28 Apr, 2014

32 commits

  • When the guest cedes the vcpu or the vcpu has no guest to
    run it naps. Clear the runlatch bit of the vcpu before
    napping to indicate an idle cpu.

    Signed-off-by: Preeti U Murthy
    Acked-by: Paul Mackerras
    Reviewed-by: Srivatsa S. Bhat
    Signed-off-by: Benjamin Herrenschmidt

    Preeti U Murthy
     
  • The secondary threads in the core are kept offline before launching guests
    in kvm on powerpc: "371fefd6f2dc4666:KVM: PPC: Allow book3s_hv guests to use
    SMT processor modes."

    Hence their runlatch bits are cleared. When the secondary threads are called
    in to start a guest, their runlatch bits need to be set to indicate that they
    are busy. The primary thread has its runlatch bit set though, but there is no
    harm in setting this bit once again. Hence set the runlatch bit for all
    threads before they start guest.

    Signed-off-by: Preeti U Murthy
    Acked-by: Paul Mackerras
    Reviewed-by: Srivatsa S. Bhat
    Signed-off-by: Benjamin Herrenschmidt

    Preeti U Murthy
     
  • Up until now we have been setting the runlatch bits for a busy CPU and
    clearing it when a CPU enters idle state. The runlatch bit has thus
    been consistent with the utilization of a CPU as long as the CPU is online.

    However when a CPU is hotplugged out the runlatch bit is not cleared. It
    needs to be cleared to indicate an unused CPU. Hence this patch has the
    runlatch bit cleared for an offline CPU just before entering an idle state
    and sets it immediately after it exits the idle state.

    Signed-off-by: Preeti U Murthy
    Acked-by: Paul Mackerras
    Reviewed-by: Srivatsa S. Bhat
    Signed-off-by: Benjamin Herrenschmidt

    Preeti U Murthy
     
  • While testing memory hot-remove, I found following dead lock:

    Process #1141 is drmgr, trying to remove some memory, i.e. memory499.
    It holds the memory_hotplug_mutex, and blocks when trying to remove file
    "online" under dir memory499, in kernfs_drain(), at
    wait_event(root->deactivate_waitq,
    atomic_read(&kn->active) == KN_DEACTIVATED_BIAS);

    Process #1120 is trying to online memory499 by
    echo 1 > memory499/online

    In .kernfs_fop_write, it uses kernfs_get_active() to increase
    &kn->active, thus blocking process #1141. While itself is blocked later
    when trying to acquire memory_hotplug_mutex, which is held by process

    The backtrace of both processes are shown below:

    [] 0xc000000001b18600
    [] .__switch_to+0x144/0x200
    [] .online_pages+0x74/0x7b0
    [] .memory_subsys_online+0x9c/0x150
    [] .device_online+0xb8/0x120
    [] .online_store+0xb4/0xc0
    [] .dev_attr_store+0x64/0xa0
    [] .sysfs_kf_write+0x7c/0xb0
    [] .kernfs_fop_write+0x154/0x1e0
    [] .vfs_write+0xe0/0x260
    [] .SyS_write+0x64/0x110
    [] syscall_exit+0x0/0x7c

    [] 0xc000000001b18600
    [] .__switch_to+0x144/0x200
    [] .__kernfs_remove+0x204/0x300
    [] .kernfs_remove_by_name_ns+0x68/0xf0
    [] .sysfs_remove_file_ns+0x38/0x60
    [] .device_remove_attrs+0x54/0xc0
    [] .device_del+0x158/0x250
    [] .device_unregister+0x34/0xa0
    [] .unregister_memory_section+0x164/0x170
    [] .__remove_pages+0x108/0x4c0
    [] .arch_remove_memory+0x60/0xc0
    [] .remove_memory+0x8c/0xe0
    [] .pseries_remove_memblock+0xd4/0x160
    [] .pseries_memory_notifier+0x27c/0x290
    [] .notifier_call_chain+0x8c/0x100
    [] .__blocking_notifier_call_chain+0x6c/0xe0
    [] .of_property_notify+0x7c/0xc0
    [] .of_update_property+0x3c/0x1b0
    [] .ofdt_write+0x3dc/0x740
    [] .proc_reg_write+0xac/0x110
    [] .vfs_write+0xe0/0x260
    [] .SyS_write+0x64/0x110
    [] syscall_exit+0x0/0x7c

    This patch uses lock_device_hotplug() to protect remove_memory() called
    in pseries_remove_memblock(), which is also stated before function
    remove_memory():

    * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
    * and online/offline operations before this call, as required by
    * try_offline_node().
    */
    void __ref remove_memory(int nid, u64 start, u64 size)

    With this lock held, the other process(#1120 above) trying to online the
    memory block will retry the system call when calling
    lock_device_hotplug_sysfs(), and finally find No such device error.

    Signed-off-by: Li Zhong
    Signed-off-by: Benjamin Herrenschmidt

    Li Zhong
     
  • module_init should return 0 or a negative errno.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • Bump the boot wrapper BOOT_COMMAND_LINE_SIZE to match the
    kernel.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • I've had a report that the current limit is too small for
    an automated network based installer. Bump it.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • We have two definitions of COMMAND_LINE_SIZE, one for the kernel
    and one for the boot wrapper. I assume this is so the boot
    wrapper can be self sufficient and not rely on kernel headers.

    Having two defines with the same name is confusing, I just
    updated the wrong one when trying to bump it.

    Make the boot wrapper define unique by calling it
    BOOT_COMMAND_LINE_SIZE.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • The catalog version number was changed from a be32 (with proceeding
    32bits of padding) to a be64, update the code to treat it as a be64

    Signed-off-by: Cody P Schafer
    Signed-off-by: Benjamin Herrenschmidt

    Cody P Schafer
     
  • Signed-off-by: Cody P Schafer
    Signed-off-by: Benjamin Herrenschmidt

    Cody P Schafer
     
  • Signed-off-by: Cody P Schafer
    Signed-off-by: Benjamin Herrenschmidt

    Cody P Schafer
     
  • Signed-off-by: Cody P Schafer
    Signed-off-by: Benjamin Herrenschmidt

    Cody P Schafer
     
  • fixup for "powerpc/perf: Add support for the hv gpci (get performance
    counter info) interface".

    Makes the "not enabled" message less awful (and hidden unless
    debugging).

    Signed-off-by: Cody P Schafer
    Signed-off-by: Benjamin Herrenschmidt

    Cody P Schafer
     
  • fixup for "powerpc/perf: Add support for the hv 24x7 interface"

    Makes the "not enabled" message less awful (and hides it in most cases).

    Signed-off-by: Cody P Schafer
    Signed-off-by: Benjamin Herrenschmidt

    Cody P Schafer
     
  • The if condition check was based on a draft ISA doc. Remove the same.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt

    Aneesh Kumar K.V
     
  • Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • We have two copies of code that creates an OPAL sg list. Consolidate
    these into a common set of helpers and fix the endian issues.

    The flash interface embedded a version number in the num_entries
    field, whereas the dump interface did did not. Since versioning
    wasn't added to the flash interface and it is impossible to add
    this in a backwards compatible way, just remove it.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • Fix little endian issues with the OPAL error log code.

    Signed-off-by: Anton Blanchard
    Reviewed-by: Stewart Smith
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • The bitmap in opal_poll_events and opal_handle_interrupt is
    big endian, so we need to byteswap it on little endian builds.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • We had some duplication of the internal OPAL functions.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • Using size_t in our APIs is asking for trouble, especially
    when some OPAL calls use size_t pointers.

    Signed-off-by: Anton Blanchard
    Reviewed-by: Stewart Smith
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • On PowerNV platform, we are holding an unnecessary refcount on a pci_dev, which
    leads to the pci_dev is not destroyed when hotplugging a pci device.

    This patch release the unnecessary refcount.

    Signed-off-by: Wei Yang
    Signed-off-by: Benjamin Herrenschmidt

    Wei Yang
     
  • During the EEH hotplug event, iommu_add_device() will be invoked three times
    and two of them will trigger warning or error.

    The three times to invoke the iommu_add_device() are:

    pci_device_add
    ...
    set_iommu_table_base_and_group kobj->sd is not initialized. The
    dev->kobj->sd is initialized in device_add().
    The third time's warning is triggered by the re-attach of the iommu_group.

    After applying this patch, the error

    iommu_tce: 0003:05:00.0 has not been added, ret=-14

    and the warning

    [ 204.123609] ------------[ cut here ]------------
    [ 204.123645] WARNING: at arch/powerpc/kernel/iommu.c:1125
    [ 204.123680] Modules linked in: xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT bnep bluetooth 6lowpan_iphc rfkill xt_conntrack ebtable_nat ebtable_broute bridge stp llc mlx4_ib ib_sa ib_mad ib_core ib_addr ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw bnx2x tg3 mlx4_core nfsd ptp mdio ses libcrc32c nfs_acl enclosure be2net pps_core shpchp lockd kvm uinput sunrpc binfmt_misc lpfc scsi_transport_fc ipr scsi_tgt
    [ 204.124356] CPU: 18 PID: 650 Comm: eehd Not tainted 3.14.0-rc5yw+ #102
    [ 204.124400] task: c0000027ed485670 ti: c0000027ed50c000 task.ti: c0000027ed50c000
    [ 204.124453] NIP: c00000000003cf80 LR: c00000000006c648 CTR: c00000000006c5c0
    [ 204.124506] REGS: c0000027ed50f440 TRAP: 0700 Not tainted (3.14.0-rc5yw+)
    [ 204.124558] MSR: 9000000000029032 CR: 88008084 XER: 20000000
    [ 204.124682] CFAR: c00000000006c644 SOFTE: 1
    GPR00: c00000000006c648 c0000027ed50f6c0 c000000001398380 c0000027ec260300
    GPR04: c0000027ea92c000 c00000000006ad00 c0000000016e41b0 0000000000000110
    GPR08: c0000000012cd4c0 0000000000000001 c0000027ec2602ff 0000000000000062
    GPR12: 0000000028008084 c00000000fdca200 c0000000000d1d90 c0000027ec281a80
    GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
    GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000001
    GPR24: 000000005342697b 0000000000002906 c000001fe6ac9800 c000001fe6ac9800
    GPR28: 0000000000000000 c0000000016e3a80 c0000027ea92c090 c0000027ea92c000
    [ 204.125353] NIP [c00000000003cf80] .iommu_add_device+0x30/0x1f0
    [ 204.125399] LR [c00000000006c648] .pnv_pci_ioda_dma_dev_setup+0x88/0xb0
    [ 204.125443] Call Trace:
    [ 204.125464] [c0000027ed50f6c0] [c0000027ed50f750] 0xc0000027ed50f750 (unreliable)
    [ 204.125526] [c0000027ed50f750] [c00000000006c648] .pnv_pci_ioda_dma_dev_setup+0x88/0xb0
    [ 204.125588] [c0000027ed50f7d0] [c000000000069cc8] .pnv_pci_dma_dev_setup+0x78/0x340
    [ 204.125650] [c0000027ed50f870] [c000000000044408] .pcibios_setup_device+0x88/0x2f0
    [ 204.125712] [c0000027ed50f940] [c000000000046040] .pcibios_setup_bus_devices+0x60/0xd0
    [ 204.125774] [c0000027ed50f9c0] [c000000000043acc] .pcibios_add_pci_devices+0xdc/0x1c0
    [ 204.125837] [c0000027ed50fa50] [c00000000086f970] .eeh_reset_device+0x36c/0x4f0
    [ 204.125939] [c0000027ed50fb20] [c00000000003a2d8] .eeh_handle_normal_event+0x448/0x480
    [ 204.126068] [c0000027ed50fbc0] [c00000000003a35c] .eeh_handle_event+0x4c/0x340
    [ 204.126192] [c0000027ed50fc80] [c00000000003a74c] .eeh_event_handler+0xfc/0x1b0
    [ 204.126319] [c0000027ed50fd30] [c0000000000d1ea0] .kthread+0x110/0x130
    [ 204.126430] [c0000027ed50fe30] [c00000000000a460] .ret_from_kernel_thread+0x5c/0x7c
    [ 204.126556] Instruction dump:
    [ 204.126610] 7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ff71 7c7e1b78 60000000
    [ 204.126787] 60000000 e87e0298 3143ffff 7d2a1910 2fa90000 40de00c8 ebfe0218
    [ 204.126966] ---[ end trace 6e7aefd80add2973 ]---

    are cleared.

    This patch removes iommu_add_device() in pnv_pci_ioda_dma_dev_setup(), which
    revert part of the change in commit d905c5df(PPC: POWERNV: move
    iommu_add_device earlier).

    Signed-off-by: Wei Yang
    Signed-off-by: Benjamin Herrenschmidt

    Wei Yang
     
  • With this patch I was able to update firmware on an LE kernel.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • We have a subtle race when sending CPUs back to OPAL on kexec.

    We mark them as "in real mode" right before we send them down. Once
    we've booted the new kernel, it might try to call opal_reinit_cpus()
    to change endianness, and that requires all CPUs to be spinning inside
    OPAL.

    However there is no synchronization here and we've observed cases
    where the returning CPUs hadn't established their new state inside
    OPAL before opal_reinit_cpus() is called, causing it to fail.

    The proper fix is to actually wait for them to go down all the way
    from the kexec'ing kernel.

    Signed-off-by: Benjamin Herrenschmidt

    Benjamin Herrenschmidt
     
  • The size of the sysparam sysfs files is determined from the device tree
    at boot. However the buffer is hard coded to 64 bytes. If we encounter a
    parameter that is larger than 64, or miss-parse the device tree, the
    buffer will overflow when reading or writing to the parameter.

    Check it at discovery time, and if the parameter is too large, do not
    create a sysfs entry for it.

    Signed-off-by: Joel Stanley
    Signed-off-by: Benjamin Herrenschmidt

    Joel Stanley
     
  • Signed-off-by: Benjamin Herrenschmidt

    Joel Stanley
     
  • The sysparam code currently uses the userspace supplied number of
    bytes when memcpy()ing in to a local 64-byte buffer.

    Limit the maximum number of bytes by the size of the buffer.

    Signed-off-by: Benjamin Herrenschmidt

    Joel Stanley
     
  • The OPAL calls are returning int64_t values, which the sysparam code
    stores in an int, and the sysfs callback returns ssize_t. Make code a
    easier to read by consistently using ssize_t.

    Signed-off-by: Joel Stanley
    Signed-off-by: Benjamin Herrenschmidt

    Joel Stanley
     
  • When a sysparam query in OPAL returned a negative value (error code),
    sysfs would spew out a decent chunk of memory; almost 64K more than
    expected. This was traced to a sign/unsigned mix up in the OPAL sysparam
    sysfs code at sys_param_show.

    The return value of sys_param_show is a ssize_t, calculated using

    return ret ? ret : attr->param_size;

    Alan Modra explains:

    "attr->param_size" is an unsigned int, "ret" an int, so the overall
    expression has type unsigned int. Result is that ret is cast to
    unsigned int before being cast to ssize_t.

    Instead of using the ternary operator, set ret to the param_size if an
    error is not detected. The same bug exists in the sysfs write callback;
    this patch fixes it in the same way.

    A note on debugging this next time: on my system gcc will warn about
    this if compiled with -Wsign-compare, which is not enabled by -Wall,
    only -Wextra.

    Signed-off-by: Joel Stanley
    Signed-off-by: Benjamin Herrenschmidt

    Joel Stanley
     
  • commit 41dd03a9 may cause Oops in rtas_stop_self().

    The reason is that the rtas_args was moved into stack space. For a box
    with more that 4GB RAM, the stack could easily be outside 32bit range,
    but RTAS is 32bit.

    So the patch moves rtas_args away from stack by adding static before
    it.

    Signed-off-by: Li Zhong
    Signed-off-by: Anton Blanchard
    Cc: stable@vger.kernel.org # 3.14+
    Signed-off-by: Benjamin Herrenschmidt

    Li Zhong
     
  • Commit aac416fc38c (lkdtm: flush icache and report actions) calls
    flush_icache_range from a module. It's exported on most architectures
    that implement it, but not on powerpc. This patch exports it to fix
    the module link failure.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Benjamin Herrenschmidt

    Jeff Mahoney
     

21 Apr, 2014

4 commits


20 Apr, 2014

4 commits

  • …it/jolsa/perf into perf/urgent

    Pull perf/urgent fixes from Jiri Olsa:

    User visible changes:

    * Adjust symbols in VDSO to properly resolve its function names (Vladimir Nikulichev)

    * Improve error reporting for record session failure (Adrien BAK)

    * Fix 'Min time' counting in report command (Alexander Yarygin)

    Signed-off-by: Jiri Olsa <jolsa@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     
  • In the current version, when using perf record, if something goes
    wrong in tools/perf/builtin-record.c:375
    session = perf_session__new(file, false, NULL);

    The error message:
    "Not enough memory for reading per file header"

    is issued. This error message seems to be outdated and is not very
    helpful. This patch proposes to replace this error message by
    "Perf session creation failed"

    I believe this issue has been brought to lkml:
    https://lkml.org/lkml/2014/2/24/458
    although this patch only tackles a (small) part of the issue.

    Additionnaly, this patch improves error reporting in
    tools/perf/util/data.c open_file_write.

    Currently, if the call to open fails, the user is unaware of it.
    This patch logs the error, before returning the error code to
    the caller.

    Reported-by: Will Deacon
    Signed-off-by: Adrien BAK
    Link: http://lkml.kernel.org/r/1397786443.3093.4.camel@beast
    [ Reorganize the changelog into paragraphs ]
    [ Added empty line after fd declaration in open_file_write ]
    Signed-off-by: Jiri Olsa

    Adrien BAK
     
  • pert-report doesn't resolve function names in VDSO:

    $ perf report --stdio -g flat,0.0,15,callee --sort pid
    ...
    8.76%
    0x7fff6b1fe861
    __gettimeofday
    ACE_OS::gettimeofday()
    ...

    In this case symbol values should be adjusted the same way as for executables,
    relocatable objects and prelinked libraries.

    After fix:

    $ perf report --stdio -g flat,0.0,15,callee --sort pid
    ...
    8.76%
    __vdso_gettimeofday
    __gettimeofday
    ACE_OS::gettimeofday()

    Signed-off-by: Vladimir Nikulichev
    Tested-by: Namhyung Kim
    Reviewed-by: Adrian Hunter
    Link: http://lkml.kernel.org/r/969812.163009436-sendEmail@nvs
    Signed-off-by: Jiri Olsa

    Vladimir Nikulichev
     
  • Every event in the perf-kvm has a 'stats' structure, which contains
    max/min/average/etc times of handling this event.
    The problem is that the 'perf-kvm stat report' command always shows
    that 'min time' is 0us for every event. Example:

    # perf kvm stat report

    Analyze events for all VCPUs:

    VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
    [..]
    0xB2 MSCH 12 0.07% 0.00% 0us 8us 7.31us ( +- 2.11% )
    0xB2 CHSC 12 0.07% 0.00% 0us 18us 9.39us ( +- 9.49% )
    0xB2 STPX 8 0.05% 0.00% 0us 2us 1.88us ( +- 7.18% )
    0xB2 STSI 7 0.04% 0.00% 0us 44us 16.49us ( +- 38.20% )
    [..]

    This happens because the 'stats' structure is not initialized and
    stats->min equals to 0. Lets initialize the structure for every
    event after its allocation using init_stats() function. This initializes
    stats->min to -1 and makes 'Min time' statistics counting work:

    # perf kvm stat report

    Analyze events for all VCPUs:

    VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
    [..]
    0xB2 MSCH 12 0.07% 0.00% 6us 8us 7.31us ( +- 2.11% )
    0xB2 CHSC 12 0.07% 0.00% 7us 18us 9.39us ( +- 9.49% )
    0xB2 STPX 8 0.05% 0.00% 1us 2us 1.88us ( +- 7.18% )
    0xB2 STSI 7 0.04% 0.00% 1us 44us 16.49us ( +- 38.20% )
    [..]

    Signed-off-by: Alexander Yarygin
    Signed-off-by: Christian Borntraeger
    Reviewed-by: David Ahern
    Link: http://lkml.kernel.org/r/1397053319-2130-3-git-send-email-borntraeger@de.ibm.com
    [ Fixing the perf examples changelog output ]
    Signed-off-by: Jiri Olsa

    Alexander Yarygin