03 Jul, 2012

24 commits

  • I blame Mikey for this. He elevated my slightly dubious testcase:

    to benchmark status. And naturally we need to be number 1 at creating
    zeros. So lets improve __clear_user some more.

    As Paul suggests we can use dcbz for large lengths. This patch gets
    the destination cacheline aligned then uses dcbz on whole cachelines.

    Before:
    10485760000 bytes (10 GB) copied, 0.414744 s, 25.3 GB/s

    After:
    10485760000 bytes (10 GB) copied, 0.268597 s, 39.0 GB/s

    39 GB/s, a new record.

    Signed-off-by: Anton Blanchard
    Tested-by: Olof Johansson
    Acked-by: Olof Johansson
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • At the moment all queues in a multiqueue adapter will serialise
    against the IOMMU table lock. This is proving to be a big issue,
    especially with 10Gbit ethernet.

    This patch creates 4 pools and tries to spread the load across
    them. If the table is under 1GB in size we revert back to the
    original behaviour of 1 pool and 1 largealloc pool.

    We create a hash to map CPUs to pools. Since we prefer interrupts to
    be affinitised to primary CPUs, without some form of hashing we are
    very likely to end up using the same pool. As an example, POWER7
    has 4 way SMT and with 4 pools all primary threads will map to the
    same pool.

    The largealloc pool is reduced from 1/2 to 1/4 of the space to
    partially offset the overhead of breaking the table up into pools.

    Some performance numbers were obtained with a Chelsio T3 adapter on
    two POWER7 boxes, running a 100 session TCP round robin test.

    Performance improved 69% with this patch applied.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • In preparation for IOMMU pools, push the spinlock into
    iommu_range_alloc and __iommu_free.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • This patch moves tce_free outside of the lock in iommu_free.

    Some performance numbers were obtained with a Chelsio T3 adapter on
    two POWER7 boxes, running a 100 session TCP round robin test.

    Performance improved 25% with this patch applied.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • We currently hold the IOMMU spinlock around tce_build and tce_flush.
    This causes our spinlock hold times to be much higher than required
    and can impact multiqueue adapters.

    This patch moves tce_build and tce_flush outside of the lock in
    iommu_alloc, and tce_flush outside of the lock in iommu_free.

    Some performance numbers were obtained with a Chelsio T3 adapter on
    two POWER7 boxes, running a 100 session TCP round robin test.

    Performance improved 32% with this patch applied.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • tce_buildmulti_pSeriesLP uses a per cpu page to communicate with the
    hypervisor. We currently rely on the IOMMU table spinlock but
    subsequent patches will be removing that so disable interrupts
    around all accesses of tce_page.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • Implement a POWER7 optimised memcpy using VMX and enhanced prefetch
    instructions.

    This is a copy of the POWER7 optimised copy_to_user/copy_from_user
    loop. Detailed implementation and performance details can be found in
    commit a66086b8197d (powerpc: POWER7 optimised
    copy_to_user/copy_from_user using VMX).

    I noticed memcpy issues when profiling a RAID6 workload:

    .memcpy
    .async_memcpy
    .async_copy_data
    .__raid_run_ops
    .handle_stripe
    .raid5d
    .md_thread

    I created a simplified testcase by building a RAID6 array with 4 1GB
    ramdisks (booting with brd.rd_size=1048576):

    # mdadm -CR -e 1.2 /dev/md0 --level=6 -n4 /dev/ram[0-3]

    I then timed how long it took to write to the entire array:

    # dd if=/dev/zero of=/dev/md0 bs=1M

    Before: 892 MB/s
    After: 999 MB/s

    A 12% improvement.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • Version 2.06 of the POWER ISA introduced enhanced touch instructions,
    allowing us to specify a number of attributes including the length of
    a stream.

    This patch adds a software stream for both loads and stores in the
    POWER7 copy_tofrom_user loop. Since the setup is quite complicated
    and we have to use an eieio to ensure correct ordering of the "GO"
    command we only do this for copies above 4kB.

    To quantify any performance improvements we need a working set
    bigger than the caches so we operate on a 1GB file:

    # dd if=/dev/zero of=/tmp/foo bs=1M count=1024

    And we compare how fast we can read the file:

    # dd if=/tmp/foo of=/dev/null bs=1M

    before: 7.7 GB/s
    after: 9.6 GB/s

    A 25% improvement.

    The worst case for this patch will be a completely L1 cache contained
    copy of just over 4kB. We can test this with the copy_to_user
    testcase we used to tune copy_tofrom_user originally:

    http://ozlabs.org/~anton/junkcode/copy_to_user.c

    # time ./copy_to_user2 -l 4224 -i 10000000

    before: 6.807 s
    after: 6.946 s

    A 2% slowdown, which seems reasonable considering our data is unlikely
    to be completely L1 contained.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • While creating the PCI root bus through function pci_create_root_bus()
    of PCI core, it should have assigned the secondary bus number for the
    newly created PCI root bus. Thus we needn't do the explicit assignment
    for the secondary bus number again in pcibios_scan_phb().

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • The form affinity for NUMA is set to 1 if the firmware supports
    OPAL. Otherwise, we have to retrieve that from OF node "/chosen".
    For the latter case, OF node "/chosen" reference count was never
    decreased.

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • Implement a POWER7 optimised copy_page using VMX and enhanced
    prefetch instructions. We use enhanced prefetch hints to prefetch
    both the load and store side. We copy a cacheline at a time and
    fall back to regular loads and stores if we are unable to use VMX
    (eg we are in an interrupt).

    The following microbenchmark was used to assess the impact of
    the patch:

    http://ozlabs.org/~anton/junkcode/page_fault_file.c

    We test MAP_PRIVATE page faults across a 1GB file, 100 times:

    # time ./page_fault_file -p -l 1G -i 100

    Before: 22.25s
    After: 18.89s

    17% faster

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • Subsequent patches will add more VMX library functions and it makes
    sense to keep all the c-code helper functions in the one file.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • mtmsrd is an expensive instruction, we save a few cycles by
    doing it once instead of twice.

    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • Version 2.06 of the POWER ISA introduced enhanced touch instructions,
    allowing us to specify a number of attributes including the length of
    a stream.

    This patch adds a software stream for both loads and stores in the
    POWER7 copy_tofrom_user loop. Since the setup is quite complicated
    and we have to use an eieio to ensure correct ordering of the "GO"
    command we only do this for copies above 4kB.

    To quantify any performance improvements we need a working set
    bigger than the caches so we operate on a 1GB file:

    # dd if=/dev/zero of=/tmp/foo bs=1M count=1024

    And we compare how fast we can read the file:

    # dd if=/tmp/foo of=/dev/null bs=1M

    before: 7.7 GB/s
    after: 9.6 GB/s

    A 25% improvement.

    The worst case for this patch will be a completely L1 cache contained
    copy of just over 4kB. We can test this with the copy_to_user
    testcase we used to tune copy_tofrom_user originally:

    http://ozlabs.org/~anton/junkcode/copy_to_user.c

    # time ./copy_to_user2 -l 4224 -i 10000000

    before: 6.807 s
    after: 6.946 s

    A 2% slowdown, which seems reasonable considering our data is unlikely
    to be completely L1 contained.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • 1) call_function.lock used in smp_call_function_many() is just to protect
    call_function.queue and &data->refs, cpu_online_mask is outside of the
    lock. And it's not necessary to protect cpu_online_mask,
    because data->cpumask is pre-calculate and even if a cpu is brougt up
    when calling arch_send_call_function_ipi_mask(), it's harmless because
    validation test in generic_smp_call_function_interrupt() will take care
    of it.

    2) For cpu down issue, stop_machine() will guarantee that no concurrent
    smp_call_fuction() is processing.

    Signed-off-by: Yong Zhang
    Signed-off-by: Benjamin Herrenschmidt

    Yong Zhang
     
  • I noticed __clear_user high up in a profile of one of my RAID stress
    tests. The testcase was doing a dd from /dev/zero which ends up
    calling __clear_user.

    __clear_user is basically a loop with a single 4 byte store which
    is horribly slow. We can do much better by aligning the desination
    and doing 32 bytes of 8 byte stores in a loop.

    The following testcase was used to verify the patch:

    http://ozlabs.org/~anton/junkcode/stress_clear_user.c

    To show the improvement in performance I ran a dd from /dev/zero
    to /dev/null on a POWER7 box:

    Before:

    # dd if=/dev/zero of=/dev/null bs=1M count=10000
    10485760000 bytes (10 GB) copied, 3.72379 s, 2.8 GB/s

    After:

    # time dd if=/dev/zero of=/dev/null bs=1M count=10000
    10485760000 bytes (10 GB) copied, 0.728318 s, 14.4 GB/s

    Over 5x faster.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • irq_entry, irq_exit, timer_interrupt_entry and timer_interrupt_exit
    all do the same thing so use DECLARE_EVENT_CLASS to avoid duplicating
    everything 4 times.

    This saves quite a lot of space in both instruction text and data:

    text data bss dec hex filename
    9265 19622 16 28903 70e7 arch/powerpc/kernel/irq.o
    6817 19019 16 25852 64fc arch/powerpc/kernel/irq.o

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • When looking through some instruction traces I noticed our tracepoint
    checks were inline. It turns out we don't have CONFIG_JUMP_LABEL
    enabled.

    By enabling CONFIG_JUMP_LABEL we replace a load/compare/branch with
    a nop at every tracepoint call. For example in do_IRQ:

    CONFIG_JUMP_LABEL disabled:
    stdx 3,11,9
    lwz 0,8(29)
    cmpwi 7,0,0
    bne- 7,.L124
    bl .irq_enter

    CONFIG_JUMP_LABEL enabled:
    stdx 3,11,9
    nop
    bl .irq_enter

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • The following patch is to remove the pseries_notify_add_cpu() call
    and replace it by a hot plug notifier.

    This would prevent cpuidle resources being released and allocated each
    time cpu comes online on pseries.

    The earlier design was causing a lockdep problem
    in start_secondary as reported on this thread
    -https://lkml.org/lkml/2012/5/17/2

    This applies on 3.4-rc7

    Signed-off-by: Deepthi Dharwar
    Signed-off-by: Benjamin Herrenschmidt

    Deepthi Dharwar
     
  • An upcoming release of firmware will add DDW extensions, in particular
    an API to "reset" the DMA window to the original configuration (32-bit,
    2GB in size). With that API available, we can safely remove the default
    window, increasing the resources available to firmware for creation of
    larger windows for the slot in question -- if we encounter an error, we
    can use the new API to reset the state of the slot.

    Further, this same release of firmware will make it a hard requirement
    for OSes to release the existing window before any other windows will be
    shown as available, to avoid conflicts in addressing between the two
    windows.

    In anticipation of these changes, always remove the default window
    before we do any DDW manipulations.

    Signed-off-by: Nishanth Aravamudan
    Signed-off-by: Benjamin Herrenschmidt

    Nishanth Aravamudan
     
  • The patch_instruction() interface is made to modify kernel text. It is
    safer to use that then the probe_kernel_write() when modifying kernel
    code.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Benjamin Herrenschmidt

    Steven Rostedt
     
  • For ftrace to use the patch_instruction code, it needs to check for
    faults on write. Ftrace updates code all over the kernel, and we need to
    know if code is updated or not due to protections that are placed on
    some portions of the kernel. If ftrace does not detect a fault, it will
    error later on, and it will be much more difficult to find the problem.

    By changing patch_instruction() to detect faults, then ftrace will be
    able to make use of it too.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Benjamin Herrenschmidt

    Steven Rostedt
     
  • PowerPC does not have the synchronization issues that x86 has with
    modifying code on one CPU while another CPU is executing it.
    The other CPU will either see the old or new code without any
    issues, unlike x86 which may issue a GPF.

    Instead of calling the heavy stop_machine, just update the code.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Benjamin Herrenschmidt

    Steven Rostedt
     
  • Currently we build all board files regardless of the final zImage
    target. This is sub-optimal (in terms on compilation) and leads to
    problems in one platform needlessly causing failures for other
    platforms.

    Use the Kconfig variables to selectively construct this board files to
    build.

    Signed-off-by: Tony Breeds
    Signed-off-by: Benjamin Herrenschmidt

    Tony Breeds
     

02 Jul, 2012

1 commit

  • Pull two ARM fixes from Russell King:
    "It's been fairly quiet with the fixes. Just two this time. One fixes
    a long standing problem with KALLSYMS needing an additional pass, and
    the other sorts a problem with the vmalloc space interacting with
    static IO mappings."

    * 'fixes' of git://git.linaro.org/people/rmk/linux-arm:
    ARM: 7438/1: fill possible PMD empty section gaps
    ARM: 7428/1: Prevent KALLSYM size mismatch on ARM.

    Linus Torvalds
     

01 Jul, 2012

10 commits

  • On ARM with the 2-level page table format, a PMD entry is represented by
    two consecutive section entries covering 2MB of virtual space.

    However, static mappings always were allowed to use separate 1MB section
    entries. This means in practice that a static mapping may create half
    populated PMDs via create_mapping().

    Since commit 0536bdf33f (ARM: move iotable mappings within the vmalloc
    region) those static mappings are located in the vmalloc area. We must
    ensure no such half populated PMDs are accessible once vmalloc() or
    ioremap() start looking at the vmalloc area for nearby free virtual
    address ranges, or various things leading to a kernel crash will happen.

    Signed-off-by: Nicolas Pitre
    Reported-by: Santosh Shilimkar
    Tested-by: "R, Sricharan"
    Reviewed-by: Catalin Marinas
    Cc: stable@vger.kernel.org
    Signed-off-by: Russell King

    Nicolas Pitre
     
  • Linus Torvalds
     
  • Pull ARM SoC fixes from Olof Johansson:
    "Another week, another batch of fixes.

    All are small, contained, targeted fixes for explicit problems --
    mostly build and boot failures across i.MX, OMAP, Renesas/Shmobile and
    Samsung."

    * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
    ARM: imx6q: fix suspend regression caused by common clk migration
    ARM: OMAP4470: Fix OMAP4470 boot failure
    ARM: EXYNOS: Fix EXYNOS_DEV_DMA Kconfig entry
    ARM: OMAP2+: nand: fix build error when CONFIG_MTD_ONENAND_OMAP2=n
    ARM: shmobile: r8a7779: Route all interrupts to ARM
    ARM: shmobile: kzm9d: use late init machine hook
    ARM: shmobile: kzm9g: use late init machine hook
    ARM: mach-shmobile: armadillo800eva: Use late init machine hook
    ARM: SAMSUNG: Fix for S3C2412 EBI memory mapping
    ARM: mach-shmobile: add missing GPIO IRQ configuration on mackerel
    ARM: mach-shmobile: Fix build when SMP is enabled and EMEV2 is not enabled
    ARM: shmobile: sh7372: bugfix: chclr_offset base
    ARM: shmobile: sh73a0: bugfix: SY-DMAC number
    ARM: SAMSUNG: Should check for IS_ERR(clk) instead of NULL

    Linus Torvalds
     
  • Fix kernel-doc warnings in printk.c: use correct parameter name.

    Warning(kernel/printk.c:2429): No description found for parameter 'buf'
    Warning(kernel/printk.c:2429): Excess function parameter 'line' description in 'kmsg_dump_get_buffer'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Fix kernel-doc warning. This struct member was removed in commit
    875682648b89 ("irq: Remove irq_chip->release()") so remove its
    associated kernel-doc entry also.

    Warning(include/linux/irq.h:338): Excess struct/union/enum/typedef member 'release' description in 'irq_chip'

    Signed-off-by: Randy Dunlap
    Cc: Richard Weinberger
    Cc: Thomas Gleixner
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • …/git/kgene/linux-samsung into fixes

    * 'v3.5-samsung-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kgene/linux-samsung:
    ARM: EXYNOS: Fix EXYNOS_DEV_DMA Kconfig entry
    ARM: SAMSUNG: Fix for S3C2412 EBI memory mapping
    ARM: SAMSUNG: Should check for IS_ERR(clk) instead of NULL

    Olof Johansson
     
  • When moving to common clk framework, the imx6q clks rom and mmdc_ch1_axi
    get different on/off states than old clk driver, which breaks suspend
    function. There might be a better way to manage these clocks, but let's
    takes the old clk driver approach to fix the regression first.

    Signed-off-by: Shawn Guo
    Signed-off-by: Olof Johansson

    Shawn Guo
     
  • …/git/tmlind/linux-omap into fixes

    From Tony Lindgren:
    "Here's one more regression fix that I missed earlier, and a
    trivial fix to get omap4470 booting."

    * tag 'omap-fixes-for-v3.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap:
    ARM: OMAP4470: Fix OMAP4470 boot failure
    ARM: OMAP2+: nand: fix build error when CONFIG_MTD_ONENAND_OMAP2=n

    Olof Johansson
     
  • Pull ACPI & Power Management patches from Len Brown.

    * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
    acpi_pad: fix power_saving thread deadlock
    ACPI video: Still use ACPI backlight control if _DOS doesn't exist
    ACPI, APEI, Avoid too much error reporting in runtime
    ACPI: Add a quirk for "AMILO PRO V2030" to ignore the timer overriding
    ACPI: Remove one board specific WARN when ignoring timer overriding
    ACPI: Make acpi_skip_timer_override cover all source_irq==0 cases
    ACPI, x86: fix Dell M6600 ACPI reboot regression via DMI
    ACPI sysfs.c strlen fix

    Linus Torvalds
     
  • Pull driver Core fixes from Greg Kroah-Hartman:
    "Here is a number of printk() fixes, specifically a few reported by the
    crazy blog program that ships in SUSE releases (that's "boot log" and
    not "web log", it predates the general "blog" terminology by many
    years), and the restoration of the continuation line functionality
    reported by Stephen and others. Yes, the changes seem a bit big this
    late in the cycle, but I've been beating on them for a while now, and
    Stephen has even optimized it a bit, so all looks good to me.

    The other change in here is a Documentation update for the stable
    kernel rules describing how some distro patches should be backported,
    to hopefully drive a bit more response from the distros to the stable
    kernel releases.

    Signed-off-by: Greg Kroah-Hartman "

    * tag 'driver-core-3.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
    printk: Optimize if statement logic where newline exists
    printk: flush continuation lines immediately to console
    syslog: fill buffer with more than a single message for SYSLOG_ACTION_READ
    Revert "printk: return -EINVAL if the message len is bigger than the buf size"
    printk: fix regression in SYSLOG_ACTION_CLEAR
    stable: Allow merging of backports for serious user-visible performance issues

    Linus Torvalds
     

30 Jun, 2012

5 commits

  • …-43168', 'bugzilla-40002' and 'bugfix-misc' into release

    bug fixes

    Len Brown
     
  • The acpi_pad driver can get stuck in destroy_power_saving_task()
    waiting for kthread_stop() to stop a power_saving thread. The problem
    is that the isolated_cpus_lock mutex is owned when
    destroy_power_saving_task() calls kthread_stop(), which waits for a
    power_saving thread to end, and the power_saving thread tries to
    acquire the isolated_cpus_lock when it calls round_robin_cpu(). This
    patch fixes the issue by making round_robin_cpu() use its own mutex.

    https://bugzilla.kernel.org/show_bug.cgi?id=42981

    Cc: stable@vger.kernel.org
    Signed-off-by: Stuart Hayes
    Signed-off-by: Len Brown

    Stuart Hayes
     
  • This fixes a regression in 3.4-rc1 caused by commit
    ea9f8856bd6d4ed45885b06a338f7362cd6c60e5
    (ACPI video: Harden video bus adding.)

    Some platforms don't have _DOS control method, but the ACPI
    backlight still works.
    We should not invoke _DOS for these platforms.

    https://bugzilla.kernel.org/show_bug.cgi?id=43168

    Cc: Igor Murzov
    Cc: stable@vger.kernel.org
    Signed-off-by: Zhang Rui
    Signed-off-by: Len Brown

    Zhang Rui
     
  • Pull power management fixes from Rafael J. Wysocki:

    * Fix for a bug in async suspend error code path causing parents to
    wait forever for their children in case of a suspend error from
    Mandeep Singh Baines (-stable metarial).

    * Fix for a suspend regression related to earlier changes in the ACPI
    cpuidle driver from Deepthi Dharwar.

    * tag 'pm-for-3.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    PM / ACPI: Fix suspend/resume regression caused by cpuidle cleanup.
    PM / Sleep: Prevent waiting forever on asynchronous suspend after abort

    Linus Torvalds
     
  • In reviewing Kay's fix up patch: "printk: Have printk() never buffer its
    data", I found two if statements that could be combined and optimized.

    Put together the two 'cont.len && cont.owner == current' if statements
    into a single one, and check if we need to call cont_add(). This also
    removes the unneeded double cont_flush() calls.

    Link: http://lkml.kernel.org/r/1340869133.876.10.camel@mop

    Signed-off-by: Steven Rostedt
    Cc: Kay Sievers
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt