05 Jan, 2012

2 commits

  • The commit 883c2cfc8bcc0fd00c5d9f596fb8870f481b5bda:

    "fix of_flat_dt_is_compatible() to match the full compatible string"

    causes silent boot death on the sbc8349 board because it was
    just looking for 8349 and not 8349E -- as originally there
    were non-E (no SEC/encryption) chips available. Just add the
    E to the board detection string since all boards I've seen
    were manufactured with the E versions.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: Kumar Gala

    Paul Gortmaker
     
  • There is an issue on FSL-BookE 64-bit devices (P5020) in which PCIe
    devices that are capable of doing 64-bit DMAs (like an Intel e1000) do
    not function and crash the kernel if we have >4G of memory in the system.

    The reason is that the existing code only sets up one inbound window for
    access to system memory across PCIe. That window is limited to a 32-bit
    address space. So on systems we'll end up utilizing SWIOTLB for dma
    mappings. However SWIOTLB dma ops implement dma_alloc_coherent() as
    dma_direct_alloc_coherent(). Thus we can end up with dma addresses that
    are not accessible because of the inbound window limitation.

    We could possibly set the SWIOTLB alloc_coherent op to
    swiotlb_alloc_coherent() however that does not address the issue since
    the swiotlb_alloc_coherent() will behave almost identical to
    dma_direct_alloc_coherent() since the devices coherent_dma_mask will be
    greater than any address allocated by swiotlb_alloc_coherent() and thus
    we'll never bounce buffer it into a range that would be dma-able.

    The easiest and best solution is to just make it so that a 64-bit
    capable device is able to DMA to any internal system address.

    We accomplish this by opening up a second inbound window that maps all
    of memory above the internal SoC address width so we can set it up to
    access all of the internal SoC address space if needed.

    We than fixup the dma_ops and dma_offset for PCIe devices with a dma
    mask greater than the maximum internal SoC address.

    Signed-off-by: Kumar Gala

    Kumar Gala
     

03 Jan, 2012

4 commits

  • Unpaired calling of probe_hcall_entry and probe_hcall_exit might happen
    as following, which could cause incorrect preempt count.

    __trace_hcall_entry => trace_hcall_entry -> probe_hcall_entry =>
    get_cpu_var => preempt_disable

    __trace_hcall_exit => trace_hcall_exit -> probe_hcall_exit =>
    put_cpu_var => preempt_enable

    where:
    A => B and A -> B means A calls B, but
    => means A will call B through function name, and B will definitely be
    called.
    -> means A will call B through function pointer, so B might not be
    called if the function pointer is not set.

    So error happens when only one of probe_hcall_entry and probe_hcall_exit
    get called during a hcall.

    This patch tries to move the preempt count operations from
    probe_hcall_entry and probe_hcall_exit to its callers.

    Reported-by: Paul E. McKenney
    Signed-off-by: Li Zhong
    Tested-by: Paul E. McKenney
    CC: stable@kernel.org [v2.6.32+]
    Signed-off-by: Benjamin Herrenschmidt

    Li Zhong
     
  • When using a >8bpp framebuffer, offb advertises truecolor, not directcolor,
    and doesn't touch the color map even if it has a corresponding access method
    for the real hardware.

    Thus it needs to set the pseudo-palette with all 3 components of the color,
    like other truecolor framebuffers, not with copies of the color index like
    a directcolor framebuffer would do.

    This went unnoticed for a long time because it's pretty hard to get offb
    to kick in with anything but 8bpp (old BootX under MacOS will do that and
    qemu does it).

    Signed-off-by: Benjamin Herrenschmidt
    CC: stable@kernel.org

    Benjamin Herrenschmidt
     
  • We rename the mach64 hack to "simple" since that's also applicable
    to anything using VGA-style DAC IO ports (set to 8-bit DAC) and we
    use it for qemu vga.

    Note that this is keyed on a device-tree "compatible" property that
    is currently only set by an upcoming version of SLOF when using the
    qemu "pseries" platform. This is on purpose as other qemu ppc platforms
    using OpenBIOS aren't properly setting the DAC to 8-bit at the time of
    the writing of this patch.

    We can fix OpenBIOS later to do that and add the required property, in
    which case it will be matched by this change.

    Signed-off-by: Benjamin Herrenschmidt

    Benjamin Herrenschmidt
     
  • We used to try to request 8 times more vram than needed, which would
    fail if the card has a too small BAR (observed with qemu & kvm).

    Signed-off-by: Benjamin Herrenschmidt
    CC: stable@kernel.org

    Benjamin Herrenschmidt
     

22 Dec, 2011

1 commit

  • commit c55aef0e5bc6 ("powerpc/boot: Change the load address
    for the wrapper to fit the kernel") introduced a WARNING to
    inform the user that the uncompressed kernel would overlap
    the boot uncompressing wrapper code. Change it to an INFO.

    I initially thought, this would be a 'WARNING' for the those
    boards, where the link_address should be fixed, so that the
    user can take actions accordingly.

    Changing the same to INFO.

    Signed-off-by: Suzuki K. Poulose
    Signed-off-by: Josh Boyer

    Suzuki Poulose
     

20 Dec, 2011

8 commits

  • The MPIC_PRIMARY define was recently made "default" and the meaning was
    inverted to MPIC_SECONDARY. This causes compile errors in currituck now, so
    fix it to the new manner of allocating mpics.

    Signed-off-by: Josh Boyer

    Josh Boyer
     
  • The wrapper code which uncompresses the kernel in case of a 'ppc' boot
    is by default loaded at 0x00400000 and the kernel will be uncompressed
    to fit the location 0-0x00400000. But with dynamic relocations, the size
    of the kernel may exceed 0x00400000(4M). This would cause an overlap
    of the uncompressed kernel and the boot wrapper, causing a failure in
    boot.

    The message looks like :

    zImage starting: loaded at 0x00400000 (sp: 0x0065ffb0)
    Allocating 0x5ce650 bytes for kernel ...
    Insufficient memory for kernel at address 0! (_start=00400000, uncompressed size=00591a20)

    This patch shifts the load address of the boot wrapper code to the next
    higher MB, according to the size of the uncompressed vmlinux.

    With the patch, we get the following message while building the image :

    WARN: Uncompressed kernel (size 0x5b0344) overlaps the address of the wrapper(0x400000)
    WARN: Fixing the link_address of wrapper to (0x600000)

    Signed-off-by: Suzuki K. Poulose
    Signed-off-by: Josh Boyer

    Suzuki Poulose
     
  • Now that we have relocatable kernel, supporting CRASH_DUMP only requires
    turning the switches on for UP machines.

    We don't have kexec support on 47x yet. Enabling SMP support would be done
    as part of enabling the PPC_47x support.

    Signed-off-by: Suzuki K. Poulose
    Cc: Josh Boyer
    Cc: Benjamin Herrenschmidt
    Cc: linuxppc-dev
    Signed-off-by: Josh Boyer

    Suzuki Poulose
     
  • The following patch adds relocatable kernel support - based on processing
    of dynamic relocations - for PPC44x kernel.

    We find the runtime address of _stext and relocate ourselves based
    on the following calculation.

    virtual_base = ALIGN(KERNELBASE,256M) +
    MODULO(_stext.run,256M)

    relocate() is called with the Effective Virtual Base Address (as
    shown below)

    | Phys. Addr| Virt. Addr |
    Page (256M) |------------------------|
    Boundary | | |
    | | |
    | | |
    Kernel Load |___________|_ __ _ _ _ _|
    Cc: Benjamin Herrenschmidt
    Cc: Kumar Gala
    Cc: Tony Breeds
    Cc: Josh Boyer
    Cc: linuxppc-dev
    Signed-off-by: Josh Boyer

    Suzuki Poulose
     
  • We find the runtime address of _stext and relocate ourselves based
    on the following calculation.

    virtual_base = ALIGN(KERNELBASE,KERNEL_TLB_PIN_SIZE) +
    MODULO(_stext.run,KERNEL_TLB_PIN_SIZE)

    relocate() is called with the Effective Virtual Base Address (as
    shown below)

    | Phys. Addr| Virt. Addr |
    Page |------------------------|
    Boundary | | |
    | | |
    | | |
    Kernel Load |___________|_ __ _ _ _ _|
    Cc: Benjamin Herrenschmidt
    Cc: Kumar Gala
    Cc: linuxppc-dev
    Signed-off-by: Josh Boyer

    Suzuki Poulose
     
  • The following patch implements the dynamic relocation processing for
    PPC32 kernel. relocate() accepts the target virtual address and relocates
    the kernel image to the same.

    Currently the following relocation types are handled :

    R_PPC_RELATIVE
    R_PPC_ADDR16_LO
    R_PPC_ADDR16_HI
    R_PPC_ADDR16_HA

    The last 3 relocations in the above list depends on value of Symbol indexed
    whose index is encoded in the Relocation entry. Hence we need the Symbol
    Table for processing such relocations.

    Note: The GNU ld for ppc32 produces buggy relocations for relocation types
    that depend on symbols. The value of the symbols with STB_LOCAL scope
    should be assumed to be zero. - Alan Modra

    Signed-off-by: Suzuki K. Poulose
    Signed-off-by: Josh Poimboeuf
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Alan Modra
    Cc: Kumar Gala
    Cc: linuxppc-dev
    Signed-off-by: Josh Boyer

    Suzuki Poulose
     
  • DYNAMIC_MEMSTART(old RELOCATABLE) was restricted only to PPC_47x variants
    of 44x. This patch enables DYNAMIC_MEMSTART for 440x based chipsets.

    Signed-off-by: Suzuki K. Poulose
    Cc: Josh Boyer
    Cc: Kumar Gala
    Cc: Benjamin Herrenschmidt
    Cc: linux ppc dev
    Signed-off-by: Josh Boyer

    Suzuki Poulose
     
  • The current implementation of CONFIG_RELOCATABLE in BookE is based
    on mapping the page aligned kernel load address to KERNELBASE. This
    approach however is not enough for platforms, where the TLB page size
    is large (e.g, 256M on 44x). So we are renaming the RELOCATABLE used
    currently in BookE to DYNAMIC_MEMSTART to reflect the actual method.

    The CONFIG_RELOCATABLE for PPC32(BookE) based on processing of the
    dynamic relocations will be introduced in the later in the patch series.

    This change would allow the use of the old method of RELOCATABLE for
    platforms which can afford to enforce the page alignment (platforms with
    smaller TLB size).

    Changes since v3:

    * Introduced a new config, NONSTATIC_KERNEL, to denote a kernel which is
    either a RELOCATABLE or DYNAMIC_MEMSTART(Suggested by: Josh Boyer)

    Suggested-by: Scott Wood
    Tested-by: Scott Wood

    Signed-off-by: Suzuki K. Poulose
    Cc: Scott Wood
    Cc: Kumar Gala
    Cc: Josh Boyer
    Cc: Benjamin Herrenschmidt
    Cc: linux ppc dev
    Signed-off-by: Josh Boyer

    Suzuki Poulose
     

19 Dec, 2011

8 commits

  • We have an array of 16 entries and a loop of 32 iterations... oops.

    Signed-off-by: Benjamin Herrenschmidt

    Benjamin Herrenschmidt
     
  • As the kernels and initrd's get bigger boot-loaders and possibly
    kexec-tools will need to place the initrd outside the RMO. When this
    happens we end up with no lowmem and the boot doesn't get very far.

    Only use initrd_end as the limit for alloc_bottom if it's inside the
    RMO.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Tony Breeds
    Signed-off-by: Benjamin Herrenschmidt

    Paul Mackerras
     
  • We support 16TB of user address space and half a million contexts
    so update the comment to reflect this.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • Commit d57af9b (taskstats: use real microsecond granularity for CPU times)
    renamed msecs_to_cputime to usecs_to_cputime, but failed to update all
    numbers on the way. This causes nonsensical cpu idle/iowait values to be
    displayed in /proc/stat (the only user of usecs_to_cputime so far).

    This also renames __cputime_msec_factor to __cputime_usec_factor, adapting
    its value and using it directly in cputime_to_usecs instead of doing two
    multiplications.

    Signed-off-by: Andreas Schwab
    Acked-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Andreas Schwab
     
  • read_n_cells() cannot be marked as .devinit.text since it is referenced
    from two functions that are not in that section: of_get_lmb_size() and
    hot_add_drconf_scn_to_nid().

    Signed-off-by: David Rientjes
    Signed-off-by: Benjamin Herrenschmidt

    David Rientjes
     
  • mark_reserved_regions_for_nid() is only called from do_init_bootmem(),
    which is in .init.text, so it must be in the same section to avoid a
    section mismatch warning.

    Reported-by: Subrata Modak
    Signed-off-by: David Rientjes
    Signed-off-by: Benjamin Herrenschmidt

    David Rientjes
     
  • PPC64 uses long long for u64 in the kernel, but powerpc's asm/types.h
    prevents 64-bit userland from seeing this definition, instead defaulting
    to u64 == long in userspace. Some user programs (e.g. kvmtool) may actually
    want LL64, so this patch adds a check for __SANE_USERSPACE_TYPES__ so that,
    if defined, int-ll64.h is included instead.

    Signed-off-by: Matt Evans
    Acked-by: Ingo Molnar
    Signed-off-by: Benjamin Herrenschmidt

    Matt Evans
     
  • Implement a POWER7 optimised copy_to_user/copy_from_user using VMX.
    For large aligned copies this new loop is over 10% faster, and for
    large unaligned copies it is over 200% faster.

    If we take a fault we fall back to the old version, this keeps
    things relatively simple and easy to verify.

    On POWER7 unaligned stores rarely slow down - they only flush when
    a store crosses a 4KB page boundary. Furthermore this flush is
    handled completely in hardware and should be 20-30 cycles.

    Unaligned loads on the other hand flush much more often - whenever
    crossing a 128 byte cache line, or a 32 byte sector if either sector
    is an L1 miss.

    Considering this information we really want to get the loads aligned
    and not worry about the alignment of the stores. Microbenchmarks
    confirm that this approach is much faster than the current unaligned
    copy loop that uses shifts and rotates to ensure both loads and
    stores are aligned.

    We also want to try and do the stores in cacheline aligned, cacheline
    sized chunks. If the store queue is unable to merge an entire
    cacheline of stores then the L2 cache will have to do a
    read/modify/write. Even worse, we will serialise this with the stores
    in the next iteration of the copy loop since both iterations hit
    the same cacheline.

    Based on this, the new loop does the following things:

    1 - 127 bytes
    Get the source 8 byte aligned and use 8 byte loads and stores. Pretty
    boring and similar to how the current loop works.

    128 - 4095 bytes
    Get the source 8 byte aligned and use 8 byte loads and stores,
    1 cacheline at a time. We aren't doing the stores in cacheline
    aligned chunks so we will potentially serialise once per cacheline.
    Even so it is much better than the loop we have today.

    4096 - bytes
    If both source and destination have the same alignment get them both
    16 byte aligned, then get the destination cacheline aligned. Do
    cacheline sized loads and stores using VMX.

    If source and destination do not have the same alignment, we get the
    destination cacheline aligned, and use permute to do aligned loads.

    In both cases the VMX loop should be optimal - we always do aligned
    loads and stores and are always doing stores in cacheline aligned,
    cacheline sized chunks.

    To be able to use VMX we must be careful about interrupts and
    sleeping. We don't use the VMX loop when in an interrupt (which should
    be rare anyway) and we wrap the VMX loop in disable/enable_pagefault
    and fall back to the existing copy_tofrom_user loop if we do need to
    sleep.

    The VMX breakpoint of 4096 bytes was chosen using this microbenchmark:

    http://ozlabs.org/~anton/junkcode/copy_to_user.c

    Since we are using VMX and there is a cost to saving and restoring
    the user VMX state there are two broad cases we need to benchmark:

    - Best case - userspace never uses VMX

    - Worst case - userspace always uses VMX

    In reality a userspace process will sit somewhere between these two
    extremes. Since we need to test both aligned and unaligned copies we
    end up with 4 combinations. The point at which the VMX loop begins to
    win is:

    0% VMX
    aligned 2048 bytes
    unaligned 2048 bytes

    100% VMX
    aligned 16384 bytes
    unaligned 8192 bytes

    Considering this is a microbenchmark, the data is hot in cache and
    the VMX loop has better store queue merging properties we set the
    breakpoint to 4096 bytes, a little below the unaligned breakpoints.

    Some future optimisations we can look at:

    - Looking at the perf data, a significant part of the cost when a
    task is always using VMX is the extra exception we take to restore
    the VMX state. As such we should do something similar to the x86
    optimisation that restores FPU state for heavy users. ie:

    /*
    * If the task has used fpu the last 5 timeslices, just do a full
    * restore of the math state immediately to avoid the trap; the
    * chances of needing FPU soon are obviously high now
    */
    preload_fpu = tsk_used_math(next_p) && next_p->fpu_counter > 5;

    and

    /*
    * fpu_counter contains the number of consecutive context switches
    * that the FPU is used. If this is over a threshold, the lazy fpu
    * saving becomes unlazy to save the trap. This is an unsigned char
    * so that after 256 times the counter wraps and the behavior turns
    * lazy again; this to deal with bursty apps that only use FPU for
    * a short time
    */

    - We could create a paca bit to mirror the VMX enabled MSR bit and check
    that first, avoiding multiple calls to calling enable_kernel_altivec.
    That should help with iovec based system calls like readv.

    - We could have two VMX breakpoints, one for when we know the user VMX
    state is loaded into the registers and one when it isn't. This could
    be a second bit in the paca so we can calculate the break points quickly.

    - One suggestion from Ben was to save and restore the VSX registers
    we use inline instead of using enable_kernel_altivec.

    [BenH: Fixed a problem with preempt and fixed build without CONFIG_ALTIVEC]

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     

16 Dec, 2011

8 commits

  • As of commit dd472da38, rwsem.h was moved into asm-generic.
    This patch removes the arch file and points the build at
    its new location.

    Signed-off-by: Richard Kuo
    Signed-off-by: Benjamin Herrenschmidt

    Richard Kuo
     
  • Conflicts:
    arch/powerpc/platforms/40x/ppc40x_simple.c

    Benjamin Herrenschmidt
     
  • The code for "powersurge" SMP would kick in and cause a crash
    at boot due to the lack of a NULL test.

    Signed-off-by: Benjamin Herrenschmidt

    Benjamin Herrenschmidt
     
  • In the old days, we treated all interrupts from the legacy Apple home made
    interrupt controllers as level, with a trick reading the "level" register
    along with the "event" register to work arounds bugs where it would
    occasionally fail to latch some events.

    Doing so appeared to work fine for both level and edge interrupts.

    Later on, we discovered in Darwin source the magic masks that define which
    interrupts are actually level and which are edge, and implemented a
    different algorithm, more similar to what Apple does, that treats those
    differently.

    I recently discovered however that this caused problems (including loss
    of interrupts) with an old Wallstreet PowerBook when trying to use the
    internal modem (connected to a cascaded controller).

    It looks like some interrupts are treated as edge while they are really
    level and I'm starting to seriously doubt the correctness of the Darwin
    code (which has other obvious bugs when you read it, so ...)

    This patch reverts to our original behaviour of treating everything as
    a level interrupt. It appears to solve the problems with the modem on
    the Wallstreet and everything else seems to be working properly as well.

    Signed-off-by: Benjamin Herrenschmidt

    Benjamin Herrenschmidt
     
  • This patch reworks & simplifies pmac_zilog handling of suspend/resume,
    essentially removing all the specific code in there and using the
    generic uart helpers.

    This required properly registering the tty as a child of the macio (or platform)
    device, so I had to delay the registration a bit (we used to register the ports
    very very early). We still register the kernel console early though.

    I removed a couple of unused or useless flags as well, relying on the
    core to not call us when asleep. I also removed the essentially useless
    interrupt mutex, simplifying the locking a bit.

    I removed some code for handling unexpected interrupt which should never
    be hit and could potentially be harmful (causing us to access a register
    on a powered off SCC). We diable port interrupts on close always so there
    should be no need to drain data on a closed port.

    Signed-off-by: Benjamin Herrenschmidt

    Benjamin Herrenschmidt
     
  • Benjamin Herrenschmidt
     
  • Benjamin Herrenschmidt
     
  • Benjamin Herrenschmidt
     

09 Dec, 2011

8 commits


08 Dec, 2011

1 commit