09 May, 2007

1 commit

  • The basic issue is to be able to do what hugetlbfs does but with
    different page sizes for some other special filesystems; more
    specifically, my need is:

    - Huge pages

    - SPE local store mappings using 64K pages on a 4K base page size
    kernel on Cell

    - Some special 4K segments in 64K-page kernels for mapping a dodgy
    type of powerpc-specific infiniband hardware that requires 4K MMU
    mappings for various reasons I won't explain here.

    The main issues are:

    - To maintain/keep track of the page size per "segment" (as we can
    only have one page size per segment on powerpc, which are 256MB
    divisions of the address space).

    - To make sure special mappings stay within their allotted
    "segments" (including MAP_FIXED crap)

    - To make sure everybody else doesn't mmap/brk/grow_stack into a
    "segment" that is used for a special mapping

    Some of the necessary mechanisms to handle that were present in the
    hugetlbfs code, but mostly in ways not suitable for anything else.

    The patch relies on some changes to the generic get_unmapped_area()
    that just got merged. It still hijacks hugetlb callbacks here or
    there as the generic code hasn't been entirely cleaned up yet but
    that shouldn't be a problem.

    So what is a slice ? Well, I re-used the mechanism used formerly by our
    hugetlbfs implementation which divides the address space in
    "meta-segments" which I called "slices". The division is done using
    256MB slices below 4G, and 1T slices above. Thus the address space is
    divided currently into 16 "low" slices and 16 "high" slices. (Special
    case: high slice 0 is the area between 4G and 1T).

    Doing so simplifies significantly the tracking of segments and avoids
    having to keep track of all the 256MB segments in the address space.

    While I used the "concepts" of hugetlbfs, I mostly re-implemented
    everything in a more generic way and "ported" hugetlbfs to it.

    Slices can have an associated page size, which is encoded in the mmu
    context and used by the SLB miss handler to set the segment sizes. The
    hash code currently doesn't care, it has a specific check for hugepages,
    though I might add a mechanism to provide per-slice hash mapping
    functions in the future.

    The slice code provide a pair of "generic" get_unmapped_area() (bottomup
    and topdown) functions that should work with any slice size. There is
    some trickiness here so I would appreciate people to have a look at the
    implementation of these and let me know if I got something wrong.

    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Paul Mackerras

    Benjamin Herrenschmidt
     

24 Apr, 2007

1 commit

  • Save the trap number in the case of getting a bad stack in an exception
    handler. It is sometimes useful to know what exception it was that caused
    this to happen. Without this, no trap number is reported.

    Signed-off-by: Olof Johansson
    Signed-off-by: Paul Mackerras

    Olof Johansson
     

21 Mar, 2007

1 commit

  • Move the slb_shadow_ptr field into the first cache line since it is
    (like everything there) read-only after boot. It is in fact statically
    initialised and thereafter only read.

    Signed-off-by: Stephen Rothwell
    Acked-by: Michael Neuling
    Signed-off-by: Paul Mackerras

    Stephen Rothwell
     

16 Oct, 2006

1 commit

  • This implements a lazy strategy for disabling interrupts. This means
    that local_irq_disable() et al. just clear the 'interrupts are
    enabled' flag in the paca. If an interrupt comes along, the interrupt
    entry code notices that interrupts are supposed to be disabled, and
    clears the EE bit in SRR1, clears the 'interrupts are hard-enabled'
    flag in the paca, and returns. This means that interrupts only
    actually get disabled in the processor when an interrupt comes along.

    When interrupts are enabled by local_irq_enable() et al., the code
    sets the interrupts-enabled flag in the paca, and then checks whether
    interrupts got hard-disabled. If so, it also sets the EE bit in the
    MSR to hard-enable the interrupts.

    This has the potential to improve performance, and also makes it
    easier to make a kernel that can boot on iSeries and on other 64-bit
    machines, since this lazy-disable strategy is very similar to the
    soft-disable strategy that iSeries already uses.

    This version renames paca->proc_enabled to paca->soft_enabled, and
    changes a couple of soft-disables in the kexec code to hard-disables,
    which should fix the crash that Michael Ellerman saw. This doesn't
    yet use a reserved CR field for the soft_enabled and hard_enabled
    flags. This applies on top of Stephen Rothwell's patches to make it
    possible to build a combined iSeries/other kernel.

    Signed-off-by: Paul Mackerras

    Paul Mackerras
     

14 Sep, 2006

1 commit


13 Sep, 2006

1 commit

  • This changes the writeX family of functions to have a sync instruction
    before the MMIO store rather than after, because the generally expected
    behaviour is that the device receiving the MMIO store can be guaranteed
    to see the effects of any preceding writes to normal memory.

    To preserve ordering between writeX and readX, and to preserve ordering
    between preceding stores and the readX, the readX family of functions
    have had an sync added before the load.

    Although writeX followed by spin_unlock is not officially guaranteed
    to keep the writeX inside the spin-locked region unless an mmiowb()
    is used, there are currently drivers that depend on the previous
    behaviour on powerpc, which was that the mmiowb wasn't actually required.
    Therefore we have a per-cpu flag that is set by writeX, cleared by
    __raw_spin_lock and mmiowb, and tested by __raw_spin_unlock. If it is
    set, __raw_spin_unlock does a sync and clears it.

    This changes both 32-bit and 64-bit readX/writeX. 32-bit already has a
    sync in __raw_spin_unlock (since lwsync doesn't exist on 32-bit), and thus
    doesn't need the per-cpu flag.

    Tested on G5 (PPC970) and POWER5.

    Signed-off-by: Paul Mackerras

    Paul Mackerras
     

08 Aug, 2006

1 commit

  • This adds a shadow buffer for the SLBs and regsiters it with PHYP.
    Only the bolted SLB entries (top 3) are shadowed.

    The SLB shadow buffer tells the hypervisor what the kernel needs to
    have in the SLB for the kernel to be able to function. The hypervisor
    can use this information to speed up partition context switches.

    Signed-off-by: Michael Neuling
    Signed-off-by: Paul Mackerras

    Michael Neuling
     

23 Jun, 2006

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (139 commits)
    [POWERPC] re-enable OProfile for iSeries, using timer interrupt
    [POWERPC] support ibm,extended-*-frequency properties
    [POWERPC] Extra sanity check in EEH code
    [POWERPC] Dont look for class-code in pci children
    [POWERPC] Fix mdelay badness on shared processor partitions
    [POWERPC] disable floating point exceptions for init
    [POWERPC] Unify ppc syscall tables
    [POWERPC] mpic: add support for serial mode interrupts
    [POWERPC] pseries: Print PCI slot location code on failure
    [POWERPC] spufs: one more fix for 64k pages
    [POWERPC] spufs: fail spu_create with invalid flags
    [POWERPC] spufs: clear class2 interrupt status before wakeup
    [POWERPC] spufs: fix Makefile for "make clean"
    [POWERPC] spufs: remove stop_code from struct spu
    [POWERPC] spufs: fix spu irq affinity setting
    [POWERPC] spufs: further abstract priv1 register access
    [POWERPC] spufs: split the Cell BE support into generic and platform dependant parts
    [POWERPC] spufs: dont try to access SPE channel 1 count
    [POWERPC] spufs: use kzalloc in create_spu
    [POWERPC] spufs: fix initial state of wbox file
    ...

    Manually resolved conflicts in:
    drivers/net/phy/Makefile
    include/asm-powerpc/spu.h

    Linus Torvalds
     

15 Jun, 2006

1 commit

  • Some POWER5+ machines can do 64k hardware pages for normal memory but
    not for cache-inhibited pages. This patch lets us use 64k hardware
    pages for most user processes on such machines (assuming the kernel
    has been configured with CONFIG_PPC_64K_PAGES=y). User processes
    start out using 64k pages and get switched to 4k pages if they use any
    non-cacheable mappings.

    With this, we use 64k pages for the vmalloc region and 4k pages for
    the imalloc region. If anything creates a non-cacheable mapping in
    the vmalloc region, the vmalloc region will get switched to 4k pages.
    I don't know of any driver other than the DRM that would do this,
    though, and these machines don't have AGP.

    When a region gets switched from 64k pages to 4k pages, we do not have
    to clear out all the 64k HPTEs from the hash table immediately. We
    use the _PAGE_COMBO bit in the Linux PTE to indicate whether the page
    was hashed in as a 64k page or a set of 4k pages. If hash_page is
    trying to insert a 4k page for a Linux PTE and it sees that it has
    already been inserted as a 64k page, it first invalidates the 64k HPTE
    before inserting the 4k HPTE. The hash invalidation routines also use
    the _PAGE_COMBO bit, to determine whether to look for a 64k HPTE or a
    set of 4k HPTEs to remove. With those two changes, we can tolerate a
    mix of 4k and 64k HPTEs in the hash table, and they will all get
    removed when the address space is torn down.

    Signed-off-by: Paul Mackerras

    Paul Mackerras
     

12 Jun, 2006

1 commit


26 Apr, 2006

1 commit


27 Mar, 2006

1 commit

  • We currently have a hack to flip the boot cpu and its secondary thread
    to logical cpuid 0 and 1. This means the logical - physical mapping will
    differ depending on which cpu is boot cpu. This is most apparent on
    kexec, where we might kexec on any cpu and therefore change the mapping
    from boot to boot.

    The patch below does a first pass early on to work out the logical cpuid
    of the boot thread. We then fix up some paca structures to match.

    Ive also removed the boot_cpuid_phys variable for ppc64, to be
    consistent we use get_hard_smp_processor_id(boot_cpuid) everywhere.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Paul Mackerras

    Anton Blanchard
     

24 Feb, 2006

1 commit

  • This implements accurate task and cpu time accounting for 64-bit
    powerpc kernels. Instead of accounting a whole jiffy of time to a
    task on a timer interrupt because that task happened to be running at
    the time, we now account time in units of timebase ticks according to
    the actual time spent by the task in user mode and kernel mode. We
    also count the time spent processing hardware and software interrupts
    accurately. This is conditional on CONFIG_VIRT_CPU_ACCOUNTING. If
    that is not set, we do tick-based approximate accounting as before.

    To get this accurate information, we read either the PURR (processor
    utilization of resources register) on POWER5 machines, or the timebase
    on other machines on

    * each entry to the kernel from usermode
    * each exit to usermode
    * transitions between process context, hard irq context and soft irq
    context in kernel mode
    * context switches.

    On POWER5 systems with shared-processor logical partitioning we also
    read both the PURR and the timebase at each timer interrupt and
    context switch in order to determine how much time has been taken by
    the hypervisor to run other partitions ("steal" time). Unfortunately,
    since we need values of the PURR on both threads at the same time to
    accurately calculate the steal time, and since we can only calculate
    steal time on a per-core basis, the apportioning of the steal time
    between idle time (time which we ceded to the hypervisor in the idle
    loop) and actual stolen time is somewhat approximate at the moment.

    This is all based quite heavily on what s390 does, and it uses the
    generic interfaces that were added by the s390 developers,
    i.e. account_system_time(), account_user_time(), etc.

    This patch doesn't add any new interfaces between the kernel and
    userspace, and doesn't change the units in which time is reported to
    userspace by things such as /proc/stat, /proc//stat, getrusage(),
    times(), etc. Internally the various task and cpu times are stored in
    timebase units, but they are converted to USER_HZ units (1/100th of a
    second) when reported to userspace. Some precision is therefore lost
    but there should not be any accumulating error, since the internal
    accumulation is at full precision.

    Signed-off-by: Paul Mackerras

    Paul Mackerras
     

10 Feb, 2006

1 commit


13 Jan, 2006

1 commit

  • At present the lppaca - the structure shared with the iSeries
    hypervisor and phyp - is contained within the PACA, our own low-level
    per-cpu structure. This doesn't have to be so, the patch below
    removes it, making a separate array of lppaca structures.

    This saves approximately 500*NR_CPUS bytes of image size and kernel
    memory, because we don't need aligning gap between the Linux and
    hypervisor portions of every PACA. On the other hand it means an
    extra level of dereference in many accesses to the lppaca.

    The patch also gets rid of several places where we assign the paca
    address to a local variable for no particular reason.

    Signed-off-by: David Gibson
    Signed-off-by: Paul Mackerras

    David Gibson
     

11 Jan, 2006

1 commit

  • The current ppc64 per cpu data implementation is quite slow. eg:

    lhz 11,18(13) /* smp_processor_id() */
    ld 9,.LC63-.LCTOC1(30) /* per_cpu__variable_name */
    ld 8,.LC61-.LCTOC1(30) /* __per_cpu_offset */
    sldi 11,11,3 /* form index into __per_cpu_offset */
    mr 10,9
    ldx 9,11,8 /* __per_cpu_offset[smp_processor_id()] */
    ldx 0,10,9 /* load per cpu data */

    5 loads for something that is supposed to be fast, pretty awful. One
    reason for the large number of loads is that we have to synthesize 2
    64bit constants (per_cpu__variable_name and __per_cpu_offset).

    By putting __per_cpu_offset into the paca we can avoid the 2 loads
    associated with it:

    ld 11,56(13) /* paca->data_offset */
    ld 9,.LC59-.LCTOC1(30) /* per_cpu__variable_name */
    ldx 0,9,11 /* load per cpu data

    Longer term we can should be able to do even better than 3 loads.
    If per_cpu__variable_name wasnt a 64bit constant and paca->data_offset
    was in a register we could cut it down to one load. A suggestion from
    Rusty is to use gcc's __thread extension here. In order to do this we
    would need to free up r13 (the __thread register and where the paca
    currently is). So far Ive had a few unsuccessful attempts at doing that :)

    The patch also allocates per cpu memory node local on NUMA machines.
    This patch from Rusty has been sitting in my queue _forever_ but stalled
    when I hit the compiler bug. Sorry about that.

    Finally I also only allocate per cpu data for possible cpus, which comes
    straight out of the x86-64 port. On a pseries kernel (with NR_CPUS == 128)
    and 4 possible cpus we see some nice gains:

    total used free shared buffers cached
    Mem: 4012228 212860 3799368 0 0 162424

    total used free shared buffers cached
    Mem: 4016200 212984 3803216 0 0 162424

    A saving of 3.75MB. Quite nice for smaller machines. Note: we now have
    to be careful of per cpu users that touch data for !possible cpus.

    At this stage it might be worth making the NUMA and possible cpu
    optimisations generic, but per cpu init is done so early we have to be
    careful that all architectures have their possible map setup correctly.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Paul Mackerras

    Anton Blanchard
     

09 Jan, 2006

3 commits

  • include/asm-ppc/ had #ifdef __KERNEL__ in all header files that
    are not meant for use by user space, include/asm-powerpc does
    not have this yet.

    This patch gets us a lot closer there. There are a few cases
    where I was not sure, so I left them out. I have verified
    that no CONFIG_* symbols are used outside of __KERNEL__
    any more and that there are no obvious compile errors when
    including any of the headers in user space libraries.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul Mackerras

    Arnd Bergmann
     
  • This patch removes several unnecessary fields from the paca:

    - next_jiffy_update_tb was simply unused. Remove trivially.

    - The exdsi exception save area was not used. There were plans to use
    it, but they never seem to have gone anywhere. If they ever do, we
    can put it back. Remove from the paca, and from asm-offsets.c

    - The default_decr field was used from asm, but was only ever assigned
    the value of tb_ticks_per_jiffy. Just access tb_ticks_per_jiffy from
    asm directly instead.

    Built and booted on POWER5 LPAR and iSeries RS64.

    Signed-off-by: David Gibson
    Signed-off-by: Paul Mackerras

    David Gibson
     
  • On iSeries, the paca contains, amongst other things an ItLpRegSave
    structure used by the hypervisor to save registers. The hypervisor
    locates this area through a pointer at the beginning of the paca, so
    the structure itself can be located elsewhere. This patch moves the
    reg_save area out into its own array. This reduces the amount of
    iSeries specific gunk which is visible to general powerpc code via
    paca.h

    Built and booted on POWER5 LPAR and iSeries RS64.

    Signed-off-by: David Gibson
    Signed-off-by: Paul Mackerras

    David Gibson
     

10 Nov, 2005

1 commit

  • This patch moves a bunch of files from arch/ppc64 and
    include/asm-ppc64 which have no equivalents in ppc32 code into
    arch/powerpc and include/asm-powerpc. The file affected are:
    abs_addr.h
    compat.h
    lppaca.h
    paca.h
    tce.h
    cpu_setup_power4.S
    ioctl32.c
    firmware.c
    pacaData.c

    The only changes apart from the move and corresponding Makefile
    changes are:
    - #ifndef/#define in includes updated to _ASM_POWERPC_ form
    - trailing whitespace removed
    - comments giving full paths removed
    - pacaData.c renamed paca.c to remove studlyCaps
    - Misplaced { moved in lppaca.h

    Built and booted on POWER5 LPAR (ARCH=powerpc and ARCH=ppc64), built
    for 32-bit powermac (ARCH=powerpc).

    Signed-off-by: David Gibson
    Signed-off-by: Paul Mackerras

    David Gibson