03 May, 2007

40 commits

  • Add hooks to allow a paravirt implementation to track the lifetime of
    an mm. Paravirtualization requires three hooks, but only two are
    needed in common code. They are:

    arch_dup_mmap, which is called when a new mmap is created at fork

    arch_exit_mmap, which is called when the last process reference to an
    mm is dropped, which typically happens on exit and exec.

    The third hook is activate_mm, which is called from the arch-specific
    activate_mm() macro/function, and so doesn't need stub versions for
    other architectures. It's called when an mm is first used.

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Andi Kleen
    Cc: linux-arch@vger.kernel.org
    Cc: James Bottomley
    Acked-by: Ingo Molnar

    Jeremy Fitzhardinge
     
  • Normally when running in PAE mode, the 4th PMD maps the kernel address space,
    which can be shared among all processes (since they all need the same kernel
    mappings).

    Xen, however, does not allow guests to have the kernel pmd shared between page
    tables, so parameterize pgtable.c to allow both modes of operation.

    There are several side-effects of this. One is that vmalloc will update the
    kernel address space mappings, and those updates need to be propagated into
    all processes if the kernel mappings are not intrinsically shared. In the
    non-PAE case, this is done by maintaining a pgd_list of all processes; this
    list is used when all process pagetables must be updated. pgd_list is
    threaded via otherwise unused entries in the page structure for the pgd, which
    means that the pgd must be page-sized for this to work.

    Normally the PAE pgd is only 4x64 byte entries large, but Xen requires the PAE
    pgd to page aligned anyway, so this patch forces the pgd to be page
    aligned+sized when the kernel pmd is unshared, to accomodate both these
    requirements.

    Also, since there may be several distinct kernel pmds (if the user/kernel
    split is below 3G), there's no point in allocating them from a slab cache;
    they're just allocated with get_free_page and initialized appropriately. (Of
    course the could be cached if there is just a single kernel pmd - which is the
    default with a 3G user/kernel split - but it doesn't seem worthwhile to add
    yet another case into this code).

    [ Many thanks to wli for review comments. ]

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: William Lee Irwin III
    Signed-off-by: Andi Kleen
    Cc: Zachary Amsden
    Cc: Christoph Lameter
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton

    Jeremy Fitzhardinge
     
  • Allocate a fixmap slot for use by a paravirt_ops implementation. This
    is intended for early-boot bootstrap mappings. Once the zones and
    allocator have been set up, it would be better to use get_vm_area() to
    allocate some virtual space.

    Xen uses this to map the hypervisor's shared info page, which doesn't
    have a pseudo-physical page number, and therefore can't be mapped
    ordinarily. It is needed early because it contains the vcpu state,
    including the interrupt mask.

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Andi Kleen
    Acked-by: Ingo Molnar

    Jeremy Fitzhardinge
     
  • This patch introduces paravirt_ops hooks to control how the kernel's
    initial pagetable is set up.

    In the case of a native boot, the very early bootstrap code creates a
    simple non-PAE pagetable to map the kernel and physical memory. When
    the VM subsystem is initialized, it creates a proper pagetable which
    respects the PAE mode, large pages, etc.

    When booting under a hypervisor, there are many possibilities for what
    paging environment the hypervisor establishes for the guest kernel, so
    the constructon of the kernel's pagetable depends on the hypervisor.

    In the case of Xen, the hypervisor boots the kernel with a fully
    constructed pagetable, which is already using PAE if necessary. Also,
    Xen requires particular care when constructing pagetables to make sure
    all pagetables are always mapped read-only.

    In order to make this easier, kernel's initial pagetable construction
    has been changed to only allocate and initialize a pagetable page if
    there's no page already present in the pagetable. This allows the Xen
    paravirt backend to make a copy of the hypervisor-provided pagetable,
    allowing the kernel to establish any more mappings it needs while
    keeping the existing ones.

    A slightly subtle point which is worth highlighting here is that Xen
    requires all kernel mappings to share the same pte_t pages between all
    pagetables, so that updating a kernel page's mapping in one pagetable
    is reflected in all other pagetables. This makes it possible to
    allocate a page and attach it to a pagetable without having to
    explicitly enumerate that page's mapping in all pagetables.

    And:

    +From: "Eric W. Biederman"

    If we don't set the leaf page table entries it is quite possible that
    will inherit and incorrect page table entry from the initial boot
    page table setup in head.S. So we need to redo the effort here,
    so we pick up PSE, PGE and the like.

    Hypervisors like Xen require that their page tables be read-only,
    which is slightly incompatible with our low identity mappings, however
    I discussed this with Jeremy he has modified the Xen early set_pte
    function to avoid problems in this area.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Andi Kleen
    Acked-by: William Irwin
    Cc: Ingo Molnar

    Jeremy Fitzhardinge
     
  • Add a set of accessors to pack, unpack and modify page table entries
    (at all levels). This allows a paravirt implementation to control the
    contents of pgd/pmd/pte entries. For example, Xen uses this to
    convert the (pseudo-)physical address into a machine address when
    populating a pagetable entry, and converting back to pphys address
    when an entry is read.

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Andi Kleen
    Acked-by: Ingo Molnar

    Jeremy Fitzhardinge
     
  • Add a _paravirt_nop function for use as a stub for no-op operations,
    and paravirt_nop #defined void * version to make using it easier
    (since all its uses are as a void *).

    This is useful to allow the patcher to automatically identify noop
    operations so it can simply nop out the callsite.

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Andi Kleen
    Acked-by: Ingo Molnar
    [mingo] but only as a cleanup of the current open-coded (void *) casts.
    My problem with this is that it loses the types. Not that there is much
    to check for, but still, this adds some assumptions about how function
    calls look like

    Jeremy Fitzhardinge
     
  • Remove CONFIG_DEBUG_PARAVIRT. When inlining code, this option
    attempts to trash registers in the patch-site's "clobber" field, on
    the grounds that this should find bugs with incorrect clobbers.
    Unfortunately, the clobber field really means "registers modified by
    this patch site", which includes return values.

    Because of this, this option has outlived its usefulness, so remove
    it.

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Andi Kleen
    Cc: Rusty Russell

    Jeremy Fitzhardinge
     
  • Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Andi Kleen
    Cc: Chris Wright
    Cc: Zachary Amsden
    Cc: Rusty Russell

    Jeremy Fitzhardinge
     
  • On Thu, 2007-03-29 at 13:16 +0200, Andi Kleen wrote:
    > Please clean it up properly with two structs.

    Not sure about this, now I've done it. Running it here.

    If you like it, I can do x86-64 as well.

    ==
    lguest defines its own TSS struct because the "struct tss_struct"
    contains linux-specific additions. Andi asked me to split the struct
    in processor.h.

    Unfortunately it makes usage a little awkward.

    Signed-off-by: Rusty Russell
    Signed-off-by: Andi Kleen

    Rusty Russell
     
  • I have a 4 socket AMD Operton system. The 2.6.18 kernel I have crashes
    when there is no memory in node0.

    AK: changed call to _nopanic

    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Andi Kleen

    James Puthukattukaran
     
  • Setting the DEBUG_SIG flag breaks compilation due to a wrong
    struct access. Aditionally, it raises two warnings. This is one
    patch to fix them all.

    Signed-off-by: Glauber de Oliveira Costa
    Signed-off-by: Andi Kleen

    Glauber de Oliveira Costa
     
  • Add "noreplace-smp" to disable SMP instruction replacement.

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Andi Kleen

    Jeremy Fitzhardinge
     
  • The .smp_altinstructions section and its corresponding symbols are
    completely unused, so remove them.

    Also, remove stray #ifdef __KENREL__ in asm-i386/alternative.h

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Andi Kleen
    Cc: Andi Kleen

    Jeremy Fitzhardinge
     
  • This patch is based on Rusty's recent cleanup of the EFLAGS-related
    macros; it extends the same kind of cleanup to control registers and
    MSRs.

    It also unifies these between i386 and x86-64; at least with regards
    to MSRs, the two had definitely gotten out of sync.

    Signed-off-by: H. Peter Anvin
    Signed-off-by: Andi Kleen

    H. Peter Anvin
     
  • Let's allow page-alignment in general for per-cpu data (wanted by Xen, and
    Ingo suggested KVM as well).

    Because larger alignments can use more room, we increase the max per-cpu
    memory to 64k rather than 32k: it's getting a little tight.

    Signed-off-by: Rusty Russell
    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Andi Kleen
    Acked-by: Ingo Molnar
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton

    Jeremy Fitzhardinge
     
  • As a bug workaround bank 0 on K7s is normally disabled, but no need
    to do that on other AMD CPUs.

    Cc: davej@redhat.com

    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • Update documentation for i386 smp_call_function* functions.

    As reported by Randy Dunlap

    [ I've posted this before but it seems to have been lost along the way. ]

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Andi Kleen
    Cc: Randy Dunlap

    Jeremy Fitzhardinge
     
  • (I hope Andi is the right one to Cc, otherwise please add, thanks!)

    Use menuconfigs instead of menus, so the whole menu can be disabled at
    once instead of going through all options.

    Signed-off-by: Jan Engelhardt
    Signed-off-by: Andi Kleen

    Jan Engelhardt
     
  • It doesn't put the CPU into deeper sleep states, so it's better to use the standard
    idle loop to save power. But allow to reenable it anyways for benchmarking.

    I also removed the obsolete idle=halt on i386

    Cc: andreas.herrmann@amd.com

    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • Most of asm-x86_64/bugs.h is code which should be in a C file, so put it there.

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Andi Kleen
    Cc: Andi Kleen
    Cc: Linus Torvalds

    Jeremy Fitzhardinge
     
  • Now that relocation of the VDSO for COMPAT_VDSO users is done at
    runtime rather than compile time, it is possible to enable/disable
    compat mode at runtime.

    This patch allows you to enable COMPAT_VDSO mode with "vdso=2" on the
    kernel command line, or via sysctl. (Switching on a running system
    shouldn't be done lightly; any process which was relying on the compat
    VDSO will be upset if it goes away.)

    The COMPAT_VDSO config option still exists, but if enabled it just
    makes vdso_enabled default to VDSO_COMPAT.

    +From: Hugh Dickins

    Fix oops from i386-make-compat_vdso-runtime-selectable.patch.

    Even mingetty at system startup finds it easy to trigger an oops
    while reading /proc/PID/maps: though it has a good hold on the mm
    itself, that cannot stop exit_mm() from resetting tsk->mm to NULL.

    (It is usually show_map()'s call to get_gate_vma() which oopses,
    and I expect we could change that to check priv->tail_vma instead;
    but no matter, even m_start()'s call just after get_task_mm() is racy.)

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Andi Kleen
    Cc: Zachary Amsden
    Cc: "Jan Beulich"
    Cc: Eric W. Biederman
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Roland McGrath

    Jeremy Fitzhardinge
     
  • Some versions of libc can't deal with a VDSO which doesn't have its
    ELF headers matching its mapped address. COMPAT_VDSO maps the VDSO at
    a specific system-wide fixed address. Previously this was all done at
    build time, on the grounds that the fixed VDSO address is always at
    the top of the address space. However, a hypervisor may reserve some
    of that address space, pushing the fixmap address down.

    This patch does the adjustment dynamically at runtime, depending on
    the runtime location of the VDSO fixmap.

    [ Patch has been through several hands: Jan Beulich wrote the orignal
    version; Zach reworked it, and Jeremy converted it to relocate phdrs
    as well as sections. ]

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Andi Kleen
    Cc: Zachary Amsden
    Cc: "Jan Beulich"
    Cc: Eric W. Biederman
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Roland McGrath

    Jeremy Fitzhardinge
     
  • identify_cpu() is used to identify both the boot CPU and secondary
    CPUs, but it performs some actions which only apply to the boot CPU.
    Those functions are therefore really __init functions, but because
    they're called by identify_cpu(), they must be marked __cpuinit.

    This patch splits identify_cpu() into identify_boot_cpu() and
    identify_secondary_cpu(), and calls the appropriate init functions
    from each. Also, identify_boot_cpu() and all the functions it
    dominates are marked __init.

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Andi Kleen

    Jeremy Fitzhardinge
     
  • Most of asm-i386/bugs.h is code which should be in a C file, so put it there.

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Andi Kleen
    Cc: Linus Torvalds

    Jeremy Fitzhardinge
     
  • Ugly ifdef, but should handle all 64bit platforms that have suitable
    zones. On some like Altix it's probably impossible without IOMMU
    use to get memory

    Andi Kleen
     
  • The xmm space on x86_64 is 256 bytes.

    Signed-off-by: Avi Kivity
    Signed-off-by: Andi Kleen

    Avi Kivity
     
  • As per i386 patch: move X86_EFLAGS_IF et al out to a new header:
    processor-flags.h, so we can include it from irqflags.h and use it in
    raw_irqs_disabled_flags().

    As a side-effect, we could now use these flags in .S files.

    Signed-off-by: Rusty Russell
    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • Under CONFIG_DISCONTIGMEM, assuming that a !pfn_valid() implies all
    subsequent pfn-s are also invalid is wrong. Thus replace this by
    explicitly checking against the E820 map.

    AK: make e820 on x86-64 not initdata

    Signed-off-by: Jan Beulich
    Signed-off-by: Andi Kleen
    Acked-by: Mark Langsdorf

    Jan Beulich
     
  • Rather than using a single constant PERCPU_ENOUGH_ROOM, compute it as
    the sum of kernel_percpu + PERCPU_MODULE_RESERVE. This is now common
    to all architectures; if an architecture wants to set
    PERCPU_ENOUGH_ROOM to something special, then it may do so (ia64 is
    the only one which does).

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Andi Kleen
    Cc: Rusty Russell
    Cc: Eric W. Biederman
    Cc: Andi Kleen

    Jeremy Fitzhardinge
     
  • All were already in some header
    Signed-off-by: Andi Kleen

    Andi Kleen
     
  • Signed-off-by: Andi Kleen

    Andi Kleen
     
  • machine_ops is an interface for the machine_* functions defined in
    . This is intended to allow hypervisors to intercept
    the reboot process, but it could be used to implement other x86
    subarchtecture reboots.

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Andi Kleen

    Jeremy Fitzhardinge
     
  • Add a smp_ops interface. This abstracts the API defined by
    for use within arch/i386. The primary intent is that it
    be used by a paravirtualizing hypervisor to implement SMP, but it
    could also be used by non-APIC-using sub-architectures.

    This is related to CONFIG_PARAVIRT, but is implemented unconditionally
    since it is simpler that way and not a highly performance-sensitive
    interface.

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Andi Kleen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: James Bottomley

    Jeremy Fitzhardinge
     
  • Now we have an explicit per-cpu GDT variable, we don't need to keep the
    descriptors around to use them to find the GDT: expose cpu_gdt directly.

    We could go further and make load_gdt() pack the descriptor for us, or even
    assume it means "load the current cpu's GDT" which is what it always does.

    Signed-off-by: Rusty Russell
    Signed-off-by: Andi Kleen
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton

    Rusty Russell
     
  • Fix "Section mismatch" warnings in arch/x86_64/kernel/time.c

    Signed-off-by: Bernhard Walle
    Signed-off-by: Andi Kleen

    Bernhard Walle
     
  • commit 5e518d7672dea4cd7c60871e40d0490c52f01d13 did the same change to
    i386's variant.

    With this change, i386's and x86-64's versions are identical, raising
    the question whether the x86-64 one should go (just like there's only
    one instance of edd.S).

    Signed-off-by: Jan Beulich
    Signed-off-by: Andi Kleen

    Jan Beulich
     
  • This patch touches the NMI watchdog every MAX_ORDER_NR_PAGES
    to inhibit the machine from triggering an NMI while the CPUs
    are locked. This situation is happening on boxes with more
    than 64CPUs and 128GB of RAM when Alt-SysRq-m is performed.

    It has been succesfully tested for regression on uni, 2, 4, 8
    32, and 64 CPU boxes with various memory configuration.

    Signed-off-by: Andi Kleen

    Konrad Rzeszutek
     
  • Current vsyscall_gtod_data is large (3 or 4 cache lines dirtied at timer
    interrupt). We can shrink it to exactly 64 bytes (1 cache line on AMD64)

    Instead of copying a whole struct clocksource, we copy only needed fields.

    I deleted an unused field : offset_base

    This patch fixes one oddity in vgettimeofday(): It can returns a timeval with
    tv_usec = 1000000. Maybe not a bug, but why not doing the right thing ?

    Signed-off-by: Eric Dumazet
    Signed-off-by: Andi Kleen

    Eric Dumazet
     
  • There is a tiny probability that the return value from vtime(time_t *t) is
    Signed-off-by: Andi Kleen

    different than the value stored in *t

    Using a temporary variable solves the problem and gives a faster code.

    17: 48 85 ff test %rdi,%rdi
    1a: 48 8b 05 00 00 00 00 mov 0(%rip),%rax #
    __vsyscall_gtod_data.wall_time_tv.tv_sec
    21: 74 03 je 26
    23: 48 89 07 mov %rax,(%rdi)
    26: c9 leaveq
    27: c3 retq

    Signed-off-by: Eric Dumazet

    Eric Dumazet
     
  • Many years ago, UNEXPECTED_IO_APIC() contained printk()'s (but nothing more).

    Now that it's completely empty for years, we can as well remove it.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andi Kleen

    Adrian Bunk