05 Sep, 2005

40 commits

  • Add a clone operation for pgd updates.

    This helps complete the encapsulation of updates to page tables (or pages
    about to become page tables) into accessor functions rather than using
    memcpy() to duplicate them. This is both generally good for consistency
    and also necessary for running in a hypervisor which requires explicit
    updates to page table entries.

    The new function is:

    clone_pgd_range(pgd_t *dst, pgd_t *src, int count);

    dst - pointer to pgd range anwhere on a pgd page
    src - ""
    count - the number of pgds to copy.

    dst and src can be on the same page, but the range must not overlap
    and must not cross a page boundary.

    Note that I ommitted using this call to copy pgd entries into the
    software suspend page root, since this is not technically a live paging
    structure, rather it is used on resume from suspend. CC'ing Pavel in case
    he has any feedback on this.

    Thanks to Chris Wright for noticing that this could be more optimal in
    PAE compiles by eliminating the memset.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • This patch adds a notify to the die_nmi notify that the system is about to
    be taken down. If the notify is handled with a NOTIFY_STOP return, the
    system is given a new lease on life.

    We also change the nmi watchdog to carry on if die_nmi returns.

    This give debug code a chance to a) catch watchdog timeouts and b) possibly
    allow the system to continue, realizing that the time out may be due to
    debugger activities such as single stepping which is usually done with
    "other" cpus held.

    Signed-off-by: George Anzinger
    Cc: Keith Owens
    Signed-off-by: George Anzinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    George Anzinger
     
  • Introduce a write acessor for updating the current LDT. This is required
    for hypervisors like Xen that do not allow LDT pages to be directly
    written.

    Testing - here's a fun little LDT test that can be trivially modified to
    test limits as well.

    /*
    * Copyright (c) 2005, Zachary Amsden (zach@vmware.com)
    * This is licensed under the GPL.
    */

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #define __KERNEL__
    #include

    void main(void)
    {
    struct user_desc desc;
    char *code;
    unsigned long long tsc;

    code = (char *)mmap(0, 8192, PROT_EXEC|PROT_READ|PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    desc.entry_number = 0;
    desc.base_addr = code;
    desc.limit = 1;
    desc.seg_32bit = 1;
    desc.contents = MODIFY_LDT_CONTENTS_CODE;
    desc.read_exec_only = 0;
    desc.limit_in_pages = 1;
    desc.seg_not_present = 0;
    desc.useable = 1;
    if (modify_ldt(1, &desc, sizeof(desc)) != 0) {
    perror("modify_ldt");
    }
    printf("code base is 0x%08x\n", (unsigned)code);
    code[0x0ffe] = 0x0f; /* rdtsc */
    code[0x0fff] = 0x31;
    code[0x1000] = 0xcb; /* lret */
    __asm__ __volatile("lcall $7,$0xffe" : "=A" (tsc));
    printf("TSC is 0x%016llx\n", tsc);
    }

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • When reviewing GDT updates, I found the code:

    set_tss_desc(cpu,t); /* This just modifies memory; ... */
    per_cpu(cpu_gdt_table, cpu)[GDT_ENTRY_TSS].b &= 0xfffffdff;

    This second line is unnecessary, since set_tss_desc() has already cleared
    the busy bit.

    Commented disassembly, line 1:

    c028b8bd: 8b 0c 86 mov (%esi,%eax,4),%ecx
    c028b8c0: 01 cb add %ecx,%ebx
    c028b8c2: 8d 0c 39 lea (%ecx,%edi,1),%ecx

    => %ecx = per_cpu(cpu_gdt_table, cpu)

    c028b8c5: 8d 91 80 00 00 00 lea 0x80(%ecx),%edx

    => %edx = &per_cpu(cpu_gdt_table, cpu)[GDT_ENTRY_TSS]

    c028b8cb: 66 c7 42 00 73 20 movw $0x2073,0x0(%edx)
    c028b8d1: 66 89 5a 02 mov %bx,0x2(%edx)
    c028b8d5: c1 cb 10 ror $0x10,%ebx
    c028b8d8: 88 5a 04 mov %bl,0x4(%edx)
    c028b8db: c6 42 05 89 movb $0x89,0x5(%edx)

    => ((char *)%edx)[5] = 0x89
    (equivalent) ((char *)per_cpu(cpu_gdt_table, cpu)[GDT_ENTRY_TSS])[5] = 0x89

    c028b8df: c6 42 06 00 movb $0x0,0x6(%edx)
    c028b8e3: 88 7a 07 mov %bh,0x7(%edx)
    c028b8e6: c1 cb 10 ror $0x10,%ebx

    => other bits

    Commented disassembly, line 2:

    c028b8e9: 8b 14 86 mov (%esi,%eax,4),%edx
    c028b8ec: 8d 04 3a lea (%edx,%edi,1),%eax

    => %eax = per_cpu(cpu_gdt_table, cpu)

    c028b8ef: 81 a0 84 00 00 00 ff andl $0xfffffdff,0x84(%eax)

    => per_cpu(cpu_gdt_table, cpu)[GDT_ENTRY_TSS].b &= 0xfffffdff;
    (equivalent) ((char *)per_cpu(cpu_gdt_table, cpu)[GDT_ENTRY_TSS])[5] &= 0xfd

    Note that (0x89 & ~0xfd) == 0; i.e, set_tss_desc(cpu,t) has already stored
    the type field in the GDT with the busy bit clear.

    Eliminating redundant and obscure code is always a good thing; in fact, I
    pointed out this same optimization many moons ago in arch/i386/setup.c,
    back when it used to be called that.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • The pushf/popf in switch_to are ONLY used to switch IOPL. Making this
    explicit in C code is more clear. This pushf/popf pair was added as a
    bugfix for leaking IOPL to unprivileged processes when using
    sysenter/sysexit based system calls (sysexit does not restore flags).

    When requesting an IOPL change in sys_iopl(), it is just as easy to change
    the current flags and the flags in the stack image (in case an IRET is
    required), but there is no reason to force an IRET if we came in from the
    SYSENTER path.

    This change is the minimal solution for supporting a paravirtualized Linux
    kernel that allows user processes to run with I/O privilege. Other
    solutions require radical rewrites of part of the low level fault / system
    call handling code, or do not fully support sysenter based system calls.

    Unfortunately, this added one field to the thread_struct. But as a bonus,
    on P4, the fastest time measured for switch_to() went from 312 to 260
    cycles, a win of about 17% in the fast case through this performance
    critical path.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • Privilege checking cleanup. Originally, these diffs were much greater, but
    recent cleanups in Linux have already done much of the cleanup. I added
    some explanatory comments in places where the reasoning behind certain
    tests is rather subtle.

    Also, in traps.c, we can skip the user_mode check in handle_BUG(). The
    reason is, there are only two call chains - one via die_if_kernel() and one
    via do_page_fault(), both entering from die(). Both of these paths already
    ensure that a kernel mode failure has happened. Also, the original check
    here, if (user_mode(regs)) was insufficient anyways, since it would not
    rule out BUG faults from V8086 mode execution.

    Saving the %ss segment in show_regs() rather than assuming a fixed value
    also gives better information about the current kernel state in the
    register dump.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • Some more assembler cleanups I noticed along the way.

    Signed-off-by: Zachary Amsden
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • Noticed by Chuck Ebbert: the .ldt entry of the TSS was set up incorrectly.
    It never mattered since this was a leftover from old times, so remove it.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Also, setting PDPEs in PAE mode does not require atomic operations, since the
    PDPEs are cached by the processor, and only reloaded on an explicit or
    implicit reload of CR3.

    Since the four PDPEs must always be present in an active root, and the kernel
    PDPE is never updated, we are safe even from SMIs and interrupts / NMIs using
    task gates (which reload CR3). Actually, much of this is moot, since the user
    PDPEs are never updated either, and the only usage of task gates is by the
    doublefault handler. It appears the only place PGDs get updated in PAE mode
    is in init_low_mappings() / zap_low_mapping() for initial page table creation
    and recovery from ACPI sleep state, and these sites are safe by inspection.
    Getting rid of the cmpxchg8b saves code space and 720 cycles in pgd_alloc on
    P4.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • Subtle fix: load_TLS has been moved after saving %fs and %gs segments to avoid
    creating non-reversible segments. This could conceivably cause a bug if the
    kernel ever needed to save and restore fs/gs from the NMI handler. It
    currently does not, but this is the safest approach to avoiding fs/gs
    corruption. SMIs are safe, since SMI saves the descriptor hidden state.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • GCC can generate better code around descriptor update and access functions
    when there is not an explicit "eax" register constraint.

    Testing: You won't boot if this is messed up, since the TSS descriptor will be
    corrupted. Verified the assembler and booted.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • i386 inline assembler cleanup.

    This change encapsulates descriptor and task register management. Also,
    it is possible to improve assembler generation in two cases; savesegment
    may store the value in a register instead of a memory location, which
    allows GCC to optimize stack variables into registers, and MOV MEM, SEG
    is always a 16-bit write to memory, making the casting in math-emu
    unnecessary.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • i386 arch cleanup. Introduce the serialize macro to serialize processor
    state. Why the microcode update needs it I am not quite sure, since wrmsr()
    is already a serializing instruction, but it is a microcode update, so I will
    keep the semantic the same, since this could be a timing workaround. As far
    as I can tell, this has always been there since the original microcode update
    source.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • i386 Inline asm cleanup. Use cr/dr accessor functions.

    Also, a potential bugfix. Also, some CR accessors really should be volatile.
    Reads from CR0 (numeric state may change in an exception handler), writes to
    CR4 (flipping CR4.TSD) and reads from CR2 (page fault) prevent instruction
    re-ordering. I did not add memory clobber to CR3 / CR4 / CR0 updates, as it
    was not there to begin with, and in no case should kernel memory be clobbered,
    except when doing a TLB flush, which already has memory clobber.

    I noticed that page invalidation does not have a memory clobber. I can't find
    a bug as a result, but there is definitely a potential for a bug here:

    #define __flush_tlb_single(addr) \
    __asm__ __volatile__("invlpg %0": :"m" (*(char *) addr))

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • This makes the vDSO use nops for all its padding around instructions,
    rather than sometimes zeros, and nop-pads the end of the area containing
    instructions to a 32-byte cache line, to keep text and data in separate
    lines.

    Signed-off-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • This is subarch update for ES7000. I've modified platform check code and
    removed unnecessary OEM table parsing for newer systems that don't use OEM
    information during boot. Parsing the table in fact is causing problems,
    and the platform doesn't get recognized. The patch only affects the ES7000
    subach.

    Signed-off-by:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Natalie.Protasevich@unisys.com
     
  • The VIA VT8237's IOAPIC sends 'APIC De-Assert Messages' by default, causing
    another CPU interrupt when the IRQ pin is de-asserted. This feature is
    switched off by the patch to get rid of doubled ioapic level interrupt
    rates.

    Signed-off-by: Karsten Wiese
    Tested-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Karsten Wiese
     
  • Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Venkatesh Pallipadi
     
  • i386 generic subarchitecture requires explicit dmi strings or command line
    to enable bigsmp mode. The patch below removes that restriction, and uses
    bigsmp as soon as it finds more than 8 logical CPUs, Intel processors and
    xAPIC support.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Venkatesh Pallipadi
     
  • o With introduction of kexec as boot-loader, the assumption that parameter
    segment will always be loaded at lower address than kernel and will be
    addressable by early bootup page tables is no longer valid. In kexec on
    panic case parameter segment might well be loaded beyond kernel image and
    might not be addressable by early boot page tables.
    o This case might hit in the scenario where user has reserved a chunk of
    memory for second kernel, for example 16MB to 64MB, and has also built
    second kernel for physical memory location 16MB. In this case kexec has no
    choice but to load the parameter segment at a higher address than new kernel
    image at safe location where new kernel does not stomp it.
    o Though problem should automatically go away once relocatable kernel for i386
    is in place and kexec can determine the location of new kernel at run time
    and load parameter segment at lower address than kernel image. But till then
    this patch can go in (assuming it does not break something else).
    o This patch moves up the boot parameter saving code. Now boot parameters
    are copied out in protected mode before page tables are initialized. This
    will ensure that parameter segment is always addressable irrespective of
    its physical location.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     
  • If the virtual 86 machine reaches an instruction which raises a General
    Protection Fault (such as CLI or STI), the instruction is emulated (in
    handle_vm86_fault). However, the emulation ignored the TF bit, so the
    hardware debug interrupt was not invoked after such an emulated instruction
    (and the DOS debugger missed it).

    This patch fixes the problem by emulating the hardware debug interrupt as
    the last action before control is returned to the VM86 program.

    Signed-off-by: Petr Tesarik
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Tesarik
     
  • The memory descriptors that comprise the EFI memory map are not fixed in
    stone such that the size could change in the future. This uses the memory
    descriptor size obtained from EFI to iterate over the memory map entries
    during boot. This enables the removal of an x86 specific pad (and ifdef)
    in the EFI header. I also couldn't stomach the broken up nature of the
    function to put EFI runtime calls into virtual mode any longer so I fixed
    that up a bit as well.

    For reference, this patch only impacts x86.

    Signed-off-by: Matt Tolentino
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Tolentino
     
  • Only use read_timer_tsc only when CPU has TSC. Thanks to Andrea for
    pointing this out. Should not be issue on any platforms as all recent
    systems that has HPET also has CPUs that supports TSC. The patch is still
    required for correctness.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Venkatesh Pallipadi
     
  • This patch pushes the creation of a rare signal frame (SIGBUS or SIGSEGV)
    into a separate function, thus saving stackspace in the main
    do_page_fault() stackframe. The effect is 132 bytes less of stack used by
    the typical do_page_fault() invocation - resulting in a denser
    cache-layout.

    (Another minor effect is that in case of kernel crashes that come from a
    pagefault, we add less space to the already existing frame, giving the
    crash functions a slightly higher chance to do their stuff without
    overflowing the stack.)

    (The changes also result in slightly cleaner code.)

    argument bugfix from "Guillaume C."

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • The LOG_BUF_SHIFT from lib/Kconfig.debug is sufficient.

    Signed-off-by: Adrian Bunk
    Acked-by: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Extend the compat mode kludgeology in envdev to cover MIPS as well.

    Or why we should need something like is_compat_task() ...

    Signed-off-by: Ralf Baechle
    Cc: Vojtech Pavlik
    Signed-off-by: Dmitry Torokhov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralf Baechle
     
  • vr41xx doesn't need mach-vr41xx/timex.h. This patch has removed
    mach-vr41xx/timex.h.

    Signed-off-by: Yoichi Yuasa
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yoichi Yuasa
     
  • This patch has fixed the following warnings.

    arch/mips/kernel/genex.S:250:5: warning: "CONFIG_64BIT" is not defined
    arch/mips/math-emu/cp1emu.c:1128:5: warning: "__mips64" is not defined
    arch/mips/math-emu/cp1emu.c:1206:5: warning: "__mips64" is not defined
    arch/mips/math-emu/cp1emu.c:1270:5: warning: "__mips64" is not defined
    arch/mips/math-emu/cp1emu.c:323:5: warning: "__mips64" is not defined
    arch/mips/math-emu/cp1emu.c:808:5: warning: "__mips64" is not defined
    arch/mips/math-emu/cp1emu.c:953:5: warning: "__mips64" is not defined
    arch/mips/mm/tlbex.c:519:5: warning: "CONFIG_64BIT" is not defined
    include/asm/reg.h:73:5: warning: "CONFIG_64BIT" is not defined

    Signed-off-by: Yoichi Yuasa
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yoichi Yuasa
     
  • The addtion of SYS_SUPPORTS_*_KERNEL and CPU_SUPPORTS_*_KERNEL is halfway.
    This patch has added more SYS_SUPPORTS_*_KERNEL and CPU_SUPPORTS_*_KERNEL
    to arch/mips/Kconfig. Please apply.

    Signed-off-by: Yoichi Yuasa
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yoichi Yuasa
     
  • This patch has added pcibios_bus_to_resource to MIPS.

    Signed-off-by: Yoichi Yuasa
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yoichi Yuasa
     
  • Add pcibios_select_root to MIPS.

    Signed-off-by: Yoichi Yuasa
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yoichi Yuasa
     
  • Fix the MIPS coherency configuration such that we always keep the mapping
    state in when we need to on non-coherent platforms.

    Signed-off-by: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralf Baechle
     
  • Signed-off-by: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralf Baechle
     
  • Start cleaning 32-bit vs. 64-bit configuration.

    Signed-off-by: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralf Baechle
     
  • This patch has changed from VR41XX to VR4100 series in arch/mips/Kconfig.

    Signed-off-by: Yoichi Yuasa
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yoichi Yuasa
     
  • This patch has removed obsolete VRC4171 config.

    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yoichi Yuasa
     
  • $ make menuconfig
    scripts/kconfig/mconf arch/i386/Kconfig
    drivers/char/Kconfig:847:warning: 'select' used by config symbol
    'TANBAC_TB0219' refer to undefined symbol 'PCI_VR41XX'

    Here is a patch for this warning fix.

    Signed-off-by: Yoichi Yuasa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yoichi Yuasa
     
  • This patch has added default select configs for vr41xx.

    Signed-off-by: Yoichi Yuasa
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yoichi Yuasa
     
  • This patch has added TANBAC VR4131 multichip module in arch/mips/Kconfig

    Signed-off-by: Yoichi Yuasa
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yoichi Yuasa
     
  • - MIPS Denmark does no longer exist; the PCI vendor ID is now owned by
    MIPS Technologies.

    - Add ID for SOC-it, MIPS's system controller.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralf Baechle