09 Oct, 2005

1 commit

  • - added typedef unsigned int __nocast gfp_t;

    - replaced __nocast uses for gfp flags with gfp_t - it gives exactly
    the same warnings as far as sparse is concerned, doesn't change
    generated code (from gcc point of view we replaced unsigned int with
    typedef) and documents what's going on far better.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

01 Oct, 2005

1 commit

  • include/asm/hw_irq.h:70: warning: `struct hw_interrupt_type' declared inside parameter list
    include/asm/hw_irq.h:70: warning: its scope is only this definition or declaration, which is probably not what you want

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

22 Sep, 2005

1 commit

  • Follow up to 4732efbeb997189d9f9b04708dc26bf8613ed721 - uml must just reuse
    as-is the backing architecture support. There is a micro-fixup is needed for the
    included file, which won't affect i386 behaviour at all.

    I've not tested compilation on x86_64, only on x86, but the code is almost the
    same except the culprit test, so everything should be ok on x86_64 too.

    Cc: Jakub Jelinek
    Signed-off-by: Paolo 'Blaisorblade' Giarrusso
    Signed-off-by: Linus Torvalds

    Paolo 'Blaisorblade' Giarrusso
     

13 Sep, 2005

4 commits

  • As written in Documentation/feature-removal-schedule.txt, remove the
    io_remap_page_range() kernel API.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Original patch from Bertro Simul

    This is probably still not quite correct, but seems to be
    the best solution so far.

    Signed-off-by: Linus Torvalds

    Chuck Ebbert
     
  • As noted by matz@suse.de

    The problem is, that on i386 the syscallN
    macro is defined like so:

    long __res; \
    __asm__ volatile ("int $0x80" \
    : "=a" (__res) \
    : "0" (__NR_##name),"b" ((long)(arg1)),"c" ((long)(arg2)), \
    "d" ((long)(arg3)),"S" ((long)(arg4)),"D" ((long)(arg5))); \

    If one of the arguments (in the _llseek syscall it's the arg4) is a pointer
    which the syscall is expected to write to (to the memory pointed to by this
    ptr), then this side-effect is not captured in the asm.

    If anyone uses this macro to define it's own version of the syscall
    (sometimes necessary when not using glibc) and it's inlined, then GCC
    doesn't know that this asm write to "*dest", when called like so for instance:

    out = 1;
    llseek (fd, bla, blubb, &out, trara)
    use (out);

    Here nobody tells GCC that "out" actually is written to (just a pointer to it
    is passed to the asm). Hence GCC might (and in the above bug did)
    copy-propagate "1" into the second use of "out".

    The easiest solution would be to add a "memory" clobber to the definition
    of this syscall macro. As this is a syscall, it shouldn't inhibit too many
    optimizations.

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Since this is shared code I had to implement it for i386 too

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

11 Sep, 2005

2 commits

  • "extern inline" doesn't make much sense.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • This patch (written by me and also containing many suggestions of Arjan van
    de Ven) does a major cleanup of the spinlock code. It does the following
    things:

    - consolidates and enhances the spinlock/rwlock debugging code

    - simplifies the asm/spinlock.h files

    - encapsulates the raw spinlock type and moves generic spinlock
    features (such as ->break_lock) into the generic code.

    - cleans up the spinlock code hierarchy to get rid of the spaghetti.

    Most notably there's now only a single variant of the debugging code,
    located in lib/spinlock_debug.c. (previously we had one SMP debugging
    variant per architecture, plus a separate generic one for UP builds)

    Also, i've enhanced the rwlock debugging facility, it will now track
    write-owners. There is new spinlock-owner/CPU-tracking on SMP builds too.
    All locks have lockup detection now, which will work for both soft and hard
    spin/rwlock lockups.

    The arch-level include files now only contain the minimally necessary
    subset of the spinlock code - all the rest that can be generalized now
    lives in the generic headers:

    include/asm-i386/spinlock_types.h | 16
    include/asm-x86_64/spinlock_types.h | 16

    I have also split up the various spinlock variants into separate files,
    making it easier to see which does what. The new layout is:

    SMP | UP
    ----------------------------|-----------------------------------
    asm/spinlock_types_smp.h | linux/spinlock_types_up.h
    linux/spinlock_types.h | linux/spinlock_types.h
    asm/spinlock_smp.h | linux/spinlock_up.h
    linux/spinlock_api_smp.h | linux/spinlock_api_up.h
    linux/spinlock.h | linux/spinlock.h

    /*
    * here's the role of the various spinlock/rwlock related include files:
    *
    * on SMP builds:
    *
    * asm/spinlock_types.h: contains the raw_spinlock_t/raw_rwlock_t and the
    * initializers
    *
    * linux/spinlock_types.h:
    * defines the generic type and initializers
    *
    * asm/spinlock.h: contains the __raw_spin_*()/etc. lowlevel
    * implementations, mostly inline assembly code
    *
    * (also included on UP-debug builds:)
    *
    * linux/spinlock_api_smp.h:
    * contains the prototypes for the _spin_*() APIs.
    *
    * linux/spinlock.h: builds the final spin_*() APIs.
    *
    * on UP builds:
    *
    * linux/spinlock_type_up.h:
    * contains the generic, simplified UP spinlock type.
    * (which is an empty structure on non-debug builds)
    *
    * linux/spinlock_types.h:
    * defines the generic type and initializers
    *
    * linux/spinlock_up.h:
    * contains the __raw_spin_*()/etc. version of UP
    * builds. (which are NOPs on non-debug, non-preempt
    * builds)
    *
    * (included on UP-non-debug builds:)
    *
    * linux/spinlock_api_up.h:
    * builds the _spin_*() APIs.
    *
    * linux/spinlock.h: builds the final spin_*() APIs.
    */

    All SMP and UP architectures are converted by this patch.

    arm, i386, ia64, ppc, ppc64, s390/s390x, x64 was build-tested via
    crosscompilers. m32r, mips, sh, sparc, have not been tested yet, but should
    be mostly fine.

    From: Grant Grundler

    Booted and lightly tested on a500-44 (64-bit, SMP kernel, dual CPU).
    Builds 32-bit SMP kernel (not booted or tested). I did not try to build
    non-SMP kernels. That should be trivial to fix up later if necessary.

    I converted bit ops atomic_hash lock to raw_spinlock_t. Doing so avoids
    some ugly nesting of linux/*.h and asm/*.h files. Those particular locks
    are well tested and contained entirely inside arch specific code. I do NOT
    expect any new issues to arise with them.

    If someone does ever need to use debug/metrics with them, then they will
    need to unravel this hairball between spinlocks, atomic ops, and bit ops
    that exist only because parisc has exactly one atomic instruction: LDCW
    (load and clear word).

    From: "Luck, Tony"

    ia64 fix

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Grant Grundler
    Cc: Matthew Wilcox
    Signed-off-by: Hirokazu Takata
    Signed-off-by: Mikael Pettersson
    Signed-off-by: Benoit Boissinot
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

10 Sep, 2005

4 commits

  • Linus Torvalds
     
  • I have a system (Biostar IDEQ210M mini-pc with a VIA chipset) which will
    not reboot unless a keyboard is plugged in to it. I have tried all
    combinations of the kernel "reboot=x,y" flags to no avail. Rebooting by
    any method will leave the system in a wedged state (at the "Restarting
    system" message).

    I finally tracked the problem down to the machine's refusal to fully reboot
    unless the keyboard controller status register had bit 2 set. This is the
    "System flag" which when set, indicates successful completion of the
    keyboard controller self-test (Basic Assurance Test, BAT).

    I suppose that something is trying to protect against sporadic reboots
    unless the keyboard controller is in a good state (a keyboard is present),
    but I need this machine to be headless.

    I found that setting the system flag (via the command byte) before giving
    the "pulse reset line" command will allow the reboot to proceed. The patch
    is simple, and I think it should be fine for everybody whether they have
    this type of machine or not. This affects the "hard" reboot (as done when
    the kernel boot flags "reboot=c,h" are used).

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Truxton Fulton
     
  • Fix a typo involving CONFIG_ACPI_SRAT.

    Signed-off-by: Magnus Damm
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Magnus Damm
     
  • Building asm-offsets.h has been moved to a seperate Kbuild file
    located in the top-level directory. This allow us to share the
    functionality across the architectures.

    The old rules in architecture specific Makefiles will die
    in subsequent patches.

    Furhtermore the usual kbuild dependency tracking is now used
    when deciding to rebuild asm-offsets.s. So we no longer risk
    to fail a rebuild caused by asm-offsets.c dependencies being touched.

    With this common rule-set we now force the same name across
    all architectures. Following patches will fix the rest.

    Signed-off-by: Sam Ravnborg

    Sam Ravnborg
     

08 Sep, 2005

10 commits

  • Len Brown
     
  • This patch gathers all the struct flock64 definitions (and the operations),
    puts them under !CONFIG_64BIT and cleans up the arch files.

    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     
  • This patch just gathers together all the struct flock definitions except
    xtensa into asm-generic/fcntl.h.

    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     
  • This patch puts the most popular of each fcntl operation/flag into
    asm-generic/fcntl.h and cleans up the arch files.

    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     
  • This patch puts the most popular of each open flag into asm-generic/fcntl.h
    and cleans up the arch files.

    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     
  • This set of patches creates asm-generic/fcntl.h and consolidates as much as
    possible from the asm-*/fcntl.h files into it.

    This patch just gathers all the identical bits of the asm-*/fcntl.h files into
    asm-generic/fcntl.h.

    Signed-off-by: Stephen Rothwell
    Signed-off-by: Yoichi Yuasa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     
  • Remove the deprecated (and unused) verify_area() from various uaccess.h
    headers.

    Signed-off-by: Jesper Juhl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     
  • unused and useless..

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • The size of auxiliary vector is fixed at 42 in linux/sched.h. But it isn't
    very obvious when looking at linux/elf.h. This patch adds AT_VECTOR_SIZE
    so that we can change it if necessary when a new vector is added.

    Because of include file ordering problems, doing this necessitated the
    extraction of the AT_* symbols into a standalone header file.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    H. J. Lu
     
  • ATM pthread_cond_signal is unnecessarily slow, because it wakes one waiter
    (which at least on UP usually means an immediate context switch to one of
    the waiter threads). This waiter wakes up and after a few instructions it
    attempts to acquire the cv internal lock, but that lock is still held by
    the thread calling pthread_cond_signal. So it goes to sleep and eventually
    the signalling thread is scheduled in, unlocks the internal lock and wakes
    the waiter again.

    Now, before 2003-09-21 NPTL was using FUTEX_REQUEUE in pthread_cond_signal
    to avoid this performance issue, but it was removed when locks were
    redesigned to the 3 state scheme (unlocked, locked uncontended, locked
    contended).

    Following scenario shows why simply using FUTEX_REQUEUE in
    pthread_cond_signal together with using lll_mutex_unlock_force in place of
    lll_mutex_unlock is not enough and probably why it has been disabled at
    that time:

    The number is value in cv->__data.__lock.
    thr1 thr2 thr3
    0 pthread_cond_wait
    1 lll_mutex_lock (cv->__data.__lock)
    0 lll_mutex_unlock (cv->__data.__lock)
    0 lll_futex_wait (&cv->__data.__futex, futexval)
    0 pthread_cond_signal
    1 lll_mutex_lock (cv->__data.__lock)
    1 pthread_cond_signal
    2 lll_mutex_lock (cv->__data.__lock)
    2 lll_futex_wait (&cv->__data.__lock, 2)
    2 lll_futex_requeue (&cv->__data.__futex, 0, 1, &cv->__data.__lock)
    # FUTEX_REQUEUE, not FUTEX_CMP_REQUEUE
    2 lll_mutex_unlock_force (cv->__data.__lock)
    0 cv->__data.__lock = 0
    0 lll_futex_wake (&cv->__data.__lock, 1)
    1 lll_mutex_lock (cv->__data.__lock)
    0 lll_mutex_unlock (cv->__data.__lock)
    # Here, lll_mutex_unlock doesn't know there are threads waiting
    # on the internal cv's lock

    Now, I believe it is possible to use FUTEX_REQUEUE in pthread_cond_signal,
    but it will cost us not one, but 2 extra syscalls and, what's worse, one of
    these extra syscalls will be done for every single waiting loop in
    pthread_cond_*wait.

    We would need to use lll_mutex_unlock_force in pthread_cond_signal after
    requeue and lll_mutex_cond_lock in pthread_cond_*wait after lll_futex_wait.

    Another alternative is to do the unlocking pthread_cond_signal needs to do
    (the lock can't be unlocked before lll_futex_wake, as that is racy) in the
    kernel.

    I have implemented both variants, futex-requeue-glibc.patch is the first
    one and futex-wake_op{,-glibc}.patch is the unlocking inside of the kernel.
    The kernel interface allows userland to specify how exactly an unlocking
    operation should look like (some atomic arithmetic operation with optional
    constant argument and comparison of the previous futex value with another
    constant).

    It has been implemented just for ppc*, x86_64 and i?86, for other
    architectures I'm including just a stub header which can be used as a
    starting point by maintainers to write support for their arches and ATM
    will just return -ENOSYS for FUTEX_WAKE_OP. The requeue patch has been
    (lightly) tested just on x86_64, the wake_op patch on ppc64 kernel running
    32-bit and 64-bit NPTL and x86_64 kernel running 32-bit and 64-bit NPTL.

    With the following benchmark on UP x86-64 I get:

    for i in nptl-orig nptl-requeue nptl-wake_op; do echo time elf/ld.so --library-path .:$i /tmp/bench; \
    for j in 1 2; do echo ( time elf/ld.so --library-path .:$i /tmp/bench ) 2>&1; done; done
    time elf/ld.so --library-path .:nptl-orig /tmp/bench
    real 0m0.655s user 0m0.253s sys 0m0.403s
    real 0m0.657s user 0m0.269s sys 0m0.388s
    time elf/ld.so --library-path .:nptl-requeue /tmp/bench
    real 0m0.496s user 0m0.225s sys 0m0.271s
    real 0m0.531s user 0m0.242s sys 0m0.288s
    time elf/ld.so --library-path .:nptl-wake_op /tmp/bench
    real 0m0.380s user 0m0.176s sys 0m0.204s
    real 0m0.382s user 0m0.175s sys 0m0.207s

    The benchmark is at:
    http://sourceware.org/ml/libc-alpha/2005-03/txt00001.txt
    Older futex-requeue-glibc.patch version is at:
    http://sourceware.org/ml/libc-alpha/2005-03/txt00002.txt
    Older futex-wake_op-glibc.patch version is at:
    http://sourceware.org/ml/libc-alpha/2005-03/txt00003.txt
    Will post a new version (just x86-64 fixes so that the patch
    applies against pthread_cond_signal.S) to libc-hacker ml soon.

    Attached is the kernel FUTEX_WAKE_OP patch as well as a simple-minded
    testcase that will not test the atomicity of the operation, but at least
    check if the threads that should have been woken up are woken up and
    whether the arithmetic operation in the kernel gave the expected results.

    Acked-by: Ingo Molnar
    Cc: Ulrich Drepper
    Cc: Jamie Lokier
    Cc: Rusty Russell
    Signed-off-by: Yoichi Yuasa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jakub Jelinek
     

05 Sep, 2005

17 commits

  • Jeff Dike ,
    Paolo 'Blaisorblade' Giarrusso ,
    Bodo Stroesser

    Adds a new ptrace(2) mode, called PTRACE_SYSEMU, resembling PTRACE_SYSCALL
    except that the kernel does not execute the requested syscall; this is useful
    to improve performance for virtual environments, like UML, which want to run
    the syscall on their own.

    In fact, using PTRACE_SYSCALL means stopping child execution twice, on entry
    and on exit, and each time you also have two context switches; with SYSEMU you
    avoid the 2nd stop and so save two context switches per syscall.

    Also, some architectures don't have support in the host for changing the
    syscall number via ptrace(), which is currently needed to skip syscall
    execution (UML turns any syscall into getpid() to avoid it being executed on
    the host). Fixing that is hard, while SYSEMU is easier to implement.

    * This version of the patch includes some suggestions of Jeff Dike to avoid
    adding any instructions to the syscall fast path, plus some other little
    changes, by myself, to make it work even when the syscall is executed with
    SYSENTER (but I'm unsure about them). It has been widely tested for quite a
    lot of time.

    * Various fixed were included to handle the various switches between
    various states, i.e. when for instance a syscall entry is traced with one of
    PT_SYSCALL / _SYSEMU / _SINGLESTEP and another one is used on exit.
    Basically, this is done by remembering which one of them was used even after
    the call to ptrace_notify().

    * We're combining TIF_SYSCALL_EMU with TIF_SYSCALL_TRACE or TIF_SINGLESTEP
    to make do_syscall_trace() notice that the current syscall was started with
    SYSEMU on entry, so that no notification ought to be done in the exit path;
    this is a bit of a hack, so this problem is solved in another way in next
    patches.

    * Also, the effects of the patch:
    "Ptrace - i386: fix Syscall Audit interaction with singlestep"
    are cancelled; they are restored back in the last patch of this series.

    Detailed descriptions of the patches doing this kind of processing follow (but
    I've already summed everything up).

    * Fix behaviour when changing interception kind #1.

    In do_syscall_trace(), we check the status of the TIF_SYSCALL_EMU flag
    only after doing the debugger notification; but the debugger might have
    changed the status of this flag because he continued execution with
    PTRACE_SYSCALL, so this is wrong. This patch fixes it by saving the flag
    status before calling ptrace_notify().

    * Fix behaviour when changing interception kind #2:
    avoid intercepting syscall on return when using SYSCALL again.

    A guest process switching from using PTRACE_SYSEMU to PTRACE_SYSCALL
    crashes.

    The problem is in arch/i386/kernel/entry.S. The current SYSEMU patch
    inhibits the syscall-handler to be called, but does not prevent
    do_syscall_trace() to be called after this for syscall completion
    interception.

    The appended patch fixes this. It reuses the flag TIF_SYSCALL_EMU to
    remember "we come from PTRACE_SYSEMU and now are in PTRACE_SYSCALL", since
    the flag is unused in the depicted situation.

    * Fix behaviour when changing interception kind #3:
    avoid intercepting syscall on return when using SINGLESTEP.

    When testing 2.6.9 and the skas3.v6 patch, with my latest patch and had
    problems with singlestepping on UML in SKAS with SYSEMU. It looped
    receiving SIGTRAPs without moving forward. EIP of the traced process was
    the same for all SIGTRAPs.

    What's missing is to handle switching from PTRACE_SYSCALL_EMU to
    PTRACE_SINGLESTEP in a way very similar to what is done for the change from
    PTRACE_SYSCALL_EMU to PTRACE_SYSCALL_TRACE.

    I.e., after calling ptrace(PTRACE_SYSEMU), on the return path, the debugger is
    notified and then wake ups the process; the syscall is executed (or skipped,
    when do_syscall_trace() returns 0, i.e. when using PTRACE_SYSEMU), and
    do_syscall_trace() is called again. Since we are on the return path of a
    SYSEMU'd syscall, if the wake up is performed through ptrace(PTRACE_SYSCALL),
    we must still avoid notifying the parent of the syscall exit. Now, this
    behaviour is extended even to resuming with PTRACE_SINGLESTEP.

    Signed-off-by: Paolo 'Blaisorblade' Giarrusso
    Cc: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laurent Vivier
     
  • The timers lack .suspend/.resume methods. Because of this, jiffies got a
    big compensation after a S3 resume. And then softlockup watchdog reports
    an oops. This occured with HPET enabled, but it's also possible for other
    timers.

    Signed-off-by: Shaohua Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • for_each_cpu walks through all processors in cpu_possible_map, which is
    defined as cpu_callout_map on i386 and isn't initialised until all
    processors have been booted. This breaks things which do for_each_cpu
    iterations early during boot. So, define cpu_possible_map as a bitmap with
    NR_CPUS bits populated. This was triggered by a patch i'm working on which
    does alloc_percpu before bringing up secondary processors.

    From: Alexander Nyberg

    i386-boottime-for_each_cpu-broken.patch
    i386-boottime-for_each_cpu-broken-fix.patch

    The SMP version of __alloc_percpu checks the cpu_possible_map before
    allocating memory for a certain cpu. With the above patches the BSP cpuid
    is never set in cpu_possible_map which breaks CONFIG_SMP on uniprocessor
    machines (as soon as someone tries to dereference something allocated via
    __alloc_percpu, which in fact is never allocated since the cpu is not set
    in cpu_possible_map).

    Signed-off-by: Zwane Mwaikambo
    Signed-off-by: Alexander Nyberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zwane Mwaikambo
     
  • Add a clone operation for pgd updates.

    This helps complete the encapsulation of updates to page tables (or pages
    about to become page tables) into accessor functions rather than using
    memcpy() to duplicate them. This is both generally good for consistency
    and also necessary for running in a hypervisor which requires explicit
    updates to page table entries.

    The new function is:

    clone_pgd_range(pgd_t *dst, pgd_t *src, int count);

    dst - pointer to pgd range anwhere on a pgd page
    src - ""
    count - the number of pgds to copy.

    dst and src can be on the same page, but the range must not overlap
    and must not cross a page boundary.

    Note that I ommitted using this call to copy pgd entries into the
    software suspend page root, since this is not technically a live paging
    structure, rather it is used on resume from suspend. CC'ing Pavel in case
    he has any feedback on this.

    Thanks to Chris Wright for noticing that this could be more optimal in
    PAE compiles by eliminating the memset.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • This patch adds a notify to the die_nmi notify that the system is about to
    be taken down. If the notify is handled with a NOTIFY_STOP return, the
    system is given a new lease on life.

    We also change the nmi watchdog to carry on if die_nmi returns.

    This give debug code a chance to a) catch watchdog timeouts and b) possibly
    allow the system to continue, realizing that the time out may be due to
    debugger activities such as single stepping which is usually done with
    "other" cpus held.

    Signed-off-by: George Anzinger
    Cc: Keith Owens
    Signed-off-by: George Anzinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    George Anzinger
     
  • Introduce a write acessor for updating the current LDT. This is required
    for hypervisors like Xen that do not allow LDT pages to be directly
    written.

    Testing - here's a fun little LDT test that can be trivially modified to
    test limits as well.

    /*
    * Copyright (c) 2005, Zachary Amsden (zach@vmware.com)
    * This is licensed under the GPL.
    */

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #define __KERNEL__
    #include

    void main(void)
    {
    struct user_desc desc;
    char *code;
    unsigned long long tsc;

    code = (char *)mmap(0, 8192, PROT_EXEC|PROT_READ|PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    desc.entry_number = 0;
    desc.base_addr = code;
    desc.limit = 1;
    desc.seg_32bit = 1;
    desc.contents = MODIFY_LDT_CONTENTS_CODE;
    desc.read_exec_only = 0;
    desc.limit_in_pages = 1;
    desc.seg_not_present = 0;
    desc.useable = 1;
    if (modify_ldt(1, &desc, sizeof(desc)) != 0) {
    perror("modify_ldt");
    }
    printf("code base is 0x%08x\n", (unsigned)code);
    code[0x0ffe] = 0x0f; /* rdtsc */
    code[0x0fff] = 0x31;
    code[0x1000] = 0xcb; /* lret */
    __asm__ __volatile("lcall $7,$0xffe" : "=A" (tsc));
    printf("TSC is 0x%016llx\n", tsc);
    }

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • The pushf/popf in switch_to are ONLY used to switch IOPL. Making this
    explicit in C code is more clear. This pushf/popf pair was added as a
    bugfix for leaking IOPL to unprivileged processes when using
    sysenter/sysexit based system calls (sysexit does not restore flags).

    When requesting an IOPL change in sys_iopl(), it is just as easy to change
    the current flags and the flags in the stack image (in case an IRET is
    required), but there is no reason to force an IRET if we came in from the
    SYSENTER path.

    This change is the minimal solution for supporting a paravirtualized Linux
    kernel that allows user processes to run with I/O privilege. Other
    solutions require radical rewrites of part of the low level fault / system
    call handling code, or do not fully support sysenter based system calls.

    Unfortunately, this added one field to the thread_struct. But as a bonus,
    on P4, the fastest time measured for switch_to() went from 312 to 260
    cycles, a win of about 17% in the fast case through this performance
    critical path.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • Privilege checking cleanup. Originally, these diffs were much greater, but
    recent cleanups in Linux have already done much of the cleanup. I added
    some explanatory comments in places where the reasoning behind certain
    tests is rather subtle.

    Also, in traps.c, we can skip the user_mode check in handle_BUG(). The
    reason is, there are only two call chains - one via die_if_kernel() and one
    via do_page_fault(), both entering from die(). Both of these paths already
    ensure that a kernel mode failure has happened. Also, the original check
    here, if (user_mode(regs)) was insufficient anyways, since it would not
    rule out BUG faults from V8086 mode execution.

    Saving the %ss segment in show_regs() rather than assuming a fixed value
    also gives better information about the current kernel state in the
    register dump.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • Some more assembler cleanups I noticed along the way.

    Signed-off-by: Zachary Amsden
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • Noticed by Chuck Ebbert: the .ldt entry of the TSS was set up incorrectly.
    It never mattered since this was a leftover from old times, so remove it.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Also, setting PDPEs in PAE mode does not require atomic operations, since the
    PDPEs are cached by the processor, and only reloaded on an explicit or
    implicit reload of CR3.

    Since the four PDPEs must always be present in an active root, and the kernel
    PDPE is never updated, we are safe even from SMIs and interrupts / NMIs using
    task gates (which reload CR3). Actually, much of this is moot, since the user
    PDPEs are never updated either, and the only usage of task gates is by the
    doublefault handler. It appears the only place PGDs get updated in PAE mode
    is in init_low_mappings() / zap_low_mapping() for initial page table creation
    and recovery from ACPI sleep state, and these sites are safe by inspection.
    Getting rid of the cmpxchg8b saves code space and 720 cycles in pgd_alloc on
    P4.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • GCC can generate better code around descriptor update and access functions
    when there is not an explicit "eax" register constraint.

    Testing: You won't boot if this is messed up, since the TSS descriptor will be
    corrupted. Verified the assembler and booted.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • i386 inline assembler cleanup.

    This change encapsulates descriptor and task register management. Also,
    it is possible to improve assembler generation in two cases; savesegment
    may store the value in a register instead of a memory location, which
    allows GCC to optimize stack variables into registers, and MOV MEM, SEG
    is always a 16-bit write to memory, making the casting in math-emu
    unnecessary.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • i386 arch cleanup. Introduce the serialize macro to serialize processor
    state. Why the microcode update needs it I am not quite sure, since wrmsr()
    is already a serializing instruction, but it is a microcode update, so I will
    keep the semantic the same, since this could be a timing workaround. As far
    as I can tell, this has always been there since the original microcode update
    source.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • i386 Inline asm cleanup. Use cr/dr accessor functions.

    Also, a potential bugfix. Also, some CR accessors really should be volatile.
    Reads from CR0 (numeric state may change in an exception handler), writes to
    CR4 (flipping CR4.TSD) and reads from CR2 (page fault) prevent instruction
    re-ordering. I did not add memory clobber to CR3 / CR4 / CR0 updates, as it
    was not there to begin with, and in no case should kernel memory be clobbered,
    except when doing a TLB flush, which already has memory clobber.

    I noticed that page invalidation does not have a memory clobber. I can't find
    a bug as a result, but there is definitely a potential for a bug here:

    #define __flush_tlb_single(addr) \
    __asm__ __volatile__("invlpg %0": :"m" (*(char *) addr))

    Signed-off-by: Zachary Amsden
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • This is subarch update for ES7000. I've modified platform check code and
    removed unnecessary OEM table parsing for newer systems that don't use OEM
    information during boot. Parsing the table in fact is causing problems,
    and the platform doesn't get recognized. The patch only affects the ES7000
    subach.

    Signed-off-by:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Natalie.Protasevich@unisys.com
     
  • i386 generic subarchitecture requires explicit dmi strings or command line
    to enable bigsmp mode. The patch below removes that restriction, and uses
    bigsmp as soon as it finds more than 8 logical CPUs, Intel processors and
    xAPIC support.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Venkatesh Pallipadi