10 May, 2007

40 commits

  • "reshape_position" records how much progress has been made on a "reshape"
    (adding drives, changing layout or chunksize).

    When it is set, the number of drives, layout and chunksize can have
    two possible values, an old an a new.

    So allow these different values to be visible, and allow both old and new to
    be set: Set the old ones first, then the reshape_position, then the new
    values.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • SLUB doesn't like slashes as it wants to use the cache name as the name of a
    directory (or symlink) in sysfs.

    Signed-off-by: Neil Brown
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • If CONFIG_NET is not selected, csum_partial is not exported, so md.ko cannot
    use it. We shouldn't really be using csum_partial anyway as it is an
    internal-to-networking interface.

    So replace it with C code to do the same thing. Speed is not crucial here, so
    something simple and correct is best.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • We need to check for internal-consistency of superblock in load_super.
    validate_super is for inter-device consistency.

    With the test in the wrong place, a badly created array will confuse md rather
    an produce sensible errors.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • We can save some lines of code by using seq_release_private().

    Signed-off-by: Martin Peschke
    Acked-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Peschke
     
  • Use ARRAY_SIZE macro already defined in kernel.h

    Signed-off-by: Ahmed S. Darwish
    Acked-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ahmed S. Darwish
     
  • My geforce isn't supported by nvidia frame buffer.

    /sbin/lspci
    01:00.0 VGA compatible controller: nVidia Corporation Unknown device 02e2 (rev a2)

    /usr/sbin/fbset -i

    mode "1024x768-60"
    # D: 65.003 MHz, H: 48.365 kHz, V: 60.006 Hz
    geometry 1024 768 1024 32767 8
    timings 15384 160 24 29 3 136 6
    accel true
    rgba 8/0,8/0,8/0,0/0
    endmode

    Frame buffer device information:
    Name : NV2e
    Address : 0xe0000000
    Size : 134217728
    Type : PACKED PIXELS
    Visual : PSEUDOCOLOR
    XPanStep : 8
    YPanStep : 1
    YWrapStep : 0
    LineLength : 1024
    MMIO Address: 0xf6000000
    MMIO Size : 16777216
    Accelerator : Unknown (46)

    Here is a patch for this problem.

    Signed-off-by: Michal Piotrowski
    Cc: "Antonino A. Daplas"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Piotrowski
     
  • Provide framebuffer page protection flags and definitions of
    fb_readl/fb_writel for AVR32.

    Signed-off-by: Haavard Skinnemoen
    Cc: "Antonino A. Daplas"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haavard Skinnemoen
     
  • Move fb_get_caps() method to svgalib.c as svga_get_caps() so it can be used by
    s3fb, arkfb and vt8623fb.

    Signed-off-by: Antonino Daplas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Antonino A. Daplas
     
  • This patch adds fbdev driver for graphics cards with ARK Logic 2000PV graphics
    chip with ICS 5342 ramdac.

    [adaplas@gmail.com: build fixes]
    Signed-off-by: Ondrej Zajicek
    Signed-off-by: Antonino Daplas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ondrej Zajicek
     
  • This patch adds fbdev driver for graphics core in VIA VT8623

    [adaplas@gmail.com: build fixes]
    Signed-off-by: Ondrej Zajicek
    Signed-off-by: Antonino Daplas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ondrej Zajicek
     
  • Replace automatic variable instances of __attribute__ ((unused)) with
    __maybe_unused.

    Cc: Andy Whitcroft
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Replace automatic variable instances of __attribute__((unused)) with
    __maybe_unused in mca_nmi_hook().

    Cc: James Bottomley
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • There is no such thing as labeling a variable as __attribute__((used)). Since
    ts_shift is not referenced in inline assembly, we assume that we're simply
    suppressing a warning here if the variable is declared but unreferenced.

    Cc: Paul Mundt
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Use the new macro here

    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • __used is defined to be __attribute__((unused)) for all pre-3.3 gcc
    compilers to suppress warnings for unused functions because perhaps they
    are referenced only in inline assembly. It is defined to be
    __attribute__((used)) for gcc 3.3 and later so that the code is still
    emitted for such functions.

    __maybe_unused is defined to be __attribute__((unused)) for both function
    and variable use if it could possibly be unreferenced due to the evaluation
    of preprocessor macros. Function prototypes shall be marked with
    __maybe_unused if the actual definition of the function is dependant on
    preprocessor macros.

    No update to compiler-intel.h is necessary because ICC supports both
    __attribute__((used)) and __attribute__((unused)) as specified by the gcc
    manual.

    __attribute_used__ is deprecated and will be removed once all current
    code is converted to using __used.

    Cc: Rusty Russell
    Cc: Adrian Bunk
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • This finally renames the thread_info field in task structure to stack, so that
    the assumptions about this field are gone and archs have more freedom about
    placing the thread_info structure.

    Nonbroken archs which have a proper thread pointer can do the access to both
    current thread and task structure via a single pointer.

    It'll allow for a few more cleanups of the fork code, from which e.g. ia64
    could benefit.

    Signed-off-by: Roman Zippel
    [akpm@linux-foundation.org: build fix]
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Russell King
    Cc: Ian Molton
    Cc: Haavard Skinnemoen
    Cc: Mikael Starvik
    Cc: David Howells
    Cc: Yoshinori Sato
    Cc: "Luck, Tony"
    Cc: Hirokazu Takata
    Cc: Geert Uytterhoeven
    Cc: Roman Zippel
    Cc: Greg Ungerer
    Cc: Ralf Baechle
    Cc: Ralf Baechle
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Paul Mundt
    Cc: Kazumoto Kojima
    Cc: Richard Curnow
    Cc: William Lee Irwin III
    Cc: "David S. Miller"
    Cc: Jeff Dike
    Cc: Paolo 'Blaisorblade' Giarrusso
    Cc: Miles Bader
    Cc: Andi Kleen
    Cc: Chris Zankel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Zippel
     
  • Recently a few direct accesses to the thread_info in the task structure snuck
    back, so this wraps them with the appropriate wrapper.

    Signed-off-by: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Zippel
     
  • This will later allow an arch to add module specific information via linker
    generated tables instead of poking directly in the module object structure.

    Signed-off-by: Roman Zippel
    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Zippel
     
  • We need to make sure that the clocksources are resumed, when timekeeping is
    resumed. The current resume logic does not guarantee this.

    Add a resume function pointer to the clocksource struct, so clocksource
    drivers which need to reinitialize the clocksource can provide a resume
    function.

    Add a resume function, which calls the maybe available clocksource resume
    functions and resets the watchdog function, so a stable TSC can be used
    accross suspend/resume.

    Signed-off-by: Thomas Gleixner
    Cc: john stultz
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • Currently the slab allocators contain callbacks into the page allocator to
    perform the draining of pagesets on remote nodes. This requires SLUB to have
    a whole subsystem in order to be compatible with SLAB. Moving node draining
    out of the slab allocators avoids a section of code in SLUB.

    Move the node draining so that is is done when the vm statistics are updated.
    At that point we are already touching all the cachelines with the pagesets of
    a processor.

    Add a expire counter there. If we have to update per zone or global vm
    statistics then assume that the pageset will require subsequent draining.

    The expire counter will be decremented on each vm stats update pass until it
    reaches zero. Then we will drain one batch from the pageset. The draining
    will cause vm counter updates which will then cause another expiration until
    the pcp is empty. So we will drain a batch every 3 seconds.

    Note that remote node draining is a somewhat esoteric feature that is required
    on large NUMA systems because otherwise significant portions of system memory
    can become trapped in pcp queues. The number of pcp is determined by the
    number of processors and nodes in a system. A system with 4 processors and 2
    nodes has 8 pcps which is okay. But a system with 1024 processors and 512
    nodes has 512k pcps with a high potential for large amount of memory being
    caught in them.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Make it configurable. Code in mm makes the vm statistics intervals
    independent from the cache reaper use that opportunity to make it
    configurable.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • vmstat is currently using the cache reaper to periodically bring the
    statistics up to date. The cache reaper does only exists in SLUB as a way to
    provide compatibility with SLAB. This patch removes the vmstat calls from the
    slab allocators and provides its own handling.

    The advantage is also that we can use a different frequency for the updates.
    Refreshing vm stats is a pretty fast job so we can run this every second and
    stagger this by only one tick. This will lead to some overlap in large
    systems. F.e a system running at 250 HZ with 1024 processors will have 4 vm
    updates occurring at once.

    However, the vm stats update only accesses per node information. It is only
    necessary to stagger the vm statistics updates per processor in each node. Vm
    counter updates occurring on distant nodes will not cause cacheline
    contention.

    We could implement an alternate approach that runs the first processor on each
    node at the second and then each of the other processor on a node on a
    subsequent tick. That may be useful to keep a large amount of the second free
    of timer activity. Maybe the timer folks will have some feedback on this one?

    [jirislaby@gmail.com: add missing break]
    Cc: Arjan van de Ven
    Signed-off-by: Christoph Lameter
    Signed-off-by: Jiri Slaby
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Make the microcode driver use the suspend-related CPU hotplug notifications
    to handle the CPU hotplug events occuring during system-wide suspend and
    resume transitions. Remove the global variable suspend_cpu_hotplug
    previously used for this purpose.

    Signed-off-by: Rafael J. Wysocki
    Cc: Gautham R Shenoy
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Since nonboot CPUs are now disabled after tasks and devices have been
    frozen and the CPU hotplug infrastructure is used for this purpose, we need
    special CPU hotplug notifications that will help the CPU-hotplug-aware
    subsystems distinguish normal CPU hotplug events from CPU hotplug events
    related to a system-wide suspend or resume operation in progress. This
    patch introduces such notifications and causes them to be used during
    suspend and resume transitions. It also changes all of the
    CPU-hotplug-aware subsystems to take these notifications into consideration
    (for now they are handled in the same way as the corresponding "normal"
    ones).

    [oleg@tv-sign.ru: cleanups]
    Signed-off-by: Rafael J. Wysocki
    Cc: Gautham R Shenoy
    Cc: Pavel Machek
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Now that all the in-tree users are converted over to zero_user_page(),
    deprecate the old memclear_highpage_flush() call.

    Signed-off-by: Nate Diller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nate Diller
     
  • Use zero_user_page() instead of open-coding it.

    Signed-off-by: Nate Diller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nate Diller
     
  • Use zero_user_page() instead of open-coding it.

    Signed-off-by: Nate Diller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nate Diller
     
  • Use zero_user_page() instead of open-coding it.

    Signed-off-by: Nate Diller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nate Diller
     
  • It's very common for file systems to need to zero part or all of a page,
    the simplist way is just to use kmap_atomic() and memset(). There's
    actually a library function in include/linux/highmem.h that does exactly
    that, but it's confusingly named memclear_highpage_flush(), which is
    descriptive of *how* it does the work rather than what the *purpose* is.
    So this patchset renames the function to zero_user_page(), and calls it
    from the various places that currently open code it.

    This first patch introduces the new function call, and converts all the
    core kernel callsites, both the open-coded ones and the old
    memclear_highpage_flush() ones. Following this patch is a series of
    conversions for each file system individually, per AKPM, and finally a
    patch deprecating the old call. The diffstat below shows the entire
    patchset.

    [akpm@linux-foundation.org: fix a few things]
    Signed-off-by: Nate Diller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nate Diller
     
  • Signed-off-by: Jarek Poplawski
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jarek Poplawski
     
  • Analysis of current linux futex code :
    --------------------------------------

    A central hash table futex_queues[] holds all contexts (futex_q) of waiting
    threads.

    Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to
    perform lookups or insert/deletion of a futex_q.

    When a futex_wait() is done, calling thread has to :

    1) - Obtain a read lock on mmap_sem to be able to validate the user pointer
    (calling find_vma()). This validation tells us if the futex uses
    an inode based store (mapped file), or mm based store (anonymous mem)

    2) - compute a hash key

    3) - Atomic increment of reference counter on an inode or a mm_struct

    4) - lock part of futex_queues[] hash table

    5) - perform the test on value of futex.
    (rollback is value != expected_value, returns EWOULDBLOCK)
    (various loops if test triggers mm faults)

    6) queue the context into hash table, release the lock got in 4)

    7) - release the read_lock on mmap_sem

    8) Eventually unqueue the context (but rarely, as this part  may be done
    by the futex_wake())

    Futexes were designed to improve scalability but current implementation has
    various problems :

    - Central hashtable :

    This means scalability problems if many processes/threads want to use
    futexes at the same time.
    This means NUMA unbalance because this hashtable is located on one node.

    - Using mmap_sem on every futex() syscall :

    Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic
    ops on mmap_sem, dirtying cache line :
    - lot of cache line ping pongs on SMP configurations.

    mmap_sem is also extensively used by mm code (page faults, mmap()/munmap())
    Highly threaded processes might suffer from mmap_sem contention.

    mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded
    programs because of contention on the mmap_sem cache line.

    - Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter:
    It's also a cache line ping pong on SMP. It also increases mmap_sem hold time
    because of cache misses.

    Most of these scalability problems come from the fact that futexes are in
    one global namespace. As we use a central hash table, we must make sure
    they are all using the same reference (given by the mm subsystem). We
    chose to force all futexes be 'shared'. This has a cost.

    But fact is POSIX defined PRIVATE and SHARED, allowing clear separation,
    and optimal performance if carefuly implemented. Time has come for linux
    to have better threading performance.

    The goal is to permit new futex commands to avoid :
    - Taking the mmap_sem semaphore, conflicting with other subsystems.
    - Modifying a ref_count on mm or an inode, still conflicting with mm or fs.

    This is possible because, for one process using PTHREAD_PROCESS_PRIVATE
    futexes, we only need to distinguish futexes by their virtual address, no
    matter the underlying mm storage is.

    If glibc wants to exploit this new infrastructure, it should use new
    _PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And be
    prepared to fallback on old subcommands for old kernels. Using one global
    variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK.

    PTHREAD_PROCESS_SHARED futexes should still use the old subcommands.

    Compatibility with old applications is preserved, they still hit the
    scalability problems, but new applications can fly :)

    Note : the same SHARED futex (mapped on a file) can be used by old binaries
    *and* new binaries, because both binaries will use the old subcommands.

    Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic,
    as this is the default semantic. Almost all applications should benefit
    of this changes (new kernel and updated libc)

    Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine)

    /* calling futex_wait(addr, value) with value != *addr */
    433 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes)
    424 cycles per futex(FUTEX_WAIT) call (using one futex)
    334 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes)
    334 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex)
    For reference :
    187 cycles per getppid() call
    188 cycles per umask() call
    181 cycles per ni_syscall() call

    Signed-off-by: Eric Dumazet
    Pierre Peiffer
    Cc: "Ulrich Drepper"
    Cc: "Nick Piggin"
    Cc: "Ingo Molnar"
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • This patch provides the futex_requeue_pi functionality, which allows some
    threads waiting on a normal futex to be requeued on the wait-queue of a
    PI-futex.

    This provides an optimization, already used for (normal) futexes, to be used
    with the PI-futexes.

    This optimization is currently used by the glibc in pthread_broadcast, when
    using "normal" mutexes. With futex_requeue_pi, it can be used with
    PRIO_INHERIT mutexes too.

    Signed-off-by: Pierre Peiffer
    Cc: Ingo Molnar
    Cc: Ulrich Drepper
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pierre Peiffer
     
  • This patch modifies futex_wait() to use an hrtimer + schedule() in place of
    schedule_timeout().

    schedule_timeout() is tick based, therefore the timeout granularity is the
    tick (1 ms, 4 ms or 10 ms depending on HZ). By using a high resolution timer
    for timeout wakeup, we can attain a much finer timeout granularity (in the
    microsecond range). This parallels what is already done for futex_lock_pi().

    The timeout passed to the syscall is no longer converted to jiffies and is
    therefore passed to do_futex() and futex_wait() as an absolute ktime_t
    therefore keeping nanosecond resolution.

    Also this removes the need to pass the nanoseconds timeout part to
    futex_lock_pi() in val2.

    In futex_wait(), if there is no timeout then a regular schedule() is
    performed. Otherwise, an hrtimer is fired before schedule() is called.

    [akpm@linux-foundation.org: fix `make headers_check']
    Signed-off-by: Sebastien Dugue
    Signed-off-by: Pierre Peiffer
    Cc: Ingo Molnar
    Cc: Ulrich Drepper
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pierre Peiffer
     
  • Today, all threads waiting for a given futex are woken in FIFO order (first
    waiter woken first) instead of priority order.

    This patch makes use of plist (pirotity ordered lists) instead of simple list
    in futex_hash_bucket.

    All non-RT threads are stored with priority MAX_RT_PRIO, causing them to be
    woken last, in FIFO order (RT-threads are woken first, in priority order).

    Signed-off-by: Sebastien Dugue
    Signed-off-by: Pierre Peiffer
    Cc: Ingo Molnar
    Cc: Ulrich Drepper
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pierre Peiffer
     
  • Some smarty went and inflicted ktime_t as a typedef upon us, so we cannot
    forward declare it.

    Create a new `union ktime', map ktime_t onto that. Now we need to kill off
    this ktime_t thing.

    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: john stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Stick an unlikely() around is_aio(): I assert that most IO is synchronous.

    Cc: Suparna Bhattacharya
    Cc: Ingo Molnar
    Cc: Benjamin LaHaise
    Cc: Zach Brown
    Cc: Ulrich Drepper
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • When a lookup request arrives, nfsd uses information provided by userspace
    (mountd) to find the right filesystem.

    It then assumes that the same filehandle type as the incoming filehandle can
    be used to create an outgoing filehandle.

    However if mountd is buggy, or maybe just being creative, the filesystem may
    not support that filesystem type, and the kernel could oops, particularly if
    'ex_uuid' is NULL but a FSID_UUID* filehandle type is used.

    So add some proper checking that the fsid version/type from the incoming
    filehandle is actually supportable, and ignore that information if it isn't
    supportable.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • 1/ decode_sattr and decode_sattr3 never return NULL, so remove
    several checks for that. ditto for xdr_decode_hyper.

    2/ replace some open coded XDR_QUADLEN calls with calls to
    XDR_QUADLEN

    3/ in decode_writeargs, simply an 'if' to use a single
    calculation.
    .page_len is the length of that part of the packet that did
    not fit in the first page (the head).
    So the length of the data part is the remainder of the
    head, plus page_len.

    3/ other minor cleanups.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • kbuild directly interprets -y as objects to build into a module,
    no need to assign it to the old foo-objs variable.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig