31 Oct, 2005

40 commits

  • In the forthcoming task migration support, a key calculation will be
    mapping cpu and node numbers from the old set to the new set while
    preserving cpuset-relative offset.

    For example, if a task and its pages on nodes 8-11 are being migrated to
    nodes 24-27, then pages on node 9 (the 2nd node in the old set) should be
    moved to node 25 (the 2nd node in the new set.)

    As with other bitmap operations, the proper way to code this is to provide
    the underlying calculation in lib/bitmap.c, and then to provide the usual
    cpumask and nodemask wrappers.

    This patch provides that. These operations are termed 'remap' operations.
    Both remapping a single bit and a set of bits is supported.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • This patch keeps pdflush daemons on the same cpuset as their parent, the
    kthread daemon.

    Some large NUMA configurations put as much as they can of kernel threads
    and other classic Unix load in what's called a bootcpuset, keeping the rest
    of the system free for dedicated jobs.

    This effort is thwarted by pdflush, which dynamically destroys and
    recreates pdflush daemons depending on load.

    It's easy enough to force the originally created pdflush deamons into the
    bootcpuset, at system boottime. But the pdflush threads created later were
    allowed to run freely across the system, due to the necessary line in their
    startup kthread():

    set_cpus_allowed(current, CPU_MASK_ALL);

    By simply coding pdflush to start its threads with the cpus_allowed
    restrictions of its cpuset (inherited from kthread, its parent) we can
    ensure that dynamically created pdflush threads are also kept in the
    bootcpuset.

    On systems w/o cpusets, or w/o a bootcpuset implementation, the following
    will have no affect, leaving pdflush to run on any CPU, as before.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Add support for renaming cpusets. Only allow simple rename of cpuset
    directories in place. Don't allow moving cpusets elsewhere in hierarchy or
    renaming the special cpuset files in each cpuset directory.

    The usefulness of this simple rename became apparent when developing task
    migration facilities. It allows building a second cpuset hierarchy using
    new names and containing new CPUs and Memory Nodes, moving tasks from the
    old to the new cpusets, removing the old cpusets, and then renaming the new
    cpusets to be just like the old names, so that any knowledge that the tasks
    had of their cpuset names will still be valid.

    Leaf node cpusets can be migrated to other CPUs or Memory Nodes by just
    updating their 'cpus' and 'mems' files, but because no cpuset can contain
    CPUs or Nodes not in its parent cpuset, one cannot do this in a cpuset
    hierarchy without first expanding all the non-leaf cpusets to contain the
    union of both the old and new CPUs and Nodes, which would obfuscate the
    one-to-one migration of a task from one cpuset to another required to
    correctly migrate the physical page frames currently allocated to that
    task.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Overhaul cpuset locking. Replace single semaphore with two semaphores.

    The suggestion to use two locks was made by Roman Zippel.

    Both locks are global. Code that wants to modify cpusets must first
    acquire the exclusive manage_sem, which allows them read-only access to
    cpusets, and holds off other would-be modifiers. Before making actual
    changes, the second semaphore, callback_sem must be acquired as well. Code
    that needs only to query cpusets must acquire callback_sem, which is also a
    global exclusive lock.

    The earlier problems with double tripping are avoided, because it is
    allowed for holders of manage_sem to nest the second callback_sem lock, and
    only callback_sem is needed by code called from within __alloc_pages(),
    where the double tripping had been possible.

    This is not quite the same as a normal read/write semaphore, because
    obtaining read-only access with intent to change must hold off other such
    attempts, while allowing read-only access w/o such intention. Changing
    cpusets involves several related checks and changes, which must be done
    while allowing read-only queries (to avoid the double trip), but while
    ensuring nothing changes (holding off other would be modifiers.)

    This overhaul of cpuset locking also makes careful use of task_lock() to
    guard access to the task->cpuset pointer, closing a couple of race
    conditions noticed while reading this code (thanks, Roman). I've never
    seen these races fail in any use or test.

    See further the comments in the code.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Remove a rather hackish depth counter on cpuset locking. The depth counter
    was avoiding a possible double trip on the global cpuset_sem semaphore. It
    worked, but now an improved version of cpuset locking is available, to come
    in the next patch, using two global semaphores.

    This patch reverses "cpuset semaphore depth check deadlock fix"

    The kernel still works, even after this patch, except for some rare and
    difficult to reproduce race conditions when agressively creating and
    destroying cpusets marked with the notify_on_release option, on very large
    systems.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Remove one more useless line from cpuset_common_file_read().

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • This patch fixes incorrect error path in proc_get_inode(), when module
    can't be get due to being unloaded. When try_module_get() fails, this
    function puts de(!) and still returns inode with non-getted de.

    There are still unresolved known bugs in proc yet to be fixed:
    - proc_dir_entry tree is managed without any serialization
    - create_proc_entry() doesn't setup de->owner anyhow,
    so setting it later manually is inatomic.
    - looks like almost all modules do not care whether
    it's de->owner is set...

    Signed-Off-By: Denis Lunev
    Signed-Off-By: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     
  • Remove last remains of NFS exportability support.

    The code is actually buggy (as reported by Akshat Aranya), since 'alias'
    will be leaked if it's non-null and alias->d_flags has DCACHE_DISCONNECTED.

    This is not an active bug, since there will never be any disconnected
    dentries. But it's better to get rid of the unnecessary complexity anyway.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • In the recent timer rework we lost the check for an add_timer() of an
    already-pending timer. That check was useful for networking, so put it back.

    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • If the requested I/O scheduler is already in place, elevator_switch simply
    leaves the queue alone, and returns. However, it forgets to call
    elevator_put, so

    'echo [current_sched] > /sys/block/[dev]/queue/scheduler'

    will leak a reference, causing the current_sched module to be permanently
    pinned in memory.

    Signed-off-by: Nate Diller
    Acked-by: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nate Diller
     
  • Typo fix: dots appearing after a newline in printk strings.

    Signed-off-by: Jean Delvare
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jean Delvare
     
  • Make sure we always return, as all syscalls should. Also move the common
    prototype to

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Make the pid argument a long as on every other arcihtecture. Despite pid_t
    beeing a 32bit type even on 64bit parisc this is not an ABI change due to
    the parisc calling conventions. And even if it did it wouldn't matter too
    much because 64bit userspace on parisc is in an embrionic stage.

    Acked-by: Matthew Wilcox
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • common_nsleep() reimplements schedule_timeout_interruptible() for unknown
    reason.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • (akpm: I don't do typo patches, but one of these is in a printk string)

    Signed-off-by: Jean Delvare
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jean Delvare
     
  • Add a kconfig submenu to select the default I/O scheduler, in case
    anticipatory is not compiled in or another default is preferred. Also,
    since no-op is always available, we should use it whenever the selected
    default is not.

    Signed-off-by: Nate Diller
    Acked-by: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nate Diller
     
  • The majority of the sys_tkill() and sys_tgkill() function code is
    duplicated between the two of them. This patch pulls the duplication out
    into a separate function -- do_tkill() -- and lets sys_tkill() and
    sys_tgkill() be simple wrappers around it. This should make it easier to
    maintain in light of future changes.

    Signed-off-by: Vadim Lobanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vadim Lobanov
     
  • This lock is used in sigqueue_free(), but it is always equal to
    current->sighand->siglock, so we don't need to keep it in the struct
    sigqueue.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • de_thread() calls del_timer_sync(->real_timer) under ->sighand->siglock.
    This is deadlockable, it_real_fn sends a signal and needs this lock too.

    Also, delete unneeded ->real_timer.data assignment.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that RCU applied on 'struct file' seems stable, we can place f_rcuhead
    in a memory location that is not anymore used at call_rcu(&f->f_rcuhead,
    file_free_rcu) time, to reduce the size of this critical kernel object.

    The trick I used is to move f_rcuhead and f_list in an union called f_u

    The callers are changed so that f_rcuhead becomes f_u.fu_rcuhead and f_list
    becomes f_u.f_list

    Signed-off-by: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • Add explicit text about
    - where menuconfig '/' (search) searches for strings,
    - that substrings are allowed, and
    - that regular expressions are supported.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Cleanup trailing whitespace, blank lines, CodingStyle issues etc, for
    lib/idr.c

    Signed-off-by: Jesper Juhl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     
  • The first two hunks of the patch really belongs in patch 1, but I missed
    them on the first pass and instead of redoing all 3 patches I stuck them in
    this one.

    Signed-off-by: Jesper Juhl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     
  • Removes a few pointless register keywords. register is merely a compiler
    hint that access to the variable should be optimized, but gcc (3.3.6 in my
    case) generates the exact same code with and without the keyword, and even
    if gcc did something different with register present I think it is doubtful
    we would want to optimize access to these variables - especially since this
    is generic library code and there are supposed to be optimized versions in
    asm/ for anything that really matters speed wise.

    (akpm: iirc, keyword register is a gcc no-op unless using -O0)

    Signed-off-by: Jesper Juhl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     
  • Removes some blank lines, removes some trailing whitespace, adds spaces
    after commas and a few similar changes.

    Signed-off-by: Jesper Juhl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     
  • The only call to ide_cdrom_capacity is in code protected by
    CONFIG_PROC_FS, so when that is not enabled, the compiler complains:

    drivers/ide/ide-cd.c:3259: warning: `ide_cdrom_capacity' defined but not used

    Here is a patch that fixes that. It provides some space savings for
    embedded systems that are not using procfs, as well:

    text data bss dec hex filename
    - 33540 6504 1032 41076 a074 drivers/ide/ide-cd.o
    + 33468 6480 1032 40980 a014 drivers/ide/ide-cd.o

    Signed-off-by: Amos Waterland
    Cc: Jens Axboe
    Cc: Bartlomiej Zolnierkiewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amos Waterland
     
  • lookup_flags() is only called from the non-create case, so it needn't check
    for O_CREAT|O_EXCL.

    Signed-off-by: Miklos Szeredi
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • task_struct is an internal structure to the kernel with a lot of good
    information, that is probably interesting in core dumps. However there is
    no way for user space to know what format that information is in making it
    useless.

    I grepped the GDB 6.3 source code and NT_TASKSTRUCT while defined is not
    used anywhere else. So I would be surprised if anyone notices it is
    missing.

    In addition exporting kernel pointers to all the interesting kernel data
    structures sounds like the very definition of an information leak. I
    haven't a clue what someone with evil intentions could do with that
    information, but in any attack against the kernel it looks like this is the
    perfect tool for aiming that attack.

    So since NT_TASKSTRUCT is useless as currently defined and is potentially
    dangerous, let's just not export it.

    (akpm: Daniel Jacobowitz "would be amazed" if anything was
    using NT_TASKSTRUCT).

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Remove timer_list.magic and associated debugging code.

    I originally added this when a spinlock was added to timer_list - this meant
    that an all-zeroes timer became illegal and init_timer() was required.

    That spinlock isn't even there any more, although timer.base must now be
    initialised.

    I'll keep this debugging code in -mm.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • This patch makes the workqueus use alloc_percpu instead of an array. The
    workqueues are placed on nodes local to each processor.

    The workqueue structure can grow to a significant size on a system with
    lots of processors if this patch is not applied. 64 bit architectures with
    all debugging features enabled and configured for 512 processors will not
    be able to boot without this patch.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Removed some more references to check_region().

    I checked these changes into the 'checkreg' branch of
    rsync://rsync.kernel.org/pub/scm/linux/kernel/git/jgarzik/misc-2.6.git

    The only valid references remaining are in:
    drivers/scsi/advansys.c
    drivers/scsi/BusLogic.c
    drivers/cdrom/sbpcd.c
    sound/oss/pss.c

    Remove last vestiges of ide_check_region()
    drivers/char/specialix: trim trailing whitespace
    drivers/char/specialix: eliminate use of check_region()
    Remove outdated and unused references to check_region()
    [sound oss] remove check_region() usage from cs4232, wavfront
    [netdrvr eepro] trim trailing whitespace
    [netdrvr eepro] remove check_region() usage

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Garzik
     
  • Try to make the INIT_ENV_ARG_LIMIT help text more readable and
    understandable.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • The last patch from Jens Axboe for drivers/block/paride/pf.c introduced
    pf_end_request() which sets pf_req to NULL.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Norbert Kiesel
     
  • Fix bizarre 4-space coding style in the NTP code.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Create a macro shift_right() that avoids the numerous ugly conditionals in the
    NTP code that look like:

    if(a < 0)
    b = -(-a >> shift);
    else
    b = a >> shift;

    Replacing it with:

    b = shift_right(a, shift);

    This should have zero effect on the logic, however it should probably have
    a bit of testing just to be sure.

    Also replace open-coded min/max with the macros.

    Signed-off-by : John Stultz

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    john stultz
     
  • TIOCSTART and TIOCSTOP are defined in asm/ioctls.h and asm/termios.h by
    various architectures but not actually implemented anywhere but in the IRIX
    compatibility layer, so remove their COMPATIBLE_IOCTL from parisc, ppc64
    and sparc64.

    Move the TIOCSLTC COMPATIBLE_IOCTL to common code, guided by an ifdef to
    only show up on architectures that support it (same as the code handling it
    in tty_ioctl.c), aswell as it's brother TIOCGLTC that wasn't handled so
    far.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Enhance the kthread API by adding kthread_stop_sem, for use in stopping
    threads that spend their idle time waiting on a semaphore.

    Signed-off-by: Alan Stern
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Stern
     
  • We're trying to get rid of as much as possible tasklist walks, or at
    least moving them to core code. This patch falls into the second
    category.

    Instead of walking the tasklist in cfq-iosched move that into
    elv_unregister. The added benefit is that with this change the as
    ioscheduler might be might unloadable more easily aswell.

    The new code uses read_lock instead of read_lock_irq because the
    tasklist_lock only needs irq disabling for writers.

    Signed-off-by: Christoph Hellwig
    Acked-by: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Every user of init_timer() also needs to initialize ->function and ->data
    fields. This patch adds a simple setup_timer() helper for that.

    The schedule_timeout() is patched as an example of usage.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Trivial, saves one 'if' branch in de_thread().

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov