09 Jan, 2006

40 commits

  • The important code paths through alloc_pages_current() and alloc_page_vma(),
    by which most kernel page allocations go, both called
    cpuset_update_current_mems_allowed(), which in turn called refresh_mems().
    -Both- of these latter two routines did a tasklock, got the tasks cpuset
    pointer, and checked for out of date cpuset->mems_generation.

    That was a silly duplication of code and waste of CPU cycles on an important
    code path.

    Consolidated those two routines into a single routine, called
    cpuset_update_task_memory_state(), since it updates more than just
    mems_allowed.

    Changed all callers of either routine to call the new consolidated routine.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Fix obscure, never seen in real life, cpuset fork race. The cpuset_fork()
    call in fork.c was setting up the correct task->cpuset pointer after the
    tasklist_lock was dropped, which briefly exposed the newly forked process with
    an unsafe (copied from parent without locks or usage counter increment) cpuset
    pointer.

    In theory, that exposed cpuset pointer could have been pointing at a cpuset
    that was already freed and removed, and in theory another task that had been
    sitting on the tasklist_lock waiting to scan the task list could have raced
    down the entire tasklist, found our new child at the far end, and dereferenced
    that bogus cpuset pointer.

    To fix, setup up the correct cpuset pointer in the new child by calling
    cpuset_fork() before the new task is linked into the tasklist, and with that,
    add a fork failure case, to dereference that cpuset, if the fork fails along
    the way, after cpuset_fork() was called.

    Had to remove a BUG_ON() from cpuset_exit(), because it was no longer valid -
    the call to cpuset_exit() from a failed fork would not have PF_EXITING set.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Restructure code layout of the kernel/cpuset.c update_nodemask() routine,
    removing embedded returns and nested if's in favor of goto completion labels.
    This is being done in anticipation of adding more logic to this routine, which
    will favor the goto style structure.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Four trivial cpuset fixes: remove extra spaces, remove useless initializers,
    mark one __read_mostly.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Remove documentation for the cpuset 'marker_pid' feature, that was in the
    patch "cpuset: change marker for relative numbering" That patch was previously
    pulled from *-mm at my (pj) request.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Document the additional cpuset features:
    notify_on_release
    marker_pid
    memory_pressure
    memory_pressure_enabled

    Rearrange and improve formatting of existing documentation for
    cpu_exclusive and mem_exclusive features.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
    that the tasks in a cpuset call try_to_free_pages(), the synchronous
    (direct) memory reclaim code.

    This enables batch managers monitoring jobs running in dedicated cpusets to
    efficiently detect what level of memory pressure that job is causing.

    This is useful both on tightly managed systems running a wide mix of
    submitted jobs, which may choose to terminate or reprioritize jobs that are
    trying to use more memory than allowed on the nodes assigned them, and with
    tightly coupled, long running, massively parallel scientific computing jobs
    that will dramatically fail to meet required performance goals if they
    start to use more memory than allowed to them.

    This patch just provides a very economical way for the batch manager to
    monitor a cpuset for signs of memory pressure. It's up to the batch
    manager or other user code to decide what to do about it and take action.

    ==> Unless this feature is enabled by writing "1" to the special file
    /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
    code of __alloc_pages() for this metric reduces to simply noticing
    that the cpuset_memory_pressure_enabled flag is zero. So only
    systems that enable this feature will compute the metric.

    Why a per-cpuset, running average:

    Because this meter is per-cpuset, rather than per-task or mm, the
    system load imposed by a batch scheduler monitoring this metric is
    sharply reduced on large systems, because a scan of the tasklist can be
    avoided on each set of queries.

    Because this meter is a running average, instead of an accumulating
    counter, a batch scheduler can detect memory pressure with a single
    read, instead of having to read and accumulate results for a period of
    time.

    Because this meter is per-cpuset rather than per-task or mm, the
    batch scheduler can obtain the key information, memory pressure in a
    cpuset, with a single read, rather than having to query and accumulate
    results over all the (dynamically changing) set of tasks in the cpuset.

    A per-cpuset simple digital filter (requires a spinlock and 3 words of data
    per-cpuset) is kept, and updated by any task attached to that cpuset, if it
    enters the synchronous (direct) page reclaim code.

    A per-cpuset file provides an integer number representing the recent
    (half-life of 10 seconds) rate of direct page reclaims caused by the tasks
    in the cpuset, in units of reclaims attempted per second, times 1000.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Finish converting mm/mempolicy.c from bitmaps to nodemasks. The previous
    conversion had left one routine using bitmaps, since it involved a
    corresponding change to kernel/cpuset.c

    Fix that interface by replacing with a simple macro that calls nodes_subset(),
    or if !CONFIG_CPUSET, returns (1).

    Signed-off-by: Paul Jackson
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Fix the default behaviour for the remap operators in bitmap, cpumask and
    nodemask.

    As previously submitted, the pair of masks defined a map of the
    positions of the set bits in A to the corresponding bits in B. This is still
    true.

    The issue is how to map the other positions, corresponding to the unset (0)
    bits in A. As previously submitted, they were all mapped to the first set bit
    position in B, a constant map.

    When I tried to code per-vma mempolicy rebinding using these remap operators,
    I realized this was wrong.

    This patch changes the default to map all the unset bit positions in A to the
    same positions in B, the identity map.

    For example, if A has bits 4-7 set, and B has bits 9-12 set, then the map
    defined by the pair maps each bit position in the first 32 bits as
    follows:

    0 ==> 0
    ...
    3 ==> 3
    4 ==> 9
    ...
    7 ==> 12
    8 ==> 8
    9 ==> 9
    ...
    31 ==> 31

    This now corresponds to the typical behaviour desired when migrating pages and
    policies from one cpuset to another.

    The pages on nodes within the original cpuset, and the references in memory
    policies to nodes within the original cpuset, are migrated to the
    corresponding cpuset-relative nodes in the destination cpuset. Other pages
    and node references are left untouched.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • configurable replacement for slab allocator

    This adds a CONFIG_SLAB option under CONFIG_EMBEDDED. When CONFIG_SLAB is
    disabled, the kernel falls back to using the 'SLOB' allocator.

    SLOB is a traditional K&R/UNIX allocator with a SLAB emulation layer,
    similar to the original Linux kmalloc allocator that SLAB replaced. It's
    signicantly smaller code and is more memory efficient. But like all
    similar allocators, it scales poorly and suffers from fragmentation more
    than SLAB, so it's only appropriate for small systems.

    It's been tested extensively in the Linux-tiny tree. I've also
    stress-tested it with make -j 8 compiles on a 3G SMP+PREEMPT box (not
    recommended).

    Here's a comparison for otherwise identical builds, showing SLOB saving
    nearly half a megabyte of RAM:

    $ size vmlinux*
    text data bss dec hex filename
    3336372 529360 190812 4056544 3de5e0 vmlinux-slab
    3323208 527948 190684 4041840 3dac70 vmlinux-slob

    $ size mm/{slab,slob}.o
    text data bss dec hex filename
    13221 752 48 14021 36c5 mm/slab.o
    1896 52 8 1956 7a4 mm/slob.o

    /proc/meminfo:
    SLAB SLOB delta
    MemTotal: 27964 kB 27980 kB +16 kB
    MemFree: 24596 kB 25092 kB +496 kB
    Buffers: 36 kB 36 kB 0 kB
    Cached: 1188 kB 1188 kB 0 kB
    SwapCached: 0 kB 0 kB 0 kB
    Active: 608 kB 600 kB -8 kB
    Inactive: 808 kB 812 kB +4 kB
    HighTotal: 0 kB 0 kB 0 kB
    HighFree: 0 kB 0 kB 0 kB
    LowTotal: 27964 kB 27980 kB +16 kB
    LowFree: 24596 kB 25092 kB +496 kB
    SwapTotal: 0 kB 0 kB 0 kB
    SwapFree: 0 kB 0 kB 0 kB
    Dirty: 4 kB 12 kB +8 kB
    Writeback: 0 kB 0 kB 0 kB
    Mapped: 560 kB 556 kB -4 kB
    Slab: 1756 kB 0 kB -1756 kB
    CommitLimit: 13980 kB 13988 kB +8 kB
    Committed_AS: 4208 kB 4208 kB 0 kB
    PageTables: 28 kB 28 kB 0 kB
    VmallocTotal: 1007312 kB 1007312 kB 0 kB
    VmallocUsed: 48 kB 48 kB 0 kB
    VmallocChunk: 1007264 kB 1007264 kB 0 kB

    (this work has been sponsored in part by CELF)

    From: Ingo Molnar

    Fix 32-bitness bugs in mm/slob.c.

    Signed-off-by: Matt Mackall
    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • Add mm/util.c for functions common between SLAB and SLOB.

    Signed-off-by: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • Make DEBUG_SLAB depend on SLAB.

    Signed-off-by: Ingo Molnar
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Shrink the height of a radix tree when it is partially truncated - we only do
    shrinkage of full truncation at present.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Correctly determine the tags to be cleared in radix_tree_delete() so we
    don't keep moving up the tree clearing tags that we don't need to. For
    example, if a tag is simply not set in the deleted item, nor anywhere up
    the tree, radix_tree_delete() would attempt to clear it up the entire
    height of the tree.

    Also, tag_set() was made conditional so as not to dirty too many cachelines
    high up in the radix tree. Instead, put this logic into
    radix_tree_tag_set().

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Introduce helper any_tag_set() rather than repeat the same code sequence 4
    times.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • The latest set of signal-RCU patches does not use get_task_struct_rcu().
    Attached is a patch that removes it.

    Signed-off-by: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul E. McKenney
     
  • Some simplification in checking signal delivery against concurrent exit.
    Instead of using get_task_struct_rcu(), which increments the task_struct
    reference count, check the reference count after acquiring sighand lock.

    Signed-off-by: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul E. McKenney
     
  • RCU tasklist_lock and RCU signal handling: send signals RCU-read-locked
    instead of tasklist_lock read-locked. This is a scalability improvement on
    SMP and a preemption-latency improvement under PREEMPT_RCU.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Ingo Molnar
    Acked-by: William Irwin
    Cc: Roland McGrath
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Applying RCU to the task structure broke oprofile, because
    free_task_notify() can now be called from softirq. This means that the
    task_mortuary lock must be acquired with irq disabled in order to avoid
    intermittent self-deadlock. Since irq is now disabled, the critical
    section within process_task_mortuary() has been restructured to be O(1) in
    order to maximize scalability and minimize realtime latency degradation.

    Kudos to Wu Fengguang for finding this problem!

    CC: Wu Fengguang
    Cc: Philippe Elie
    Cc: John Levon
    Signed-off-by: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul E. McKenney
     
  • If MODE_TT=n, MODE_SKAS must be y.

    Signed-off-by: Adrian Bunk
    Acked-by: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • This fixes some mangled whitespace added by the earlier trap_user.c patch.

    Signed-off-by: Jeff Dike
    Cc: Paolo 'Blaisorblade' Giarrusso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Dike
     
  • Most of the architectures have the same asm/futex.h. This consolidates them
    into asm-generic, with the arches including it from their own asm/futex.h.

    In the case of UML, this reverts the old broken futex.h and goes back to using
    the same one as almost everyone else.

    Signed-off-by: Jeff Dike
    Cc: Paolo 'Blaisorblade' Giarrusso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Dike
     
  • The serial UML OS-abstraction layer patch (um/kernel dir).

    This joins trap_user.c and trap_kernel.c files.

    Signed-off-by: Gennady Sharapov
    Signed-off-by: Jeff Dike
    Cc: Paolo 'Blaisorblade' Giarrusso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gennady Sharapov
     
  • The serial UML OS-abstraction layer patch (um/kernel dir).

    This moves all systemcalls from trap_user.c file under os-Linux dir

    Signed-off-by: Gennady Sharapov
    Signed-off-by: Jeff Dike
    Cc: Paolo 'Blaisorblade' Giarrusso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gennady Sharapov
     
  • The serial UML OS-abstraction layer patch (um/kernel dir).

    This moves all systemcalls from signal_user.c file under os-Linux dir

    Signed-off-by: Gennady Sharapov
    Signed-off-by: Jeff Dike
    Cc: Paolo 'Blaisorblade' Giarrusso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gennady Sharapov
     
  • ds1620 module is using gpio_read symbol, so works only if "built-in" symbol
    needs to be exported from the kernel image

    Signed-off-by: Woody Suwalski
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Woody Suwalski
     
  • Kill L1_CACHE_SHIFT from all arches. Since L1_CACHE_SHIFT_MAX is not used
    anymore with the introduction of INTERNODE_CACHE, kill L1_CACHE_SHIFT_MAX.

    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Signed-off-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • ____cacheline_maxaligned_in_smp is currently used to align critical structures
    and avoid false sharing. It uses per-arch L1_CACHE_SHIFT_MAX and people find
    L1_CACHE_SHIFT_MAX useless.

    However, we have been using ____cacheline_maxaligned_in_smp to align
    structures on the internode cacheline size. As per Andi's suggestion,
    following patch kills ____cacheline_maxaligned_in_smp and introduces
    INTERNODE_CACHE_SHIFT, which defaults to L1_CACHE_SHIFT for all arches.
    Arches needing L3/Internode cacheline alignment can define
    INTERNODE_CACHE_SHIFT in the arch asm/cache.h. Patch replaces
    ____cacheline_maxaligned_in_smp with ____cacheline_internodealigned_in_smp

    With this patch, L1_CACHE_SHIFT_MAX can be killed

    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Signed-off-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • Fix an uninitialised variable warning in the serverworks driver.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Fix an uninitialised variable warning in the atm nicstar driver.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Fix a number of miscellanous items:

    (1) Declare lock sections in the linker script.

    (2) Recurse in the correct manner in the arch makefile.

    (3) asm/bug.h requires asm/linkage.h to be included first. One C file puts
    asm/bug.h first.

    (4) Add an empty RTC header file to avoid missing header file errors.

    (5) sg_dma_address() should use the dma_address member of a scatter list.

    (6) Add trivial pci_unmap support.

    (7) Add pgprot_noncached()

    (8) Discard u_quad_t.

    (9) Use ~0UL rather than ULONG_MAX in unistd.h in case the latter isn't
    declared.

    (10) Add an empty VGA header file to avoid missing header file errors.

    (11) Add an XOR header file to use the generic XOR stuff.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Make the get_user macro cast the source pointer to an appropriate type for the
    specified size.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Force the 8230 serial driver to be built in if the on-CPU UARTs are to be
    used. It can't be used as a module because the arch setup needs to call into
    it.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Fix PCMCIA configuration for FRV by including the stock PCMCIA configuration
    description file.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Implement pci_iomap() for FRV.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Add stubs for FRV module support.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Supply various I/O access primitives that are missing for the FRV arch:

    (*) mmiowb()

    (*) read*_relaxed()

    (*) ioport_*map()

    (*) ioread*(), iowrite*(), ioread*_rep() and iowrite*_rep()

    (*) pci_io*map()

    (*) check_signature()

    The patch also makes __is_PCI_addr() more efficient.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Fix the exception table handling so that modules exceptions are dealt with.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Export a number of features required to build all the modules. It also
    implements the following simple features:

    (*) csum_partial_copy_from_user() for MMU as well as no-MMU.

    (*) __ucmpdi2().

    so that they can be exported too.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Drop support for debugging features that aren't supported on FRV:

    (*) EARLY_PRINTK

    The on-chip UARTs are set up early enough that this isn't required,
    and VGA support isn't available. There's also a gdbstub available.

    (*) DEBUG_PAGEALLOC

    This can't be easily be done since we use huge static mappings to
    cover the kernel, not pages.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells