24 Mar, 2006

40 commits

  • This patch provides the implementation and cpuset interface for an alternative
    memory allocation policy that can be applied to certain kinds of memory
    allocations, such as the page cache (file system buffers) and some slab caches
    (such as inode caches).

    The policy is called "memory spreading." If enabled, it spreads out these
    kinds of memory allocations over all the nodes allowed to a task, instead of
    preferring to place them on the node where the task is executing.

    All other kinds of allocations, including anonymous pages for a tasks stack
    and data regions, are not affected by this policy choice, and continue to be
    allocated preferring the node local to execution, as modified by the NUMA
    mempolicy.

    There are two boolean flag files per cpuset that control where the kernel
    allocates pages for the file system buffers and related in kernel data
    structures. They are called 'memory_spread_page' and 'memory_spread_slab'.

    If the per-cpuset boolean flag file 'memory_spread_page' is set, then the
    kernel will spread the file system buffers (page cache) evenly over all the
    nodes that the faulting task is allowed to use, instead of preferring to put
    those pages on the node where the task is running.

    If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the
    kernel will spread some file system related slab caches, such as for inodes
    and dentries evenly over all the nodes that the faulting task is allowed to
    use, instead of preferring to put those pages on the node where the task is
    running.

    The implementation is simple. Setting the cpuset flags 'memory_spread_page'
    or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or
    PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or
    subsequently joins that cpuset. In subsequent patches, the page allocation
    calls for the affected page cache and slab caches are modified to perform an
    inline check for these flags, and if set, a call to a new routine
    cpuset_mem_spread_node() returns the node to prefer for the allocation.

    The cpuset_mem_spread_node() routine is also simple. It uses the value of a
    per-task rotor cpuset_mem_spread_rotor to select the next node in the current
    tasks mems_allowed to prefer for the allocation.

    This policy can provide substantial improvements for jobs that need to place
    thread local data on the corresponding node, but that need to access large
    file system data sets that need to be spread across the several nodes in the
    jobs cpuset in order to fit. Without this patch, especially for jobs that
    might have one thread reading in the data set, the memory allocation across
    the nodes in the jobs cpuset can become very uneven.

    A couple of Copyright year ranges are updated as well. And a couple of email
    addresses that can be found in the MAINTAINERS file are removed.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Replace pairs of calls to , with a single call
    atomic_inc_return, saving a few bytes of source and kernel text.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Since the test_bit() bit operator is boolean (return 0 or 1), the double not
    "!!" operations needed to convert a scalar (zero or not zero) to a boolean are
    not needed.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • If we get under some memory pressure in a cpuset (we only scan zones that
    are in the cpuset for memory) then kswapd is woken up for all zones. This
    patch only wakes up kswapd in zones that are part of the current cpuset.

    Signed-off-by: Christoph Lameter
    Acked-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • A long-running rcutorture test can overflow dmesg, so that the line
    containing the module parameters is lost. Although it is usually possible
    to retrieve this information from the log files, it is much better to just
    tag it onto the final success/failure line so that it may be easily found.
    This patch does just that.

    Signed-off-by: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul E. McKenney
     
  • include/linux/platform.h contained nothing that was actually used except
    the default_idle() prototype, and is therefore removed by this patch.

    This patch does the following with the platform specific default_idle()
    functions on different architectures:
    - remove the unused function:
    - parisc
    - sparc64
    - make the needlessly global function static:
    - arm
    - h8300
    - m68k
    - m68knommu
    - s390
    - v850
    - x86_64
    - add a prototype in asm/system.h:
    - cris
    - i386
    - ia64

    Signed-off-by: Adrian Bunk
    Acked-by: Patrick Mochel
    Acked-by: Kyle McMartin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • With internal Xen-enabled kernels we see the kernel's static per-cpu data
    area exceed the limit of 32k on x86-64, and even native x86-64 kernels get
    fairly close to that limit. I generally question whether it is reasonable
    to have data structures several kb in size allocated as per-cpu data when
    the space there is rather limited.

    The biggest arch-independent consumer is tvec_bases (over 4k on 32-bit
    archs, over 8k on 64-bit ones), which now gets converted to use dynamically
    allocated memory instead.

    Signed-off-by: Jan Beulich
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     
  • Introduce a file fs/coda/coda_int.h with proper prototypes for some code.

    Signed-off-by: Adrian Bunk
    Acked-by: Jan Harkes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Add a proper prototype for ext2_get_parent().

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • - mux.c: v9fs_poll_mux() was inline but not static resuling in needless
    object size bloat
    - mux.c: remove all "inline"s: gcc should know best what to inline
    - #if 0 the following unused global functions:
    - 9p.c: v9fs_v9fs_t_flush()
    - conv.c: v9fs_create_tauth()
    - mux.c: v9fs_mux_rpcnb()

    Signed-off-by: Adrian Bunk
    Cc: Eric Van Hensbergen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • __rcu_process_callbacks() disables interrupts to protect itself from
    call_rcu() which adds new entries to ->nxtlist.

    However we can check "->nxtlist != NULL" with interrupts enabled, we can't
    get "false positives" because call_rcu() can only change this condition
    from 0 to 1.

    Tested with rcutorture.ko.

    Signed-off-by: Oleg Nesterov
    Acked-by: Dipankar Sarma
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • When (integer) sysctl values are in either seconds or centiseconds, but
    represented internally as jiffies, the allowable value range is decreased.
    This patch adds range checks to the conversion routines.

    For values in seconds: maximum LONG_MAX / HZ.

    For values in centiseconds: maximum (LONG_MAX / HZ) * USER_HZ.

    (BTW, does anyone else feel that an interface in seconds should not be
    accepting negative values?)

    Signed-off-by: Bart Samwel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bart Samwel
     
  • Make that the internal value for /proc/sys/vm/laptop_mode is stored as
    jiffies instead of seconds. Let the sysctl interface do the conversions,
    instead of doing on-the-fly conversions every time the value is used.

    Add a description of the fact that laptop_mode doubles as a flag and a
    timeout to the comment above the laptop_mode variable.

    Signed-off-by: Bart Samwel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bart Samwel
     
  • Make that the internal values for:

    /proc/sys/vm/dirty_writeback_centisecs
    /proc/sys/vm/dirty_expire_centisecs

    are stored as jiffies instead of centiseconds. Let the sysctl interface do
    the conversions with full precision using clock_t_to_jiffies, instead of
    doing overflow-sensitive on-the-fly conversions every time the values are
    used.

    Cons: apparent precision loss if HZ is not a multiple of 100, because of
    conversion back and forth. This is a common problem for all sysctl values
    that use proc_dointvec_userhz_jiffies. (There is only one other in-tree
    use, in net/core/neighbour.c.)

    Signed-off-by: Bart Samwel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bart Samwel
     
  • Reduce lock hold times in free_uid().

    Cc: Ingo Molnar
    Cc: "Paul E. McKenney"
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Restructure the bitmap_*_region() operations, to avoid code duplication.

    Also reduces binary text size by about 100 bytes (ia64 arch). The original
    Bottomley bitmap_*_region patch added about 1000 bytes of compiled kernel text
    (ia64). The Mundt multiword extension added another 600 bytes, and this
    restructuring patch gets back about 100 bytes.

    But the real motivation was the reduced amount of duplicated code.

    Tested by Paul Mundt using
    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Add support to the lib/bitmap.c bitmap_*_region() routines

    For bitmap regions larger than one word (nbits > BITS_PER_LONG). This removes
    a BUG_ON() in lib bitmap.

    I have an updated store queue API for SH that is currently using this with
    relative success, and at first glance, it seems like this could be useful for
    x86 (arch/i386/kernel/pci-dma.c) as well. Particularly for anything using
    dma_declare_coherent_memory() on large areas and that attempts to allocate
    large buffers from that space.

    Paul Jackson also did some cleanup to this patch.

    Signed-off-by: Paul Mundt
    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Mundt
     
  • Paul Mundt says:

    This patch set implements a number of patches to clean up and restructure the
    bitmap region code, in addition to extending the interface to support
    multiword spanning allocations.

    The current implementation (before this patch set) is limited by only being
    able to allocate pages BITS_PER_LONG);

    As I seem to have been the first person to trigger this, the result ends up
    being the following patch set with the help of Paul Jackson.

    The final patch in the series eliminates quite a bit of code duplication, so
    the bitmap code size ends up being smaller than the current implementation as
    an added bonus.

    After these are applied, it should already be possible to do multiword
    allocations with dma_alloc_coherent() out of ranges established by
    dma_declare_coherent_memory() on x86 without having to change any of the code,
    and the SH store queue API will follow up on this as the other user that needs
    support for this.

    This patch:

    Some code cleanup on the lib/bitmap.c bitmap_*_region() routines:

    * spacing
    * variable names
    * comments

    Has no change to code function.

    Signed-off-by: Paul Mundt
    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • This patch removes the documentation of the ISA legacy functions.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • unused isa_...() helpers removed.

    Adrian Bunk:
    The asm-sh part was rediffed due to unrelated changes.

    Signed-off-by: Al Viro
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • switch to ioremap()

    Signed-off-by: Al Viro
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • switch to ioremap()

    Adrian Bunk:
    The order of the hunks in the patch was slightly rearranged due to an
    unrelated change in the driver.

    Signed-off-by: Al Viro
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • switched to ioremap(), cleaned the probing up a bit.

    Signed-off-by: Al Viro
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • switched CONFIG_SCSI_G_NCR5380_MEM code in g_NCR5380 to ioremap(); massaged
    g_NCR5380.h accordingly.

    Signed-off-by: Al Viro
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • switch from isa_read...() to ioremap() and read...()

    Signed-off-by: Al Viro
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • Use ARRAY_SIZE macro instead of sizeof(x)/sizeof(x[0]) and remove a
    duplicate of ARRAY_SIZE. Some trailing whitespaces are also deleted.

    Signed-off-by: Tobias Klauser
    Cc: David Howells
    Cc: Dave Kleikamp
    Acked-by: Trond Myklebust
    Cc: Neil Brown
    Cc: Chris Mason
    Cc: Jeff Mahoney
    Cc: Christoph Hellwig
    Cc: Nathan Scott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tobias Klauser
     
  • Cast the argument correctly.

    Cc: Christoph Hellwig
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bastian Blank
     
  • Convert all kmalloc + memset sequences in drivers/s390 to kzalloc usage.

    Signed-off-by: Eric Sesterhenn
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sesterhenn
     
  • Convert all kmalloc + memset sequences in arch/s390 to kzalloc usage.

    Signed-off-by: Eric Sesterhenn
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sesterhenn
     
  • Undetected edge case for CRT messages to CEX2A caused length to be too short,
    thus truncating the message. The solution was to check a different variable
    which actually determines which key type is being used.

    Increment version number in z90main.c to correct level of 1.3.3, fix copyright
    year and add comment about bitlength limit of CEX2A.

    Signed-off-by: Eric Rossman
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Rossman
     
  • Michael Holzheu ,
    Martin Schwidefsky

    Signed-off-by: Stefan Bader
    Signed-off-by: Michael Holzheu
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefan Bader
     
  • If a tape device is assigned to another host, the interrupt for the assign
    operation comes back with deferred condition code 1. Under some conditions
    this can lead to an endless loop of retries. Check if the current request is
    still in IO in deferred condition code handling and prevent retries when the
    request has already been cancelled.

    Signed-off-by: Michael Holzheu
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Holzheu
     
  • When a request is aborted because of a signal, we currently stop the request
    via csh, but we do not wait for the interrupt of csh in any case. We free the
    request structure and therefore when the interrupt for the csh operation is
    presented, the request object is no longer valid and an invalid callback
    pointer is used.

    To fix this wait until the interrupt for csh arrives and until
    wait_event_interruptible() does not return -ERESTARTSYS.

    Signed-off-by: Michael Holzheu
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Holzheu
     
  • If a deferred CC happens there will be lots of messages, because the retry is
    done immediatly in the interrupt handler which can be too fast. To avoid this
    requeue the request and schedule the queue to be processed.

    Signed-off-by: Stefan Bader
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefan Bader
     
  • The DASD extended error reporting is a facility that allows to get detailed
    information about certain problems in the DASD I/O. This information can be
    used to implement fail-over applications that can recover these problems.

    Signed-off-by: Stefan Weinhuber
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefan Weinhuber
     
  • Use kzalloc to get a zeroed buffer for the structure returned to user space by
    the BIODASDINFO2 ioctl. Not all fields are set up, e.g. the read_devno is
    missing.

    Signed-off-by: Horst Hummel
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Horst Hummel
     
  • The dasd diag discipline has been tested on 64 bit and is no longer
    experimental.

    Signed-off-by: Peter Oberparleiter
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Oberparleiter