07 Jan, 2006

22 commits

  • Comment the new locking rules for page_state statistics.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Optimise page_state manipulations by introducing interrupt unsafe accessors
    to page_state fields. Callers must provide their own locking (either
    disable interrupts or not update from interrupt context).

    Switch over the hot callsites that can easily be moved under interrupts off
    sections.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Several counters already have the need to use 64 atomic variables on 64 bit
    platforms (see mm_counter_t in sched.h). We have to do ugly ifdefs to fall
    back to 32 bit atomic on 32 bit platforms.

    The VM statistics patch that I am working on will also make more extensive
    use of atomic64.

    This patch introduces a new type atomic_long_t by providing definitions in
    asm-generic/atomic.h that works similar to the c "long" type. Its 32 bits
    on 32 bit platforms and 64 bits on 64 bit platforms.

    Also cleans up the determination of the mm_counter_t in sched.h.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Currently the function to build a zonelist for a BIND policy has the side
    effect to set the policy_zone. This seems to be a bit strange. policy
    zone seems to not be initialized elsewhere and therefore 0. Do we police
    ZONE_DMA if no bind policy has been used yet?

    This patch moves the determination of the zone to apply policies to into
    the page allocator. We determine the zone while building the zonelist for
    nodes.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • There are numerous places we check whether a zone is populated or not.

    Provide a helper function to check for populated zones and convert all
    checks for zone->present_pages.

    Signed-off-by: Con Kolivas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Con Kolivas
     
  • Optimise rmap functions by minimising atomic operations when we know there
    will be no concurrent modifications.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Add dma32 to zone statistics. Also attempt to arrange struct page_state a
    bit better (visually).

    Signed-off-by: Nick Piggin
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Remove the last bits of Martin's ill-fated sys_set_zone_reclaim().

    Cc: Martin Hicks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Patch cleans up the alloc_bootmem fix for swiotlb. Patch removes
    alloc_bootmem_*_limit api and fixes alloc_boot_*low api to do the right
    thing -- allocate from low32 memory.

    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • struct per_cpu_pages.low is useless. Remove it.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Before SPARSEMEM is initialised we cannot provide an efficient pfn_to_nid()
    implmentation; before initialisation is complete we use early_pfn_to_nid()
    to provide location information. Until recently there was no non-init user
    of this functionality. Provide a post init pfn_to_nid() implementation.

    Note that this implmentation assumes that the pfn passed has been validated
    with pfn_valid(). The current single user of this function already has
    this check.

    Signed-off-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • There are three places we define pfn_to_nid(). Two in linux/mmzone.h and one
    in asm/mmzone.h. These in essence represent the three memory models. The
    definition in linux/mmzone.h under !NEED_MULTIPLE_NODES is both the FLATMEM
    definition and the optimisation for single NUMA nodes; the one under SPARSEMEM
    is the NUMA sparsemem one; the one in asm/mmzone.h under DISCONTIGMEM is the
    discontigmem one. This is not in the least bit obvious, particularly the
    connection between the non-NUMA optimisations and the memory models.

    Two patches:

    flatmem-split-out-memory-model: simplifies the selection of pfn_to_nid()
    implementations. The selection is based primarily off the memory model
    selected. Optimisations for non-NUMA are applied where needed.

    sparse-provide-pfn_to_nid: implement pfn_to_nid() for SPARSEMEM

    This patch:

    pfn_to_nid is memory model specific

    The pfn_to_nid() call is memory model specific. It represents the locality
    identifier for the memory passed. Classically this would be a NUMA node,
    but not a chunk of memory under DISCONTIGMEM.

    The SPARSEMEM and FLATMEM memory model non-NUMA versions of pfn_to_nid()
    are folded together under NEED_MULTIPLE_NODES, while DISCONTIGMEM has its
    own optimisation. This is all very confusing.

    This patch splits out each implementation of pfn_to_nid() so that we can
    see them and the optimisations to each.

    Signed-off-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • Fix two warnings in ipc/shm.c

    ipc/shm.c:122: warning: statement with no effect
    ipc/shm.c:560: warning: statement with no effect

    by converting the macros to empty inline functions. For safety, let's do
    all three. This also has the advantage that typechecking gets performed
    even without CONFIG_SHMEM enabled.

    Signed-off-by: Russell King
    Cc: Manfred Spraul
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Russell King
     
  • The NODES_SPAN_OTHER_NODES config option was created so that DISCONTIGMEM
    could handle pSeries numa layouts. However, support for DISCONTIGMEM has
    been replaced by SPARSEMEM on powerpc. As a result, this config option and
    supporting code is no longer needed.

    I have already sent a patch to Paul that removes the option from powerpc
    specific code. This removes the arch independent piece. Doesn't really
    matter which is applied first.

    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • pfn_to_pgdat() isn't used in common code. Remove definition.

    Signed-off-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • kvaddr_to_nid() isn't used in common code nor in i386 code. Remove these
    definitions.

    Signed-off-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • mempolicy.c contains provisional interface for huge page allocation based on
    node numbers. This is in use in SLES9 but was never used (AFAIK) in upstream
    versions of Linux.

    Huge page allocations now use zonelists to figure out where to allocate pages.
    The use of zonelists allows us to find the closest hugepage which was the
    consideration of the NUMA distance for huge page allocations.

    Remove the obsolete functions.

    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Acked-by: William Lee Irwin III
    Cc: Adam Litke
    Acked-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The huge_zonelist() function in the memory policy layer provides an list of
    zones ordered by NUMA distance. The hugetlb layer will walk that list looking
    for a zone that has available huge pages but is also in the nodeset of the
    current cpuset.

    This patch does not contain the folding of find_or_alloc_huge_page() that was
    controversial in the earlier discussion.

    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Acked-by: William Lee Irwin III
    Cc: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Here is the patch to implement madvise(MADV_REMOVE) - which frees up a
    given range of pages & its associated backing store. Current
    implementation supports only shmfs/tmpfs and other filesystems return
    -ENOSYS.

    "Some app allocates large tmpfs files, then when some task quits and some
    client disconnect, some memory can be released. However the only way to
    release tmpfs-swap is to MADV_REMOVE". - Andrea Arcangeli

    Databases want to use this feature to drop a section of their bufferpool
    (shared memory segments) - without writing back to disk/swap space.

    This feature is also useful for supporting hot-plug memory on UML.

    Concerns raised by Andrew Morton:

    - "We have no plan for holepunching! If we _do_ have such a plan (or
    might in the future) then what would the API look like? I think
    sys_holepunch(fd, start, len), so we should start out with that."

    - Using madvise is very weird, because people will ask "why do I need to
    mmap my file before I can stick a hole in it?"

    - None of the other madvise operations call into the filesystem in this
    manner. A broad question is: is this capability an MM operation or a
    filesytem operation? truncate, for example, is a filesystem operation
    which sometimes has MM side-effects. madvise is an mm operation and with
    this patch, it gains FS side-effects, only they're really, really
    significant ones."

    Comments:

    - Andrea suggested the fs operation too but then it's more efficient to
    have it as a mm operation with fs side effects, because they don't
    immediatly know fd and physical offset of the range. It's possible to
    fixup in userland and to use the fs operation but it's more expensive,
    the vmas are already in the kernel and we can use them.

    Short term plan & Future Direction:

    - We seem to need this interface only for shmfs/tmpfs files in the short
    term. We have to add hooks into the filesystem for correctness and
    completeness. This is what this patch does.

    - In the future, plan is to support both fs and mmap apis also. This
    also involves (other) filesystem specific functions to be implemented.

    - Current patch doesn't support VM_NONLINEAR - which can be addressed in
    the future.

    Signed-off-by: Badari Pulavarty
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Michael Kerrisk
    Cc: Ulrich Drepper
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     
  • This patch makes truncate_inode_pages_range from truncate_inode_pages.
    truncate_inode_pages became a one-liner call to truncate_inode_pages_range.

    Reiser4 needs truncate_inode_pages_ranges because it tries to keep
    correspondence between existences of metadata pointing to data pages and pages
    to which those metadata point to. So, when metadata of certain part of file
    is removed from filesystem tree, only pages of corresponding range are to be
    truncated.

    (Needed by the madvise(MADV_REMOVE) patch)

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hans Reiser
     
  • Cc: Ivan Kokshaysky
    Cc: Richard Henderson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Janos Haar of First NetCenter Bt. reported numerous crashes involving the
    NBD driver. With his help, this was tracked down to bogus bio vectors
    which in turn was the result of a race condition between the
    receive/transmit routines in the NBD driver.

    The bug manifests itself like this:

    CPU0 CPU1
    do_nbd_request
    add req to queuelist
    nbd_send_request
    send req head
    for each bio
    kmap
    send
    nbd_read_stat
    nbd_find_request
    nbd_end_request
    kunmap

    When CPU1 finishes nbd_end_request, the request and all its associated
    bio's are freed. So when CPU0 calls kunmap whose argument is derived from
    the last bio, it may crash.

    Under normal circumstances, the race occurs only on the last bio. However,
    if an error is encountered on the remote NBD server (such as an incorrect
    magic number in the request), or if there were a bug in the server, it is
    possible for the nbd_end_request to occur any time after the request's
    addition to the queuelist.

    The following patch fixes this problem by making sure that requests are not
    added to the queuelist until after they have been completed transmission.

    In order for the receiving side to be ready for responses involving
    requests still being transmitted, the patch introduces the concept of the
    active request.

    When a response matches the current active request, its processing is
    delayed until after the tranmission has come to a stop.

    This has been tested by Janos and it has been successful in curing this
    race condition.

    From: Herbert Xu

    Here is an updated patch which removes the active_req wait in
    nbd_clear_queue and the associated memory barrier.

    I've also clarified this in the comment.

    Signed-off-by: Herbert Xu
    Cc:
    Cc: Paul Clements
    Signed-off-by: Herbert Xu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Herbert Xu
     

06 Jan, 2006

14 commits


05 Jan, 2006

4 commits