02 Feb, 2006

1 commit


19 Jan, 2006

8 commits

  • Move the interrupt check from slab_node into ___cache_alloc and adds an
    "unlikely()" to avoid pipeline stalls on some architectures.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This patch fixes a regression in 2.6.14 against 2.6.13 that causes an
    imbalance in memory allocation during bootup.

    The slab allocator in 2.6.13 is not numa aware and simply calls
    alloc_pages(). This means that memory policies may control the behavior of
    alloc_pages(). During bootup the memory policy is set to MPOL_INTERLEAVE
    resulting in the spreading out of allocations during bootup over all
    available nodes. The slab allocator in 2.6.13 has only a single list of
    slab pages. As a result the per cpu slab cache and the spinlock controlled
    page lists may contain slab entries from off node memory. The slab
    allocator in 2.6.13 makes no effort to discern the locality of an entry on
    its lists.

    The NUMA aware slab allocator in 2.6.14 controls locality of the slab pages
    explicitly by calling alloc_pages_node(). The NUMA slab allocator manages
    slab entries by having lists of available slab pages for each node. The
    per cpu slab cache can only contain slab entries associated with the node
    local to the processor. This guarantees that the default allocation mode
    of the slab allocator always assigns local memory if available.

    Setting MPOL_INTERLEAVE as a default policy during bootup has no effect
    anymore. In 2.6.14 all node unspecific slab allocations are performed on
    the boot processor. This means that most of key data structures are
    allocated on one node. Most processors will have to refer to these
    structures making the boot node a potential bottleneck. This may reduce
    performance and cause unnecessary memory pressure on the boot node.

    This patch implements NUMA policies in the slab layer. There is the need
    of explicit application of NUMA memory policies by the slab allcator itself
    since the NUMA slab allocator does no longer let the page_allocator control
    locality.

    The check for policies is made directly at the beginning of __cache_alloc
    using current->mempolicy. The memory policy is already frequently checked
    by the page allocator (alloc_page_vma() and alloc_page_current()). So it
    is highly likely that the cacheline is present. For MPOL_INTERLEAVE
    kmalloc() will spread out each request to one node after another so that an
    equal distribution of allocations can be obtained during bootup.

    It is not possible to push the policy check to lower layers of the NUMA
    slab allocator since the per cpu caches are now only containing slab
    entries from the current node. If the policy says that the local node is
    not to be preferred or forbidden then there is no point in checking the
    slab cache or local list of slab pages. The allocation better be directed
    immediately to the lists containing slab entries for the allowed set of
    nodes.

    This way of applying policy also fixes another strange behavior in 2.6.13.
    alloc_pages() is controlled by the memory allocation policy of the current
    process. It could therefore be that one process is running with
    MPOL_INTERLEAVE and would f.e. obtain a new page following that policy
    since no slab entries are in the lists anymore. A page can typically be
    used for multiple slab entries but lets say that the current process is
    only using one. The other entries are then added to the slab lists. These
    are now non local entries in the slab lists despite of the possible
    availability of local pages that would provide faster access and increase
    the performance of the application.

    Another process without MPOL_INTERLEAVE may now run and expect a local slab
    entry from kmalloc(). However, there are still these free slab entries
    from the off node page obtained from the other process via MPOL_INTERLEAVE
    in the cache. The process will then get an off node slab entry although
    other slab entries may be available that are local to that process. This
    means that the policy if one process may contaminate the locality of the
    slab caches for other processes.

    This patch in effect insures that a per process policy is followed for the
    allocation of slab entries and that there cannot be a memory policy
    influence from one process to another. A process with default policy will
    always get a local slab entry if one is available. And the process using
    memory policies will get its memory arranged as requested. Off-node slab
    allocation will require the use of spinlocks and will make the use of per
    cpu caches not possible. A process using memory policies to redirect
    allocations offnode will have to cope with additional lock overhead in
    addition to the latency added by the need to access a remote slab entry.

    Changes V1->V2
    - Remove #ifdef CONFIG_NUMA by moving forward declaration into
    prior #ifdef CONFIG_NUMA section.

    - Give the function determining the node number to use a saner
    name.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Convert mm/swapfile.c's swapon_sem to swapon_mutex.

    Signed-off-by: Ingo Molnar
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Some bits for zone reclaim exists in 2.6.15 but they are not usable. This
    patch fixes them up, removes unused code and makes zone reclaim usable.

    Zone reclaim allows the reclaiming of pages from a zone if the number of
    free pages falls below the watermarks even if other zones still have enough
    pages available. Zone reclaim is of particular importance for NUMA
    machines. It can be more beneficial to reclaim a page than taking the
    performance penalties that come with allocating a page on a remote zone.

    Zone reclaim is enabled if the maximum distance to another node is higher
    than RECLAIM_DISTANCE, which may be defined by an arch. By default
    RECLAIM_DISTANCE is 20. 20 is the distance to another node in the same
    component (enclosure or motherboard) on IA64. The meaning of the NUMA
    distance information seems to vary by arch.

    If zone reclaim is not successful then no further reclaim attempts will
    occur for a certain time period (ZONE_RECLAIM_INTERVAL).

    This patch was discussed before. See

    http://marc.theaimsgroup.com/?l=linux-kernel&m=113519961504207&w=2
    http://marc.theaimsgroup.com/?l=linux-kernel&m=113408418232531&w=2
    http://marc.theaimsgroup.com/?l=linux-kernel&m=113389027420032&w=2
    http://marc.theaimsgroup.com/?l=linux-kernel&m=113380938612205&w=2

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Zone reclaim has a huge impact on NUMA performance (f.e. our maximum
    throughput with XFS is raised from 4GB to 6GB/sec / page cache contamination
    of numa nodes destroys locality if one just does a large copy operation which
    results in performance dropping for good until reboot).

    This patch:

    Resurrect may_swap in struct scan_control

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Simplify migrate_page_add after feedback from Hugh. This also allows us to
    drop one parameter from migrate_page_add.

    Signed-off-by: Christoph Lameter
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Migration code currently does not take a reference to target page
    properly, so between unlocking the pte and trying to take a new
    reference to the page with isolate_lru_page, anything could happen to
    it.

    Fix this by holding the pte lock until we get a chance to elevate the
    refcount.

    Other small cleanups while we're here.

    Signed-off-by: Nick Piggin
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Ravikiran reports that this variable is bouncing all around nodes on NUMA
    machines, causing measurable performance problems. Fix that up by only
    writing to it when it actually changed.

    And put it in a new cacheline to prevent it sharing with other things (this
    happened).

    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

17 Jan, 2006

1 commit

  • Add __meminit to the __init lineup to ensure functions default
    to __init when memory hotplug is not enabled. Replace __devinit
    with __meminit on functions that were changed when the memory
    hotplug code was introduced.

    Signed-off-by: Matt Tolentino
    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Matt Tolentino
     

15 Jan, 2006

2 commits

  • The problem, reported in:

    http://bugzilla.kernel.org/show_bug.cgi?id=5859

    and by various other email messages and lkml posts is that the cpuset hook
    in the oom (out of memory) code can try to take a cpuset semaphore while
    holding the tasklist_lock (a spinlock).

    One must not sleep while holding a spinlock.

    The fix seems easy enough - move the cpuset semaphore region outside the
    tasklist_lock region.

    This required a few lines of mechanism to implement. The oom code where
    the locking needs to be changed does not have access to the cpuset locks,
    which are internal to kernel/cpuset.c only. So I provided a couple more
    cpuset interface routines, available to the rest of the kernel, which
    simple take and drop the lock needed here (cpusets callback_sem).

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Anything that writes into a tmpfs filesystem is liable to disproportionately
    decrease the available memory on a particular node. Since there's no telling
    what sort of application (e.g. dd/cp/cat) might be dropping large files
    there, this lets the admin choose the appropriate default behavior for their
    site's situation.

    Introduce a tmpfs mount option which allows specifying a memory policy and
    a second option to specify the nodelist for that policy. With the default
    policy, tmpfs will behave as it does today. This patch adds support for
    preferred, bind, and interleave policies.

    The default policy will cause pages to be added to tmpfs files on the node
    which is doing the writing. Some jobs expect a single process to create
    and manage the tmpfs files. This results in a node which has a
    significantly reduced number of free pages.

    With this patch, the administrator can specify the policy and nodes for
    that policy where they would prefer allocations.

    This patch was originally written by Brent Casavant and Hugh Dickins. I
    added support for the bind and preferred policies and the mpol_nodelist
    mount option.

    Signed-off-by: Brent Casavant
    Signed-off-by: Hugh Dickins
    Signed-off-by: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

13 Jan, 2006

4 commits


12 Jan, 2006

5 commits


11 Jan, 2006

3 commits


10 Jan, 2006

3 commits


09 Jan, 2006

13 commits

  • The patch makes posix_fadvise return ESPIPE on FIFO/pipe in order to be
    fully POSIX-compliant.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Valentine Barshak
     
  • This patch add EXPORT_SYMBOL(filemap_write_and_wait) and use it.

    See mm/filemap.c:

    And changes the filemap_write_and_wait() and filemap_write_and_wait_range().

    Current filemap_write_and_wait() doesn't wait if filemap_fdatawrite()
    returns error. However, even if filemap_fdatawrite() returned an
    error, it may have submitted the partially data pages to the device.
    (e.g. in the case of -ENOSPC)

    Andrew Morton writes,

    If filemap_fdatawrite() returns an error, this might be due to some
    I/O problem: dead disk, unplugged cable, etc. Given the generally
    crappy quality of the kernel's handling of such exceptions, there's a
    good chance that the filemap_fdatawait() will get stuck in D state
    forever.

    So, this patch doesn't wait if filemap_fdatawrite() returns the -EIO.

    Trond, could you please review the nfs part? Especially I'm not sure,
    nfs must use the "filemap_fdatawrite(inode->i_mapping) == 0", or not.

    Acked-by: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • This exports/changes the sync_page_range/_nolock(). The fatfs needs
    sync_page_range/_nolock() for expanding truncate, and changes "size_t count"
    to "loff_t count".

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • Fix more of longstanding bug in cpuset/mempolicy interaction.

    NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
    to just the Memory Nodes allowed by that cpuset. The kernel maintains
    internal state for each mempolicy, tracking what nodes are used for the
    MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.

    When a tasks cpuset memory placement changes, whether because the cpuset
    changed, or because the task was attached to a different cpuset, then the
    tasks mempolicies have to be rebound to the new cpuset placement, so as to
    preserve the cpuset-relative numbering of the nodes in that policy.

    An earlier fix handled such mempolicy rebinding for mempolicies attached to a
    task.

    This fix rebinds mempolicies attached to vma's (address ranges in a tasks
    address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
    updating vma's, the rebinding of vma mempolicies has to be done when the
    cpuset memory placement is changed, at which time mmap_sem can be safely
    acquired. The tasks mempolicy is rebound later, when the task next attempts
    to allocate memory and notices that its task->cpuset_mems_generation is
    out-of-date with its cpusets mems_generation.

    Because walking the tasklist to find all tasks attached to a changing cpuset
    requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
    affected tasks while doing the tasklist scan. In general, one cannot acquire
    a semaphore (which can sleep) while already holding a spinlock (such as
    tasklist_lock). So a list of mm references has to be built up during the
    tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
    acquired, and the vma's in that mm rebound.

    Once the tasklist lock is dropped, affected tasks may fork new tasks, before
    their mm's are rebound. A kernel global 'cpuset_being_rebound' is set to
    point to the cpuset being rebound (there can only be one; cpuset modifications
    are done under a global 'manage_sem' semaphore), and the mpol_copy code that
    is used to copy a tasks mempolicies during fork catches such forking tasks,
    and ensures their children are also rebound.

    When a task is moved to a different cpuset, it is easier, as there is only one
    task involved. It's mm->vma's are scanned, using the same
    mpol_rebind_policy() as used above.

    It may happen that both the mpol_copy hook and the update done via the
    tasklist scan update the same mm twice. This is ok, as the mempolicies of
    each vma in an mm keep track of what mems_allowed they are relative to, and
    safely no-op a second request to rebind to the same nodes.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Cleanup, reorganize and make more robust the mempolicy.c code to rebind
    mempolicies relative to the containing cpuset after a tasks memory placement
    changes.

    The real motivator for this cleanup patch is to lay more groundwork for the
    upcoming patch to correctly rebind NUMA mempolicies that are attached to vma's
    after the containing cpuset memory placement changes.

    NUMA mempolicies are constrained by the cpuset their task is a member of.
    When either (1) a task is moved to a different cpuset, or (2) the 'mems'
    mems_allowed of a cpuset is changed, then the NUMA mempolicies have embedded
    node numbers (for MPOL_BIND, MPOL_INTERLEAVE and MPOL_PREFERRED) that need to
    be recalculated, relative to their new cpuset placement.

    The old code used an unreliable method of determining what was the old
    mems_allowed constraining the mempolicy. It just looked at the tasks
    mems_allowed value. This sort of worked with the present code, that just
    rebinds the -task- mempolicy, and leaves any -vma- mempolicies broken,
    referring to the old nodes. But in an upcoming patch, the vma mempolicies
    will be rebound as well. Then the order in which the various task and vma
    mempolicies are updated will no longer be deterministic, and one can no longer
    count on the task->mems_allowed holding the old value for as long as needed.
    It's not even clear if the current code was guaranteed to work reliably for
    task mempolicies.

    So I added a mems_allowed field to each mempolicy, stating exactly what
    mems_allowed the policy is relative to, and updated synchronously and reliably
    anytime that the mempolicy is rebound.

    Also removed a useless wrapper routine, numa_policy_rebind(), and had its
    caller, cpuset_update_task_memory_state(), call directly to the rewritten
    policy_rebind() routine, and made that rebind routine extern instead of
    static, and added a "mpol_" prefix to its name, making it
    mpol_rebind_policy().

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Provide a cpuset_mems_allowed() method, which the sys_migrate_pages() code
    needed, to obtain the mems_allowed vector of a cpuset, and replaced the
    workaround in sys_migrate_pages() to call this new method.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • The important code paths through alloc_pages_current() and alloc_page_vma(),
    by which most kernel page allocations go, both called
    cpuset_update_current_mems_allowed(), which in turn called refresh_mems().
    -Both- of these latter two routines did a tasklock, got the tasks cpuset
    pointer, and checked for out of date cpuset->mems_generation.

    That was a silly duplication of code and waste of CPU cycles on an important
    code path.

    Consolidated those two routines into a single routine, called
    cpuset_update_task_memory_state(), since it updates more than just
    mems_allowed.

    Changed all callers of either routine to call the new consolidated routine.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
    that the tasks in a cpuset call try_to_free_pages(), the synchronous
    (direct) memory reclaim code.

    This enables batch managers monitoring jobs running in dedicated cpusets to
    efficiently detect what level of memory pressure that job is causing.

    This is useful both on tightly managed systems running a wide mix of
    submitted jobs, which may choose to terminate or reprioritize jobs that are
    trying to use more memory than allowed on the nodes assigned them, and with
    tightly coupled, long running, massively parallel scientific computing jobs
    that will dramatically fail to meet required performance goals if they
    start to use more memory than allowed to them.

    This patch just provides a very economical way for the batch manager to
    monitor a cpuset for signs of memory pressure. It's up to the batch
    manager or other user code to decide what to do about it and take action.

    ==> Unless this feature is enabled by writing "1" to the special file
    /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
    code of __alloc_pages() for this metric reduces to simply noticing
    that the cpuset_memory_pressure_enabled flag is zero. So only
    systems that enable this feature will compute the metric.

    Why a per-cpuset, running average:

    Because this meter is per-cpuset, rather than per-task or mm, the
    system load imposed by a batch scheduler monitoring this metric is
    sharply reduced on large systems, because a scan of the tasklist can be
    avoided on each set of queries.

    Because this meter is a running average, instead of an accumulating
    counter, a batch scheduler can detect memory pressure with a single
    read, instead of having to read and accumulate results for a period of
    time.

    Because this meter is per-cpuset rather than per-task or mm, the
    batch scheduler can obtain the key information, memory pressure in a
    cpuset, with a single read, rather than having to query and accumulate
    results over all the (dynamically changing) set of tasks in the cpuset.

    A per-cpuset simple digital filter (requires a spinlock and 3 words of data
    per-cpuset) is kept, and updated by any task attached to that cpuset, if it
    enters the synchronous (direct) page reclaim code.

    A per-cpuset file provides an integer number representing the recent
    (half-life of 10 seconds) rate of direct page reclaims caused by the tasks
    in the cpuset, in units of reclaims attempted per second, times 1000.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Finish converting mm/mempolicy.c from bitmaps to nodemasks. The previous
    conversion had left one routine using bitmaps, since it involved a
    corresponding change to kernel/cpuset.c

    Fix that interface by replacing with a simple macro that calls nodes_subset(),
    or if !CONFIG_CPUSET, returns (1).

    Signed-off-by: Paul Jackson
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • configurable replacement for slab allocator

    This adds a CONFIG_SLAB option under CONFIG_EMBEDDED. When CONFIG_SLAB is
    disabled, the kernel falls back to using the 'SLOB' allocator.

    SLOB is a traditional K&R/UNIX allocator with a SLAB emulation layer,
    similar to the original Linux kmalloc allocator that SLAB replaced. It's
    signicantly smaller code and is more memory efficient. But like all
    similar allocators, it scales poorly and suffers from fragmentation more
    than SLAB, so it's only appropriate for small systems.

    It's been tested extensively in the Linux-tiny tree. I've also
    stress-tested it with make -j 8 compiles on a 3G SMP+PREEMPT box (not
    recommended).

    Here's a comparison for otherwise identical builds, showing SLOB saving
    nearly half a megabyte of RAM:

    $ size vmlinux*
    text data bss dec hex filename
    3336372 529360 190812 4056544 3de5e0 vmlinux-slab
    3323208 527948 190684 4041840 3dac70 vmlinux-slob

    $ size mm/{slab,slob}.o
    text data bss dec hex filename
    13221 752 48 14021 36c5 mm/slab.o
    1896 52 8 1956 7a4 mm/slob.o

    /proc/meminfo:
    SLAB SLOB delta
    MemTotal: 27964 kB 27980 kB +16 kB
    MemFree: 24596 kB 25092 kB +496 kB
    Buffers: 36 kB 36 kB 0 kB
    Cached: 1188 kB 1188 kB 0 kB
    SwapCached: 0 kB 0 kB 0 kB
    Active: 608 kB 600 kB -8 kB
    Inactive: 808 kB 812 kB +4 kB
    HighTotal: 0 kB 0 kB 0 kB
    HighFree: 0 kB 0 kB 0 kB
    LowTotal: 27964 kB 27980 kB +16 kB
    LowFree: 24596 kB 25092 kB +496 kB
    SwapTotal: 0 kB 0 kB 0 kB
    SwapFree: 0 kB 0 kB 0 kB
    Dirty: 4 kB 12 kB +8 kB
    Writeback: 0 kB 0 kB 0 kB
    Mapped: 560 kB 556 kB -4 kB
    Slab: 1756 kB 0 kB -1756 kB
    CommitLimit: 13980 kB 13988 kB +8 kB
    Committed_AS: 4208 kB 4208 kB 0 kB
    PageTables: 28 kB 28 kB 0 kB
    VmallocTotal: 1007312 kB 1007312 kB 0 kB
    VmallocUsed: 48 kB 48 kB 0 kB
    VmallocChunk: 1007264 kB 1007264 kB 0 kB

    (this work has been sponsored in part by CELF)

    From: Ingo Molnar

    Fix 32-bitness bugs in mm/slob.c.

    Signed-off-by: Matt Mackall
    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • Add mm/util.c for functions common between SLAB and SLOB.

    Signed-off-by: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • ____cacheline_maxaligned_in_smp is currently used to align critical structures
    and avoid false sharing. It uses per-arch L1_CACHE_SHIFT_MAX and people find
    L1_CACHE_SHIFT_MAX useless.

    However, we have been using ____cacheline_maxaligned_in_smp to align
    structures on the internode cacheline size. As per Andi's suggestion,
    following patch kills ____cacheline_maxaligned_in_smp and introduces
    INTERNODE_CACHE_SHIFT, which defaults to L1_CACHE_SHIFT for all arches.
    Arches needing L3/Internode cacheline alignment can define
    INTERNODE_CACHE_SHIFT in the arch asm/cache.h. Patch replaces
    ____cacheline_maxaligned_in_smp with ____cacheline_internodealigned_in_smp

    With this patch, L1_CACHE_SHIFT_MAX can be killed

    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Signed-off-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • When oom_killer kills current there's no need to call
    schedule_timeout_interruptible() since task must die ASAP.

    Signed-Off-By: Pavel Emelianov
    Signed-Off-By: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev