08 May, 2007

40 commits

  • On x86_64 this cuts allocation overhead for page table pages down to a
    fraction (kernel compile / editing load. TSC based measurement of times spend
    in each function):

    no quicklist

    pte_alloc 1569048 4.3s(401ns/2.7us/179.7us)
    pmd_alloc 780988 2.1s(337ns/2.7us/86.1us)
    pud_alloc 780072 2.2s(424ns/2.8us/300.6us)
    pgd_alloc 260022 1s(920ns/4us/263.1us)

    quicklist:

    pte_alloc 452436 573.4ms(8ns/1.3us/121.1us)
    pmd_alloc 196204 174.5ms(7ns/889ns/46.1us)
    pud_alloc 195688 172.4ms(7ns/881ns/151.3us)
    pgd_alloc 65228 9.8ms(8ns/150ns/6.1us)

    pgd allocations are the most complex and there we see the most dramatic
    improvement (may be we can cut down the amount of pgds cached somewhat?). But
    even the pte allocations still see a doubling of performance.

    1. Proven code from the IA64 arch.

    The method used here has been fine tuned for years and
    is NUMA aware. It is based on the knowledge that accesses
    to page table pages are sparse in nature. Taking a page
    off the freelists instead of allocating a zeroed pages
    allows a reduction of number of cachelines touched
    in addition to getting rid of the slab overhead. So
    performance improves. This is particularly useful if pgds
    contain standard mappings. We can save on the teardown
    and setup of such a page if we have some on the quicklists.
    This includes avoiding lists operations that are otherwise
    necessary on alloc and free to track pgds.

    2. Light weight alternative to use slab to manage page size pages

    Slab overhead is significant and even page allocator use
    is pretty heavy weight. The use of a per cpu quicklist
    means that we touch only two cachelines for an allocation.
    There is no need to access the page_struct (unless arch code
    needs to fiddle around with it). So the fast past just
    means bringing in one cacheline at the beginning of the
    page. That same cacheline may then be used to store the
    page table entry. Or a second cacheline may be used
    if the page table entry is not in the first cacheline of
    the page. The current code will zero the page which means
    touching 32 cachelines (assuming 128 byte). We get down
    from 32 to 2 cachelines in the fast path.

    3. x86_64 gets lightweight page table page management.

    This will allow x86_64 arch code to faster repopulate pgds
    and other page table entries. The list operations for pgds
    are reduced in the same way as for i386 to the point where
    a pgd is allocated from the page allocator and when it is
    freed back to the page allocator. A pgd can pass through
    the quicklists without having to be reinitialized.

    64 Consolidation of code from multiple arches

    So far arches have their own implementation of quicklist
    management. This patch moves that feature into the core allowing
    an easier maintenance and consistent management of quicklists.

    Page table pages have the characteristics that they are typically zero or in a
    known state when they are freed. This is usually the exactly same state as
    needed after allocation. So it makes sense to build a list of freed page
    table pages and then consume the pages already in use first. Those pages have
    already been initialized correctly (thus no need to zero them) and are likely
    already cached in such a way that the MMU can use them most effectively. Page
    table pages are used in a sparse way so zeroing them on allocation is not too
    useful.

    Such an implementation already exits for ia64. Howver, that implementation
    did not support constructors and destructors as needed by i386 / x86_64. It
    also only supported a single quicklist. The implementation here has
    constructor and destructor support as well as the ability for an arch to
    specify how many quicklists are needed.

    Quicklists are defined by an arch defining CONFIG_QUICKLIST. If more than one
    quicklist is necessary then we can define NR_QUICK for additional lists. F.e.
    i386 needs two and thus has

    config NR_QUICK
    int
    default 2

    If an arch has requested quicklist support then pages can be allocated
    from the quicklist (or from the page allocator if the quicklist is
    empty) via:

    quicklist_alloc(, , )

    Page table pages can be freed using:

    quicklist_free(, , )

    Pages must have a definite state after allocation and before
    they are freed. If no constructor is specified then pages
    will be zeroed on allocation and must be zeroed before they are
    freed.

    If a constructor is used then the constructor will establish
    a definite page state. F.e. the i386 and x86_64 pgd constructors
    establish certain mappings.

    Constructors and destructors can also be used to track the pages.
    i386 and x86_64 use a list of pgds in order to be able to dynamically
    update standard mappings.

    Signed-off-by: Christoph Lameter
    Cc: "David S. Miller"
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Add the tool which gets reports about slabs to the VM documentation directory.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Make sure that the check function really only check things and do not perform
    activities. Extract the tracing and object seeding out of the two check
    functions and place them into slab_alloc and slab_free

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • At kmem_cache_shrink check if we have any empty slabs on the partial
    if so then remove them.

    Also--as an anti-fragmentation measure--sort the partial slabs so that
    the most fully allocated ones come first and the least allocated last.

    The next allocations may fill up the nearly full slabs. Having the
    least allocated slabs last gives them the maximum chance that their
    remaining objects may be freed. Thus we can hopefully minimize the
    partial slabs.

    I think this is the best one can do in terms antifragmentation
    measures. Real defragmentation (meaning moving objects out of slabs with
    the least free objects to those that are almost full) can be implemted
    by reverse scanning through the list produced here but that would mean
    that we need to provide a callback at slab cache creation that allows
    the deletion or moving of an object. This will involve slab API
    changes, so defer for now.

    Cc: Mel Gorman
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This patch enables listing the callers who allocated or freed objects in a
    cache.

    For example to list the allocators for kmalloc-128 do

    cat /sys/slab/kmalloc-128/alloc_calls
    7 sn_io_slot_fixup+0x40/0x700
    7 sn_io_slot_fixup+0x80/0x700
    9 sn_bus_fixup+0xe0/0x380
    6 param_sysfs_setup+0xf0/0x280
    276 percpu_populate+0xf0/0x1a0
    19 __register_chrdev_region+0x30/0x360
    8 expand_files+0x2e0/0x6e0
    1 sys_epoll_create+0x60/0x200
    1 __mounts_open+0x140/0x2c0
    65 kmem_alloc+0x110/0x280
    3 alloc_disk_node+0xe0/0x200
    33 as_get_io_context+0x90/0x280
    74 kobject_kset_add_dir+0x40/0x140
    12 pci_create_bus+0x2a0/0x5c0
    1 acpi_ev_create_gpe_block+0x120/0x9e0
    41 con_insert_unipair+0x100/0x1c0
    1 uart_open+0x1c0/0xba0
    1 dma_pool_create+0xe0/0x340
    2 neigh_table_init_no_netlink+0x260/0x4c0
    6 neigh_parms_alloc+0x30/0x200
    1 netlink_kernel_create+0x130/0x320
    5 fz_hash_alloc+0x50/0xe0
    2 sn_common_hubdev_init+0xd0/0x6e0
    28 kernel_param_sysfs_setup+0x30/0x180
    72 process_zones+0x70/0x2e0

    cat /sys/slab/kmalloc-128/free_calls
    558
    3 sn_io_slot_fixup+0x600/0x700
    84 free_fdtable_rcu+0x120/0x260
    2 seq_release+0x40/0x60
    6 kmem_free+0x70/0xc0
    24 free_as_io_context+0x20/0x200
    1 acpi_get_object_info+0x3a0/0x3e0
    1 acpi_add_single_object+0xcf0/0x1e40
    2 con_release_unimap+0x80/0x140
    1 free+0x20/0x40

    SLAB_STORE_USER must be enabled for a slab cache by either booting with
    "slab_debug" or enabling user tracking specifically for the slab of interest.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • We leave a mininum of partial slabs on nodes when we search for
    partial slabs on other node. Define a constant for that value.

    Then modify slub to keep MIN_PARTIAL slabs around.

    This avoids bad situations where a function frees the last object
    in a slab (which results in the page being returned to the page
    allocator) only to then allocate one again (which requires getting
    a page back from the page allocator if the partial list was empty).
    Keeping a couple of slabs on the partial list reduces overhead.

    Empty slabs are added to the end of the partial list to insure that
    partially allocated slabs are consumed first (defragmentation).

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This enables validation of slab. Validation means that all objects are
    checked to see if there are redzone violations, if padding has been
    overwritten or any pointers have been corrupted. Also checks the consistency
    of slab counters.

    Validation enables the detection of metadata corruption without the kernel
    having to execute code that actually uses (allocs/frees) and object. It
    allows one to make sure that the slab metainformation and the guard values
    around an object have not been compromised.

    A single slabcache can be checked by writing a 1 to the "validate" file.

    i.e.

    echo 1 >/sys/slab/kmalloc-128/validate

    or use the slabinfo tool to check all slabs

    slabinfo -v

    Error messages will show up in the syslog.

    Note that validation can only reach slabs that are on a list. This means that
    we are usually restricted to partial slabs and active slabs unless
    SLAB_STORE_USER is active which will build a full slab list and allows
    validation of slabs that are fully in use. Booting with "slub_debug" set will
    enable SLAB_STORE_USER and then full diagnostic are available.

    Note that we attempt to push cpu slabs back to the lists when we start the
    check. If the cpu slab is reactivated before we get to it (another processor
    grabs it before we get to it) then it cannot be checked.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • If slab tracking is on then build a list of full slabs so that we can verify
    the integrity of all slabs and are also able to built list of alloc/free
    callers.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Object tracking did not work the right way for several call chains. Fix this up
    by adding a new parameter to slub_alloc and slub_free that specifies the
    caller address explicitly.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The patch adds PageTail(page) and PageHead(page) to check if a page is the
    head or the tail of a compound page. This is done by masking the two bits
    describing the state of a compound page and then comparing them. So one
    comparision and a branch instead of two bit checks and two branches.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • If we add a new flag so that we can distinguish between the first page and the
    tail pages then we can avoid to use page->private in the first page.
    page->private == page for the first page, so there is no real information in
    there.

    Freeing up page->private makes the use of compound pages more transparent.
    They become more usable like real pages. Right now we have to be careful f.e.
    if we are going beyond PAGE_SIZE allocations in the slab on i386 because we
    can then no longer use the private field. This is one of the issues that
    cause us not to support debugging for page size slabs in SLAB.

    Having page->private available for SLUB would allow more meta information in
    the page struct. I can probably avoid the 16 bit ints that I have in there
    right now.

    Also if page->private is available then a compound page may be equipped with
    buffer heads. This may free up the way for filesystems to support larger
    blocks than page size.

    We add PageTail as an alias of PageReclaim. Compound pages cannot currently
    be reclaimed. Because of the alias one needs to check PageCompound first.

    The RFC for the this approach was discussed at
    http://marc.info/?t=117574302800001&r=1&w=2

    [nacc@us.ibm.com: fix hugetlbfs]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Nishanth Aravamudan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • PowerPC uses the slab allocator to manage the lowest level of the page
    table. In high cpu configurations we also use the page struct to split the
    page table lock. Disallow the selection of SLUB for that case.

    Signed-off-by: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Paul Mackerras
    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Makes SLUB behave like SLAB in this area to avoid issues....

    Throw a stack dump to alert people.

    At some point the behavior should be switched back. NULL is no memory as
    far as I can tell and if the use asked for 0 bytes then he need to get no
    memory.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Structures may contain u64 items on 32 bit platforms that are only able to
    address 64 bit items on 64 bit boundaries. Change the mininum alignment of
    slabs to conform to those expectations.

    ARCH_KMALLOC_MINALIGN must be changed for good since a variety of structure
    are mixed in the general slabs.

    ARCH_SLAB_MINALIGN is changed because currently there is no consistent
    specification of object alignment. We may have that in the future when the
    KMEM_CACHE and related macros are used to generate slabs. These pass the
    alignment of the structure generated by the compiler to the slab.

    With KMEM_CACHE etc we could align structures that do not contain 64
    bit values to 32 bit boundaries potentially saving some memory.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This is a new slab allocator which was motivated by the complexity of the
    existing code in mm/slab.c. It attempts to address a variety of concerns
    with the existing implementation.

    A. Management of object queues

    A particular concern was the complex management of the numerous object
    queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
    each allocating CPU and use objects from a slab directly instead of
    queueing them up.

    B. Storage overhead of object queues

    SLAB Object queues exist per node, per CPU. The alien cache queue even
    has a queue array that contain a queue for each processor on each
    node. For very large systems the number of queues and the number of
    objects that may be caught in those queues grows exponentially. On our
    systems with 1k nodes / processors we have several gigabytes just tied up
    for storing references to objects for those queues This does not include
    the objects that could be on those queues. One fears that the whole
    memory of the machine could one day be consumed by those queues.

    C. SLAB meta data overhead

    SLAB has overhead at the beginning of each slab. This means that data
    cannot be naturally aligned at the beginning of a slab block. SLUB keeps
    all meta data in the corresponding page_struct. Objects can be naturally
    aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte
    boundaries and can fit tightly into a 4k page with no bytes left over.
    SLAB cannot do this.

    D. SLAB has a complex cache reaper

    SLUB does not need a cache reaper for UP systems. On SMP systems
    the per CPU slab may be pushed back into partial list but that
    operation is simple and does not require an iteration over a list
    of objects. SLAB expires per CPU, shared and alien object queues
    during cache reaping which may cause strange hold offs.

    E. SLAB has complex NUMA policy layer support

    SLUB pushes NUMA policy handling into the page allocator. This means that
    allocation is coarser (SLUB does interleave on a page level) but that
    situation was also present before 2.6.13. SLABs application of
    policies to individual slab objects allocated in SLAB is
    certainly a performance concern due to the frequent references to
    memory policies which may lead a sequence of objects to come from
    one node after another. SLUB will get a slab full of objects
    from one node and then will switch to the next.

    F. Reduction of the size of partial slab lists

    SLAB has per node partial lists. This means that over time a large
    number of partial slabs may accumulate on those lists. These can
    only be reused if allocator occur on specific nodes. SLUB has a global
    pool of partial slabs and will consume slabs from that pool to
    decrease fragmentation.

    G. Tunables

    SLAB has sophisticated tuning abilities for each slab cache. One can
    manipulate the queue sizes in detail. However, filling the queues still
    requires the uses of the spin lock to check out slabs. SLUB has a global
    parameter (min_slab_order) for tuning. Increasing the minimum slab
    order can decrease the locking overhead. The bigger the slab order the
    less motions of pages between per CPU and partial lists occur and the
    better SLUB will be scaling.

    G. Slab merging

    We often have slab caches with similar parameters. SLUB detects those
    on boot up and merges them into the corresponding general caches. This
    leads to more effective memory use. About 50% of all caches can
    be eliminated through slab merging. This will also decrease
    slab fragmentation because partial allocated slabs can be filled
    up again. Slab merging can be switched off by specifying
    slub_nomerge on boot up.

    Note that merging can expose heretofore unknown bugs in the kernel
    because corrupted objects may now be placed differently and corrupt
    differing neighboring objects. Enable sanity checks to find those.

    H. Diagnostics

    The current slab diagnostics are difficult to use and require a
    recompilation of the kernel. SLUB contains debugging code that
    is always available (but is kept out of the hot code paths).
    SLUB diagnostics can be enabled via the "slab_debug" option.
    Parameters can be specified to select a single or a group of
    slab caches for diagnostics. This means that the system is running
    with the usual performance and it is much more likely that
    race conditions can be reproduced.

    I. Resiliency

    If basic sanity checks are on then SLUB is capable of detecting
    common error conditions and recover as best as possible to allow the
    system to continue.

    J. Tracing

    Tracing can be enabled via the slab_debug=T, option
    during boot. SLUB will then protocol all actions on that slabcache
    and dump the object contents on free.

    K. On demand DMA cache creation.

    Generally DMA caches are not needed. If a kmalloc is used with
    __GFP_DMA then just create this single slabcache that is needed.
    For systems that have no ZONE_DMA requirement the support is
    completely eliminated.

    L. Performance increase

    Some benchmarks have shown speed improvements on kernbench in the
    range of 5-10%. The locking overhead of slub is based on the
    underlying base allocation size. If we can reliably allocate
    larger order pages then it is possible to increase slub
    performance much further. The anti-fragmentation patches may
    enable further performance increases.

    Tested on:
    i386 UP + SMP, x86_64 UP + SMP + NUMA emulation, IA64 NUMA + Simulator

    SLUB Boot options

    slub_nomerge Disable merging of slabs
    slub_min_order=x Require a minimum order for slab caches. This
    increases the managed chunk size and therefore
    reduces meta data and locking overhead.
    slub_min_objects=x Mininum objects per slab. Default is 8.
    slub_max_order=x Avoid generating slabs larger than order specified.
    slub_debug Enable all diagnostics for all caches
    slub_debug= Enable selective options for all caches
    slub_debug=, Enable selective options for a certain set of
    caches

    Available Debug options
    F Double Free checking, sanity and resiliency
    R Red zoning
    P Object / padding poisoning
    U Track last free / alloc
    T Trace all allocs / frees (only use for individual slabs).

    To use SLUB: Apply this patch and then select SLUB as the default slab
    allocator.

    [hugh@veritas.com: fix an oops-causing locking error]
    [akpm@linux-foundation.org: various stupid cleanups and small fixes]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • If device->num is zero we attempt to kmalloc() zero bytes. When SLUB is
    enabled this returns a null pointer and take that as an allocation failure
    and fail the device register. Check for no devices and avoid the
    allocation.

    [akpm: opportunistic kzalloc() conversion]
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • i386 uses kmalloc to allocate the threadinfo structure assuming that the
    allocations result in a page sized aligned allocation. That has worked so
    far because SLAB exempts page sized slabs from debugging and aligns them in
    special ways that goes beyond the restrictions imposed by
    KMALLOC_ARCH_MINALIGN valid for other slabs in the kmalloc array.

    SLUB also works fine without debugging since page sized allocations neatly
    align at page boundaries. However, if debugging is switched on then SLUB
    will extend the slab with debug information. The resulting slab is not
    longer of page size. It will only be aligned following the requirements
    imposed by KMALLOC_ARCH_MINALIGN. As a result the threadinfo structure may
    not be page aligned which makes i386 fail to boot with SLUB debug on.

    Replace the calls to kmalloc with calls into the page allocator.

    An alternate solution may be to create a custom slab cache where the
    alignment is set to PAGE_SIZE. That would allow slub debugging to be
    applied to the threadinfo structure.

    Signed-off-by: Christoph Lameter
    Cc: William Lee Irwin III
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • OOM killed tasks have access to memory reserves as specified by the
    TIF_MEMDIE flag in the hopes that it will quickly exit. If such a task has
    memory allocations constrained by cpusets, we may encounter a deadlock if a
    blocking task cannot exit because it cannot allocate the necessary memory.

    We allow tasks that have the TIF_MEMDIE flag to allocate memory anywhere,
    including outside its cpuset restriction, so that it can quickly die
    regardless of whether it is __GFP_HARDWALL.

    Cc: Andi Kleen
    Cc: Paul Jackson
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • It is only ever used prior to free_initmem().

    (It will cause a warning when we run the section checking, but that's a
    false-positive and it simply changes the source of an existing warning, which
    is also a false-positive)

    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The sysctl handler for min_free_kbytes calls setup_per_zone_pages_min() on
    read or write. This function iterates through every zone and calls
    spin_lock_irqsave() on the zone LRU lock. When reading min_free_kbytes,
    this is a total waste of time that disables interrupts on the local
    processor. It might even be noticable machines with large numbers of zones
    if a process started constantly reading min_free_kbytes.

    This patch only calls setup_per_zone_pages_min() only on write. Tested on
    an x86 laptop and it did the right thing.

    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Some NUMA machines have a big MAX_NUMNODES (possibly 1024), but fewer
    possible nodes. This patch dynamically sizes the 'struct kmem_cache' to
    allocate only needed space.

    I moved nodelists[] field at the end of struct kmem_cache, and use the
    following computation in kmem_cache_init()

    cache_cache.buffer_size = offsetof(struct kmem_cache, nodelists) +
    nr_node_ids * sizeof(struct kmem_list3 *);

    On my two nodes x86_64 machine, kmem_cache.obj_size is now 192 instead of 704
    (This is because on x86_64, MAX_NUMNODES is 64)

    On bigger NUMA setups, this might reduce the gfporder of "cache_cache"

    Signed-off-by: Eric Dumazet
    Cc: Pekka Enberg
    Cc: Andy Whitcroft
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • We can avoid allocating empty shared caches and avoid unecessary check of
    cache->limit. We save some memory. We avoid bringing into CPU cache
    unecessary cache lines.

    All accesses to l3->shared are already checking NULL pointers so this patch is
    safe.

    Signed-off-by: Eric Dumazet
    Acked-by: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • The existing comment in mm/slab.c is *perfect*, so I reproduce it :

    /*
    * CPU bound tasks (e.g. network routing) can exhibit cpu bound
    * allocation behaviour: Most allocs on one cpu, most free operations
    * on another cpu. For these cases, an efficient object passing between
    * cpus is necessary. This is provided by a shared array. The array
    * replaces Bonwick's magazine layer.
    * On uniprocessor, it's functionally equivalent (but less efficient)
    * to a larger limit. Thus disabled by default.
    */

    As most shiped linux kernels are now compiled with CONFIG_SMP, there is no way
    a preprocessor #if can detect if the machine is UP or SMP. Better to use
    num_possible_cpus().

    This means on UP we allocate a 'size=0 shared array', to be more efficient.

    Another patch can later avoid the allocations of 'empty shared arrays', to
    save some memory.

    Signed-off-by: Eric Dumazet
    Acked-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • Rename file_ra_state.prev_page to prev_index and file_ra_state.offset to
    prev_offset. Also update of prev_index in do_generic_mapping_read() is now
    moved close to the update of prev_offset.

    [wfg@mail.ustc.edu.cn: fix it]
    Signed-off-by: Jan Kara
    Cc: Nick Piggin
    Cc: WU Fengguang
    Signed-off-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Introduce ra.offset and store in it an offset where the previous read
    ended. This way we can detect whether reads are really sequential (and
    thus we should not mark the page as accessed repeatedly) or whether they
    are random and just happen to be in the same page (and the page should
    really be marked accessed again).

    Signed-off-by: Jan Kara
    Acked-by: Nick Piggin
    Cc: WU Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Adds /proc/pid/clear_refs. When any non-zero number is written to this file,
    pte_mkold() and ClearPageReferenced() is called for each pte and its
    corresponding page, respectively, in that task's VMAs. This file is only
    writable by the user who owns the task.

    It is now possible to measure _approximately_ how much memory a task is using
    by clearing the reference bits with

    echo 1 > /proc/pid/clear_refs

    and checking the reference count for each VMA from the /proc/pid/smaps output
    at a measured time interval. For example, to observe the approximate change
    in memory footprint for a task, write a script that clears the references
    (echo 1 > /proc/pid/clear_refs), sleeps, and then greps for Pgs_Referenced and
    extracts the size in kB. Add the sizes for each VMA together for the total
    referenced footprint. Moments later, repeat the process and observe the
    difference.

    For example, using an efficient Mozilla:

    accumulated time referenced memory
    ---------------- -----------------
    0 s 408 kB
    1 s 408 kB
    2 s 556 kB
    3 s 1028 kB
    4 s 872 kB
    5 s 1956 kB
    6 s 416 kB
    7 s 1560 kB
    8 s 2336 kB
    9 s 1044 kB
    10 s 416 kB

    This is a valuable tool to get an approximate measurement of the memory
    footprint for a task.

    Cc: Hugh Dickins
    Cc: Paul Mundt
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    [akpm@linux-foundation.org: build fixes]
    [mpm@selenic.com: rename for_each_pmd]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Adds an additional unsigned long field to struct mem_size_stats called
    'referenced'. For each pte walked in the smaps code, this field is
    incremented by PAGE_SIZE if it has pte-reference bits.

    An additional line was added to the /proc/pid/smaps output for each VMA to
    indicate how many pages within it are currently marked as referenced or
    accessed.

    Cc: Hugh Dickins
    Cc: Paul Mundt
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Extracts the pmd walker from smaps-specific code in fs/proc/task_mmu.c.

    The new struct pmd_walker includes the struct vm_area_struct of the memory to
    walk over. Iteration begins at the vma->vm_start and completes at
    vma->vm_end. A pointer to another data structure may be stored in the private
    field such as struct mem_size_stats, which acts as the smaps accumulator. For
    each pmd in the VMA, the action function is called with a pointer to its
    struct vm_area_struct, a pointer to the pmd_t, its start and end addresses,
    and the private field.

    The interface for walking pmd's in a VMA for fs/proc/task_mmu.c is now:

    void for_each_pmd(struct vm_area_struct *vma,
    void (*action)(struct vm_area_struct *vma,
    pmd_t *pmd, unsigned long addr,
    unsigned long end,
    void *private),
    void *private);

    Since the pmd walker is now extracted from the smaps code, smaps_one_pmd() is
    invoked for each pmd in the VMA. Its behavior and efficiency is identical to
    the existing implementation.

    Cc: Hugh Dickins
    Cc: Paul Mundt
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • If you actually clear the bit, you need to:

    + pte_update_defer(vma->vm_mm, addr, ptep);

    The reason is, when updating PTEs, the hypervisor must be notified. Using
    atomic operations to do this is fine for all hypervisors I am aware of.
    However, for hypervisors which shadow page tables, if these PTE
    modifications are not trapped, you need a post-modification call to fulfill
    the update of the shadow page table.

    Acked-by: Zachary Amsden
    Cc: Hugh Dickins
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • Add ptep_test_and_clear_{dirty,young} to i386. They advertise that they
    have it and there is at least one place where it needs to be called without
    the page table lock: to clear the accessed bit on write to
    /proc/pid/clear_refs.

    ptep_clear_flush_{dirty,young} are updated to use the new functions. The
    overall net effect to current users of ptep_clear_flush_{dirty,young} is
    that we introduce an additional branch.

    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Signed-off-by: David Rientjes
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Introduce a macro for suppressing gcc from generating a warning about a
    probable uninitialized state of a variable.

    Example:

    - spinlock_t *ptl;
    + spinlock_t *uninitialized_var(ptl);

    Not a happy solution, but those warnings are obnoxious.

    - Using the usual pointlessly-set-it-to-zero approach wastes several
    bytes of text.

    - Using a macro means we can (hopefully) do something else if gcc changes
    cause the `x = x' hack to stop working

    - Using a macro means that people who are worried about hiding true bugs
    can easily turn it off.

    Signed-off-by: Borislav Petkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Borislav Petkov
     
  • Identical block is duplicated twice: contrary to the comment, we have been
    re-reading the page *twice* in filemap_nopage rather than once.

    If any retry logic or anything is needed, it belongs in lower levels anyway.
    Only retry once. Linus agrees.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Generally we work under the assumption that memory the mem_map array is
    contigious and valid out to MAX_ORDER_NR_PAGES block of pages, ie. that if we
    have validated any page within this MAX_ORDER_NR_PAGES block we need not check
    any other. This is not true when CONFIG_HOLES_IN_ZONE is set and we must
    check each and every reference we make from a pfn.

    Add a pfn_valid_within() helper which should be used when scanning pages
    within a MAX_ORDER_NR_PAGES block when we have already checked the validility
    of the block normally with pfn_valid(). This can then be optimised away when
    we do not have holes within a MAX_ORDER_NR_PAGES block of pages.

    Signed-off-by: Andy Whitcroft
    Acked-by: Mel Gorman
    Acked-by: Bob Picco
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • Add proper prototypes in include/linux/slab.h.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Architectures that don't support DMA can say so by adding a config NO_DMA
    to their Kconfig file. This will prevent compilation of some dma specific
    driver code. Also dma-mapping-broken.h isn't needed anymore on at least
    s390. This avoids compilation and linking of otherwise dead/broken code.

    Other architectures that include dma-mapping-broken.h are arm26, h8300,
    m68k, m68knommu and v850. If these could be converted as well we could get
    rid of the header file.

    Signed-off-by: Heiko Carstens
    "John W. Linville"
    Cc: Kyle McMartin
    Cc:
    Cc: Tejun Heo
    Cc: Jeff Garzik
    Cc: Martin Schwidefsky
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • If the badness of a process is zero then oom_adj>0 has no effect. This
    patch makes sure that the oom_adj shift actually increases badness points
    appropriately.

    Signed-off-by: Joshua N. Pritikin
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joshua N Pritikin
     
  • __block_write_full_page is calling SetPageUptodate without the page locked.
    This is unusual, but not incorrect, as PG_writeback is still set.

    However the next patch will require that SetPageUptodate always be called with
    the page locked. Simply don't bother setting the page uptodate in this case
    (it is unusual that the write path does such a thing anyway). Instead just
    leave it to the read side to bring the page uptodate when it notices that all
    buffers are uptodate.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Ensure pages are uptodate after returning from read_cache_page, which allows
    us to cut out most of the filesystem-internal PageUptodate calls.

    I didn't have a great look down the call chains, but this appears to fixes 7
    possible use-before uptodate in hfs, 2 in hfsplus, 1 in jfs, a few in
    ecryptfs, 1 in jffs2, and a possible cleared data overwritten with readpage in
    block2mtd. All depending on whether the filler is async and/or can return
    with a !uptodate page.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin