18 Jul, 2007

38 commits

  • Allocate/release a chunk of vmalloc address space:
    alloc_vm_area reserves a chunk of address space, and makes sure all
    the pagetables are constructed for that address range - but no pages.

    free_vm_area releases the address space range.

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Ian Pratt
    Signed-off-by: Christian Limpach
    Signed-off-by: Chris Wright
    Cc: "Jan Beulich"
    Cc: "Andi Kleen"

    Jeremy Fitzhardinge
     
  • Add a kstrndup function, modelled on strndup. Like strndup this
    returns a string copied into its own allocated memory, but it copies
    no more than the specified number of bytes from the source.

    Remove private strndup() from irda code.

    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Chris Wright
    Cc: Andrew Morton
    Cc: Randy Dunlap
    Cc: YOSHIFUJI Hideaki
    Cc: Akinobu Mita
    Cc: Arnaldo Carvalho de Melo
    Cc: Al Viro
    Cc: Panagiotis Issaris
    Cc: Rene Scharfe

    Jeremy Fitzhardinge
     
  • Our original NFSv4 delegation policy was to give out a read delegation on any
    open when it was possible to.

    Since the lifetime of a delegation isn't limited to that of an open, a client
    may quite reasonably hang on to a delegation as long as it has the inode
    cached. This becomes an obvious problem the first time a client's inode cache
    approaches the size of the server's total memory.

    Our first quick solution was to add a hard-coded limit. This patch makes a
    mild incremental improvement by varying that limit according to the server's
    total memory size, allowing at most 4 delegations per megabyte of RAM.

    My quick back-of-the-envelope calculation finds that in the worst case (where
    every delegation is for a different inode), a delegation could take about
    1.5K, which would make the worst case usage about 6% of memory. The new limit
    works out to be about the same as the old on a 1-gig server.

    [akpm@linux-foundation.org: Don't needlessly bloat vmlinux]
    [akpm@linux-foundation.org: Make it right for highmem machines]
    Signed-off-by: "J. Bruce Fields"
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Meelap Shah
     
  • currently the export_operation structure and helpers related to it are in
    fs.h. fs.h is already far too large and there are very few places needing the
    export bits, so split them off into a separate header.

    [akpm@linux-foundation.org: fix cifs build]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Neil Brown
    Cc: Steven French
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • KSYM_NAME_LEN is peculiar in that it does not include the space for the
    trailing '\0', forcing all users to use KSYM_NAME_LEN + 1 when allocating
    buffer. This is nonsense and error-prone. Moreover, when the caller
    forgets that it's very likely to subtly bite back by corrupting the stack
    because the last position of the buffer is always cleared to zero.

    This patch increments KSYM_NAME_LEN by one and updates code accordingly.

    * off-by-one bug in asm-powerpc/kprobes.h::kprobe_lookup_name() macro
    is fixed.

    * Where MODULE_NAME_LEN and KSYM_NAME_LEN were used together,
    MODULE_NAME_LEN was treated as if it didn't include space for the
    trailing '\0'. Fix it.

    Signed-off-by: Tejun Heo
    Acked-by: Paulo Marques
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • The bounce buffer logic is included on systems that do not need it. If a
    system does not have zones like ZONE_DMA and ZONE_HIGHMEM that can lead to
    the use of bounce buffers then there is no need to reserve memory pools etc
    etc. This is true f.e. for SGI Altix.

    Also nicifies the Makefile and gets rid of the tricky "and" there.

    Signed-off-by: Christoph Lameter
    Acked-by: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Currently, the freezer treats all tasks as freezable, except for the kernel
    threads that explicitly set the PF_NOFREEZE flag for themselves. This
    approach is problematic, since it requires every kernel thread to either
    set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't
    care for the freezing of tasks at all.

    It seems better to only require the kernel threads that want to or need to
    be frozen to use some freezer-related code and to remove any
    freezer-related code from the other (nonfreezable) kernel threads, which is
    done in this patch.

    The patch causes all kernel threads to be nonfreezable by default (ie. to
    have PF_NOFREEZE set by default) and introduces the set_freezable()
    function that should be called by the freezable kernel threads in order to
    unset PF_NOFREEZE. It also makes all of the currently freezable kernel
    threads call set_freezable(), so it shouldn't cause any (intentional)
    change of behaviour to appear. Additionally, it updates documentation to
    describe the freezing of tasks more accurately.

    [akpm@linux-foundation.org: build fixes]
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Nigel Cunningham
    Cc: Pavel Machek
    Cc: Oleg Nesterov
    Cc: Gautham R Shenoy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • It is a bug to set a page dirty if it is not uptodate unless it has
    buffers. If the page has buffers, then the page may be dirty (some buffers
    dirty) but not uptodate (some buffers not uptodate). The exception to this
    rule is if the set_page_dirty caller is racing with truncate or invalidate.

    A buffer can not be set dirty if it is not uptodate.

    If either of these situations occurs, it indicates there could be some data
    loss problem. Some of these warnings could be a harmless one where the
    page or buffer is set uptodate immediately after it is dirtied, however we
    should fix those up, and enforce this ordering.

    Bring the order of operations for truncate into line with those of
    invalidate. This will prevent a page from being able to go !uptodate while
    we're holding the tree_lock, which is probably a good thing anyway.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Signed-off-by: Robert P. J. Day
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert P. J. Day
     
  • We currently cannot disable CONFIG_SLUB_DEBUG for CONFIG_NUMA. Now that
    embedded systems start to use NUMA we may need this.

    Put an #ifdef around places where NUMA only code uses fields only valid
    for CONFIG_SLUB_DEBUG.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Sysfs can do a gazillion things when called. Make sure that we do not call
    any sysfs functions while holding the slub_lock.

    Just protect the essentials:

    1. The list of all slab caches
    2. The kmalloc_dma array
    3. The ref counters of the slabs.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The objects per slab increase with the current patches in mm since we allow up
    to order 3 allocs by default. More patches in mm actually allow to use 2M or
    higher sized slabs. For slab validation we need per object bitmaps in order
    to check a slab. We end up with up to 64k objects per slab resulting in a
    potential requirement of 8K stack space. That does not look good.

    Allocate the bit arrays via kmalloc.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • kmalloc_node() and kmem_cache_alloc_node() were not available in a zeroing
    variant in the past. But with __GFP_ZERO it is possible now to do zeroing
    while allocating.

    Use __GFP_ZERO to remove the explicit clearing of memory via memset whereever
    we can.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • It becomes now easy to support the zeroing allocs with generic inline
    functions in slab.h. Provide inline definitions to allow the continued use of
    kzalloc, kmem_cache_zalloc etc but remove other definitions of zeroing
    functions from the slab allocators and util.c.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • We can get to the length of the object through the kmem_cache_structure. The
    additional parameter does no good and causes the compiler to generate bad
    code.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Do proper spacing and we only need to do this in steps of 8.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Signed-off-by: Adrian Bunk
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • There is no need to caculate the dma slab size ourselves. We can simply
    lookup the size of the corresponding non dma slab.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • kmalloc_index is a long series of comparisons. The attempt to replace
    kmalloc_index with something more efficient like ilog2 failed due to compiler
    issues with constant folding on gcc 3.3 / powerpc.

    kmalloc_index()'es long list of comparisons works fine for constant folding
    since all the comparisons are optimized away. However, SLUB also uses
    kmalloc_index to determine the slab to use for the __kmalloc_xxx functions.
    This leads to a large set of comparisons in get_slab().

    The patch here allows to get rid of that list of comparisons in get_slab():

    1. If the requested size is larger than 192 then we can simply use
    fls to determine the slab index since all larger slabs are
    of the power of two type.

    2. If the requested size is smaller then we cannot use fls since there
    are non power of two caches to be considered. However, the sizes are
    in a managable range. So we divide the size by 8. Then we have only
    24 possibilities left and then we simply look up the kmalloc index
    in a table.

    Code size of slub.o decreases by more than 200 bytes through this patch.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • We modify the kmalloc_cache_dma[] array without proper locking. Do the proper
    locking and undo the dma cache creation if another processor has already
    created it.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The rarely used dma functionality in get_slab() makes the function too
    complex. The compiler begins to spill variables from the working set onto the
    stack. The created function is only used in extremely rare cases so make sure
    that the compiler does not decide on its own to merge it back into get_slab().

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Add #ifdefs around data structures only needed if debugging is compiled into
    SLUB.

    Add inlines to small functions to reduce code size.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • A kernel convention for many allocators is that if __GFP_ZERO is passed to an
    allocator then the allocated memory should be zeroed.

    This is currently not supported by the slab allocators. The inconsistency
    makes it difficult to implement in derived allocators such as in the uncached
    allocator and the pool allocators.

    In addition the support zeroed allocations in the slab allocators does not
    have a consistent API. There are no zeroing allocator functions for NUMA node
    placement (kmalloc_node, kmem_cache_alloc_node). The zeroing allocations are
    only provided for default allocs (kzalloc, kmem_cache_zalloc_node).
    __GFP_ZERO will make zeroing universally available and does not require any
    addititional functions.

    So add the necessary logic to all slab allocators to support __GFP_ZERO.

    The code is added to the hot path. The gfp flags are on the stack and so the
    cacheline is readily available for checking if we want a zeroed object.

    Zeroing while allocating is now a frequent operation and we seem to be
    gradually approaching a 1-1 parity between zeroing and not zeroing allocs.
    The current tree has 3476 uses of kmalloc vs 2731 uses of kzalloc.

    Signed-off-by: Christoph Lameter
    Acked-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Define ZERO_OR_NULL_PTR macro to be able to remove the checks from the
    allocators. Move ZERO_SIZE_PTR related stuff into slab.h.

    Make ZERO_SIZE_PTR work for all slab allocators and get rid of the
    WARN_ON_ONCE(size == 0) that is still remaining in SLAB.

    Make slub return NULL like the other allocators if a too large memory segment
    is requested via __kmalloc.

    Signed-off-by: Christoph Lameter
    Acked-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The size of a kmalloc object is readily available via ksize(). ksize is
    provided by all allocators and thus we can implement krealloc in a generic
    way.

    Implement krealloc in mm/util.c and drop slab specific implementations of
    krealloc.

    Signed-off-by: Christoph Lameter
    Acked-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The function we are calling to initialize object debug state during early NUMA
    bootstrap sets up an inactive object giving it the wrong redzone signature.
    The bootstrap nodes are active objects and should have active redzone
    signatures.

    Currently slab validation complains and reverts the object to active state.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Currently SLUB has no provision to deal with too high page orders that may
    be specified on the kernel boot line. If an order higher than 6 (on a 4k
    platform) is generated then we will BUG() because slabs get more than 65535
    objects.

    Add some logic that decreases order for slabs that have too many objects.
    This allow booting with slab sizes up to MAX_ORDER.

    For example

    slub_min_order=10

    will boot with a default slab size of 4M and reduce slab sizes for small
    object sizes to lower orders if the number of objects becomes too big.
    Large slab sizes like that allow a concentration of objects of the same
    slab cache under as few as possible TLB entries and thus potentially
    reduces TLB pressure.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • We currently have to do an GFP_ATOMIC allocation because the list_lock is
    already taken when we first allocate memory for tracking allocation
    information. It would be better if we could avoid atomic allocations.

    Allocate a size of the tracking table that is usually sufficient (one page)
    before we take the list lock. We will then only do the atomic allocation
    if we need to resize the table to become larger than a page (mostly only
    needed under large NUMA because of the tracking of cpus and nodes otherwise
    the table stays small).

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Use list_for_each_entry() instead of list_for_each().

    Get rid of for_all_slabs(). It had only one user. So fold it into the
    callback. This also gets rid of cpu_slab_flush.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Changes the error reporting format to loosely follow lockdep.

    If data corruption is detected then we generate the following lines:

    ============================================
    BUG :
    --------------------------------------------

    INFO: [possibly multiple times]

    FIX :

    This also adds some more intelligence to the data corruption detection. Its
    now capable of figuring out the start and end.

    Add a comment on how to configure SLUB so that a production system may
    continue to operate even though occasional slab corruption occur through
    a misbehaving kernel component. See "Emergency operations" in
    Documentation/vm/slub.txt.

    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • I can never remember what the function to register to receive VM pressure
    is called. I have to trace down from __alloc_pages() to find it.

    It's called "set_shrinker()", and it needs Your Help.

    1) Don't hide struct shrinker. It contains no magic.
    2) Don't allocate "struct shrinker". It's not helpful.
    3) Call them "register_shrinker" and "unregister_shrinker".
    4) Call the function "shrink" not "shrinker".
    5) Reduce the 17 lines of waffly comments to 13, but document it properly.

    Signed-off-by: Rusty Russell
    Cc: David Chinner
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rusty Russell
     
  • When we are out of memory of a suitable size we enter reclaim. The current
    reclaim algorithm targets pages in LRU order, which is great for fairness at
    order-0 but highly unsuitable if you desire pages at higher orders. To get
    pages of higher order we must shoot down a very high proportion of memory;
    >95% in a lot of cases.

    This patch set adds a lumpy reclaim algorithm to the allocator. It targets
    groups of pages at the specified order anchored at the end of the active and
    inactive lists. This encourages groups of pages at the requested orders to
    move from active to inactive, and active to free lists. This behaviour is
    only triggered out of direct reclaim when higher order pages have been
    requested.

    This patch set is particularly effective when utilised with an
    anti-fragmentation scheme which groups pages of similar reclaimability
    together.

    This patch set is based on Peter Zijlstra's lumpy reclaim V2 patch which forms
    the foundation. Credit to Mel Gorman for sanitity checking.

    Mel said:

    The patches have an application with hugepage pool resizing.

    When lumpy-reclaim is used used with ZONE_MOVABLE, the hugepages pool can
    be resized with greater reliability. Testing on a desktop machine with 2GB
    of RAM showed that growing the hugepage pool with ZONE_MOVABLE on it's own
    was very slow as the success rate was quite low. Without lumpy-reclaim,
    each attempt to grow the pool by 100 pages would yield 1 or 2 hugepages.
    With lumpy-reclaim, getting 40 to 70 hugepages on each attempt was typical.

    [akpm@osdl.org: ia64 pfn_to_nid fixes and loop cleanup]
    [bunk@stusta.de: static declarations for internal functions]
    [a.p.zijlstra@chello.nl: initial lumpy V2 implementation]
    Signed-off-by: Andy Whitcroft
    Acked-by: Peter Zijlstra
    Acked-by: Mel Gorman
    Acked-by: Mel Gorman
    Cc: Bob Picco
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • This patch adds a new parameter for sizing ZONE_MOVABLE called
    movablecore=. While kernelcore= is used to specify the minimum amount of
    memory that must be available for all allocation types, movablecore= is
    used to specify the minimum amount of memory that is used for migratable
    allocations. The amount of memory used for migratable allocations
    determines how large the huge page pool could be dynamically resized to at
    runtime for example.

    How movablecore is actually handled is that the total number of pages in
    the system is calculated and a value is set for kernelcore that is

    kernelcore == totalpages - movablecore

    Both kernelcore= and movablecore= can be safely specified at the same time.

    Signed-off-by: Mel Gorman
    Acked-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch adds the kernelcore= parameter for x86.

    Once all patches are applied, a new command-line parameter exist and a new
    sysctl. This patch adds the necessary documentation.

    From: Yasunori Goto

    When "kernelcore" boot option is specified, kernel can't boot up on ia64
    because of an infinite loop. In addition, the parsing code can be handled
    in an architecture-independent manner.

    This patch uses common code to handle the kernelcore= parameter. It is
    only available to architectures that support arch-independent zone-sizing
    (i.e. define CONFIG_ARCH_POPULATES_NODE_MAP). Other architectures will
    ignore the boot parameter.

    [bunk@stusta.de: make cmdline_parse_kernelcore() static]
    Signed-off-by: Mel Gorman
    Signed-off-by: Yasunori Goto
    Acked-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Huge pages are not movable so are not allocated from ZONE_MOVABLE. However,
    as ZONE_MOVABLE will always have pages that can be migrated or reclaimed, it
    can be used to satisfy hugepage allocations even when the system has been
    running a long time. This allows an administrator to resize the hugepage pool
    at runtime depending on the size of ZONE_MOVABLE.

    This patch adds a new sysctl called hugepages_treat_as_movable. When a
    non-zero value is written to it, future allocations for the huge page pool
    will use ZONE_MOVABLE. Despite huge pages being non-movable, we do not
    introduce additional external fragmentation of note as huge pages are always
    the largest contiguous block we care about.

    [akpm@linux-foundation.org: various fixes]
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The following 8 patches against 2.6.20-mm2 create a zone called ZONE_MOVABLE
    that is only usable by allocations that specify both __GFP_HIGHMEM and
    __GFP_MOVABLE. This has the effect of keeping all non-movable pages within a
    single memory partition while allowing movable allocations to be satisfied
    from either partition. The patches may be applied with the list-based
    anti-fragmentation patches that groups pages together based on mobility.

    The size of the zone is determined by a kernelcore= parameter specified at
    boot-time. This specifies how much memory is usable by non-movable
    allocations and the remainder is used for ZONE_MOVABLE. Any range of pages
    within ZONE_MOVABLE can be released by migrating the pages or by reclaiming.

    When selecting a zone to take pages from for ZONE_MOVABLE, there are two
    things to consider. First, only memory from the highest populated zone is
    used for ZONE_MOVABLE. On the x86, this is probably going to be ZONE_HIGHMEM
    but it would be ZONE_DMA on ppc64 or possibly ZONE_DMA32 on x86_64. Second,
    the amount of memory usable by the kernel will be spread evenly throughout
    NUMA nodes where possible. If the nodes are not of equal size, the amount of
    memory usable by the kernel on some nodes may be greater than others.

    By default, the zone is not as useful for hugetlb allocations because they are
    pinned and non-migratable (currently at least). A sysctl is provided that
    allows huge pages to be allocated from that zone. This means that the huge
    page pool can be resized to the size of ZONE_MOVABLE during the lifetime of
    the system assuming that pages are not mlocked. Despite huge pages being
    non-movable, we do not introduce additional external fragmentation of note as
    huge pages are always the largest contiguous block we care about.

    Credit goes to Andy Whitcroft for catching a large variety of problems during
    review of the patches.

    This patch creates an additional zone, ZONE_MOVABLE. This zone is only usable
    by allocations which specify both __GFP_HIGHMEM and __GFP_MOVABLE. Hot-added
    memory continues to be placed in their existing destination as there is no
    mechanism to redirect them to a specific zone.

    [y-goto@jp.fujitsu.com: Fix section mismatch of memory hotplug related code]
    [akpm@linux-foundation.org: various fixes]
    Signed-off-by: Mel Gorman
    Cc: Andy Whitcroft
    Signed-off-by: Yasunori Goto
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • It is often known at allocation time whether a page may be migrated or not.
    This patch adds a flag called __GFP_MOVABLE and a new mask called
    GFP_HIGH_MOVABLE. Allocations using the __GFP_MOVABLE can be either migrated
    using the page migration mechanism or reclaimed by syncing with backing
    storage and discarding.

    An API function very similar to alloc_zeroed_user_highpage() is added for
    __GFP_MOVABLE allocations called alloc_zeroed_user_highpage_movable(). The
    flags used by alloc_zeroed_user_highpage() are not changed because it would
    change the semantics of an existing API. After this patch is applied there
    are no in-kernel users of alloc_zeroed_user_highpage() so it probably should
    be marked deprecated if this patch is merged.

    Note that this patch includes a minor cleanup to the use of __GFP_ZERO in
    shmem.c to keep all flag modifications to inode->mapping in the
    shmem_dir_alloc() helper function. This clean-up suggestion is courtesy of
    Hugh Dickens.

    Additional credit goes to Christoph Lameter and Linus Torvalds for shaping the
    concept. Credit to Hugh Dickens for catching issues with shmem swap vector
    and ramfs allocations.

    [akpm@linux-foundation.org: build fix]
    [hugh@veritas.com: __GFP_ZERO cleanup]
    Signed-off-by: Mel Gorman
    Cc: Andy Whitcroft
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • do_generic_mapping_read currently samples the i_size at the start and doesn't
    do so again unless it needs to call ->readpage to load a page. After
    ->readpage it has to re-sample i_size as a truncate may have caused that page
    to be filled with zeros, and the read() call should not see these.

    However there are other activities that might cause ->readpage to be called on
    a page between the time that do_generic_mapping_read samples i_size and when
    it finds that it has an uptodate page. These include at least read-ahead and
    possibly another thread performing a read.

    So do_generic_mapping_read must sample i_size *after* it has an uptodate page.
    Thus the current sampling at the start and after a read can be replaced with
    a sampling before the copy-out.

    The same change applied to __generic_file_splice_read.

    Note that this fixes any race with truncate_complete_page, but does not fix a
    possible race with truncate_partial_page. If a partial truncate happens after
    do_generic_mapping_read samples i_size and before the copy_out, the nuls that
    truncate_partial_page place in the page could be copied out incorrectly.

    I think the best fix for that is to *not* zero out parts of the page in
    truncate_partial_page, but rather to zero out the tail of a page when
    increasing i_size.

    Signed-off-by: Neil Brown
    Cc: Jens Axboe
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

17 Jul, 2007

2 commits

  • * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (209 commits)
    [POWERPC] Create add_rtc() function to enable the RTC CMOS driver
    [POWERPC] Add H_ILLAN_ATTRIBUTES hcall number
    [POWERPC] xilinxfb: Parameterize xilinxfb platform device registration
    [POWERPC] Oprofile support for Power 5++
    [POWERPC] Enable arbitary speed tty ioctls and split input/output speed
    [POWERPC] Make drivers/char/hvc_console.c:khvcd() static
    [POWERPC] Remove dead code for preventing pread() and pwrite() calls
    [POWERPC] Remove unnecessary #undef printk from prom.c
    [POWERPC] Fix typo in Ebony default DTS
    [POWERPC] Check for NULL ppc_md.init_IRQ() before calling
    [POWERPC] Remove extra return statement
    [POWERPC] pasemi: Don't auto-select CONFIG_EMBEDDED
    [POWERPC] pasemi: Rename platform
    [POWERPC] arch/powerpc/kernel/sysfs.c: Move NUMA exports
    [POWERPC] Add __read_mostly support for powerpc
    [POWERPC] Modify sched_clock() to make CONFIG_PRINTK_TIME more sane
    [POWERPC] Create a dummy zImage if no valid platform has been selected
    [POWERPC] PS3: Bootwrapper support.
    [POWERPC] powermac i2c: Use mutex
    [POWERPC] Schedule removal of arch/ppc
    ...

    Fixed up conflicts manually in:

    Documentation/feature-removal-schedule.txt
    arch/powerpc/kernel/pci_32.c
    arch/powerpc/kernel/pci_64.c
    include/asm-powerpc/pci.h

    and asked the powerpc people to double-check the result..

    Linus Torvalds
     
  • * master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6: (68 commits)
    sh: sh-rtc support for SH7709.
    sh: Revert __xdiv64_32 size change.
    sh: Update r7785rp defconfig.
    sh: Export div symbols for GCC 4.2 and ST GCC.
    sh: fix race in parallel out-of-tree build
    sh: Kill off dead mach.c for hp6xx.
    sh: hd64461.h cleanup and added comments.
    sh: Update the alignment when 4K stacks are used.
    sh: Add a .bss.page_aligned section for 4K stacks.
    sh: Don't let SH-4A clobber SH-4 CFLAGS.
    sh: Add parport stub for SuperIO ports.
    sh: Drop -Wa,-dsp for DSP tuning.
    sh: Update dreamcast defconfig.
    fb: pvr2fb: A few more __devinit annotations for PCI.
    fb: pvr2fb: Fix up section mismatch warnings.
    sh: Select IPR-IRQ for SH7091.
    sh: Correct __xdiv64_32/div64_32 return value size.
    sh: Fix timer-tmu build for SH-3.
    sh: Add cpu and mach links to CLEAN_FILES.
    sh: Preliminary support for the SH-X3 CPU.
    ...

    Linus Torvalds