03 Jun, 2006

1 commit

  • mm/slab.c's offlab_limit logic is totally broken.

    Firstly, "offslab_limit" is a global variable while it should either be
    calculated in situ or should be passed in as a parameter.

    Secondly, the more serious problem with it is that the condition for
    calculating it:

    if (!(OFF_SLAB(sizes->cs_cachep))) {
    offslab_limit = sizes->cs_size - sizeof(struct slab);
    offslab_limit /= sizeof(kmem_bufctl_t);

    is in total disconnect with the condition that makes use of it:

    /* More than offslab_limit objects will cause problems */
    if ((flags & CFLGS_OFF_SLAB) && num > offslab_limit)
    break;

    but due to offslab_limit being a global variable this breakage was
    hidden.

    Up until lockdep came along and perturbed the slab sizes sufficiently so
    that the first off-slab cache would still see a (non-calculated) zero
    value for offslab_limit and would panic with:

    kmem_cache_create: couldn't create cache size-512.

    Call Trace:
    [] show_trace+0x96/0x1c8
    [] dump_stack+0x13/0x15
    [] panic+0x39/0x21a
    [] kmem_cache_create+0x5a0/0x5d0
    [] kmem_cache_init+0x193/0x379
    [] start_kernel+0x17f/0x218
    [] _sinittext+0x263/0x26a

    Kernel panic - not syncing: kmem_cache_create(): failed to create slab `size-512'

    Paolo Ornati's config on x86_64 managed to trigger it.

    The fix is to move the calculation to the place that makes use of it.
    This also makes slab.o 54 bytes smaller.

    Btw., the check itself is quite silly. Its intention is to test whether
    the number of objects per slab would be higher than the number of slab
    control pointers possible. In theory it could be triggered: if someone
    tried to allocate 4-byte objects cache and explicitly requested with
    CFLGS_OFF_SLAB. So i kept the check.

    Out of historic interest i checked how old this bug was and it's
    ancient, 10 years old! It is the oldest hidden and then truly triggering
    bugs i ever saw being fixed in the kernel!

    Signed-off-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

01 Jun, 2006

1 commit

  • From: Yasunori Goto

    If hot-added memory's address is smaller than old area, spanned_pages will
    not be updated. It must be fixed.

    example) Old zone_start_pfn = 0x60000, and spanned_pages = 0x10000
    Added new memory's start_pfn = 0x50000, and end_pfn = 0x60000

    new spanned_pages will be still 0x10000 by old code.
    (It should be updated to 0x20000.) Because old_zone_end_pfn will be
    0x70000, and end_pfn smaller than it. So, spanned_pages will not be
    updated.

    In current code, spanned_pages is updated only when end_pfn is updated.
    But, it should be updated by subtraction between bigger end_pfn and new
    zone_start_pfn.

    Signed-off-by: Yasunori Goto
    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasunori Goto
     

22 May, 2006

3 commits

  • Andy added code to buddy allocator which does not require the zone's
    endpoints to be aligned to MAX_ORDER. An issue is that the buddy allocator
    requires the node_mem_map's endpoints to be MAX_ORDER aligned. Otherwise
    __page_find_buddy could compute a buddy not in node_mem_map for partial
    MAX_ORDER regions at zone's endpoints. page_is_buddy will detect that
    these pages at endpoints are not PG_buddy (they were zeroed out by bootmem
    allocator and not part of zone). Of course the negative here is we could
    waste a little memory but the positive is eliminating all the old checks
    for zone boundary conditions.

    SPARSEMEM won't encounter this issue because of MAX_ORDER size constraint
    when SPARSEMEM is configured. ia64 VIRTUAL_MEM_MAP doesn't need the logic
    either because the holes and endpoints are handled differently. This
    leaves checking alloc_remap and other arches which privately allocate for
    node_mem_map.

    Signed-off-by: Bob Picco
    Acked-by: Mel Gorman
    Cc: Dave Hansen
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Picco
     
  • Fix a couple of infrequently encountered 'sleeping function called from
    invalid context' in the cpuset hooks in __alloc_pages. Could sleep while
    interrupts disabled.

    The routine cpuset_zone_allowed() is called by code in mm/page_alloc.c
    __alloc_pages() to determine if a zone is allowed in the current tasks
    cpuset. This routine can sleep, for certain GFP_KERNEL allocations, if the
    zone is on a memory node not allowed in the current cpuset, but might be
    allowed in a parent cpuset.

    But we can't sleep in __alloc_pages() if in interrupt, nor if called for a
    GFP_ATOMIC request (__GFP_WAIT not set in gfp_flags).

    The rule was intended to be:
    Don't call cpuset_zone_allowed() if you can't sleep, unless you
    pass in the __GFP_HARDWALL flag set in gfp_flag, which disables
    the code that might scan up ancestor cpusets and sleep.

    This rule was being violated in a couple of places, due to a bogus change
    made (by myself, pj) to __alloc_pages() as part of the November 2005 effort
    to cleanup its logic, and also due to a later fix to constrain which swap
    daemons were awoken.

    The bogus change can be seen at:
    http://linux.derkeiler.com/Mailing-Lists/Kernel/2005-11/4691.html
    [PATCH 01/05] mm fix __alloc_pages cpuset ALLOC_* flags

    This was first noticed on a tight memory system, in code that was disabling
    interrupts and doing allocation requests with __GFP_WAIT not set, which
    resulted in __might_sleep() writing complaints to the log "Debug: sleeping
    function called ...", when the code in cpuset_zone_allowed() tried to take
    the callback_sem cpuset semaphore.

    We haven't seen a system hang on this 'might_sleep' yet, but we are at
    decent risk of seeing it fairly soon, especially since the additional
    cpuset_zone_allowed() check was added, conditioning wakeup_kswapd(), in
    March 2006.

    Special thanks to Dave Chinner, for figuring this out, and a tip of the hat
    to Nick Piggin who warned me of this back in Nov 2005, before I was ready
    to listen.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • A bad calculation/loop in __section_nr() could result in incorrect section
    information being put into sysfs memory entries. This primarily impacts
    memory add operations as the sysfs information is used while onlining new
    memory.

    Fix suggested by Dave Hansen.

    Note that the bug may not be obvious from the patch. It actually occurs in
    the function's return statement:

    return (root_nr * SECTIONS_PER_ROOT) + (ms - root);

    In the existing code, root_nr has already been multiplied by
    SECTIONS_PER_ROOT.

    Signed-off-by: Mike Kravetz
    Cc: Dave Hansen
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

16 May, 2006

3 commits

  • With CONFIG_NUMA set, kmem_cache_destroy() may fail and say "Can't
    free all objects." The problem is caused by sequences such as the
    following (suppose we are on a NUMA machine with two nodes, 0 and 1):

    * Allocate an object from cache on node 0.
    * Free the object on node 1. The object is put into node 1's alien
    array_cache for node 0.
    * Call kmem_cache_destroy(), which ultimately ends up in __cache_shrink().
    * __cache_shrink() does drain_cpu_caches(), which loops through all nodes.
    For each node it drains the shared array_cache and then handles the
    alien array_cache for the other node.

    However this means that node 0's shared array_cache will be drained,
    and then node 1 will move the contents of its alien[0] array_cache
    into that same shared array_cache. node 0's shared array_cache is
    never looked at again, so the objects left there will appear to be in
    use when __cache_shrink() calls __node_shrink() for node 0. So
    __node_shrink() will return 1 and kmem_cache_destroy() will fail.

    This patch fixes this by having drain_cpu_caches() do
    drain_alien_cache() on every node before it does drain_array() on the
    nodes' shared array_caches.

    The problem was originally reported by Or Gerlitz .

    Signed-off-by: Roland Dreier
    Acked-by: Christoph Lameter
    Acked-by: Pekka Enberg
    Signed-off-by: Linus Torvalds

    Roland Dreier
     
  • slab_is_available() indicates slab based allocators are available for use.
    SPARSEMEM code needs to know this as it can be called at various times
    during the boot process.

    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • As pointed out in http://bugzilla.kernel.org/show_bug.cgi?id=6490, this
    function can experience overflows on 32-bit machines, causing our response to
    changed values of min_free_kbytes to go whacky.

    Fixing it efficiently is all too hard, so fix it with 64-bit math instead.

    Cc: Ake Sandgren
    Cc: Martin Bligh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

02 May, 2006

3 commits

  • Based on an older patch from Mike Kravetz

    We need to have a mem_map for high addresses in order to make fops->no_page
    work on spufs mem and register files. So far, we have used the
    memory_present() function during early bootup, but that did not work when
    CONFIG_NUMA was enabled.

    We now use the __add_pages() function to add the mem_map when loading the
    spufs module, which is a lot nicer.

    Signed-off-by: Arnd Bergmann
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joel H Schopp
     
  • This patch fixes two bugs with the way sparsemem interacts with memory add.
    They are:

    - memory leak if memmap for section already exists

    - calling alloc_bootmem_node() after boot

    These bugs were discovered and a first cut at the fixes were provided by
    Arnd Bergmann and Joel Schopp .

    Signed-off-by: Mike Kravetz
    Signed-off-by: Joel Schopp
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Currently we check PageDirty() in order to make the decision to swap out
    the page. However, the dirty information may be only be contained in the
    ptes pointing to the page. We need to first unmap the ptes before checking
    for PageDirty(). If unmap is successful then the page count of the page
    will also be decreased so that pageout() works properly.

    This is a fix necessary for 2.6.17. Without this fix we may migrate dirty
    pages for filesystems without migration functions. Filesystems may keep
    pointers to dirty pages. Migration of dirty pages can result in the
    filesystem keeping pointers to freed pages.

    Unmapping is currently not be separated out from removing all the
    references to a page and moving the mapping. Therefore try_to_unmap will
    be called again in migrate_page() if the writeout is successful. However,
    it wont do anything since the ptes are already removed.

    The coming updates to the page migration code will restructure the code
    so that this is no longer necessary.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

29 Apr, 2006

1 commit


27 Apr, 2006

1 commit


26 Apr, 2006

1 commit


23 Apr, 2006

1 commit

  • Basic problem: pages of a shared memory segment can only be migrated once.

    In 2.6.16 through 2.6.17-rc1, shared memory mappings do not have a
    migratepage address space op. Therefore, migrate_pages() falls back to
    default processing. In this path, it will try to pageout() dirty pages.
    Once a shared memory page has been migrated it becomes dirty, so
    migrate_pages() will try to page it out. However, because the page count
    is 3 [cache + current + pte], pageout() will return PAGE_KEEP because
    is_page_cache_freeable() returns false. This will abort all subsequent
    migrations.

    This patch adds a migratepage address space op to shared memory segments to
    avoid taking the default path. We use the "migrate_page()" function
    because it knows how to migrate dirty pages. This allows shared memory
    segment pages to migrate, subject to other conditions such as # pte's
    referencing the page [page_mapcount(page)], when requested.

    I think this is safe. If we're migrating a shared memory page, then we
    found the page via a page table, so it must be in memory.

    Can be verified with memtoy and the shmem-mbind-test script, both
    available at: http://free.linux.hp.com/~lts/Tools/

    Signed-off-by: Lee Schermerhorn
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

20 Apr, 2006

5 commits


18 Apr, 2006

1 commit


11 Apr, 2006

11 commits

  • Signed-off-by: Coywolf Qi Hunt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Coywolf Qi Hunt
     
  • EXPORT_SYMBOL'ing of a static function is not a good idea.

    Signed-off-by: Adrian Bunk
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • This patch is an enhancement of OVERCOMMIT_GUESS algorithm in
    __vm_enough_memory() in mm/nommu.c.

    When the OVERCOMMIT_GUESS algorithm calculates the number of free pages,
    the algorithm subtracts the number of reserved pages from the result
    nr_free_pages().

    Signed-off-by: Hideo Aoki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hideo AOKI
     
  • This patch is an enhancement of OVERCOMMIT_GUESS algorithm in
    __vm_enough_memory() in mm/mmap.c.

    When the OVERCOMMIT_GUESS algorithm calculates the number of free pages,
    the algorithm subtracts the number of reserved pages from the result
    nr_free_pages().

    Signed-off-by: Hideo Aoki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hideo AOKI
     
  • These patches are an enhancement of OVERCOMMIT_GUESS algorithm in
    __vm_enough_memory().

    - why the kernel needed patching

    When the kernel can't allocate anonymous pages in practice, currnet
    OVERCOMMIT_GUESS could return success. This implementation might be
    the cause of oom kill in memory pressure situation.

    If the Linux runs with page reservation features like
    /proc/sys/vm/lowmem_reserve_ratio and without swap region, I think
    the oom kill occurs easily.

    - the overall design approach in the patch

    When the OVERCOMMET_GUESS algorithm calculates number of free pages,
    the reserved free pages are regarded as non-free pages.

    This change helps to avoid the pitfall that the number of free pages
    become less than the number which the kernel tries to keep free.

    - testing results

    I tested the patches using my test kernel module.

    If the patches aren't applied to the kernel, __vm_enough_memory()
    returns success in the situation but autual page allocation is
    failed.

    On the other hand, if the patches are applied to the kernel, memory
    allocation failure is avoided since __vm_enough_memory() returns
    failure in the situation.

    I checked that on i386 SMP 16GB memory machine. I haven't tested on
    nommu environment currently.

    This patch adds totalreserve_pages for __vm_enough_memory().

    Calculate_totalreserve_pages() checks maximum lowmem_reserve pages and
    pages_high in each zone. Finally, the function stores the sum of each
    zone to totalreserve_pages.

    The totalreserve_pages is calculated when the VM is initilized.
    And the variable is updated when /proc/sys/vm/lowmem_reserve_raito
    or /proc/sys/vm/min_free_kbytes are changed.

    Signed-off-by: Hideo Aoki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hideo AOKI
     
  • - Remove sparse comment

    - Remove duplicated include

    - Return the correct error condition in migrate_page_remove_references().

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The code checks for newbrk with oldbrk which are page aligned before making
    a check for the memory limit set of data segment. If the memory limit is
    not page aligned in that case it bypasses the test for the limit if the
    memory allocation is still for the same page.

    Signed-off-by: Ram Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ram Gupta
     
  • The earlier patch to consolidate mmu and nommu page allocation and
    refcounting by using compound pages for nommu allocations had a bug:
    kmalloc slabs who's pages were initially allocated by a non-__GFP_COMP
    allocator could be passed into mm/nommu.c kmalloc allocations which really
    wanted __GFP_COMP underlying pages. Fix that by having nommu pass
    __GFP_COMP to all higher order slab allocations.

    Signed-off-by: Luke Yang
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luke Yang
     
  • Add a statistics counter which is incremented everytime the alien cache
    overflows. alien_cache limit is hardcoded to 12 right now. We can use
    this statistics to tune alien cache if needed in the future.

    Signed-off-by: Alok N Kataria
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • Allocate off-slab slab descriptors from node local memory.

    Signed-off-by: Alok N Kataria
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • Rohit found an obscure bug causing buddy list corruption.

    page_is_buddy is using a non-atomic test (PagePrivate && page_count == 0)
    to determine whether or not a free page's buddy is itself free and in the
    buddy lists.

    Each of the conjuncts may be true at different times due to unrelated
    conditions, so the non-atomic page_is_buddy test may find each conjunct to
    be true even if they were not both true at the same time (ie. the page was
    not on the buddy lists).

    Signed-off-by: Martin Bligh
    Signed-off-by: Rohit Seth
    Signed-off-by: Nick Piggin
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

10 Apr, 2006

1 commit

  • The node setup code would try to allocate the node metadata in the node
    itself, but that fails if there is no memory in there.

    This can happen with memory hotplug when the hotplug area defines an so
    far empty node.

    Now use bootmem to try to allocate the mem_map in other nodes.

    And if it fails don't panic, but just ignore the node.

    To make this work I added a new __alloc_bootmem_nopanic function that
    does what its name implies.

    TBD should try to use nearby nodes here. Currently we just use any.
    It's hard to do it better because bootmem doesn't have proper fallback
    lists yet.

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

02 Apr, 2006

3 commits


01 Apr, 2006

4 commits

  • this changes if() BUG(); constructs to BUG_ON() which is
    cleaner, contains unlikely() and can better optimized away.

    Signed-off-by: Eric Sesterhenn
    Signed-off-by: Adrian Bunk

    Eric Sesterhenn
     
  • this changes if() BUG(); constructs to BUG_ON() which is
    cleaner, contains unlikely() and can better optimized away.

    Signed-off-by: Eric Sesterhenn
    Signed-off-by: Adrian Bunk

    Eric Sesterhenn
     
  • this changes if() BUG(); constructs to BUG_ON() which is
    cleaner, contains unlikely() and can better optimized away.

    Signed-off-by: Eric Sesterhenn
    Signed-off-by: Adrian Bunk

    Eric Sesterhenn
     
  • Remove the recently-added LINUX_FADV_ASYNC_WRITE and LINUX_FADV_WRITE_WAIT
    fadvise() additions, do it in a new sys_sync_file_range() syscall instead.
    Reasons:

    - It's more flexible. Things which would require two or three syscalls with
    fadvise() can be done in a single syscall.

    - Using fadvise() in this manner is something not covered by POSIX.

    The patch wires up the syscall for x86.

    The sycall is implemented in the new fs/sync.c. The intention is that we can
    move sys_fsync(), sys_fdatasync() and perhaps sys_sync() into there later.

    Documentation for the syscall is in fs/sync.c.

    A test app (sync_file_range.c) is in
    http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.

    The available-to-GPL-modules do_sync_file_range() is for knfsd: "A COMMIT can
    say NFS_DATA_SYNC or NFS_FILE_SYNC. I can skip the ->fsync call for
    NFS_DATA_SYNC which is hopefully the more common."

    Note: the `async' writeout mode SYNC_FILE_RANGE_WRITE will turn synchronous if
    the queue is congested. This is trivial to fix: add a new flag bit, set
    wbc->nonblocking. But I'm not sure that we want to expose implementation
    details down to that level.

    Note: it's notable that we can sync an fd which wasn't opened for writing.
    Same with fsync() and fdatasync()).

    Note: the code takes some care to handle attempts to sync file contents
    outside the 16TB offset on 32-bit machines. It makes such attempts appear to
    succeed, for best 32-bit/64-bit compatibility. Perhaps it should make such
    requests fail...

    Cc: Nick Piggin
    Cc: Michael Kerrisk
    Cc: Ulrich Drepper
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton