16 May, 2006

3 commits

  • With CONFIG_NUMA set, kmem_cache_destroy() may fail and say "Can't
    free all objects." The problem is caused by sequences such as the
    following (suppose we are on a NUMA machine with two nodes, 0 and 1):

    * Allocate an object from cache on node 0.
    * Free the object on node 1. The object is put into node 1's alien
    array_cache for node 0.
    * Call kmem_cache_destroy(), which ultimately ends up in __cache_shrink().
    * __cache_shrink() does drain_cpu_caches(), which loops through all nodes.
    For each node it drains the shared array_cache and then handles the
    alien array_cache for the other node.

    However this means that node 0's shared array_cache will be drained,
    and then node 1 will move the contents of its alien[0] array_cache
    into that same shared array_cache. node 0's shared array_cache is
    never looked at again, so the objects left there will appear to be in
    use when __cache_shrink() calls __node_shrink() for node 0. So
    __node_shrink() will return 1 and kmem_cache_destroy() will fail.

    This patch fixes this by having drain_cpu_caches() do
    drain_alien_cache() on every node before it does drain_array() on the
    nodes' shared array_caches.

    The problem was originally reported by Or Gerlitz .

    Signed-off-by: Roland Dreier
    Acked-by: Christoph Lameter
    Acked-by: Pekka Enberg
    Signed-off-by: Linus Torvalds

    Roland Dreier
     
  • slab_is_available() indicates slab based allocators are available for use.
    SPARSEMEM code needs to know this as it can be called at various times
    during the boot process.

    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • As pointed out in http://bugzilla.kernel.org/show_bug.cgi?id=6490, this
    function can experience overflows on 32-bit machines, causing our response to
    changed values of min_free_kbytes to go whacky.

    Fixing it efficiently is all too hard, so fix it with 64-bit math instead.

    Cc: Ake Sandgren
    Cc: Martin Bligh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

02 May, 2006

3 commits

  • Based on an older patch from Mike Kravetz

    We need to have a mem_map for high addresses in order to make fops->no_page
    work on spufs mem and register files. So far, we have used the
    memory_present() function during early bootup, but that did not work when
    CONFIG_NUMA was enabled.

    We now use the __add_pages() function to add the mem_map when loading the
    spufs module, which is a lot nicer.

    Signed-off-by: Arnd Bergmann
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joel H Schopp
     
  • This patch fixes two bugs with the way sparsemem interacts with memory add.
    They are:

    - memory leak if memmap for section already exists

    - calling alloc_bootmem_node() after boot

    These bugs were discovered and a first cut at the fixes were provided by
    Arnd Bergmann and Joel Schopp .

    Signed-off-by: Mike Kravetz
    Signed-off-by: Joel Schopp
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Currently we check PageDirty() in order to make the decision to swap out
    the page. However, the dirty information may be only be contained in the
    ptes pointing to the page. We need to first unmap the ptes before checking
    for PageDirty(). If unmap is successful then the page count of the page
    will also be decreased so that pageout() works properly.

    This is a fix necessary for 2.6.17. Without this fix we may migrate dirty
    pages for filesystems without migration functions. Filesystems may keep
    pointers to dirty pages. Migration of dirty pages can result in the
    filesystem keeping pointers to freed pages.

    Unmapping is currently not be separated out from removing all the
    references to a page and moving the mapping. Therefore try_to_unmap will
    be called again in migrate_page() if the writeout is successful. However,
    it wont do anything since the ptes are already removed.

    The coming updates to the page migration code will restructure the code
    so that this is no longer necessary.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

29 Apr, 2006

1 commit


27 Apr, 2006

1 commit


26 Apr, 2006

1 commit


23 Apr, 2006

1 commit

  • Basic problem: pages of a shared memory segment can only be migrated once.

    In 2.6.16 through 2.6.17-rc1, shared memory mappings do not have a
    migratepage address space op. Therefore, migrate_pages() falls back to
    default processing. In this path, it will try to pageout() dirty pages.
    Once a shared memory page has been migrated it becomes dirty, so
    migrate_pages() will try to page it out. However, because the page count
    is 3 [cache + current + pte], pageout() will return PAGE_KEEP because
    is_page_cache_freeable() returns false. This will abort all subsequent
    migrations.

    This patch adds a migratepage address space op to shared memory segments to
    avoid taking the default path. We use the "migrate_page()" function
    because it knows how to migrate dirty pages. This allows shared memory
    segment pages to migrate, subject to other conditions such as # pte's
    referencing the page [page_mapcount(page)], when requested.

    I think this is safe. If we're migrating a shared memory page, then we
    found the page via a page table, so it must be in memory.

    Can be verified with memtoy and the shmem-mbind-test script, both
    available at: http://free.linux.hp.com/~lts/Tools/

    Signed-off-by: Lee Schermerhorn
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

20 Apr, 2006

5 commits


18 Apr, 2006

1 commit


11 Apr, 2006

11 commits

  • Signed-off-by: Coywolf Qi Hunt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Coywolf Qi Hunt
     
  • EXPORT_SYMBOL'ing of a static function is not a good idea.

    Signed-off-by: Adrian Bunk
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • This patch is an enhancement of OVERCOMMIT_GUESS algorithm in
    __vm_enough_memory() in mm/nommu.c.

    When the OVERCOMMIT_GUESS algorithm calculates the number of free pages,
    the algorithm subtracts the number of reserved pages from the result
    nr_free_pages().

    Signed-off-by: Hideo Aoki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hideo AOKI
     
  • This patch is an enhancement of OVERCOMMIT_GUESS algorithm in
    __vm_enough_memory() in mm/mmap.c.

    When the OVERCOMMIT_GUESS algorithm calculates the number of free pages,
    the algorithm subtracts the number of reserved pages from the result
    nr_free_pages().

    Signed-off-by: Hideo Aoki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hideo AOKI
     
  • These patches are an enhancement of OVERCOMMIT_GUESS algorithm in
    __vm_enough_memory().

    - why the kernel needed patching

    When the kernel can't allocate anonymous pages in practice, currnet
    OVERCOMMIT_GUESS could return success. This implementation might be
    the cause of oom kill in memory pressure situation.

    If the Linux runs with page reservation features like
    /proc/sys/vm/lowmem_reserve_ratio and without swap region, I think
    the oom kill occurs easily.

    - the overall design approach in the patch

    When the OVERCOMMET_GUESS algorithm calculates number of free pages,
    the reserved free pages are regarded as non-free pages.

    This change helps to avoid the pitfall that the number of free pages
    become less than the number which the kernel tries to keep free.

    - testing results

    I tested the patches using my test kernel module.

    If the patches aren't applied to the kernel, __vm_enough_memory()
    returns success in the situation but autual page allocation is
    failed.

    On the other hand, if the patches are applied to the kernel, memory
    allocation failure is avoided since __vm_enough_memory() returns
    failure in the situation.

    I checked that on i386 SMP 16GB memory machine. I haven't tested on
    nommu environment currently.

    This patch adds totalreserve_pages for __vm_enough_memory().

    Calculate_totalreserve_pages() checks maximum lowmem_reserve pages and
    pages_high in each zone. Finally, the function stores the sum of each
    zone to totalreserve_pages.

    The totalreserve_pages is calculated when the VM is initilized.
    And the variable is updated when /proc/sys/vm/lowmem_reserve_raito
    or /proc/sys/vm/min_free_kbytes are changed.

    Signed-off-by: Hideo Aoki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hideo AOKI
     
  • - Remove sparse comment

    - Remove duplicated include

    - Return the correct error condition in migrate_page_remove_references().

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The code checks for newbrk with oldbrk which are page aligned before making
    a check for the memory limit set of data segment. If the memory limit is
    not page aligned in that case it bypasses the test for the limit if the
    memory allocation is still for the same page.

    Signed-off-by: Ram Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ram Gupta
     
  • The earlier patch to consolidate mmu and nommu page allocation and
    refcounting by using compound pages for nommu allocations had a bug:
    kmalloc slabs who's pages were initially allocated by a non-__GFP_COMP
    allocator could be passed into mm/nommu.c kmalloc allocations which really
    wanted __GFP_COMP underlying pages. Fix that by having nommu pass
    __GFP_COMP to all higher order slab allocations.

    Signed-off-by: Luke Yang
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luke Yang
     
  • Add a statistics counter which is incremented everytime the alien cache
    overflows. alien_cache limit is hardcoded to 12 right now. We can use
    this statistics to tune alien cache if needed in the future.

    Signed-off-by: Alok N Kataria
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • Allocate off-slab slab descriptors from node local memory.

    Signed-off-by: Alok N Kataria
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Shai Fultheim
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • Rohit found an obscure bug causing buddy list corruption.

    page_is_buddy is using a non-atomic test (PagePrivate && page_count == 0)
    to determine whether or not a free page's buddy is itself free and in the
    buddy lists.

    Each of the conjuncts may be true at different times due to unrelated
    conditions, so the non-atomic page_is_buddy test may find each conjunct to
    be true even if they were not both true at the same time (ie. the page was
    not on the buddy lists).

    Signed-off-by: Martin Bligh
    Signed-off-by: Rohit Seth
    Signed-off-by: Nick Piggin
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

10 Apr, 2006

1 commit

  • The node setup code would try to allocate the node metadata in the node
    itself, but that fails if there is no memory in there.

    This can happen with memory hotplug when the hotplug area defines an so
    far empty node.

    Now use bootmem to try to allocate the mem_map in other nodes.

    And if it fails don't panic, but just ignore the node.

    To make this work I added a new __alloc_bootmem_nopanic function that
    does what its name implies.

    TBD should try to use nearby nodes here. Currently we just use any.
    It's hard to do it better because bootmem doesn't have proper fallback
    lists yet.

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

02 Apr, 2006

3 commits


01 Apr, 2006

8 commits

  • this changes if() BUG(); constructs to BUG_ON() which is
    cleaner, contains unlikely() and can better optimized away.

    Signed-off-by: Eric Sesterhenn
    Signed-off-by: Adrian Bunk

    Eric Sesterhenn
     
  • this changes if() BUG(); constructs to BUG_ON() which is
    cleaner, contains unlikely() and can better optimized away.

    Signed-off-by: Eric Sesterhenn
    Signed-off-by: Adrian Bunk

    Eric Sesterhenn
     
  • this changes if() BUG(); constructs to BUG_ON() which is
    cleaner, contains unlikely() and can better optimized away.

    Signed-off-by: Eric Sesterhenn
    Signed-off-by: Adrian Bunk

    Eric Sesterhenn
     
  • Remove the recently-added LINUX_FADV_ASYNC_WRITE and LINUX_FADV_WRITE_WAIT
    fadvise() additions, do it in a new sys_sync_file_range() syscall instead.
    Reasons:

    - It's more flexible. Things which would require two or three syscalls with
    fadvise() can be done in a single syscall.

    - Using fadvise() in this manner is something not covered by POSIX.

    The patch wires up the syscall for x86.

    The sycall is implemented in the new fs/sync.c. The intention is that we can
    move sys_fsync(), sys_fdatasync() and perhaps sys_sync() into there later.

    Documentation for the syscall is in fs/sync.c.

    A test app (sync_file_range.c) is in
    http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.

    The available-to-GPL-modules do_sync_file_range() is for knfsd: "A COMMIT can
    say NFS_DATA_SYNC or NFS_FILE_SYNC. I can skip the ->fsync call for
    NFS_DATA_SYNC which is hopefully the more common."

    Note: the `async' writeout mode SYNC_FILE_RANGE_WRITE will turn synchronous if
    the queue is congested. This is trivial to fix: add a new flag bit, set
    wbc->nonblocking. But I'm not sure that we want to expose implementation
    details down to that level.

    Note: it's notable that we can sync an fd which wasn't opened for writing.
    Same with fsync() and fdatasync()).

    Note: the code takes some care to handle attempts to sync file contents
    outside the 16TB offset on 32-bit machines. It makes such attempts appear to
    succeed, for best 32-bit/64-bit compatibility. Perhaps it should make such
    requests fail...

    Cc: Nick Piggin
    Cc: Michael Kerrisk
    Cc: Ulrich Drepper
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The boot cmdline is parsed in parse_early_param() and
    parse_args(,unknown_bootoption).

    And __setup() is used in obsolete_checksetup().

    start_kernel()
    -> parse_args()
    -> unknown_bootoption()
    -> obsolete_checksetup()

    If __setup()'s callback (->setup_func()) returns 1 in
    obsolete_checksetup(), obsolete_checksetup() thinks a parameter was
    handled.

    If ->setup_func() returns 0, obsolete_checksetup() tries other
    ->setup_func(). If all ->setup_func() that matched a parameter returns 0,
    a parameter is seted to argv_init[].

    Then, when runing /sbin/init or init=app, argv_init[] is passed to the app.
    If the app doesn't ignore those arguments, it will warning and exit.

    This patch fixes a wrong usage of it, however fixes obvious one only.

    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • With strict page reservation, I think kernel should enforce number of free
    hugetlb page don't fall below reserved count. Currently it is possible in
    the sysctl path. Add proper check in sysctl to disallow that.

    Signed-off-by: Ken Chen
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     
  • git-commit: d5d4b0aa4e1430d73050babba999365593bdb9d2
    "[PATCH] optimize follow_hugetlb_page" breaks mlock on hugepage areas.

    I mis-interpret pages argument and made get_page() unconditional. It
    should only get a ref count when "pages" argument is non-null.

    Credit goes to Adam Litke who spotted the bug.

    Signed-off-by: Ken Chen
    Acked-by: Adam Litke
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     
  • find_trylock_page() is an odd interface in that it doesn't take a reference
    like the others. Now that XFS no longer uses it, and its last remaining
    caller actually wants an elevated refcount, opencode that callsite and
    schedule find_trylock_page() for removal.

    Signed-off-by: Nick Piggin
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

29 Mar, 2006

1 commit