22 Mar, 2006

40 commits

  • Remove __put_page from outside the core mm/. It is dangerous because it does
    not handle compound pages nicely, and misses 1->0 transitions. If a user
    later appears that really needs the extra speed we can reevaluate.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Remove page_count and __put_page from x86-64 pageattr

    Signed-off-by: Nick Piggin
    Acked-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Use page->lru.next to implement the singly linked list of pages rather than
    the struct deferred_page which needs to be allocated and freed for each
    page.

    Signed-off-by: Nick Piggin
    Acked-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Stop using __put_page and page_count in i386 pageattr.c

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • sg increments the refcount of constituent pages in its higher order memory
    allocations when they are about to be mapped by userspace. This is done so
    the subsequent get_page/put_page when doing the mapping and unmapping does not
    free the page.

    Move over to the preferred way, that is, using compound pages instead. This
    fixes a whole class of possible obscure bugs where a get_user_pages on a
    constituent page may outlast the user mappings or even the driver.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Cc: Douglas Gilbert
    Cc: James Bottomley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Now that it's madvisable, remove two pieces of VM_DONTCOPY bogosity:

    1. There was and is no logical reason why VM_DONTCOPY should be in the
    list of flags which forbid vma merging (and those drivers which set
    it are also setting VM_IO, which itself forbids the merge).

    2. It's hard to understand the purpose of the VM_HUGETLB, VM_DONTCOPY
    block in vm_stat_account: but never mind, it's under CONFIG_HUGETLB,
    which (unlike CONFIG_HUGETLB_PAGE or CONFIG_HUGETLBFS) has never been
    defined.

    Signed-off-by: Hugh Dickins
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • In shrink_inactive_list(), nr_scan is not accounted when nr_taken is 0.
    But 0 pages taken does not mean 0 pages scanned.

    Move the goto statement below the accounting code to fix it.

    Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • In isolate_lru_pages(), *scanned reports one more scan because the scan
    counter is increased one more time on exit of the while-loop.

    Change the while-loop to for-loop to fix it.

    Signed-off-by: Nick Piggin
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • Add some comments to explain how zone reclaim works. And it fixes the
    following issues:

    - PF_SWAPWRITE needs to be set for RECLAIM_SWAP to be able to write
    out pages to swap. Currently RECLAIM_SWAP may not do that.

    - remove setting nr_reclaimed pages after slab reclaim since the slab shrinking
    code does not use that and the nr_reclaimed pages is just right for the
    intended follow up action.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • We have:

    try_to_free_pages
    ->shrink_caches(struct zone **zones, ..)
    ->shrink_zone(struct zone *, ...)
    ->shrink_cache(struct zone *, ...)
    ->shrink_list(struct list_head *, ...)
    ->refill_inactive_list((struct zone *, ...)

    which is fairly irrational.

    Rename things so that we have

    try_to_free_pages
    ->shrink_zones(struct zone **zones, ..)
    ->shrink_zone(struct zone *, ...)
    ->shrink_inactive_list(struct zone *, ...)
    ->shrink_page_list(struct list_head *, ...)
    ->shrink_active_list(struct zone *, ...)

    Cc: Nick Piggin
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Change all the vmscan functions to retunr the number-of-reclaimed pages and
    remove scan_conrtol.nr_reclaimed.

    Saves ten-odd bytes of text and makes things clearer and more consistent.

    The patch also changes the behaviour of zone_reclaim() when it falls back to slab shrinking. Christoph says

    "Setting this to one means that we will rescan and shrink the slab for
    each allocation if we are out of zone memory and RECLAIM_SLAB is set. Plus
    if we do an order 0 allocation we do not go off node as intended.

    "We better set this to zero. This means the allocation will go offnode
    despite us having potentially freed lots of memory on the zone. Future
    allocations can then again be done from this zone."

    Cc: Nick Piggin
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Turn basically everything in vmscan.c into `unsigned long'. This is to avoid
    the possibility that some piece of code in there might decide to operate upon
    more than 4G (or even 2G) of pages in one hit.

    This might be silly, but we'll need it one day.

    Cc: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Initialise as much of scan_control as possible at the declaration site. This
    tidies things up a bit and assures us that all unmentioned fields are zeroed
    out.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Make nr_to_scan and priority a parameter instead of putting it into scan
    control. This allows various small optimizations and IMHO makes the code
    easier to read.

    Signed-off-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Slab duplicates on_each_cpu().

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • When on_each_cpu() runs the callback on other CPUs, it runs with local
    interrupts disabled. So we should run the function with local interrupts
    disabled on this CPU, too.

    And do the same for UP, so the callback is run in the same environment on both
    UP and SMP. (strictly it should do preempt_disable() too, but I think
    local_irq_disable is sufficiently equivalent).

    Also uninlines on_each_cpu(). softirq.c was the most appropriate file I could
    find, but it doesn't seem to justify creating a new file.

    Oh, and fix up that comment over (under?) x86's smp_call_function(). It
    drives me nuts.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • SLAB_NO_REAP is documented as an option that will cause this slab not to be
    reaped under memory pressure. However, that is not what happens. The only
    thing that SLAB_NO_REAP controls at the moment is the reclaim of the unused
    slab elements that were allocated in batch in cache_reap(). Cache_reap()
    is run every few seconds independently of memory pressure.

    Could we remove the whole thing? Its only used by three slabs anyways and
    I cannot find a reason for having this option.

    There is an additional problem with SLAB_NO_REAP. If set then the recovery
    of objects from alien caches is switched off. Objects not freed on the
    same node where they were initially allocated will only be reused if a
    certain amount of objects accumulates from one alien node (not very likely)
    or if the cache is explicitly shrunk. (Strangely __cache_shrink does not
    check for SLAB_NO_REAP)

    Getting rid of SLAB_NO_REAP fixes the problems with alien cache freeing.

    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Manfred Spraul
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Fix kernel-doc warnings in mm/slab.c.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • We have struct kmem_cache now so use it instead of the old typedef.

    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • Remove cachep->spinlock. Locking has moved to the kmem_list3 and most of
    the structures protected earlier by cachep->spinlock is now protected by
    the l3->list_lock. slab cache tunables like batchcount are accessed always
    with the cache_chain_mutex held.

    Patch tested on SMP and NUMA kernels with dbench processes running,
    constant onlining/offlining, and constant cache tuning, all at the same
    time.

    Signed-off-by: Ravikiran Thirumalai
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • slab.c has become a bit revolting again. Try to repair it.

    - Coding style fixes

    - Don't do assignments-in-if-statements.

    - Don't typecast assignments to/from void*

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Extract setup_cpu_cache() function from kmem_cache_create() to make the
    latter a little less complex.

    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • Clean up the object to index mapping that has been spread around mm/slab.c.

    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • Since size_t has the same size as a long on all architectures, it's enough
    for overflow checks to check against ULONG_MAX.

    This change could allow a compiler better optimization (especially in the
    n=1 case).

    The practical effect seems to be positive, but quite small:

    text data bss dec hex filename
    21762380 5859870 1848928 29471178 1c1b1ca vmlinux-old
    21762211 5859870 1848928 29471009 1c1b121 vmlinux-patched

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Insert "fresh" huge pages into the hugepage allocator by the same means as
    they are freed back into it. This reduces code size and allows
    enqueue_huge_page to be inlined into the hugepage free fastpath.

    Eliminate occurances of hugepages on the free list with non-zero refcount.
    This can allow stricter refcount checks in future. Also required for
    lockless pagecache.

    Signed-off-by: Nick Piggin

    "This patch also eliminates a leak "cleaned up" by re-clobbering the
    refcount on every allocation from the hugepage freelists. With respect to
    the lockless pagecache, the crucial aspect is to eliminate unconditional
    set_page_count() to 0 on pages with potentially nonzero refcounts, though
    closer inspection suggests the assignments removed are entirely spurious."

    Acked-by: William Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • The bootmem code added to page_alloc.c duplicated some page freeing code
    that it really doesn't need to because it is not so performance critical.

    While we're here, make prefetching work properly by actually prefetching
    the page we're about to use before prefetching ahead to the next one (ie.
    get the most important transaction started first). Also prefetch just a
    single page ahead rather than leaving a gap of 16.

    Jack Steiner reported no problems with SGI's ia64 simulator.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Clarify that preemption needs to be guarded against with the
    __xxx_page_state functions.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Have an explicit mm call to split higher order pages into individual pages.
    Should help to avoid bugs and be more explicit about the code's intention.

    Signed-off-by: Nick Piggin
    Cc: Russell King
    Cc: David Howells
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Cc: Chris Zankel
    Signed-off-by: Yoichi Yuasa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • - Don't return uninitialised stack values in case of allocation failure

    - Don't bother clearing PageCompound because __GFP_COMP wasn't specified
    Increment over the pte page rather than one pte entry in
    pte_alloc_one_kernel

    - Actually increment the page pointer in pte_alloc_one

    - Compile fixes, typos.

    Signed-off-by: Nick Piggin
    Acked-by: Chris Zankel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • atomic_add_unless (atomic_inc_not_zero) no longer requires an offset refcount
    to function correctly.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • The VM has an interesting race where a page refcount can drop to zero, but it
    is still on the LRU lists for a short time. This was solved by testing a 0->1
    refcount transition when picking up pages from the LRU, and dropping the
    refcount in that case.

    Instead, use atomic_add_unless to ensure we never pick up a 0 refcount page
    from the LRU, thus a 0 refcount page will never have its refcount elevated
    until it is allocated again.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Atomic operation removal from slab

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • More atomic operation removal from page allocator

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • In the page release paths, we can be sure that nobody will mess with our
    page->flags because the refcount has dropped to 0. So no need for atomic
    operations here.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • PG_active is protected by zone->lru_lock, it does not need TestSet/TestClear
    operations.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • PG_lru is protected by zone->lru_lock. It does not need TestSet/TestClear
    operations.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • If vmscan finds a zero refcount page on the lru list, never ClearPageLRU
    it. This means the release code need not hold ->lru_lock to stabilise
    PageLRU, so that lock may be skipped entirely when releasing !PageLRU pages
    (because we know PageLRU won't have been temporarily cleared by vmscan,
    which was previously guaranteed by holding the lock to synchronise against
    vmscan).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • set_pgdir isn't needed anymore for a very long time. Remove the leftover
    implementation on sh64 and the stub on s390.

    Signed-off-by: Christoph Hellwig
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Paul Mundt
    Cc: Richard Curnow
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Do not use platform_device_register_simple() as it is going away, define
    dcdbas_driver and implement ->probe() and ->remove() functions so manual
    binding and unbinding will work with this driver.

    Also switch to using attribute_group when creating sysfs attributes and
    make sure to check and handle errors; explicitely remove attributes when
    detaching driver.

    Signed-off-by: Dmitry Torokhov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Torokhov
     
  • Do not use platform_device_register_simple() as it is going away.

    Signed-off-by: Dmitry Torokhov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Torokhov