24 Oct, 2008

1 commit

  • * 'proc' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc: (35 commits)
    proc: remove fs/proc/proc_misc.c
    proc: move /proc/vmcore creation to fs/proc/vmcore.c
    proc: move pagecount stuff to fs/proc/page.c
    proc: move all /proc/kcore stuff to fs/proc/kcore.c
    proc: move /proc/schedstat boilerplate to kernel/sched_stats.h
    proc: move /proc/modules boilerplate to kernel/module.c
    proc: move /proc/diskstats boilerplate to block/genhd.c
    proc: move /proc/zoneinfo boilerplate to mm/vmstat.c
    proc: move /proc/vmstat boilerplate to mm/vmstat.c
    proc: move /proc/pagetypeinfo boilerplate to mm/vmstat.c
    proc: move /proc/buddyinfo boilerplate to mm/vmstat.c
    proc: move /proc/vmallocinfo to mm/vmalloc.c
    proc: move /proc/slabinfo boilerplate to mm/slub.c, mm/slab.c
    proc: move /proc/slab_allocators boilerplate to mm/slab.c
    proc: move /proc/interrupts boilerplate code to fs/proc/interrupts.c
    proc: move /proc/stat to fs/proc/stat.c
    proc: move rest of /proc/partitions code to block/genhd.c
    proc: move /proc/cpuinfo code to fs/proc/cpuinfo.c
    proc: move /proc/devices code to fs/proc/devices.c
    proc: move rest of /proc/locks to fs/locks.c
    ...

    Linus Torvalds
     

23 Oct, 2008

10 commits


21 Oct, 2008

3 commits


20 Oct, 2008

26 commits

  • This patch makes the needlessly global anon_vma_cachep static.

    Signed-off-by: Adrian Bunk
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Allocate all page_cgroup at boot and remove page_cgroup poitner from
    struct page. This patch adds an interface as

    struct page_cgroup *lookup_page_cgroup(struct page*)

    All FLATMEM/DISCONTIGMEM/SPARSEMEM and MEMORY_HOTPLUG is supported.

    Remove page_cgroup pointer reduces the amount of memory by
    - 4 bytes per PAGE_SIZE.
    - 8 bytes per PAGE_SIZE
    if memory controller is disabled. (even if configured.)

    On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
    On my x86-64 server with 48GB of memory, this saves 96MB of memory.
    I think this reduction makes sense.

    By pre-allocation, kmalloc/kfree in charge/uncharge are removed.
    This means
    - we're not necessary to be afraid of kmalloc faiulre.
    (this can happen because of gfp_mask type.)
    - we can avoid calling kmalloc/kfree.
    - we can avoid allocating tons of small objects which can be fragmented.
    - we can know what amount of memory will be used for this extra-lru handling.

    I added printk message as

    "allocated %ld bytes of page_cgroup"
    "please try cgroup_disable=memory option if you don't want"

    maybe enough informative for users.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch makes page_cgroup->flags to be atomic_ops and define functions
    (and macros) to access it.

    Before trying to modify memory resource controller, this atomic operation
    on flags is necessary. Most of flags in this patch is for LRU and modfied
    under mz->lru_lock but we'll add another flags which is not for LRU soon.
    For example, we'll place LOCK bit on flags field. We need atomic
    operation to modify LRU bit without LOCK.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Some obvious optimization to memcg.

    I found mem_cgroup_charge_statistics() is a little big (in object) and
    does unnecessary address calclation. This patch is for optimization to
    reduce the size of this function.

    And res_counter_charge() is 'likely' to succeed.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • There are not-on-LRU pages which can be mapped and they are not worth to
    be accounted. (becasue we can't shrink them and need dirty codes to
    handle specical case) We'd like to make use of usual objrmap/radix-tree's
    protcol and don't want to account out-of-vm's control pages.

    When special_mapping_fault() is called, page->mapping is tend to be NULL
    and it's charged as Anonymous page. insert_page() also handles some
    special pages from drivers.

    This patch is for avoiding to account special pages.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch tries to make page->mapping to be NULL before
    mem_cgroup_uncharge_cache_page() is called.

    "page->mapping == NULL" is a good check for "whether the page is still
    radix-tree or not". This patch also adds BUG_ON() to
    mem_cgroup_uncharge_cache_page();

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • While page-cache's charge/uncharge is done under page_lock(), swap-cache
    isn't. (anonymous page is charged when it's newly allocated.)

    This patch moves do_swap_page()'s charge() call under lock. I don't see
    any bad problem *now* but this fix will be good for future for avoiding
    unnecessary racy state.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • To prepare the chunking, move the sys_move_pages() code that is used when
    nodes!=NULL into do_pages_move(). And rename do_move_pages() into
    do_move_page_to_node_array().

    Signed-off-by: Brice Goglin
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     
  • do_pages_stat() does not need any page_to_node entry for real. Just pass
    the pointers to the user-space page address array and to the user-space
    status array, and have do_pages_stat() traverse the former and fill the
    latter directly.

    Signed-off-by: Brice Goglin
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     
  • A patchset reworking sys_move_pages(). It removes the possibly large
    vmalloc by using multiple chunks when migrating large buffers. It also
    dramatically increases the throughput for large buffers since the lookup
    in new_page_node() is now limited to a single chunk, causing the quadratic
    complexity to have a much slower impact. There is no need to use any
    radix-tree-like structure to improve this lookup.

    sys_move_pages() duration on a 4-quadcore-opteron 2347HE (1.9Gz),
    migrating between nodes #2 and #3:

    length move_pages (us) move_pages+patch (us)
    4kB 126 98
    40kB 198 168
    400kB 963 937
    4MB 12503 11930
    40MB 246867 11848

    Patches #1 and #4 are the important ones:
    1) stop returning -ENOENT from sys_move_pages() if nothing got migrated
    2) don't vmalloc a huge page_to_node array for do_pages_stat()
    3) extract do_pages_move() out of sys_move_pages()
    4) rework do_pages_move() to work on page_sized chunks
    5) move_pages: no need to set pp->page to ZERO_PAGE(0) by default

    This patch:

    There is no point in returning -ENOENT from sys_move_pages() if all pages
    were already on the right node, while we return 0 if only 1 page was not.
    Most application don't know where their pages are allocated, so it's not
    an error to try to migrate them anyway.

    Just return 0 and let the status array in user-space be checked if the
    application needs details.

    It will make the upcoming chunked-move_pages() support much easier.

    Signed-off-by: Brice Goglin
    Acked-by: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brice Goglin
     
  • During hotplug memory remove, memory regions should be released on a
    PAGES_PER_SECTION size chunks. This mirrors the code in add_memory where
    resources are requested on a PAGES_PER_SECTION size.

    Attempting to release the entire memory region fails because there is not
    a single resource for the total number of pages being removed. Instead
    the resources for the pages are split in PAGES_PER_SECTION size chunks as
    requested during memory add.

    Signed-off-by: Nathan Fontenot
    Signed-off-by: Badari Pulavarty
    Acked-by: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nathan Fontenot
     
  • This replaces zone->lru_lock in setup_per_zone_pages_min() with zone->lock.
    There seems to be no need for the lru_lock anymore, but there is a need for
    zone->lock instead, because that function may call move_freepages() via
    setup_zone_migrate_reserve().

    Signed-off-by: Gerald Schaefer
    Acked-by: KAMEZAWA Hiroyuki
    Tested-by: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • Presently hugepage doesn't use zero page at all because zero page is only
    used for coredumping and hugepage can't core dump.

    However we have now implemented hugepage coredumping. Therefore we should
    implement the zero page of hugepage.

    Implementation note:

    o Why do we only check VM_SHARED for zero page?
    normal page checked as ..

    static inline int use_zero_page(struct vm_area_struct *vma)
    {
    if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
    return 0;

    return !vma->vm_ops || !vma->vm_ops->fault;
    }

    First, hugepages are never mlock()ed. We aren't concerned with VM_LOCKED.

    Second, hugetlbfs is a pseudo filesystem, not a real filesystem and it
    doesn't have any file backing. Thus ops->fault checking is meaningless.

    o Why don't we use zero page if !pte.

    !pte indicate {pud, pmd} doesn't exist or some error happened. So we
    shouldn't return zero page if any error occurred.

    Signed-off-by: KOSAKI Motohiro
    Cc: Adam Litke
    Cc: Hugh Dickins
    Cc: Kawai Hidehiro
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Improve debuggability of memory setup problems.

    Signed-off-by: Yinghai Lu
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • mm/hugetlb.c:265:17: warning: symbol 'resv_map_alloc' was not declared. Should it be static?
    mm/hugetlb.c:277:6: warning: symbol 'resv_map_release' was not declared. Should it be static?
    mm/hugetlb.c:292:9: warning: Using plain integer as NULL pointer
    mm/hugetlb.c:1750:5: warning: symbol 'unmap_ref_private' was not declared. Should it be static?

    Signed-off-by: Harvey Harrison
    Acked-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     
  • Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
    provide a fast, scalable percpu frontend for small vmaps (requires a
    slightly different API, though).

    The biggest problem with vmap is actually vunmap. Presently this requires
    a global kernel TLB flush, which on most architectures is a broadcast IPI
    to all CPUs to flush the cache. This is all done under a global lock. As
    the number of CPUs increases, so will the number of vunmaps a scaled
    workload will want to perform, and so will the cost of a global TLB flush.
    This gives terrible quadratic scalability characteristics.

    Another problem is that the entire vmap subsystem works under a single
    lock. It is a rwlock, but it is actually taken for write in all the fast
    paths, and the read locking would likely never be run concurrently anyway,
    so it's just pointless.

    This is a rewrite of vmap subsystem to solve those problems. The existing
    vmalloc API is implemented on top of the rewritten subsystem.

    The TLB flushing problem is solved by using lazy TLB unmapping. vmap
    addresses do not have to be flushed immediately when they are vunmapped,
    because the kernel will not reuse them again (would be a use-after-free)
    until they are reallocated. So the addresses aren't allocated again until
    a subsequent TLB flush. A single TLB flush then can flush multiple
    vunmaps from each CPU.

    XEN and PAT and such do not like deferred TLB flushing because they can't
    always handle multiple aliasing virtual addresses to a physical address.
    They now call vm_unmap_aliases() in order to flush any deferred mappings.
    That call is very expensive (well, actually not a lot more expensive than
    a single vunmap under the old scheme), however it should be OK if not
    called too often.

    The virtual memory extent information is stored in an rbtree rather than a
    linked list to improve the algorithmic scalability.

    There is a per-CPU allocator for small vmaps, which amortizes or avoids
    global locking.

    To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
    must be used in place of vmap and vunmap. Vmalloc does not use these
    interfaces at the moment, so it will not be quite so scalable (although it
    will use lazy TLB flushing).

    As a quick test of performance, I ran a test that loops in the kernel,
    linearly mapping then touching then unmapping 4 pages. Different numbers
    of tests were run in parallel on an 4 core, 2 socket opteron. Results are
    in nanoseconds per map+touch+unmap.

    threads vanilla vmap rewrite
    1 14700 2900
    2 33600 3000
    4 49500 2800
    8 70631 2900

    So with a 8 cores, the rewritten version is already 25x faster.

    In a slightly more realistic test (although with an older and less
    scalable version of the patch), I ripped the not-very-good vunmap batching
    code out of XFS, and implemented the large buffer mapping with vm_map_ram
    and vm_unmap_ram... along with a couple of other tricks, I was able to
    speed up a large directory workload by 20x on a 64 CPU system. I believe
    vmap/vunmap is actually sped up a lot more than 20x on such a system, but
    I'm running into other locks now. vmap is pretty well blown off the
    profiles.

    Before:
    1352059 total 0.1401
    798784 _write_lock 8320.6667
    Cc: Jeremy Fitzhardinge
    Cc: Krzysztof Helt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • __vma_link_file and expand_downwards functions are not small, yeat they
    are marked inline. They probably had one callsite sometime in the past,
    but now they have more. In order to prevent similar thing, I also
    deinlined expand_upwards, despite it having only pne callsite. Nowadays
    gcc auto-inlines such static functions anyway. In find_extend_vma, I
    removed one extra level of indirection.

    Patch is deliberately generated with -U $BIGNUM to make
    it easier to see that functions are big.

    Result:

    # size */*/mmap.o */vmlinux
    text data bss dec hex filename
    9514 188 16 9718 25f6 0.org/mm/mmap.o
    9237 188 16 9441 24e1 deinline/mm/mmap.o
    6124402 858996 389480 7372878 70804e 0.org/vmlinux
    6124113 858996 389480 7372589 707f2d deinline/vmlinux

    Signed-off-by: Denys Vlasenko
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denys Vlasenko
     
  • trylock_page, unlock_page open and close a critical section. Hence,
    we can use the lock bitops to get the desired memory ordering.

    Also, mark trylock as likely to succeed (and remove the annotation from
    callers).

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • unlock_page is fairly expensive. It can be avoided in page reclaim
    success path. By definition if we have any other references to the page
    it would be a bug anyway.

    Signed-off-by: Nick Piggin
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Setting and clearing the page locked when inserting it into swapcache /
    pagecache when it has no other references can use non-atomic page flags
    operations because no other CPU may be operating on it at this time.

    This saves one atomic operation when inserting a page into pagecache.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Rework Posix error return for mlock().

    Posix requires error code for mlock*() system calls for some conditions
    that differ from what kernel low level functions, such as
    get_user_pages(), return for those conditions. For more info, see:

    http://marc.info/?l=linux-kernel&m=121750892930775&w=2

    This patch provides the same translation of get_user_pages()
    error codes to posix specified error codes in the context
    of the mlock rework for unevictable lru.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This change is intended to make mlock() error returns correct.
    make_page_present() is a lower level function used by more than mlock().
    Subsequent patch[es] will add this error return fixup in an mlock specific
    path.

    Cc: KOSAKI Motohiro
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • During each reclaim scan we accumulate scan pressure on unrelated lists
    which will result in bogus scans and unwanted reclaims eventually.

    Scanning lists with few reclaim candidates results in a lot of rotation
    and therefor also disturbs the list balancing, putting even more
    pressure on the wrong lists.

    In a test-case with much streaming IO, and therefor a crowded inactive
    file page list, swapping started because

    a) anon pages were reclaimed after swap_cluster_max reclaim
    invocations -- nr_scan of this list has just accumulated

    b) active file pages were scanned because *their* nr_scan has also
    accumulated through the same logic. And this in return created a
    lot of rotation for file pages and resulted in a decrease of file
    list priority, again increasing the pressure on anon pages.

    The result was an evicted working set of anon pages while there were
    tons of inactive file pages that should have been taken instead.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Allow free of mlock()ed pages. This shouldn't happen, but during
    developement, it occasionally did.

    This patch allows us to survive that condition, while keeping the
    statistics and events correct for debug.

    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This patch adds a function to scan individual or all zones' unevictable
    lists and move any pages that have become evictable onto the respective
    zone's inactive list, where shrink_inactive_list() will deal with them.

    Adds sysctl to scan all nodes, and per node attributes to individual
    nodes' zones.

    Kosaki: If evictable page found in unevictable lru when write
    /proc/sys/vm/scan_unevictable_pages, print filename and file offset of
    these pages.

    [akpm@linux-foundation.org: fix one CONFIG_MMU=n build error]
    [kosaki.motohiro@jp.fujitsu.com: adapt vmscan-unevictable-lru-scan-sysctl.patch to new sysfs API]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • In the fault paths that install new anonymous pages, check whether the
    page is evictable or not using lru_cache_add_active_or_unevictable(). If
    the page is evictable, just add it to the active lru list [via the pagevec
    cache], else add it to the unevictable list.

    This "proactive" culling in the fault path mimics the handling of mlocked
    pages in Nick Piggin's series to keep mlocked pages off the lru lists.

    Notes:

    1) This patch is optional--e.g., if one is concerned about the
    additional test in the fault path. We can defer the moving of
    nonreclaimable pages until when vmscan [shrink_*_list()]
    encounters them. Vmscan will only need to handle such pages
    once, but if there are a lot of them it could impact system
    performance.

    2) The 'vma' argument to page_evictable() is require to notice that
    we're faulting a page into an mlock()ed vma w/o having to scan the
    page's rmap in the fault path. Culling mlock()ed anon pages is
    currently the only reason for this patch.

    3) We can't cull swap pages in read_swap_cache_async() because the
    vma argument doesn't necessarily correspond to the swap cache
    offset passed in by swapin_readahead(). This could [did!] result
    in mlocking pages in non-VM_LOCKED vmas if [when] we tried to
    cull in this path.

    4) Move set_pte_at() to after where we add page to lru to keep it
    hidden from other tasks that might walk the page table.
    We already do it in this order in do_anonymous() page. And,
    these are COW'd anon pages. Is this safe?

    [riel@redhat.com: undo an overzealous code cleanup]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn