02 Nov, 2015

1 commit

  • It turns out that at least some versions of glibc end up reading
    /proc/meminfo at every single startup, because glibc wants to know the
    amount of memory the machine has. And while that's arguably insane,
    it's just how things are.

    And it turns out that it's not all that expensive most of the time, but
    the vmalloc information statistics (amount of virtual memory used in the
    vmalloc space, and the biggest remaining chunk) can be rather expensive
    to compute.

    The 'get_vmalloc_info()' function actually showed up on my profiles as
    4% of the CPU usage of "make test" in the git source repository, because
    the git tests are lots of very short-lived shell-scripts etc.

    It turns out that apparently this same silly vmalloc info gathering
    shows up on the facebook servers too, according to Dave Jones. So it's
    not just "make test" for git.

    We had two patches to just cache the information (one by me, one by
    Ingo) to mitigate this issue, but the whole vmalloc information of of
    rather dubious value to begin with, and people who *actually* want to
    know what the situation is wrt the vmalloc area should just look at the
    much more complete /proc/vmallocinfo instead.

    In fact, according to my testing - and perhaps more importantly,
    according to that big search engine in the sky: Google - there is
    nothing out there that actually cares about those two expensive fields:
    VmallocUsed and VmallocChunk.

    So let's try to just remove them entirely. Actually, this just removes
    the computation and reports the numbers as zero for now, just to try to
    be minimally intrusive.

    If this breaks anything, we'll obviously have to re-introduce the code
    to compute this all and add the caching patches on top. But if given
    the option, I'd really prefer to just remove this bad idea entirely
    rather than add even more code to work around our historical mistake
    that likely nobody really cares about.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

13 Mar, 2015

1 commit

  • Current approach in handling shadow memory for modules is broken.

    Shadow memory could be freed only after memory shadow corresponds it is no
    longer used. vfree() called from interrupt context could use memory its
    freeing to store 'struct llist_node' in it:

    void vfree(const void *addr)
    {
    ...
    if (unlikely(in_interrupt())) {
    struct vfree_deferred *p = this_cpu_ptr(&vfree_deferred);
    if (llist_add((struct llist_node *)addr, &p->list))
    schedule_work(&p->wq);

    Later this list node used in free_work() which actually frees memory.
    Currently module_memfree() called in interrupt context will free shadow
    before freeing module's memory which could provoke kernel crash.

    So shadow memory should be freed after module's memory. However, such
    deallocation order could race with kasan_module_alloc() in module_alloc().

    Free shadow right before releasing vm area. At this point vfree()'d
    memory is not used anymore and yet not available for other allocations.
    New VM_KASAN flag used to indicate that vm area has dynamically allocated
    shadow memory so kasan frees shadow only if it was previously allocated.

    Signed-off-by: Andrey Ryabinin
    Acked-by: Rusty Russell
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

14 Feb, 2015

2 commits

  • For instrumenting global variables KASan will shadow memory backing memory
    for modules. So on module loading we will need to allocate memory for
    shadow and map it at address in shadow that corresponds to the address
    allocated in module_alloc().

    __vmalloc_node_range() could be used for this purpose, except it puts a
    guard hole after allocated area. Guard hole in shadow memory should be a
    problem because at some future point we might need to have a shadow memory
    at address occupied by guard hole. So we could fail to allocate shadow
    for module_alloc().

    Now we have VM_NO_GUARD flag disabling guard page, so we need to pass into
    __vmalloc_node_range(). Add new parameter 'vm_flags' to
    __vmalloc_node_range() function.

    Signed-off-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • For instrumenting global variables KASan will shadow memory backing memory
    for modules. So on module loading we will need to allocate memory for
    shadow and map it at address in shadow that corresponds to the address
    allocated in module_alloc().

    __vmalloc_node_range() could be used for this purpose, except it puts a
    guard hole after allocated area. Guard hole in shadow memory should be a
    problem because at some future point we might need to have a shadow memory
    at address occupied by guard hole. So we could fail to allocate shadow
    for module_alloc().

    Add a new vm_struct flag 'VM_NO_GUARD' indicating that vm area doesn't
    have a guard hole.

    Signed-off-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

07 Aug, 2014

1 commit

  • Currently map_vm_area() takes (struct page *** pages) as third argument,
    and after mapping, it moves (*pages) to point to (*pages +
    nr_mappped_pages).

    It looks like this kind of increment is useless to its caller these
    days. The callers don't care about the increments and actually they're
    trying to avoid this by passing another copy to map_vm_area().

    The caller can always guarantee all the pages can be mapped into vm_area
    as specified in first argument and the caller only cares about whether
    map_vm_area() fails or not.

    This patch cleans up the pointer movement in map_vm_area() and updates
    its callers accordingly.

    Signed-off-by: WANG Chao
    Cc: Zhang Yanfei
    Acked-by: Greg Kroah-Hartman
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    WANG Chao
     

10 Jul, 2013

1 commit

  • VM_UNLIST was used to indicate that the vm_struct is not listed in
    vmlist.

    But after commit 4341fa454796 ("mm, vmalloc: remove list management of
    vmlist after initializing vmalloc"), the meaning of this flag changed.
    It now means the vm_struct is not fully initialized. So renaming it to
    VM_UNINITIALIZED seems more reasonable.

    Also change clear_vm_unlist to clear_vm_uninitialized_flag.

    Signed-off-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     

04 Jul, 2013

1 commit

  • We want to allocate ELF note segment buffer on the 2nd kernel in vmalloc
    space and remap it to user-space in order to reduce the risk that memory
    allocation fails on system with huge number of CPUs and so with huge ELF
    note segment that exceeds 11-order block size.

    Although there's already remap_vmalloc_range for the purpose of
    remapping vmalloc memory to user-space, we need to specify user-space
    range via vma.
    Mmap on /proc/vmcore needs to remap range across multiple objects, so
    the interface that requires vma to cover full range is problematic.

    This patch introduces remap_vmalloc_range_partial that receives user-space
    range as a pair of base address and size and can be used for mmap on
    /proc/vmcore case.

    remap_vmalloc_range is rewritten using remap_vmalloc_range_partial.

    [akpm@linux-foundation.org: use PAGE_ALIGNED()]
    Signed-off-by: HATAYAMA Daisuke
    Cc: KOSAKI Motohiro
    Cc: Vivek Goyal
    Cc: Atsushi Kumagai
    Cc: Lisa Mitchell
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    HATAYAMA Daisuke
     

30 Apr, 2013

3 commits

  • Now, vmap_area_list is exported as VMCOREINFO for makedumpfile to get
    the start address of vmalloc region (vmalloc_start). The address which
    contains vmalloc_start value is represented as below:

    vmap_area_list.next - OFFSET(vmap_area.list) + OFFSET(vmap_area.va_start)

    However, both OFFSET(vmap_area.va_start) and OFFSET(vmap_area.list)
    aren't exported as VMCOREINFO.

    So this patch exports them externally with small cleanup.

    [akpm@linux-foundation.org: vmalloc.h should include list.h for list_head]
    Signed-off-by: Atsushi Kumagai
    Cc: Joonsoo Kim
    Cc: Joonsoo Kim
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Atsushi Kumagai
    Cc: Chris Metcalf
    Cc: Dave Anderson
    Cc: Eric Biederman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Atsushi Kumagai
     
  • Although our intention is to unexport internal structure entirely, but
    there is one exception for kexec. kexec dumps address of vmlist and
    makedumpfile uses this information.

    We are about to remove vmlist, then another way to retrieve information
    of vmalloc layer is needed for makedumpfile. For this purpose, we
    export vmap_area_list, instead of vmlist.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Joonsoo Kim
    Cc: Eric Biederman
    Cc: Dave Anderson
    Cc: Vivek Goyal
    Cc: Atsushi Kumagai
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Chris Metcalf
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Now get_vmalloc_info() is in fs/proc/mmu.c. There is no reason that this
    code must be here and it's implementation needs vmlist_lock and it iterate
    a vmlist which may be internal data structure for vmalloc.

    It is preferable that vmlist_lock and vmlist is only used in vmalloc.c
    for maintainability. So move the code to vmalloc.c

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Joonsoo Kim
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Atsushi Kumagai
    Cc: Chris Metcalf
    Cc: Dave Anderson
    Cc: Eric Biederman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

30 Jul, 2012

2 commits

  • This patch changes dma-mapping subsystem to use generic vmalloc areas
    for all consistent dma allocations. This increases the total size limit
    of the consistent allocations and removes platform hacks and a lot of
    duplicated code.

    Atomic allocations are served from special pool preallocated on boot,
    because vmalloc areas cannot be reliably created in atomic context.

    Signed-off-by: Marek Szyprowski
    Reviewed-by: Kyungmin Park
    Reviewed-by: Minchan Kim

    Marek Szyprowski
     
  • 'const void *' is a safer type for caller function type. This patch
    updates all references to caller function type.

    Signed-off-by: Marek Szyprowski
    Reviewed-by: Kyungmin Park
    Reviewed-by: Minchan Kim

    Marek Szyprowski
     

06 Dec, 2011

1 commit


19 Nov, 2011

1 commit

  • The existing vm_area_register_early() allows for early vmalloc space
    allocation. However upcoming cleanups in the ARM architecture require
    that some fixed locations in the vmalloc area be reserved also very early.

    The name "vm_area_register_early" would have been a good name for the
    reservation part without the allocation. Since it is already in use with
    different semantics, let's create vm_area_add_early() instead.

    Both vm_area_register_early() and vm_area_add_early() can be used together
    meaning that the former is now implemented using the later where it is
    ensured that no conflicting areas are added, but no attempt is made to
    make the allocation scheme in vm_area_register_early() more sophisticated.
    After all, you must know what you're doing when using those functions.

    Signed-off-by: Nicolas Pitre
    Acked-by: Andrew Morton
    Cc: linux-mm@kvack.org

    Nicolas Pitre
     

17 Nov, 2011

1 commit

  • When mapping a foreign page with xenbus_map_ring_valloc() with the
    GNTTABOP_map_grant_ref hypercall, set the GNTMAP_contains_pte flag and
    pass a pointer to the PTE (in init_mm).

    After the page is mapped, the usual fault mechanism can be used to
    update additional MMs. This allows the vmalloc_sync_all() to be
    removed from alloc_vm_area().

    Signed-off-by: David Vrabel
    Acked-by: Andrew Morton
    [v1: Squashed fix by Michal for no-mmu case]
    Signed-off-by: Konrad Rzeszutek Wilk
    Signed-off-by: Michal Simek

    David Vrabel
     

01 Nov, 2011

1 commit

  • The /proc/vmallocinfo shows information about vmalloc allocations in
    vmlist that is a linklist of vm_struct. It, however, may access pages
    field of vm_struct where a page was not allocated. This results in a null
    pointer access and leads to a kernel panic.

    Why this happens: In __vmalloc_node_range() called from vmalloc(), newly
    allocated vm_struct is added to vmlist at __get_vm_area_node() and then,
    some fields of vm_struct such as nr_pages and pages are set at
    __vmalloc_area_node(). In other words, it is added to vmlist before it is
    fully initialized. At the same time, when the /proc/vmallocinfo is read,
    it accesses the pages field of vm_struct according to the nr_pages field
    at show_numa_info(). Thus, a null pointer access happens.

    The patch adds the newly allocated vm_struct to the vmlist *after* it is
    fully initialized. So, it can avoid accessing the pages field with
    unallocated page when show_numa_info() is called.

    Signed-off-by: Mitsuo Hayasaka
    Cc: Andrew Morton
    Cc: David Rientjes
    Cc: Namhyung Kim
    Cc: "Paul E. McKenney"
    Cc: Jeremy Fitzhardinge
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mitsuo Hayasaka
     

28 Mar, 2011

1 commit

  • The percpu code requires more functions to be implemented in the mm core
    which nommu currently does not provide. So add inline implementations
    since these are largely meaningless on nommu systems.

    Signed-off-by: Graf Yang
    Signed-off-by: Mike Frysinger
    Signed-off-by: David Howells
    Acked-by: Greg Ungerer

    Graf Yang
     

14 Jan, 2011

3 commits

  • Four architectures (arm, mips, sparc, x86) use __vmalloc_area() for
    module_init(). Much of the code is duplicated and can be generalized in a
    globally accessible function, __vmalloc_node_range().

    __vmalloc_node() now calls into __vmalloc_node_range() with a range of
    [VMALLOC_START, VMALLOC_END) for functionally equivalent behavior.

    Each architecture may then use __vmalloc_node_range() directly to remove
    the duplication of code.

    Signed-off-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Russell King
    Cc: Ralf Baechle
    Cc: "David S. Miller"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • pcpu_get_vm_areas() only uses GFP_KERNEL allocations, so remove the gfp_t
    formal and use the mask internally.

    Signed-off-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • get_vm_area_node() is unused in the kernel and can thus be removed.

    Signed-off-by: David Rientjes
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

03 Dec, 2010

1 commit

  • On stock 2.6.37-rc4, running:

    # mount lilith:/export /mnt/lilith
    # find /mnt/lilith/ -type f -print0 | xargs -0 file

    crashes the machine fairly quickly under Xen. Often it results in oops
    messages, but the couple of times I tried just now, it just hung quietly
    and made Xen print some rude messages:

    (XEN) mm.c:2389:d80 Bad type (saw 7400000000000001 != exp
    3000000000000000) for mfn 1d7058 (pfn 18fa7)
    (XEN) mm.c:964:d80 Attempt to create linear p.t. with write perms
    (XEN) mm.c:2389:d80 Bad type (saw 7400000000000010 != exp
    1000000000000000) for mfn 1d2e04 (pfn 1d1fb)
    (XEN) mm.c:2965:d80 Error while pinning mfn 1d2e04

    Which means the domain tried to map a pagetable page RW, which would
    allow it to map arbitrary memory, so Xen stopped it. This is because
    vm_unmap_ram() left some pages mapped in the vmalloc area after NFS had
    finished with them, and those pages got recycled as pagetable pages
    while still having these RW aliases.

    Removing those mappings immediately removes the Xen-visible aliases, and
    so it has no problem with those pages being reused as pagetable pages.
    Deferring the TLB flush doesn't upset Xen because it can flush the TLB
    itself as needed to maintain its invariants.

    When unmapping a region in the vmalloc space, clear the ptes
    immediately. There's no point in deferring this because there's no
    amortization benefit.

    The TLBs are left dirty, and they are flushed lazily to amortize the
    cost of the IPIs.

    This specific motivation for this patch is an oops-causing regression
    since 2.6.36 when using NFS under Xen, triggered by the NFS client's use
    of vm_map_ram() introduced in 56e4ebf877b60 ("NFS: readdir with vmapped
    pages") . XFS also uses vm_map_ram() and could cause similar problems.

    Signed-off-by: Jeremy Fitzhardinge
    Cc: Nick Piggin
    Cc: Bryan Schumaker
    Cc: Trond Myklebust
    Cc: Alex Elder
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeremy Fitzhardinge
     

27 Oct, 2010

1 commit

  • Add vzalloc() and vzalloc_node() to encapsulate the
    vmalloc-then-memset-zero operation.

    Use __GFP_ZERO to zero fill the allocated memory.

    Signed-off-by: Dave Young
    Cc: Christoph Lameter
    Acked-by: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Young
     

08 Sep, 2010

1 commit


13 Aug, 2010

1 commit

  • * 'stable/xen-swiotlb-0.8.6' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    x86: Detect whether we should use Xen SWIOTLB.
    pci-swiotlb-xen: Add glue code to setup dma_ops utilizing xen_swiotlb_* functions.
    swiotlb-xen: SWIOTLB library for Xen PV guest with PCI passthrough.
    xen/mmu: inhibit vmap aliases rather than trying to clear them out
    vmap: add flag to allow lazy unmap to be disabled at runtime
    xen: Add xen_create_contiguous_region
    xen: Rename the balloon lock
    xen: Allow unprivileged Xen domains to create iomap pages
    xen: use _PAGE_IOMAP in ioremap to do machine mappings

    Fix up trivial conflicts (adding both xen swiotlb and xen pci platform
    driver setup close to each other) in drivers/xen/{Kconfig,Makefile} and
    include/xen/xen-ops.h

    Linus Torvalds
     

27 Jul, 2010

1 commit


10 Jul, 2010

1 commit

  • Current x86 ioremap() doesn't handle physical address higher than
    32-bit properly in X86_32 PAE mode. When physical address higher than
    32-bit is passed to ioremap(), higher 32-bits in physical address is
    cleared wrongly. Due to this bug, ioremap() can map wrong address to
    linear address space.

    In my case, 64-bit MMIO region was assigned to a PCI device (ioat
    device) on my system. Because of the ioremap()'s bug, wrong physical
    address (instead of MMIO region) was mapped to linear address space.
    Because of this, loading ioatdma driver caused unexpected behavior
    (kernel panic, kernel hangup, ...).

    Signed-off-by: Kenji Kaneshige
    LKML-Reference:
    Signed-off-by: H. Peter Anvin

    Kenji Kaneshige
     

14 Aug, 2009

1 commit

  • To directly use spread NUMA memories for percpu units, percpu
    allocator will be updated to allow sparsely mapping units in a chunk.
    As the distances between units can be very large, this makes
    allocating single vmap area for each chunk undesirable. This patch
    implements pcpu_get_vm_areas() and pcpu_free_vm_areas() which
    allocates and frees sparse congruent vmap areas.

    pcpu_get_vm_areas() take @offsets and @sizes array which define
    distances and sizes of vmap areas. It scans down from the top of
    vmalloc area looking for the top-most address which can accomodate all
    the areas. The top-down scan is to avoid interacting with regular
    vmallocs which can push up these congruent areas up little by little
    ending up wasting address space and page table.

    To speed up top-down scan, the highest possible address hint is
    maintained. Although the scan is linear from the hint, given the
    usual large holes between memory addresses between NUMA nodes, the
    scanning is highly likely to finish after finding the first hole for
    the last unit which is scanned first.

    Signed-off-by: Tejun Heo
    Cc: Nick Piggin

    Tejun Heo
     

25 Feb, 2009

1 commit


24 Feb, 2009

1 commit

  • Impact: allow larger alignment for early vmalloc area allocation

    Some early vmalloc users might want larger alignment, for example, for
    custom large page mapping. Add @align to vm_area_register_early().
    While at it, drop docbook comment on non-existent @size.

    Signed-off-by: Tejun Heo
    Cc: Nick Piggin
    Cc: Ivan Kokshaysky

    Tejun Heo
     

20 Feb, 2009

2 commits

  • Impact: two more public map/unmap functions

    Implement map_kernel_range_noflush() and unmap_kernel_range_noflush().
    These functions respectively map and unmap address range in kernel VM
    area but doesn't do any vcache or tlb flushing. These will be used by
    new percpu allocator.

    Signed-off-by: Tejun Heo
    Cc: Nick Piggin

    Tejun Heo
     
  • Impact: allow multiple early vm areas

    There are places where kernel VM area needs to be allocated before
    vmalloc is initialized. This is done by allocating static vm_struct,
    initializing several fields and linking it to vmlist and later vmalloc
    initialization picking up these from vmlist. This is currently done
    manually and if there's more than one such areas, there's no defined
    way to arbitrate who gets which address.

    This patch implements vm_area_register_early(), which takes vm_area
    struct with flags and size initialized, assigns address to it and puts
    it on the vmlist. This way, multiple early vm areas can determine
    which addresses they should use. The only current user - alpha mm
    init - is converted to use it.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

19 Feb, 2009

1 commit

  • We have get_vm_area_caller() and __get_vm_area() but not
    __get_vm_area_caller()

    On powerpc, I use __get_vm_area() to separate the ranges of addresses
    given to vmalloc vs. ioremap (various good reasons for that) so in order
    to be able to implement the new caller tracking in /proc/vmallocinfo, I
    need a "_caller" variant of it.

    (akpm: needed for ongoing powerpc development, so merge it early)

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Benjamin Herrenschmidt
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     

07 Jan, 2009

1 commit

  • Sparse output following warnings.

    mm/vmalloc.c:1436:6: warning: symbol 'vread' was not declared. Should it be static?
    mm/vmalloc.c:1474:6: warning: symbol 'vwrite' was not declared. Should it be static?

    However, it is used by /dev/kmem. fixed here.

    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

23 Oct, 2008

1 commit


20 Oct, 2008

1 commit

  • Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
    provide a fast, scalable percpu frontend for small vmaps (requires a
    slightly different API, though).

    The biggest problem with vmap is actually vunmap. Presently this requires
    a global kernel TLB flush, which on most architectures is a broadcast IPI
    to all CPUs to flush the cache. This is all done under a global lock. As
    the number of CPUs increases, so will the number of vunmaps a scaled
    workload will want to perform, and so will the cost of a global TLB flush.
    This gives terrible quadratic scalability characteristics.

    Another problem is that the entire vmap subsystem works under a single
    lock. It is a rwlock, but it is actually taken for write in all the fast
    paths, and the read locking would likely never be run concurrently anyway,
    so it's just pointless.

    This is a rewrite of vmap subsystem to solve those problems. The existing
    vmalloc API is implemented on top of the rewritten subsystem.

    The TLB flushing problem is solved by using lazy TLB unmapping. vmap
    addresses do not have to be flushed immediately when they are vunmapped,
    because the kernel will not reuse them again (would be a use-after-free)
    until they are reallocated. So the addresses aren't allocated again until
    a subsequent TLB flush. A single TLB flush then can flush multiple
    vunmaps from each CPU.

    XEN and PAT and such do not like deferred TLB flushing because they can't
    always handle multiple aliasing virtual addresses to a physical address.
    They now call vm_unmap_aliases() in order to flush any deferred mappings.
    That call is very expensive (well, actually not a lot more expensive than
    a single vunmap under the old scheme), however it should be OK if not
    called too often.

    The virtual memory extent information is stored in an rbtree rather than a
    linked list to improve the algorithmic scalability.

    There is a per-CPU allocator for small vmaps, which amortizes or avoids
    global locking.

    To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
    must be used in place of vmap and vunmap. Vmalloc does not use these
    interfaces at the moment, so it will not be quite so scalable (although it
    will use lazy TLB flushing).

    As a quick test of performance, I ran a test that loops in the kernel,
    linearly mapping then touching then unmapping 4 pages. Different numbers
    of tests were run in parallel on an 4 core, 2 socket opteron. Results are
    in nanoseconds per map+touch+unmap.

    threads vanilla vmap rewrite
    1 14700 2900
    2 33600 3000
    4 49500 2800
    8 70631 2900

    So with a 8 cores, the rewritten version is already 25x faster.

    In a slightly more realistic test (although with an older and less
    scalable version of the patch), I ripped the not-very-good vunmap batching
    code out of XFS, and implemented the large buffer mapping with vm_map_ram
    and vm_unmap_ram... along with a couple of other tricks, I was able to
    speed up a large directory workload by 20x on a 64 CPU system. I believe
    vmap/vunmap is actually sped up a lot more than 20x on such a system, but
    I'm running into other locks now. vmap is pretty well blown off the
    profiles.

    Before:
    1352059 total 0.1401
    798784 _write_lock 8320.6667
    Cc: Jeremy Fitzhardinge
    Cc: Krzysztof Helt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

17 Aug, 2008

1 commit

  • Try to comment away a little of the confusion between mm's vm_area_struct
    vm_flags and vmalloc's vm_struct flags: based on an idea by Ulrich Drepper.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

28 Apr, 2008

2 commits

  • Add caller information so that /proc/vmallocinfo shows where the allocation
    request for a slice of vmalloc memory originated.

    Results in output like this:

    0xffffc20000000000-0xffffc20000801000 8392704 alloc_large_system_hash+0x127/0x246 pages=2048 vmalloc vpages
    0xffffc20000801000-0xffffc20000806000 20480 alloc_large_system_hash+0x127/0x246 pages=4 vmalloc
    0xffffc20000806000-0xffffc20000c07000 4198400 alloc_large_system_hash+0x127/0x246 pages=1024 vmalloc vpages
    0xffffc20000c07000-0xffffc20000c0a000 12288 alloc_large_system_hash+0x127/0x246 pages=2 vmalloc
    0xffffc20000c0a000-0xffffc20000c0c000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
    0xffffc20000c0c000-0xffffc20000c0f000 12288 acpi_os_map_memory+0x13/0x1c phys=cff64000 ioremap
    0xffffc20000c10000-0xffffc20000c15000 20480 acpi_os_map_memory+0x13/0x1c phys=cff65000 ioremap
    0xffffc20000c16000-0xffffc20000c18000 8192 acpi_os_map_memory+0x13/0x1c phys=cff69000 ioremap
    0xffffc20000c18000-0xffffc20000c1a000 8192 acpi_os_map_memory+0x13/0x1c phys=fed1f000 ioremap
    0xffffc20000c1a000-0xffffc20000c1c000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
    0xffffc20000c1c000-0xffffc20000c1e000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
    0xffffc20000c1e000-0xffffc20000c20000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
    0xffffc20000c20000-0xffffc20000c22000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
    0xffffc20000c22000-0xffffc20000c24000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
    0xffffc20000c24000-0xffffc20000c26000 8192 acpi_os_map_memory+0x13/0x1c phys=e0081000 ioremap
    0xffffc20000c26000-0xffffc20000c28000 8192 acpi_os_map_memory+0x13/0x1c phys=e0080000 ioremap
    0xffffc20000c28000-0xffffc20000c2d000 20480 alloc_large_system_hash+0x127/0x246 pages=4 vmalloc
    0xffffc20000c2d000-0xffffc20000c31000 16384 tcp_init+0xd5/0x31c pages=3 vmalloc
    0xffffc20000c31000-0xffffc20000c34000 12288 alloc_large_system_hash+0x127/0x246 pages=2 vmalloc
    0xffffc20000c34000-0xffffc20000c36000 8192 init_vdso_vars+0xde/0x1f1
    0xffffc20000c36000-0xffffc20000c38000 8192 pci_iomap+0x8a/0xb4 phys=d8e00000 ioremap
    0xffffc20000c38000-0xffffc20000c3a000 8192 usb_hcd_pci_probe+0x139/0x295 [usbcore] phys=d8e00000 ioremap
    0xffffc20000c3a000-0xffffc20000c3e000 16384 sys_swapon+0x509/0xa15 pages=3 vmalloc
    0xffffc20000c40000-0xffffc20000c61000 135168 e1000_probe+0x1c4/0xa32 phys=d8a20000 ioremap
    0xffffc20000c61000-0xffffc20000c6a000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
    0xffffc20000c6a000-0xffffc20000c73000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
    0xffffc20000c73000-0xffffc20000c7c000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
    0xffffc20000c7c000-0xffffc20000c7f000 12288 e1000e_setup_tx_resources+0x29/0xbe pages=2 vmalloc
    0xffffc20000c80000-0xffffc20001481000 8392704 pci_mmcfg_arch_init+0x90/0x118 phys=e0000000 ioremap
    0xffffc20001481000-0xffffc20001682000 2101248 alloc_large_system_hash+0x127/0x246 pages=512 vmalloc
    0xffffc20001682000-0xffffc20001e83000 8392704 alloc_large_system_hash+0x127/0x246 pages=2048 vmalloc vpages
    0xffffc20001e83000-0xffffc20002204000 3674112 alloc_large_system_hash+0x127/0x246 pages=896 vmalloc vpages
    0xffffc20002204000-0xffffc2000220d000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
    0xffffc2000220d000-0xffffc20002216000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
    0xffffc20002216000-0xffffc2000221f000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
    0xffffc2000221f000-0xffffc20002228000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
    0xffffc20002228000-0xffffc20002231000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
    0xffffc20002231000-0xffffc20002234000 12288 e1000e_setup_rx_resources+0x35/0x122 pages=2 vmalloc
    0xffffc20002240000-0xffffc20002261000 135168 e1000_probe+0x1c4/0xa32 phys=d8a60000 ioremap
    0xffffc20002261000-0xffffc2000270c000 4894720 sys_swapon+0x509/0xa15 pages=1194 vmalloc vpages
    0xffffffffa0000000-0xffffffffa0022000 139264 module_alloc+0x4f/0x55 pages=33 vmalloc
    0xffffffffa0022000-0xffffffffa0029000 28672 module_alloc+0x4f/0x55 pages=6 vmalloc
    0xffffffffa002b000-0xffffffffa0034000 36864 module_alloc+0x4f/0x55 pages=8 vmalloc
    0xffffffffa0034000-0xffffffffa003d000 36864 module_alloc+0x4f/0x55 pages=8 vmalloc
    0xffffffffa003d000-0xffffffffa0049000 49152 module_alloc+0x4f/0x55 pages=11 vmalloc
    0xffffffffa0049000-0xffffffffa0050000 28672 module_alloc+0x4f/0x55 pages=6 vmalloc

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Implement a new proc file that allows the display of the currently allocated
    vmalloc memory.

    It allows to see the users of vmalloc. That is important if vmalloc space is
    scarce (i386 for example).

    And it's going to be important for the compound page fallback to vmalloc.
    Many of the current users can be switched to use compound pages with fallback.
    This means that the number of users of vmalloc is reduced and page tables no
    longer necessary to access the memory. /proc/vmallocinfo allows to review how
    that reduction occurs.

    If memory becomes fragmented and larger order allocations are no longer
    possible then /proc/vmallocinfo allows to see which compound page allocations
    fell back to virtual compound pages. That is important for new users of
    virtual compound pages. Such as order 1 stack allocation etc that may
    fallback to virtual compound pages in the future.

    /proc/vmallocinfo permissions are made readable-only-by-root to avoid possible
    information leakage.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: CONFIG_MMU=n build fix]
    Signed-off-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Cc: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

06 Feb, 2008

1 commit


22 Jul, 2007

1 commit

  • get_vm_area always returns an area with an adjacent guard page. That guard
    page is included in vm_struct.size. iounmap uses vm_struct.size to
    determine how much address space needs to have change_page_attr applied to
    it, which will BUG if applied to the guard page.

    This patch adds a helper function - get_vm_area_size() in linux/vmalloc.h -
    to return the actual size of a vm area, and uses it to make iounmap do the
    right thing. There are probably other places which should be using
    get_vm_area_size().

    Thanks to Dave Young for debugging the
    problem.

    [ Andi, it wasn't clear to me whether x86_64 needs the same fix. ]

    Signed-off-by: Jeremy Fitzhardinge
    Cc: Dave Young
    Cc: Chuck Ebbert
    Signed-off-by: Andrew Morton
    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Jeremy Fitzhardinge