Eric Lee / smarc-fsl-linux-kernel

02 Nov, 2015

1 commit

a5ad88ce8 mm: get rid of 'vmalloc_info' from /proc/meminfo ... Browse Code »

It turns out that at least some versions of glibc end up reading
/proc/meminfo at every single startup, because glibc wants to know the
amount of memory the machine has. And while that's arguably insane,
it's just how things are.

And it turns out that it's not all that expensive most of the time, but
the vmalloc information statistics (amount of virtual memory used in the
vmalloc space, and the biggest remaining chunk) can be rather expensive
to compute.

The 'get_vmalloc_info()' function actually showed up on my profiles as
4% of the CPU usage of "make test" in the git source repository, because
the git tests are lots of very short-lived shell-scripts etc.

It turns out that apparently this same silly vmalloc info gathering
shows up on the facebook servers too, according to Dave Jones. So it's
not just "make test" for git.

We had two patches to just cache the information (one by me, one by
Ingo) to mitigate this issue, but the whole vmalloc information of of
rather dubious value to begin with, and people who *actually* want to
know what the situation is wrt the vmalloc area should just look at the
much more complete /proc/vmallocinfo instead.

In fact, according to my testing - and perhaps more importantly,
according to that big search engine in the sky: Google - there is
nothing out there that actually cares about those two expensive fields:
VmallocUsed and VmallocChunk.

So let's try to just remove them entirely. Actually, this just removes
the computation and reports the numbers as zero for now, just to try to
be minimally intrusive.

If this breaks anything, we'll obviously have to re-introduce the code
to compute this all and add the caching patches on top. But if given
the option, I'd really prefer to just remove this bad idea entirely
rather than add even more code to work around our historical mistake
that likely nobody really cares about.

Signed-off-by: Linus Torvalds

Linus Torvalds
2015-11-02 09:09:15 +0800

13 Mar, 2015

1 commit

a5af5aa8b kasan, module, vmalloc: rework shadow allocation for modules ... Browse Code »

Current approach in handling shadow memory for modules is broken.

Shadow memory could be freed only after memory shadow corresponds it is no
longer used. vfree() called from interrupt context could use memory its
freeing to store 'struct llist_node' in it:

void vfree(const void *addr)
{
...
if (unlikely(in_interrupt())) {
struct vfree_deferred *p = this_cpu_ptr(&vfree_deferred);
if (llist_add((struct llist_node *)addr, &p->list))
schedule_work(&p->wq);

Later this list node used in free_work() which actually frees memory.
Currently module_memfree() called in interrupt context will free shadow
before freeing module's memory which could provoke kernel crash.

So shadow memory should be freed after module's memory. However, such
deallocation order could race with kasan_module_alloc() in module_alloc().

Free shadow right before releasing vm area. At this point vfree()'d
memory is not used anymore and yet not available for other allocations.
New VM_KASAN flag used to indicate that vm area has dynamically allocated
shadow memory so kasan frees shadow only if it was previously allocated.

Signed-off-by: Andrey Ryabinin
Acked-by: Rusty Russell
Cc: Dmitry Vyukov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2015-03-13 09:46:08 +0800

14 Feb, 2015

2 commits

cb9e3c292 mm: vmalloc: pass additional vm_flags to __vmalloc_node_range() ... Browse Code »

For instrumenting global variables KASan will shadow memory backing memory
for modules. So on module loading we will need to allocate memory for
shadow and map it at address in shadow that corresponds to the address
allocated in module_alloc().

__vmalloc_node_range() could be used for this purpose, except it puts a
guard hole after allocated area. Guard hole in shadow memory should be a
problem because at some future point we might need to have a shadow memory
at address occupied by guard hole. So we could fail to allocate shadow
for module_alloc().

Now we have VM_NO_GUARD flag disabling guard page, so we need to pass into
__vmalloc_node_range(). Add new parameter 'vm_flags' to
__vmalloc_node_range() function.

Signed-off-by: Andrey Ryabinin
Cc: Dmitry Vyukov
Cc: Konstantin Serebryany
Cc: Dmitry Chernenkov
Signed-off-by: Andrey Konovalov
Cc: Yuri Gribov
Cc: Konstantin Khlebnikov
Cc: Sasha Levin
Cc: Christoph Lameter
Cc: Joonsoo Kim
Cc: Dave Hansen
Cc: Andi Kleen
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2015-02-14 13:21:42 +0800
71394fe50 mm: vmalloc: add flag preventing guard hole allocation ... Browse Code »

For instrumenting global variables KASan will shadow memory backing memory
for modules. So on module loading we will need to allocate memory for
shadow and map it at address in shadow that corresponds to the address
allocated in module_alloc().

__vmalloc_node_range() could be used for this purpose, except it puts a
guard hole after allocated area. Guard hole in shadow memory should be a
problem because at some future point we might need to have a shadow memory
at address occupied by guard hole. So we could fail to allocate shadow
for module_alloc().

Add a new vm_struct flag 'VM_NO_GUARD' indicating that vm area doesn't
have a guard hole.

Signed-off-by: Andrey Ryabinin
Cc: Dmitry Vyukov
Cc: Konstantin Serebryany
Cc: Dmitry Chernenkov
Signed-off-by: Andrey Konovalov
Cc: Yuri Gribov
Cc: Konstantin Khlebnikov
Cc: Sasha Levin
Cc: Christoph Lameter
Cc: Joonsoo Kim
Cc: Dave Hansen
Cc: Andi Kleen
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2015-02-14 13:21:42 +0800

07 Aug, 2014

1 commit

f6f8ed473 mm/vmalloc.c: clean up map_vm_area third argument ... Browse Code »

Currently map_vm_area() takes (struct page *** pages) as third argument,
and after mapping, it moves (*pages) to point to (*pages +
nr_mappped_pages).

It looks like this kind of increment is useless to its caller these
days. The callers don't care about the increments and actually they're
trying to avoid this by passing another copy to map_vm_area().

The caller can always guarantee all the pages can be mapped into vm_area
as specified in first argument and the caller only cares about whether
map_vm_area() fails or not.

This patch cleans up the pointer movement in map_vm_area() and updates
its callers accordingly.

Signed-off-by: WANG Chao
Cc: Zhang Yanfei
Acked-by: Greg Kroah-Hartman
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

WANG Chao
2014-08-07 09:01:19 +0800

10 Jul, 2013

1 commit

20fc02b47 mm/vmalloc.c: rename VM_UNLIST to VM_UNINITIALIZED ... Browse Code »

VM_UNLIST was used to indicate that the vm_struct is not listed in
vmlist.

But after commit 4341fa454796 ("mm, vmalloc: remove list management of
vmlist after initializing vmalloc"), the meaning of this flag changed.
It now means the vm_struct is not fully initialized. So renaming it to
VM_UNINITIALIZED seems more reasonable.

Also change clear_vm_unlist to clear_vm_uninitialized_flag.

Signed-off-by: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhang Yanfei
2013-07-10 01:33:21 +0800

04 Jul, 2013

1 commit

e69e9d4ae vmalloc: introduce remap_vmalloc_range_partial ... Browse Code »

We want to allocate ELF note segment buffer on the 2nd kernel in vmalloc
space and remap it to user-space in order to reduce the risk that memory
allocation fails on system with huge number of CPUs and so with huge ELF
note segment that exceeds 11-order block size.

Although there's already remap_vmalloc_range for the purpose of
remapping vmalloc memory to user-space, we need to specify user-space
range via vma.
Mmap on /proc/vmcore needs to remap range across multiple objects, so
the interface that requires vma to cover full range is problematic.

This patch introduces remap_vmalloc_range_partial that receives user-space
range as a pair of base address and size and can be used for mmap on
/proc/vmcore case.

remap_vmalloc_range is rewritten using remap_vmalloc_range_partial.

[akpm@linux-foundation.org: use PAGE_ALIGNED()]
Signed-off-by: HATAYAMA Daisuke
Cc: KOSAKI Motohiro
Cc: Vivek Goyal
Cc: Atsushi Kumagai
Cc: Lisa Mitchell
Cc: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

HATAYAMA Daisuke
2013-07-04 07:07:30 +0800

30 Apr, 2013

3 commits

13ba3fcbb kexec, vmalloc: export additional vmalloc layer information ... Browse Code »

Now, vmap_area_list is exported as VMCOREINFO for makedumpfile to get
the start address of vmalloc region (vmalloc_start). The address which
contains vmalloc_start value is represented as below:

vmap_area_list.next - OFFSET(vmap_area.list) + OFFSET(vmap_area.va_start)

However, both OFFSET(vmap_area.va_start) and OFFSET(vmap_area.list)
aren't exported as VMCOREINFO.

So this patch exports them externally with small cleanup.

[akpm@linux-foundation.org: vmalloc.h should include list.h for list_head]
Signed-off-by: Atsushi Kumagai
Cc: Joonsoo Kim
Cc: Joonsoo Kim
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Atsushi Kumagai
Cc: Chris Metcalf
Cc: Dave Anderson
Cc: Eric Biederman
Cc: Guan Xuetao
Cc: Ingo Molnar
Cc: Vivek Goyal
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Atsushi Kumagai
2013-04-30 06:54:34 +0800
f1c4069e1 mm, vmalloc: export vmap_area_list, instead of vmlist ... Browse Code »

Although our intention is to unexport internal structure entirely, but
there is one exception for kexec. kexec dumps address of vmlist and
makedumpfile uses this information.

We are about to remove vmlist, then another way to retrieve information
of vmalloc layer is needed for makedumpfile. For this purpose, we
export vmap_area_list, instead of vmlist.

Signed-off-by: Joonsoo Kim
Signed-off-by: Joonsoo Kim
Cc: Eric Biederman
Cc: Dave Anderson
Cc: Vivek Goyal
Cc: Atsushi Kumagai
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Chris Metcalf
Cc: Guan Xuetao
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2013-04-30 06:54:34 +0800
db3808c1b mm, vmalloc: move get_vmalloc_info() to vmalloc.c ... Browse Code »

Now get_vmalloc_info() is in fs/proc/mmu.c. There is no reason that this
code must be here and it's implementation needs vmlist_lock and it iterate
a vmlist which may be internal data structure for vmalloc.

It is preferable that vmlist_lock and vmlist is only used in vmalloc.c
for maintainability. So move the code to vmalloc.c

Signed-off-by: Joonsoo Kim
Signed-off-by: Joonsoo Kim
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Atsushi Kumagai
Cc: Chris Metcalf
Cc: Dave Anderson
Cc: Eric Biederman
Cc: Guan Xuetao
Cc: Ingo Molnar
Cc: Vivek Goyal
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2013-04-30 06:54:33 +0800

30 Jul, 2012

2 commits

e9da6e990 ARM: dma-mapping: remove custom consistent dma region ... Browse Code »
129

This patch changes dma-mapping subsystem to use generic vmalloc areas
for all consistent dma allocations. This increases the total size limit
of the consistent allocations and removes platform hacks and a lot of
duplicated code.

Atomic allocations are served from special pool preallocated on boot,
because vmalloc areas cannot be reliably created in atomic context.

Signed-off-by: Marek Szyprowski
Reviewed-by: Kyungmin Park
Reviewed-by: Minchan Kim

Marek Szyprowski
2012-07-30 18:25:45 +0800
5e6cafc83 mm: vmalloc: use const void * for caller argument ... Browse Code »

'const void *' is a safer type for caller function type. This patch
updates all references to caller function type.

Signed-off-by: Marek Szyprowski
Reviewed-by: Kyungmin Park
Reviewed-by: Minchan Kim

Marek Szyprowski
2012-07-30 18:25:44 +0800

06 Dec, 2011

1 commit

73829af71 Merge branch 'vmalloc' of git://git.linaro.org/people/nico/linux into devel-stable Browse Code »

Russell King
2011-12-06 07:27:59 +0800

19 Nov, 2011

1 commit

be9b7335e mm: add vm_area_add_early() ... Browse Code »

The existing vm_area_register_early() allows for early vmalloc space
allocation. However upcoming cleanups in the ARM architecture require
that some fixed locations in the vmalloc area be reserved also very early.

The name "vm_area_register_early" would have been a good name for the
reservation part without the allocation. Since it is already in use with
different semantics, let's create vm_area_add_early() instead.

Both vm_area_register_early() and vm_area_add_early() can be used together
meaning that the former is now implemented using the later where it is
ensured that no conflicting areas are added, but no attempt is made to
make the allocation scheme in vm_area_register_early() more sophisticated.
After all, you must know what you're doing when using those functions.

Signed-off-by: Nicolas Pitre
Acked-by: Andrew Morton
Cc: linux-mm@kvack.org

Nicolas Pitre
2011-11-19 02:51:22 +0800

17 Nov, 2011

1 commit

cd12909cb xen: map foreign pages for shared rings by updating the PTEs directly ... Browse Code »

When mapping a foreign page with xenbus_map_ring_valloc() with the
GNTTABOP_map_grant_ref hypercall, set the GNTMAP_contains_pte flag and
pass a pointer to the PTE (in init_mm).

After the page is mapped, the usual fault mechanism can be used to
update additional MMs. This allows the vmalloc_sync_all() to be
removed from alloc_vm_area().

Signed-off-by: David Vrabel
Acked-by: Andrew Morton
[v1: Squashed fix by Michal for no-mmu case]
Signed-off-by: Konrad Rzeszutek Wilk
Signed-off-by: Michal Simek

David Vrabel
2011-11-17 01:13:08 +0800

01 Nov, 2011

1 commit

f5252e009 mm: avoid null pointer access in vm_struct via /proc/vmallocinfo ... Browse Code »
2

The /proc/vmallocinfo shows information about vmalloc allocations in
vmlist that is a linklist of vm_struct. It, however, may access pages
field of vm_struct where a page was not allocated. This results in a null
pointer access and leads to a kernel panic.

Why this happens: In __vmalloc_node_range() called from vmalloc(), newly
allocated vm_struct is added to vmlist at __get_vm_area_node() and then,
some fields of vm_struct such as nr_pages and pages are set at
__vmalloc_area_node(). In other words, it is added to vmlist before it is
fully initialized. At the same time, when the /proc/vmallocinfo is read,
it accesses the pages field of vm_struct according to the nr_pages field
at show_numa_info(). Thus, a null pointer access happens.

The patch adds the newly allocated vm_struct to the vmlist *after* it is
fully initialized. So, it can avoid accessing the pages field with
unallocated page when show_numa_info() is called.

Signed-off-by: Mitsuo Hayasaka
Cc: Andrew Morton
Cc: David Rientjes
Cc: Namhyung Kim
Cc: "Paul E. McKenney"
Cc: Jeremy Fitzhardinge
Cc:
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mitsuo Hayasaka
2011-11-01 08:30:47 +0800

28 Mar, 2011

1 commit

b554cb426 NOMMU: support SMP dynamic percpu_alloc ... Browse Code »

The percpu code requires more functions to be implemented in the mm core
which nommu currently does not provide. So add inline implementations
since these are largely meaningless on nommu systems.

Signed-off-by: Graf Yang
Signed-off-by: Mike Frysinger
Signed-off-by: David Howells
Acked-by: Greg Ungerer

Graf Yang
2011-03-28 19:53:29 +0800

14 Jan, 2011

3 commits

d0a21265d mm: unify module_alloc code for vmalloc ... Browse Code »

Four architectures (arm, mips, sparc, x86) use __vmalloc_area() for
module_init(). Much of the code is duplicated and can be generalized in a
globally accessible function, __vmalloc_node_range().

__vmalloc_node() now calls into __vmalloc_node_range() with a range of
[VMALLOC_START, VMALLOC_END) for functionally equivalent behavior.

Each architecture may then use __vmalloc_node_range() directly to remove
the duplication of code.

Signed-off-by: David Rientjes
Cc: Christoph Lameter
Cc: Russell King
Cc: Ralf Baechle
Cc: "David S. Miller"
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2011-01-14 09:32:34 +0800
ec3f64fc9 mm: remove gfp mask from pcpu_get_vm_areas ... Browse Code »

pcpu_get_vm_areas() only uses GFP_KERNEL allocations, so remove the gfp_t
formal and use the mask internally.

Signed-off-by: David Rientjes
Cc: Christoph Lameter
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2011-01-14 09:32:34 +0800
e5a5623b2 mm: remove unused get_vm_area_node ... Browse Code »

get_vm_area_node() is unused in the kernel and can thus be removed.

Signed-off-by: David Rientjes
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2011-01-14 09:32:34 +0800

03 Dec, 2010

1 commit

64141da58 vmalloc: eagerly clear ptes on vunmap ... Browse Code »

On stock 2.6.37-rc4, running:

# mount lilith:/export /mnt/lilith
# find /mnt/lilith/ -type f -print0 | xargs -0 file

crashes the machine fairly quickly under Xen. Often it results in oops
messages, but the couple of times I tried just now, it just hung quietly
and made Xen print some rude messages:

(XEN) mm.c:2389:d80 Bad type (saw 7400000000000001 != exp
3000000000000000) for mfn 1d7058 (pfn 18fa7)
(XEN) mm.c:964:d80 Attempt to create linear p.t. with write perms
(XEN) mm.c:2389:d80 Bad type (saw 7400000000000010 != exp
1000000000000000) for mfn 1d2e04 (pfn 1d1fb)
(XEN) mm.c:2965:d80 Error while pinning mfn 1d2e04

Which means the domain tried to map a pagetable page RW, which would
allow it to map arbitrary memory, so Xen stopped it. This is because
vm_unmap_ram() left some pages mapped in the vmalloc area after NFS had
finished with them, and those pages got recycled as pagetable pages
while still having these RW aliases.

Removing those mappings immediately removes the Xen-visible aliases, and
so it has no problem with those pages being reused as pagetable pages.
Deferring the TLB flush doesn't upset Xen because it can flush the TLB
itself as needed to maintain its invariants.

When unmapping a region in the vmalloc space, clear the ptes
immediately. There's no point in deferring this because there's no
amortization benefit.

The TLBs are left dirty, and they are flushed lazily to amortize the
cost of the IPIs.

This specific motivation for this patch is an oops-causing regression
since 2.6.36 when using NFS under Xen, triggered by the NFS client's use
of vm_map_ram() introduced in 56e4ebf877b60 ("NFS: readdir with vmapped
pages") . XFS also uses vm_map_ram() and could cause similar problems.

Signed-off-by: Jeremy Fitzhardinge
Cc: Nick Piggin
Cc: Bryan Schumaker
Cc: Trond Myklebust
Cc: Alex Elder
Cc: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeremy Fitzhardinge
2010-12-03 06:51:15 +0800

27 Oct, 2010

1 commit

e1ca7788d mm: add vzalloc() and vzalloc_node() helpers ... Browse Code »

Add vzalloc() and vzalloc_node() to encapsulate the
vmalloc-then-memset-zero operation.

Use __GFP_ZERO to zero fill the allocated memory.

Signed-off-by: Dave Young
Cc: Christoph Lameter
Acked-by: Greg Ungerer
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Young
2010-10-27 07:52:10 +0800

08 Sep, 2010

1 commit

4f8b02b4e vmalloc: pcpu_get/free_vm_areas() aren't needed on UP ... Browse Code »

These functions are used only by percpu memory allocator on SMP.
Don't build them on UP.

Signed-off-by: Tejun Heo
Cc: Nick Piggin
Reviewed-by: Chrsitoph Lameter

Tejun Heo
2010-09-08 17:10:47 +0800

13 Aug, 2010

1 commit

26f0cf918 Merge branch 'stable/xen-swiotlb-0.8.6' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen ... Browse Code »

* 'stable/xen-swiotlb-0.8.6' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
x86: Detect whether we should use Xen SWIOTLB.
pci-swiotlb-xen: Add glue code to setup dma_ops utilizing xen_swiotlb_* functions.
swiotlb-xen: SWIOTLB library for Xen PV guest with PCI passthrough.
xen/mmu: inhibit vmap aliases rather than trying to clear them out
vmap: add flag to allow lazy unmap to be disabled at runtime
xen: Add xen_create_contiguous_region
xen: Rename the balloon lock
xen: Allow unprivileged Xen domains to create iomap pages
xen: use _PAGE_IOMAP in ioremap to do machine mappings

Fix up trivial conflicts (adding both xen swiotlb and xen pci platform
driver setup close to each other) in drivers/xen/{Kconfig,Makefile} and
include/xen/xen-ops.h

Linus Torvalds
2010-08-13 00:09:41 +0800

27 Jul, 2010

1 commit

a0d40c802 vmap: add flag to allow lazy unmap to be disabled at runtime ... Browse Code »

Add a flag to force lazy_max_pages() to zero to prevent any outstanding
mapped pages. We'll need this for Xen.

Signed-off-by: Jeremy Fitzhardinge
Signed-off-by: Konrad Rzeszutek Wilk
Acked-by: Nick Piggin

Jeremy Fitzhardinge
2010-07-27 23:49:09 +0800

10 Jul, 2010

1 commit

ffa71f33a x86, ioremap: Fix incorrect physical address handling in PAE mode ... Browse Code »

Current x86 ioremap() doesn't handle physical address higher than
32-bit properly in X86_32 PAE mode. When physical address higher than
32-bit is passed to ioremap(), higher 32-bits in physical address is
cleared wrongly. Due to this bug, ioremap() can map wrong address to
linear address space.

In my case, 64-bit MMIO region was assigned to a PCI device (ioat
device) on my system. Because of the ioremap()'s bug, wrong physical
address (instead of MMIO region) was mapped to linear address space.
Because of this, loading ioatdma driver caused unexpected behavior
(kernel panic, kernel hangup, ...).

Signed-off-by: Kenji Kaneshige
LKML-Reference:
Signed-off-by: H. Peter Anvin

Kenji Kaneshige
2010-07-10 02:42:03 +0800

14 Aug, 2009

1 commit

ca23e405e vmalloc: implement pcpu_get_vm_areas() ... Browse Code »

To directly use spread NUMA memories for percpu units, percpu
allocator will be updated to allow sparsely mapping units in a chunk.
As the distances between units can be very large, this makes
allocating single vmap area for each chunk undesirable. This patch
implements pcpu_get_vm_areas() and pcpu_free_vm_areas() which
allocates and frees sparse congruent vmap areas.

pcpu_get_vm_areas() take @offsets and @sizes array which define
distances and sizes of vmap areas. It scans down from the top of
vmalloc area looking for the top-most address which can accomodate all
the areas. The top-down scan is to avoid interacting with regular
vmallocs which can push up these congruent areas up little by little
ending up wasting address space and page table.

To speed up top-down scan, the highest possible address hint is
maintained. Although the scan is linear from the hint, given the
usual large holes between memory addresses between NUMA nodes, the
scanning is highly likely to finish after finding the first hole for
the last unit which is scanned first.

Signed-off-by: Tejun Heo
Cc: Nick Piggin

Tejun Heo
2009-08-14 14:00:52 +0800

25 Feb, 2009

1 commit

0edcf8d69 Merge branch 'tj-percpu' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into core/percpu ... Browse Code »

Conflicts:
arch/x86/include/asm/pgtable.h

Ingo Molnar
2009-02-25 04:52:45 +0800

24 Feb, 2009

1 commit

c0c0a2937 vmalloc: add @align to vm_area_register_early() ... Browse Code »

Impact: allow larger alignment for early vmalloc area allocation

Some early vmalloc users might want larger alignment, for example, for
custom large page mapping. Add @align to vm_area_register_early().
While at it, drop docbook comment on non-existent @size.

Signed-off-by: Tejun Heo
Cc: Nick Piggin
Cc: Ivan Kokshaysky

Tejun Heo
2009-02-24 10:57:21 +0800

20 Feb, 2009

2 commits

8fc489850 vmalloc: add un/map_kernel_range_noflush() ... Browse Code »

Impact: two more public map/unmap functions

Implement map_kernel_range_noflush() and unmap_kernel_range_noflush().
These functions respectively map and unmap address range in kernel VM
area but doesn't do any vcache or tlb flushing. These will be used by
new percpu allocator.

Signed-off-by: Tejun Heo
Cc: Nick Piggin

Tejun Heo
2009-02-20 15:29:08 +0800
f0aa66179 vmalloc: implement vm_area_register_early() ... Browse Code »

Impact: allow multiple early vm areas

There are places where kernel VM area needs to be allocated before
vmalloc is initialized. This is done by allocating static vm_struct,
initializing several fields and linking it to vmlist and later vmalloc
initialization picking up these from vmlist. This is currently done
manually and if there's more than one such areas, there's no defined
way to arbitrate who gets which address.

This patch implements vm_area_register_early(), which takes vm_area
struct with flags and size initialized, assigns address to it and puts
it on the vmlist. This way, multiple early vm areas can determine
which addresses they should use. The only current user - alpha mm
init - is converted to use it.

Signed-off-by: Tejun Heo

Tejun Heo
2009-02-20 15:29:08 +0800

19 Feb, 2009

1 commit

c29686129 vmalloc: add __get_vm_area_caller() ... Browse Code »

We have get_vm_area_caller() and __get_vm_area() but not
__get_vm_area_caller()

On powerpc, I use __get_vm_area() to separate the ranges of addresses
given to vmalloc vs. ioremap (various good reasons for that) so in order
to be able to implement the new caller tracking in /proc/vmallocinfo, I
need a "_caller" variant of it.

(akpm: needed for ongoing powerpc development, so merge it early)

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Benjamin Herrenschmidt
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Benjamin Herrenschmidt
2009-02-19 07:37:53 +0800

07 Jan, 2009

1 commit

69beeb1d3 mm: make vread() and vwrite() declaration ... Browse Code »

Sparse output following warnings.

mm/vmalloc.c:1436:6: warning: symbol 'vread' was not declared. Should it be static?
mm/vmalloc.c:1474:6: warning: symbol 'vwrite' was not declared. Should it be static?

However, it is used by /dev/kmem. fixed here.

Signed-off-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-07 07:59:05 +0800

23 Oct, 2008

1 commit

5f6a6a9c4 proc: move /proc/vmallocinfo to mm/vmalloc.c ... Browse Code »

Signed-off-by: Alexey Dobriyan
Acked-by: Christoph Lameter

Alexey Dobriyan
2008-10-23 19:48:28 +0800

20 Oct, 2008

1 commit

db64fe022 mm: rewrite vmap layer ... Browse Code »
1

Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
provide a fast, scalable percpu frontend for small vmaps (requires a
slightly different API, though).

The biggest problem with vmap is actually vunmap. Presently this requires
a global kernel TLB flush, which on most architectures is a broadcast IPI
to all CPUs to flush the cache. This is all done under a global lock. As
the number of CPUs increases, so will the number of vunmaps a scaled
workload will want to perform, and so will the cost of a global TLB flush.
This gives terrible quadratic scalability characteristics.

Another problem is that the entire vmap subsystem works under a single
lock. It is a rwlock, but it is actually taken for write in all the fast
paths, and the read locking would likely never be run concurrently anyway,
so it's just pointless.

This is a rewrite of vmap subsystem to solve those problems. The existing
vmalloc API is implemented on top of the rewritten subsystem.

The TLB flushing problem is solved by using lazy TLB unmapping. vmap
addresses do not have to be flushed immediately when they are vunmapped,
because the kernel will not reuse them again (would be a use-after-free)
until they are reallocated. So the addresses aren't allocated again until
a subsequent TLB flush. A single TLB flush then can flush multiple
vunmaps from each CPU.

XEN and PAT and such do not like deferred TLB flushing because they can't
always handle multiple aliasing virtual addresses to a physical address.
They now call vm_unmap_aliases() in order to flush any deferred mappings.
That call is very expensive (well, actually not a lot more expensive than
a single vunmap under the old scheme), however it should be OK if not
called too often.

The virtual memory extent information is stored in an rbtree rather than a
linked list to improve the algorithmic scalability.

There is a per-CPU allocator for small vmaps, which amortizes or avoids
global locking.

To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
must be used in place of vmap and vunmap. Vmalloc does not use these
interfaces at the moment, so it will not be quite so scalable (although it
will use lazy TLB flushing).

As a quick test of performance, I ran a test that loops in the kernel,
linearly mapping then touching then unmapping 4 pages. Different numbers
of tests were run in parallel on an 4 core, 2 socket opteron. Results are
in nanoseconds per map+touch+unmap.

threads vanilla vmap rewrite
1 14700 2900
2 33600 3000
4 49500 2800
8 70631 2900

So with a 8 cores, the rewritten version is already 25x faster.

In a slightly more realistic test (although with an older and less
scalable version of the patch), I ripped the not-very-good vunmap batching
code out of XFS, and implemented the large buffer mapping with vm_map_ram
and vm_unmap_ram... along with a couple of other tricks, I was able to
speed up a large directory workload by 20x on a 64 CPU system. I believe
vmap/vunmap is actually sped up a lot more than 20x on such a system, but
I'm running into other locks now. vmap is pretty well blown off the
profiles.

Before:
1352059 total 0.1401
798784 _write_lock 8320.6667
Cc: Jeremy Fitzhardinge
Cc: Krzysztof Helt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-10-20 23:52:32 +0800

17 Aug, 2008

1 commit

605d9288b mm: VM_flags comment fixes ... Browse Code »

Try to comment away a little of the confusion between mm's vm_area_struct
vm_flags and vmalloc's vm_struct flags: based on an idea by Ulrich Drepper.

Signed-off-by: Hugh Dickins
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-08-17 07:45:56 +0800

28 Apr, 2008

2 commits

230169693 vmallocinfo: add caller information ... Browse Code »

Add caller information so that /proc/vmallocinfo shows where the allocation
request for a slice of vmalloc memory originated.

Results in output like this:

0xffffc20000000000-0xffffc20000801000 8392704 alloc_large_system_hash+0x127/0x246 pages=2048 vmalloc vpages
0xffffc20000801000-0xffffc20000806000 20480 alloc_large_system_hash+0x127/0x246 pages=4 vmalloc
0xffffc20000806000-0xffffc20000c07000 4198400 alloc_large_system_hash+0x127/0x246 pages=1024 vmalloc vpages
0xffffc20000c07000-0xffffc20000c0a000 12288 alloc_large_system_hash+0x127/0x246 pages=2 vmalloc
0xffffc20000c0a000-0xffffc20000c0c000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
0xffffc20000c0c000-0xffffc20000c0f000 12288 acpi_os_map_memory+0x13/0x1c phys=cff64000 ioremap
0xffffc20000c10000-0xffffc20000c15000 20480 acpi_os_map_memory+0x13/0x1c phys=cff65000 ioremap
0xffffc20000c16000-0xffffc20000c18000 8192 acpi_os_map_memory+0x13/0x1c phys=cff69000 ioremap
0xffffc20000c18000-0xffffc20000c1a000 8192 acpi_os_map_memory+0x13/0x1c phys=fed1f000 ioremap
0xffffc20000c1a000-0xffffc20000c1c000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
0xffffc20000c1c000-0xffffc20000c1e000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
0xffffc20000c1e000-0xffffc20000c20000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
0xffffc20000c20000-0xffffc20000c22000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
0xffffc20000c22000-0xffffc20000c24000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
0xffffc20000c24000-0xffffc20000c26000 8192 acpi_os_map_memory+0x13/0x1c phys=e0081000 ioremap
0xffffc20000c26000-0xffffc20000c28000 8192 acpi_os_map_memory+0x13/0x1c phys=e0080000 ioremap
0xffffc20000c28000-0xffffc20000c2d000 20480 alloc_large_system_hash+0x127/0x246 pages=4 vmalloc
0xffffc20000c2d000-0xffffc20000c31000 16384 tcp_init+0xd5/0x31c pages=3 vmalloc
0xffffc20000c31000-0xffffc20000c34000 12288 alloc_large_system_hash+0x127/0x246 pages=2 vmalloc
0xffffc20000c34000-0xffffc20000c36000 8192 init_vdso_vars+0xde/0x1f1
0xffffc20000c36000-0xffffc20000c38000 8192 pci_iomap+0x8a/0xb4 phys=d8e00000 ioremap
0xffffc20000c38000-0xffffc20000c3a000 8192 usb_hcd_pci_probe+0x139/0x295 [usbcore] phys=d8e00000 ioremap
0xffffc20000c3a000-0xffffc20000c3e000 16384 sys_swapon+0x509/0xa15 pages=3 vmalloc
0xffffc20000c40000-0xffffc20000c61000 135168 e1000_probe+0x1c4/0xa32 phys=d8a20000 ioremap
0xffffc20000c61000-0xffffc20000c6a000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
0xffffc20000c6a000-0xffffc20000c73000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
0xffffc20000c73000-0xffffc20000c7c000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
0xffffc20000c7c000-0xffffc20000c7f000 12288 e1000e_setup_tx_resources+0x29/0xbe pages=2 vmalloc
0xffffc20000c80000-0xffffc20001481000 8392704 pci_mmcfg_arch_init+0x90/0x118 phys=e0000000 ioremap
0xffffc20001481000-0xffffc20001682000 2101248 alloc_large_system_hash+0x127/0x246 pages=512 vmalloc
0xffffc20001682000-0xffffc20001e83000 8392704 alloc_large_system_hash+0x127/0x246 pages=2048 vmalloc vpages
0xffffc20001e83000-0xffffc20002204000 3674112 alloc_large_system_hash+0x127/0x246 pages=896 vmalloc vpages
0xffffc20002204000-0xffffc2000220d000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
0xffffc2000220d000-0xffffc20002216000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
0xffffc20002216000-0xffffc2000221f000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
0xffffc2000221f000-0xffffc20002228000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
0xffffc20002228000-0xffffc20002231000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
0xffffc20002231000-0xffffc20002234000 12288 e1000e_setup_rx_resources+0x35/0x122 pages=2 vmalloc
0xffffc20002240000-0xffffc20002261000 135168 e1000_probe+0x1c4/0xa32 phys=d8a60000 ioremap
0xffffc20002261000-0xffffc2000270c000 4894720 sys_swapon+0x509/0xa15 pages=1194 vmalloc vpages
0xffffffffa0000000-0xffffffffa0022000 139264 module_alloc+0x4f/0x55 pages=33 vmalloc
0xffffffffa0022000-0xffffffffa0029000 28672 module_alloc+0x4f/0x55 pages=6 vmalloc
0xffffffffa002b000-0xffffffffa0034000 36864 module_alloc+0x4f/0x55 pages=8 vmalloc
0xffffffffa0034000-0xffffffffa003d000 36864 module_alloc+0x4f/0x55 pages=8 vmalloc
0xffffffffa003d000-0xffffffffa0049000 49152 module_alloc+0x4f/0x55 pages=11 vmalloc
0xffffffffa0049000-0xffffffffa0050000 28672 module_alloc+0x4f/0x55 pages=6 vmalloc

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Christoph Lameter
Reviewed-by: KOSAKI Motohiro
Cc: Hugh Dickins
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2008-04-28 23:58:21 +0800
a10aa5798 vmalloc: show vmalloced areas via /proc/vmallocinfo ... Browse Code »

Implement a new proc file that allows the display of the currently allocated
vmalloc memory.

It allows to see the users of vmalloc. That is important if vmalloc space is
scarce (i386 for example).

And it's going to be important for the compound page fallback to vmalloc.
Many of the current users can be switched to use compound pages with fallback.
This means that the number of users of vmalloc is reduced and page tables no
longer necessary to access the memory. /proc/vmallocinfo allows to review how
that reduction occurs.

If memory becomes fragmented and larger order allocations are no longer
possible then /proc/vmallocinfo allows to see which compound page allocations
fell back to virtual compound pages. That is important for new users of
virtual compound pages. Such as order 1 stack allocation etc that may
fallback to virtual compound pages in the future.

/proc/vmallocinfo permissions are made readable-only-by-root to avoid possible
information leakage.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: CONFIG_MMU=n build fix]
Signed-off-by: Christoph Lameter
Reviewed-by: KOSAKI Motohiro
Cc: Hugh Dickins
Cc: Nick Piggin
Cc: Arjan van de Ven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2008-04-28 23:58:21 +0800

06 Feb, 2008

1 commit

b3bdda02a vmalloc: add const to void* parameters ... Browse Code »

Make vmalloc functions work the same way as kfree() and friends that
take a const void * argument.

[akpm@linux-foundation.org: fix consts, coding-style]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2008-02-06 01:44:14 +0800

22 Jul, 2007

1 commit

9585116ba i386: fix iounmap's use of vm_struct's size field ... Browse Code »
43

get_vm_area always returns an area with an adjacent guard page. That guard
page is included in vm_struct.size. iounmap uses vm_struct.size to
determine how much address space needs to have change_page_attr applied to
it, which will BUG if applied to the guard page.

This patch adds a helper function - get_vm_area_size() in linux/vmalloc.h -
to return the actual size of a vm area, and uses it to make iounmap do the
right thing. There are probably other places which should be using
get_vm_area_size().

Thanks to Dave Young for debugging the
problem.

[ Andi, it wasn't clear to me whether x86_64 needs the same fix. ]

Signed-off-by: Jeremy Fitzhardinge
Cc: Dave Young
Cc: Chuck Ebbert
Signed-off-by: Andrew Morton
Signed-off-by: Andi Kleen
Signed-off-by: Linus Torvalds

Jeremy Fitzhardinge
2007-07-22 09:37:14 +0800