Eric Lee / smarc-fsl-linux-kernel

16 Apr, 2015

3 commits

7d61bfe8f mm/vmalloc: get rid of dirty bitmap inside vmap_block structure ... Browse Code »

In original implementation of vm_map_ram made by Nick Piggin there were
two bitmaps: alloc_map and dirty_map. None of them were used as supposed
to be: finding a suitable free hole for next allocation in block.
vm_map_ram allocates space sequentially in block and on free call marks
pages as dirty, so freed space can't be reused anymore.

Actually it would be very interesting to know the real meaning of those
bitmaps, maybe implementation was incomplete, etc.

But long time ago Zhang Yanfei removed alloc_map by these two commits:

mm/vmalloc.c: remove dead code in vb_alloc
3fcd76e8028e0be37b02a2002b4f56755daeda06
mm/vmalloc.c: remove alloc_map from vmap_block
b8e748b6c32999f221ea4786557b8e7e6c4e4e7a

In this patch I replaced dirty_map with two range variables: dirty min and
max. These variables store minimum and maximum position of dirty space in
a block, since we need only to know the dirty range, not exact position of
dirty pages.

Why it was made? Several reasons: at first glance it seems that
vm_map_ram allocator concerns about fragmentation thus it uses bitmaps for
finding free hole, but it is not true. To avoid complexity seems it is
better to use something simple, like min or max range values. Secondly,
code also becomes simpler, without iteration over bitmap, just comparing
values in min and max macros. Thirdly, bitmap occupies up to 1024 bits
(4MB is a max size of a block). Here I replaced the whole bitmap with two
longs.

Finally vm_unmap_aliases should be slightly faster and the whole
vmap_block structure occupies less memory.

Signed-off-by: Roman Pen
Cc: Zhang Yanfei
Cc: Eric Dumazet
Acked-by: Joonsoo Kim
Cc: David Rientjes
Cc: WANG Chao
Cc: Fabian Frederick
Cc: Christoph Lameter
Cc: Gioh Kim
Cc: Rob Jones
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Pen
2015-04-16 07:35:18 +0800
cf725ce27 mm/vmalloc: occupy newly allocated vmap block just after allocation ... Browse Code »

Previous implementation allocates new vmap block and repeats search of a
free block from the very beginning, iterating over the CPU free list.

Why it can be better??

1. Allocation can happen on one CPU, but search can be done on another CPU.
In worst case we preallocate amount of vmap blocks which is equal to
CPU number on the system.

2. In previous patch I added newly allocated block to the tail of free list
to avoid soon exhaustion of virtual space and give a chance to occupy
blocks which were allocated long time ago. Thus to find newly allocated
block all the search sequence should be repeated, seems it is not efficient.

In this patch newly allocated block is occupied right away, address of
virtual space is returned to the caller, so there is no any need to repeat
the search sequence, allocation job is done.

Signed-off-by: Roman Pen
Cc: Andrew Morton
Cc: Eric Dumazet
Acked-by: Joonsoo Kim
Cc: David Rientjes
Cc: WANG Chao
Cc: Fabian Frederick
Cc: Christoph Lameter
Cc: Gioh Kim
Cc: Rob Jones
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Pen
2015-04-16 07:35:18 +0800
68ac546f2 mm/vmalloc: fix possible exhaustion of vmalloc space caused by vm_map_ram allocator ... Browse Code »

Recently I came across high fragmentation of vm_map_ram allocator:
vmap_block has free space, but still new blocks continue to appear.
Further investigation showed that certain mapping/unmapping sequences
can exhaust vmalloc space. On small 32bit systems that's not a big
problem, cause purging will be called soon on a first allocation failure
(alloc_vmap_area), but on 64bit machines, e.g. x86_64 has 45 bits of
vmalloc space, that can be a disaster.

1) I came up with a simple allocation sequence, which exhausts virtual
space very quickly:

while (iters) {

/* Map/unmap big chunk */
vaddr = vm_map_ram(pages, 16, -1, PAGE_KERNEL);
vm_unmap_ram(vaddr, 16);

/* Map/unmap small chunks.
*
* -1 for hole, which should be left at the end of each block
* to keep it partially used, with some free space available */
for (i = 0; i < (VMAP_BBMAP_BITS - 16) / 8 - 1; i++) {
vaddr = vm_map_ram(pages, 8, -1, PAGE_KERNEL);
vm_unmap_ram(vaddr, 8);
}
}

The idea behind is simple:

1. We have to map a big chunk, e.g. 16 pages.

2. Then we have to occupy the remaining space with smaller chunks, i.e.
8 pages. At the end small hole should remain to keep block in free list,
but do not let big chunk to occupy remaining space.

3. Goto 1 - allocation request of 16 pages can't be completed (only 8 slots
are left free in the block in the #2 step), new block will be allocated,
all further requests will lay into newly allocated block.

To have some measurement numbers for all further tests I setup ftrace and
enabled 4 basic calls in a function profile:

echo vm_map_ram > /sys/kernel/debug/tracing/set_ftrace_filter;
echo alloc_vmap_area >> /sys/kernel/debug/tracing/set_ftrace_filter;
echo vm_unmap_ram >> /sys/kernel/debug/tracing/set_ftrace_filter;
echo free_vmap_block >> /sys/kernel/debug/tracing/set_ftrace_filter;

So for this scenario I got these results:

BEFORE (all new blocks are put to the head of a free list)
# cat /sys/kernel/debug/tracing/trace_stat/function0
Function Hit Time Avg s^2
-------- --- ---- --- ---
vm_map_ram 126000 30683.30 us 0.243 us 30819.36 us
vm_unmap_ram 126000 22003.24 us 0.174 us 340.886 us
alloc_vmap_area 1000 4132.065 us 4.132 us 0.903 us

AFTER (all new blocks are put to the tail of a free list)
# cat /sys/kernel/debug/tracing/trace_stat/function0
Function Hit Time Avg s^2
-------- --- ---- --- ---
vm_map_ram 126000 28713.13 us 0.227 us 24944.70 us
vm_unmap_ram 126000 20403.96 us 0.161 us 1429.872 us
alloc_vmap_area 993 3916.795 us 3.944 us 29.370 us
free_vmap_block 992 654.157 us 0.659 us 1.273 us

SUMMARY:

The most interesting numbers in those tables are numbers of block
allocations and deallocations: alloc_vmap_area and free_vmap_block
calls, which show that before the change blocks were not freed, and
virtual space and physical memory (vmap_block structure allocations,
etc) were consumed.

Average time which were spent in vm_map_ram/vm_unmap_ram became slightly
better. That can be explained with a reasonable amount of blocks in a
free list, which we need to iterate to find a suitable free block.

2) Another scenario is a random allocation:

while (iters) {

/* Randomly take number from a range [1..32/64] */
nr = rand(1, VMAP_MAX_ALLOC);
vaddr = vm_map_ram(pages, nr, -1, PAGE_KERNEL);
vm_unmap_ram(vaddr, nr);
}

I chose mersenne twister PRNG to generate persistent random state to
guarantee that both runs have the same random sequence. For each
vm_map_ram call random number from [1..32/64] was taken to represent
amount of pages which I do map.

I did 10'000 vm_map_ram calls and got these two tables:

BEFORE (all new blocks are put to the head of a free list)

# cat /sys/kernel/debug/tracing/trace_stat/function0
Function Hit Time Avg s^2
-------- --- ---- --- ---
vm_map_ram 10000 10170.01 us 1.017 us 993.609 us
vm_unmap_ram 10000 5321.823 us 0.532 us 59.789 us
alloc_vmap_area 420 2150.239 us 5.119 us 3.307 us
free_vmap_block 37 159.587 us 4.313 us 134.344 us

AFTER (all new blocks are put to the tail of a free list)

# cat /sys/kernel/debug/tracing/trace_stat/function0
Function Hit Time Avg s^2
-------- --- ---- --- ---
vm_map_ram 10000 7745.637 us 0.774 us 395.229 us
vm_unmap_ram 10000 5460.573 us 0.546 us 67.187 us
alloc_vmap_area 414 2201.650 us 5.317 us 5.591 us
free_vmap_block 412 574.421 us 1.394 us 15.138 us

SUMMARY:

'BEFORE' table shows, that 420 blocks were allocated and only 37 were
freed. Remained 383 blocks are still in a free list, consuming virtual
space and physical memory.

'AFTER' table shows, that 414 blocks were allocated and 412 were really
freed. 2 blocks remained in a free list.

So fragmentation was dramatically reduced. Why? Because when we put
newly allocated block to the head, all further requests will occupy new
block, regardless remained space in other blocks. In this scenario all
requests come randomly. Eventually remained free space will be less
than requested size, free list will be iterated and it is possible that
nothing will be found there - finally new block will be created. So
exhaustion in random scenario happens for the maximum possible
allocation size: 32 pages for 32-bit system and 64 pages for 64-bit
system.

Also average cost of vm_map_ram was reduced from 1.017 us to 0.774 us.
Again this can be explained by iteration through smaller list of free
blocks.

3) Next simple scenario is a sequential allocation, when the allocation
order is increased for each block. This scenario forces allocator to
reach maximum amount of partially free blocks in a free list:

while (iters) {

/* Populate free list with blocks with remaining space */
for (order = 0; order << order);

/* Leave a hole */
nr -= 1;

for (i = 0; i < nr; i++) {
vaddr = vm_map_ram(pages, (1 << order), -1, PAGE_KERNEL);
vm_unmap_ram(vaddr, (1 << order));
}

/* Completely occupy blocks from a free list */
for (order = 0; order << order), -1, PAGE_KERNEL);
vm_unmap_ram(vaddr, (1 << order));
}
}

Results which I got:

BEFORE (all new blocks are put to the head of a free list)

# cat /sys/kernel/debug/tracing/trace_stat/function0
Function Hit Time Avg s^2
-------- --- ---- --- ---
vm_map_ram 2032000 399545.2 us 0.196 us 467123.7 us
vm_unmap_ram 2032000 363225.7 us 0.178 us 111405.9 us
alloc_vmap_area 7001 30627.76 us 4.374 us 495.755 us
free_vmap_block 6993 7011.685 us 1.002 us 159.090 us

AFTER (all new blocks are put to the tail of a free list)

# cat /sys/kernel/debug/tracing/trace_stat/function0
Function Hit Time Avg s^2
-------- --- ---- --- ---
vm_map_ram 2032000 394259.7 us 0.194 us 589395.9 us
vm_unmap_ram 2032000 292500.7 us 0.143 us 94181.08 us
alloc_vmap_area 7000 31103.11 us 4.443 us 703.225 us
free_vmap_block 7000 6750.844 us 0.964 us 119.112 us

SUMMARY:

No surprises here, almost all numbers are the same.

Fixing this fragmentation problem I also did some improvements in a
allocation logic of a new vmap block: occupy block immediately and get
rid of extra search in a free list.

Also I replaced dirty bitmap with min/max dirty range values to make the
logic simpler and slightly faster, since two longs comparison costs
less, than loop thru bitmap.

This patchset raises several questions:

Q: Think the problem you comments is already known so that I wrote comments
about it as "it could consume lots of address space through fragmentation".
Could you tell me about your situation and reason why it should be avoided?
Gioh Kim

A: Indeed, there was a commit 364376383 which adds explicit comment about
fragmentation. But fragmentation which is described in this comment caused
by mixing of long-lived and short-lived objects, when a whole block is pinned
in memory because some page slots are still in use. But here I am talking
about blocks which are free, nobody uses them, and allocator keeps them alive
forever, continuously allocating new blocks.

Q: I think that if you put newly allocated block to the tail of a free
list, below example would results in enormous performance degradation.

new block: 1MB (256 pages)

while (iters--) {
vm_map_ram(3 or something else not dividable for 256) * 85
vm_unmap_ram(3) * 85
}

On every iteration, it needs newly allocated block and it is put to the
tail of a free list so finding it consumes large amount of time.
Joonsoo Kim

A: Second patch in current patchset gets rid of extra search in a free list,
so new block will be immediately occupied..

Also, the scenario above is impossible, cause vm_map_ram allocates virtual
range in orders, i.e. 2^n. I.e. passing 3 to vm_map_ram you will allocate
4 slots in a block and 256 slots (capacity of a block) of course dividable
on 4, so block will be completely occupied.

But there is a worst case which we can achieve: each free block has a hole
equal to order size.

The maximum size of allocation is 64 pages for 64-bit system
(if you try to map more, original alloc_vmap_area will be called).

So the maximum order is 6. That means that worst case, before allocator
makes a decision to allocate a new block, is to iterate 7 blocks:

HEAD
1st block - has 1 page slot free (order 0)
2nd block - has 2 page slots free (order 1)
3rd block - has 4 page slots free (order 2)
4th block - has 8 page slots free (order 3)
5th block - has 16 page slots free (order 4)
6th block - has 32 page slots free (order 5)
7th block - has 64 page slots free (order 6)
TAIL

So the worst scenario on 64-bit system is that each CPU queue can have 7
blocks in a free list.

This can happen only and only if you allocate blocks increasing the order.
(as I did in the function written in the comment of the first patch)
This is weird and rare case, but still it is possible. Afterwards you will
get 7 blocks in a list.

All further requests should be placed in a newly allocated block or some
free slots should be found in a free list.
Seems it does not look dramatically awful.

This patch (of 3):

If suitable block can't be found, new block is allocated and put into a
head of a free list, so on next iteration this new block will be found
first.

That's bad, because old blocks in a free list will not get a chance to be
fully used, thus fragmentation will grow.

Let's consider this simple example:

#1 We have one block in a free list which is partially used, and where only
one page is free:

HEAD |xxxxxxxxx-| TAIL
^
free space for 1 page, order 0

#2 New allocation request of order 1 (2 pages) comes, new block is allocated
since we do not have free space to complete this request. New block is put
into a head of a free list:

HEAD |----------|xxxxxxxxx-| TAIL

#3 Two pages were occupied in a new found block:

HEAD |xx--------|xxxxxxxxx-| TAIL
^
two pages mapped here

#4 New allocation request of order 0 (1 page) comes. Block, which was created
on #2 step, is located at the beginning of a free list, so it will be found
first:

HEAD |xxX-------|xxxxxxxxx-| TAIL
^ ^
page mapped here, but better to use this hole

It is obvious, that it is better to complete request of #4 step using the
old block, where free space is left, because in other case fragmentation
will be highly increased.

But fragmentation is not only the case. The worst thing is that I can
easily create scenario, when the whole vmalloc space is exhausted by
blocks, which are not used, but already dirty and have several free pages.

Let's consider this function which execution should be pinned to one CPU:

static void exhaust_virtual_space(struct page *pages[16], int iters)
{
/* Firstly we have to map a big chunk, e.g. 16 pages.
* Then we have to occupy the remaining space with smaller
* chunks, i.e. 8 pages. At the end small hole should remain.
* So at the end of our allocation sequence block looks like
* this:
* XX big chunk
* |XXxxxxxxx-| x small chunk
* - hole, which is enough for a small chunk,
* but is not enough for a big chunk
*/
while (iters--) {
int i;
void *vaddr;

/* Map/unmap big chunk */
vaddr = vm_map_ram(pages, 16, -1, PAGE_KERNEL);
vm_unmap_ram(vaddr, 16);

/* Map/unmap small chunks.
*
* -1 for hole, which should be left at the end of each block
* to keep it partially used, with some free space available */
for (i = 0; i < (VMAP_BBMAP_BITS - 16) / 8 - 1; i++) {
vaddr = vm_map_ram(pages, 8, -1, PAGE_KERNEL);
vm_unmap_ram(vaddr, 8);
}
}
}

On every iteration new block (1MB of vm area in my case) will be
allocated and then will be occupied, without attempt to resolve small
allocation request using previously allocated blocks in a free list.

In case of random allocation (size should be randomly taken from the
range [1..64] in 64-bit case or [1..32] in 32-bit case) situation is the
same: new blocks continue to appear if maximum possible allocation size
(32 or 64) passed to the allocator, because all remaining blocks in a
free list do not have enough free space to complete this allocation
request.

In summary if new blocks are put into the head of a free list eventually
virtual space will be exhausted.

In current patch I simply put newly allocated block to the tail of a
free list, thus reduce fragmentation, giving a chance to resolve
allocation request using older blocks with possible holes left.

Signed-off-by: Roman Pen
Cc: Eric Dumazet
Acked-by: Joonsoo Kim
Cc: David Rientjes
Cc: WANG Chao
Cc: Fabian Frederick
Cc: Christoph Lameter
Cc: Gioh Kim
Cc: Rob Jones
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Pen
2015-04-16 07:35:18 +0800

15 Apr, 2015

2 commits

b9820d8f3 mm: change vunmap to tear down huge KVA mappings ... Browse Code »

Change vunmap_pmd_range() and vunmap_pud_range() to tear down huge KVA
mappings when they are set. pud_clear_huge() and pmd_clear_huge() return
zero when no-operation is performed, i.e. huge page mapping was not used.

These changes are only enabled when CONFIG_HAVE_ARCH_HUGE_VMAP is defined
on the architecture.

[akpm@linux-foundation.org: use consistent code layout]
Signed-off-by: Toshi Kani
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Arnd Bergmann
Cc: Dave Hansen
Cc: Robert Elliott
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Toshi Kani
2015-04-15 07:49:04 +0800
0f616be12 mm: change __get_vm_area_node() to use fls_long() ... Browse Code »

ioremap() and its related interfaces are used to create I/O mappings to
memory-mapped I/O devices. The mapping sizes of the traditional I/O
devices are relatively small. Non-volatile memory (NVM), however, has
many GB and is going to have TB soon. It is not very efficient to create
large I/O mappings with 4KB.

This patchset extends the ioremap() interfaces to transparently create I/O
mappings with huge pages whenever possible. ioremap() continues to use
4KB mappings when a huge page does not fit into a requested range. There
is no change necessary to the drivers using ioremap(). A requested
physical address must be aligned by a huge page size (1GB or 2MB on x86)
for using huge page mapping, though. The kernel huge I/O mapping will
improve performance of NVM and other devices with large memory, and reduce
the time to create their mappings as well.

On x86, MTRRs can override PAT memory types with a 4KB granularity. When
using a huge page, MTRRs can override the memory type of the huge page,
which may lead a performance penalty. The processor can also behave in an
undefined manner if a huge page is mapped to a memory range that MTRRs
have mapped with multiple different memory types. Therefore, the mapping
code falls back to use a smaller page size toward 4KB when a mapping range
is covered by non-WB type of MTRRs. The WB type of MTRRs has no affect on
the PAT memory types.

The patchset introduces HAVE_ARCH_HUGE_VMAP, which indicates that the arch
supports huge KVA mappings for ioremap(). User may specify a new kernel
option "nohugeiomap" to disable the huge I/O mapping capability of
ioremap() when necessary.

Patch 1-4 change common files to support huge I/O mappings. There is no
change in the functinalities unless HAVE_ARCH_HUGE_VMAP is defined on the
architecture of the system.

Patch 5-6 implement the HAVE_ARCH_HUGE_VMAP funcs on x86, and set
HAVE_ARCH_HUGE_VMAP on x86.

This patch (of 6):

__get_vm_area_node() takes unsigned long size, which is a 64-bit value on
a 64-bit kernel. However, fls(size) simply ignores the upper 32-bit.
Change to use fls_long() to handle the size properly.

Signed-off-by: Toshi Kani
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Arnd Bergmann
Cc: Dave Hansen
Cc: Robert Elliott
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Toshi Kani
2015-04-15 07:49:04 +0800

13 Mar, 2015

1 commit

a5af5aa8b kasan, module, vmalloc: rework shadow allocation for modules ... Browse Code »

Current approach in handling shadow memory for modules is broken.

Shadow memory could be freed only after memory shadow corresponds it is no
longer used. vfree() called from interrupt context could use memory its
freeing to store 'struct llist_node' in it:

void vfree(const void *addr)
{
...
if (unlikely(in_interrupt())) {
struct vfree_deferred *p = this_cpu_ptr(&vfree_deferred);
if (llist_add((struct llist_node *)addr, &p->list))
schedule_work(&p->wq);

Later this list node used in free_work() which actually frees memory.
Currently module_memfree() called in interrupt context will free shadow
before freeing module's memory which could provoke kernel crash.

So shadow memory should be freed after module's memory. However, such
deallocation order could race with kasan_module_alloc() in module_alloc().

Free shadow right before releasing vm area. At this point vfree()'d
memory is not used anymore and yet not available for other allocations.
New VM_KASAN flag used to indicate that vm area has dynamically allocated
shadow memory so kasan frees shadow only if it was previously allocated.

Signed-off-by: Andrey Ryabinin
Acked-by: Rusty Russell
Cc: Dmitry Vyukov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2015-03-13 09:46:08 +0800

14 Feb, 2015

2 commits

cb9e3c292 mm: vmalloc: pass additional vm_flags to __vmalloc_node_range() ... Browse Code »

For instrumenting global variables KASan will shadow memory backing memory
for modules. So on module loading we will need to allocate memory for
shadow and map it at address in shadow that corresponds to the address
allocated in module_alloc().

__vmalloc_node_range() could be used for this purpose, except it puts a
guard hole after allocated area. Guard hole in shadow memory should be a
problem because at some future point we might need to have a shadow memory
at address occupied by guard hole. So we could fail to allocate shadow
for module_alloc().

Now we have VM_NO_GUARD flag disabling guard page, so we need to pass into
__vmalloc_node_range(). Add new parameter 'vm_flags' to
__vmalloc_node_range() function.

Signed-off-by: Andrey Ryabinin
Cc: Dmitry Vyukov
Cc: Konstantin Serebryany
Cc: Dmitry Chernenkov
Signed-off-by: Andrey Konovalov
Cc: Yuri Gribov
Cc: Konstantin Khlebnikov
Cc: Sasha Levin
Cc: Christoph Lameter
Cc: Joonsoo Kim
Cc: Dave Hansen
Cc: Andi Kleen
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2015-02-14 13:21:42 +0800
71394fe50 mm: vmalloc: add flag preventing guard hole allocation ... Browse Code »

For instrumenting global variables KASan will shadow memory backing memory
for modules. So on module loading we will need to allocate memory for
shadow and map it at address in shadow that corresponds to the address
allocated in module_alloc().

__vmalloc_node_range() could be used for this purpose, except it puts a
guard hole after allocated area. Guard hole in shadow memory should be a
problem because at some future point we might need to have a shadow memory
at address occupied by guard hole. So we could fail to allocate shadow
for module_alloc().

Add a new vm_struct flag 'VM_NO_GUARD' indicating that vm area doesn't
have a guard hole.

Signed-off-by: Andrey Ryabinin
Cc: Dmitry Vyukov
Cc: Konstantin Serebryany
Cc: Dmitry Chernenkov
Signed-off-by: Andrey Konovalov
Cc: Yuri Gribov
Cc: Konstantin Khlebnikov
Cc: Sasha Levin
Cc: Christoph Lameter
Cc: Joonsoo Kim
Cc: Dave Hansen
Cc: Andi Kleen
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2015-02-14 13:21:42 +0800

14 Dec, 2014

1 commit

7e5b528b4 mm/vmalloc.c: fix memory ordering bug ... Browse Code »

Read memory barriers must follow the read operations.

Signed-off-by: Dmitry Vyukov
Cc: Eric Dumazet
Acked-by: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dmitry Vyukov
2014-12-14 04:42:49 +0800

11 Dec, 2014

1 commit

0cbc8533b mm/vmalloc.c: replace printk with pr_warn ... Browse Code »

This patch replaces printk(KERN_WARNING..) with pr_warn.
Thus it also reduces one line extra because of formatting.

Signed-off-by: Pintu Kumar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pintu Kumar
2014-12-11 09:41:05 +0800

10 Oct, 2014

1 commit

703394c10 mm/vmalloc.c: use seq_open_private() instead of seq_open() ... Browse Code »

Using seq_open_private() removes boilerplate code from vmalloc_open().

The resultant code is shorter and easier to follow.

However, please note that seq_open_private() call kzalloc() rather than
kmalloc() which may affect timing due to the memory initialisation
overhead.

Signed-off-by: Rob Jones
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rob Jones
2014-10-10 10:25:56 +0800

07 Aug, 2014

4 commits

f6f8ed473 mm/vmalloc.c: clean up map_vm_area third argument ... Browse Code »

Currently map_vm_area() takes (struct page *** pages) as third argument,
and after mapping, it moves (*pages) to point to (*pages +
nr_mappped_pages).

It looks like this kind of increment is useless to its caller these
days. The callers don't care about the increments and actually they're
trying to avoid this by passing another copy to map_vm_area().

The caller can always guarantee all the pages can be mapped into vm_area
as specified in first argument and the caller only cares about whether
map_vm_area() fails or not.

This patch cleans up the pointer movement in map_vm_area() and updates
its callers accordingly.

Signed-off-by: WANG Chao
Cc: Zhang Yanfei
Acked-by: Greg Kroah-Hartman
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

WANG Chao
2014-08-07 09:01:19 +0800
930f036b4 mm, vmalloc: constify allocation mask ... Browse Code »

tmp_mask in the __vmalloc_area_node() iteration never changes so it can
be moved into function scope and marked with const. This causes the
movl and orl to only be done once per call rather than area->nr_pages
times.

nested_gfp can also be marked const.

Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-08-07 09:01:18 +0800
660654f90 mm/vmalloc.c: add a schedule point to vmalloc() ... Browse Code »

It is not uncommon on busy servers to get stuck hundred of ms in
vmalloc() calls (like file descriptor expansions).

Add a cond_resched() to __vmalloc_area_node() to be gentle to
other tasks.

[akpm@linux-foundation.org: only do it for __GFP_WAIT, per David]
Signed-off-by: Eric Dumazet
Cc: Hugh Dickins
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Dumazet
2014-08-07 09:01:18 +0800
474750aba vmalloc: use rcu list iterator to reduce vmap_area_lock contention ... Browse Code »

Richard Yao reported a month ago that his system have a trouble with
vmap_area_lock contention during performance analysis by /proc/meminfo.
Andrew asked why his analysis checks /proc/meminfo stressfully, but he
didn't answer it.

https://lkml.org/lkml/2014/4/10/416

Although I'm not sure that this is right usage or not, there is a
solution reducing vmap_area_lock contention with no side-effect. That
is just to use rcu list iterator in get_vmalloc_info().

rcu can be used in this function because all RCU protocol is already
respected by writers, since Nick Piggin commit db64fe02258f1 ("mm:
rewrite vmap layer") back in linux-2.6.28

Specifically :
insertions use list_add_rcu(),
deletions use list_del_rcu() and kfree_rcu().

Note the rb tree is not used from rcu reader (it would not be safe),
only the vmap_area_list has full RCU protection.

Note that __purge_vmap_area_lazy() already uses this rcu protection.

rcu_read_lock();
list_for_each_entry_rcu(va, &vmap_area_list, list) {
if (va->flags & VM_LAZY_FREE) {
if (va->va_start < *start)
*start = va->va_start;
if (va->va_end > *end)
*end = va->va_end;
nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
list_add_tail(&va->purge_list, &valist);
va->flags |= VM_LAZY_FREEING;
va->flags &= ~VM_LAZY_FREE;
}
}
rcu_read_unlock();

Peter:

: While rcu list traversal over the vmap_area_list is safe, this may
: arrive at different results than the spinlocked version. The rcu list
: traversal version will not be a 'snapshot' of a single, valid instant
: of the entire vmap_area_list, but rather a potential amalgam of
: different list states.

Joonsoo:

: Yes, you are right, but I don't think that we should be strict here.
: Meminfo is already not a 'snapshot' at specific time. While we try to get
: certain stats, the other stats can change. And, although we may arrive at
: different results than the spinlocked version, the difference would not be
: large and would not make serious side-effect.

[edumazet@google.com: add more commit description]
Signed-off-by: Joonsoo Kim
Reported-by: Richard Yao
Acked-by: Eric Dumazet
Cc: Peter Hurley
Cc: Zhang Yanfei
Cc: Johannes Weiner
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-08-07 09:01:15 +0800

05 Jun, 2014

3 commits

93ef6d6ca mm/vmalloc.c: export unmap_kernel_range() ... Browse Code »

zsmalloc needs exported unmap_kernel_range for building as a module. See
https://lkml.org/lkml/2013/1/18/487

I didn't send a patch to make unmap_kernel_range exportable at that time
because zram was staging stuff and I thought VM function exporting for
staging stuff makes no sense.

Now zsmalloc was promoted. If we can't build zsmalloc as module, it means
we can't build zram as module, either. Additionally, buddy map_vm_area is
already exported so let's export unmap_kernel_range to help his buddy.

Signed-off-by: Minchan Kim
Cc: Nitin Gupta
Cc: Sergey Senozhatsky
Cc: Jerome Marchand
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2014-06-05 07:54:14 +0800
f4527c908 mm/vmalloc.c: replace seq_printf by seq_puts ... Browse Code »

Replace seq_printf where possible

Signed-off-by: Fabian Frederick
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fabian Frederick
2014-06-05 07:54:04 +0800
7c8e0181e mm: replace __get_cpu_var uses with this_cpu_ptr ... Browse Code »

Replace places where __get_cpu_var() is used for an address calculation
with this_cpu_ptr().

Signed-off-by: Christoph Lameter
Cc: Tejun Heo
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2014-06-05 07:54:03 +0800

08 Apr, 2014

2 commits

364376383 mm/vmalloc.c: enhance vm_map_ram() comment ... Browse Code »

vm_map_ram() has a fragmentation problem when it cannot purge a
chunk(ie, 4M address space) if there is a pinning object in that
addresss space. So it could consume all VMALLOC address space easily.

We can fix the fragmentation problem by using vmap instead of
vm_map_ram() but vmap() is known to be slow compared to vm_map_ram().
Minchan said vm_map_ram is 5 times faster than vmap in his tests. So I
thought we should fix fragment problem of vm_map_ram because our
proprietary GPU driver has used it heavily.

On second thought, it's not an easy because we should reuse freed space
for solving the problem and it could make more IPI and bitmap operation
for searching hole. It could mitigate API's goal which is very fast
mapping. And even fragmentation problem wouldn't show in 64 bit
machine.

Another option is that the user should separate long-life and short-life
object and use vmap for long-life but vm_map_ram for short-life. If we
inform the user about the characteristic of vm_map_ram the user can
choose one according to the page lifetime.

Let's add some notice messages to user.

[akpm@linux-foundation.org: tweak comment text]
Signed-off-by: Gioh Kim
Reviewed-by: Zhang Yanfei
Cc: Minchan Kim
Cc: Johannes Weiner
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gioh Kim
2014-04-08 07:35:55 +0800
3b32123d7 mm: use macros from compiler.h instead of __attribute__((...)) ... Browse Code »

To increase compiler portability there is which
provides convenience macros for various gcc constructs. Eg: __weak for
__attribute__((weak)). I've replaced all instances of gcc attributes with
the right macro in the memory management (/mm) subsystem.

[akpm@linux-foundation.org: while-we're-there consistency tweaks]
Signed-off-by: Gideon Israel Dsouza
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gideon Israel Dsouza
2014-04-08 07:35:54 +0800

28 Jan, 2014

1 commit

add688fbd Revert "mm/vmalloc: interchage the implementation of vmalloc_to_{pfn,page}" ... Browse Code »

Revert commit ece86e222db4, which was intended as a small performance
improvement.

Despite the claim that the patch doesn't introduce any functional
changes in fact it does.

The "no page" path behaves different now. Originally, vmalloc_to_page
might return NULL under some conditions, with new implementation it
returns pfn_to_page(0) which is not the same as NULL.

Simple test shows the difference.

test.c

#include
#include
#include
#include

int __init myi(void)
{
struct page *p;
void *v;

v = vmalloc(PAGE_SIZE);
/* trigger the "no page" path in vmalloc_to_page*/
vfree(v);

p = vmalloc_to_page(v);

pr_err("expected val = NULL, returned val = %p", p);

return -EBUSY;
}

void __exit mye(void)
{

}
module_init(myi)
module_exit(mye)

Before interchange:
expected val = NULL, returned val = (null)

After interchange:
expected val = NULL, returned val = c7ebe000

Signed-off-by: Vladimir Murzin
Cc: Jianyu Zhan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

malc
2014-01-28 13:02:39 +0800

22 Jan, 2014

1 commit

ece86e222 mm/vmalloc: interchage the implementation of vmalloc_to_{pfn,page} ... Browse Code »

Currently we are implementing vmalloc_to_pfn() as a wrapper around
vmalloc_to_page(), which is implemented as follow:

1. walks the page talbes to generates the corresponding pfn,
2. then converts the pfn to struct page,
3. returns it.

And vmalloc_to_pfn() re-wraps vmalloc_to_page() to get the pfn.

This seems too circuitous, so this patch reverses the way: implement
vmalloc_to_page() as a wrapper around vmalloc_to_pfn(). This makes
vmalloc_to_pfn() and vmalloc_to_page() slightly more efficient.

No functional change.

Signed-off-by: Jianyu Zhan
Cc: Vladimir Murzin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jianyu Zhan
2014-01-22 08:19:44 +0800

13 Nov, 2013

6 commits

7f88f88f8 mm: kmemleak: avoid false negatives on vmalloc'ed objects ... Browse Code »

Commit 248ac0e1943a ("mm/vmalloc: remove guard page from between vmap
blocks") had the side effect of making vmap_area.va_end member point to
the next vmap_area.va_start. This was creating an artificial reference
to vmalloc'ed objects and kmemleak was rarely reporting vmalloc() leaks.

This patch marks the vmap_area containing pointers explicitly and
reduces the min ref_count to 2 as vm_struct still contains a reference
to the vmalloc'ed object. The kmemleak add_scan_area() function has
been improved to allow a SIZE_MAX argument covering the rest of the
object (for simpler calling sites).

Signed-off-by: Catalin Marinas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Catalin Marinas
2013-11-13 11:09:07 +0800
b82225f3f revert mm/vmalloc.c: emit the failure message before return ... Browse Code »

Don't warn twice in __vmalloc_area_node and __vmalloc_node_range if
__vmalloc_area_node allocation failure. This patch reverts commit
46c001a2753f ("mm/vmalloc.c: emit the failure message before return").

Signed-off-by: Wanpeng Li
Reviewed-by: Zhang Yanfei
Cc: Joonsoo Kim
Cc: KOSAKI Motohiro
Cc: Mitsuo Hayasaka
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2013-11-13 11:09:05 +0800
af12346cd mm/vmalloc: revert "mm/vmalloc.c: check VM_UNINITIALIZED flag in s_show instead of show_numa_info" ... Browse Code »

The VM_UNINITIALIZED/VM_UNLIST flag introduced by f5252e009d5b ("mm:
avoid null pointer access in vm_struct via /proc/vmallocinfo") is used
to avoid accessing the pages field with unallocated page when
show_numa_info() is called.

This patch moves the check just before show_numa_info in order that some
messages still can be dumped via /proc/vmallocinfo. This patch reverts
commit d157a55815ff ("mm/vmalloc.c: check VM_UNINITIALIZED flag in
s_show instead of show_numa_info");

Reviewed-by: Zhang Yanfei
Signed-off-by: Wanpeng Li
Cc: Mitsuo Hayasaka
Cc: Joonsoo Kim
Cc: KOSAKI Motohiro
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2013-11-13 11:09:05 +0800
c2ce8c142 mm/vmalloc: fix show vmap_area information race with vmap_area tear down ... Browse Code »

There is a race window between vmap_area tear down and show vmap_area
information.

A B

remove_vm_area
spin_lock(&vmap_area_lock);
va->vm = NULL;
va->flags &= ~VM_VM_AREA;
spin_unlock(&vmap_area_lock);
spin_lock(&vmap_area_lock);
if (va->flags & (VM_LAZY_FREE | VM_LAZY_FREEZING))
return 0;
if (!(va->flags & VM_VM_AREA)) {
seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n",
(void *)va->va_start, (void *)va->va_end,
va->va_end - va->va_start);
return 0;
}
free_unmap_vmap_area(va);
flush_cache_vunmap
free_unmap_vmap_area_noflush
unmap_vmap_area
free_vmap_area_noflush
va->flags |= VM_LAZY_FREE

The assumption !VM_VM_AREA represents vm_map_ram allocation is
introduced by d4033afdf828 ("mm, vmalloc: iterate vmap_area_list,
instead of vmlist, in vmallocinfo()").

However, !VM_VM_AREA also represents vmap_area is being tear down in
race window mentioned above. This patch fix it by don't dump any
information for !VM_VM_AREA case and also remove (VM_LAZY_FREE |
VM_LAZY_FREEING) check since they are not possible for !VM_VM_AREA case.

Suggested-by: Joonsoo Kim
Acked-by: KOSAKI Motohiro
Signed-off-by: Wanpeng Li
Cc: Mitsuo Hayasaka
Cc: Zhang Yanfei
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2013-11-13 11:09:05 +0800
3722e13cf mm/vmalloc: don't set area->caller twice ... Browse Code »

The caller address has already been set in set_vmalloc_vm(), there's no
need to set it again in __vmalloc_area_node.

Reviewed-by: Zhang Yanfei
Signed-off-by: Wanpeng Li
Cc: Joonsoo Kim
Cc: KOSAKI Motohiro
Cc: Mitsuo Hayasaka
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2013-11-13 11:09:05 +0800
4b90951c0 mm/vmalloc: use NUMA_NO_NODE ... Browse Code »

Use more appropriate "if (node == NUMA_NO_NODE)" instead of "if (node < 0)"

Signed-off-by: Jianguo Wu
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jianguo Wu
2013-11-13 11:09:02 +0800

12 Sep, 2013

3 commits

762216ab4 mm/vmalloc: use wrapper function get_vm_area_size to caculate size of vm area ... Browse Code »

Use wrapper function get_vm_area_size to calculate size of vm area.

Signed-off-by: Wanpeng Li
Cc: Dave Hansen
Cc: Rik van Riel
Cc: Fengguang Wu
Cc: Joonsoo Kim
Cc: Johannes Weiner
Cc: Tejun Heo
Cc: Yasuaki Ishimatsu
Cc: David Rientjes
Cc: KOSAKI Motohiro
Cc: Jiri Kosina
Cc: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2013-09-12 06:58:02 +0800
b136be5e0 mm, vmalloc: use well-defined find_last_bit() func ... Browse Code »

Our intention in here is to find last_bit within the region to flush.
There is well-defined function, find_last_bit() for this purpose and its
performance may be slightly better than current implementation. So change
it.

Signed-off-by: Joonsoo Kim
Reviewed-by: Wanpeng Li
Acked-by: Johannes Weiner
Acked-by: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2013-09-12 06:57:34 +0800
6b70f7dff mm, vmalloc: remove useless variable in vmap_block ... Browse Code »

vbq in vmap_block isn't used. So remove it.

Signed-off-by: Joonsoo Kim
Reviewed-by: Wanpeng Li
Acked-by: Johannes Weiner
Acked-by: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2013-09-12 06:57:33 +0800

10 Jul, 2013

9 commits

bcb615a81 mm/vmalloc.c: fix an overflow bug in alloc_vmap_area() ... Browse Code »

When searching a vmap area in the vmalloc space, we use (addr + size -
1) to check if the value is less than addr, which is an overflow. But
we assign (addr + size) to vmap_area->va_end.

So if we come across the below case:

(addr + size - 1) : not overflow
(addr + size) : overflow

we will assign an overflow value (e.g 0) to vmap_area->va_end, And this
will trigger BUG in __insert_vmap_area, causing system panic.

So using (addr + size) to check the overflow should be the correct
behaviour, not (addr + size - 1).

Signed-off-by: Zhang Yanfei
Reported-by: Ghennadi Procopciuc
Tested-by: Daniel Baluta
Cc: David Rientjes
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhang Yanfei
2013-07-10 01:33:23 +0800
59d3132f8 vfree: don't schedule free_work() if llist_add() returns false ... Browse Code »

vfree() only needs schedule_work(&p->wq) if p->list was empty, otherwise
vfree_deferred->wq is already pending or it is running and didn't do
llist_del_all() yet.

Signed-off-by: Oleg Nesterov
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-07-10 01:33:22 +0800
d157a5581 mm/vmalloc.c: check VM_UNINITIALIZED flag in s_show instead of show_numa_info ... Browse Code »

We should check the VM_UNITIALIZED flag in s_show(). If this flag is
set, that said, the vm_struct is not fully initialized. So it is
unnecessary to try to show the information contained in vm_struct.

We checked this flag in show_numa_info(), but I think it's better to
check it earlier.

Signed-off-by: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhang Yanfei
2013-07-10 01:33:21 +0800
20fc02b47 mm/vmalloc.c: rename VM_UNLIST to VM_UNINITIALIZED ... Browse Code »

VM_UNLIST was used to indicate that the vm_struct is not listed in
vmlist.

But after commit 4341fa454796 ("mm, vmalloc: remove list management of
vmlist after initializing vmalloc"), the meaning of this flag changed.
It now means the vm_struct is not fully initialized. So renaming it to
VM_UNINITIALIZED seems more reasonable.

Also change clear_vm_unlist to clear_vm_uninitialized_flag.

Signed-off-by: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhang Yanfei
2013-07-10 01:33:21 +0800
46c001a27 mm/vmalloc.c: emit the failure message before return ... Browse Code »

Use goto to jump to the fail label to give a failure message before
returning NULL. This makes the failure handling in this function
consistent.

Signed-off-by: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhang Yanfei
2013-07-10 01:33:21 +0800
b8e748b6c mm/vmalloc.c: remove alloc_map from vmap_block ... Browse Code »

As we have removed the dead code in the vb_alloc, it seems there is no
place to use the alloc_map. So there is no reason to maintain the
alloc_map in vmap_block.

Signed-off-by: Zhang Yanfei
Cc: Johannes Weiner
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhang Yanfei
2013-07-10 01:33:21 +0800
9da3f59fb mm/vmalloc.c: remove unused purge_fragmented_blocks_thiscpu ... Browse Code »

This function is nowhere used now, so remove it.

Signed-off-by: Zhang Yanfei
Cc: Johannes Weiner
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhang Yanfei
2013-07-10 01:33:21 +0800
3fcd76e80 mm/vmalloc.c: remove dead code in vb_alloc ... Browse Code »

Space in a vmap block that was once allocated is considered dirty and
not made available for allocation again before the whole block is
recycled. The result is that free space within a vmap block is always
contiguous.

So if a vmap block has enough free space for allocation, the allocation
is impossible to fail. Thus, the fragmented block purging was never
invoked from vb_alloc(). So remove this dead code.

[ Same patches also sent by:

Chanho Min
Johannes Weiner

but git doesn't do "multiple authors" ]

Signed-off-by: Zhang Yanfei
Cc: Johannes Weiner
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhang Yanfei
2013-07-10 01:33:20 +0800
ab15d9b4c mm/vmalloc.c: unbreak __vunmap() ... Browse Code »

There is an extra semi-colon so the function always returns.

Signed-off-by: Dan Carpenter
Acked-by: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Carpenter
2013-07-10 01:33:20 +0800