16 Apr, 2015

3 commits

  • In original implementation of vm_map_ram made by Nick Piggin there were
    two bitmaps: alloc_map and dirty_map. None of them were used as supposed
    to be: finding a suitable free hole for next allocation in block.
    vm_map_ram allocates space sequentially in block and on free call marks
    pages as dirty, so freed space can't be reused anymore.

    Actually it would be very interesting to know the real meaning of those
    bitmaps, maybe implementation was incomplete, etc.

    But long time ago Zhang Yanfei removed alloc_map by these two commits:

    mm/vmalloc.c: remove dead code in vb_alloc
    3fcd76e8028e0be37b02a2002b4f56755daeda06
    mm/vmalloc.c: remove alloc_map from vmap_block
    b8e748b6c32999f221ea4786557b8e7e6c4e4e7a

    In this patch I replaced dirty_map with two range variables: dirty min and
    max. These variables store minimum and maximum position of dirty space in
    a block, since we need only to know the dirty range, not exact position of
    dirty pages.

    Why it was made? Several reasons: at first glance it seems that
    vm_map_ram allocator concerns about fragmentation thus it uses bitmaps for
    finding free hole, but it is not true. To avoid complexity seems it is
    better to use something simple, like min or max range values. Secondly,
    code also becomes simpler, without iteration over bitmap, just comparing
    values in min and max macros. Thirdly, bitmap occupies up to 1024 bits
    (4MB is a max size of a block). Here I replaced the whole bitmap with two
    longs.

    Finally vm_unmap_aliases should be slightly faster and the whole
    vmap_block structure occupies less memory.

    Signed-off-by: Roman Pen
    Cc: Zhang Yanfei
    Cc: Eric Dumazet
    Acked-by: Joonsoo Kim
    Cc: David Rientjes
    Cc: WANG Chao
    Cc: Fabian Frederick
    Cc: Christoph Lameter
    Cc: Gioh Kim
    Cc: Rob Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Pen
     
  • Previous implementation allocates new vmap block and repeats search of a
    free block from the very beginning, iterating over the CPU free list.

    Why it can be better??

    1. Allocation can happen on one CPU, but search can be done on another CPU.
    In worst case we preallocate amount of vmap blocks which is equal to
    CPU number on the system.

    2. In previous patch I added newly allocated block to the tail of free list
    to avoid soon exhaustion of virtual space and give a chance to occupy
    blocks which were allocated long time ago. Thus to find newly allocated
    block all the search sequence should be repeated, seems it is not efficient.

    In this patch newly allocated block is occupied right away, address of
    virtual space is returned to the caller, so there is no any need to repeat
    the search sequence, allocation job is done.

    Signed-off-by: Roman Pen
    Cc: Andrew Morton
    Cc: Eric Dumazet
    Acked-by: Joonsoo Kim
    Cc: David Rientjes
    Cc: WANG Chao
    Cc: Fabian Frederick
    Cc: Christoph Lameter
    Cc: Gioh Kim
    Cc: Rob Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Pen
     
  • Recently I came across high fragmentation of vm_map_ram allocator:
    vmap_block has free space, but still new blocks continue to appear.
    Further investigation showed that certain mapping/unmapping sequences
    can exhaust vmalloc space. On small 32bit systems that's not a big
    problem, cause purging will be called soon on a first allocation failure
    (alloc_vmap_area), but on 64bit machines, e.g. x86_64 has 45 bits of
    vmalloc space, that can be a disaster.

    1) I came up with a simple allocation sequence, which exhausts virtual
    space very quickly:

    while (iters) {

    /* Map/unmap big chunk */
    vaddr = vm_map_ram(pages, 16, -1, PAGE_KERNEL);
    vm_unmap_ram(vaddr, 16);

    /* Map/unmap small chunks.
    *
    * -1 for hole, which should be left at the end of each block
    * to keep it partially used, with some free space available */
    for (i = 0; i < (VMAP_BBMAP_BITS - 16) / 8 - 1; i++) {
    vaddr = vm_map_ram(pages, 8, -1, PAGE_KERNEL);
    vm_unmap_ram(vaddr, 8);
    }
    }

    The idea behind is simple:

    1. We have to map a big chunk, e.g. 16 pages.

    2. Then we have to occupy the remaining space with smaller chunks, i.e.
    8 pages. At the end small hole should remain to keep block in free list,
    but do not let big chunk to occupy remaining space.

    3. Goto 1 - allocation request of 16 pages can't be completed (only 8 slots
    are left free in the block in the #2 step), new block will be allocated,
    all further requests will lay into newly allocated block.

    To have some measurement numbers for all further tests I setup ftrace and
    enabled 4 basic calls in a function profile:

    echo vm_map_ram > /sys/kernel/debug/tracing/set_ftrace_filter;
    echo alloc_vmap_area >> /sys/kernel/debug/tracing/set_ftrace_filter;
    echo vm_unmap_ram >> /sys/kernel/debug/tracing/set_ftrace_filter;
    echo free_vmap_block >> /sys/kernel/debug/tracing/set_ftrace_filter;

    So for this scenario I got these results:

    BEFORE (all new blocks are put to the head of a free list)
    # cat /sys/kernel/debug/tracing/trace_stat/function0
    Function Hit Time Avg s^2
    -------- --- ---- --- ---
    vm_map_ram 126000 30683.30 us 0.243 us 30819.36 us
    vm_unmap_ram 126000 22003.24 us 0.174 us 340.886 us
    alloc_vmap_area 1000 4132.065 us 4.132 us 0.903 us

    AFTER (all new blocks are put to the tail of a free list)
    # cat /sys/kernel/debug/tracing/trace_stat/function0
    Function Hit Time Avg s^2
    -------- --- ---- --- ---
    vm_map_ram 126000 28713.13 us 0.227 us 24944.70 us
    vm_unmap_ram 126000 20403.96 us 0.161 us 1429.872 us
    alloc_vmap_area 993 3916.795 us 3.944 us 29.370 us
    free_vmap_block 992 654.157 us 0.659 us 1.273 us

    SUMMARY:

    The most interesting numbers in those tables are numbers of block
    allocations and deallocations: alloc_vmap_area and free_vmap_block
    calls, which show that before the change blocks were not freed, and
    virtual space and physical memory (vmap_block structure allocations,
    etc) were consumed.

    Average time which were spent in vm_map_ram/vm_unmap_ram became slightly
    better. That can be explained with a reasonable amount of blocks in a
    free list, which we need to iterate to find a suitable free block.

    2) Another scenario is a random allocation:

    while (iters) {

    /* Randomly take number from a range [1..32/64] */
    nr = rand(1, VMAP_MAX_ALLOC);
    vaddr = vm_map_ram(pages, nr, -1, PAGE_KERNEL);
    vm_unmap_ram(vaddr, nr);
    }

    I chose mersenne twister PRNG to generate persistent random state to
    guarantee that both runs have the same random sequence. For each
    vm_map_ram call random number from [1..32/64] was taken to represent
    amount of pages which I do map.

    I did 10'000 vm_map_ram calls and got these two tables:

    BEFORE (all new blocks are put to the head of a free list)

    # cat /sys/kernel/debug/tracing/trace_stat/function0
    Function Hit Time Avg s^2
    -------- --- ---- --- ---
    vm_map_ram 10000 10170.01 us 1.017 us 993.609 us
    vm_unmap_ram 10000 5321.823 us 0.532 us 59.789 us
    alloc_vmap_area 420 2150.239 us 5.119 us 3.307 us
    free_vmap_block 37 159.587 us 4.313 us 134.344 us

    AFTER (all new blocks are put to the tail of a free list)

    # cat /sys/kernel/debug/tracing/trace_stat/function0
    Function Hit Time Avg s^2
    -------- --- ---- --- ---
    vm_map_ram 10000 7745.637 us 0.774 us 395.229 us
    vm_unmap_ram 10000 5460.573 us 0.546 us 67.187 us
    alloc_vmap_area 414 2201.650 us 5.317 us 5.591 us
    free_vmap_block 412 574.421 us 1.394 us 15.138 us

    SUMMARY:

    'BEFORE' table shows, that 420 blocks were allocated and only 37 were
    freed. Remained 383 blocks are still in a free list, consuming virtual
    space and physical memory.

    'AFTER' table shows, that 414 blocks were allocated and 412 were really
    freed. 2 blocks remained in a free list.

    So fragmentation was dramatically reduced. Why? Because when we put
    newly allocated block to the head, all further requests will occupy new
    block, regardless remained space in other blocks. In this scenario all
    requests come randomly. Eventually remained free space will be less
    than requested size, free list will be iterated and it is possible that
    nothing will be found there - finally new block will be created. So
    exhaustion in random scenario happens for the maximum possible
    allocation size: 32 pages for 32-bit system and 64 pages for 64-bit
    system.

    Also average cost of vm_map_ram was reduced from 1.017 us to 0.774 us.
    Again this can be explained by iteration through smaller list of free
    blocks.

    3) Next simple scenario is a sequential allocation, when the allocation
    order is increased for each block. This scenario forces allocator to
    reach maximum amount of partially free blocks in a free list:

    while (iters) {

    /* Populate free list with blocks with remaining space */
    for (order = 0; order << order);

    /* Leave a hole */
    nr -= 1;

    for (i = 0; i < nr; i++) {
    vaddr = vm_map_ram(pages, (1 << order), -1, PAGE_KERNEL);
    vm_unmap_ram(vaddr, (1 << order));
    }

    /* Completely occupy blocks from a free list */
    for (order = 0; order << order), -1, PAGE_KERNEL);
    vm_unmap_ram(vaddr, (1 << order));
    }
    }

    Results which I got:

    BEFORE (all new blocks are put to the head of a free list)

    # cat /sys/kernel/debug/tracing/trace_stat/function0
    Function Hit Time Avg s^2
    -------- --- ---- --- ---
    vm_map_ram 2032000 399545.2 us 0.196 us 467123.7 us
    vm_unmap_ram 2032000 363225.7 us 0.178 us 111405.9 us
    alloc_vmap_area 7001 30627.76 us 4.374 us 495.755 us
    free_vmap_block 6993 7011.685 us 1.002 us 159.090 us

    AFTER (all new blocks are put to the tail of a free list)

    # cat /sys/kernel/debug/tracing/trace_stat/function0
    Function Hit Time Avg s^2
    -------- --- ---- --- ---
    vm_map_ram 2032000 394259.7 us 0.194 us 589395.9 us
    vm_unmap_ram 2032000 292500.7 us 0.143 us 94181.08 us
    alloc_vmap_area 7000 31103.11 us 4.443 us 703.225 us
    free_vmap_block 7000 6750.844 us 0.964 us 119.112 us

    SUMMARY:

    No surprises here, almost all numbers are the same.

    Fixing this fragmentation problem I also did some improvements in a
    allocation logic of a new vmap block: occupy block immediately and get
    rid of extra search in a free list.

    Also I replaced dirty bitmap with min/max dirty range values to make the
    logic simpler and slightly faster, since two longs comparison costs
    less, than loop thru bitmap.

    This patchset raises several questions:

    Q: Think the problem you comments is already known so that I wrote comments
    about it as "it could consume lots of address space through fragmentation".
    Could you tell me about your situation and reason why it should be avoided?
    Gioh Kim

    A: Indeed, there was a commit 364376383 which adds explicit comment about
    fragmentation. But fragmentation which is described in this comment caused
    by mixing of long-lived and short-lived objects, when a whole block is pinned
    in memory because some page slots are still in use. But here I am talking
    about blocks which are free, nobody uses them, and allocator keeps them alive
    forever, continuously allocating new blocks.

    Q: I think that if you put newly allocated block to the tail of a free
    list, below example would results in enormous performance degradation.

    new block: 1MB (256 pages)

    while (iters--) {
    vm_map_ram(3 or something else not dividable for 256) * 85
    vm_unmap_ram(3) * 85
    }

    On every iteration, it needs newly allocated block and it is put to the
    tail of a free list so finding it consumes large amount of time.
    Joonsoo Kim

    A: Second patch in current patchset gets rid of extra search in a free list,
    so new block will be immediately occupied..

    Also, the scenario above is impossible, cause vm_map_ram allocates virtual
    range in orders, i.e. 2^n. I.e. passing 3 to vm_map_ram you will allocate
    4 slots in a block and 256 slots (capacity of a block) of course dividable
    on 4, so block will be completely occupied.

    But there is a worst case which we can achieve: each free block has a hole
    equal to order size.

    The maximum size of allocation is 64 pages for 64-bit system
    (if you try to map more, original alloc_vmap_area will be called).

    So the maximum order is 6. That means that worst case, before allocator
    makes a decision to allocate a new block, is to iterate 7 blocks:

    HEAD
    1st block - has 1 page slot free (order 0)
    2nd block - has 2 page slots free (order 1)
    3rd block - has 4 page slots free (order 2)
    4th block - has 8 page slots free (order 3)
    5th block - has 16 page slots free (order 4)
    6th block - has 32 page slots free (order 5)
    7th block - has 64 page slots free (order 6)
    TAIL

    So the worst scenario on 64-bit system is that each CPU queue can have 7
    blocks in a free list.

    This can happen only and only if you allocate blocks increasing the order.
    (as I did in the function written in the comment of the first patch)
    This is weird and rare case, but still it is possible. Afterwards you will
    get 7 blocks in a list.

    All further requests should be placed in a newly allocated block or some
    free slots should be found in a free list.
    Seems it does not look dramatically awful.

    This patch (of 3):

    If suitable block can't be found, new block is allocated and put into a
    head of a free list, so on next iteration this new block will be found
    first.

    That's bad, because old blocks in a free list will not get a chance to be
    fully used, thus fragmentation will grow.

    Let's consider this simple example:

    #1 We have one block in a free list which is partially used, and where only
    one page is free:

    HEAD |xxxxxxxxx-| TAIL
    ^
    free space for 1 page, order 0

    #2 New allocation request of order 1 (2 pages) comes, new block is allocated
    since we do not have free space to complete this request. New block is put
    into a head of a free list:

    HEAD |----------|xxxxxxxxx-| TAIL

    #3 Two pages were occupied in a new found block:

    HEAD |xx--------|xxxxxxxxx-| TAIL
    ^
    two pages mapped here

    #4 New allocation request of order 0 (1 page) comes. Block, which was created
    on #2 step, is located at the beginning of a free list, so it will be found
    first:

    HEAD |xxX-------|xxxxxxxxx-| TAIL
    ^ ^
    page mapped here, but better to use this hole

    It is obvious, that it is better to complete request of #4 step using the
    old block, where free space is left, because in other case fragmentation
    will be highly increased.

    But fragmentation is not only the case. The worst thing is that I can
    easily create scenario, when the whole vmalloc space is exhausted by
    blocks, which are not used, but already dirty and have several free pages.

    Let's consider this function which execution should be pinned to one CPU:

    static void exhaust_virtual_space(struct page *pages[16], int iters)
    {
    /* Firstly we have to map a big chunk, e.g. 16 pages.
    * Then we have to occupy the remaining space with smaller
    * chunks, i.e. 8 pages. At the end small hole should remain.
    * So at the end of our allocation sequence block looks like
    * this:
    * XX big chunk
    * |XXxxxxxxx-| x small chunk
    * - hole, which is enough for a small chunk,
    * but is not enough for a big chunk
    */
    while (iters--) {
    int i;
    void *vaddr;

    /* Map/unmap big chunk */
    vaddr = vm_map_ram(pages, 16, -1, PAGE_KERNEL);
    vm_unmap_ram(vaddr, 16);

    /* Map/unmap small chunks.
    *
    * -1 for hole, which should be left at the end of each block
    * to keep it partially used, with some free space available */
    for (i = 0; i < (VMAP_BBMAP_BITS - 16) / 8 - 1; i++) {
    vaddr = vm_map_ram(pages, 8, -1, PAGE_KERNEL);
    vm_unmap_ram(vaddr, 8);
    }
    }
    }

    On every iteration new block (1MB of vm area in my case) will be
    allocated and then will be occupied, without attempt to resolve small
    allocation request using previously allocated blocks in a free list.

    In case of random allocation (size should be randomly taken from the
    range [1..64] in 64-bit case or [1..32] in 32-bit case) situation is the
    same: new blocks continue to appear if maximum possible allocation size
    (32 or 64) passed to the allocator, because all remaining blocks in a
    free list do not have enough free space to complete this allocation
    request.

    In summary if new blocks are put into the head of a free list eventually
    virtual space will be exhausted.

    In current patch I simply put newly allocated block to the tail of a
    free list, thus reduce fragmentation, giving a chance to resolve
    allocation request using older blocks with possible holes left.

    Signed-off-by: Roman Pen
    Cc: Eric Dumazet
    Acked-by: Joonsoo Kim
    Cc: David Rientjes
    Cc: WANG Chao
    Cc: Fabian Frederick
    Cc: Christoph Lameter
    Cc: Gioh Kim
    Cc: Rob Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Pen
     

15 Apr, 2015

2 commits

  • Change vunmap_pmd_range() and vunmap_pud_range() to tear down huge KVA
    mappings when they are set. pud_clear_huge() and pmd_clear_huge() return
    zero when no-operation is performed, i.e. huge page mapping was not used.

    These changes are only enabled when CONFIG_HAVE_ARCH_HUGE_VMAP is defined
    on the architecture.

    [akpm@linux-foundation.org: use consistent code layout]
    Signed-off-by: Toshi Kani
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Arnd Bergmann
    Cc: Dave Hansen
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     
  • ioremap() and its related interfaces are used to create I/O mappings to
    memory-mapped I/O devices. The mapping sizes of the traditional I/O
    devices are relatively small. Non-volatile memory (NVM), however, has
    many GB and is going to have TB soon. It is not very efficient to create
    large I/O mappings with 4KB.

    This patchset extends the ioremap() interfaces to transparently create I/O
    mappings with huge pages whenever possible. ioremap() continues to use
    4KB mappings when a huge page does not fit into a requested range. There
    is no change necessary to the drivers using ioremap(). A requested
    physical address must be aligned by a huge page size (1GB or 2MB on x86)
    for using huge page mapping, though. The kernel huge I/O mapping will
    improve performance of NVM and other devices with large memory, and reduce
    the time to create their mappings as well.

    On x86, MTRRs can override PAT memory types with a 4KB granularity. When
    using a huge page, MTRRs can override the memory type of the huge page,
    which may lead a performance penalty. The processor can also behave in an
    undefined manner if a huge page is mapped to a memory range that MTRRs
    have mapped with multiple different memory types. Therefore, the mapping
    code falls back to use a smaller page size toward 4KB when a mapping range
    is covered by non-WB type of MTRRs. The WB type of MTRRs has no affect on
    the PAT memory types.

    The patchset introduces HAVE_ARCH_HUGE_VMAP, which indicates that the arch
    supports huge KVA mappings for ioremap(). User may specify a new kernel
    option "nohugeiomap" to disable the huge I/O mapping capability of
    ioremap() when necessary.

    Patch 1-4 change common files to support huge I/O mappings. There is no
    change in the functinalities unless HAVE_ARCH_HUGE_VMAP is defined on the
    architecture of the system.

    Patch 5-6 implement the HAVE_ARCH_HUGE_VMAP funcs on x86, and set
    HAVE_ARCH_HUGE_VMAP on x86.

    This patch (of 6):

    __get_vm_area_node() takes unsigned long size, which is a 64-bit value on
    a 64-bit kernel. However, fls(size) simply ignores the upper 32-bit.
    Change to use fls_long() to handle the size properly.

    Signed-off-by: Toshi Kani
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Arnd Bergmann
    Cc: Dave Hansen
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     

13 Mar, 2015

1 commit

  • Current approach in handling shadow memory for modules is broken.

    Shadow memory could be freed only after memory shadow corresponds it is no
    longer used. vfree() called from interrupt context could use memory its
    freeing to store 'struct llist_node' in it:

    void vfree(const void *addr)
    {
    ...
    if (unlikely(in_interrupt())) {
    struct vfree_deferred *p = this_cpu_ptr(&vfree_deferred);
    if (llist_add((struct llist_node *)addr, &p->list))
    schedule_work(&p->wq);

    Later this list node used in free_work() which actually frees memory.
    Currently module_memfree() called in interrupt context will free shadow
    before freeing module's memory which could provoke kernel crash.

    So shadow memory should be freed after module's memory. However, such
    deallocation order could race with kasan_module_alloc() in module_alloc().

    Free shadow right before releasing vm area. At this point vfree()'d
    memory is not used anymore and yet not available for other allocations.
    New VM_KASAN flag used to indicate that vm area has dynamically allocated
    shadow memory so kasan frees shadow only if it was previously allocated.

    Signed-off-by: Andrey Ryabinin
    Acked-by: Rusty Russell
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

14 Feb, 2015

2 commits

  • For instrumenting global variables KASan will shadow memory backing memory
    for modules. So on module loading we will need to allocate memory for
    shadow and map it at address in shadow that corresponds to the address
    allocated in module_alloc().

    __vmalloc_node_range() could be used for this purpose, except it puts a
    guard hole after allocated area. Guard hole in shadow memory should be a
    problem because at some future point we might need to have a shadow memory
    at address occupied by guard hole. So we could fail to allocate shadow
    for module_alloc().

    Now we have VM_NO_GUARD flag disabling guard page, so we need to pass into
    __vmalloc_node_range(). Add new parameter 'vm_flags' to
    __vmalloc_node_range() function.

    Signed-off-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • For instrumenting global variables KASan will shadow memory backing memory
    for modules. So on module loading we will need to allocate memory for
    shadow and map it at address in shadow that corresponds to the address
    allocated in module_alloc().

    __vmalloc_node_range() could be used for this purpose, except it puts a
    guard hole after allocated area. Guard hole in shadow memory should be a
    problem because at some future point we might need to have a shadow memory
    at address occupied by guard hole. So we could fail to allocate shadow
    for module_alloc().

    Add a new vm_struct flag 'VM_NO_GUARD' indicating that vm area doesn't
    have a guard hole.

    Signed-off-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

14 Dec, 2014

1 commit


11 Dec, 2014

1 commit


10 Oct, 2014

1 commit

  • Using seq_open_private() removes boilerplate code from vmalloc_open().

    The resultant code is shorter and easier to follow.

    However, please note that seq_open_private() call kzalloc() rather than
    kmalloc() which may affect timing due to the memory initialisation
    overhead.

    Signed-off-by: Rob Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rob Jones
     

07 Aug, 2014

4 commits

  • Currently map_vm_area() takes (struct page *** pages) as third argument,
    and after mapping, it moves (*pages) to point to (*pages +
    nr_mappped_pages).

    It looks like this kind of increment is useless to its caller these
    days. The callers don't care about the increments and actually they're
    trying to avoid this by passing another copy to map_vm_area().

    The caller can always guarantee all the pages can be mapped into vm_area
    as specified in first argument and the caller only cares about whether
    map_vm_area() fails or not.

    This patch cleans up the pointer movement in map_vm_area() and updates
    its callers accordingly.

    Signed-off-by: WANG Chao
    Cc: Zhang Yanfei
    Acked-by: Greg Kroah-Hartman
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    WANG Chao
     
  • tmp_mask in the __vmalloc_area_node() iteration never changes so it can
    be moved into function scope and marked with const. This causes the
    movl and orl to only be done once per call rather than area->nr_pages
    times.

    nested_gfp can also be marked const.

    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • It is not uncommon on busy servers to get stuck hundred of ms in
    vmalloc() calls (like file descriptor expansions).

    Add a cond_resched() to __vmalloc_area_node() to be gentle to
    other tasks.

    [akpm@linux-foundation.org: only do it for __GFP_WAIT, per David]
    Signed-off-by: Eric Dumazet
    Cc: Hugh Dickins
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • Richard Yao reported a month ago that his system have a trouble with
    vmap_area_lock contention during performance analysis by /proc/meminfo.
    Andrew asked why his analysis checks /proc/meminfo stressfully, but he
    didn't answer it.

    https://lkml.org/lkml/2014/4/10/416

    Although I'm not sure that this is right usage or not, there is a
    solution reducing vmap_area_lock contention with no side-effect. That
    is just to use rcu list iterator in get_vmalloc_info().

    rcu can be used in this function because all RCU protocol is already
    respected by writers, since Nick Piggin commit db64fe02258f1 ("mm:
    rewrite vmap layer") back in linux-2.6.28

    Specifically :
    insertions use list_add_rcu(),
    deletions use list_del_rcu() and kfree_rcu().

    Note the rb tree is not used from rcu reader (it would not be safe),
    only the vmap_area_list has full RCU protection.

    Note that __purge_vmap_area_lazy() already uses this rcu protection.

    rcu_read_lock();
    list_for_each_entry_rcu(va, &vmap_area_list, list) {
    if (va->flags & VM_LAZY_FREE) {
    if (va->va_start < *start)
    *start = va->va_start;
    if (va->va_end > *end)
    *end = va->va_end;
    nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
    list_add_tail(&va->purge_list, &valist);
    va->flags |= VM_LAZY_FREEING;
    va->flags &= ~VM_LAZY_FREE;
    }
    }
    rcu_read_unlock();

    Peter:

    : While rcu list traversal over the vmap_area_list is safe, this may
    : arrive at different results than the spinlocked version. The rcu list
    : traversal version will not be a 'snapshot' of a single, valid instant
    : of the entire vmap_area_list, but rather a potential amalgam of
    : different list states.

    Joonsoo:

    : Yes, you are right, but I don't think that we should be strict here.
    : Meminfo is already not a 'snapshot' at specific time. While we try to get
    : certain stats, the other stats can change. And, although we may arrive at
    : different results than the spinlocked version, the difference would not be
    : large and would not make serious side-effect.

    [edumazet@google.com: add more commit description]
    Signed-off-by: Joonsoo Kim
    Reported-by: Richard Yao
    Acked-by: Eric Dumazet
    Cc: Peter Hurley
    Cc: Zhang Yanfei
    Cc: Johannes Weiner
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

05 Jun, 2014

3 commits

  • zsmalloc needs exported unmap_kernel_range for building as a module. See
    https://lkml.org/lkml/2013/1/18/487

    I didn't send a patch to make unmap_kernel_range exportable at that time
    because zram was staging stuff and I thought VM function exporting for
    staging stuff makes no sense.

    Now zsmalloc was promoted. If we can't build zsmalloc as module, it means
    we can't build zram as module, either. Additionally, buddy map_vm_area is
    already exported so let's export unmap_kernel_range to help his buddy.

    Signed-off-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Sergey Senozhatsky
    Cc: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Replace seq_printf where possible

    Signed-off-by: Fabian Frederick
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • Replace places where __get_cpu_var() is used for an address calculation
    with this_cpu_ptr().

    Signed-off-by: Christoph Lameter
    Cc: Tejun Heo
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

08 Apr, 2014

2 commits

  • vm_map_ram() has a fragmentation problem when it cannot purge a
    chunk(ie, 4M address space) if there is a pinning object in that
    addresss space. So it could consume all VMALLOC address space easily.

    We can fix the fragmentation problem by using vmap instead of
    vm_map_ram() but vmap() is known to be slow compared to vm_map_ram().
    Minchan said vm_map_ram is 5 times faster than vmap in his tests. So I
    thought we should fix fragment problem of vm_map_ram because our
    proprietary GPU driver has used it heavily.

    On second thought, it's not an easy because we should reuse freed space
    for solving the problem and it could make more IPI and bitmap operation
    for searching hole. It could mitigate API's goal which is very fast
    mapping. And even fragmentation problem wouldn't show in 64 bit
    machine.

    Another option is that the user should separate long-life and short-life
    object and use vmap for long-life but vm_map_ram for short-life. If we
    inform the user about the characteristic of vm_map_ram the user can
    choose one according to the page lifetime.

    Let's add some notice messages to user.

    [akpm@linux-foundation.org: tweak comment text]
    Signed-off-by: Gioh Kim
    Reviewed-by: Zhang Yanfei
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gioh Kim
     
  • To increase compiler portability there is which
    provides convenience macros for various gcc constructs. Eg: __weak for
    __attribute__((weak)). I've replaced all instances of gcc attributes with
    the right macro in the memory management (/mm) subsystem.

    [akpm@linux-foundation.org: while-we're-there consistency tweaks]
    Signed-off-by: Gideon Israel Dsouza
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gideon Israel Dsouza
     

28 Jan, 2014

1 commit

  • Revert commit ece86e222db4, which was intended as a small performance
    improvement.

    Despite the claim that the patch doesn't introduce any functional
    changes in fact it does.

    The "no page" path behaves different now. Originally, vmalloc_to_page
    might return NULL under some conditions, with new implementation it
    returns pfn_to_page(0) which is not the same as NULL.

    Simple test shows the difference.

    test.c

    #include
    #include
    #include
    #include

    int __init myi(void)
    {
    struct page *p;
    void *v;

    v = vmalloc(PAGE_SIZE);
    /* trigger the "no page" path in vmalloc_to_page*/
    vfree(v);

    p = vmalloc_to_page(v);

    pr_err("expected val = NULL, returned val = %p", p);

    return -EBUSY;
    }

    void __exit mye(void)
    {

    }
    module_init(myi)
    module_exit(mye)

    Before interchange:
    expected val = NULL, returned val = (null)

    After interchange:
    expected val = NULL, returned val = c7ebe000

    Signed-off-by: Vladimir Murzin
    Cc: Jianyu Zhan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    malc
     

22 Jan, 2014

1 commit

  • Currently we are implementing vmalloc_to_pfn() as a wrapper around
    vmalloc_to_page(), which is implemented as follow:

    1. walks the page talbes to generates the corresponding pfn,
    2. then converts the pfn to struct page,
    3. returns it.

    And vmalloc_to_pfn() re-wraps vmalloc_to_page() to get the pfn.

    This seems too circuitous, so this patch reverses the way: implement
    vmalloc_to_page() as a wrapper around vmalloc_to_pfn(). This makes
    vmalloc_to_pfn() and vmalloc_to_page() slightly more efficient.

    No functional change.

    Signed-off-by: Jianyu Zhan
    Cc: Vladimir Murzin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     

13 Nov, 2013

6 commits

  • Commit 248ac0e1943a ("mm/vmalloc: remove guard page from between vmap
    blocks") had the side effect of making vmap_area.va_end member point to
    the next vmap_area.va_start. This was creating an artificial reference
    to vmalloc'ed objects and kmemleak was rarely reporting vmalloc() leaks.

    This patch marks the vmap_area containing pointers explicitly and
    reduces the min ref_count to 2 as vm_struct still contains a reference
    to the vmalloc'ed object. The kmemleak add_scan_area() function has
    been improved to allow a SIZE_MAX argument covering the rest of the
    object (for simpler calling sites).

    Signed-off-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     
  • Don't warn twice in __vmalloc_area_node and __vmalloc_node_range if
    __vmalloc_area_node allocation failure. This patch reverts commit
    46c001a2753f ("mm/vmalloc.c: emit the failure message before return").

    Signed-off-by: Wanpeng Li
    Reviewed-by: Zhang Yanfei
    Cc: Joonsoo Kim
    Cc: KOSAKI Motohiro
    Cc: Mitsuo Hayasaka
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • The VM_UNINITIALIZED/VM_UNLIST flag introduced by f5252e009d5b ("mm:
    avoid null pointer access in vm_struct via /proc/vmallocinfo") is used
    to avoid accessing the pages field with unallocated page when
    show_numa_info() is called.

    This patch moves the check just before show_numa_info in order that some
    messages still can be dumped via /proc/vmallocinfo. This patch reverts
    commit d157a55815ff ("mm/vmalloc.c: check VM_UNINITIALIZED flag in
    s_show instead of show_numa_info");

    Reviewed-by: Zhang Yanfei
    Signed-off-by: Wanpeng Li
    Cc: Mitsuo Hayasaka
    Cc: Joonsoo Kim
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • There is a race window between vmap_area tear down and show vmap_area
    information.

    A B

    remove_vm_area
    spin_lock(&vmap_area_lock);
    va->vm = NULL;
    va->flags &= ~VM_VM_AREA;
    spin_unlock(&vmap_area_lock);
    spin_lock(&vmap_area_lock);
    if (va->flags & (VM_LAZY_FREE | VM_LAZY_FREEZING))
    return 0;
    if (!(va->flags & VM_VM_AREA)) {
    seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n",
    (void *)va->va_start, (void *)va->va_end,
    va->va_end - va->va_start);
    return 0;
    }
    free_unmap_vmap_area(va);
    flush_cache_vunmap
    free_unmap_vmap_area_noflush
    unmap_vmap_area
    free_vmap_area_noflush
    va->flags |= VM_LAZY_FREE

    The assumption !VM_VM_AREA represents vm_map_ram allocation is
    introduced by d4033afdf828 ("mm, vmalloc: iterate vmap_area_list,
    instead of vmlist, in vmallocinfo()").

    However, !VM_VM_AREA also represents vmap_area is being tear down in
    race window mentioned above. This patch fix it by don't dump any
    information for !VM_VM_AREA case and also remove (VM_LAZY_FREE |
    VM_LAZY_FREEING) check since they are not possible for !VM_VM_AREA case.

    Suggested-by: Joonsoo Kim
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Wanpeng Li
    Cc: Mitsuo Hayasaka
    Cc: Zhang Yanfei
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • The caller address has already been set in set_vmalloc_vm(), there's no
    need to set it again in __vmalloc_area_node.

    Reviewed-by: Zhang Yanfei
    Signed-off-by: Wanpeng Li
    Cc: Joonsoo Kim
    Cc: KOSAKI Motohiro
    Cc: Mitsuo Hayasaka
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Use more appropriate "if (node == NUMA_NO_NODE)" instead of "if (node < 0)"

    Signed-off-by: Jianguo Wu
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianguo Wu
     

12 Sep, 2013

3 commits


10 Jul, 2013

9 commits

  • When searching a vmap area in the vmalloc space, we use (addr + size -
    1) to check if the value is less than addr, which is an overflow. But
    we assign (addr + size) to vmap_area->va_end.

    So if we come across the below case:

    (addr + size - 1) : not overflow
    (addr + size) : overflow

    we will assign an overflow value (e.g 0) to vmap_area->va_end, And this
    will trigger BUG in __insert_vmap_area, causing system panic.

    So using (addr + size) to check the overflow should be the correct
    behaviour, not (addr + size - 1).

    Signed-off-by: Zhang Yanfei
    Reported-by: Ghennadi Procopciuc
    Tested-by: Daniel Baluta
    Cc: David Rientjes
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • vfree() only needs schedule_work(&p->wq) if p->list was empty, otherwise
    vfree_deferred->wq is already pending or it is running and didn't do
    llist_del_all() yet.

    Signed-off-by: Oleg Nesterov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • We should check the VM_UNITIALIZED flag in s_show(). If this flag is
    set, that said, the vm_struct is not fully initialized. So it is
    unnecessary to try to show the information contained in vm_struct.

    We checked this flag in show_numa_info(), but I think it's better to
    check it earlier.

    Signed-off-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • VM_UNLIST was used to indicate that the vm_struct is not listed in
    vmlist.

    But after commit 4341fa454796 ("mm, vmalloc: remove list management of
    vmlist after initializing vmalloc"), the meaning of this flag changed.
    It now means the vm_struct is not fully initialized. So renaming it to
    VM_UNINITIALIZED seems more reasonable.

    Also change clear_vm_unlist to clear_vm_uninitialized_flag.

    Signed-off-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • Use goto to jump to the fail label to give a failure message before
    returning NULL. This makes the failure handling in this function
    consistent.

    Signed-off-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • As we have removed the dead code in the vb_alloc, it seems there is no
    place to use the alloc_map. So there is no reason to maintain the
    alloc_map in vmap_block.

    Signed-off-by: Zhang Yanfei
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • This function is nowhere used now, so remove it.

    Signed-off-by: Zhang Yanfei
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • Space in a vmap block that was once allocated is considered dirty and
    not made available for allocation again before the whole block is
    recycled. The result is that free space within a vmap block is always
    contiguous.

    So if a vmap block has enough free space for allocation, the allocation
    is impossible to fail. Thus, the fragmented block purging was never
    invoked from vb_alloc(). So remove this dead code.

    [ Same patches also sent by:

    Chanho Min
    Johannes Weiner

    but git doesn't do "multiple authors" ]

    Signed-off-by: Zhang Yanfei
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • There is an extra semi-colon so the function always returns.

    Signed-off-by: Dan Carpenter
    Acked-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter