08 Oct, 2016

2 commits

  • warn_alloc_failed is currently used from the page and vmalloc
    allocators. This is a good reuse of the code except that vmalloc would
    appreciate a slightly different warning message. This is already
    handled by the fmt parameter except that

    "%s: page allocation failure: order:%u, mode:%#x(%pGg)"

    is printed anyway. This might be quite misleading because it might be a
    vmalloc failure which leads to the warning while the page allocator is
    not the culprit here. Fix this by always using the fmt string and only
    print the context that makes sense for the particular context (e.g.
    order makes only very little sense for the vmalloc context).

    Rename the function to not miss any user and also because a later patch
    will reuse it also for !failure cases.

    Link: http://lkml.kernel.org/r/20160929084407.7004-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Tetsuo Handa
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • It causes double align requirement for __get_vm_area_node() if parameter
    size is power of 2 and VM_IOREMAP is set in parameter flags, for example
    size=0x10000 -> fls_long(0x10000)=17 -> align=0x20000

    get_count_order_long() is implemented and can be used instead of
    fls_long() for fixing the bug, for example size=0x10000 ->
    get_count_order_long(0x10000)=16 -> align=0x10000

    [akpm@linux-foundation.org: s/get_order_long()/get_count_order_long()/]
    [zijun_hu@zoho.com: fixes]
    Link: http://lkml.kernel.org/r/57AABC8B.1040409@zoho.com
    [akpm@linux-foundation.org: locate get_count_order_long() next to get_count_order()]
    [akpm@linux-foundation.org: move get_count_order[_long] definitions to pick up fls_long()]
    [zijun_hu@htc.com: move out get_count_order[_long]() from __KERNEL__ scope]
    Link: http://lkml.kernel.org/r/57B2C4CE.80303@zoho.com
    Link: http://lkml.kernel.org/r/fc045ecf-20fa-0722-b3ac-9a6140488fad@zoho.com
    Signed-off-by: zijun_hu
    Cc: Tejun Heo
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: David Rientjes
    Signed-off-by: zijun_hu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zijun_hu
     

27 Jul, 2016

1 commit

  • Currently, to charge a non-slab allocation to kmemcg one has to use
    alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with
    this helper should finally be freed using free_kmem_pages, otherwise it
    won't be uncharged.

    This API suits its current users fine, but it turns out to be impossible
    to use along with page reference counting, i.e. when an allocation is
    supposed to be freed with put_page, as it is the case with pipe or unix
    socket buffers.

    To overcome this limitation, this patch moves charging/uncharging to
    generic page allocator paths, i.e. to __alloc_pages_nodemask and
    free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way,
    one can use any of the available page allocation functions to get the
    allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT,
    just like in case of kmalloc and friends. A charged page will be
    automatically uncharged on free.

    To make it possible, we need to mark pages charged to kmemcg somehow.
    To avoid introducing a new page flag, we make use of page->_mapcount for
    marking such pages. Since pages charged to kmemcg are not supposed to
    be mapped to userspace, it should work just fine. There are other
    (ab)users of page->_mapcount - buddy and balloon pages - but we don't
    conflict with them.

    In case kmemcg is compiled out or not used at runtime, this patch
    introduces no overhead to generic page allocator paths. If kmemcg is
    used, it will be plus one gfp flags check on alloc and plus one
    page->_mapcount check on free, which shouldn't hurt performance, because
    the data accessed are hot.

    Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

04 Jun, 2016

1 commit

  • When remapping pages accounting for 4G or more memory space, the
    operation 'count << PAGE_SHIFT' overflows as it is performed on an
    integer. Solution: cast before doing the bitshift.

    [akpm@linux-foundation.org: fix vm_unmap_ram() also]
    [akpm@linux-foundation.org: fix vmap() as well, per Guillermo]
    Link: http://lkml.kernel.org/r/etPan.57175fb3.7a271c6b.2bd@naudit.es
    Signed-off-by: Guillermo Julián Moreno
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Guillermo Julián Moreno
     

24 May, 2016

1 commit

  • Pull drm updates from Dave Airlie:
    "Here's the main drm pull request for 4.7, it's been a busy one, and
    I've been a bit more distracted in real life this merge window. Lots
    more ARM drivers, not sure if it'll ever end. I think I've at least
    one more coming the next merge window.

    But changes are all over the place, support for AMD Polaris GPUs is in
    here, some missing GM108 support for nouveau (found in some Lenovos),
    a bunch of MST and skylake fixes.

    I've also noticed a few fixes from Arnd in my inbox, that I'll try and
    get in asap, but I didn't think they should hold this up.

    New drivers:
    - Hisilicon kirin display driver
    - Mediatek MT8173 display driver
    - ARC PGU - bitstreamer on Synopsys ARC SDP boards
    - Allwinner A13 initial RGB output driver
    - Analogix driver for DisplayPort IP found in exynos and rockchip

    DRM Core:
    - UAPI headers fixes and C++ safety
    - DRM connector reference counting
    - DisplayID mode parsing for Dell 5K monitors
    - Removal of struct_mutex from drivers
    - Connector registration cleanups
    - MST robustness fixes
    - MAINTAINERS updates
    - Lockless GEM object freeing
    - Generic fbdev deferred IO support

    panel:
    - Support for a bunch of new panels

    i915:
    - VBT refactoring
    - PLL computation cleanups
    - DSI support for BXT
    - Color manager support
    - More atomic patches
    - GEM improvements
    - GuC fw loading fixes
    - DP detection fixes
    - SKL GPU hang fixes
    - Lots of BXT fixes

    radeon/amdgpu:
    - Initial Polaris support
    - GPUVM/Scheduler/Clock/Power improvements
    - ASYNC pageflip support
    - New mesa feature support

    nouveau:
    - GM108 support
    - Power sensor support improvements
    - GR init + ucode fixes.
    - Use GPU provided topology information

    vmwgfx:
    - Add host messaging support

    gma500:
    - Some cleanups and fixes

    atmel:
    - Bridge support
    - Async atomic commit support

    fsl-dcu:
    - Timing controller for LCD support
    - Pixel clock polarity support

    rcar-du:
    - Misc fixes

    exynos:
    - Pipeline clock support
    - Exynoss4533 SoC support
    - HW trigger mode support
    - export HDMI_PHY clock
    - DECON5433 fixes
    - Use generic prime functions
    - use DMA mapping APIs

    rockchip:
    - Lots of little fixes

    vc4:
    - Render node support
    - Gamma ramp support
    - DPI output support

    msm:
    - Mostly cleanups and fixes
    - Conversion to generic struct fence

    etnaviv:
    - Fix for prime buffer handling
    - Allow hangcheck to be coalesced with other wakeups

    tegra:
    - Gamme table size fix"

    * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (1050 commits)
    drm/edid: add displayid detailed 1 timings to the modelist. (v1.1)
    drm/edid: move displayid validation to it's own function.
    drm/displayid: Iterate over all DisplayID blocks
    drm/edid: move displayid tiled block parsing into separate function.
    drm: Nuke ->vblank_disable_allowed
    drm/vmwgfx: Report vmwgfx version to vmware.log
    drm/vmwgfx: Add VMWare host messaging capability
    drm/vmwgfx: Kill some lockdep warnings
    drm/nouveau/gr/gf100-: fix race condition in fecs/gpccs ucode
    drm/nouveau/core: recognise GM108 chipsets
    drm/nouveau/gr/gm107-: fix touching non-existent ppcs in attrib cb setup
    drm/nouveau/gr/gk104-: share implementation of ppc exception init
    drm/nouveau/gr/gk104-: move rop_active_fbps init to nonctx
    drm/nouveau/bios/pll: check BIT table version before trying to parse it
    drm/nouveau/bios/pll: prevent oops when limits table can't be parsed
    drm/nouveau/volt/gk104: round up in gk104_volt_set
    drm/nouveau/fb/gm200: setup mmu debug buffer registers at init()
    drm/nouveau/fb/gk20a,gm20b: setup mmu debug buffer registers at init()
    drm/nouveau/fb/gf100-: allocate mmu debug buffers
    drm/nouveau/fb: allow chipset-specific actions for oneinit()
    ...

    Linus Torvalds
     

21 May, 2016

1 commit

  • When mixing lots of vmallocs and set_memory_*() (which calls
    vm_unmap_aliases()) I encountered situations where the performance
    degraded severely due to the walking of the entire vmap_area list each
    invocation.

    One simple improvement is to add the lazily freed vmap_area to a
    separate lockless free list, such that we then avoid having to walk the
    full list on each purge.

    Signed-off-by: Chris Wilson
    Reviewed-by: Roman Pen
    Cc: Joonas Lahtinen
    Cc: Tvrtko Ursulin
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Roman Pen
    Cc: Mel Gorman
    Cc: Toshi Kani
    Cc: Shawn Lin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Wilson
     

12 Apr, 2016

1 commit

  • Linux 4.6-rc3

    Backmerge requested by Chris Wilson to make his patches apply cleanly.
    Tiny conflict in vmalloc.c with the (properly acked and all) patch in
    drm-intel-next:

    commit 4da56b99d99e5a7df2b7f11e87bfea935f909732
    Author: Chris Wilson
    Date: Mon Apr 4 14:46:42 2016 +0100

    mm/vmap: Add a notifier for when we run out of vmap address space

    and Linus' tree.

    Signed-off-by: Daniel Vetter

    Daniel Vetter
     

05 Apr, 2016

1 commit

  • vmaps are temporary kernel mappings that may be of long duration.
    Reusing a vmap on an object is preferrable for a driver as the cost of
    setting up the vmap can otherwise dominate the operation on the object.
    However, the vmap address space is rather limited on 32bit systems and
    so we add a notification for vmap pressure in order for the driver to
    release any cached vmappings.

    The interface is styled after the oom-notifier where the callees are
    passed a pointer to an unsigned long counter for them to indicate if they
    have freed any space.

    v2: Guard the blocking notifier call with gfpflags_allow_blocking()
    v3: Correct typo in forward declaration and move to head of file

    Signed-off-by: Chris Wilson
    Cc: Andrew Morton
    Cc: David Rientjes
    Cc: Roman Peniaev
    Cc: Mel Gorman
    Cc: linux-mm@kvack.org
    Cc: linux-kernel@vger.kernel.org
    Acked-by: Andrew Morton # for inclusion via DRM
    Cc: Joonas Lahtinen
    Cc: Tvrtko Ursulin
    Link: http://patchwork.freedesktop.org/patch/msgid/1459777603-23618-3-git-send-email-chris@chris-wilson.co.uk
    Reviewed-by: Joonas Lahtinen

    Chris Wilson
     

18 Mar, 2016

3 commits

  • We have PAGE_ALIGNED() in mm.h, so let's use it instead of IS_ALIGNED()
    for checking PAGE_SIZE aligned case.

    Signed-off-by: Shawn Lin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shawn Lin
     
  • Kernel style prefers a single string over split strings when the string is
    'user-visible'.

    Miscellanea:

    - Add a missing newline
    - Realign arguments

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • As CONFIG_DEBUG_PAGEALLOC can be enabled/disabled via kernel parameters
    we can optimize some cases by checking the enablement state.

    This is follow-up work for Christian's Optimize CONFIG_DEBUG_PAGEALLOC:

    https://lkml.org/lkml/2016/1/27/194

    Remaining work is to make sparc to be aware of this but it looks not
    easy for me so I skip that in this series.

    This patch (of 5):

    We can disable debug_pagealloc processing even if the code is complied
    with CONFIG_DEBUG_PAGEALLOC. This patch changes the code to query
    whether it is enabled or not in runtime.

    [akpm@linux-foundation.org: update comment, per David. Adjust comment to use 80 cols]
    Signed-off-by: Joonsoo Kim
    Reviewed-by: Christian Borntraeger
    Acked-by: David Rientjes
    Cc: Benjamin Herrenschmidt
    Cc: Takashi Iwai
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

16 Jan, 2016

1 commit


15 Jan, 2016

3 commits

  • VM_VPAGES is unnecessary, it's easier to check is_vmalloc_addr() when
    reading /proc/vmallocinfo.

    [akpm@linux-foundation.org: remove VM_VPAGES reference via kvfree()]
    Signed-off-by: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • To make the intention clearer, use list_{next,first}_entry instead of
    list_entry.

    Signed-off-by: Geliang Tang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geliang Tang
     
  • Make vmalloc family functions allocate vmalloc area pages with
    alloc_kmem_pages so that if __GFP_ACCOUNT is set they will be accounted
    to memcg. This is needed, at least, to account alloc_fdmem allocations.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

21 Nov, 2015

1 commit

  • Commit 71394fe50146 ("mm: vmalloc: add flag preventing guard hole
    allocation") missed a spot. Currently remove_vm_area() decreases vm->size
    to "remove" the guard hole page, even when it isn't present. All but one
    users just free the vm_struct rigth away and never access vm->size anyway.

    Don't touch the size in remove_vm_area() and have __vunmap() use the
    proper get_vm_area_size() helper.

    Signed-off-by: Jerome Marchand
    Acked-by: Andrey Ryabinin
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     

07 Nov, 2015

2 commits

  • Andrew stated the following

    We have quite a history of remote parts of the kernel using
    weird/wrong/inexplicable combinations of __GFP_ flags. I tend
    to think that this is because we didn't adequately explain the
    interface.

    And I don't think that gfp.h really improved much in this area as
    a result of this patchset. Could you go through it some time and
    decide if we've adequately documented all this stuff?

    This patches first moves some GFP flag combinations that are part of the MM
    internals to mm/internal.h. The rest of the patch documents the __GFP_FOO
    bits under various headings and then documents the flag combinations. It
    will not help callers that are brain damaged but the clarity might motivate
    some fixes and avoid future mistakes.

    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

06 Nov, 2015

1 commit


02 Nov, 2015

1 commit

  • It turns out that at least some versions of glibc end up reading
    /proc/meminfo at every single startup, because glibc wants to know the
    amount of memory the machine has. And while that's arguably insane,
    it's just how things are.

    And it turns out that it's not all that expensive most of the time, but
    the vmalloc information statistics (amount of virtual memory used in the
    vmalloc space, and the biggest remaining chunk) can be rather expensive
    to compute.

    The 'get_vmalloc_info()' function actually showed up on my profiles as
    4% of the CPU usage of "make test" in the git source repository, because
    the git tests are lots of very short-lived shell-scripts etc.

    It turns out that apparently this same silly vmalloc info gathering
    shows up on the facebook servers too, according to Dave Jones. So it's
    not just "make test" for git.

    We had two patches to just cache the information (one by me, one by
    Ingo) to mitigate this issue, but the whole vmalloc information of of
    rather dubious value to begin with, and people who *actually* want to
    know what the situation is wrt the vmalloc area should just look at the
    much more complete /proc/vmallocinfo instead.

    In fact, according to my testing - and perhaps more importantly,
    according to that big search engine in the sky: Google - there is
    nothing out there that actually cares about those two expensive fields:
    VmallocUsed and VmallocChunk.

    So let's try to just remove them entirely. Actually, this just removes
    the computation and reports the numbers as zero for now, just to try to
    be minimally intrusive.

    If this breaks anything, we'll obviously have to re-introduce the code
    to compute this all and add the caching patches on top. But if given
    the option, I'd really prefer to just remove this bad idea entirely
    rather than add even more code to work around our historical mistake
    that likely nobody really cares about.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

16 Apr, 2015

3 commits

  • In original implementation of vm_map_ram made by Nick Piggin there were
    two bitmaps: alloc_map and dirty_map. None of them were used as supposed
    to be: finding a suitable free hole for next allocation in block.
    vm_map_ram allocates space sequentially in block and on free call marks
    pages as dirty, so freed space can't be reused anymore.

    Actually it would be very interesting to know the real meaning of those
    bitmaps, maybe implementation was incomplete, etc.

    But long time ago Zhang Yanfei removed alloc_map by these two commits:

    mm/vmalloc.c: remove dead code in vb_alloc
    3fcd76e8028e0be37b02a2002b4f56755daeda06
    mm/vmalloc.c: remove alloc_map from vmap_block
    b8e748b6c32999f221ea4786557b8e7e6c4e4e7a

    In this patch I replaced dirty_map with two range variables: dirty min and
    max. These variables store minimum and maximum position of dirty space in
    a block, since we need only to know the dirty range, not exact position of
    dirty pages.

    Why it was made? Several reasons: at first glance it seems that
    vm_map_ram allocator concerns about fragmentation thus it uses bitmaps for
    finding free hole, but it is not true. To avoid complexity seems it is
    better to use something simple, like min or max range values. Secondly,
    code also becomes simpler, without iteration over bitmap, just comparing
    values in min and max macros. Thirdly, bitmap occupies up to 1024 bits
    (4MB is a max size of a block). Here I replaced the whole bitmap with two
    longs.

    Finally vm_unmap_aliases should be slightly faster and the whole
    vmap_block structure occupies less memory.

    Signed-off-by: Roman Pen
    Cc: Zhang Yanfei
    Cc: Eric Dumazet
    Acked-by: Joonsoo Kim
    Cc: David Rientjes
    Cc: WANG Chao
    Cc: Fabian Frederick
    Cc: Christoph Lameter
    Cc: Gioh Kim
    Cc: Rob Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Pen
     
  • Previous implementation allocates new vmap block and repeats search of a
    free block from the very beginning, iterating over the CPU free list.

    Why it can be better??

    1. Allocation can happen on one CPU, but search can be done on another CPU.
    In worst case we preallocate amount of vmap blocks which is equal to
    CPU number on the system.

    2. In previous patch I added newly allocated block to the tail of free list
    to avoid soon exhaustion of virtual space and give a chance to occupy
    blocks which were allocated long time ago. Thus to find newly allocated
    block all the search sequence should be repeated, seems it is not efficient.

    In this patch newly allocated block is occupied right away, address of
    virtual space is returned to the caller, so there is no any need to repeat
    the search sequence, allocation job is done.

    Signed-off-by: Roman Pen
    Cc: Andrew Morton
    Cc: Eric Dumazet
    Acked-by: Joonsoo Kim
    Cc: David Rientjes
    Cc: WANG Chao
    Cc: Fabian Frederick
    Cc: Christoph Lameter
    Cc: Gioh Kim
    Cc: Rob Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Pen
     
  • Recently I came across high fragmentation of vm_map_ram allocator:
    vmap_block has free space, but still new blocks continue to appear.
    Further investigation showed that certain mapping/unmapping sequences
    can exhaust vmalloc space. On small 32bit systems that's not a big
    problem, cause purging will be called soon on a first allocation failure
    (alloc_vmap_area), but on 64bit machines, e.g. x86_64 has 45 bits of
    vmalloc space, that can be a disaster.

    1) I came up with a simple allocation sequence, which exhausts virtual
    space very quickly:

    while (iters) {

    /* Map/unmap big chunk */
    vaddr = vm_map_ram(pages, 16, -1, PAGE_KERNEL);
    vm_unmap_ram(vaddr, 16);

    /* Map/unmap small chunks.
    *
    * -1 for hole, which should be left at the end of each block
    * to keep it partially used, with some free space available */
    for (i = 0; i < (VMAP_BBMAP_BITS - 16) / 8 - 1; i++) {
    vaddr = vm_map_ram(pages, 8, -1, PAGE_KERNEL);
    vm_unmap_ram(vaddr, 8);
    }
    }

    The idea behind is simple:

    1. We have to map a big chunk, e.g. 16 pages.

    2. Then we have to occupy the remaining space with smaller chunks, i.e.
    8 pages. At the end small hole should remain to keep block in free list,
    but do not let big chunk to occupy remaining space.

    3. Goto 1 - allocation request of 16 pages can't be completed (only 8 slots
    are left free in the block in the #2 step), new block will be allocated,
    all further requests will lay into newly allocated block.

    To have some measurement numbers for all further tests I setup ftrace and
    enabled 4 basic calls in a function profile:

    echo vm_map_ram > /sys/kernel/debug/tracing/set_ftrace_filter;
    echo alloc_vmap_area >> /sys/kernel/debug/tracing/set_ftrace_filter;
    echo vm_unmap_ram >> /sys/kernel/debug/tracing/set_ftrace_filter;
    echo free_vmap_block >> /sys/kernel/debug/tracing/set_ftrace_filter;

    So for this scenario I got these results:

    BEFORE (all new blocks are put to the head of a free list)
    # cat /sys/kernel/debug/tracing/trace_stat/function0
    Function Hit Time Avg s^2
    -------- --- ---- --- ---
    vm_map_ram 126000 30683.30 us 0.243 us 30819.36 us
    vm_unmap_ram 126000 22003.24 us 0.174 us 340.886 us
    alloc_vmap_area 1000 4132.065 us 4.132 us 0.903 us

    AFTER (all new blocks are put to the tail of a free list)
    # cat /sys/kernel/debug/tracing/trace_stat/function0
    Function Hit Time Avg s^2
    -------- --- ---- --- ---
    vm_map_ram 126000 28713.13 us 0.227 us 24944.70 us
    vm_unmap_ram 126000 20403.96 us 0.161 us 1429.872 us
    alloc_vmap_area 993 3916.795 us 3.944 us 29.370 us
    free_vmap_block 992 654.157 us 0.659 us 1.273 us

    SUMMARY:

    The most interesting numbers in those tables are numbers of block
    allocations and deallocations: alloc_vmap_area and free_vmap_block
    calls, which show that before the change blocks were not freed, and
    virtual space and physical memory (vmap_block structure allocations,
    etc) were consumed.

    Average time which were spent in vm_map_ram/vm_unmap_ram became slightly
    better. That can be explained with a reasonable amount of blocks in a
    free list, which we need to iterate to find a suitable free block.

    2) Another scenario is a random allocation:

    while (iters) {

    /* Randomly take number from a range [1..32/64] */
    nr = rand(1, VMAP_MAX_ALLOC);
    vaddr = vm_map_ram(pages, nr, -1, PAGE_KERNEL);
    vm_unmap_ram(vaddr, nr);
    }

    I chose mersenne twister PRNG to generate persistent random state to
    guarantee that both runs have the same random sequence. For each
    vm_map_ram call random number from [1..32/64] was taken to represent
    amount of pages which I do map.

    I did 10'000 vm_map_ram calls and got these two tables:

    BEFORE (all new blocks are put to the head of a free list)

    # cat /sys/kernel/debug/tracing/trace_stat/function0
    Function Hit Time Avg s^2
    -------- --- ---- --- ---
    vm_map_ram 10000 10170.01 us 1.017 us 993.609 us
    vm_unmap_ram 10000 5321.823 us 0.532 us 59.789 us
    alloc_vmap_area 420 2150.239 us 5.119 us 3.307 us
    free_vmap_block 37 159.587 us 4.313 us 134.344 us

    AFTER (all new blocks are put to the tail of a free list)

    # cat /sys/kernel/debug/tracing/trace_stat/function0
    Function Hit Time Avg s^2
    -------- --- ---- --- ---
    vm_map_ram 10000 7745.637 us 0.774 us 395.229 us
    vm_unmap_ram 10000 5460.573 us 0.546 us 67.187 us
    alloc_vmap_area 414 2201.650 us 5.317 us 5.591 us
    free_vmap_block 412 574.421 us 1.394 us 15.138 us

    SUMMARY:

    'BEFORE' table shows, that 420 blocks were allocated and only 37 were
    freed. Remained 383 blocks are still in a free list, consuming virtual
    space and physical memory.

    'AFTER' table shows, that 414 blocks were allocated and 412 were really
    freed. 2 blocks remained in a free list.

    So fragmentation was dramatically reduced. Why? Because when we put
    newly allocated block to the head, all further requests will occupy new
    block, regardless remained space in other blocks. In this scenario all
    requests come randomly. Eventually remained free space will be less
    than requested size, free list will be iterated and it is possible that
    nothing will be found there - finally new block will be created. So
    exhaustion in random scenario happens for the maximum possible
    allocation size: 32 pages for 32-bit system and 64 pages for 64-bit
    system.

    Also average cost of vm_map_ram was reduced from 1.017 us to 0.774 us.
    Again this can be explained by iteration through smaller list of free
    blocks.

    3) Next simple scenario is a sequential allocation, when the allocation
    order is increased for each block. This scenario forces allocator to
    reach maximum amount of partially free blocks in a free list:

    while (iters) {

    /* Populate free list with blocks with remaining space */
    for (order = 0; order << order);

    /* Leave a hole */
    nr -= 1;

    for (i = 0; i < nr; i++) {
    vaddr = vm_map_ram(pages, (1 << order), -1, PAGE_KERNEL);
    vm_unmap_ram(vaddr, (1 << order));
    }

    /* Completely occupy blocks from a free list */
    for (order = 0; order << order), -1, PAGE_KERNEL);
    vm_unmap_ram(vaddr, (1 << order));
    }
    }

    Results which I got:

    BEFORE (all new blocks are put to the head of a free list)

    # cat /sys/kernel/debug/tracing/trace_stat/function0
    Function Hit Time Avg s^2
    -------- --- ---- --- ---
    vm_map_ram 2032000 399545.2 us 0.196 us 467123.7 us
    vm_unmap_ram 2032000 363225.7 us 0.178 us 111405.9 us
    alloc_vmap_area 7001 30627.76 us 4.374 us 495.755 us
    free_vmap_block 6993 7011.685 us 1.002 us 159.090 us

    AFTER (all new blocks are put to the tail of a free list)

    # cat /sys/kernel/debug/tracing/trace_stat/function0
    Function Hit Time Avg s^2
    -------- --- ---- --- ---
    vm_map_ram 2032000 394259.7 us 0.194 us 589395.9 us
    vm_unmap_ram 2032000 292500.7 us 0.143 us 94181.08 us
    alloc_vmap_area 7000 31103.11 us 4.443 us 703.225 us
    free_vmap_block 7000 6750.844 us 0.964 us 119.112 us

    SUMMARY:

    No surprises here, almost all numbers are the same.

    Fixing this fragmentation problem I also did some improvements in a
    allocation logic of a new vmap block: occupy block immediately and get
    rid of extra search in a free list.

    Also I replaced dirty bitmap with min/max dirty range values to make the
    logic simpler and slightly faster, since two longs comparison costs
    less, than loop thru bitmap.

    This patchset raises several questions:

    Q: Think the problem you comments is already known so that I wrote comments
    about it as "it could consume lots of address space through fragmentation".
    Could you tell me about your situation and reason why it should be avoided?
    Gioh Kim

    A: Indeed, there was a commit 364376383 which adds explicit comment about
    fragmentation. But fragmentation which is described in this comment caused
    by mixing of long-lived and short-lived objects, when a whole block is pinned
    in memory because some page slots are still in use. But here I am talking
    about blocks which are free, nobody uses them, and allocator keeps them alive
    forever, continuously allocating new blocks.

    Q: I think that if you put newly allocated block to the tail of a free
    list, below example would results in enormous performance degradation.

    new block: 1MB (256 pages)

    while (iters--) {
    vm_map_ram(3 or something else not dividable for 256) * 85
    vm_unmap_ram(3) * 85
    }

    On every iteration, it needs newly allocated block and it is put to the
    tail of a free list so finding it consumes large amount of time.
    Joonsoo Kim

    A: Second patch in current patchset gets rid of extra search in a free list,
    so new block will be immediately occupied..

    Also, the scenario above is impossible, cause vm_map_ram allocates virtual
    range in orders, i.e. 2^n. I.e. passing 3 to vm_map_ram you will allocate
    4 slots in a block and 256 slots (capacity of a block) of course dividable
    on 4, so block will be completely occupied.

    But there is a worst case which we can achieve: each free block has a hole
    equal to order size.

    The maximum size of allocation is 64 pages for 64-bit system
    (if you try to map more, original alloc_vmap_area will be called).

    So the maximum order is 6. That means that worst case, before allocator
    makes a decision to allocate a new block, is to iterate 7 blocks:

    HEAD
    1st block - has 1 page slot free (order 0)
    2nd block - has 2 page slots free (order 1)
    3rd block - has 4 page slots free (order 2)
    4th block - has 8 page slots free (order 3)
    5th block - has 16 page slots free (order 4)
    6th block - has 32 page slots free (order 5)
    7th block - has 64 page slots free (order 6)
    TAIL

    So the worst scenario on 64-bit system is that each CPU queue can have 7
    blocks in a free list.

    This can happen only and only if you allocate blocks increasing the order.
    (as I did in the function written in the comment of the first patch)
    This is weird and rare case, but still it is possible. Afterwards you will
    get 7 blocks in a list.

    All further requests should be placed in a newly allocated block or some
    free slots should be found in a free list.
    Seems it does not look dramatically awful.

    This patch (of 3):

    If suitable block can't be found, new block is allocated and put into a
    head of a free list, so on next iteration this new block will be found
    first.

    That's bad, because old blocks in a free list will not get a chance to be
    fully used, thus fragmentation will grow.

    Let's consider this simple example:

    #1 We have one block in a free list which is partially used, and where only
    one page is free:

    HEAD |xxxxxxxxx-| TAIL
    ^
    free space for 1 page, order 0

    #2 New allocation request of order 1 (2 pages) comes, new block is allocated
    since we do not have free space to complete this request. New block is put
    into a head of a free list:

    HEAD |----------|xxxxxxxxx-| TAIL

    #3 Two pages were occupied in a new found block:

    HEAD |xx--------|xxxxxxxxx-| TAIL
    ^
    two pages mapped here

    #4 New allocation request of order 0 (1 page) comes. Block, which was created
    on #2 step, is located at the beginning of a free list, so it will be found
    first:

    HEAD |xxX-------|xxxxxxxxx-| TAIL
    ^ ^
    page mapped here, but better to use this hole

    It is obvious, that it is better to complete request of #4 step using the
    old block, where free space is left, because in other case fragmentation
    will be highly increased.

    But fragmentation is not only the case. The worst thing is that I can
    easily create scenario, when the whole vmalloc space is exhausted by
    blocks, which are not used, but already dirty and have several free pages.

    Let's consider this function which execution should be pinned to one CPU:

    static void exhaust_virtual_space(struct page *pages[16], int iters)
    {
    /* Firstly we have to map a big chunk, e.g. 16 pages.
    * Then we have to occupy the remaining space with smaller
    * chunks, i.e. 8 pages. At the end small hole should remain.
    * So at the end of our allocation sequence block looks like
    * this:
    * XX big chunk
    * |XXxxxxxxx-| x small chunk
    * - hole, which is enough for a small chunk,
    * but is not enough for a big chunk
    */
    while (iters--) {
    int i;
    void *vaddr;

    /* Map/unmap big chunk */
    vaddr = vm_map_ram(pages, 16, -1, PAGE_KERNEL);
    vm_unmap_ram(vaddr, 16);

    /* Map/unmap small chunks.
    *
    * -1 for hole, which should be left at the end of each block
    * to keep it partially used, with some free space available */
    for (i = 0; i < (VMAP_BBMAP_BITS - 16) / 8 - 1; i++) {
    vaddr = vm_map_ram(pages, 8, -1, PAGE_KERNEL);
    vm_unmap_ram(vaddr, 8);
    }
    }
    }

    On every iteration new block (1MB of vm area in my case) will be
    allocated and then will be occupied, without attempt to resolve small
    allocation request using previously allocated blocks in a free list.

    In case of random allocation (size should be randomly taken from the
    range [1..64] in 64-bit case or [1..32] in 32-bit case) situation is the
    same: new blocks continue to appear if maximum possible allocation size
    (32 or 64) passed to the allocator, because all remaining blocks in a
    free list do not have enough free space to complete this allocation
    request.

    In summary if new blocks are put into the head of a free list eventually
    virtual space will be exhausted.

    In current patch I simply put newly allocated block to the tail of a
    free list, thus reduce fragmentation, giving a chance to resolve
    allocation request using older blocks with possible holes left.

    Signed-off-by: Roman Pen
    Cc: Eric Dumazet
    Acked-by: Joonsoo Kim
    Cc: David Rientjes
    Cc: WANG Chao
    Cc: Fabian Frederick
    Cc: Christoph Lameter
    Cc: Gioh Kim
    Cc: Rob Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Pen
     

15 Apr, 2015

2 commits

  • Change vunmap_pmd_range() and vunmap_pud_range() to tear down huge KVA
    mappings when they are set. pud_clear_huge() and pmd_clear_huge() return
    zero when no-operation is performed, i.e. huge page mapping was not used.

    These changes are only enabled when CONFIG_HAVE_ARCH_HUGE_VMAP is defined
    on the architecture.

    [akpm@linux-foundation.org: use consistent code layout]
    Signed-off-by: Toshi Kani
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Arnd Bergmann
    Cc: Dave Hansen
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     
  • ioremap() and its related interfaces are used to create I/O mappings to
    memory-mapped I/O devices. The mapping sizes of the traditional I/O
    devices are relatively small. Non-volatile memory (NVM), however, has
    many GB and is going to have TB soon. It is not very efficient to create
    large I/O mappings with 4KB.

    This patchset extends the ioremap() interfaces to transparently create I/O
    mappings with huge pages whenever possible. ioremap() continues to use
    4KB mappings when a huge page does not fit into a requested range. There
    is no change necessary to the drivers using ioremap(). A requested
    physical address must be aligned by a huge page size (1GB or 2MB on x86)
    for using huge page mapping, though. The kernel huge I/O mapping will
    improve performance of NVM and other devices with large memory, and reduce
    the time to create their mappings as well.

    On x86, MTRRs can override PAT memory types with a 4KB granularity. When
    using a huge page, MTRRs can override the memory type of the huge page,
    which may lead a performance penalty. The processor can also behave in an
    undefined manner if a huge page is mapped to a memory range that MTRRs
    have mapped with multiple different memory types. Therefore, the mapping
    code falls back to use a smaller page size toward 4KB when a mapping range
    is covered by non-WB type of MTRRs. The WB type of MTRRs has no affect on
    the PAT memory types.

    The patchset introduces HAVE_ARCH_HUGE_VMAP, which indicates that the arch
    supports huge KVA mappings for ioremap(). User may specify a new kernel
    option "nohugeiomap" to disable the huge I/O mapping capability of
    ioremap() when necessary.

    Patch 1-4 change common files to support huge I/O mappings. There is no
    change in the functinalities unless HAVE_ARCH_HUGE_VMAP is defined on the
    architecture of the system.

    Patch 5-6 implement the HAVE_ARCH_HUGE_VMAP funcs on x86, and set
    HAVE_ARCH_HUGE_VMAP on x86.

    This patch (of 6):

    __get_vm_area_node() takes unsigned long size, which is a 64-bit value on
    a 64-bit kernel. However, fls(size) simply ignores the upper 32-bit.
    Change to use fls_long() to handle the size properly.

    Signed-off-by: Toshi Kani
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Arnd Bergmann
    Cc: Dave Hansen
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     

13 Mar, 2015

1 commit

  • Current approach in handling shadow memory for modules is broken.

    Shadow memory could be freed only after memory shadow corresponds it is no
    longer used. vfree() called from interrupt context could use memory its
    freeing to store 'struct llist_node' in it:

    void vfree(const void *addr)
    {
    ...
    if (unlikely(in_interrupt())) {
    struct vfree_deferred *p = this_cpu_ptr(&vfree_deferred);
    if (llist_add((struct llist_node *)addr, &p->list))
    schedule_work(&p->wq);

    Later this list node used in free_work() which actually frees memory.
    Currently module_memfree() called in interrupt context will free shadow
    before freeing module's memory which could provoke kernel crash.

    So shadow memory should be freed after module's memory. However, such
    deallocation order could race with kasan_module_alloc() in module_alloc().

    Free shadow right before releasing vm area. At this point vfree()'d
    memory is not used anymore and yet not available for other allocations.
    New VM_KASAN flag used to indicate that vm area has dynamically allocated
    shadow memory so kasan frees shadow only if it was previously allocated.

    Signed-off-by: Andrey Ryabinin
    Acked-by: Rusty Russell
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

14 Feb, 2015

2 commits

  • For instrumenting global variables KASan will shadow memory backing memory
    for modules. So on module loading we will need to allocate memory for
    shadow and map it at address in shadow that corresponds to the address
    allocated in module_alloc().

    __vmalloc_node_range() could be used for this purpose, except it puts a
    guard hole after allocated area. Guard hole in shadow memory should be a
    problem because at some future point we might need to have a shadow memory
    at address occupied by guard hole. So we could fail to allocate shadow
    for module_alloc().

    Now we have VM_NO_GUARD flag disabling guard page, so we need to pass into
    __vmalloc_node_range(). Add new parameter 'vm_flags' to
    __vmalloc_node_range() function.

    Signed-off-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • For instrumenting global variables KASan will shadow memory backing memory
    for modules. So on module loading we will need to allocate memory for
    shadow and map it at address in shadow that corresponds to the address
    allocated in module_alloc().

    __vmalloc_node_range() could be used for this purpose, except it puts a
    guard hole after allocated area. Guard hole in shadow memory should be a
    problem because at some future point we might need to have a shadow memory
    at address occupied by guard hole. So we could fail to allocate shadow
    for module_alloc().

    Add a new vm_struct flag 'VM_NO_GUARD' indicating that vm area doesn't
    have a guard hole.

    Signed-off-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

14 Dec, 2014

1 commit


11 Dec, 2014

1 commit


10 Oct, 2014

1 commit

  • Using seq_open_private() removes boilerplate code from vmalloc_open().

    The resultant code is shorter and easier to follow.

    However, please note that seq_open_private() call kzalloc() rather than
    kmalloc() which may affect timing due to the memory initialisation
    overhead.

    Signed-off-by: Rob Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rob Jones
     

07 Aug, 2014

4 commits

  • Currently map_vm_area() takes (struct page *** pages) as third argument,
    and after mapping, it moves (*pages) to point to (*pages +
    nr_mappped_pages).

    It looks like this kind of increment is useless to its caller these
    days. The callers don't care about the increments and actually they're
    trying to avoid this by passing another copy to map_vm_area().

    The caller can always guarantee all the pages can be mapped into vm_area
    as specified in first argument and the caller only cares about whether
    map_vm_area() fails or not.

    This patch cleans up the pointer movement in map_vm_area() and updates
    its callers accordingly.

    Signed-off-by: WANG Chao
    Cc: Zhang Yanfei
    Acked-by: Greg Kroah-Hartman
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    WANG Chao
     
  • tmp_mask in the __vmalloc_area_node() iteration never changes so it can
    be moved into function scope and marked with const. This causes the
    movl and orl to only be done once per call rather than area->nr_pages
    times.

    nested_gfp can also be marked const.

    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • It is not uncommon on busy servers to get stuck hundred of ms in
    vmalloc() calls (like file descriptor expansions).

    Add a cond_resched() to __vmalloc_area_node() to be gentle to
    other tasks.

    [akpm@linux-foundation.org: only do it for __GFP_WAIT, per David]
    Signed-off-by: Eric Dumazet
    Cc: Hugh Dickins
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • Richard Yao reported a month ago that his system have a trouble with
    vmap_area_lock contention during performance analysis by /proc/meminfo.
    Andrew asked why his analysis checks /proc/meminfo stressfully, but he
    didn't answer it.

    https://lkml.org/lkml/2014/4/10/416

    Although I'm not sure that this is right usage or not, there is a
    solution reducing vmap_area_lock contention with no side-effect. That
    is just to use rcu list iterator in get_vmalloc_info().

    rcu can be used in this function because all RCU protocol is already
    respected by writers, since Nick Piggin commit db64fe02258f1 ("mm:
    rewrite vmap layer") back in linux-2.6.28

    Specifically :
    insertions use list_add_rcu(),
    deletions use list_del_rcu() and kfree_rcu().

    Note the rb tree is not used from rcu reader (it would not be safe),
    only the vmap_area_list has full RCU protection.

    Note that __purge_vmap_area_lazy() already uses this rcu protection.

    rcu_read_lock();
    list_for_each_entry_rcu(va, &vmap_area_list, list) {
    if (va->flags & VM_LAZY_FREE) {
    if (va->va_start < *start)
    *start = va->va_start;
    if (va->va_end > *end)
    *end = va->va_end;
    nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
    list_add_tail(&va->purge_list, &valist);
    va->flags |= VM_LAZY_FREEING;
    va->flags &= ~VM_LAZY_FREE;
    }
    }
    rcu_read_unlock();

    Peter:

    : While rcu list traversal over the vmap_area_list is safe, this may
    : arrive at different results than the spinlocked version. The rcu list
    : traversal version will not be a 'snapshot' of a single, valid instant
    : of the entire vmap_area_list, but rather a potential amalgam of
    : different list states.

    Joonsoo:

    : Yes, you are right, but I don't think that we should be strict here.
    : Meminfo is already not a 'snapshot' at specific time. While we try to get
    : certain stats, the other stats can change. And, although we may arrive at
    : different results than the spinlocked version, the difference would not be
    : large and would not make serious side-effect.

    [edumazet@google.com: add more commit description]
    Signed-off-by: Joonsoo Kim
    Reported-by: Richard Yao
    Acked-by: Eric Dumazet
    Cc: Peter Hurley
    Cc: Zhang Yanfei
    Cc: Johannes Weiner
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

05 Jun, 2014

3 commits

  • zsmalloc needs exported unmap_kernel_range for building as a module. See
    https://lkml.org/lkml/2013/1/18/487

    I didn't send a patch to make unmap_kernel_range exportable at that time
    because zram was staging stuff and I thought VM function exporting for
    staging stuff makes no sense.

    Now zsmalloc was promoted. If we can't build zsmalloc as module, it means
    we can't build zram as module, either. Additionally, buddy map_vm_area is
    already exported so let's export unmap_kernel_range to help his buddy.

    Signed-off-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Sergey Senozhatsky
    Cc: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Replace seq_printf where possible

    Signed-off-by: Fabian Frederick
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • Replace places where __get_cpu_var() is used for an address calculation
    with this_cpu_ptr().

    Signed-off-by: Christoph Lameter
    Cc: Tejun Heo
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

08 Apr, 2014

2 commits

  • vm_map_ram() has a fragmentation problem when it cannot purge a
    chunk(ie, 4M address space) if there is a pinning object in that
    addresss space. So it could consume all VMALLOC address space easily.

    We can fix the fragmentation problem by using vmap instead of
    vm_map_ram() but vmap() is known to be slow compared to vm_map_ram().
    Minchan said vm_map_ram is 5 times faster than vmap in his tests. So I
    thought we should fix fragment problem of vm_map_ram because our
    proprietary GPU driver has used it heavily.

    On second thought, it's not an easy because we should reuse freed space
    for solving the problem and it could make more IPI and bitmap operation
    for searching hole. It could mitigate API's goal which is very fast
    mapping. And even fragmentation problem wouldn't show in 64 bit
    machine.

    Another option is that the user should separate long-life and short-life
    object and use vmap for long-life but vm_map_ram for short-life. If we
    inform the user about the characteristic of vm_map_ram the user can
    choose one according to the page lifetime.

    Let's add some notice messages to user.

    [akpm@linux-foundation.org: tweak comment text]
    Signed-off-by: Gioh Kim
    Reviewed-by: Zhang Yanfei
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gioh Kim
     
  • To increase compiler portability there is which
    provides convenience macros for various gcc constructs. Eg: __weak for
    __attribute__((weak)). I've replaced all instances of gcc attributes with
    the right macro in the memory management (/mm) subsystem.

    [akpm@linux-foundation.org: while-we're-there consistency tweaks]
    Signed-off-by: Gideon Israel Dsouza
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gideon Israel Dsouza