29 Aug, 2013

1 commit


15 Jul, 2013

2 commits

  • Pull slab update from Pekka Enberg:
    "Highlights:

    - Fix for boot-time problems on some architectures due to
    init_lock_keys() not respecting kmalloc_caches boundaries
    (Christoph Lameter)

    - CONFIG_SLUB_CPU_PARTIAL requested by RT folks (Joonsoo Kim)

    - Fix for excessive slab freelist draining (Wanpeng Li)

    - SLUB and SLOB cleanups and fixes (various people)"

    I ended up editing the branch, and this avoids two commits at the end
    that were immediately reverted, and I instead just applied the oneliner
    fix in between myself.

    * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux
    slub: Check for page NULL before doing the node_match check
    mm/slab: Give s_next and s_stop slab-specific names
    slob: Check for NULL pointer before calling ctor()
    slub: Make cpu partial slab support configurable
    slab: add kmalloc() to kernel API documentation
    slab: fix init_lock_keys
    slob: use DIV_ROUND_UP where possible
    slub: do not put a slab to cpu partial list when cpu_partial is 0
    mm/slub: Use node_nr_slabs and node_nr_objs in get_slabinfo
    mm/slub: Drop unnecessary nr_partials
    mm/slab: Fix /proc/slabinfo unwriteable for slab
    mm/slab: Sharing s_next and s_stop between slab and slub
    mm/slab: Fix drain freelist excessively
    slob: Rework #ifdeffery in slab.h
    mm, slab: moved kmem_cache_alloc_node comment to correct place

    Linus Torvalds
     
  • In the -rt kernel (mrg), we hit the following dump:

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] kmem_cache_alloc_node+0x51/0x180
    PGD a2d39067 PUD b1641067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP
    Modules linked in: sunrpc cpufreq_ondemand ipv6 tg3 joydev sg serio_raw pcspkr k8temp amd64_edac_mod edac_core i2c_piix4 e100 mii shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom sata_svw ata_generic pata_acpi pata_serverworks radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod
    CPU 3
    Pid: 20878, comm: hackbench Not tainted 3.6.11-rt25.14.el6rt.x86_64 #1 empty empty/Tyan Transport GT24-B3992
    RIP: 0010:[] [] kmem_cache_alloc_node+0x51/0x180
    RSP: 0018:ffff8800a9b17d70 EFLAGS: 00010213
    RAX: 0000000000000000 RBX: 0000000001200011 RCX: ffff8800a06d8000
    RDX: 0000000004d92a03 RSI: 00000000000000d0 RDI: ffff88013b805500
    RBP: ffff8800a9b17dc0 R08: ffff88023fd14d10 R09: ffffffff81041cbd
    R10: 00007f4e3f06e9d0 R11: 0000000000000246 R12: ffff88013b805500
    R13: ffff8801ff46af40 R14: 0000000000000001 R15: 0000000000000000
    FS: 00007f4e3f06e700(0000) GS:ffff88023fd00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000000000000 CR3: 00000000a2d3a000 CR4: 00000000000007e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process hackbench (pid: 20878, threadinfo ffff8800a9b16000, task ffff8800a06d8000)
    Stack:
    ffff8800a9b17da0 ffffffff81202e08 ffff8800a9b17de0 000000d001200011
    0000000001200011 0000000001200011 0000000000000000 0000000000000000
    00007f4e3f06e9d0 0000000000000000 ffff8800a9b17e60 ffffffff81041cbd
    Call Trace:
    [] ? current_has_perm+0x68/0x80
    [] copy_process+0xdd/0x15b0
    [] ? rt_up_read+0x25/0x30
    [] do_fork+0x5a/0x360
    [] ? migrate_enable+0xeb/0x220
    [] sys_clone+0x28/0x30
    [] stub_clone+0x13/0x20
    [] ? system_call_fastpath+0x16/0x1b
    Code: 89 fc 89 75 cc 41 89 d6 4d 8b 04 24 65 4c 03 04 25 48 ae 00 00 49 8b 50 08 4d 8b 28 49 8b 40 10 4d 85 ed 74 12 41 83 fe ff 74 27 8b 00 48 c1 e8 3a 41 39 c6 74 1b 8b 75 cc 4c 89 c9 44 89 f2
    RIP [] kmem_cache_alloc_node+0x51/0x180
    RSP
    CR2: 0000000000000000
    ---[ end trace 0000000000000002 ]---

    Now, this uses SLUB pretty much unmodified, but as it is the -rt kernel
    with CONFIG_PREEMPT_RT set, spinlocks are mutexes, although they do
    disable migration. But the SLUB code is relatively lockless, and the
    spin_locks there are raw_spin_locks (not converted to mutexes), thus I
    believe this bug can happen in mainline without -rt features. The -rt
    patch is just good at triggering mainline bugs ;-)

    Anyway, looking at where this crashed, it seems that the page variable
    can be NULL when passed to the node_match() function (which does not
    check if it is NULL). When this happens we get the above panic.

    As page is only used in slab_alloc() to check if the node matches, if
    it's NULL I'm assuming that we can say it doesn't and call the
    __slab_alloc() code. Is this a correct assumption?

    Acked-by: Christoph Lameter
    Signed-off-by: Steven Rostedt
    Signed-off-by: Pekka Enberg
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     

11 Jul, 2013

3 commits

  • Since all architectures have been converted to use vm_unmapped_area(),
    there is no remaining use for the free_area_cache.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Cc: "James E.J. Bottomley"
    Cc: "Luck, Tony"
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Paul Mackerras
    Cc: Richard Henderson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • zswap is a thin backend for frontswap that takes pages that are in the
    process of being swapped out and attempts to compress them and store
    them in a RAM-based memory pool. This can result in a significant I/O
    reduction on the swap device and, in the case where decompressing from
    RAM is faster than reading from the swap device, can also improve
    workload performance.

    It also has support for evicting swap pages that are currently
    compressed in zswap to the swap device on an LRU(ish) basis. This
    functionality makes zswap a true cache in that, once the cache is full,
    the oldest pages can be moved out of zswap to the swap device so newer
    pages can be compressed and stored in zswap.

    This patch adds the zswap driver to mm/

    Signed-off-by: Seth Jennings
    Acked-by: Rik van Riel
    Cc: Greg Kroah-Hartman
    Cc: Nitin Gupta
    Cc: Minchan Kim
    Cc: Konrad Rzeszutek Wilk
    Cc: Dan Magenheimer
    Cc: Robert Jennings
    Cc: Jenifer Hopper
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Larry Woodman
    Cc: Benjamin Herrenschmidt
    Cc: Dave Hansen
    Cc: Joe Perches
    Cc: Joonsoo Kim
    Cc: Cody P Schafer
    Cc: Hugh Dickens
    Cc: Paul Mackerras
    Cc: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Seth Jennings
     
  • zbud is an special purpose allocator for storing compressed pages. It
    is designed to store up to two compressed pages per physical page.
    While this design limits storage density, it has simple and
    deterministic reclaim properties that make it preferable to a higher
    density approach when reclaim will be used.

    zbud works by storing compressed pages, or "zpages", together in pairs
    in a single memory page called a "zbud page". The first buddy is "left
    justifed" at the beginning of the zbud page, and the last buddy is
    "right justified" at the end of the zbud page. The benefit is that if
    either buddy is freed, the freed buddy space, coalesced with whatever
    slack space that existed between the buddies, results in the largest
    possible free region within the zbud page.

    zbud also provides an attractive lower bound on density. The ratio of
    zpages to zbud pages can not be less than 1. This ensures that zbud can
    never "do harm" by using more pages to store zpages than the
    uncompressed zpages would have used on their own.

    This implementation is a rewrite of the zbud allocator internally used
    by zcache in the driver/staging tree. The rewrite was necessary to
    remove some of the zcache specific elements that were ingrained
    throughout and provide a generic allocation interface that can later be
    used by zsmalloc and others.

    This patch adds zbud to mm/ for later use by zswap.

    Signed-off-by: Seth Jennings
    Acked-by: Rik van Riel
    Cc: Greg Kroah-Hartman
    Cc: Nitin Gupta
    Cc: Minchan Kim
    Cc: Konrad Rzeszutek Wilk
    Cc: Dan Magenheimer
    Cc: Robert Jennings
    Cc: Jenifer Hopper
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Larry Woodman
    Cc: Benjamin Herrenschmidt
    Cc: Dave Hansen
    Cc: Joe Perches
    Cc: Joonsoo Kim
    Cc: Cody P Schafer
    Cc: Hugh Dickens
    Cc: Paul Mackerras
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Seth Jennings
     

10 Jul, 2013

34 commits

  • Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • online_pages() is called from memory_block_action() when a user requests
    to online a memory block via sysfs. This function needs to return a
    proper error value in case of error.

    Signed-off-by: Toshi Kani
    Cc: Yasuaki Ishimatsu
    Cc: Tang Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     
  • min_free_kbytes is updated during memory hotplug (by
    init_per_zone_wmark_min) currently which is right thing to do in most
    cases but this could be unexpected if admin increased the value to
    prevent from allocation failures and the new min_free_kbytes would be
    decreased as a result of memory hotadd.

    This patch saves the user defined value and allows updating
    min_free_kbytes only if it is higher than the saved one.

    A warning is printed when the new value is ignored.

    Signed-off-by: Michal Hocko
    Cc: Mel Gorman
    Acked-by: Zhang Yanfei
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Now memcg has the same life cycle with its corresponding cgroup, and a
    cgroup is freed via RCU and then mem_cgroup_css_free() will be called in
    a work function, so we can simply call __mem_cgroup_free() in
    mem_cgroup_css_free().

    This actually reverts commit 59927fb984d ("memcg: free mem_cgroup by RCU
    to fix oops").

    Signed-off-by: Li Zefan
    Cc: Hugh Dickins
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Now memcg has the same life cycle as its corresponding cgroup. Kill the
    useless refcnt.

    Signed-off-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • The cgroup core guarantees it's always safe to access the parent.

    Signed-off-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Use css_get/put instead of mem_cgroup_get/put. A simple replacement
    will do.

    The historical reason that memcg has its own refcnt instead of always
    using css_get/put, is that cgroup couldn't be removed if there're still
    css refs, so css refs can't be used as long-lived reference. The
    situation has changed so that rmdir a cgroup will succeed regardless css
    refs, but won't be freed until css refs goes down to 0.

    Signed-off-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Use css_get/put instead of mem_cgroup_get/put.

    We can't do a simple replacement, because here mem_cgroup_put() is
    called during mem_cgroup_css_free(), while mem_cgroup_css_free() won't
    be called until css refcnt goes down to 0.

    Instead we increment css refcnt in mem_cgroup_css_offline(), and then
    check if there's still kmem charges. If not, css refcnt will be
    decremented immediately, otherwise the refcnt will be released after the
    last kmem allocation is uncahred.

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Tejun Heo
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Use css_get()/css_put() instead of mem_cgroup_get()/mem_cgroup_put().

    There are two things being done in the current code:

    First, we acquired a css_ref to make sure that the underlying cgroup
    would not go away. That is a short lived reference, and it is put as
    soon as the cache is created.

    At this point, we acquire a long-lived per-cache memcg reference count
    to guarantee that the memcg will still be alive.

    so it is:

    enqueue: css_get
    create : memcg_get, css_put
    destroy: memcg_put

    So we only need to get rid of the memcg_get, change the memcg_put to
    css_put, and get rid of the now extra css_put.

    (This changelog is mostly written by Glauber)

    Signed-off-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Use css_get/css_put instead of mem_cgroup_get/put.

    Note, if at the same time someone is moving @current to a different
    cgroup and removing the old cgroup, css_tryget() may return false, and
    sock->sk_cgrp won't be initialized, which is fine.

    Signed-off-by: Li Zefan
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • mem_cgroup_css_online calls mem_cgroup_put if memcg_init_kmem fails.
    This is not correct because only memcg_propagate_kmem takes an
    additional reference while mem_cgroup_sockets_init is allowed to fail as
    well (although no current implementation fails) but it doesn't take any
    reference. This all suggests that it should be memcg_propagate_kmem
    that should clean up after itself so this patch moves mem_cgroup_put
    over there.

    Unfortunately this is not that easy (as pointed out by Li Zefan) because
    memcg_kmem_mark_dead marks the group dead (KMEM_ACCOUNTED_DEAD) if it is
    marked active (KMEM_ACCOUNTED_ACTIVE) which is the case even if
    memcg_propagate_kmem fails so the additional reference is dropped in
    that case in kmem_cgroup_destroy which means that the reference would be
    dropped two times.

    The easiest way then would be to simply remove mem_cgrroup_put from
    mem_cgroup_css_online and rely on kmem_cgroup_destroy doing the right
    thing.

    Signed-off-by: Michal Hocko
    Signed-off-by: Li Zefan
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: [3.8]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This reverts commit e4715f01be697a.

    mem_cgroup_put is hierarchy aware so mem_cgroup_put(memcg) already drops
    an additional reference from all parents so the additional
    mem_cgrroup_put(parent) potentially causes use-after-free.

    Signed-off-by: Michal Hocko
    Signed-off-by: Li Zefan
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: [3.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • It is counterintuitive at best that mmap'ing a hugetlbfs file with
    MAP_HUGETLB fails, while mmap'ing it without will a) succeed and b)
    return huge pages.

    v2: use is_file_hugepages(), as suggested by Jianguo

    Signed-off-by: Joern Engel
    Cc: Jianguo Wu
    Signed-off-by: Linus Torvalds

    Jörn Engel
     
  • After the patch "mm: vmscan: Flatten kswapd priority loop" was merged
    the scanning priority of kswapd changed.

    The priority now rises until it is scanning enough pages to meet the
    high watermark. shrink_inactive_list sets ZONE_WRITEBACK if a number of
    pages were encountered under writeback but this value is scaled based on
    the priority. As kswapd frequently scans with a higher priority now it
    is relatively easy to set ZONE_WRITEBACK. This patch removes the
    scaling and treates writeback pages similar to how it treats unqueued
    dirty pages and congested pages. The user-visible effect should be that
    kswapd will writeback fewer pages from reclaim context.

    Signed-off-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Dave Chinner
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Direct reclaim is not aborting to allow compaction to go ahead properly.
    do_try_to_free_pages is told to abort reclaim which is happily ignores
    and instead increases priority instead until it reaches 0 and starts
    shrinking file/anon equally. This patch corrects the situation by
    aborting reclaim when requested instead of raising priority.

    Signed-off-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Dave Chinner
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Signed-off-by: Tang Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • Remove one redundant "nid" in the comment.

    Signed-off-by: Tang Chen
    Signed-off-by: Linus Torvalds

    Tang Chen
     
  • When searching a vmap area in the vmalloc space, we use (addr + size -
    1) to check if the value is less than addr, which is an overflow. But
    we assign (addr + size) to vmap_area->va_end.

    So if we come across the below case:

    (addr + size - 1) : not overflow
    (addr + size) : overflow

    we will assign an overflow value (e.g 0) to vmap_area->va_end, And this
    will trigger BUG in __insert_vmap_area, causing system panic.

    So using (addr + size) to check the overflow should be the correct
    behaviour, not (addr + size - 1).

    Signed-off-by: Zhang Yanfei
    Reported-by: Ghennadi Procopciuc
    Tested-by: Daniel Baluta
    Cc: David Rientjes
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • These VM_ macros aren't used very often and three of them
    aren't used at all.

    Expand the ones that are used in-place, and remove all the now unused
    #define VM_ macros.

    VM_READHINTMASK, VM_NormalReadHint and VM_ClearReadHint were added just
    before 2.4 and appears have never been used.

    Signed-off-by: Joe Perches
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • With CONFIG_MEMORY_HOTREMOVE unset, there is a compile warning:

    mm/sparse.c:755: warning: `clear_hwpoisoned_pages' defined but not used

    And Bisecting it ended up pointing to 4edd7ceff ("mm, hotplug: avoid
    compiling memory hotremove functions when disabled").

    This is because the commit above put sparse_remove_one_section() within
    the protection of CONFIG_MEMORY_HOTREMOVE but the only user of
    clear_hwpoisoned_pages() is sparse_remove_one_section(), and it is not
    within the protection of CONFIG_MEMORY_HOTREMOVE.

    So put clear_hwpoisoned_pages within CONFIG_MEMORY_HOTREMOVE should fix
    the warning.

    Signed-off-by: Zhang Yanfei
    Cc: David Rientjes
    Acked-by: Toshi Kani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • This function is nowhere used, and it has a confusing name with put_page
    in mm/swap.c. So better to remove it.

    Signed-off-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • vfree() only needs schedule_work(&p->wq) if p->list was empty, otherwise
    vfree_deferred->wq is already pending or it is running and didn't do
    llist_del_all() yet.

    Signed-off-by: Oleg Nesterov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • In __rmqueue_fallback(), current_order loops down from MAX_ORDER - 1 to
    the order passed. MAX_ORDER is typically 11 and pageblock_order is
    typically 9 on x86. Integer division truncates, so pageblock_order / 2
    is 4. For the first eight iterations, it's guaranteed that
    current_order >= pageblock_order / 2 if it even gets that far!

    So just remove the unlikely(), it's completely bogus.

    Signed-off-by: Zhang Yanfei
    Suggested-by: David Rientjes
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • The callers of build_zonelists_node always pass MAX_NR_ZONES -1 as the
    zone_type argument, so we can directly use the value in
    build_zonelists_node and remove zone_type argument.

    Signed-off-by: Zhang Yanfei
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • The memory we used to hold the memcg arrays is currently accounted to
    the current memcg. But that creates a problem, because that memory can
    only be freed after the last user is gone. Our only way to know which
    is the last user, is to hook up to freeing time, but the fact that we
    still have some in flight kmallocs will prevent freeing to happen. I
    believe therefore to be just easier to account this memory as global
    overhead.

    Signed-off-by: Glauber Costa
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • The memory we used to hold the memcg arrays is currently accounted to
    the current memcg. But that creates a problem, because that memory can
    only be freed after the last user is gone. Our only way to know which
    is the last user, is to hook up to freeing time, but the fact that we
    still have some in flight kmallocs will prevent freeing to happen. I
    believe therefore to be just easier to account this memory as global
    overhead.

    This patch (of 2):

    Disabling accounting is only relevant for some specific memcg internal
    allocations. Therefore we would initially not have such check at
    memcg_kmem_newpage_charge, since direct calls to the page allocator that
    are marked with GFP_KMEMCG only happen outside memcg core. We are
    mostly concerned with cache allocations and by having this test at
    memcg_kmem_get_cache we are already able to relay the allocation to the
    root cache and bypass the memcg caches altogether.

    There is one exception, though: the SLUB allocator does not create large
    order caches, but rather service large kmallocs directly from the page
    allocator. Therefore, the following sequence, when backed by the SLUB
    allocator:

    memcg_stop_kmem_account();
    kmalloc()
    memcg_resume_kmem_account();

    would effectively ignore the fact that we should skip accounting, since
    it will drive us directly to this function without passing through the
    cache selector memcg_kmem_get_cache. Such large allocations are
    extremely rare but can happen, for instance, for the cache arrays.

    This was never a problem in practice, because we weren't skipping
    accounting for the cache arrays. All the allocations we were skipping
    were fairly small. However, the fact that we were not skipping those
    allocations are a problem and can prevent the memcgs from going away.
    As we fix that, we need to make sure that the fix will also work with
    the SLUB allocator.

    Signed-off-by: Glauber Costa
    Reported-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • We should check the VM_UNITIALIZED flag in s_show(). If this flag is
    set, that said, the vm_struct is not fully initialized. So it is
    unnecessary to try to show the information contained in vm_struct.

    We checked this flag in show_numa_info(), but I think it's better to
    check it earlier.

    Signed-off-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • VM_UNLIST was used to indicate that the vm_struct is not listed in
    vmlist.

    But after commit 4341fa454796 ("mm, vmalloc: remove list management of
    vmlist after initializing vmalloc"), the meaning of this flag changed.
    It now means the vm_struct is not fully initialized. So renaming it to
    VM_UNINITIALIZED seems more reasonable.

    Also change clear_vm_unlist to clear_vm_uninitialized_flag.

    Signed-off-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • Use goto to jump to the fail label to give a failure message before
    returning NULL. This makes the failure handling in this function
    consistent.

    Signed-off-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • As we have removed the dead code in the vb_alloc, it seems there is no
    place to use the alloc_map. So there is no reason to maintain the
    alloc_map in vmap_block.

    Signed-off-by: Zhang Yanfei
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • This function is nowhere used now, so remove it.

    Signed-off-by: Zhang Yanfei
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • Space in a vmap block that was once allocated is considered dirty and
    not made available for allocation again before the whole block is
    recycled. The result is that free space within a vmap block is always
    contiguous.

    So if a vmap block has enough free space for allocation, the allocation
    is impossible to fail. Thus, the fragmented block purging was never
    invoked from vb_alloc(). So remove this dead code.

    [ Same patches also sent by:

    Chanho Min
    Johannes Weiner

    but git doesn't do "multiple authors" ]

    Signed-off-by: Zhang Yanfei
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • There is an extra semi-colon so the function always returns.

    Signed-off-by: Dan Carpenter
    Acked-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     
  • When calculating pages in a node, for each zone in that node, we will
    have

    zone_spanned_pages_in_node
    --> get_pfn_range_for_nid
    zone_absent_pages_in_node
    --> get_pfn_range_for_nid

    That is to say, we call the get_pfn_range_for_nid to get start_pfn and
    end_pfn of the node for MAX_NR_ZONES * 2 times. And this is totally
    unnecessary if we call the get_pfn_range_for_nid before
    zone_*_pages_in_node add two extra arguments node_start_pfn and
    node_end_pfn for zone_*_pages_in_node, then we can remove the
    get_pfn_range_in_node in zone_*_pages_in_node.

    [akpm@linux-foundation.org: make definitions more readable]
    Signed-off-by: Zhang Yanfei
    Cc: Michal Hocko
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei