14 Dec, 2014

40 commits

  • Read memory barriers must follow the read operations.

    Signed-off-by: Dmitry Vyukov
    Cc: Eric Dumazet
    Acked-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     
  • After the previous patch we can remove the PT_TRACE_EXIT check in
    oom_scan_process_thread(), it was added to handle the case when the
    coredumping was "frozen" by ptrace, but it doesn't really work. If
    nothing else, we would need to check all threads which could share the
    same ->mm to make it more or less correct.

    Signed-off-by: Oleg Nesterov
    Cc: Cong Wang
    Cc: David Rientjes
    Acked-by: Michal Hocko
    Cc: "Rafael J. Wysocki"
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • oom_kill.c assumes that PF_EXITING task should exit and free the memory
    soon. This is wrong in many ways and one important case is the coredump.
    A task can sleep in exit_mm() "forever" while the coredumping sub-thread
    can need more memory.

    Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account,
    we add the new trivial helper for that.

    Note: this is only the first step, this patch doesn't try to solve other
    problems. The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can
    participate in coredump after it was already observed in PF_EXITING state,
    so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set.
    fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so
    out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it.
    And even the name/usage of the new helper is confusing, an exiting thread
    can only free its ->mm if it is the only/last task in thread group.

    [akpm@linux-foundation.org: add comment]
    Signed-off-by: Oleg Nesterov
    Cc: Cong Wang
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: "Rafael J. Wysocki"
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Since 01cefaef40c4 ("mm: provide more accurate estimation
    of pages occupied by memmap") allocate the pages from lowmem for the
    highmem zones' memmap. So It is not need to reserver the memmap's for
    the highmem.

    A 2G DDR3 for the arm platform:
    On node 0 totalpages: 524288
    free_area_init_node: node 0, pgdat 80ccd380, node_mem_map 80d38000
    DMA zone: 3568 pages used for memmap
    DMA zone: 0 pages reserved
    DMA zone: 456704 pages, LIFO batch:31
    HighMem zone: 528 pages used for memmap
    HighMem zone: 67584 pages, LIFO batch:15

    On node 0 totalpages: 524288
    free_area_init_node: node 0, pgdat 80cd6f40, node_mem_map 80d42000
    DMA zone: 3568 pages used for memmap
    DMA zone: 0 pages reserved
    DMA zone: 456704 pages, LIFO batch:31
    HighMem zone: 67584 pages, LIFO batch:15

    Signed-off-by: Hongbo Zhong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhong Hongbo
     
  • Page migration's __unmap_and_move(), and rmap's try_to_unmap(), were
    created for use on pages almost certainly mapped into userspace. But
    nowadays compaction often applies them to unmapped page cache pages: which
    may exacerbate contention on i_mmap_rwsem quite unnecessarily, since
    try_to_unmap_file() makes no preliminary page_mapped() check.

    Now check page_mapped() in __unmap_and_move(); and avoid repeating the
    same overhead in rmap_walk_file() - don't remove_migration_ptes() when we
    never inserted any.

    (The PageAnon(page) comment blocks now look even sillier than before, but
    clean that up on some other occasion. And note in passing that
    try_to_unmap_one() does not use a migration entry when PageSwapCache, so
    remove_migration_ptes() will then not update that swap entry to newpage
    pte: not a big deal, but something else to clean up later.)

    Davidlohr remarked in "mm,fs: introduce helpers around the i_mmap_mutex"
    conversion to i_mmap_rwsem, that "The biggest winner of these changes is
    migration": a part of the reason might be all of that unnecessary taking
    of i_mmap_mutex in page migration; and it's rather a shame that I didn't
    get around to sending this patch in before his - this one is much less
    useful after Davidlohr's conversion to rwsem, but still good.

    Signed-off-by: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Since commit 058504edd026 ("fs/seq_file: fallback to vmalloc allocation"),
    seq_buf_alloc() falls back to vmalloc() when the kmalloc() for contiguous
    memory fails. This was done to address order-4 slab allocations for
    reading /proc/stat on large machines and noticed because
    PAGE_ALLOC_COSTLY_ORDER < 4, so there is no infinite loop in the page
    allocator when allocating new slab for such high-order allocations.

    Contiguous memory isn't necessary for caller of seq_buf_alloc(), however.
    Other GFP_KERNEL high-order allocations that are
    Cc: Heiko Carstens
    Cc: Christoph Hellwig
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The slab shrinkers are currently invoked from the zonelist walkers in
    kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the
    eligible LRU pages and assemble a nodemask to pass to NUMA-aware
    shrinkers, which then again have to walk over the nodemask. This is
    redundant code, extra runtime work, and fairly inaccurate when it comes to
    the estimation of actually scannable LRU pages. The code duplication will
    only get worse when making the shrinkers cgroup-aware and requiring them
    to have out-of-band cgroup hierarchy walks as well.

    Instead, invoke the shrinkers from shrink_zone(), which is where all
    reclaimers end up, to avoid this duplication.

    Take the count for eligible LRU pages out of get_scan_count(), which
    considers many more factors than just the availability of swap space, like
    zone_reclaimable_pages() currently does. Accumulate the number over all
    visited lruvecs to get the per-zone value.

    Some nodes have multiple zones due to memory addressing restrictions. To
    avoid putting too much pressure on the shrinkers, only invoke them once
    for each such node, using the class zone of the allocation as the pivot
    zone.

    For now, this integrates the slab shrinking better into the reclaim logic
    and gets rid of duplicative invocations from kswapd, direct reclaim, and
    zone reclaim. It also prepares for cgroup-awareness, allowing
    memcg-capable shrinkers to be added at the lruvec level without much
    duplication of both code and runtime work.

    This changes kswapd behavior, which used to invoke the shrinkers for each
    zone, but with scan ratios gathered from the entire node, resulting in
    meaningless pressure quantities on multi-zone nodes.

    Zone reclaim behavior also changes. It used to shrink slabs until the
    same amount of pages were shrunk as were reclaimed from the LRUs. Now it
    merely invokes the shrinkers once with the zone's scan ratio, which makes
    the shrinkers go easier on caches that implement aging and would prefer
    feeding back pressure from recently used slab objects to unused LRU pages.

    [vdavydov@parallels.com: assure class zone is populated]
    Signed-off-by: Johannes Weiner
    Cc: Dave Chinner
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These flushes deal with sequence number overflows, such as for long lived
    threads. These are rare, but interesting from a debugging PoV. As such,
    display the number of flushes when vmacache debugging is enabled.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • page owner is for the tracking about who allocated each page. This
    document explains what is the page owner feature and what is the merit of
    it. And, simple HOW-TO is also explained. See the document for detailed
    information.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Extended memory to store page owner information is initialized some time
    later than that page allocator starts. Until initialization, many pages
    can be allocated and they have no owner information. This make debugging
    using page owner harder, so some fixup will be helpful.

    This patch fixes up this situation by setting fake owner information
    immediately after page extension is initialized. Information doesn't tell
    the right owner, but, at least, it can tell whether page is allocated or
    not, more correctly.

    On my testing, this patch catches 13343 early allocated pages, although
    they are mostly allocated from page extension feature. Anyway, after
    then, there is no page left that it is allocated and has no page owner
    flag.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • This is the page owner tracking code which is introduced so far ago. It
    is resident on Andrew's tree, though, nobody tried to upstream so it
    remain as is. Our company uses this feature actively to debug memory leak
    or to find a memory hogger so I decide to upstream this feature.

    This functionality help us to know who allocates the page. When
    allocating a page, we store some information about allocation in extra
    memory. Later, if we need to know status of all pages, we can get and
    analyze it from this stored information.

    In previous version of this feature, extra memory is statically defined in
    struct page, but, in this version, extra memory is allocated outside of
    struct page. It enables us to turn on/off this feature at boottime
    without considerable memory waste.

    Although we already have tracepoint for tracing page allocation/free,
    using it to analyze page owner is rather complex. We need to enlarge the
    trace buffer for preventing overlapping until userspace program launched.
    And, launched program continually dump out the trace buffer for later
    analysis and it would change system behaviour with more possibility rather
    than just keeping it in memory, so bad for debug.

    Moreover, we can use page_owner feature further for various purposes. For
    example, we can use it for fragmentation statistics implemented in this
    patch. And, I also plan to implement some CMA failure debugging feature
    using this interface.

    I'd like to give the credit for all developers contributed this feature,
    but, it's not easy because I don't know exact history. Sorry about that.
    Below is people who has "Signed-off-by" in the patches in Andrew's tree.

    Contributor:
    Alexander Nyberg
    Mel Gorman
    Dave Hansen
    Minchan Kim
    Michal Nazarewicz
    Andrew Morton
    Jungsoo Son

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Current stacktrace only have the function for console output. page_owner
    that will be introduced in following patch needs to print the output of
    stacktrace into the buffer for our own output format so so new function,
    snprint_stack_trace(), is needed.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • do_mmap_private() in nommu.c try to allocate physically contiguous pages
    with arbitrary size in some cases and we now have good abstract function
    to do exactly same thing, alloc_pages_exact(). So, change to use it.

    There is no functional change. This is the preparation step for support
    page owner feature accurately.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Now, we have prepared to avoid using debug-pagealloc in boottime. So
    introduce new kernel-parameter to disable debug-pagealloc in boottime, and
    makes related functions to be disabled in this case.

    Only non-intuitive part is change of guard page functions. Because guard
    page is effective only if debug-pagealloc is enabled, turning off
    according to debug-pagealloc is reasonable thing to do.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Until now, debug-pagealloc needs extra flags in struct page, so we need to
    recompile whole source code when we decide to use it. This is really
    painful, because it takes some time to recompile and sometimes rebuild is
    not possible due to third party module depending on struct page. So, we
    can't use this good feature in many cases.

    Now, we have the page extension feature that allows us to insert extra
    flags to outside of struct page. This gets rid of third party module
    issue mentioned above. And, this allows us to determine if we need extra
    memory for this page extension in boottime. With these property, we can
    avoid using debug-pagealloc in boottime with low computational overhead in
    the kernel built with CONFIG_DEBUG_PAGEALLOC. This will help our
    development process greatly.

    This patch is the preparation step to achive above goal. debug-pagealloc
    originally uses extra field of struct page, but, after this patch, it will
    use field of struct page_ext. Because memory for page_ext is allocated
    later than initialization of page allocator in CONFIG_SPARSEMEM, we should
    disable debug-pagealloc feature temporarily until initialization of
    page_ext. This patch implements this.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • When we debug something, we'd like to insert some information to every
    page. For this purpose, we sometimes modify struct page itself. But,
    this has drawbacks. First, it requires re-compile. This makes us
    hesitate to use the powerful debug feature so development process is
    slowed down. And, second, sometimes it is impossible to rebuild the
    kernel due to third party module dependency. At third, system behaviour
    would be largely different after re-compile, because it changes size of
    struct page greatly and this structure is accessed by every part of
    kernel. Keeping this as it is would be better to reproduce errornous
    situation.

    This feature is intended to overcome above mentioned problems. This
    feature allocates memory for extended data per page in certain place
    rather than the struct page itself. This memory can be accessed by the
    accessor functions provided by this code. During the boot process, it
    checks whether allocation of huge chunk of memory is needed or not. If
    not, it avoids allocating memory at all. With this advantage, we can
    include this feature into the kernel in default and can avoid rebuild and
    solve related problems.

    Until now, memcg uses this technique. But, now, memcg decides to embed
    their variable to struct page itself and it's code to extend struct page
    has been removed. I'd like to use this code to develop debug feature, so
    this patch resurrect it.

    To help these things to work well, this patch introduces two callbacks for
    clients. One is the need callback which is mandatory if user wants to
    avoid useless memory allocation at boot-time. The other is optional, init
    callback, which is used to do proper initialization after memory is
    allocated. Detailed explanation about purpose of these functions is in
    code comment. Please refer it.

    Others are completely same with previous extension code in memcg.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • GFP_USER, GFP_HIGHUSER and GFP_HIGHUSER_MOVABLE are escalatedly confined
    defined, also implied by their names:

    GFP_USER = GFP_USER
    GFP_USER + __GFP_HIGHMEM = GFP_HIGHUSER
    GFP_USER + __GFP_HIGHMEM + __GFP_MOVABLE = GFP_HIGHUSER_MOVABLE

    So just make GFP_HIGHUSER and GFP_HIGHUSER_MOVABLE escalatedly defined to
    reflect this fact. It also makes the definition clear and texturally warn
    on any furture break-up of this escalated relastionship.

    Signed-off-by: Jianyu Zhan
    Acked-by: Johannes Weiner
    Acked-by: Kirill A. Shutemov
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     
  • include/linux/kmemleak.h: In function 'kmemleak_alloc_recursive':
    include/linux/kmemleak.h:43: error: 'SLAB_NOLEAKTRACE' undeclared (first use in this function)

    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The gfp was passed in but never used in this function.

    Signed-off-by: Zhang Zhen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Zhen
     
  • swp_entry_t being defined in include/linux/swap.h instead of
    include/linux/mm_types.h causes cyclic include dependency later when
    include/linux/page_cgroup.h is included from writeback path. Move the
    definition to include/linux/mm_types.h.

    While at it, reformat the comment above it.

    Signed-off-by: Tejun Heo
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • This is just a small optimization. The start_pfn can be obtained directly
    by phys_index << PFN_SECTION_SHIFT. So the call of page_to_pfn() is
    redundant and remove it.

    Signed-off-by: Zhang Zhen
    Acked-by: Yasuaki Ishimatsu
    Acked-by: David Rientjes
    Cc: Dave Hansen
    Cc: Wang Nan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Zhen
     
  • This could be useful for debug in the future if we want to track
    major/minor faults more closely, and also avoids the put_page trick we
    used with gup.

    In order to do this, we also track the task struct in the PASID state
    structure. This lets us update the appropriate task stats after the fault
    has been handled, and may aid with debug in the future as well.

    Signed-off-by: Jesse Barnes
    Tested-by: Oded Gabbay
    Cc: Joerg Roedel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesse Barnes
     
  • This lets drivers like the AMD IOMMUv2 driver handle faults a bit more
    simply, rather than doing tricks with page refs and get_user_pages().

    Signed-off-by: Jesse Barnes
    Cc: Oded Gabbay
    Cc: Joerg Roedel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesse Barnes
     
  • This function is only called during initialization.

    Signed-off-by: Luiz Capitulino
    Cc: Andi Kleen
    Acked-by: David Rientjes
    Cc: Rik van Riel
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Davidlohr Bueso
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luiz Capitulino
     
  • No reason to duplicate the code of an existing macro.

    Signed-off-by: Luiz Capitulino
    Cc: Andi Kleen
    Acked-by: David Rientjes
    Cc: Rik van Riel
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Davidlohr Bueso
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luiz Capitulino
     
  • The hugepages= entry in kernel-parameters.txt states that 1GB pages can
    only be allocated at boot time and not freed afterwards. This is not
    true since commit 944d9fec8d7a ("hugetlb: add support for gigantic page
    allocation at runtime"), at least for x86_64.

    Instead of adding arch-specifc observations to the hugepages= entry,
    this commit just drops the out of date information. Further information
    about arch-specific support and available features can be obtained in
    the hugetlb documentation.

    Signed-off-by: Luiz Capitulino
    Cc: Andi Kleen
    Acked-by: David Rientjes
    Cc: Rik van Riel
    Cc: Yasuaki Ishimatsu
    Cc: Yinghai Lu
    Cc: Davidlohr Bueso
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luiz Capitulino
     
  • It isn't supposed to stack, so turn it into a bit-field to save 4 bytes on
    the task_struct.

    Also, remove the memcg_stop/resume_kmem_account helpers - it is clearer to
    set/clear the flag inline. Regarding the overwhelming comment to the
    helpers, which is removed by this patch too, we already have a compact yet
    accurate explanation in memcg_schedule_cache_create, no need in yet
    another one.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • __memcg_kmem_get_cache can recurse if it calls kmalloc (which it does if
    the cgroup's kmem cache doesn't exist), because kmalloc may call
    __memcg_kmem_get_cache internally again. To avoid the recursion, we use
    the task_struct->memcg_kmem_skip_account flag.

    However, there's no need checking the flag in memcg_kmem_newpage_charge,
    because there's no way how this function could result in recursion, if
    called from memcg_kmem_get_cache. So let's remove the redundant code.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • The only such flag is KMEM_ACCOUNTED_ACTIVE, but it's set iff
    mem_cgroup->kmemcg_id is initialized, so we can check kmemcg_id instead of
    having a separate flags field.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • When the encountered pte is a swap entry, the current code handles two
    cases: migration and normal swapentry, but we have a third case: hwpoison
    page.

    This patch adds hwpoison page handle, consider hwpoison page incore as
    same as migration.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Weijie Yang
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang
     
  • Call page_to_pgoff() to get the page offset once we are sure we actually
    need it, and any very obvious initial function checks have passed.
    Trivial micro-optimization, and potentially save some cycles.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Page guard is used by debug-pagealloc feature. Currently, it is
    open-coded, but, I think that more abstraction of it makes core page
    allocator code more readable.

    There is no functional difference.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Gioh Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • There is a lot of duplication in the rubric around actually setting or
    clearing a mem region flag. Create a new helper function to do this and
    reduce each of memblock_mark_hotplug() and memblock_clear_hotplug() to a
    single line.

    This will be useful if someone were to add a new mem region flag - which
    I hope to be doing some day soon. But it looks like a plausible cleanup
    even without that - so I'd like to get it out of the way now.

    Signed-off-by: Tony Luck
    Cc: Santosh Shilimkar
    Cc: Tang Chen
    Cc: Grygorii Strashko
    Cc: Zhang Yanfei
    Cc: Philipp Hachtmann
    Cc: Yinghai Lu
    Cc: Emil Medve
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Luck
     
  • task_struct->memcg_kmem_skip_account was initially introduced to avoid
    recursion during kmem cache creation: memcg_kmem_get_cache, which is
    called by kmem_cache_alloc to determine the per-memcg cache to account
    allocation to, may issue lazy cache creation if the needed cache doesn't
    exist, which means issuing yet another kmem_cache_alloc. We can't just
    pass a flag to the nested kmem_cache_alloc disabling kmem accounting,
    because there are hidden allocations, e.g. in INIT_WORK. So we
    introduced a flag on the task_struct, memcg_kmem_skip_account, making
    memcg_kmem_get_cache return immediately.

    By its nature, the flag may also be used to disable accounting for
    allocations shared among different cgroups, and currently it is used this
    way in memcg_activate_kmem. Using it like this looks like abusing it to
    me. If we want to disable accounting for some allocations (which we will
    definitely want one day), we should either add GFP_NO_MEMCG or GFP_MEMCG
    flag in order to blacklist/whitelist some allocations.

    For now, let's simply remove memcg_stop/resume_kmem_account from
    memcg_activate_kmem.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We already assured the current task has mm in memcg_kmem_should_charge,
    no need to double check.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • cpuset code stopped using cgroup_lock in favor of cpuset_mutex long ago.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • The alignment in cma_alloc() was done w.r.t. the bitmap. This is a
    problem when, for example:

    - a device requires 16M (order 12) alignment
    - the CMA region is not 16 M aligned

    In such a case, can result with the CMA region starting at, say,
    0x2f800000 but any allocation you make from there will be aligned from
    there. Requesting an allocation of 32 M with 16 M alignment will result
    in an allocation from 0x2f800000 to 0x31800000, which doesn't work very
    well if your strange device requires 16M alignment.

    Change to use bitmap_find_next_zero_area_off() to account for the
    difference in alignment at reserve-time and alloc-time.

    Signed-off-by: Gregory Fong
    Acked-by: Michal Nazarewicz
    Cc: Marek Szyprowski
    Cc: Joonsoo Kim
    Cc: Kukjin Kim
    Cc: Laurent Pinchart
    Cc: Laura Abbott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gregory Fong
     
  • Add a bitmap_find_next_zero_area_off() function which works like
    bitmap_find_next_zero_area() function except it allows an offset to be
    specified when alignment is checked. This lets caller request a bit such
    that its number plus the offset is aligned according to the mask.

    [gregory.0xf0@gmail.com: Retrieved from https://patchwork.linuxtv.org/patch/6254/ and updated documentation]
    Signed-off-by: Michal Nazarewicz
    Signed-off-by: Kyungmin Park
    Signed-off-by: Marek Szyprowski
    Signed-off-by: Gregory Fong
    Cc: Joonsoo Kim
    Cc: Kukjin Kim
    Cc: Laurent Pinchart
    Cc: Laura Abbott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Nazarewicz
     
  • The unmap_mapping_range family of functions do the unmapping of user pages
    (ultimately via zap_page_range_single) without touching the actual
    interval tree, thus share the lock.

    Signed-off-by: Davidlohr Bueso
    Cc: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra (Intel)
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Shrinking/truncate logic can call nommu_shrink_inode_mappings() to verify
    that any shared mappings of the inode in question aren't broken (dead
    zone). afaict the only user being ramfs to handle the size change
    attribute.

    Pretty much a no-brainer to share the lock.

    Signed-off-by: Davidlohr Bueso
    Acked-by: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso