09 Oct, 2012

40 commits

  • remove_memory() will be called when hot removing a memory device. But
    even if offlining memory, we cannot notice it. So the patch updates the
    memory block's state and sends notification to userspace.

    Additionally, the memory device may contain more than one memory block.
    If the memory block has been offlined, __offline_pages() will fail. So we
    should try to offline one memory block at a time.

    Thus remove_memory() also check each memory block's state. So there is no
    need to check the memory block's state before calling remove_memory().

    Signed-off-by: Wen Congyang
    Signed-off-by: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Len Brown
    Cc: Christoph Lameter
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • remove_memory() is called in two cases:
    1. echo offline >/sys/devices/system/memory/memoryXX/state
    2. hot remove a memory device

    In the 1st case, the memory block's state is changed and the notification
    that memory block's state changed is sent to userland after calling
    remove_memory(). So user can notice memory block is changed.

    But in the 2nd case, the memory block's state is not changed and the
    notification is not also sent to userspcae even if calling
    remove_memory(). So user cannot notice memory block is changed.

    For adding the notification at memory hot remove, the patch just prepare
    as follows:
    1st case uses offline_pages() for offlining memory.
    2nd case uses remove_memory() for offlining memory and changing memory block's
    state and notifing the information.

    The patch does not implement notification to remove_memory().

    Signed-off-by: Wen Congyang
    Signed-off-by: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Len Brown
    Cc: Christoph Lameter
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     
  • Following section mismatch warning is thrown during build;

    WARNING: vmlinux.o(.text+0x32408f): Section mismatch in reference from the function memblock_type_name() to the variable .meminit.data:memblock
    The function memblock_type_name() references
    the variable __meminitdata memblock.
    This is often because memblock_type_name lacks a __meminitdata
    annotation or the annotation of memblock is wrong.

    This is because memblock_type_name makes reference to memblock variable
    with attribute __meminitdata. Hence, the warning (even if the function is
    inline).

    [akpm@linux-foundation.org: remove inline]
    Signed-off-by: Raghavendra D Prabhu
    Cc: Tejun Heo
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Raghavendra D Prabhu
     
  • There was a general sentiment in a recent discussion (See
    https://lkml.org/lkml/2012/9/18/258) that the __GFP flags should be
    defined unconditionally. Currently, the only offender is GFP_NOTRACK,
    which is conditional to KMEMCHECK.

    Signed-off-by: Glauber Costa
    Acked-by: Christoph Lameter
    Cc: Mel Gorman
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • reclaim_clean_pages_from_list() reclaims clean pages before migration so
    cc.nr_migratepages should be updated. Currently, there is no problem but
    it can be wrong if we try to use the value in future.

    Signed-off-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
    contiguous memory space.

    This patch makes mlocked pages be migrated out. Of course, it can affect
    realtime processes but in CMA usecase, contiguous memory allocation failing
    is far worse than access latency to an mlocked page being variable while
    CMA is running. If someone wants to make the system realtime, he shouldn't
    enable CMA because stalls can still happen at random times.

    [akpm@linux-foundation.org: tweak comment text, per Mel]
    Signed-off-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • KPF_THP can be set on non-huge compound pages (like slab pages or pages
    allocated by drivers with __GFP_COMP) because PageTransCompound only
    checks PG_head and PG_tail. Obviously this is a bug and breaks user space
    applications which look for thp via /proc/kpageflags.

    This patch rules out setting KPF_THP wrongly by additionally checking
    PageLRU on the head pages.

    Signed-off-by: Naoya Horiguchi
    Acked-by: KOSAKI Motohiro
    Acked-by: David Rientjes
    Reviewed-by: Fengguang Wu
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • The parameter 'wb' is never used in this function.

    Signed-off-by: Yan Hong
    Acked-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yan Hong
     
  • Signed-off-by: Robert P. J. Day
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert P. J. Day
     
  • Simply remove UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed line
    from /proc/vmstat: Johannes and Mel point out that it was very unlikely to
    have been used by any tool, and of course we can restore it easily enough
    if that turns out to be wrong.

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: Ying Han
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • During memory-hotplug, I found NR_ISOLATED_[ANON|FILE] are increasing,
    causing the kernel to hang. When the system doesn't have enough free
    pages, it enters reclaim but never reclaim any pages due to
    too_many_isolated()==true and loops forever.

    The cause is that when we do memory-hotadd after memory-remove,
    __zone_pcp_update() clears a zone's ZONE_STAT_ITEMS in setup_pageset()
    although the vm_stat_diff of all CPUs still have values.

    In addtion, when we offline all pages of the zone, we reset them in
    zone_pcp_reset without draining so we loss some zone stat item.

    Reviewed-by: Wen Congyang
    Signed-off-by: Minchan Kim
    Cc: Kamezawa Hiroyuki
    Cc: Yasuaki Ishimatsu
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Revert commit 0def08e3acc2 because check_range can't fail in
    migrate_to_node with considering current usecases.

    Quote from Johannes

    : I think it makes sense to revert. Not because of the semantics, but I
    : just don't see how check_range() could even fail for this callsite:
    :
    : 1. we pass mm->mmap->vm_start in there, so we should not fail due to
    : find_vma()
    :
    : 2. we pass MPOL_MF_DISCONTIG_OK, so the discontig checks do not apply
    : and so can not fail
    :
    : 3. we pass MPOL_MF_MOVE | MPOL_MF_MOVE_ALL, the page table loops will
    : continue until addr == end, so we never fail with -EIO

    And I added a new VM_BUG_ON for checking migrate_to_node's future usecase
    which might pass to MPOL_MF_STRICT.

    Suggested-by: Johannes Weiner
    Signed-off-by: Minchan Kim
    Acked-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Vasiliy Kulikov
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • In order to allow sleeping during invalidate_page mmu notifier calls, we
    need to avoid calling when holding the PT lock. In addition to its direct
    calls, invalidate_page can also be called as a substitute for a change_pte
    call, in case the notifier client hasn't implemented change_pte.

    This patch drops the invalidate_page call from change_pte, and instead
    wraps all calls to change_pte with invalidate_range_start and
    invalidate_range_end calls.

    Note that change_pte still cannot sleep after this patch, and that clients
    implementing change_pte should not take action on it in case the number of
    outstanding invalidate_range_start calls is larger than one, otherwise
    they might miss a later invalidation.

    Signed-off-by: Haggai Eran
    Cc: Andrea Arcangeli
    Cc: Sagi Grimberg
    Cc: Peter Zijlstra
    Cc: Xiao Guangrong
    Cc: Or Gerlitz
    Cc: Haggai Eran
    Cc: Shachar Raindel
    Cc: Liran Liss
    Cc: Christoph Lameter
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haggai Eran
     
  • In order to allow sleeping during mmu notifier calls, we need to avoid
    invoking them under the page table spinlock. This patch solves the
    problem by calling invalidate_page notification after releasing the lock
    (but before freeing the page itself), or by wrapping the page invalidation
    with calls to invalidate_range_begin and invalidate_range_end.

    To prevent accidental changes to the invalidate_range_end arguments after
    the call to invalidate_range_begin, the patch introduces a convention of
    saving the arguments in consistently named locals:

    unsigned long mmun_start; /* For mmu_notifiers */
    unsigned long mmun_end; /* For mmu_notifiers */

    ...

    mmun_start = ...
    mmun_end = ...
    mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);

    ...

    mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

    The patch changes code to use this convention for all calls to
    mmu_notifier_invalidate_range_start/end, except those where the calls are
    close enough so that anyone who glances at the code can see the values
    aren't changing.

    This patchset is a preliminary step towards on-demand paging design to be
    added to the RDMA stack.

    Why do we want on-demand paging for Infiniband?

    Applications register memory with an RDMA adapter using system calls,
    and subsequently post IO operations that refer to the corresponding
    virtual addresses directly to HW. Until now, this was achieved by
    pinning the memory during the registration calls. The goal of on demand
    paging is to avoid pinning the pages of registered memory regions (MRs).
    This will allow users the same flexibility they get when swapping any
    other part of their processes address spaces. Instead of requiring the
    entire MR to fit in physical memory, we can allow the MR to be larger,
    and only fit the current working set in physical memory.

    Why should anyone care? What problems are users currently experiencing?

    This can make programming with RDMA much simpler. Today, developers
    that are working with more data than their RAM can hold need either to
    deregister and reregister memory regions throughout their process's
    life, or keep a single memory region and copy the data to it. On demand
    paging will allow these developers to register a single MR at the
    beginning of their process's life, and let the operating system manage
    which pages needs to be fetched at a given time. In the future, we
    might be able to provide a single memory access key for each process
    that would provide the entire process's address as one large memory
    region, and the developers wouldn't need to register memory regions at
    all.

    Is there any prospect that any other subsystems will utilise these
    infrastructural changes? If so, which and how, etc?

    As for other subsystems, I understand that XPMEM wanted to sleep in
    MMU notifiers, as Christoph Lameter wrote at
    http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html and
    perhaps Andrea knows about other use cases.

    Scheduling in mmu notifications is required since we need to sync the
    hardware with the secondary page tables change. A TLB flush of an IO
    device is inherently slower than a CPU TLB flush, so our design works by
    sending the invalidation request to the device, and waiting for an
    interrupt before exiting the mmu notifier handler.

    Avi said:

    kvm may be a buyer. kvm::mmu_lock, which serializes guest page
    faults, also protects long operations such as destroying large ranges.
    It would be good to convert it into a spinlock, but as it is used inside
    mmu notifiers, this cannot be done.

    (there are alternatives, such as keeping the spinlock and using a
    generation counter to do the teardown in O(1), which is what the "may"
    is doing up there).

    [akpm@linux-foundation.orgpossible speed tweak in hugetlb_cow(), cleanups]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Haggai Eran
    Cc: Peter Zijlstra
    Cc: Xiao Guangrong
    Cc: Or Gerlitz
    Cc: Haggai Eran
    Cc: Shachar Raindel
    Cc: Liran Liss
    Cc: Christoph Lameter
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sagi Grimberg
     
  • Commit 0c176d52b0b2 ("mm: hugetlb: fix pgoff computation when unmapping
    page from vma") fixed pgoff calculation but it has replaced it by
    vma_hugecache_offset() which is not approapriate for offsets used for
    vma_prio_tree_foreach() because that one expects index in page units
    rather than in huge_page_shift.

    Johannes said:

    : The resulting index may not be too big, but it can be too small: assume
    : hpage size of 2M and the address to unmap to be 0x200000. This is regular
    : page index 512 and hpage index 1. If you have a VMA that maps the file
    : only starting at the second huge page, that VMAs vm_pgoff will be 512 but
    : you ask for offset 1 and miss it even though it does map the page of
    : interest. hugetlb_cow() will try to unmap, miss the vma, and retry the
    : cow until the allocation succeeds or the skipped vma(s) go away.

    Signed-off-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Acked-by: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • In many places !pmd_present has been converted to pmd_none. For pmds
    that's equivalent and pmd_none is quicker so using pmd_none is better.

    However (unless we delete pmd_present) we should provide an accurate
    pmd_present too. This will avoid the risk of code thinking the pmd is non
    present because it's under __split_huge_page_map, see the pmd_mknotpresent
    there and the comment above it.

    If the page has been mprotected as PROT_NONE, it would also lead to a
    pmd_present false negative in the same way as the race with
    split_huge_page.

    Because the PSE bit stays on at all times (both during split_huge_page and
    when the _PAGE_PROTNONE bit get set), we could only check for the PSE bit,
    but checking the PROTNONE bit too is still good to remember pmd_present
    must always keep PROT_NONE into account.

    This explains a not reproducible BUG_ON that was seldom reported on the
    lists.

    The same issue is in pmd_large, it would go wrong with both PROT_NONE and
    if it races with split_huge_page.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Andi removed some outedated documentation from Documentation/memory.txt
    back in 2009 by commit 3b2b9a875ddc ("Documentation/memory.txt: remove
    some very outdated recommendations"), but the resulting document is not
    in a nice shape either.

    It seems to me like we are not losing anything by completely removing the
    file now.

    Signed-off-by: Jiri Kosina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Kosina
     
  • RECLAIM_DISTANCE represents the distance between nodes at which it is
    deemed too costly to allocate from; it's preferred to try to reclaim from
    a local zone before falling back to allocating on a remote node with such
    a distance.

    To do this, zone_reclaim_mode is set if the distance between any two
    nodes on the system is greather than this distance. This, however, ends
    up causing the page allocator to reclaim from every zone regardless of
    its affinity.

    What we really want is to reclaim only from zones that are closer than
    RECLAIM_DISTANCE. This patch adds a nodemask to each node that
    represents the set of nodes that are within this distance. During the
    zone iteration, if the bit for a zone's node is set for the local node,
    then reclaim is attempted; otherwise, the zone is skipped.

    [akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • We should not be seeing non-0 unevictable_pgs_mlockfreed any longer. So
    remove free_page_mlock() from the page freeing paths: __PG_MLOCKED is
    already in PAGE_FLAGS_CHECK_AT_FREE, so free_pages_check() will now be
    checking it, reporting "BUG: Bad page state" if it's ever found set.
    Comment UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed always 0.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We had thought that pages could no longer get freed while still marked as
    mlocked; but Johannes Weiner posted this program to demonstrate that
    truncating an mlocked private file mapping containing COWed pages is still
    mishandled:

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(void)
    {
    char *map;
    int fd;

    system("grep mlockfreed /proc/vmstat");
    fd = open("chigurh", O_CREAT|O_EXCL|O_RDWR);
    unlink("chigurh");
    ftruncate(fd, 4096);
    map = mmap(NULL, 4096, PROT_WRITE, MAP_PRIVATE, fd, 0);
    map[0] = 11;
    mlock(map, sizeof(fd));
    ftruncate(fd, 0);
    close(fd);
    munlock(map, sizeof(fd));
    munmap(map, 4096);
    system("grep mlockfreed /proc/vmstat");
    return 0;
    }

    The anon COWed pages are not caught by truncation's clear_page_mlock() of
    the pagecache pages; but unmap_mapping_range() unmaps them, so we ought to
    look out for them there in page_remove_rmap(). Indeed, why should
    truncation or invalidation be doing the clear_page_mlock() when removing
    from pagecache? mlock is a property of mapping in userspace, not a
    property of pagecache: an mlocked unmapped page is nonsensical.

    Reported-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Michel Lespinasse
    Cc: Ying Han
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • page_evictable(page, vma) is an irritant: almost all its callers pass
    NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page)
    explicitly in the couple of places it's needed. But in those places we
    don't even need page_evictable() itself! They're dealing with a freshly
    allocated anonymous page, which has no "mapping" and cannot be mlocked yet.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • In fuzzing with trinity, lockdep protested "possible irq lock inversion
    dependency detected" when isolate_lru_page() reenabled interrupts while
    still holding the supposedly irq-safe tree_lock:

    invalidate_inode_pages2
    invalidate_complete_page2
    spin_lock_irq(&mapping->tree_lock)
    clear_page_mlock
    isolate_lru_page
    spin_unlock_irq(&zone->lru_lock)

    isolate_lru_page() is correct to enable interrupts unconditionally:
    invalidate_complete_page2() is incorrect to call clear_page_mlock() while
    holding tree_lock, which is supposed to nest inside lru_lock.

    Both truncate_complete_page() and invalidate_complete_page() call
    clear_page_mlock() before taking tree_lock to remove page from radix_tree.
    I guess invalidate_complete_page2() preferred to test PageDirty (again)
    under tree_lock before committing to the munlock; but since the page has
    already been unmapped, its state is already somewhat inconsistent, and no
    worse if clear_page_mlock() moved up.

    Reported-by: Sasha Levin
    Deciphered-by: Andrew Morton
    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: Ying Han
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • kmem code uses this function and it is better to not use forward
    declarations for static inline functions as some (older) compilers don't
    like it:

    gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux)

    mm/memcontrol.c:421: warning: `mem_cgroup_is_root' declared inline after being called
    mm/memcontrol.c:421: warning: previous declaration of `mem_cgroup_is_root' was here

    Signed-off-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Sachin Kamat
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • TCP kmem accounting is currently guarded by CONFIG_MEMCG_KMEM ifdefs but
    the code is not used if !CONFIG_INET so we should rather test for both.
    The same applies to net/sock.h, net/ip.h and net/tcp_memcontrol.h but
    let's keep those outside of any ifdefs because it is considered safer wrt.
    future maintainability.

    Tested with
    - CONFIG_INET && CONFIG_MEMCG_KMEM
    - !CONFIG_INET && CONFIG_MEMCG_KMEM
    - CONFIG_INET && !CONFIG_MEMCG_KMEM
    - !CONFIG_INET && !CONFIG_MEMCG_KMEM

    Signed-off-by: Sachin Kamat
    Signed-off-by: Michal Hocko
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • While reading through Documentation/cgroups/memory.txt, I found a number
    of minor wordos and typos. The patch below is a conservative handling of
    some of these: it provides just a number of "obviously correct" fixes to
    the English that improve the readability of the document somewhat.
    Obviously some more significant fixes need to be made to the document, but
    some of those may not be in the "obvious correct" category.

    Signed-off-by: Michael Kerrisk
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Kerrisk
     
  • I think zone->present_pages indicates pages that buddy system can management,
    it should be:

    zone->present_pages = spanned pages - absent pages - bootmem pages,

    but is now:
    zone->present_pages = spanned pages - absent pages - memmap pages.

    spanned pages: total size, including holes.
    absent pages: holes.
    bootmem pages: pages used in system boot, managed by bootmem allocator.
    memmap pages: pages used by page structs.

    This may cause zone->present_pages less than it should be. For example,
    numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other
    bootmem will be allocated from ZONE_MOVABLE, so ZONE_NORMAL's
    present_pages should be spanned pages - absent pages, but now it also
    minus memmap pages(free_area_init_core), which are actually allocated from
    ZONE_MOVABLE. When offlining all memory of a zone, this will cause
    zone->present_pages less than 0, because present_pages is unsigned long
    type, it is actually a very large integer, it indirectly caused
    zone->watermark[WMARK_MIN] becomes a large
    integer(setup_per_zone_wmarks()), than cause totalreserve_pages become a
    large integer(calculate_totalreserve_pages()), and finally cause memory
    allocating failure when fork process(__vm_enough_memory()).

    [root@localhost ~]# dmesg
    -bash: fork: Cannot allocate memory

    I think the bug described in

    http://marc.info/?l=linux-mm&m=134502182714186&w=2

    is also caused by wrong zone present pages.

    This patch intends to fix-up zone->present_pages when memory are freed to
    buddy system on x86_64 and IA64 platforms.

    Signed-off-by: Jianguo Wu
    Signed-off-by: Jiang Liu
    Reported-by: Petr Tesarik
    Tested-by: Petr Tesarik
    Cc: "Luck, Tony"
    Cc: Mel Gorman
    Cc: Yinghai Lu
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianguo Wu
     
  • Now that lumpy reclaim has been removed, compaction is the only way left
    to free up contiguous memory areas. It is time to just enable
    CONFIG_COMPACTION by default.

    Signed-off-by: Rik van Riel
    Cc: Mel Gorman
    Acked-by: Rafael Aquini
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • The update_mmu_cache() takes a pointer (to pte_t by default) as the last
    argument but the huge_memory.c passes a pmd_t value. The patch changes
    the argument to the pmd_t * pointer.

    Signed-off-by: Catalin Marinas
    Signed-off-by: Steve Capper
    Signed-off-by: Will Deacon
    Cc: Arnd Bergmann
    Reviewed-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Gerald Schaefer
    Reviewed-by: Andrea Arcangeli
    Cc: Chris Metcalf
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     
  • The CONFIG_TRANSPARENT_HUGEPAGE implementation of pmdp_get_and_clear()
    calls pmd_clear() with 3 arguments instead of 1.

    This happens only for !__HAVE_ARCH_PMDP_GET_AND_CLEAR which doesn't seem
    to happen because x86 defines this and it uses pmd_update.

    [mhocko@suse.cz: changelog addition]
    Signed-off-by: Catalin Marinas
    Signed-off-by: Steve Capper
    Signed-off-by: Will Deacon
    Cc: Arnd Bergmann
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Gerald Schaefer
    Reviewed-by: Andrea Arcangeli
    Cc: Chris Metcalf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     
  • If NUMA is enabled, the indicator is not reset if the previous page
    request failed, ausing us to trigger the BUG_ON() in
    khugepaged_alloc_page().

    Signed-off-by: Xiao Guangrong
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • The changelog for commit 6a6dccba2fdc ("mm: cma: don't replace lowmem
    pages with highmem") mentioned that lowmem pages can be replaced by
    highmem pages during CMA migration. 6a6dccba2fdc fixed that issue.

    Quote from that changelog:

    : The filesystem layer expects pages in the block device's mapping to not
    : be in highmem (the mapping's gfp mask is set in bdget()), but CMA can
    : currently replace lowmem pages with highmem pages, leading to crashes in
    : filesystem code such as the one below:
    :
    : Unable to handle kernel NULL pointer dereference at virtual address 00000400
    : pgd = c0c98000
    : [00000400] *pgd=00c91831, *pte=00000000, *ppte=00000000
    : Internal error: Oops: 817 [#1] PREEMPT SMP ARM
    : CPU: 0 Not tainted (3.5.0-rc5+ #80)
    : PC is at __memzero+0x24/0x80
    : ...
    : Process fsstress (pid: 323, stack limit = 0xc0cbc2f0)
    : Backtrace:
    : [] (ext4_getblk+0x0/0x180) from [] (ext4_bread+0x1c/0x98)
    : [] (ext4_bread+0x0/0x98) from [] (ext4_mkdir+0x160/0x3bc)
    : r4:c15337f0
    : [] (ext4_mkdir+0x0/0x3bc) from [] (vfs_mkdir+0x8c/0x98)
    : [] (vfs_mkdir+0x0/0x98) from [] (sys_mkdirat+0x74/0xac)
    : r6:00000000 r5:c152eb40 r4:000001ff r3:c14b43f0
    : [] (sys_mkdirat+0x0/0xac) from [] (sys_mkdir+0x20/0x24)
    : r6:beccdcf0 r5:00074000 r4:beccdbbc
    : [] (sys_mkdir+0x0/0x24) from [] (ret_fast_syscall+0x0/0x30)

    Memory-hotplug has same problem as CMA has so the same fix can be applied
    to memory-hotplug as well.

    Fix it by reusing.

    Signed-off-by: Minchan Kim
    Cc: Kamezawa Hiroyuki
    Reviewed-by: Yasuaki Ishimatsu
    Acked-by: Michal Nazarewicz
    Cc: Marek Szyprowski
    Cc: Wen Congyang
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • __alloc_contig_migrate_alloc() can be used by memory-hotplug so refactor
    it out (move + rename as a common name) into page_isolation.c.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Minchan Kim
    Cc: Kamezawa Hiroyuki
    Reviewed-by: Yasuaki Ishimatsu
    Acked-by: Michal Nazarewicz
    Cc: Marek Szyprowski
    Cc: Wen Congyang
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Signed-off-by: Sachin Kamat
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sachin Kamat
     
  • Compaction caches if a pageblock was scanned and no pages were isolated so
    that the pageblocks can be skipped in the future to reduce scanning. This
    information is not cleared by the page allocator based on activity due to
    the impact it would have to the page allocator fast paths. Hence there is
    a requirement that something clear the cache or pageblocks will be skipped
    forever. Currently the cache is cleared if there were a number of recent
    allocation failures and it has not been cleared within the last 5 seconds.
    Time-based decisions like this are terrible as they have no relationship
    to VM activity and is basically a big hammer.

    Unfortunately, accurate heuristics would add cost to some hot paths so
    this patch implements a rough heuristic. There are two cases where the
    cache is cleared.

    1. If a !kswapd process completes a compaction cycle (migrate and free
    scanner meet), the zone is marked compact_blockskip_flush. When kswapd
    goes to sleep, it will clear the cache. This is expected to be the
    common case where the cache is cleared. It does not really matter if
    kswapd happens to be asleep or going to sleep when the flag is set as
    it will be woken on the next allocation request.

    2. If there have been multiple failures recently and compaction just
    finished being deferred then a process will clear the cache and start a
    full scan. This situation happens if there are multiple high-order
    allocation requests under heavy memory pressure.

    The clearing of the PG_migrate_skip bits and other scans is inherently
    racy but the race is harmless. For allocations that can fail such as THP,
    they will simply fail. For requests that cannot fail, they will retry the
    allocation. Tests indicated that scanning rates were roughly similar to
    when the time-based heuristic was used and the allocation success rates
    were similar.

    Signed-off-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Cc: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This is almost entirely based on Rik's previous patches and discussions
    with him about how this might be implemented.

    Order > 0 compaction stops when enough free pages of the correct page
    order have been coalesced. When doing subsequent higher order
    allocations, it is possible for compaction to be invoked many times.

    However, the compaction code always starts out looking for things to
    compact at the start of the zone, and for free pages to compact things to
    at the end of the zone.

    This can cause quadratic behaviour, with isolate_freepages starting at the
    end of the zone each time, even though previous invocations of the
    compaction code already filled up all free memory on that end of the zone.
    This can cause isolate_freepages to take enormous amounts of CPU with
    certain workloads on larger memory systems.

    This patch caches where the migration and free scanner should start from
    on subsequent compaction invocations using the pageblock-skip information.
    When compaction starts it begins from the cached restart points and will
    update the cached restart points until a page is isolated or a pageblock
    is skipped that would have been scanned by synchronous compaction.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Acked-by: Rafael Aquini
    Cc: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When compaction was implemented it was known that scanning could
    potentially be excessive. The ideal was that a counter be maintained for
    each pageblock but maintaining this information would incur a severe
    penalty due to a shared writable cache line. It has reached the point
    where the scanning costs are a serious problem, particularly on
    long-lived systems where a large process starts and allocates a large
    number of THPs at the same time.

    Instead of using a shared counter, this patch adds another bit to the
    pageblock flags called PG_migrate_skip. If a pageblock is scanned by
    either migrate or free scanner and 0 pages were isolated, the pageblock is
    marked to be skipped in the future. When scanning, this bit is checked
    before any scanning takes place and the block skipped if set.

    The main difficulty with a patch like this is "when to ignore the cached
    information?" If it's ignored too often, the scanning rates will still be
    excessive. If the information is too stale then allocations will fail
    that might have otherwise succeeded. In this patch

    o CMA always ignores the information
    o If the migrate and free scanner meet then the cached information will
    be discarded if it's at least 5 seconds since the last time the cache
    was discarded
    o If there are a large number of allocation failures, discard the cache.

    The time-based heuristic is very clumsy but there are few choices for a
    better event. Depending solely on multiple allocation failures still
    allows excessive scanning when THP allocations are failing in quick
    succession due to memory pressure. Waiting until memory pressure is
    relieved would cause compaction to continually fail instead of using
    reclaim/compaction to try allocate the page. The time-based mechanism is
    clumsy but a better option is not obvious.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Acked-by: Rafael Aquini
    Cc: Fengguang Wu
    Cc: Michal Nazarewicz
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Kyungmin Park
    Cc: Mark Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This reverts commit 7db8889ab05b ("mm: have order > 0 compaction start
    off where it left") and commit de74f1cc ("mm: have order > 0 compaction
    start near a pageblock with free pages"). These patches were a good
    idea and tests confirmed that they massively reduced the amount of
    scanning but the implementation is complex and tricky to understand. A
    later patch will cache what pageblocks should be skipped and
    reimplements the concept of compact_cached_free_pfn on top for both
    migration and free scanners.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Acked-by: Rafael Aquini
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Compaction's free scanner acquires the zone->lock when checking for
    PageBuddy pages and isolating them. It does this even if there are no
    PageBuddy pages in the range.

    This patch defers acquiring the zone lock for as long as possible. In the
    event there are no free pages in the pageblock then the lock will not be
    acquired at all which reduces contention on zone->lock.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Acked-by: Rafael Aquini
    Acked-by: Minchan Kim
    Tested-by: Peter Ujfalusi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Richard Davies and Shaohua Li have both reported lock contention problems
    in compaction on the zone and LRU locks as well as significant amounts of
    time being spent in compaction. This series aims to reduce lock
    contention and scanning rates to reduce that CPU usage. Richard reported
    at https://lkml.org/lkml/2012/9/21/91 that this series made a big
    different to a problem he reported in August:

    http://marc.info/?l=kvm&m=134511507015614&w=2

    Patch 1 defers acquiring the zone->lru_lock as long as possible.

    Patch 2 defers acquiring the zone->lock as lock as possible.

    Patch 3 reverts Rik's "skip-free" patches as the core concept gets
    reimplemented later and the remaining patches are easier to
    understand if this is reverted first.

    Patch 4 adds a pageblock-skip bit to the pageblock flags to cache what
    pageblocks should be skipped by the migrate and free scanners.
    This drastically reduces the amount of scanning compaction has
    to do.

    Patch 5 reimplements something similar to Rik's idea except it uses the
    pageblock-skip information to decide where the scanners should
    restart from and does not need to wrap around.

    I tested this on 3.6-rc6 + linux-next/akpm. Kernels tested were

    akpm-20120920 3.6-rc6 + linux-next/akpm as of Septeber 20th, 2012
    lesslock Patches 1-6
    revert Patches 1-7
    cachefail Patches 1-8
    skipuseless Patches 1-9

    Stress high-order allocation tests looked ok. Success rates are more or
    less the same with the full series applied but there is an expectation
    that there is less opportunity to race with other allocation requests if
    there is less scanning. The time to complete the tests did not vary that
    much and are uninteresting as were the vmstat statistics so I will not
    present them here.

    Using ftrace I recorded how much scanning was done by compaction and got this

    3.6.0-rc6 3.6.0-rc6 3.6.0-rc6 3.6.0-rc6 3.6.0-rc6
    akpm-20120920 lockless revert-v2r2 cachefail skipuseless

    Total free scanned 360753976 515414028 565479007 17103281 18916589
    Total free isolated 2852429 3597369 4048601 670493 727840
    Total free efficiency 0.0079% 0.0070% 0.0072% 0.0392% 0.0385%
    Total migrate scanned 247728664 822729112 1004645830 17946827 14118903
    Total migrate isolated 2555324 3245937 3437501 616359 658616
    Total migrate efficiency 0.0103% 0.0039% 0.0034% 0.0343% 0.0466%

    The efficiency is worthless because of the nature of the test and the
    number of failures. The really interesting point as far as this patch
    series is concerned is the number of pages scanned. Note that reverting
    Rik's patches massively increases the number of pages scanned indicating
    that those patches really did make a difference to CPU usage.

    However, caching what pageblocks should be skipped has a much higher
    impact. With patches 1-8 applied, free page and migrate page scanning are
    both reduced by 95% in comparison to the akpm kernel. If the basic
    concept of Rik's patches are implemened on top then scanning then the free
    scanner barely changed but migrate scanning was further reduced. That
    said, tests on 3.6-rc5 indicated that the last patch had greater impact
    than what was measured here so it is a bit variable.

    One way or the other, this series has a large impact on the amount of
    scanning compaction does when there is a storm of THP allocations.

    This patch:

    Compaction's migrate scanner acquires the zone->lru_lock when scanning a
    range of pages looking for LRU pages to acquire. It does this even if
    there are no LRU pages in the range. If multiple processes are compacting
    then this can cause severe locking contention. To make matters worse
    commit b2eef8c0 ("mm: compaction: minimise the time IRQs are disabled
    while isolating pages for migration") releases the lru_lock every
    SWAP_CLUSTER_MAX pages that are scanned.

    This patch makes two changes to how the migrate scanner acquires the LRU
    lock. First, it only releases the LRU lock every SWAP_CLUSTER_MAX pages
    if the lock is contended. This reduces the number of times it
    unnecessarily disables and re-enables IRQs. The second is that it defers
    acquiring the LRU lock for as long as possible. If there are no LRU pages
    or the only LRU pages are transhuge then the LRU lock will not be acquired
    at all which reduces contention on zone->lru_lock.

    [minchan@kernel.org: augment comment]
    [akpm@linux-foundation.org: tweak comment text]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Parameters were added without documentation, tut tut.

    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman