27 Jul, 2016

40 commits

  • Provide a shmem_get_unmapped_area method in file_operations, called at
    mmap time to decide the mapping address. It could be conditional on
    CONFIG_TRANSPARENT_HUGEPAGE, but save #ifdefs in other places by making
    it unconditional.

    shmem_get_unmapped_area() first calls the usual mm->get_unmapped_area
    (which we treat as a black box, highly dependent on architecture and
    config and executable layout). Lots of conditions, and in most cases it
    just goes with the address that chose; but when our huge stars are
    rightly aligned, yet that did not provide a suitable address, go back to
    ask for a larger arena, within which to align the mapping suitably.

    There have to be some direct calls to shmem_get_unmapped_area(), not via
    the file_operations: because of the way shmem_zero_setup() is called to
    create a shmem object late in the mmap sequence, when MAP_SHARED is
    requested with MAP_ANONYMOUS or /dev/zero. Though this only matters
    when /proc/sys/vm/shmem_huge has been set.

    Link: http://lkml.kernel.org/r/1466021202-61880-29-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Hugh Dickins
    Signed-off-by: Kirill A. Shutemov

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • This patch adds new mount option "huge=". It can have following values:

    - "always":
    Attempt to allocate huge pages every time we need a new page;

    - "never":
    Do not allocate huge pages;

    - "within_size":
    Only allocate huge page if it will be fully within i_size.
    Also respect fadvise()/madvise() hints;

    - "advise:
    Only allocate huge pages if requested with fadvise()/madvise();

    Default is "never" for now.

    "mount -o remount,huge= /mountpoint" works fine after mount: remounting
    huge=never will not attempt to break up huge pages at all, just stop
    more from being allocated.

    No new config option: put this under CONFIG_TRANSPARENT_HUGEPAGE, which
    is the appropriate option to protect those who don't want the new bloat,
    and with which we shall share some pmd code.

    Prohibit the option when !CONFIG_TRANSPARENT_HUGEPAGE, just as mpol is
    invalid without CONFIG_NUMA (was hidden in mpol_parse_str(): make it
    explicit).

    Allow enabling THP only if the machine has_transparent_hugepage().

    But what about Shmem with no user-visible mount? SysV SHM, memfds,
    shared anonymous mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM
    objects, Ashmem. Though unlikely to suit all usages, provide sysfs knob
    /sys/kernel/mm/transparent_hugepage/shmem_enabled to experiment with
    huge on those.

    And allow shmem_enabled two further values:

    - "deny":
    For use in emergencies, to force the huge option off from
    all mounts;
    - "force":
    Force the huge option on for all - very useful for testing;

    Based on patch by Hugh Dickins.

    Link: http://lkml.kernel.org/r/1466021202-61880-28-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Let's add ShmemHugePages and ShmemPmdMapped fields into meminfo and
    smaps. It indicates how many times we allocate and map shmem THP.

    NR_ANON_TRANSPARENT_HUGEPAGES is renamed to NR_ANON_THPS.

    Link: http://lkml.kernel.org/r/1466021202-61880-27-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • For shmem/tmpfs we only need to tweak truncate_inode_page() and
    invalidate_mapping_pages().

    truncate_inode_pages_range() and invalidate_inode_pages2_range() are
    adjusted to use page_to_pgoff().

    Link: http://lkml.kernel.org/r/1466021202-61880-26-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • For now, we would have HPAGE_PMD_NR entries in radix tree for every huge
    page. That's suboptimal and it will be changed to use Matthew's
    multi-order entries later.

    'add' operation is not changed, because we don't need it to implement
    hugetmpfs: shmem uses its own implementation.

    Link: http://lkml.kernel.org/r/1466021202-61880-25-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The new helper is similar to radix_tree_maybe_preload(), but tries to
    preload number of nodes required to insert (1 << order) continuous
    naturally-aligned elements.

    This is required to push huge pages into pagecache.

    Link: http://lkml.kernel.org/r/1466021202-61880-24-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • These flags are in use for file THP.

    Link: http://lkml.kernel.org/r/1466021202-61880-23-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This is preparation of vmscan for file huge pages. We cannot write out
    huge pages, so we need to split them on the way out.

    Link: http://lkml.kernel.org/r/1466021202-61880-22-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • As with anon THP, we only mlock file huge pages if we can prove that the
    page is not mapped with PTE. This way we can avoid mlock leak into
    non-mlocked vma on split.

    We rely on PageDoubleMap() under lock_page() to check if the the page
    may be PTE mapped. PG_double_map is set by page_add_file_rmap() when
    the page mapped with PTEs.

    Link: http://lkml.kernel.org/r/1466021202-61880-21-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Basic scheme is the same as for anon THP.

    Main differences:

    - File pages are on radix-tree, so we have head->_count offset by
    HPAGE_PMD_NR. The count got distributed to small pages during split.

    - mapping->tree_lock prevents non-lockless access to pages under split
    over radix-tree;

    - Lockless access is prevented by setting the head->_count to 0 during
    split;

    - After split, some pages can be beyond i_size. We drop them from
    radix-tree.

    - We don't setup migration entries. Just unmap pages. It helps
    handling cases when i_size is in the middle of the page: no need
    handle unmap pages beyond i_size manually.

    Link: http://lkml.kernel.org/r/1466021202-61880-20-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • vma_addjust_trans_huge() splits pmd if it's crossing VMA boundary.
    During split we munlock the huge page which requires rmap walk. rmap
    wants to take the lock on its own.

    Let's move vma_adjust_trans_huge() outside i_mmap_rwsem to fix this.

    Link: http://lkml.kernel.org/r/1466021202-61880-19-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • change_huge_pmd() has assert which is not relvant for file page. For
    shared mapping it's perfectly fine to have page table entry writable,
    without explicit mkwrite.

    Link: http://lkml.kernel.org/r/1466021202-61880-18-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • copy_page_range() has a check for "Don't copy ptes where a page fault
    will fill them correctly." It works on VMA level. We still copy all
    page table entries from private mappings, even if they map page cache.

    We can simplify copy_huge_pmd() a bit by skipping file PMDs.

    We don't map file private pages with PMDs, so they only can map page
    cache. It's safe to skip them as they can be re-faulted later.

    Link: http://lkml.kernel.org/r/1466021202-61880-17-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • File COW for THP is handled on pte level: just split the pmd.

    It's not clear how benefitial would be allocation of huge pages on COW
    faults. And it would require some code to make them work.

    I think at some point we can consider teaching khugepaged to collapse
    pages in COW mappings, but allocating huge on fault is probably
    overkill.

    Link: http://lkml.kernel.org/r/1466021202-61880-16-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Splitting THP PMD is simple: just unmap it as in DAX case. This way we
    can avoid memory overhead on page table allocation to deposit.

    It's probably a good idea to try to allocation page table with
    GFP_ATOMIC in __split_huge_pmd_locked() to avoid refaulting the area,
    but clearing pmd should be good enough for now.

    Unlike DAX, we also remove the page from rmap and drop reference.
    pmd_young() is transfered to PageReferenced().

    Link: http://lkml.kernel.org/r/1466021202-61880-15-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • split_huge_pmd() for file mappings (and DAX too) is implemented by just
    clearing pmd entry as we can re-fill this area from page cache on pte
    level later.

    This means we don't need deposit page tables when file THP is mapped.
    Therefore we shouldn't try to withdraw a page table on zap_huge_pmd()
    file THP PMD.

    Link: http://lkml.kernel.org/r/1466021202-61880-14-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • THP_FILE_ALLOC: how many times huge page was allocated and put page
    cache.

    THP_FILE_MAPPED: how many times file huge page was mapped.

    Link: http://lkml.kernel.org/r/1466021202-61880-13-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With postponed page table allocation we have chance to setup huge pages.
    do_set_pte() calls do_set_pmd() if following criteria met:

    - page is compound;
    - pmd entry in pmd_none();
    - vma has suitable size and alignment;

    Link: http://lkml.kernel.org/r/1466021202-61880-12-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Naive approach: on mapping/unmapping the page as compound we update
    ->_mapcount on each 4k page. That's not efficient, but it's not obvious
    how we can optimize this. We can look into optimization later.

    PG_double_map optimization doesn't work for file pages since lifecycle
    of file pages is different comparing to anon pages: file page can be
    mapped again at any time.

    Link: http://lkml.kernel.org/r/1466021202-61880-11-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The idea (and most of code) is borrowed again: from Hugh's patchset on
    huge tmpfs[1].

    Instead of allocation pte page table upfront, we postpone this until we
    have page to map in hands. This approach opens possibility to map the
    page as huge if filesystem supports this.

    Comparing to Hugh's patch I've pushed page table allocation a bit
    further: into do_set_pte(). This way we can postpone allocation even in
    faultaround case without moving do_fault_around() after __do_fault().

    do_set_pte() got renamed to alloc_set_pte() as it can allocate page
    table if required.

    [1] http://lkml.kernel.org/r/alpine.LSU.2.11.1502202015090.14414@eggly.anvils

    Link: http://lkml.kernel.org/r/1466021202-61880-10-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The idea borrowed from Peter's patch from patchset on speculative page
    faults[1]:

    Instead of passing around the endless list of function arguments,
    replace the lot with a single structure so we can change context without
    endless function signature changes.

    The changes are mostly mechanical with exception of faultaround code:
    filemap_map_pages() got reworked a bit.

    This patch is preparation for the next one.

    [1] http://lkml.kernel.org/r/20141020222841.302891540@infradead.org

    Link: http://lkml.kernel.org/r/1466021202-61880-9-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We always have vma->vm_mm around.

    Link: http://lkml.kernel.org/r/1466021202-61880-8-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Add description of THP handling into unevictable-lru.txt.

    Link: http://lkml.kernel.org/r/1466021202-61880-7-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Vlastimil noted[1] that pmd can be no longer valid after we drop
    mmap_sem. We need recheck it once mmap_sem taken again.

    [1] http://lkml.kernel.org/r/12918dcd-a695-c6f4-e06f-69141c5f357f@suse.cz

    Link: http://lkml.kernel.org/r/1466021202-61880-6-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • After creating revalidate vma function, locking inconsistency occured
    due to directing the code path to wrong label. This patch directs to
    correct label and fix the inconsistency.

    Related commit that caused inconsistency:
    http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=da4360877094368f6dfe75bbe804b0f0a5d575b0

    Link: http://lkml.kernel.org/r/1464956884-4644-1-git-send-email-ebru.akagunduz@gmail.com
    Link: http://lkml.kernel.org/r/1466021202-61880-4-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Ebru Akagunduz
    Cc: Vlastimil Babka
    Cc: Sergey Senozhatsky
    Cc: Kirill A. Shutemov
    Cc: Stephen Rothwell
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • Currently khugepaged makes swapin readahead under down_write. This
    patch supplies to make swapin readahead under down_read instead of
    down_write.

    The patch was tested with a test program that allocates 800MB of memory,
    writes to it, and then sleeps. The system was forced to swap out all.
    Afterwards, the test program touches the area by writing, it skips a
    page in each 20 pages of the area.

    [akpm@linux-foundation.org: update comment to match new code]
    [kirill.shutemov@linux.intel.com: passing 'vma' to hugepage_vma_revlidate() is useless]
    Link: http://lkml.kernel.org/r/20160530095058.GA53044@black.fi.intel.com
    Link: http://lkml.kernel.org/r/1466021202-61880-3-git-send-email-kirill.shutemov@linux.intel.com
    Link: http://lkml.kernel.org/r/1464335964-6510-4-git-send-email-ebru.akagunduz@gmail.com
    Link: http://lkml.kernel.org/r/1466021202-61880-2-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Ebru Akagunduz
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: "Kirill A. Shutemov"
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Joonsoo Kim
    Cc: Cyrill Gorcunov
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • This patch makes swapin readahead to improve thp collapse rate. When
    khugepaged scanned pages, there can be a few of the pages in swap area.

    With the patch THP can collapse 4kB pages into a THP when there are up
    to max_ptes_swap swap ptes in a 2MB range.

    The patch was tested with a test program that allocates 400B of memory,
    writes to it, and then sleeps. I force the system to swap out all.
    Afterwards, the test program touches the area by writing, it skips a
    page in each 20 pages of the area.

    Without the patch, system did not swap in readahead. THP rate was %65
    of the program of the memory, it did not change over time.

    With this patch, after 10 minutes of waiting khugepaged had collapsed
    %99 of the program's memory.

    [kirill.shutemov@linux.intel.com: trivial cleanup of exit path of the function]
    [kirill.shutemov@linux.intel.com: __collapse_huge_page_swapin(): drop unused 'pte' parameter]
    [kirill.shutemov@linux.intel.com: do not hold anon_vma lock during swap in]
    Signed-off-by: Ebru Akagunduz
    Acked-by: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Joonsoo Kim
    Cc: Xie XiuQi
    Cc: Cyrill Gorcunov
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • Introduce a new sysfs integer knob
    /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap which makes
    optimistic check for swapin readahead to increase thp collapse rate.
    Before getting swapped out pages to memory, checks them and allows up to a
    certain number. It also prints out using tracepoints amount of unmapped
    ptes.

    [vdavydov@parallels.com: fix scan not aborted on SCAN_EXCEED_SWAP_PTE]
    [sfr@canb.auug.org.au: build fix]
    Link: http://lkml.kernel.org/r/20160616154503.65806e12@canb.auug.org.au
    Signed-off-by: Ebru Akagunduz
    Acked-by: Rik van Riel
    Cc: Kirill A. Shutemov
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Joonsoo Kim
    Cc: Xie XiuQi
    Cc: Cyrill Gorcunov
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • If nr_new is 0 which means there's no region would be added, so just
    return to the caller.

    Signed-off-by: nimisolo
    Cc: Alexander Kuleshov
    Cc: Pekka Enberg
    Cc: Tony Luck
    Cc: Mel Gorman
    Cc: Tang Chen
    Cc: Wei Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    nimisolo
     
  • Vladimir has noticed that we might declare memcg oom even during
    readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
    restriction) while __do_page_cache_readahead uses
    page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
    OOMs. This gfp mask discrepancy is really unfortunate and easily
    fixable. Drop page_cache_alloc_readahead() which only has one user and
    outsource the gfp_mask logic into readahead_gfp_mask and propagate this
    mask from __do_page_cache_readahead down to read_pages.

    This alone would have only very limited impact as most filesystems are
    implementing ->readpages and the common implementation mpage_readpages
    does GFP_KERNEL (with mapping_gfp restriction) again. We can tell it to
    use readahead_gfp_mask instead as this function is called only during
    readahead as well. The same applies to read_cache_pages.

    ext4 has its own ext4_mpage_readpages but the path which has pages !=
    NULL can use the same gfp mask. Btrfs, cifs, f2fs and orangefs are
    doing a very similar pattern to mpage_readpages so the same can be
    applied to them as well.

    [akpm@linux-foundation.org: coding-style fixes]
    [mhocko@suse.com: restrict gfp mask in mpage_alloc]
    Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Chris Mason
    Cc: Steve French
    Cc: Theodore Ts'o
    Cc: Jan Kara
    Cc: Mike Marshall
    Cc: Jaegeuk Kim
    Cc: Changman Lee
    Cc: Chao Yu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Tetsuo is worried that mmput_async might still lead to a premature new
    oom victim selection due to the following race:

    __oom_reap_task exit_mm
    find_lock_task_mm
    atomic_inc(mm->mm_users) # = 2
    task_unlock
    task_lock
    task->mm = NULL
    up_read(&mm->mmap_sem)
    < somebody write locks mmap_sem >
    task_unlock
    mmput
    atomic_dec_and_test # = 1
    exit_oom_victim
    down_read_trylock # failed - no reclaim
    mmput_async # Takes unpredictable amount of time
    < new OOM situation >

    the final __mmput will be executed in the delayed context which might
    happen far in the future. Such a race is highly unlikely because the
    write holder of mmap_sem would have to be an external task (all direct
    holders are already killed or exiting) and it usually have to pin
    mm_users in order to do anything reasonable.

    We can, however, make sure that the mmput_async is only called when we
    do not back off and reap some memory. That would reduce the impact of
    the delayed __mmput because the real content would be already freed.
    Pin mm_count to keep it alive after we drop task_lock and before we try
    to get mmap_sem. If the mmap_sem succeeds we can try to grab mm_users
    reference and then go on with unmapping the address space.

    It is not clear whether this race is possible at all but it is better to
    be more robust and do not pin mm_users unless we are sure we are
    actually doing some real work during __oom_reap_task.

    Link: http://lkml.kernel.org/r/1465306987-30297-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Correct the function parameters alignment, since original code already
    use both tabs and white spaces together for the incorrect parameters
    alignment functions.

    If one line can hold one statement within 80 columns, let it in one line
    (original code did not consider about the tabs/spaces for 2nd line when
    a statement is separated into 2 lines).

    Try to let '' aligned within one macro, since all related lines are
    short enough.

    Remove useless statement "idx = 0;", and always assign rgn within the
    'for' statement.

    Link: http://lkml.kernel.org/r/1464904899-1714-1-git-send-email-chengang@emindsoft.com.cn
    Signed-off-by: Chen Gang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • zram is very popular for some of the embedded world (e.g., TV, mobile
    phones). On those system, zsmalloc's consumed memory size is never
    trivial (one of example from real product system, total memory: 800M,
    zsmalloc consumed: 150M), so we have used this out of tree patch to
    monitor system memory behavior via /proc/vmstat.

    With zsmalloc in vmstat, it helps in tracking down system behavior due
    to memory usage.

    [minchan@kernel.org: zsmalloc: follow up zsmalloc vmstat]
    Link: http://lkml.kernel.org/r/20160607091737.GC23435@bbox
    [akpm@linux-foundation.org: fix build with CONFIG_ZSMALLOC=m]
    Link: http://lkml.kernel.org/r/1464919731-13255-1-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Sangseok Lee
    Cc: Chanho Min
    Cc: Chan Gyun Jeong
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • I have noticed that frontswap.h first declares "frontswap_enabled" as
    extern bool variable, and then overrides it with "#define
    frontswap_enabled (1)" for CONFIG_FRONTSWAP=Y or (0) when disabled. The
    bool variable isn't actually instantiated anywhere.

    This all looks like an unfinished attempt to make frontswap_enabled
    reflect whether a backend is instantiated. But in the current state,
    all frontswap hooks call unconditionally into frontswap.c just to check
    if frontswap_ops is non-NULL. This should at least be checked inline,
    but we can further eliminate the overhead when CONFIG_FRONTSWAP is
    enabled and no backend registered, using a static key that is initially
    disabled, and gets enabled only upon first backend registration.

    Thus, checks for "frontswap_enabled" are replaced with
    "frontswap_enabled()" wrapping the static key check. There are two
    exceptions:

    - xen's selfballoon_process() was testing frontswap_enabled in code guarded
    by #ifdef CONFIG_FRONTSWAP, which was effectively always true when reachable.
    The patch just removes this check. Using frontswap_enabled() does not sound
    correct here, as this can be true even without xen's own backend being
    registered.

    - in SYSCALL_DEFINE2(swapon), change the check to IS_ENABLED(CONFIG_FRONTSWAP)
    as it seems the bitmap allocation cannot currently be postponed until a
    backend is registered. This means that frontswap will still have some
    memory overhead by being configured, but without a backend.

    After the patch, we can expect that some functions in frontswap.c are
    called only when frontswap_ops is non-NULL. Change the checks there to
    VM_BUG_ONs. While at it, convert other BUG_ONs to VM_BUG_ONs as
    frontswap has been stable for some time.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1463152235-9717-1-git-send-email-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Cc: Konrad Rzeszutek Wilk
    Cc: Boris Ostrovsky
    Cc: David Vrabel
    Cc: Juergen Gross
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • oom_scan_process_thread() does not use totalpages argument.
    oom_badness() uses it.

    Link: http://lkml.kernel.org/r/1463796041-7889-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Unix sockets can consume a significant amount of system memory, hence
    they should be accounted to kmemcg.

    Since unix socket buffers are always allocated from process context, all
    we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
    sock->sk_allocation mask.

    Eric asked:

    > 1) What happens when a buffer, allocated from socket lands in a
    > different socket , maybe owned by another user/process.
    >
    > Who owns it now, in term of kmemcg accounting ?

    We never move memcg charges. E.g. if two processes from different
    cgroups are sharing a memory region, each page will be charged to the
    process which touched it first. Or if two processes are working with
    the same directory tree, inodes and dentries will be charged to the
    first user. The same is fair for unix socket buffers - they will be
    charged to the sender.

    > 2) Has performance impact been evaluated ?

    I ran netperf STREAM_STREAM with default options in a kmemcg on a 4 core
    x2 HT box. The results are below:

    # clients bandwidth (10^6bits/sec)
    base patched
    1 67643 +- 725 64874 +- 353 - 4.0 %
    4 193585 +- 2516 186715 +- 1460 - 3.5 %
    8 194820 +- 377 187443 +- 1229 - 3.7 %

    So the accounting doesn't come for free - it takes ~4% of performance.
    I believe we could optimize it by using per cpu batching not only on
    charge, but also on uncharge in memcg core, but that's beyond the scope
    of this patch set - I'll take a look at this later.

    Anyway, if performance impact is found to be unacceptable, it is always
    possible to disable kmem accounting at boot time (cgroup.memory=nokmem)
    or not use memory cgroups at runtime at all (thanks to jump labels
    there'll be no overhead even if they are compiled in).

    Link: http://lkml.kernel.org/r/fcfe6cae27a59fbc5e40145664b3cf085a560c68.1464079538.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Cc: "David S. Miller"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Pipes can consume a significant amount of system memory, hence they
    should be accounted to kmemcg.

    This patch marks pipe_inode_info and anonymous pipe buffer page
    allocations as __GFP_ACCOUNT so that they would be charged to kmemcg.
    Note, since a pipe buffer page can be "stolen" and get reused for other
    purposes, including mapping to userspace, we clear PageKmemcg thus
    resetting page->_mapcount and uncharge it in anon_pipe_buf_steal, which
    is introduced by this patch.

    A note regarding anon_pipe_buf_steal implementation. We allow to steal
    the page if its ref count equals 1. It looks racy, but it is correct
    for anonymous pipe buffer pages, because:

    - We lock out all other pipe users, because ->steal is called with
    pipe_lock held, so the page can't be spliced to another pipe from
    under us.

    - The page is not on LRU and it never was.

    - Thus a parallel thread can access it only by PFN. Although this is
    quite possible (e.g. see page_idle_get_page and balloon_page_isolate)
    this is not dangerous, because all such functions do is increase page
    ref count, check if the page is the one they are looking for, and
    decrease ref count if it isn't. Since our page is clean except for
    PageKmemcg mark, which doesn't conflict with other _mapcount users,
    the worst that can happen is we see page_count > 2 due to a transient
    ref, in which case we false-positively abort ->steal, which is still
    fine, because ->steal is not guaranteed to succeed.

    Link: http://lkml.kernel.org/r/20160527150313.GD26059@esperanza
    Signed-off-by: Vladimir Davydov
    Cc: Alexander Viro
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Page tables can bite a relatively big chunk off system memory and their
    allocations are easy to trigger from userspace, so they should be
    accounted to kmemcg.

    This patch marks page table allocations as __GFP_ACCOUNT for x86. Note
    we must not charge allocations of kernel page tables, because they can
    be shared among processes from different cgroups so accounting them to a
    particular one can pin other cgroups for indefinitely long. So we clear
    __GFP_ACCOUNT flag if a page table is allocated for the kernel.

    Link: http://lkml.kernel.org/r/7d5c54f6a2bcbe76f03171689440003d87e6c742.1464079538.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Page table pages are batched-freed in release_pages on most
    architectures. If we want to charge them to kmemcg (this is what is
    done later in this series), we need to teach mem_cgroup_uncharge_list to
    handle kmem pages.

    Link: http://lkml.kernel.org/r/18d5c09e97f80074ed25b97a7d0f32b95d875717.1464079538.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, to charge a non-slab allocation to kmemcg one has to use
    alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with
    this helper should finally be freed using free_kmem_pages, otherwise it
    won't be uncharged.

    This API suits its current users fine, but it turns out to be impossible
    to use along with page reference counting, i.e. when an allocation is
    supposed to be freed with put_page, as it is the case with pipe or unix
    socket buffers.

    To overcome this limitation, this patch moves charging/uncharging to
    generic page allocator paths, i.e. to __alloc_pages_nodemask and
    free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way,
    one can use any of the available page allocation functions to get the
    allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT,
    just like in case of kmalloc and friends. A charged page will be
    automatically uncharged on free.

    To make it possible, we need to mark pages charged to kmemcg somehow.
    To avoid introducing a new page flag, we make use of page->_mapcount for
    marking such pages. Since pages charged to kmemcg are not supposed to
    be mapped to userspace, it should work just fine. There are other
    (ab)users of page->_mapcount - buddy and balloon pages - but we don't
    conflict with them.

    In case kmemcg is compiled out or not used at runtime, this patch
    introduces no overhead to generic page allocator paths. If kmemcg is
    used, it will be plus one gfp flags check on alloc and plus one
    page->_mapcount check on free, which shouldn't hurt performance, because
    the data accessed are hot.

    Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov