15 May, 2019

3 commits

  • This updates each existing invalidation to use the correct mmu notifier
    event that represent what is happening to the CPU page table. See the
    patch which introduced the events to see the rational behind this.

    Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • CPU page table update can happens for many reasons, not only as a result
    of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
    a result of kernel activities (memory compression, reclaim, migration,
    ...).

    Users of mmu notifier API track changes to the CPU page table and take
    specific action for them. While current API only provide range of virtual
    address affected by the change, not why the changes is happening.

    This patchset do the initial mechanical convertion of all the places that
    calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
    event as well as the vma if it is know (most invalidation happens against
    a given vma). Passing down the vma allows the users of mmu notifier to
    inspect the new vma page protection.

    The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
    should assume that every for the range is going away when that event
    happens. A latter patch do convert mm call path to use a more appropriate
    events for each call.

    This is done as 2 patches so that no call site is forgotten especialy
    as it uses this following coccinelle patch:

    %vm_mm, E3, E4)
    ...>

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(..., struct vm_area_struct *VMA, ...) {
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(...) {
    struct vm_area_struct *VMA;
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN;
    @@
    FN(...) {
    }
    ---------------------------------------------------------------------->%

    Applied with:
    spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
    spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
    spatch --sp-file mmu-notifier.spatch --dir mm --in-place

    Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Transparent Huge Pages are currently stored in i_pages as pointers to
    consecutive subpages. This patch changes that to storing consecutive
    pointers to the head page in preparation for storing huge pages more
    efficiently in i_pages.

    Large parts of this are "inspired" by Kirill's patch
    https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/

    [willy@infradead.org: fix swapcache pages]
    Link: http://lkml.kernel.org/r/20190324155441.GF10344@bombadil.infradead.org
    [kirill@shutemov.name: hugetlb stores pages in page cache differently]
    Link: http://lkml.kernel.org/r/20190404134553.vuvhgmghlkiw2hgl@kshutemo-mobl1
    Link: http://lkml.kernel.org/r/20190307153051.18815-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jan Kara
    Reviewed-by: Kirill Shutemov
    Reviewed-and-tested-by: Song Liu
    Tested-by: William Kucharski
    Reviewed-by: William Kucharski
    Tested-by: Qian Cai
    Cc: Hugh Dickins
    Cc: Song Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

06 Mar, 2019

1 commit

  • Currently THP allocation events data is fairly opaque, since you can
    only get it system-wide. This patch makes it easier to reason about
    transparent hugepage behaviour on a per-memcg basis.

    For anonymous THP-backed pages, we already have MEMCG_RSS_HUGE in v1,
    which is used for v1's rss_huge [sic]. This is reused here as it's
    fairly involved to untangle NR_ANON_THPS right now to make it per-memcg,
    since right now some of this is delegated to rmap before we have any
    memcg actually assigned to the page. It's a good idea to rework that,
    but let's leave untangling THP allocation for a future patch.

    [akpm@linux-foundation.org: fix build]
    [chris@chrisdown.name: fix memcontrol build when THP is disabled]
    Link: http://lkml.kernel.org/r/20190131160802.GA5777@chrisdown.name
    Link: http://lkml.kernel.org/r/20190129205852.GA7310@chrisdown.name
    Signed-off-by: Chris Down
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Down
     

29 Dec, 2018

1 commit

  • To avoid having to change many call sites everytime we want to add a
    parameter use a structure to group all parameters for the mmu_notifier
    invalidate_range_start/end cakks. No functional changes with this patch.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    From: Jérôme Glisse
    Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

    fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

    Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

04 Dec, 2018

1 commit

  • …k/linux-rcu into core/rcu

    Pull RCU changes from Paul E. McKenney:

    - Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.

    - Replace calls of RCU-bh and RCU-sched update-side functions
    to their vanilla RCU counterparts. This series is a step
    towards complete removal of the RCU-bh and RCU-sched update-side
    functions.

    ( Note that some of these conversions are going upstream via their
    respective maintainers. )

    - Documentation updates, including a number of flavor-consolidation
    updates from Joel Fernandes.

    - Miscellaneous fixes.

    - Automate generation of the initrd filesystem used for
    rcutorture testing.

    - Convert spin_is_locked() assertions to instead use lockdep.

    ( Note that some of these conversions are going upstream via their
    respective maintainers. )

    - SRCU updates, especially including a fix from Dennis Krein
    for a bag-on-head-class bug.

    - RCU torture-test updates.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

01 Dec, 2018

7 commits

  • collapse_shmem()'s xas_nomem() is very unlikely to fail, but it is
    rightly given a failure path, so move the whole xas_create_range() block
    up before __SetPageLocked(new_page): so that it does not need to
    remember to unlock_page(new_page).

    Add the missing mem_cgroup_cancel_charge(), and set (currently unused)
    result to SCAN_FAIL rather than SCAN_SUCCEED.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261531200.2275@eggly.anvils
    Fixes: 77da9389b9d5 ("mm: Convert collapse_shmem to XArray")
    Signed-off-by: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • collapse_shmem()'s VM_BUG_ON_PAGE(PageTransCompound) was unsafe: before
    it holds page lock of the first page, racing truncation then extension
    might conceivably have inserted a hugepage there already. Fail with the
    SCAN_PAGE_COMPOUND result, instead of crashing (CONFIG_DEBUG_VM=y) or
    otherwise mishandling the unexpected hugepage - though later we might
    code up a more constructive way of handling it, with SCAN_SUCCESS.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261529310.2275@eggly.anvils
    Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Signed-off-by: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • khugepaged's collapse_shmem() does almost all of its work, to assemble
    the huge new_page from 512 scattered old pages, with the new_page's
    refcount frozen to 0 (and refcounts of all old pages so far also frozen
    to 0). Including shmem_getpage() to read in any which were out on swap,
    memory reclaim if necessary to allocate their intermediate pages, and
    copying over all the data from old to new.

    Imagine the frozen refcount as a spinlock held, but without any lock
    debugging to highlight the abuse: it's not good, and under serious load
    heads into lockups - speculative getters of the page are not expecting
    to spin while khugepaged is rescheduled.

    One can get a little further under load by hacking around elsewhere; but
    fortunately, freezing the new_page turns out to have been entirely
    unnecessary, with no hacks needed elsewhere.

    The huge new_page lock is already held throughout, and guards all its
    subpages as they are brought one by one into the page cache tree; and
    anything reading the data in that page, without the lock, before it has
    been marked PageUptodate, would already be in the wrong. So simply
    eliminate the freezing of the new_page.

    Each of the old pages remains frozen with refcount 0 after it has been
    replaced by a new_page subpage in the page cache tree, until they are
    all unfrozen on success or failure: just as before. They could be
    unfrozen sooner, but cause no problem once no longer visible to
    find_get_entry(), filemap_map_pages() and other speculative lookups.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261527570.2275@eggly.anvils
    Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Several cleanups in collapse_shmem(): most of which probably do not
    really matter, beyond doing things in a more familiar and reassuring
    order. Simplify the failure gotos in the main loop, and on success
    update stats while interrupts still disabled from the last iteration.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261526400.2275@eggly.anvils
    Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Huge tmpfs testing reminds us that there is no __GFP_ZERO in the gfp
    flags khugepaged uses to allocate a huge page - in all common cases it
    would just be a waste of effort - so collapse_shmem() must remember to
    clear out any holes that it instantiates.

    The obvious place to do so, where they are put into the page cache tree,
    is not a good choice: because interrupts are disabled there. Leave it
    until further down, once success is assured, where the other pages are
    copied (before setting PageUptodate).

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261525080.2275@eggly.anvils
    Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Huge tmpfs testing on a shortish file mapped into a pmd-rounded extent
    hit shmem_evict_inode()'s WARN_ON(inode->i_blocks) followed by
    clear_inode()'s BUG_ON(inode->i_data.nrpages) when the file was later
    closed and unlinked.

    khugepaged's collapse_shmem() was forgetting to update mapping->nrpages
    on the rollback path, after it had added but then needs to undo some
    holes.

    There is indeed an irritating asymmetry between shmem_charge(), whose
    callers want it to increment nrpages after successfully accounting
    blocks, and shmem_uncharge(), when __delete_from_page_cache() already
    decremented nrpages itself: oh well, just add a comment on that to them
    both.

    And shmem_recalc_inode() is supposed to be called when the accounting is
    expected to be in balance (so it can deduce from imbalance that reclaim
    discarded some pages): so change shmem_charge() to update nrpages
    earlier (though it's rare for the difference to matter at all).

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261523450.2275@eggly.anvils
    Fixes: 800d8c63b2e98 ("shmem: add huge pages support")
    Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Huge tmpfs testing showed that although collapse_shmem() recognizes a
    concurrently truncated or hole-punched page correctly, its handling of
    holes was liable to refill an emptied extent. Add check to stop that.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261522040.2275@eggly.anvils
    Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Signed-off-by: Hugh Dickins
    Reviewed-by: Matthew Wilcox
    Cc: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

13 Nov, 2018

1 commit

  • lockdep_assert_held() is better suited to checking locking requirements,
    since it only checks if the current thread holds the lock regardless of
    whether someone else does. This is also a step towards possibly removing
    spin_is_locked().

    Signed-off-by: Lance Roy
    Cc: Andrew Morton
    Cc: "Kirill A. Shutemov"
    Cc: Yang Shi
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Jan Kara
    Cc: Shakeel Butt
    Cc:
    Signed-off-by: Paul E. McKenney

    Lance Roy
     

21 Oct, 2018

2 commits


30 Sep, 2018

1 commit

  • Introduce xarray value entries and tagged pointers to replace radix
    tree exceptional entries. This is a slight change in encoding to allow
    the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
    value entry). It is also a change in emphasis; exceptional entries are
    intimidating and different. As the comment explains, you can choose
    to store values or pointers in the xarray and they are both first-class
    citizens.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Josef Bacik

    Matthew Wilcox
     

24 Aug, 2018

1 commit

  • Use new return type vm_fault_t for fault handler. For now, this is just
    documenting that the function returns a VM_FAULT value rather than an
    errno. Once all instances are converted, vm_fault_t will become a
    distinct type.

    Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    The aim is to change the return type of finish_fault() and
    handle_mm_fault() to vm_fault_t type. As part of that clean up return
    type of all other recursively called functions have been changed to
    vm_fault_t type.

    The places from where handle_mm_fault() is getting invoked will be
    change to vm_fault_t type but in a separate patch.

    vmf_error() is the newly introduce inline function in 4.17-rc6.

    [akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
    Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     

18 Aug, 2018

3 commits

  • khugepaged_enter_vma_merge() passes a stale vma->vm_flags to
    hugepage_vma_check(). The argument vm_flags contains the latest value.
    Therefore, it is necessary to pass this vm_flags into
    hugepage_vma_check().

    With this bug, madvise(MADV_HUGEPAGE) for mmap files in shmem fails to
    put memory in huge pages. Here is an example of failed madvise():

    /* mount /dev/shm with huge=advise:
    * mount -o remount,huge=advise /dev/shm */
    /* create file /dev/shm/huge */
    #define HUGE_FILE "/dev/shm/huge"

    fd = open(HUGE_FILE, O_RDONLY);
    ptr = mmap(NULL, FILE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0);
    ret = madvise(ptr, FILE_SIZE, MADV_HUGEPAGE);

    madvise() will return 0, but this memory region is never put in huge
    page (check from /proc/meminfo: ShmemHugePages).

    Link: http://lkml.kernel.org/r/20180629181752.792831-1-songliubraving@fb.com
    Fixes: 02b75dc8160d ("mm: thp: register mm for khugepaged when merging vma for shmem")
    Signed-off-by: Song Liu
    Reviewed-by: Rik van Riel
    Reviewed-by: Yang Shi
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     
  • /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed is used
    to record the counter of collapsed THP, but it just gets inc'ed in
    anonymous THP collapse path, do this for shmem THP collapse too.

    Link: http://lkml.kernel.org/r/1529622949-75504-2-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Acked-by: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • When merging anonymous page vma, if the size of the vma can fit in at
    least one hugepage, the mm will be registered for khugepaged for
    collapsing THP in the future.

    But it skips shmem vmas. Do so for shmem also, but not for file-private
    mappings when merging a vma in order to increase the odds of collapsing
    a hugepage via khugepaged.

    hugepage_vma_check() sounds like a good fit to do the check. And move
    the definition of it before khugepaged_enter_vma_merge() to avoid a
    build error.

    Link: http://lkml.kernel.org/r/1529697791-6950-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Acked-by: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

12 Apr, 2018

3 commits

  • Remove the address_space ->tree_lock and use the xa_lock newly added to
    the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
    since we don't really care that it's a tree.

    [willy@infradead.org: fix nds32, fs/dax.c]
    Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • There was a regression report for "mm/cma: manage the memory of the CMA
    area by using the ZONE_MOVABLE" [1] and I think that it is related to
    this problem. CMA patchset makes the system use one more zone
    (ZONE_MOVABLE) and then increases min_free_kbytes. It reduces usable
    memory and it could cause regression.

    ZONE_MOVABLE only has movable pages so we don't need to keep enough
    freepages to avoid or deal with fragmentation. So, don't count it.

    This changes min_free_kbytes and thus min_watermark greatly if
    ZONE_MOVABLE is used. It will make the user uses more memory.

    System:
    22GB ram, fakenuma, 2 nodes. 5 zones are used.

    Before:
    min_free_kbytes: 112640

    zone_info (min_watermark):
    Node 0, zone DMA
    min 19
    Node 0, zone DMA32
    min 3778
    Node 0, zone Normal
    min 10191
    Node 0, zone Movable
    min 0
    Node 0, zone Device
    min 0
    Node 1, zone DMA
    min 0
    Node 1, zone DMA32
    min 0
    Node 1, zone Normal
    min 14043
    Node 1, zone Movable
    min 127
    Node 1, zone Device
    min 0

    After:
    min_free_kbytes: 90112

    zone_info (min_watermark):
    Node 0, zone DMA
    min 15
    Node 0, zone DMA32
    min 3022
    Node 0, zone Normal
    min 8152
    Node 0, zone Movable
    min 0
    Node 0, zone Device
    min 0
    Node 1, zone DMA
    min 0
    Node 1, zone DMA32
    min 0
    Node 1, zone Normal
    min 11234
    Node 1, zone Movable
    min 102
    Node 1, zone Device
    min 0

    [1] (lkml.kernel.org/r/20180102063528.GG30397%20()%20yexl-desktop)

    Link: http://lkml.kernel.org/r/1522913236-15776-1-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Cc: Michal Hocko
    Cc: "Kirill A . Shutemov"
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • A THP memcg charge can trigger the oom killer since 2516035499b9 ("mm,
    thp: remove __GFP_NORETRY from khugepaged and madvised allocations").
    We have used an explicit __GFP_NORETRY previously which ruled the OOM
    killer automagically.

    Memcg charge path should be semantically compliant with the allocation
    path and that means that if we do not trigger the OOM killer for costly
    orders which should do the same in the memcg charge path as well.
    Otherwise we are forcing callers to distinguish the two and use
    different gfp masks which is both non-intuitive and bug prone. As soon
    as we get a costly high order kmalloc user we even do not have any means
    to tell the memcg specific gfp mask to prevent from OOM because the
    charging is deep within guts of the slab allocator.

    The unexpected memcg OOM on THP has already been fixed upstream by
    9d3c3354bb85 ("mm, thp: do not cause memcg oom for thp") but this is a
    one-off fix rather than a generic solution. Teach mem_cgroup_oom to
    bail out on costly order requests to fix the THP issue as well as any
    other costly OOM eligible allocations to be added in future.

    Also revert 9d3c3354bb85 because special gfp for THP is no longer
    needed.

    Link: http://lkml.kernel.org/r/20180403193129.22146-1-mhocko@kernel.org
    Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

23 Mar, 2018

2 commits

  • Commit 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
    madvised allocations") changed the page allocator to no longer detect
    thp allocations based on __GFP_NORETRY.

    It did not, however, modify the mem cgroup try_charge() path to avoid
    oom kill for either khugepaged collapsing or thp faulting. It is never
    expected to oom kill a process to allocate a hugepage for thp; reclaim
    is governed by the thp defrag mode and MADV_HUGEPAGE, but allocations
    (and charging) should fallback instead of oom killing processes.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803191409420.124411@chino.kir.corp.google.com
    Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
    Signed-off-by: David Rientjes
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • khugepaged is not yet able to convert PTE-mapped huge pages back to PMD
    mapped. We do not collapse such pages. See check
    khugepaged_scan_pmd().

    But if between khugepaged_scan_pmd() and __collapse_huge_page_isolate()
    somebody managed to instantiate THP in the range and then split the PMD
    back to PTEs we would have a problem --
    VM_BUG_ON_PAGE(PageCompound(page)) will get triggered.

    It's possible since we drop mmap_sem during collapse to re-take for
    write.

    Replace the VM_BUG_ON() with graceful collapse fail.

    Link: http://lkml.kernel.org/r/20180315152353.27989-1-kirill.shutemov@linux.intel.com
    Fixes: b1caa957ae6d ("khugepaged: ignore pmd tables with THP mapped with ptes")
    Signed-off-by: Kirill A. Shutemov
    Cc: Laura Abbott
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

01 Feb, 2018

2 commits

  • In the current design, khugepaged needs to acquire mmap_sem before
    scanning an mm. But in some corner cases, khugepaged may scan a process
    which is modifying its memory mapping, so khugepaged blocks in
    uninterruptible state. But the process might hold the mmap_sem for a
    long time when modifying a huge memory space and it may trigger the
    below khugepaged hung issue:

    INFO: task khugepaged:270 blocked for more than 120 seconds.
    Tainted: G E 4.9.65-006.ali3000.alios7.x86_64 #1
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    khugepaged D 0 270 2 0x00000000 
    ffff883f3deae4c0 0000000000000000 ffff883f610596c0 ffff883f7d359440
    ffff883f63818000 ffffc90019adfc78 ffffffff817079a5 d67e5aa8c1860a64
    0000000000000246 ffff883f7d359440 ffffc90019adfc88 ffff883f610596c0
    Call Trace:
    schedule+0x36/0x80
    rwsem_down_read_failed+0xf0/0x150
    call_rwsem_down_read_failed+0x18/0x30
    down_read+0x20/0x40
    khugepaged+0x476/0x11d0
    kthread+0xe6/0x100
    ret_from_fork+0x25/0x30

    So it sounds pointless to just block khugepaged waiting for the
    semaphore so replace down_read() with down_read_trylock() to move to
    scan the next mm quickly instead of just blocking on the semaphore so
    that other processes can get more chances to install THP. Then
    khugepaged can come back to scan the skipped mm when it has finished the
    current round full_scan.

    And it appears that the change can improve khugepaged efficiency a
    little bit.

    Below is the test result when running LTP on a 24 cores 4GB memory 2
    nodes NUMA VM:

    pristine w/ trylock
    full_scan 197 187
    pages_collapsed 21 26
    thp_fault_alloc 40818 44466
    thp_fault_fallback 18413 16679
    thp_collapse_alloc 21 150
    thp_collapse_alloc_failed 14 16
    thp_file_alloc 369 369

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: tweak comment]
    [arnd@arndb.de: avoid uninitialized variable use]
    Link: http://lkml.kernel.org/r/20171215125129.2948634-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/1513281203-54878-1-git-send-email-yang.s@alibaba-inc.com
    Signed-off-by: Yang Shi
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Several users of unmap_mapping_range() would prefer to express their
    range in pages rather than bytes. Unfortuately, on a 32-bit kernel, you
    have to remember to cast your page number to a 64-bit type before
    shifting it, and four places in the current tree didn't remember to do
    that. That's a sign of a bad interface.

    Conveniently, unmap_mapping_range() actually converts from bytes into
    pages, so hoist the guts of unmap_mapping_range() into a new function
    unmap_mapping_pages() and convert the callers which want to use pages.

    Link: http://lkml.kernel.org/r/20171206142627.GD32044@bombadil.infradead.org
    Signed-off-by: Matthew Wilcox
    Reported-by: "zhangyi (F)"
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

30 Nov, 2017

1 commit

  • This reverts commit 152e93af3cfe2d29d8136cc0a02a8612507136ee.

    It was a nice cleanup in theory, but as Nicolai Stange points out, we do
    need to make the page dirty for the copy-on-write case even when we
    didn't end up making it writable, since the dirty bit is what we use to
    check that we've gone through a COW cycle.

    Reported-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

28 Nov, 2017

1 commit

  • Currently we make page table entries dirty all the time regardless of
    access type and don't even consider if the mapping is write-protected.
    The reasoning is that we don't really need dirty tracking on THP and
    making the entry dirty upfront may save some time on first write to the
    page.

    Unfortunately, such approach may result in false-positive
    can_follow_write_pmd() for huge zero page or read-only shmem file.

    Let's only make page dirty only if we about to write to the page anyway
    (as we do for small pages).

    I've restructured the code to make entry dirty inside
    maybe_p[mu]d_mkwrite(). It also takes into account if the vma is
    write-protected.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Nov, 2017

1 commit

  • Let's add wrappers for ->nr_ptes with the same interface as for nr_pmd
    and nr_pud.

    The patch also makes nr_ptes accounting dependent onto CONFIG_MMU. Page
    table accounting doesn't make sense if you don't have page tables.

    It's preparation for consolidation of page-table counters in mm_struct.

    Link: http://lkml.kernel.org/r/20171006100651.44742-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

11 Jul, 2017

1 commit

  • PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any
    existing mapping because it only updated mm->def_flags which is a
    template for new mappings.

    The mappings created after prctl(PR_SET_THP_DISABLE) have VM_NOHUGEPAGE
    flag set. This can be quite surprising for all those applications which
    do not do prctl(); fork() & exec() and want to control their own THP
    behavior.

    Another usecase when the immediate semantic of the prctl might be useful
    is a combination of pre- and post-copy migration of containers with
    CRIU. In this case CRIU populates a part of a memory region with data
    that was saved during the pre-copy stage. Afterwards, the region is
    registered with userfaultfd and CRIU expects to get page faults for the
    parts of the region that were not yet populated. However, khugepaged
    collapses the pages and the expected page faults do not occur.

    In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
    temporary mechanism for enabling/disabling THP process wide.

    Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is
    tested when decision whether to use huge pages is taken either during
    page fault of at the time of THP collapse.

    It should be noted, that the new implementation makes PR_SET_THP_DISABLE
    master override to any per-VMA setting, which was not the case
    previously.

    Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
    Link: http://lkml.kernel.org/r/1496415802-30944-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Michal Hocko
    Signed-off-by: Mike Rapoport
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Arnd Bergmann
    Cc: "Kirill A. Shutemov"
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

24 Jun, 2017

1 commit

  • This is a partial revert of commit 338a16ba1549 ("mm, thp: copying user
    pages must schedule on collapse") which added a cond_resched() to
    __collapse_huge_page_copy().

    On x86 with CONFIG_HIGHPTE, __collapse_huge_page_copy is called in
    atomic context and thus scheduling is not possible. This is only a
    possible config on arm and i386.

    Although need_resched has been shown to be set for over 100 jiffies
    while doing the iteration in __collapse_huge_page_copy, this is better
    than doing

    if (in_atomic())
    cond_resched()

    to cover only non-CONFIG_HIGHPTE configs.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1706191341550.97821@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Reported-by: Larry Finger
    Tested-by: Larry Finger
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

13 May, 2017

2 commits

  • We have encountered need_resched warnings in __collapse_huge_page_copy()
    while doing {clear,copy}_user_highpage() over HPAGE_PMD_NR source pages.

    mm->mmap_sem is held for write, but the iteration is well bounded.

    Reschedule as needed.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705101426380.109808@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • One return case of `__collapse_huge_page_swapin()` does not invoke
    tracepoint while every other return case does. This commit adds a
    tracepoint invocation for the case.

    Link: http://lkml.kernel.org/r/20170507101813.30187-1-sj38.park@gmail.com
    Signed-off-by: SeongJae Park
    Cc: Kirill A. Shutemov
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    SeongJae Park
     

04 May, 2017

2 commits

  • With the discussion[1], I found it seems there are every PageFlags
    functions return bool at this moment so we don't need double negation
    any more. Although it's not a problem to keep it, it makes future users
    confused to use double negation for them, too.

    Remove such possibility.

    [1] https://marc.info/?l=linux-kernel&m=148881578820434

    Frankly sepaking, I like every PageFlags to return bool instead of int.
    It will make it clear. AFAIR, Chen Gang had tried it but don't know why
    it was not merged at that time.

    http://lkml.kernel.org/r/1469336184-1904-1-git-send-email-chengang@emindsoft.com.cn

    Link: http://lkml.kernel.org/r/1488868597-32222-1-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Kirill A. Shutemov
    Cc: Chen Gang
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • There are a few places the code assumes anonymous pages should have
    SwapBacked flag set. MADV_FREE pages are anonymous pages but we are
    going to add them to LRU_INACTIVE_FILE list and clear SwapBacked flag
    for them. The assumption doesn't hold any more, so fix them.

    Link: http://lkml.kernel.org/r/3945232c0df3dd6c4ef001976f35a95f18dcb407.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Acked-by: Johannes Weiner
    Acked-by: Hillf Danton
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

02 Mar, 2017

2 commits

  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar