15 May, 2019
3 commits
-
This updates each existing invalidation to use the correct mmu notifier
event that represent what is happening to the CPU page table. See the
patch which introduced the events to see the rational behind this.Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Reviewed-by: Ralph Campbell
Reviewed-by: Ira Weiny
Cc: Christian König
Cc: Joonas Lahtinen
Cc: Jani Nikula
Cc: Rodrigo Vivi
Cc: Jan Kara
Cc: Andrea Arcangeli
Cc: Peter Xu
Cc: Felix Kuehling
Cc: Jason Gunthorpe
Cc: Ross Zwisler
Cc: Dan Williams
Cc: Paolo Bonzini
Cc: Radim Krcmar
Cc: Michal Hocko
Cc: Christian Koenig
Cc: John Hubbard
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
CPU page table update can happens for many reasons, not only as a result
of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
a result of kernel activities (memory compression, reclaim, migration,
...).Users of mmu notifier API track changes to the CPU page table and take
specific action for them. While current API only provide range of virtual
address affected by the change, not why the changes is happening.This patchset do the initial mechanical convertion of all the places that
calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
event as well as the vma if it is know (most invalidation happens against
a given vma). Passing down the vma allows the users of mmu notifier to
inspect the new vma page protection.The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
should assume that every for the range is going away when that event
happens. A latter patch do convert mm call path to use a more appropriate
events for each call.This is done as 2 patches so that no call site is forgotten especialy
as it uses this following coccinelle patch:%vm_mm, E3, E4)
...>@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(..., struct vm_area_struct *VMA, ...) {
}@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(...) {
struct vm_area_struct *VMA;
}@@
expression E1, E2, E3, E4;
identifier FN;
@@
FN(...) {
}
---------------------------------------------------------------------->%Applied with:
spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
spatch --sp-file mmu-notifier.spatch --dir mm --in-placeLink: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Reviewed-by: Ralph Campbell
Reviewed-by: Ira Weiny
Cc: Christian König
Cc: Joonas Lahtinen
Cc: Jani Nikula
Cc: Rodrigo Vivi
Cc: Jan Kara
Cc: Andrea Arcangeli
Cc: Peter Xu
Cc: Felix Kuehling
Cc: Jason Gunthorpe
Cc: Ross Zwisler
Cc: Dan Williams
Cc: Paolo Bonzini
Cc: Radim Krcmar
Cc: Michal Hocko
Cc: Christian Koenig
Cc: John Hubbard
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Transparent Huge Pages are currently stored in i_pages as pointers to
consecutive subpages. This patch changes that to storing consecutive
pointers to the head page in preparation for storing huge pages more
efficiently in i_pages.Large parts of this are "inspired" by Kirill's patch
https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/[willy@infradead.org: fix swapcache pages]
Link: http://lkml.kernel.org/r/20190324155441.GF10344@bombadil.infradead.org
[kirill@shutemov.name: hugetlb stores pages in page cache differently]
Link: http://lkml.kernel.org/r/20190404134553.vuvhgmghlkiw2hgl@kshutemo-mobl1
Link: http://lkml.kernel.org/r/20190307153051.18815-1-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Jan Kara
Reviewed-by: Kirill Shutemov
Reviewed-and-tested-by: Song Liu
Tested-by: William Kucharski
Reviewed-by: William Kucharski
Tested-by: Qian Cai
Cc: Hugh Dickins
Cc: Song Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
06 Mar, 2019
1 commit
-
Currently THP allocation events data is fairly opaque, since you can
only get it system-wide. This patch makes it easier to reason about
transparent hugepage behaviour on a per-memcg basis.For anonymous THP-backed pages, we already have MEMCG_RSS_HUGE in v1,
which is used for v1's rss_huge [sic]. This is reused here as it's
fairly involved to untangle NR_ANON_THPS right now to make it per-memcg,
since right now some of this is delegated to rmap before we have any
memcg actually assigned to the page. It's a good idea to rework that,
but let's leave untangling THP allocation for a future patch.[akpm@linux-foundation.org: fix build]
[chris@chrisdown.name: fix memcontrol build when THP is disabled]
Link: http://lkml.kernel.org/r/20190131160802.GA5777@chrisdown.name
Link: http://lkml.kernel.org/r/20190129205852.GA7310@chrisdown.name
Signed-off-by: Chris Down
Acked-by: Johannes Weiner
Cc: Tejun Heo
Cc: Roman Gushchin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
29 Dec, 2018
1 commit
-
To avoid having to change many call sites everytime we want to add a
parameter use a structure to group all parameters for the mmu_notifier
invalidate_range_start/end cakks. No functional changes with this patch.[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Acked-by: Christian König
Acked-by: Jan Kara
Cc: Matthew Wilcox
Cc: Ross Zwisler
Cc: Dan Williams
Cc: Paolo Bonzini
Cc: Radim Krcmar
Cc: Michal Hocko
Cc: Felix Kuehling
Cc: Ralph Campbell
Cc: John Hubbard
From: Jérôme Glisse
Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n
Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
04 Dec, 2018
1 commit
-
…k/linux-rcu into core/rcu
Pull RCU changes from Paul E. McKenney:
- Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.
- Replace calls of RCU-bh and RCU-sched update-side functions
to their vanilla RCU counterparts. This series is a step
towards complete removal of the RCU-bh and RCU-sched update-side
functions.( Note that some of these conversions are going upstream via their
respective maintainers. )- Documentation updates, including a number of flavor-consolidation
updates from Joel Fernandes.- Miscellaneous fixes.
- Automate generation of the initrd filesystem used for
rcutorture testing.- Convert spin_is_locked() assertions to instead use lockdep.
( Note that some of these conversions are going upstream via their
respective maintainers. )- SRCU updates, especially including a fix from Dennis Krein
for a bag-on-head-class bug.- RCU torture-test updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
01 Dec, 2018
7 commits
-
collapse_shmem()'s xas_nomem() is very unlikely to fail, but it is
rightly given a failure path, so move the whole xas_create_range() block
up before __SetPageLocked(new_page): so that it does not need to
remember to unlock_page(new_page).Add the missing mem_cgroup_cancel_charge(), and set (currently unused)
result to SCAN_FAIL rather than SCAN_SUCCEED.Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261531200.2275@eggly.anvils
Fixes: 77da9389b9d5 ("mm: Convert collapse_shmem to XArray")
Signed-off-by: Hugh Dickins
Cc: Matthew Wilcox
Cc: Kirill A. Shutemov
Cc: Jerome Glisse
Cc: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
collapse_shmem()'s VM_BUG_ON_PAGE(PageTransCompound) was unsafe: before
it holds page lock of the first page, racing truncation then extension
might conceivably have inserted a hugepage there already. Fail with the
SCAN_PAGE_COMPOUND result, instead of crashing (CONFIG_DEBUG_VM=y) or
otherwise mishandling the unexpected hugepage - though later we might
code up a more constructive way of handling it, with SCAN_SUCCESS.Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261529310.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins
Cc: Kirill A. Shutemov
Cc: Jerome Glisse
Cc: Konstantin Khlebnikov
Cc: Matthew Wilcox
Cc: [4.8+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
khugepaged's collapse_shmem() does almost all of its work, to assemble
the huge new_page from 512 scattered old pages, with the new_page's
refcount frozen to 0 (and refcounts of all old pages so far also frozen
to 0). Including shmem_getpage() to read in any which were out on swap,
memory reclaim if necessary to allocate their intermediate pages, and
copying over all the data from old to new.Imagine the frozen refcount as a spinlock held, but without any lock
debugging to highlight the abuse: it's not good, and under serious load
heads into lockups - speculative getters of the page are not expecting
to spin while khugepaged is rescheduled.One can get a little further under load by hacking around elsewhere; but
fortunately, freezing the new_page turns out to have been entirely
unnecessary, with no hacks needed elsewhere.The huge new_page lock is already held throughout, and guards all its
subpages as they are brought one by one into the page cache tree; and
anything reading the data in that page, without the lock, before it has
been marked PageUptodate, would already be in the wrong. So simply
eliminate the freezing of the new_page.Each of the old pages remains frozen with refcount 0 after it has been
replaced by a new_page subpage in the page cache tree, until they are
all unfrozen on success or failure: just as before. They could be
unfrozen sooner, but cause no problem once no longer visible to
find_get_entry(), filemap_map_pages() and other speculative lookups.Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261527570.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins
Acked-by: Kirill A. Shutemov
Cc: Jerome Glisse
Cc: Konstantin Khlebnikov
Cc: Matthew Wilcox
Cc: [4.8+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Several cleanups in collapse_shmem(): most of which probably do not
really matter, beyond doing things in a more familiar and reassuring
order. Simplify the failure gotos in the main loop, and on success
update stats while interrupts still disabled from the last iteration.Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261526400.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins
Acked-by: Kirill A. Shutemov
Cc: Jerome Glisse
Cc: Konstantin Khlebnikov
Cc: Matthew Wilcox
Cc: [4.8+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Huge tmpfs testing reminds us that there is no __GFP_ZERO in the gfp
flags khugepaged uses to allocate a huge page - in all common cases it
would just be a waste of effort - so collapse_shmem() must remember to
clear out any holes that it instantiates.The obvious place to do so, where they are put into the page cache tree,
is not a good choice: because interrupts are disabled there. Leave it
until further down, once success is assured, where the other pages are
copied (before setting PageUptodate).Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261525080.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins
Acked-by: Kirill A. Shutemov
Cc: Jerome Glisse
Cc: Konstantin Khlebnikov
Cc: Matthew Wilcox
Cc: [4.8+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Huge tmpfs testing on a shortish file mapped into a pmd-rounded extent
hit shmem_evict_inode()'s WARN_ON(inode->i_blocks) followed by
clear_inode()'s BUG_ON(inode->i_data.nrpages) when the file was later
closed and unlinked.khugepaged's collapse_shmem() was forgetting to update mapping->nrpages
on the rollback path, after it had added but then needs to undo some
holes.There is indeed an irritating asymmetry between shmem_charge(), whose
callers want it to increment nrpages after successfully accounting
blocks, and shmem_uncharge(), when __delete_from_page_cache() already
decremented nrpages itself: oh well, just add a comment on that to them
both.And shmem_recalc_inode() is supposed to be called when the accounting is
expected to be in balance (so it can deduce from imbalance that reclaim
discarded some pages): so change shmem_charge() to update nrpages
earlier (though it's rare for the difference to matter at all).Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261523450.2275@eggly.anvils
Fixes: 800d8c63b2e98 ("shmem: add huge pages support")
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins
Acked-by: Kirill A. Shutemov
Cc: Jerome Glisse
Cc: Konstantin Khlebnikov
Cc: Matthew Wilcox
Cc: [4.8+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Huge tmpfs testing showed that although collapse_shmem() recognizes a
concurrently truncated or hole-punched page correctly, its handling of
holes was liable to refill an emptied extent. Add check to stop that.Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261522040.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins
Reviewed-by: Matthew Wilcox
Cc: Kirill A. Shutemov
Cc: Jerome Glisse
Cc: Konstantin Khlebnikov
Cc: [4.8+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
13 Nov, 2018
1 commit
-
lockdep_assert_held() is better suited to checking locking requirements,
since it only checks if the current thread holds the lock regardless of
whether someone else does. This is also a step towards possibly removing
spin_is_locked().Signed-off-by: Lance Roy
Cc: Andrew Morton
Cc: "Kirill A. Shutemov"
Cc: Yang Shi
Cc: Matthew Wilcox
Cc: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Jan Kara
Cc: Shakeel Butt
Cc:
Signed-off-by: Paul E. McKenney
21 Oct, 2018
2 commits
-
Slightly shorter and easier to read code.
Signed-off-by: Matthew Wilcox
-
I found another victim of the radix tree being hard to use. Because
there was no call to radix_tree_preload(), khugepaged was allocating
radix_tree_nodes using GFP_ATOMIC.I also converted a local_irq_save()/restore() pair to
disable()/enable().Signed-off-by: Matthew Wilcox
30 Sep, 2018
1 commit
-
Introduce xarray value entries and tagged pointers to replace radix
tree exceptional entries. This is a slight change in encoding to allow
the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
value entry). It is also a change in emphasis; exceptional entries are
intimidating and different. As the comment explains, you can choose
to store values or pointers in the xarray and they are both first-class
citizens.Signed-off-by: Matthew Wilcox
Reviewed-by: Josef Bacik
24 Aug, 2018
1 commit
-
Use new return type vm_fault_t for fault handler. For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno. Once all instances are converted, vm_fault_t will become a
distinct type.Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")
The aim is to change the return type of finish_fault() and
handle_mm_fault() to vm_fault_t type. As part of that clean up return
type of all other recursively called functions have been changed to
vm_fault_t type.The places from where handle_mm_fault() is getting invoked will be
change to vm_fault_t type but in a separate patch.vmf_error() is the newly introduce inline function in 4.17-rc6.
[akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder
Reviewed-by: Matthew Wilcox
Reviewed-by: Andrew Morton
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
18 Aug, 2018
3 commits
-
khugepaged_enter_vma_merge() passes a stale vma->vm_flags to
hugepage_vma_check(). The argument vm_flags contains the latest value.
Therefore, it is necessary to pass this vm_flags into
hugepage_vma_check().With this bug, madvise(MADV_HUGEPAGE) for mmap files in shmem fails to
put memory in huge pages. Here is an example of failed madvise():/* mount /dev/shm with huge=advise:
* mount -o remount,huge=advise /dev/shm */
/* create file /dev/shm/huge */
#define HUGE_FILE "/dev/shm/huge"fd = open(HUGE_FILE, O_RDONLY);
ptr = mmap(NULL, FILE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0);
ret = madvise(ptr, FILE_SIZE, MADV_HUGEPAGE);madvise() will return 0, but this memory region is never put in huge
page (check from /proc/meminfo: ShmemHugePages).Link: http://lkml.kernel.org/r/20180629181752.792831-1-songliubraving@fb.com
Fixes: 02b75dc8160d ("mm: thp: register mm for khugepaged when merging vma for shmem")
Signed-off-by: Song Liu
Reviewed-by: Rik van Riel
Reviewed-by: Yang Shi
Cc: Kirill A. Shutemov
Cc: Hugh Dickins
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed is used
to record the counter of collapsed THP, but it just gets inc'ed in
anonymous THP collapse path, do this for shmem THP collapse too.Link: http://lkml.kernel.org/r/1529622949-75504-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi
Acked-by: Kirill A. Shutemov
Cc: Hugh Dickins
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When merging anonymous page vma, if the size of the vma can fit in at
least one hugepage, the mm will be registered for khugepaged for
collapsing THP in the future.But it skips shmem vmas. Do so for shmem also, but not for file-private
mappings when merging a vma in order to increase the odds of collapsing
a hugepage via khugepaged.hugepage_vma_check() sounds like a good fit to do the check. And move
the definition of it before khugepaged_enter_vma_merge() to avoid a
build error.Link: http://lkml.kernel.org/r/1529697791-6950-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi
Acked-by: Kirill A. Shutemov
Cc: Hugh Dickins
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Apr, 2018
3 commits
-
Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.[willy@infradead.org: fix nds32, fs/dax.c]
Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Jeff Layton
Cc: Darrick J. Wong
Cc: Dave Chinner
Cc: Ryusuke Konishi
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There was a regression report for "mm/cma: manage the memory of the CMA
area by using the ZONE_MOVABLE" [1] and I think that it is related to
this problem. CMA patchset makes the system use one more zone
(ZONE_MOVABLE) and then increases min_free_kbytes. It reduces usable
memory and it could cause regression.ZONE_MOVABLE only has movable pages so we don't need to keep enough
freepages to avoid or deal with fragmentation. So, don't count it.This changes min_free_kbytes and thus min_watermark greatly if
ZONE_MOVABLE is used. It will make the user uses more memory.System:
22GB ram, fakenuma, 2 nodes. 5 zones are used.Before:
min_free_kbytes: 112640zone_info (min_watermark):
Node 0, zone DMA
min 19
Node 0, zone DMA32
min 3778
Node 0, zone Normal
min 10191
Node 0, zone Movable
min 0
Node 0, zone Device
min 0
Node 1, zone DMA
min 0
Node 1, zone DMA32
min 0
Node 1, zone Normal
min 14043
Node 1, zone Movable
min 127
Node 1, zone Device
min 0After:
min_free_kbytes: 90112zone_info (min_watermark):
Node 0, zone DMA
min 15
Node 0, zone DMA32
min 3022
Node 0, zone Normal
min 8152
Node 0, zone Movable
min 0
Node 0, zone Device
min 0
Node 1, zone DMA
min 0
Node 1, zone DMA32
min 0
Node 1, zone Normal
min 11234
Node 1, zone Movable
min 102
Node 1, zone Device
min 0[1] (lkml.kernel.org/r/20180102063528.GG30397%20()%20yexl-desktop)
Link: http://lkml.kernel.org/r/1522913236-15776-1-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim
Cc: Michal Hocko
Cc: "Kirill A . Shutemov"
Cc: Mel Gorman
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A THP memcg charge can trigger the oom killer since 2516035499b9 ("mm,
thp: remove __GFP_NORETRY from khugepaged and madvised allocations").
We have used an explicit __GFP_NORETRY previously which ruled the OOM
killer automagically.Memcg charge path should be semantically compliant with the allocation
path and that means that if we do not trigger the OOM killer for costly
orders which should do the same in the memcg charge path as well.
Otherwise we are forcing callers to distinguish the two and use
different gfp masks which is both non-intuitive and bug prone. As soon
as we get a costly high order kmalloc user we even do not have any means
to tell the memcg specific gfp mask to prevent from OOM because the
charging is deep within guts of the slab allocator.The unexpected memcg OOM on THP has already been fixed upstream by
9d3c3354bb85 ("mm, thp: do not cause memcg oom for thp") but this is a
one-off fix rather than a generic solution. Teach mem_cgroup_oom to
bail out on costly order requests to fix the THP issue as well as any
other costly OOM eligible allocations to be added in future.Also revert 9d3c3354bb85 because special gfp for THP is no longer
needed.Link: http://lkml.kernel.org/r/20180403193129.22146-1-mhocko@kernel.org
Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
Signed-off-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: "Kirill A. Shutemov"
Cc: Vlastimil Babka
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
23 Mar, 2018
2 commits
-
Commit 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
madvised allocations") changed the page allocator to no longer detect
thp allocations based on __GFP_NORETRY.It did not, however, modify the mem cgroup try_charge() path to avoid
oom kill for either khugepaged collapsing or thp faulting. It is never
expected to oom kill a process to allocate a hugepage for thp; reclaim
is governed by the thp defrag mode and MADV_HUGEPAGE, but allocations
(and charging) should fallback instead of oom killing processes.Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803191409420.124411@chino.kir.corp.google.com
Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
Signed-off-by: David Rientjes
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
khugepaged is not yet able to convert PTE-mapped huge pages back to PMD
mapped. We do not collapse such pages. See check
khugepaged_scan_pmd().But if between khugepaged_scan_pmd() and __collapse_huge_page_isolate()
somebody managed to instantiate THP in the range and then split the PMD
back to PTEs we would have a problem --
VM_BUG_ON_PAGE(PageCompound(page)) will get triggered.It's possible since we drop mmap_sem during collapse to re-take for
write.Replace the VM_BUG_ON() with graceful collapse fail.
Link: http://lkml.kernel.org/r/20180315152353.27989-1-kirill.shutemov@linux.intel.com
Fixes: b1caa957ae6d ("khugepaged: ignore pmd tables with THP mapped with ptes")
Signed-off-by: Kirill A. Shutemov
Cc: Laura Abbott
Cc: Jerome Marchand
Cc: Vlastimil Babka
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
01 Feb, 2018
2 commits
-
In the current design, khugepaged needs to acquire mmap_sem before
scanning an mm. But in some corner cases, khugepaged may scan a process
which is modifying its memory mapping, so khugepaged blocks in
uninterruptible state. But the process might hold the mmap_sem for a
long time when modifying a huge memory space and it may trigger the
below khugepaged hung issue:INFO: task khugepaged:270 blocked for more than 120 seconds.
Tainted: G E 4.9.65-006.ali3000.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
khugepaged D 0 270 2 0x00000000
ffff883f3deae4c0 0000000000000000 ffff883f610596c0 ffff883f7d359440
ffff883f63818000 ffffc90019adfc78 ffffffff817079a5 d67e5aa8c1860a64
0000000000000246 ffff883f7d359440 ffffc90019adfc88 ffff883f610596c0
Call Trace:
schedule+0x36/0x80
rwsem_down_read_failed+0xf0/0x150
call_rwsem_down_read_failed+0x18/0x30
down_read+0x20/0x40
khugepaged+0x476/0x11d0
kthread+0xe6/0x100
ret_from_fork+0x25/0x30So it sounds pointless to just block khugepaged waiting for the
semaphore so replace down_read() with down_read_trylock() to move to
scan the next mm quickly instead of just blocking on the semaphore so
that other processes can get more chances to install THP. Then
khugepaged can come back to scan the skipped mm when it has finished the
current round full_scan.And it appears that the change can improve khugepaged efficiency a
little bit.Below is the test result when running LTP on a 24 cores 4GB memory 2
nodes NUMA VM:pristine w/ trylock
full_scan 197 187
pages_collapsed 21 26
thp_fault_alloc 40818 44466
thp_fault_fallback 18413 16679
thp_collapse_alloc 21 150
thp_collapse_alloc_failed 14 16
thp_file_alloc 369 369[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: tweak comment]
[arnd@arndb.de: avoid uninitialized variable use]
Link: http://lkml.kernel.org/r/20171215125129.2948634-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/1513281203-54878-1-git-send-email-yang.s@alibaba-inc.com
Signed-off-by: Yang Shi
Acked-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Cc: Hugh Dickins
Cc: Andrea Arcangeli
Signed-off-by: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Several users of unmap_mapping_range() would prefer to express their
range in pages rather than bytes. Unfortuately, on a 32-bit kernel, you
have to remember to cast your page number to a 64-bit type before
shifting it, and four places in the current tree didn't remember to do
that. That's a sign of a bad interface.Conveniently, unmap_mapping_range() actually converts from bytes into
pages, so hoist the guts of unmap_mapping_range() into a new function
unmap_mapping_pages() and convert the callers which want to use pages.Link: http://lkml.kernel.org/r/20171206142627.GD32044@bombadil.infradead.org
Signed-off-by: Matthew Wilcox
Reported-by: "zhangyi (F)"
Reviewed-by: Ross Zwisler
Acked-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
30 Nov, 2017
1 commit
-
This reverts commit 152e93af3cfe2d29d8136cc0a02a8612507136ee.
It was a nice cleanup in theory, but as Nicolai Stange points out, we do
need to make the page dirty for the copy-on-write case even when we
didn't end up making it writable, since the dirty bit is what we use to
check that we've gone through a COW cycle.Reported-by: Michal Hocko
Acked-by: Kirill A. Shutemov
Signed-off-by: Linus Torvalds
28 Nov, 2017
1 commit
-
Currently we make page table entries dirty all the time regardless of
access type and don't even consider if the mapping is write-protected.
The reasoning is that we don't really need dirty tracking on THP and
making the entry dirty upfront may save some time on first write to the
page.Unfortunately, such approach may result in false-positive
can_follow_write_pmd() for huge zero page or read-only shmem file.Let's only make page dirty only if we about to write to the page anyway
(as we do for small pages).I've restructured the code to make entry dirty inside
maybe_p[mu]d_mkwrite(). It also takes into account if the vma is
write-protected.Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Cc: Hugh Dickins
Signed-off-by: Linus Torvalds
16 Nov, 2017
1 commit
-
Let's add wrappers for ->nr_ptes with the same interface as for nr_pmd
and nr_pud.The patch also makes nr_ptes accounting dependent onto CONFIG_MMU. Page
table accounting doesn't make sense if you don't have page tables.It's preparation for consolidation of page-table counters in mm_struct.
Link: http://lkml.kernel.org/r/20171006100651.44742-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
02 Nov, 2017
1 commit
-
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.By default all files without license information are under the default
license of the kernel, which is GPL version 2.Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman
11 Jul, 2017
1 commit
-
PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any
existing mapping because it only updated mm->def_flags which is a
template for new mappings.The mappings created after prctl(PR_SET_THP_DISABLE) have VM_NOHUGEPAGE
flag set. This can be quite surprising for all those applications which
do not do prctl(); fork() & exec() and want to control their own THP
behavior.Another usecase when the immediate semantic of the prctl might be useful
is a combination of pre- and post-copy migration of containers with
CRIU. In this case CRIU populates a part of a memory region with data
that was saved during the pre-copy stage. Afterwards, the region is
registered with userfaultfd and CRIU expects to get page faults for the
parts of the region that were not yet populated. However, khugepaged
collapses the pages and the expected page faults do not occur.In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
temporary mechanism for enabling/disabling THP process wide.Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is
tested when decision whether to use huge pages is taken either during
page fault of at the time of THP collapse.It should be noted, that the new implementation makes PR_SET_THP_DISABLE
master override to any per-VMA setting, which was not the case
previously.Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
Link: http://lkml.kernel.org/r/1496415802-30944-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Michal Hocko
Signed-off-by: Mike Rapoport
Cc: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Arnd Bergmann
Cc: "Kirill A. Shutemov"
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 Jun, 2017
1 commit
-
This is a partial revert of commit 338a16ba1549 ("mm, thp: copying user
pages must schedule on collapse") which added a cond_resched() to
__collapse_huge_page_copy().On x86 with CONFIG_HIGHPTE, __collapse_huge_page_copy is called in
atomic context and thus scheduling is not possible. This is only a
possible config on arm and i386.Although need_resched has been shown to be set for over 100 jiffies
while doing the iteration in __collapse_huge_page_copy, this is better
than doingif (in_atomic())
cond_resched()to cover only non-CONFIG_HIGHPTE configs.
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1706191341550.97821@chino.kir.corp.google.com
Signed-off-by: David Rientjes
Reported-by: Larry Finger
Tested-by: Larry Finger
Acked-by: Michal Hocko
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
13 May, 2017
2 commits
-
We have encountered need_resched warnings in __collapse_huge_page_copy()
while doing {clear,copy}_user_highpage() over HPAGE_PMD_NR source pages.mm->mmap_sem is held for write, but the iteration is well bounded.
Reschedule as needed.
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705101426380.109808@chino.kir.corp.google.com
Signed-off-by: David Rientjes
Acked-by: Vlastimil Babka
Cc: "Kirill A. Shutemov"
Cc: Johannes Weiner
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
One return case of `__collapse_huge_page_swapin()` does not invoke
tracepoint while every other return case does. This commit adds a
tracepoint invocation for the case.Link: http://lkml.kernel.org/r/20170507101813.30187-1-sj38.park@gmail.com
Signed-off-by: SeongJae Park
Cc: Kirill A. Shutemov
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
04 May, 2017
2 commits
-
With the discussion[1], I found it seems there are every PageFlags
functions return bool at this moment so we don't need double negation
any more. Although it's not a problem to keep it, it makes future users
confused to use double negation for them, too.Remove such possibility.
[1] https://marc.info/?l=linux-kernel&m=148881578820434
Frankly sepaking, I like every PageFlags to return bool instead of int.
It will make it clear. AFAIR, Chen Gang had tried it but don't know why
it was not merged at that time.http://lkml.kernel.org/r/1469336184-1904-1-git-send-email-chengang@emindsoft.com.cn
Link: http://lkml.kernel.org/r/1488868597-32222-1-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
Acked-by: Michal Hocko
Cc: Kirill A. Shutemov
Cc: Chen Gang
Cc: Anshuman Khandual
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There are a few places the code assumes anonymous pages should have
SwapBacked flag set. MADV_FREE pages are anonymous pages but we are
going to add them to LRU_INACTIVE_FILE list and clear SwapBacked flag
for them. The assumption doesn't hold any more, so fix them.Link: http://lkml.kernel.org/r/3945232c0df3dd6c4ef001976f35a95f18dcb407.1487965799.git.shli@fb.com
Signed-off-by: Shaohua Li
Acked-by: Johannes Weiner
Acked-by: Hillf Danton
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
02 Mar, 2017
2 commits
-
We are going to split out of , which
will have to be picked up from other headers and a couple of .c files.Create a trivial placeholder file that just
maps to to make this patch obviously correct and
bisectable.Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar -
We are going to split out of , which
will have to be picked up from other headers and a couple of .c files.Create a trivial placeholder file that just
maps to to make this patch obviously correct and
bisectable.The APIs that are going to be moved first are:
mm_alloc()
__mmdrop()
mmdrop()
mmdrop_async_fn()
mmdrop_async()
mmget_not_zero()
mmput()
mmput_async()
get_task_mm()
mm_access()
mm_release()Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar