13 Oct, 2018
1 commit
-
commit 017b1660df89f5fb4bfe66c34e35f7d2031100c7 upstream.
The page migration code employs try_to_unmap() to try and unmap the source
page. This is accomplished by using rmap_walk to find all vmas where the
page is mapped. This search stops when page mapcount is zero. For shared
PMD huge pages, the page map count is always 1 no matter the number of
mappings. Shared mappings are tracked via the reference count of the PMD
page. Therefore, try_to_unmap stops prematurely and does not completely
unmap all mappings of the source page.This problem can result is data corruption as writes to the original
source page can happen after contents of the page are copied to the target
page. Hence, data is lost.This problem was originally seen as DB corruption of shared global areas
after a huge page was soft offlined due to ECC memory errors. DB
developers noticed they could reproduce the issue by (hotplug) offlining
memory used to back huge pages. A simple testcase can reproduce the
problem by creating a shared PMD mapping (note that this must be at least
PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
migrate_pages() to migrate process pages between nodes while continually
writing to the huge pages being migrated.To fix, have the try_to_unmap_one routine check for huge PMD sharing by
calling huge_pmd_unshare for hugetlbfs huge pages. If it is a shared
mapping it will be 'unshared' which removes the page table entry and drops
the reference on the PMD page. After this, flush caches and TLB.mmu notifiers are called before locking page tables, but we can not be
sure of PMD sharing until page tables are locked. Therefore, check for
the possibility of PMD sharing before locking so that notifiers can
prepare for the worst possible case.Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com
[mike.kravetz@oracle.com: make _range_in_vma() a static inline]
Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com
Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz
Acked-by: Kirill A. Shutemov
Reviewed-by: Naoya Horiguchi
Acked-by: Michal Hocko
Cc: Vlastimil Babka
Cc: Davidlohr Bueso
Cc: Jerome Glisse
Cc: Mike Kravetz
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman
17 Jul, 2018
1 commit
-
commit bce73e4842390f7b7309c8e253e139db71288ac3 upstream.
KVM guests on s390 can notify the host of unused pages. This can result
in pte_unused callbacks to be true for KVM guest memory.If a page is unused (checked with pte_unused) we might drop this page
instead of paging it. This can have side-effects on userfaultd, when
the page in question was already migrated:The next access of that page will trigger a fault and a user fault
instead of faulting in a new and empty zero page. As QEMU does not
expect a userfault on an already migrated page this migration will fail.The most straightforward solution is to ignore the pte_unused hint if a
userfault context is active for this VMA.Link: http://lkml.kernel.org/r/20180703171854.63981-1-borntraeger@de.ibm.com
Signed-off-by: Christian Borntraeger
Cc: Martin Schwidefsky
Cc: Andrea Arcangeli
Cc: Mike Rapoport
Cc: Janosch Frank
Cc: David Hildenbrand
Cc: Cornelia Huck
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
09 Sep, 2017
4 commits
-
Allow interval trees to quickly check for overlaps to avoid unnecesary
tree lookups in interval_tree_iter_first().As of this patch, all interval tree flavors will require using a
'rb_root_cached' such that we can have the leftmost node easily
available. While most users will make use of this feature, those with
special functions (in addition to the generic insert, delete, search
calls) will avoid using the cached option as they can do funky things
with insertions -- for example, vma_interval_tree_insert_after().[jglisse@redhat.com: fix deadlock from typo vm_lock_anon_vma()]
Link: http://lkml.kernel.org/r/20170808225719.20723-1-jglisse@redhat.com
Link: http://lkml.kernel.org/r/20170719014603.19029-12-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Signed-off-by: Jérôme Glisse
Acked-by: Christian König
Acked-by: Peter Zijlstra (Intel)
Acked-by: Doug Ledford
Acked-by: Michael S. Tsirkin
Cc: David Airlie
Cc: Jason Wang
Cc: Christian Benvenuti
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Allow to unmap and restore special swap entry of un-addressable
ZONE_DEVICE memory.Link: http://lkml.kernel.org/r/20170817000548.32038-17-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Cc: Kirill A. Shutemov
Cc: Aneesh Kumar
Cc: Balbir Singh
Cc: Benjamin Herrenschmidt
Cc: Dan Williams
Cc: David Nellans
Cc: Evgeny Baskakov
Cc: Johannes Weiner
Cc: John Hubbard
Cc: Mark Hairgrove
Cc: Michal Hocko
Cc: Paul E. McKenney
Cc: Ross Zwisler
Cc: Sherry Cheung
Cc: Subhash Gutti
Cc: Vladimir Davydov
Cc: Bob Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add thp migration's core code, including conversions between a PMD entry
and a swap entry, setting PMD migration entry, removing PMD migration
entry, and waiting on PMD migration entries.This patch makes it possible to support thp migration. If you fail to
allocate a destination page as a thp, you just split the source thp as
we do now, and then enter the normal page migration. If you succeed to
allocate destination thp, you enter thp migration. Subsequent patches
actually enable thp migration for each caller of page migration by
allowing its get_new_page() callback to allocate thps.[zi.yan@cs.rutgers.edu: fix gcc-4.9.0 -Wmissing-braces warning]
Link: http://lkml.kernel.org/r/A0ABA698-7486-46C3-B209-E95A9048B22C@cs.rutgers.edu
[akpm@linux-foundation.org: fix x86_64 allnoconfig warning]
Signed-off-by: Zi Yan
Acked-by: Kirill A. Shutemov
Cc: "H. Peter Anvin"
Cc: Anshuman Khandual
Cc: Dave Hansen
Cc: David Nellans
Cc: Ingo Molnar
Cc: Mel Gorman
Cc: Minchan Kim
Cc: Naoya Horiguchi
Cc: Thomas Gleixner
Cc: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
TTU_MIGRATION is used to convert pte into migration entry until thp
split completes. This behavior conflicts with thp migration added later
patches, so let's introduce a new TTU flag specifically for freezing.try_to_unmap() is used both for thp split (via freeze_page()) and page
migration (via __unmap_and_move()). In freeze_page(), ttu_flag given
for head page is like below (assuming anonymous thp):(TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
TTU_MIGRATION | TTU_SPLIT_HUGE_PMD)and ttu_flag given for tail pages is:
(TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
TTU_MIGRATION)__unmap_and_move() calls try_to_unmap() with ttu_flag:
(TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS)
Now I'm trying to insert a branch for thp migration at the top of
try_to_unmap_one() like belowstatic int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
unsigned long address, void *arg)
{
...
/* PMD-mapped THP migration entry */
if (!pvmw.pte && (flags & TTU_MIGRATION)) {
if (!PageAnon(page))
continue;set_pmd_migration_entry(&pvmw, page);
continue;
}
...
}so try_to_unmap() for tail pages called by thp split can go into thp
migration code path (which converts *pmd* into migration entry), while
the expectation is to freeze thp (which converts *pte* into migration
entry.)I detected this failure as a "bad page state" error in a testcase where
split_huge_page() is called from queue_pages_pte_range().Link: http://lkml.kernel.org/r/20170717193955.20207-4-zi.yan@sent.com
Signed-off-by: Naoya Horiguchi
Signed-off-by: Zi Yan
Acked-by: Kirill A. Shutemov
Cc: "H. Peter Anvin"
Cc: Anshuman Khandual
Cc: Dave Hansen
Cc: David Nellans
Cc: Ingo Molnar
Cc: Mel Gorman
Cc: Minchan Kim
Cc: Thomas Gleixner
Cc: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
01 Sep, 2017
1 commit
-
Replace all mmu_notifier_invalidate_page() calls by *_invalidate_range()
and make sure it is bracketed by calls to *_invalidate_range_start()/end().Note that because we can not presume the pmd value or pte value we have
to assume the worst and unconditionaly report an invalidation as
happening.Changed since v2:
- try_to_unmap_one() only one call to mmu_notifier_invalidate_range()
- compute end with PAGE_SIZE << compound_order(page)
- fix PageHuge() case in try_to_unmap_one()Signed-off-by: Jérôme Glisse
Reviewed-by: Andrea Arcangeli
Cc: Dan Williams
Cc: Ross Zwisler
Cc: Bernhard Held
Cc: Adam Borowski
Cc: Radim Krčmář
Cc: Wanpeng Li
Cc: Paolo Bonzini
Cc: Takashi Iwai
Cc: Nadav Amit
Cc: Mike Galbraith
Cc: Kirill A. Shutemov
Cc: axie
Cc: Andrew Morton
Signed-off-by: Linus Torvalds
30 Aug, 2017
1 commit
-
This reverts commit aac2fea94f7a3df8ad1eeb477eb2643f81fd5393.
It turns out that that patch was complete and utter garbage, and broke
KVM, resulting in odd oopses.Quoting Andrea Arcangeli:
"The aforementioned commit has 3 bugs.1) mmu_notifier_invalidate_range cannot be used in replacement of
mmu_notifier_invalidate_range_start/end.For KVM mmu_notifier_invalidate_range is a noop and rightfully so.
A MMU notifier implementation has to implement either
->invalidate_range method or the invalidate_range_start/end
methods, not both. And if you implement invalidate_range_start/end
like KVM is forced to do, calling mmu_notifier_invalidate_range in
common code is a noop for KVM.For those MMU notifiers that can get away only implementing
->invalidate_range, the ->invalidate_range is implicitly called by
mmu_notifier_invalidate_range_end(). And only those secondary MMUs
that share the same pagetable with the primary MMU (like AMD
iommuv2) can get away only implementing ->invalidate_range.So all cases (THP on/off) are broken right now.
To fix this is enough to replace mmu_notifier_invalidate_range with
mmu_notifier_invalidate_range_start;mmu_notifier_invalidate_range_end.
Either that or call multiple mmu_notifier_invalidate_page like
before.2) address + (1UL << compound_order(page) is buggy, it should be
PAGE_SIZE << compound_order(page), it's bytes not pages, 2M not
512.3) The whole invalidate_range thing was an attempt to call a single
invalidate while walking multiple 4k ptes that maps the same THP
(after a pmd virtual split without physical compound page THP
split).It's unclear if the rmap_walk will always provide an address that
is 2M aligned as parameter to try_to_unmap_one, in presence of THP.
I think it needs also an address &= (PAGE_SIZE <<
compound_order(page)) - 1 to be safe"In general, we should stop making excuses for horrible MMU notifier
users. It's much more important that the core VM is sane and safe, than
letting MMU notifiers sleep.So if some MMU notifier is sleeping under a spinlock, we need to fix the
notifier, not try to make excuses for that garbage in the core VM.Reported-and-tested-by: Bernhard Held
Reported-and-tested-by: Adam Borowski
Cc: Andrea Arcangeli
Cc: Radim Krčmář
Cc: Wanpeng Li
Cc: Paolo Bonzini
Cc: Takashi Iwai
Cc: Nadav Amit
Cc: Mike Galbraith
Cc: Kirill A. Shutemov
Cc: Jérôme Glisse
Cc: axie
Cc: Andrew Morton
Signed-off-by: Linus Torvalds
11 Aug, 2017
1 commit
-
MMU notifiers can sleep, but in page_mkclean_one() we call
mmu_notifier_invalidate_page() under page table lock.Let's instead use mmu_notifier_invalidate_range() outside
page_vma_mapped_walk() loop.[jglisse@redhat.com: try_to_unmap_one() do not call mmu_notifier under ptl]
Link: http://lkml.kernel.org/r/20170809204333.27485-1-jglisse@redhat.com
Link: http://lkml.kernel.org/r/20170804134928.l4klfcnqatni7vsc@black.fi.intel.com
Fixes: c7ab0d2fdc84 ("mm: convert try_to_unmap_one() to use page_vma_mapped_walk()")
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Jérôme Glisse
Reported-by: axie
Cc: Alex Deucher
Cc: "Writer, Tim"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
03 Aug, 2017
1 commit
-
Nadav Amit identified a theoritical race between page reclaim and
mprotect due to TLB flushes being batched outside of the PTL being held.He described the race as follows:
CPU0 CPU1
---- ----
user accesses memory using RW PTE
[PTE now cached in TLB]
try_to_unmap_one()
==> ptep_get_and_clear()
==> set_tlb_ubc_flush_pending()
mprotect(addr, PROT_READ)
==> change_pte_range()
==> [ PTE non-present - no flush ]user writes using cached RW PTE
...try_to_unmap_flush()
The same type of race exists for reads when protecting for PROT_NONE and
also exists for operations that can leave an old TLB entry behind such
as munmap, mremap and madvise.For some operations like mprotect, it's not necessarily a data integrity
issue but it is a correctness issue as there is a window where an
mprotect that limits access still allows access. For munmap, it's
potentially a data integrity issue although the race is massive as an
munmap, mmap and return to userspace must all complete between the
window when reclaim drops the PTL and flushes the TLB. However, it's
theoritically possible so handle this issue by flushing the mm if
reclaim is potentially currently batching TLB flushes.Other instances where a flush is required for a present pte should be ok
as either the page lock is held preventing parallel reclaim or a page
reference count is elevated preventing a parallel free leading to
corruption. In the case of page_mkclean there isn't an obvious path
that userspace could take advantage of without using the operations that
are guarded by this patch. Other users such as gup as a race with
reclaim looks just at PTEs. huge page variants should be ok as they
don't race with reclaim. mincore only looks at PTEs. userfault also
should be ok as if a parallel reclaim takes place, it will either fault
the page back in or read some of the data before the flush occurs
triggering a fault.Note that a variant of this patch was acked by Andy Lutomirski but this
was for the x86 parts on top of his PCID work which didn't make the 4.13
merge window as expected. His ack is dropped from this version and
there will be a follow-on patch on top of PCID that will include his
ack.[akpm@linux-foundation.org: tweak comments]
[akpm@linux-foundation.org: fix spello]
Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de
Reported-by: Nadav Amit
Signed-off-by: Mel Gorman
Cc: Andy Lutomirski
Cc: [v4.4+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Jul, 2017
2 commits
-
lruvecs are at the intersection of the NUMA node and memcg, which is the
scope for most paging activity.Introduce a convenient accounting infrastructure that maintains
statistics per node, per memcg, and the lruvec itself.Then convert over accounting sites for statistics that are already
tracked in both nodes and memcgs and can be easily switched.[hannes@cmpxchg.org: fix crash in the new cgroup stat keeping code]
Link: http://lkml.kernel.org/r/20170531171450.GA10481@cmpxchg.org
[hannes@cmpxchg.org: don't track uncharged pages at all
Link: http://lkml.kernel.org/r/20170605175254.GA8547@cmpxchg.org
[hannes@cmpxchg.org: add missing free_percpu()]
Link: http://lkml.kernel.org/r/20170605175354.GB8547@cmpxchg.org
[linux@roeck-us.net: hexagon: fix build error caused by include file order]
Link: http://lkml.kernel.org/r/20170617153721.GA4382@roeck-us.net
Link: http://lkml.kernel.org/r/20170530181724.27197-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Signed-off-by: Guenter Roeck
Acked-by: Vladimir Davydov
Cc: Josef Bacik
Cc: Michal Hocko
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Using set_pte_at() does not do the right thing when putting down
HWPOISON swap entries for hugepages on architectures that support
contiguous ptes.Fix this problem by using set_huge_swap_pte_at() which was introduced to
fix exactly this problem.Link: http://lkml.kernel.org/r/20170522133604.11392-7-punit.agrawal@arm.com
Signed-off-by: Punit Agrawal
Acked-by: Steve Capper
Cc: "Kirill A. Shutemov"
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Naoya Horiguchi
Cc: Mike Kravetz
Cc: Mark Rutland
Cc: Hillf Danton
Cc: Aneesh Kumar K.V
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 May, 2017
1 commit
-
try_to_unmap_flush() used to open-code a rather x86-centric flush
sequence: local_flush_tlb() + flush_tlb_others(). Rearrange the
code so that the arch (only x86 for now) provides
arch_tlbbatch_add_mm() and arch_tlbbatch_flush() and the core code
calls those functions instead.I'll want this for x86 because, to enable address space ids, I can't
support the flush_tlb_others() mode used by exising
try_to_unmap_flush() implementation with good performance. I can
support the new API fairly easily, though.I imagine that other architectures may be in a similar position.
Architectures with strong remote flush primitives (arm64?) may have
even worse performance problems with flush_tlb_others() the way that
try_to_unmap_flush() uses it.Signed-off-by: Andy Lutomirski
Acked-by: Kees Cook
Cc: Andrew Morton
Cc: Borislav Petkov
Cc: Dave Hansen
Cc: Linus Torvalds
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Nadav Amit
Cc: Nadav Amit
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Sasha Levin
Cc: Thomas Gleixner
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/19f25a8581f9fb77876b7ff3b001f89835e34ea3.1495492063.git.luto@kernel.org
Signed-off-by: Ingo Molnar
11 May, 2017
1 commit
-
Pull RCU updates from Ingo Molnar:
"The main changes are:- Debloat RCU headers
- Parallelize SRCU callback handling (plus overlapping patches)
- Improve the performance of Tree SRCU on a CPU-hotplug stress test
- Documentation updates
- Miscellaneous fixes"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
rcu: Open-code the rcu_cblist_n_lazy_cbs() function
rcu: Open-code the rcu_cblist_n_cbs() function
rcu: Open-code the rcu_cblist_empty() function
rcu: Separately compile large rcu_segcblist functions
srcu: Debloat the header
srcu: Adjust default auto-expediting holdoff
srcu: Specify auto-expedite holdoff time
srcu: Expedite first synchronize_srcu() when idle
srcu: Expedited grace periods with reduced memory contention
srcu: Make rcutorture writer stalls print SRCU GP state
srcu: Exact tracking of srcu_data structures containing callbacks
srcu: Make SRCU be built by default
srcu: Fix Kconfig botch when SRCU not selected
rcu: Make non-preemptive schedule be Tasks RCU quiescent state
srcu: Expedite srcu_schedule_cbs_snp() callback invocation
srcu: Parallelize callback handling
kvm: Move srcu_struct fields to end of struct kvm
rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
rcu: Use true/false in assignment to bool
rcu: Use bool value directly
...
04 May, 2017
16 commits
-
The memory controllers stat function names are awkwardly long and
arbitrarily different from the zone and node stat functions.The current interface is named:
mem_cgroup_read_stat()
mem_cgroup_update_stat()
mem_cgroup_inc_stat()
mem_cgroup_dec_stat()
mem_cgroup_update_page_stat()
mem_cgroup_inc_page_stat()
mem_cgroup_dec_page_stat()This patch renames it to match the corresponding node stat functions:
memcg_page_state() [node_page_state()]
mod_memcg_state() [mod_node_state()]
inc_memcg_state() [inc_node_state()]
dec_memcg_state() [dec_node_state()]
mod_memcg_page_state() [mod_node_page_state()]
inc_memcg_page_state() [inc_node_page_state()]
dec_memcg_page_state() [dec_node_page_state()]Link: http://lkml.kernel.org/r/20170404220148.28338-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The current duplication is a high-maintenance mess, and it's painful to
add new items or query memcg state from the rest of the VM.This increases the size of the stat array marginally, but we should aim
to track all these stats on a per-cgroup level anyway.Link: http://lkml.kernel.org/r/20170404220148.28338-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There is no user for it. Remove it.
[minchan@kernel.org: use false instead of SWAP_FAIL]
Link: http://lkml.kernel.org/r/20170316053313.GA19241@bbox
Link: http://lkml.kernel.org/r/1489555493-14659-11-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Cc: Anshuman Khandual
Cc: Hillf Danton
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Naoya Horiguchi
Cc: Vlastimil Babka
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
rmap_one's return value controls whether rmap_work should contine to
scan other ptes or not so it's target for changing to boolean. Return
true if the scan should be continued. Otherwise, return false to stop
the scanning.This patch makes rmap_one's return value to boolean.
Link: http://lkml.kernel.org/r/1489555493-14659-10-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Cc: Anshuman Khandual
Cc: Hillf Danton
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Naoya Horiguchi
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There is no user of the return value from rmap_walk() and friends so
this patch makes them void-returning functions.Link: http://lkml.kernel.org/r/1489555493-14659-9-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Cc: Anshuman Khandual
Cc: Hillf Danton
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Naoya Horiguchi
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
try_to_unmap() returns SWAP_SUCCESS or SWAP_FAIL so it's suitable for
boolean return. This patch changes it.Link: http://lkml.kernel.org/r/1489555493-14659-8-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Cc: Naoya Horiguchi
Cc: Anshuman Khandual
Cc: Hillf Danton
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In 2002, [1] introduced SWAP_AGAIN. At that time, try_to_unmap_one used
spin_trylock(&mm->page_table_lock) so it's really easy to contend and
fail to hold a lock so SWAP_AGAIN to keep LRU status makes sense.However, now we changed it to mutex-based lock and be able to block
without skip pte so there is few of small window to return SWAP_AGAIN so
remove SWAP_AGAIN and just return SWAP_FAIL.[1] c48c43e, minimal rmap
Link: http://lkml.kernel.org/r/1489555493-14659-7-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Cc: Anshuman Khandual
Cc: Hillf Danton
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Naoya Horiguchi
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
ttu doesn't need to return SWAP_MLOCK. Instead, just return SWAP_FAIL
because it means the page is not-swappable so it should move to another
LRU list(active or unevictable). putback friends will move it to right
list depending on the page's LRU flag.Link: http://lkml.kernel.org/r/1489555493-14659-6-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Cc: Anshuman Khandual
Cc: Hillf Danton
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Naoya Horiguchi
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
try_to_munlock returns SWAP_MLOCK if the one of VMAs mapped the page has
VM_LOCKED flag. In that time, VM set PG_mlocked to the page if the page
is not pte-mapped THP which cannot be mlocked, either.With that, __munlock_isolated_page can use PageMlocked to check whether
try_to_munlock is successful or not without relying on try_to_munlock's
retval. It helps to make try_to_unmap/try_to_unmap_one simple with
upcoming patches.[minchan@kernel.org: remove PG_Mlocked VM_BUG_ON check]
Link: http://lkml.kernel.org/r/20170411025615.GA6545@bbox
Link: http://lkml.kernel.org/r/1489555493-14659-5-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Acked-by: Kirill A. Shutemov
Acked-by: Vlastimil Babka
Cc: Anshuman Khandual
Cc: Hillf Danton
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Naoya Horiguchi
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If the page is mapped and rescue in try_to_unmap_one, the
page_mapcount() of a page cannot be zero, so the page_mapcount check in
try_to_unmap is enough to return SWAP_SUCCESS. IOW, SWAP_MLOCK check is
redundant so remove it.Link: http://lkml.kernel.org/r/1489555493-14659-4-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Cc: Anshuman Khandual
Cc: Hillf Danton
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Naoya Horiguchi
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If we found lazyfree page is dirty, try_to_unmap_one can just
SetPageSwapBakced in there like PG_mlocked page and just return with
SWAP_FAIL which is very natural because the page is not swappable right
now so that vmscan can activate it. There is no point to introduce new
return value SWAP_DIRTY in try_to_unmap at the moment.Link: http://lkml.kernel.org/r/1489555493-14659-3-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Acked-by: Hillf Danton
Acked-by: Kirill A. Shutemov
Cc: Anshuman Khandual
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Naoya Horiguchi
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Nobody uses ret variable. Remove it.
Link: http://lkml.kernel.org/r/1489555493-14659-2-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Acked-by: Hillf Danton
Acked-by: Kirill A. Shutemov
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Kirill A. Shutemov
Cc: Anshuman Khandual
Cc: Vlastimil Babka
Cc: Naoya HoriguchiSigned-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If a page is swapbacked, it means it should be in swapcache in
try_to_unmap_one's path.If a page is !swapbacked, it mean it shouldn't be in swapcache in
try_to_unmap_one's path.Check both two cases all at once and if it fails, warn and return
SWAP_FAIL. Such bug never mean we should shut down the kernel.[minchan@kernel.org: do not use VM_WARN_ON_ONCE as if condition[
Link: http://lkml.kernel.org/r/20170309060226.GB854@bbox
Link: http://lkml.kernel.org/r/20170307055551.GC29458@bbox
Signed-off-by: Minchan Kim
Suggested-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: Shaohua Li
Cc: Hillf Danton
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When memory pressure is high, we free MADV_FREE pages. If the pages are
not dirty in pte, the pages could be freed immediately. Otherwise we
can't reclaim them. We put the pages back to anonumous LRU list (by
setting SwapBacked flag) and the pages will be reclaimed in normal
swapout way.We use normal page reclaim policy. Since MADV_FREE pages are put into
inactive file list, such pages and inactive file pages are reclaimed
according to their age. This is expected, because we don't want to
reclaim too many MADV_FREE pages before used once pages.Based on Minchan's original patch
[minchan@kernel.org: clean up lazyfree page handling]
Link: http://lkml.kernel.org/r/20170303025237.GB3503@bbox
Link: http://lkml.kernel.org/r/14b8eb1d3f6bf6cc492833f183ac8c304e560484.1487965799.git.shli@fb.com
Signed-off-by: Shaohua Li
Signed-off-by: Minchan Kim
Acked-by: Minchan Kim
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Acked-by: Hillf Danton
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There are a few places the code assumes anonymous pages should have
SwapBacked flag set. MADV_FREE pages are anonymous pages but we are
going to add them to LRU_INACTIVE_FILE list and clear SwapBacked flag
for them. The assumption doesn't hold any more, so fix them.Link: http://lkml.kernel.org/r/3945232c0df3dd6c4ef001976f35a95f18dcb407.1487965799.git.shli@fb.com
Signed-off-by: Shaohua Li
Acked-by: Johannes Weiner
Acked-by: Hillf Danton
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Patch series "mm: fix some MADV_FREE issues", v5.
We are trying to use MADV_FREE in jemalloc. Several issues are found.
Without solving the issues, jemalloc can't use the MADV_FREE feature.- Doesn't support system without swap enabled. Because if swap is off,
we can't or can't efficiently age anonymous pages. And since
MADV_FREE pages are mixed with other anonymous pages, we can't
reclaim MADV_FREE pages. In current implementation, MADV_FREE will
fallback to MADV_DONTNEED without swap enabled. But in our
environment, a lot of machines don't enable swap. This will prevent
our setup using MADV_FREE.- Increases memory pressure. page reclaim bias file pages reclaim
against anonymous pages. This doesn't make sense for MADV_FREE pages,
because those pages could be freed easily and refilled with very
slight penality. Even page reclaim doesn't bias file pages, there is
still an issue, because MADV_FREE pages and other anonymous pages are
mixed together. To reclaim a MADV_FREE page, we probably must scan a
lot of other anonymous pages, which is inefficient. In our test, we
usually see oom with MADV_FREE enabled and nothing without it.- Accounting. There are two accounting problems. We don't have a global
accounting. If the system is abnormal, we don't know if it's a
problem from MADV_FREE side. The other problem is RSS accounting.
MADV_FREE pages are accounted as normal anon pages and reclaimed
lazily, so application's RSS becomes bigger. This confuses our
workloads. We have monitoring daemon running and if it finds
applications' RSS becomes abnormal, the daemon will kill the
applications even kernel can reclaim the memory easily.To address the first the two issues, we can either put MADV_FREE pages
into a separate LRU list (Minchan's previous patches and V1 patches), or
put them into LRU_INACTIVE_FILE list (suggested by Johannes). The
patchset use the second idea. The reason is LRU_INACTIVE_FILE list is
tiny nowadays and should be full of used once file pages. So we can
still efficiently reclaim MADV_FREE pages there without interference
with other anon and active file pages. Putting the pages into inactive
file list also has an advantage which allows page reclaim to prioritize
MADV_FREE pages and used once file pages. MADV_FREE pages are put into
the lru list and clear SwapBacked flag, so PageAnon(page) &&
!PageSwapBacked(page) will indicate a MADV_FREE pages. These pages will
directly freed without pageout if they are clean, otherwise normal swap
will reclaim them.For the third issue, the previous post adds global accounting and a
separate RSS count for MADV_FREE pages. The problem is we never get
accurate accounting for MADV_FREE pages. The pages are mapped to
userspace, can be dirtied without notice from kernel side. To get
accurate accounting, we could write protect the page, but then there is
extra page fault overhead, which people don't want to pay. Jemalloc
guys have concerns about the inaccurate accounting, so this post drops
the accounting patches temporarily. The info exported to
/proc/pid/smaps for MADV_FREE pages are kept, which is the only place we
can get accurate accounting right now.This patch (of 6):
Johannes pointed out TTU_LZFREE is unnecessary. It's true because we
always have the flag set if we want to do an unmap. For cases we don't
do an unmap, the TTU_LZFREE part of code should never run.Also the TTU_UNMAP is unnecessary. If no other flags set (for example,
TTU_MIGRATION), an unmap is implied.The patch includes Johannes's cleanup and dead TTU_ACTION macro removal
codeLink: http://lkml.kernel.org/r/4be3ea1bc56b26fd98a54d0a6f70bec63f6d8980.1487965799.git.shli@fb.com
Signed-off-by: Shaohua Li
Suggested-by: Johannes Weiner
Acked-by: Johannes Weiner
Acked-by: Minchan Kim
Acked-by: Hillf Danton
Acked-by: Michal Hocko
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
23 Apr, 2017
1 commit
-
…k/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:
- Documentation updates.
- Miscellaneous fixes.
- Parallelize SRCU callback handling (plus overlapping patches).
Signed-off-by: Ingo Molnar <mingo@kernel.org>
19 Apr, 2017
1 commit
-
A group of Linux kernel hackers reported chasing a bug that resulted
from their assumption that SLAB_DESTROY_BY_RCU provided an existence
guarantee, that is, that no block from such a slab would be reallocated
during an RCU read-side critical section. Of course, that is not the
case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
slab of blocks.However, there is a phrase for this, namely "type safety". This commit
therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
to avoid future instances of this sort of confusion.Signed-off-by: Paul E. McKenney
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Andrew Morton
Cc:
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
[ paulmck: Add comments mentioning the old name, as requested by Eric
Dumazet, in order to help people familiar with the old name find
the new one. ]
Acked-by: David Rientjes
01 Apr, 2017
1 commit
-
Huge pages are accounted as single units in the memcg's "file_mapped"
counter. Account the correct number of base pages, like we do in the
corresponding node counter.Link: http://lkml.kernel.org/r/20170322005111.3156-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Reviewed-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Cc: Vladimir Davydov
Cc: [4.8+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Mar, 2017
1 commit
-
Merge 5-level page table prep from Kirill Shutemov:
"Here's relatively low-risk part of 5-level paging patchset. Merging it
now will make x86 5-level paging enabling in v4.12 easier.The first patch is actually x86-specific: detect 5-level paging
support. It boils down to single define.The rest of patchset converts Linux MMU abstraction from 4- to 5-level
paging.Enabling of new abstraction in most cases requires adding single line
of code in arch-specific code. The rest is taken care by asm-generic/.Changes to mm/ code are mostly mechanical: add support for new page
table level -- p4d_t -- where we deal with pud_t now.v2:
- fix build on microblaze (Michal);
- comment for __ARCH_HAS_5LEVEL_HACK in kasan_populate_zero_shadow();
- acks from Michal"* emailed patches from Kirill A Shutemov :
mm: introduce __p4d_alloc()
mm: convert generic code to 5-level paging
asm-generic: introduce
arch, mm: convert all architectures to use 5level-fixup.h
asm-generic: introduce __ARCH_USE_5LEVEL_HACK
asm-generic: introduce 5level-fixup.h
x86/cpufeature: Add 5-level paging detection
10 Mar, 2017
2 commits
-
The following test case triggers NULL-pointer derefernce in
try_to_unmap_one():#include
#include
#include
#includeint main(int argc, char *argv[])
{
int fd;system("mount -t tmpfs -o huge=always none /mnt");
fd = open("/mnt/test", O_CREAT | O_RDWR);
ftruncate(fd, 2UL << 20);
mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_FIXED | MAP_LOCKED, fd, 0);
mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_LOCKED, fd, 0);
munlockall();
return 0;
}Apparently, there's a case when we call try_to_unmap() on huge PMDs:
it's TTU_MUNLOCK.Let's handle this case correctly.
Fixes: c7ab0d2fdc84 ("mm: convert try_to_unmap_one() to use page_vma_mapped_walk()")
Link: http://lkml.kernel.org/r/20170302151159.30592-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Convert all non-architecture-specific code to 5-level paging.
It's mostly mechanical adding handling one more page table level in
places where we deal with pud_t.Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds
02 Mar, 2017
2 commits
-
We are going to split out of , which
will have to be picked up from other headers and a couple of .c files.Create a trivial placeholder file that just
maps to to make this patch obviously correct and
bisectable.Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar -
We are going to split out of , which
will have to be picked up from other headers and a couple of .c files.Create a trivial placeholder file that just
maps to to make this patch obviously correct and
bisectable.The APIs that are going to be moved first are:
mm_alloc()
__mmdrop()
mmdrop()
mmdrop_async_fn()
mmdrop_async()
mmget_not_zero()
mmput()
mmput_async()
get_task_mm()
mm_access()
mm_release()Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar
25 Feb, 2017
2 commits
-
All users are gone. Let's drop them.
Link: http://lkml.kernel.org/r/20170129173858.45174-12-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Hillf Danton
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Srikar Dronamraju
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
For consistency, it worth converting all page_check_address() to
page_vma_mapped_walk(), so we could drop the former.Link: http://lkml.kernel.org/r/20170129173858.45174-11-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Acked-by: Hillf Danton
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Srikar Dronamraju
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds