10 Sep, 2020
1 commit
-
commit e5a59d308f52bb0052af5790c22173651b187465 upstream.
collapse_file() in khugepaged passes PAGE_SIZE as the number of pages to
be read to page_cache_sync_readahead(). The intent was probably to read
a single page. Fix it to use the number of pages to the end of the
window instead.Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: David Howells
Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Reviewed-by: Matthew Wilcox (Oracle)
Acked-by: Song Liu
Acked-by: Yang Shi
Acked-by: Pankaj Gupta
Cc: Eric Biggers
Link: https://lkml.kernel.org/r/20200903140844.14194-2-willy@infradead.org
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
26 Aug, 2020
2 commits
-
[ Upstream commit f3f99d63a8156c7a4a6b20aac22b53c5579c7dc1 ]
syzbot crashes on the VM_BUG_ON_MM(khugepaged_test_exit(mm), mm) in
__khugepaged_enter(): yes, when one thread is about to dump core, has set
core_state, and is waiting for others, another might do something calling
__khugepaged_enter(), which now crashes because I lumped the core_state
test (known as "mmget_still_valid") into khugepaged_test_exit(). I still
think it's best to lump them together, so just in this exceptional case,
check mm->mm_users directly instead of khugepaged_test_exit().Fixes: bbe98f9cadff ("khugepaged: khugepaged_test_exit() check mmget_still_valid()")
Reported-by: syzbot
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Acked-by: Yang Shi
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Song Liu
Cc: Mike Kravetz
Cc: Eric Dumazet
Cc: [4.8+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008141503370.18085@eggly.anvils
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin -
[ Upstream commit bbe98f9cadff58cdd6a4acaeba0efa8565dabe65 ]
Move collapse_huge_page()'s mmget_still_valid() check into
khugepaged_test_exit() itself. collapse_huge_page() is used for anon THP
only, and earned its mmget_still_valid() check because it inserts a huge
pmd entry in place of the page table's pmd entry; whereas
collapse_file()'s retract_page_tables() or collapse_pte_mapped_thp()
merely clears the page table's pmd entry. But core dumping without mmap
lock must have been as open to mistaking a racily cleared pmd entry for a
page table at physical page 0, as exit_mmap() was. And we certainly have
no interest in mapping as a THP once dumping core.Fixes: 59ea6d06cfa9 ("coredump: fix race condition between collapse_huge_page() and core dumping")
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Cc: Andrea Arcangeli
Cc: Song Liu
Cc: Mike Kravetz
Cc: Kirill A. Shutemov
Cc: [4.8+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021217020.27773@eggly.anvils
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
21 Aug, 2020
3 commits
-
commit 18e77600f7a1ed69f8ce46c9e11cad0985712dfa upstream.
Only once have I seen this scenario (and forgot even to notice what forced
the eventual crash): a sequence of "BUG: Bad page map" alerts from
vm_normal_page(), from zap_pte_range() servicing exit_mmap();
pmd:00000000, pte values corresponding to data in physical page 0.The pte mappings being zapped in this case were supposed to be from a huge
page of ext4 text (but could as well have been shmem): my belief is that
it was racing with collapse_file()'s retract_page_tables(), found *pmd
pointing to a page table, locked it, but *pmd had become 0 by the time
start_pte was decided.In most cases, that possibility is excluded by holding mmap lock; but
exit_mmap() proceeds without mmap lock. Most of what's run by khugepaged
checks khugepaged_test_exit() after acquiring mmap lock:
khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() do so,
for example. But retract_page_tables() did not: fix that.The fix is for retract_page_tables() to check khugepaged_test_exit(),
after acquiring mmap lock, before doing anything to the page table.
Getting the mmap lock serializes with __mmput(), which briefly takes and
drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on
mm_users makes sure we don't touch the page table once exit_mmap() might
reach it, since exit_mmap() will be proceeding without mmap lock, not
expecting anyone to be racing with it.Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Acked-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Mike Kravetz
Cc: Song Liu
Cc: [4.8+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021215400.27773@eggly.anvils
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman -
commit 119a5fc16105b2b9383a6e2a7800b2ef861b2975 upstream.
When retract_page_tables() removes a page table to make way for a huge
pmd, it holds huge page lock, i_mmap_lock_write, mmap_write_trylock and
pmd lock; but when collapse_pte_mapped_thp() does the same (to handle the
case when the original mmap_write_trylock had failed), only
mmap_write_trylock and pmd lock are held.That's not enough. One machine has twice crashed under load, with "BUG:
spinlock bad magic" and GPF on 6b6b6b6b6b6b6b6b. Examining the second
crash, page_vma_mapped_walk_done()'s spin_unlock of pvmw->ptl (serving
page_referenced() on a file THP, that had found a page table at *pmd)
discovers that the page table page and its lock have already been freed by
the time it comes to unlock.Follow the example of retract_page_tables(), but we only need one of huge
page lock or i_mmap_lock_write to secure against this: because it's the
narrower lock, and because it simplifies collapse_pte_mapped_thp() to know
the hpage earlier, choose to rely on huge page lock here.Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Acked-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Mike Kravetz
Cc: Song Liu
Cc: [5.4+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021213070.27773@eggly.anvils
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman -
commit 723a80dafed5c95889d48baab9aa433a6ffa0b4e upstream.
pmdp_collapse_flush() should be given the start address at which the huge
page is mapped, haddr: it was given addr, which at that point has been
used as a local variable, incremented to the end address of the extent.Found by source inspection while chasing a hugepage locking bug, which I
then could not explain by this. At first I thought this was very bad;
then saw that all of the page translations that were not flushed would
actually still point to the right pages afterwards, so harmless; then
realized that I know nothing of how different architectures and models
cache intermediate paging structures, so maybe it matters after all -
particularly since the page table concerned is immediately freed.Much easier to fix than to think about.
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Acked-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Mike Kravetz
Cc: Song Liu
Cc: [5.4+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021204390.27773@eggly.anvils
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
29 Jul, 2020
1 commit
-
commit 594cced14ad3903166c8b091ff96adac7552f0b3 upstream.
khugepaged has to drop mmap lock several times while collapsing a page.
The situation can change while the lock is dropped and we need to
re-validate that the VMA is still in place and the PMD is still subject
for collapse.But we miss one corner case: while collapsing an anonymous pages the VMA
could be replaced with file VMA. If the file VMA doesn't have any
private pages we get NULL pointer dereference:general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
anon_vma_lock_write include/linux/rmap.h:120 [inline]
collapse_huge_page mm/khugepaged.c:1110 [inline]
khugepaged_scan_pmd mm/khugepaged.c:1349 [inline]
khugepaged_scan_mm_slot mm/khugepaged.c:2110 [inline]
khugepaged_do_scan mm/khugepaged.c:2193 [inline]
khugepaged+0x3bba/0x5a10 mm/khugepaged.c:2238The fix is to make sure that the VMA is anonymous in
hugepage_vma_revalidate(). The helper is only used for collapsing
anonymous pages.Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
Reported-by: syzbot+ed318e8b790ca72c5ad0@syzkaller.appspotmail.com
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Acked-by: Yang Shi
Cc:
Link: http://lkml.kernel.org/r/20200722121439.44328-1-kirill.shutemov@linux.intel.com
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
03 Jun, 2020
1 commit
-
[ Upstream commit 2f33a706027c94cd4f70fcd3e3f4a17c1ce4ea4b ]
When collapse_file() calls try_to_release_page(), it has already isolated
the page: so if releasing buffers happens to fail (as it sometimes does),
remember to putback_lru_page(): otherwise that page is left unreclaimable
and unfreeable, and the file extent uncollapsible.Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Acked-by: Song Liu
Acked-by: Kirill A. Shutemov
Acked-by: Johannes Weiner
Cc: Rik van Riel
Cc: [5.4+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2005231837500.1766@eggly.anvils
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
16 Nov, 2019
1 commit
-
In collapse_file(), for !is_shmem case, current check cannot guarantee
the locked page is up-to-date. Specifically, xas_unlock_irq() should
not be called before lock_page() and get_page(); and it is necessary to
recheck PageUptodate() after locking the page.With this bug and CONFIG_READ_ONLY_THP_FOR_FS=y, madvise(HUGE)'ed .text
may contain corrupted data. This is because khugepaged mistakenly
collapses some not up-to-date sub pages into a huge page, and assumes
the huge page is up-to-date. This will NOT corrupt data in the disk,
because the page is read-only and never written back. Fix this by
properly checking PageUptodate() after locking the page. This check
replaces "VM_BUG_ON_PAGE(!PageUptodate(page), page);".Also, move PageDirty() check after locking the page. Current khugepaged
should not try to collapse dirty file THP, because it is limited to
read-only .text. The only case we hit a dirty page here is when the
page hasn't been written since write. Bail out and retry when this
happens.syzbot reported bug on previous version of this patch.
Link: http://lkml.kernel.org/r/20191106060930.2571389-2-songliubraving@fb.com
Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Song Liu
Reported-by: syzbot+efb9e48b9fbdc49bb34a@syzkaller.appspotmail.com
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Hugh Dickins
Cc: William Kucharski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Nov, 2019
1 commit
-
I got some khugepaged spew on a 32bit x86:
BUG: sleeping function called from invalid context at include/linux/mmu_notifier.h:346
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 25, name: khugepaged
INFO: lockdep is turned off.
CPU: 1 PID: 25 Comm: khugepaged Not tainted 5.4.0-rc5-elk+ #206
Hardware name: System manufacturer P5Q-EM/P5Q-EM, BIOS 2203 07/08/2009
Call Trace:
dump_stack+0x66/0x8e
___might_sleep.cold.96+0x95/0xa6
__might_sleep+0x2e/0x80
collapse_huge_page.isra.51+0x5ac/0x1360
khugepaged+0x9a9/0x20f0
kthread+0xf5/0x110
ret_from_fork+0x2e/0x38Looks like it's due to CONFIG_HIGHPTE=y pte_offset_map()->kmap_atomic()
vs. mmu_notifier_invalidate_range_start(). Let's do the naive approach
and just reorder the two operations.Link: http://lkml.kernel.org/r/20191029201513.GG1208@intel.com
Fixes: 810e24e009cf71 ("mm/mmu_notifiers: annotate with might_sleep()")
Signed-off-by: Ville Syrjl
Reviewed-by: Andrew Morton
Acked-by: Kirill A. Shutemov
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Borislav Petkov
Cc: "H. Peter Anvin"
Cc: Jérôme Glisse
Cc: Ralph Campbell
Cc: Ira Weiny
Cc: Jason Gunthorpe
Cc: Daniel Vetter
Cc: Andrea Arcangeli
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
25 Sep, 2019
5 commits
-
khugepaged needs exclusive mmap_sem to access page table. When it fails
to lock mmap_sem, the page will fault in as pte-mapped THP. As the page
is already a THP, khugepaged will not handle this pmd again.This patch enables the khugepaged to retry collapse the page table.
struct mm_slot (in khugepaged.c) is extended with an array, containing
addresses of pte-mapped THPs. We use array here for simplicity. We can
easily replace it with more advanced data structures when needed.In khugepaged_scan_mm_slot(), if the mm contains pte-mapped THP, we try to
collapse the page table.Since collapse may happen at an later time, some pages may already fault
in. collapse_pte_mapped_thp() is added to properly handle these pages.
collapse_pte_mapped_thp() also double checks whether all ptes in this pmd
are mapping to the same THP. This is necessary because some subpage of
the THP may be replaced, for example by uprobe. In such cases, it is not
possible to collapse the pmd.[kirill.shutemov@linux.intel.com: add comments for retract_page_tables()]
Link: http://lkml.kernel.org/r/20190816145443.6ard3iilytc6jlgv@box
Link: http://lkml.kernel.org/r/20190815164525.1848545-6-songliubraving@fb.com
Signed-off-by: Song Liu
Signed-off-by: Kirill A. Shutemov
Acked-by: Kirill A. Shutemov
Suggested-by: Johannes Weiner
Reviewed-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In previous patch, an application could put part of its text section in
THP via madvise(). These THPs will be protected from writes when the
application is still running (TXTBSY). However, after the application
exits, the file is available for writes.This patch avoids writes to file THP by dropping page cache for the file
when the file is open for write. A new counter nr_thps is added to struct
address_space. In do_dentry_open(), if the file is open for write and
nr_thps is non-zero, we drop page cache for the whole file.Link: http://lkml.kernel.org/r/20190801184244.3169074-8-songliubraving@fb.com
Signed-off-by: Song Liu
Reported-by: kbuild test robot
Acked-by: Rik van Riel
Acked-by: Kirill A. Shutemov
Acked-by: Johannes Weiner
Cc: Hillf Danton
Cc: Hugh Dickins
Cc: William Kucharski
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch is (hopefully) the first step to enable THP for non-shmem
filesystems.This patch enables an application to put part of its text sections to THP
via madvise, for example:madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE);
We tried to reuse the logic for THP on tmpfs.
Currently, write is not supported for non-shmem THP. khugepaged will only
process vma with VM_DENYWRITE. sys_mmap() ignores VM_DENYWRITE requests
(see ksys_mmap_pgoff). The only way to create vma with VM_DENYWRITE is
execve(). This requirement limits non-shmem THP to text sections.The next patch will handle writes, which would only happen when the all
the vmas with VM_DENYWRITE are unmapped.An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this
feature.[songliubraving@fb.com: fix build without CONFIG_SHMEM]
Link: http://lkml.kernel.org/r/F53407FB-96CC-42E8-9862-105C92CC2B98@fb.com
[songliubraving@fb.com: fix double unlock in collapse_file()]
Link: http://lkml.kernel.org/r/B960CBFA-8EFC-4DA4-ABC5-1977FFF2CA57@fb.com
Link: http://lkml.kernel.org/r/20190801184244.3169074-7-songliubraving@fb.com
Signed-off-by: Song Liu
Acked-by: Rik van Riel
Acked-by: Kirill A. Shutemov
Acked-by: Johannes Weiner
Cc: Stephen Rothwell
Cc: Dan Carpenter
Cc: Hillf Danton
Cc: Hugh Dickins
Cc: William Kucharski
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Next patch will add khugepaged support of non-shmem files. This patch
renames these two functions to reflect the new functionality:collapse_shmem() => collapse_file()
khugepaged_scan_shmem() => khugepaged_scan_file()Link: http://lkml.kernel.org/r/20190801184244.3169074-6-songliubraving@fb.com
Signed-off-by: Song Liu
Acked-by: Rik van Riel
Acked-by: Kirill A. Shutemov
Acked-by: Johannes Weiner
Cc: Hillf Danton
Cc: Hugh Dickins
Cc: William Kucharski
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Transparent Huge Pages are currently stored in i_pages as pointers to
consecutive subpages. This patch changes that to storing consecutive
pointers to the head page in preparation for storing huge pages more
efficiently in i_pages.Large parts of this are "inspired" by Kirill's patch
https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/Kirill and Huang Ying contributed several fixes.
[willy@infradead.org: use compound_nr, squish uninit-var warning]
Link: http://lkml.kernel.org/r/20190731210400.7419-1-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Jan Kara
Reviewed-by: Kirill Shutemov
Reviewed-by: Song Liu
Tested-by: Song Liu
Tested-by: William Kucharski
Reviewed-by: William Kucharski
Tested-by: Qian Cai
Tested-by: Mikhail Gavrilov
Cc: Hugh Dickins
Cc: Chris Wilson
Cc: Song Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
03 Sep, 2019
1 commit
-
SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
for any sched domains with a NUMA distance greater than 2 hops
(RECLAIM_DISTANCE). The idea being that it's expensive to balance
across domains that far apart.However, as is rather unfortunately explained in:
commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")
the value for RECLAIM_DISTANCE is based on node distance tables from
2011-era hardware.Current AMD EPYC machines have the following NUMA node distances:
node distances:
node 0 1 2 3 4 5 6 7
0: 10 16 16 16 32 32 32 32
1: 16 10 16 16 32 32 32 32
2: 16 16 10 16 32 32 32 32
3: 16 16 16 10 32 32 32 32
4: 32 32 32 32 10 16 16 16
5: 32 32 32 32 16 10 16 16
6: 32 32 32 32 16 16 10 16
7: 32 32 32 32 16 16 16 10where 2 hops is 32.
The result is that the scheduler fails to load balance properly across
NUMA nodes on different sockets -- 2 hops apart.For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
(CPUs 32-39) like so,$ numactl -C 0-7,32-39 ./spinner 16
causes all threads to fork and remain on node 0 until the active
balancer kicks in after a few seconds and forcibly moves some threads
to node 4.Override node_reclaim_distance for AMD Zen.
Signed-off-by: Matt Fleming
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Mel Gorman
Cc: Borislav Petkov
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Suravee.Suthikulpanit@amd.com
Cc: Thomas Gleixner
Cc: Thomas.Lendacky@amd.com
Cc: Tony Luck
Link: https://lkml.kernel.org/r/20190808195301.13222-3-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar
06 Jul, 2019
1 commit
-
This reverts commit 5fd4ca2d84b249f0858ce28cf637cf25b61a398f.
Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in
__delete_from_swap_cache() to trigger:page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363
anon
flags: 0x17fffe00080034(uptodate|lru|active|swapbacked)
raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689
raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000
page dumped because: VM_BUG_ON_PAGE(entry != page)
page->mem_cgroup:ffff978433ace000
------------[ cut here ]------------
kernel BUG at mm/swap_state.c:170!
invalid opcode: 0000 [#1] SMP NOPTI
CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1
Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019
RIP: 0010:__delete_from_swap_cache+0x20d/0x240
Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f
RSP: 0018:ffffa982036e7980 EFLAGS: 00010046
RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900
RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535
R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120
R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000
FS: 0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0
Call Trace:
delete_from_swap_cache+0x46/0xa0
try_to_free_swap+0xbc/0x110
swap_writepage+0x13/0x70
pageout.isra.0+0x13c/0x350
shrink_page_list+0xc14/0xdf0
shrink_inactive_list+0x1e5/0x3c0
shrink_node_memcg+0x202/0x760
shrink_node+0xe0/0x470
balance_pgdat+0x2d1/0x510
kswapd+0x220/0x420
kthread+0xfb/0x130
ret_from_fork+0x22/0x40and it's not immediately obvious why it happens. It's too late in the
rc cycle to do anything but revert for now.Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/
Reported-and-bisected-by: Mikhail Gavrilov
Suggested-by: Jan Kara
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Matthew Wilcox
Cc: Kirill Shutemov
Cc: William Kucharski
Cc: Andrew Morton
Signed-off-by: Linus Torvalds
14 Jun, 2019
1 commit
-
When fixing the race conditions between the coredump and the mmap_sem
holders outside the context of the process, we focused on
mmget_not_zero()/get_task_mm() callers in 04f5866e41fb70 ("coredump: fix
race condition between mmget_not_zero()/get_task_mm() and core
dumping"), but those aren't the only cases where the mmap_sem can be
taken outside of the context of the process as Michal Hocko noticed
while backporting that commit to older -stable kernels.If mmgrab() is called in the context of the process, but then the
mm_count reference is transferred outside the context of the process,
that can also be a problem if the mmap_sem has to be taken for writing
through that mm_count reference.khugepaged registration calls mmgrab() in the context of the process,
but the mmap_sem for writing is taken later in the context of the
khugepaged kernel thread.collapse_huge_page() after taking the mmap_sem for writing doesn't
modify any vma, so it's not obvious that it could cause a problem to the
coredump, but it happens to modify the pmd in a way that breaks an
invariant that pmd_trans_huge_lock() relies upon. collapse_huge_page()
needs the mmap_sem for writing just to block concurrent page faults that
call pmd_trans_huge_lock().Specifically the invariant that "!pmd_trans_huge()" cannot become a
"pmd_trans_huge()" doesn't hold while collapse_huge_page() runs.The coredump will call __get_user_pages() without mmap_sem for reading,
which eventually can invoke a lockless page fault which will need a
functional pmd_trans_huge_lock().So collapse_huge_page() needs to use mmget_still_valid() to check it's
not running concurrently with the coredump... as long as the coredump
can invoke page faults without holding the mmap_sem for reading.This has "Fixes: khugepaged" to facilitate backporting, but in my view
it's more a bug in the coredump code that will eventually have to be
rewritten to stop invoking page faults without the mmap_sem for reading.
So the long term plan is still to drop all mmget_still_valid().Link: http://lkml.kernel.org/r/20190607161558.32104-1-aarcange@redhat.com
Fixes: ba76149f47d8 ("thp: khugepaged")
Signed-off-by: Andrea Arcangeli
Reported-by: Michal Hocko
Acked-by: Michal Hocko
Acked-by: Kirill A. Shutemov
Cc: Oleg Nesterov
Cc: Jann Horn
Cc: Hugh Dickins
Cc: Mike Rapoport
Cc: Mike Kravetz
Cc: Peter Xu
Cc: Jason Gunthorpe
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
15 May, 2019
3 commits
-
This updates each existing invalidation to use the correct mmu notifier
event that represent what is happening to the CPU page table. See the
patch which introduced the events to see the rational behind this.Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Reviewed-by: Ralph Campbell
Reviewed-by: Ira Weiny
Cc: Christian König
Cc: Joonas Lahtinen
Cc: Jani Nikula
Cc: Rodrigo Vivi
Cc: Jan Kara
Cc: Andrea Arcangeli
Cc: Peter Xu
Cc: Felix Kuehling
Cc: Jason Gunthorpe
Cc: Ross Zwisler
Cc: Dan Williams
Cc: Paolo Bonzini
Cc: Radim Krcmar
Cc: Michal Hocko
Cc: Christian Koenig
Cc: John Hubbard
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
CPU page table update can happens for many reasons, not only as a result
of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
a result of kernel activities (memory compression, reclaim, migration,
...).Users of mmu notifier API track changes to the CPU page table and take
specific action for them. While current API only provide range of virtual
address affected by the change, not why the changes is happening.This patchset do the initial mechanical convertion of all the places that
calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
event as well as the vma if it is know (most invalidation happens against
a given vma). Passing down the vma allows the users of mmu notifier to
inspect the new vma page protection.The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
should assume that every for the range is going away when that event
happens. A latter patch do convert mm call path to use a more appropriate
events for each call.This is done as 2 patches so that no call site is forgotten especialy
as it uses this following coccinelle patch:%vm_mm, E3, E4)
...>@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(..., struct vm_area_struct *VMA, ...) {
}@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(...) {
struct vm_area_struct *VMA;
}@@
expression E1, E2, E3, E4;
identifier FN;
@@
FN(...) {
}
---------------------------------------------------------------------->%Applied with:
spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
spatch --sp-file mmu-notifier.spatch --dir mm --in-placeLink: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Reviewed-by: Ralph Campbell
Reviewed-by: Ira Weiny
Cc: Christian König
Cc: Joonas Lahtinen
Cc: Jani Nikula
Cc: Rodrigo Vivi
Cc: Jan Kara
Cc: Andrea Arcangeli
Cc: Peter Xu
Cc: Felix Kuehling
Cc: Jason Gunthorpe
Cc: Ross Zwisler
Cc: Dan Williams
Cc: Paolo Bonzini
Cc: Radim Krcmar
Cc: Michal Hocko
Cc: Christian Koenig
Cc: John Hubbard
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Transparent Huge Pages are currently stored in i_pages as pointers to
consecutive subpages. This patch changes that to storing consecutive
pointers to the head page in preparation for storing huge pages more
efficiently in i_pages.Large parts of this are "inspired" by Kirill's patch
https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/[willy@infradead.org: fix swapcache pages]
Link: http://lkml.kernel.org/r/20190324155441.GF10344@bombadil.infradead.org
[kirill@shutemov.name: hugetlb stores pages in page cache differently]
Link: http://lkml.kernel.org/r/20190404134553.vuvhgmghlkiw2hgl@kshutemo-mobl1
Link: http://lkml.kernel.org/r/20190307153051.18815-1-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Jan Kara
Reviewed-by: Kirill Shutemov
Reviewed-and-tested-by: Song Liu
Tested-by: William Kucharski
Reviewed-by: William Kucharski
Tested-by: Qian Cai
Cc: Hugh Dickins
Cc: Song Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
06 Mar, 2019
1 commit
-
Currently THP allocation events data is fairly opaque, since you can
only get it system-wide. This patch makes it easier to reason about
transparent hugepage behaviour on a per-memcg basis.For anonymous THP-backed pages, we already have MEMCG_RSS_HUGE in v1,
which is used for v1's rss_huge [sic]. This is reused here as it's
fairly involved to untangle NR_ANON_THPS right now to make it per-memcg,
since right now some of this is delegated to rmap before we have any
memcg actually assigned to the page. It's a good idea to rework that,
but let's leave untangling THP allocation for a future patch.[akpm@linux-foundation.org: fix build]
[chris@chrisdown.name: fix memcontrol build when THP is disabled]
Link: http://lkml.kernel.org/r/20190131160802.GA5777@chrisdown.name
Link: http://lkml.kernel.org/r/20190129205852.GA7310@chrisdown.name
Signed-off-by: Chris Down
Acked-by: Johannes Weiner
Cc: Tejun Heo
Cc: Roman Gushchin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
29 Dec, 2018
1 commit
-
To avoid having to change many call sites everytime we want to add a
parameter use a structure to group all parameters for the mmu_notifier
invalidate_range_start/end cakks. No functional changes with this patch.[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Acked-by: Christian König
Acked-by: Jan Kara
Cc: Matthew Wilcox
Cc: Ross Zwisler
Cc: Dan Williams
Cc: Paolo Bonzini
Cc: Radim Krcmar
Cc: Michal Hocko
Cc: Felix Kuehling
Cc: Ralph Campbell
Cc: John Hubbard
From: Jérôme Glisse
Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n
Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
04 Dec, 2018
1 commit
-
…k/linux-rcu into core/rcu
Pull RCU changes from Paul E. McKenney:
- Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.
- Replace calls of RCU-bh and RCU-sched update-side functions
to their vanilla RCU counterparts. This series is a step
towards complete removal of the RCU-bh and RCU-sched update-side
functions.( Note that some of these conversions are going upstream via their
respective maintainers. )- Documentation updates, including a number of flavor-consolidation
updates from Joel Fernandes.- Miscellaneous fixes.
- Automate generation of the initrd filesystem used for
rcutorture testing.- Convert spin_is_locked() assertions to instead use lockdep.
( Note that some of these conversions are going upstream via their
respective maintainers. )- SRCU updates, especially including a fix from Dennis Krein
for a bag-on-head-class bug.- RCU torture-test updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
01 Dec, 2018
7 commits
-
collapse_shmem()'s xas_nomem() is very unlikely to fail, but it is
rightly given a failure path, so move the whole xas_create_range() block
up before __SetPageLocked(new_page): so that it does not need to
remember to unlock_page(new_page).Add the missing mem_cgroup_cancel_charge(), and set (currently unused)
result to SCAN_FAIL rather than SCAN_SUCCEED.Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261531200.2275@eggly.anvils
Fixes: 77da9389b9d5 ("mm: Convert collapse_shmem to XArray")
Signed-off-by: Hugh Dickins
Cc: Matthew Wilcox
Cc: Kirill A. Shutemov
Cc: Jerome Glisse
Cc: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
collapse_shmem()'s VM_BUG_ON_PAGE(PageTransCompound) was unsafe: before
it holds page lock of the first page, racing truncation then extension
might conceivably have inserted a hugepage there already. Fail with the
SCAN_PAGE_COMPOUND result, instead of crashing (CONFIG_DEBUG_VM=y) or
otherwise mishandling the unexpected hugepage - though later we might
code up a more constructive way of handling it, with SCAN_SUCCESS.Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261529310.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins
Cc: Kirill A. Shutemov
Cc: Jerome Glisse
Cc: Konstantin Khlebnikov
Cc: Matthew Wilcox
Cc: [4.8+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
khugepaged's collapse_shmem() does almost all of its work, to assemble
the huge new_page from 512 scattered old pages, with the new_page's
refcount frozen to 0 (and refcounts of all old pages so far also frozen
to 0). Including shmem_getpage() to read in any which were out on swap,
memory reclaim if necessary to allocate their intermediate pages, and
copying over all the data from old to new.Imagine the frozen refcount as a spinlock held, but without any lock
debugging to highlight the abuse: it's not good, and under serious load
heads into lockups - speculative getters of the page are not expecting
to spin while khugepaged is rescheduled.One can get a little further under load by hacking around elsewhere; but
fortunately, freezing the new_page turns out to have been entirely
unnecessary, with no hacks needed elsewhere.The huge new_page lock is already held throughout, and guards all its
subpages as they are brought one by one into the page cache tree; and
anything reading the data in that page, without the lock, before it has
been marked PageUptodate, would already be in the wrong. So simply
eliminate the freezing of the new_page.Each of the old pages remains frozen with refcount 0 after it has been
replaced by a new_page subpage in the page cache tree, until they are
all unfrozen on success or failure: just as before. They could be
unfrozen sooner, but cause no problem once no longer visible to
find_get_entry(), filemap_map_pages() and other speculative lookups.Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261527570.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins
Acked-by: Kirill A. Shutemov
Cc: Jerome Glisse
Cc: Konstantin Khlebnikov
Cc: Matthew Wilcox
Cc: [4.8+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Several cleanups in collapse_shmem(): most of which probably do not
really matter, beyond doing things in a more familiar and reassuring
order. Simplify the failure gotos in the main loop, and on success
update stats while interrupts still disabled from the last iteration.Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261526400.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins
Acked-by: Kirill A. Shutemov
Cc: Jerome Glisse
Cc: Konstantin Khlebnikov
Cc: Matthew Wilcox
Cc: [4.8+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Huge tmpfs testing reminds us that there is no __GFP_ZERO in the gfp
flags khugepaged uses to allocate a huge page - in all common cases it
would just be a waste of effort - so collapse_shmem() must remember to
clear out any holes that it instantiates.The obvious place to do so, where they are put into the page cache tree,
is not a good choice: because interrupts are disabled there. Leave it
until further down, once success is assured, where the other pages are
copied (before setting PageUptodate).Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261525080.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins
Acked-by: Kirill A. Shutemov
Cc: Jerome Glisse
Cc: Konstantin Khlebnikov
Cc: Matthew Wilcox
Cc: [4.8+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Huge tmpfs testing on a shortish file mapped into a pmd-rounded extent
hit shmem_evict_inode()'s WARN_ON(inode->i_blocks) followed by
clear_inode()'s BUG_ON(inode->i_data.nrpages) when the file was later
closed and unlinked.khugepaged's collapse_shmem() was forgetting to update mapping->nrpages
on the rollback path, after it had added but then needs to undo some
holes.There is indeed an irritating asymmetry between shmem_charge(), whose
callers want it to increment nrpages after successfully accounting
blocks, and shmem_uncharge(), when __delete_from_page_cache() already
decremented nrpages itself: oh well, just add a comment on that to them
both.And shmem_recalc_inode() is supposed to be called when the accounting is
expected to be in balance (so it can deduce from imbalance that reclaim
discarded some pages): so change shmem_charge() to update nrpages
earlier (though it's rare for the difference to matter at all).Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261523450.2275@eggly.anvils
Fixes: 800d8c63b2e98 ("shmem: add huge pages support")
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins
Acked-by: Kirill A. Shutemov
Cc: Jerome Glisse
Cc: Konstantin Khlebnikov
Cc: Matthew Wilcox
Cc: [4.8+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Huge tmpfs testing showed that although collapse_shmem() recognizes a
concurrently truncated or hole-punched page correctly, its handling of
holes was liable to refill an emptied extent. Add check to stop that.Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261522040.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins
Reviewed-by: Matthew Wilcox
Cc: Kirill A. Shutemov
Cc: Jerome Glisse
Cc: Konstantin Khlebnikov
Cc: [4.8+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
13 Nov, 2018
1 commit
-
lockdep_assert_held() is better suited to checking locking requirements,
since it only checks if the current thread holds the lock regardless of
whether someone else does. This is also a step towards possibly removing
spin_is_locked().Signed-off-by: Lance Roy
Cc: Andrew Morton
Cc: "Kirill A. Shutemov"
Cc: Yang Shi
Cc: Matthew Wilcox
Cc: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Jan Kara
Cc: Shakeel Butt
Cc:
Signed-off-by: Paul E. McKenney
21 Oct, 2018
2 commits
-
Slightly shorter and easier to read code.
Signed-off-by: Matthew Wilcox
-
I found another victim of the radix tree being hard to use. Because
there was no call to radix_tree_preload(), khugepaged was allocating
radix_tree_nodes using GFP_ATOMIC.I also converted a local_irq_save()/restore() pair to
disable()/enable().Signed-off-by: Matthew Wilcox
30 Sep, 2018
1 commit
-
Introduce xarray value entries and tagged pointers to replace radix
tree exceptional entries. This is a slight change in encoding to allow
the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
value entry). It is also a change in emphasis; exceptional entries are
intimidating and different. As the comment explains, you can choose
to store values or pointers in the xarray and they are both first-class
citizens.Signed-off-by: Matthew Wilcox
Reviewed-by: Josef Bacik
24 Aug, 2018
1 commit
-
Use new return type vm_fault_t for fault handler. For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno. Once all instances are converted, vm_fault_t will become a
distinct type.Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")
The aim is to change the return type of finish_fault() and
handle_mm_fault() to vm_fault_t type. As part of that clean up return
type of all other recursively called functions have been changed to
vm_fault_t type.The places from where handle_mm_fault() is getting invoked will be
change to vm_fault_t type but in a separate patch.vmf_error() is the newly introduce inline function in 4.17-rc6.
[akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder
Reviewed-by: Matthew Wilcox
Reviewed-by: Andrew Morton
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
18 Aug, 2018
3 commits
-
khugepaged_enter_vma_merge() passes a stale vma->vm_flags to
hugepage_vma_check(). The argument vm_flags contains the latest value.
Therefore, it is necessary to pass this vm_flags into
hugepage_vma_check().With this bug, madvise(MADV_HUGEPAGE) for mmap files in shmem fails to
put memory in huge pages. Here is an example of failed madvise():/* mount /dev/shm with huge=advise:
* mount -o remount,huge=advise /dev/shm */
/* create file /dev/shm/huge */
#define HUGE_FILE "/dev/shm/huge"fd = open(HUGE_FILE, O_RDONLY);
ptr = mmap(NULL, FILE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0);
ret = madvise(ptr, FILE_SIZE, MADV_HUGEPAGE);madvise() will return 0, but this memory region is never put in huge
page (check from /proc/meminfo: ShmemHugePages).Link: http://lkml.kernel.org/r/20180629181752.792831-1-songliubraving@fb.com
Fixes: 02b75dc8160d ("mm: thp: register mm for khugepaged when merging vma for shmem")
Signed-off-by: Song Liu
Reviewed-by: Rik van Riel
Reviewed-by: Yang Shi
Cc: Kirill A. Shutemov
Cc: Hugh Dickins
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed is used
to record the counter of collapsed THP, but it just gets inc'ed in
anonymous THP collapse path, do this for shmem THP collapse too.Link: http://lkml.kernel.org/r/1529622949-75504-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi
Acked-by: Kirill A. Shutemov
Cc: Hugh Dickins
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When merging anonymous page vma, if the size of the vma can fit in at
least one hugepage, the mm will be registered for khugepaged for
collapsing THP in the future.But it skips shmem vmas. Do so for shmem also, but not for file-private
mappings when merging a vma in order to increase the odds of collapsing
a hugepage via khugepaged.hugepage_vma_check() sounds like a good fit to do the check. And move
the definition of it before khugepaged_enter_vma_merge() to avoid a
build error.Link: http://lkml.kernel.org/r/1529697791-6950-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi
Acked-by: Kirill A. Shutemov
Cc: Hugh Dickins
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Apr, 2018
1 commit
-
Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.[willy@infradead.org: fix nds32, fs/dax.c]
Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Jeff Layton
Cc: Darrick J. Wong
Cc: Dave Chinner
Cc: Ryusuke Konishi
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds