10 Jun, 2020
2 commits
-
Convert comments that reference mmap_sem to reference mmap_lock instead.
[akpm@linux-foundation.org: fix up linux-next leftovers]
[akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
[akpm@linux-foundation.org: more linux-next fixups, per Michel]Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Daniel Jordan
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Laurent Dufour
Cc: Liam Howlett
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Ying Han
Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
Signed-off-by: Linus Torvalds -
Add new APIs to assert that mmap_sem is held.
Using this instead of rwsem_is_locked and lockdep_assert_held[_write]
makes the assertions more tolerant of future changes to the lock type.Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Daniel Jordan
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Laurent Dufour
Cc: Liam Howlett
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Ying Han
Link: http://lkml.kernel.org/r/20200520052908.204642-10-walken@google.com
Signed-off-by: Linus Torvalds
04 Feb, 2020
6 commits
-
struct mm_struct is quite large (~1664 bytes) and so allocating on the
stack may cause problems as the kernel stack size is small.Since ptdump_walk_pgd_level_core() was only allocating the structure so
that it could modify the pgd argument we can instead introduce a pgd
override in struct mm_walk and pass this down the call stack to where it
is needed.Since the correct mm_struct is now being passed down, it is now also
unnecessary to take the mmap_sem semaphore because ptdump_walk_pgd() will
now take the semaphore on the real mm.[steven.price@arm.com: restore missed arm64 changes]
Link: http://lkml.kernel.org/r/20200108145710.34314-1-steven.price@arm.com
Link: http://lkml.kernel.org/r/20200108145710.34314-1-steven.price@arm.com
Signed-off-by: Steven Price
Reported-by: Stephen Rothwell
Cc: Catalin Marinas
Cc: Albert Ou
Cc: Alexandre Ghiti
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Christian Borntraeger
Cc: Dave Hansen
Cc: David S. Miller
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: James Hogan
Cc: James Morse
Cc: Jerome Glisse
Cc: "Liang, Kan"
Cc: Mark Rutland
Cc: Michael Ellerman
Cc: Paul Burton
Cc: Paul Mackerras
Cc: Paul Walmsley
Cc: Peter Zijlstra
Cc: Ralf Baechle
Cc: Russell King
Cc: Thomas Gleixner
Cc: Vasily Gorbik
Cc: Vineet Gupta
Cc: Will Deacon
Cc: Zong Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The pte_hole() callback is called at multiple levels of the page tables.
Code dumping the kernel page tables needs to know what at what depth the
missing entry is. Add this is an extra parameter to pte_hole(). When the
depth isn't know (e.g. processing a vma) then -1 is passed.The depth that is reported is the actual level where the entry is missing
(ignoring any folding that is in place), i.e. any levels where
PTRS_PER_P?D is set to 1 are ignored.Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their
natural numbers as levels 2/3/4.Link: http://lkml.kernel.org/r/20191218162402.45610-16-steven.price@arm.com
Signed-off-by: Steven Price
Tested-by: Zong Li
Cc: Albert Ou
Cc: Alexandre Ghiti
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Christian Borntraeger
Cc: Dave Hansen
Cc: David S. Miller
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: James Hogan
Cc: James Morse
Cc: Jerome Glisse
Cc: "Liang, Kan"
Cc: Mark Rutland
Cc: Michael Ellerman
Cc: Paul Burton
Cc: Paul Mackerras
Cc: Paul Walmsley
Cc: Peter Zijlstra
Cc: Ralf Baechle
Cc: Russell King
Cc: Thomas Gleixner
Cc: Vasily Gorbik
Cc: Vineet Gupta
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If walk_pte_range() is called with a 'end' argument that is beyond the
last page of memory (e.g. ~0UL) then the comparison between 'addr' and
'end' will always fail and the loop will be infinite. Instead change the
comparison to >= while accounting for overflow.Link: http://lkml.kernel.org/r/20191218162402.45610-15-steven.price@arm.com
Signed-off-by: Steven Price
Cc: Albert Ou
Cc: Alexandre Ghiti
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Christian Borntraeger
Cc: Dave Hansen
Cc: David S. Miller
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: James Hogan
Cc: James Morse
Cc: Jerome Glisse
Cc: "Liang, Kan"
Cc: Mark Rutland
Cc: Michael Ellerman
Cc: Paul Burton
Cc: Paul Mackerras
Cc: Paul Walmsley
Cc: Peter Zijlstra
Cc: Ralf Baechle
Cc: Russell King
Cc: Thomas Gleixner
Cc: Vasily Gorbik
Cc: Vineet Gupta
Cc: Will Deacon
Cc: Zong Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
walk_page_range_novma() can be used to walk page tables or the kernel or
for firmware. These page tables may contain entries that are not backed
by a struct page and so it isn't (in general) possible to take the PTE
lock for the pte_entry() callback. So update walk_pte_range() to only
take the lock when no_vma==false by splitting out the inner loop to a
separate function and add a comment explaining the difference to
walk_page_range_novma().Link: http://lkml.kernel.org/r/20191218162402.45610-14-steven.price@arm.com
Signed-off-by: Steven Price
Cc: Albert Ou
Cc: Alexandre Ghiti
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Christian Borntraeger
Cc: Dave Hansen
Cc: David S. Miller
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: James Hogan
Cc: James Morse
Cc: Jerome Glisse
Cc: "Liang, Kan"
Cc: Mark Rutland
Cc: Michael Ellerman
Cc: Paul Burton
Cc: Paul Mackerras
Cc: Paul Walmsley
Cc: Peter Zijlstra
Cc: Ralf Baechle
Cc: Russell King
Cc: Thomas Gleixner
Cc: Vasily Gorbik
Cc: Vineet Gupta
Cc: Will Deacon
Cc: Zong Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Since 48684a65b4e3: "mm: pagewalk: fix misbehavior of walk_page_range for
vma(VM_PFNMAP)", page_table_walk() will report any kernel area as a hole,
because it lacks a vma.This means each arch has re-implemented page table walking when needed,
for example in the per-arch ptdump walker.Remove the requirement to have a vma in the generic code and add a new
function walk_page_range_novma() which ignores the VMAs and simply walks
the page tables.Link: http://lkml.kernel.org/r/20191218162402.45610-13-steven.price@arm.com
Signed-off-by: Steven Price
Cc: Albert Ou
Cc: Alexandre Ghiti
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Christian Borntraeger
Cc: Dave Hansen
Cc: David S. Miller
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: James Hogan
Cc: James Morse
Cc: Jerome Glisse
Cc: "Liang, Kan"
Cc: Mark Rutland
Cc: Michael Ellerman
Cc: Paul Burton
Cc: Paul Mackerras
Cc: Paul Walmsley
Cc: Peter Zijlstra
Cc: Ralf Baechle
Cc: Russell King
Cc: Thomas Gleixner
Cc: Vasily Gorbik
Cc: Vineet Gupta
Cc: Will Deacon
Cc: Zong Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
pgd_entry() and pud_entry() were removed by commit 0b1fbfe50006c410
("mm/pagewalk: remove pgd_entry() and pud_entry()") because there were no
users. We're about to add users so reintroduce them, along with
p4d_entry() as we now have 5 levels of tables.Note that commit a00cc7d9dd93d66a ("mm, x86: add support for PUD-sized
transparent hugepages") already re-added pud_entry() but with different
semantics to the other callbacks. This commit reverts the semantics back
to match the other callbacks.To support hmm.c which now uses the new semantics of pud_entry() a new
member ('action') of struct mm_walk is added which allows the callbacks to
either descend (ACTION_SUBTREE, the default), skip (ACTION_CONTINUE) or
repeat the callback (ACTION_AGAIN). hmm.c is then updated to call
pud_trans_huge_lock() itself and make use of the splitting/retry logic of
the core code.After this change pud_entry() is called for all entries, not just
transparent huge pages.[arnd@arndb.de: fix unused variable warning]
Link: http://lkml.kernel.org/r/20200107204607.1533842-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20191218162402.45610-12-steven.price@arm.com
Signed-off-by: Steven Price
Signed-off-by: Arnd Bergmann
Cc: Albert Ou
Cc: Alexandre Ghiti
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Christian Borntraeger
Cc: Dave Hansen
Cc: David S. Miller
Cc: Heiko Carstens
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: James Hogan
Cc: James Morse
Cc: Jerome Glisse
Cc: "Liang, Kan"
Cc: Mark Rutland
Cc: Michael Ellerman
Cc: Paul Burton
Cc: Paul Mackerras
Cc: Paul Walmsley
Cc: Peter Zijlstra
Cc: Ralf Baechle
Cc: Russell King
Cc: Thomas Gleixner
Cc: Vasily Gorbik
Cc: Vineet Gupta
Cc: Will Deacon
Cc: Zong Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
06 Nov, 2019
2 commits
-
For users that want to travers all page table entries pointing into a
region of a struct address_space mapping, introduce a walk_page_mapping()
function.The walk_page_mapping() function will be initially be used for dirty-
tracking in virtual graphics drivers.Cc: Andrew Morton
Cc: Matthew Wilcox
Cc: Will Deacon
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Minchan Kim
Cc: Michal Hocko
Cc: Huang Ying
Cc: Jérôme Glisse
Cc: Kirill A. Shutemov
Signed-off-by: Thomas Hellstrom
Reviewed-by: Andrew Morton -
Without the lock, anybody modifying a pte from within this function might
have it concurrently modified by someone else.Cc: Matthew Wilcox
Cc: Will Deacon
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Minchan Kim
Cc: Michal Hocko
Cc: Huang Ying
Cc: Jérôme Glisse
Cc: Kirill A. Shutemov
Suggested-by: Linus Torvalds
Signed-off-by: Thomas Hellstrom
Acked-by: Kirill A. Shutemov
07 Sep, 2019
3 commits
-
Use lockdep to check for held locks instead of using home grown asserts.
Link: https://lore.kernel.org/r/20190828141955.22210-4-hch@lst.de
Signed-off-by: Christoph Hellwig
Reviewed-by: Thomas Hellstrom
Reviewed-by: Steven Price
Reviewed-by: Jason Gunthorpe
Signed-off-by: Jason Gunthorpe -
The mm_walk structure currently mixed data and code. Split out the
operations vectors into a new mm_walk_ops structure, and while we are
changing the API also declare the mm_walk structure inside the
walk_page_range and walk_page_vma functions.Based on patch from Linus Torvalds.
Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de
Signed-off-by: Christoph Hellwig
Reviewed-by: Thomas Hellstrom
Reviewed-by: Steven Price
Reviewed-by: Jason Gunthorpe
Signed-off-by: Jason Gunthorpe -
Add a new header for the two handful of users of the walk_page_range /
walk_page_vma interface instead of polluting all users of mm.h with it.Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de
Signed-off-by: Christoph Hellwig
Reviewed-by: Thomas Hellstrom
Reviewed-by: Steven Price
Reviewed-by: Jason Gunthorpe
Signed-off-by: Jason Gunthorpe
06 Apr, 2018
1 commit
-
Link: http://lkml.kernel.org/r/1519585191-10180-4-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Feb, 2018
1 commit
-
Link: http://lkml.kernel.org/r/1516700871-22279-4-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Cc: Jonathan Corbet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
16 Nov, 2017
1 commit
-
This matters at least for the mincore syscall, which will otherwise copy
uninitialized memory from the page allocator to userspace. It is
probably also a correctness error for /proc/$pid/pagemap, but I haven't
tested that.Removing the `walk->hugetlb_entry` condition in walk_hugetlb_range() has
no effect because the caller already checks for that.This only reports holes in hugetlb ranges to callers who have specified
a hugetlb_entry callback.This issue was found using an AFL-based fuzzer.
v2:
- don't crash on ->pte_hole==NULL (Andrew Morton)
- add Cc stable (Andrew Morton)Fixes: 1e25a271c8ac ("mincore: apply page table walker on do_mincore()")
Signed-off-by: Jann Horn
Cc:
Signed-off-by: Linus Torvalds
02 Nov, 2017
1 commit
-
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.By default all files without license information are under the default
license of the kernel, which is GPL version 2.Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman
07 Jul, 2017
1 commit
-
A poisoned or migrated hugepage is stored as a swap entry in the page
tables. On architectures that support hugepages consisting of
contiguous page table entries (such as on arm64) this leads to ambiguity
in determining the page table entry to return in huge_pte_offset() when
a poisoned entry is encountered.Let's remove the ambiguity by adding a size parameter to convey
additional information about the requested address. Also fixup the
definition/usage of huge_pte_offset() throughout the tree.Link: http://lkml.kernel.org/r/20170522133604.11392-4-punit.agrawal@arm.com
Signed-off-by: Punit Agrawal
Acked-by: Steve Capper
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Tony Luck
Cc: Fenghua Yu
Cc: James Hogan (odd fixer:METAG ARCHITECTURE)
Cc: Ralf Baechle (supporter:MIPS)
Cc: "James E.J. Bottomley"
Cc: Helge Deller
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Michael Ellerman
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Yoshinori Sato
Cc: Rich Felker
Cc: "David S. Miller"
Cc: Chris Metcalf
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Alexander Viro
Cc: Michal Hocko
Cc: Mike Kravetz
Cc: Naoya Horiguchi
Cc: "Aneesh Kumar K.V"
Cc: "Kirill A. Shutemov"
Cc: Hillf Danton
Cc: Mark Rutland
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
10 Mar, 2017
1 commit
-
Convert all non-architecture-specific code to 5-level paging.
It's mostly mechanical adding handling one more page table level in
places where we deal with pud_t.Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds
25 Feb, 2017
1 commit
-
The current transparent hugepage code only supports PMDs. This patch
adds support for transparent use of PUDs with DAX. It does not include
support for anonymous pages. x86 support code also added.Most of this patch simply parallels the work that was done for huge
PMDs. The only major difference is how the new ->pud_entry method in
mm_walk works. The ->pmd_entry method replaces the ->pte_entry method,
whereas the ->pud_entry method works along with either ->pmd_entry or
->pte_entry. The pagewalk code takes care of locking the PUD before
calling ->pud_walk, so handlers do not need to worry whether the PUD is
stable.[dave.jiang@intel.com: fix SMP x86 32bit build for native_pud_clear()]
Link: http://lkml.kernel.org/r/148719066814.31111.3239231168815337012.stgit@djiang5-desk3.ch.intel.com
[dave.jiang@intel.com: native_pud_clear missing on i386 build]
Link: http://lkml.kernel.org/r/148640375195.69754.3315433724330910314.stgit@djiang5-desk3.ch.intel.com
Link: http://lkml.kernel.org/r/148545059381.17912.8602162635537598445.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Matthew Wilcox
Signed-off-by: Dave Jiang
Tested-by: Alexander Kapshuk
Cc: Dave Hansen
Cc: Vlastimil Babka
Cc: Jan Kara
Cc: Dan Williams
Cc: Ross Zwisler
Cc: Kirill A. Shutemov
Cc: Nilesh Choudhury
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
16 Jan, 2016
1 commit
-
We are going to decouple splitting THP PMD from splitting underlying
compound page.This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
to reflect the fact that it doesn't imply page splitting, only PMD.Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Jerome Marchand
Acked-by: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
26 Mar, 2015
1 commit
-
walk_page_test() is purely pagewalk's internal stuff, and its positive
return values are not intended to be passed to the callers of pagewalk.However, in the current code if the last vma in the do-while loop in
walk_page_range() happens to return a positive value, it leaks outside
walk_page_range(). So the user visible effect is invalid/unexpected
return value (according to the reporter, mbind() causes it.)This patch fixes it simply by reinitializing the return value after
checked.Another exposed interface, walk_page_vma(), already returns 0 for such
cases so no problem.Fixes: fafaa4264eba ("pagewalk: improve vma handling")
Signed-off-by: Naoya Horiguchi
Signed-off-by: Kazutomo Yoshii
Reported-by: Kazutomo Yoshii
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Feb, 2015
4 commits
-
walk_page_range() silently skips vma having VM_PFNMAP set, which leads to
undesirable behaviour at client end (who called walk_page_range). For
example for pagemap_read(), when no callbacks are called against VM_PFNMAP
vma, pagemap_read() may prepare pagemap data for next virtual address
range at wrong index. That could confuse and/or break userspace
applications.This patch avoid this misbehavior caused by vma(VM_PFNMAP) like follows:
- for pagemap_read() which has its own ->pte_hole(), call the ->pte_hole()
over vma(VM_PFNMAP),
- for clear_refs and queue_pages which have their own ->tests_walk,
just return 1 and skip vma(VM_PFNMAP). This is no problem because
these are not interested in hole regions,
- for other callers, just skip the vma(VM_PFNMAP) as a default behavior.Signed-off-by: Naoya Horiguchi
Signed-off-by: Shiraz Hashim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Introduce walk_page_vma(), which is useful for the callers which want to
walk over a given vma. It's used by later patches.Signed-off-by: Naoya Horiguchi
Acked-by: Kirill A. Shutemov
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Cyrill Gorcunov
Cc: Dave Hansen
Cc: Pavel Emelyanov
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Current implementation of page table walker has a fundamental problem in
vma handling, which started when we tried to handle vma(VM_HUGETLB).
Because it's done in pgd loop, considering vma boundary makes code
complicated and bug-prone.From the users viewpoint, some user checks some vma-related condition to
determine whether the user really does page walk over the vma.In order to solve these, this patch moves vma check outside pgd loop and
introduce a new callback ->test_walk().Signed-off-by: Naoya Horiguchi
Acked-by: Kirill A. Shutemov
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Cyrill Gorcunov
Cc: Dave Hansen
Cc: Pavel Emelyanov
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently no user of page table walker sets ->pgd_entry() or
->pud_entry(), so checking their existence in each loop is just wasting
CPU cycle. So let's remove it to reduce overhead.Signed-off-by: Naoya Horiguchi
Acked-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Cyrill Gorcunov
Cc: Dave Hansen
Cc: Kirill A. Shutemov
Cc: Pavel Emelyanov
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
06 Feb, 2015
1 commit
-
walk_page_range() silently skips vma having VM_PFNMAP set, which leads
to undesirable behaviour at client end (who called walk_page_range).
Userspace applications get the wrong data, so the effect is like just
confusing users (if the applications just display the data) or sometimes
killing the processes (if the applications do something with
misunderstanding virtual addresses due to the wrong data.)For example for pagemap_read, when no callbacks are called against
VM_PFNMAP vma, pagemap_read may prepare pagemap data for next virtual
address range at wrong index.Eventually userspace may get wrong pagemap data for a task.
Corresponding to a VM_PFNMAP marked vma region, kernel may report
mappings from subsequent vma regions. User space in turn may account
more pages (than really are) to the task.In my case I was using procmem, procrack (Android utility) which uses
pagemap interface to account RSS pages of a task. Due to this bug it
was giving a wrong picture for vmas (with VM_PFNMAP set).Fixes: a9ff785e4437 ("mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas")
Signed-off-by: Shiraz Hashim
Acked-by: Naoya Horiguchi
Cc: [3.10+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
10 Oct, 2014
1 commit
-
Dump the contents of the relevant struct_mm when we hit the bug condition.
Signed-off-by: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
31 Oct, 2013
1 commit
-
When walk_page_range walk a memory map's page tables, it'll skip
VM_PFNMAP area, then variable 'next' will to assign to vma->vm_end, it
maybe larger than 'end'. In next loop, 'addr' will be larger than
'next'. Then in /proc/XXXX/pagemap file reading procedure, the 'addr'
will growing forever in pagemap_pte_range, pte_to_pagemap_entry will
access the wrong pte.BUG: Bad page map in process procrank pte:8437526f pmd:785de067
addr:9108d000 vm_flags:00200073 anon_vma:f0d99020 mapping: (null) index:9108d
CPU: 1 PID: 4974 Comm: procrank Tainted: G B W O 3.10.1+ #1
Call Trace:
dump_stack+0x16/0x18
print_bad_pte+0x114/0x1b0
vm_normal_page+0x56/0x60
pagemap_pte_range+0x17a/0x1d0
walk_page_range+0x19e/0x2c0
pagemap_read+0x16e/0x200
vfs_read+0x84/0x150
SyS_read+0x4a/0x80
syscall_call+0x7/0xbSigned-off-by: Liu ShuoX
Signed-off-by: Chen LinX
Acked-by: Kirill A. Shutemov
Reviewed-by: Naoya Horiguchi
Cc: [3.10.x+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
25 May, 2013
1 commit
-
A panic can be caused by simply cat'ing /proc//smaps while an
application has a VM_PFNMAP range. It happened in-house when a
benchmarker was trying to decipher the memory layout of his program./proc//smaps and similar walks through a user page table should not
be looking at VM_PFNMAP areas.Certain tests in walk_page_range() (specifically split_huge_page_pmd())
assume that all the mapped PFN's are backed with page structures. And
this is not usually true for VM_PFNMAP areas. This can result in panics
on kernel page faults when attempting to address those page structures.There are a half dozen callers of walk_page_range() that walk through a
task's entire page table (as N. Horiguchi pointed out). So rather than
change all of them, this patch changes just walk_page_range() to ignore
VM_PFNMAP areas.The logic of hugetlb_vma() is moved back into walk_page_range(), as we
want to test any vma in the range.VM_PFNMAP areas are used by:
- graphics memory manager gpu/drm/drm_gem.c
- global reference unit sgi-gru/grufile.c
- sgi special memory char/mspec.c
- and probably several out-of-tree modules[akpm@linux-foundation.org: remove now-unused hugetlb_vma() stub]
Signed-off-by: Cliff Wickman
Reviewed-by: Naoya Horiguchi
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: Dave Hansen
Cc: David Sterba
Cc: Johannes Weiner
Cc: KOSAKI Motohiro
Cc: "Kirill A. Shutemov"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
13 Dec, 2012
1 commit
-
Pass vma instead of mm and add address parameter.
In most cases we already have vma on the stack. We provides
split_huge_page_pmd_mm() for few cases when we have mm, but not vma.This change is preparation to huge zero pmd splitting implementation.
Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Andi Kleen
Cc: "H. Peter Anvin"
Cc: Mel Gorman
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
21 Jun, 2012
1 commit
-
Fix kernel-doc warnings such as
Warning(../mm/page_cgroup.c:432): No description found for parameter 'id'
Warning(../mm/page_cgroup.c:432): Excess function parameter 'mem' description in 'swap_cgroup_record'Signed-off-by: Wanpeng Li
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
22 Mar, 2012
1 commit
-
In some cases it may happen that pmd_none_or_clear_bad() is called with
the mmap_sem hold in read mode. In those cases the huge page faults can
allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
false positive from pmd_bad() that will not like to see a pmd
materializing as trans huge.It's not khugepaged causing the problem, khugepaged holds the mmap_sem
in write mode (and all those sites must hold the mmap_sem in read mode
to prevent pagetables to go away from under them, during code review it
seems vm86 mode on 32bit kernels requires that too unless it's
restricted to 1 thread per process or UP builds). The race is only with
the huge pagefaults that can convert a pmd_none() into a
pmd_trans_huge().Effectively all these pmd_none_or_clear_bad() sites running with
mmap_sem in read mode are somewhat speculative with the page faults, and
the result is always undefined when they run simultaneously. This is
probably why it wasn't common to run into this. For example if the
madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
fault, the hugepage will not be zapped, if the page fault runs first it
will be zapped.Altering pmd_bad() not to error out if it finds hugepmds won't be enough
to fix this, because zap_pmd_range would then proceed to call
zap_pte_range (which would be incorrect if the pmd become a
pmd_trans_huge()).The simplest way to fix this is to read the pmd in the local stack
(regardless of what we read, no need of actual CPU barriers, only
compiler barrier needed), and be sure it is not changing under the code
that computes its value. Even if the real pmd is changing under the
value we hold on the stack, we don't care. If we actually end up in
zap_pte_range it means the pmd was not none already and it was not huge,
and it can't become huge from under us (khugepaged locking explained
above).All we need is to enforce that there is no way anymore that in a code
path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
can run into a hugepmd. The overhead of a barrier() is just a compiler
tweak and should not be measurable (I only added it for THP builds). I
don't exclude different compiler versions may have prevented the race
too by caching the value of *pmd on the stack (that hasn't been
verified, but it wouldn't be impossible considering
pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
and there's no external function called in between pmd_trans_huge and
pmd_none_or_clear_bad).if (pmd_trans_huge(*pmd)) {
if (next-addr != HPAGE_PMD_SIZE) {
VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
split_huge_page_pmd(vma->vm_mm, pmd);
} else if (zap_huge_pmd(tlb, vma, pmd, addr))
continue;
/* fall through */
}
if (pmd_none_or_clear_bad(pmd))Because this race condition could be exercised without special
privileges this was reported in CVE-2012-1179.The race was identified and fully explained by Ulrich who debugged it.
I'm quoting his accurate explanation below, for reference.====== start quote =======
mapcount 0 page_mapcount 1
kernel BUG at mm/huge_memory.c:1384!At some point prior to the panic, a "bad pmd ..." message similar to the
following is logged on the console:mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
the page's PMD table entry.143 void pmd_clear_bad(pmd_t *pmd)
144 {
-> 145 pmd_ERROR(*pmd);
146 pmd_clear(pmd);
147 }After the PMD table entry has been cleared, there is an inconsistency
between the actual number of PMD table entries that are mapping the page
and the page's map count (_mapcount field in struct page). When the page
is subsequently reclaimed, __split_huge_page() detects this inconsistency.1381 if (mapcount != page_mapcount(page))
1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
1383 mapcount, page_mapcount(page));
-> 1384 BUG_ON(mapcount != page_mapcount(page));The root cause of the problem is a race of two threads in a multithreaded
process. Thread B incurs a page fault on a virtual address that has never
been accessed (PMD entry is zero) while Thread A is executing an madvise()
system call on a virtual address within the same 2 MB (huge page) range.virtual address space
.---------------------.
| |
| |
.-|---------------------|
| | |
| | |< |/////////////////////| > A(range)
page | |/////////////////////|-'
| | |
| | |
'-|---------------------|
| |
| |
'---------------------'- Thread A is executing an madvise(..., MADV_DONTNEED) system call
on the virtual address range "A(range)" shown in the picture.sys_madvise
// Acquire the semaphore in shared mode.
down_read(¤t->mm->mmap_sem)
...
madvise_vma
switch (behavior)
case MADV_DONTNEED:
madvise_dontneed
zap_page_range
unmap_vmas
unmap_page_range
zap_pud_range
zap_pmd_range
//
// Assume that this huge page has never been accessed.
// I.e. content of the PMD entry is zero (not mapped).
//
if (pmd_trans_huge(*pmd)) {
// We don't get here due to the above assumption.
}
//
// Assume that Thread B incurred a page fault and
.---------> // sneaks in here as shown below.
| //
| if (pmd_none_or_clear_bad(pmd))
| {
| if (unlikely(pmd_bad(*pmd)))
| pmd_clear_bad
| {
| pmd_ERROR
| // Log "bad pmd ..." message here.
| pmd_clear
| // Clear the page's PMD entry.
| // Thread B incremented the map count
| // in page_add_new_anon_rmap(), but
| // now the page is no longer mapped
| // by a PMD entry (-> inconsistency).
| }
| }
|
v
- Thread B is handling a page fault on virtual address "B(fault)" shown
in the picture....
do_page_fault
__do_page_fault
// Acquire the semaphore in shared mode.
down_read_trylock(&mm->mmap_sem)
...
handle_mm_fault
if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
// We get here due to the above assumption (PMD entry is zero).
do_huge_pmd_anonymous_page
alloc_hugepage_vma
// Allocate a new transparent huge page here.
...
__do_huge_pmd_anonymous_page
...
spin_lock(&mm->page_table_lock)
...
page_add_new_anon_rmap
// Here we increment the page's map count (starts at -1).
atomic_set(&page->_mapcount, 0)
set_pmd_at
// Here we set the page's PMD entry which will be cleared
// when Thread A calls pmd_clear_bad().
...
spin_unlock(&mm->page_table_lock)The mmap_sem does not prevent the race because both threads are acquiring
it in shared mode (down_read). Thread B holds the page_table_lock while
the page's map count and PMD table entry are updated. However, Thread A
does not synchronize on that lock.====== end quote =======
[akpm@linux-foundation.org: checkpatch fixes]
Reported-by: Ulrich Obergfell
Signed-off-by: Andrea Arcangeli
Acked-by: Johannes Weiner
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: Dave Jones
Acked-by: Larry Woodman
Acked-by: Rik van Riel
Cc: [2.6.38+]
Cc: Mark Salter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
26 Jul, 2011
4 commits
-
Commit bae9c19bf1 ("thp: split_huge_page_mm/vma") changed locking behavior
of walk_page_range(). Thus this patch changes the comment too.Signed-off-by: KOSAKI Motohiro
Cc: Naoya Horiguchi
Cc: Hiroyuki Kamezawa
Cc: Andrea Arcangeli
Cc: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Originally, walk_hugetlb_range() didn't require a caller take any lock.
But commit d33b9f45bd ("mm: hugetlb: fix hugepage memory leak in
walk_page_range") changed its rule. Because it added find_vma() call in
walk_hugetlb_range().Any locking-rule change commit should write a doc too.
[akpm@linux-foundation.org: clarify comment]
Signed-off-by: KOSAKI Motohiro
Cc: Naoya Horiguchi
Cc: Hiroyuki Kamezawa
Cc: Andrea Arcangeli
Cc: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently, walk_page_range() calls find_vma() every page table for walk
iteration. but it's completely unnecessary if walk->hugetlb_entry is
unused. And we don't have to assume find_vma() is a lightweight
operation. So this patch checks the walk->hugetlb_entry and avoids the
find_vma() call if possible.This patch also makes some cleanups. 1) remove ugly uninitialized_var()
and 2) #ifdef in function body.Signed-off-by: KOSAKI Motohiro
Cc: Naoya Horiguchi
Cc: Hiroyuki Kamezawa
Cc: Andrea Arcangeli
Cc: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The doc of find_vma() says,
/* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
{
(snip)Thus, caller should confirm whether the returned vma matches a desired one.
Signed-off-by: KOSAKI Motohiro
Cc: Naoya Horiguchi
Cc: Hiroyuki Kamezawa
Cc: Andrea Arcangeli
Cc: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
23 Mar, 2011
1 commit
-
Right now, if a mm_walk has either ->pte_entry or ->pmd_entry set, it will
unconditionally split any transparent huge pages it runs in to. In
practice, that means that anyone doing acat /proc/$pid/smaps
will unconditionally break down every huge page in the process and depend
on khugepaged to re-collapse it later. This is fairly suboptimal.This patch changes that behavior. It teaches each ->pmd_entry handler
(there are five) that they must break down the THPs themselves. Also, the
_generic_ code will never break down a THP unless a ->pte_entry handler is
actually set.This means that the ->pmd_entry handlers can now choose to deal with THPs
without breaking them down.[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Dave Hansen
Acked-by: Mel Gorman
Acked-by: David Rientjes
Reviewed-by: Eric B Munson
Tested-by: Eric B Munson
Cc: Michael J Wolf
Cc: Andrea Arcangeli
Cc: Johannes Weiner
Cc: Matt Mackall
Cc: Jeremy Fitzhardinge
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
14 Jan, 2011
1 commit
-
split_huge_page_pmd compat code. Each one of those would need to be
expanded to hundred of lines of complex code without a fully reliable
split_huge_page_pmd design.Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
25 Nov, 2010
1 commit
-
Commit d33b9f45 ("mm: hugetlb: fix hugepage memory leak in
walk_page_range()") introduces a check if a vma is a hugetlbfs one and
later in 5dc37642 ("mm hugetlb: add hugepage support to pagemap") it is
moved under #ifdef CONFIG_HUGETLB_PAGE but a needless find_vma call is
left behind and its result is not used anywhere else in the function.The side-effect of caching vma for @addr inside walk->mm is neither
utilized in walk_page_range() nor in called functions.Signed-off-by: David Sterba
Reviewed-by: Naoya Horiguchi
Acked-by: Andi Kleen
Cc: Andy Whitcroft
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Lee Schermerhorn
Cc: Matt Mackall
Acked-by: Mel Gorman
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds