10 Jun, 2020

2 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Add new APIs to assert that mmap_sem is held.

    Using this instead of rwsem_is_locked and lockdep_assert_held[_write]
    makes the assertions more tolerant of future changes to the lock type.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-10-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

04 Feb, 2020

6 commits

  • struct mm_struct is quite large (~1664 bytes) and so allocating on the
    stack may cause problems as the kernel stack size is small.

    Since ptdump_walk_pgd_level_core() was only allocating the structure so
    that it could modify the pgd argument we can instead introduce a pgd
    override in struct mm_walk and pass this down the call stack to where it
    is needed.

    Since the correct mm_struct is now being passed down, it is now also
    unnecessary to take the mmap_sem semaphore because ptdump_walk_pgd() will
    now take the semaphore on the real mm.

    [steven.price@arm.com: restore missed arm64 changes]
    Link: http://lkml.kernel.org/r/20200108145710.34314-1-steven.price@arm.com
    Link: http://lkml.kernel.org/r/20200108145710.34314-1-steven.price@arm.com
    Signed-off-by: Steven Price
    Reported-by: Stephen Rothwell
    Cc: Catalin Marinas
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Zong Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     
  • The pte_hole() callback is called at multiple levels of the page tables.
    Code dumping the kernel page tables needs to know what at what depth the
    missing entry is. Add this is an extra parameter to pte_hole(). When the
    depth isn't know (e.g. processing a vma) then -1 is passed.

    The depth that is reported is the actual level where the entry is missing
    (ignoring any folding that is in place), i.e. any levels where
    PTRS_PER_P?D is set to 1 are ignored.

    Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their
    natural numbers as levels 2/3/4.

    Link: http://lkml.kernel.org/r/20191218162402.45610-16-steven.price@arm.com
    Signed-off-by: Steven Price
    Tested-by: Zong Li
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     
  • If walk_pte_range() is called with a 'end' argument that is beyond the
    last page of memory (e.g. ~0UL) then the comparison between 'addr' and
    'end' will always fail and the loop will be infinite. Instead change the
    comparison to >= while accounting for overflow.

    Link: http://lkml.kernel.org/r/20191218162402.45610-15-steven.price@arm.com
    Signed-off-by: Steven Price
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Zong Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     
  • walk_page_range_novma() can be used to walk page tables or the kernel or
    for firmware. These page tables may contain entries that are not backed
    by a struct page and so it isn't (in general) possible to take the PTE
    lock for the pte_entry() callback. So update walk_pte_range() to only
    take the lock when no_vma==false by splitting out the inner loop to a
    separate function and add a comment explaining the difference to
    walk_page_range_novma().

    Link: http://lkml.kernel.org/r/20191218162402.45610-14-steven.price@arm.com
    Signed-off-by: Steven Price
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Zong Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     
  • Since 48684a65b4e3: "mm: pagewalk: fix misbehavior of walk_page_range for
    vma(VM_PFNMAP)", page_table_walk() will report any kernel area as a hole,
    because it lacks a vma.

    This means each arch has re-implemented page table walking when needed,
    for example in the per-arch ptdump walker.

    Remove the requirement to have a vma in the generic code and add a new
    function walk_page_range_novma() which ignores the VMAs and simply walks
    the page tables.

    Link: http://lkml.kernel.org/r/20191218162402.45610-13-steven.price@arm.com
    Signed-off-by: Steven Price
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Zong Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     
  • pgd_entry() and pud_entry() were removed by commit 0b1fbfe50006c410
    ("mm/pagewalk: remove pgd_entry() and pud_entry()") because there were no
    users. We're about to add users so reintroduce them, along with
    p4d_entry() as we now have 5 levels of tables.

    Note that commit a00cc7d9dd93d66a ("mm, x86: add support for PUD-sized
    transparent hugepages") already re-added pud_entry() but with different
    semantics to the other callbacks. This commit reverts the semantics back
    to match the other callbacks.

    To support hmm.c which now uses the new semantics of pud_entry() a new
    member ('action') of struct mm_walk is added which allows the callbacks to
    either descend (ACTION_SUBTREE, the default), skip (ACTION_CONTINUE) or
    repeat the callback (ACTION_AGAIN). hmm.c is then updated to call
    pud_trans_huge_lock() itself and make use of the splitting/retry logic of
    the core code.

    After this change pud_entry() is called for all entries, not just
    transparent huge pages.

    [arnd@arndb.de: fix unused variable warning]
    Link: http://lkml.kernel.org/r/20200107204607.1533842-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20191218162402.45610-12-steven.price@arm.com
    Signed-off-by: Steven Price
    Signed-off-by: Arnd Bergmann
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Zong Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     

06 Nov, 2019

2 commits

  • For users that want to travers all page table entries pointing into a
    region of a struct address_space mapping, introduce a walk_page_mapping()
    function.

    The walk_page_mapping() function will be initially be used for dirty-
    tracking in virtual graphics drivers.

    Cc: Andrew Morton
    Cc: Matthew Wilcox
    Cc: Will Deacon
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Huang Ying
    Cc: Jérôme Glisse
    Cc: Kirill A. Shutemov
    Signed-off-by: Thomas Hellstrom
    Reviewed-by: Andrew Morton

    Thomas Hellstrom
     
  • Without the lock, anybody modifying a pte from within this function might
    have it concurrently modified by someone else.

    Cc: Matthew Wilcox
    Cc: Will Deacon
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Huang Ying
    Cc: Jérôme Glisse
    Cc: Kirill A. Shutemov
    Suggested-by: Linus Torvalds
    Signed-off-by: Thomas Hellstrom
    Acked-by: Kirill A. Shutemov

    Thomas Hellstrom
     

07 Sep, 2019

3 commits

  • Use lockdep to check for held locks instead of using home grown asserts.

    Link: https://lore.kernel.org/r/20190828141955.22210-4-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • The mm_walk structure currently mixed data and code. Split out the
    operations vectors into a new mm_walk_ops structure, and while we are
    changing the API also declare the mm_walk structure inside the
    walk_page_range and walk_page_vma functions.

    Based on patch from Linus Torvalds.

    Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Add a new header for the two handful of users of the walk_page_range /
    walk_page_vma interface instead of polluting all users of mm.h with it.

    Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

06 Apr, 2018

1 commit


07 Feb, 2018

1 commit


16 Nov, 2017

1 commit

  • This matters at least for the mincore syscall, which will otherwise copy
    uninitialized memory from the page allocator to userspace. It is
    probably also a correctness error for /proc/$pid/pagemap, but I haven't
    tested that.

    Removing the `walk->hugetlb_entry` condition in walk_hugetlb_range() has
    no effect because the caller already checks for that.

    This only reports holes in hugetlb ranges to callers who have specified
    a hugetlb_entry callback.

    This issue was found using an AFL-based fuzzer.

    v2:
    - don't crash on ->pte_hole==NULL (Andrew Morton)
    - add Cc stable (Andrew Morton)

    Fixes: 1e25a271c8ac ("mincore: apply page table walker on do_mincore()")
    Signed-off-by: Jann Horn
    Cc:
    Signed-off-by: Linus Torvalds

    Jann Horn
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

07 Jul, 2017

1 commit

  • A poisoned or migrated hugepage is stored as a swap entry in the page
    tables. On architectures that support hugepages consisting of
    contiguous page table entries (such as on arm64) this leads to ambiguity
    in determining the page table entry to return in huge_pte_offset() when
    a poisoned entry is encountered.

    Let's remove the ambiguity by adding a size parameter to convey
    additional information about the requested address. Also fixup the
    definition/usage of huge_pte_offset() throughout the tree.

    Link: http://lkml.kernel.org/r/20170522133604.11392-4-punit.agrawal@arm.com
    Signed-off-by: Punit Agrawal
    Acked-by: Steve Capper
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: James Hogan (odd fixer:METAG ARCHITECTURE)
    Cc: Ralf Baechle (supporter:MIPS)
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Alexander Viro
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K.V"
    Cc: "Kirill A. Shutemov"
    Cc: Hillf Danton
    Cc: Mark Rutland
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Punit Agrawal
     

10 Mar, 2017

1 commit


25 Feb, 2017

1 commit

  • The current transparent hugepage code only supports PMDs. This patch
    adds support for transparent use of PUDs with DAX. It does not include
    support for anonymous pages. x86 support code also added.

    Most of this patch simply parallels the work that was done for huge
    PMDs. The only major difference is how the new ->pud_entry method in
    mm_walk works. The ->pmd_entry method replaces the ->pte_entry method,
    whereas the ->pud_entry method works along with either ->pmd_entry or
    ->pte_entry. The pagewalk code takes care of locking the PUD before
    calling ->pud_walk, so handlers do not need to worry whether the PUD is
    stable.

    [dave.jiang@intel.com: fix SMP x86 32bit build for native_pud_clear()]
    Link: http://lkml.kernel.org/r/148719066814.31111.3239231168815337012.stgit@djiang5-desk3.ch.intel.com
    [dave.jiang@intel.com: native_pud_clear missing on i386 build]
    Link: http://lkml.kernel.org/r/148640375195.69754.3315433724330910314.stgit@djiang5-desk3.ch.intel.com
    Link: http://lkml.kernel.org/r/148545059381.17912.8602162635537598445.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Matthew Wilcox
    Signed-off-by: Dave Jiang
    Tested-by: Alexander Kapshuk
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Kirill A. Shutemov
    Cc: Nilesh Choudhury
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

16 Jan, 2016

1 commit

  • We are going to decouple splitting THP PMD from splitting underlying
    compound page.

    This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
    to reflect the fact that it doesn't imply page splitting, only PMD.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

26 Mar, 2015

1 commit

  • walk_page_test() is purely pagewalk's internal stuff, and its positive
    return values are not intended to be passed to the callers of pagewalk.

    However, in the current code if the last vma in the do-while loop in
    walk_page_range() happens to return a positive value, it leaks outside
    walk_page_range(). So the user visible effect is invalid/unexpected
    return value (according to the reporter, mbind() causes it.)

    This patch fixes it simply by reinitializing the return value after
    checked.

    Another exposed interface, walk_page_vma(), already returns 0 for such
    cases so no problem.

    Fixes: fafaa4264eba ("pagewalk: improve vma handling")
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kazutomo Yoshii
    Reported-by: Kazutomo Yoshii
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

12 Feb, 2015

4 commits

  • walk_page_range() silently skips vma having VM_PFNMAP set, which leads to
    undesirable behaviour at client end (who called walk_page_range). For
    example for pagemap_read(), when no callbacks are called against VM_PFNMAP
    vma, pagemap_read() may prepare pagemap data for next virtual address
    range at wrong index. That could confuse and/or break userspace
    applications.

    This patch avoid this misbehavior caused by vma(VM_PFNMAP) like follows:
    - for pagemap_read() which has its own ->pte_hole(), call the ->pte_hole()
    over vma(VM_PFNMAP),
    - for clear_refs and queue_pages which have their own ->tests_walk,
    just return 1 and skip vma(VM_PFNMAP). This is no problem because
    these are not interested in hole regions,
    - for other callers, just skip the vma(VM_PFNMAP) as a default behavior.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Shiraz Hashim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Introduce walk_page_vma(), which is useful for the callers which want to
    walk over a given vma. It's used by later patches.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Current implementation of page table walker has a fundamental problem in
    vma handling, which started when we tried to handle vma(VM_HUGETLB).
    Because it's done in pgd loop, considering vma boundary makes code
    complicated and bug-prone.

    From the users viewpoint, some user checks some vma-related condition to
    determine whether the user really does page walk over the vma.

    In order to solve these, this patch moves vma check outside pgd loop and
    introduce a new callback ->test_walk().

    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently no user of page table walker sets ->pgd_entry() or
    ->pud_entry(), so checking their existence in each loop is just wasting
    CPU cycle. So let's remove it to reduce overhead.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

06 Feb, 2015

1 commit

  • walk_page_range() silently skips vma having VM_PFNMAP set, which leads
    to undesirable behaviour at client end (who called walk_page_range).
    Userspace applications get the wrong data, so the effect is like just
    confusing users (if the applications just display the data) or sometimes
    killing the processes (if the applications do something with
    misunderstanding virtual addresses due to the wrong data.)

    For example for pagemap_read, when no callbacks are called against
    VM_PFNMAP vma, pagemap_read may prepare pagemap data for next virtual
    address range at wrong index.

    Eventually userspace may get wrong pagemap data for a task.
    Corresponding to a VM_PFNMAP marked vma region, kernel may report
    mappings from subsequent vma regions. User space in turn may account
    more pages (than really are) to the task.

    In my case I was using procmem, procrack (Android utility) which uses
    pagemap interface to account RSS pages of a task. Due to this bug it
    was giving a wrong picture for vmas (with VM_PFNMAP set).

    Fixes: a9ff785e4437 ("mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas")
    Signed-off-by: Shiraz Hashim
    Acked-by: Naoya Horiguchi
    Cc: [3.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shiraz Hashim
     

10 Oct, 2014

1 commit


31 Oct, 2013

1 commit

  • When walk_page_range walk a memory map's page tables, it'll skip
    VM_PFNMAP area, then variable 'next' will to assign to vma->vm_end, it
    maybe larger than 'end'. In next loop, 'addr' will be larger than
    'next'. Then in /proc/XXXX/pagemap file reading procedure, the 'addr'
    will growing forever in pagemap_pte_range, pte_to_pagemap_entry will
    access the wrong pte.

    BUG: Bad page map in process procrank pte:8437526f pmd:785de067
    addr:9108d000 vm_flags:00200073 anon_vma:f0d99020 mapping: (null) index:9108d
    CPU: 1 PID: 4974 Comm: procrank Tainted: G B W O 3.10.1+ #1
    Call Trace:
    dump_stack+0x16/0x18
    print_bad_pte+0x114/0x1b0
    vm_normal_page+0x56/0x60
    pagemap_pte_range+0x17a/0x1d0
    walk_page_range+0x19e/0x2c0
    pagemap_read+0x16e/0x200
    vfs_read+0x84/0x150
    SyS_read+0x4a/0x80
    syscall_call+0x7/0xb

    Signed-off-by: Liu ShuoX
    Signed-off-by: Chen LinX
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Naoya Horiguchi
    Cc: [3.10.x+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen LinX
     

25 May, 2013

1 commit

  • A panic can be caused by simply cat'ing /proc//smaps while an
    application has a VM_PFNMAP range. It happened in-house when a
    benchmarker was trying to decipher the memory layout of his program.

    /proc//smaps and similar walks through a user page table should not
    be looking at VM_PFNMAP areas.

    Certain tests in walk_page_range() (specifically split_huge_page_pmd())
    assume that all the mapped PFN's are backed with page structures. And
    this is not usually true for VM_PFNMAP areas. This can result in panics
    on kernel page faults when attempting to address those page structures.

    There are a half dozen callers of walk_page_range() that walk through a
    task's entire page table (as N. Horiguchi pointed out). So rather than
    change all of them, this patch changes just walk_page_range() to ignore
    VM_PFNMAP areas.

    The logic of hugetlb_vma() is moved back into walk_page_range(), as we
    want to test any vma in the range.

    VM_PFNMAP areas are used by:
    - graphics memory manager gpu/drm/drm_gem.c
    - global reference unit sgi-gru/grufile.c
    - sgi special memory char/mspec.c
    - and probably several out-of-tree modules

    [akpm@linux-foundation.org: remove now-unused hugetlb_vma() stub]
    Signed-off-by: Cliff Wickman
    Reviewed-by: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: David Sterba
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: "Kirill A. Shutemov"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cliff Wickman
     

13 Dec, 2012

1 commit

  • Pass vma instead of mm and add address parameter.

    In most cases we already have vma on the stack. We provides
    split_huge_page_pmd_mm() for few cases when we have mm, but not vma.

    This change is preparation to huge zero pmd splitting implementation.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: "H. Peter Anvin"
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

21 Jun, 2012

1 commit

  • Fix kernel-doc warnings such as

    Warning(../mm/page_cgroup.c:432): No description found for parameter 'id'
    Warning(../mm/page_cgroup.c:432): Excess function parameter 'mem' description in 'swap_cgroup_record'

    Signed-off-by: Wanpeng Li
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     

22 Mar, 2012

1 commit

  • In some cases it may happen that pmd_none_or_clear_bad() is called with
    the mmap_sem hold in read mode. In those cases the huge page faults can
    allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
    false positive from pmd_bad() that will not like to see a pmd
    materializing as trans huge.

    It's not khugepaged causing the problem, khugepaged holds the mmap_sem
    in write mode (and all those sites must hold the mmap_sem in read mode
    to prevent pagetables to go away from under them, during code review it
    seems vm86 mode on 32bit kernels requires that too unless it's
    restricted to 1 thread per process or UP builds). The race is only with
    the huge pagefaults that can convert a pmd_none() into a
    pmd_trans_huge().

    Effectively all these pmd_none_or_clear_bad() sites running with
    mmap_sem in read mode are somewhat speculative with the page faults, and
    the result is always undefined when they run simultaneously. This is
    probably why it wasn't common to run into this. For example if the
    madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
    fault, the hugepage will not be zapped, if the page fault runs first it
    will be zapped.

    Altering pmd_bad() not to error out if it finds hugepmds won't be enough
    to fix this, because zap_pmd_range would then proceed to call
    zap_pte_range (which would be incorrect if the pmd become a
    pmd_trans_huge()).

    The simplest way to fix this is to read the pmd in the local stack
    (regardless of what we read, no need of actual CPU barriers, only
    compiler barrier needed), and be sure it is not changing under the code
    that computes its value. Even if the real pmd is changing under the
    value we hold on the stack, we don't care. If we actually end up in
    zap_pte_range it means the pmd was not none already and it was not huge,
    and it can't become huge from under us (khugepaged locking explained
    above).

    All we need is to enforce that there is no way anymore that in a code
    path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
    can run into a hugepmd. The overhead of a barrier() is just a compiler
    tweak and should not be measurable (I only added it for THP builds). I
    don't exclude different compiler versions may have prevented the race
    too by caching the value of *pmd on the stack (that hasn't been
    verified, but it wouldn't be impossible considering
    pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
    and there's no external function called in between pmd_trans_huge and
    pmd_none_or_clear_bad).

    if (pmd_trans_huge(*pmd)) {
    if (next-addr != HPAGE_PMD_SIZE) {
    VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
    split_huge_page_pmd(vma->vm_mm, pmd);
    } else if (zap_huge_pmd(tlb, vma, pmd, addr))
    continue;
    /* fall through */
    }
    if (pmd_none_or_clear_bad(pmd))

    Because this race condition could be exercised without special
    privileges this was reported in CVE-2012-1179.

    The race was identified and fully explained by Ulrich who debugged it.
    I'm quoting his accurate explanation below, for reference.

    ====== start quote =======
    mapcount 0 page_mapcount 1
    kernel BUG at mm/huge_memory.c:1384!

    At some point prior to the panic, a "bad pmd ..." message similar to the
    following is logged on the console:

    mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).

    The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
    the page's PMD table entry.

    143 void pmd_clear_bad(pmd_t *pmd)
    144 {
    -> 145 pmd_ERROR(*pmd);
    146 pmd_clear(pmd);
    147 }

    After the PMD table entry has been cleared, there is an inconsistency
    between the actual number of PMD table entries that are mapping the page
    and the page's map count (_mapcount field in struct page). When the page
    is subsequently reclaimed, __split_huge_page() detects this inconsistency.

    1381 if (mapcount != page_mapcount(page))
    1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
    1383 mapcount, page_mapcount(page));
    -> 1384 BUG_ON(mapcount != page_mapcount(page));

    The root cause of the problem is a race of two threads in a multithreaded
    process. Thread B incurs a page fault on a virtual address that has never
    been accessed (PMD entry is zero) while Thread A is executing an madvise()
    system call on a virtual address within the same 2 MB (huge page) range.

    virtual address space
    .---------------------.
    | |
    | |
    .-|---------------------|
    | | |
    | | |< |/////////////////////| > A(range)
    page | |/////////////////////|-'
    | | |
    | | |
    '-|---------------------|
    | |
    | |
    '---------------------'

    - Thread A is executing an madvise(..., MADV_DONTNEED) system call
    on the virtual address range "A(range)" shown in the picture.

    sys_madvise
    // Acquire the semaphore in shared mode.
    down_read(¤t->mm->mmap_sem)
    ...
    madvise_vma
    switch (behavior)
    case MADV_DONTNEED:
    madvise_dontneed
    zap_page_range
    unmap_vmas
    unmap_page_range
    zap_pud_range
    zap_pmd_range
    //
    // Assume that this huge page has never been accessed.
    // I.e. content of the PMD entry is zero (not mapped).
    //
    if (pmd_trans_huge(*pmd)) {
    // We don't get here due to the above assumption.
    }
    //
    // Assume that Thread B incurred a page fault and
    .---------> // sneaks in here as shown below.
    | //
    | if (pmd_none_or_clear_bad(pmd))
    | {
    | if (unlikely(pmd_bad(*pmd)))
    | pmd_clear_bad
    | {
    | pmd_ERROR
    | // Log "bad pmd ..." message here.
    | pmd_clear
    | // Clear the page's PMD entry.
    | // Thread B incremented the map count
    | // in page_add_new_anon_rmap(), but
    | // now the page is no longer mapped
    | // by a PMD entry (-> inconsistency).
    | }
    | }
    |
    v
    - Thread B is handling a page fault on virtual address "B(fault)" shown
    in the picture.

    ...
    do_page_fault
    __do_page_fault
    // Acquire the semaphore in shared mode.
    down_read_trylock(&mm->mmap_sem)
    ...
    handle_mm_fault
    if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
    // We get here due to the above assumption (PMD entry is zero).
    do_huge_pmd_anonymous_page
    alloc_hugepage_vma
    // Allocate a new transparent huge page here.
    ...
    __do_huge_pmd_anonymous_page
    ...
    spin_lock(&mm->page_table_lock)
    ...
    page_add_new_anon_rmap
    // Here we increment the page's map count (starts at -1).
    atomic_set(&page->_mapcount, 0)
    set_pmd_at
    // Here we set the page's PMD entry which will be cleared
    // when Thread A calls pmd_clear_bad().
    ...
    spin_unlock(&mm->page_table_lock)

    The mmap_sem does not prevent the race because both threads are acquiring
    it in shared mode (down_read). Thread B holds the page_table_lock while
    the page's map count and PMD table entry are updated. However, Thread A
    does not synchronize on that lock.

    ====== end quote =======

    [akpm@linux-foundation.org: checkpatch fixes]
    Reported-by: Ulrich Obergfell
    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Dave Jones
    Acked-by: Larry Woodman
    Acked-by: Rik van Riel
    Cc: [2.6.38+]
    Cc: Mark Salter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

26 Jul, 2011

4 commits

  • Commit bae9c19bf1 ("thp: split_huge_page_mm/vma") changed locking behavior
    of walk_page_range(). Thus this patch changes the comment too.

    Signed-off-by: KOSAKI Motohiro
    Cc: Naoya Horiguchi
    Cc: Hiroyuki Kamezawa
    Cc: Andrea Arcangeli
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Originally, walk_hugetlb_range() didn't require a caller take any lock.
    But commit d33b9f45bd ("mm: hugetlb: fix hugepage memory leak in
    walk_page_range") changed its rule. Because it added find_vma() call in
    walk_hugetlb_range().

    Any locking-rule change commit should write a doc too.

    [akpm@linux-foundation.org: clarify comment]
    Signed-off-by: KOSAKI Motohiro
    Cc: Naoya Horiguchi
    Cc: Hiroyuki Kamezawa
    Cc: Andrea Arcangeli
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently, walk_page_range() calls find_vma() every page table for walk
    iteration. but it's completely unnecessary if walk->hugetlb_entry is
    unused. And we don't have to assume find_vma() is a lightweight
    operation. So this patch checks the walk->hugetlb_entry and avoids the
    find_vma() call if possible.

    This patch also makes some cleanups. 1) remove ugly uninitialized_var()
    and 2) #ifdef in function body.

    Signed-off-by: KOSAKI Motohiro
    Cc: Naoya Horiguchi
    Cc: Hiroyuki Kamezawa
    Cc: Andrea Arcangeli
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The doc of find_vma() says,

    /* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
    struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
    {
    (snip)

    Thus, caller should confirm whether the returned vma matches a desired one.

    Signed-off-by: KOSAKI Motohiro
    Cc: Naoya Horiguchi
    Cc: Hiroyuki Kamezawa
    Cc: Andrea Arcangeli
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

23 Mar, 2011

1 commit

  • Right now, if a mm_walk has either ->pte_entry or ->pmd_entry set, it will
    unconditionally split any transparent huge pages it runs in to. In
    practice, that means that anyone doing a

    cat /proc/$pid/smaps

    will unconditionally break down every huge page in the process and depend
    on khugepaged to re-collapse it later. This is fairly suboptimal.

    This patch changes that behavior. It teaches each ->pmd_entry handler
    (there are five) that they must break down the THPs themselves. Also, the
    _generic_ code will never break down a THP unless a ->pte_entry handler is
    actually set.

    This means that the ->pmd_entry handlers can now choose to deal with THPs
    without breaking them down.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Reviewed-by: Eric B Munson
    Tested-by: Eric B Munson
    Cc: Michael J Wolf
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Matt Mackall
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

14 Jan, 2011

1 commit

  • split_huge_page_pmd compat code. Each one of those would need to be
    expanded to hundred of lines of complex code without a fully reliable
    split_huge_page_pmd design.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 Nov, 2010

1 commit

  • Commit d33b9f45 ("mm: hugetlb: fix hugepage memory leak in
    walk_page_range()") introduces a check if a vma is a hugetlbfs one and
    later in 5dc37642 ("mm hugetlb: add hugepage support to pagemap") it is
    moved under #ifdef CONFIG_HUGETLB_PAGE but a needless find_vma call is
    left behind and its result is not used anywhere else in the function.

    The side-effect of caching vma for @addr inside walk->mm is neither
    utilized in walk_page_range() nor in called functions.

    Signed-off-by: David Sterba
    Reviewed-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Cc: Andy Whitcroft
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Matt Mackall
    Acked-by: Mel Gorman
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Sterba