24 Nov, 2017

1 commit

  • commit 373c4557d2aa362702c4c2d41288fb1e54990b7c upstream.

    This matters at least for the mincore syscall, which will otherwise copy
    uninitialized memory from the page allocator to userspace. It is
    probably also a correctness error for /proc/$pid/pagemap, but I haven't
    tested that.

    Removing the `walk->hugetlb_entry` condition in walk_hugetlb_range() has
    no effect because the caller already checks for that.

    This only reports holes in hugetlb ranges to callers who have specified
    a hugetlb_entry callback.

    This issue was found using an AFL-based fuzzer.

    v2:
    - don't crash on ->pte_hole==NULL (Andrew Morton)
    - add Cc stable (Andrew Morton)

    Fixes: 1e25a271c8ac ("mincore: apply page table walker on do_mincore()")
    Signed-off-by: Jann Horn
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

07 Jul, 2017

1 commit

  • A poisoned or migrated hugepage is stored as a swap entry in the page
    tables. On architectures that support hugepages consisting of
    contiguous page table entries (such as on arm64) this leads to ambiguity
    in determining the page table entry to return in huge_pte_offset() when
    a poisoned entry is encountered.

    Let's remove the ambiguity by adding a size parameter to convey
    additional information about the requested address. Also fixup the
    definition/usage of huge_pte_offset() throughout the tree.

    Link: http://lkml.kernel.org/r/20170522133604.11392-4-punit.agrawal@arm.com
    Signed-off-by: Punit Agrawal
    Acked-by: Steve Capper
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: James Hogan (odd fixer:METAG ARCHITECTURE)
    Cc: Ralf Baechle (supporter:MIPS)
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Alexander Viro
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K.V"
    Cc: "Kirill A. Shutemov"
    Cc: Hillf Danton
    Cc: Mark Rutland
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Punit Agrawal
     

10 Mar, 2017

1 commit


25 Feb, 2017

1 commit

  • The current transparent hugepage code only supports PMDs. This patch
    adds support for transparent use of PUDs with DAX. It does not include
    support for anonymous pages. x86 support code also added.

    Most of this patch simply parallels the work that was done for huge
    PMDs. The only major difference is how the new ->pud_entry method in
    mm_walk works. The ->pmd_entry method replaces the ->pte_entry method,
    whereas the ->pud_entry method works along with either ->pmd_entry or
    ->pte_entry. The pagewalk code takes care of locking the PUD before
    calling ->pud_walk, so handlers do not need to worry whether the PUD is
    stable.

    [dave.jiang@intel.com: fix SMP x86 32bit build for native_pud_clear()]
    Link: http://lkml.kernel.org/r/148719066814.31111.3239231168815337012.stgit@djiang5-desk3.ch.intel.com
    [dave.jiang@intel.com: native_pud_clear missing on i386 build]
    Link: http://lkml.kernel.org/r/148640375195.69754.3315433724330910314.stgit@djiang5-desk3.ch.intel.com
    Link: http://lkml.kernel.org/r/148545059381.17912.8602162635537598445.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Matthew Wilcox
    Signed-off-by: Dave Jiang
    Tested-by: Alexander Kapshuk
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Kirill A. Shutemov
    Cc: Nilesh Choudhury
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

16 Jan, 2016

1 commit

  • We are going to decouple splitting THP PMD from splitting underlying
    compound page.

    This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
    to reflect the fact that it doesn't imply page splitting, only PMD.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

26 Mar, 2015

1 commit

  • walk_page_test() is purely pagewalk's internal stuff, and its positive
    return values are not intended to be passed to the callers of pagewalk.

    However, in the current code if the last vma in the do-while loop in
    walk_page_range() happens to return a positive value, it leaks outside
    walk_page_range(). So the user visible effect is invalid/unexpected
    return value (according to the reporter, mbind() causes it.)

    This patch fixes it simply by reinitializing the return value after
    checked.

    Another exposed interface, walk_page_vma(), already returns 0 for such
    cases so no problem.

    Fixes: fafaa4264eba ("pagewalk: improve vma handling")
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kazutomo Yoshii
    Reported-by: Kazutomo Yoshii
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

12 Feb, 2015

4 commits

  • walk_page_range() silently skips vma having VM_PFNMAP set, which leads to
    undesirable behaviour at client end (who called walk_page_range). For
    example for pagemap_read(), when no callbacks are called against VM_PFNMAP
    vma, pagemap_read() may prepare pagemap data for next virtual address
    range at wrong index. That could confuse and/or break userspace
    applications.

    This patch avoid this misbehavior caused by vma(VM_PFNMAP) like follows:
    - for pagemap_read() which has its own ->pte_hole(), call the ->pte_hole()
    over vma(VM_PFNMAP),
    - for clear_refs and queue_pages which have their own ->tests_walk,
    just return 1 and skip vma(VM_PFNMAP). This is no problem because
    these are not interested in hole regions,
    - for other callers, just skip the vma(VM_PFNMAP) as a default behavior.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Shiraz Hashim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Introduce walk_page_vma(), which is useful for the callers which want to
    walk over a given vma. It's used by later patches.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Current implementation of page table walker has a fundamental problem in
    vma handling, which started when we tried to handle vma(VM_HUGETLB).
    Because it's done in pgd loop, considering vma boundary makes code
    complicated and bug-prone.

    From the users viewpoint, some user checks some vma-related condition to
    determine whether the user really does page walk over the vma.

    In order to solve these, this patch moves vma check outside pgd loop and
    introduce a new callback ->test_walk().

    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently no user of page table walker sets ->pgd_entry() or
    ->pud_entry(), so checking their existence in each loop is just wasting
    CPU cycle. So let's remove it to reduce overhead.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

06 Feb, 2015

1 commit

  • walk_page_range() silently skips vma having VM_PFNMAP set, which leads
    to undesirable behaviour at client end (who called walk_page_range).
    Userspace applications get the wrong data, so the effect is like just
    confusing users (if the applications just display the data) or sometimes
    killing the processes (if the applications do something with
    misunderstanding virtual addresses due to the wrong data.)

    For example for pagemap_read, when no callbacks are called against
    VM_PFNMAP vma, pagemap_read may prepare pagemap data for next virtual
    address range at wrong index.

    Eventually userspace may get wrong pagemap data for a task.
    Corresponding to a VM_PFNMAP marked vma region, kernel may report
    mappings from subsequent vma regions. User space in turn may account
    more pages (than really are) to the task.

    In my case I was using procmem, procrack (Android utility) which uses
    pagemap interface to account RSS pages of a task. Due to this bug it
    was giving a wrong picture for vmas (with VM_PFNMAP set).

    Fixes: a9ff785e4437 ("mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas")
    Signed-off-by: Shiraz Hashim
    Acked-by: Naoya Horiguchi
    Cc: [3.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shiraz Hashim
     

10 Oct, 2014

1 commit


31 Oct, 2013

1 commit

  • When walk_page_range walk a memory map's page tables, it'll skip
    VM_PFNMAP area, then variable 'next' will to assign to vma->vm_end, it
    maybe larger than 'end'. In next loop, 'addr' will be larger than
    'next'. Then in /proc/XXXX/pagemap file reading procedure, the 'addr'
    will growing forever in pagemap_pte_range, pte_to_pagemap_entry will
    access the wrong pte.

    BUG: Bad page map in process procrank pte:8437526f pmd:785de067
    addr:9108d000 vm_flags:00200073 anon_vma:f0d99020 mapping: (null) index:9108d
    CPU: 1 PID: 4974 Comm: procrank Tainted: G B W O 3.10.1+ #1
    Call Trace:
    dump_stack+0x16/0x18
    print_bad_pte+0x114/0x1b0
    vm_normal_page+0x56/0x60
    pagemap_pte_range+0x17a/0x1d0
    walk_page_range+0x19e/0x2c0
    pagemap_read+0x16e/0x200
    vfs_read+0x84/0x150
    SyS_read+0x4a/0x80
    syscall_call+0x7/0xb

    Signed-off-by: Liu ShuoX
    Signed-off-by: Chen LinX
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Naoya Horiguchi
    Cc: [3.10.x+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen LinX
     

25 May, 2013

1 commit

  • A panic can be caused by simply cat'ing /proc//smaps while an
    application has a VM_PFNMAP range. It happened in-house when a
    benchmarker was trying to decipher the memory layout of his program.

    /proc//smaps and similar walks through a user page table should not
    be looking at VM_PFNMAP areas.

    Certain tests in walk_page_range() (specifically split_huge_page_pmd())
    assume that all the mapped PFN's are backed with page structures. And
    this is not usually true for VM_PFNMAP areas. This can result in panics
    on kernel page faults when attempting to address those page structures.

    There are a half dozen callers of walk_page_range() that walk through a
    task's entire page table (as N. Horiguchi pointed out). So rather than
    change all of them, this patch changes just walk_page_range() to ignore
    VM_PFNMAP areas.

    The logic of hugetlb_vma() is moved back into walk_page_range(), as we
    want to test any vma in the range.

    VM_PFNMAP areas are used by:
    - graphics memory manager gpu/drm/drm_gem.c
    - global reference unit sgi-gru/grufile.c
    - sgi special memory char/mspec.c
    - and probably several out-of-tree modules

    [akpm@linux-foundation.org: remove now-unused hugetlb_vma() stub]
    Signed-off-by: Cliff Wickman
    Reviewed-by: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: David Sterba
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: "Kirill A. Shutemov"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cliff Wickman
     

13 Dec, 2012

1 commit

  • Pass vma instead of mm and add address parameter.

    In most cases we already have vma on the stack. We provides
    split_huge_page_pmd_mm() for few cases when we have mm, but not vma.

    This change is preparation to huge zero pmd splitting implementation.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: "H. Peter Anvin"
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

21 Jun, 2012

1 commit

  • Fix kernel-doc warnings such as

    Warning(../mm/page_cgroup.c:432): No description found for parameter 'id'
    Warning(../mm/page_cgroup.c:432): Excess function parameter 'mem' description in 'swap_cgroup_record'

    Signed-off-by: Wanpeng Li
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     

22 Mar, 2012

1 commit

  • In some cases it may happen that pmd_none_or_clear_bad() is called with
    the mmap_sem hold in read mode. In those cases the huge page faults can
    allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
    false positive from pmd_bad() that will not like to see a pmd
    materializing as trans huge.

    It's not khugepaged causing the problem, khugepaged holds the mmap_sem
    in write mode (and all those sites must hold the mmap_sem in read mode
    to prevent pagetables to go away from under them, during code review it
    seems vm86 mode on 32bit kernels requires that too unless it's
    restricted to 1 thread per process or UP builds). The race is only with
    the huge pagefaults that can convert a pmd_none() into a
    pmd_trans_huge().

    Effectively all these pmd_none_or_clear_bad() sites running with
    mmap_sem in read mode are somewhat speculative with the page faults, and
    the result is always undefined when they run simultaneously. This is
    probably why it wasn't common to run into this. For example if the
    madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
    fault, the hugepage will not be zapped, if the page fault runs first it
    will be zapped.

    Altering pmd_bad() not to error out if it finds hugepmds won't be enough
    to fix this, because zap_pmd_range would then proceed to call
    zap_pte_range (which would be incorrect if the pmd become a
    pmd_trans_huge()).

    The simplest way to fix this is to read the pmd in the local stack
    (regardless of what we read, no need of actual CPU barriers, only
    compiler barrier needed), and be sure it is not changing under the code
    that computes its value. Even if the real pmd is changing under the
    value we hold on the stack, we don't care. If we actually end up in
    zap_pte_range it means the pmd was not none already and it was not huge,
    and it can't become huge from under us (khugepaged locking explained
    above).

    All we need is to enforce that there is no way anymore that in a code
    path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
    can run into a hugepmd. The overhead of a barrier() is just a compiler
    tweak and should not be measurable (I only added it for THP builds). I
    don't exclude different compiler versions may have prevented the race
    too by caching the value of *pmd on the stack (that hasn't been
    verified, but it wouldn't be impossible considering
    pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
    and there's no external function called in between pmd_trans_huge and
    pmd_none_or_clear_bad).

    if (pmd_trans_huge(*pmd)) {
    if (next-addr != HPAGE_PMD_SIZE) {
    VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
    split_huge_page_pmd(vma->vm_mm, pmd);
    } else if (zap_huge_pmd(tlb, vma, pmd, addr))
    continue;
    /* fall through */
    }
    if (pmd_none_or_clear_bad(pmd))

    Because this race condition could be exercised without special
    privileges this was reported in CVE-2012-1179.

    The race was identified and fully explained by Ulrich who debugged it.
    I'm quoting his accurate explanation below, for reference.

    ====== start quote =======
    mapcount 0 page_mapcount 1
    kernel BUG at mm/huge_memory.c:1384!

    At some point prior to the panic, a "bad pmd ..." message similar to the
    following is logged on the console:

    mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).

    The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
    the page's PMD table entry.

    143 void pmd_clear_bad(pmd_t *pmd)
    144 {
    -> 145 pmd_ERROR(*pmd);
    146 pmd_clear(pmd);
    147 }

    After the PMD table entry has been cleared, there is an inconsistency
    between the actual number of PMD table entries that are mapping the page
    and the page's map count (_mapcount field in struct page). When the page
    is subsequently reclaimed, __split_huge_page() detects this inconsistency.

    1381 if (mapcount != page_mapcount(page))
    1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
    1383 mapcount, page_mapcount(page));
    -> 1384 BUG_ON(mapcount != page_mapcount(page));

    The root cause of the problem is a race of two threads in a multithreaded
    process. Thread B incurs a page fault on a virtual address that has never
    been accessed (PMD entry is zero) while Thread A is executing an madvise()
    system call on a virtual address within the same 2 MB (huge page) range.

    virtual address space
    .---------------------.
    | |
    | |
    .-|---------------------|
    | | |
    | | |< |/////////////////////| > A(range)
    page | |/////////////////////|-'
    | | |
    | | |
    '-|---------------------|
    | |
    | |
    '---------------------'

    - Thread A is executing an madvise(..., MADV_DONTNEED) system call
    on the virtual address range "A(range)" shown in the picture.

    sys_madvise
    // Acquire the semaphore in shared mode.
    down_read(¤t->mm->mmap_sem)
    ...
    madvise_vma
    switch (behavior)
    case MADV_DONTNEED:
    madvise_dontneed
    zap_page_range
    unmap_vmas
    unmap_page_range
    zap_pud_range
    zap_pmd_range
    //
    // Assume that this huge page has never been accessed.
    // I.e. content of the PMD entry is zero (not mapped).
    //
    if (pmd_trans_huge(*pmd)) {
    // We don't get here due to the above assumption.
    }
    //
    // Assume that Thread B incurred a page fault and
    .---------> // sneaks in here as shown below.
    | //
    | if (pmd_none_or_clear_bad(pmd))
    | {
    | if (unlikely(pmd_bad(*pmd)))
    | pmd_clear_bad
    | {
    | pmd_ERROR
    | // Log "bad pmd ..." message here.
    | pmd_clear
    | // Clear the page's PMD entry.
    | // Thread B incremented the map count
    | // in page_add_new_anon_rmap(), but
    | // now the page is no longer mapped
    | // by a PMD entry (-> inconsistency).
    | }
    | }
    |
    v
    - Thread B is handling a page fault on virtual address "B(fault)" shown
    in the picture.

    ...
    do_page_fault
    __do_page_fault
    // Acquire the semaphore in shared mode.
    down_read_trylock(&mm->mmap_sem)
    ...
    handle_mm_fault
    if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
    // We get here due to the above assumption (PMD entry is zero).
    do_huge_pmd_anonymous_page
    alloc_hugepage_vma
    // Allocate a new transparent huge page here.
    ...
    __do_huge_pmd_anonymous_page
    ...
    spin_lock(&mm->page_table_lock)
    ...
    page_add_new_anon_rmap
    // Here we increment the page's map count (starts at -1).
    atomic_set(&page->_mapcount, 0)
    set_pmd_at
    // Here we set the page's PMD entry which will be cleared
    // when Thread A calls pmd_clear_bad().
    ...
    spin_unlock(&mm->page_table_lock)

    The mmap_sem does not prevent the race because both threads are acquiring
    it in shared mode (down_read). Thread B holds the page_table_lock while
    the page's map count and PMD table entry are updated. However, Thread A
    does not synchronize on that lock.

    ====== end quote =======

    [akpm@linux-foundation.org: checkpatch fixes]
    Reported-by: Ulrich Obergfell
    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Dave Jones
    Acked-by: Larry Woodman
    Acked-by: Rik van Riel
    Cc: [2.6.38+]
    Cc: Mark Salter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

26 Jul, 2011

4 commits

  • Commit bae9c19bf1 ("thp: split_huge_page_mm/vma") changed locking behavior
    of walk_page_range(). Thus this patch changes the comment too.

    Signed-off-by: KOSAKI Motohiro
    Cc: Naoya Horiguchi
    Cc: Hiroyuki Kamezawa
    Cc: Andrea Arcangeli
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Originally, walk_hugetlb_range() didn't require a caller take any lock.
    But commit d33b9f45bd ("mm: hugetlb: fix hugepage memory leak in
    walk_page_range") changed its rule. Because it added find_vma() call in
    walk_hugetlb_range().

    Any locking-rule change commit should write a doc too.

    [akpm@linux-foundation.org: clarify comment]
    Signed-off-by: KOSAKI Motohiro
    Cc: Naoya Horiguchi
    Cc: Hiroyuki Kamezawa
    Cc: Andrea Arcangeli
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently, walk_page_range() calls find_vma() every page table for walk
    iteration. but it's completely unnecessary if walk->hugetlb_entry is
    unused. And we don't have to assume find_vma() is a lightweight
    operation. So this patch checks the walk->hugetlb_entry and avoids the
    find_vma() call if possible.

    This patch also makes some cleanups. 1) remove ugly uninitialized_var()
    and 2) #ifdef in function body.

    Signed-off-by: KOSAKI Motohiro
    Cc: Naoya Horiguchi
    Cc: Hiroyuki Kamezawa
    Cc: Andrea Arcangeli
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The doc of find_vma() says,

    /* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
    struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
    {
    (snip)

    Thus, caller should confirm whether the returned vma matches a desired one.

    Signed-off-by: KOSAKI Motohiro
    Cc: Naoya Horiguchi
    Cc: Hiroyuki Kamezawa
    Cc: Andrea Arcangeli
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

23 Mar, 2011

1 commit

  • Right now, if a mm_walk has either ->pte_entry or ->pmd_entry set, it will
    unconditionally split any transparent huge pages it runs in to. In
    practice, that means that anyone doing a

    cat /proc/$pid/smaps

    will unconditionally break down every huge page in the process and depend
    on khugepaged to re-collapse it later. This is fairly suboptimal.

    This patch changes that behavior. It teaches each ->pmd_entry handler
    (there are five) that they must break down the THPs themselves. Also, the
    _generic_ code will never break down a THP unless a ->pte_entry handler is
    actually set.

    This means that the ->pmd_entry handlers can now choose to deal with THPs
    without breaking them down.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Reviewed-by: Eric B Munson
    Tested-by: Eric B Munson
    Cc: Michael J Wolf
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Matt Mackall
    Cc: Jeremy Fitzhardinge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

14 Jan, 2011

1 commit

  • split_huge_page_pmd compat code. Each one of those would need to be
    expanded to hundred of lines of complex code without a fully reliable
    split_huge_page_pmd design.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 Nov, 2010

1 commit

  • Commit d33b9f45 ("mm: hugetlb: fix hugepage memory leak in
    walk_page_range()") introduces a check if a vma is a hugetlbfs one and
    later in 5dc37642 ("mm hugetlb: add hugepage support to pagemap") it is
    moved under #ifdef CONFIG_HUGETLB_PAGE but a needless find_vma call is
    left behind and its result is not used anywhere else in the function.

    The side-effect of caching vma for @addr inside walk->mm is neither
    utilized in walk_page_range() nor in called functions.

    Signed-off-by: David Sterba
    Reviewed-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Cc: Andy Whitcroft
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Matt Mackall
    Acked-by: Mel Gorman
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Sterba
     

07 Apr, 2010

1 commit

  • When we look into pagemap using page-types with option -p, the value of
    pfn for hugepages looks wrong (see below.) This is because pte was
    evaluated only once for one vma although it should be updated for each
    hugepage. This patch fixes it.

    $ page-types -p 3277 -Nl -b huge
    voffset offset len flags
    7f21e8a00 11e400 1 ___U___________H_G________________
    7f21e8a01 11e401 1ff ________________TG________________
    ^^^
    7f21e8c00 11e400 1 ___U___________H_G________________
    7f21e8c01 11e401 1ff ________________TG________________
    ^^^

    One hugepage contains 1 head page and 511 tail pages in x86_64 and each
    two lines represent each hugepage. Voffset and offset mean virtual
    address and physical address in the page unit, respectively. The
    different hugepages should not have the same offset value.

    With this patch applied:

    $ page-types -p 3386 -Nl -b huge
    voffset offset len flags
    7fec7a600 112c00 1 ___UD__________H_G________________
    7fec7a601 112c01 1ff ________________TG________________
    ^^^
    7fec7a800 113200 1 ___UD__________H_G________________
    7fec7a801 113201 1ff ________________TG________________
    ^^^
    OK

    More info:

    - This patch modifies walk_page_range()'s hugepage walker. But the
    change only affects pagemap_read(), which is the only caller of hugepage
    callback.

    - Without this patch, hugetlb_entry() callback is called per vma, that
    doesn't match the natural expectation from its name.

    - With this patch, hugetlb_entry() is called per hugepte entry and the
    callback can become much simpler.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

16 Dec, 2009

2 commits

  • This patch enables extraction of the pfn of a hugepage from
    /proc/pid/pagemap in an architecture independent manner.

    Details
    -------
    My test program (leak_pagemap) works as follows:
    - creat() and mmap() a file on hugetlbfs (file size is 200MB == 100 hugepages,)
    - read()/write() something on it,
    - call page-types with option -p,
    - munmap() and unlink() the file on hugetlbfs

    Without my patches
    ------------------
    $ ./leak_pagemap
    flags page-count MB symbolic-flags long-symbolic-flags
    0x0000000000000000 1 0 __________________________________
    0x0000000000000804 1 0 __R________M______________________ referenced,mmap
    0x000000000000086c 81 0 __RU_lA____M______________________ referenced,uptodate,lru,active,mmap
    0x0000000000005808 5 0 ___U_______Ma_b___________________ uptodate,mmap,anonymous,swapbacked
    0x0000000000005868 12 0 ___U_lA____Ma_b___________________ uptodate,lru,active,mmap,anonymous,swapbacked
    0x000000000000586c 1 0 __RU_lA____Ma_b___________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
    total 101 0

    The output of page-types don't show any hugepage.

    With my patches
    ---------------
    $ ./leak_pagemap
    flags page-count MB symbolic-flags long-symbolic-flags
    0x0000000000000000 1 0 __________________________________
    0x0000000000030000 51100 199 ________________TG________________ compound_tail,huge
    0x0000000000028018 100 0 ___UD__________H_G________________ uptodate,dirty,compound_head,huge
    0x0000000000000804 1 0 __R________M______________________ referenced,mmap
    0x000000000000080c 1 0 __RU_______M______________________ referenced,uptodate,mmap
    0x000000000000086c 80 0 __RU_lA____M______________________ referenced,uptodate,lru,active,mmap
    0x0000000000005808 4 0 ___U_______Ma_b___________________ uptodate,mmap,anonymous,swapbacked
    0x0000000000005868 12 0 ___U_lA____Ma_b___________________ uptodate,lru,active,mmap,anonymous,swapbacked
    0x000000000000586c 1 0 __RU_lA____Ma_b___________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
    total 51300 200

    The output of page-types shows 51200 pages contributing to hugepages,
    containing 100 head pages and 51100 tail pages as expected.

    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Wu Fengguang
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Lee Schermerhorn
    Cc: Andy Whitcroft
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Most callers of pmd_none_or_clear_bad() check whether the target page is
    in a hugepage or not, but walk_page_range() do not check it. So if we
    read /proc/pid/pagemap for the hugepage on x86 machine, the hugepage
    memory is leaked as shown below. This patch fixes it.

    Details
    =======
    My test program (leak_pagemap) works as follows:
    - creat() and mmap() a file on hugetlbfs (file size is 200MB == 100 hugepages,)
    - read()/write() something on it,
    - call page-types with option -p (walk around the page tables),
    - munmap() and unlink() the file on hugetlbfs

    Without my patches
    ------------------
    $ cat /proc/meminfo |grep "HugePage"
    HugePages_Total: 1000
    HugePages_Free: 1000
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    $ ./leak_pagemap
    [snip output]
    $ cat /proc/meminfo |grep "HugePage"
    HugePages_Total: 1000
    HugePages_Free: 900
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    $ ls /hugetlbfs/
    $

    100 hugepages are accounted as used while there is no file on hugetlbfs.

    With my patches
    ---------------
    $ cat /proc/meminfo |grep "HugePage"
    HugePages_Total: 1000
    HugePages_Free: 1000
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    $ ./leak_pagemap
    [snip output]
    $ cat /proc/meminfo |grep "HugePage"
    HugePages_Total: 1000
    HugePages_Free: 1000
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    $ ls /hugetlbfs
    $

    No memory leaks.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Wu Fengguang
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Lee Schermerhorn
    Cc: Andy Whitcroft
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

13 Jun, 2008

1 commit

  • We need this at least for huge page detection for now, because powerpc
    needs the vm_area_struct to be able to determine whether a virtual address
    is referring to a huge page (its pmd_huge() doesn't work).

    It might also come in handy for some of the other users.

    Signed-off-by: Dave Hansen
    Acked-by: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

28 Apr, 2008

1 commit

  • After the loop in walk_pte_range() pte might point to the first address after
    the pmd it walks. The pte_unmap() is then applied to something bad.

    Spotted by Roel Kluin and Andreas Schwab.

    Signed-off-by: Johannes Weiner
    Cc: Roel Kluin
    Cc: Andreas Schwab
    Acked-by: Matt Mackall
    Acked-by: Mikael Pettersson
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

20 Mar, 2008

1 commit

  • Fix various kernel-doc notation in mm/:

    filemap.c: add function short description; convert 2 to kernel-doc
    fremap.c: change parameter 'prot' to @prot
    pagewalk.c: change "-" in function parameters to ":"
    slab.c: fix short description of kmem_ptr_validate()
    swap.c: fix description & parameters of put_pages_list()
    swap_state.c: fix function parameters
    vmalloc.c: change "@returns" to "Returns:" since that is not a parameter

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

06 Feb, 2008

1 commit