14 Oct, 2020

1 commit

  • Patch series "Return head pages from find_*_entry", v2.

    This patch series started out as part of the THP patch set, but it has
    some nice effects along the way and it seems worth splitting it out and
    submitting separately.

    Currently find_get_entry() and find_lock_entry() return the page
    corresponding to the requested index, but the first thing most callers do
    is find the head page, which we just threw away. As part of auditing all
    the callers, I found some misuses of the APIs and some plain
    inefficiencies that I've fixed.

    The diffstat is unflattering, but I added more kernel-doc and a new wrapper.

    This patch (of 8);

    Provide this functionality from the swap cache. It's useful for
    more than just mincore().

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: William Kucharski
    Cc: Jani Nikula
    Cc: Alexey Dobriyan
    Cc: Johannes Weiner
    Cc: Chris Wilson
    Cc: Matthew Auld
    Cc: Huang Ying
    Link: https://lkml.kernel.org/r/20200910183318.20139-1-willy@infradead.org
    Link: https://lkml.kernel.org/r/20200910183318.20139-2-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

10 Jun, 2020

3 commits

  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • The replacement of with made the include
    of the latter in the middle of asm includes. Fix this up with the aid of
    the below script and manual adjustments here and there.

    import sys
    import re

    if len(sys.argv) is not 3:
    print "USAGE: %s " % (sys.argv[0])
    sys.exit(1)

    hdr_to_move="#include " % sys.argv[2]
    moved = False
    in_hdrs = False

    with open(sys.argv[1], "r") as f:
    lines = f.readlines()
    for _line in lines:
    line = _line.rstrip('
    ')
    if line == hdr_to_move:
    continue
    if line.startswith("#include
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200514170327.31389-4-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • The include/linux/pgtable.h is going to be the home of generic page table
    manipulation functions.

    Start with moving asm-generic/pgtable.h to include/linux/pgtable.h and
    make the latter include asm/pgtable.h.

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200514170327.31389-3-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

04 Feb, 2020

1 commit

  • The pte_hole() callback is called at multiple levels of the page tables.
    Code dumping the kernel page tables needs to know what at what depth the
    missing entry is. Add this is an extra parameter to pte_hole(). When the
    depth isn't know (e.g. processing a vma) then -1 is passed.

    The depth that is reported is the actual level where the entry is missing
    (ignoring any folding that is in place), i.e. any levels where
    PTRS_PER_P?D is set to 1 are ignored.

    Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their
    natural numbers as levels 2/3/4.

    Link: http://lkml.kernel.org/r/20191218162402.45610-16-steven.price@arm.com
    Signed-off-by: Steven Price
    Tested-by: Zong Li
    Cc: Albert Ou
    Cc: Alexandre Ghiti
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: James Morse
    Cc: Jerome Glisse
    Cc: "Liang, Kan"
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vineet Gupta
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Price
     

26 Sep, 2019

1 commit

  • This patch is a part of a series that extends kernel ABI to allow to pass
    tagged user pointers (with the top byte set to something else other than
    0x00) as syscall arguments.

    This patch allows tagged pointers to be passed to the following memory
    syscalls: get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mprotect,
    mremap, msync, munlock, move_pages.

    The mmap and mremap syscalls do not currently accept tagged addresses.
    Architectures may interpret the tag as a background colour for the
    corresponding vma.

    Link: http://lkml.kernel.org/r/aaf0c0969d46b2feb9017f3e1b3ef3970b633d91.1563904656.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Khalid Aziz
    Reviewed-by: Vincenzo Frascino
    Reviewed-by: Catalin Marinas
    Reviewed-by: Kees Cook
    Cc: Al Viro
    Cc: Dave Hansen
    Cc: Eric Auger
    Cc: Felix Kuehling
    Cc: Jens Wiklander
    Cc: Mauro Carvalho Chehab
    Cc: Mike Rapoport
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

07 Sep, 2019

2 commits

  • The mm_walk structure currently mixed data and code. Split out the
    operations vectors into a new mm_walk_ops structure, and while we are
    changing the API also declare the mm_walk structure inside the
    walk_page_range and walk_page_vma functions.

    Based on patch from Linus Torvalds.

    Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Add a new header for the two handful of users of the walk_page_range /
    walk_page_vma interface instead of polluting all users of mm.h with it.

    Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

13 Jul, 2019

1 commit

  • Via commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"),
    after swapoff, the address_space associated with the swap device will be
    freed. So swap_address_space() users which touch the address_space need
    some kind of mechanism to prevent the address_space from being freed
    during accessing.

    When mincore processes an unmapped range for swapped shmem pages, it
    doesn't hold the lock to prevent swap device from being swapped off. So
    the following race is possible:

    CPU1 CPU2
    do_mincore() swapoff()
    walk_page_range()
    mincore_unmapped_range()
    __mincore_unmapped_range
    mincore_page
    as = swap_address_space()
    ... exit_swap_address_space()
    ... kvfree(spaces)
    find_get_page(as)

    The address space may be accessed after being freed.

    To fix the race, get_swap_device()/put_swap_device() is used to enclose
    find_get_page() to check whether the swap entry is valid and prevent the
    swap device from being swapoff during accessing.

    Link: http://lkml.kernel.org/r/20190611020510.28251-1-ying.huang@intel.com
    Fixes: 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks")
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Paul E. McKenney
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Tim Chen
    Cc: Mel Gorman
    Cc: Jérôme Glisse
    Cc: Andrea Arcangeli
    Cc: Yang Shi
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: Jan Kara
    Cc: Dave Jiang
    Cc: Daniel Jordan
    Cc: Andrea Parri
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

15 May, 2019

1 commit

  • The semantics of what mincore() considers to be resident is not
    completely clear, but Linux has always (since 2.3.52, which is when
    mincore() was initially done) treated it as "page is available in page
    cache".

    That's potentially a problem, as that [in]directly exposes
    meta-information about pagecache / memory mapping state even about
    memory not strictly belonging to the process executing the syscall,
    opening possibilities for sidechannel attacks.

    Change the semantics of mincore() so that it only reveals pagecache
    information for non-anonymous mappings that belog to files that the
    calling process could (if it tried to) successfully open for writing;
    otherwise we'd be including shared non-exclusive mappings, which

    - is the sidechannel

    - is not the usecase for mincore(), as that's primarily used for data,
    not (shared) text

    [jkosina@suse.cz: v2]
    Link: http://lkml.kernel.org/r/20190312141708.6652-2-vbabka@suse.cz
    [mhocko@suse.com: restructure can_do_mincore() conditions]
    Link: http://lkml.kernel.org/r/nycvar.YFH.7.76.1903062342020.19912@cbobk.fhfr.pm
    Signed-off-by: Jiri Kosina
    Signed-off-by: Vlastimil Babka
    Acked-by: Josh Snyder
    Acked-by: Michal Hocko
    Originally-by: Linus Torvalds
    Originally-by: Dominique Martinet
    Cc: Andy Lutomirski
    Cc: Dave Chinner
    Cc: Kevin Easton
    Cc: Matthew Wilcox
    Cc: Cyril Hrubis
    Cc: Tejun Heo
    Cc: Kirill A. Shutemov
    Cc: Daniel Gruss
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Kosina
     

24 Jan, 2019

1 commit

  • This reverts commit 574823bfab82d9d8fa47f422778043fbb4b4f50e.

    It turns out that my hope that we could just remove the code that
    exposes the cache residency status from mincore() was too optimistic.

    There are various random users that want it, and one example would be
    the Netflix database cluster maintenance. To quote Josh Snyder:

    "For Netflix, losing accurate information from the mincore syscall
    would lengthen database cluster maintenance operations from days to
    months. We rely on cross-process mincore to migrate the contents of a
    page cache from machine to machine, and across reboots.

    To do this, I wrote and maintain happycache [1], a page cache
    dumper/loader tool. It is quite similar in architecture to pgfincore,
    except that it is agnostic to workload. The gist of happycache's
    operation is "produce a dump of residence status for each page, do
    some operation, then reload exactly the same pages which were present
    before." happycache is entirely dependent on accurate reporting of the
    in-core status of file-backed pages, as accessed by another process.

    We primarily use happycache with Cassandra, which (like Postgres +
    pgfincore) relies heavily on OS page cache to reduce disk accesses.
    Because our workloads never experience a cold page cache, we are able
    to provision hardware for a peak utilization level that is far lower
    than the hypothetical "every query is a cache miss" peak.

    A database warmed by happycache can be ready for service in seconds
    (bounded only by the performance of the drives and the I/O subsystem),
    with no period of in-service degradation. By contrast, putting a
    database in service without a page cache entails a potentially
    unbounded period of degradation (at Netflix, the time to populate a
    single node's cache via natural cache misses varies by workload from
    hours to weeks). If a single node upgrade were to take weeks, then
    upgrading an entire cluster would take months. Since we want to apply
    security upgrades (and other things) on a somewhat tighter schedule,
    we would have to develop more complex solutions to provide the same
    functionality already provided by mincore.

    At the bottom line, happycache is designed to benignly exploit the
    same information leak documented in the paper [2]. I think it makes
    perfect sense to remove cross-process mincore functionality from
    unprivileged users, but not to remove it entirely"

    We do have an alternate approach that limits the cache residency
    reporting only to processes that have write permissions to the file, so
    we can fix the original information leak issue that way. It involves
    _adding_ code rather than removing it, which is sad, but hey, at least
    we haven't found any users that would find the restrictions
    unacceptable.

    So revert the optimistic first approach to make room for that alternate
    fix instead.

    Reported-by: Josh Snyder
    Cc: Jiri Kosina
    Cc: Dominique Martinet
    Cc: Andy Lutomirski
    Cc: Dave Chinner
    Cc: Kevin Easton
    Cc: Matthew Wilcox
    Cc: Cyril Hrubis
    Cc: Vlastimil Babka
    Cc: Tejun Heo
    Cc: Kirill A. Shutemov
    Cc: Daniel Gruss
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

07 Jan, 2019

1 commit

  • The semantics of what "in core" means for the mincore() system call are
    somewhat unclear, but Linux has always (since 2.3.52, which is when
    mincore() was initially done) treated it as "page is available in page
    cache" rather than "page is mapped in the mapping".

    The problem with that traditional semantic is that it exposes a lot of
    system cache state that it really probably shouldn't, and that users
    shouldn't really even care about.

    So let's try to avoid that information leak by simply changing the
    semantics to be that mincore() counts actual mapped pages, not pages
    that might be cheaply mapped if they were faulted (note the "might be"
    part of the old semantics: being in the cache doesn't actually guarantee
    that you can access them without IO anyway, since things like network
    filesystems may have to revalidate the cache before use).

    In many ways the old semantics were somewhat insane even aside from the
    information leak issue. From the very beginning (and that beginning is
    a long time ago: 2.3.52 was released in March 2000, I think), the code
    had a comment saying

    Later we can get more picky about what "in core" means precisely.

    and this is that "later". Admittedly it is much later than is really
    comfortable.

    NOTE! This is a real semantic change, and it is for example known to
    change the output of "fincore", since that program literally does a
    mmmap without populating it, and then doing "mincore()" on that mapping
    that doesn't actually have any pages in it.

    I'm hoping that nobody actually has any workflow that cares, and the
    info leak is real.

    We may have to do something different if it turns out that people have
    valid reasons to want the old semantics, and if we can limit the
    information leak sanely.

    Cc: Kevin Easton
    Cc: Jiri Kosina
    Cc: Masatake YAMATO
    Cc: Andrew Morton
    Cc: Greg KH
    Cc: Peter Zijlstra
    Cc: Michal Hocko
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

04 Jan, 2019

1 commit

  • Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
    of the user address range verification function since we got rid of the
    old racy i386-only code to walk page tables by hand.

    It existed because the original 80386 would not honor the write protect
    bit when in kernel mode, so you had to do COW by hand before doing any
    user access. But we haven't supported that in a long time, and these
    days the 'type' argument is a purely historical artifact.

    A discussion about extending 'user_access_begin()' to do the range
    checking resulted this patch, because there is no way we're going to
    move the old VERIFY_xyz interface to that model. And it's best done at
    the end of the merge window when I've done most of my merges, so let's
    just get this done once and for all.

    This patch was mostly done with a sed-script, with manual fix-ups for
    the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

    There were a couple of notable cases:

    - csky still had the old "verify_area()" name as an alias.

    - the iter_iov code had magical hardcoded knowledge of the actual
    values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
    really used it)

    - microblaze used the type argument for a debug printout

    but other than those oddities this should be a total no-op patch.

    I tried to fix up all architectures, did fairly extensive grepping for
    access_ok() uses, and the changes are trivial, but I may have missed
    something. Any missed conversion should be trivially fixable, though.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

30 Sep, 2018

1 commit

  • Introduce xarray value entries and tagged pointers to replace radix
    tree exceptional entries. This is a slight change in encoding to allow
    the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
    value entry). It is also a change in emphasis; exceptional entries are
    intimidating and different. As the comment explains, you can choose
    to store values or pointers in the xarray and they are both first-class
    citizens.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Josef Bacik

    Matthew Wilcox
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

25 Feb, 2017

1 commit

  • Remove the prototypes for shmem_mapping() and shmem_zero_setup() from
    linux/mm.h, since they are already provided in linux/shmem_fs.h. But
    shmem_fs.h must then provide the inline stub for shmem_mapping() when
    CONFIG_SHMEM is not set, and a few more cfiles now need to #include it.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1702081658250.1549@eggly.anvils
    Signed-off-by: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Simek
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

25 Dec, 2016

1 commit


08 Oct, 2016

1 commit

  • This patch is to improve the performance of swap cache operations when
    the type of the swap device is not 0. Originally, the whole swap entry
    value is used as the key of the swap cache, even though there is one
    radix tree for each swap device. If the type of the swap device is not
    0, the height of the radix tree of the swap cache will be increased
    unnecessary, especially on 64bit architecture. For example, for a 1GB
    swap device on the x86_64 architecture, the height of the radix tree of
    the swap cache is 11. But if the offset of the swap entry is used as
    the key of the swap cache, the height of the radix tree of the swap
    cache is 4. The increased height causes unnecessary radix tree
    descending and increased cache footprint.

    This patch reduces the height of the radix tree of the swap cache via
    using the offset of the swap entry instead of the whole swap entry value
    as the key of the swap cache. In 32 processes sequential swap out test
    case on a Xeon E5 v3 system with RAM disk as swap, the lock contention
    for the spinlock of the swap cache is reduced from 20.15% to 12.19%,
    when the type of the swap device is 1.

    Use the whole swap entry as key,

    perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 10.37,
    perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 9.78,

    Use the swap offset as key,

    perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 6.25,
    perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 5.94,

    Link: http://lkml.kernel.org/r/1473270649-27229-1-git-send-email-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: "Kirill A. Shutemov"
    Cc: Dave Hansen
    Cc: Dan Williams
    Cc: Joonsoo Kim
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Aaron Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

05 Apr, 2016

2 commits

  • Mostly direct substitution with occasional adjustment or removing
    outdated comments.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

22 Jan, 2016

1 commit

  • After THP refcounting rework we have only two possible return values
    from pmd_trans_huge_lock(): success and failure. Return-by-pointer for
    ptl doesn't make much sense in this case.

    Let's convert pmd_trans_huge_lock() to return ptl on success and NULL on
    failure.

    Signed-off-by: Kirill A. Shutemov
    Suggested-by: Linus Torvalds
    Cc: Minchan Kim
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Jan, 2016

1 commit

  • With new refcounting we don't need to mark PMDs splitting. Let's drop
    code to handle this.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

06 Nov, 2015

1 commit


12 Feb, 2015

1 commit

  • This patch makes do_mincore() use walk_page_vma(), which reduces many
    lines of code by using common page table walk code.

    [daeseok.youn@gmail.com: remove unneeded variable 'err']
    Signed-off-by: Naoya Horiguchi
    Acked-by: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Daeseok Youn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

11 Feb, 2015

1 commit


14 Dec, 2014

1 commit

  • When the encountered pte is a swap entry, the current code handles two
    cases: migration and normal swapentry, but we have a third case: hwpoison
    page.

    This patch adds hwpoison page handle, consider hwpoison page incore as
    same as migration.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Weijie Yang
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang
     

04 Apr, 2014

1 commit

  • shmem mappings already contain exceptional entries where swap slot
    information is remembered.

    To be able to store eviction information for regular page cache, prepare
    every site dealing with the radix trees directly to handle entries other
    than pages.

    The common lookup functions will filter out non-page entries and return
    NULL for page cache holes, just as before. But provide a raw version of
    the API which returns non-page entries as well, and switch shmem over to
    use it.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

24 Jan, 2014

1 commit

  • Two cleanups:
    1. remove redundant codes for hugetlb pages.
    2. end = pmd_addr_end(addr, end) restricts [addr, end) within PMD_SIZE,
    this may increase do_mincore() calls, remove it.

    Signed-off-by: Jianguo Wu
    Acked-by: Johannes Weiner
    Cc: Minchan Kim
    Cc: qiuxishi
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianguo Wu
     

24 Feb, 2013

1 commit

  • When I use several fast SSD to do swap, swapper_space.tree_lock is
    heavily contended. This makes each swap partition have one
    address_space to reduce the lock contention. There is an array of
    address_space for swap. The swap entry type is the index to the array.

    In my test with 3 SSD, this increases the swapout throughput 20%.

    [akpm@linux-foundation.org: revert unneeded change to __add_to_swap_cache]
    Signed-off-by: Shaohua Li
    Cc: Hugh Dickins
    Acked-by: Rik van Riel
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

22 Mar, 2012

1 commit

  • In some cases it may happen that pmd_none_or_clear_bad() is called with
    the mmap_sem hold in read mode. In those cases the huge page faults can
    allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
    false positive from pmd_bad() that will not like to see a pmd
    materializing as trans huge.

    It's not khugepaged causing the problem, khugepaged holds the mmap_sem
    in write mode (and all those sites must hold the mmap_sem in read mode
    to prevent pagetables to go away from under them, during code review it
    seems vm86 mode on 32bit kernels requires that too unless it's
    restricted to 1 thread per process or UP builds). The race is only with
    the huge pagefaults that can convert a pmd_none() into a
    pmd_trans_huge().

    Effectively all these pmd_none_or_clear_bad() sites running with
    mmap_sem in read mode are somewhat speculative with the page faults, and
    the result is always undefined when they run simultaneously. This is
    probably why it wasn't common to run into this. For example if the
    madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
    fault, the hugepage will not be zapped, if the page fault runs first it
    will be zapped.

    Altering pmd_bad() not to error out if it finds hugepmds won't be enough
    to fix this, because zap_pmd_range would then proceed to call
    zap_pte_range (which would be incorrect if the pmd become a
    pmd_trans_huge()).

    The simplest way to fix this is to read the pmd in the local stack
    (regardless of what we read, no need of actual CPU barriers, only
    compiler barrier needed), and be sure it is not changing under the code
    that computes its value. Even if the real pmd is changing under the
    value we hold on the stack, we don't care. If we actually end up in
    zap_pte_range it means the pmd was not none already and it was not huge,
    and it can't become huge from under us (khugepaged locking explained
    above).

    All we need is to enforce that there is no way anymore that in a code
    path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
    can run into a hugepmd. The overhead of a barrier() is just a compiler
    tweak and should not be measurable (I only added it for THP builds). I
    don't exclude different compiler versions may have prevented the race
    too by caching the value of *pmd on the stack (that hasn't been
    verified, but it wouldn't be impossible considering
    pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
    and there's no external function called in between pmd_trans_huge and
    pmd_none_or_clear_bad).

    if (pmd_trans_huge(*pmd)) {
    if (next-addr != HPAGE_PMD_SIZE) {
    VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
    split_huge_page_pmd(vma->vm_mm, pmd);
    } else if (zap_huge_pmd(tlb, vma, pmd, addr))
    continue;
    /* fall through */
    }
    if (pmd_none_or_clear_bad(pmd))

    Because this race condition could be exercised without special
    privileges this was reported in CVE-2012-1179.

    The race was identified and fully explained by Ulrich who debugged it.
    I'm quoting his accurate explanation below, for reference.

    ====== start quote =======
    mapcount 0 page_mapcount 1
    kernel BUG at mm/huge_memory.c:1384!

    At some point prior to the panic, a "bad pmd ..." message similar to the
    following is logged on the console:

    mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).

    The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
    the page's PMD table entry.

    143 void pmd_clear_bad(pmd_t *pmd)
    144 {
    -> 145 pmd_ERROR(*pmd);
    146 pmd_clear(pmd);
    147 }

    After the PMD table entry has been cleared, there is an inconsistency
    between the actual number of PMD table entries that are mapping the page
    and the page's map count (_mapcount field in struct page). When the page
    is subsequently reclaimed, __split_huge_page() detects this inconsistency.

    1381 if (mapcount != page_mapcount(page))
    1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
    1383 mapcount, page_mapcount(page));
    -> 1384 BUG_ON(mapcount != page_mapcount(page));

    The root cause of the problem is a race of two threads in a multithreaded
    process. Thread B incurs a page fault on a virtual address that has never
    been accessed (PMD entry is zero) while Thread A is executing an madvise()
    system call on a virtual address within the same 2 MB (huge page) range.

    virtual address space
    .---------------------.
    | |
    | |
    .-|---------------------|
    | | |
    | | |< |/////////////////////| > A(range)
    page | |/////////////////////|-'
    | | |
    | | |
    '-|---------------------|
    | |
    | |
    '---------------------'

    - Thread A is executing an madvise(..., MADV_DONTNEED) system call
    on the virtual address range "A(range)" shown in the picture.

    sys_madvise
    // Acquire the semaphore in shared mode.
    down_read(¤t->mm->mmap_sem)
    ...
    madvise_vma
    switch (behavior)
    case MADV_DONTNEED:
    madvise_dontneed
    zap_page_range
    unmap_vmas
    unmap_page_range
    zap_pud_range
    zap_pmd_range
    //
    // Assume that this huge page has never been accessed.
    // I.e. content of the PMD entry is zero (not mapped).
    //
    if (pmd_trans_huge(*pmd)) {
    // We don't get here due to the above assumption.
    }
    //
    // Assume that Thread B incurred a page fault and
    .---------> // sneaks in here as shown below.
    | //
    | if (pmd_none_or_clear_bad(pmd))
    | {
    | if (unlikely(pmd_bad(*pmd)))
    | pmd_clear_bad
    | {
    | pmd_ERROR
    | // Log "bad pmd ..." message here.
    | pmd_clear
    | // Clear the page's PMD entry.
    | // Thread B incremented the map count
    | // in page_add_new_anon_rmap(), but
    | // now the page is no longer mapped
    | // by a PMD entry (-> inconsistency).
    | }
    | }
    |
    v
    - Thread B is handling a page fault on virtual address "B(fault)" shown
    in the picture.

    ...
    do_page_fault
    __do_page_fault
    // Acquire the semaphore in shared mode.
    down_read_trylock(&mm->mmap_sem)
    ...
    handle_mm_fault
    if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
    // We get here due to the above assumption (PMD entry is zero).
    do_huge_pmd_anonymous_page
    alloc_hugepage_vma
    // Allocate a new transparent huge page here.
    ...
    __do_huge_pmd_anonymous_page
    ...
    spin_lock(&mm->page_table_lock)
    ...
    page_add_new_anon_rmap
    // Here we increment the page's map count (starts at -1).
    atomic_set(&page->_mapcount, 0)
    set_pmd_at
    // Here we set the page's PMD entry which will be cleared
    // when Thread A calls pmd_clear_bad().
    ...
    spin_unlock(&mm->page_table_lock)

    The mmap_sem does not prevent the race because both threads are acquiring
    it in shared mode (down_read). Thread B holds the page_table_lock while
    the page's map count and PMD table entry are updated. However, Thread A
    does not synchronize on that lock.

    ====== end quote =======

    [akpm@linux-foundation.org: checkpatch fixes]
    Reported-by: Ulrich Obergfell
    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Dave Jones
    Acked-by: Larry Woodman
    Acked-by: Rik van Riel
    Cc: [2.6.38+]
    Cc: Mark Salter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

04 Aug, 2011

2 commits

  • Make the radix_tree exceptional cases, mostly in filemap.c, clearer.

    It's hard to devise a suitable snappy name that illuminates the use by
    shmem/tmpfs for swap, while keeping filemap/pagecache/radix_tree
    generality. And akpm points out that /* radix_tree_deref_retry(page) */
    comments look like calls that have been commented out for unknown
    reason.

    Skirt the naming difficulty by rearranging these blocks to handle the
    transient radix_tree_deref_retry(page) case first; then just explain the
    remaining shmem/tmpfs swap case in a comment.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove PageSwapBacked (!page_is_file_cache) cases from
    add_to_page_cache_locked() and add_to_page_cache_lru(): those pages now
    go through shmem_add_to_page_cache().

    Remove a comment on maximum tmpfs size from fsstack_copy_inode_size(),
    and add a comment on swap entries to invalidate_mapping_pages().

    And mincore_page() uses find_get_page() on what might be shmem or a
    tmpfs file: allow for a radix_tree_exceptional_entry(), and proceed to
    find_get_page() on swapper_space if so (oh, swapper_space needs #ifdef).

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

14 Jan, 2011

2 commits

  • Handle transparent huge page pmd entries natively instead of splitting
    them into subpages.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • split_huge_page_pmd compat code. Each one of those would need to be
    expanded to hundred of lines of complex code without a fully reliable
    split_huge_page_pmd design.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 May, 2010

4 commits

  • Do page table walks with the well-known nested loops we use in several
    other places already.

    This avoids doing full page table walks after every pte range and also
    allows to handle unmapped areas bigger than one pte range in one go.

    Signed-off-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Instead of passing a start address and a number of pages into the helper
    functions, convert them to use a start and an end address.

    Signed-off-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Split out functions to handle hugetlb ranges, pte ranges and unmapped
    ranges, to improve readability but also to prepare the file structure for
    nested page table walks.

    No semantic changes intended.

    Signed-off-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • This fixes some minor issues that bugged me while going over the code:

    o adjust argument order of do_mincore() to match the syscall
    o simplify range length calculation
    o drop superfluous shift in huge tlb calculation, address is page aligned
    o drop dead nr_huge calculation
    o check pte_none() before pte_present()
    o comment and whitespace fixes

    No semantic changes intended.

    Signed-off-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

16 Dec, 2009

1 commit

  • Most callers of pmd_none_or_clear_bad() check whether the target page is
    in a hugepage or not, but mincore() and walk_page_range() do not check it.
    So if we use mincore() on a hugepage on x86 machine, the hugepage memory
    is leaked as shown below. This patch fixes it by extending mincore()
    system call to support hugepages.

    Details
    =======
    My test program (leak_mincore) works as follows:
    - creat() and mmap() a file on hugetlbfs (file size is 200MB == 100 hugepages,)
    - read()/write() something on it,
    - call mincore() for first ten pages and printf() the values of *vec
    - munmap() and unlink() the file on hugetlbfs

    Without my patch
    ----------------
    $ cat /proc/meminfo| grep "HugePage"
    HugePages_Total: 1000
    HugePages_Free: 1000
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    $ ./leak_mincore
    vec[0] 0
    vec[1] 0
    vec[2] 0
    vec[3] 0
    vec[4] 0
    vec[5] 0
    vec[6] 0
    vec[7] 0
    vec[8] 0
    vec[9] 0
    $ cat /proc/meminfo |grep "HugePage"
    HugePages_Total: 1000
    HugePages_Free: 999
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    $ ls /hugetlbfs/
    $

    Return values in *vec from mincore() are set to 0, while the hugepage
    should be in memory, and 1 hugepage is still accounted as used while
    there is no file on hugetlbfs.

    With my patch
    -------------
    $ cat /proc/meminfo| grep "HugePage"
    HugePages_Total: 1000
    HugePages_Free: 1000
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    $ ./leak_mincore
    vec[0] 1
    vec[1] 1
    vec[2] 1
    vec[3] 1
    vec[4] 1
    vec[5] 1
    vec[6] 1
    vec[7] 1
    vec[8] 1
    vec[9] 1
    $ cat /proc/meminfo |grep "HugePage"
    HugePages_Total: 1000
    HugePages_Free: 1000
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    $ ls /hugetlbfs/
    $

    Return value in *vec set to 1 and no memory leaks.

    [akpm@linux-foundation.org: cleanup]
    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Wu Fengguang
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Lee Schermerhorn
    Cc: Andy Whitcroft
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi