07 Jun, 2020

1 commit

  • commit 5bfea2d9b17f1034a68147a8b03b9789af5700f9 upstream.

    The original code in mm/mremap.c checks huge pmd by:

    if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd)) {

    However, a DAX mapped nvdimm is mapped as huge page (by default) but it
    is not transparent huge page (_PAGE_PSE | PAGE_DEVMAP). This commit
    changes the condition to include the case.

    This addresses CVE-2020-10757.

    Fixes: 5c7fb56e5e3f ("mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd")
    Cc:
    Reported-by: Fan Yang
    Signed-off-by: Fan Yang
    Tested-by: Fan Yang
    Tested-by: Dan Williams
    Reviewed-by: Dan Williams
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Fan Yang
     

10 May, 2020

1 commit

  • commit b2a84de2a2deb76a6a51609845341f508c518c03 upstream.

    Commit dcde237319e6 ("mm: Avoid creating virtual address aliases in
    brk()/mmap()/mremap()") changed mremap() so that only the 'old' address
    is untagged, leaving the 'new' address in the form it was passed from
    userspace. This prevents the unexpected creation of aliasing virtual
    mappings in userspace, but looks a bit odd when you read the code.

    Add a comment justifying the untagging behaviour in mremap().

    Reported-by: Linus Torvalds
    Acked-by: Linus Torvalds
    Reviewed-by: Catalin Marinas
    Signed-off-by: Will Deacon
    Signed-off-by: Catalin Marinas
    Signed-off-by: Greg Kroah-Hartman

    Will Deacon
     

29 Feb, 2020

1 commit

  • commit dcde237319e626d1ec3c9d8b7613032f0fd4663a upstream.

    Currently the arm64 kernel ignores the top address byte passed to brk(),
    mmap() and mremap(). When the user is not aware of the 56-bit address
    limit or relies on the kernel to return an error, untagging such
    pointers has the potential to create address aliases in user-space.
    Passing a tagged address to munmap(), madvise() is permitted since the
    tagged pointer is expected to be inside an existing mapping.

    The current behaviour breaks the existing glibc malloc() implementation
    which relies on brk() with an address beyond 56-bit to be rejected by
    the kernel.

    Remove untagging in the above functions by partially reverting commit
    ce18d171cb73 ("mm: untag user pointers in mmap/munmap/mremap/brk"). In
    addition, update the arm64 tagged-address-abi.rst document accordingly.

    Link: https://bugzilla.redhat.com/1797052
    Fixes: ce18d171cb73 ("mm: untag user pointers in mmap/munmap/mremap/brk")
    Cc: # 5.4.x-
    Cc: Florian Weimer
    Reviewed-by: Andrew Morton
    Reported-by: Victor Stinner
    Acked-by: Will Deacon
    Acked-by: Andrey Konovalov
    Signed-off-by: Catalin Marinas
    Signed-off-by: Will Deacon
    Signed-off-by: Greg Kroah-Hartman

    Catalin Marinas
     

26 Sep, 2019

2 commits

  • There isn't a good reason to differentiate between the user address space
    layout modification syscalls and the other memory permission/attributes
    ones (e.g. mprotect, madvise) w.r.t. the tagged address ABI. Untag the
    user addresses on entry to these functions.

    Link: http://lkml.kernel.org/r/20190821164730.47450-2-catalin.marinas@arm.com
    Signed-off-by: Catalin Marinas
    Acked-by: Will Deacon
    Acked-by: Andrey Konovalov
    Cc: Vincenzo Frascino
    Cc: Szabolcs Nagy
    Cc: Kevin Brodsky
    Cc: Dave P Martin
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     
  • This patch is a part of a series that extends kernel ABI to allow to pass
    tagged user pointers (with the top byte set to something else other than
    0x00) as syscall arguments.

    This patch allows tagged pointers to be passed to the following memory
    syscalls: get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mprotect,
    mremap, msync, munlock, move_pages.

    The mmap and mremap syscalls do not currently accept tagged addresses.
    Architectures may interpret the tag as a background colour for the
    corresponding vma.

    Link: http://lkml.kernel.org/r/aaf0c0969d46b2feb9017f3e1b3ef3970b633d91.1563904656.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Khalid Aziz
    Reviewed-by: Vincenzo Frascino
    Reviewed-by: Catalin Marinas
    Reviewed-by: Kees Cook
    Cc: Al Viro
    Cc: Dave Hansen
    Cc: Eric Auger
    Cc: Felix Kuehling
    Cc: Jens Wiklander
    Cc: Mauro Carvalho Chehab
    Cc: Mike Rapoport
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

15 May, 2019

1 commit

  • CPU page table update can happens for many reasons, not only as a result
    of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
    a result of kernel activities (memory compression, reclaim, migration,
    ...).

    Users of mmu notifier API track changes to the CPU page table and take
    specific action for them. While current API only provide range of virtual
    address affected by the change, not why the changes is happening.

    This patchset do the initial mechanical convertion of all the places that
    calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
    event as well as the vma if it is know (most invalidation happens against
    a given vma). Passing down the vma allows the users of mmu notifier to
    inspect the new vma page protection.

    The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
    should assume that every for the range is going away when that event
    happens. A latter patch do convert mm call path to use a more appropriate
    events for each call.

    This is done as 2 patches so that no call site is forgotten especialy
    as it uses this following coccinelle patch:

    %vm_mm, E3, E4)
    ...>

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(..., struct vm_area_struct *VMA, ...) {
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(...) {
    struct vm_area_struct *VMA;
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN;
    @@
    FN(...) {
    }
    ---------------------------------------------------------------------->%

    Applied with:
    spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
    spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
    spatch --sp-file mmu-notifier.spatch --dir mm --in-place

    Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

06 Mar, 2019

1 commit

  • When using mremap() syscall in addition to MREMAP_FIXED flag, mremap()
    calls mremap_to() which does the following:

    1) unmaps the destination region where we are going to move the map
    2) If the new region is going to be smaller, we unmap the last part
    of the old region

    Then, we will eventually call move_vma() to do the actual move.

    move_vma() checks whether we are at least 4 maps below max_map_count
    before going further, otherwise it bails out with -ENOMEM. The problem
    is that we might have already unmapped the vma's in steps 1) and 2), so
    it is not possible for userspace to figure out the state of the vmas
    after it gets -ENOMEM, and it gets tricky for userspace to clean up
    properly on error path.

    While it is true that we can return -ENOMEM for more reasons (e.g: see
    may_expand_vm() or move_page_tables()), I think that we can avoid this
    scenario if we check early in mremap_to() if the operation has high
    chances to succeed map-wise.

    Should that not be the case, we can bail out before we even try to unmap
    anything, so we make sure the vma's are left untouched in case we are
    likely to be short of maps.

    The thumb-rule now is to rely on the worst-scenario case we can have.
    That is when both vma's (old region and new region) are going to be
    split in 3, so we get two more maps to the ones we already hold (one per
    each). If current map count + 2 maps still leads us to 4 maps below the
    threshold, we are going to pass the check in move_vma().

    Of course, this is not free, as it might generate false positives when
    it is true that we are tight map-wise, but the unmap operation can
    release several vma's leading us to a good state.

    Another approach was also investigated [1], but it may be too much
    hassle for what it brings.

    [1] https://lore.kernel.org/lkml/20190219155320.tkfkwvqk53tfdojt@d104.suse.de/

    Link: http://lkml.kernel.org/r/20190226091314.18446-1-osalvador@suse.de
    Signed-off-by: Oscar Salvador
    Acked-by: Vlastimil Babka
    Acked-by: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Joel Fernandes (Google)
    Cc: Yang Shi
    Cc: Mel Gorman
    Cc: Joel Fernandes
    Cc: Cyril Hrubis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     

05 Jan, 2019

2 commits

  • Android needs to mremap large regions of memory during memory management
    related operations. The mremap system call can be really slow if THP is
    not enabled. The bottleneck is move_page_tables, which is copying each
    pte at a time, and can be really slow across a large map. Turning on
    THP may not be a viable option, and is not for us. This patch speeds up
    the performance for non-THP system by copying at the PMD level when
    possible.

    The speedup is an order of magnitude on x86 (~20x). On a 1GB mremap,
    the mremap completion times drops from 3.4-3.6 milliseconds to 144-160
    microseconds.

    Before:
    Total mremap time for 1GB data: 3521942 nanoseconds.
    Total mremap time for 1GB data: 3449229 nanoseconds.
    Total mremap time for 1GB data: 3488230 nanoseconds.

    After:
    Total mremap time for 1GB data: 150279 nanoseconds.
    Total mremap time for 1GB data: 144665 nanoseconds.
    Total mremap time for 1GB data: 158708 nanoseconds.

    If THP is enabled the optimization is mostly skipped except in certain
    situations.

    [joel@joelfernandes.org: fix 'move_normal_pmd' unused function warning]
    Link: http://lkml.kernel.org/r/20181108224457.GB209347@google.com
    Link: http://lkml.kernel.org/r/20181108181201.88826-3-joelaf@google.com
    Signed-off-by: Joel Fernandes (Google)
    Acked-by: Kirill A. Shutemov
    Reviewed-by: William Kucharski
    Cc: Julia Lawall
    Cc: Michal Hocko
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joel Fernandes (Google)
     
  • Patch series "Add support for fast mremap".

    This series speeds up the mremap(2) syscall by copying page tables at
    the PMD level even for non-THP systems. There is concern that the extra
    'address' argument that mremap passes to pte_alloc may do something
    subtle architecture related in the future that may make the scheme not
    work. Also we find that there is no point in passing the 'address' to
    pte_alloc since its unused. This patch therefore removes this argument
    tree-wide resulting in a nice negative diff as well. Also ensuring
    along the way that the enabled architectures do not do anything funky
    with the 'address' argument that goes unnoticed by the optimization.

    Build and boot tested on x86-64. Build tested on arm64. The config
    enablement patch for arm64 will be posted in the future after more
    testing.

    The changes were obtained by applying the following Coccinelle script.
    (thanks Julia for answering all Coccinelle questions!).
    Following fix ups were done manually:
    * Removal of address argument from pte_fragment_alloc
    * Removal of pte_alloc_one_fast definitions from m68k and microblaze.

    // Options: --include-headers --no-includes
    // Note: I split the 'identifier fn' line, so if you are manually
    // running it, please unsplit it so it runs for you.

    virtual patch

    @pte_alloc_func_def depends on patch exists@
    identifier E2;
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    type T2;
    @@

    fn(...
    - , T2 E2
    )
    { ... }

    @pte_alloc_func_proto_noarg depends on patch exists@
    type T1, T2, T3, T4;
    identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    @@

    (
    - T3 fn(T1, T2);
    + T3 fn(T1);
    |
    - T3 fn(T1, T2, T4);
    + T3 fn(T1, T2);
    )

    @pte_alloc_func_proto depends on patch exists@
    identifier E1, E2, E4;
    type T1, T2, T3, T4;
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    @@

    (
    - T3 fn(T1 E1, T2 E2);
    + T3 fn(T1 E1);
    |
    - T3 fn(T1 E1, T2 E2, T4 E4);
    + T3 fn(T1 E1, T2 E2);
    )

    @pte_alloc_func_call depends on patch exists@
    expression E2;
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    @@

    fn(...
    -, E2
    )

    @pte_alloc_macro depends on patch exists@
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    identifier a, b, c;
    expression e;
    position p;
    @@

    (
    - #define fn(a, b, c) e
    + #define fn(a, b) e
    |
    - #define fn(a, b) e
    + #define fn(a) e
    )

    Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
    Signed-off-by: Joel Fernandes (Google)
    Suggested-by: Kirill A. Shutemov
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Julia Lawall
    Cc: Kirill A. Shutemov
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joel Fernandes (Google)
     

29 Dec, 2018

1 commit

  • To avoid having to change many call sites everytime we want to add a
    parameter use a structure to group all parameters for the mmu_notifier
    invalidate_range_start/end cakks. No functional changes with this patch.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    From: Jérôme Glisse
    Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

    fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

    Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

27 Oct, 2018

1 commit

  • Other than munmap, mremap might be used to shrink memory mapping too.
    So, it may hold write mmap_sem for long time when shrinking large
    mapping, as what commit ("mm: mmap: zap pages with read mmap_sem in
    munmap") described.

    The mremap() will not manipulate vmas anymore after __do_munmap() call for
    the mapping shrink use case, so it is safe to downgrade to read mmap_sem.

    So, the same optimization, which downgrades mmap_sem to read for zapping
    pages, is also feasible and reasonable to this case.

    The period of holding exclusive mmap_sem for shrinking large mapping
    would be reduced significantly with this optimization.

    MREMAP_FIXED and MREMAP_MAYMOVE are more complicated to adopt this
    optimization since they need manipulate vmas after do_munmap(),
    downgrading mmap_sem may create race window.

    Simple mapping shrink is the low hanging fruit, and it may cover the
    most cases of unmap with munmap together.

    [akpm@linux-foundation.org: tweak comment]
    [yang.shi@linux.alibaba.com: fix unsigned compare against 0 issue]
    Link: http://lkml.kernel.org/r/1538687672-17795-2-git-send-email-yang.shi@linux.alibaba.com
    Link: http://lkml.kernel.org/r/1538067582-60038-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Acked-by: Vlastimil Babka
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Matthew Wilcox
    Cc: Laurent Dufour
    Cc: Colin Ian King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

18 Oct, 2018

1 commit

  • Jann Horn points out that our TLB flushing was subtly wrong for the
    mremap() case. What makes mremap() special is that we don't follow the
    usual "add page to list of pages to be freed, then flush tlb, and then
    free pages". No, mremap() obviously just _moves_ the page from one page
    table location to another.

    That matters, because mremap() thus doesn't directly control the
    lifetime of the moved page with a freelist: instead, the lifetime of the
    page is controlled by the page table locking, that serializes access to
    the entry.

    As a result, we need to flush the TLB not just before releasing the lock
    for the source location (to avoid any concurrent accesses to the entry),
    but also before we release the destination page table lock (to avoid the
    TLB being flushed after somebody else has already done something to that
    page).

    This also makes the whole "need_flush" logic unnecessary, since we now
    always end up flushing the TLB for every valid entry.

    Reported-and-tested-by: Jann Horn
    Acked-by: Will Deacon
    Tested-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

15 Jun, 2018

1 commit

  • Commit 5d1904204c99 ("mremap: fix race between mremap() and page
    cleanning") fixed races between mremap and other operations for both
    file-backed and anonymous mappings. The file-backed was the most
    critical as it allowed the possibility that data could be changed on a
    physical page after page_mkclean returned which could trigger data loss
    or data integrity issues.

    A customer reported that the cost of the TLBs for anonymous regressions
    was excessive and resulting in a 30-50% drop in performance overall
    since this commit on a microbenchmark. Unfortunately I neither have
    access to the test-case nor can I describe what it does other than
    saying that mremap operations dominate heavily.

    This patch removes the LATENCY_LIMIT to handle TLB flushes on a PMD
    boundary instead of every 64 pages to reduce the number of TLB
    shootdowns by a factor of 8 in the ideal case. LATENCY_LIMIT was almost
    certainly used originally to limit the PTL hold times but the latency
    savings are likely offset by the cost of IPIs in many cases. This patch
    is not reported to completely restore performance but gets it within an
    acceptable percentage. The given metric here is simply described as
    "higher is better".

    Baseline that was known good
    002: Metric: 91.05
    004: Metric: 109.45
    008: Metric: 73.08
    016: Metric: 58.14
    032: Metric: 61.09
    064: Metric: 57.76
    128: Metric: 55.43

    Current
    001: Metric: 54.98
    002: Metric: 56.56
    004: Metric: 41.22
    008: Metric: 35.96
    016: Metric: 36.45
    032: Metric: 35.71
    064: Metric: 35.73
    128: Metric: 34.96

    With patch
    001: Metric: 61.43
    002: Metric: 81.64
    004: Metric: 67.92
    008: Metric: 51.67
    016: Metric: 50.47
    032: Metric: 52.29
    064: Metric: 50.01
    128: Metric: 49.04

    So for low threads, it's not restored but for larger number of threads,
    it's closer to the "known good" baseline.

    Using a different mremap-intensive workload that is not representative
    of the real workload there is little difference observed outside of
    noise in the headline metrics However, the TLB shootdowns are reduced by
    11% on average and at the peak, TLB shootdowns were reduced by 21%.
    Interrupts were sampled every second while the workload ran to get those
    figures. It's known that the figures will vary as the
    non-representative load is non-deterministic.

    An alternative patch was posted that should have significantly reduced
    the TLB flushes but unfortunately it does not perform as well as this
    version on the customer test case. If revisited, the two patches can
    stack on top of each other.

    Link: http://lkml.kernel.org/r/20180606183803.k7qaw2xnbvzshv34@techsingularity.net
    Signed-off-by: Mel Gorman
    Reviewed-by: Andrew Morton
    Cc: Nadav Amit
    Cc: Dave Hansen
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Aaron Lu
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

09 Sep, 2017

1 commit

  • When THP migration is being used, memory management code needs to handle
    pmd migration entries properly. This patch uses !pmd_present() or
    is_swap_pmd() (depending on whether pmd_none() needs separate code or
    not) to check pmd migration entries at the places where a pmd entry is
    present.

    Since pmd-related code uses split_huge_page(), split_huge_pmd(),
    pmd_trans_huge(), pmd_trans_unstable(), or
    pmd_none_or_trans_huge_or_clear_bad(), this patch:

    1. adds pmd migration entry split code in split_huge_pmd(),

    2. takes care of pmd migration entries whenever pmd_trans_huge() is present,

    3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware.

    Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable()
    is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change
    them.

    Until this commit, a pmd entry should be:
    1. pointing to a pte page,
    2. is_swap_pmd(),
    3. pmd_trans_huge(),
    4. pmd_devmap(), or
    5. pmd_none().

    Signed-off-by: Zi Yan
    Cc: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zi Yan
     

07 Sep, 2017

1 commit

  • mremap will attempt to create a 'duplicate' mapping if old_size == 0 is
    specified. In the case of private mappings, mremap will actually create
    a fresh separate private mapping unrelated to the original. This does
    not fit with the design semantics of mremap as the intention is to
    create a new mapping based on the original.

    Therefore, return EINVAL in the case where an attempt is made to
    duplicate a private mapping. Also, print a warning message (once) if
    such an attempt is made.

    Link: http://lkml.kernel.org/r/cb9d9f6a-7095-582f-15a5-62643d65c736@oracle.com
    Signed-off-by: Mike Kravetz
    Acked-by: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: Aaron Lu
    Cc: "Kirill A . Shutemov"
    Cc: Vlastimil Babka
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

03 Aug, 2017

2 commits

  • When mremap is called with MREMAP_FIXED it unmaps memory at the
    destination address without notifying userfaultfd monitor.

    If the destination were registered with userfaultfd, the monitor has no
    way to distinguish between the old and new ranges and to properly relate
    the page faults that would occur in the destination region.

    Fixes: 897ab3e0c49e ("userfaultfd: non-cooperative: add event for memory unmaps")
    Link: http://lkml.kernel.org/r/1500276876-3350-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Pavel Emelyanov
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Nadav Amit identified a theoritical race between page reclaim and
    mprotect due to TLB flushes being batched outside of the PTL being held.

    He described the race as follows:

    CPU0 CPU1
    ---- ----
    user accesses memory using RW PTE
    [PTE now cached in TLB]
    try_to_unmap_one()
    ==> ptep_get_and_clear()
    ==> set_tlb_ubc_flush_pending()
    mprotect(addr, PROT_READ)
    ==> change_pte_range()
    ==> [ PTE non-present - no flush ]

    user writes using cached RW PTE
    ...

    try_to_unmap_flush()

    The same type of race exists for reads when protecting for PROT_NONE and
    also exists for operations that can leave an old TLB entry behind such
    as munmap, mremap and madvise.

    For some operations like mprotect, it's not necessarily a data integrity
    issue but it is a correctness issue as there is a window where an
    mprotect that limits access still allows access. For munmap, it's
    potentially a data integrity issue although the race is massive as an
    munmap, mmap and return to userspace must all complete between the
    window when reclaim drops the PTL and flushes the TLB. However, it's
    theoritically possible so handle this issue by flushing the mm if
    reclaim is potentially currently batching TLB flushes.

    Other instances where a flush is required for a present pte should be ok
    as either the page lock is held preventing parallel reclaim or a page
    reference count is elevated preventing a parallel free leading to
    corruption. In the case of page_mkclean there isn't an obvious path
    that userspace could take advantage of without using the operations that
    are guarded by this patch. Other users such as gup as a race with
    reclaim looks just at PTEs. huge page variants should be ok as they
    don't race with reclaim. mincore only looks at PTEs. userfault also
    should be ok as if a parallel reclaim takes place, it will either fault
    the page back in or read some of the data before the flush occurs
    triggering a fault.

    Note that a variant of this patch was acked by Andy Lutomirski but this
    was for the x86 parts on top of his PCID work which didn't make the 4.13
    merge window as expected. His ack is dropped from this version and
    there will be a follow-on patch on top of PCID that will include his
    ack.

    [akpm@linux-foundation.org: tweak comments]
    [akpm@linux-foundation.org: fix spello]
    Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de
    Reported-by: Nadav Amit
    Signed-off-by: Mel Gorman
    Cc: Andy Lutomirski
    Cc: [v4.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

10 Mar, 2017

1 commit


25 Feb, 2017

1 commit

  • When a non-cooperative userfaultfd monitor copies pages in the
    background, it may encounter regions that were already unmapped.
    Addition of UFFD_EVENT_UNMAP allows the uffd monitor to track precisely
    changes in the virtual memory layout.

    Since there might be different uffd contexts for the affected VMAs, we
    first should create a temporary representation for the unmap event for
    each uffd context and then notify them one by one to the appropriate
    userfault file descriptors.

    The event notification occurs after the mmap_sem has been released.

    [arnd@arndb.de: fix nommu build]
    Link: http://lkml.kernel.org/r/20170203165141.3665284-1-arnd@arndb.de
    [mhocko@suse.com: fix nommu build]
    Link: http://lkml.kernel.org/r/20170202091503.GA22823@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/1485542673-24387-3-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Michal Hocko
    Signed-off-by: Arnd Bergmann
    Acked-by: Hillf Danton
    Cc: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

23 Feb, 2017

2 commits

  • Optimize the mremap_userfaultfd_complete() interface to pass only the
    vm_userfaultfd_ctx pointer through the stack as a microoptimization.

    Link: http://lkml.kernel.org/r/20161216144821.5183-13-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: Hillf Danton
    Acked-by: Mike Rapoport
    Cc: "Dr. David Alan Gilbert"
    Cc: Michael Rapoport
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • The event denotes that an area [start:end] moves to different location.
    Length change isn't reported as "new" addresses, if they appear on the
    uffd reader side they will not contain any data and the latter can just
    zeromap them.

    Waiting for the event ACK is also done outside of mmap sem, as for fork
    event.

    Link: http://lkml.kernel.org/r/20161216144821.5183-12-aarcange@redhat.com
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Hillf Danton
    Cc: Michael Rapoport
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

30 Nov, 2016

1 commit

  • Linus found there still is a race in mremap after commit 5d1904204c99
    ("mremap: fix race between mremap() and page cleanning").

    As described by Linus:
    "the issue is that another thread might make the pte be dirty (in the
    hardware walker, so no locking of ours will make any difference)
    *after* we checked whether it was dirty, but *before* we removed it
    from the page tables"

    Fix it by moving the check after we removed it from the page table.

    Suggested-by: Linus Torvalds
    Signed-off-by: Aaron Lu
    Signed-off-by: Linus Torvalds

    Aaron Lu
     

18 Nov, 2016

1 commit

  • Prior to 3.15, there was a race between zap_pte_range() and
    page_mkclean() where writes to a page could be lost. Dave Hansen
    discovered by inspection that there is a similar race between
    move_ptes() and page_mkclean().

    We've been able to reproduce the issue by enlarging the race window with
    a msleep(), but have not been able to hit it without modifying the code.
    So, we think it's a real issue, but is difficult or impossible to hit in
    practice.

    The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
    'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And
    this patch is to fix the race between page_mkclean() and mremap().

    Here is one possible way to hit the race: suppose a process mmapped a
    file with READ | WRITE and SHARED, it has two threads and they are bound
    to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread
    1 did a write to addr X so that CPU1 now has a writable TLB for addr X
    on it. Thread 2 starts mremaping from addr X to Y while thread 1
    cleaned the page and then did another write to the old addr X again.
    The 2nd write from thread 1 could succeed but the value will get lost.

    thread 1 thread 2
    (bound to CPU1) (bound to CPU2)

    1: write 1 to addr X to get a
    writeable TLB on this CPU

    2: mremap starts

    3: move_ptes emptied PTE for addr X
    and setup new PTE for addr Y and
    then dropped PTL for X and Y

    4: page laundering for N by doing
    fadvise FADV_DONTNEED. When done,
    pageframe N is deemed clean.

    5: *write 2 to addr X

    6: tlb flush for addr X

    7: munmap (Y, pagesize) to make the
    page unmapped

    8: fadvise with FADV_DONTNEED again
    to kick the page off the pagecache

    9: pread the page from file to verify
    the value. If 1 is there, it means
    we have lost the written 2.

    *the write may or may not cause segmentation fault, it depends on
    if the TLB is still on the CPU.

    Please note that this is only one specific way of how the race could
    occur, it didn't mean that the race could only occur in exact the above
    config, e.g. more than 2 threads could be involved and fadvise() could
    be done in another thread, etc.

    For anonymous pages, they could race between mremap() and page reclaim:
    THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
    PMD gets unmapped/splitted/pagedout before the flush tlb happened for
    the old huge PMD in move_page_tables() and we could still write data to
    it. The normal anonymous page has similar situation.

    To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
    if any, did the flush before dropping the PTL. If we did the flush for
    every move_ptes()/move_huge_pmd() call then we do not need to do the
    flush in move_pages_tables() for the whole range. But if we didn't, we
    still need to do the whole range flush.

    Alternatively, we can track which part of the range is flushed in
    move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
    range in move_page_tables(). But that would require multiple tlb
    flushes for the different sub-ranges and should be less efficient than
    the single whole range flush.

    KBuild test on my Sandybridge desktop doesn't show any noticeable change.
    v4.9-rc4:
    real 5m14.048s
    user 32m19.800s
    sys 4m50.320s

    With this commit:
    real 5m13.888s
    user 32m19.330s
    sys 4m51.200s

    Reported-by: Dave Hansen
    Signed-off-by: Aaron Lu
    Signed-off-by: Linus Torvalds

    Aaron Lu
     

27 Jul, 2016

1 commit

  • split_huge_pmd() doesn't guarantee that the pmd is normal pmd pointing
    to pte entries, which can be checked with pmd_trans_unstable(). Some
    callers make this assertion and some do it differently and some not, so
    let's do it in a unified manner.

    Link: http://lkml.kernel.org/r/1464741400-12143-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

24 May, 2016

1 commit

  • This is a follow up work for oom_reaper [1]. As the async OOM killing
    depends on oom_sem for read we would really appreciate if a holder for
    write didn't stood in the way. This patchset is changing many of
    down_write calls to be killable to help those cases when the writer is
    blocked and waiting for readers to release the lock and so help
    __oom_reap_task to process the oom victim.

    Most of the patches are really trivial because the lock is help from a
    shallow syscall paths where we can return EINTR trivially and allow the
    current task to die (note that EINTR will never get to the userspace as
    the task has fatal signal pending). Others seem to be easy as well as
    the callers are already handling fatal errors and bail and return to
    userspace which should be sufficient to handle the failure gracefully.
    I am not familiar with all those code paths so a deeper review is really
    appreciated.

    As this work is touching more areas which are not directly connected I
    have tried to keep the CC list as small as possible and people who I
    believed would be familiar are CCed only to the specific patches (all
    should have received the cover though).

    This patchset is based on linux-next and it depends on
    down_write_killable for rw_semaphores which got merged into tip
    locking/rwsem branch and it is merged into this next tree. I guess it
    would be easiest to route these patches via mmotm because of the
    dependency on the tip tree but if respective maintainers prefer other
    way I have no objections.

    I haven't covered all the mmap_write(mm->mmap_sem) instances here

    $ git grep "down_write(.*\)" next/master | wc -l
    98
    $ git grep "down_write(.*\)" | wc -l
    62

    I have tried to cover those which should be relatively easy to review in
    this series because this alone should be a nice improvement. Other
    places can be changed on top.

    [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
    [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org

    This patch (of 18):

    This is the first step in making mmap_sem write waiters killable. It
    focuses on the trivial ones which are taking the lock early after
    entering the syscall and they are not changing state before.

    Therefore it is very easy to change them to use down_write_killable and
    immediately return with -EINTR. This will allow the waiter to pass away
    without blocking the mmap_sem which might be required to make a forward
    progress. E.g. the oom reaper will need the lock for reading to
    dismantle the OOM victim address space.

    The only tricky function in this patch is vm_mmap_pgoff which has many
    call sites via vm_mmap. To reduce the risk keep vm_mmap with the
    original non-killable semantic for now.

    vm_munmap callers do not bother checking the return value so open code
    it into the munmap syscall path for now for simplicity.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

20 May, 2016

2 commits

  • Whatever huge pagecache implementation we go with, file rmap locking
    must be added to anon rmap locking, when mremap's move_page_tables()
    finds a pmd_trans_huge pmd entry: a simple change, let's do it now.

    Factor out take_rmap_locks() and drop_rmap_locks() to handle the locking
    for make move_ptes() and move_page_tables(), and delete the
    VM_BUG_ON_VMA which rejected vm_file and required anon_vma.

    Signed-off-by: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Andres Lagar-Cavilla
    Cc: Yang Shi
    Cc: Ning Qu
    Cc: Mel Gorman
    Cc: Andres Lagar-Cavilla
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove move_huge_pmd()'s redundant new_vma arg: all it was used for was
    a VM_NOHUGEPAGE check on new_vma flags, but the new_vma is cloned from
    the old vma, so a trans_huge_pmd in the new_vma will be as acceptable as
    it was in the old vma, alignment and size permitting.

    Signed-off-by: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Andres Lagar-Cavilla
    Cc: Yang Shi
    Cc: Ning Qu
    Cc: Mel Gorman
    Cc: Andres Lagar-Cavilla
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

18 Mar, 2016

2 commits

  • There are few things about *pte_alloc*() helpers worth cleaning up:

    - 'vma' argument is unused, let's drop it;

    - most __pte_alloc() callers do speculative check for pmd_none(),
    before taking ptl: let's introduce pte_alloc() macro which does
    the check.

    The only direct user of __pte_alloc left is userfaultfd, which has
    different expectation about atomicity wrt pmd.

    - pte_alloc_map() and pte_alloc_map_lock() are redefined using
    pte_alloc().

    [sudeep.holla@arm.com: fix build for arm64 hugetlbpage]
    [sfr@canb.auug.org.au: fix arch/arm/mm/mmu.c some more]
    Signed-off-by: Kirill A. Shutemov
    Cc: Dave Hansen
    Signed-off-by: Sudeep Holla
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • max_map_count sysctl unrelated to scheduler. Move its bits from
    include/linux/sched/sysctl.h to include/linux/mm.h.

    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

12 Feb, 2016

1 commit

  • DAX implements split_huge_pmd() by clearing pmd. This simple approach
    reduces memory overhead, as we don't need to deposit page table on huge
    page mapping to make split_huge_pmd() never-fail. PTE table can be
    allocated and populated later on page fault from backing store.

    But one side effect is that have to check if pmd is pmd_none() after
    split_huge_pmd(). In most places we do this already to deal with
    parallel MADV_DONTNEED.

    But I found two call sites which is not affected by MADV_DONTNEED (due
    down_write(mmap_sem)), but need to have the check to work with DAX
    properly.

    Signed-off-by: Kirill A. Shutemov
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Andrea Arcangeli
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Jan, 2016

2 commits

  • With new refcounting we don't need to mark PMDs splitting. Let's drop
    code to handle this.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We are going to decouple splitting THP PMD from splitting underlying
    compound page.

    This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
    to reflect the fact that it doesn't imply page splitting, only PMD.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

1 commit

  • When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
    testing the RLIMIT_DATA value to figure out if we're allowed to assign
    new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
    commited that RLIMIT_DATA in a form it's implemented now doesn't do
    anything useful because most of user-space libraries use mmap() syscall
    for dynamic memory allocations.

    Linus suggested to convert RLIMIT_DATA rlimit into something suitable
    for anonymous memory accounting. But in this patch we go further, and
    the changes are bundled together as:

    * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
    * replace mm->shared_vm with better defined mm->data_vm
    * account anonymous executable areas as executable
    * account file-backed growsdown/up areas as stack
    * drop struct file* argument from vm_stat_account
    * enforce RLIMIT_DATA for size of data areas

    This way code looks cleaner: now code/stack/data classification depends
    only on vm_flags state:

    VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc)
    VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk)
    VM_WRITE & ~VM_SHARED & !stack -> data (VmData)

    The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
    "shared", but that might be strange beast like readonly-private or VM_IO
    area.

    - RLIMIT_AS limits whole address space "VmSize"
    - RLIMIT_STACK limits stack "VmStk" (but each vma individually)
    - RLIMIT_DATA now limits "VmData"

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Cyrill Gorcunov
    Cc: Quentin Casasnovas
    Cc: Vegard Nossum
    Acked-by: Linus Torvalds
    Cc: Willy Tarreau
    Cc: Andy Lutomirski
    Cc: Kees Cook
    Cc: Vladimir Davydov
    Cc: Pavel Emelyanov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

05 Jan, 2016

1 commit

  • mremap() with MREMAP_FIXED on a VM_PFNMAP range causes the following
    WARN_ON_ONCE() message in untrack_pfn().

    WARNING: CPU: 1 PID: 3493 at arch/x86/mm/pat.c:985 untrack_pfn+0xbd/0xd0()
    Call Trace:
    [] dump_stack+0x45/0x57
    [] warn_slowpath_common+0x86/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] untrack_pfn+0xbd/0xd0
    [] unmap_single_vma+0x80e/0x860
    [] unmap_vmas+0x55/0xb0
    [] unmap_region+0xac/0x120
    [] do_munmap+0x28a/0x460
    [] move_vma+0x1b3/0x2e0
    [] SyS_mremap+0x3b3/0x510
    [] entry_SYSCALL_64_fastpath+0x12/0x71

    MREMAP_FIXED moves a pfnmap from old vma to new vma. untrack_pfn() is
    called with the old vma after its pfnmap page table has been removed,
    which causes follow_phys() to fail. The new vma has a new pfnmap to
    the same pfn & cache type with VM_PAT set. Therefore, we only need to
    clear VM_PAT from the old vma in this case.

    Add untrack_pfn_moved(), which clears VM_PAT from a given old vma.
    move_vma() is changed to call this function with the old vma when
    VM_PFNMAP is set. move_vma() then calls do_munmap(), and untrack_pfn()
    is a no-op since VM_PAT is cleared.

    Reported-by: Stas Sergeev
    Signed-off-by: Toshi Kani
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Borislav Petkov
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/1450832064-10093-2-git-send-email-toshi.kani@hpe.com
    Signed-off-by: Thomas Gleixner

    Toshi Kani
     

06 Nov, 2015

1 commit


05 Sep, 2015

4 commits

  • Minor, but this check is overcomplicated. Two half-intervals do NOT
    overlap if END1
    Acked-by: David Rientjes
    Cc: Benjamin LaHaise
    Cc: Hugh Dickins
    Cc: Jeff Moyer
    Cc: Kirill A. Shutemov
    Cc: Laurent Dufour
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The "new_len > old_len" branch in vma_to_resize() looks very confusing.
    It only covers the VM_DONTEXPAND/pgoff checks but everything below is
    equally unneeded if new_len == old_len.

    Change this code to return if "new_len == old_len", new_len < old_len is
    not possible, otherwise the code below is wrong anyway.

    Signed-off-by: Oleg Nesterov
    Acked-by: David Rientjes
    Cc: Benjamin LaHaise
    Cc: Hugh Dickins
    Cc: Jeff Moyer
    Cc: Kirill A. Shutemov
    Cc: Laurent Dufour
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • move_vma() sets *locked even if move_page_tables() or ->mremap() fails,
    change sys_mremap() to check "ret & ~PAGE_MASK".

    I think we should simply remove the VM_LOCKED code in move_vma(), that is
    why this patch doesn't change move_vma(). But this needs more cleanups.

    Signed-off-by: Oleg Nesterov
    Acked-by: David Rientjes
    Cc: Benjamin LaHaise
    Cc: Hugh Dickins
    Cc: Jeff Moyer
    Cc: Kirill A. Shutemov
    Cc: Laurent Dufour
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • vma->vm_ops->mremap() looks more natural and clean in move_vma(), and this
    way ->mremap() can have more users. Say, vdso.

    While at it, s/aio_ring_remap/aio_ring_mremap/.

    Note: this is the minimal change before ->mremap() finds another user in
    file_operations; this method should have more arguments, and it can be
    used to kill arch_remap().

    Signed-off-by: Oleg Nesterov
    Acked-by: Pavel Emelyanov
    Acked-by: Kirill A. Shutemov
    Cc: David Rientjes
    Cc: Benjamin LaHaise
    Cc: Hugh Dickins
    Cc: Jeff Moyer
    Cc: Laurent Dufour
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov