17 Apr, 2019

1 commit

  • commit c6f3c5ee40c10bb65725047a220570f718507001 upstream.

    With some architectures like ppc64, set_pmd_at() cannot cope with a
    situation where there is already some (different) valid entry present.

    Use pmdp_set_access_flags() instead to modify the pfn which is built to
    deal with modifying existing PMD entries.

    This is similar to commit cae85cb8add3 ("mm/memory.c: fix modifying of
    page protection by insert_pfn()")

    We also do similar update w.r.t insert_pfn_pud eventhough ppc64 don't
    support pud pfn entries now.

    Without this patch we also see the below message in kernel log "BUG:
    non-zero pgtables_bytes on freeing mm:"

    Link: http://lkml.kernel.org/r/20190402115125.18803-1-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Reported-by: Chandan Rajendra
    Reviewed-by: Jan Kara
    Cc: Dan Williams
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Aneesh Kumar K.V
     

29 Dec, 2018

1 commit

  • commit 2e83ee1d8694a61d0d95a5b694f2e61e8dde8627 upstream.

    When splitting a huge migrating PMD, we'll transfer all the existing PMD
    bits and apply them again onto the small PTEs. However we are fetching
    the bits unconditionally via pmd_soft_dirty(), pmd_write() or
    pmd_yound() while actually they don't make sense at all when it's a
    migration entry. Fix them up. Since at it, drop the ifdef together as
    not needed.

    Note that if my understanding is correct about the problem then if
    without the patch there is chance to lose some of the dirty bits in the
    migrating pmd pages (on x86_64 we're fetching bit 11 which is part of
    swap offset instead of bit 2) and it could potentially corrupt the
    memory of an userspace program which depends on the dirty bit.

    Link: http://lkml.kernel.org/r/20181213051510.20306-1-peterx@redhat.com
    Signed-off-by: Peter Xu
    Reviewed-by: Konstantin Khlebnikov
    Reviewed-by: William Kucharski
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Dave Jiang
    Cc: "Aneesh Kumar K.V"
    Cc: Souptick Joarder
    Cc: Konstantin Khlebnikov
    Cc: Zi Yan
    Cc: [4.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Peter Xu
     

06 Dec, 2018

3 commits

  • commit 006d3ff27e884f80bd7d306b041afc415f63598f upstream.

    Huge tmpfs testing, on 32-bit kernel with lockdep enabled, showed that
    __split_huge_page() was using i_size_read() while holding the irq-safe
    lru_lock and page tree lock, but the 32-bit i_size_read() uses an
    irq-unsafe seqlock which should not be nested inside them.

    Instead, read the i_size earlier in split_huge_page_to_list(), and pass
    the end offset down to __split_huge_page(): all while holding head page
    lock, which is enough to prevent truncation of that extent before the
    page tree lock has been taken.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261520070.2275@eggly.anvils
    Fixes: baa355fd33142 ("thp: file pages support for split_huge_page()")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     
  • commit 173d9d9fd3ddae84c110fea8aedf1f26af6be9ec upstream.

    Huge tmpfs stress testing has occasionally hit shmem_undo_range()'s
    VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page).

    Move the setting of mapping and index up before the page_ref_unfreeze()
    in __split_huge_page_tail() to fix this: so that a page cache lookup
    cannot get a reference while the tail's mapping and index are unstable.

    In fact, might as well move them up before the smp_wmb(): I don't see an
    actual need for that, but if I'm missing something, this way round is
    safer than the other, and no less efficient.

    You might argue that VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page) is
    misplaced, and should be left until after the trylock_page(); but left as
    is has not crashed since, and gives more stringent assurance.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261516380.2275@eggly.anvils
    Fixes: e9b61f19858a5 ("thp: reintroduce split_huge_page()")
    Requires: 605ca5ede764 ("mm/huge_memory.c: reorder operations in __split_huge_page_tail()")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Jerome Glisse
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     
  • commit 906f9cdfc2a0800f13683f9e4ebdfd08c12ee81b upstream.

    The term "freeze" is used in several ways in the kernel, and in mm it
    has the particular meaning of forcing page refcount temporarily to 0.
    freeze_page() is just too confusing a name for a function that unmaps a
    page: rename it unmap_page(), and rename unfreeze_page() remap_page().

    Went to change the mention of freeze_page() added later in mm/rmap.c,
    but found it to be incorrect: ordinary page reclaim reaches there too;
    but the substance of the comment still seems correct, so edit it down.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261514080.2275@eggly.anvils
    Fixes: e9b61f19858a5 ("thp: reintroduce split_huge_page()")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     

18 Oct, 2018

1 commit

  • Jann Horn points out that our TLB flushing was subtly wrong for the
    mremap() case. What makes mremap() special is that we don't follow the
    usual "add page to list of pages to be freed, then flush tlb, and then
    free pages". No, mremap() obviously just _moves_ the page from one page
    table location to another.

    That matters, because mremap() thus doesn't directly control the
    lifetime of the moved page with a freelist: instead, the lifetime of the
    page is controlled by the page table locking, that serializes access to
    the entry.

    As a result, we need to flush the TLB not just before releasing the lock
    for the source location (to avoid any concurrent accesses to the entry),
    but also before we release the destination page table lock (to avoid the
    TLB being flushed after somebody else has already done something to that
    page).

    This also makes the whole "need_flush" logic unnecessary, since we now
    always end up flushing the TLB for every valid entry.

    Reported-and-tested-by: Jann Horn
    Acked-by: Will Deacon
    Tested-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

13 Oct, 2018

1 commit

  • Inside set_pmd_migration_entry() we are holding page table locks and thus
    we can not sleep so we can not call invalidate_range_start/end()

    So remove call to mmu_notifier_invalidate_range_start/end() because they
    are call inside the function calling set_pmd_migration_entry() (see
    try_to_unmap_one()).

    Link: http://lkml.kernel.org/r/20181012181056.7864-1-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reported-by: Andrea Arcangeli
    Reviewed-by: Zi Yan
    Acked-by: Michal Hocko
    Cc: Greg Kroah-Hartman
    Cc: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Jérôme Glisse
     

06 Oct, 2018

1 commit

  • A transparent huge page is represented by a single entry on an LRU list.
    Therefore, we can only make unevictable an entire compound page, not
    individual subpages.

    If a user tries to mlock() part of a huge page, we want the rest of the
    page to be reclaimable.

    We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
    PMD on border of VM_LOCKED VMA will be split into PTE table.

    Introduction of THP migration breaks[1] the rules around mlocking THP
    pages. If we had a single PMD mapping of the page in mlocked VMA, the
    page will get mlocked, regardless of PTE mappings of the page.

    For tmpfs/shmem it's easy to fix by checking PageDoubleMap() in
    remove_migration_pmd().

    Anon THP pages can only be shared between processes via fork(). Mlocked
    page can only be shared if parent mlocked it before forking, otherwise CoW
    will be triggered on mlock().

    For Anon-THP, we can fix the issue by munlocking the page on removing PTE
    migration entry for the page. PTEs for the page will always come after
    mlocked PMD: rmap walks VMAs from oldest to newest.

    Test-case:

    #include
    #include
    #include
    #include
    #include

    int main(void)
    {
    unsigned long nodemask = 4;
    void *addr;

    addr = mmap((void *)0x20000000UL, 2UL << 20, PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0);

    if (fork()) {
    wait(NULL);
    return 0;
    }

    mlock(addr, 4UL << 10);
    mbind(addr, 2UL << 20, MPOL_PREFERRED | MPOL_F_RELATIVE_NODES,
    &nodemask, 4, MPOL_MF_MOVE);

    return 0;
    }

    [1] https://lkml.kernel.org/r/CAOMGZ=G52R-30rZvhGxEbkTw7rLLwBGadVYeo--iizcD3upL3A@mail.gmail.com

    Link: http://lkml.kernel.org/r/20180917133816.43995-1-kirill.shutemov@linux.intel.com
    Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Vegard Nossum
    Reviewed-by: Zi Yan
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: [4.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     

05 Sep, 2018

1 commit

  • It looks like I missed the PUD path when doing VM_MIXEDMAP removal.
    This can be triggered by:
    1. Boot with memmap=4G!8G
    2. build ndctl with destructive flag on
    3. make TESTS=device-dax check

    [ +0.000675] kernel BUG at mm/huge_memory.c:824!

    Applying the same change that was applied to vmf_insert_pfn_pmd() in the
    original patch.

    Link: http://lkml.kernel.org/r/153565957352.35524.1005746906902065126.stgit@djiang5-desk3.ch.intel.com
    Fixes: e1fb4a08649 ("dax: remove VM_MIXEDMAP for fsdax and device dax")
    Signed-off-by: Dave Jiang
    Reported-by: Vishal Verma
    Tested-by: Vishal Verma
    Acked-by: Jeff Moyer
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     

26 Aug, 2018

1 commit

  • …/linux/kernel/git/nvdimm/nvdimm

    Pull libnvdimm memory-failure update from Dave Jiang:
    "As it stands, memory_failure() gets thoroughly confused by dev_pagemap
    backed mappings. The recovery code has specific enabling for several
    possible page states and needs new enabling to handle poison in dax
    mappings.

    In order to support reliable reverse mapping of user space addresses:

    1/ Add new locking in the memory_failure() rmap path to prevent races
    that would typically be handled by the page lock.

    2/ Since dev_pagemap pages are hidden from the page allocator and the
    "compound page" accounting machinery, add a mechanism to determine
    the size of the mapping that encompasses a given poisoned pfn.

    3/ Given pmem errors can be repaired, change the speculatively
    accessed poison protection, mce_unmap_kpfn(), to be reversible and
    otherwise allow ongoing access from the kernel.

    A side effect of this enabling is that MADV_HWPOISON becomes usable
    for dax mappings, however the primary motivation is to allow the
    system to survive userspace consumption of hardware-poison via dax.
    Specifically the current behavior is:

    mce: Uncorrected hardware memory error in user-access at af34214200
    {1}[Hardware Error]: It has been corrected by h/w and requires no further action
    mce: [Hardware Error]: Machine check events logged
    {1}[Hardware Error]: event severity: corrected
    Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
    [..]
    Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
    mce: Memory error not recovered
    <reboot>

    ...and with these changes:

    Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
    Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
    Memory failure: 0x20cb00: recovery action for dax page: Recovered

    Given all the cross dependencies I propose taking this through
    nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
    folks"

    * tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
    libnvdimm, pmem: Restore page attributes when clearing errors
    x86/memory_failure: Introduce {set, clear}_mce_nospec()
    x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
    mm, memory_failure: Teach memory_failure() about dev_pagemap pages
    filesystem-dax: Introduce dax_lock_mapping_entry()
    mm, memory_failure: Collect mapping size in collect_procs()
    mm, madvise_inject_error: Let memory_failure() optionally take a page reference
    mm, dev_pagemap: Do not clear ->mapping on final put
    mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages
    filesystem-dax: Set page->index
    device-dax: Set page->index
    device-dax: Enable page_mapping()
    device-dax: Convert to vmf_insert_mixed and vm_fault_t

    Linus Torvalds
     

24 Aug, 2018

1 commit

  • Use new return type vm_fault_t for fault handler. For now, this is just
    documenting that the function returns a VM_FAULT value rather than an
    errno. Once all instances are converted, vm_fault_t will become a
    distinct type.

    Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    The aim is to change the return type of finish_fault() and
    handle_mm_fault() to vm_fault_t type. As part of that clean up return
    type of all other recursively called functions have been changed to
    vm_fault_t type.

    The places from where handle_mm_fault() is getting invoked will be
    change to vm_fault_t type but in a separate patch.

    vmf_error() is the newly introduce inline function in 4.17-rc6.

    [akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
    Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     

18 Aug, 2018

3 commits

  • Huge page helps to reduce TLB miss rate, but it has higher cache
    footprint, sometimes this may cause some issue. For example, when
    copying huge page on x86_64 platform, the cache footprint is 4M. But on
    a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC
    (last level cache). That is, in average, there are 2.5M LLC for each
    core and 1.25M LLC for each thread.

    If the cache contention is heavy when copying the huge page, and we copy
    the huge page from the begin to the end, it is possible that the begin
    of huge page is evicted from the cache after we finishing copying the
    end of the huge page. And it is possible for the application to access
    the begin of the huge page after copying the huge page.

    In c79b57e462b5d ("mm: hugetlb: clear target sub-page last when clearing
    huge page"), to keep the cache lines of the target subpage hot, the
    order to clear the subpages in the huge page in clear_huge_page() is
    changed to clearing the subpage which is furthest from the target
    subpage firstly, and the target subpage last. The similar order
    changing helps huge page copying too. That is implemented in this
    patch. Because we have put the order algorithm into a separate
    function, the implementation is quite simple.

    The patch is a generic optimization which should benefit quite some
    workloads, not for a specific use case. To demonstrate the performance
    benefit of the patch, we tested it with vm-scalability run on
    transparent huge page.

    With this patch, the throughput increases ~16.6% in vm-scalability
    anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699
    system (36 cores, 72 threads). The test case set
    /sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big
    anonymous memory area and populate it, then forked 36 child processes,
    each writes to the anonymous memory area from the begin to the end, so
    cause copy on write. For each child process, other child processes
    could be seen as other workloads which generate heavy cache pressure.
    At the same time, the IPC (instruction per cycle) increased from 0.63 to
    0.78, and the time spent in user space is reduced ~7.2%.

    Link: http://lkml.kernel.org/r/20180524005851.4079-3-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Mike Kravetz
    Cc: Andi Kleen
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Shaohua Li
    Cc: Christopher Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Since commit eca56ff906bd ("mm, shmem: add internal shmem resident
    memory accounting"), MM_SHMEMPAGES is added to separate the shmem
    accounting from regular files. So, all shmem pages should be accounted
    to MM_SHMEMPAGES instead of MM_FILEPAGES.

    And, normal 4K shmem pages have been accounted to MM_SHMEMPAGES, so
    shmem thp pages should be not treated differently. Account them to
    MM_SHMEMPAGES via mm_counter_file() since shmem pages are swap backed to
    keep consistent with normal 4K shmem pages.

    This will not change the rss counter of processes since shmem pages are
    still a part of it.

    The /proc/pid/status and /proc/pid/statm counters will however be more
    accurate wrt shmem usage, as originally intended. And as eca56ff906bd
    ("mm, shmem: add internal shmem resident memory accounting") mentioned,
    oom also could report more accurate "shmem-rss".

    Link: http://lkml.kernel.org/r/1529442518-17398-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • This patch is reworked from an earlier patch that Dan has posted:
    https://patchwork.kernel.org/patch/10131727/

    VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
    the memory page it is dealing with is not typical memory from the linear
    map. The get_user_pages_fast() path, since it does not resolve the vma,
    is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
    use that as a VM_MIXEDMAP replacement in some locations. In the cases
    where there is no pte to consult we fallback to using vma_is_dax() to
    detect the VM_MIXEDMAP special case.

    Now that we have explicit driver pfn_t-flag opt-in/opt-out for
    get_user_pages() support for DAX we can stop setting VM_MIXEDMAP. This
    also means we no longer need to worry about safely manipulating vm_flags
    in a future where we support dynamically changing the dax mode of a
    file.

    DAX should also now be supported with madvise_behavior(), vma_merge(),
    and copy_page_range().

    This patch has been tested against ndctl unit test. It has also been
    tested against xfstests commit: 625515d using fake pmem created by
    memmap and no additional issues have been observed.

    Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Dave Jiang
    Acked-by: Dan Williams
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     

06 Aug, 2018

1 commit


22 Jul, 2018

1 commit

  • __split_huge_pmd_locked() must check if the cleared huge pmd was dirty,
    and propagate that to PageDirty: otherwise, data may be lost when a huge
    tmpfs page is modified then split then reclaimed.

    How has this taken so long to be noticed? Because there was no problem
    when the huge page is written by a write system call (shmem_write_end()
    calls set_page_dirty()), nor when the page is allocated for a write fault
    (fault_dirty_shared_page() calls set_page_dirty()); but when allocated for
    a read fault (which MAP_POPULATE simulates), no set_page_dirty().

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1807111741430.1106@eggly.anvils
    Fixes: d21b9e57c74c ("thp: handle file pages in split_huge_pmd()")
    Signed-off-by: Hugh Dickins
    Reported-by: Ashwin Chaugule
    Reviewed-by: Yang Shi
    Reviewed-by: Kirill A. Shutemov
    Cc: "Huang, Ying"
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

21 Jul, 2018

1 commit

  • Use new return type vm_fault_t for fault and huge_fault handler. For
    now, this is just documenting that the function returns a VM_FAULT value
    rather than an errno. Once all instances are converted, vm_fault_t will
    become a distinct type.

    Commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    Previously vm_insert_mixed() returned an error code which driver mapped into
    VM_FAULT_* type. The new function vmf_insert_mixed() will replace this
    inefficiency by returning VM_FAULT_* type.

    Signed-off-by: Souptick Joarder
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Signed-off-by: Dan Williams
    Signed-off-by: Dave Jiang

    Dan Williams
     

09 Jul, 2018

1 commit

  • Memory allocations can induce swapping via kswapd or direct reclaim. If
    we are having IO done for us by kswapd and don't actually go into direct
    reclaim we may never get scheduled for throttling. So instead check to
    see if our cgroup is congested, and if so schedule the throttling.
    Before we return to user space the throttling stuff will only throttle
    if we actually required it.

    Signed-off-by: Tejun Heo
    Signed-off-by: Josef Bacik
    Acked-by: Johannes Weiner
    Acked-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Tejun Heo
     

13 Jun, 2018

1 commit

  • The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
    patch replaces cases of:

    kmalloc(a * b, gfp)

    with:
    kmalloc_array(a * b, gfp)

    as well as handling cases of:

    kmalloc(a * b * c, gfp)

    with:

    kmalloc(array3_size(a, b, c), gfp)

    as it's slightly less ugly than:

    kmalloc_array(array_size(a, b), c, gfp)

    This does, however, attempt to ignore constant size factors like:

    kmalloc(4 * 1024, gfp)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The tools/ directory was manually excluded, since it has its own
    implementation of kmalloc().

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kmalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kmalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kmalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kmalloc
    + kmalloc_array
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kmalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kmalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kmalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kmalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kmalloc(C1 * C2 * C3, ...)
    |
    kmalloc(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kmalloc(sizeof(THING) * C2, ...)
    |
    kmalloc(sizeof(TYPE) * C2, ...)
    |
    kmalloc(C1 * C2 * C3, ...)
    |
    kmalloc(C1 * C2, ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

08 Jun, 2018

1 commit

  • Now that we can represent the location of 'deferred_list' in C instead of
    comments, make use of that ability.

    Link: http://lkml.kernel.org/r/20180518194519.3820-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Vlastimil Babka
    Acked-by: Kirill A. Shutemov
    Cc: Christoph Lameter
    Cc: Dave Hansen
    Cc: Jérôme Glisse
    Cc: Lai Jiangshan
    Cc: Martin Schwidefsky
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

05 Jun, 2018

1 commit

  • Pull documentation updates from Jonathan Corbet:
    "There's been a fair amount of work in the docs tree this time around,
    including:

    - Extensive RST conversions and organizational work in the
    memory-management docs thanks to Mike Rapoport.

    - An update of Documentation/features from Andrea Parri and a script
    to keep it updated.

    - Various LICENSES updates from Thomas, along with a script to check
    SPDX tags.

    - Work to fix dangling references to documentation files; this
    involved a fair number of one-liner comment changes outside of
    Documentation/

    ... and the usual list of documentation improvements, typo fixes, etc"

    * tag 'docs-4.18' of git://git.lwn.net/linux: (103 commits)
    Documentation: document hung_task_panic kernel parameter
    docs/admin-guide/mm: add high level concepts overview
    docs/vm: move ksm and transhuge from "user" to "internals" section.
    docs: Use the kerneldoc comments for memalloc_no*()
    doc: document scope NOFS, NOIO APIs
    docs: update kernel versions and dates in tables
    docs/vm: transhuge: split userspace bits to admin-guide/mm/transhuge
    docs/vm: transhuge: minor updates
    docs/vm: transhuge: change sections order
    Documentation: arm: clean up Marvell Berlin family info
    Documentation: gpio: driver: Fix a typo and some odd grammar
    docs: ranoops.rst: fix location of ramoops.txt
    scripts/documentation-file-ref-check: rewrite it in perl with auto-fix mode
    docs: uio-howto.rst: use a code block to solve a warning
    mm, THP, doc: Add document for thp_swpout/thp_swpout_fallback
    w1: w1_io.c: fix a kernel-doc warning
    Documentation/process/posting: wrap text at 80 cols
    docs: admin-guide: add cgroup-v2 documentation
    Revert "Documentation/features/vm: Remove arch support status file for 'pte_special'"
    Documentation: refcount-vs-atomic: Update reference to LKMM doc.
    ...

    Linus Torvalds
     

03 Jun, 2018

1 commit

  • Swapping load on huge=always tmpfs (with khugepaged tuned up to be very
    eager, but I'm not sure that is relevant) soon hung uninterruptibly,
    waiting for page lock in shmem_getpage_gfp()'s find_lock_entry(), most
    often when "cp -a" was trying to write to a smallish file. Debug showed
    that the page in question was not locked, and page->mapping NULL by now,
    but page->index consistent with having been in a huge page before.

    Reproduced in minutes on a 4.15 kernel, even with 4.17's 605ca5ede764
    ("mm/huge_memory.c: reorder operations in __split_huge_page_tail()") added
    in; but took hours to reproduce on a 4.17 kernel (no idea why).

    The culprit proved to be the __ClearPageDirty() on tails beyond i_size in
    __split_huge_page(): the non-atomic __bitoperation may have been safe when
    4.8's baa355fd3314 ("thp: file pages support for split_huge_page()")
    introduced it, but liable to erase PageWaiters after 4.10's 62906027091f
    ("mm: add PageWaiters indicating tasks are waiting for a page bit").

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1805291841070.3197@eggly.anvils
    Fixes: 62906027091f ("mm: add PageWaiters indicating tasks are waiting for a page bit")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Nicholas Piggin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

28 Apr, 2018

1 commit


21 Apr, 2018

1 commit

  • My testing for the latest kernel supporting thp migration showed an
    infinite loop in offlining the memory block that is filled with shmem
    thps. We can get out of the loop with a signal, but kernel should return
    with failure in this case.

    What happens in the loop is that scan_movable_pages() repeats returning
    the same pfn without any progress. That's because page migration always
    fails for shmem thps.

    In memory offline code, memory blocks containing unmovable pages should be
    prevented from being offline targets by has_unmovable_pages() inside
    start_isolate_page_range(). So it's possible to change migratability for
    non-anonymous thps to avoid the issue, but it introduces more complex and
    thp-specific handling in migration code, so it might not good.

    So this patch is suggesting to fix the issue by enabling thp migration for
    shmem thp. Both of anon/shmem thp are migratable so we don't need
    precheck about the type of thps.

    Link: http://lkml.kernel.org/r/20180406030706.GA2434@hori1.linux.bs1.fc.nec.co.jp
    Fixes: commit 72b39cfc4d75 ("mm, memory_hotplug: do not fail offlining too early")
    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: Zi Yan
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

17 Apr, 2018

2 commits

  • Mike Rapoport says:

    These patches convert files in Documentation/vm to ReST format, add an
    initial index and link it to the top level documentation.

    There are no contents changes in the documentation, except few spelling
    fixes. The relatively large diffstat stems from the indentation and
    paragraph wrapping changes.

    I've tried to keep the formatting as consistent as possible, but I could
    miss some places that needed markup and add some markup where it was not
    necessary.

    [jc: significant conflicts in vm/hmm.rst]

    Jonathan Corbet
     
  • Signed-off-by: Mike Rapoport
    Signed-off-by: Jonathan Corbet

    Mike Rapoport
     

12 Apr, 2018

3 commits

  • Remove the address_space ->tree_lock and use the xa_lock newly added to
    the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
    since we don't really care that it's a tree.

    [willy@infradead.org: fix nds32, fs/dax.c]
    Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • THP migration is hacked into the generic migration with rather
    surprising semantic. The migration allocation callback is supposed to
    check whether the THP can be migrated at once and if that is not the
    case then it allocates a simple page to migrate. unmap_and_move then
    fixes that up by spliting the THP into small pages while moving the head
    page to the newly allocated order-0 page. Remaning pages are moved to
    the LRU list by split_huge_page. The same happens if the THP allocation
    fails. This is really ugly and error prone [1].

    I also believe that split_huge_page to the LRU lists is inherently wrong
    because all tail pages are not migrated. Some callers will just work
    around that by retrying (e.g. memory hotplug). There are other pfn
    walkers which are simply broken though. e.g. madvise_inject_error will
    migrate head and then advances next pfn by the huge page size.
    do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
    will simply split the THP before migration if the THP migration is not
    supported then falls back to single page migration but it doesn't handle
    tail pages if the THP migration path is not able to allocate a fresh THP
    so we end up with ENOMEM and fail the whole migration which is a
    questionable behavior. Page compaction doesn't try to migrate large
    pages so it should be immune.

    This patch tries to unclutter the situation by moving the special THP
    handling up to the migrate_pages layer where it actually belongs. We
    simply split the THP page into the existing list if unmap_and_move fails
    with ENOMEM and retry. So we will _always_ migrate all THP subpages and
    specific migrate_pages users do not have to deal with this case in a
    special way.

    [1] http://lkml.kernel.org/r/20171121021855.50525-1-zi.yan@sent.com

    Link: http://lkml.kernel.org/r/20180103082555.14592-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Zi Yan
    Cc: Andrea Reale
    Cc: Anshuman Khandual
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • A THP memcg charge can trigger the oom killer since 2516035499b9 ("mm,
    thp: remove __GFP_NORETRY from khugepaged and madvised allocations").
    We have used an explicit __GFP_NORETRY previously which ruled the OOM
    killer automagically.

    Memcg charge path should be semantically compliant with the allocation
    path and that means that if we do not trigger the OOM killer for costly
    orders which should do the same in the memcg charge path as well.
    Otherwise we are forcing callers to distinguish the two and use
    different gfp masks which is both non-intuitive and bug prone. As soon
    as we get a costly high order kmalloc user we even do not have any means
    to tell the memcg specific gfp mask to prevent from OOM because the
    charging is deep within guts of the slab allocator.

    The unexpected memcg OOM on THP has already been fixed upstream by
    9d3c3354bb85 ("mm, thp: do not cause memcg oom for thp") but this is a
    one-off fix rather than a generic solution. Teach mem_cgroup_oom to
    bail out on costly order requests to fix the THP issue as well as any
    other costly OOM eligible allocations to be added in future.

    Also revert 9d3c3354bb85 because special gfp for THP is no longer
    needed.

    Link: http://lkml.kernel.org/r/20180403193129.22146-1-mhocko@kernel.org
    Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

06 Apr, 2018

1 commit

  • THP split makes non-atomic change of tail page flags. This is almost ok
    because tail pages are locked and isolated but this breaks recent
    changes in page locking: non-atomic operation could clear bit
    PG_waiters.

    As a result concurrent sequence get_page_unless_zero() -> lock_page()
    might block forever. Especially if this page was truncated later.

    Fix is trivial: clone flags before unfreezing page reference counter.

    This race exists since commit 62906027091f ("mm: add PageWaiters
    indicating tasks are waiting for a page bit") while unsave unfreeze
    itself was added in commit 8df651c7059e ("thp: cleanup
    split_huge_page()").

    clear_compound_head() also must be called before unfreezing page
    reference because after successful get_page_unless_zero() might follow
    put_page() which needs correct compound_head().

    And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which
    is made especially for that and has semantic of smp_store_release().

    Link: http://lkml.kernel.org/r/151844393341.210639.13162088407980624477.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Nicholas Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

23 Mar, 2018

2 commits

  • Commit 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
    madvised allocations") changed the page allocator to no longer detect
    thp allocations based on __GFP_NORETRY.

    It did not, however, modify the mem cgroup try_charge() path to avoid
    oom kill for either khugepaged collapsing or thp faulting. It is never
    expected to oom kill a process to allocate a hugepage for thp; reclaim
    is governed by the thp defrag mode and MADV_HUGEPAGE, but allocations
    (and charging) should fallback instead of oom killing processes.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803191409420.124411@chino.kir.corp.google.com
    Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
    Signed-off-by: David Rientjes
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • deferred_split_scan() gets called from reclaim path. Waiting for page
    lock may lead to deadlock there.

    Replace lock_page() with trylock_page() and skip the page if we failed
    to lock it. We will get to the page on the next scan.

    Link: http://lkml.kernel.org/r/20180315150747.31945-1-kirill.shutemov@linux.intel.com
    Fixes: 9a982250f773 ("thp: introduce deferred_split_huge_page()")
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

01 Feb, 2018

3 commits

  • Instead of marking the pmd ready for split, invalidate the pmd. This
    should take care of powerpc requirement. Only side effect is that we
    mark the pmd invalid early. This can result in us blocking access to
    the page a bit longer if we race against a thp split.

    [kirill.shutemov@linux.intel.com: rebased, dirty THP once]
    Link: http://lkml.kernel.org/r/20171213105756.69879-13-kirill.shutemov@linux.intel.com
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Catalin Marinas
    Cc: David Daney
    Cc: David Miller
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Martin Schwidefsky
    Cc: Michal Hocko
    Cc: Nitin Gupta
    Cc: Ralf Baechle
    Cc: Thomas Gleixner
    Cc: Vineet Gupta
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Use the modifed pmdp_invalidate() that returns the previous value of pmd
    to transfer dirty and accessed bits.

    Link: http://lkml.kernel.org/r/20171213105756.69879-12-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Aneesh Kumar K.V
    Cc: Catalin Marinas
    Cc: David Daney
    Cc: David Miller
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Martin Schwidefsky
    Cc: Michal Hocko
    Cc: Nitin Gupta
    Cc: Ralf Baechle
    Cc: Thomas Gleixner
    Cc: Vineet Gupta
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • pmd_trans_splitting() was removed after THP refcounting redesign,
    therefore related comment should be updated.

    Link: http://lkml.kernel.org/r/1512625745-59451-1-git-send-email-xieyisheng1@huawei.com
    Signed-off-by: Yisheng Xie
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     

16 Dec, 2017

1 commit

  • This reverts commits 5c9d2d5c269c, c7da82b894e9, and e7fe7b5cae90.

    We'll probably need to revisit this, but basically we should not
    complicate the get_user_pages_fast() case, and checking the actual page
    table protection key bits will require more care anyway, since the
    protection keys depend on the exact state of the VM in question.

    Particularly when doing a "remote" page lookup (ie in somebody elses VM,
    not your own), you need to be much more careful than this was. Dave
    Hansen says:

    "So, the underlying bug here is that we now a get_user_pages_remote()
    and then go ahead and do the p*_access_permitted() checks against the
    current PKRU. This was introduced recently with the addition of the
    new p??_access_permitted() calls.

    We have checks in the VMA path for the "remote" gups and we avoid
    consulting PKRU for them. This got missed in the pkeys selftests
    because I did a ptrace read, but not a *write*. I also didn't
    explicitly test it against something where a COW needed to be done"

    It's also not entirely clear that it makes sense to check the protection
    key bits at this level at all. But one possible eventual solution is to
    make the get_user_pages_fast() case just abort if it sees protection key
    bits set, which makes us fall back to the regular get_user_pages() case,
    which then has a vma and can do the check there if we want to.

    We'll see.

    Somewhat related to this all: what we _do_ want to do some day is to
    check the PAGE_USER bit - it should obviously always be set for user
    pages, but it would be a good check to have back. Because we have no
    generic way to test for it, we lost it as part of moving over from the
    architecture-specific x86 GUP implementation to the generic one in
    commit e585513b76f7 ("x86/mm/gup: Switch GUP to the generic
    get_user_page_fast() implementation").

    Cc: Peter Zijlstra
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: "Jérôme Glisse"
    Cc: Andrew Morton
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

30 Nov, 2017

4 commits

  • Mergr misc fixes from Andrew Morton:
    "28 fixes"

    * emailed patches from Andrew Morton : (28 commits)
    fs/hugetlbfs/inode.c: change put_page/unlock_page order in hugetlbfs_fallocate()
    mm/hugetlb: fix NULL-pointer dereference on 5-level paging machine
    autofs: revert "autofs: fix AT_NO_AUTOMOUNT not being honored"
    autofs: revert "autofs: take more care to not update last_used on path walk"
    fs/fat/inode.c: fix sb_rdonly() change
    mm, memcg: fix mem_cgroup_swapout() for THPs
    mm: migrate: fix an incorrect call of prep_transhuge_page()
    kmemleak: add scheduling point to kmemleak_scan()
    scripts/bloat-o-meter: don't fail with division by 0
    fs/mbcache.c: make count_objects() more robust
    Revert "mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical"
    mm/madvise.c: fix madvise() infinite loop under special circumstances
    exec: avoid RLIMIT_STACK races with prlimit()
    IB/core: disable memory registration of filesystem-dax vmas
    v4l2: disable filesystem-dax mapping support
    mm: fail get_vaddr_frames() for filesystem-dax mappings
    mm: introduce get_user_pages_longterm
    device-dax: implement ->split() to catch invalid munmap attempts
    mm, hugetlbfs: introduce ->split() to vm_operations_struct
    scripts/faddr2line: extend usage on generic arch
    ...

    Linus Torvalds
     
  • The 'access_permitted' helper is used in the gup-fast path and goes
    beyond the simple _PAGE_RW check to also:

    - validate that the mapping is writable from a protection keys
    standpoint

    - validate that the pte has _PAGE_USER set since all fault paths where
    pmd_write is must be referencing user-memory.

    Link: http://lkml.kernel.org/r/151043111049.2842.15241454964150083466.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: "Jérôme Glisse"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • The 'access_permitted' helper is used in the gup-fast path and goes
    beyond the simple _PAGE_RW check to also:

    - validate that the mapping is writable from a protection keys
    standpoint

    - validate that the pte has _PAGE_USER set since all fault paths where
    pud_write is must be referencing user-memory.

    [dan.j.williams@intel.com: fix powerpc compile error]
    Link: http://lkml.kernel.org/r/151129127237.37405.16073414520854722485.stgit@dwillia2-desk3.amr.corp.intel.com
    Link: http://lkml.kernel.org/r/151043110453.2842.2166049702068628177.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Cc: Dave Hansen
    Cc: "David S. Miller"
    Cc: Kirill A. Shutemov
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • This reverts commit 152e93af3cfe2d29d8136cc0a02a8612507136ee.

    It was a nice cleanup in theory, but as Nicolai Stange points out, we do
    need to make the page dirty for the copy-on-write case even when we
    didn't end up making it writable, since the dirty bit is what we use to
    check that we've gone through a COW cycle.

    Reported-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Linus Torvalds

    Linus Torvalds