15 May, 2019

2 commits

  • This updates each existing invalidation to use the correct mmu notifier
    event that represent what is happening to the CPU page table. See the
    patch which introduced the events to see the rational behind this.

    Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • CPU page table update can happens for many reasons, not only as a result
    of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
    a result of kernel activities (memory compression, reclaim, migration,
    ...).

    Users of mmu notifier API track changes to the CPU page table and take
    specific action for them. While current API only provide range of virtual
    address affected by the change, not why the changes is happening.

    This patchset do the initial mechanical convertion of all the places that
    calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
    event as well as the vma if it is know (most invalidation happens against
    a given vma). Passing down the vma allows the users of mmu notifier to
    inspect the new vma page protection.

    The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
    should assume that every for the range is going away when that event
    happens. A latter patch do convert mm call path to use a more appropriate
    events for each call.

    This is done as 2 patches so that no call site is forgotten especialy
    as it uses this following coccinelle patch:

    %vm_mm, E3, E4)
    ...>

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(..., struct vm_area_struct *VMA, ...) {
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(...) {
    struct vm_area_struct *VMA;
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN;
    @@
    FN(...) {
    }
    ---------------------------------------------------------------------->%

    Applied with:
    spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
    spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
    spatch --sp-file mmu-notifier.spatch --dir mm --in-place

    Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

03 Apr, 2019

1 commit

  • Move the mmu_gather::page_size things into the generic code instead of
    PowerPC specific bits.

    No change in behavior intended.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Will Deacon
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Aneesh Kumar K.V
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

29 Dec, 2018

1 commit

  • To avoid having to change many call sites everytime we want to add a
    parameter use a structure to group all parameters for the mmu_notifier
    invalidate_range_start/end cakks. No functional changes with this patch.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    From: Jérôme Glisse
    Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

    fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

    Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

29 Oct, 2018

1 commit

  • Pull XArray conversion from Matthew Wilcox:
    "The XArray provides an improved interface to the radix tree data
    structure, providing locking as part of the API, specifying GFP flags
    at allocation time, eliminating preloading, less re-walking the tree,
    more efficient iterations and not exposing RCU-protected pointers to
    its users.

    This patch set

    1. Introduces the XArray implementation

    2. Converts the pagecache to use it

    3. Converts memremap to use it

    The page cache is the most complex and important user of the radix
    tree, so converting it was most important. Converting the memremap
    code removes the only other user of the multiorder code, which allows
    us to remove the radix tree code that supported it.

    I have 40+ followup patches to convert many other users of the radix
    tree over to the XArray, but I'd like to get this part in first. The
    other conversions haven't been in linux-next and aren't suitable for
    applying yet, but you can see them in the xarray-conv branch if you're
    interested"

    * 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
    radix tree: Remove multiorder support
    radix tree test: Convert multiorder tests to XArray
    radix tree tests: Convert item_delete_rcu to XArray
    radix tree tests: Convert item_kill_tree to XArray
    radix tree tests: Move item_insert_order
    radix tree test suite: Remove multiorder benchmarking
    radix tree test suite: Remove __item_insert
    memremap: Convert to XArray
    xarray: Add range store functionality
    xarray: Move multiorder_check to in-kernel tests
    xarray: Move multiorder_shrink to kernel tests
    xarray: Move multiorder account test in-kernel
    radix tree test suite: Convert iteration test to XArray
    radix tree test suite: Convert tag_tagged_items to XArray
    radix tree: Remove radix_tree_clear_tags
    radix tree: Remove radix_tree_maybe_preload_order
    radix tree: Remove split/join code
    radix tree: Remove radix_tree_update_node_t
    page cache: Finish XArray conversion
    dax: Convert page fault handlers to XArray
    ...

    Linus Torvalds
     

06 Oct, 2018

1 commit

  • Reproducer, assuming 2M of hugetlbfs available:

    Hugetlbfs mounted, size=2M and option user=testuser

    # mount | grep ^hugetlbfs
    hugetlbfs on /dev/hugepages type hugetlbfs (rw,pagesize=2M,user=dan)
    # sysctl vm.nr_hugepages=1
    vm.nr_hugepages = 1
    # grep Huge /proc/meminfo
    AnonHugePages: 0 kB
    ShmemHugePages: 0 kB
    HugePages_Total: 1
    HugePages_Free: 1
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    Hugepagesize: 2048 kB
    Hugetlb: 2048 kB

    Code:

    #include
    #include
    #define SIZE 2*1024*1024
    int main()
    {
    void *ptr;
    ptr = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_HUGETLB | MAP_ANONYMOUS, -1, 0);
    madvise(ptr, SIZE, MADV_DONTDUMP);
    madvise(ptr, SIZE, MADV_DODUMP);
    }

    Compile and strace:

    mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = 0x7ff7c9200000
    madvise(0x7ff7c9200000, 2097152, MADV_DONTDUMP) = 0
    madvise(0x7ff7c9200000, 2097152, MADV_DODUMP) = -1 EINVAL (Invalid argument)

    hugetlbfs pages have VM_DONTEXPAND in the VmFlags driver pages based on
    author testing with analysis from Florian Weimer[1].

    The inclusion of VM_DONTEXPAND into the VM_SPECIAL defination was a
    consequence of the large useage of VM_DONTEXPAND in device drivers.

    A consequence of [2] is that VM_DONTEXPAND marked pages are unable to be
    marked DODUMP.

    A user could quite legitimately madvise(MADV_DONTDUMP) their hugetlbfs
    memory for a while and later request that madvise(MADV_DODUMP) on the same
    memory. We correct this omission by allowing madvice(MADV_DODUMP) on
    hugetlbfs pages.

    [1] https://stackoverflow.com/questions/52548260/madvisedodump-on-the-same-ptr-size-as-a-successful-madvisedontdump-fails-wit
    [2] commit 0103bd16fb90 ("mm: prepare VM_DONTDUMP for using in drivers")

    Link: http://lkml.kernel.org/r/20180930054629.29150-1-daniel@linux.ibm.com
    Link: https://lists.launchpad.net/maria-discuss/msg05245.html
    Fixes: 0103bd16fb90 ("mm: prepare VM_DONTDUMP for using in drivers")
    Reported-by: Kenneth Penza
    Signed-off-by: Daniel Black
    Reviewed-by: Mike Kravetz
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Daniel Black
     

30 Sep, 2018

1 commit

  • Introduce xarray value entries and tagged pointers to replace radix
    tree exceptional entries. This is a slight change in encoding to allow
    the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
    value entry). It is also a change in emphasis; exceptional entries are
    intimidating and different. As the comment explains, you can choose
    to store values or pointers in the xarray and they are both first-class
    citizens.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Josef Bacik

    Matthew Wilcox
     

24 Jul, 2018

1 commit

  • The madvise_inject_error() routine uses get_user_pages() to lookup the
    pfn and other information for injected error, but it does not release
    that pin. The assumption is that failed pages should be taken out of
    circulation.

    However, for dax mappings it is not possible to take pages out of
    circulation since they are 1:1 physically mapped as filesystem blocks,
    or device-dax capacity. They also typically represent persistent memory
    which has an error clearing capability.

    In preparation for adding a special handler for dax mappings, shift the
    responsibility of taking the page reference to memory_failure(). I.e.
    drop the page reference and do not specify MF_COUNT_INCREASED to
    memory_failure().

    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Signed-off-by: Dan Williams
    Acked-by: Naoya Horiguchi
    Signed-off-by: Dave Jiang

    Dan Williams
     

24 Jan, 2018

1 commit


30 Nov, 2017

1 commit

  • MADVISE_WILLNEED has always been a noop for DAX (formerly XIP) mappings.
    Unfortunately madvise_willneed() doesn't communicate this information
    properly to the generic madvise syscall implementation. The calling
    convention is quite subtle there. madvise_vma() is supposed to either
    return an error or update &prev otherwise the main loop will never
    advance to the next vma and it will keep looping for ever without a way
    to get out of the kernel.

    It seems this has been broken since introduction. Nobody has noticed
    because nobody seems to be using MADVISE_WILLNEED on these DAX mappings.

    [mhocko@suse.com: rewrite changelog]
    Link: http://lkml.kernel.org/r/20171127115318.911-1-guoxuenan@huawei.com
    Fixes: fe77ba6f4f97 ("[PATCH] xip: madvice/fadvice: execute in place")
    Signed-off-by: chenjie
    Signed-off-by: guoxuenan
    Acked-by: Michal Hocko
    Cc: Minchan Kim
    Cc: zhangyi (F)
    Cc: Miao Xie
    Cc: Mike Rapoport
    Cc: Shaohua Li
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Kirill A. Shutemov
    Cc: David Rientjes
    Cc: Anshuman Khandual
    Cc: Rik van Riel
    Cc: Carsten Otte
    Cc: Dan Williams
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    chenjie
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

14 Oct, 2017

1 commit

  • mm/madvise.c has a brief description about all MADV_ flags. Add a
    description for the newly added MADV_WIPEONFORK and MADV_KEEPONFORK.

    Although man page has the similar information, but it'd better to keep
    the consistent with other flags.

    Link: http://lkml.kernel.org/r/1506117328-88228-1-git-send-email-yang.s@alibaba-inc.com
    Signed-off-by: Yang Shi
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

04 Oct, 2017

1 commit

  • This fixes a bug in madvise() where if you'd try to soft offline a
    hugepage via madvise(), while walking the address range you'd end up,
    using the wrong page offset due to attempting to get the compound order
    of a former but presently not compound page, due to dissolving the huge
    page (since commit c3114a84f7f9: "mm: hugetlb: soft-offline: dissolve
    source hugepage after successful migration").

    As a result I ended up with all my free pages except one being offlined.

    Link: http://lkml.kernel.org/r/20170912204306.GA12053@gmail.com
    Fixes: c3114a84f7f9 ("mm: hugetlb: soft-offline: dissolve source hugepage after successful migration")
    Signed-off-by: Alexandru Moise
    Cc: Anshuman Khandual
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Hillf Danton
    Cc: Shaohua Li
    Cc: Mike Rapoport
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandru Moise
     

09 Sep, 2017

1 commit

  • Platform with advance system bus (like CAPI or CCIX) allow device memory
    to be accessible from CPU in a cache coherent fashion. Add a new type of
    ZONE_DEVICE to represent such memory. The use case are the same as for
    the un-addressable device memory but without all the corners cases.

    Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Aneesh Kumar
    Cc: Paul E. McKenney
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Balbir Singh
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

07 Sep, 2017

1 commit

  • Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty
    in the child process after fork. This differs from MADV_DONTFORK in one
    important way.

    If a child process accesses memory that was MADV_WIPEONFORK, it will get
    zeroes. The address ranges are still valid, they are just empty.

    If a child process accesses memory that was MADV_DONTFORK, it will get a
    segmentation fault, since those address ranges are no longer valid in
    the child after fork.

    Since MADV_DONTFORK also seems to be used to allow very large programs
    to fork in systems with strict memory overcommit restrictions, changing
    the semantics of MADV_DONTFORK might break existing programs.

    MADV_WIPEONFORK only works on private, anonymous VMAs.

    The use case is libraries that store or cache information, and want to
    know that they need to regenerate it in the child process after fork.

    Examples of this would be:
    - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
    check, which is too slow without a PID cache)
    - PKCS#11 API reinitialization check (mandated by specification)
    - glibc's upcoming PRNG (reseed after fork)
    - OpenSSL PRNG (reseed after fork)

    The security benefits of a forking server having a re-inialized PRNG in
    every child process are pretty obvious. However, due to libraries
    having all kinds of internal state, and programs getting compiled with
    many different versions of each library, it is unreasonable to expect
    calling programs to re-initialize everything manually after fork.

    A further complication is the proliferation of clone flags, programs
    bypassing glibc's functions to call clone directly, and programs calling
    unshare, causing the glibc pthread_atfork hook to not get called.

    It would be better to have the kernel take care of this automatically.

    The patch also adds MADV_KEEPONFORK, to undo the effects of a prior
    MADV_WIPEONFORK.

    This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:

    https://man.openbsd.org/minherit.2

    [akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines]
    Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.com
    Signed-off-by: Rik van Riel
    Reported-by: Florian Weimer
    Reported-by: Colm MacCártaigh
    Reviewed-by: Mike Kravetz
    Cc: "H. Peter Anvin"
    Cc: "Kirill A. Shutemov"
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Ingo Molnar
    Cc: Helge Deller
    Cc: Kees Cook
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Will Drewry
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

01 Sep, 2017

1 commit

  • Wendy Wang reported off-list that a RAS HWPOISON-SOFT test case failed
    and bisected it to the commit 479f854a207c ("mm, page_alloc: defer
    debugging checks of pages allocated from the PCP").

    The problem is that a page that was poisoned with madvise() is reused.
    The commit removed a check that would trigger if DEBUG_VM was enabled
    but re-enabling the check only fixes the problem as a side-effect by
    printing a bad_page warning and recovering.

    The root of the problem is that an madvise() can leave a poisoned page
    on the per-cpu list. This patch drains all per-cpu lists after pages
    are poisoned so that they will not be reused. Wendy reports that the
    test case in question passes with this patch applied. While this could
    be done in a targeted fashion, it is over-complicated for such a rare
    operation.

    Link: http://lkml.kernel.org/r/20170828133414.7qro57jbepdcyz5x@techsingularity.net
    Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
    Signed-off-by: Mel Gorman
    Reported-by: Wang, Wendy
    Tested-by: Wang, Wendy
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: "Hansen, Dave"
    Cc: "Luck, Tony"
    Cc: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

26 Aug, 2017

1 commit

  • If madvise(..., MADV_FREE) split a transparent hugepage, it called
    put_page() before unlock_page().

    This was wrong because put_page() can free the page, e.g. if a
    concurrent madvise(..., MADV_DONTNEED) has removed it from the memory
    mapping. put_page() then rightfully complained about freeing a locked
    page.

    Fix this by moving the unlock_page() before put_page().

    This bug was found by syzkaller, which encountered the following splat:

    BUG: Bad page state in process syzkaller412798 pfn:1bd800
    page:ffffea0006f60000 count:0 mapcount:0 mapping: (null) index:0x20a00
    flags: 0x200000000040019(locked|uptodate|dirty|swapbacked)
    raw: 0200000000040019 0000000000000000 0000000000020a00 00000000ffffffff
    raw: ffffea0006f60020 ffffea0006f60020 0000000000000000 0000000000000000
    page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
    bad because of flags: 0x1(locked)
    Modules linked in:
    CPU: 1 PID: 3037 Comm: syzkaller412798 Not tainted 4.13.0-rc5+ #35
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:16 [inline]
    dump_stack+0x194/0x257 lib/dump_stack.c:52
    bad_page+0x230/0x2b0 mm/page_alloc.c:565
    free_pages_check_bad+0x1f0/0x2e0 mm/page_alloc.c:943
    free_pages_check mm/page_alloc.c:952 [inline]
    free_pages_prepare mm/page_alloc.c:1043 [inline]
    free_pcp_prepare mm/page_alloc.c:1068 [inline]
    free_hot_cold_page+0x8cf/0x12b0 mm/page_alloc.c:2584
    __put_single_page mm/swap.c:79 [inline]
    __put_page+0xfb/0x160 mm/swap.c:113
    put_page include/linux/mm.h:814 [inline]
    madvise_free_pte_range+0x137a/0x1ec0 mm/madvise.c:371
    walk_pmd_range mm/pagewalk.c:50 [inline]
    walk_pud_range mm/pagewalk.c:108 [inline]
    walk_p4d_range mm/pagewalk.c:134 [inline]
    walk_pgd_range mm/pagewalk.c:160 [inline]
    __walk_page_range+0xc3a/0x1450 mm/pagewalk.c:249
    walk_page_range+0x200/0x470 mm/pagewalk.c:326
    madvise_free_page_range.isra.9+0x17d/0x230 mm/madvise.c:444
    madvise_free_single_vma+0x353/0x580 mm/madvise.c:471
    madvise_dontneed_free mm/madvise.c:555 [inline]
    madvise_vma mm/madvise.c:664 [inline]
    SYSC_madvise mm/madvise.c:832 [inline]
    SyS_madvise+0x7d3/0x13c0 mm/madvise.c:760
    entry_SYSCALL_64_fastpath+0x1f/0xbe

    Here is a C reproducer:

    #define _GNU_SOURCE
    #include
    #include
    #include

    #define MADV_FREE 8
    #define PAGE_SIZE 4096

    static void *mapping;
    static const size_t mapping_size = 0x1000000;

    static void *madvise_thrproc(void *arg)
    {
    madvise(mapping, mapping_size, (long)arg);
    }

    int main(void)
    {
    pthread_t t[2];

    for (;;) {
    mapping = mmap(NULL, mapping_size, PROT_WRITE,
    MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);

    munmap(mapping + mapping_size / 2, PAGE_SIZE);

    pthread_create(&t[0], 0, madvise_thrproc, (void*)MADV_DONTNEED);
    pthread_create(&t[1], 0, madvise_thrproc, (void*)MADV_FREE);
    pthread_join(t[0], NULL);
    pthread_join(t[1], NULL);
    munmap(mapping, mapping_size);
    }
    }

    Note: to see the splat, CONFIG_TRANSPARENT_HUGEPAGE=y and
    CONFIG_DEBUG_VM=y are needed.

    Google Bug Id: 64696096

    Link: http://lkml.kernel.org/r/20170823205235.132061-1-ebiggers3@gmail.com
    Fixes: 854e9ed09ded ("mm: support madvise(MADV_FREE)")
    Signed-off-by: Eric Biggers
    Acked-by: David Rientjes
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Cc: Dmitry Vyukov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: [v4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers
     

03 Aug, 2017

1 commit

  • Nadav Amit identified a theoritical race between page reclaim and
    mprotect due to TLB flushes being batched outside of the PTL being held.

    He described the race as follows:

    CPU0 CPU1
    ---- ----
    user accesses memory using RW PTE
    [PTE now cached in TLB]
    try_to_unmap_one()
    ==> ptep_get_and_clear()
    ==> set_tlb_ubc_flush_pending()
    mprotect(addr, PROT_READ)
    ==> change_pte_range()
    ==> [ PTE non-present - no flush ]

    user writes using cached RW PTE
    ...

    try_to_unmap_flush()

    The same type of race exists for reads when protecting for PROT_NONE and
    also exists for operations that can leave an old TLB entry behind such
    as munmap, mremap and madvise.

    For some operations like mprotect, it's not necessarily a data integrity
    issue but it is a correctness issue as there is a window where an
    mprotect that limits access still allows access. For munmap, it's
    potentially a data integrity issue although the race is massive as an
    munmap, mmap and return to userspace must all complete between the
    window when reclaim drops the PTL and flushes the TLB. However, it's
    theoritically possible so handle this issue by flushing the mm if
    reclaim is potentially currently batching TLB flushes.

    Other instances where a flush is required for a present pte should be ok
    as either the page lock is held preventing parallel reclaim or a page
    reference count is elevated preventing a parallel free leading to
    corruption. In the case of page_mkclean there isn't an obvious path
    that userspace could take advantage of without using the operations that
    are guarded by this patch. Other users such as gup as a race with
    reclaim looks just at PTEs. huge page variants should be ok as they
    don't race with reclaim. mincore only looks at PTEs. userfault also
    should be ok as if a parallel reclaim takes place, it will either fault
    the page back in or read some of the data before the flush occurs
    triggering a fault.

    Note that a variant of this patch was acked by Andy Lutomirski but this
    was for the x86 parts on top of his PCID work which didn't make the 4.13
    merge window as expected. His ack is dropped from this version and
    there will be a follow-on patch on top of PCID that will include his
    ack.

    [akpm@linux-foundation.org: tweak comments]
    [akpm@linux-foundation.org: fix spello]
    Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de
    Reported-by: Nadav Amit
    Signed-off-by: Mel Gorman
    Cc: Andy Lutomirski
    Cc: [v4.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

11 Jul, 2017

2 commits

  • MADV_FREE is identical to MADV_DONTNEED from the point of view of uffd
    monitor. The monitor has to stop handling #PF events in the range being
    freed. We are reusing userfaultfd_remove callback along with the logic
    required to re-get and re-validate the VMA which may change or disappear
    because userfaultfd_remove releases mmap_sem.

    Link: http://lkml.kernel.org/r/1497876311-18615-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: Hillf Danton
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • For fast flash disk, async IO could introduce overhead because of
    context switch. block-mq now supports IO poll, which improves
    performance and latency a lot. swapin is a good place to use this
    technique, because the task is waiting for the swapin page to continue
    execution.

    In my virtual machine, directly read 4k data from a NVMe with iopoll is
    about 60% better than that without poll. With iopoll support in swapin
    patch, my microbenchmark (a task does random memory write) is about
    10%~25% faster. CPU utilization increases a lot though, 2x and even 3x
    CPU utilization. This will depend on disk speed.

    While iopoll in swapin isn't intended for all usage cases, it's a win
    for latency sensistive workloads with high speed swap disk. block layer
    has knob to control poll in runtime. If poll isn't enabled in block
    layer, there should be no noticeable change in swapin.

    I got a chance to run the same test in a NVMe with DRAM as the media.
    In simple fio IO test, blkpoll boosts 50% performance in single thread
    test and ~20% in 8 threads test. So this is the base line. In above
    swap test, blkpoll boosts ~27% performance in single thread test.
    blkpoll uses 2x CPU time though.

    If we enable hybid polling, the performance gain has very slight drop
    but CPU time is only 50% worse than that without blkpoll. Also we can
    adjust parameter of hybid poll, with it, the CPU time penality is
    reduced further. In 8 threads test, blkpoll doesn't help though. The
    performance is similar to that without blkpoll, but cpu utilization is
    similar too. There is lock contention in swap path. The cpu time
    spending on blkpoll isn't high. So overall, blkpoll swapin isn't worse
    than that without it.

    The swapin readahead might read several pages in in the same time and
    form a big IO request. Since the IO will take longer time, it doesn't
    make sense to do poll, so the patch only does iopoll for single page
    swapin.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/070c3c3e40b711e7b1390002c991e86a-b5408f0@7511894063d3764ff01ea8111f5a004d7dd700ed078797c204a24e620ddb965c
    Signed-off-by: Shaohua Li
    Cc: Tim Chen
    Cc: Huang Ying
    Cc: Jens Axboe
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

04 May, 2017

5 commits

  • madvise_behavior_valid() should be called before acting upon the
    behavior parameter. Hence move up the function. This also includes
    MADV_SOFT_OFFLINE and MADV_HWPOISON options as valid behavior parameter
    for the system call madvise().

    Link: http://lkml.kernel.org/r/20170418052844.24891-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Acked-by: David Rientjes
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • This cleans up handling MADV_SOFT_OFFLINE and MADV_HWPOISON called
    through madvise() system call.

    * madvise_memory_failure() was misleading to accommodate handling of
    both memory_failure() as well as soft_offline_page() functions.
    Basically it handles memory error injection from user space which
    can go either way as memory failure or soft offline. Renamed as
    madvise_inject_error() instead.

    * Renamed struct page pointer 'p' to 'page'.

    * pr_info() was essentially printing PFN value but it said 'page'
    which was misleading. Made the process virtual address explicit.

    Before the patch:

    Soft offlining page 0x15e3e at 0x3fff8c230000
    Soft offlining page 0x1f3 at 0x3fffa0da0000
    Soft offlining page 0x744 at 0x3fff7d200000
    Soft offlining page 0x1634d at 0x3fff95e20000
    Soft offlining page 0x16349 at 0x3fff95e30000
    Soft offlining page 0x1d6 at 0x3fff9e8b0000
    Soft offlining page 0x5f3 at 0x3fff91bd0000

    Injecting memory failure for page 0x15c8b at 0x3fff83280000
    Injecting memory failure for page 0x16190 at 0x3fff83290000
    Injecting memory failure for page 0x740 at 0x3fff9a2e0000
    Injecting memory failure for page 0x741 at 0x3fff9a2f0000

    After the patch:

    Soft offlining pfn 0x1484e at process virtual address 0x3fff883c0000
    Soft offlining pfn 0x1484f at process virtual address 0x3fff883d0000
    Soft offlining pfn 0x14850 at process virtual address 0x3fff883e0000
    Soft offlining pfn 0x14851 at process virtual address 0x3fff883f0000
    Soft offlining pfn 0x14852 at process virtual address 0x3fff88400000
    Soft offlining pfn 0x14853 at process virtual address 0x3fff88410000
    Soft offlining pfn 0x14854 at process virtual address 0x3fff88420000
    Soft offlining pfn 0x1521c at process virtual address 0x3fff6bc70000

    Injecting memory failure for pfn 0x10fcf at process virtual address 0x3fff86310000
    Injecting memory failure for pfn 0x10fd0 at process virtual address 0x3fff86320000
    Injecting memory failure for pfn 0x10fd1 at process virtual address 0x3fff86330000
    Injecting memory failure for pfn 0x10fd2 at process virtual address 0x3fff86340000
    Injecting memory failure for pfn 0x10fd3 at process virtual address 0x3fff86350000
    Injecting memory failure for pfn 0x10fd4 at process virtual address 0x3fff86360000
    Injecting memory failure for pfn 0x10fd5 at process virtual address 0x3fff86370000

    Link: http://lkml.kernel.org/r/20170410084701.11248-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • Now MADV_FREE pages can be easily reclaimed even for swapless system.
    We can safely enable MADV_FREE for all systems.

    Link: http://lkml.kernel.org/r/155648585589300bfae1d45078e7aebb3d988b87.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Acked-by: Minchan Kim
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • When memory pressure is high, we free MADV_FREE pages. If the pages are
    not dirty in pte, the pages could be freed immediately. Otherwise we
    can't reclaim them. We put the pages back to anonumous LRU list (by
    setting SwapBacked flag) and the pages will be reclaimed in normal
    swapout way.

    We use normal page reclaim policy. Since MADV_FREE pages are put into
    inactive file list, such pages and inactive file pages are reclaimed
    according to their age. This is expected, because we don't want to
    reclaim too many MADV_FREE pages before used once pages.

    Based on Minchan's original patch

    [minchan@kernel.org: clean up lazyfree page handling]
    Link: http://lkml.kernel.org/r/20170303025237.GB3503@bbox
    Link: http://lkml.kernel.org/r/14b8eb1d3f6bf6cc492833f183ac8c304e560484.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Signed-off-by: Minchan Kim
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: Hillf Danton
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • madv()'s MADV_FREE indicate pages are 'lazyfree'. They are still
    anonymous pages, but they can be freed without pageout. To distinguish
    these from normal anonymous pages, we clear their SwapBacked flag.

    MADV_FREE pages could be freed without pageout, so they pretty much like
    used once file pages. For such pages, we'd like to reclaim them once
    there is memory pressure. Also it might be unfair reclaiming MADV_FREE
    pages always before used once file pages and we definitively want to
    reclaim the pages before other anonymous and file pages.

    To speed up MADV_FREE pages reclaim, we put the pages into
    LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
    nowadays and should be full of used once file pages. Reclaiming
    MADV_FREE pages will not have much interfere of anonymous and active
    file pages. And the inactive file pages and MADV_FREE pages will be
    reclaimed according to their age, so we don't reclaim too many MADV_FREE
    pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
    means we can reclaim the pages without swap support. This idea is
    suggested by Johannes.

    This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
    avoid bisect failure, next patch will do it.

    The patch is based on Minchan's original patch.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Suggested-by: Johannes Weiner
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

10 Mar, 2017

1 commit

  • userfaultfd_remove() has to be execute before zapping the pagetables or
    UFFDIO_COPY could keep filling pages after zap_page_range returned,
    which would result in non zero data after a MADV_DONTNEED.

    However userfaultfd_remove() may have to release the mmap_sem. This was
    handled correctly in MADV_REMOVE, but MADV_DONTNEED accessed a
    potentially stale vma (the very vma passed to zap_page_range(vma, ...)).

    The fix consists in revalidating the vma in case userfaultfd_remove()
    had to release the mmap_sem.

    This also optimizes away an unnecessary down_read/up_read in the
    MADV_REMOVE case if UFFD_EVENT_FORK had to be delivered.

    It all remains zero runtime cost in case CONFIG_USERFAULTFD=n as
    userfaultfd_remove() will be defined as "true" at build time.

    Link: http://lkml.kernel.org/r/20170302173738.18994-3-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Acked-by: Mike Rapoport
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 Feb, 2017

4 commits

  • Remove the prototypes for shmem_mapping() and shmem_zero_setup() from
    linux/mm.h, since they are already provided in linux/shmem_fs.h. But
    shmem_fs.h must then provide the inline stub for shmem_mapping() when
    CONFIG_SHMEM is not set, and a few more cfiles now need to #include it.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1702081658250.1549@eggly.anvils
    Signed-off-by: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Simek
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If madvise(2) advice will result in the underlying vma being split and
    the number of areas mapped by the process will exceed
    /proc/sys/vm/max_map_count as a result, return ENOMEM instead of EAGAIN.

    EAGAIN is returned by madvise(2) when a kernel resource, such as slab,
    is temporarily unavailable. It indicates that userspace should retry
    the advice in the near future. This is important for advice such as
    MADV_DONTNEED which is often used by malloc implementations to free
    memory back to the system: we really do want to free memory back when
    madvise(2) returns EAGAIN because slab allocations (for vmas, anon_vmas,
    or mempolicies) cannot be allocated.

    Encountering /proc/sys/vm/max_map_count is not a temporary failure,
    however, so return ENOMEM to indicate this is a more serious issue. A
    followup patch to the man page will specify this behavior.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701241431120.42507@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Cc: Jonathan Corbet
    Cc: Johannes Weiner
    Cc: Jerome Marchand
    Cc: "Kirill A. Shutemov"
    Cc: Michael Kerrisk
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • When a page is removed from a shared mapping, the uffd reader should be
    notified, so that it won't attempt to handle #PF events for the removed
    pages.

    We can reuse the UFFD_EVENT_REMOVE because from the uffd monitor point
    of view, the semantices of madvise(MADV_DONTNEED) and
    madvise(MADV_REMOVE) is exactly the same.

    Link: http://lkml.kernel.org/r/1484814154-1557-3-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrea Arcangeli
    Acked-by: Hillf Danton
    Acked-by: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Patch series "userfaultfd: non-cooperative: add madvise() event for
    MADV_REMOVE request".

    These patches add notification of madvise(MADV_REMOVE) event to
    non-cooperative userfaultfd monitor.

    The first pacth renames EVENT_MADVDONTNEED to EVENT_REMOVE along with
    relevant functions and structures. Using _REMOVE instead of
    _MADVDONTNEED describes the event semantics more clearly and I hope it's
    not too late for such change in the ABI.

    This patch (of 3):

    The UFFD_EVENT_MADVDONTNEED purpose is to notify uffd monitor about
    removal of certain range from address space tracked by userfaultfd.
    Hence, UFFD_EVENT_REMOVE seems to better reflect the operation
    semantics. Respectively, 'madv_dn' field of uffd_msg is renamed to
    'remove' and the madvise_userfault_dontneed callback is renamed to
    userfaultfd_remove.

    Link: http://lkml.kernel.org/r/1484814154-1557-2-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrea Arcangeli
    Acked-by: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

23 Feb, 2017

4 commits

  • Logic on whether we can reap pages from the VMA should match what we
    have in madvise_dontneed(). In particular, we should skip, VM_PFNMAP
    VMAs, but we don't now.

    Let's just extract condition on which we can shoot down pagesi from a
    VMA with MADV_DONTNEED into separate function and use it in both places.

    Link: http://lkml.kernel.org/r/20170118122429.43661-4-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • There's no users of zap_page_range() who wants non-NULL 'details'.
    Let's drop it.

    Link: http://lkml.kernel.org/r/20170118122429.43661-3-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • MADV_DONTNEED must be notified to userland before the pages are zapped.

    This allows userland to immediately stop adding pages to the userfaultfd
    ranges before the pages are actually zapped or there could be
    non-zeropage leftovers as result of concurrent UFFDIO_COPY run in
    between zap_page_range and madvise_userfault_dontneed (both
    MADV_DONTNEED and UFFDIO_COPY runs under the mmap_sem for reading, so
    they can run concurrently).

    Link: http://lkml.kernel.org/r/20161216144821.5183-15-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Hillf Danton
    Cc: Michael Rapoport
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • If the page is punched out of the address space the uffd reader should
    know this and zeromap the respective area in case of the #PF event.

    Link: http://lkml.kernel.org/r/20161216144821.5183-14-aarcange@redhat.com
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Hillf Danton
    Cc: Michael Rapoport
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

13 Dec, 2016

1 commit

  • With commit e77b0852b551 ("mm/mmu_gather: track page size with mmu
    gather and force flush if page size change") we added the ability to
    force a tlb flush when the page size change in a mmu_gather loop. We
    did that by checking for a page size change every time we added a page
    to mmu_gather for lazy flush/remove. We can improve that by moving the
    page size change check early and not doing it every time we add a page.

    This also helps us to do tlb flush when invalidating a range covering
    dax mapping. Wrt dax mapping we don't have a backing struct page and
    hence we don't call tlb_remove_page, which earlier forced the tlb flush
    on page size change. Moving the page size change check earlier means we
    will do the same even for dax mapping.

    We also avoid doing this check on architecture other than powerpc.

    In a later patch we will remove page size check from tlb_remove_page().

    Link: http://lkml.kernel.org/r/20161026084839.27299-5-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Cc: "Kirill A. Shutemov"
    Cc: Dan Williams
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

24 May, 2016

1 commit

  • This is a follow up work for oom_reaper [1]. As the async OOM killing
    depends on oom_sem for read we would really appreciate if a holder for
    write didn't stood in the way. This patchset is changing many of
    down_write calls to be killable to help those cases when the writer is
    blocked and waiting for readers to release the lock and so help
    __oom_reap_task to process the oom victim.

    Most of the patches are really trivial because the lock is help from a
    shallow syscall paths where we can return EINTR trivially and allow the
    current task to die (note that EINTR will never get to the userspace as
    the task has fatal signal pending). Others seem to be easy as well as
    the callers are already handling fatal errors and bail and return to
    userspace which should be sufficient to handle the failure gracefully.
    I am not familiar with all those code paths so a deeper review is really
    appreciated.

    As this work is touching more areas which are not directly connected I
    have tried to keep the CC list as small as possible and people who I
    believed would be familiar are CCed only to the specific patches (all
    should have received the cover though).

    This patchset is based on linux-next and it depends on
    down_write_killable for rw_semaphores which got merged into tip
    locking/rwsem branch and it is merged into this next tree. I guess it
    would be easiest to route these patches via mmotm because of the
    dependency on the tip tree but if respective maintainers prefer other
    way I have no objections.

    I haven't covered all the mmap_write(mm->mmap_sem) instances here

    $ git grep "down_write(.*\)" next/master | wc -l
    98
    $ git grep "down_write(.*\)" | wc -l
    62

    I have tried to cover those which should be relatively easy to review in
    this series because this alone should be a nice improvement. Other
    places can be changed on top.

    [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
    [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org

    This patch (of 18):

    This is the first step in making mmap_sem write waiters killable. It
    focuses on the trivial ones which are taking the lock early after
    entering the syscall and they are not changing state before.

    Therefore it is very easy to change them to use down_write_killable and
    immediately return with -EINTR. This will allow the waiter to pass away
    without blocking the mmap_sem which might be required to make a forward
    progress. E.g. the oom reaper will need the lock for reading to
    dismantle the OOM victim address space.

    The only tricky function in this patch is vm_mmap_pgoff which has many
    call sites via vm_mmap. To reduce the risk keep vm_mmap with the
    original non-killable semantic for now.

    vm_munmap callers do not bother checking the return value so open code
    it into the munmap syscall path for now for simplicity.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Mar, 2016

2 commits

  • Some new MADV_* advices are not documented in sys_madvise() comment. So
    let's update it.

    [akpm@linux-foundation.org: modifications suggested by Michal]
    Signed-off-by: Naoya Horiguchi
    Acked-by: Michal Hocko
    Cc: Minchan Kim
    Cc: "Kirill A. Shutemov"
    Cc: Jason Baron
    Cc: Chen Gong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently the return value of memory_failure() is not passed to
    userspace when madvise(MADV_HWPOISON) is used. This is inconvenient for
    test programs that want to know the result of error handling. So let's
    return it to the caller as we already do in the MADV_SOFT_OFFLINE case.

    Signed-off-by: Naoya Horiguchi
    Cc: Chen Gong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

16 Jan, 2016

1 commit

  • We don't need to split THP page when MADV_FREE syscall is called if
    [start, len] is aligned with THP size. The split could be done when VM
    decide to free it in reclaim path if memory pressure is heavy. With
    that, we could avoid unnecessary THP split.

    For the feature, this patch changes pte dirtness marking logic of THP.
    Now, it marks every ptes of pages dirty unconditionally in splitting,
    which makes MADV_FREE void. So, instead, this patch propagates pmd
    dirtiness to all pages via PG_dirty and restores pte dirtiness from
    PG_dirty. With this, if pmd is clean(ie, MADV_FREEed) when split
    happens(e,g, shrink_page_list), all of pages are clean too so we could
    discard them.

    Signed-off-by: Minchan Kim
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: "James E.J. Bottomley"
    Cc: "Kirill A. Shutemov"
    Cc: Shaohua Li
    Cc:
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Gang
    Cc: Chris Zankel
    Cc: Daniel Micay
    Cc: Darrick J. Wong
    Cc: David S. Miller
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Jason Evans
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Cc: Mika Penttil
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Rik van Riel
    Cc: Roland Dreier
    Cc: Russell King
    Cc: Shaohua Li
    Cc: Will Deacon
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim