10 Oct, 2018

1 commit

  • commit d41aa5252394c065d1f04d1ceea885b70d00c9c6 upstream.

    Reproducer, assuming 2M of hugetlbfs available:

    Hugetlbfs mounted, size=2M and option user=testuser

    # mount | grep ^hugetlbfs
    hugetlbfs on /dev/hugepages type hugetlbfs (rw,pagesize=2M,user=dan)
    # sysctl vm.nr_hugepages=1
    vm.nr_hugepages = 1
    # grep Huge /proc/meminfo
    AnonHugePages: 0 kB
    ShmemHugePages: 0 kB
    HugePages_Total: 1
    HugePages_Free: 1
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    Hugepagesize: 2048 kB
    Hugetlb: 2048 kB

    Code:

    #include
    #include
    #define SIZE 2*1024*1024
    int main()
    {
    void *ptr;
    ptr = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_HUGETLB | MAP_ANONYMOUS, -1, 0);
    madvise(ptr, SIZE, MADV_DONTDUMP);
    madvise(ptr, SIZE, MADV_DODUMP);
    }

    Compile and strace:

    mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = 0x7ff7c9200000
    madvise(0x7ff7c9200000, 2097152, MADV_DONTDUMP) = 0
    madvise(0x7ff7c9200000, 2097152, MADV_DODUMP) = -1 EINVAL (Invalid argument)

    hugetlbfs pages have VM_DONTEXPAND in the VmFlags driver pages based on
    author testing with analysis from Florian Weimer[1].

    The inclusion of VM_DONTEXPAND into the VM_SPECIAL defination was a
    consequence of the large useage of VM_DONTEXPAND in device drivers.

    A consequence of [2] is that VM_DONTEXPAND marked pages are unable to be
    marked DODUMP.

    A user could quite legitimately madvise(MADV_DONTDUMP) their hugetlbfs
    memory for a while and later request that madvise(MADV_DODUMP) on the same
    memory. We correct this omission by allowing madvice(MADV_DODUMP) on
    hugetlbfs pages.

    [1] https://stackoverflow.com/questions/52548260/madvisedodump-on-the-same-ptr-size-as-a-successful-madvisedontdump-fails-wit
    [2] commit 0103bd16fb90 ("mm: prepare VM_DONTDUMP for using in drivers")

    Link: http://lkml.kernel.org/r/20180930054629.29150-1-daniel@linux.ibm.com
    Link: https://lists.launchpad.net/maria-discuss/msg05245.html
    Fixes: 0103bd16fb90 ("mm: prepare VM_DONTDUMP for using in drivers")
    Reported-by: Kenneth Penza
    Signed-off-by: Daniel Black
    Reviewed-by: Mike Kravetz
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: Greg Kroah-Hartman

    Daniel Black
     

05 Dec, 2017

1 commit

  • commit 6ea8d958a2c95a1d514015d4e29ba21a8c0a1a91 upstream.

    MADVISE_WILLNEED has always been a noop for DAX (formerly XIP) mappings.
    Unfortunately madvise_willneed() doesn't communicate this information
    properly to the generic madvise syscall implementation. The calling
    convention is quite subtle there. madvise_vma() is supposed to either
    return an error or update &prev otherwise the main loop will never
    advance to the next vma and it will keep looping for ever without a way
    to get out of the kernel.

    It seems this has been broken since introduction. Nobody has noticed
    because nobody seems to be using MADVISE_WILLNEED on these DAX mappings.

    [mhocko@suse.com: rewrite changelog]
    Link: http://lkml.kernel.org/r/20171127115318.911-1-guoxuenan@huawei.com
    Fixes: fe77ba6f4f97 ("[PATCH] xip: madvice/fadvice: execute in place")
    Signed-off-by: chenjie
    Signed-off-by: guoxuenan
    Acked-by: Michal Hocko
    Cc: Minchan Kim
    Cc: zhangyi (F)
    Cc: Miao Xie
    Cc: Mike Rapoport
    Cc: Shaohua Li
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Kirill A. Shutemov
    Cc: David Rientjes
    Cc: Anshuman Khandual
    Cc: Rik van Riel
    Cc: Carsten Otte
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    chenjie
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

14 Oct, 2017

1 commit

  • mm/madvise.c has a brief description about all MADV_ flags. Add a
    description for the newly added MADV_WIPEONFORK and MADV_KEEPONFORK.

    Although man page has the similar information, but it'd better to keep
    the consistent with other flags.

    Link: http://lkml.kernel.org/r/1506117328-88228-1-git-send-email-yang.s@alibaba-inc.com
    Signed-off-by: Yang Shi
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

04 Oct, 2017

1 commit

  • This fixes a bug in madvise() where if you'd try to soft offline a
    hugepage via madvise(), while walking the address range you'd end up,
    using the wrong page offset due to attempting to get the compound order
    of a former but presently not compound page, due to dissolving the huge
    page (since commit c3114a84f7f9: "mm: hugetlb: soft-offline: dissolve
    source hugepage after successful migration").

    As a result I ended up with all my free pages except one being offlined.

    Link: http://lkml.kernel.org/r/20170912204306.GA12053@gmail.com
    Fixes: c3114a84f7f9 ("mm: hugetlb: soft-offline: dissolve source hugepage after successful migration")
    Signed-off-by: Alexandru Moise
    Cc: Anshuman Khandual
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Hillf Danton
    Cc: Shaohua Li
    Cc: Mike Rapoport
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandru Moise
     

09 Sep, 2017

1 commit

  • Platform with advance system bus (like CAPI or CCIX) allow device memory
    to be accessible from CPU in a cache coherent fashion. Add a new type of
    ZONE_DEVICE to represent such memory. The use case are the same as for
    the un-addressable device memory but without all the corners cases.

    Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Aneesh Kumar
    Cc: Paul E. McKenney
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Balbir Singh
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

07 Sep, 2017

1 commit

  • Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty
    in the child process after fork. This differs from MADV_DONTFORK in one
    important way.

    If a child process accesses memory that was MADV_WIPEONFORK, it will get
    zeroes. The address ranges are still valid, they are just empty.

    If a child process accesses memory that was MADV_DONTFORK, it will get a
    segmentation fault, since those address ranges are no longer valid in
    the child after fork.

    Since MADV_DONTFORK also seems to be used to allow very large programs
    to fork in systems with strict memory overcommit restrictions, changing
    the semantics of MADV_DONTFORK might break existing programs.

    MADV_WIPEONFORK only works on private, anonymous VMAs.

    The use case is libraries that store or cache information, and want to
    know that they need to regenerate it in the child process after fork.

    Examples of this would be:
    - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
    check, which is too slow without a PID cache)
    - PKCS#11 API reinitialization check (mandated by specification)
    - glibc's upcoming PRNG (reseed after fork)
    - OpenSSL PRNG (reseed after fork)

    The security benefits of a forking server having a re-inialized PRNG in
    every child process are pretty obvious. However, due to libraries
    having all kinds of internal state, and programs getting compiled with
    many different versions of each library, it is unreasonable to expect
    calling programs to re-initialize everything manually after fork.

    A further complication is the proliferation of clone flags, programs
    bypassing glibc's functions to call clone directly, and programs calling
    unshare, causing the glibc pthread_atfork hook to not get called.

    It would be better to have the kernel take care of this automatically.

    The patch also adds MADV_KEEPONFORK, to undo the effects of a prior
    MADV_WIPEONFORK.

    This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:

    https://man.openbsd.org/minherit.2

    [akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines]
    Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.com
    Signed-off-by: Rik van Riel
    Reported-by: Florian Weimer
    Reported-by: Colm MacCártaigh
    Reviewed-by: Mike Kravetz
    Cc: "H. Peter Anvin"
    Cc: "Kirill A. Shutemov"
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Ingo Molnar
    Cc: Helge Deller
    Cc: Kees Cook
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Will Drewry
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

01 Sep, 2017

1 commit

  • Wendy Wang reported off-list that a RAS HWPOISON-SOFT test case failed
    and bisected it to the commit 479f854a207c ("mm, page_alloc: defer
    debugging checks of pages allocated from the PCP").

    The problem is that a page that was poisoned with madvise() is reused.
    The commit removed a check that would trigger if DEBUG_VM was enabled
    but re-enabling the check only fixes the problem as a side-effect by
    printing a bad_page warning and recovering.

    The root of the problem is that an madvise() can leave a poisoned page
    on the per-cpu list. This patch drains all per-cpu lists after pages
    are poisoned so that they will not be reused. Wendy reports that the
    test case in question passes with this patch applied. While this could
    be done in a targeted fashion, it is over-complicated for such a rare
    operation.

    Link: http://lkml.kernel.org/r/20170828133414.7qro57jbepdcyz5x@techsingularity.net
    Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
    Signed-off-by: Mel Gorman
    Reported-by: Wang, Wendy
    Tested-by: Wang, Wendy
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: "Hansen, Dave"
    Cc: "Luck, Tony"
    Cc: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

26 Aug, 2017

1 commit

  • If madvise(..., MADV_FREE) split a transparent hugepage, it called
    put_page() before unlock_page().

    This was wrong because put_page() can free the page, e.g. if a
    concurrent madvise(..., MADV_DONTNEED) has removed it from the memory
    mapping. put_page() then rightfully complained about freeing a locked
    page.

    Fix this by moving the unlock_page() before put_page().

    This bug was found by syzkaller, which encountered the following splat:

    BUG: Bad page state in process syzkaller412798 pfn:1bd800
    page:ffffea0006f60000 count:0 mapcount:0 mapping: (null) index:0x20a00
    flags: 0x200000000040019(locked|uptodate|dirty|swapbacked)
    raw: 0200000000040019 0000000000000000 0000000000020a00 00000000ffffffff
    raw: ffffea0006f60020 ffffea0006f60020 0000000000000000 0000000000000000
    page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
    bad because of flags: 0x1(locked)
    Modules linked in:
    CPU: 1 PID: 3037 Comm: syzkaller412798 Not tainted 4.13.0-rc5+ #35
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:16 [inline]
    dump_stack+0x194/0x257 lib/dump_stack.c:52
    bad_page+0x230/0x2b0 mm/page_alloc.c:565
    free_pages_check_bad+0x1f0/0x2e0 mm/page_alloc.c:943
    free_pages_check mm/page_alloc.c:952 [inline]
    free_pages_prepare mm/page_alloc.c:1043 [inline]
    free_pcp_prepare mm/page_alloc.c:1068 [inline]
    free_hot_cold_page+0x8cf/0x12b0 mm/page_alloc.c:2584
    __put_single_page mm/swap.c:79 [inline]
    __put_page+0xfb/0x160 mm/swap.c:113
    put_page include/linux/mm.h:814 [inline]
    madvise_free_pte_range+0x137a/0x1ec0 mm/madvise.c:371
    walk_pmd_range mm/pagewalk.c:50 [inline]
    walk_pud_range mm/pagewalk.c:108 [inline]
    walk_p4d_range mm/pagewalk.c:134 [inline]
    walk_pgd_range mm/pagewalk.c:160 [inline]
    __walk_page_range+0xc3a/0x1450 mm/pagewalk.c:249
    walk_page_range+0x200/0x470 mm/pagewalk.c:326
    madvise_free_page_range.isra.9+0x17d/0x230 mm/madvise.c:444
    madvise_free_single_vma+0x353/0x580 mm/madvise.c:471
    madvise_dontneed_free mm/madvise.c:555 [inline]
    madvise_vma mm/madvise.c:664 [inline]
    SYSC_madvise mm/madvise.c:832 [inline]
    SyS_madvise+0x7d3/0x13c0 mm/madvise.c:760
    entry_SYSCALL_64_fastpath+0x1f/0xbe

    Here is a C reproducer:

    #define _GNU_SOURCE
    #include
    #include
    #include

    #define MADV_FREE 8
    #define PAGE_SIZE 4096

    static void *mapping;
    static const size_t mapping_size = 0x1000000;

    static void *madvise_thrproc(void *arg)
    {
    madvise(mapping, mapping_size, (long)arg);
    }

    int main(void)
    {
    pthread_t t[2];

    for (;;) {
    mapping = mmap(NULL, mapping_size, PROT_WRITE,
    MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);

    munmap(mapping + mapping_size / 2, PAGE_SIZE);

    pthread_create(&t[0], 0, madvise_thrproc, (void*)MADV_DONTNEED);
    pthread_create(&t[1], 0, madvise_thrproc, (void*)MADV_FREE);
    pthread_join(t[0], NULL);
    pthread_join(t[1], NULL);
    munmap(mapping, mapping_size);
    }
    }

    Note: to see the splat, CONFIG_TRANSPARENT_HUGEPAGE=y and
    CONFIG_DEBUG_VM=y are needed.

    Google Bug Id: 64696096

    Link: http://lkml.kernel.org/r/20170823205235.132061-1-ebiggers3@gmail.com
    Fixes: 854e9ed09ded ("mm: support madvise(MADV_FREE)")
    Signed-off-by: Eric Biggers
    Acked-by: David Rientjes
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Cc: Dmitry Vyukov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: [v4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers
     

03 Aug, 2017

1 commit

  • Nadav Amit identified a theoritical race between page reclaim and
    mprotect due to TLB flushes being batched outside of the PTL being held.

    He described the race as follows:

    CPU0 CPU1
    ---- ----
    user accesses memory using RW PTE
    [PTE now cached in TLB]
    try_to_unmap_one()
    ==> ptep_get_and_clear()
    ==> set_tlb_ubc_flush_pending()
    mprotect(addr, PROT_READ)
    ==> change_pte_range()
    ==> [ PTE non-present - no flush ]

    user writes using cached RW PTE
    ...

    try_to_unmap_flush()

    The same type of race exists for reads when protecting for PROT_NONE and
    also exists for operations that can leave an old TLB entry behind such
    as munmap, mremap and madvise.

    For some operations like mprotect, it's not necessarily a data integrity
    issue but it is a correctness issue as there is a window where an
    mprotect that limits access still allows access. For munmap, it's
    potentially a data integrity issue although the race is massive as an
    munmap, mmap and return to userspace must all complete between the
    window when reclaim drops the PTL and flushes the TLB. However, it's
    theoritically possible so handle this issue by flushing the mm if
    reclaim is potentially currently batching TLB flushes.

    Other instances where a flush is required for a present pte should be ok
    as either the page lock is held preventing parallel reclaim or a page
    reference count is elevated preventing a parallel free leading to
    corruption. In the case of page_mkclean there isn't an obvious path
    that userspace could take advantage of without using the operations that
    are guarded by this patch. Other users such as gup as a race with
    reclaim looks just at PTEs. huge page variants should be ok as they
    don't race with reclaim. mincore only looks at PTEs. userfault also
    should be ok as if a parallel reclaim takes place, it will either fault
    the page back in or read some of the data before the flush occurs
    triggering a fault.

    Note that a variant of this patch was acked by Andy Lutomirski but this
    was for the x86 parts on top of his PCID work which didn't make the 4.13
    merge window as expected. His ack is dropped from this version and
    there will be a follow-on patch on top of PCID that will include his
    ack.

    [akpm@linux-foundation.org: tweak comments]
    [akpm@linux-foundation.org: fix spello]
    Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de
    Reported-by: Nadav Amit
    Signed-off-by: Mel Gorman
    Cc: Andy Lutomirski
    Cc: [v4.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

11 Jul, 2017

2 commits

  • MADV_FREE is identical to MADV_DONTNEED from the point of view of uffd
    monitor. The monitor has to stop handling #PF events in the range being
    freed. We are reusing userfaultfd_remove callback along with the logic
    required to re-get and re-validate the VMA which may change or disappear
    because userfaultfd_remove releases mmap_sem.

    Link: http://lkml.kernel.org/r/1497876311-18615-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: Hillf Danton
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • For fast flash disk, async IO could introduce overhead because of
    context switch. block-mq now supports IO poll, which improves
    performance and latency a lot. swapin is a good place to use this
    technique, because the task is waiting for the swapin page to continue
    execution.

    In my virtual machine, directly read 4k data from a NVMe with iopoll is
    about 60% better than that without poll. With iopoll support in swapin
    patch, my microbenchmark (a task does random memory write) is about
    10%~25% faster. CPU utilization increases a lot though, 2x and even 3x
    CPU utilization. This will depend on disk speed.

    While iopoll in swapin isn't intended for all usage cases, it's a win
    for latency sensistive workloads with high speed swap disk. block layer
    has knob to control poll in runtime. If poll isn't enabled in block
    layer, there should be no noticeable change in swapin.

    I got a chance to run the same test in a NVMe with DRAM as the media.
    In simple fio IO test, blkpoll boosts 50% performance in single thread
    test and ~20% in 8 threads test. So this is the base line. In above
    swap test, blkpoll boosts ~27% performance in single thread test.
    blkpoll uses 2x CPU time though.

    If we enable hybid polling, the performance gain has very slight drop
    but CPU time is only 50% worse than that without blkpoll. Also we can
    adjust parameter of hybid poll, with it, the CPU time penality is
    reduced further. In 8 threads test, blkpoll doesn't help though. The
    performance is similar to that without blkpoll, but cpu utilization is
    similar too. There is lock contention in swap path. The cpu time
    spending on blkpoll isn't high. So overall, blkpoll swapin isn't worse
    than that without it.

    The swapin readahead might read several pages in in the same time and
    form a big IO request. Since the IO will take longer time, it doesn't
    make sense to do poll, so the patch only does iopoll for single page
    swapin.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/070c3c3e40b711e7b1390002c991e86a-b5408f0@7511894063d3764ff01ea8111f5a004d7dd700ed078797c204a24e620ddb965c
    Signed-off-by: Shaohua Li
    Cc: Tim Chen
    Cc: Huang Ying
    Cc: Jens Axboe
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

04 May, 2017

5 commits

  • madvise_behavior_valid() should be called before acting upon the
    behavior parameter. Hence move up the function. This also includes
    MADV_SOFT_OFFLINE and MADV_HWPOISON options as valid behavior parameter
    for the system call madvise().

    Link: http://lkml.kernel.org/r/20170418052844.24891-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Acked-by: David Rientjes
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • This cleans up handling MADV_SOFT_OFFLINE and MADV_HWPOISON called
    through madvise() system call.

    * madvise_memory_failure() was misleading to accommodate handling of
    both memory_failure() as well as soft_offline_page() functions.
    Basically it handles memory error injection from user space which
    can go either way as memory failure or soft offline. Renamed as
    madvise_inject_error() instead.

    * Renamed struct page pointer 'p' to 'page'.

    * pr_info() was essentially printing PFN value but it said 'page'
    which was misleading. Made the process virtual address explicit.

    Before the patch:

    Soft offlining page 0x15e3e at 0x3fff8c230000
    Soft offlining page 0x1f3 at 0x3fffa0da0000
    Soft offlining page 0x744 at 0x3fff7d200000
    Soft offlining page 0x1634d at 0x3fff95e20000
    Soft offlining page 0x16349 at 0x3fff95e30000
    Soft offlining page 0x1d6 at 0x3fff9e8b0000
    Soft offlining page 0x5f3 at 0x3fff91bd0000

    Injecting memory failure for page 0x15c8b at 0x3fff83280000
    Injecting memory failure for page 0x16190 at 0x3fff83290000
    Injecting memory failure for page 0x740 at 0x3fff9a2e0000
    Injecting memory failure for page 0x741 at 0x3fff9a2f0000

    After the patch:

    Soft offlining pfn 0x1484e at process virtual address 0x3fff883c0000
    Soft offlining pfn 0x1484f at process virtual address 0x3fff883d0000
    Soft offlining pfn 0x14850 at process virtual address 0x3fff883e0000
    Soft offlining pfn 0x14851 at process virtual address 0x3fff883f0000
    Soft offlining pfn 0x14852 at process virtual address 0x3fff88400000
    Soft offlining pfn 0x14853 at process virtual address 0x3fff88410000
    Soft offlining pfn 0x14854 at process virtual address 0x3fff88420000
    Soft offlining pfn 0x1521c at process virtual address 0x3fff6bc70000

    Injecting memory failure for pfn 0x10fcf at process virtual address 0x3fff86310000
    Injecting memory failure for pfn 0x10fd0 at process virtual address 0x3fff86320000
    Injecting memory failure for pfn 0x10fd1 at process virtual address 0x3fff86330000
    Injecting memory failure for pfn 0x10fd2 at process virtual address 0x3fff86340000
    Injecting memory failure for pfn 0x10fd3 at process virtual address 0x3fff86350000
    Injecting memory failure for pfn 0x10fd4 at process virtual address 0x3fff86360000
    Injecting memory failure for pfn 0x10fd5 at process virtual address 0x3fff86370000

    Link: http://lkml.kernel.org/r/20170410084701.11248-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • Now MADV_FREE pages can be easily reclaimed even for swapless system.
    We can safely enable MADV_FREE for all systems.

    Link: http://lkml.kernel.org/r/155648585589300bfae1d45078e7aebb3d988b87.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Acked-by: Minchan Kim
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • When memory pressure is high, we free MADV_FREE pages. If the pages are
    not dirty in pte, the pages could be freed immediately. Otherwise we
    can't reclaim them. We put the pages back to anonumous LRU list (by
    setting SwapBacked flag) and the pages will be reclaimed in normal
    swapout way.

    We use normal page reclaim policy. Since MADV_FREE pages are put into
    inactive file list, such pages and inactive file pages are reclaimed
    according to their age. This is expected, because we don't want to
    reclaim too many MADV_FREE pages before used once pages.

    Based on Minchan's original patch

    [minchan@kernel.org: clean up lazyfree page handling]
    Link: http://lkml.kernel.org/r/20170303025237.GB3503@bbox
    Link: http://lkml.kernel.org/r/14b8eb1d3f6bf6cc492833f183ac8c304e560484.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Signed-off-by: Minchan Kim
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: Hillf Danton
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • madv()'s MADV_FREE indicate pages are 'lazyfree'. They are still
    anonymous pages, but they can be freed without pageout. To distinguish
    these from normal anonymous pages, we clear their SwapBacked flag.

    MADV_FREE pages could be freed without pageout, so they pretty much like
    used once file pages. For such pages, we'd like to reclaim them once
    there is memory pressure. Also it might be unfair reclaiming MADV_FREE
    pages always before used once file pages and we definitively want to
    reclaim the pages before other anonymous and file pages.

    To speed up MADV_FREE pages reclaim, we put the pages into
    LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
    nowadays and should be full of used once file pages. Reclaiming
    MADV_FREE pages will not have much interfere of anonymous and active
    file pages. And the inactive file pages and MADV_FREE pages will be
    reclaimed according to their age, so we don't reclaim too many MADV_FREE
    pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
    means we can reclaim the pages without swap support. This idea is
    suggested by Johannes.

    This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
    avoid bisect failure, next patch will do it.

    The patch is based on Minchan's original patch.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Suggested-by: Johannes Weiner
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

10 Mar, 2017

1 commit

  • userfaultfd_remove() has to be execute before zapping the pagetables or
    UFFDIO_COPY could keep filling pages after zap_page_range returned,
    which would result in non zero data after a MADV_DONTNEED.

    However userfaultfd_remove() may have to release the mmap_sem. This was
    handled correctly in MADV_REMOVE, but MADV_DONTNEED accessed a
    potentially stale vma (the very vma passed to zap_page_range(vma, ...)).

    The fix consists in revalidating the vma in case userfaultfd_remove()
    had to release the mmap_sem.

    This also optimizes away an unnecessary down_read/up_read in the
    MADV_REMOVE case if UFFD_EVENT_FORK had to be delivered.

    It all remains zero runtime cost in case CONFIG_USERFAULTFD=n as
    userfaultfd_remove() will be defined as "true" at build time.

    Link: http://lkml.kernel.org/r/20170302173738.18994-3-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Acked-by: Mike Rapoport
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 Feb, 2017

4 commits

  • Remove the prototypes for shmem_mapping() and shmem_zero_setup() from
    linux/mm.h, since they are already provided in linux/shmem_fs.h. But
    shmem_fs.h must then provide the inline stub for shmem_mapping() when
    CONFIG_SHMEM is not set, and a few more cfiles now need to #include it.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1702081658250.1549@eggly.anvils
    Signed-off-by: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Simek
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If madvise(2) advice will result in the underlying vma being split and
    the number of areas mapped by the process will exceed
    /proc/sys/vm/max_map_count as a result, return ENOMEM instead of EAGAIN.

    EAGAIN is returned by madvise(2) when a kernel resource, such as slab,
    is temporarily unavailable. It indicates that userspace should retry
    the advice in the near future. This is important for advice such as
    MADV_DONTNEED which is often used by malloc implementations to free
    memory back to the system: we really do want to free memory back when
    madvise(2) returns EAGAIN because slab allocations (for vmas, anon_vmas,
    or mempolicies) cannot be allocated.

    Encountering /proc/sys/vm/max_map_count is not a temporary failure,
    however, so return ENOMEM to indicate this is a more serious issue. A
    followup patch to the man page will specify this behavior.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701241431120.42507@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Cc: Jonathan Corbet
    Cc: Johannes Weiner
    Cc: Jerome Marchand
    Cc: "Kirill A. Shutemov"
    Cc: Michael Kerrisk
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • When a page is removed from a shared mapping, the uffd reader should be
    notified, so that it won't attempt to handle #PF events for the removed
    pages.

    We can reuse the UFFD_EVENT_REMOVE because from the uffd monitor point
    of view, the semantices of madvise(MADV_DONTNEED) and
    madvise(MADV_REMOVE) is exactly the same.

    Link: http://lkml.kernel.org/r/1484814154-1557-3-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrea Arcangeli
    Acked-by: Hillf Danton
    Acked-by: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Patch series "userfaultfd: non-cooperative: add madvise() event for
    MADV_REMOVE request".

    These patches add notification of madvise(MADV_REMOVE) event to
    non-cooperative userfaultfd monitor.

    The first pacth renames EVENT_MADVDONTNEED to EVENT_REMOVE along with
    relevant functions and structures. Using _REMOVE instead of
    _MADVDONTNEED describes the event semantics more clearly and I hope it's
    not too late for such change in the ABI.

    This patch (of 3):

    The UFFD_EVENT_MADVDONTNEED purpose is to notify uffd monitor about
    removal of certain range from address space tracked by userfaultfd.
    Hence, UFFD_EVENT_REMOVE seems to better reflect the operation
    semantics. Respectively, 'madv_dn' field of uffd_msg is renamed to
    'remove' and the madvise_userfault_dontneed callback is renamed to
    userfaultfd_remove.

    Link: http://lkml.kernel.org/r/1484814154-1557-2-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrea Arcangeli
    Acked-by: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

23 Feb, 2017

4 commits

  • Logic on whether we can reap pages from the VMA should match what we
    have in madvise_dontneed(). In particular, we should skip, VM_PFNMAP
    VMAs, but we don't now.

    Let's just extract condition on which we can shoot down pagesi from a
    VMA with MADV_DONTNEED into separate function and use it in both places.

    Link: http://lkml.kernel.org/r/20170118122429.43661-4-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • There's no users of zap_page_range() who wants non-NULL 'details'.
    Let's drop it.

    Link: http://lkml.kernel.org/r/20170118122429.43661-3-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • MADV_DONTNEED must be notified to userland before the pages are zapped.

    This allows userland to immediately stop adding pages to the userfaultfd
    ranges before the pages are actually zapped or there could be
    non-zeropage leftovers as result of concurrent UFFDIO_COPY run in
    between zap_page_range and madvise_userfault_dontneed (both
    MADV_DONTNEED and UFFDIO_COPY runs under the mmap_sem for reading, so
    they can run concurrently).

    Link: http://lkml.kernel.org/r/20161216144821.5183-15-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Hillf Danton
    Cc: Michael Rapoport
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • If the page is punched out of the address space the uffd reader should
    know this and zeromap the respective area in case of the #PF event.

    Link: http://lkml.kernel.org/r/20161216144821.5183-14-aarcange@redhat.com
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Hillf Danton
    Cc: Michael Rapoport
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

13 Dec, 2016

1 commit

  • With commit e77b0852b551 ("mm/mmu_gather: track page size with mmu
    gather and force flush if page size change") we added the ability to
    force a tlb flush when the page size change in a mmu_gather loop. We
    did that by checking for a page size change every time we added a page
    to mmu_gather for lazy flush/remove. We can improve that by moving the
    page size change check early and not doing it every time we add a page.

    This also helps us to do tlb flush when invalidating a range covering
    dax mapping. Wrt dax mapping we don't have a backing struct page and
    hence we don't call tlb_remove_page, which earlier forced the tlb flush
    on page size change. Moving the page size change check earlier means we
    will do the same even for dax mapping.

    We also avoid doing this check on architecture other than powerpc.

    In a later patch we will remove page size check from tlb_remove_page().

    Link: http://lkml.kernel.org/r/20161026084839.27299-5-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Cc: "Kirill A. Shutemov"
    Cc: Dan Williams
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

24 May, 2016

1 commit

  • This is a follow up work for oom_reaper [1]. As the async OOM killing
    depends on oom_sem for read we would really appreciate if a holder for
    write didn't stood in the way. This patchset is changing many of
    down_write calls to be killable to help those cases when the writer is
    blocked and waiting for readers to release the lock and so help
    __oom_reap_task to process the oom victim.

    Most of the patches are really trivial because the lock is help from a
    shallow syscall paths where we can return EINTR trivially and allow the
    current task to die (note that EINTR will never get to the userspace as
    the task has fatal signal pending). Others seem to be easy as well as
    the callers are already handling fatal errors and bail and return to
    userspace which should be sufficient to handle the failure gracefully.
    I am not familiar with all those code paths so a deeper review is really
    appreciated.

    As this work is touching more areas which are not directly connected I
    have tried to keep the CC list as small as possible and people who I
    believed would be familiar are CCed only to the specific patches (all
    should have received the cover though).

    This patchset is based on linux-next and it depends on
    down_write_killable for rw_semaphores which got merged into tip
    locking/rwsem branch and it is merged into this next tree. I guess it
    would be easiest to route these patches via mmotm because of the
    dependency on the tip tree but if respective maintainers prefer other
    way I have no objections.

    I haven't covered all the mmap_write(mm->mmap_sem) instances here

    $ git grep "down_write(.*\)" next/master | wc -l
    98
    $ git grep "down_write(.*\)" | wc -l
    62

    I have tried to cover those which should be relatively easy to review in
    this series because this alone should be a nice improvement. Other
    places can be changed on top.

    [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
    [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org

    This patch (of 18):

    This is the first step in making mmap_sem write waiters killable. It
    focuses on the trivial ones which are taking the lock early after
    entering the syscall and they are not changing state before.

    Therefore it is very easy to change them to use down_write_killable and
    immediately return with -EINTR. This will allow the waiter to pass away
    without blocking the mmap_sem which might be required to make a forward
    progress. E.g. the oom reaper will need the lock for reading to
    dismantle the OOM victim address space.

    The only tricky function in this patch is vm_mmap_pgoff which has many
    call sites via vm_mmap. To reduce the risk keep vm_mmap with the
    original non-killable semantic for now.

    vm_munmap callers do not bother checking the return value so open code
    it into the munmap syscall path for now for simplicity.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Mar, 2016

2 commits

  • Some new MADV_* advices are not documented in sys_madvise() comment. So
    let's update it.

    [akpm@linux-foundation.org: modifications suggested by Michal]
    Signed-off-by: Naoya Horiguchi
    Acked-by: Michal Hocko
    Cc: Minchan Kim
    Cc: "Kirill A. Shutemov"
    Cc: Jason Baron
    Cc: Chen Gong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently the return value of memory_failure() is not passed to
    userspace when madvise(MADV_HWPOISON) is used. This is inconvenient for
    test programs that want to know the result of error handling. So let's
    return it to the caller as we already do in the MADV_SOFT_OFFLINE case.

    Signed-off-by: Naoya Horiguchi
    Cc: Chen Gong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

16 Jan, 2016

4 commits

  • We don't need to split THP page when MADV_FREE syscall is called if
    [start, len] is aligned with THP size. The split could be done when VM
    decide to free it in reclaim path if memory pressure is heavy. With
    that, we could avoid unnecessary THP split.

    For the feature, this patch changes pte dirtness marking logic of THP.
    Now, it marks every ptes of pages dirty unconditionally in splitting,
    which makes MADV_FREE void. So, instead, this patch propagates pmd
    dirtiness to all pages via PG_dirty and restores pte dirtiness from
    PG_dirty. With this, if pmd is clean(ie, MADV_FREEed) when split
    happens(e,g, shrink_page_list), all of pages are clean too so we could
    discard them.

    Signed-off-by: Minchan Kim
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: "James E.J. Bottomley"
    Cc: "Kirill A. Shutemov"
    Cc: Shaohua Li
    Cc:
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Gang
    Cc: Chris Zankel
    Cc: Daniel Micay
    Cc: Darrick J. Wong
    Cc: David S. Miller
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Jason Evans
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Cc: Mika Penttil
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Rik van Riel
    Cc: Roland Dreier
    Cc: Russell King
    Cc: Shaohua Li
    Cc: Will Deacon
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • MADV_FREE is a hint that it's okay to discard pages if there is memory
    pressure and we use reclaimers(ie, kswapd and direct reclaim) to free
    them so there is no value keeping them in the active anonymous LRU so
    this patch moves them to inactive LRU list's head.

    This means that MADV_FREE-ed pages which were living on the inactive
    list are reclaimed first because they are more likely to be cold rather
    than recently active pages.

    An arguable issue for the approach would be whether we should put the
    page to the head or tail of the inactive list. I chose head because the
    kernel cannot make sure it's really cold or warm for every MADV_FREE
    usecase but at least we know it's not *hot*, so landing of inactive head
    would be a comprimise for various usecases.

    This fixes suboptimal behavior of MADV_FREE when pages living on the
    active list will sit there for a long time even under memory pressure
    while the inactive list is reclaimed heavily. This basically breaks the
    whole purpose of using MADV_FREE to help the system to free memory which
    is might not be used.

    Signed-off-by: Minchan Kim
    Acked-by: Hugh Dickins
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: "James E.J. Bottomley"
    Cc: "Kirill A. Shutemov"
    Cc:
    Cc: Andrea Arcangeli
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Gang
    Cc: Chris Zankel
    Cc: Daniel Micay
    Cc: Darrick J. Wong
    Cc: David S. Miller
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Jason Evans
    Cc: KOSAKI Motohiro
    Cc: Kirill A. Shutemov
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Kerrisk
    Cc: Mika Penttil
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Roland Dreier
    Cc: Russell King
    Cc: Will Deacon
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • When I test below piece of code with 12 processes(ie, 512M * 12 = 6G
    consume) on my (3G ram + 12 cpu + 8G swap, the madvise_free is
    siginficat slower (ie, 2x times) than madvise_dontneed.

    loop = 5;
    mmap(512M);
    while (loop--) {
    memset(512M);
    madvise(MADV_FREE or MADV_DONTNEED);
    }

    The reason is lots of swapin.

    1) dontneed: 1,612 swapin
    2) madvfree: 879,585 swapin

    If we find hinted pages were already swapped out when syscall is called,
    it's pointless to keep the swapped-out pages in pte. Instead, let's
    free the cold page because swapin is more expensive than (alloc page +
    zeroing).

    With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time

    1) dontneed: 6.10user 233.50system 0:50.44elapsed
    2) madvfree: 6.03user 401.17system 1:30.67elapsed
    2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed

    Signed-off-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Hugh Dickins
    Cc: "James E.J. Bottomley"
    Cc: "Kirill A. Shutemov"
    Cc: Shaohua Li
    Cc:
    Cc: Andrea Arcangeli
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Gang
    Cc: Chris Zankel
    Cc: Daniel Micay
    Cc: Darrick J. Wong
    Cc: David S. Miller
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Jason Evans
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: Kirill A. Shutemov
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Mika Penttil
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Rik van Riel
    Cc: Roland Dreier
    Cc: Russell King
    Cc: Shaohua Li
    Cc: Will Deacon
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Linux doesn't have an ability to free pages lazy while other OS already
    have been supported that named by madvise(MADV_FREE).

    The gain is clear that kernel can discard freed pages rather than
    swapping out or OOM if memory pressure happens.

    Without memory pressure, freed pages would be reused by userspace
    without another additional overhead(ex, page fault + allocation +
    zeroing).

    Jason Evans said:

    : Facebook has been using MAP_UNINITIALIZED
    : (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
    : several years, but there are operational costs to maintaining this
    : out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
    : in favor of MADV_FREE. When we first enabled MAP_UNINITIALIZED it
    : increased throughput for much of our workload by ~5%, and although the
    : benefit has decreased using newer hardware and kernels, there is still
    : enough benefit that we cannot reasonably retire it without a replacement.
    :
    : Aside from Facebook operations, there are numerous broadly used
    : applications that would benefit from MADV_FREE. The ones that immediately
    : come to mind are redis, varnish, and MariaDB. I don't have much insight
    : into Android internals and development process, but I would hope to see
    : MADV_FREE support eventually end up there as well to benefit applications
    : linked with the integrated jemalloc.
    :
    : jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
    : In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
    : available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
    : (and AIX, but I'm not sure it even compiles on AIX). The lack of
    : MADV_FREE on Linux forced me down a long series of increasingly
    : sophisticated heuristics for madvise() volume reduction, and even so this
    : remains a common performance issue for people using jemalloc on Linux.
    : Please integrate MADV_FREE; many people will benefit substantially.

    How it works:

    When madvise syscall is called, VM clears dirty bit of ptes of the
    range. If memory pressure happens, VM checks dirty bit of page table
    and if it found still "clean", it means it's a "lazyfree pages" so VM
    could discard the page instead of swapping out. Once there was store
    operation for the page before VM peek a page to reclaim, dirty bit is
    set so VM can swap out the page instead of discarding.

    One thing we should notice is that basically, MADV_FREE relies on dirty
    bit in page table entry to decide whether VM allows to discard the page
    or not. IOW, if page table entry includes marked dirty bit, VM
    shouldn't discard the page.

    However, as a example, if swap-in by read fault happens, page table
    entry doesn't have dirty bit so MADV_FREE could discard the page
    wrongly.

    For avoiding the problem, MADV_FREE did more checks with PageDirty and
    PageSwapCache. It worked out because swapped-in page lives on swap
    cache and since it is evicted from the swap cache, the page has PG_dirty
    flag. So both page flags check effectively prevent wrong discarding by
    MADV_FREE.

    However, a problem in above logic is that swapped-in page has PG_dirty
    still after they are removed from swap cache so VM cannot consider the
    page as freeable any more even if madvise_free is called in future.

    Look at below example for detail.

    ptr = malloc();
    memset(ptr);
    ..
    ..
    .. heavy memory pressure so all of pages are swapped out
    ..
    ..
    var = *ptr; -> a page swapped-in and could be removed from
    swapcache. Then, page table doesn't mark
    dirty bit and page descriptor includes PG_dirty
    ..
    ..
    madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
    ..
    ..
    ..
    .. heavy memory pressure again.
    .. In this time, VM cannot discard the page because the page
    .. has *PG_dirty*

    To solve the problem, this patch clears PG_dirty if only the page is
    owned exclusively by current process when madvise is called because
    PG_dirty represents ptes's dirtiness in several processes so we could
    clear it only if we own it exclusively.

    Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
    and hope glibc supports it) and jemalloc/tcmalloc already have supported
    the feature for other OS(ex, FreeBSD)

    barrios@blaptop:~/benchmark/ebizzy$ lscpu
    Architecture: x86_64
    CPU op-mode(s): 32-bit, 64-bit
    Byte Order: Little Endian
    CPU(s): 12
    On-line CPU(s) list: 0-11
    Thread(s) per core: 1
    Core(s) per socket: 1
    Socket(s): 12
    NUMA node(s): 1
    Vendor ID: GenuineIntel
    CPU family: 6
    Model: 2
    Stepping: 3
    CPU MHz: 3200.185
    BogoMIPS: 6400.53
    Virtualization: VT-x
    Hypervisor vendor: KVM
    Virtualization type: full
    L1d cache: 32K
    L1i cache: 32K
    L2 cache: 4096K
    NUMA node0 CPU(s): 0-11
    ebizzy benchmark(./ebizzy -S 10 -n 512)

    Higher avg is better.

    vanilla-jemalloc MADV_free-jemalloc

    1 thread
    records: 10 records: 10
    avg: 2961.90 avg: 12069.70
    std: 71.96(2.43%) std: 186.68(1.55%)
    max: 3070.00 max: 12385.00
    min: 2796.00 min: 11746.00

    2 thread
    records: 10 records: 10
    avg: 5020.00 avg: 17827.00
    std: 264.87(5.28%) std: 358.52(2.01%)
    max: 5244.00 max: 18760.00
    min: 4251.00 min: 17382.00

    4 thread
    records: 10 records: 10
    avg: 8988.80 avg: 27930.80
    std: 1175.33(13.08%) std: 3317.33(11.88%)
    max: 9508.00 max: 30879.00
    min: 5477.00 min: 21024.00

    8 thread
    records: 10 records: 10
    avg: 13036.50 avg: 33739.40
    std: 170.67(1.31%) std: 5146.22(15.25%)
    max: 13371.00 max: 40572.00
    min: 12785.00 min: 24088.00

    16 thread
    records: 10 records: 10
    avg: 11092.40 avg: 31424.20
    std: 710.60(6.41%) std: 3763.89(11.98%)
    max: 12446.00 max: 36635.00
    min: 9949.00 min: 25669.00

    32 thread
    records: 10 records: 10
    avg: 11067.00 avg: 34495.80
    std: 971.06(8.77%) std: 2721.36(7.89%)
    max: 12010.00 max: 38598.00
    min: 9002.00 min: 30636.00

    In summary, MADV_FREE is about much faster than MADV_DONTNEED.

    This patch (of 12):

    Add core MADV_FREE implementation.

    [akpm@linux-foundation.org: small cleanups]
    Signed-off-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Hugh Dickins
    Cc: Mika Penttil
    Cc: Michael Kerrisk
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Jason Evans
    Cc: Daniel Micay
    Cc: "Kirill A. Shutemov"
    Cc: Shaohua Li
    Cc:
    Cc: Andy Lutomirski
    Cc: "James E.J. Bottomley"
    Cc: "Kirill A. Shutemov"
    Cc: "Shaohua Li"
    Cc: Andrea Arcangeli
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Gang
    Cc: Chris Zankel
    Cc: Darrick J. Wong
    Cc: David S. Miller
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Roland Dreier
    Cc: Russell King
    Cc: Shaohua Li
    Cc: Will Deacon
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

09 Sep, 2015

1 commit

  • Now that we have hole punching support for hugetlbfs, we can also
    support the MADV_REMOVE interface to it.

    Signed-off-by: Dave Hansen
    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

05 Sep, 2015

2 commits

  • This makes the madvise_bahaviour_valid() function return bool due to
    this particular function always returning the value of either one or
    zero as its return value.

    Signed-off-by: Nicholas Krause
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Krause
     
  • vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge
    must be aware about so that we can merge vmas back like they were
    originally before arming the userfaultfd on some memory range.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

02 Jun, 2015

1 commit

  • With the planned cgroup writeback support, backing-dev related
    declarations will be more widely used across block and cgroup;
    unfortunately, including backing-dev.h from include/linux/blkdev.h
    makes cyclic include dependency quite likely.

    This patch separates out backing-dev-defs.h which only has the
    essential definitions and updates blkdev.h to include it. c files
    which need access to more backing-dev details now include
    backing-dev.h directly. This takes backing-dev.h off the common
    include dependency chain making it a lot easier to use it across block
    and cgroup.

    v2: fs/fat build failure fixed.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Tejun Heo
     

17 Feb, 2015

1 commit

  • All callers of get_xip_mem() are now gone. Remove checks for it,
    initialisers of it, documentation of it and the only implementation of it.
    Also remove mm/filemap_xip.c as it is now empty. Also remove
    documentation of the long-gone get_xip_page().

    Signed-off-by: Matthew Wilcox
    Cc: Andreas Dilger
    Cc: Boaz Harrosh
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Jens Axboe
    Cc: Kirill A. Shutemov
    Cc: Mathieu Desnoyers
    Cc: Randy Dunlap
    Cc: Ross Zwisler
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox