21 Sep, 2020

1 commit


20 Sep, 2020

1 commit

  • 5.8 commit 5d91f31faf8e ("mm: swap: fix vmstats for huge page") has
    established that vm_events should count every subpage of a THP, including
    unevictable_pgs_culled and unevictable_pgs_rescued; but
    lru_cache_add_inactive_or_unevictable() was not doing so for
    unevictable_pgs_mlocked, and mm/mlock.c was not doing so for
    unevictable_pgs mlocked, munlocked, cleared and stranded.

    Fix them; but THPs don't go the pagevec way in mlock.c, so no fixes needed
    on that path.

    Fixes: 5d91f31faf8e ("mm: swap: fix vmstats for huge page")
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Yang Shi
    Cc: Alex Shi
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301408230.5954@eggly.anvils
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Aug, 2020

1 commit


15 Aug, 2020

1 commit

  • The thp prefix is more frequently used than hpage and we should be
    consistent between the various functions.

    [akpm@linux-foundation.org: fix mm/migrate.c]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

24 Jun, 2020

1 commit


10 Jun, 2020

2 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

02 Oct, 2019

1 commit


26 Sep, 2019

1 commit

  • This patch is a part of a series that extends kernel ABI to allow to pass
    tagged user pointers (with the top byte set to something else other than
    0x00) as syscall arguments.

    This patch allows tagged pointers to be passed to the following memory
    syscalls: get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mprotect,
    mremap, msync, munlock, move_pages.

    The mmap and mremap syscalls do not currently accept tagged addresses.
    Architectures may interpret the tag as a background colour for the
    corresponding vma.

    Link: http://lkml.kernel.org/r/aaf0c0969d46b2feb9017f3e1b3ef3970b633d91.1563904656.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Khalid Aziz
    Reviewed-by: Vincenzo Frascino
    Reviewed-by: Catalin Marinas
    Reviewed-by: Kees Cook
    Cc: Al Viro
    Cc: Dave Hansen
    Cc: Eric Auger
    Cc: Felix Kuehling
    Cc: Jens Wiklander
    Cc: Mauro Carvalho Chehab
    Cc: Mike Rapoport
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

17 Jun, 2019

1 commit


14 Jun, 2019

2 commits

  • On a 64-bit machine the value of "vma->vm_end - vma->vm_start" may be
    negative when using 32 bit ints and the "count >> PAGE_SHIFT"'s result
    will be wrong. So change the local variable and return value to
    unsigned long to fix the problem.

    Link: http://lkml.kernel.org/r/20190513023701.83056-1-swkhack@gmail.com
    Fixes: 0cf2f6f6dc60 ("mm: mlock: check against vma for actual mlock() size")
    Signed-off-by: swkhack
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    swkhack
     
  • If mlockall() is called with only MCL_ONFAULT as flag, it removes any
    previously applied lockings and does nothing else.

    This behavior is counter-intuitive and doesn't match the Linux man page.

    For mlockall():

    EINVAL Unknown flags were specified or MCL_ONFAULT was specified
    without either MCL_FUTURE or MCL_CURRENT.

    Consequently, return the error EINVAL, if only MCL_ONFAULT is passed.
    That way, applications will at least detect that they are calling
    mlockall() incorrectly.

    Link: http://lkml.kernel.org/r/20190527075333.GA6339@er01809n.ebgroup.elektrobit.com
    Fixes: b0f205c2a308 ("mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage")
    Signed-off-by: Stefan Potyra
    Reviewed-by: Daniel Jordan
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Potyra, Stefan
     

04 May, 2019

2 commits

  • Change-Id: I4380c68c3474026a42ffa9f95c525f9a563ba7a3

    Todd Kjos
     
  • Userspace processes often have multiple allocators that each do
    anonymous mmaps to get memory. When examining memory usage of
    individual processes or systems as a whole, it is useful to be
    able to break down the various heaps that were allocated by
    each layer and examine their size, RSS, and physical memory
    usage.

    This patch adds a user pointer to the shared union in
    vm_area_struct that points to a null terminated string inside
    the user process containing a name for the vma. vmas that
    point to the same address will be merged, but vmas that
    point to equivalent strings at different addresses will
    not be merged.

    Userspace can set the name for a region of memory by calling
    prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
    Setting the name to NULL clears it.

    The names of named anonymous vmas are shown in /proc/pid/maps
    as [anon:] and in /proc/pid/smaps in a new "Name" field
    that is only present for named vmas. If the userspace pointer
    is no longer valid all or part of the name will be replaced
    with "".

    The idea to store a userspace pointer to reduce the complexity
    within mm (at the expense of the complexity of reading
    /proc/pid/mem) came from Dave Hansen. This results in no
    runtime overhead in the mm subsystem other than comparing
    the anon_name pointers when considering vma merging. The pointer
    is stored in a union with fieds that are only used on file-backed
    mappings, so it does not increase memory usage.

    Includes fix from Jed Davis for typo in
    prctl_set_vma_anon_name, which could attempt to set the name
    across two vmas at the same time due to a typo, which might
    corrupt the vma list. Fix it to use tmp instead of end to limit
    the name setting to a single vma at a time.

    Bug: 120441514
    Change-Id: I9aa7b6b5ef536cd780599ba4e2fba8ceebe8b59f
    Signed-off-by: Dmitry Shmidt
    [AmitP: Fix get_user_pages_remote() call to align with upstream commit
    5b56d49fc31d ("mm: add locked parameter to get_user_pages_remote()")]
    Signed-off-by: Amit Pundir

    Colin Cross
     

06 Mar, 2019

1 commit

  • We have common pattern to access lru_lock from a page pointer:
    zone_lru_lock(page_zone(page))

    Which is silly, because it unfolds to this:
    &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]->zone_pgdat->lru_lock
    while we can simply do
    &NODE_DATA(page_to_nid(page))->lru_lock

    Remove zone_lru_lock() function, since it's only complicate things. Use
    'page_pgdat(page)->lru_lock' pattern instead.

    [aryabinin@virtuozzo.com: a slightly better version of __split_huge_page()]
    Link: http://lkml.kernel.org/r/20190301121651.7741-1-aryabinin@virtuozzo.com
    Link: http://lkml.kernel.org/r/20190228083329.31892-2-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: William Kucharski
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

18 Aug, 2018

1 commit

  • This patch is reworked from an earlier patch that Dan has posted:
    https://patchwork.kernel.org/patch/10131727/

    VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
    the memory page it is dealing with is not typical memory from the linear
    map. The get_user_pages_fast() path, since it does not resolve the vma,
    is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
    use that as a VM_MIXEDMAP replacement in some locations. In the cases
    where there is no pte to consult we fallback to using vma_is_dax() to
    detect the VM_MIXEDMAP special case.

    Now that we have explicit driver pfn_t-flag opt-in/opt-out for
    get_user_pages() support for DAX we can stop setting VM_MIXEDMAP. This
    also means we no longer need to worry about safely manipulating vm_flags
    in a future where we support dynamically changing the dax mode of a
    file.

    DAX should also now be supported with madvise_behavior(), vma_merge(),
    and copy_page_range().

    This patch has been tested against ndctl unit test. It has also been
    tested against xfstests commit: 625515d using fake pmem created by
    memmap and no additional issues have been observed.

    Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Dave Jiang
    Acked-by: Dan Williams
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     

22 Feb, 2018

1 commit

  • When a thread mlocks an address space backed either by file pages which
    are currently not present in memory or swapped out anon pages (not in
    swapcache), a new page is allocated and added to the local pagevec
    (lru_add_pvec), I/O is triggered and the thread then sleeps on the page.
    On I/O completion, the thread can wake on a different CPU, the mlock
    syscall will then sets the PageMlocked() bit of the page but will not be
    able to put that page in unevictable LRU as the page is on the pagevec
    of a different CPU. Even on drain, that page will go to evictable LRU
    because the PageMlocked() bit is not checked on pagevec drain.

    The page will eventually go to right LRU on reclaim but the LRU stats
    will remain skewed for a long time.

    This patch puts all the pages, even unevictable, to the pagevecs and on
    the drain, the pages will be added on their LRUs correctly by checking
    their evictability. This resolves the mlocked pages on pagevec of other
    CPUs issue because when those pagevecs will be drained, the mlocked file
    pages will go to unevictable LRU. Also this makes the race with munlock
    easier to resolve because the pagevec drains happen in LRU lock.

    However there is still one place which makes a page evictable and does
    PageLRU check on that page without LRU lock and needs special attention.
    TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock().

    #0: __pagevec_lru_add_fn #1: clear_page_mlock

    SetPageLRU() if (!TestClearPageMlocked())
    return
    smp_mb() //
    Acked-by: Vlastimil Babka
    Cc: Jérôme Glisse
    Cc: Huang Ying
    Cc: Tim Chen
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Minchan Kim
    Cc: Shaohua Li
    Cc: Jan Kara
    Cc: Nicholas Piggin
    Cc: Dan Williams
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

07 Feb, 2018

1 commit

  • so that kernel-doc will properly recognize the parameter and function
    descriptions.

    Link: http://lkml.kernel.org/r/1516700871-22279-2-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

29 Nov, 2017

1 commit


16 Nov, 2017

2 commits

  • lru_add_drain_all() is not required by mlock() and it will drain
    everything that has been cached at the time mlock is called. And that
    is not really related to the memory which will be faulted in (and
    cached) and mlocked by the syscall itself.

    If anything lru_add_drain_all() should be called _after_ pages have been
    mlocked and faulted in but even that is not strictly needed because
    those pages would get to the appropriate LRUs lazily during the reclaim
    path. Moreover follow_page_pte (gup) will drain the local pcp LRU
    cache.

    On larger machines the overhead of lru_add_drain_all() in mlock() can be
    significant when mlocking data already in memory. We have observed high
    latency in mlock() due to lru_add_drain_all() when the users were
    mlocking in memory tmpfs files.

    [mhocko@suse.com: changelog fix]
    Link: http://lkml.kernel.org/r/20171019222507.2894-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Michal Hocko
    Acked-by: Balbir Singh
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Yisheng Xie
    Cc: Ingo Molnar
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • Every pagevec_init user claims the pages being released are hot even in
    cases where it is unlikely the pages are hot. As no one cares about the
    hotness of pages being released to the allocator, just ditch the
    parameter.

    No performance impact is expected as the overhead is marginal. The
    parameter is removed simply because it is a bit stupid to have a useless
    parameter copied everywhere.

    Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

09 Sep, 2017

1 commit

  • page_zone_id() is a specialized function to compare the zone for the pages
    that are within the section range. If the section of the pages are
    different, page_zone_id() can be different even if their zone is the same.
    This wrong usage doesn't cause any actual problem since
    __munlock_pagevec_fill() would be called again with failed index.
    However, it's better to use more appropriate function here.

    Link: http://lkml.kernel.org/r/1503559211-10259-1-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

03 Jun, 2017

1 commit

  • Kefeng reported that when running the follow test, the mlock count in
    meminfo will increase permanently:

    [1] testcase
    linux:~ # cat test_mlockal
    grep Mlocked /proc/meminfo
    for j in `seq 0 10`
    do
    for i in `seq 4 15`
    do
    ./p_mlockall >> log &
    done
    sleep 0.2
    done
    # wait some time to let mlock counter decrease and 5s may not enough
    sleep 5
    grep Mlocked /proc/meminfo

    linux:~ # cat p_mlockall.c
    #include
    #include
    #include

    #define SPACE_LEN 4096

    int main(int argc, char ** argv)
    {
    int ret;
    void *adr = malloc(SPACE_LEN);
    if (!adr)
    return -1;

    ret = mlockall(MCL_CURRENT | MCL_FUTURE);
    printf("mlcokall ret = %d\n", ret);

    ret = munlockall();
    printf("munlcokall ret = %d\n", ret);

    free(adr);
    return 0;
    }

    In __munlock_pagevec() we should decrement NR_MLOCK for each page where
    we clear the PageMlocked flag. Commit 1ebb7cc6a583 ("mm: munlock: batch
    NR_MLOCK zone state updates") has introduced a bug where we don't
    decrement NR_MLOCK for pages where we clear the flag, but fail to
    isolate them from the lru list (e.g. when the pages are on some other
    cpu's percpu pagevec). Since PageMlocked stays cleared, the NR_MLOCK
    accounting gets permanently disrupted by this.

    Fix it by counting the number of page whose PageMlock flag is cleared.

    Fixes: 1ebb7cc6a583 (" mm: munlock: batch NR_MLOCK zone state updates")
    Link: http://lkml.kernel.org/r/1495678405-54569-1-git-send-email-xieyisheng1@huawei.com
    Signed-off-by: Yisheng Xie
    Reported-by: Kefeng Wang
    Tested-by: Kefeng Wang
    Cc: Vlastimil Babka
    Cc: Joern Engel
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Xishi Qiu
    Cc: zhongjiang
    Cc: Hanjun Guo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     

04 May, 2017

1 commit

  • try_to_munlock returns SWAP_MLOCK if the one of VMAs mapped the page has
    VM_LOCKED flag. In that time, VM set PG_mlocked to the page if the page
    is not pte-mapped THP which cannot be mlocked, either.

    With that, __munlock_isolated_page can use PageMlocked to check whether
    try_to_munlock is successful or not without relying on try_to_munlock's
    retval. It helps to make try_to_unmap/try_to_unmap_one simple with
    upcoming patches.

    [minchan@kernel.org: remove PG_Mlocked VM_BUG_ON check]
    Link: http://lkml.kernel.org/r/20170411025615.GA6545@bbox
    Link: http://lkml.kernel.org/r/1489555493-14659-5-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Cc: Anshuman Khandual
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

11 Mar, 2017

1 commit

  • Merge 5-level page table prep from Kirill Shutemov:
    "Here's relatively low-risk part of 5-level paging patchset. Merging it
    now will make x86 5-level paging enabling in v4.12 easier.

    The first patch is actually x86-specific: detect 5-level paging
    support. It boils down to single define.

    The rest of patchset converts Linux MMU abstraction from 4- to 5-level
    paging.

    Enabling of new abstraction in most cases requires adding single line
    of code in arch-specific code. The rest is taken care by asm-generic/.

    Changes to mm/ code are mostly mechanical: add support for new page
    table level -- p4d_t -- where we deal with pud_t now.

    v2:
    - fix build on microblaze (Michal);
    - comment for __ARCH_HAS_5LEVEL_HACK in kasan_populate_zero_shadow();
    - acks from Michal"

    * emailed patches from Kirill A Shutemov :
    mm: introduce __p4d_alloc()
    mm: convert generic code to 5-level paging
    asm-generic: introduce
    arch, mm: convert all architectures to use 5level-fixup.h
    asm-generic: introduce __ARCH_USE_5LEVEL_HACK
    asm-generic: introduce 5level-fixup.h
    x86/cpufeature: Add 5-level paging detection

    Linus Torvalds
     

10 Mar, 2017

2 commits

  • The following test case triggers BUG() in munlock_vma_pages_range():

    int main(int argc, char *argv[])
    {
    int fd;

    system("mount -t tmpfs -o huge=always none /mnt");
    fd = open("/mnt/test", O_CREAT | O_RDWR);
    ftruncate(fd, 4UL << 20);
    mmap(NULL, 4UL << 20, PROT_READ | PROT_WRITE,
    MAP_SHARED | MAP_FIXED | MAP_LOCKED, fd, 0);
    mmap(NULL, 4096, PROT_READ | PROT_WRITE,
    MAP_SHARED | MAP_LOCKED, fd, 0);
    munlockall();
    return 0;
    }

    The second mmap() create PTE-mapping of the first huge page in file. It
    makes kernel munlock the page as we never keep PTE-mapped page mlocked.

    On munlockall() when we handle vma created by the first mmap(),
    munlock_vma_page() returns page_mask == 0, as the page is not mlocked
    anymore. On next iteration follow_page_mask() return tail page, but
    page_mask is HPAGE_NR_PAGES - 1. It makes us skip to the first tail
    page of the next huge page and step on
    VM_BUG_ON_PAGE(PageMlocked(page)).

    The fix is not use the page_mask from follow_page_mask() at all. It has
    no use for us.

    Link: http://lkml.kernel.org/r/20170302150252.34120-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Convert all non-architecture-specific code to 5-level paging.

    It's mostly mechanical adding handling one more page table level in
    places where we deal with pud_t.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

02 Mar, 2017

1 commit


01 Dec, 2016

1 commit

  • The following program triggers BUG() in munlock_vma_pages_range():

    // autogenerated by syzkaller (http://github.com/google/syzkaller)
    #include

    int main()
    {
    mmap((void*)0x20105000ul, 0xc00000ul, 0x2ul, 0x2172ul, -1, 0);
    mremap((void*)0x201fd000ul, 0x4000ul, 0xc00000ul, 0x3ul, 0x203f0000ul);
    return 0;
    }

    The test-case constructs the situation when munlock_vma_pages_range()
    finds PTE-mapped THP-head in the middle of page table and, by mistake,
    skips HPAGE_PMD_NR pages after that.

    As result, on the next iteration it hits the middle of PMD-mapped THP
    and gets upset seeing mlocked tail page.

    The solution is only skip HPAGE_PMD_NR pages if the THP was mlocked
    during munlock_vma_page(). It would guarantee that the page is
    PMD-mapped as we never mlock PTE-mapeed THPs.

    Fixes: e90309c9f772 ("thp: allow mlocked THP again")
    Link: http://lkml.kernel.org/r/20161115132703.7s7rrgmwttegcdh4@black.fi.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dmitry Vyukov
    Cc: Konstantin Khlebnikov
    Cc: Andrey Ryabinin
    Cc: syzkaller
    Cc: Andrea Arcangeli
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

08 Oct, 2016

2 commits

  • When one vma was with flag VM_LOCKED|VM_LOCKONFAULT (by invoking
    mlock2(,MLOCK_ONFAULT)), it can again be populated with mlock() with
    VM_LOCKED flag only.

    There is a hole in mlock_fixup() which increase mm->locked_vm twice even
    the two operations are on the same vma and both with VM_LOCKED flags.

    The issue can be reproduced by following code:

    mlock2(p, 1024 * 64, MLOCK_ONFAULT); //VM_LOCKED|VM_LOCKONFAULT
    mlock(p, 1024 * 64); //VM_LOCKED

    Then check the increase VmLck field in /proc/pid/status(to 128k).

    When vma is set with different vm_flags, and the new vm_flags is with
    VM_LOCKED, it is not necessarily be a "new locked" vma. This patch
    corrects this bug by prevent mm->locked_vm from increment when old
    vm_flags is already VM_LOCKED.

    Link: http://lkml.kernel.org/r/1472554781-9835-3-git-send-email-wei.guo.simon@gmail.com
    Signed-off-by: Simon Guo
    Acked-by: Kirill A. Shutemov
    Cc: Alexey Klimov
    Cc: Eric B Munson
    Cc: Geert Uytterhoeven
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Shuah Khan
    Cc: Simon Guo
    Cc: Thierry Reding
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Simon Guo
     
  • In do_mlock(), the check against locked memory limitation has a hole
    which will fail following cases at step 3):

    1) User has a memory chunk from addressA with 50k, and user mem lock
    rlimit is 64k.
    2) mlock(addressA, 30k)
    3) mlock(addressA, 40k)

    The 3rd step should have been allowed since the 40k request is
    intersected with the previous 30k at step 2), and the 3rd step is
    actually for mlock on the extra 10k memory.

    This patch checks vma to caculate the actual "new" mlock size, if
    necessary, and ajust the logic to fix this issue.

    [akpm@linux-foundation.org: clean up comment layout]
    [wei.guo.simon@gmail.com: correct a typo in count_mm_mlocked_page_nr()]
    Link: http://lkml.kernel.org/r/1473325970-11393-2-git-send-email-wei.guo.simon@gmail.com
    Link: http://lkml.kernel.org/r/1472554781-9835-2-git-send-email-wei.guo.simon@gmail.com
    Signed-off-by: Simon Guo
    Cc: Alexey Klimov
    Cc: Eric B Munson
    Cc: Geert Uytterhoeven
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Shuah Khan
    Cc: Simon Guo
    Cc: Thierry Reding
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Simon Guo
     

29 Jul, 2016

2 commits

  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Node-based reclaim requires node-based LRUs and locking. This is a
    preparation patch that just moves the lru_lock to the node so later
    patches are easier to review. It is a mechanical change but note this
    patch makes contention worse because the LRU lock is hotter and direct
    reclaim and kswapd can contend on the same lock even when reclaiming
    from different zones.

    Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

24 May, 2016

1 commit

  • This is a follow up work for oom_reaper [1]. As the async OOM killing
    depends on oom_sem for read we would really appreciate if a holder for
    write didn't stood in the way. This patchset is changing many of
    down_write calls to be killable to help those cases when the writer is
    blocked and waiting for readers to release the lock and so help
    __oom_reap_task to process the oom victim.

    Most of the patches are really trivial because the lock is help from a
    shallow syscall paths where we can return EINTR trivially and allow the
    current task to die (note that EINTR will never get to the userspace as
    the task has fatal signal pending). Others seem to be easy as well as
    the callers are already handling fatal errors and bail and return to
    userspace which should be sufficient to handle the failure gracefully.
    I am not familiar with all those code paths so a deeper review is really
    appreciated.

    As this work is touching more areas which are not directly connected I
    have tried to keep the CC list as small as possible and people who I
    believed would be familiar are CCed only to the specific patches (all
    should have received the cover though).

    This patchset is based on linux-next and it depends on
    down_write_killable for rw_semaphores which got merged into tip
    locking/rwsem branch and it is merged into this next tree. I guess it
    would be easiest to route these patches via mmotm because of the
    dependency on the tip tree but if respective maintainers prefer other
    way I have no objections.

    I haven't covered all the mmap_write(mm->mmap_sem) instances here

    $ git grep "down_write(.*\)" next/master | wc -l
    98
    $ git grep "down_write(.*\)" | wc -l
    62

    I have tried to cover those which should be relatively easy to review in
    this series because this alone should be a nice improvement. Other
    places can be changed on top.

    [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
    [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org

    This patch (of 18):

    This is the first step in making mmap_sem write waiters killable. It
    focuses on the trivial ones which are taking the lock early after
    entering the syscall and they are not changing state before.

    Therefore it is very easy to change them to use down_write_killable and
    immediately return with -EINTR. This will allow the waiter to pass away
    without blocking the mmap_sem which might be required to make a forward
    progress. E.g. the oom reaper will need the lock for reading to
    dismantle the OOM victim address space.

    The only tricky function in this patch is vm_mmap_pgoff which has many
    call sites via vm_mmap. To reduce the risk keep vm_mmap with the
    original non-killable semantic for now.

    vm_munmap callers do not bother checking the return value so open code
    it into the munmap syscall path for now for simplicity.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

22 Jan, 2016

1 commit

  • Tetsuo Handa reported underflow of NR_MLOCK on munlock.

    Testcase:

    #include
    #include
    #include

    #define BASE ((void *)0x400000000000)
    #define SIZE (1UL << 21)

    int main(int argc, char *argv[])
    {
    void *addr;

    system("grep Mlocked /proc/meminfo");
    addr = mmap(BASE, SIZE, PROT_READ | PROT_WRITE,
    MAP_ANONYMOUS | MAP_PRIVATE | MAP_LOCKED | MAP_FIXED,
    -1, 0);
    if (addr == MAP_FAILED)
    printf("mmap() failed\n"), exit(1);
    munmap(addr, SIZE);
    system("grep Mlocked /proc/meminfo");
    return 0;
    }

    It happens on munlock_vma_page() due to unfortunate choice of nr_pages
    data type:

    __mod_zone_page_state(zone, NR_MLOCK, -nr_pages);

    For unsigned int nr_pages, implicitly casted to long in
    __mod_zone_page_state(), it becomes something around UINT_MAX.

    munlock_vma_page() usually called for THP as small pages go though
    pagevec.

    Let's make nr_pages signed int.

    Similar fixes in 6cdb18ad98a4 ("mm/vmstat: fix overflow in
    mod_zone_page_state()") used `long' type, but `int' here is OK for a
    count of the number of sub-pages in a huge page.

    Fixes: ff6a6da60b89 ("mm: accelerate munlock() treatment of THP pages")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Tetsuo Handa
    Tested-by: Tetsuo Handa
    Cc: Michel Lespinasse
    Acked-by: Michal Hocko
    Cc: [4.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Jan, 2016

3 commits

  • Since can_do_mlock only return 1 or 0, so make it boolean.

    No functional change.

    [akpm@linux-foundation.org: update declaration in mm.h]
    Signed-off-by: Wang Xiaoqiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Xiaoqiang
     
  • Before THP refcounting rework, THP was not allowed to cross VMA
    boundary. So, if we have THP and we split it, PG_mlocked can be safely
    transferred to small pages.

    With new THP refcounting and naive approach to mlocking we can end up
    with this scenario:
    1. we have a mlocked THP, which belong to one VM_LOCKED VMA.
    2. the process does munlock() on the *part* of the THP:
    - the VMA is split into two, one of them VM_LOCKED;
    - huge PMD split into PTE table;
    - THP is still mlocked;
    3. split_huge_page():
    - it transfers PG_mlocked to *all* small pages regrardless if it
    blong to any VM_LOCKED VMA.

    We probably could munlock() all small pages on split_huge_page(), but I
    think we have accounting issue already on step two.

    Instead of forbidding mlocked pages altogether, we just avoid mlocking
    PTE-mapped THPs and munlock THPs on split_huge_pmd().

    This means PTE-mapped THPs will be on normal lru lists and will be split
    under memory pressure by vmscan. After the split vmscan will detect
    unevictable small pages and mlock them.

    With this approach we shouldn't hit situation like described above.

    Signed-off-by: Kirill A. Shutemov
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With new refcounting THP can belong to several VMAs. This makes tricky
    to track THP pages, when they partially mlocked. It can lead to leaking
    mlocked pages to non-VM_LOCKED vmas and other problems.

    With this patch we will split all pages on mlock and avoid
    fault-in/collapse new THP in VM_LOCKED vmas.

    I've tried alternative approach: do not mark THP pages mlocked and keep
    them on normal LRUs. This way vmscan could try to split huge pages on
    memory pressure and free up subpages which doesn't belong to VM_LOCKED
    vmas. But this is user-visible change: we screw up Mlocked accouting
    reported in meminfo, so I had to leave this approach aside.

    We can bring something better later, but this should be good enough for
    now.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

1 commit