09 Jan, 2019

1 commit

  • LTP proc01 testcase has been observed to rarely trigger crashes
    on arm64:
    page_mapped+0x78/0xb4
    stable_page_flags+0x27c/0x338
    kpageflags_read+0xfc/0x164
    proc_reg_read+0x7c/0xb8
    __vfs_read+0x58/0x178
    vfs_read+0x90/0x14c
    SyS_read+0x60/0xc0

    The issue is that page_mapped() assumes that if compound page is not
    huge, then it must be THP. But if this is 'normal' compound page
    (COMPOUND_PAGE_DTOR), then following loop can keep running (for
    HPAGE_PMD_NR iterations) until it tries to read from memory that isn't
    mapped and triggers a panic:

    for (i = 0; i < hpage_nr_pages(page); i++) {
    if (atomic_read(&page[i]._mapcount) >= 0)
    return true;
    }

    I could replicate this on x86 (v4.20-rc4-98-g60b548237fed) only
    with a custom kernel module [1] which:
    - allocates compound page (PAGEC) of order 1
    - allocates 2 normal pages (COPY), which are initialized to 0xff (to
    satisfy _mapcount >= 0)
    - 2 PAGEC page structs are copied to address of first COPY page
    - second page of COPY is marked as not present
    - call to page_mapped(COPY) now triggers fault on access to 2nd COPY
    page at offset 0x30 (_mapcount)

    [1] https://github.com/jstancek/reproducers/blob/master/kernel/page_mapped_crash/repro.c

    Fix the loop to iterate for "1 << compound_order" pages.

    Kirrill said "IIRC, sound subsystem can producuce custom mapped compound
    pages".

    Link: http://lkml.kernel.org/r/c440d69879e34209feba21e12d236d06bc0a25db.1543577156.git.jstancek@redhat.com
    Fixes: e1534ae95004 ("mm: differentiate page_mapped() from page_mapcount() for compound pages")
    Signed-off-by: Jan Stancek
    Debugged-by: Laszlo Ersek
    Suggested-by: "Kirill A. Shutemov"
    Acked-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Reviewed-by: David Hildenbrand
    Reviewed-by: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Stancek
     

29 Dec, 2018

1 commit

  • totalram_pages and totalhigh_pages are made static inline function.

    Main motivation was that managed_page_count_lock handling was complicating
    things. It was discussed in length here,
    https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
    better to remove the lock and convert variables to atomic, with preventing
    poteintial store-to-read tearing as a bonus.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
    Signed-off-by: Arun KS
    Suggested-by: Michal Hocko
    Suggested-by: Vlastimil Babka
    Reviewed-by: Konstantin Khlebnikov
    Reviewed-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun KS
     

27 Oct, 2018

2 commits

  • vfree() might sleep if called not in interrupt context. So does kvfree()
    too. Fix misleading kvfree()'s comment about allowed context.

    Link: http://lkml.kernel.org/r/20180914130512.10394-1-aryabinin@virtuozzo.com
    Fixes: 04b8e946075d ("mm/util.c: improve kvfree() kerneldoc")
    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • The vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES was introduced by
    commit eb59254608bc ("mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES") with
    the goal of accounting objects that can be reclaimed, but cannot be
    allocated via a SLAB_RECLAIM_ACCOUNT cache. This is now possible via
    kmalloc() with __GFP_RECLAIMABLE flag, and the dcache external names user
    is converted.

    The counter is however still useful for accounting direct page allocations
    (i.e. not slab) with a shrinker, such as the ION page pool. So keep it,
    and:

    - change granularity to pages to be more like other counters; sub-page
    allocations should be able to use kmalloc
    - rename the counter to NR_KERNEL_MISC_RECLAIMABLE
    - expose the counter again in vmstat as "nr_kernel_misc_reclaimable"; we can
    again remove the check for not printing "hidden" counters

    Link: http://lkml.kernel.org/r/20180731090649.16028-5-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Christoph Lameter
    Acked-by: Roman Gushchin
    Cc: Vijayanand Jitta
    Cc: Laura Abbott
    Cc: Sumit Semwal
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

16 Oct, 2018

1 commit


05 Sep, 2018

1 commit


24 Aug, 2018

2 commits

  • Link: http://lkml.kernel.org/r/1532626360-16650-3-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Patch series "memory management documentation updates", v3.

    Here are several updates to the mm documentation.

    Aside from really minor changes in the first three patches, the updates
    are:

    * move the documentation of kstrdup and friends to "String Manipulation"
    section
    * split memory management API into a separate .rst file
    * adjust formating of the GFP flags description and include it in the
    reference documentation.

    This patch (of 7):

    The description of the strndup_user function misses '*' character at the
    beginning of the comment to be proper kernel-doc. Add the missing
    character.

    Link: http://lkml.kernel.org/r/1532626360-16650-2-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

08 Jun, 2018

1 commit

  • kvmalloc warned about incompatible gfp_mask to catch abusers (mostly
    GFP_NOFS) with an intention that this will motivate authors of the code
    to fix those. Linus argues that this just motivates people to do even
    more hacks like

    if (gfp == GFP_KERNEL)
    kvmalloc
    else
    kmalloc

    I haven't seen this happening much (Linus pointed to bucket_lock special
    cases an atomic allocation but my git foo hasn't found much more) but it
    is true that we can grow those in future. Therefore Linus suggested to
    simply not fallback to vmalloc for incompatible gfp flags and rather
    stick with the kmalloc path.

    Link: http://lkml.kernel.org/r/20180601115329.27807-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: Linus Torvalds
    Cc: Tom Herbert
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

17 Apr, 2018

2 commits

  • Mike Rapoport says:

    These patches convert files in Documentation/vm to ReST format, add an
    initial index and link it to the top level documentation.

    There are no contents changes in the documentation, except few spelling
    fixes. The relatively large diffstat stems from the indentation and
    paragraph wrapping changes.

    I've tried to keep the formatting as consistent as possible, but I could
    miss some places that needed markup and add some markup where it was not
    necessary.

    [jc: significant conflicts in vm/hmm.rst]

    Jonathan Corbet
     
  • Signed-off-by: Mike Rapoport
    Signed-off-by: Jonathan Corbet

    Mike Rapoport
     

14 Apr, 2018

1 commit

  • __get_user_pages_fast handles errors differently from
    get_user_pages_fast: the former always returns the number of pages
    pinned, the later might return a negative error code.

    Link: http://lkml.kernel.org/r/1522962072-182137-6-git-send-email-mst@redhat.com
    Signed-off-by: Michael S. Tsirkin
    Reviewed-by: Andrew Morton
    Cc: Kirill A. Shutemov
    Cc: Huang Ying
    Cc: Jonathan Corbet
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Thorsten Leemhuis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     

12 Apr, 2018

2 commits

  • Patch series "exec: Pin stack limit during exec".

    Attempts to solve problems with the stack limit changing during exec
    continue to be frustrated[1][2]. In addition to the specific issues
    around the Stack Clash family of flaws, Andy Lutomirski pointed out[3]
    other places during exec where the stack limit is used and is assumed to
    be unchanging. Given the many places it gets used and the fact that it
    can be manipulated/raced via setrlimit() and prlimit(), I think the only
    way to handle this is to move away from the "current" view of the stack
    limit and instead attach it to the bprm, and plumb this down into the
    functions that need to know the stack limits. This series implements
    the approach.

    [1] 04e35f4495dd ("exec: avoid RLIMIT_STACK races with prlimit()")
    [2] 779f4e1c6c7c ("Revert "exec: avoid RLIMIT_STACK races with prlimit()"")
    [3] to security@kernel.org, "Subject: existing rlimit races?"

    This patch (of 3):

    Since it is possible that the stack rlimit can change externally during
    exec (either via another thread calling setrlimit() or another process
    calling prlimit()), provide a way to pass the rlimit down into the
    per-architecture mm layout functions so that the rlimit can stay in the
    bprm structure instead of sitting in the signal structure until exec is
    finalized.

    Link: http://lkml.kernel.org/r/1518638796-20819-2-git-send-email-keescook@chromium.org
    Signed-off-by: Kees Cook
    Cc: Michal Hocko
    Cc: Ben Hutchings
    Cc: Willy Tarreau
    Cc: Hugh Dickins
    Cc: Oleg Nesterov
    Cc: "Jason A. Donenfeld"
    Cc: Rik van Riel
    Cc: Laura Abbott
    Cc: Greg KH
    Cc: Andy Lutomirski
    Cc: Ben Hutchings
    Cc: Brad Spengler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Indirectly reclaimable memory can consume a significant part of total
    memory and it's actually reclaimable (it will be released under actual
    memory pressure).

    So, the overcommit logic should treat it as free.

    Otherwise, it's possible to cause random system-wide memory allocation
    failures by consuming a significant amount of memory by indirectly
    reclaimable memory, e.g. dentry external names.

    If overcommit policy GUESS is used, it might be used for denial of
    service attack under some conditions.

    The following program illustrates the approach. It causes the kernel to
    allocate an unreclaimable kmalloc-256 chunk for each stat() call, so
    that at some point the overcommit logic may start blocking large
    allocation system-wide.

    int main()
    {
    char buf[256];
    unsigned long i;
    struct stat statbuf;

    buf[0] = '/';
    for (i = 1; i < sizeof(buf); i++)
    buf[i] = '_';

    for (i = 0; 1; i++) {
    sprintf(&buf[248], "%8lu", i);
    stat(buf, &statbuf);
    }

    return 0;
    }

    This patch in combination with related indirectly reclaimable memory
    patches closes this issue.

    Link: http://lkml.kernel.org/r/20180313130041.8078-1-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

06 Apr, 2018

1 commit

  • Thanks to commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB
    trunks"), after swapoff the address_space associated with the swap
    device will be freed. So page_mapping() users which may touch the
    address_space need some kind of mechanism to prevent the address_space
    from being freed during accessing.

    The dcache flushing functions (flush_dcache_page(), etc) in architecture
    specific code may access the address_space of swap device for anonymous
    pages in swap cache via page_mapping() function. But in some cases
    there are no mechanisms to prevent the swap device from being swapoff,
    for example,

    CPU1 CPU2
    __get_user_pages() swapoff()
    flush_dcache_page()
    mapping = page_mapping()
    ... exit_swap_address_space()
    ... kvfree(spaces)
    mapping_mapped(mapping)

    The address space may be accessed after being freed.

    But from cachetlb.txt and Russell King, flush_dcache_page() only care
    about file cache pages, for anonymous pages, flush_anon_page() should be
    used. The implementation of flush_dcache_page() in all architectures
    follows this too. They will check whether page_mapping() is NULL and
    whether mapping_mapped() is true to determine whether to flush the
    dcache immediately. And they will use interval tree (mapping->i_mmap)
    to find all user space mappings. While mapping_mapped() and
    mapping->i_mmap isn't used by anonymous pages in swap cache at all.

    So, to fix the race between swapoff and flush dcache, __page_mapping()
    is add to return the address_space for file cache pages and NULL
    otherwise. All page_mapping() invoking in flush dcache functions are
    replaced with page_mapping_file().

    [akpm@linux-foundation.org: simplify page_mapping_file(), per Mike]
    Link: http://lkml.kernel.org/r/20180305083634.15174-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Andrew Morton
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Dave Hansen
    Cc: Chen Liqin
    Cc: Russell King
    Cc: Yoshinori Sato
    Cc: "James E.J. Bottomley"
    Cc: Guan Xuetao
    Cc: "David S. Miller"
    Cc: Chris Zankel
    Cc: Vineet Gupta
    Cc: Ley Foon Tan
    Cc: Ralf Baechle
    Cc: Andi Kleen
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

08 Jan, 2018

2 commits


07 Sep, 2017

1 commit

  • global_page_state is error prone as a recent bug report pointed out [1].
    It only returns proper values for zone based counters as the enum it
    gets suggests. We already have global_node_page_state so let's rename
    global_page_state to global_zone_page_state to be more explicit here.
    All existing users seems to be correct:

    $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
    2 NR_BOUNCE
    2 NR_FREE_CMA_PAGES
    11 NR_FREE_PAGES
    1 NR_KERNEL_STACK_KB
    1 NR_MLOCK
    2 NR_PAGETABLE

    This patch shouldn't introduce any functional change.

    [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp

    Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Johannes Weiner
    Cc: Tetsuo Handa
    Cc: Josef Bacik
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

11 Aug, 2017

1 commit

  • As Tetsuo points out:
    "Commit 385386cff4c6 ("mm: vmstat: move slab statistics from zone to
    node counters") broke "Slab:" field of /proc/meminfo . It shows nearly
    0kB"

    In addition to /proc/meminfo, this problem also affects the slab
    counters OOM/allocation failure info dumps, can cause early -ENOMEM from
    overcommit protection, and miscalculate image size requirements during
    suspend-to-disk.

    This is because the patch in question switched the slab counters from
    the zone level to the node level, but forgot to update the global
    accessor functions to read the aggregate node data instead of the
    aggregate zone data.

    Use global_node_page_state() to access the global slab counters.

    Fixes: 385386cff4c6 ("mm: vmstat: move slab statistics from zone to node counters")
    Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Josef Bacik
    Cc: Vladimir Davydov
    Cc: Stefan Agner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

16 Jul, 2017

1 commit

  • Pull ->s_options removal from Al Viro:
    "Preparations for fsmount/fsopen stuff (coming next cycle). Everything
    gets moved to explicit ->show_options(), killing ->s_options off +
    some cosmetic bits around fs/namespace.c and friends. Basically, the
    stuff needed to work with fsmount series with minimum of conflicts
    with other work.

    It's not strictly required for this merge window, but it would reduce
    the PITA during the coming cycle, so it would be nice to have those
    bits and pieces out of the way"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    isofs: Fix isofs_show_options()
    VFS: Kill off s_options and helpers
    orangefs: Implement show_options
    9p: Implement show_options
    isofs: Implement show_options
    afs: Implement show_options
    affs: Implement show_options
    befs: Implement show_options
    spufs: Implement show_options
    bpf: Implement show_options
    ramfs: Implement show_options
    pstore: Implement show_options
    omfs: Implement show_options
    hugetlbfs: Implement show_options
    VFS: Don't use save/replace_mount_options if not using generic_show_options
    VFS: Provide empty name qstr
    VFS: Make get_filesystem() return the affected filesystem
    VFS: Clean up whitespace in fs/namespace.c and fs/super.c
    Provide a function to create a NUL-terminated string from unterminated data

    Linus Torvalds
     

13 Jul, 2017

2 commits

  • Now that __GFP_RETRY_MAYFAIL has a reasonable semantic regardless of the
    request size we can drop the hackish implementation for !costly orders.
    __GFP_RETRY_MAYFAIL retries as long as the reclaim makes a forward
    progress and backs of when we are out of memory for the requested size.
    Therefore we do not need to enforce__GFP_NORETRY for !costly orders just
    to silent the oom killer anymore.

    Link: http://lkml.kernel.org/r/20170623085345.11304-5-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alex Belits
    Cc: Chris Wilson
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Cc: David Daney
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: NeilBrown
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
    the page allocator. This has been true but only for allocations
    requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always
    ignored for smaller sizes. This is a bit unfortunate because there is
    no way to express the same semantic for those requests and they are
    considered too important to fail so they might end up looping in the
    page allocator for ever, similarly to GFP_NOFAIL requests.

    Now that the whole tree has been cleaned up and accidental or misled
    usage of __GFP_REPEAT flag has been removed for !costly requests we can
    give the original flag a better name and more importantly a more useful
    semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
    that the allocator would try really hard but there is no promise of a
    success. This will work independent of the order and overrides the
    default allocator behavior. Page allocator users have several levels of
    guarantee vs. cost options (take GFP_KERNEL as an example)

    - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
    attempt to free memory at all. The most light weight mode which even
    doesn't kick the background reclaim. Should be used carefully because
    it might deplete the memory and the next user might hit the more
    aggressive reclaim

    - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
    allocation without any attempt to free memory from the current
    context but can wake kswapd to reclaim memory if the zone is below
    the low watermark. Can be used from either atomic contexts or when
    the request is a performance optimization and there is another
    fallback for a slow path.

    - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
    non sleeping allocation with an expensive fallback so it can access
    some portion of memory reserves. Usually used from interrupt/bh
    context with an expensive slow path fallback.

    - GFP_KERNEL - both background and direct reclaim are allowed and the
    _default_ page allocator behavior is used. That means that !costly
    allocation requests are basically nofail but there is no guarantee of
    that behavior so failures have to be checked properly by callers
    (e.g. OOM killer victim is allowed to fail currently).

    - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
    and all allocation requests fail early rather than cause disruptive
    reclaim (one round of reclaim in this implementation). The OOM killer
    is not invoked.

    - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
    behavior and all allocation requests try really hard. The request
    will fail if the reclaim cannot make any progress. The OOM killer
    won't be triggered.

    - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
    and all allocation requests will loop endlessly until they succeed.
    This might be really dangerous especially for larger orders.

    Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
    because they already had their semantic. No new users are added.
    __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
    there is no progress and we have already passed the OOM point.

    This means that all the reclaim opportunities have been exhausted except
    the most disruptive one (the OOM killer) and a user defined fallback
    behavior is more sensible than keep retrying in the page allocator.

    [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
    [mhocko@suse.com: semantic fix]
    Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
    [mhocko@kernel.org: address other thing spotted by Vlastimil]
    Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alex Belits
    Cc: Chris Wilson
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Cc: David Daney
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: NeilBrown
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

06 Jul, 2017

1 commit


03 Jun, 2017

1 commit

  • While converting drm_[cm]alloc* helpers to kvmalloc* variants Chris
    Wilson has wondered why we want to try kmalloc before vmalloc fallback
    even for larger allocations requests. Let's clarify that one larger
    physically contiguous block is less likely to fragment memory than many
    scattered pages which can prevent more large blocks from being created.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20170517080932.21423-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: Chris Wilson
    Reviewed-by: Chris Wilson
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

13 May, 2017

1 commit

  • Commit 1f5307b1e094 ("mm, vmalloc: properly track vmalloc users") has
    pulled asm/pgtable.h include dependency to linux/vmalloc.h and that
    turned out to be a bad idea for some architectures. E.g. m68k fails
    with

    In file included from arch/m68k/include/asm/pgtable_mm.h:145:0,
    from arch/m68k/include/asm/pgtable.h:4,
    from include/linux/vmalloc.h:9,
    from arch/m68k/kernel/module.c:9:
    arch/m68k/include/asm/mcf_pgtable.h: In function 'nocache_page':
    >> arch/m68k/include/asm/mcf_pgtable.h:339:43: error: 'init_mm' undeclared (first use in this function)
    #define pgd_offset_k(address) pgd_offset(&init_mm, address)

    as spotted by kernel build bot. nios2 fails for other reason

    In file included from include/asm-generic/io.h:767:0,
    from arch/nios2/include/asm/io.h:61,
    from include/linux/io.h:25,
    from arch/nios2/include/asm/pgtable.h:18,
    from include/linux/mm.h:70,
    from include/linux/pid_namespace.h:6,
    from include/linux/ptrace.h:9,
    from arch/nios2/include/uapi/asm/elf.h:23,
    from arch/nios2/include/asm/elf.h:22,
    from include/linux/elf.h:4,
    from include/linux/module.h:15,
    from init/main.c:16:
    include/linux/vmalloc.h: In function '__vmalloc_node_flags':
    include/linux/vmalloc.h:99:40: error: 'PAGE_KERNEL' undeclared (first use in this function); did you mean 'GFP_KERNEL'?

    which is due to the newly added #include , which on nios2
    includes and thus and which
    again includes .

    Tweaking that around just turns out a bigger headache than necessary.
    This patch reverts 1f5307b1e094 and reimplements the original fix in a
    different way. __vmalloc_node_flags can stay static inline which will
    cover vmalloc* functions. We only have one external user
    (kvmalloc_node) and we can export __vmalloc_node_flags_caller and
    provide the caller directly. This is much simpler and it doesn't really
    need any games with header files.

    [akpm@linux-foundation.org: coding-style fixes]
    [mhocko@kernel.org: revert old comment]
    Link: http://lkml.kernel.org/r/20170509211054.GB16325@dhcp22.suse.cz
    Fixes: 1f5307b1e094 ("mm, vmalloc: properly track vmalloc users")
    Link: http://lkml.kernel.org/r/20170509153702.GR6481@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Cc: Tobias Klauser
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

09 May, 2017

3 commits

  • __vmalloc* allows users to provide gfp flags for the underlying
    allocation. This API is quite popular

    $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
    77

    The only problem is that many people are not aware that they really want
    to give __GFP_HIGHMEM along with other flags because there is really no
    reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
    which are mapped to the kernel vmalloc space. About half of users don't
    use this flag, though. This signals that we make the API unnecessarily
    too complex.

    This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
    be mapped to the vmalloc space. Current users which add __GFP_HIGHMEM
    are simplified and drop the flag.

    Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Matthew Wilcox
    Cc: Al Viro
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Cristopher Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • vhost code uses __GFP_REPEAT when allocating vhost_virtqueue resp.
    vhost_vsock because it would really like to prefer kmalloc to the
    vmalloc fallback - see 23cc5a991c7a ("vhost-net: extend device
    allocation to vmalloc") for more context. Michael Tsirkin has also
    noted:

    "__GFP_REPEAT overhead is during allocation time. Using vmalloc means
    all accesses are slowed down. Allocation is not on data path, accesses
    are."

    The similar applies to other vhost_kvzalloc users.

    Let's teach kvmalloc_node to handle __GFP_REPEAT properly. There are
    two things to be careful about. First we should prevent from the OOM
    killer and so have to involve __GFP_NORETRY by default and secondly
    override __GFP_REPEAT for !costly order requests as the __GFP_REPEAT is
    ignored for !costly orders.

    Supporting __GFP_REPEAT like semantic for !costly request is possible it
    would require changes in the page allocator. This is out of scope of
    this patch.

    This patch shouldn't introduce any functional change.

    Link: http://lkml.kernel.org/r/20170306103032.2540-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Michael S. Tsirkin
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "kvmalloc", v5.

    There are many open coded kmalloc with vmalloc fallback instances in the
    tree. Most of them are not careful enough or simply do not care about
    the underlying semantic of the kmalloc/page allocator which means that
    a) some vmalloc fallbacks are basically unreachable because the kmalloc
    part will keep retrying until it succeeds b) the page allocator can
    invoke a really disruptive steps like the OOM killer to move forward
    which doesn't sound appropriate when we consider that the vmalloc
    fallback is available.

    As it can be seen implementing kvmalloc requires quite an intimate
    knowledge if the page allocator and the memory reclaim internals which
    strongly suggests that a helper should be implemented in the memory
    subsystem proper.

    Most callers, I could find, have been converted to use the helper
    instead. This is patch 6. There are some more relying on __GFP_REPEAT
    in the networking stack which I have converted as well and Eric Dumazet
    was not opposed [2] to convert them as well.

    [1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com

    This patch (of 9):

    Using kmalloc with the vmalloc fallback for larger allocations is a
    common pattern in the kernel code. Yet we do not have any common helper
    for that and so users have invented their own helpers. Some of them are
    really creative when doing so. Let's just add kv[mz]alloc and make sure
    it is implemented properly. This implementation makes sure to not make
    a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
    to not warn about allocation failures. This also rules out the OOM
    killer as the vmalloc is a more approapriate fallback than a disruptive
    user visible action.

    This patch also changes some existing users and removes helpers which
    are specific for them. In some cases this is not possible (e.g.
    ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
    require GFP_NO{FS,IO} context which is not vmalloc compatible in general
    (note that the page table allocation is GFP_KERNEL). Those need to be
    fixed separately.

    While we are at it, document that __vmalloc{_node} about unsupported gfp
    mask because there seems to be a lot of confusion out there.
    kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
    superset) flags to catch new abusers. Existing ones would have to die
    slowly.

    [sfr@canb.auug.org.au: f2fs fixup]
    Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Stephen Rothwell
    Reviewed-by: Andreas Dilger [ext4 part]
    Acked-by: Vlastimil Babka
    Cc: John Hubbard
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

02 Mar, 2017

2 commits

  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

25 Feb, 2017

1 commit

  • When a non-cooperative userfaultfd monitor copies pages in the
    background, it may encounter regions that were already unmapped.
    Addition of UFFD_EVENT_UNMAP allows the uffd monitor to track precisely
    changes in the virtual memory layout.

    Since there might be different uffd contexts for the affected VMAs, we
    first should create a temporary representation for the unmap event for
    each uffd context and then notify them one by one to the appropriate
    userfault file descriptors.

    The event notification occurs after the mmap_sem has been released.

    [arnd@arndb.de: fix nommu build]
    Link: http://lkml.kernel.org/r/20170203165141.3665284-1-arnd@arndb.de
    [mhocko@suse.com: fix nommu build]
    Link: http://lkml.kernel.org/r/20170202091503.GA22823@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/1485542673-24387-3-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Michal Hocko
    Signed-off-by: Arnd Bergmann
    Acked-by: Hillf Danton
    Cc: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

25 Dec, 2016

1 commit


23 Oct, 2016

1 commit

  • Pull vmap stack fixes from Ingo Molnar:
    "This is fallout from CONFIG_HAVE_ARCH_VMAP_STACK=y on x86: stack
    accesses that used to be just somewhat questionable are now totally
    buggy.

    These changes try to do it without breaking the ABI: the fields are
    left there, they are just reporting zero, or reporting narrower
    information (the maps file change)"

    * 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    mm: Change vm_is_stack_for_task() to vm_is_stack_for_current()
    fs/proc: Stop trying to report thread stacks
    fs/proc: Stop reporting eip and esp in /proc/PID/stat
    mm/numa: Remove duplicated include from mprotect.c

    Linus Torvalds
     

20 Oct, 2016

1 commit

  • Asking for a non-current task's stack can't be done without races
    unless the task is frozen in kernel mode. As far as I know,
    vm_is_stack_for_task() never had a safe non-current use case.

    The __unused annotation is because some KSTK_ESP implementations
    ignore their parameter, which IMO is further justification for this
    patch.

    Signed-off-by: Andy Lutomirski
    Acked-by: Thomas Gleixner
    Cc: Al Viro
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Jann Horn
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Linux API
    Cc: Peter Zijlstra
    Cc: Tycho Andersen
    Link: http://lkml.kernel.org/r/4c3f68f426e6c061ca98b4fc7ef85ffbb0a25b0c.1475257877.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

19 Oct, 2016

2 commits

  • This removes the 'write' argument from access_process_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Jesper Nilsson
    Acked-by: Michal Hocko
    Acked-by: Michael Ellerman
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' and 'force' use from get_user_pages_unlocked()
    and replaces them with 'gup_flags' to make the use of FOLL_FORCE
    explicit in callers as use of this flag can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Reviewed-by: Jan Kara
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     

29 Jul, 2016

1 commit

  • There are now a number of accounting oddities such as mapped file pages
    being accounted for on the node while the total number of file pages are
    accounted on the zone. This can be coped with to some extent but it's
    confusing so this patch moves the relevant file-based accounted. Due to
    throttling logic in the page allocator for reliable OOM detection, it is
    still necessary to track dirty and writeback pages on a per-zone basis.

    [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
    Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Jul, 2016

2 commits

  • Naive approach: on mapping/unmapping the page as compound we update
    ->_mapcount on each 4k page. That's not efficient, but it's not obvious
    how we can optimize this. We can look into optimization later.

    PG_double_map optimization doesn't work for file pages since lifecycle
    of file pages is different comparing to anon pages: file page can be
    mapped again at any time.

    Link: http://lkml.kernel.org/r/1466021202-61880-11-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We have allowed migration for only LRU pages until now and it was enough
    to make high-order pages. But recently, embedded system(e.g., webOS,
    android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
    have seen several reports about troubles of small high-order allocation.
    For fixing the problem, there were several efforts (e,g,. enhance
    compaction algorithm, SLUB fallback to 0-order page, reserved memory,
    vmalloc and so on) but if there are lots of non-movable pages in system,
    their solutions are void in the long run.

    So, this patch is to support facility to change non-movable pages with
    movable. For the feature, this patch introduces functions related to
    migration to address_space_operations as well as some page flags.

    If a driver want to make own pages movable, it should define three
    functions which are function pointers of struct
    address_space_operations.

    1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

    What VM expects on isolate_page function of driver is to return *true*
    if driver isolates page successfully. On returing true, VM marks the
    page as PG_isolated so concurrent isolation in several CPUs skip the
    page for isolation. If a driver cannot isolate the page, it should
    return *false*.

    Once page is successfully isolated, VM uses page.lru fields so driver
    shouldn't expect to preserve values in that fields.

    2. int (*migratepage) (struct address_space *mapping,
    struct page *newpage, struct page *oldpage, enum migrate_mode);

    After isolation, VM calls migratepage of driver with isolated page. The
    function of migratepage is to move content of the old page to new page
    and set up fields of struct page newpage. Keep in mind that you should
    indicate to the VM the oldpage is no longer movable via
    __ClearPageMovable() under page_lock if you migrated the oldpage
    successfully and returns 0. If driver cannot migrate the page at the
    moment, driver can return -EAGAIN. On -EAGAIN, VM will retry page
    migration in a short time because VM interprets -EAGAIN as "temporal
    migration failure". On returning any error except -EAGAIN, VM will give
    up the page migration without retrying in this time.

    Driver shouldn't touch page.lru field VM using in the functions.

    3. void (*putback_page)(struct page *);

    If migration fails on isolated page, VM should return the isolated page
    to the driver so VM calls driver's putback_page with migration failed
    page. In this function, driver should put the isolated page back to the
    own data structure.

    4. non-lru movable page flags

    There are two page flags for supporting non-lru movable page.

    * PG_movable

    Driver should use the below function to make page movable under
    page_lock.

    void __SetPageMovable(struct page *page, struct address_space *mapping)

    It needs argument of address_space for registering migration family
    functions which will be called by VM. Exactly speaking, PG_movable is
    not a real flag of struct page. Rather than, VM reuses page->mapping's
    lower bits to represent it.

    #define PAGE_MAPPING_MOVABLE 0x2
    page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

    so driver shouldn't access page->mapping directly. Instead, driver
    should use page_mapping which mask off the low two bits of page->mapping
    so it can get right struct address_space.

    For testing of non-lru movable page, VM supports __PageMovable function.
    However, it doesn't guarantee to identify non-lru movable page because
    page->mapping field is unified with other variables in struct page. As
    well, if driver releases the page after isolation by VM, page->mapping
    doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
    __ClearPageMovable). But __PageMovable is cheap to catch whether page
    is LRU or non-lru movable once the page has been isolated. Because LRU
    pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
    good for just peeking to test non-lru movable pages before more
    expensive checking with lock_page in pfn scanning to select victim.

    For guaranteeing non-lru movable page, VM provides PageMovable function.
    Unlike __PageMovable, PageMovable functions validates page->mapping and
    mapping->a_ops->isolate_page under lock_page. The lock_page prevents
    sudden destroying of page->mapping.

    Driver using __SetPageMovable should clear the flag via
    __ClearMovablePage under page_lock before the releasing the page.

    * PG_isolated

    To prevent concurrent isolation among several CPUs, VM marks isolated
    page as PG_isolated under lock_page. So if a CPU encounters PG_isolated
    non-lru movable page, it can skip it. Driver doesn't need to manipulate
    the flag because VM will set/clear it automatically. Keep in mind that
    if driver sees PG_isolated page, it means the page have been isolated by
    VM so it shouldn't touch page.lru field. PG_isolated is alias with
    PG_reclaim flag so driver shouldn't use the flag for own purpose.

    [opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
    Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
    Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.org
    Signed-off-by: Gioh Kim
    Signed-off-by: Minchan Kim
    Signed-off-by: Ganesh Mahendran
    Acked-by: Vlastimil Babka
    Cc: Sergey Senozhatsky
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rafael Aquini
    Cc: Jonathan Corbet
    Cc: John Einar Reitan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

24 May, 2016

1 commit

  • All the callers of vm_mmap seem to check for the failure already and
    bail out in one way or another on the error which means that we can
    change it to use killable version of vm_mmap_pgoff and return -EINTR if
    the current task gets killed while waiting for mmap_sem. This also
    means that vm_mmap_pgoff can be killable by default and drop the
    additional parameter.

    This will help in the OOM conditions when the oom victim might be stuck
    waiting for the mmap_sem for write which in turn can block oom_reaper
    which relies on the mmap_sem for read to make a forward progress and
    reclaim the address space of the victim.

    Please note that load_elf_binary is ignoring vm_mmap error for
    current->personality & MMAP_PAGE_ZERO case but that shouldn't be a
    problem because the address is not used anywhere and we never return to
    the userspace if we got killed.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Cc: Al Viro
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko