17 Jun, 2020

1 commit

  • [ Upstream commit d4eaa2837851db2bfed572898bfc17f9a9f9151e ]

    For kvmalloc'ed data object that contains sensitive information like
    cryptographic keys, we need to make sure that the buffer is always cleared
    before freeing it. Using memset() alone for buffer clearing may not
    provide certainty as the compiler may compile it away. To be sure, the
    special memzero_explicit() has to be used.

    This patch introduces a new kvfree_sensitive() for freeing those sensitive
    data objects allocated by kvmalloc(). The relevant places where
    kvfree_sensitive() can be used are modified to use it.

    Fixes: 4f0882491a14 ("KEYS: Avoid false positive ENOMEM error on key read")
    Suggested-by: Linus Torvalds
    Signed-off-by: Waiman Long
    Signed-off-by: Andrew Morton
    Reviewed-by: Eric Biggers
    Acked-by: David Howells
    Cc: Jarkko Sakkinen
    Cc: James Morris
    Cc: "Serge E. Hallyn"
    Cc: Joe Perches
    Cc: Matthew Wilcox
    Cc: David Rientjes
    Cc: Uladzislau Rezki
    Link: http://lkml.kernel.org/r/20200407200318.11711-1-longman@redhat.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Waiman Long
     

25 Sep, 2019

5 commits

  • This commits selects ARCH_HAS_ELF_RANDOMIZE when an arch uses the generic
    topdown mmap layout functions so that this security feature is on by
    default.

    Note that this commit also removes the possibility for arm64 to have elf
    randomization and no MMU: without MMU, the security added by randomization
    is worth nothing.

    Link: http://lkml.kernel.org/r/20190730055113.23635-6-alex@ghiti.fr
    Signed-off-by: Alexandre Ghiti
    Acked-by: Catalin Marinas
    Acked-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Luis Chamberlain
    Cc: Albert Ou
    Cc: Alexander Viro
    Cc: Christoph Hellwig
    Cc: James Hogan
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     
  • arm64 handles top-down mmap layout in a way that can be easily reused by
    other architectures, so make it available in mm. It then introduces a new
    config ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT that can be set by other
    architectures to benefit from those functions. Note that this new config
    depends on MMU being enabled, if selected without MMU support, a warning
    will be thrown.

    Link: http://lkml.kernel.org/r/20190730055113.23635-5-alex@ghiti.fr
    Signed-off-by: Alexandre Ghiti
    Suggested-by: Christoph Hellwig
    Acked-by: Catalin Marinas
    Acked-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Luis Chamberlain
    Cc: Albert Ou
    Cc: Alexander Viro
    Cc: James Hogan
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     
  • Patch series "Provide generic top-down mmap layout functions", v6.

    This series introduces generic functions to make top-down mmap layout
    easily accessible to architectures, in particular riscv which was the
    initial goal of this series. The generic implementation was taken from
    arm64 and used successively by arm, mips and finally riscv.

    Note that in addition the series fixes 2 issues:

    - stack randomization was taken into account even if not necessary.

    - [1] fixed an issue with mmap base which did not take into account
    randomization but did not report it to arm and mips, so by moving arm64
    into a generic library, this problem is now fixed for both
    architectures.

    This work is an effort to factorize architecture functions to avoid code
    duplication and oversights as in [1].

    [1]: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1429066.html

    This patch (of 14):

    This preparatory commit moves this function so that further introduction
    of generic topdown mmap layout is contained only in mm/util.c.

    Link: http://lkml.kernel.org/r/20190730055113.23635-2-alex@ghiti.fr
    Signed-off-by: Alexandre Ghiti
    Acked-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Luis Chamberlain
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Ralf Baechle
    Cc: Paul Burton
    Cc: James Hogan
    Cc: Palmer Dabbelt
    Cc: Albert Ou
    Cc: Alexander Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     
  • Patch series "THP aware uprobe", v13.

    This patchset makes uprobe aware of THPs.

    Currently, when uprobe is attached to text on THP, the page is split by
    FOLL_SPLIT. As a result, uprobe eliminates the performance benefit of
    THP.

    This set makes uprobe THP-aware. Instead of FOLL_SPLIT, we introduces
    FOLL_SPLIT_PMD, which only split PMD for uprobe.

    After all uprobes within the THP are removed, the PTE-mapped pages are
    regrouped as huge PMD.

    This set (plus a few THP patches) is also available at

    https://github.com/liu-song-6/linux/tree/uprobe-thp

    This patch (of 6):

    Move memcmp_pages() to mm/util.c and pages_identical() to mm.h, so that we
    can use them in other files.

    Link: http://lkml.kernel.org/r/20190815164525.1848545-2-songliubraving@fb.com
    Signed-off-by: Song Liu
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Oleg Nesterov
    Cc: Johannes Weiner
    Cc: Matthew Wilcox
    Cc: William Kucharski
    Cc: Srikar Dronamraju
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     
  • Replace 1 << compound_order(page) with compound_nr(page). Minor
    improvements in readability.

    Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

17 Jul, 2019

1 commit

  • locked_vm accounting is done roughly the same way in five places, so
    unify them in a helper.

    Include the helper's caller in the debug print to distinguish between
    callsites.

    Error codes stay the same, so user-visible behavior does too. The one
    exception is that the -EPERM case in tce_account_locked_vm is removed
    because Alexey has never seen it triggered.

    [daniel.m.jordan@oracle.com: v3]
    Link: http://lkml.kernel.org/r/20190529205019.20927-1-daniel.m.jordan@oracle.com
    [sfr@canb.auug.org.au: fix mm/util.c]
    Link: http://lkml.kernel.org/r/20190524175045.26897-1-daniel.m.jordan@oracle.com
    Signed-off-by: Daniel Jordan
    Signed-off-by: Stephen Rothwell
    Tested-by: Alexey Kardashevskiy
    Acked-by: Alex Williamson
    Cc: Alan Tull
    Cc: Alex Williamson
    Cc: Benjamin Herrenschmidt
    Cc: Christoph Lameter
    Cc: Christophe Leroy
    Cc: Davidlohr Bueso
    Cc: Jason Gunthorpe
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Moritz Fischer
    Cc: Paul Mackerras
    Cc: Steve Sistare
    Cc: Wu Hao
    Cc: Ira Weiny
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Jordan
     

13 Jul, 2019

1 commit

  • Always build mm/gup.c so that we don't have to provide separate nommu
    stubs. Also merge the get_user_pages_fast and __get_user_pages_fast stubs
    when HAVE_FAST_GUP into the main implementations, which will never call
    the fast path if HAVE_FAST_GUP is not set.

    This also ensures the new put_user_pages* helpers are available for nommu,
    as those are currently missing, which would create a problem as soon as we
    actually grew users for it.

    Link: http://lkml.kernel.org/r/20190625143715.1689-13-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Cc: Andrey Konovalov
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: James Hogan
    Cc: Jason Gunthorpe
    Cc: Khalid Aziz
    Cc: Michael Ellerman
    Cc: Nicholas Piggin
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Ralf Baechle
    Cc: Rich Felker
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

02 Jun, 2019

1 commit

  • The commit a3b609ef9f8b ("proc read mm's {arg,env}_{start,end} with mmap
    semaphore taken.") added synchronization of reading argument/environment
    boundaries under mmap_sem. Later commit 88aa7cc688d4 ("mm: introduce
    arg_lock to protect arg_start|end and env_start|end in mm_struct") avoided
    the coarse use of mmap_sem in similar situations. But there still
    remained two places that (mis)use mmap_sem.

    get_cmdline should also use arg_lock instead of mmap_sem when it reads the
    boundaries.

    The second place that should use arg_lock is in prctl_set_mm. By
    protecting the boundaries fields with the arg_lock, we can downgrade
    mmap_sem to reader lock (analogous to what we already do in
    prctl_set_mm_map).

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20190502125203.24014-3-mkoutny@suse.com
    Fixes: 88aa7cc688d4 ("mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct")
    Signed-off-by: Michal Koutný
    Signed-off-by: Laurent Dufour
    Co-developed-by: Laurent Dufour
    Reviewed-by: Cyrill Gorcunov
    Acked-by: Michal Hocko
    Cc: Yang Shi
    Cc: Mateusz Guzik
    Cc: Kirill Tkhai
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Koutný
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

2 commits

  • With the default overcommit==guess we occasionally run into mmap
    rejections despite plenty of memory that would get dropped under
    pressure but just isn't accounted reclaimable. One example of this is
    dying cgroups pinned by some page cache. A previous case was auxiliary
    path name memory associated with dentries; we have since annotated
    those allocations to avoid overcommit failures (see d79f7aa496fc ("mm:
    treat indirectly reclaimable memory as free in overcommit logic")).

    But trying to classify all allocated memory reliably as reclaimable
    and unreclaimable is a bit of a fool's errand. There could be a myriad
    of dependencies that constantly change with kernel versions.

    It becomes even more questionable of an effort when considering how
    this estimate of available memory is used: it's not compared to the
    system-wide allocated virtual memory in any way. It's not even
    compared to the allocating process's address space. It's compared to
    the single allocation request at hand!

    So we have an elaborate left-hand side of the equation that tries to
    assess the exact breathing room the system has available down to a
    page - and then compare it to an isolated allocation request with no
    additional context. We could fail an allocation of N bytes, but for
    two allocations of N/2 bytes we'd do this elaborate dance twice in a
    row and then still let N bytes of virtual memory through. This doesn't
    make a whole lot of sense.

    Let's take a step back and look at the actual goal of the
    heuristic. From the documentation:

    Heuristic overcommit handling. Obvious overcommits of address
    space are refused. Used for a typical system. It ensures a
    seriously wild allocation fails while allowing overcommit to
    reduce swap usage. root is allowed to allocate slightly more
    memory in this mode. This is the default.

    If all we want to do is catch clearly bogus allocation requests
    irrespective of the general virtual memory situation, the physical
    memory counter-part doesn't need to be that complicated, either.

    When in GUESS mode, catch wild allocations by comparing their request
    size to total amount of ram and swap in the system.

    Link: http://lkml.kernel.org/r/20190412191418.26333-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • To facilitate additional options to get_user_pages_fast() change the
    singular write parameter to be gup_flags.

    This patch does not change any functionality. New functionality will
    follow in subsequent patches.

    Some of the get_user_pages_fast() call sites were unchanged because they
    already passed FOLL_WRITE or 0 for the write parameter.

    NOTE: It was suggested to change the ordering of the get_user_pages_fast()
    arguments to ensure that callers were converted. This breaks the current
    GUP call site convention of having the returned pages be the final
    parameter. So the suggestion was rejected.

    Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
    Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com
    Signed-off-by: Ira Weiny
    Reviewed-by: Mike Marshall
    Cc: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Dan Williams
    Cc: "David S. Miller"
    Cc: Heiko Carstens
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: Jason Gunthorpe
    Cc: John Hubbard
    Cc: "Kirill A. Shutemov"
    Cc: Martin Schwidefsky
    Cc: Michal Hocko
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Rich Felker
    Cc: Thomas Gleixner
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ira Weiny
     

06 Apr, 2019

1 commit


06 Mar, 2019

1 commit

  • Many kernel-doc comments in mm/ have the return value descriptions
    either misformatted or omitted at all which makes kernel-doc script
    unhappy:

    $ make V=1 htmldocs
    ...
    ./mm/util.c:36: info: Scanning doc for kstrdup
    ./mm/util.c:41: warning: No description found for return value of 'kstrdup'
    ./mm/util.c:57: info: Scanning doc for kstrdup_const
    ./mm/util.c:66: warning: No description found for return value of 'kstrdup_const'
    ./mm/util.c:75: info: Scanning doc for kstrndup
    ./mm/util.c:83: warning: No description found for return value of 'kstrndup'
    ...

    Fixing the formatting and adding the missing return value descriptions
    eliminates ~100 such warnings.

    Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

22 Feb, 2019

1 commit

  • memdump_user usually gets fed unchecked userspace input. Blasting a
    full backtrace into dmesg every time is a bit excessive - I'm not sure
    on the kernel rule in general, but at least in drm we're trying not to
    let unpriviledge userspace spam the logs freely. Definitely not entire
    warning backtraces.

    It also means more filtering for our CI, because our testsuite exercises
    these corner cases and so hits these a lot.

    Link: http://lkml.kernel.org/r/20190220204058.11676-1-daniel.vetter@ffwll.ch
    Signed-off-by: Daniel Vetter
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Reviewed-by: Kees Cook
    Cc: Mike Rapoport
    Cc: Roman Gushchin
    Cc: Vlastimil Babka
    Cc: Jan Stancek
    Cc: Andrey Ryabinin
    Cc: "Michael S. Tsirkin"
    Cc: Huang Ying
    Cc: Bartosz Golaszewski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Vetter
     

09 Jan, 2019

1 commit

  • LTP proc01 testcase has been observed to rarely trigger crashes
    on arm64:
    page_mapped+0x78/0xb4
    stable_page_flags+0x27c/0x338
    kpageflags_read+0xfc/0x164
    proc_reg_read+0x7c/0xb8
    __vfs_read+0x58/0x178
    vfs_read+0x90/0x14c
    SyS_read+0x60/0xc0

    The issue is that page_mapped() assumes that if compound page is not
    huge, then it must be THP. But if this is 'normal' compound page
    (COMPOUND_PAGE_DTOR), then following loop can keep running (for
    HPAGE_PMD_NR iterations) until it tries to read from memory that isn't
    mapped and triggers a panic:

    for (i = 0; i < hpage_nr_pages(page); i++) {
    if (atomic_read(&page[i]._mapcount) >= 0)
    return true;
    }

    I could replicate this on x86 (v4.20-rc4-98-g60b548237fed) only
    with a custom kernel module [1] which:
    - allocates compound page (PAGEC) of order 1
    - allocates 2 normal pages (COPY), which are initialized to 0xff (to
    satisfy _mapcount >= 0)
    - 2 PAGEC page structs are copied to address of first COPY page
    - second page of COPY is marked as not present
    - call to page_mapped(COPY) now triggers fault on access to 2nd COPY
    page at offset 0x30 (_mapcount)

    [1] https://github.com/jstancek/reproducers/blob/master/kernel/page_mapped_crash/repro.c

    Fix the loop to iterate for "1 << compound_order" pages.

    Kirrill said "IIRC, sound subsystem can producuce custom mapped compound
    pages".

    Link: http://lkml.kernel.org/r/c440d69879e34209feba21e12d236d06bc0a25db.1543577156.git.jstancek@redhat.com
    Fixes: e1534ae95004 ("mm: differentiate page_mapped() from page_mapcount() for compound pages")
    Signed-off-by: Jan Stancek
    Debugged-by: Laszlo Ersek
    Suggested-by: "Kirill A. Shutemov"
    Acked-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Reviewed-by: David Hildenbrand
    Reviewed-by: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Stancek
     

29 Dec, 2018

1 commit

  • totalram_pages and totalhigh_pages are made static inline function.

    Main motivation was that managed_page_count_lock handling was complicating
    things. It was discussed in length here,
    https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
    better to remove the lock and convert variables to atomic, with preventing
    poteintial store-to-read tearing as a bonus.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
    Signed-off-by: Arun KS
    Suggested-by: Michal Hocko
    Suggested-by: Vlastimil Babka
    Reviewed-by: Konstantin Khlebnikov
    Reviewed-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun KS
     

27 Oct, 2018

2 commits

  • vfree() might sleep if called not in interrupt context. So does kvfree()
    too. Fix misleading kvfree()'s comment about allowed context.

    Link: http://lkml.kernel.org/r/20180914130512.10394-1-aryabinin@virtuozzo.com
    Fixes: 04b8e946075d ("mm/util.c: improve kvfree() kerneldoc")
    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • The vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES was introduced by
    commit eb59254608bc ("mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES") with
    the goal of accounting objects that can be reclaimed, but cannot be
    allocated via a SLAB_RECLAIM_ACCOUNT cache. This is now possible via
    kmalloc() with __GFP_RECLAIMABLE flag, and the dcache external names user
    is converted.

    The counter is however still useful for accounting direct page allocations
    (i.e. not slab) with a shrinker, such as the ION page pool. So keep it,
    and:

    - change granularity to pages to be more like other counters; sub-page
    allocations should be able to use kmalloc
    - rename the counter to NR_KERNEL_MISC_RECLAIMABLE
    - expose the counter again in vmstat as "nr_kernel_misc_reclaimable"; we can
    again remove the check for not printing "hidden" counters

    Link: http://lkml.kernel.org/r/20180731090649.16028-5-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Christoph Lameter
    Acked-by: Roman Gushchin
    Cc: Vijayanand Jitta
    Cc: Laura Abbott
    Cc: Sumit Semwal
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

16 Oct, 2018

1 commit


05 Sep, 2018

1 commit


24 Aug, 2018

2 commits

  • Link: http://lkml.kernel.org/r/1532626360-16650-3-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Patch series "memory management documentation updates", v3.

    Here are several updates to the mm documentation.

    Aside from really minor changes in the first three patches, the updates
    are:

    * move the documentation of kstrdup and friends to "String Manipulation"
    section
    * split memory management API into a separate .rst file
    * adjust formating of the GFP flags description and include it in the
    reference documentation.

    This patch (of 7):

    The description of the strndup_user function misses '*' character at the
    beginning of the comment to be proper kernel-doc. Add the missing
    character.

    Link: http://lkml.kernel.org/r/1532626360-16650-2-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

08 Jun, 2018

1 commit

  • kvmalloc warned about incompatible gfp_mask to catch abusers (mostly
    GFP_NOFS) with an intention that this will motivate authors of the code
    to fix those. Linus argues that this just motivates people to do even
    more hacks like

    if (gfp == GFP_KERNEL)
    kvmalloc
    else
    kmalloc

    I haven't seen this happening much (Linus pointed to bucket_lock special
    cases an atomic allocation but my git foo hasn't found much more) but it
    is true that we can grow those in future. Therefore Linus suggested to
    simply not fallback to vmalloc for incompatible gfp flags and rather
    stick with the kmalloc path.

    Link: http://lkml.kernel.org/r/20180601115329.27807-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: Linus Torvalds
    Cc: Tom Herbert
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

17 Apr, 2018

2 commits

  • Mike Rapoport says:

    These patches convert files in Documentation/vm to ReST format, add an
    initial index and link it to the top level documentation.

    There are no contents changes in the documentation, except few spelling
    fixes. The relatively large diffstat stems from the indentation and
    paragraph wrapping changes.

    I've tried to keep the formatting as consistent as possible, but I could
    miss some places that needed markup and add some markup where it was not
    necessary.

    [jc: significant conflicts in vm/hmm.rst]

    Jonathan Corbet
     
  • Signed-off-by: Mike Rapoport
    Signed-off-by: Jonathan Corbet

    Mike Rapoport
     

14 Apr, 2018

1 commit

  • __get_user_pages_fast handles errors differently from
    get_user_pages_fast: the former always returns the number of pages
    pinned, the later might return a negative error code.

    Link: http://lkml.kernel.org/r/1522962072-182137-6-git-send-email-mst@redhat.com
    Signed-off-by: Michael S. Tsirkin
    Reviewed-by: Andrew Morton
    Cc: Kirill A. Shutemov
    Cc: Huang Ying
    Cc: Jonathan Corbet
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Thorsten Leemhuis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     

12 Apr, 2018

2 commits

  • Patch series "exec: Pin stack limit during exec".

    Attempts to solve problems with the stack limit changing during exec
    continue to be frustrated[1][2]. In addition to the specific issues
    around the Stack Clash family of flaws, Andy Lutomirski pointed out[3]
    other places during exec where the stack limit is used and is assumed to
    be unchanging. Given the many places it gets used and the fact that it
    can be manipulated/raced via setrlimit() and prlimit(), I think the only
    way to handle this is to move away from the "current" view of the stack
    limit and instead attach it to the bprm, and plumb this down into the
    functions that need to know the stack limits. This series implements
    the approach.

    [1] 04e35f4495dd ("exec: avoid RLIMIT_STACK races with prlimit()")
    [2] 779f4e1c6c7c ("Revert "exec: avoid RLIMIT_STACK races with prlimit()"")
    [3] to security@kernel.org, "Subject: existing rlimit races?"

    This patch (of 3):

    Since it is possible that the stack rlimit can change externally during
    exec (either via another thread calling setrlimit() or another process
    calling prlimit()), provide a way to pass the rlimit down into the
    per-architecture mm layout functions so that the rlimit can stay in the
    bprm structure instead of sitting in the signal structure until exec is
    finalized.

    Link: http://lkml.kernel.org/r/1518638796-20819-2-git-send-email-keescook@chromium.org
    Signed-off-by: Kees Cook
    Cc: Michal Hocko
    Cc: Ben Hutchings
    Cc: Willy Tarreau
    Cc: Hugh Dickins
    Cc: Oleg Nesterov
    Cc: "Jason A. Donenfeld"
    Cc: Rik van Riel
    Cc: Laura Abbott
    Cc: Greg KH
    Cc: Andy Lutomirski
    Cc: Ben Hutchings
    Cc: Brad Spengler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Indirectly reclaimable memory can consume a significant part of total
    memory and it's actually reclaimable (it will be released under actual
    memory pressure).

    So, the overcommit logic should treat it as free.

    Otherwise, it's possible to cause random system-wide memory allocation
    failures by consuming a significant amount of memory by indirectly
    reclaimable memory, e.g. dentry external names.

    If overcommit policy GUESS is used, it might be used for denial of
    service attack under some conditions.

    The following program illustrates the approach. It causes the kernel to
    allocate an unreclaimable kmalloc-256 chunk for each stat() call, so
    that at some point the overcommit logic may start blocking large
    allocation system-wide.

    int main()
    {
    char buf[256];
    unsigned long i;
    struct stat statbuf;

    buf[0] = '/';
    for (i = 1; i < sizeof(buf); i++)
    buf[i] = '_';

    for (i = 0; 1; i++) {
    sprintf(&buf[248], "%8lu", i);
    stat(buf, &statbuf);
    }

    return 0;
    }

    This patch in combination with related indirectly reclaimable memory
    patches closes this issue.

    Link: http://lkml.kernel.org/r/20180313130041.8078-1-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

06 Apr, 2018

1 commit

  • Thanks to commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB
    trunks"), after swapoff the address_space associated with the swap
    device will be freed. So page_mapping() users which may touch the
    address_space need some kind of mechanism to prevent the address_space
    from being freed during accessing.

    The dcache flushing functions (flush_dcache_page(), etc) in architecture
    specific code may access the address_space of swap device for anonymous
    pages in swap cache via page_mapping() function. But in some cases
    there are no mechanisms to prevent the swap device from being swapoff,
    for example,

    CPU1 CPU2
    __get_user_pages() swapoff()
    flush_dcache_page()
    mapping = page_mapping()
    ... exit_swap_address_space()
    ... kvfree(spaces)
    mapping_mapped(mapping)

    The address space may be accessed after being freed.

    But from cachetlb.txt and Russell King, flush_dcache_page() only care
    about file cache pages, for anonymous pages, flush_anon_page() should be
    used. The implementation of flush_dcache_page() in all architectures
    follows this too. They will check whether page_mapping() is NULL and
    whether mapping_mapped() is true to determine whether to flush the
    dcache immediately. And they will use interval tree (mapping->i_mmap)
    to find all user space mappings. While mapping_mapped() and
    mapping->i_mmap isn't used by anonymous pages in swap cache at all.

    So, to fix the race between swapoff and flush dcache, __page_mapping()
    is add to return the address_space for file cache pages and NULL
    otherwise. All page_mapping() invoking in flush dcache functions are
    replaced with page_mapping_file().

    [akpm@linux-foundation.org: simplify page_mapping_file(), per Mike]
    Link: http://lkml.kernel.org/r/20180305083634.15174-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Andrew Morton
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Dave Hansen
    Cc: Chen Liqin
    Cc: Russell King
    Cc: Yoshinori Sato
    Cc: "James E.J. Bottomley"
    Cc: Guan Xuetao
    Cc: "David S. Miller"
    Cc: Chris Zankel
    Cc: Vineet Gupta
    Cc: Ley Foon Tan
    Cc: Ralf Baechle
    Cc: Andi Kleen
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

08 Jan, 2018

2 commits


07 Sep, 2017

1 commit

  • global_page_state is error prone as a recent bug report pointed out [1].
    It only returns proper values for zone based counters as the enum it
    gets suggests. We already have global_node_page_state so let's rename
    global_page_state to global_zone_page_state to be more explicit here.
    All existing users seems to be correct:

    $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
    2 NR_BOUNCE
    2 NR_FREE_CMA_PAGES
    11 NR_FREE_PAGES
    1 NR_KERNEL_STACK_KB
    1 NR_MLOCK
    2 NR_PAGETABLE

    This patch shouldn't introduce any functional change.

    [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp

    Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Johannes Weiner
    Cc: Tetsuo Handa
    Cc: Josef Bacik
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

11 Aug, 2017

1 commit

  • As Tetsuo points out:
    "Commit 385386cff4c6 ("mm: vmstat: move slab statistics from zone to
    node counters") broke "Slab:" field of /proc/meminfo . It shows nearly
    0kB"

    In addition to /proc/meminfo, this problem also affects the slab
    counters OOM/allocation failure info dumps, can cause early -ENOMEM from
    overcommit protection, and miscalculate image size requirements during
    suspend-to-disk.

    This is because the patch in question switched the slab counters from
    the zone level to the node level, but forgot to update the global
    accessor functions to read the aggregate node data instead of the
    aggregate zone data.

    Use global_node_page_state() to access the global slab counters.

    Fixes: 385386cff4c6 ("mm: vmstat: move slab statistics from zone to node counters")
    Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Josef Bacik
    Cc: Vladimir Davydov
    Cc: Stefan Agner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

16 Jul, 2017

1 commit

  • Pull ->s_options removal from Al Viro:
    "Preparations for fsmount/fsopen stuff (coming next cycle). Everything
    gets moved to explicit ->show_options(), killing ->s_options off +
    some cosmetic bits around fs/namespace.c and friends. Basically, the
    stuff needed to work with fsmount series with minimum of conflicts
    with other work.

    It's not strictly required for this merge window, but it would reduce
    the PITA during the coming cycle, so it would be nice to have those
    bits and pieces out of the way"

    * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    isofs: Fix isofs_show_options()
    VFS: Kill off s_options and helpers
    orangefs: Implement show_options
    9p: Implement show_options
    isofs: Implement show_options
    afs: Implement show_options
    affs: Implement show_options
    befs: Implement show_options
    spufs: Implement show_options
    bpf: Implement show_options
    ramfs: Implement show_options
    pstore: Implement show_options
    omfs: Implement show_options
    hugetlbfs: Implement show_options
    VFS: Don't use save/replace_mount_options if not using generic_show_options
    VFS: Provide empty name qstr
    VFS: Make get_filesystem() return the affected filesystem
    VFS: Clean up whitespace in fs/namespace.c and fs/super.c
    Provide a function to create a NUL-terminated string from unterminated data

    Linus Torvalds
     

13 Jul, 2017

2 commits

  • Now that __GFP_RETRY_MAYFAIL has a reasonable semantic regardless of the
    request size we can drop the hackish implementation for !costly orders.
    __GFP_RETRY_MAYFAIL retries as long as the reclaim makes a forward
    progress and backs of when we are out of memory for the requested size.
    Therefore we do not need to enforce__GFP_NORETRY for !costly orders just
    to silent the oom killer anymore.

    Link: http://lkml.kernel.org/r/20170623085345.11304-5-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alex Belits
    Cc: Chris Wilson
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Cc: David Daney
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: NeilBrown
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
    the page allocator. This has been true but only for allocations
    requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always
    ignored for smaller sizes. This is a bit unfortunate because there is
    no way to express the same semantic for those requests and they are
    considered too important to fail so they might end up looping in the
    page allocator for ever, similarly to GFP_NOFAIL requests.

    Now that the whole tree has been cleaned up and accidental or misled
    usage of __GFP_REPEAT flag has been removed for !costly requests we can
    give the original flag a better name and more importantly a more useful
    semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
    that the allocator would try really hard but there is no promise of a
    success. This will work independent of the order and overrides the
    default allocator behavior. Page allocator users have several levels of
    guarantee vs. cost options (take GFP_KERNEL as an example)

    - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
    attempt to free memory at all. The most light weight mode which even
    doesn't kick the background reclaim. Should be used carefully because
    it might deplete the memory and the next user might hit the more
    aggressive reclaim

    - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
    allocation without any attempt to free memory from the current
    context but can wake kswapd to reclaim memory if the zone is below
    the low watermark. Can be used from either atomic contexts or when
    the request is a performance optimization and there is another
    fallback for a slow path.

    - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
    non sleeping allocation with an expensive fallback so it can access
    some portion of memory reserves. Usually used from interrupt/bh
    context with an expensive slow path fallback.

    - GFP_KERNEL - both background and direct reclaim are allowed and the
    _default_ page allocator behavior is used. That means that !costly
    allocation requests are basically nofail but there is no guarantee of
    that behavior so failures have to be checked properly by callers
    (e.g. OOM killer victim is allowed to fail currently).

    - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
    and all allocation requests fail early rather than cause disruptive
    reclaim (one round of reclaim in this implementation). The OOM killer
    is not invoked.

    - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
    behavior and all allocation requests try really hard. The request
    will fail if the reclaim cannot make any progress. The OOM killer
    won't be triggered.

    - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
    and all allocation requests will loop endlessly until they succeed.
    This might be really dangerous especially for larger orders.

    Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
    because they already had their semantic. No new users are added.
    __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
    there is no progress and we have already passed the OOM point.

    This means that all the reclaim opportunities have been exhausted except
    the most disruptive one (the OOM killer) and a user defined fallback
    behavior is more sensible than keep retrying in the page allocator.

    [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
    [mhocko@suse.com: semantic fix]
    Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
    [mhocko@kernel.org: address other thing spotted by Vlastimil]
    Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alex Belits
    Cc: Chris Wilson
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Cc: David Daney
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: NeilBrown
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

06 Jul, 2017

1 commit


03 Jun, 2017

1 commit

  • While converting drm_[cm]alloc* helpers to kvmalloc* variants Chris
    Wilson has wondered why we want to try kmalloc before vmalloc fallback
    even for larger allocations requests. Let's clarify that one larger
    physically contiguous block is less likely to fragment memory than many
    scattered pages which can prevent more large blocks from being created.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20170517080932.21423-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: Chris Wilson
    Reviewed-by: Chris Wilson
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

13 May, 2017

1 commit

  • Commit 1f5307b1e094 ("mm, vmalloc: properly track vmalloc users") has
    pulled asm/pgtable.h include dependency to linux/vmalloc.h and that
    turned out to be a bad idea for some architectures. E.g. m68k fails
    with

    In file included from arch/m68k/include/asm/pgtable_mm.h:145:0,
    from arch/m68k/include/asm/pgtable.h:4,
    from include/linux/vmalloc.h:9,
    from arch/m68k/kernel/module.c:9:
    arch/m68k/include/asm/mcf_pgtable.h: In function 'nocache_page':
    >> arch/m68k/include/asm/mcf_pgtable.h:339:43: error: 'init_mm' undeclared (first use in this function)
    #define pgd_offset_k(address) pgd_offset(&init_mm, address)

    as spotted by kernel build bot. nios2 fails for other reason

    In file included from include/asm-generic/io.h:767:0,
    from arch/nios2/include/asm/io.h:61,
    from include/linux/io.h:25,
    from arch/nios2/include/asm/pgtable.h:18,
    from include/linux/mm.h:70,
    from include/linux/pid_namespace.h:6,
    from include/linux/ptrace.h:9,
    from arch/nios2/include/uapi/asm/elf.h:23,
    from arch/nios2/include/asm/elf.h:22,
    from include/linux/elf.h:4,
    from include/linux/module.h:15,
    from init/main.c:16:
    include/linux/vmalloc.h: In function '__vmalloc_node_flags':
    include/linux/vmalloc.h:99:40: error: 'PAGE_KERNEL' undeclared (first use in this function); did you mean 'GFP_KERNEL'?

    which is due to the newly added #include , which on nios2
    includes and thus and which
    again includes .

    Tweaking that around just turns out a bigger headache than necessary.
    This patch reverts 1f5307b1e094 and reimplements the original fix in a
    different way. __vmalloc_node_flags can stay static inline which will
    cover vmalloc* functions. We only have one external user
    (kvmalloc_node) and we can export __vmalloc_node_flags_caller and
    provide the caller directly. This is much simpler and it doesn't really
    need any games with header files.

    [akpm@linux-foundation.org: coding-style fixes]
    [mhocko@kernel.org: revert old comment]
    Link: http://lkml.kernel.org/r/20170509211054.GB16325@dhcp22.suse.cz
    Fixes: 1f5307b1e094 ("mm, vmalloc: properly track vmalloc users")
    Link: http://lkml.kernel.org/r/20170509153702.GR6481@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Cc: Tobias Klauser
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko