17 Oct, 2020

1 commit

  • Memory allocated with kstrdup_const() must not be passed to regular
    krealloc() as it is not aware of the possibility of the chunk residing in
    .rodata. Since there are no potential users of krealloc_const() at the
    moment, let's just update the doc to make it explicit.

    Signed-off-by: Bartosz Golaszewski
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200817173927.23389-1-brgl@bgdev.pl
    Signed-off-by: Linus Torvalds

    Bartosz Golaszewski
     

04 Sep, 2020

1 commit

  • When the Memory Tagging Extension is enabled, two pages are identical
    only if both their data and tags are identical.

    Make the generic memcmp_pages() a __weak function and add an
    arm64-specific implementation which returns non-zero if any of the two
    pages contain valid MTE tags (PG_mte_tagged set). There isn't much
    benefit in comparing the tags of two pages since these are normally used
    for heap allocations and likely to differ anyway.

    Co-developed-by: Vincenzo Frascino
    Signed-off-by: Vincenzo Frascino
    Signed-off-by: Catalin Marinas
    Cc: Will Deacon

    Catalin Marinas
     

08 Aug, 2020

3 commits

  • The current split between do_mmap() and do_mmap_pgoff() was introduced in
    commit 1fcfd8db7f82 ("mm, mpx: add "vm_flags_t vm_flags" arg to
    do_mmap_pgoff()") to support MPX.

    The wrapper function do_mmap_pgoff() always passed 0 as the value of the
    vm_flags argument to do_mmap(). However, MPX support has subsequently
    been removed from the kernel and there were no more direct callers of
    do_mmap(); all calls were going via do_mmap_pgoff().

    Simplify the code by removing do_mmap_pgoff() and changing all callers to
    directly call do_mmap(), which now no longer takes a vm_flags argument.

    Signed-off-by: Peter Collingbourne
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Link: http://lkml.kernel.org/r/20200727194109.1371462-1-pcc@google.com
    Signed-off-by: Linus Torvalds

    Peter Collingbourne
     
  • When checking a performance change for will-it-scale scalability mmap test
    [1], we found very high lock contention for spinlock of percpu counter
    'vm_committed_as':

    94.14% 0.35% [kernel.kallsyms] [k] _raw_spin_lock_irqsave
    48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
    45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;

    Actually this heavy lock contention is not always necessary. The
    'vm_committed_as' needs to be very precise when the strict
    OVERCOMMIT_NEVER policy is set, which requires a rather small batch number
    for the percpu counter.

    So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy, and
    lift it to 64X for OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS policies. Also
    add a sysctl handler to adjust it when the policy is reconfigured.

    Benchmark with the same testcase in [1] shows 53% improvement on a 8C/16T
    desktop, and 2097%(20X) on a 4S/72C/144T server. We tested with test
    platforms in 0day (server, desktop and laptop), and 80%+ platforms shows
    improvements with that test. And whether it shows improvements depends on
    if the test mmap size is bigger than the batch number computed.

    And if the lift is 16X, 1/3 of the platforms will show improvements,
    though it should help the mmap/unmap usage generally, as Michal Hocko
    mentioned:

    : I believe that there are non-synthetic worklaods which would benefit from
    : a larger batch. E.g. large in memory databases which do large mmaps
    : during startups from multiple threads.

    [1] https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/

    Signed-off-by: Feng Tang
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Matthew Wilcox (Oracle)
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Qian Cai
    Cc: Kees Cook
    Cc: Andi Kleen
    Cc: Tim Chen
    Cc: Dave Hansen
    Cc: Huang Ying
    Cc: Christoph Lameter
    Cc: Dennis Zhou
    Cc: Haiyang Zhang
    Cc: kernel test robot
    Cc: "K. Y. Srinivasan"
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/1589611660-89854-4-git-send-email-feng.tang@intel.com
    Link: http://lkml.kernel.org/r/1592725000-73486-4-git-send-email-feng.tang@intel.com
    Link: http://lkml.kernel.org/r/1594389708-60781-5-git-send-email-feng.tang@intel.com
    Signed-off-by: Linus Torvalds

    Feng Tang
     
  • percpu_counter_sum_positive() will provide more accurate info.

    As with percpu_counter_read_positive(), in worst case the deviation could
    be 'batch * nr_cpus', which is totalram_pages/256 for now, and will be
    more when the batch gets enlarged.

    Its time cost is about 800 nanoseconds on a 2C/4T platform and 2~3
    microseconds on a 2S/36C/72T Skylake server in normal case, and in worst
    case where vm_committed_as's spinlock is under severe contention, it costs
    30~40 microseconds for the 2S/36C/72T Skylake sever, which should be fine
    for its only two users: /proc/meminfo and HyperV balloon driver's status
    trace per second.

    Signed-off-by: Feng Tang
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko # for /proc/meminfo
    Cc: "K. Y. Srinivasan"
    Cc: Haiyang Zhang
    Cc: Matthew Wilcox (Oracle)
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Qian Cai
    Cc: Andi Kleen
    Cc: Tim Chen
    Cc: Dave Hansen
    Cc: Huang Ying
    Cc: Christoph Lameter
    Cc: Dennis Zhou
    Cc: Kees Cook
    Cc: kernel test robot
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/1592725000-73486-3-git-send-email-feng.tang@intel.com
    Link: http://lkml.kernel.org/r/1594389708-60781-3-git-send-email-feng.tang@intel.com
    Signed-off-by: Linus Torvalds

    Feng Tang
     

10 Jun, 2020

3 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Add new APIs to assert that mmap_sem is held.

    Using this instead of rwsem_is_locked and lockdep_assert_held[_write]
    makes the assertions more tolerant of future changes to the lock type.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-10-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

05 Jun, 2020

2 commits

  • For kvmalloc'ed data object that contains sensitive information like
    cryptographic keys, we need to make sure that the buffer is always cleared
    before freeing it. Using memset() alone for buffer clearing may not
    provide certainty as the compiler may compile it away. To be sure, the
    special memzero_explicit() has to be used.

    This patch introduces a new kvfree_sensitive() for freeing those sensitive
    data objects allocated by kvmalloc(). The relevant places where
    kvfree_sensitive() can be used are modified to use it.

    Fixes: 4f0882491a14 ("KEYS: Avoid false positive ENOMEM error on key read")
    Suggested-by: Linus Torvalds
    Signed-off-by: Waiman Long
    Signed-off-by: Andrew Morton
    Reviewed-by: Eric Biggers
    Acked-by: David Howells
    Cc: Jarkko Sakkinen
    Cc: James Morris
    Cc: "Serge E. Hallyn"
    Cc: Joe Perches
    Cc: Matthew Wilcox
    Cc: David Rientjes
    Cc: Uladzislau Rezki
    Link: http://lkml.kernel.org/r/20200407200318.11711-1-longman@redhat.com
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • This check was added by commit 82f71ae4a2b8 ("mm: catch memory
    commitment underflow") in 2014 to have a safety check for issues which
    have been fixed. And there has been few report caught by it, as
    described in its commit log:

    : This shouldn't happen any more - the previous two patches fixed
    : the committed_as underflow issues.

    But it was really found by Qian Cai when he used the LTP memory stress
    suite to test a RFC patchset, which tries to improve scalability of
    per-cpu counter 'vm_committed_as', by chosing a bigger 'batch' number for
    loose overcommit policies (OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS), while
    keeping current number for OVERCOMMIT_NEVER.

    With that patchset, when system firstly uses a loose policy, the
    'vm_committed_as' count could be a big negative value, as its big 'batch'
    number allows a big deviation, then when the policy is changed to
    OVERCOMMIT_NEVER, the 'batch' will be decreased to a much smaller value,
    thus hits this WARN check.

    To mitigate this, one proposed solution is to queue work on all online
    CPUs to do a local sync for 'vm_committed_as' when changing policy to
    OVERCOMMIT_NEVER, plus some global syncing to garante the case won't be
    hit.

    But this solution is costy and slow, given this check hasn't shown real
    trouble or benefit, simply drop it from one hot path of MM. And perf
    stats does show some tiny saving for removing it.

    Reported-by: Qian Cai
    Signed-off-by: Feng Tang
    Signed-off-by: Andrew Morton
    Reviewed-by: Qian Cai
    Acked-by: Michal Hocko
    Cc: Konstantin Khlebnikov
    Cc: Andi Kleen
    Cc: Johannes Weiner
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Kees Cook
    Link: http://lkml.kernel.org/r/20200603094804.GB89848@shbuild999.sh.intel.com
    Signed-off-by: Linus Torvalds

    Feng Tang
     

04 Jun, 2020

1 commit

  • Pull networking updates from David Miller:

    1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

    2) Add GSO partial support to igc, from Sasha Neftin.

    3) Several cleanups and improvements to r8169 from Heiner Kallweit.

    4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

    5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

    6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

    7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

    8) Add sriov and vf support to hinic, from Luo bin.

    9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

    10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

    11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

    12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

    13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

    14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

    15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

    16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

    17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

    18) Several RISCV bpf jit optimizations, from Luke Nelson.

    19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

    20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

    21) Add BPF iterators, from Yonghang Song.

    22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

    23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

    24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

    25) Add CAP_BPF, from Alexei Starovoitov.

    26) Support terse dumps in the packet scheduler, from Vlad Buslov.

    27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

    28) Add devm_register_netdev(), from Bartosz Golaszewski.

    29) Minimize qdisc resets, from Cong Wang.

    30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
    selftests: net: ip_defrag: ignore EPERM
    net_failover: fixed rollback in net_failover_open()
    Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
    Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
    vmxnet3: allow rx flow hash ops only when rss is enabled
    hinic: add set_channels ethtool_ops support
    selftests/bpf: Add a default $(CXX) value
    tools/bpf: Don't use $(COMPILE.c)
    bpf, selftests: Use bpf_probe_read_kernel
    s390/bpf: Use bcr 0,%0 as tail call nop filler
    s390/bpf: Maintain 8-byte stack alignment
    selftests/bpf: Fix verifier test
    selftests/bpf: Fix sample_cnt shared between two threads
    bpf, selftests: Adapt cls_redirect to call csum_level helper
    bpf: Add csum_level helper for fixing up csum levels
    bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
    sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
    crypto/chtls: IPv6 support for inline TLS
    Crypto/chcr: Fixes a coccinile check error
    Crypto/chcr: Fixes compilations warnings
    ...

    Linus Torvalds
     

03 Jun, 2020

1 commit

  • Just use __vmalloc_node instead which gets and extra argument. To be able
    to to use __vmalloc_node in all caller make it available outside of
    vmalloc and implement it in nommu.c.

    [akpm@linux-foundation.org: fix nommu build]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Acked-by: Peter Zijlstra (Intel)
    Cc: Christian Borntraeger
    Cc: Christophe Leroy
    Cc: Daniel Vetter
    Cc: David Airlie
    Cc: Gao Xiang
    Cc: Greg Kroah-Hartman
    Cc: Haiyang Zhang
    Cc: Johannes Weiner
    Cc: "K. Y. Srinivasan"
    Cc: Laura Abbott
    Cc: Mark Rutland
    Cc: Michael Kelley
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Robin Murphy
    Cc: Sakari Ailus
    Cc: Stephen Hemminger
    Cc: Sumit Semwal
    Cc: Wei Liu
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Heiko Carstens
    Cc: Paul Mackerras
    Cc: Vasily Gorbik
    Cc: Will Deacon
    Cc: Stephen Rothwell
    Link: http://lkml.kernel.org/r/20200414131348.444715-25-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

27 Apr, 2020

1 commit

  • Instead of having all the sysctl handlers deal with user pointers, which
    is rather hairy in terms of the BPF interaction, copy the input to and
    from userspace in common code. This also means that the strings are
    always NUL-terminated by the common code, making the API a little bit
    safer.

    As most handler just pass through the data to one of the common handlers
    a lot of the changes are mechnical.

    Signed-off-by: Christoph Hellwig
    Acked-by: Andrey Ignatov
    Signed-off-by: Al Viro

    Christoph Hellwig
     

01 Dec, 2019

2 commits

  • Now we use rb_parent to get next, while this is not necessary.

    When prev is NULL, this means vma should be the first element in the list.
    Then next should be current first one (mm->mmap), no matter whether we
    have parent or not.

    After removing it, the code shows the beauty of symmetry.

    Link: http://lkml.kernel.org/r/20190813032656.16625-1-richardw.yang@linux.intel.com
    Signed-off-by: Wei Yang
    Acked-by: Andrew Morton
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Just make the code a little easier to read.

    Link: http://lkml.kernel.org/r/20191006012636.31521-3-richardw.yang@linux.intel.com
    Signed-off-by: Wei Yang
    Cc: Christoph Hellwig
    Cc: Matthew Wilcox (Oracle)
    Cc: Mel Gorman
    Cc: Oscar Salvador
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     

25 Sep, 2019

5 commits

  • This commits selects ARCH_HAS_ELF_RANDOMIZE when an arch uses the generic
    topdown mmap layout functions so that this security feature is on by
    default.

    Note that this commit also removes the possibility for arm64 to have elf
    randomization and no MMU: without MMU, the security added by randomization
    is worth nothing.

    Link: http://lkml.kernel.org/r/20190730055113.23635-6-alex@ghiti.fr
    Signed-off-by: Alexandre Ghiti
    Acked-by: Catalin Marinas
    Acked-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Luis Chamberlain
    Cc: Albert Ou
    Cc: Alexander Viro
    Cc: Christoph Hellwig
    Cc: James Hogan
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     
  • arm64 handles top-down mmap layout in a way that can be easily reused by
    other architectures, so make it available in mm. It then introduces a new
    config ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT that can be set by other
    architectures to benefit from those functions. Note that this new config
    depends on MMU being enabled, if selected without MMU support, a warning
    will be thrown.

    Link: http://lkml.kernel.org/r/20190730055113.23635-5-alex@ghiti.fr
    Signed-off-by: Alexandre Ghiti
    Suggested-by: Christoph Hellwig
    Acked-by: Catalin Marinas
    Acked-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Luis Chamberlain
    Cc: Albert Ou
    Cc: Alexander Viro
    Cc: James Hogan
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     
  • Patch series "Provide generic top-down mmap layout functions", v6.

    This series introduces generic functions to make top-down mmap layout
    easily accessible to architectures, in particular riscv which was the
    initial goal of this series. The generic implementation was taken from
    arm64 and used successively by arm, mips and finally riscv.

    Note that in addition the series fixes 2 issues:

    - stack randomization was taken into account even if not necessary.

    - [1] fixed an issue with mmap base which did not take into account
    randomization but did not report it to arm and mips, so by moving arm64
    into a generic library, this problem is now fixed for both
    architectures.

    This work is an effort to factorize architecture functions to avoid code
    duplication and oversights as in [1].

    [1]: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1429066.html

    This patch (of 14):

    This preparatory commit moves this function so that further introduction
    of generic topdown mmap layout is contained only in mm/util.c.

    Link: http://lkml.kernel.org/r/20190730055113.23635-2-alex@ghiti.fr
    Signed-off-by: Alexandre Ghiti
    Acked-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Luis Chamberlain
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Ralf Baechle
    Cc: Paul Burton
    Cc: James Hogan
    Cc: Palmer Dabbelt
    Cc: Albert Ou
    Cc: Alexander Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     
  • Patch series "THP aware uprobe", v13.

    This patchset makes uprobe aware of THPs.

    Currently, when uprobe is attached to text on THP, the page is split by
    FOLL_SPLIT. As a result, uprobe eliminates the performance benefit of
    THP.

    This set makes uprobe THP-aware. Instead of FOLL_SPLIT, we introduces
    FOLL_SPLIT_PMD, which only split PMD for uprobe.

    After all uprobes within the THP are removed, the PTE-mapped pages are
    regrouped as huge PMD.

    This set (plus a few THP patches) is also available at

    https://github.com/liu-song-6/linux/tree/uprobe-thp

    This patch (of 6):

    Move memcmp_pages() to mm/util.c and pages_identical() to mm.h, so that we
    can use them in other files.

    Link: http://lkml.kernel.org/r/20190815164525.1848545-2-songliubraving@fb.com
    Signed-off-by: Song Liu
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Oleg Nesterov
    Cc: Johannes Weiner
    Cc: Matthew Wilcox
    Cc: William Kucharski
    Cc: Srikar Dronamraju
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     
  • Replace 1 << compound_order(page) with compound_nr(page). Minor
    improvements in readability.

    Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

17 Jul, 2019

1 commit

  • locked_vm accounting is done roughly the same way in five places, so
    unify them in a helper.

    Include the helper's caller in the debug print to distinguish between
    callsites.

    Error codes stay the same, so user-visible behavior does too. The one
    exception is that the -EPERM case in tce_account_locked_vm is removed
    because Alexey has never seen it triggered.

    [daniel.m.jordan@oracle.com: v3]
    Link: http://lkml.kernel.org/r/20190529205019.20927-1-daniel.m.jordan@oracle.com
    [sfr@canb.auug.org.au: fix mm/util.c]
    Link: http://lkml.kernel.org/r/20190524175045.26897-1-daniel.m.jordan@oracle.com
    Signed-off-by: Daniel Jordan
    Signed-off-by: Stephen Rothwell
    Tested-by: Alexey Kardashevskiy
    Acked-by: Alex Williamson
    Cc: Alan Tull
    Cc: Alex Williamson
    Cc: Benjamin Herrenschmidt
    Cc: Christoph Lameter
    Cc: Christophe Leroy
    Cc: Davidlohr Bueso
    Cc: Jason Gunthorpe
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Moritz Fischer
    Cc: Paul Mackerras
    Cc: Steve Sistare
    Cc: Wu Hao
    Cc: Ira Weiny
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Jordan
     

13 Jul, 2019

1 commit

  • Always build mm/gup.c so that we don't have to provide separate nommu
    stubs. Also merge the get_user_pages_fast and __get_user_pages_fast stubs
    when HAVE_FAST_GUP into the main implementations, which will never call
    the fast path if HAVE_FAST_GUP is not set.

    This also ensures the new put_user_pages* helpers are available for nommu,
    as those are currently missing, which would create a problem as soon as we
    actually grew users for it.

    Link: http://lkml.kernel.org/r/20190625143715.1689-13-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Cc: Andrey Konovalov
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: James Hogan
    Cc: Jason Gunthorpe
    Cc: Khalid Aziz
    Cc: Michael Ellerman
    Cc: Nicholas Piggin
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Ralf Baechle
    Cc: Rich Felker
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

02 Jun, 2019

1 commit

  • The commit a3b609ef9f8b ("proc read mm's {arg,env}_{start,end} with mmap
    semaphore taken.") added synchronization of reading argument/environment
    boundaries under mmap_sem. Later commit 88aa7cc688d4 ("mm: introduce
    arg_lock to protect arg_start|end and env_start|end in mm_struct") avoided
    the coarse use of mmap_sem in similar situations. But there still
    remained two places that (mis)use mmap_sem.

    get_cmdline should also use arg_lock instead of mmap_sem when it reads the
    boundaries.

    The second place that should use arg_lock is in prctl_set_mm. By
    protecting the boundaries fields with the arg_lock, we can downgrade
    mmap_sem to reader lock (analogous to what we already do in
    prctl_set_mm_map).

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20190502125203.24014-3-mkoutny@suse.com
    Fixes: 88aa7cc688d4 ("mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct")
    Signed-off-by: Michal Koutný
    Signed-off-by: Laurent Dufour
    Co-developed-by: Laurent Dufour
    Reviewed-by: Cyrill Gorcunov
    Acked-by: Michal Hocko
    Cc: Yang Shi
    Cc: Mateusz Guzik
    Cc: Kirill Tkhai
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Koutný
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

2 commits

  • With the default overcommit==guess we occasionally run into mmap
    rejections despite plenty of memory that would get dropped under
    pressure but just isn't accounted reclaimable. One example of this is
    dying cgroups pinned by some page cache. A previous case was auxiliary
    path name memory associated with dentries; we have since annotated
    those allocations to avoid overcommit failures (see d79f7aa496fc ("mm:
    treat indirectly reclaimable memory as free in overcommit logic")).

    But trying to classify all allocated memory reliably as reclaimable
    and unreclaimable is a bit of a fool's errand. There could be a myriad
    of dependencies that constantly change with kernel versions.

    It becomes even more questionable of an effort when considering how
    this estimate of available memory is used: it's not compared to the
    system-wide allocated virtual memory in any way. It's not even
    compared to the allocating process's address space. It's compared to
    the single allocation request at hand!

    So we have an elaborate left-hand side of the equation that tries to
    assess the exact breathing room the system has available down to a
    page - and then compare it to an isolated allocation request with no
    additional context. We could fail an allocation of N bytes, but for
    two allocations of N/2 bytes we'd do this elaborate dance twice in a
    row and then still let N bytes of virtual memory through. This doesn't
    make a whole lot of sense.

    Let's take a step back and look at the actual goal of the
    heuristic. From the documentation:

    Heuristic overcommit handling. Obvious overcommits of address
    space are refused. Used for a typical system. It ensures a
    seriously wild allocation fails while allowing overcommit to
    reduce swap usage. root is allowed to allocate slightly more
    memory in this mode. This is the default.

    If all we want to do is catch clearly bogus allocation requests
    irrespective of the general virtual memory situation, the physical
    memory counter-part doesn't need to be that complicated, either.

    When in GUESS mode, catch wild allocations by comparing their request
    size to total amount of ram and swap in the system.

    Link: http://lkml.kernel.org/r/20190412191418.26333-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • To facilitate additional options to get_user_pages_fast() change the
    singular write parameter to be gup_flags.

    This patch does not change any functionality. New functionality will
    follow in subsequent patches.

    Some of the get_user_pages_fast() call sites were unchanged because they
    already passed FOLL_WRITE or 0 for the write parameter.

    NOTE: It was suggested to change the ordering of the get_user_pages_fast()
    arguments to ensure that callers were converted. This breaks the current
    GUP call site convention of having the returned pages be the final
    parameter. So the suggestion was rejected.

    Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
    Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com
    Signed-off-by: Ira Weiny
    Reviewed-by: Mike Marshall
    Cc: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Dan Williams
    Cc: "David S. Miller"
    Cc: Heiko Carstens
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: Jason Gunthorpe
    Cc: John Hubbard
    Cc: "Kirill A. Shutemov"
    Cc: Martin Schwidefsky
    Cc: Michal Hocko
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Rich Felker
    Cc: Thomas Gleixner
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ira Weiny
     

06 Apr, 2019

1 commit


06 Mar, 2019

1 commit

  • Many kernel-doc comments in mm/ have the return value descriptions
    either misformatted or omitted at all which makes kernel-doc script
    unhappy:

    $ make V=1 htmldocs
    ...
    ./mm/util.c:36: info: Scanning doc for kstrdup
    ./mm/util.c:41: warning: No description found for return value of 'kstrdup'
    ./mm/util.c:57: info: Scanning doc for kstrdup_const
    ./mm/util.c:66: warning: No description found for return value of 'kstrdup_const'
    ./mm/util.c:75: info: Scanning doc for kstrndup
    ./mm/util.c:83: warning: No description found for return value of 'kstrndup'
    ...

    Fixing the formatting and adding the missing return value descriptions
    eliminates ~100 such warnings.

    Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

22 Feb, 2019

1 commit

  • memdump_user usually gets fed unchecked userspace input. Blasting a
    full backtrace into dmesg every time is a bit excessive - I'm not sure
    on the kernel rule in general, but at least in drm we're trying not to
    let unpriviledge userspace spam the logs freely. Definitely not entire
    warning backtraces.

    It also means more filtering for our CI, because our testsuite exercises
    these corner cases and so hits these a lot.

    Link: http://lkml.kernel.org/r/20190220204058.11676-1-daniel.vetter@ffwll.ch
    Signed-off-by: Daniel Vetter
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Reviewed-by: Kees Cook
    Cc: Mike Rapoport
    Cc: Roman Gushchin
    Cc: Vlastimil Babka
    Cc: Jan Stancek
    Cc: Andrey Ryabinin
    Cc: "Michael S. Tsirkin"
    Cc: Huang Ying
    Cc: Bartosz Golaszewski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Vetter
     

09 Jan, 2019

1 commit

  • LTP proc01 testcase has been observed to rarely trigger crashes
    on arm64:
    page_mapped+0x78/0xb4
    stable_page_flags+0x27c/0x338
    kpageflags_read+0xfc/0x164
    proc_reg_read+0x7c/0xb8
    __vfs_read+0x58/0x178
    vfs_read+0x90/0x14c
    SyS_read+0x60/0xc0

    The issue is that page_mapped() assumes that if compound page is not
    huge, then it must be THP. But if this is 'normal' compound page
    (COMPOUND_PAGE_DTOR), then following loop can keep running (for
    HPAGE_PMD_NR iterations) until it tries to read from memory that isn't
    mapped and triggers a panic:

    for (i = 0; i < hpage_nr_pages(page); i++) {
    if (atomic_read(&page[i]._mapcount) >= 0)
    return true;
    }

    I could replicate this on x86 (v4.20-rc4-98-g60b548237fed) only
    with a custom kernel module [1] which:
    - allocates compound page (PAGEC) of order 1
    - allocates 2 normal pages (COPY), which are initialized to 0xff (to
    satisfy _mapcount >= 0)
    - 2 PAGEC page structs are copied to address of first COPY page
    - second page of COPY is marked as not present
    - call to page_mapped(COPY) now triggers fault on access to 2nd COPY
    page at offset 0x30 (_mapcount)

    [1] https://github.com/jstancek/reproducers/blob/master/kernel/page_mapped_crash/repro.c

    Fix the loop to iterate for "1 << compound_order" pages.

    Kirrill said "IIRC, sound subsystem can producuce custom mapped compound
    pages".

    Link: http://lkml.kernel.org/r/c440d69879e34209feba21e12d236d06bc0a25db.1543577156.git.jstancek@redhat.com
    Fixes: e1534ae95004 ("mm: differentiate page_mapped() from page_mapcount() for compound pages")
    Signed-off-by: Jan Stancek
    Debugged-by: Laszlo Ersek
    Suggested-by: "Kirill A. Shutemov"
    Acked-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Reviewed-by: David Hildenbrand
    Reviewed-by: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Stancek
     

29 Dec, 2018

1 commit

  • totalram_pages and totalhigh_pages are made static inline function.

    Main motivation was that managed_page_count_lock handling was complicating
    things. It was discussed in length here,
    https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
    better to remove the lock and convert variables to atomic, with preventing
    poteintial store-to-read tearing as a bonus.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
    Signed-off-by: Arun KS
    Suggested-by: Michal Hocko
    Suggested-by: Vlastimil Babka
    Reviewed-by: Konstantin Khlebnikov
    Reviewed-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun KS
     

27 Oct, 2018

2 commits

  • vfree() might sleep if called not in interrupt context. So does kvfree()
    too. Fix misleading kvfree()'s comment about allowed context.

    Link: http://lkml.kernel.org/r/20180914130512.10394-1-aryabinin@virtuozzo.com
    Fixes: 04b8e946075d ("mm/util.c: improve kvfree() kerneldoc")
    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • The vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES was introduced by
    commit eb59254608bc ("mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES") with
    the goal of accounting objects that can be reclaimed, but cannot be
    allocated via a SLAB_RECLAIM_ACCOUNT cache. This is now possible via
    kmalloc() with __GFP_RECLAIMABLE flag, and the dcache external names user
    is converted.

    The counter is however still useful for accounting direct page allocations
    (i.e. not slab) with a shrinker, such as the ION page pool. So keep it,
    and:

    - change granularity to pages to be more like other counters; sub-page
    allocations should be able to use kmalloc
    - rename the counter to NR_KERNEL_MISC_RECLAIMABLE
    - expose the counter again in vmstat as "nr_kernel_misc_reclaimable"; we can
    again remove the check for not printing "hidden" counters

    Link: http://lkml.kernel.org/r/20180731090649.16028-5-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Christoph Lameter
    Acked-by: Roman Gushchin
    Cc: Vijayanand Jitta
    Cc: Laura Abbott
    Cc: Sumit Semwal
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

16 Oct, 2018

1 commit


05 Sep, 2018

1 commit


24 Aug, 2018

2 commits

  • Link: http://lkml.kernel.org/r/1532626360-16650-3-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Patch series "memory management documentation updates", v3.

    Here are several updates to the mm documentation.

    Aside from really minor changes in the first three patches, the updates
    are:

    * move the documentation of kstrdup and friends to "String Manipulation"
    section
    * split memory management API into a separate .rst file
    * adjust formating of the GFP flags description and include it in the
    reference documentation.

    This patch (of 7):

    The description of the strndup_user function misses '*' character at the
    beginning of the comment to be proper kernel-doc. Add the missing
    character.

    Link: http://lkml.kernel.org/r/1532626360-16650-2-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

08 Jun, 2018

1 commit

  • kvmalloc warned about incompatible gfp_mask to catch abusers (mostly
    GFP_NOFS) with an intention that this will motivate authors of the code
    to fix those. Linus argues that this just motivates people to do even
    more hacks like

    if (gfp == GFP_KERNEL)
    kvmalloc
    else
    kmalloc

    I haven't seen this happening much (Linus pointed to bucket_lock special
    cases an atomic allocation but my git foo hasn't found much more) but it
    is true that we can grow those in future. Therefore Linus suggested to
    simply not fallback to vmalloc for incompatible gfp flags and rather
    stick with the kmalloc path.

    Link: http://lkml.kernel.org/r/20180601115329.27807-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: Linus Torvalds
    Cc: Tom Herbert
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

17 Apr, 2018

2 commits

  • Mike Rapoport says:

    These patches convert files in Documentation/vm to ReST format, add an
    initial index and link it to the top level documentation.

    There are no contents changes in the documentation, except few spelling
    fixes. The relatively large diffstat stems from the indentation and
    paragraph wrapping changes.

    I've tried to keep the formatting as consistent as possible, but I could
    miss some places that needed markup and add some markup where it was not
    necessary.

    [jc: significant conflicts in vm/hmm.rst]

    Jonathan Corbet
     
  • Signed-off-by: Mike Rapoport
    Signed-off-by: Jonathan Corbet

    Mike Rapoport