05 Mar, 2020

1 commit

  • commit 5b57b8f22709f07c0ab5921c94fd66e8c59c3e11 upstream.

    Commit 76a1850e4572 ("mm/debug.c: __dump_page() prints an extra line")
    inadvertently removed printing of page flags for pages that are neither
    anon nor ksm nor have a mapping. Fix that.

    Using pr_cont() again would be a solution, but the commit explicitly
    removed its use. Avoiding the danger of mixing up split lines from
    multiple CPUs might be beneficial for near-panic dumps like this, so fix
    this without reintroducing pr_cont().

    Link: http://lkml.kernel.org/r/9f884d5c-ca60-dc7b-219c-c081c755fab6@suse.cz
    Fixes: 76a1850e4572 ("mm/debug.c: __dump_page() prints an extra line")
    Signed-off-by: Vlastimil Babka
    Reported-by: Anshuman Khandual
    Reported-by: Michal Hocko
    Acked-by: Michal Hocko
    Cc: David Hildenbrand
    Cc: Qian Cai
    Cc: Oscar Salvador
    Cc: Mel Gorman
    Cc: Mike Rapoport
    Cc: Dan Williams
    Cc: Pavel Tatashin
    Cc: Ralph Campbell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     

16 Nov, 2019

2 commits

  • PageAnon() and PageKsm() use the low two bits of the page->mapping
    pointer to indicate the page type. PageAnon() only checks the LSB while
    PageKsm() checks the least significant 2 bits are equal to 3.

    Therefore, PageAnon() is true for KSM pages. __dump_page() incorrectly
    will never print "ksm" because it checks PageAnon() first. Fix this by
    checking PageKsm() first.

    Link: http://lkml.kernel.org/r/20191113000651.20677-1-rcampbell@nvidia.com
    Fixes: 1c6fb1d89e73 ("mm: print more information about mapping in __dump_page")
    Signed-off-by: Ralph Campbell
    Acked-by: Michal Hocko
    Cc: Jerome Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     
  • When dumping struct page information, __dump_page() prints the page type
    with a trailing blank followed by the page flags on a separate line:

    anon
    flags: 0x100000000090034(uptodate|lru|active|head|swapbacked)

    It looks like the intent was to use pr_cont() for printing "flags:" but
    pr_cont() usage is discouraged so fix this by extending the format to
    include the flags into a single line:

    anon flags: 0x100000000090034(uptodate|lru|active|head|swapbacked)

    If the page is file backed, the name might be long so use two lines:

    shmem_aops name:"dev/zero"
    flags: 0x10000000008000c(uptodate|dirty|swapbacked)

    Eliminate pr_conf() usage as well for appending compound_mapcount.

    Link: http://lkml.kernel.org/r/20191112012608.16926-1-rcampbell@nvidia.com
    Signed-off-by: Ralph Campbell
    Reviewed-by: Andrew Morton
    Cc: Jerome Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     

15 May, 2019

1 commit

  • Commit 0139aa7b7fa ("mm: rename _count, field of the struct page, to
    _refcount") left out a couple of references to the old field name. Fix
    that.

    Link: http://lkml.kernel.org/r/cedf87b02eb8a6b3eac57e8e91da53fb15c3c44c.1556537475.git.baruch@tkos.co.il
    Fixes: 0139aa7b7fa ("mm: rename _count, field of the struct page, to _refcount")
    Signed-off-by: Baruch Siach
    Reviewed-by: Andrew Morton
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Baruch Siach
     

30 Mar, 2019

2 commits

  • While debugging something, I added a dump_page() into do_swap_page(),
    and I got the splat from below. The issue happens when dereferencing
    mapping->host in __dump_page():

    ...
    else if (mapping) {
    pr_warn("%ps ", mapping->a_ops);
    if (mapping->host->i_dentry.first) {
    struct dentry *dentry;
    dentry = container_of(mapping->host->i_dentry.first, struct dentry, d_u.d_alias);
    pr_warn("name:\"%pd\" ", dentry);
    }
    }
    ...

    Swap address space does not contain an inode information, and so
    mapping->host equals NULL.

    Although the dump_page() call was added artificially into
    do_swap_page(), I am not sure if we can hit this from any other path, so
    it looks worth fixing it. We can easily do that by checking
    mapping->host first.

    Link: http://lkml.kernel.org/r/20190318072931.29094-1-osalvador@suse.de
    Fixes: 1c6fb1d89e73c ("mm: print more information about mapping in __dump_page")
    Signed-off-by: Oscar Salvador
    Acked-by: Michal Hocko
    Acked-by: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • atomic64_read() on ppc64le returns "long int", so fix the same way as
    commit d549f545e690 ("drm/virtio: use %llu format string form
    atomic64_t") by adding a cast to u64, which makes it work on all arches.

    In file included from ./include/linux/printk.h:7,
    from ./include/linux/kernel.h:15,
    from mm/debug.c:9:
    mm/debug.c: In function 'dump_mm':
    ./include/linux/kern_levels.h:5:18: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 19 has type 'long int' [-Wformat=]
    #define KERN_SOH "A" /* ASCII Start Of Header */
    ^~~~~~
    ./include/linux/kern_levels.h:8:20: note: in expansion of macro
    'KERN_SOH'
    #define KERN_EMERG KERN_SOH "0" /* system is unusable */
    ^~~~~~~~
    ./include/linux/printk.h:297:9: note: in expansion of macro 'KERN_EMERG'
    printk(KERN_EMERG pr_fmt(fmt), ##__VA_ARGS__)
    ^~~~~~~~~~
    mm/debug.c:133:2: note: in expansion of macro 'pr_emerg'
    pr_emerg("mm %px mmap %px seqnum %llu task_size %lu"
    ^~~~~~~~
    mm/debug.c:140:17: note: format string is defined here
    "pinned_vm %llx data_vm %lx exec_vm %lx stack_vm %lx"
    ~~~^
    %lx

    Link: http://lkml.kernel.org/r/20190310183051.87303-1-cai@lca.pw
    Fixes: 70f8a3ca68d3 ("mm: make mm->pinned_vm an atomic64 counter")
    Signed-off-by: Qian Cai
    Acked-by: Davidlohr Bueso
    Cc: Jason Gunthorpe
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     

10 Mar, 2019

1 commit

  • Pull rdma updates from Jason Gunthorpe:
    "This has been a slightly more active cycle than normal with ongoing
    core changes and quite a lot of collected driver updates.

    - Various driver fixes for bnxt_re, cxgb4, hns, mlx5, pvrdma, rxe

    - A new data transfer mode for HFI1 giving higher performance

    - Significant functional and bug fix update to the mlx5
    On-Demand-Paging MR feature

    - A chip hang reset recovery system for hns

    - Change mm->pinned_vm to an atomic64

    - Update bnxt_re to support a new 57500 chip

    - A sane netlink 'rdma link add' method for creating rxe devices and
    fixing the various unregistration race conditions in rxe's
    unregister flow

    - Allow lookup up objects by an ID over netlink

    - Various reworking of the core to driver interface:
    - drivers should not assume umem SGLs are in PAGE_SIZE chunks
    - ucontext is accessed via udata not other means
    - start to make the core code responsible for object memory
    allocation
    - drivers should convert struct device to struct ib_device via a
    helper
    - drivers have more tools to avoid use after unregister problems"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (280 commits)
    net/mlx5: ODP support for XRC transport is not enabled by default in FW
    IB/hfi1: Close race condition on user context disable and close
    RDMA/umem: Revert broken 'off by one' fix
    RDMA/umem: minor bug fix in error handling path
    RDMA/hns: Use GFP_ATOMIC in hns_roce_v2_modify_qp
    cxgb4: kfree mhp after the debug print
    IB/rdmavt: Fix concurrency panics in QP post_send and modify to error
    IB/rdmavt: Fix loopback send with invalidate ordering
    IB/iser: Fix dma_nents type definition
    IB/mlx5: Set correct write permissions for implicit ODP MR
    bnxt_re: Clean cq for kernel consumers only
    RDMA/uverbs: Don't do double free of allocated PD
    RDMA: Handle ucontext allocations by IB/core
    RDMA/core: Fix a WARN() message
    bnxt_re: fix the regression due to changes in alloc_pbl
    IB/mlx4: Increase the timeout for CM cache
    IB/core: Abort page fault handler silently during owning process exit
    IB/mlx5: Validate correct PD before prefetch MR
    IB/mlx5: Protect against prefetch of invalid MR
    RDMA/uverbs: Store PR pointer before it is overwritten
    ...

    Linus Torvalds
     

22 Feb, 2019

1 commit

  • Evaluating page_mapping() on a poisoned page ends up dereferencing junk
    and making PF_POISONED_CHECK() considerably crashier than intended:

    Unable to handle kernel NULL pointer dereference at virtual address 0000000000000006
    Mem abort info:
    ESR = 0x96000005
    Exception class = DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
    Data abort info:
    ISV = 0, ISS = 0x00000005
    CM = 0, WnR = 0
    user pgtable: 4k pages, 39-bit VAs, pgdp = 00000000c2f6ac38
    [0000000000000006] pgd=0000000000000000, pud=0000000000000000
    Internal error: Oops: 96000005 [#1] PREEMPT SMP
    Modules linked in:
    CPU: 2 PID: 491 Comm: bash Not tainted 5.0.0-rc1+ #1
    Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Dec 17 2018
    pstate: 00000005 (nzcv daif -PAN -UAO)
    pc : page_mapping+0x18/0x118
    lr : __dump_page+0x1c/0x398
    Process bash (pid: 491, stack limit = 0x000000004ebd4ecd)
    Call trace:
    page_mapping+0x18/0x118
    __dump_page+0x1c/0x398
    dump_page+0xc/0x18
    remove_store+0xbc/0x120
    dev_attr_store+0x18/0x28
    sysfs_kf_write+0x40/0x50
    kernfs_fop_write+0x130/0x1d8
    __vfs_write+0x30/0x180
    vfs_write+0xb4/0x1a0
    ksys_write+0x60/0xd0
    __arm64_sys_write+0x18/0x20
    el0_svc_common+0x94/0xf8
    el0_svc_handler+0x68/0x70
    el0_svc+0x8/0xc
    Code: f9400401 d1000422 f240003f 9a801040 (f9400402)
    ---[ end trace cdb5eb5bf435cecb ]---

    Fix that by not inspecting the mapping until we've determined that it's
    likely to be valid. Now the above condition still ends up stopping the
    kernel, but in the correct manner:

    page:ffffffbf20000000 is uninitialized and poisoned
    raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
    raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
    page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
    ------------[ cut here ]------------
    kernel BUG at ./include/linux/mm.h:1006!
    Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
    Modules linked in:
    CPU: 1 PID: 483 Comm: bash Not tainted 5.0.0-rc1+ #3
    Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Dec 17 2018
    pstate: 40000005 (nZcv daif -PAN -UAO)
    pc : remove_store+0xbc/0x120
    lr : remove_store+0xbc/0x120
    ...

    Link: http://lkml.kernel.org/r/03b53ee9d7e76cda4b9b5e1e31eea080db033396.1550071778.git.robin.murphy@arm.com
    Fixes: 1c6fb1d89e73 ("mm: print more information about mapping in __dump_page")
    Signed-off-by: Robin Murphy
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Murphy
     

08 Feb, 2019

1 commit

  • Taking a sleeping lock to _only_ increment a variable is quite the
    overkill, and pretty much all users do this. Furthermore, some drivers
    (ie: infiniband and scif) that need pinned semantics can go to quite
    some trouble to actually delay via workqueue (un)accounting for pinned
    pages when not possible to acquire it.

    By making the counter atomic we no longer need to hold the mmap_sem and
    can simply some code around it for pinned_vm users. The counter is 64-bit
    such that we need not worry about overflows such as rdma user input
    controlled from userspace.

    Reviewed-by: Ira Weiny
    Reviewed-by: Christoph Lameter
    Reviewed-by: Daniel Jordan
    Reviewed-by: Jan Kara
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Jason Gunthorpe

    Davidlohr Bueso
     

29 Dec, 2018

3 commits

  • Those strings are immutable as well.

    Link: http://lkml.kernel.org/r/20181124090508.GB10877@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • __dump_page messages use KERN_EMERG resp. KERN_ALERT loglevel (this is
    the case since 2004). Most callers of this function are really detecting
    a critical page state and BUG right after. On the other hand the function
    is called also from contexts which just want to inform about the page
    state and those would rather not disrupt logs that much (e.g. some
    systems route these messages to the normal console).

    Reduce the loglevel to KERN_WARNING to make dump_page easier to reuse for
    other contexts while those messages will still make it to the kernel log
    in most setups. Even if the loglevel setup filters warnings away those
    paths that are really critical already print the more targeted error or
    panic and that should make it to the kernel log.

    [mhocko@kernel.org: fix __dump_page()]
    Link: http://lkml.kernel.org/r/20181212142540.GA7378@dhcp22.suse.cz
    [akpm@linux-foundation.org: s/KERN_WARN/KERN_WARNING/, per Michal]
    Link: http://lkml.kernel.org/r/20181107101830.17405-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Reviewed-by: Anshuman Khandual
    Cc: Baoquan He
    Cc: Oscar Salvador
    Cc: Oscar Salvador
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • I have been promissing to improve memory offlining failures debugging for
    quite some time. As things stand now we get only very limited information
    in the kernel log when the offlining fails. It is usually only

    [ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed

    with no further details. We do not know what exactly fails and for what
    reason. Whenever I was forced to debug such a failure I've always had to
    do a debugging patch to tell me more. We can enable some tracepoints but
    it would be much better to get a better picture without using them.

    This patch series does 2 things. The first one is to make dump_page more
    usable by printing more information about the mapping patch 1. Then it
    reduces the log level from emerg to warning so that this function is
    usable from less critical context patch 2. Then I have added more
    detailed information about the offlining failure patch 4 and finally add
    dump_page to isolation and offlining migration paths. Patch 3 is a
    trivial cleanup.

    This patch (of 6):

    __dump_page prints the mapping pointer but that is quite unhelpful for
    many reports because the pointer itself only helps to distinguish anon/ksm
    mappings from other ones (because of lowest bits set). Sometimes it would
    be much more helpful to know what kind of mapping that is actually and if
    we know this is a file mapping then also try to resolve the dentry name.

    [dan.carpenter@oracle.com: fix a width vs precision bug in printk]
    Link: http://lkml.kernel.org/r/20181123072135.gqvblm2vdujbvfjs@kili.mountain
    [mhocko@kernel.org: use %dp to print dentry]
    Link: http://lkml.kernel.org/r/20181125080834.GB12455@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20181107101830.17405-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Reviewed-by: Anshuman Khandual
    Reviewed-by: William Kucharski
    Cc: Oscar Salvador
    Cc: Baoquan He
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

27 Oct, 2018

1 commit

  • Patch series "Address issues slowing persistent memory initialization", v5.

    The main thing this patch set achieves is that it allows us to initialize
    each node worth of persistent memory independently. As a result we reduce
    page init time by about 2 minutes because instead of taking 30 to 40
    seconds per node and going through each node one at a time, we process all
    4 nodes in parallel in the case of a 12TB persistent memory setup spread
    evenly over 4 nodes.

    This patch (of 3):

    On systems with a large amount of memory it can take a significant amount
    of time to initialize all of the page structs with the PAGE_POISON_PATTERN
    value. I have seen it take over 2 minutes to initialize a system with
    over 12TB of RAM.

    In order to work around the issue I had to disable CONFIG_DEBUG_VM and
    then the boot time returned to something much more reasonable as the
    arch_add_memory call completed in milliseconds versus seconds. However in
    doing that I had to disable all of the other VM debugging on the system.

    In order to work around a kernel that might have CONFIG_DEBUG_VM enabled
    on a system that has a large amount of memory I have added a new kernel
    parameter named "vm_debug" that can be set to "-" in order to disable it.

    Link: http://lkml.kernel.org/r/20180925201921.3576.84239.stgit@localhost.localdomain
    Reviewed-by: Pavel Tatashin
    Signed-off-by: Alexander Duyck
    Cc: Dave Hansen
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Duyck
     

14 Sep, 2018

1 commit

  • Jann Horn points out that the vmacache_flush_all() function is not only
    potentially expensive, it's buggy too. It also happens to be entirely
    unnecessary, because the sequence number overflow case can be avoided by
    simply making the sequence number be 64-bit. That doesn't even grow the
    data structures in question, because the other adjacent fields are
    already 64-bit.

    So simplify the whole thing by just making the sequence number overflow
    case go away entirely, which gets rid of all the complications and makes
    the code faster too. Win-win.

    [ Oleg Nesterov points out that the VMACACHE_FULL_FLUSHES statistics
    also just goes away entirely with this ]

    Reported-by: Jann Horn
    Suggested-by: Will Deacon
    Acked-by: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

04 Jul, 2018

1 commit

  • If struct page is poisoned, and uninitialized access is detected via
    PF_POISONED_CHECK(page) dump_page() is called to output the page. But,
    the dump_page() itself accesses struct page to determine how to print
    it, and therefore gets into a recursive loop.

    For example:

    dump_page()
    __dump_page()
    PageSlab(page)
    PF_POISONED_CHECK(page)
    VM_BUG_ON_PGFLAGS(PagePoisoned(page), page)
    dump_page() recursion loop.

    Link: http://lkml.kernel.org/r/20180702180536.2552-1-pasha.tatashin@oracle.com
    Fixes: f165b378bbdf ("mm: uninitialized struct page poisoning sanity checking")
    Signed-off-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     

05 Jan, 2018

1 commit

  • With the recent addition of hashed kernel pointers, places which need to
    produce useful debug output have to specify %px, not %p. This patch
    fixes all the VM debug to use %px. This is appropriate because it's
    debug output that the user should never be able to trigger, and kernel
    developers need to see the actual pointers.

    Link: http://lkml.kernel.org/r/20171219133236.GE13680@bombadil.infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Michal Hocko
    Cc: "Tobin C. Harding"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

16 Nov, 2017

3 commits

  • Currently, we account page tables separately for each page table level,
    but that's redundant -- we only make use of total memory allocated to
    page tables for oom_badness calculation. We also provide the
    information to userspace, but it has dubious value there too.

    This patch switches page table accounting to single counter.

    mm->pgtables_bytes is now used to account all page table levels. We use
    bytes, because page table size for different levels of page table tree
    may be different.

    The change has user-visible effect: we don't have VmPMD and VmPUD
    reported in /proc/[pid]/status. Not sure if anybody uses them. (As
    alternative, we can always report 0 kB for them.)

    OOM-killer report is also slightly changed: we now report pgtables_bytes
    instead of nr_ptes, nr_pmd, nr_puds.

    Apart from reducing number of counters per-mm, the benefit is that we
    now calculate oom_badness() more correctly for machines which have
    different size of page tables depending on level or where page tables
    are less than a page in size.

    The only downside can be debuggability because we do not know which page
    table level could leak. But I do not remember many bugs that would be
    caught by separate counters so I wouldn't lose sleep over this.

    [akpm@linux-foundation.org: fix mm/huge_memory.c]
    Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    [kirill.shutemov@linux.intel.com: fix build]
    Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Let's add wrappers for ->nr_ptes with the same interface as for nr_pmd
    and nr_pud.

    The patch also makes nr_ptes accounting dependent onto CONFIG_MMU. Page
    table accounting doesn't make sense if you don't have page tables.

    It's preparation for consolidation of page-table counters in mm_struct.

    Link: http://lkml.kernel.org/r/20171006100651.44742-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • On a machine with 5-level paging support a process can allocate
    significant amount of memory and stay unnoticed by oom-killer and memory
    cgroup. The trick is to allocate a lot of PUD page tables. We don't
    account PUD page tables, only PMD and PTE.

    We already addressed the same issue for PMD page tables, see commit
    dc6c9a35b66b ("mm: account pmd page tables to the process").
    Introduction of 5-level paging brings the same issue for PUD page
    tables.

    The patch expands accounting to PUD level.

    [kirill.shutemov@linux.intel.com: s/pmd_t/pud_t/]
    Link: http://lkml.kernel.org/r/20171004074305.x35eh5u7ybbt5kar@black.fi.intel.com
    [heiko.carstens@de.ibm.com: s390/mm: fix pud table accounting]
    Link: http://lkml.kernel.org/r/20171103090551.18231-1-heiko.carstens@de.ibm.com
    Link: http://lkml.kernel.org/r/20171002080427.3320-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Heiko Carstens
    Acked-by: Rik van Riel
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

11 Aug, 2017

2 commits

  • Currently, tlb_flush_pending is used only for CONFIG_[NUMA_BALANCING|
    COMPACTION] but upcoming patches to solve subtle TLB flush batching
    problem will use it regardless of compaction/NUMA so this patch doesn't
    remove the dependency.

    [akpm@linux-foundation.org: remove more ifdefs from world's ugliest printk statement]
    Link: http://lkml.kernel.org/r/20170802000818.4760-6-namit@vmware.com
    Signed-off-by: Minchan Kim
    Signed-off-by: Nadav Amit
    Acked-by: Mel Gorman
    Cc: "David S. Miller"
    Cc: Andrea Arcangeli
    Cc: Andy Lutomirski
    Cc: Heiko Carstens
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Jeff Dike
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Nadav Amit
    Cc: Rik van Riel
    Cc: Russell King
    Cc: Sergey Senozhatsky
    Cc: Tony Luck
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Patch series "fixes of TLB batching races", v6.

    It turns out that Linux TLB batching mechanism suffers from various
    races. Races that are caused due to batching during reclamation were
    recently handled by Mel and this patch-set deals with others. The more
    fundamental issue is that concurrent updates of the page-tables allow
    for TLB flushes to be batched on one core, while another core changes
    the page-tables. This other core may assume a PTE change does not
    require a flush based on the updated PTE value, while it is unaware that
    TLB flushes are still pending.

    This behavior affects KSM (which may result in memory corruption) and
    MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior). A
    proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
    Memory corruption in KSM is harder to produce in practice, but was
    observed by hacking the kernel and adding a delay before flushing and
    replacing the KSM page.

    Finally, there is also one memory barrier missing, which may affect
    architectures with weak memory model.

    This patch (of 7):

    Setting and clearing mm->tlb_flush_pending can be performed by multiple
    threads, since mmap_sem may only be acquired for read in
    task_numa_work(). If this happens, tlb_flush_pending might be cleared
    while one of the threads still changes PTEs and batches TLB flushes.

    This can lead to the same race between migration and
    change_protection_range() that led to the introduction of
    tlb_flush_pending. The result of this race was data corruption, which
    means that this patch also addresses a theoretically possible data
    corruption.

    An actual data corruption was not observed, yet the race was was
    confirmed by adding assertion to check tlb_flush_pending is not set by
    two threads, adding artificial latency in change_protection_range() and
    using sysctl to reduce kernel.numa_balancing_scan_delay_ms.

    Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
    Fixes: 20841405940e ("mm: fix TLB flush race between migration, and
    change_protection_range")
    Signed-off-by: Nadav Amit
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Minchan Kim
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: "David S. Miller"
    Cc: Andrea Arcangeli
    Cc: Heiko Carstens
    Cc: Ingo Molnar
    Cc: Jeff Dike
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Russell King
    Cc: Sergey Senozhatsky
    Cc: Tony Luck
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nadav Amit
     

13 Dec, 2016

1 commit

  • __dump_page() is used when a page metadata inconsistency is detected,
    either by standard runtime checks, or extra checks in CONFIG_DEBUG_VM
    builds. It prints some of the relevant metadata, but not the whole
    struct page, which is based on unions and interpretation is dependent on
    the context.

    This means that sometimes e.g. a VM_BUG_ON_PAGE() checks certain field,
    which is however not printed by __dump_page() and the resulting bug
    report may then lack clues that could help in determining the root
    cause. This patch solves the problem by simply printing the whole
    struct page word by word, so no part is missing, but the interpretation
    of the data is left to developers. This is similar to e.g. x86_64 raw
    stack dumps.

    Example output:

    page:ffffea00000475c0 count:1 mapcount:0 mapping: (null) index:0x0
    flags: 0x100000000000400(reserved)
    raw: 0100000000000400 0000000000000000 0000000000000000 00000001ffffffff
    raw: ffffea00000475e0 ffffea00000475e0 0000000000000000 0000000000000000
    page dumped because: VM_BUG_ON_PAGE(1)

    [aryabinin@virtuozzo.com: suggested print_hex_dump()]
    Link: http://lkml.kernel.org/r/2ff83214-70fe-741e-bf05-fe4a4073ec3e@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Kirill A. Shutemov
    Acked-by: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

08 Oct, 2016

1 commit


20 Sep, 2016

1 commit

  • dump_page() uses page_mapcount() to get mapcount of the page.
    page_mapcount() has VM_BUG_ON_PAGE(PageSlab(page)) as mapcount doesn't
    make sense for slab pages and the field in struct page used for other
    information.

    It leads to recursion if dump_page() called for slub page and DEBUG_VM
    is enabled:

    dump_page() -> page_mapcount() -> VM_BUG_ON_PAGE() -> dump_page -> ...

    Let's avoid calling page_mapcount() for slab pages in dump_page().

    Link: http://lkml.kernel.org/r/20160908082137.131076-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

18 Mar, 2016

1 commit

  • The success of CMA allocation largely depends on the success of
    migration and key factor of it is page reference count. Until now, page
    reference is manipulated by direct calling atomic functions so we cannot
    follow up who and where manipulate it. Then, it is hard to find actual
    reason of CMA allocation failure. CMA allocation should be guaranteed
    to succeed so finding offending place is really important.

    In this patch, call sites where page reference is manipulated are
    converted to introduced wrapper function. This is preparation step to
    add tracepoint to each page reference manipulation function. With this
    facility, we can easily find reason of CMA allocation failure. There is
    no functional change in this patch.

    In addition, this patch also converts reference read sites. It will
    help a second step that renames page._count to something else and
    prevents later attempt to direct access to it (Suggested by Andrew).

    Signed-off-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Sergey Senozhatsky
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

16 Mar, 2016

6 commits

  • Since bad_page() is the only user of the badflags parameter of
    dump_page_badflags(), we can move the code to bad_page() and simplify a
    bit.

    The dump_page_badflags() function is renamed to __dump_page() and can
    still be called separately from dump_page() for temporary debug prints
    where page_owner info is not desired.

    The only user-visible change is that page->mem_cgroup is printed before
    the bad flags.

    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The page_owner mechanism is useful for dealing with memory leaks. By
    reading /sys/kernel/debug/page_owner one can determine the stack traces
    leading to allocations of all pages, and find e.g. a buggy driver.

    This information might be also potentially useful for debugging, such as
    the VM_BUG_ON_PAGE() calls to dump_page(). So let's print the stored
    info from dump_page().

    Example output:

    page:ffffea000292f1c0 count:1 mapcount:0 mapping:ffff8800b2f6cc18 index:0x91d
    flags: 0x1fffff8001002c(referenced|uptodate|lru|mappedtodisk)
    page dumped because: VM_BUG_ON_PAGE(1)
    page->mem_cgroup:ffff8801392c5000
    page allocated via order 0, migratetype Movable, gfp_mask 0x24213ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD|__GFP_NOWARN|__GFP_NORETRY)
    [] __alloc_pages_nodemask+0x134/0x230
    [] alloc_pages_current+0x88/0x120
    [] __page_cache_alloc+0xe6/0x120
    [] __do_page_cache_readahead+0xdc/0x240
    [] ondemand_readahead+0x135/0x260
    [] page_cache_async_readahead+0x6c/0x70
    [] generic_file_read_iter+0x3f2/0x760
    [] __vfs_read+0xa7/0xd0
    page has been migrated, last migrate reason: compaction

    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • During migration, page_owner info is now copied with the rest of the
    page, so the stacktrace leading to free page allocation during migration
    is overwritten. For debugging purposes, it might be however useful to
    know that the page has been migrated since its initial allocation. This
    might happen many times during the lifetime for different reasons and
    fully tracking this, especially with stacktraces would incur extra
    memory costs. As a compromise, store and print the migrate_reason of
    the last migration that occurred to the page. This is enough to
    distinguish compaction, numa balancing etc.

    Example page_owner entry after the patch:

    Page allocated via order 0, mask 0x24200ca(GFP_HIGHUSER_MOVABLE)
    PFN 628753 type Movable Block 1228 type Movable Flags 0x1fffff80040030(dirty|lru|swapbacked)
    [] __alloc_pages_nodemask+0x134/0x230
    [] alloc_pages_vma+0xb5/0x250
    [] shmem_alloc_page+0x61/0x90
    [] shmem_getpage_gfp+0x678/0x960
    [] shmem_fallocate+0x329/0x440
    [] vfs_fallocate+0x140/0x230
    [] SyS_fallocate+0x44/0x70
    [] entry_SYSCALL_64_fastpath+0x12/0x71
    Page has been migrated, last migrate reason: compaction

    Signed-off-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • With the new printk format strings for flags, we can get rid of
    dump_flags() in mm/debug.c.

    This also fixes dump_vma() which used dump_flags() for printing vma
    flags. However dump_flags() did a page-flags specific filtering of bits
    higher than NR_PAGEFLAGS in order to remove the zone id part. For
    dump_vma() this resulted in removing several VM_* flags from the
    symbolic translation.

    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Ingo Molnar
    Cc: Rasmus Villemoes
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • In mm we use several kinds of flags bitfields that are sometimes printed
    for debugging purposes, or exported to userspace via sysfs. To make
    them easier to interpret independently on kernel version and config, we
    want to dump also the symbolic flag names. So far this has been done
    with repeated calls to pr_cont(), which is unreliable on SMP, and not
    usable for e.g. sysfs export.

    To get a more reliable and universal solution, this patch extends
    printk() format string for pointers to handle the page flags (%pGp),
    gfp_flags (%pGg) and vma flags (%pGv). Existing users of
    dump_flag_names() are converted and simplified.

    It would be possible to pass flags by value instead of pointer, but the
    %p format string for pointers already has extensions for various kernel
    structures, so it's a good fit, and the extra indirection in a
    non-critical path is negligible.

    [linux@rasmusvillemoes.dk: lots of good implementation suggestions]
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Ingo Molnar
    Cc: Rasmus Villemoes
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • In tracepoints, it's possible to print gfp flags in a human-friendly
    format through a macro show_gfp_flags(), which defines a translation
    array and passes is to __print_flags(). Since the following patch will
    introduce support for gfp flags printing in printk(), it would be nice
    to reuse the array. This is not straightforward, since __print_flags()
    can't simply reference an array defined in a .c file such as mm/debug.c
    - it has to be a macro to allow the macro magic to communicate the
    format to userspace tools such as trace-cmd.

    The solution is to create a macro __def_gfpflag_names which is used both
    in show_gfp_flags(), and to define the gfpflag_names[] array in
    mm/debug.c.

    On the other hand, mm/debug.c also defines translation tables for page
    flags and vma flags, and desire was expressed (but not implemented in
    this series) to use these also from tracepoints. Thus, this patch also
    renames the events/gfpflags.h file to events/mmflags.h and moves the
    table definitions there, using the same macro approach as for gfpflags.
    This allows translating all three kinds of mm-specific flags both in
    tracepoints and printk.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Michal Hocko
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Ingo Molnar
    Cc: Rasmus Villemoes
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

16 Jan, 2016

2 commits

  • We're going to allow mapping of individual 4k pages of THP compound. It
    means we need to track mapcount on per small page basis.

    Straight-forward approach is to use ->_mapcount in all subpages to track
    how many time this subpage is mapped with PMDs or PTEs combined. But
    this is rather expensive: mapping or unmapping of a THP page with PMD
    would require HPAGE_PMD_NR atomic operations instead of single we have
    now.

    The idea is to store separately how many times the page was mapped as
    whole -- compound_mapcount. This frees up ->_mapcount in subpages to
    track PTE mapcount.

    We use the same approach as with compound page destructor and compound
    order to store compound_mapcount: use space in first tail page,
    ->mapping this time.

    Any time we map/unmap whole compound page (THP or hugetlb) -- we
    increment/decrement compound_mapcount. When we map part of compound
    page with PTE we operate on ->_mapcount of the subpage.

    page_mapcount() counts both: PTE and PMD mappings of the page.

    Basically, we have mapcount for a subpage spread over two counters. It
    makes tricky to detect when last mapcount for a page goes away.

    We introduced PageDoubleMap() for this. When we split THP PMD for the
    first time and there's other PMD mapping left we offset up ->_mapcount
    in all subpages by one and set PG_double_map on the compound page.
    These additional references go away with last compound_mapcount.

    This approach provides a way to detect when last mapcount goes away on
    per small page basis without introducing new overhead for most common
    cases.

    [akpm@linux-foundation.org: fix typo in comment]
    [mhocko@suse.com: ignore partial THP when moving task]
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We are going to use migration entries to stabilize page counts. It
    means we don't need compound_lock() for that.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

1 commit

  • When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
    testing the RLIMIT_DATA value to figure out if we're allowed to assign
    new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
    commited that RLIMIT_DATA in a form it's implemented now doesn't do
    anything useful because most of user-space libraries use mmap() syscall
    for dynamic memory allocations.

    Linus suggested to convert RLIMIT_DATA rlimit into something suitable
    for anonymous memory accounting. But in this patch we go further, and
    the changes are bundled together as:

    * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
    * replace mm->shared_vm with better defined mm->data_vm
    * account anonymous executable areas as executable
    * account file-backed growsdown/up areas as stack
    * drop struct file* argument from vm_stat_account
    * enforce RLIMIT_DATA for size of data areas

    This way code looks cleaner: now code/stack/data classification depends
    only on vm_flags state:

    VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc)
    VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk)
    VM_WRITE & ~VM_SHARED & !stack -> data (VmData)

    The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
    "shared", but that might be strange beast like readonly-private or VM_IO
    area.

    - RLIMIT_AS limits whole address space "VmSize"
    - RLIMIT_STACK limits stack "VmStk" (but each vma individually)
    - RLIMIT_DATA now limits "VmData"

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Cyrill Gorcunov
    Cc: Quentin Casasnovas
    Cc: Vegard Nossum
    Acked-by: Linus Torvalds
    Cc: Willy Tarreau
    Cc: Andy Lutomirski
    Cc: Kees Cook
    Cc: Vladimir Davydov
    Cc: Pavel Emelyanov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

07 Nov, 2015

1 commit

  • Hugh has pointed that compound_head() call can be unsafe in some
    context. There's one example:

    CPU0 CPU1

    isolate_migratepages_block()
    page_count()
    compound_head()
    !!PageTail() == true
    put_page()
    tail->first_page = NULL
    head = tail->first_page
    alloc_pages(__GFP_COMP)
    prep_compound_page()
    tail->first_page = head
    __SetPageTail(p);
    !!PageTail() == true

    The race is pure theoretical. I don't it's possible to trigger it in
    practice. But who knows.

    We can fix the race by changing how encode PageTail() and compound_head()
    within struct page to be able to update them in one shot.

    The patch introduces page->compound_head into third double word block in
    front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
    the rest bits are pointer to head page if bit zero is set.

    The patch moves page->pmd_huge_pte out of word, just in case if an
    architecture defines pgtable_t into something what can have the bit 0
    set.

    hugetlb_cgroup uses page->lru.next in the second tail page to store
    pointer struct hugetlb_cgroup. The patch switch it to use page->private
    in the second tail page instead. The space is free since ->first_page is
    removed from the union.

    The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
    limitation, since there's now space in first tail page to store struct
    hugetlb_cgroup pointer. But that's out of scope of the patch.

    That means page->compound_head shares storage space with:

    - page->lru.next;
    - page->next;
    - page->rcu_head.next;

    That's too long list to be absolutely sure, but looks like nobody uses
    bit 0 of the word.

    page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
    call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
    call_rcu_lazy() is not allowed as it makes use of the bit and we can
    get false positive PageTail().

    [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Reviewed-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Acked-by: Paul E. McKenney
    Cc: Aneesh Kumar K.V
    Cc: Andi Kleen
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

06 Nov, 2015

1 commit

  • The cost of faulting in all memory to be locked can be very high when
    working with large mappings. If only portions of the mapping will be used
    this can incur a high penalty for locking.

    For the example of a large file, this is the usage pattern for a large
    statical language model (probably applies to other statical or graphical
    models as well). For the security example, any application transacting in
    data that cannot be swapped out (credit card data, medical records, etc).

    This patch introduces the ability to request that pages are not
    pre-faulted, but are placed on the unevictable LRU when they are finally
    faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED
    and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT
    flag for a VMA will cause pages faulted into that VMA to be added to the
    unevictable LRU when they are faulted or if they are already present, but
    will not cause any missing pages to be faulted in.

    Exposing this new lock state means that we cannot overload the meaning of
    the FOLL_POPULATE flag any longer. Prior to this patch it was used to
    mean that the VMA for a fault was locked. This means we need the new
    FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE
    will now only control if the VMA should be populated and in the case of
    VM_LOCKONFAULT, it will not be set.

    Signed-off-by: Eric B Munson
    Acked-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Jonathan Corbet
    Cc: Catalin Marinas
    Cc: Geert Uytterhoeven
    Cc: Guenter Roeck
    Cc: Heiko Carstens
    Cc: Michael Kerrisk
    Cc: Ralf Baechle
    Cc: Shuah Khan
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     

11 Sep, 2015

1 commit

  • Knowing the portion of memory that is not used by a certain application or
    memory cgroup (idle memory) can be useful for partitioning the system
    efficiently, e.g. by setting memory cgroup limits appropriately.
    Currently, the only means to estimate the amount of idle memory provided
    by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
    access bit for all pages mapped to a particular process by writing 1 to
    clear_refs, wait for some time, and then count smaps:Referenced. However,
    this method has two serious shortcomings:

    - it does not count unmapped file pages
    - it affects the reclaimer logic

    To overcome these drawbacks, this patch introduces two new page flags,
    Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
    A page's Idle flag can only be set from userspace by setting bit in
    /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
    and it is cleared whenever the page is accessed either through page tables
    (it is cleared in page_referenced() in this case) or using the read(2)
    system call (mark_page_accessed()). Thus by setting the Idle flag for
    pages of a particular workload, which can be found e.g. by reading
    /proc/PID/pagemap, waiting for some time to let the workload access its
    working set, and then reading the bitmap file, one can estimate the amount
    of pages that are not used by the workload.

    The Young page flag is used to avoid interference with the memory
    reclaimer. A page's Young flag is set whenever the Access bit of a page
    table entry pointing to the page is cleared by writing to the bitmap file.
    If page_referenced() is called on a Young page, it will add 1 to its
    return value, therefore concealing the fact that the Access bit was
    cleared.

    Note, since there is no room for extra page flags on 32 bit, this feature
    uses extended page flags when compiled on 32 bit.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: kpageidle requires an MMU]
    [akpm@linux-foundation.org: decouple from page-flags rework]
    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

14 May, 2015

1 commit


12 Feb, 2015

1 commit

  • Dave noticed that unprivileged process can allocate significant amount of
    memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
    memory cgroup. The trick is to allocate a lot of PMD page tables. Linux
    kernel doesn't account PMD tables to the process, only PTE.

    The use-cases below use few tricks to allocate a lot of PMD page tables
    while keeping VmRSS and VmPTE low. oom_score for the process will be 0.

    #include
    #include
    #include
    #include
    #include
    #include

    #define PUD_SIZE (1UL << 30)
    #define PMD_SIZE (1UL << 21)

    #define NR_PUD 130000

    int main(void)
    {
    char *addr = NULL;
    unsigned long i;

    prctl(PR_SET_THP_DISABLE);
    for (i = 0; i < NR_PUD ; i++) {
    addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
    MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
    if (addr == MAP_FAILED) {
    perror("mmap");
    break;
    }
    *addr = 'x';
    munmap(addr, PMD_SIZE);
    mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
    MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
    if (addr == MAP_FAILED)
    perror("re-mmap"), exit(1);
    }
    printf("PID %d consumed %lu KiB in PMD page tables\n",
    getpid(), i * 4096 >> 10);
    return pause();
    }

    The patch addresses the issue by account PMD tables to the process the
    same way we account PTE.

    The main place where PMD tables is accounted is __pmd_alloc() and
    free_pmd_range(). But there're few corner cases:

    - HugeTLB can share PMD page tables. The patch handles by accounting
    the table to all processes who share it.

    - x86 PAE pre-allocates few PMD tables on fork.

    - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
    check on exit(2).

    Accounting only happens on configuration where PMD page table's level is
    present (PMD is not folded). As with nr_ptes we use per-mm counter. The
    counter value is used to calculate baseline for badness score by
    oom-killer.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dave Hansen
    Cc: Hugh Dickins
    Reviewed-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: David Rientjes
    Tested-by: Sedat Dilek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov