18 Jan, 2016

1 commit

  • Merge second patch-bomb from Andrew Morton:

    - more MM stuff:

    - Kirill's page-flags rework

    - Kirill's now-allegedly-fixed THP rework

    - MADV_FREE implementation

    - DAX feature work (msync/fsync). This isn't quite complete but DAX
    is new and it's good enough and the guys have a handle on what
    needs to be done - I expect this to be wrapped in the next week or
    two.

    - some vsprintf maintenance work

    - various other misc bits

    * emailed patches from Andrew Morton : (145 commits)
    printk: change recursion_bug type to bool
    lib/vsprintf: factor out %pN[F] handler as netdev_bits()
    lib/vsprintf: refactor duplicate code to special_hex_number()
    printk-formats.txt: remove unimplemented %pT
    printk: help pr_debug and pr_devel to optimize out arguments
    lib/test_printf.c: test dentry printing
    lib/test_printf.c: add test for large bitmaps
    lib/test_printf.c: account for kvasprintf tests
    lib/test_printf.c: add a few number() tests
    lib/test_printf.c: test precision quirks
    lib/test_printf.c: check for out-of-bound writes
    lib/test_printf.c: don't BUG
    lib/kasprintf.c: add sanity check to kvasprintf
    lib/vsprintf.c: warn about too large precisions and field widths
    lib/vsprintf.c: help gcc make number() smaller
    lib/vsprintf.c: expand field_width to 24 bits
    lib/vsprintf.c: eliminate potential race in string()
    lib/vsprintf.c: move string() below widen_string()
    lib/vsprintf.c: pull out padding code from dentry_name()
    printk: do cond_resched() between lines while outputting to consoles
    ...

    Linus Torvalds
     

16 Jan, 2016

1 commit

  • The patch updates Documentation/vm/transhuge.txt to reflect changes in
    THP design.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Jerome Marchand
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

11 Dec, 2015

1 commit


07 Nov, 2015

2 commits

  • Hugh has pointed that compound_head() call can be unsafe in some
    context. There's one example:

    CPU0 CPU1

    isolate_migratepages_block()
    page_count()
    compound_head()
    !!PageTail() == true
    put_page()
    tail->first_page = NULL
    head = tail->first_page
    alloc_pages(__GFP_COMP)
    prep_compound_page()
    tail->first_page = head
    __SetPageTail(p);
    !!PageTail() == true

    The race is pure theoretical. I don't it's possible to trigger it in
    practice. But who knows.

    We can fix the race by changing how encode PageTail() and compound_head()
    within struct page to be able to update them in one shot.

    The patch introduces page->compound_head into third double word block in
    front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
    the rest bits are pointer to head page if bit zero is set.

    The patch moves page->pmd_huge_pte out of word, just in case if an
    architecture defines pgtable_t into something what can have the bit 0
    set.

    hugetlb_cgroup uses page->lru.next in the second tail page to store
    pointer struct hugetlb_cgroup. The patch switch it to use page->private
    in the second tail page instead. The space is free since ->first_page is
    removed from the union.

    The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
    limitation, since there's now space in first tail page to store struct
    hugetlb_cgroup pointer. But that's out of scope of the patch.

    That means page->compound_head shares storage space with:

    - page->lru.next;
    - page->next;
    - page->rcu_head.next;

    That's too long list to be absolutely sure, but looks like nobody uses
    bit 0 of the word.

    page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
    call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
    call_rcu_lazy() is not allowed as it makes use of the bit and we can
    get false positive PageTail().

    [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Reviewed-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Acked-by: Paul E. McKenney
    Cc: Aneesh Kumar K.V
    Cc: Andi Kleen
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

06 Nov, 2015

5 commits

  • We have had trouble in the past from the way in which page migration's
    newpage is initialized in dribs and drabs - see commit 8bdd63809160 ("mm:
    fix direct reclaim writeback regression") which proposed a cleanup.

    We have no actual problem now, but I think the procedure would be clearer
    (and alternative get_new_page pools safer to implement) if we assert that
    newpage is not touched until we are sure that it's going to be used -
    except for taking the trylock on it in __unmap_and_move().

    So shift the early initializations from move_to_new_page() into
    migrate_page_move_mapping(), mapping and NULL-mapping paths. Similarly
    migrate_huge_page_move_mapping(), but its NULL-mapping path can just be
    deleted: you cannot reach hugetlbfs_migrate_page() with a NULL mapping.

    Adjust stages 3 to 8 in the Documentation file accordingly.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • KernelThreadSanitizer (ktsan) has shown that the down_read_trylock() of
    mmap_sem in try_to_unmap_one() (when going to set PageMlocked on a page
    found mapped in a VM_LOCKED vma) is ineffective against races with
    exit_mmap()'s munlock_vma_pages_all(), because mmap_sem is not held when
    tearing down an mm.

    But that's okay, those races are benign; and although we've believed for
    years in that ugly down_read_trylock(), it's unsuitable for the job, and
    frustrates the good intention of setting PageMlocked when it fails.

    It just doesn't matter if here we read vm_flags an instant before or after
    a racing mlock() or munlock() or exit_mmap() sets or clears VM_LOCKED: the
    syscalls (or exit) work their way up the address space (taking pt locks
    after updating vm_flags) to establish the final state.

    We do still need to be careful never to mark a page Mlocked (hence
    unevictable) by any race that will not be corrected shortly after. The
    page lock protects from many of the races, but not all (a page is not
    necessarily locked when it's unmapped). But the pte lock we just dropped
    is good to cover the rest (and serializes even with
    munlock_vma_pages_all(), so no special barriers required): now hold on to
    the pte lock while calling mlock_vma_page(). Is that lock ordering safe?
    Yes, that's how follow_page_pte() calls it, and how page_remove_rmap()
    calls the complementary clear_page_mlock().

    This fixes the following case (though not a case which anyone has
    complained of), which mmap_sem did not: truncation's preliminary
    unmap_mapping_range() is supposed to remove even the anonymous COWs of
    filecache pages, and that might race with try_to_unmap_one() on a
    VM_LOCKED vma, so that mlock_vma_page() sets PageMlocked just after
    zap_pte_range() unmaps the page, causing "Bad page state (mlocked)" when
    freed. The pte lock protects against this.

    You could say that it also protects against the more ordinary case, racing
    with the preliminary unmapping of a filecache page itself: but in our
    current tree, that's independently protected by i_mmap_rwsem; and that
    race would be why "Bad page state (mlocked)" was seen before commit
    48ec833b7851 ("Revert mm/memory.c: share the i_mmap_rwsem").

    Vlastimil Babka points out another race which this patch protects against.
    try_to_unmap_one() might reach its mlock_vma_page() TestSetPageMlocked a
    moment after munlock_vma_pages_all() did its Phase 1 TestClearPageMlocked:
    leaving PageMlocked and unevictable when it should be evictable. mmap_sem
    is ineffective because exit_mmap() does not hold it; page lock ineffective
    because __munlock_pagevec() only takes it afterwards, in Phase 2; pte lock
    is effective because __munlock_pagevec_fill() takes it to get the page,
    after VM_LOCKED was cleared from vm_flags, so visible to try_to_unmap_one.

    Kirill Shutemov points out that if the compiler chooses to implement a
    "vma->vm_flags &= VM_WHATEVER" or "vma->vm_flags |= VM_WHATEVER" operation
    with an intermediate store of unrelated bits set, since I'm here foregoing
    its usual protection by mmap_sem, try_to_unmap_one() might catch sight of
    a spurious VM_LOCKED in vm_flags, and make the wrong decision. This does
    not appear to be an immediate problem, but we may want to define vm_flags
    accessors in future, to guard against such a possibility.

    While we're here, make a related optimization in try_to_munmap_one(): if
    it's doing TTU_MUNLOCK, then there's no point at all in descending the
    page tables and getting the pt lock, unless the vma is VM_LOCKED. Yes,
    that can change racily, but it can change racily even without the
    optimization: it's not critical. Far better not to waste time here.

    Stopped short of separating try_to_munlock_one() from try_to_munmap_one()
    on this occasion, but that's probably the sensible next step - with a
    rename, given that try_to_munlock()'s business is to try to set Mlocked.

    Updated the unevictable-lru Documentation, to remove its reference to mmap
    semaphore, but found a few more updates needed in just that area.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Acked-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • While updating some mm Documentation, I came across a few straggling
    references to the non-linear vmas which were happily removed in v4.0.
    Delete them.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Acked-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • max_ptes_swap specifies how many pages can be brought in from swap when
    collapsing a group of pages into a transparent huge page.

    /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap

    A higher value can cause excessive swap IO and waste memory. A lower
    value can prevent THPs from being collapsed, resulting fewer pages being
    collapsed into THPs, and lower memory access performance.

    Signed-off-by: Ebru Akagunduz
    Acked-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: Oleg Nesterov
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • Add documentation on how to use slabinfo-gnuplot.sh script.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     

11 Sep, 2015

4 commits

  • As noted by Minchan, a benefit of reading idle flag from /proc/kpageflags
    is that one can easily filter dirty and/or unevictable pages while
    estimating the size of unused memory.

    Note that idle flag read from /proc/kpageflags may be stale in case the
    page was accessed via a PTE, because it would be too costly to iterate
    over all page mappings on each /proc/kpageflags read to provide an
    up-to-date value. To make sure the flag is up-to-date one has to read
    /sys/kernel/mm/page_idle/bitmap first.

    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Knowing the portion of memory that is not used by a certain application or
    memory cgroup (idle memory) can be useful for partitioning the system
    efficiently, e.g. by setting memory cgroup limits appropriately.
    Currently, the only means to estimate the amount of idle memory provided
    by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
    access bit for all pages mapped to a particular process by writing 1 to
    clear_refs, wait for some time, and then count smaps:Referenced. However,
    this method has two serious shortcomings:

    - it does not count unmapped file pages
    - it affects the reclaimer logic

    To overcome these drawbacks, this patch introduces two new page flags,
    Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
    A page's Idle flag can only be set from userspace by setting bit in
    /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
    and it is cleared whenever the page is accessed either through page tables
    (it is cleared in page_referenced() in this case) or using the read(2)
    system call (mark_page_accessed()). Thus by setting the Idle flag for
    pages of a particular workload, which can be found e.g. by reading
    /proc/PID/pagemap, waiting for some time to let the workload access its
    working set, and then reading the bitmap file, one can estimate the amount
    of pages that are not used by the workload.

    The Young page flag is used to avoid interference with the memory
    reclaimer. A page's Young flag is set whenever the Access bit of a page
    table entry pointing to the page is cleared by writing to the bitmap file.
    If page_referenced() is called on a Young page, it will add 1 to its
    return value, therefore concealing the fact that the Access bit was
    cleared.

    Note, since there is no room for extra page flags on 32 bit, this feature
    uses extended page flags when compiled on 32 bit.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: kpageidle requires an MMU]
    [akpm@linux-foundation.org: decouple from page-flags rework]
    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • /proc/kpagecgroup contains a 64-bit inode number of the memory cgroup each
    page is charged to, indexed by PFN. Having this information is useful for
    estimating a cgroup working set size.

    The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG.

    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Change the Documentation/vm/zswap.txt doc to indicate that the "zpool" and
    "compressor" params are now changeable at runtime.

    Signed-off-by: Dan Streetman
    Cc: Seth Jennings
    Cc: Sergey Senozhatsky
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     

09 Sep, 2015

3 commits

  • The URL for libhugetlbfs has changed. Also, put a stronger emphasis on
    using libgugetlbfs for hugetlb regression testing.

    Signed-off-by: Mike Kravetz
    Acked-by: Naoya Horiguchi
    Cc: Joern Engel
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Notes about recent changes.

    [akpm@linux-foundation.org: various tweaks]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Mark Williamson
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • This patch sets bit 56 in pagemap if this page is mapped only once. It
    allows to detect exclusively used pages without exposing PFN:

    present file exclusive state
    0 0 0 non-present
    1 1 0 file page mapped somewhere else
    1 1 1 file page mapped only here
    1 0 0 anon non-CoWed page (shared with parent/child)
    1 0 1 anon CoWed page (or never forked)

    CoWed pages in (MAP_FILE | MAP_PRIVATE) areas are anon in this context.

    MMap-exclusive bit doesn't reflect potential page-sharing via swapcache:
    page could be mapped once but has several swap-ptes which point to it.
    Application could detect that by swap bit in pagemap entry and touch that
    pte via /proc/pid/mem to get real information.

    See http://lkml.kernel.org/r/CAEVpBa+_RyACkhODZrRvQLs80iy0sqpdrd0AaP_-tgnX3Y9yNQ@mail.gmail.com

    Requested by Mark Williamson.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Mark Williamson
    Tested-by: Mark Williamson
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

05 Sep, 2015

2 commits

  • I had requests to return the full address (not the page aligned one) to
    userland.

    It's not entirely clear how the page offset could be relevant because
    userfaults aren't like SIGBUS that can sigjump to a different place and it
    actually skip resolving the fault depending on a page offset. There's
    currently no real way to skip the fault especially because after a
    UFFDIO_COPY|ZEROPAGE, the fault is optimized to be retried within the
    kernel without having to return to userland first (not even self modifying
    code replacing the .text that touched the faulting address would prevent
    the fault to be repeated). Userland cannot skip repeating the fault even
    more so if the fault was triggered by a KVM secondary page fault or any
    get_user_pages or any copy-user inside some syscall which will return to
    kernel code. The second time FAULT_FLAG_RETRY_NOWAIT won't be set leading
    to a SIGBUS being raised because the userfault can't wait if it cannot
    release the mmap_map first (and FAULT_FLAG_RETRY_NOWAIT is required for
    that).

    Still returning userland a proper structure during the read() on the uffd,
    can allow to use the current UFFD_API for the future non-cooperative
    extensions too and it looks cleaner as well. Once we get additional
    fields there's no point to return the fault address page aligned anymore
    to reuse the bits below PAGE_SHIFT.

    The only downside is that the read() syscall will read 32bytes instead of
    8bytes but that's not going to be measurable overhead.

    The total number of new events that can be extended or of new future bits
    for already shipped events, is limited to 64 by the features field of the
    uffdio_api structure. If more will be needed a bump of UFFD_API will be
    required.

    [akpm@linux-foundation.org: use __packed]
    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This is the latest userfaultfd patchset. The postcopy live migration
    feature on the qemu side is mostly ready to be merged and it entirely
    depends on the userfaultfd syscall to be merged as well. So it'd be great
    if this patchset could be reviewed for merging in -mm.

    Userfaults allow to implement on demand paging from userland and more
    generally they allow userland to more efficiently take control of the
    behavior of page faults than what was available before (PROT_NONE +
    SIGSEGV trap).

    The use cases are:

    1) KVM postcopy live migration (one form of cloud memory
    externalization).

    KVM postcopy live migration is the primary driver of this work:

    http://blog.zhaw.ch/icclab/setting-up-post-copy-live-migration-in-openstack/
    http://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04873.html

    2) postcopy live migration of binaries inside linux containers:

    http://thread.gmane.org/gmane.linux.kernel.mm/132662

    3) KVM postcopy live snapshotting (allowing to limit/throttle the
    memory usage, unlike fork would, plus the avoidance of fork
    overhead in the first place).

    While the wrprotect tracking is not implemented yet, the syscall API is
    already contemplating the wrprotect fault tracking and it's generic enough
    to allow its later implementation in a backwards compatible fashion.

    4) KVM userfaults on shared memory. The UFFDIO_COPY lowlevel method
    should be extended to work also on tmpfs and then the
    uffdio_register.ioctls will notify userland that UFFDIO_COPY is
    available even when the registered virtual memory range is tmpfs
    backed.

    5) alternate mechanism to notify web browsers or apps on embedded
    devices that volatile pages have been reclaimed. This basically
    avoids the need to run a syscall before the app can access with the
    CPU the virtual regions marked volatile. This depends on point 4)
    to be fulfilled first, as volatile pages happily apply to tmpfs.

    Even though there wasn't a real use case requesting it yet, it also
    allows to implement distributed shared memory in a way that readonly
    shared mappings can exist simultaneously in different hosts and they
    can be become exclusive at the first wrprotect fault.

    This patch (of 22):

    Add documentation.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

26 Jun, 2015

1 commit

  • Change the "enabled" parameter to be configurable at runtime. Remove the
    enabled check from init(), and move it to the frontswap store() function;
    when enabled, pages will be stored, and when disabled, pages won't be
    stored.

    This is almost identical to Seth's patch from 2 years ago:
    http://lkml.iu.edu/hypermail/linux/kernel/1307.2/04289.html

    [akpm@linux-foundation.org: tweak documentation]
    Signed-off-by: Dan Streetman
    Suggested-by: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     

25 Jun, 2015

1 commit

  • There is a very subtle difference between mmap()+mlock() vs
    mmap(MAP_LOCKED) semantic. The former one fails if the population of the
    area fails while the later one doesn't. This basically means that
    mmap(MAPLOCKED) areas might see major fault after mmap syscall returns
    which is not the case for mlock. mmap man page has already been altered
    but Documentation/vm/unevictable-lru.txt deserves a clarification as well.

    Signed-off-by: Michal Hocko
    Reported-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

18 Apr, 2015

1 commit

  • Pull documentation updates from Jonathan Corbet:
    "Numerous fixes, the overdue removal of the i2o docs, some new Chinese
    translations, and, hopefully, the README fix that will end the flow of
    identical patches to that file"

    * tag 'docs-for-linus' of git://git.lwn.net/linux-2.6: (34 commits)
    Documentation/memcg: update memcg/kmem status
    Documentation: blackfin: Makefile: Typo building issue
    Documentation/vm/pagemap.txt: correct location of page-types tool
    Documentation/memory-barriers.txt: typo fix
    doc: Add guest_nice column to example output of `cat /proc/stat'
    Documentation/kernel-parameters: Move "eagerfpu" to its right place
    Documentation: gpio: Update ACPI part of the document to mention _DSD
    docs/completion.txt: Various tweaks and corrections
    doc: completion: context, scope and language fixes
    Documentation:Update Documentation/zh_CN/arm64/memory.txt
    Documentation:Update Documentation/zh_CN/arm64/booting.txt
    Documentation: Chinese translation of arm64/legacy_instructions.txt
    DocBook media: fix broken EIA hyperlink
    Documentation: tweak the maintainers entry
    README: Change gzip/bzip2 to xz compression format
    README: Update version number reference
    doc:pci: Fix typo in Documentation/PCI
    Documentation: drm: Use '->' when describing access through pointers.
    Documentation: Remove mentioning of block barriers
    Documentation/email-clients.txt: Fix one grammar mistake, add extra info about TB
    ...

    Linus Torvalds
     

16 Apr, 2015

4 commits

  • Create zsmalloc doc which explains design concept and stat information.

    Signed-off-by: Minchan Kim
    Cc: Juneho Choi
    Cc: Gunho Lee
    Cc: Luigi Semenzato
    Cc: Dan Streetman
    Cc: Seth Jennings
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Sergey Senozhatsky
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • munmap(2) of hugetlb memory requires a length that is hugepage aligned,
    otherwise it may fail. Add this to the documentation.

    This also cleans up the documentation and separates it into logical units:
    one part refers to MAP_HUGETLB and another part refers to requirements for
    shared memory segments.

    Signed-off-by: David Rientjes
    Cc: Jonathan Corbet
    Cc: Davide Libenzi
    Cc: Luiz Capitulino
    Cc: Shuah Khan
    Acked-by: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Joern Engel
    Cc: Jianguo Wu
    Cc: Eric B Munson
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Add min_size mount option to the hugetlbfs documentation. Also, add the
    missing pagesize option and mention that size can be specified as bytes or
    a percentage of huge page pool.

    Signed-off-by: Mike Kravetz
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • …d the unevictable LRU

    The memory compaction code uses the migration code to do most of the
    work in compaction. However, the compaction code interacts with the
    unevictable LRU differently than migration code and this difference
    should be noted in the documentation.

    [akpm@linux-foundation.org: identify /proc/sys/vm/compact_unevictable directly]
    Signed-off-by: Eric B Munson <emunson@akamai.com>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Eric B Munson
     

15 Apr, 2015

2 commits

  • Currently, cleancache_register_ops returns the previous value of
    cleancache_ops to allow chaining. However, chaining, as it is
    implemented now, is extremely dangerous due to possible pool id
    collisions. Suppose, a new cleancache driver is registered after the
    previous one assigned an id to a super block. If the new driver assigns
    the same id to another super block, which is perfectly possible, we will
    have two different filesystems using the same id. No matter if the new
    driver implements chaining or not, we are likely to get data corruption
    with such a configuration eventually.

    This patch therefore disables the ability to override cleancache_ops
    altogether as potentially dangerous. If there is already cleancache
    driver registered, all further calls to cleancache_register_ops will
    return EBUSY. Since no user of cleancache implements chaining, we only
    need to make minor changes to the code outside the cleancache core.

    Signed-off-by: Vladimir Davydov
    Cc: Konrad Rzeszutek Wilk
    Cc: Boris Ostrovsky
    Cc: David Vrabel
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Stefan Hengelein
    Cc: Florian Schmaus
    Cc: Andor Daam
    Cc: Dan Magenheimer
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • __mlock_vma_pages_range() doesn't necessarily mlock pages. It depends on
    vma flags. The same codepath is used for MAP_POPULATE.

    Let's rename __mlock_vma_pages_range() to populate_vma_page_range().

    This patch also drops mlock_vma_pages_range() references from
    documentation. It has gone in cea10a19b797 ("mm: directly use
    __mlock_vma_pages_range() in find_extend_vma()").

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Acked-by: David Rientjes
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

11 Apr, 2015

1 commit


20 Mar, 2015

1 commit

  • max_ptes_none specifies how many extra small pages (that are
    not already mapped) can be allocated when collapsing a group
    of small pages into one large page.

    /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none

    A higher value leads to use additional memory for programs.
    A lower value leads to gain less thp performance. Value of
    max_ptes_none can waste cpu time very little, you can
    ignore it.

    Signed-off-by: Ebru Akagunduz
    Reviewed-by: Rik van Riel
    Signed-off-by: Jonathan Corbet

    Ebru Akagunduz
     

12 Feb, 2015

1 commit

  • Add KPF_ZERO_PAGE flag for zero_page, so that userspace processes can
    detect zero_page in /proc/kpageflags, and then do memory analysis more
    accurately.

    Signed-off-by: Yalin Wang
    Acked-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang, Yalin
     

11 Feb, 2015

2 commits

  • Pull trivial tree changes from Jiri Kosina:
    "Patches from trivial.git that keep the world turning around.

    Mostly documentation and comment fixes, and a two corner-case code
    fixes from Alan Cox"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
    kexec, Kconfig: spell "architecture" properly
    mm: fix cleancache debugfs directory path
    blackfin: mach-common: ints-priority: remove unused function
    doubletalk: probe failure causes OOPS
    ARM: cache-l2x0.c: Make it clear that cache-l2x0 handles L310 cache controller
    msdos_fs.h: fix 'fields' in comment
    scsi: aic7xxx: fix comment
    ARM: l2c: fix comment
    ibmraid: fix writeable attribute with no store method
    dynamic_debug: fix comment
    doc: usbmon: fix spelling s/unpriviledged/unprivileged/
    x86: init_mem_mapping(): use capital BIOS in comment

    Linus Torvalds
     
  • remap_file_pages(2) was invented to be able efficiently map parts of
    huge file into limited 32-bit virtual address space such as in database
    workloads.

    Nonlinear mappings are pain to support and it seems there's no
    legitimate use-cases nowadays since 64-bit systems are widely available.

    Let's drop it and get rid of all these special-cased code.

    The patch replaces the syscall with emulation which creates new VMA on
    each remap_file_pages(), unless they it can be merged with an adjacent
    one.

    I didn't find *any* real code that uses remap_file_pages(2) to test
    emulation impact on. I've checked Debian code search and source of all
    packages in ALT Linux. No real users: libc wrappers, mentions in
    strace, gdb, valgrind and this kind of stuff.

    There are few basic tests in LTP for the syscall. They work just fine
    with emulation.

    To test performance impact, I've written small test case which
    demonstrate pretty much worst case scenario: map 4G shmfs file, write to
    begin of every page pgoff of the page, remap pages in reverse order,
    read every page.

    The test creates 1 million of VMAs if emulation is in use, so I had to
    set vm.max_map_count to 1100000 to avoid -ENOMEM.

    Before: 23.3 ( +- 4.31% ) seconds
    After: 43.9 ( +- 0.85% ) seconds
    Slowdown: 1.88x

    I believe we can live with that.

    Test case:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include

    #define MB (1024UL * 1024)
    #define SIZE (4096 * MB)

    int main(int argc, char **argv)
    {
    unsigned long *p;
    long i, pass;

    for (pass = 0; pass < 10; pass++) {
    p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
    MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    if (p == MAP_FAILED) {
    perror("mmap");
    return -1;
    }

    for (i = 0; i < SIZE / 4096; i++)
    p[i * 4096 / sizeof(*p)] = i;

    for (i = 0; i < SIZE / 4096; i++) {
    if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
    0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
    perror("remap_file_pages");
    return -1;
    }
    }

    for (i = SIZE / 4096 - 1; i >= 0; i--)
    assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);

    munmap(p, SIZE);
    }

    return 0;
    }

    [akpm@linux-foundation.org: fix spello]
    [sasha.levin@oracle.com: initialize populate before usage]
    [sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
    Signed-off-by: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Dave Jones
    Cc: Linus Torvalds
    Cc: Armin Rigo
    Signed-off-by: Sasha Levin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

20 Jan, 2015

1 commit


14 Dec, 2014

2 commits

  • Merge second patchbomb from Andrew Morton:
    - the rest of MM
    - misc fs fixes
    - add execveat() syscall
    - new ratelimit feature for fault-injection
    - decompressor updates
    - ipc/ updates
    - fallocate feature creep
    - fsnotify cleanups
    - a few other misc things

    * emailed patches from Andrew Morton : (99 commits)
    cgroups: Documentation: fix trivial typos and wrong paragraph numberings
    parisc: percpu: update comments referring to __get_cpu_var
    percpu: update local_ops.txt to reflect this_cpu operations
    percpu: remove __get_cpu_var and __raw_get_cpu_var macros
    fsnotify: remove destroy_list from fsnotify_mark
    fsnotify: unify inode and mount marks handling
    fallocate: create FAN_MODIFY and IN_MODIFY events
    mm/cma: make kmemleak ignore CMA regions
    slub: fix cpuset check in get_any_partial
    slab: fix cpuset check in fallback_alloc
    shmdt: use i_size_read() instead of ->i_size
    ipc/shm.c: fix overly aggressive shmdt() when calls span multiple segments
    ipc/msg: increase MSGMNI, remove scaling
    ipc/sem.c: increase SEMMSL, SEMMNI, SEMOPM
    ipc/sem.c: change memory barrier in sem_lock() to smp_rmb()
    lib/decompress.c: consistency of compress formats for kernel image
    decompress_bunzip2: off by one in get_next_block()
    usr/Kconfig: make initrd compression algorithm selection not expert
    fault-inject: add ratelimit option
    ratelimit: add initialization macro
    ...

    Linus Torvalds
     
  • page owner is for the tracking about who allocated each page. This
    document explains what is the page owner feature and what is the merit of
    it. And, simple HOW-TO is also explained. See the document for detailed
    information.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

07 Nov, 2014

1 commit


23 Oct, 2014

1 commit


07 Jun, 2014

1 commit

  • The remap_file_pages() system call is used to create a nonlinear
    mapping, that is, a mapping in which the pages of the file are mapped
    into a nonsequential order in memory. The advantage of using
    remap_file_pages() over using repeated calls to mmap(2) is that the
    former approach does not require the kernel to create additional VMA
    (Virtual Memory Area) data structures.

    Supporting of nonlinear mapping requires significant amount of
    non-trivial code in kernel virtual memory subsystem including hot paths.
    Also to get nonlinear mapping work kernel need a way to distinguish
    normal page table entries from entries with file offset (pte_file).
    Kernel reserves flag in PTE for this purpose. PTE flags are scarce
    resource especially on some CPU architectures. It would be nice to free
    up the flag for other usage.

    Fortunately, there are not many users of remap_file_pages() in the wild.
    It's only known that one enterprise RDBMS implementation uses the
    syscall on 32-bit systems to map files bigger than can linearly fit into
    32-bit virtual address space. This use-case is not critical anymore
    since 64-bit systems are widely available.

    The plan is to deprecate the syscall and replace it with an emulation.
    The emulation will create new VMAs instead of nonlinear mappings. It's
    going to work slower for rare users of remap_file_pages() but ABI is
    preserved.

    One side effect of emulation (apart from performance) is that user can
    hit vm.max_map_count limit more easily due to additional VMAs. See
    comment for DEFAULT_MAX_MAP_COUNT for more details on the limit.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Kirill A. Shutemov
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Dave Jones
    Cc: Armin Rigo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

05 Jun, 2014

1 commit

  • Currently memory error handler handles action optional errors in the
    deferred manner by default. And if a recovery aware application wants
    to handle it immediately, it can do it by setting PF_MCE_EARLY flag.
    However, such signal can be sent only to the main thread, so it's
    problematic if the application wants to have a dedicated thread to
    handler such signals.

    So this patch adds dedicated thread support to memory error handler. We
    have PF_MCE_EARLY flags for each thread separately, so with this patch
    AO signal is sent to the thread with PF_MCE_EARLY flag set, not the main
    thread. If you want to implement a dedicated thread, you call prctl()
    to set PF_MCE_EARLY on the thread.

    Memory error handler collects processes to be killed, so this patch lets
    it check PF_MCE_EARLY flag on each thread in the collecting routines.

    No behavioral change for all non-early kill cases.

    Tony said:

    : The old behavior was crazy - someone with a multithreaded process might
    : well expect that if they call prctl(PF_MCE_EARLY) in just one thread, then
    : that thread would see the SIGBUS with si_code = BUS_MCEERR_A0 - even if
    : that thread wasn't the main thread for the process.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Tony Luck
    Cc: Kamil Iskra
    Cc: Andi Kleen
    Cc: Borislav Petkov
    Cc: Chen Gong
    Cc: [3.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi