08 Jan, 2017

2 commits

  • Several people report seeing warnings about inconsistent radix tree
    nodes followed by crashes in the workingset code, which all looked like
    use-after-free access from the shadow node shrinker.

    Dave Jones managed to reproduce the issue with a debug patch applied,
    which confirmed that the radix tree shrinking indeed frees shadow nodes
    while they are still linked to the shadow LRU:

    WARNING: CPU: 2 PID: 53 at lib/radix-tree.c:643 delete_node+0x1e4/0x200
    CPU: 2 PID: 53 Comm: kswapd0 Not tainted 4.10.0-rc2-think+ #3
    Call Trace:
    delete_node+0x1e4/0x200
    __radix_tree_delete_node+0xd/0x10
    shadow_lru_isolate+0xe6/0x220
    __list_lru_walk_one.isra.4+0x9b/0x190
    list_lru_walk_one+0x23/0x30
    scan_shadow_nodes+0x2e/0x40
    shrink_slab.part.44+0x23d/0x5d0
    shrink_node+0x22c/0x330
    kswapd+0x392/0x8f0

    This is the WARN_ON_ONCE(!list_empty(&node->private_list)) placed in the
    inlined radix_tree_shrink().

    The problem is with 14b468791fa9 ("mm: workingset: move shadow entry
    tracking to radix tree exceptional tracking"), which passes an update
    callback into the radix tree to link and unlink shadow leaf nodes when
    tree entries change, but forgot to pass the callback when reclaiming a
    shadow node.

    While the reclaimed shadow node itself is unlinked by the shrinker, its
    deletion from the tree can cause the left-most leaf node in the tree to
    be shrunk. If that happens to be a shadow node as well, we don't unlink
    it from the LRU as we should.

    Consider this tree, where the s are shadow entries:

    root->rnode
    |
    [0 n]
    | |
    [s ] [sssss]

    Now the shadow node shrinker reclaims the rightmost leaf node through
    the shadow node LRU:

    root->rnode
    |
    [0 ]
    |
    [s ]

    Because the parent of the deleted node is the first level below the
    root and has only one child in the left-most slot, the intermediate
    level is shrunk and the node containing the single shadow is put in
    its place:

    root->rnode
    |
    [s ]

    The shrinker again sees a single left-most slot in a first level node
    and thus decides to store the shadow in root->rnode directly and free
    the node - which is a leaf node on the shadow node LRU.

    root->rnode
    |
    s

    Without the update callback, the freed node remains on the shadow LRU,
    where it causes later shrinker runs to crash.

    Pass the node updater callback into __radix_tree_delete_node() in case
    the deletion causes the left-most branch in the tree to collapse too.

    Also add warnings when linked nodes are freed right away, rather than
    wait for the use-after-free when the list is scanned much later.

    Fixes: 14b468791fa9 ("mm: workingset: move shadow entry tracking to radix tree exceptional tracking")
    Reported-by: Dave Chinner
    Reported-by: Hugh Dickins
    Reported-by: Andrea Arcangeli
    Reported-and-tested-by: Dave Jones
    Signed-off-by: Johannes Weiner
    Cc: Christoph Hellwig
    Cc: Chris Leech
    Cc: Lee Duncan
    Cc: Jan Kara
    Cc: Kirill A. Shutemov
    Cc: Matthew Wilcox
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • 4.10-rc loadtest (even on x86, and even without THPCache) fails with
    "fork: Cannot allocate memory" or some such; and /proc/meminfo shows
    PageTables growing.

    Commit 953c66c2b22a ("mm: THP page cache support for ppc64") that got
    merged in rc1 removed the freeing of an unused preallocated pagetable
    after do_fault_around() has called map_pages().

    This is usually a good optimization, so that the followup doesn't have
    to reallocate one; but it's not sufficient to shift the freeing into
    alloc_set_pte(), since there are failure cases (most commonly
    VM_FAULT_RETRY) which never reach finish_fault().

    Check and free it at the outer level in do_fault(), then we don't need
    to worry in alloc_set_pte(), and can restore that to how it was (I
    cannot find any reason to pte_free() under lock as it was doing).

    And fix a separate pagetable leak, or crash, introduced by the same
    change, that could only show up on some ppc64: why does do_set_pmd()'s
    failure case attempt to withdraw a pagetable when it never deposited
    one, at the same time overwriting (so leaking) the vmf->prealloc_pte?
    Residue of an earlier implementation, perhaps? Delete it.

    Fixes: 953c66c2b22a ("mm: THP page cache support for ppc64")
    Cc: Aneesh Kumar K.V
    Cc: Kirill A. Shutemov
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Michael Neuling
    Cc: Paul Mackerras
    Cc: Balbir Singh
    Cc: Andrew Morton
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

02 Jan, 2017

1 commit

  • Pull DAX updates from Dan Williams:
    "The completion of Jan's DAX work for 4.10.

    As I mentioned in the libnvdimm-for-4.10 pull request, these are some
    final fixes for the DAX dirty-cacheline-tracking invalidation work
    that was merged through the -mm, ext4, and xfs trees in -rc1. These
    patches were prepared prior to the merge window, but we waited for
    4.10-rc1 to have a stable merge base after all the prerequisites were
    merged.

    Quoting Jan on the overall changes in these patches:

    "So I'd like all these 6 patches to go for rc2. The first three
    patches fix invalidation of exceptional DAX entries (a bug which
    is there for a long time) - without these patches data loss can
    occur on power failure even though user called fsync(2). The other
    three patches change locking of DAX faults so that ->iomap_begin()
    is called in a more relaxed locking context and we are safe to
    start a transaction there for ext4"

    These have received a build success notification from the kbuild
    robot, and pass the latest libnvdimm unit tests. There have not been
    any -next releases since -rc1, so they have not appeared there"

    * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    ext4: Simplify DAX fault path
    dax: Call ->iomap_begin without entry lock during dax fault
    dax: Finish fault completely when loading holes
    dax: Avoid page invalidation races and unnecessary radix tree traversals
    mm: Invalidate DAX radix tree entries only if appropriate
    ext2: Return BH_New buffers for zeroed blocks

    Linus Torvalds
     

30 Dec, 2016

2 commits

  • mm/filemap.c: In function 'clear_bit_unlock_is_negative_byte':
    mm/filemap.c:933:9: error: too few arguments to function 'test_bit'
    return test_bit(PG_waiters);
    ^~~~~~~~

    Fixes: b91e1302ad9b ('mm: optimize PageWaiters bit use for unlock_page()')
    Signed-off-by: Olof Johansson
    Brown-paper-bag-by: Linus Torvalds
    Signed-off-by: Linus Torvalds

    Olof Johansson
     
  • In commit 62906027091f ("mm: add PageWaiters indicating tasks are
    waiting for a page bit") Nick Piggin made our page locking no longer
    unconditionally touch the hashed page waitqueue, which not only helps
    performance in general, but is particularly helpful on NUMA machines
    where the hashed wait queues can bounce around a lot.

    However, the "clear lock bit atomically and then test the waiters bit"
    sequence turns out to be much more expensive than it needs to be,
    because you get a nasty stall when trying to access the same word that
    just got updated atomically.

    On architectures where locking is done with LL/SC, this would be trivial
    to fix with a new primitive that clears one bit and tests another
    atomically, but that ends up not working on x86, where the only atomic
    operations that return the result end up being cmpxchg and xadd. The
    atomic bit operations return the old value of the same bit we changed,
    not the value of an unrelated bit.

    On x86, we could put the lock bit in the high bit of the byte, and use
    "xadd" with that bit (where the overflow ends up not touching other
    bits), and look at the other bits of the result. However, an even
    simpler model is to just use a regular atomic "and" to clear the lock
    bit, and then the sign bit in eflags will indicate the resulting state
    of the unrelated bit #7.

    So by moving the PageWaiters bit up to bit #7, we can atomically clear
    the lock bit and test the waiters bit on x86 too. And architectures
    with LL/SC (which is all the usual RISC suspects), the particular bit
    doesn't matter, so they are fine with this approach too.

    This avoids the extra access to the same atomic word, and thus avoids
    the costly stall at page unlock time.

    The only downside is that the interface ends up being a bit odd and
    specialized: clear a bit in a byte, and test the sign bit. Nick doesn't
    love the resulting name of the new primitive, but I'd rather make the
    name be descriptive and very clear about the limitation imposed by
    trying to work across all relevant architectures than make it be some
    generic thing that doesn't make the odd semantics explicit.

    So this introduces the new architecture primitive

    clear_bit_unlock_is_negative_byte();

    and adds the trivial implementation for x86. We have a generic
    non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
    combination) which can be overridden by any architecture that can do
    better. According to Nick, Power has the same hickup x86 has, for
    example, but some other architectures may not even care.

    All these optimizations mean that my page locking stress-test (which is
    just executing a lot of small short-lived shell scripts: "make test" in
    the git source tree) no longer makes our page locking look horribly bad.
    Before all these optimizations, just the unlock_page() costs were just
    over 3% of all CPU overhead on "make test". After this, it's down to
    0.66%, so just a quarter of the cost it used to be.

    (The difference on NUMA is bigger, but there this micro-optimization is
    likely less noticeable, since the big issue on NUMA was not the accesses
    to 'struct page', but the waitqueue accesses that were already removed
    by Nick's earlier commit).

    Acked-by: Nick Piggin
    Cc: Dave Hansen
    Cc: Bob Peterson
    Cc: Steven Whitehouse
    Cc: Andrew Lutomirski
    Cc: Andreas Gruenbacher
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

27 Dec, 2016

1 commit

  • Currently invalidate_inode_pages2_range() and invalidate_mapping_pages()
    just delete all exceptional radix tree entries they find. For DAX this
    is not desirable as we track cache dirtiness in these entries and when
    they are evicted, we may not flush caches although it is necessary. This
    can for example manifest when we write to the same block both via mmap
    and via write(2) (to different offsets) and fsync(2) then does not
    properly flush CPU caches when modification via write(2) was the last
    one.

    Create appropriate DAX functions to handle invalidation of DAX entries
    for invalidate_inode_pages2_range() and invalidate_mapping_pages() and
    wire them up into the corresponding mm functions.

    Acked-by: Johannes Weiner
    Reviewed-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Signed-off-by: Dan Williams

    Jan Kara
     

26 Dec, 2016

2 commits

  • Add a new page flag, PageWaiters, to indicate the page waitqueue has
    tasks waiting. This can be tested rather than testing waitqueue_active
    which requires another cacheline load.

    This bit is always set when the page has tasks on page_waitqueue(page),
    and is set and cleared under the waitqueue lock. It may be set when
    there are no tasks on the waitqueue, which will cause a harmless extra
    wakeup check that will clears the bit.

    The generic bit-waitqueue infrastructure is no longer used for pages.
    Instead, waitqueues are used directly with a custom key type. The
    generic code was not flexible enough to have PageWaiters manipulation
    under the waitqueue lock (which simplifies concurrency).

    This improves the performance of page lock intensive microbenchmarks by
    2-3%.

    Putting two bits in the same word opens the opportunity to remove the
    memory barrier between clearing the lock bit and testing the waiters
    bit, after some work on the arch primitives (e.g., ensuring memory
    operand widths match and cover both bits).

    Signed-off-by: Nicholas Piggin
    Cc: Dave Hansen
    Cc: Bob Peterson
    Cc: Steven Whitehouse
    Cc: Andrew Lutomirski
    Cc: Andreas Gruenbacher
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     
  • A page is not added to the swap cache without being swap backed,
    so PageSwapBacked mappings can use PG_owner_priv_1 for PageSwapCache.

    Signed-off-by: Nicholas Piggin
    Acked-by: Hugh Dickins
    Cc: Dave Hansen
    Cc: Bob Peterson
    Cc: Steven Whitehouse
    Cc: Andrew Lutomirski
    Cc: Andreas Gruenbacher
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     

25 Dec, 2016

1 commit


21 Dec, 2016

1 commit

  • When FADV_DONTNEED cannot drop all pages in the range, it observes that
    some pages might still be on per-cpu LRU caches after recent
    instantiation and so initiates remote calls to all CPUs to flush their
    local caches. However, in most cases, the fadvise happens from the same
    context that instantiated the pages, and any pre-LRU pages in the
    specified range are most likely sitting on the local CPU's LRU cache,
    and so in many cases this results in unnecessary remote calls, which, in
    a loaded system, can hold up the fadvise() call significantly.

    [ I didn't record it in the extreme case we observed at Facebook,
    unfortunately. We had a slow-to-respond system and noticed it
    lru_add_drain_all() leading the profile during fadvise calls. This
    patch came out of thinking about the code and how we commonly call
    FADV_DONTNEED.

    FWIW, I wrote a silly directory tree walker/searcher that recurses
    through /usr to read and FADV_DONTNEED each file it finds. On a 2
    socket 40 ht machine, over 1% is spent in lru_add_drain_all(). With
    the patch, that cost is gone; the local drain cost shows at 0.09%. ]

    Try to avoid the remote call by flushing the local LRU cache before even
    attempting to invalidate anything. It's a cheap operation, and the
    local LRU cache is the most likely to hold any pre-LRU pages in the
    specified fadvise range.

    Link: http://lkml.kernel.org/r/20161214210017.GA1465@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

18 Dec, 2016

1 commit

  • …/linux/kernel/git/mszeredi/vfs

    Pull partial readlink cleanups from Miklos Szeredi.

    This is the uncontroversial part of the readlink cleanup patch-set that
    simplifies the default readlink handling.

    Miklos and Al are still discussing the rest of the series.

    * git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    vfs: make generic_readlink() static
    vfs: remove ".readlink = generic_readlink" assignments
    vfs: default to generic_readlink()
    vfs: replace calling i_op->readlink with vfs_readlink()
    proc/self: use generic_readlink
    ecryptfs: use vfs_get_link()
    bad_inode: add missing i_op initializers

    Linus Torvalds
     

15 Dec, 2016

29 commits

  • Merge more updates from Andrew Morton:

    - a few misc things

    - kexec updates

    - DMA-mapping updates to better support networking DMA operations

    - IPC updates

    - various MM changes to improve DAX fault handling

    - lots of radix-tree changes, mainly to the test suite. All leading up
    to reimplementing the IDA/IDR code to be a wrapper layer over the
    radix-tree. However the final trigger-pulling patch is held off for
    4.11.

    * emailed patches from Andrew Morton : (114 commits)
    radix tree test suite: delete unused rcupdate.c
    radix tree test suite: add new tag check
    radix-tree: ensure counts are initialised
    radix tree test suite: cache recently freed objects
    radix tree test suite: add some more functionality
    idr: reduce the number of bits per level from 8 to 6
    rxrpc: abstract away knowledge of IDR internals
    tpm: use idr_find(), not idr_find_slowpath()
    idr: add ida_is_empty
    radix tree test suite: check multiorder iteration
    radix-tree: fix replacement for multiorder entries
    radix-tree: add radix_tree_split_preload()
    radix-tree: add radix_tree_split
    radix-tree: add radix_tree_join
    radix-tree: delete radix_tree_range_tag_if_tagged()
    radix-tree: delete radix_tree_locate_item()
    radix-tree: improve multiorder iterators
    btrfs: fix race in btrfs_free_dummy_fs_info()
    radix-tree: improve dump output
    radix-tree: make radix_tree_find_next_bit more useful
    ...

    Linus Torvalds
     
  • This is an exceptionally complicated function with just one caller
    (tag_pages_for_writeback). We devote a large portion of the runtime of
    the test suite to testing this one function which has one caller. By
    introducing the new function radix_tree_iter_tag_set(), we can eliminate
    all of the complexity while keeping the performance. The caller can now
    use a fairly standard radix_tree_for_each() loop, and it doesn't need to
    worry about tricksy things like 'start' wrapping.

    The test suite continues to spend a large amount of time investigating
    this function, but now it's testing the underlying primitives such as
    radix_tree_iter_resume() and the radix_tree_for_each_tagged() iterator
    which are also used by other parts of the kernel.

    Link: http://lkml.kernel.org/r/1480369871-5271-57-git-send-email-mawilcox@linuxonhyperv.com
    Signed-off-by: Matthew Wilcox
    Tested-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Ross Zwisler
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • This rather complicated function can be better implemented as an
    iterator. It has only one caller, so move the functionality to the only
    place that needs it. Update the test suite to follow the same pattern.

    Link: http://lkml.kernel.org/r/1480369871-5271-56-git-send-email-mawilcox@linuxonhyperv.com
    Signed-off-by: Matthew Wilcox
    Acked-by: Konstantin Khlebnikov
    Tested-by: Kirill A. Shutemov
    Cc: Ross Zwisler
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • This fixes several interlinked problems with the iterators in the
    presence of multiorder entries.

    1. radix_tree_iter_next() would only advance by one slot, which would
    result in the iterators returning the same entry more than once if
    there were sibling entries.

    2. radix_tree_next_slot() could return an internal pointer instead of
    a user pointer if a tagged multiorder entry was immediately followed by
    an entry of lower order.

    3. radix_tree_next_slot() expanded to a lot more code than it used to
    when multiorder support was compiled in. And I wasn't comfortable with
    entry_to_node() being in a header file.

    Fixing radix_tree_iter_next() for the presence of sibling entries
    necessarily involves examining the contents of the radix tree, so we now
    need to pass 'slot' to radix_tree_iter_next(), and we need to change the
    calling convention so it is called *before* dropping the lock which
    protects the tree. Also rename it to radix_tree_iter_resume(), as some
    people thought it was necessary to call radix_tree_iter_next() each time
    around the loop.

    radix_tree_next_slot() becomes closer to how it looked before multiorder
    support was introduced. It only checks to see if the next entry in the
    chunk is a sibling entry or a pointer to a node; this should be rare
    enough that handling this case out of line is not a performance impact
    (and such impact is amortised by the fact that the entry we just
    processed was a multiorder entry). Also, radix_tree_next_slot() used to
    force a new chunk lookup for untagged entries, which is more expensive
    than the out of line sibling entry skipping.

    Link: http://lkml.kernel.org/r/1480369871-5271-55-git-send-email-mawilcox@linuxonhyperv.com
    Signed-off-by: Matthew Wilcox
    Tested-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Ross Zwisler
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Currently PTE gets updated in wp_pfn_shared() after dax_pfn_mkwrite()
    has released corresponding radix tree entry lock. When we want to
    writeprotect PTE on cache flush, we need PTE modification to happen
    under radix tree entry lock to ensure consistent updates of PTE and
    radix tree (standard faults use page lock to ensure this consistency).
    So move update of PTE bit into dax_pfn_mkwrite().

    Link: http://lkml.kernel.org/r/1479460644-25076-20-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Cc: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • DAX will need to implement its own version of page_check_address(). To
    avoid duplicating page table walking code, export follow_pte() which
    does what we need.

    Link: http://lkml.kernel.org/r/1479460644-25076-18-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Cc: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Currently finish_mkwrite_fault() returns 0 when PTE got changed before
    we acquired PTE lock and VM_FAULT_WRITE when we succeeded in modifying
    the PTE. This is somewhat confusing since 0 generally means success, it
    is also inconsistent with finish_fault() which returns 0 on success.
    Change finish_mkwrite_fault() to return 0 on success and VM_FAULT_NOPAGE
    when PTE changed. Practically, there should be no behavioral difference
    since we bail out from the fault the same way regardless whether we
    return 0, VM_FAULT_NOPAGE, or VM_FAULT_WRITE. Also note that
    VM_FAULT_WRITE has no effect for shared mappings since the only two
    places that check it - KSM and GUP - care about private mappings only.
    Generally the meaning of VM_FAULT_WRITE for shared mappings is not well
    defined and we should probably clean that up.

    Link: http://lkml.kernel.org/r/1479460644-25076-17-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Cc: Ross Zwisler
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Provide a helper function for finishing write faults due to PTE being
    read-only. The helper will be used by DAX to avoid the need of
    complicating generic MM code with DAX locking specifics.

    Link: http://lkml.kernel.org/r/1479460644-25076-16-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • wp_page_reuse() handles write shared faults which is needed only in
    wp_page_shared(). Move the handling only into that location to make
    wp_page_reuse() simpler and avoid a strange situation when we sometimes
    pass in locked page, sometimes unlocked etc.

    Link: http://lkml.kernel.org/r/1479460644-25076-15-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • So far we set vmf->page during WP faults only when we needed to pass it
    to the ->page_mkwrite handler. Set it in all the cases now and use that
    instead of passing page pointer explicitly around.

    Link: http://lkml.kernel.org/r/1479460644-25076-14-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • We will need more information in the ->page_mkwrite() helper for DAX to
    be able to fully finish faults there. Pass vm_fault structure to
    do_page_mkwrite() and use it there so that information propagates
    properly from upper layers.

    Link: http://lkml.kernel.org/r/1479460644-25076-13-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Currently we duplicate handling of shared write faults in
    wp_page_reuse() and do_shared_fault(). Factor them out into a common
    function.

    Link: http://lkml.kernel.org/r/1479460644-25076-12-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Move final handling of COW faults from generic code into DAX fault
    handler. That way generic code doesn't have to be aware of
    peculiarities of DAX locking so remove that knowledge and make locking
    functions private to fs/dax.c.

    Link: http://lkml.kernel.org/r/1479460644-25076-11-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Ross Zwisler
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Introduce finish_fault() as a helper function for finishing page faults.
    It is rather thin wrapper around alloc_set_pte() but since we'd want to
    call this from DAX code or filesystems, it is still useful to avoid some
    boilerplate code.

    Link: http://lkml.kernel.org/r/1479460644-25076-10-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Patch series "dax: Clear dirty bits after flushing caches", v5.

    Patchset to clear dirty bits from radix tree of DAX inodes when caches
    for corresponding pfns have been flushed. In principle, these patches
    enable handlers to easily update PTEs and do other work necessary to
    finish the fault without duplicating the functionality present in the
    generic code. I'd like to thank Kirill and Ross for reviews of the
    series!

    This patch (of 20):

    To allow full handling of COW faults add memcg field to struct vm_fault
    and a return value of ->fault() handler meaning that COW fault is fully
    handled and memcg charge must not be canceled. This will allow us to
    remove knowledge about special DAX locking from the generic fault code.

    Link: http://lkml.kernel.org/r/1479460644-25076-9-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Add orig_pte field to vm_fault structure to allow ->page_mkwrite
    handlers to fully handle the fault.

    This also allows us to save some passing of extra arguments around.

    Link: http://lkml.kernel.org/r/1479460644-25076-8-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Instead of creating another vm_fault structure, use the one passed to
    wp_pfn_shared() for passing arguments into pfn_mkwrite handler.

    Link: http://lkml.kernel.org/r/1479460644-25076-7-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Use vm_fault structure to pass cow_page, page, and entry in and out of
    the function.

    That reduces number of __do_fault() arguments from 4 to 1.

    Link: http://lkml.kernel.org/r/1479460644-25076-6-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Instead of creating another vm_fault structure, use the one passed to
    __do_fault() for passing arguments into fault handler.

    Link: http://lkml.kernel.org/r/1479460644-25076-5-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • struct vm_fault has already pgoff entry. Use it instead of passing
    pgoff as a separate argument and then assigning it later.

    Link: http://lkml.kernel.org/r/1479460644-25076-4-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Every single user of vmf->virtual_address typed that entry to unsigned
    long before doing anything with it so the type of virtual_address does
    not really provide us any additional safety. Just use masked
    vmf->address which already has the appropriate type.

    Link: http://lkml.kernel.org/r/1479460644-25076-3-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Currently we have two different structures for passing fault information
    around - struct vm_fault and struct fault_env. DAX will need more
    information in struct vm_fault to handle its faults so the content of
    that structure would become event closer to fault_env. Furthermore it
    would need to generate struct fault_env to be able to call some of the
    generic functions. So at this point I don't think there's much use in
    keeping these two structures separate. Just embed into struct vm_fault
    all that is needed to use it for both purposes.

    Link: http://lkml.kernel.org/r/1479460644-25076-2-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Cc: Ross Zwisler
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Unexport the low-level __get_user_pages_unlocked() function and replaces
    invocations with calls to more appropriate higher-level functions.

    In hva_to_pfn_slow() we are able to replace __get_user_pages_unlocked()
    with get_user_pages_unlocked() since we can now pass gup_flags.

    In async_pf_execute() and process_vm_rw_single_vec() we need to pass
    different tsk, mm arguments so get_user_pages_remote() is the sane
    replacement in these cases (having added manual acquisition and release
    of mmap_sem.)

    Additionally get_user_pages_remote() reintroduces use of the FOLL_TOUCH
    flag. However, this flag was originally silently dropped by commit
    1e9877902dc7 ("mm/gup: Introduce get_user_pages_remote()"), so this
    appears to have been unintentional and reintroducing it is therefore not
    an issue.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20161027095141.2569-3-lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Jan Kara
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • Patch series "mm: unexport __get_user_pages_unlocked()".

    This patch series continues the cleanup of get_user_pages*() functions
    taking advantage of the fact we can now pass gup_flags as we please.

    It firstly adds an additional 'locked' parameter to
    get_user_pages_remote() to allow for its callers to utilise
    VM_FAULT_RETRY functionality. This is necessary as the invocation of
    __get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of
    this and no other existing higher level function would allow it to do
    so.

    Secondly existing callers of __get_user_pages_unlocked() are replaced
    with the appropriate higher-level replacement -
    get_user_pages_unlocked() if the current task and memory descriptor are
    referenced, or get_user_pages_remote() if other task/memory descriptors
    are referenced (having acquiring mmap_sem.)

    This patch (of 2):

    Add a int *locked parameter to get_user_pages_remote() to allow
    VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked().

    Taking into account the previous adjustments to get_user_pages*()
    functions allowing for the passing of gup_flags, we are now in a
    position where __get_user_pages_unlocked() need only be exported for his
    ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to
    subsequently unexport __get_user_pages_unlocked() as well as allowing
    for future flexibility in the use of get_user_pages_remote().

    [sfr@canb.auug.org.au: merge fix for get_user_pages_remote API change]
    Link: http://lkml.kernel.org/r/20161122210511.024ec341@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20161027095141.2569-2-lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Jan Kara
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • Add a function that allows us to batch free a page that has multiple
    references outstanding. Specifically this function can be used to drop
    a page being used in the page frag alloc cache. With this drivers can
    make use of functionality similar to the page frag alloc cache without
    having to do any workarounds for the fact that there is no function that
    frees multiple references.

    Link: http://lkml.kernel.org/r/20161110113606.76501.70752.stgit@ahduyck-blue-test.jf.intel.com
    Signed-off-by: Alexander Duyck
    Cc: "David S. Miller"
    Cc: "James E.J. Bottomley"
    Cc: Chris Metcalf
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Hans-Christian Noren Egtvedt
    Cc: Helge Deller
    Cc: James Hogan
    Cc: Jeff Kirsher
    Cc: Jonas Bonn
    Cc: Keguang Zhang
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Ralf Baechle
    Cc: Rich Felker
    Cc: Richard Kuo
    Cc: Russell King
    Cc: Steven Miao
    Cc: Tobias Klauser
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Duyck
     
  • compaction has been disabled for GFP_NOFS and GFP_NOIO requests since
    the direct compaction was introduced by commit 56de7263fcf3 ("mm:
    compaction: direct compact when a high-order allocation fails"). The
    main reason is that the migration of page cache pages might recurse back
    to fs/io layer and we could potentially deadlock. This is overly
    conservative because all the anonymous memory is migrateable in the
    GFP_NOFS context just fine. This might be a large portion of the memory
    in many/most workkloads.

    Remove the GFP_NOFS restriction and make sure that we skip all fs pages
    (those with a mapping) while isolating pages to be migrated. We cannot
    consider clean fs pages because they might need a metadata update so
    only isolate pages without any mapping for nofs requests.

    The effect of this patch will be probably very limited in many/most
    workloads because higher order GFP_NOFS requests are quite rare,
    although different configurations might lead to very different results.
    David Chinner has mentioned a heavy metadata workload with 64kB block
    which to quote him:

    : Unfortunately, there was an era of cargo cult configuration tweaks in the
    : Ceph community that has resulted in a large number of production machines
    : with XFS filesystems configured this way. And a lot of them store large
    : numbers of small files and run under significant sustained memory
    : pressure.
    :
    : I slowly working towards getting rid of these high order allocations and
    : replacing them with the equivalent number of single page allocations, but
    : I haven't got that (complex) change working yet.

    We can do the following to simulate that workload:
    $ mkfs.xfs -f -n size=64k
    $ mount /mnt/scratch
    $ time ./fs_mark -D 10000 -S0 -n 100000 -s 0 -L 32 \
    -d /mnt/scratch/0 -d /mnt/scratch/1 \
    -d /mnt/scratch/2 -d /mnt/scratch/3 \
    -d /mnt/scratch/4 -d /mnt/scratch/5 \
    -d /mnt/scratch/6 -d /mnt/scratch/7 \
    -d /mnt/scratch/8 -d /mnt/scratch/9 \
    -d /mnt/scratch/10 -d /mnt/scratch/11 \
    -d /mnt/scratch/12 -d /mnt/scratch/13 \
    -d /mnt/scratch/14 -d /mnt/scratch/15

    and indeed is hammers the system with many high order GFP_NOFS requests as
    per a simle tracepoint during the load:
    $ echo '!(gfp_flags & 0x80) && (gfp_flags &0x400000)' > $TRACE_MNT/events/kmem/mm_page_alloc/filter
    I am getting
    5287609 order=0
    37 order=1
    1594905 order=2
    3048439 order=3
    6699207 order=4
    66645 order=5

    My testing was done in a kvm guest so performance numbers should be
    taken with a grain of salt but there seems to be a difference when the
    patch is applied:

    * Original kernel
    FSUse% Count Size Files/sec App Overhead
    1 1600000 0 4300.1 20745838
    3 3200000 0 4239.9 23849857
    5 4800000 0 4243.4 25939543
    6 6400000 0 4248.4 19514050
    8 8000000 0 4262.1 20796169
    9 9600000 0 4257.6 21288675
    11 11200000 0 4259.7 19375120
    13 12800000 0 4220.7 22734141
    14 14400000 0 4238.5 31936458
    16 16000000 0 4231.5 23409901
    18 17600000 0 4045.3 23577700
    19 19200000 0 2783.4 58299526
    21 20800000 0 2678.2 40616302
    23 22400000 0 2693.5 83973996

    and xfs complaining about memory allocation not making progress
    [ 2304.372647] XFS: fs_mark(3289) possible memory allocation deadlock size 65624 in kmem_alloc (mode:0x2408240)
    [ 2304.443323] XFS: fs_mark(3285) possible memory allocation deadlock size 65728 in kmem_alloc (mode:0x2408240)
    [ 4796.772477] XFS: fs_mark(3424) possible memory allocation deadlock size 46936 in kmem_alloc (mode:0x2408240)
    [ 4796.775329] XFS: fs_mark(3423) possible memory allocation deadlock size 51416 in kmem_alloc (mode:0x2408240)
    [ 4797.388808] XFS: fs_mark(3424) possible memory allocation deadlock size 65728 in kmem_alloc (mode:0x2408240)

    * Patched kernel
    FSUse% Count Size Files/sec App Overhead
    1 1600000 0 4289.1 19243934
    3 3200000 0 4241.6 32828865
    5 4800000 0 4248.7 32884693
    6 6400000 0 4314.4 19608921
    8 8000000 0 4269.9 24953292
    9 9600000 0 4270.7 33235572
    11 11200000 0 4346.4 40817101
    13 12800000 0 4285.3 29972397
    14 14400000 0 4297.2 20539765
    16 16000000 0 4219.6 18596767
    18 17600000 0 4273.8 49611187
    19 19200000 0 4300.4 27944451
    21 20800000 0 4270.6 22324585
    22 22400000 0 4317.6 22650382
    24 24000000 0 4065.2 22297964

    So the dropdown at Count 19200000 didn't happen and there was only a
    single warning about allocation not making progress
    [ 3063.815003] XFS: fs_mark(3272) possible memory allocation deadlock size 65624 in kmem_alloc (mode:0x2408240)

    This suggests that the patch has helped even though there is not all that
    much of anonymous memory as the workload mostly generates fs metadata. I
    assume the success rate would be higher with more anonymous memory which
    should be the case in many workloads.

    [akpm@linux-foundation.org: fix comment]
    Link: http://lkml.kernel.org/r/20161012114721.31853-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Pull namespace updates from Eric Biederman:
    "After a lot of discussion and work we have finally reachanged a basic
    understanding of what is necessary to make unprivileged mounts safe in
    the presence of EVM and IMA xattrs which the last commit in this
    series reflects. While technically it is a revert the comments it adds
    are important for people not getting confused in the future. Clearing
    up that confusion allows us to seriously work on unprivileged mounts
    of fuse in the next development cycle.

    The rest of the fixes in this set are in the intersection of user
    namespaces, ptrace, and exec. I started with the first fix which
    started a feedback cycle of finding additional issues during review
    and fixing them. Culiminating in a fix for a bug that has been present
    since at least Linux v1.0.

    Potentially these fixes were candidates for being merged during the rc
    cycle, and are certainly backport candidates but enough little things
    turned up during review and testing that I decided they should be
    handled as part of the normal development process just to be certain
    there were not any great surprises when it came time to backport some
    of these fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    Revert "evm: Translate user/group ids relative to s_user_ns when computing HMAC"
    exec: Ensure mm->user_ns contains the execed files
    ptrace: Don't allow accessing an undumpable mm
    ptrace: Capture the ptracer's creds not PT_PTRACE_CAP
    mm: Add a user_ns owner to mm_struct and fix ptrace permission checks

    Linus Torvalds
     
  • We truncated the possible read iterator to s_maxbytes in commit
    c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()"),
    but our end condition handling was wrong: it's not an error to try to
    read at the end of the file.

    Reading past the end should return EOF (0), not EINVAL.

    See for example

    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1649342
    http://lists.gnu.org/archive/html/bug-coreutils/2016-12/msg00008.html

    where a md5sum of a maximally sized file fails because the final read is
    exactly at s_maxbytes.

    Fixes: c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
    Reported-by: Joseph Salisbury
    Cc: Wei Fang
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Al Viro
    Cc: Andrew Morton
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Pull ext4 updates from Ted Ts'o:
    "This merge request includes the dax-4.0-iomap-pmd branch which is
    needed for both ext4 and xfs dax changes to use iomap for DAX. It also
    includes the fscrypt branch which is needed for ubifs encryption work
    as well as ext4 encryption and fscrypt cleanups.

    Lots of cleanups and bug fixes, especially making sure ext4 is robust
    against maliciously corrupted file systems --- especially maliciously
    corrupted xattr blocks and a maliciously corrupted superblock. Also
    fix ext4 support for 64k block sizes so it works well on ppcle. Fixed
    mbcache so we don't miss some common xattr blocks that can be merged"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
    dax: Fix sleep in atomic contex in grab_mapping_entry()
    fscrypt: Rename FS_WRITE_PATH_FL to FS_CTX_HAS_BOUNCE_BUFFER_FL
    fscrypt: Delay bounce page pool allocation until needed
    fscrypt: Cleanup page locking requirements for fscrypt_{decrypt,encrypt}_page()
    fscrypt: Cleanup fscrypt_{decrypt,encrypt}_page()
    fscrypt: Never allocate fscrypt_ctx on in-place encryption
    fscrypt: Use correct index in decrypt path.
    fscrypt: move the policy flags and encryption mode definitions to uapi header
    fscrypt: move non-public structures and constants to fscrypt_private.h
    fscrypt: unexport fscrypt_initialize()
    fscrypt: rename get_crypt_info() to fscrypt_get_crypt_info()
    fscrypto: move ioctl processing more fully into common code
    fscrypto: remove unneeded Kconfig dependencies
    MAINTAINERS: fscrypto: recommend linux-fsdevel for fscrypto patches
    ext4: do not perform data journaling when data is encrypted
    ext4: return -ENOMEM instead of success
    ext4: reject inodes with negative size
    ext4: remove another test in ext4_alloc_file_blocks()
    Documentation: fix description of ext4's block_validity mount option
    ext4: fix checks for data=ordered and journal_async_commit options
    ...

    Linus Torvalds