26 Dec, 2016

2 commits

  • Add a new page flag, PageWaiters, to indicate the page waitqueue has
    tasks waiting. This can be tested rather than testing waitqueue_active
    which requires another cacheline load.

    This bit is always set when the page has tasks on page_waitqueue(page),
    and is set and cleared under the waitqueue lock. It may be set when
    there are no tasks on the waitqueue, which will cause a harmless extra
    wakeup check that will clears the bit.

    The generic bit-waitqueue infrastructure is no longer used for pages.
    Instead, waitqueues are used directly with a custom key type. The
    generic code was not flexible enough to have PageWaiters manipulation
    under the waitqueue lock (which simplifies concurrency).

    This improves the performance of page lock intensive microbenchmarks by
    2-3%.

    Putting two bits in the same word opens the opportunity to remove the
    memory barrier between clearing the lock bit and testing the waiters
    bit, after some work on the arch primitives (e.g., ensuring memory
    operand widths match and cover both bits).

    Signed-off-by: Nicholas Piggin
    Cc: Dave Hansen
    Cc: Bob Peterson
    Cc: Steven Whitehouse
    Cc: Andrew Lutomirski
    Cc: Andreas Gruenbacher
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     
  • A page is not added to the swap cache without being swap backed,
    so PageSwapBacked mappings can use PG_owner_priv_1 for PageSwapCache.

    Signed-off-by: Nicholas Piggin
    Acked-by: Hugh Dickins
    Cc: Dave Hansen
    Cc: Bob Peterson
    Cc: Steven Whitehouse
    Cc: Andrew Lutomirski
    Cc: Andreas Gruenbacher
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     

25 Dec, 2016

1 commit


21 Dec, 2016

1 commit

  • When FADV_DONTNEED cannot drop all pages in the range, it observes that
    some pages might still be on per-cpu LRU caches after recent
    instantiation and so initiates remote calls to all CPUs to flush their
    local caches. However, in most cases, the fadvise happens from the same
    context that instantiated the pages, and any pre-LRU pages in the
    specified range are most likely sitting on the local CPU's LRU cache,
    and so in many cases this results in unnecessary remote calls, which, in
    a loaded system, can hold up the fadvise() call significantly.

    [ I didn't record it in the extreme case we observed at Facebook,
    unfortunately. We had a slow-to-respond system and noticed it
    lru_add_drain_all() leading the profile during fadvise calls. This
    patch came out of thinking about the code and how we commonly call
    FADV_DONTNEED.

    FWIW, I wrote a silly directory tree walker/searcher that recurses
    through /usr to read and FADV_DONTNEED each file it finds. On a 2
    socket 40 ht machine, over 1% is spent in lru_add_drain_all(). With
    the patch, that cost is gone; the local drain cost shows at 0.09%. ]

    Try to avoid the remote call by flushing the local LRU cache before even
    attempting to invalidate anything. It's a cheap operation, and the
    local LRU cache is the most likely to hold any pre-LRU pages in the
    specified fadvise range.

    Link: http://lkml.kernel.org/r/20161214210017.GA1465@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

18 Dec, 2016

1 commit

  • …/linux/kernel/git/mszeredi/vfs

    Pull partial readlink cleanups from Miklos Szeredi.

    This is the uncontroversial part of the readlink cleanup patch-set that
    simplifies the default readlink handling.

    Miklos and Al are still discussing the rest of the series.

    * git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    vfs: make generic_readlink() static
    vfs: remove ".readlink = generic_readlink" assignments
    vfs: default to generic_readlink()
    vfs: replace calling i_op->readlink with vfs_readlink()
    proc/self: use generic_readlink
    ecryptfs: use vfs_get_link()
    bad_inode: add missing i_op initializers

    Linus Torvalds
     

15 Dec, 2016

29 commits

  • Merge more updates from Andrew Morton:

    - a few misc things

    - kexec updates

    - DMA-mapping updates to better support networking DMA operations

    - IPC updates

    - various MM changes to improve DAX fault handling

    - lots of radix-tree changes, mainly to the test suite. All leading up
    to reimplementing the IDA/IDR code to be a wrapper layer over the
    radix-tree. However the final trigger-pulling patch is held off for
    4.11.

    * emailed patches from Andrew Morton : (114 commits)
    radix tree test suite: delete unused rcupdate.c
    radix tree test suite: add new tag check
    radix-tree: ensure counts are initialised
    radix tree test suite: cache recently freed objects
    radix tree test suite: add some more functionality
    idr: reduce the number of bits per level from 8 to 6
    rxrpc: abstract away knowledge of IDR internals
    tpm: use idr_find(), not idr_find_slowpath()
    idr: add ida_is_empty
    radix tree test suite: check multiorder iteration
    radix-tree: fix replacement for multiorder entries
    radix-tree: add radix_tree_split_preload()
    radix-tree: add radix_tree_split
    radix-tree: add radix_tree_join
    radix-tree: delete radix_tree_range_tag_if_tagged()
    radix-tree: delete radix_tree_locate_item()
    radix-tree: improve multiorder iterators
    btrfs: fix race in btrfs_free_dummy_fs_info()
    radix-tree: improve dump output
    radix-tree: make radix_tree_find_next_bit more useful
    ...

    Linus Torvalds
     
  • This is an exceptionally complicated function with just one caller
    (tag_pages_for_writeback). We devote a large portion of the runtime of
    the test suite to testing this one function which has one caller. By
    introducing the new function radix_tree_iter_tag_set(), we can eliminate
    all of the complexity while keeping the performance. The caller can now
    use a fairly standard radix_tree_for_each() loop, and it doesn't need to
    worry about tricksy things like 'start' wrapping.

    The test suite continues to spend a large amount of time investigating
    this function, but now it's testing the underlying primitives such as
    radix_tree_iter_resume() and the radix_tree_for_each_tagged() iterator
    which are also used by other parts of the kernel.

    Link: http://lkml.kernel.org/r/1480369871-5271-57-git-send-email-mawilcox@linuxonhyperv.com
    Signed-off-by: Matthew Wilcox
    Tested-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Ross Zwisler
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • This rather complicated function can be better implemented as an
    iterator. It has only one caller, so move the functionality to the only
    place that needs it. Update the test suite to follow the same pattern.

    Link: http://lkml.kernel.org/r/1480369871-5271-56-git-send-email-mawilcox@linuxonhyperv.com
    Signed-off-by: Matthew Wilcox
    Acked-by: Konstantin Khlebnikov
    Tested-by: Kirill A. Shutemov
    Cc: Ross Zwisler
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • This fixes several interlinked problems with the iterators in the
    presence of multiorder entries.

    1. radix_tree_iter_next() would only advance by one slot, which would
    result in the iterators returning the same entry more than once if
    there were sibling entries.

    2. radix_tree_next_slot() could return an internal pointer instead of
    a user pointer if a tagged multiorder entry was immediately followed by
    an entry of lower order.

    3. radix_tree_next_slot() expanded to a lot more code than it used to
    when multiorder support was compiled in. And I wasn't comfortable with
    entry_to_node() being in a header file.

    Fixing radix_tree_iter_next() for the presence of sibling entries
    necessarily involves examining the contents of the radix tree, so we now
    need to pass 'slot' to radix_tree_iter_next(), and we need to change the
    calling convention so it is called *before* dropping the lock which
    protects the tree. Also rename it to radix_tree_iter_resume(), as some
    people thought it was necessary to call radix_tree_iter_next() each time
    around the loop.

    radix_tree_next_slot() becomes closer to how it looked before multiorder
    support was introduced. It only checks to see if the next entry in the
    chunk is a sibling entry or a pointer to a node; this should be rare
    enough that handling this case out of line is not a performance impact
    (and such impact is amortised by the fact that the entry we just
    processed was a multiorder entry). Also, radix_tree_next_slot() used to
    force a new chunk lookup for untagged entries, which is more expensive
    than the out of line sibling entry skipping.

    Link: http://lkml.kernel.org/r/1480369871-5271-55-git-send-email-mawilcox@linuxonhyperv.com
    Signed-off-by: Matthew Wilcox
    Tested-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Ross Zwisler
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Currently PTE gets updated in wp_pfn_shared() after dax_pfn_mkwrite()
    has released corresponding radix tree entry lock. When we want to
    writeprotect PTE on cache flush, we need PTE modification to happen
    under radix tree entry lock to ensure consistent updates of PTE and
    radix tree (standard faults use page lock to ensure this consistency).
    So move update of PTE bit into dax_pfn_mkwrite().

    Link: http://lkml.kernel.org/r/1479460644-25076-20-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Cc: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • DAX will need to implement its own version of page_check_address(). To
    avoid duplicating page table walking code, export follow_pte() which
    does what we need.

    Link: http://lkml.kernel.org/r/1479460644-25076-18-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Cc: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Currently finish_mkwrite_fault() returns 0 when PTE got changed before
    we acquired PTE lock and VM_FAULT_WRITE when we succeeded in modifying
    the PTE. This is somewhat confusing since 0 generally means success, it
    is also inconsistent with finish_fault() which returns 0 on success.
    Change finish_mkwrite_fault() to return 0 on success and VM_FAULT_NOPAGE
    when PTE changed. Practically, there should be no behavioral difference
    since we bail out from the fault the same way regardless whether we
    return 0, VM_FAULT_NOPAGE, or VM_FAULT_WRITE. Also note that
    VM_FAULT_WRITE has no effect for shared mappings since the only two
    places that check it - KSM and GUP - care about private mappings only.
    Generally the meaning of VM_FAULT_WRITE for shared mappings is not well
    defined and we should probably clean that up.

    Link: http://lkml.kernel.org/r/1479460644-25076-17-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Cc: Ross Zwisler
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Provide a helper function for finishing write faults due to PTE being
    read-only. The helper will be used by DAX to avoid the need of
    complicating generic MM code with DAX locking specifics.

    Link: http://lkml.kernel.org/r/1479460644-25076-16-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • wp_page_reuse() handles write shared faults which is needed only in
    wp_page_shared(). Move the handling only into that location to make
    wp_page_reuse() simpler and avoid a strange situation when we sometimes
    pass in locked page, sometimes unlocked etc.

    Link: http://lkml.kernel.org/r/1479460644-25076-15-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • So far we set vmf->page during WP faults only when we needed to pass it
    to the ->page_mkwrite handler. Set it in all the cases now and use that
    instead of passing page pointer explicitly around.

    Link: http://lkml.kernel.org/r/1479460644-25076-14-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • We will need more information in the ->page_mkwrite() helper for DAX to
    be able to fully finish faults there. Pass vm_fault structure to
    do_page_mkwrite() and use it there so that information propagates
    properly from upper layers.

    Link: http://lkml.kernel.org/r/1479460644-25076-13-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Currently we duplicate handling of shared write faults in
    wp_page_reuse() and do_shared_fault(). Factor them out into a common
    function.

    Link: http://lkml.kernel.org/r/1479460644-25076-12-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Move final handling of COW faults from generic code into DAX fault
    handler. That way generic code doesn't have to be aware of
    peculiarities of DAX locking so remove that knowledge and make locking
    functions private to fs/dax.c.

    Link: http://lkml.kernel.org/r/1479460644-25076-11-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Ross Zwisler
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Introduce finish_fault() as a helper function for finishing page faults.
    It is rather thin wrapper around alloc_set_pte() but since we'd want to
    call this from DAX code or filesystems, it is still useful to avoid some
    boilerplate code.

    Link: http://lkml.kernel.org/r/1479460644-25076-10-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Patch series "dax: Clear dirty bits after flushing caches", v5.

    Patchset to clear dirty bits from radix tree of DAX inodes when caches
    for corresponding pfns have been flushed. In principle, these patches
    enable handlers to easily update PTEs and do other work necessary to
    finish the fault without duplicating the functionality present in the
    generic code. I'd like to thank Kirill and Ross for reviews of the
    series!

    This patch (of 20):

    To allow full handling of COW faults add memcg field to struct vm_fault
    and a return value of ->fault() handler meaning that COW fault is fully
    handled and memcg charge must not be canceled. This will allow us to
    remove knowledge about special DAX locking from the generic fault code.

    Link: http://lkml.kernel.org/r/1479460644-25076-9-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Add orig_pte field to vm_fault structure to allow ->page_mkwrite
    handlers to fully handle the fault.

    This also allows us to save some passing of extra arguments around.

    Link: http://lkml.kernel.org/r/1479460644-25076-8-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Instead of creating another vm_fault structure, use the one passed to
    wp_pfn_shared() for passing arguments into pfn_mkwrite handler.

    Link: http://lkml.kernel.org/r/1479460644-25076-7-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Use vm_fault structure to pass cow_page, page, and entry in and out of
    the function.

    That reduces number of __do_fault() arguments from 4 to 1.

    Link: http://lkml.kernel.org/r/1479460644-25076-6-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Instead of creating another vm_fault structure, use the one passed to
    __do_fault() for passing arguments into fault handler.

    Link: http://lkml.kernel.org/r/1479460644-25076-5-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • struct vm_fault has already pgoff entry. Use it instead of passing
    pgoff as a separate argument and then assigning it later.

    Link: http://lkml.kernel.org/r/1479460644-25076-4-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Every single user of vmf->virtual_address typed that entry to unsigned
    long before doing anything with it so the type of virtual_address does
    not really provide us any additional safety. Just use masked
    vmf->address which already has the appropriate type.

    Link: http://lkml.kernel.org/r/1479460644-25076-3-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Currently we have two different structures for passing fault information
    around - struct vm_fault and struct fault_env. DAX will need more
    information in struct vm_fault to handle its faults so the content of
    that structure would become event closer to fault_env. Furthermore it
    would need to generate struct fault_env to be able to call some of the
    generic functions. So at this point I don't think there's much use in
    keeping these two structures separate. Just embed into struct vm_fault
    all that is needed to use it for both purposes.

    Link: http://lkml.kernel.org/r/1479460644-25076-2-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Cc: Ross Zwisler
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Unexport the low-level __get_user_pages_unlocked() function and replaces
    invocations with calls to more appropriate higher-level functions.

    In hva_to_pfn_slow() we are able to replace __get_user_pages_unlocked()
    with get_user_pages_unlocked() since we can now pass gup_flags.

    In async_pf_execute() and process_vm_rw_single_vec() we need to pass
    different tsk, mm arguments so get_user_pages_remote() is the sane
    replacement in these cases (having added manual acquisition and release
    of mmap_sem.)

    Additionally get_user_pages_remote() reintroduces use of the FOLL_TOUCH
    flag. However, this flag was originally silently dropped by commit
    1e9877902dc7 ("mm/gup: Introduce get_user_pages_remote()"), so this
    appears to have been unintentional and reintroducing it is therefore not
    an issue.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20161027095141.2569-3-lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Jan Kara
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • Patch series "mm: unexport __get_user_pages_unlocked()".

    This patch series continues the cleanup of get_user_pages*() functions
    taking advantage of the fact we can now pass gup_flags as we please.

    It firstly adds an additional 'locked' parameter to
    get_user_pages_remote() to allow for its callers to utilise
    VM_FAULT_RETRY functionality. This is necessary as the invocation of
    __get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of
    this and no other existing higher level function would allow it to do
    so.

    Secondly existing callers of __get_user_pages_unlocked() are replaced
    with the appropriate higher-level replacement -
    get_user_pages_unlocked() if the current task and memory descriptor are
    referenced, or get_user_pages_remote() if other task/memory descriptors
    are referenced (having acquiring mmap_sem.)

    This patch (of 2):

    Add a int *locked parameter to get_user_pages_remote() to allow
    VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked().

    Taking into account the previous adjustments to get_user_pages*()
    functions allowing for the passing of gup_flags, we are now in a
    position where __get_user_pages_unlocked() need only be exported for his
    ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to
    subsequently unexport __get_user_pages_unlocked() as well as allowing
    for future flexibility in the use of get_user_pages_remote().

    [sfr@canb.auug.org.au: merge fix for get_user_pages_remote API change]
    Link: http://lkml.kernel.org/r/20161122210511.024ec341@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20161027095141.2569-2-lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Jan Kara
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • Add a function that allows us to batch free a page that has multiple
    references outstanding. Specifically this function can be used to drop
    a page being used in the page frag alloc cache. With this drivers can
    make use of functionality similar to the page frag alloc cache without
    having to do any workarounds for the fact that there is no function that
    frees multiple references.

    Link: http://lkml.kernel.org/r/20161110113606.76501.70752.stgit@ahduyck-blue-test.jf.intel.com
    Signed-off-by: Alexander Duyck
    Cc: "David S. Miller"
    Cc: "James E.J. Bottomley"
    Cc: Chris Metcalf
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Hans-Christian Noren Egtvedt
    Cc: Helge Deller
    Cc: James Hogan
    Cc: Jeff Kirsher
    Cc: Jonas Bonn
    Cc: Keguang Zhang
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Ralf Baechle
    Cc: Rich Felker
    Cc: Richard Kuo
    Cc: Russell King
    Cc: Steven Miao
    Cc: Tobias Klauser
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Duyck
     
  • compaction has been disabled for GFP_NOFS and GFP_NOIO requests since
    the direct compaction was introduced by commit 56de7263fcf3 ("mm:
    compaction: direct compact when a high-order allocation fails"). The
    main reason is that the migration of page cache pages might recurse back
    to fs/io layer and we could potentially deadlock. This is overly
    conservative because all the anonymous memory is migrateable in the
    GFP_NOFS context just fine. This might be a large portion of the memory
    in many/most workkloads.

    Remove the GFP_NOFS restriction and make sure that we skip all fs pages
    (those with a mapping) while isolating pages to be migrated. We cannot
    consider clean fs pages because they might need a metadata update so
    only isolate pages without any mapping for nofs requests.

    The effect of this patch will be probably very limited in many/most
    workloads because higher order GFP_NOFS requests are quite rare,
    although different configurations might lead to very different results.
    David Chinner has mentioned a heavy metadata workload with 64kB block
    which to quote him:

    : Unfortunately, there was an era of cargo cult configuration tweaks in the
    : Ceph community that has resulted in a large number of production machines
    : with XFS filesystems configured this way. And a lot of them store large
    : numbers of small files and run under significant sustained memory
    : pressure.
    :
    : I slowly working towards getting rid of these high order allocations and
    : replacing them with the equivalent number of single page allocations, but
    : I haven't got that (complex) change working yet.

    We can do the following to simulate that workload:
    $ mkfs.xfs -f -n size=64k
    $ mount /mnt/scratch
    $ time ./fs_mark -D 10000 -S0 -n 100000 -s 0 -L 32 \
    -d /mnt/scratch/0 -d /mnt/scratch/1 \
    -d /mnt/scratch/2 -d /mnt/scratch/3 \
    -d /mnt/scratch/4 -d /mnt/scratch/5 \
    -d /mnt/scratch/6 -d /mnt/scratch/7 \
    -d /mnt/scratch/8 -d /mnt/scratch/9 \
    -d /mnt/scratch/10 -d /mnt/scratch/11 \
    -d /mnt/scratch/12 -d /mnt/scratch/13 \
    -d /mnt/scratch/14 -d /mnt/scratch/15

    and indeed is hammers the system with many high order GFP_NOFS requests as
    per a simle tracepoint during the load:
    $ echo '!(gfp_flags & 0x80) && (gfp_flags &0x400000)' > $TRACE_MNT/events/kmem/mm_page_alloc/filter
    I am getting
    5287609 order=0
    37 order=1
    1594905 order=2
    3048439 order=3
    6699207 order=4
    66645 order=5

    My testing was done in a kvm guest so performance numbers should be
    taken with a grain of salt but there seems to be a difference when the
    patch is applied:

    * Original kernel
    FSUse% Count Size Files/sec App Overhead
    1 1600000 0 4300.1 20745838
    3 3200000 0 4239.9 23849857
    5 4800000 0 4243.4 25939543
    6 6400000 0 4248.4 19514050
    8 8000000 0 4262.1 20796169
    9 9600000 0 4257.6 21288675
    11 11200000 0 4259.7 19375120
    13 12800000 0 4220.7 22734141
    14 14400000 0 4238.5 31936458
    16 16000000 0 4231.5 23409901
    18 17600000 0 4045.3 23577700
    19 19200000 0 2783.4 58299526
    21 20800000 0 2678.2 40616302
    23 22400000 0 2693.5 83973996

    and xfs complaining about memory allocation not making progress
    [ 2304.372647] XFS: fs_mark(3289) possible memory allocation deadlock size 65624 in kmem_alloc (mode:0x2408240)
    [ 2304.443323] XFS: fs_mark(3285) possible memory allocation deadlock size 65728 in kmem_alloc (mode:0x2408240)
    [ 4796.772477] XFS: fs_mark(3424) possible memory allocation deadlock size 46936 in kmem_alloc (mode:0x2408240)
    [ 4796.775329] XFS: fs_mark(3423) possible memory allocation deadlock size 51416 in kmem_alloc (mode:0x2408240)
    [ 4797.388808] XFS: fs_mark(3424) possible memory allocation deadlock size 65728 in kmem_alloc (mode:0x2408240)

    * Patched kernel
    FSUse% Count Size Files/sec App Overhead
    1 1600000 0 4289.1 19243934
    3 3200000 0 4241.6 32828865
    5 4800000 0 4248.7 32884693
    6 6400000 0 4314.4 19608921
    8 8000000 0 4269.9 24953292
    9 9600000 0 4270.7 33235572
    11 11200000 0 4346.4 40817101
    13 12800000 0 4285.3 29972397
    14 14400000 0 4297.2 20539765
    16 16000000 0 4219.6 18596767
    18 17600000 0 4273.8 49611187
    19 19200000 0 4300.4 27944451
    21 20800000 0 4270.6 22324585
    22 22400000 0 4317.6 22650382
    24 24000000 0 4065.2 22297964

    So the dropdown at Count 19200000 didn't happen and there was only a
    single warning about allocation not making progress
    [ 3063.815003] XFS: fs_mark(3272) possible memory allocation deadlock size 65624 in kmem_alloc (mode:0x2408240)

    This suggests that the patch has helped even though there is not all that
    much of anonymous memory as the workload mostly generates fs metadata. I
    assume the success rate would be higher with more anonymous memory which
    should be the case in many workloads.

    [akpm@linux-foundation.org: fix comment]
    Link: http://lkml.kernel.org/r/20161012114721.31853-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Pull namespace updates from Eric Biederman:
    "After a lot of discussion and work we have finally reachanged a basic
    understanding of what is necessary to make unprivileged mounts safe in
    the presence of EVM and IMA xattrs which the last commit in this
    series reflects. While technically it is a revert the comments it adds
    are important for people not getting confused in the future. Clearing
    up that confusion allows us to seriously work on unprivileged mounts
    of fuse in the next development cycle.

    The rest of the fixes in this set are in the intersection of user
    namespaces, ptrace, and exec. I started with the first fix which
    started a feedback cycle of finding additional issues during review
    and fixing them. Culiminating in a fix for a bug that has been present
    since at least Linux v1.0.

    Potentially these fixes were candidates for being merged during the rc
    cycle, and are certainly backport candidates but enough little things
    turned up during review and testing that I decided they should be
    handled as part of the normal development process just to be certain
    there were not any great surprises when it came time to backport some
    of these fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    Revert "evm: Translate user/group ids relative to s_user_ns when computing HMAC"
    exec: Ensure mm->user_ns contains the execed files
    ptrace: Don't allow accessing an undumpable mm
    ptrace: Capture the ptracer's creds not PT_PTRACE_CAP
    mm: Add a user_ns owner to mm_struct and fix ptrace permission checks

    Linus Torvalds
     
  • We truncated the possible read iterator to s_maxbytes in commit
    c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()"),
    but our end condition handling was wrong: it's not an error to try to
    read at the end of the file.

    Reading past the end should return EOF (0), not EINVAL.

    See for example

    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1649342
    http://lists.gnu.org/archive/html/bug-coreutils/2016-12/msg00008.html

    where a md5sum of a maximally sized file fails because the final read is
    exactly at s_maxbytes.

    Fixes: c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
    Reported-by: Joseph Salisbury
    Cc: Wei Fang
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Al Viro
    Cc: Andrew Morton
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Pull ext4 updates from Ted Ts'o:
    "This merge request includes the dax-4.0-iomap-pmd branch which is
    needed for both ext4 and xfs dax changes to use iomap for DAX. It also
    includes the fscrypt branch which is needed for ubifs encryption work
    as well as ext4 encryption and fscrypt cleanups.

    Lots of cleanups and bug fixes, especially making sure ext4 is robust
    against maliciously corrupted file systems --- especially maliciously
    corrupted xattr blocks and a maliciously corrupted superblock. Also
    fix ext4 support for 64k block sizes so it works well on ppcle. Fixed
    mbcache so we don't miss some common xattr blocks that can be merged"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
    dax: Fix sleep in atomic contex in grab_mapping_entry()
    fscrypt: Rename FS_WRITE_PATH_FL to FS_CTX_HAS_BOUNCE_BUFFER_FL
    fscrypt: Delay bounce page pool allocation until needed
    fscrypt: Cleanup page locking requirements for fscrypt_{decrypt,encrypt}_page()
    fscrypt: Cleanup fscrypt_{decrypt,encrypt}_page()
    fscrypt: Never allocate fscrypt_ctx on in-place encryption
    fscrypt: Use correct index in decrypt path.
    fscrypt: move the policy flags and encryption mode definitions to uapi header
    fscrypt: move non-public structures and constants to fscrypt_private.h
    fscrypt: unexport fscrypt_initialize()
    fscrypt: rename get_crypt_info() to fscrypt_get_crypt_info()
    fscrypto: move ioctl processing more fully into common code
    fscrypto: remove unneeded Kconfig dependencies
    MAINTAINERS: fscrypto: recommend linux-fsdevel for fscrypto patches
    ext4: do not perform data journaling when data is encrypted
    ext4: return -ENOMEM instead of success
    ext4: reject inodes with negative size
    ext4: remove another test in ext4_alloc_file_blocks()
    Documentation: fix description of ext4's block_validity mount option
    ext4: fix checks for data=ordered and journal_async_commit options
    ...

    Linus Torvalds
     

14 Dec, 2016

5 commits

  • Pull workqueue updates from Tejun Heo:
    "Mostly patches to initialize workqueue subsystem earlier and get rid
    of keventd_up().

    The patches were headed for the last merge cycle but got delayed due
    to a bug found late minute, which is fixed now.

    Also, to help debugging, destroy_workqueue() is more chatty now on a
    sanity check failure."

    * 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: move wq_numa_init() to workqueue_init()
    workqueue: remove keventd_up()
    debugobj, workqueue: remove keventd_up() usage
    slab, workqueue: remove keventd_up() usage
    power, workqueue: remove keventd_up() usage
    tty, workqueue: remove keventd_up() usage
    mce, workqueue: remove keventd_up() usage
    workqueue: make workqueue available early during boot
    workqueue: dump workqueue state on sanity check failures in destroy_workqueue()

    Linus Torvalds
     
  • Pull percpu update from Tejun Heo:
    "This includes just one patch to reject non-power-of-2 alignments and
    trigger warning. Interestingly, this actually caught a bug in XEN
    ARM64"

    * 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    percpu: ensure the requested alignment is power of two

    Linus Torvalds
     
  • Pull char/misc driver updates from Greg KH:
    "Here's the big char/misc driver patches for 4.10-rc1. Lots of tiny
    changes over lots of "minor" driver subsystems, the largest being some
    new FPGA drivers. Other than that, a few other new drivers, but no new
    driver subsystems added for this kernel cycle, a nice change.

    All of these have been in linux-next with no reported issues"

    * tag 'char-misc-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (107 commits)
    uio-hv-generic: store physical addresses instead of virtual
    Tools: hv: kvp: configurable external scripts path
    uio-hv-generic: new userspace i/o driver for VMBus
    vmbus: add support for dynamic device id's
    hv: change clockevents unbind tactics
    hv: acquire vmbus_connection.channel_mutex in vmbus_free_channels()
    hyperv: Fix spelling of HV_UNKOWN
    mei: bus: enable non-blocking RX
    mei: fix the back to back interrupt handling
    mei: synchronize irq before initiating a reset.
    VME: Remove shutdown entry from vme_driver
    auxdisplay: ht16k33: select framebuffer helper modules
    MAINTAINERS: add git url for fpga
    fpga: Clarify how write_init works streaming modes
    fpga zynq: Fix incorrect ISR state on bootup
    fpga zynq: Remove priv->dev
    fpga zynq: Add missing \n to messages
    fpga: Add COMPILE_TEST to all drivers
    uio: pruss: add clk_disable()
    char/pcmcia: add some error checking in scr24x_read()
    ...

    Linus Torvalds
     
  • Pull power management updates from Rafael Wysocki:
    "Again, cpufreq gets more changes than the other parts this time (one
    new driver, one old driver less, a bunch of enhancements of the
    existing code, new CPU IDs, fixes, cleanups)

    There also are some changes in cpuidle (idle injection rework, a
    couple of new CPU IDs, online/offline rework in intel_idle, fixes and
    cleanups), in the generic power domains framework (mostly related to
    supporting power domains containing CPUs), and in the Operating
    Performance Points (OPP) library (mostly related to supporting devices
    with multiple voltage regulators)

    In addition to that, the system sleep state selection interface is
    modified to make it easier for distributions with unchanged user space
    to support suspend-to-idle as the default system suspend method, some
    issues are fixed in the PM core, the latency tolerance PM QoS
    framework is improved a bit, the Intel RAPL power capping driver is
    cleaned up and there are some fixes and cleanups in the devfreq
    subsystem

    Specifics:

    - New cpufreq driver for Broadcom STB SoCs and a Device Tree binding
    for it (Markus Mayer)

    - Support for ARM Integrator/AP and Integrator/CP in the generic DT
    cpufreq driver and elimination of the old Integrator cpufreq driver
    (Linus Walleij)

    - Support for the zx296718, r8a7743 and r8a7745, Socionext UniPhier,
    and PXA SoCs in the the generic DT cpufreq driver (Baoyou Xie,
    Geert Uytterhoeven, Masahiro Yamada, Robert Jarzmik)

    - cpufreq core fix to eliminate races that may lead to using inactive
    policy objects and related cleanups (Rafael Wysocki)

    - cpufreq schedutil governor update to make it use SCHED_FIFO kernel
    threads (instead of regular workqueues) for doing delayed work (to
    reduce the response latency in some cases) and related cleanups
    (Viresh Kumar)

    - New cpufreq sysfs attribute for resetting statistics (Markus Mayer)

    - cpufreq governors fixes and cleanups (Chen Yu, Stratos Karafotis,
    Viresh Kumar)

    - Support for using generic cpufreq governors in the intel_pstate
    driver (Rafael Wysocki)

    - Support for per-logical-CPU P-state limits and the EPP/EPB (Energy
    Performance Preference/Energy Performance Bias) knobs in the
    intel_pstate driver (Srinivas Pandruvada)

    - New CPU ID for Knights Mill in intel_pstate (Piotr Luc)

    - intel_pstate driver modification to use the P-state selection
    algorithm based on CPU load on platforms with the system profile in
    the ACPI tables set to "mobile" (Srinivas Pandruvada)

    - intel_pstate driver cleanups (Arnd Bergmann, Rafael Wysocki,
    Srinivas Pandruvada)

    - cpufreq powernv driver updates including fast switching support
    (for the schedutil governor), fixes and cleanus (Akshay Adiga,
    Andrew Donnellan, Denis Kirjanov)

    - acpi-cpufreq driver rework to switch it over to the new CPU
    offline/online state machine (Sebastian Andrzej Siewior)

    - Assorted cleanups in cpufreq drivers (Wei Yongjun, Prashanth
    Prakash)

    - Idle injection rework (to make it use the regular idle path instead
    of a home-grown custom one) and related powerclamp thermal driver
    updates (Peter Zijlstra, Jacob Pan, Petr Mladek, Sebastian Andrzej
    Siewior)

    - New CPU IDs for Atom Z34xx and Knights Mill in intel_idle (Andy
    Shevchenko, Piotr Luc)

    - intel_idle driver cleanups and switch over to using the new CPU
    offline/online state machine (Anna-Maria Gleixner, Sebastian
    Andrzej Siewior)

    - cpuidle DT driver update to support suspend-to-idle properly
    (Sudeep Holla)

    - cpuidle core cleanups and misc updates (Daniel Lezcano, Pan Bian,
    Rafael Wysocki)

    - Preliminary support for power domains including CPUs in the generic
    power domains (genpd) framework and related DT bindings (Lina Iyer)

    - Assorted fixes and cleanups in the generic power domains (genpd)
    framework (Colin Ian King, Dan Carpenter, Geert Uytterhoeven)

    - Preliminary support for devices with multiple voltage regulators
    and related fixes and cleanups in the Operating Performance Points
    (OPP) library (Viresh Kumar, Masahiro Yamada, Stephen Boyd)

    - System sleep state selection interface rework to make it easier to
    support suspend-to-idle as the default system suspend method
    (Rafael Wysocki)

    - PM core fixes and cleanups, mostly related to the interactions
    between the system suspend and runtime PM frameworks (Ulf Hansson,
    Sahitya Tummala, Tony Lindgren)

    - Latency tolerance PM QoS framework imorovements (Andrew Lutomirski)

    - New Knights Mill CPU ID for the Intel RAPL power capping driver
    (Piotr Luc)

    - Intel RAPL power capping driver fixes, cleanups and switch over to
    using the new CPU offline/online state machine (Jacob Pan, Thomas
    Gleixner, Sebastian Andrzej Siewior)

    - Fixes and cleanups in the exynos-ppmu, exynos-nocp, rk3399_dmc,
    rockchip-dfi devfreq drivers and the devfreq core (Axel Lin,
    Chanwoo Choi, Javier Martinez Canillas, MyungJoo Ham, Viresh Kumar)

    - Fix for false-positive KASAN warnings during resume from ACPI S3
    (suspend-to-RAM) on x86 (Josh Poimboeuf)

    - Memory map verification during resume from hibernation on x86 to
    ensure a consistent address space layout (Chen Yu)

    - Wakeup sources debugging enhancement (Xing Wei)

    - rockchip-io AVS driver cleanup (Shawn Lin)"

    * tag 'pm-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (127 commits)
    devfreq: rk3399_dmc: Don't use OPP structures outside of RCU locks
    devfreq: rk3399_dmc: Remove dangling rcu_read_unlock()
    devfreq: exynos: Don't use OPP structures outside of RCU locks
    Documentation: intel_pstate: Document HWP energy/performance hints
    cpufreq: intel_pstate: Support for energy performance hints with HWP
    cpufreq: intel_pstate: Add locking around HWP requests
    PM / sleep: Print active wakeup sources when blocking on wakeup_count reads
    PM / core: Fix bug in the error handling of async suspend
    PM / wakeirq: Fix dedicated wakeirq for drivers not using autosuspend
    PM / Domains: Fix compatible for domain idle state
    PM / OPP: Don't WARN on multiple calls to dev_pm_opp_set_regulators()
    PM / OPP: Allow platform specific custom set_opp() callbacks
    PM / OPP: Separate out _generic_set_opp()
    PM / OPP: Add infrastructure to manage multiple regulators
    PM / OPP: Pass struct dev_pm_opp_supply to _set_opp_voltage()
    PM / OPP: Manage supply's voltage/current in a separate structure
    PM / OPP: Don't use OPP structure outside of rcu protected section
    PM / OPP: Reword binding supporting multiple regulators per device
    PM / OPP: Fix incorrect cpu-supply property in binding
    cpuidle: Add a kerneldoc comment to cpuidle_use_deepest_state()
    ..

    Linus Torvalds
     
  • Pull block layer updates from Jens Axboe:
    "This is the main block pull request this series. Contrary to previous
    release, I've kept the core and driver changes in the same branch. We
    always ended up having dependencies between the two for obvious
    reasons, so makes more sense to keep them together. That said, I'll
    probably try and keep more topical branches going forward, especially
    for cycles that end up being as busy as this one.

    The major parts of this pull request is:

    - Improved support for O_DIRECT on block devices, with a small
    private implementation instead of using the pig that is
    fs/direct-io.c. From Christoph.

    - Request completion tracking in a scalable fashion. This is utilized
    by two components in this pull, the new hybrid polling and the
    writeback queue throttling code.

    - Improved support for polling with O_DIRECT, adding a hybrid mode
    that combines pure polling with an initial sleep. From me.

    - Support for automatic throttling of writeback queues on the block
    side. This uses feedback from the device completion latencies to
    scale the queue on the block side up or down. From me.

    - Support from SMR drives in the block layer and for SD. From Hannes
    and Shaun.

    - Multi-connection support for nbd. From Josef.

    - Cleanup of request and bio flags, so we have a clear split between
    which are bio (or rq) private, and which ones are shared. From
    Christoph.

    - A set of patches from Bart, that improve how we handle queue
    stopping and starting in blk-mq.

    - Support for WRITE_ZEROES from Chaitanya.

    - Lightnvm updates from Javier/Matias.

    - Supoort for FC for the nvme-over-fabrics code. From James Smart.

    - A bunch of fixes from a whole slew of people, too many to name
    here"

    * 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
    blk-stat: fix a few cases of missing batch flushing
    blk-flush: run the queue when inserting blk-mq flush
    elevator: make the rqhash helpers exported
    blk-mq: abstract out blk_mq_dispatch_rq_list() helper
    blk-mq: add blk_mq_start_stopped_hw_queue()
    block: improve handling of the magic discard payload
    blk-wbt: don't throttle discard or write zeroes
    nbd: use dev_err_ratelimited in io path
    nbd: reset the setup task for NBD_CLEAR_SOCK
    nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
    nvme-fabrics: Add target support for FC transport
    nvme-fabrics: Add host support for FC transport
    nvme-fabrics: Add FC transport LLDD api definitions
    nvme-fabrics: Add FC transport FC-NVME definitions
    nvme-fabrics: Add FC transport error codes to nvme.h
    Add type 0x28 NVME type code to scsi fc headers
    nvme-fabrics: patch target code in prep for FC transport support
    nvme-fabrics: set sqe.command_id in core not transports
    parser: add u64 number parser
    nvme-rdma: align to generic ib_event logging helper
    ...

    Linus Torvalds
     

13 Dec, 2016

1 commit

  • Merge updates from Andrew Morton:

    - various misc bits

    - most of MM (quite a lot of MM material is awaiting the merge of
    linux-next dependencies)

    - kasan

    - printk updates

    - procfs updates

    - MAINTAINERS

    - /lib updates

    - checkpatch updates

    * emailed patches from Andrew Morton : (123 commits)
    init: reduce rootwait polling interval time to 5ms
    binfmt_elf: use vmalloc() for allocation of vma_filesz
    checkpatch: don't emit unified-diff error for rename-only patches
    checkpatch: don't check c99 types like uint8_t under tools
    checkpatch: avoid multiple line dereferences
    checkpatch: don't check .pl files, improve absolute path commit log test
    scripts/checkpatch.pl: fix spelling
    checkpatch: don't try to get maintained status when --no-tree is given
    lib/ida: document locking requirements a bit better
    lib/rbtree.c: fix typo in comment of ____rb_erase_color
    lib/Kconfig.debug: make CONFIG_STRICT_DEVMEM depend on CONFIG_DEVMEM
    MAINTAINERS: add drm and drm/i915 irc channels
    MAINTAINERS: add "C:" for URI for chat where developers hang out
    MAINTAINERS: add drm and drm/i915 bug filing info
    MAINTAINERS: add "B:" for URI where to file bugs
    get_maintainer: look for arbitrary letter prefixes in sections
    printk: add Kconfig option to set default console loglevel
    printk/sound: handle more message headers
    printk/btrfs: handle more message headers
    printk/kdb: handle more message headers
    ...

    Linus Torvalds