10 May, 2017

1 commit


09 May, 2017

2 commits

  • Commit afddba49d18f ("fs: introduce write_begin, write_end, and
    perform_write aops") introduced AOP_FLAG_UNINTERRUPTIBLE flag which was
    checked in pagecache_write_begin(), but that check was removed by
    4e02ed4b4a2f ("fs: remove prepare_write/commit_write").

    Between these two commits, commit d9414774dc0c ("cifs: Convert cifs to
    new aops.") added a check in cifs_write_begin(), but that check was soon
    removed by commit a98ee8c1c707 ("[CIFS] fix regression in
    cifs_write_begin/cifs_write_end").

    Therefore, AOP_FLAG_UNINTERRUPTIBLE flag is checked nowhere. Let's
    remove this flag. This patch has no functionality changes.

    Link: http://lkml.kernel.org/r/1489294781-53494-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Reviewed-by: Jeff Layton
    Reviewed-by: Christoph Hellwig
    Cc: Nick Piggin
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Wrong sign of iov_iter_revert() argument. Unfortunately, slipped through
    the testing, since most of the time we don't do anything to the iterator
    afterwards and potential oops on walking the iter->iov too far backwards
    is too infrequent to be easily triggered.

    Add a sanity check in iov_iter_revert() to catch bugs like this one;
    fortunately, the same braino hadn't happened in other callers, but we'd
    better have a warning if such thing crops up.

    Signed-off-by: Al Viro

    Al Viro
     

04 May, 2017

2 commits

  • Patch series "Properly invalidate data in the cleancache", v2.

    We've noticed that after direct IO write, buffered read sometimes gets
    stale data which is coming from the cleancache. The reason for this is
    that some direct write hooks call call invalidate_inode_pages2[_range]()
    conditionally iff mapping->nrpages is not zero, so we may not invalidate
    data in the cleancache.

    Another odd thing is that we check only for ->nrpages and don't check
    for ->nrexceptional, but invalidate_inode_pages2[_range] also
    invalidates exceptional entries as well. So we invalidate exceptional
    entries only if ->nrpages != 0? This doesn't feel right.

    - Patch 1 fixes direct IO writes by removing ->nrpages check.
    - Patch 2 fixes similar case in invalidate_bdev().
    Note: I only fixed conditional cleancache_invalidate_inode() here.
    Do we also need to add ->nrexceptional check in into invalidate_bdev()?

    - Patches 3-4: some optimizations.

    This patch (of 4):

    Some direct IO write fs hooks call invalidate_inode_pages2[_range]()
    conditionally iff mapping->nrpages is not zero. This can't be right,
    because invalidate_inode_pages2[_range]() also invalidate data in the
    cleancache via cleancache_invalidate_inode() call. So if page cache is
    empty but there is some data in the cleancache, buffered read after
    direct IO write would get stale data from the cleancache.

    Also it doesn't feel right to check only for ->nrpages because
    invalidate_inode_pages2[_range] invalidates exceptional entries as well.

    Fix this by calling invalidate_inode_pages2[_range]() regardless of
    nrpages state.

    Note: nfs,cifs,9p doesn't need similar fix because the never call
    cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so
    they are not affected by this bug.

    Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
    Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Jan Kara
    Acked-by: Konrad Rzeszutek Wilk
    Cc: Alexander Viro
    Cc: Ross Zwisler
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Alexey Kuznetsov
    Cc: Christoph Hellwig
    Cc: Nikolay Borisov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • The round_up() macro generates a couple of unnecessary instructions
    in this usage:

    48cd: 49 8b 47 50 mov 0x50(%r15),%rax
    48d1: 48 83 e8 01 sub $0x1,%rax
    48d5: 48 0d ff 0f 00 00 or $0xfff,%rax
    48db: 48 83 c0 01 add $0x1,%rax
    48df: 48 c1 f8 0c sar $0xc,%rax
    48e3: 48 39 c3 cmp %rax,%rbx
    48e6: 72 2e jb 4916

    If we change round_up() to ((x) + __round_mask(x, y)) & ~__round_mask(x, y)
    then GCC can see through it and remove the mask (because that would be
    dead code given the subsequent shift):

    48cd: 49 8b 47 50 mov 0x50(%r15),%rax
    48d1: 48 05 ff 0f 00 00 add $0xfff,%rax
    48d7: 48 c1 e8 0c shr $0xc,%rax
    48db: 48 39 c3 cmp %rax,%rbx
    48de: 72 2e jb 490e

    But that's problematic because we'd evaluate 'y' twice. Converting
    round_up into an inline function prevents it from being used in other
    definitions. The easiest thing to do is just change these three usages
    of round_up to use DIV_ROUND_UP. Also add an unlikely() because GCC's
    heuristic is wrong in this case.

    Link: http://lkml.kernel.org/r/20170207192812.5281-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

03 May, 2017

1 commit

  • Pull iov_iter updates from Al Viro:
    "Cleanups that sat in -next + -stable fodder that has just missed 4.11.

    There's more iov_iter work in my local tree, but I'd prefer to push
    the stuff that had been in -next first"

    * 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    iov_iter: don't revert iov buffer if csum error
    generic_file_read_iter(): make use of iov_iter_revert()
    generic_file_direct_write(): make use of iov_iter_revert()
    orangefs: use iov_iter_revert()
    sctp: switch to copy_from_iter_full()
    net/9p: switch to copy_from_iter_full()
    switch memcpy_from_msg() to copy_from_iter_full()
    rds: make use of iov_iter_revert()

    Linus Torvalds
     

22 Apr, 2017

2 commits


03 Apr, 2017

1 commit

  • ./lib/string.c:134: WARNING: Inline emphasis start-string without end-string.
    ./mm/filemap.c:522: WARNING: Inline interpreted text or phrase reference start-string without end-string.
    ./mm/filemap.c:1283: ERROR: Unexpected indentation.
    ./mm/filemap.c:3003: WARNING: Inline interpreted text or phrase reference start-string without end-string.
    ./mm/vmalloc.c:1544: WARNING: Inline emphasis start-string without end-string.
    ./mm/page_alloc.c:4245: ERROR: Unexpected indentation.
    ./ipc/util.c:676: ERROR: Unexpected indentation.
    ./drivers/pci/irq.c:35: WARNING: Block quote ends without a blank line; unexpected unindent.
    ./security/security.c:109: ERROR: Unexpected indentation.
    ./security/security.c:110: WARNING: Definition list ends without a blank line; unexpected unindent.
    ./block/genhd.c:275: WARNING: Inline strong start-string without end-string.
    ./block/genhd.c:283: WARNING: Inline strong start-string without end-string.
    ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
    ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
    ./ipc/util.c:477: ERROR: Unknown target name: "s".

    Signed-off-by: Mauro Carvalho Chehab
    Acked-by: Bjorn Helgaas
    Signed-off-by: Jonathan Corbet

    mchehab@s-opensource.com
     

02 Mar, 2017

1 commit


25 Feb, 2017

2 commits

  • With rw_page, page_endio is used for completing IO on a page and it
    propagates write error to the address space if the IO fails. The
    problem is it accesses page->mapping directly which might be okay for
    file-backed pages but it shouldn't for anonymous page. Otherwise, it
    can corrupt one of field from anon_vma under us and system goes panic
    randomly.

    swap_writepage
    bdev_writepage
    ops->rw_page

    I encountered the BUG during developing new zram feature and it was
    really hard to figure it out because it made random crash, somtime
    mmap_sem lockdep, sometime other places where places never related to
    zram/zsmalloc, and not reproducible with some configuration.

    When I consider how that bug is subtle and people do fast-swap test with
    brd, it's worth to add stable mark, I think.

    Fixes: dd6bd0d9c7db ("swap: use bdev_read_page() / bdev_write_page()")
    Signed-off-by: Minchan Kim
    Acked-by: Michal Hocko
    Cc: Matthew Wilcox
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • ->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
    take a vma and vmf parameter when the vma already resides in vmf.

    Remove the vma parameter to simplify things.

    [arnd@arndb.de: fix ARM build]
    Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Dave Jiang
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Ross Zwisler
    Cc: Theodore Ts'o
    Cc: Darrick J. Wong
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     

23 Feb, 2017

2 commits

  • Fix kernel-doc warnings in mm/filemap.c:

    mm/filemap.c:993: warning: No description found for parameter '__page'
    mm/filemap.c:993: warning: Excess function parameter 'page' description in '__lock_page'

    Link: http://lkml.kernel.org/r/a66fe492-518c-ad6c-5f03-5e8b721fb451@infradead.org
    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • These are no longer used outside mm/filemap.c, so un-export them and
    make them static where possible. These were exported specifically for
    NFS use in commit a4796e37c12e ("MM: export page_wakeup functions").

    Link: http://lkml.kernel.org/r/20170103182234.30141-3-npiggin@gmail.com
    Signed-off-by: Nicholas Piggin
    Cc: Trond Myklebust
    Cc: Anna Schumaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     

04 Feb, 2017

1 commit

  • do_generic_file_read() can be told to perform a large request from
    userspace. If the system is under OOM and the reading task is the OOM
    victim then it has an access to memory reserves and finishing the full
    request can lead to the full memory depletion which is dangerous. Make
    sure we rather go with a short read and allow the killed task to
    terminate.

    Link: http://lkml.kernel.org/r/20170201092706.9966-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Christoph Hellwig
    Cc: Tetsuo Handa
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

11 Jan, 2017

1 commit

  • Currently in DAX if we have three read faults on the same hole address we
    can end up with the following:

    Thread 0 Thread 1 Thread 2
    -------- -------- --------
    dax_iomap_fault
    grab_mapping_entry
    lock_slot

    dax_iomap_fault
    grab_mapping_entry
    get_unlocked_mapping_entry

    dax_iomap_fault
    grab_mapping_entry
    get_unlocked_mapping_entry

    dax_load_hole
    find_or_create_page
    ...
    page_cache_tree_insert
    dax_wake_mapping_entry_waiter

    __radix_tree_replace


    get_page
    lock_page
    ...
    put_locked_mapping_entry
    unlock_page
    put_page

    The crux of the problem is that once we insert a 4k zero page, all
    locking from then on is done in terms of that 4k zero page and any
    additional threads sleeping on the empty DAX entry will never be woken.

    Fix this by waking all sleepers when we replace the DAX radix tree entry
    with a 4k zero page. This will allow all sleeping threads to
    successfully transition from locking based on the DAX empty entry to
    locking on the 4k zero page.

    With the test case reported by Xiong this happens very regularly in my
    test setup, with some runs resulting in 9+ threads in this deadlocked
    state. With this fix I've been able to run that same test dozens of
    times in a loop without issue.

    Fixes: ac401cc78242 ("dax: New fault locking")
    Link: http://lkml.kernel.org/r/1483479365-13607-1-git-send-email-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reported-by: Xiong Zhou
    Reviewed-by: Jan Kara
    Cc: [4.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

30 Dec, 2016

2 commits

  • mm/filemap.c: In function 'clear_bit_unlock_is_negative_byte':
    mm/filemap.c:933:9: error: too few arguments to function 'test_bit'
    return test_bit(PG_waiters);
    ^~~~~~~~

    Fixes: b91e1302ad9b ('mm: optimize PageWaiters bit use for unlock_page()')
    Signed-off-by: Olof Johansson
    Brown-paper-bag-by: Linus Torvalds
    Signed-off-by: Linus Torvalds

    Olof Johansson
     
  • In commit 62906027091f ("mm: add PageWaiters indicating tasks are
    waiting for a page bit") Nick Piggin made our page locking no longer
    unconditionally touch the hashed page waitqueue, which not only helps
    performance in general, but is particularly helpful on NUMA machines
    where the hashed wait queues can bounce around a lot.

    However, the "clear lock bit atomically and then test the waiters bit"
    sequence turns out to be much more expensive than it needs to be,
    because you get a nasty stall when trying to access the same word that
    just got updated atomically.

    On architectures where locking is done with LL/SC, this would be trivial
    to fix with a new primitive that clears one bit and tests another
    atomically, but that ends up not working on x86, where the only atomic
    operations that return the result end up being cmpxchg and xadd. The
    atomic bit operations return the old value of the same bit we changed,
    not the value of an unrelated bit.

    On x86, we could put the lock bit in the high bit of the byte, and use
    "xadd" with that bit (where the overflow ends up not touching other
    bits), and look at the other bits of the result. However, an even
    simpler model is to just use a regular atomic "and" to clear the lock
    bit, and then the sign bit in eflags will indicate the resulting state
    of the unrelated bit #7.

    So by moving the PageWaiters bit up to bit #7, we can atomically clear
    the lock bit and test the waiters bit on x86 too. And architectures
    with LL/SC (which is all the usual RISC suspects), the particular bit
    doesn't matter, so they are fine with this approach too.

    This avoids the extra access to the same atomic word, and thus avoids
    the costly stall at page unlock time.

    The only downside is that the interface ends up being a bit odd and
    specialized: clear a bit in a byte, and test the sign bit. Nick doesn't
    love the resulting name of the new primitive, but I'd rather make the
    name be descriptive and very clear about the limitation imposed by
    trying to work across all relevant architectures than make it be some
    generic thing that doesn't make the odd semantics explicit.

    So this introduces the new architecture primitive

    clear_bit_unlock_is_negative_byte();

    and adds the trivial implementation for x86. We have a generic
    non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
    combination) which can be overridden by any architecture that can do
    better. According to Nick, Power has the same hickup x86 has, for
    example, but some other architectures may not even care.

    All these optimizations mean that my page locking stress-test (which is
    just executing a lot of small short-lived shell scripts: "make test" in
    the git source tree) no longer makes our page locking look horribly bad.
    Before all these optimizations, just the unlock_page() costs were just
    over 3% of all CPU overhead on "make test". After this, it's down to
    0.66%, so just a quarter of the cost it used to be.

    (The difference on NUMA is bigger, but there this micro-optimization is
    likely less noticeable, since the big issue on NUMA was not the accesses
    to 'struct page', but the waitqueue accesses that were already removed
    by Nick's earlier commit).

    Acked-by: Nick Piggin
    Cc: Dave Hansen
    Cc: Bob Peterson
    Cc: Steven Whitehouse
    Cc: Andrew Lutomirski
    Cc: Andreas Gruenbacher
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

26 Dec, 2016

1 commit

  • Add a new page flag, PageWaiters, to indicate the page waitqueue has
    tasks waiting. This can be tested rather than testing waitqueue_active
    which requires another cacheline load.

    This bit is always set when the page has tasks on page_waitqueue(page),
    and is set and cleared under the waitqueue lock. It may be set when
    there are no tasks on the waitqueue, which will cause a harmless extra
    wakeup check that will clears the bit.

    The generic bit-waitqueue infrastructure is no longer used for pages.
    Instead, waitqueues are used directly with a custom key type. The
    generic code was not flexible enough to have PageWaiters manipulation
    under the waitqueue lock (which simplifies concurrency).

    This improves the performance of page lock intensive microbenchmarks by
    2-3%.

    Putting two bits in the same word opens the opportunity to remove the
    memory barrier between clearing the lock bit and testing the waiters
    bit, after some work on the arch primitives (e.g., ensuring memory
    operand widths match and cover both bits).

    Signed-off-by: Nicholas Piggin
    Cc: Dave Hansen
    Cc: Bob Peterson
    Cc: Steven Whitehouse
    Cc: Andrew Lutomirski
    Cc: Andreas Gruenbacher
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     

15 Dec, 2016

4 commits

  • Merge more updates from Andrew Morton:

    - a few misc things

    - kexec updates

    - DMA-mapping updates to better support networking DMA operations

    - IPC updates

    - various MM changes to improve DAX fault handling

    - lots of radix-tree changes, mainly to the test suite. All leading up
    to reimplementing the IDA/IDR code to be a wrapper layer over the
    radix-tree. However the final trigger-pulling patch is held off for
    4.11.

    * emailed patches from Andrew Morton : (114 commits)
    radix tree test suite: delete unused rcupdate.c
    radix tree test suite: add new tag check
    radix-tree: ensure counts are initialised
    radix tree test suite: cache recently freed objects
    radix tree test suite: add some more functionality
    idr: reduce the number of bits per level from 8 to 6
    rxrpc: abstract away knowledge of IDR internals
    tpm: use idr_find(), not idr_find_slowpath()
    idr: add ida_is_empty
    radix tree test suite: check multiorder iteration
    radix-tree: fix replacement for multiorder entries
    radix-tree: add radix_tree_split_preload()
    radix-tree: add radix_tree_split
    radix-tree: add radix_tree_join
    radix-tree: delete radix_tree_range_tag_if_tagged()
    radix-tree: delete radix_tree_locate_item()
    radix-tree: improve multiorder iterators
    btrfs: fix race in btrfs_free_dummy_fs_info()
    radix-tree: improve dump output
    radix-tree: make radix_tree_find_next_bit more useful
    ...

    Linus Torvalds
     
  • Currently we have two different structures for passing fault information
    around - struct vm_fault and struct fault_env. DAX will need more
    information in struct vm_fault to handle its faults so the content of
    that structure would become event closer to fault_env. Furthermore it
    would need to generate struct fault_env to be able to call some of the
    generic functions. So at this point I don't think there's much use in
    keeping these two structures separate. Just embed into struct vm_fault
    all that is needed to use it for both purposes.

    Link: http://lkml.kernel.org/r/1479460644-25076-2-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Cc: Ross Zwisler
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • We truncated the possible read iterator to s_maxbytes in commit
    c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()"),
    but our end condition handling was wrong: it's not an error to try to
    read at the end of the file.

    Reading past the end should return EOF (0), not EINVAL.

    See for example

    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1649342
    http://lists.gnu.org/archive/html/bug-coreutils/2016-12/msg00008.html

    where a md5sum of a maximally sized file fails because the final read is
    exactly at s_maxbytes.

    Fixes: c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
    Reported-by: Joseph Salisbury
    Cc: Wei Fang
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Al Viro
    Cc: Andrew Morton
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Pull ext4 updates from Ted Ts'o:
    "This merge request includes the dax-4.0-iomap-pmd branch which is
    needed for both ext4 and xfs dax changes to use iomap for DAX. It also
    includes the fscrypt branch which is needed for ubifs encryption work
    as well as ext4 encryption and fscrypt cleanups.

    Lots of cleanups and bug fixes, especially making sure ext4 is robust
    against maliciously corrupted file systems --- especially maliciously
    corrupted xattr blocks and a maliciously corrupted superblock. Also
    fix ext4 support for 64k block sizes so it works well on ppcle. Fixed
    mbcache so we don't miss some common xattr blocks that can be merged"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
    dax: Fix sleep in atomic contex in grab_mapping_entry()
    fscrypt: Rename FS_WRITE_PATH_FL to FS_CTX_HAS_BOUNCE_BUFFER_FL
    fscrypt: Delay bounce page pool allocation until needed
    fscrypt: Cleanup page locking requirements for fscrypt_{decrypt,encrypt}_page()
    fscrypt: Cleanup fscrypt_{decrypt,encrypt}_page()
    fscrypt: Never allocate fscrypt_ctx on in-place encryption
    fscrypt: Use correct index in decrypt path.
    fscrypt: move the policy flags and encryption mode definitions to uapi header
    fscrypt: move non-public structures and constants to fscrypt_private.h
    fscrypt: unexport fscrypt_initialize()
    fscrypt: rename get_crypt_info() to fscrypt_get_crypt_info()
    fscrypto: move ioctl processing more fully into common code
    fscrypto: remove unneeded Kconfig dependencies
    MAINTAINERS: fscrypto: recommend linux-fsdevel for fscrypto patches
    ext4: do not perform data journaling when data is encrypted
    ext4: return -ENOMEM instead of success
    ext4: reject inodes with negative size
    ext4: remove another test in ext4_alloc_file_blocks()
    Documentation: fix description of ext4's block_validity mount option
    ext4: fix checks for data=ordered and journal_async_commit options
    ...

    Linus Torvalds
     

13 Dec, 2016

4 commits

  • Shadow entries in the page cache used to be accounted behind the radix
    tree implementation's back in the upper bits of node->count, and the
    radix tree code extending a single-entry tree with a shadow entry in
    root->rnode would corrupt that counter. As a result, we could not put
    shadow entries at index 0 if the tree didn't have any other entries, and
    that means no refault detection for any single-page file.

    Now that the shadow entries are tracked natively in the radix tree's
    exceptional counter, this is no longer necessary. Extending and
    shrinking the tree from and to single entries in root->rnode now does
    the right thing when the entry is exceptional, remove that limitation.

    Link: http://lkml.kernel.org/r/20161117193244.GF23430@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Jan Kara
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Currently, we track the shadow entries in the page cache in the upper
    bits of the radix_tree_node->count, behind the back of the radix tree
    implementation. Because the radix tree code has no awareness of them,
    we rely on random subtleties throughout the implementation (such as the
    node->count != 1 check in the shrinking code, which is meant to exclude
    multi-entry nodes but also happens to skip nodes with only one shadow
    entry, as that's accounted in the upper bits). This is error prone and
    has, in fact, caused the bug fixed in d3798ae8c6f3 ("mm: filemap: don't
    plant shadow entries without radix tree node").

    To remove these subtleties, this patch moves shadow entry tracking from
    the upper bits of node->count to the existing counter for exceptional
    entries. node->count goes back to being a simple counter of valid
    entries in the tree node and can be shrunk to a single byte.

    This vastly simplifies the page cache code. All accounting happens
    natively inside the radix tree implementation, and maintaining the LRU
    linkage of shadow nodes is consolidated into a single function in the
    workingset code that is called for leaf nodes affected by a change in
    the page cache tree.

    This also removes the last user of the __radix_delete_node() return
    value. Eliminate it.

    Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Jan Kara
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The bug in khugepaged fixed earlier in this series shows that radix tree
    slot replacement is fragile; and it will become more so when not only
    NULL!NULL transitions need to be caught but transitions from and to
    exceptional entries as well. We need checks.

    Re-implement radix_tree_replace_slot() on top of the sanity-checked
    __radix_tree_replace(). This requires existing callers to also pass the
    radix tree root, but it'll warn us when somebody replaces slots with
    contents that need proper accounting (transitions between NULL entries,
    real entries, exceptional entries) and where a replacement through the
    slot pointer would corrupt the radix tree node counts.

    Link: http://lkml.kernel.org/r/20161117193021.GB23430@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Suggested-by: Jan Kara
    Reviewed-by: Jan Kara
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Unlike THP, hugetlb pages are represented by one entry in the
    radix-tree.

    [akpm@linux-foundation.org: tweak comment]
    Link: http://lkml.kernel.org/r/20161110163640.126124-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

14 Nov, 2016

1 commit


08 Nov, 2016

2 commits

  • DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
    locking. This patch allows DAX PMDs to participate in the DAX radix tree
    based locking scheme so that they can be re-enabled using the new struct
    iomap based fault handlers.

    There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
    mappings that have an associated block allocation, and 4k DAX empty
    entries. The empty entries exist to provide locking for the duration of a
    given page fault.

    This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
    entries, PMD DAX entries that have associated block allocations, and 2 MiB
    DAX empty entries.

    Unlike the 4k case where we insert a struct page* into the radix tree for
    4k zero pages, for HZP we insert a DAX exceptional entry with the new
    RADIX_DAX_HZP flag set. This is because we use a single 2 MiB zero page in
    every 2MiB hole mapping, and it doesn't make sense to have that same struct
    page* with multiple entries in multiple trees. This would cause contention
    on the single page lock for the one Huge Zero Page, and it would break the
    page->index and page->mapping associations that are assumed to be valid in
    many other places in the kernel.

    One difficult use case is when one thread is trying to use 4k entries in
    radix tree for a given offset, and another thread is using 2 MiB entries
    for that same offset. The current code handles this by making the 2 MiB
    user fall back to 4k entries for most cases. This was done because it is
    the simplest solution, and because the use of 2MiB pages is already
    opportunistic.

    If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
    we run into the problem of how we lock out 4k page faults for the entire
    2MiB range while we clean out the radix tree so we can insert the 2MiB
    entry. We can solve this problem if we need to, but I think that the cases
    where both 2MiB entries and 4K entries are being used for the same range
    will be rare enough and the gain small enough that it probably won't be
    worth the complexity.

    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Signed-off-by: Dave Chinner

    Ross Zwisler
     
  • DAX radix tree locking currently locks entries based on the unique
    combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
    This works for PTEs, but as we move to PMDs we will need to have all the
    offsets within the range covered by the PMD to map to the same bit lock.
    To accomplish this, for ranges covered by a PMD entry we will instead lock
    based on the page offset of the beginning of the PMD entry. The 'mapping'
    pointer is still used in the same way.

    Signed-off-by: Ross Zwisler
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Dave Chinner

    Ross Zwisler
     

07 Nov, 2016

1 commit

  • Starting from 4.9-rc1 kernel, I started noticing some test failures
    of sendfile(2) and splice(2) (sendfile0N and splice01 from LTP) when
    testing on sub-page block size filesystems (tested both XFS and
    ext4), these syscalls start to return EIO in the tests. e.g.

    sendfile02 1 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 26, got: -1
    sendfile02 2 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 24, got: -1
    sendfile02 3 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 22, got: -1
    sendfile02 4 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 20, got: -1

    This is because that in sub-page block size cases, we don't need the
    whole page to be uptodate, only the part we care about is uptodate
    is OK (if fs has ->is_partially_uptodate defined). But
    page_cache_pipe_buf_confirm() doesn't have the ability to check the
    partially-uptodate case, it needs the whole page to be uptodate. So
    it returns EIO in this case.

    This is a regression introduced by commit 82c156f85384 ("switch
    generic_file_splice_read() to use of ->read_iter()"). Prior to the
    change, generic_file_splice_read() doesn't allow partially-uptodate
    page either, so it worked fine.

    Fix it by skipping the partially-uptodate check if we're working on
    a pipe in do_generic_file_read(), so we read the whole page from
    disk as long as the page is not uptodate.

    Signed-off-by: Eryu Guan
    Signed-off-by: Al Viro

    Eryu Guan
     

28 Oct, 2016

1 commit

  • The per-zone waitqueues exist because of a scalability issue with the
    page waitqueues on some NUMA machines, but it turns out that they hurt
    normal loads, and now with the vmalloced stacks they also end up
    breaking gfs2 that uses a bit_wait on a stack object:

    wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)

    where 'gh' can be a reference to the local variable 'mount_gh' on the
    stack of fill_super().

    The reason the per-zone hash table breaks for this case is that there is
    no "zone" for virtual allocations, and trying to look up the physical
    page to get at it will fail (with a BUG_ON()).

    It turns out that I actually complained to the mm people about the
    per-zone hash table for another reason just a month ago: the zone lookup
    also hurts the regular use of "unlock_page()" a lot, because the zone
    lookup ends up forcing several unnecessary cache misses and generates
    horrible code.

    As part of that earlier discussion, we had a much better solution for
    the NUMA scalability issue - by just making the page lock have a
    separate contention bit, the waitqueue doesn't even have to be looked at
    for the normal case.

    Peter Zijlstra already has a patch for that, but let's see if anybody
    even notices. In the meantime, let's fix the actual gfs2 breakage by
    simplifying the bitlock waitqueues and removing the per-zone issue.

    Reported-by: Andreas Gruenbacher
    Tested-by: Bob Peterson
    Acked-by: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Steven Whitehouse
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

11 Oct, 2016

2 commits

  • Pull splice fixups from Al Viro:
    "A couple of fixups for interaction of pipe-backed iov_iter with
    O_DIRECT reads + constification of a couple of primitives in uio.h
    missed by previous rounds.

    Kudos to davej - his fuzzing has caught those bugs"

    * 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    [btrfs] fix check_direct_IO() for non-iovec iterators
    constify iov_iter_count() and iter_is_iovec()
    fix ITER_PIPE interaction with direct_IO

    Linus Torvalds
     
  • by making sure we call iov_iter_advance() on original
    iov_iter even if direct_IO (done on its copy) has returned 0.
    It's a no-op for old iov_iter flavours and does the right thing
    (== truncation of the stuff we'd allocated, but not filled) in
    ITER_PIPE case. Failures (e.g. -EIO) get caught and dealt with
    by cleanup in generic_file_read_iter().

    Signed-off-by: Al Viro

    Al Viro
     

08 Oct, 2016

2 commits

  • We triggered a deadloop in truncate_inode_pages_range() on 32 bits
    architecture with the test case bellow:

    ...
    fd = open();
    write(fd, buf, 4096);
    preadv64(fd, &iovec, 1, 0xffffffff000);
    ftruncate(fd, 0);
    ...

    Then ftruncate() will not return forever.

    The filesystem used in this case is ubifs, but it can be triggered on
    many other filesystems.

    When preadv64() is called with offset=0xffffffff000, a page with
    index=0xffffffff will be added to the radix tree of ->mapping. Then
    this page can be found in ->mapping with pagevec_lookup(). After that,
    truncate_inode_pages_range(), which is called in ftruncate(), will fall
    into an infinite loop:

    - find a page with index=0xffffffff, since index>=end, this page won't
    be truncated

    - index++, and index become 0

    - the page with index=0xffffffff will be found again

    The data type of index is unsigned long, so index won't overflow to 0 on
    64 bits architecture in this case, and the dead loop won't happen.

    Since truncate_inode_pages_range() is executed with holding lock of
    inode->i_rwsem, any operation related with this lock will be blocked,
    and a hung task will happen, e.g.:

    INFO: task truncate_test:3364 blocked for more than 120 seconds.
    ...
    call_rwsem_down_write_failed+0x17/0x30
    generic_file_write_iter+0x32/0x1c0
    ubifs_write_iter+0xcc/0x170
    __vfs_write+0xc4/0x120
    vfs_write+0xb2/0x1b0
    SyS_write+0x46/0xa0

    The page with index=0xffffffff added to ->mapping is useless. Fix this
    by checking the read position before allocating pages.

    Link: http://lkml.kernel.org/r/1475151010-40166-1-git-send-email-fangwei1@huawei.com
    Signed-off-by: Wei Fang
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Fang
     
  • If a fatal signal has been received, fail immediately instead of trying
    to read more data.

    If wait_on_page_locked_killable() was interrupted then this page is most
    likely is not PageUptodate() and in this case do_generic_file_read()
    will fail after lock_page_killable().

    See also commit ebded02788b5 ("mm: filemap: avoid unnecessary calls to
    lock_page when waiting for IO to complete during a read")

    [oleg@redhat.com: changelog addition]
    Link: http://lkml.kernel.org/r/63068e8e-8bee-b208-8441-a3c39a9d9eb6@sandisk.com
    Signed-off-by: Bart Van Assche
    Reviewed-by: Jan Kara
    Acked-by: Oleg Nesterov
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bart Van Assche
     

06 Oct, 2016

3 commits

  • Pull xfs and iomap updates from Dave Chinner:
    "The main things in this update are the iomap-based DAX infrastructure,
    an XFS delalloc rework, and a chunk of fixes to how log recovery
    schedules writeback to prevent spurious corruption detections when
    recovery of certain items was not required.

    The other main chunk of code is some preparation for the upcoming
    reflink functionality. Most of it is generic and cleanups that stand
    alone, but they were ready and reviewed so are in this pull request.

    Speaking of reflink, I'm currently planning to send you another pull
    request next week containing all the new reflink functionality. I'm
    working through a similar process to the last cycle, where I sent the
    reverse mapping code in a separate request because of how large it
    was. The reflink code merge is even bigger than reverse mapping, so
    I'll be doing the same thing again....

    Summary for this update:

    - change of XFS mailing list to linux-xfs@vger.kernel.org

    - iomap-based DAX infrastructure w/ XFS and ext2 support

    - small iomap fixes and additions

    - more efficient XFS delayed allocation infrastructure based on iomap

    - a rework of log recovery writeback scheduling to ensure we don't
    fail recovery when trying to replay items that are already on disk

    - some preparation patches for upcoming reflink support

    - configurable error handling fixes and documentation

    - aio access time update race fixes for XFS and
    generic_file_read_iter"

    * tag 'xfs-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (40 commits)
    fs: update atime before I/O in generic_file_read_iter
    xfs: update atime before I/O in xfs_file_dio_aio_read
    ext2: fix possible integer truncation in ext2_iomap_begin
    xfs: log recovery tracepoints to track current lsn and buffer submission
    xfs: update metadata LSN in buffers during log recovery
    xfs: don't warn on buffers not being recovered due to LSN
    xfs: pass current lsn to log recovery buffer validation
    xfs: rework log recovery to submit buffers on LSN boundaries
    xfs: quiesce the filesystem after recovery on readonly mount
    xfs: remote attribute blocks aren't really userdata
    ext2: use iomap to implement DAX
    ext2: stop passing buffer_head to ext2_get_blocks
    xfs: use iomap to implement DAX
    xfs: refactor xfs_setfilesize
    xfs: take the ilock shared if possible in xfs_file_iomap_begin
    xfs: fix locking for DAX writes
    dax: provide an iomap based fault handler
    dax: provide an iomap based dax read/write path
    dax: don't pass buffer_head to copy_user_dax
    dax: don't pass buffer_head to dax_insert_mapping
    ...

    Linus Torvalds
     
  • Commit 22f2ac51b6d6 ("mm: workingset: fix crash in shadow node shrinker
    caused by replace_page_cache_page()") switched replace_page_cache() from
    raw radix tree operations to page_cache_tree_insert() but didn't take
    into account that the latter function, unlike the raw radix tree op,
    handles mapping->nrpages. As a result, that counter is bumped for each
    page replacement rather than balanced out even.

    The mapping->nrpages counter is used to skip needless radix tree walks
    when invalidating, truncating, syncing inodes without pages, as well as
    statistics for userspace. Since the error is positive, we'll do more
    page cache tree walks than necessary; we won't miss a necessary one.
    And we'll report more buffer pages to userspace than there are. The
    error is limited to fuse inodes.

    Fixes: 22f2ac51b6d6 ("mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page()")
    Signed-off-by: Johannes Weiner
    Cc: Andrew Morton
    Cc: Miklos Szeredi
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When the underflow checks were added to workingset_node_shadow_dec(),
    they triggered immediately:

    kernel BUG at ./include/linux/swap.h:276!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
    soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt
    CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1
    Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016
    task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
    RIP: page_cache_tree_insert+0xf1/0x100
    Call Trace:
    __add_to_page_cache_locked+0x12e/0x270
    add_to_page_cache_lru+0x4e/0xe0
    mpage_readpages+0x112/0x1d0
    blkdev_readpages+0x1d/0x20
    __do_page_cache_readahead+0x1ad/0x290
    force_page_cache_readahead+0xaa/0x100
    page_cache_sync_readahead+0x3f/0x50
    generic_file_read_iter+0x5af/0x740
    blkdev_read_iter+0x35/0x40
    __vfs_read+0xe1/0x130
    vfs_read+0x96/0x130
    SyS_read+0x55/0xc0
    entry_SYSCALL_64_fastpath+0x13/0x8f
    Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 0b e8 88 68 ef ff 0f 1f 84 00
    RIP page_cache_tree_insert+0xf1/0x100

    This is a long-standing bug in the way shadow entries are accounted in
    the radix tree nodes. The shrinker needs to know when radix tree nodes
    contain only shadow entries, no pages, so node->count is split in half
    to count shadows in the upper bits and pages in the lower bits.

    Unfortunately, the radix tree implementation doesn't know of this and
    assumes all entries are in node->count. When there is a shadow entry
    directly in root->rnode and the tree is later extended, the radix tree
    implementation will copy that entry into the new node and and bump its
    node->count, i.e. increases the page count bits. Once the shadow gets
    removed and we subtract from the upper counter, node->count underflows
    and triggers the warning. Afterwards, without node->count reaching 0
    again, the radix tree node is leaked.

    Limit shadow entries to when we have actual radix tree nodes and can
    count them properly. That means we lose the ability to detect refaults
    from files that had only the first page faulted in at eviction time.

    Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
    Signed-off-by: Johannes Weiner
    Reported-and-tested-by: Linus Torvalds
    Reviewed-by: Jan Kara
    Cc: Andrew Morton
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

03 Oct, 2016

1 commit

  • After the call to ->direct_IO the final reference to the file might have
    been dropped by aio_complete already, and the call to file_accessed might
    cause a use after free.

    Instead update the access time before the I/O, similar to how we
    update the time stamps before writes.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner

    Christoph Hellwig