25 Aug, 2016

1 commit

  • When reading from a loop device backed by a fuse file it deadlocks on
    lock_page().

    This is because the page is already locked by the read() operation done on
    the loop device. In this case we don't want to either lock the page or
    dirty it.

    So do what fs/direct-io.c does: only dirty the page for ITER_IOVEC vectors.

    Reported-by: Sheng Yang
    Fixes: aa4d86163e4e ("block: loop: switch to VFS ITER_BVEC")
    Signed-off-by: Miklos Szeredi
    Cc: # v4.1+
    Reviewed-by: Sheng Yang
    Reviewed-by: Ashish Samant
    Tested-by: Sheng Yang
    Tested-by: Ashish Samant

    Miklos Szeredi
     

06 Aug, 2016

1 commit

  • Pull qstr constification updates from Al Viro:
    "Fairly self-contained bunch - surprising lot of places passes struct
    qstr * as an argument when const struct qstr * would suffice; it
    complicates analysis for no good reason.

    I'd prefer to feed that separately from the assorted fixes (those are
    in #for-linus and with somewhat trickier topology)"

    * 'work.const-qstr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    qstr: constify instances in adfs
    qstr: constify instances in lustre
    qstr: constify instances in f2fs
    qstr: constify instances in ext2
    qstr: constify instances in vfat
    qstr: constify instances in procfs
    qstr: constify instances in fuse
    qstr constify instances in fs/dcache.c
    qstr: constify instances in nfs
    qstr: constify instances in ocfs2
    qstr: constify instances in autofs4
    qstr: constify instances in hfs
    qstr: constify instances in hfsplus
    qstr: constify instances in logfs
    qstr: constify dentry_init_security

    Linus Torvalds
     

31 Jul, 2016

1 commit


30 Jul, 2016

1 commit

  • Pull fuse updates from Miklos Szeredi:
    "This fixes error propagation from writeback to fsync/close for
    writeback cache mode as well as adding a missing capability flag to
    the INIT message. The rest are cleanups.

    (The commits are recent but all the code actually sat in -next for a
    while now. The recommits are due to conflict avoidance and the
    addition of Cc: stable@...)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
    fuse: use filemap_check_errors()
    mm: export filemap_check_errors() to modules
    fuse: fix wrong assignment of ->flags in fuse_send_init()
    fuse: fuse_flush must check mapping->flags for errors
    fuse: fsync() did not return IO errors
    fuse: don't mess with blocking signals
    new helper: wait_event_killable_exclusive()
    fuse: improve aio directIO write performance for size extending writes

    Linus Torvalds
     

29 Jul, 2016

7 commits

  • Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • FUSE_HAS_IOCTL_DIR should be assigned to ->flags, it may be a typo.

    Signed-off-by: Wei Fang
    Signed-off-by: Miklos Szeredi
    Fixes: 69fe05c90ed5 ("fuse: add missing INIT flags")
    Cc:

    Wei Fang
     
  • fuse_flush() calls write_inode_now() that triggers writeback, but actual
    writeback will happen later, on fuse_sync_writes(). If an error happens,
    fuse_writepage_end() will set error bit in mapping->flags. So, we have to
    check mapping->flags after fuse_sync_writes().

    Signed-off-by: Maxim Patlasov
    Signed-off-by: Miklos Szeredi
    Fixes: 4d99ff8f12eb ("fuse: Turn writeback cache on")
    Cc: # v3.15+

    Maxim Patlasov
     
  • Due to implementation of fuse writeback filemap_write_and_wait_range() does
    not catch errors. We have to do this directly after fuse_sync_writes()

    Signed-off-by: Alexey Kuznetsov
    Signed-off-by: Maxim Patlasov
    Signed-off-by: Miklos Szeredi
    Fixes: 4d99ff8f12eb ("fuse: Turn writeback cache on")
    Cc: # v3.15+

    Alexey Kuznetsov
     
  • Merge more updates from Andrew Morton:
    "The rest of MM"

    * emailed patches from Andrew Morton : (101 commits)
    mm, compaction: simplify contended compaction handling
    mm, compaction: introduce direct compaction priority
    mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
    mm, page_alloc: make THP-specific decisions more generic
    mm, page_alloc: restructure direct compaction handling in slowpath
    mm, page_alloc: don't retry initial attempt in slowpath
    mm, page_alloc: set alloc_flags only once in slowpath
    lib/stackdepot.c: use __GFP_NOWARN for stack allocations
    mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB
    mm, kasan: account for object redzone in SLUB's nearest_obj()
    mm: fix use-after-free if memory allocation failed in vma_adjust()
    zsmalloc: Delete an unnecessary check before the function call "iput"
    mm/memblock.c: fix index adjustment error in __next_mem_range_rev()
    mem-hotplug: alloc new page from a nearest neighbor node when mem-offline
    mm: optimize copy_page_to/from_iter_iovec
    mm: add cond_resched() to generic_swapfile_activate()
    Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"
    mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode
    mm: hwpoison: remove incorrect comments
    make __section_nr() more efficient
    ...

    Linus Torvalds
     
  • There are now a number of accounting oddities such as mapped file pages
    being accounted for on the node while the total number of file pages are
    accounted on the zone. This can be coped with to some extent but it's
    confusing so this patch moves the relevant file-based accounted. Due to
    throttling logic in the page allocator for reliable OOM detection, it is
    still necessary to track dirty and writeback pages on a per-zone basis.

    [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
    Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This changes the vfs dentry hashing to mix in the parent pointer at the
    _beginning_ of the hash, rather than at the end.

    That actually improves both the hash and the code generation, because we
    can move more of the computation to the "static" part of the dcache
    setup, and do less at lookup runtime.

    It turns out that a lot of other hash users also really wanted to mix in
    a base pointer as a 'salt' for the hash, and so the slightly extended
    interface ends up working well for other cases too.

    Users that want a string hash that is purely about the string pass in a
    'salt' pointer of NULL.

    * merge branch 'salted-string-hash':
    fs/dcache.c: Save one 32-bit multiply in dcache lookup
    vfs: make the string hashes salt the hash

    Linus Torvalds
     

21 Jul, 2016

1 commit


19 Jul, 2016

1 commit


06 Jul, 2016

1 commit


30 Jun, 2016

2 commits

  • While sending the blocking directIO in fuse, the write request is broken
    into sub-requests, each of default size 128k and all the requests are sent
    in non-blocking background mode if async_dio mode is supported by libfuse.
    The process which issue the write wait for the completion of all the
    sub-requests. Sending multiple requests parallely gives a chance to perform
    parallel writes in the user space fuse implementation if it is
    multi-threaded and hence improves the performance.

    When there is a size extending aio dio write, we switch to blocking mode so
    that we can properly update the size of the file after completion of the
    writes. However, in this situation all the sub-requests are sent in
    serialized manner where the next request is sent only after receiving the
    reply of the current request. Hence the multi-threaded user space
    implementation is not utilized properly.

    This patch changes the size extending aio dio behavior to exactly follow
    blocking dio. For multi threaded fuse implementation having 10 threads and
    using buffer size of 64MB to perform async directIO, we are getting double
    the speed.

    Signed-off-by: Ashish Sangwan
    Signed-off-by: Miklos Szeredi

    Ashish Sangwan
     
  • Negotiate with userspace filesystems whether they support parallel readdir
    and lookup. Disable parallelism by default for fear of breaking fuse
    filesystems.

    Signed-off-by: Miklos Szeredi
    Fixes: 9902af79c01a ("parallel lookups: actual switch to rwsem")
    Fixes: d9b3dbdcfd62 ("fuse: switch to ->iterate_shared()")

    Miklos Szeredi
     

11 Jun, 2016

1 commit

  • We always mixed in the parent pointer into the dentry name hash, but we
    did it late at lookup time. It turns out that we can simplify that
    lookup-time action by salting the hash with the parent pointer early
    instead of late.

    A few other users of our string hashes also wanted to mix in their own
    pointers into the hash, and those are updated to use the same mechanism.

    Hash users that don't have any particular initial salt can just use the
    NULL pointer as a no-salt.

    Cc: Vegard Nossum
    Cc: George Spelvin
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

28 May, 2016

1 commit

  • smack ->d_instantiate() uses ->setxattr(), so to be able to call it before
    we'd hashed the new dentry and attached it to inode, we need ->setxattr()
    instances getting the inode as an explicit argument rather than obtaining
    it from dentry.

    Similar change for ->getxattr() had been done in commit ce23e64. Unlike
    ->getxattr() (which is used by both selinux and smack instances of
    ->d_instantiate()) ->setxattr() is used only by smack one and unfortunately
    it got missed back then.

    Reported-by: Seung-Woo Kim
    Tested-by: Casey Schaufler
    Signed-off-by: Al Viro

    Al Viro
     

18 May, 2016

1 commit

  • Pull vfs cleanups from Al Viro:
    "More cleanups from Christoph"

    * 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    nfsd: use RWF_SYNC
    fs: add RWF_DSYNC aand RWF_SYNC
    ceph: use generic_write_sync
    fs: simplify the generic_write_sync prototype
    fs: add IOCB_SYNC and IOCB_DSYNC
    direct-io: remove the offset argument to dio_complete
    direct-io: eliminate the offset argument to ->direct_IO
    xfs: eliminate the pos variable in xfs_file_dio_aio_write
    filemap: remove the pos argument to generic_file_direct_write
    filemap: remove pos variables in generic_file_read_iter

    Linus Torvalds
     

17 May, 2016

1 commit


03 May, 2016

2 commits


02 May, 2016

2 commits


25 Apr, 2016

1 commit

  • fuse_get_user_pages() should return error or 0. Otherwise fuse_direct_io
    read will not return 0 to indicate that read has completed.

    Fixes: 742f992708df ("fuse: return patrial success from fuse_direct_io()")
    Signed-off-by: Ashish Samant
    Signed-off-by: Seth Forshee
    Signed-off-by: Miklos Szeredi

    Ashish Samant
     

11 Apr, 2016

1 commit


05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Mar, 2016

1 commit

  • If a user calls writev/readv in direct io mode with partially valid data
    in the iovec array such that any vector other than the first one in the
    array contains invalid data, we currently return the error for the invalid
    iovec.

    Instead, we should return the number of bytes already written/read and not
    the error as we do in the non direct_io case.

    Reported-by: Alexey Kodanev
    Signed-off-by: Ashish Samant
    Signed-off-by: Miklos Szeredi

    Ashish Samant
     

14 Mar, 2016

2 commits

  • The 'reqs' member of fuse_io_priv serves two purposes. First is to track
    the number of oustanding async requests to the server and to signal that
    the io request is completed. The second is to be a reference count on the
    structure to know when it can be freed.

    For sync io requests these purposes can be at odds. fuse_direct_IO() wants
    to block until the request is done, and since the signal is sent when
    'reqs' reaches 0 it cannot keep a reference to the object. Yet it needs to
    use the object after the userspace server has completed processing
    requests. This leads to some handshaking and special casing that it
    needlessly complicated and responsible for at least one race condition.

    It's much cleaner and safer to maintain a separate reference count for the
    object lifecycle and to let 'reqs' just be a count of outstanding requests
    to the userspace server. Then we can know for sure when it is safe to free
    the object without any handshaking or special cases.

    The catch here is that most of the time these objects are stack allocated
    and should not be freed. Initializing these objects with a single reference
    that is never released prevents accidental attempts to free the objects.

    Fixes: 9d5722b7777e ("fuse: handle synchronous iocbs internally")
    Cc: stable@vger.kernel.org # v4.1+
    Signed-off-by: Seth Forshee
    Signed-off-by: Miklos Szeredi

    Seth Forshee
     
  • There's a race in fuse_direct_IO(), whereby is_sync_kiocb() is called on an
    iocb that could have been freed if async io has already completed. The fix
    in this case is simple and obvious: cache the result before starting io.

    It was discovered by KASan:

    kernel: ==================================================================
    kernel: BUG: KASan: use after free in fuse_direct_IO+0xb1a/0xcc0 at addr ffff88036c414390

    Signed-off-by: Robert Doebbelin
    Signed-off-by: Miklos Szeredi
    Fixes: bcba24ccdc82 ("fuse: enable asynchronous processing direct IO")
    Cc: # 3.10+

    Robert Doebbelin
     

23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

22 Jan, 2016

1 commit


15 Jan, 2016

1 commit

  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

12 Jan, 2016

1 commit

  • Pull vfs RCU symlink updates from Al Viro:
    "Replacement of ->follow_link/->put_link, allowing to stay in RCU mode
    even if the symlink is not an embedded one.

    No changes since the mailbomb on Jan 1"

    * 'work.symlinks' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    switch ->get_link() to delayed_call, kill ->put_link()
    kill free_page_put_link()
    teach nfs_get_link() to work in RCU mode
    teach proc_self_get_link()/proc_thread_self_get_link() to work in RCU mode
    teach shmem_get_link() to work in RCU mode
    teach page_get_link() to work in RCU mode
    replace ->follow_link() with new method that could stay in RCU mode
    don't put symlink bodies in pagecache into highmem
    namei: page_getlink() and page_follow_link_light() are the same thing
    ufs: get rid of ->setattr() for symlinks
    udf: don't duplicate page_symlink_inode_operations
    logfs: don't duplicate page_symlink_inode_operations
    switch befs long symlinks to page_symlink_operations

    Linus Torvalds
     

31 Dec, 2015

1 commit


30 Dec, 2015

1 commit


12 Dec, 2015

1 commit


09 Dec, 2015

1 commit

  • new method: ->get_link(); replacement of ->follow_link(). The differences
    are:
    * inode and dentry are passed separately
    * might be called both in RCU and non-RCU mode;
    the former is indicated by passing it a NULL dentry.
    * when called that way it isn't allowed to block
    and should return ERR_PTR(-ECHILD) if it needs to be called
    in non-RCU mode.

    It's a flagday change - the old method is gone, all in-tree instances
    converted. Conversion isn't hard; said that, so far very few instances
    do not immediately bail out when called in RCU mode. That'll change
    in the next commits.

    Signed-off-by: Al Viro

    Al Viro
     

10 Nov, 2015

2 commits

  • A useful performance improvement for accessing virtual machine images
    via FUSE mount.

    See https://bugzilla.redhat.com/show_bug.cgi?id=1220173 for a use-case
    for glusterFS.

    Signed-off-by: Ravishankar N
    Signed-off-by: Miklos Szeredi

    Ravishankar N
     
  • I got a report about unkillable task eating CPU. Further
    investigation shows, that the problem is in the fuse_fill_write_pages()
    function. If iov's first segment has zero length, we get an infinite
    loop, because we never reach iov_iter_advance() call.

    Fix this by calling iov_iter_advance() before repeating an attempt to
    copy data from userspace.

    A similar problem is described in 124d3b7041f ("fix writev regression:
    pan hanging unkillable and un-straceable"). If zero-length segmend
    is followed by segment with invalid address,
    iov_iter_fault_in_readable() checks only first segment (zero-length),
    iov_iter_copy_from_user_atomic() skips it, fails at second and
    returns zero -> goto again without skipping zero-length segment.

    Patch calls iov_iter_advance() before goto again: we'll skip zero-length
    segment at second iteraction and iov_iter_fault_in_readable() will detect
    invalid address.

    Special thanks to Konstantin Khlebnikov, who helped a lot with the commit
    description.

    Cc: Andrew Morton
    Cc: Maxim Patlasov
    Cc: Konstantin Khlebnikov
    Signed-off-by: Roman Gushchin
    Signed-off-by: Miklos Szeredi
    Fixes: ea9b9907b82a ("fuse: implement perform_write")
    Cc:

    Roman Gushchin