18 Dec, 2014

1 commit

  • Pull fuse update from Miklos Szeredi:
    "The first part makes sure we don't hold up umount with pending async
    requests. In addition to being a cleanup, this is a small behavioral
    change (for the better) and unlikely to break anything.

    The second part prepares for a cleanup of the fuse device I/O code by
    adding a helper for simple request submission, with some savings in
    line numbers already realized"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
    fuse: use file_inode() in fuse_file_fallocate()
    fuse: introduce fuse_simple_request() helper
    fuse: reduce max out args
    fuse: hold inode instead of path after release
    fuse: flush requests on umount
    fuse: don't wake up reserved req in fuse_conn_kill()

    Linus Torvalds
     

12 Dec, 2014

6 commits

  • Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • The following pattern is repeated many times:

    req = fuse_get_req_nopages(fc);
    /* Initialize req->(in|out).args */
    fuse_request_send(fc, req);
    err = req->out.h.error;
    fuse_put_request(req);

    Create a new replacement helper:

    /* Initialize args */
    err = fuse_simple_request(fc, &args);

    In addition to reducing the code size, this will ease moving from the
    complex arg-based to a simpler page-based I/O on the fuse device.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • The third out-arg is never actually used.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • path_put() in release could trigger a DESTROY request in fuseblk. The
    possible deadlock was worked around by doing the path_put() with
    schedule_work().

    This complexity isn't needed if we just hold the inode instead of the path.
    Since we now flush all requests before destroying the super block we can be
    sure that all held inodes will be dropped.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Use fuse_abort_conn() instead of fuse_conn_kill() in fuse_put_super().
    This flushes and aborts requests still on any queues. But since we've
    already reset fc->connected, those requests would not be useful anyway and
    would be flushed when the fuse device is closed.

    Next patches will rely on requests being flushed before the superblock is
    destroyed.

    Use fuse_abort_conn() in cuse_process_init_reply() too, since it makes no
    difference there, and we can get rid of fuse_conn_kill().

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Waking up reserved_req_waitq from fuse_conn_kill() doesn't make sense since
    we aren't chaging ff->reserved_req here, which is what this waitqueue
    signals.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

20 Nov, 2014

2 commits


09 Oct, 2014

2 commits


27 Sep, 2014

1 commit

  • The third argument of fuse_get_user_pages() "nbytesp" refers to the number of
    bytes a caller asked to pack into fuse request. This value may be lesser
    than capacity of fuse request or iov_iter. So fuse_get_user_pages() must
    ensure that *nbytesp won't grow.

    Now, when helper iov_iter_get_pages() performs all hard work of extracting
    pages from iov_iter, it can be done by passing properly calculated
    "maxsize" to the helper.

    The other caller of iov_iter_get_pages() (dio_refill_pages()) doesn't need
    this capability, so pass LONG_MAX as the maxsize argument here.

    Fixes: c9c37e2e6378 ("fuse: switch to iov_iter_get_pages()")
    Reported-by: Werner Baumann
    Tested-by: Maxim Patlasov
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     

08 Aug, 2014

2 commits


22 Jul, 2014

2 commits

  • Here some additional changes to set a capability flag so that clients can
    detect when it's appropriate to return -ENOSYS from open.

    This amends the following commit introduced in 3.14:

    7678ac50615d fuse: support clients that don't implement 'open'

    However we can only add the flag to 3.15 and later since there was no
    protocol version update in 3.14.

    Signed-off-by: Miklos Szeredi
    Cc: # v3.15+

    Andrew Gallagher
     
  • Default s_time_gran is 1, don't overwrite that if userspace didn't
    explicitly specify one.

    Signed-off-by: Miklos Szeredi
    Cc: # v3.15+

    Miklos Szeredi
     

15 Jul, 2014

1 commit

  • Pull fuse fixes from Miklos Szeredi:
    "This contains miscellaneous fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
    fuse: replace count*size kzalloc by kcalloc
    fuse: release temporary page if fuse_writepage_locked() failed
    fuse: restructure ->rename2()
    fuse: avoid scheduling while atomic
    fuse: handle large user and group ID
    fuse: inode: drop cast
    fuse: ignore entry-timeout on LOOKUP_REVAL
    fuse: timeout comparison fix

    Linus Torvalds
     

14 Jul, 2014

2 commits


10 Jul, 2014

1 commit


07 Jul, 2014

5 commits

  • As reported by Richard Sharpe, an attempt to use fuse_notify_inval_entry()
    triggers complains about scheduling while atomic:

    BUG: scheduling while atomic: fuse.hf/13976/0x10000001

    This happens because fuse_notify_inval_entry() attempts to allocate memory
    with GFP_KERNEL, holding "struct fuse_copy_state" mapped by kmap_atomic().

    Introduced by commit 58bda1da4b3c "fuse/dev: use atomic maps"

    Fix by moving the map/unmap to just cover the actual memcpy operation.

    Original patch from Maxim Patlasov

    Reported-by: Richard Sharpe
    Signed-off-by: Miklos Szeredi
    Cc: # v3.15+

    Miklos Szeredi
     
  • If the number in "user_id=N" or "group_id=N" mount options was larger than
    INT_MAX then fuse returned EINVAL.

    Fix this to handle all valid uid/gid values.

    Signed-off-by: Miklos Szeredi
    Cc: stable@vger.kernel.org

    Miklos Szeredi
     
  • This patch removes the cast on data of type void * as it is not needed.
    The following Coccinelle semantic patch was used for making the change:

    @r@
    expression x;
    void* e;
    type T;
    identifier f;
    @@

    (
    *((T *)e)
    |
    ((T *)x)[...]
    |
    ((T *)x)->f
    |
    - (T *)
    e
    )

    Signed-off-by: Himangi Saraogi
    Acked-by: Julia Lawall
    Signed-off-by: Miklos Szeredi

    Himangi Saraogi
     
  • The following test case demonstrates the bug:

    sh# mount -t glusterfs localhost:meta-test /mnt/one

    sh# mount -t glusterfs localhost:meta-test /mnt/two

    sh# echo stuff > /mnt/one/file; rm -f /mnt/two/file; echo stuff > /mnt/one/file
    bash: /mnt/one/file: Stale file handle

    sh# echo stuff > /mnt/one/file; rm -f /mnt/two/file; sleep 1; echo stuff > /mnt/one/file

    On the second open() on /mnt/one, FUSE would have used the old
    nodeid (file handle) trying to re-open it. Gluster is returning
    -ESTALE. The ESTALE propagates back to namei.c:filename_lookup()
    where lookup is re-attempted with LOOKUP_REVAL. The right
    behavior now, would be for FUSE to ignore the entry-timeout and
    and do the up-call revalidation. Instead FUSE is ignoring
    LOOKUP_REVAL, succeeding the revalidation (because entry-timeout
    has not passed), and open() is again retried on the old file
    handle and finally the ESTALE is going back to the application.

    Fix: if revalidation is happening with LOOKUP_REVAL, then ignore
    entry-timeout and always do the up-call.

    Signed-off-by: Anand Avati
    Reviewed-by: Niels de Vos
    Signed-off-by: Miklos Szeredi
    Cc: stable@vger.kernel.org

    Anand Avati
     
  • As suggested by checkpatch.pl, use time_before64() instead of direct
    comparison of jiffies64 values.

    Signed-off-by: Miklos Szeredi
    Cc:

    Miklos Szeredi
     

13 Jun, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "This the bunch that sat in -next + lock_parent() fix. This is the
    minimal set; there's more pending stuff.

    In particular, I really hope to get acct.c fixes merged this cycle -
    we need that to deal sanely with delayed-mntput stuff. In the next
    pile, hopefully - that series is fairly short and localized
    (kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
    iov_iter work. Most of prereqs for ->splice_write with sane locking
    order are there and Kent's dio rewrite would also fit nicely on top of
    this pile"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
    lock_parent: don't step on stale ->d_parent of all-but-freed one
    kill generic_file_splice_write()
    ceph: switch to iter_file_splice_write()
    shmem: switch to iter_file_splice_write()
    nfs: switch to iter_splice_write_file()
    fs/splice.c: remove unneeded exports
    ocfs2: switch to iter_file_splice_write()
    ->splice_write() via ->write_iter()
    bio_vec-backed iov_iter
    optimize copy_page_{to,from}_iter()
    bury generic_file_aio_{read,write}
    lustre: get rid of messing with iovecs
    ceph: switch to ->write_iter()
    ceph_sync_direct_write: stop poking into iov_iter guts
    ceph_sync_read: stop poking into iov_iter guts
    new helper: copy_page_from_iter()
    fuse: switch to ->write_iter()
    btrfs: switch to ->write_iter()
    ocfs2: switch to ->write_iter()
    xfs: switch to ->write_iter()
    ...

    Linus Torvalds
     

05 Jun, 2014

2 commits

  • aops->write_begin may allocate a new page and make it visible only to have
    mark_page_accessed called almost immediately after. Once the page is
    visible the atomic operations are necessary which is noticable overhead
    when writing to an in-memory filesystem like tmpfs but should also be
    noticable with fast storage. The objective of the patch is to initialse
    the accessed information with non-atomic operations before the page is
    visible.

    The bulk of filesystems directly or indirectly use
    grab_cache_page_write_begin or find_or_create_page for the initial
    allocation of a page cache page. This patch adds an init_page_accessed()
    helper which behaves like the first call to mark_page_accessed() but may
    called before the page is visible and can be done non-atomically.

    The primary APIs of concern in this care are the following and are used
    by most filesystems.

    find_get_page
    find_lock_page
    find_or_create_page
    grab_cache_page_nowait
    grab_cache_page_write_begin

    All of them are very similar in detail to the patch creates a core helper
    pagecache_get_page() which takes a flags parameter that affects its
    behavior such as whether the page should be marked accessed or not. Then
    old API is preserved but is basically a thin wrapper around this core
    function.

    Each of the filesystems are then updated to avoid calling
    mark_page_accessed when it is known that the VM interfaces have already
    done the job. There is a slight snag in that the timing of the
    mark_page_accessed() has now changed so in rare cases it's possible a page
    gets to the end of the LRU as PageReferenced where as previously it might
    have been repromoted. This is expected to be rare but it's worth the
    filesystem people thinking about it in case they see a problem with the
    timing change. It is also the case that some filesystems may be marking
    pages accessed that previously did not but it makes sense that filesystems
    have consistent behaviour in this regard.

    The test case used to evaulate this is a simple dd of a large file done
    multiple times with the file deleted on each iterations. The size of the
    file is 1/10th physical memory to avoid dirty page balancing. In the
    async case it will be possible that the workload completes without even
    hitting the disk and will have variable results but highlight the impact
    of mark_page_accessed for async IO. The sync results are expected to be
    more stable. The exception is tmpfs where the normal case is for the "IO"
    to not hit the disk.

    The test machine was single socket and UMA to avoid any scheduling or NUMA
    artifacts. Throughput and wall times are presented for sync IO, only wall
    times are shown for async as the granularity reported by dd and the
    variability is unsuitable for comparison. As async results were variable
    do to writback timings, I'm only reporting the maximum figures. The sync
    results were stable enough to make the mean and stddev uninteresting.

    The performance results are reported based on a run with no profiling.
    Profile data is based on a separate run with oprofile running.

    async dd
    3.15.0-rc3 3.15.0-rc3
    vanilla accessed-v2
    ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
    tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
    btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
    ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
    xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)

    The XFS figure is a bit strange as it managed to avoid a worst case by
    sheer luck but the average figures looked reasonable.

    samples percentage
    ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

    [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Tested-by: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • cold is a bool, make it one. Make the likely case the "if" part of the
    block instead of the else as according to the optimisation manual this is
    preferred.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

02 Jun, 2014

1 commit

  • Currently, the fl_owner isn't set for flock locks. Some filesystems use
    byte-range locks to simulate flock locks and there is a common idiom in
    those that does:

    fl->fl_owner = (fl_owner_t)filp;
    fl->fl_start = 0;
    fl->fl_end = OFFSET_MAX;

    Since flock locks are generally "owned" by the open file description,
    move this into the common flock lock setup code. The fl_start and fl_end
    fields are already set appropriately, so remove the unneeded setting of
    that in flock ops in those filesystems as well.

    Finally, the lease code also sets the fl_owner as if they were owned by
    the process and not the open file description. This is incorrect as
    leases have the same ownership semantics as flock locks. Set them the
    same way. The lease code doesn't actually use the fl_owner value for
    anything, so this is more for consistency's sake than a bugfix.

    Reported-by: Trond Myklebust
    Signed-off-by: Jeff Layton
    Acked-by: Greg Kroah-Hartman (Staging portion)
    Acked-by: J. Bruce Fields

    Jeff Layton
     

07 May, 2014

11 commits