13 Dec, 2019

4 commits

  • commit eb59bd17d2fa6e5e84fba61a5ebdea984222e6d5 upstream.

    If a filesystem returns negative inode sizes, future reads on the file were
    causing the cpu to spin on truncate_pagecache.

    Create a helper to validate the attributes. This now does two things:

    - check the file mode
    - check if the file size fits in i_size without overflowing

    Reported-by: Arijit Banerjee
    Fixes: d8a5ba45457e ("[PATCH] FUSE - core")
    Cc: # v2.6.14
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Greg Kroah-Hartman

    Miklos Szeredi
     
  • commit 8aab336b14c115c6bf1d4baeb9247e41ed9ce6de upstream.

    Make sure filesystem is not returning a bogus number of bytes written.

    Fixes: ea9b9907b82a ("fuse: implement perform_write")
    Cc: # v2.6.26
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Greg Kroah-Hartman

    Miklos Szeredi
     
  • commit c634da718db9b2fac201df2ae1b1b095344ce5eb upstream.

    When adding a new hard link, make sure that i_nlink doesn't overflow.

    Fixes: ac45d61357e8 ("fuse: fix nlink after unlink")
    Cc: # v3.4
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Greg Kroah-Hartman

    Miklos Szeredi
     
  • commit f1ebdeffc6f325e30e0ddb9f7a70f1370fa4b851 upstream.

    exit_aio() is sometimes stuck in wait_for_completion() after aio is issued
    with direct IO and the task receives a signal.

    The reason is failure to call ->ki_complete() due to a leaked reference to
    fuse_io_priv. This happens in fuse_async_req_send() if
    fuse_simple_background() returns an error (e.g. -EINTR).

    In this case the error value is propagated via io->err, so return success
    to not confuse callers.

    This issue is tracked as a virtio-fs issue:
    https://gitlab.com/virtio-fs/qemu/issues/14

    Reported-by: Masayoshi Mizuma
    Fixes: 45ac96ed7c36 ("fuse: convert direct_io to simple api")
    Cc: # v5.4
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Greg Kroah-Hartman

    Miklos Szeredi
     

23 Oct, 2019

4 commits

  • Currently fuse_writepages_fill() calls get_fuse_inode() few times with
    the same argument.

    Signed-off-by: Vasily Averin
    Signed-off-by: Miklos Szeredi

    Vasily Averin
     
  • Make sure cached writes are not reordered around open(..., O_TRUNC), with
    the obvious wrong results.

    Fixes: 4d99ff8f12eb ("fuse: Turn writeback cache on")
    Cc: # v3.15+
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • If writeback cache is enabled, then writes might get reordered with
    chmod/chown/utimes. The problem with this is that performing the write in
    the fuse daemon might itself change some of these attributes. In such case
    the following sequence of operations will result in file ending up with the
    wrong mode, for example:

    int fd = open ("suid", O_WRONLY|O_CREAT|O_EXCL);
    write (fd, "1", 1);
    fchown (fd, 0, 0);
    fchmod (fd, 04755);
    close (fd);

    This patch fixes this by flushing pending writes before performing
    chown/chmod/utimes.

    Reported-by: Giuseppe Scrivano
    Tested-by: Giuseppe Scrivano
    Fixes: 4d99ff8f12eb ("fuse: Turn writeback cache on")
    Cc: # v3.15+
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Fixes gcc '-Wunused-but-set-variable' warning:

    fs/fuse/virtio_fs.c: In function virtio_fs_wake_pending_and_unlock:
    fs/fuse/virtio_fs.c:983:20: warning: variable fc set but not used [-Wunused-but-set-variable]

    It is not used since commit 7ee1e2e631db ("virtiofs: No need to check
    fpq->connected state")

    Reported-by: Hulk Robot
    Signed-off-by: zhengbin
    Signed-off-by: Miklos Szeredi

    zhengbin
     

21 Oct, 2019

7 commits

  • If regular request queue gets full, currently we sleep for a bit and
    retrying submission in submitter's context. This assumes submitter is not
    holding any spin lock. But this assumption is not true for background
    requests. For background requests, we are called with fc->bg_lock held.

    This can lead to deadlock where one thread is trying submission with
    fc->bg_lock held while request completion thread has called
    fuse_request_end() which tries to acquire fc->bg_lock and gets blocked. As
    request completion thread gets blocked, it does not make further progress
    and that means queue does not get empty and submitter can't submit more
    requests.

    To solve this issue, retry submission with the help of a worker, instead of
    retrying in submitter's context. We already do this for hiprio/forget
    requests.

    Reported-by: Chirantan Ekbote
    Signed-off-by: Vivek Goyal
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     
  • If virtqueue is full, we put forget requests on a list and these forgets
    are dispatched later using a worker. As of now we don't count these forgets
    in fsvq->in_flight variable. This means when queue is being drained, we
    have to have special logic to first drain these pending requests and then
    wait for fsvq->in_flight to go to zero.

    By counting pending forgets in fsvq->in_flight, we can get rid of special
    logic and just wait for in_flight to go to zero. Worker thread will kick
    and drain all the forgets anyway, leading in_flight to zero.

    I also need similar logic for normal request queue in next patch where I am
    about to defer request submission in the worker context if queue is full.

    This simplifies the code a bit.

    Also add two helper functions to inc/dec in_flight. Decrement in_flight
    helper will later used to call completion when in_flight reaches zero.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     
  • FR_SENT flag should be set when request has been sent successfully sent
    over virtqueue. This is used by interrupt logic to figure out if interrupt
    request should be sent or not.

    Also add it to fqp->processing list after sending it successfully.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     
  • In virtiofs we keep per queue connected state in virtio_fs_vq->connected
    and use that to end request if queue is not connected. And virtiofs does
    not even touch fpq->connected state.

    We probably need to merge these two at some point of time. For now,
    simplify the code a bit and do not worry about checking state of
    fpq->connected.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     
  • Submission context can hold some locks which end request code tries to hold
    again and deadlock can occur. For example, fc->bg_lock. If a background
    request is being submitted, it might hold fc->bg_lock and if we could not
    submit request (because device went away) and tried to end request, then
    deadlock happens. During testing, I also got a warning from deadlock
    detection code.

    So put requests on a list and end requests from a worker thread.

    I got following warning from deadlock detector.

    [ 603.137138] WARNING: possible recursive locking detected
    [ 603.137142] --------------------------------------------
    [ 603.137144] blogbench/2036 is trying to acquire lock:
    [ 603.137149] 00000000f0f51107 (&(&fc->bg_lock)->rlock){+.+.}, at: fuse_request_end+0xdf/0x1c0 [fuse]
    [ 603.140701]
    [ 603.140701] but task is already holding lock:
    [ 603.140703] 00000000f0f51107 (&(&fc->bg_lock)->rlock){+.+.}, at: fuse_simple_background+0x92/0x1d0 [fuse]
    [ 603.140713]
    [ 603.140713] other info that might help us debug this:
    [ 603.140714] Possible unsafe locking scenario:
    [ 603.140714]
    [ 603.140715] CPU0
    [ 603.140716] ----
    [ 603.140716] lock(&(&fc->bg_lock)->rlock);
    [ 603.140718] lock(&(&fc->bg_lock)->rlock);
    [ 603.140719]
    [ 603.140719] *** DEADLOCK ***

    Signed-off-by: Vivek Goyal
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     
  • If the FUSE_READDIRPLUS_AUTO feature is enabled, then lookups on a
    directory before/during readdir are used as an indication that READDIRPLUS
    should be used instead of READDIR. However if the lookup turns out to be
    negative, then selecting READDIRPLUS makes no sense.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Move the check for async request after check for the request being already
    finished and done with.

    Reported-by: syzbot+ae0bb7aae3de6b4594e2@syzkaller.appspotmail.com
    Fixes: d49937749fef ("fuse: stop copying args to fuse_req")
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

15 Oct, 2019

1 commit


14 Oct, 2019

1 commit

  • We have been calling it virtio_fs and even file name is virtio_fs.c. Module
    name is virtio_fs.ko but when registering file system user is supposed to
    specify filesystem type as "virtiofs".

    Masayoshi Mizuma reported that he specified filesytem type as "virtio_fs"
    and got this warning on console.

    ------------[ cut here ]------------
    request_module fs-virtio_fs succeeded, but still no fs?
    WARNING: CPU: 1 PID: 1234 at fs/filesystems.c:274 get_fs_type+0x12c/0x138
    Modules linked in: ... virtio_fs fuse virtio_net net_failover ...
    CPU: 1 PID: 1234 Comm: mount Not tainted 5.4.0-rc1 #1

    So looks like kernel could find the module virtio_fs.ko but could not find
    filesystem type after that.

    It probably is better to rename module name to virtiofs.ko so that above
    warning goes away in case user ends up specifying wrong fs name.

    Reported-by: Masayoshi Mizuma
    Suggested-by: Stefan Hajnoczi
    Signed-off-by: Vivek Goyal
    Tested-by: Masayoshi Mizuma
    Reviewed-by: Stefan Hajnoczi
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     

28 Sep, 2019

1 commit

  • Pull fuse virtio-fs support from Miklos Szeredi:
    "Virtio-fs allows exporting directory trees on the host and mounting
    them in guest(s).

    This isn't actually a new filesystem, but a glue layer between the
    fuse filesystem and a virtio based back-end.

    It's similar in functionality to the existing virtio-9p solution, but
    significantly faster in benchmarks and has better POSIX compliance.
    Further permformance improvements can be achieved by sharing the page
    cache between host and guest, allowing for faster I/O and reduced
    memory use.

    Kata Containers have been including the out-of-tree virtio-fs (with
    the shared page cache patches as well) since version 1.7 as an
    experimental feature. They have been active in development and plan to
    switch from virtio-9p to virtio-fs as their default solution. There
    has been interest from other sources as well.

    The userspace infrastructure is slated to be merged into qemu once the
    kernel part hits mainline.

    This was developed by Vivek Goyal, Dave Gilbert and Stefan Hajnoczi"

    * tag 'virtio-fs-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
    virtio-fs: add virtiofs filesystem
    virtio-fs: add Documentation/filesystems/virtiofs.rst
    fuse: reserve values for mapping protocol

    Linus Torvalds
     

24 Sep, 2019

7 commits

  • Fix sparse warning:

    fs/fuse/dev.c:468:6: warning: symbol 'fuse_args_to_req' was not declared. Should it be static?

    Reported-by: Hulk Robot
    Signed-off-by: YueHaibing
    Fixes: 68583165f962 ("fuse: add pages to fuse_args")
    Signed-off-by: Miklos Szeredi

    YueHaibing
     
  • If cuse_send_init fails, need to fuse_conn_put cc->fc.

    cuse_channel_open->fuse_conn_init->refcount_set(&fc->count, 1)
    ->fuse_dev_alloc->fuse_conn_get
    ->fuse_dev_free->fuse_conn_put

    Fixes: cc080e9e9be1 ("fuse: introduce per-instance fuse_dev structure")
    Reported-by: Hulk Robot
    Signed-off-by: zhengbin
    Signed-off-by: Miklos Szeredi

    zhengbin
     
  • With DEBUG_PAGEALLOC on, the following triggers.

    BUG: unable to handle page fault for address: ffff88859367c000
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 3001067 P4D 3001067 PUD 406d3a8067 PMD 406d30c067 PTE 800ffffa6c983060
    Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
    CPU: 38 PID: 3110657 Comm: python2.7
    RIP: 0010:fuse_readdir+0x88f/0xe7a [fuse]
    Code: 49 8b 4d 08 49 39 4e 60 0f 84 44 04 00 00 48 8b 43 08 43 8d 1c 3c 4d 01 7e 68 49 89 dc 48 03 5c 24 38 49 89 46 60 8b 44 24 30 4b 10 44 29 e0 48 89 ca 48 83 c1 1f 48 83 e1 f8 83 f8 17 49 89
    RSP: 0018:ffffc90035edbde0 EFLAGS: 00010286
    RAX: 0000000000001000 RBX: ffff88859367bff0 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: ffff88859367bfed RDI: 0000000000920907
    RBP: ffffc90035edbe90 R08: 000000000000014b R09: 0000000000000004
    R10: ffff88859367b000 R11: 0000000000000000 R12: 0000000000000ff0
    R13: ffffc90035edbee0 R14: ffff889fb8546180 R15: 0000000000000020
    FS: 00007f80b5f4a740(0000) GS:ffff889fffa00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffff88859367c000 CR3: 0000001c170c2001 CR4: 00000000003606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    iterate_dir+0x122/0x180
    __x64_sys_getdents+0xa6/0x140
    do_syscall_64+0x42/0x100
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    It's in fuse_parse_cache(). %rbx (ffff88859367bff0) is fuse_dirent
    pointer - addr + offset. FUSE_DIRENT_SIZE() is trying to dereference
    namelen off of it but that derefs into the next page which is disabled
    by pagealloc debug causing a PF.

    This is caused by dirent->namelen being accessed before ensuring that
    there's enough bytes in the page for the dirent. Fix it by pushing
    down reclen calculation.

    Signed-off-by: Tejun Heo
    Fixes: 5d7bc7e8680c ("fuse: allow using readdir cache")
    Cc: stable@vger.kernel.org # v4.20+
    Signed-off-by: Miklos Szeredi

    Tejun Heo
     
  • This function has been made static, which now causes a compile-time
    warning:

    WARNING: "fuse_put_request" [vmlinux] is a static EXPORT_SYMBOL_GPL

    Remove the unneeded export.

    Fixes: 66abc3599c3c ("fuse: unexport request ops")
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Stefan Hajnoczi
    Signed-off-by: Miklos Szeredi

    Arnd Bergmann
     
  • account per-file, dentry, and inode data

    blockdev/superblock and temporary per-request data was left alone, as
    this usually isn't accounted

    Reviewed-by: Shakeel Butt
    Signed-off-by: Khazhismel Kumykov
    Signed-off-by: Miklos Szeredi

    Khazhismel Kumykov
     
  • Implements the optimization noted in commit f75fdf22b0a8 ("fuse: don't
    use ->d_time"), as the additional memory can be significant. (In
    particular, on SLAB configurations this 8-byte alloc becomes 32 bytes).
    Per-dentry, this can consume significant memory.

    Reviewed-by: Shakeel Butt
    Signed-off-by: Khazhismel Kumykov
    Signed-off-by: Miklos Szeredi

    Khazhismel Kumykov
     
  • unlock_page() was missing in case of an already in-flight write against the
    same page.

    Signed-off-by: Vasily Averin
    Fixes: ff17be086477 ("fuse: writepage: skip already in flight")
    Cc: # v3.13
    Signed-off-by: Miklos Szeredi

    Vasily Averin
     

19 Sep, 2019

1 commit

  • Add a basic file system module for virtio-fs. This does not yet contain
    shared data support between host and guest or metadata coherency speedups.
    However it is already significantly faster than virtio-9p.

    Design Overview
    ===============

    With the goal of designing something with better performance and local file
    system semantics, a bunch of ideas were proposed.

    - Use fuse protocol (instead of 9p) for communication between guest and
    host. Guest kernel will be fuse client and a fuse server will run on
    host to serve the requests.

    - For data access inside guest, mmap portion of file in QEMU address space
    and guest accesses this memory using dax. That way guest page cache is
    bypassed and there is only one copy of data (on host). This will also
    enable mmap(MAP_SHARED) between guests.

    - For metadata coherency, there is a shared memory region which contains
    version number associated with metadata and any guest changing metadata
    updates version number and other guests refresh metadata on next access.
    This is yet to be implemented.

    How virtio-fs differs from existing approaches
    ==============================================

    The unique idea behind virtio-fs is to take advantage of the co-location of
    the virtual machine and hypervisor to avoid communication (vmexits).

    DAX allows file contents to be accessed without communication with the
    hypervisor. The shared memory region for metadata avoids communication in
    the common case where metadata is unchanged.

    By replacing expensive communication with cheaper shared memory accesses,
    we expect to achieve better performance than approaches based on network
    file system protocols. In addition, this also makes it easier to achieve
    local file system semantics (coherency).

    These techniques are not applicable to network file system protocols since
    the communications channel is bypassed by taking advantage of shared memory
    on a local machine. This is why we decided to build virtio-fs rather than
    focus on 9P or NFS.

    Caching Modes
    =============

    Like virtio-9p, different caching modes are supported which determine the
    coherency level as well. The “cache=FOO” and “writeback” options control
    the level of coherence between the guest and host filesystems.

    - cache=none
    metadata, data and pathname lookup are not cached in guest. They are
    always fetched from host and any changes are immediately pushed to host.

    - cache=always
    metadata, data and pathname lookup are cached in guest and never expire.

    - cache=auto
    metadata and pathname lookup cache expires after a configured amount of
    time (default is 1 second). Data is cached while the file is open
    (close to open consistency).

    - writeback/no_writeback
    These options control the writeback strategy. If writeback is disabled,
    then normal writes will immediately be synchronized with the host fs.
    If writeback is enabled, then writes may be cached in the guest until
    the file is closed or an fsync(2) performed. This option has no effect
    on mmap-ed writes or writes going through the DAX mechanism.

    Signed-off-by: Stefan Hajnoczi
    Signed-off-by: Vivek Goyal
    Acked-by: Michael S. Tsirkin
    Signed-off-by: Miklos Szeredi

    Stefan Hajnoczi
     

12 Sep, 2019

12 commits

  • virtio-fs does not support aborting requests which are being
    processed. That is requests which have been sent to fuse daemon on host.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     
  • Allow virtio-fs to also send DESTROY request.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Don't hold onto dentry in lru list if need to re-lookup it anyway at next
    access. Only do this if explicitly enabled, otherwise it could result in
    performance regression.

    More advanced version of this patch would periodically flush out dentries
    from the lru which have gone stale.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • As of now fuse_dev_alloc() both allocates a fuse device and installs it in
    fuse_conn list. fuse_dev_alloc() can fail if fuse_device allocation fails.

    virtio-fs needs to initialize multiple fuse devices (one per virtio queue).
    It initializes one fuse device as part of call to fuse_fill_super_common()
    and rest of the devices are allocated and installed after that.

    But, we can't afford to fail after calling fuse_fill_super_common() as we
    don't have a way to undo all the actions done by fuse_fill_super_common().
    So to avoid failures after the call to fuse_fill_super_common(),
    pre-allocate all fuse devices early and install them into fuse connection
    later.

    This patch provides two separate helpers for fuse device allocation and
    fuse device installation in fuse_conn.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     
  • The /dev/fuse device uses fiq->waitq and fasync to signal that requests are
    available. These mechanisms do not apply to virtio-fs. This patch
    introduces callbacks so alternative behavior can be used.

    Note that queue_interrupt() changes along these lines:

    spin_lock(&fiq->waitq.lock);
    wake_up_locked(&fiq->waitq);
    + kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
    spin_unlock(&fiq->waitq.lock);
    - kill_fasync(&fiq->fasync, SIGIO, POLL_IN);

    Since queue_request() and queue_forget() also call kill_fasync() inside
    the spinlock this should be safe.

    Signed-off-by: Stefan Hajnoczi
    Signed-off-by: Miklos Szeredi

    Stefan Hajnoczi
     
  • fuse_fill_super() includes code to process the fd= option and link the
    struct fuse_dev to the fd's struct file. In virtio-fs there is no file
    descriptor because /dev/fuse is not used.

    This patch extracts fuse_fill_super_common() so that both classic fuse and
    virtio-fs can share the code to initialize a mount.

    Signed-off-by: Stefan Hajnoczi
    Signed-off-by: Miklos Szeredi

    Stefan Hajnoczi
     
  • File systems like virtio-fs need to do not have to play directly with
    forget list data structures. There is a helper function use that instead.

    Rename dequeue_forget() to fuse_dequeue_forget() and export it so that
    stacked filesystems can use it.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     
  • virtio-fs will need unique IDs for FORGET requests from outside
    fs/fuse/dev.c. Make the symbol visible.

    Signed-off-by: Stefan Hajnoczi
    Signed-off-by: Miklos Szeredi

    Stefan Hajnoczi
     
  • This will be used by virtio-fs to send init request to fuse server after
    initialization of virt queues.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Miklos Szeredi

    Vivek Goyal
     
  • virtio-fs will need to query the length of fuse_arg lists. Make the symbol
    visible.

    Signed-off-by: Stefan Hajnoczi
    Signed-off-by: Miklos Szeredi

    Stefan Hajnoczi
     
  • virtio-fs will need to complete requests from outside fs/fuse/dev.c. Make
    the symbol visible.

    Signed-off-by: Stefan Hajnoczi
    Signed-off-by: Miklos Szeredi

    Stefan Hajnoczi
     
  • The size of struct fuse_req was reduced from 392B to 144B on a non-debug
    config, thus the sanitize_global_limit() helper was setting a larger
    default limit. This doesn't really reflect reduction in the memory used by
    requests, since the fields removed from fuse_req were added to fuse_args
    derived structs; e.g. sizeof(struct fuse_writepages_args) is 248B, thus
    resulting in slightly more memory being used for writepage requests
    overalll (due to using 256B slabs).

    Make the calculatation ignore the size of fuse_req and use the old 392B
    value.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

10 Sep, 2019

2 commits