25 Aug, 2011

1 commit


08 Aug, 2011

1 commit

  • Commit a9ff4f87 "fuse: support BSD locking semantics" overlooked a
    number of issues with supporing flock locks over existing POSIX
    locking infrastructure:

    - it's not backward compatible, passing flock(2) calls to userspace
    unconditionally (if userspace sets FUSE_POSIX_LOCKS)

    - it doesn't cater for the fact that flock locks are automatically
    unlocked on file release

    - it doesn't take into account the fact that flock exclusive locks
    (write locks) don't need an fd opened for write.

    The last one invalidates the original premise of the patch that flock
    locks can be emulated with POSIX locks.

    This patch fixes the first two issues. The last one needs to be fixed
    in userspace if the filesystem assumed that a write lock will happen
    only on a file operned for write (as in the case of the current fuse
    library).

    Reported-by: Sebastian Pipping
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

21 Jul, 2011

1 commit

  • Btrfs needs to be able to control how filemap_write_and_wait_range() is called
    in fsync to make it less of a painful operation, so push down taking i_mutex and
    the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
    file systems can drop taking the i_mutex altogether it seems, like ext3 and
    ocfs2. For correctness sake I just pushed everything down in all cases to make
    sure that we keep the current behavior the same for everybody, and then each
    individual fs maintainer can make up their mind about what to do from there.
    Thanks,

    Acked-by: Jan Kara
    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     

21 Mar, 2011

1 commit


25 Feb, 2011

1 commit

  • Single threaded NTFS-3G could get stuck if a delayed RELEASE reply
    triggered a DESTROY request via path_put().

    Fix this by

    a) making RELEASE requests synchronous, whenever possible, on fuseblk
    filesystems

    b) if not possible (triggered by an asynchronous read/write) then do
    the path_put() in a separate thread with schedule_work().

    Reported-by: Oliver Neukum
    Cc: stable@kernel.org
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

08 Dec, 2010

2 commits

  • Terje Malmedal reports that a fuse filesystem with 32 million inodes
    on a machine with lots of memory can take up to 30 minutes to process
    FORGET requests when all those inodes are evicted from the icache.

    To solve this, create a BATCH_FORGET request that allows up to about
    8000 FORGET requests to be sent in a single message.

    This request is only sent if userspace supports interface version 7.16
    or later, otherwise fall back to sending individual FORGET messages.

    Reported-by: Terje Malmedal
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Terje Malmedal reports that a fuse filesystem with 32 million inodes
    on a machine with lots of memory can go unresponsive for up to 30
    minutes when all those inodes are evicted from the icache.

    The reason is that FORGET messages, sent when the inode is evicted,
    are queued up together with regular filesystem requests, and while the
    huge queue of FORGET messages are processed no other filesystem
    operation can proceed.

    Since a full fuse request structure is allocated for each inode, these
    take up quite a bit of memory as well.

    To solve these issues, create a slim 'fuse_forget_link' structure
    containing just the minimum of information required to send the FORGET
    request and chain these on a separate queue.

    When userspace is asking for a request make sure that FORGET and
    non-FORGET requests are selected fairly: for each 8 non-FORGET allow
    16 FORGET requests. This will make sure FORGETs do not pile up, yet
    other requests are also allowed to proceed while the queued FORGETs
    are processed.

    Reported-by: Terje Malmedal
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

12 Jul, 2010

2 commits

  • Userspace filesystem can request data to be retrieved from the inode's
    mapping. This request is synchronous and the retrieved data is queued
    as a new request. If the write to the fuse device returns an error
    then the retrieve request was not completed and a reply will not be
    sent.

    Only present pages are returned in the retrieve reply. Retrieving
    stops when it finds a non-present page and only data prior to that is
    returned.

    This request doesn't change the dirty state of pages.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Userspace filesystem can request data to be stored in the inode's
    mapping. This request is synchronous and has no reply. If the write
    to the fuse device returns an error then the store request was not
    fully completed (but may have updated some pages).

    If the stored data overflows the current file size, then the size is
    extended, similarly to a write(2) on the filesystem.

    Pages which have been completely stored are marked uptodate.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

31 May, 2010

1 commit


28 May, 2010

1 commit


25 May, 2010

1 commit

  • When splicing buffers to the fuse device with SPLICE_F_MOVE, try to
    move pages from the pipe buffer into the page cache. This allows
    populating the fuse filesystem's cache without ever touching the page
    contents, i.e. zero copy read capability.

    The following steps are performed when trying to move a page into the
    page cache:

    - buf->ops->confirm() to make sure the new page is uptodate
    - buf->ops->steal() to try to remove the new page from it's previous place
    - remove_from_page_cache() on the old page
    - add_to_page_cache_locked() on the new page

    If any of the above steps fail (non fatally) then the code falls back
    to copying the page. In particular ->steal() will fail if there are
    external references (other than the page cache and the pipe buffer) to
    the page.

    Also since the remove_from_page_cache() + add_to_page_cache_locked()
    are non-atomic it is possible that the page cache is repopulated in
    between the two and add_to_page_cache_locked() will fail. This could
    be fixed by creating a new atomic replace_page_cache_page() function.

    fuse_readpages_end() needed to be reworked so it works even if
    page->mapping is NULL for some or all pages which can happen if the
    add_to_page_cache_locked() failed.

    A number of sanity checks were added to make sure the stolen pages
    don't have weird flags set, etc... These could be moved into generic
    splice/steal code.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

24 Sep, 2009

1 commit

  • Update some fs code to make use of new helper functions introduced
    in the previous patch. Should be no significant change in behaviour
    (except CIFS now calls send_sig under i_lock, via inode_newsize_ok).

    Reviewed-by: Christoph Hellwig
    Acked-by: Miklos Szeredi
    Cc: linux-nfs@vger.kernel.org
    Cc: Trond.Myklebust@netapp.com
    Cc: linux-cifs-client@lists.samba.org
    Cc: sfrench@samba.org
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     

16 Sep, 2009

1 commit


07 Jul, 2009

1 commit


01 Jul, 2009

2 commits

  • Add notification messages that allow the filesystem to invalidate VFS
    caches.

    Two notifications are added:

    1) inode invalidation

    - invalidate cached attributes
    - invalidate a range of pages in the page cache (this is optional)

    2) dentry invalidation

    - try to invalidate a subtree in the dentry cache

    Care must be taken while accessing the 'struct super_block' for the
    mount, as it can go away while an invalidation is in progress. To
    prevent this, introduce a rw-semaphore, that is taken for read during
    the invalidation and taken for write in the ->kill_sb callback.

    Cc: Csaba Henk
    Cc: Anand Avati
    Signed-off-by: Miklos Szeredi

    John Muir
     
  • This patch lets filesystems handle masking the file mode on creation.
    This is needed if filesystem is using ACLs.

    - The CREATE, MKDIR and MKNOD requests are extended with a "umask"
    parameter.

    - A new FUSE_DONT_MASK flag is added to the INIT request/reply. With
    this the filesystem may request that the create mode is not masked.

    CC: Jean-Pierre André
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

28 Apr, 2009

8 commits


28 Mar, 2009

1 commit


26 Nov, 2008

6 commits

  • Add fuse_conn->release() so that fuse_conn can be embedded in other
    structures.

    Signed-off-by: Tejun Heo
    Signed-off-by: Miklos Szeredi

    Tejun Heo
     
  • Separate out fuse_conn_init() from new_conn() and while at it
    initialize fuse_conn->entry during conn initialization.

    This will be used by CUSE.

    Signed-off-by: Tejun Heo
    Signed-off-by: Miklos Szeredi

    Tejun Heo
     
  • Add fuse_ prefix to request_send*() and get_root_inode() as some of
    those functions will be exported for CUSE. With or without CUSE
    export, having the function names scoped is a good idea for
    debuggability.

    Signed-off-by: Tejun Heo
    Signed-off-by: Miklos Szeredi

    Tejun Heo
     
  • Implement poll support. Polled files are indexed using kh in a RB
    tree rooted at fuse_conn->polled_files.

    Client should send FUSE_NOTIFY_POLL notification once after processing
    FUSE_POLL which has FUSE_POLL_SCHEDULE_NOTIFY set. Sending
    notification unconditionally after the latest poll or everytime file
    content might have changed is inefficient but won't cause malfunction.

    fuse_file_poll() can sleep and requires patches from the following
    thread which allows f_op->poll() to sleep.

    http://thread.gmane.org/gmane.linux.kernel/726176

    Signed-off-by: Tejun Heo
    Signed-off-by: Miklos Szeredi

    Tejun Heo
     
  • The file handle, fuse_file->fh, is opaque value supplied by userland
    FUSE server and uniqueness is not guaranteed. Add file kernel handle,
    fuse_file->kh, which is allocated by the kernel on file allocation and
    guaranteed to be unique.

    This will be used by poll to match notification to the respective file
    but can be used for other purposes where unique file handle is
    necessary.

    Signed-off-by: Tejun Heo
    Signed-off-by: Miklos Szeredi

    Tejun Heo
     
  • Fix coding style errors reported by checkpatch and others. Uptdate
    copyright date to 2008.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

16 Oct, 2008

1 commit


26 Jul, 2008

2 commits

  • Implement the get_parent export operation by sending a LOOKUP request with
    ".." as the name.

    Implement looking up an inode by node ID after it has been evicted from
    the cache. This is done by seding a LOOKUP request with "." as the name
    (for all file types, not just directories).

    The filesystem can set the FUSE_EXPORT_SUPPORT flag in the INIT reply, to
    indicate that it supports these special lookups.

    Thanks to John Muir for the original implementation of this feature.

    Signed-off-by: Miklos Szeredi
    Cc: "J. Bruce Fields"
    Cc: Trond Myklebust
    Cc: Matthew Wilcox
    Cc: David Teigland
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Implement export_operations, to allow fuse filesystems to be exported to
    NFS. This feature has been in the out-of-tree fuse module, and is widely
    used and tested.

    It has not been originally merged into mainline, because doing the NFS
    export in userspace was thought to be a cleaner and more efficient way of
    doing it, than through the kernel.

    While that is true, it would also have involved a lot of duplicated effort
    at reimplementing NFS exporting (all the different versions of the
    protocol). This effort was unfortunately not undertaken by anyone, so we
    are left with doing it the easy but less efficient way.

    If this feature goes in, the out-of-tree fuse module can go away,
    which would have several advantages:

    - not having to maintain two versions
    - less confusion for users
    - no bugs due to kernel API changes

    Comment from hch:
    - Use the same fh_type values as XFS, since we use the same fh encoding.

    Signed-off-by: Miklos Szeredi
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

13 May, 2008

1 commit

  • Prior to 2.6.26 fuse only supported single page write requests. In theory all
    fuse filesystem should be able support bigger than 4k writes, as there's
    nothing in the API to prevent it. Unfortunately there's a known case in
    NTFS-3G where big writes cause filesystem corruption. There could also be
    other filesystems, where the lack of testing with big write requests would
    result in bugs.

    To prevent such problems on a kernel upgrade, disable big writes by default,
    but let filesystems set a flag to turn it on.

    Signed-off-by: Miklos Szeredi
    Cc: Szabolcs Szakacsits
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

30 Apr, 2008

4 commits

  • Node ID is 64bit but it is passed as unsigned long to some functions. This
    breakage wasn't noticed, because libfuse uses unsigned long too.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • If the READ request returned a short count, then either

    - cached size is incorrect
    - filesystem is buggy, as short reads are only allowed on EOF

    So assume that the size is wrong and refresh it, so that cached read() doesn't
    zero fill the missing chunk.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Quoting Linus (3 years ago, FUSE inclusion discussions):

    "User-space filesystems are hard to get right. I'd claim that they
    are almost impossible, unless you limit them somehow (shared
    writable mappings are the nastiest part - if you don't have those,
    you can reasonably limit your problems by limiting the number of
    dirty pages you accept through normal "write()" calls)."

    Instead of attempting the impossible, I've just waited for the dirty page
    accounting infrastructure to materialize (thanks to Peter Zijlstra and
    others). This nicely solved the biggest problem: limiting the number of pages
    used for write caching.

    Some small details remained, however, which this largish patch attempts to
    address. It provides a page writeback implementation for fuse, which is
    completely safe against VM related deadlocks. Performance may not be very
    good for certain usage patterns, but generally it should be acceptable.

    It has been tested extensively with fsx-linux and bash-shared-mapping.

    Fuse page writeback design
    --------------------------

    fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
    It copies the contents of the original page, and queues a WRITE request to the
    userspace filesystem using this temp page.

    The writeback is finished instantly from the MM's point of view: the page is
    removed from the radix trees, and the PageDirty and PageWriteback flags are
    cleared.

    For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
    incremented. The per-bdi writeback count is not decremented until the actual
    write completes.

    On dirtying the page, fuse waits for a previous write to finish before
    proceeding. This makes sure, there can only be one temporary page used at a
    time for one cached page.

    This approach is wasteful in both memory and CPU bandwidth, so why is this
    complication needed?

    The basic problem is that there can be no guarantee about the time in which
    the userspace filesystem will complete a write. It may be buggy or even
    malicious, and fail to complete WRITE requests. We don't want unrelated parts
    of the system to grind to a halt in such cases.

    Also a filesystem may need additional resources (particularly memory) to
    complete a WRITE request. There's a great danger of a deadlock if that
    allocation may wait for the writepage to finish.

    Currently there are several cases where the kernel can block on page
    writeback:

    - allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
    - page migration
    - throttle_vm_writeout (through NR_WRITEBACK)
    - sync(2)

    Of course in some cases (fsync, msync) we explicitly want to allow blocking.
    So for these cases new code has to be added to fuse, since the VM is not
    tracking writeback pages for us any more.

    As an extra safetly measure, the maximum dirty ratio allocated to a single
    fuse filesystem is set to 1% by default. This way one (or several) buggy or
    malicious fuse filesystems cannot slow down the rest of the system by hogging
    dirty memory.

    With appropriate privileges, this limit can be raised through
    '/sys/class/bdi//max_ratio'.

    Signed-off-by: Miklos Szeredi
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Register FUSE's backing_dev_info under sysfs with the name "fuse-MAJOR:MINOR"

    Make the fuse control filesystem use s_dev instead of a fuse specific ID.
    This makes it easier to match directories under /sys/fs/fuse/connections/ with
    directories under /sys/class/bdi, and with actual mounts.

    Signed-off-by: Miklos Szeredi
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi