13 Dec, 2011

2 commits

  • Allows a FUSE file-system to tell the kernel when a file or directory is
    deleted. If the specified dentry has the specified inode number, the kernel will
    unhash it.

    The current 'fuse_notify_inval_entry' does not cause the kernel to clean up
    directories that are in use properly, and as a result the users of those
    directories see incorrect semantics from the file-system. The error condition
    seen when 'fuse_notify_inval_entry' is used to notify of a deleted directory is
    avoided when 'fuse_notify_delete' is used instead.

    The following scenario demonstrates the difference:
    1. User A chdirs into 'testdir' and starts reading 'testfile'.
    2. User B rm -rf 'testdir'.
    3. User B creates 'testdir'.
    4. User C chdirs into 'testdir'.

    If you run the above within the same machine on any file-system (including fuse
    file-systems), there is no problem: user C is able to chdir into the new
    testdir. The old testdir is removed from the dentry tree, but still open by user
    A.

    If operations 2 and 3 are performed via the network such that the fuse
    file-system uses one of the notify functions to tell the kernel that the nodes
    are gone, then the following error occurs for user C while user A holds the
    original directory open:

    muirj@empacher:~> ls /test/testdir
    ls: cannot access /test/testdir: No such file or directory

    The issue here is that the kernel still has a dentry for testdir, and so it is
    requesting the attributes for the old directory, while the file-system is
    responding that the directory no longer exists.

    If on the other hand, if the file-system can notify the kernel that the
    directory is deleted using the new 'fuse_notify_delete' function, then the above
    ls will find the new directory as expected.

    Signed-off-by: John Muir
    Signed-off-by: Miklos Szeredi

    John Muir
     
  • Fix two bugs in fuse_retrieve():

    - retrieving more than one page would yield repeated instances of the
    first page

    - if more than FUSE_MAX_PAGES_PER_REQ pages were requested than the
    request page array would overflow

    fuse_retrieve() was added in 2.6.36 and these bugs had been there since the
    beginning.

    Signed-off-by: Miklos Szeredi
    CC: stable@vger.kernel.org

    Miklos Szeredi
     

13 Sep, 2011

1 commit

  • kmemleak is reporting that 32 bytes are being leaked by FUSE:

    unreferenced object 0xe373b270 (size 32):
    comm "fusermount", pid 1207, jiffies 4294707026 (age 2675.187s)
    hex dump (first 32 bytes):
    01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmemleak_alloc+0x27/0x50
    [] kmem_cache_alloc+0xc5/0x180
    [] fuse_alloc_forget+0x1e/0x20
    [] fuse_alloc_inode+0xb0/0xd0
    [] alloc_inode+0x1c/0x80
    [] iget5_locked+0x8f/0x1a0
    [] fuse_iget+0x72/0x1a0
    [] fuse_get_root_inode+0x8a/0x90
    [] fuse_fill_super+0x3ef/0x590
    [] mount_nodev+0x3f/0x90
    [] fuse_mount+0x15/0x20
    [] mount_fs+0x1c/0xc0
    [] vfs_kern_mount+0x41/0x90
    [] do_kern_mount+0x39/0xd0
    [] do_mount+0x2e5/0x660
    [] sys_mount+0x66/0xa0

    This leak report is consistent and happens once per boot on
    3.1.0-rc5-dirty.

    This happens if a FORGET request is queued after the fuse device was
    released.

    Reported-by: Sitsofe Wheeler
    Signed-off-by: Miklos Szeredi
    Tested-by: Sitsofe Wheeler
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

24 Aug, 2011

1 commit


23 Mar, 2011

1 commit

  • This function basically does:

    remove_from_page_cache(old);
    page_cache_release(old);
    add_to_page_cache_locked(new);

    Except it does this atomically, so there's no possibility for the "add" to
    fail because of a race.

    If memory cgroups are enabled, then the memory cgroup charge is also moved
    from the old page to the new.

    This function is currently used by fuse to move pages into the page cache
    on read, instead of copying the page contents.

    [minchan.kim@gmail.com: add freepage() hook to replace_page_cache_page()]
    Signed-off-by: Miklos Szeredi
    Acked-by: Rik van Riel
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

21 Mar, 2011

1 commit


08 Dec, 2010

2 commits

  • Terje Malmedal reports that a fuse filesystem with 32 million inodes
    on a machine with lots of memory can take up to 30 minutes to process
    FORGET requests when all those inodes are evicted from the icache.

    To solve this, create a BATCH_FORGET request that allows up to about
    8000 FORGET requests to be sent in a single message.

    This request is only sent if userspace supports interface version 7.16
    or later, otherwise fall back to sending individual FORGET messages.

    Reported-by: Terje Malmedal
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Terje Malmedal reports that a fuse filesystem with 32 million inodes
    on a machine with lots of memory can go unresponsive for up to 30
    minutes when all those inodes are evicted from the icache.

    The reason is that FORGET messages, sent when the inode is evicted,
    are queued up together with regular filesystem requests, and while the
    huge queue of FORGET messages are processed no other filesystem
    operation can proceed.

    Since a full fuse request structure is allocated for each inode, these
    take up quite a bit of memory as well.

    To solve these issues, create a slim 'fuse_forget_link' structure
    containing just the minimum of information required to send the FORGET
    request and chain these on a separate queue.

    When userspace is asking for a request make sure that FORGET and
    non-FORGET requests are selected fairly: for each 8 non-FORGET allow
    16 FORGET requests. This will make sure FORGETs do not pile up, yet
    other requests are also allowed to proceed while the queued FORGETs
    are processed.

    Reported-by: Terje Malmedal
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

28 Oct, 2010

1 commit

  • Replace iterated page_cache_release() with release_pages(), which is
    faster and shorter.

    Needs release_pages() to be exported to modules.

    Suggested-by: Andrew Morton
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

27 Oct, 2010

2 commits


04 Oct, 2010

1 commit


07 Sep, 2010

2 commits

  • Sparse doesn't understand lock annotations of the form
    __releases(&foo->lock). Change them to __releases(foo->lock). Same
    for __acquires().

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • David Bartly reported that fuse can hang in fuse_get_req_nofail() when
    the connection to the filesystem server is no longer active.

    If bg_queue is not empty then flush_bg_queue() called from
    request_end() can put more requests on to the pending queue. If this
    happens while ending requests on the processing queue then those
    background requests will be queued to the pending list and never
    ended.

    Another problem is that fuse_dev_release() didn't wake up processes
    sleeping on blocked_waitq.

    Solve this by:

    a) flushing the background queue before calling end_requests() on the
    pending and processing queues

    b) setting blocked = 0 and waking up processes waiting on
    blocked_waitq()

    Thanks to David for an excellent bug report.

    Reported-by: David Bartley
    Signed-off-by: Miklos Szeredi
    CC: stable@kernel.org

    Miklos Szeredi
     

12 Jul, 2010

3 commits

  • Userspace filesystem can request data to be retrieved from the inode's
    mapping. This request is synchronous and the retrieved data is queued
    as a new request. If the write to the fuse device returns an error
    then the retrieve request was not completed and a reply will not be
    sent.

    Only present pages are returned in the retrieve reply. Retrieving
    stops when it finds a non-present page and only data prior to that is
    returned.

    This request doesn't change the dirty state of pages.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Userspace filesystem can request data to be stored in the inode's
    mapping. This request is synchronous and has no reply. If the write
    to the fuse device returns an error then the store request was not
    fully completed (but may have updated some pages).

    If the stored data overflows the current file size, then the size is
    extended, similarly to a write(2) on the filesystem.

    Pages which have been completely stored are marked uptodate.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Don't use atomic kmap for mapping userspace buffers in device
    read/write/splice.

    This is necessary because the next patch (adding store notify)
    requires that caller of fuse_copy_page() may sleep between
    invocations. The simplest way to ensure this is to change the atomic
    kmaps to non-atomic ones.

    Thankfully architectures where kmap() is not a no-op are going out of
    fashion, so we can ignore the (probably negligible) performance impact
    of this change.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

31 May, 2010

1 commit


26 May, 2010

1 commit

  • This adds:
    alias: devname:
    to some common kernel modules, which will allow the on-demand loading
    of the kernel module when the device node is accessed.

    Ideally all these modules would be compiled-in, but distros seems too
    much in love with their modularization that we need to cover the common
    cases with this new facility. It will allow us to remove a bunch of pretty
    useless init scripts and modprobes from init scripts.

    The static device node aliases will be carried in the module itself. The
    program depmod will extract this information to a file in the module directory:
    $ cat /lib/modules/2.6.34-00650-g537b60d-dirty/modules.devname
    # Device nodes to trigger on-demand module loading.
    microcode cpu/microcode c10:184
    fuse fuse c10:229
    ppp_generic ppp c108:0
    tun net/tun c10:200
    dm_mod mapper/control c10:235

    Udev will pick up the depmod created file on startup and create all the
    static device nodes which the kernel modules specify, so that these modules
    get automatically loaded when the device node is accessed:
    $ /sbin/udevd --debug
    ...
    static_dev_create_from_modules: mknod '/dev/cpu/microcode' c10:184
    static_dev_create_from_modules: mknod '/dev/fuse' c10:229
    static_dev_create_from_modules: mknod '/dev/ppp' c108:0
    static_dev_create_from_modules: mknod '/dev/net/tun' c10:200
    static_dev_create_from_modules: mknod '/dev/mapper/control' c10:235
    udev_rules_apply_static_dev_perms: chmod '/dev/net/tun' 0666
    udev_rules_apply_static_dev_perms: chmod '/dev/fuse' 0666

    A few device nodes are switched to statically allocated numbers, to allow
    the static nodes to work. This might also useful for systems which still run
    a plain static /dev, which is completely unsafe to use with any dynamic minor
    numbers.

    Note:
    The devname aliases must be limited to the *common* and *single*instance*
    device nodes, like the misc devices, and never be used for conceptually limited
    systems like the loop devices, which should rather get fixed properly and get a
    control node for losetup to talk to, instead of creating a random number of
    device nodes in advance, regardless if they are ever used.

    This facility is to hide the mess distros are creating with too modualized
    kernels, and just to hide that these modules are not compiled-in, and not to
    paper-over broken concepts. Thanks! :)

    Cc: Greg Kroah-Hartman
    Cc: David S. Miller
    Cc: Miklos Szeredi
    Cc: Chris Mason
    Cc: Alasdair G Kergon
    Cc: Tigran Aivazian
    Cc: Ian Kent
    Signed-Off-By: Kay Sievers
    Signed-off-by: Greg Kroah-Hartman

    Kay Sievers
     

25 May, 2010

4 commits

  • Allow userspace filesystem implementation to use splice() to read from
    the fuse device.

    The userspace filesystem can now transfer data coming from a WRITE
    request to an arbitrary file descriptor (regular file, block device or
    socket) without having to go through a userspace buffer.

    The semantics of using splice() to read messages are:

    1) with a single splice() call move the whole message from the fuse
    device to a temporary pipe
    2) read the header from the pipe and determine the message type
    3a) if message is a WRITE then splice data from pipe to destination
    3b) else read rest of message to userspace buffer

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • When splicing buffers to the fuse device with SPLICE_F_MOVE, try to
    move pages from the pipe buffer into the page cache. This allows
    populating the fuse filesystem's cache without ever touching the page
    contents, i.e. zero copy read capability.

    The following steps are performed when trying to move a page into the
    page cache:

    - buf->ops->confirm() to make sure the new page is uptodate
    - buf->ops->steal() to try to remove the new page from it's previous place
    - remove_from_page_cache() on the old page
    - add_to_page_cache_locked() on the new page

    If any of the above steps fail (non fatally) then the code falls back
    to copying the page. In particular ->steal() will fail if there are
    external references (other than the page cache and the pipe buffer) to
    the page.

    Also since the remove_from_page_cache() + add_to_page_cache_locked()
    are non-atomic it is possible that the page cache is repopulated in
    between the two and add_to_page_cache_locked() will fail. This could
    be fixed by creating a new atomic replace_page_cache_page() function.

    fuse_readpages_end() needed to be reworked so it works even if
    page->mapping is NULL for some or all pages which can happen if the
    add_to_page_cache_locked() failed.

    A number of sanity checks were added to make sure the stolen pages
    don't have weird flags set, etc... These could be moved into generic
    splice/steal code.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Allow userspace filesystem implementation to use splice() to write to
    the fuse device. The semantics of using splice() are:

    1) buffer the message header and data in a temporary pipe
    2) with a *single* splice() call move the message from the temporary pipe
    to the fuse device

    The READ reply message has the most interesting use for this, since
    now the data from an arbitrary file descriptor (which could be a
    regular file, a block device or a socket) can be tranferred into the
    fuse device without having to go through a userspace buffer. It will
    also allow zero copy moving of pages.

    One caveat is that the protocol on the fuse device requires the length
    of the whole message to be written into the header. But the length of
    the data transferred into the temporary pipe may not be known in
    advance. The current library implementation works around this by
    using vmplice to write the header and modifying the header after
    splicing the data into the pipe (error handling omitted):

    struct fuse_out_header out;

    iov.iov_base = &out;
    iov.iov_len = sizeof(struct fuse_out_header);
    vmsplice(pip[1], &iov, 1, 0);
    len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
    /* retrospectively modify the header: */
    out.len = len + sizeof(struct fuse_out_header);
    splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);

    This works since vmsplice only saves a pointer to the data, it does
    not copy the data itself.

    Since pipes are currently limited to 16 pages and messages need to be
    spliced atomically, the length of the data is limited to 15 pages (or
    60kB for 4k pages).

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Replace uses of get_user_pages() with get_user_pages_fast(). It looks
    nicer and should be faster in most cases.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

05 Feb, 2010

2 commits

  • gcc 4.4 warns about:
    fs/fuse/dev.c: In function ‘fuse_notify_inval_entry’:
    fs/fuse/dev.c:925: warning: the frame size of 1060 bytes is larger than 1024 bytes

    The problem is we declare two structures and a large array on the stack,
    I move the array alway from the stack and allocate memory for it dynamically.

    Signed-off-by: Fang Wenqi
    Signed-off-by: Miklos Szeredi

    Fang Wenqi
     
  • Small cleanup in fuse_notify_inval_inode() and
    fuse_notify_inval_entry().

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

19 Sep, 2009

1 commit


12 Jul, 2009

1 commit


11 Jul, 2009

2 commits

  • When building v2.6.31-rc2-344-g69ca06c, the following build errors are
    found due to missing includes:

    CC [M] fs/fuse/dev.o
    fs/fuse/dev.c: In function ‘request_end’:
    fs/fuse/dev.c:289: error: ‘BLK_RW_SYNC’ undeclared (first use in this function)
    ...
    fs/nfs/write.c: In function ‘nfs_set_page_writeback’:
    fs/nfs/write.c:207: error: ‘BLK_RW_ASYNC’ undeclared (first use in this function)

    Signed-off-by: Larry Finger@lwfinger.net>
    Signed-off-by: Linus Torvalds

    Larry Finger
     
  • Commit 1faa16d22877f4839bd433547d770c676d1d964c accidentally broke
    the bdi congestion wait queue logic, causing us to wait on congestion
    for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

07 Jul, 2009

1 commit


01 Jul, 2009

2 commits

  • Add notification messages that allow the filesystem to invalidate VFS
    caches.

    Two notifications are added:

    1) inode invalidation

    - invalidate cached attributes
    - invalidate a range of pages in the page cache (this is optional)

    2) dentry invalidation

    - try to invalidate a subtree in the dentry cache

    Care must be taken while accessing the 'struct super_block' for the
    mount, as it can go away while an invalidation is in progress. To
    prevent this, introduce a rw-semaphore, that is taken for read during
    the invalidation and taken for write in the ->kill_sb callback.

    Cc: Csaba Henk
    Cc: Anand Avati
    Signed-off-by: Miklos Szeredi

    John Muir
     
  • On 64 bit systems -- where sizeof(ssize_t) > sizeof(int) -- the following test
    exposes a bug due to a non-careful return of an int or unsigned value:

    implement a FUSE filesystem which sends an unsolicited notification to
    the kernel with invalid opcode. The respective write to /dev/fuse
    will return (1 << 32) - EINVAL with errno == 0 instead of -1 with
    errno == EINVAL.

    Signed-off-by: Miklos Szeredi
    CC: stable@kernel.org

    Csaba Henk
     

28 Apr, 2009

2 commits

  • Export the following symbols for CUSE.

    fuse_conn_put()
    fuse_conn_get()
    fuse_conn_kill()
    fuse_send_init()
    fuse_do_open()
    fuse_sync_release()
    fuse_direct_io()
    fuse_do_ioctl()
    fuse_file_poll()
    fuse_request_alloc()
    fuse_get_req()
    fuse_put_request()
    fuse_request_send()
    fuse_abort_conn()
    fuse_dev_release()
    fuse_dev_operations

    Signed-off-by: Tejun Heo
    Signed-off-by: Miklos Szeredi

    Tejun Heo
     
  • Update fuse_conn_init() such that it doesn't take @sb and move bdi
    registration into a separate function. Also separate out
    fuse_conn_kill() from fuse_put_super().

    These will be used to implement cuse.

    Signed-off-by: Tejun Heo
    Signed-off-by: Miklos Szeredi

    Tejun Heo
     

26 Jan, 2009

2 commits

  • Move fuse_copy_finish() to before calling fuse_notify_poll_wakeup().
    This is not a big issue because fuse_notify_poll_wakeup() should be
    atomic, but it's cleaner this way, and later uses of notification will
    need to be able to finish the copying before performing some actions.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • If a fuse filesystem is unmounted but the device file descriptor
    remains open and a new mount reuses the old device number, then the
    mount fails with EEXIST and the following warning is printed in the
    kernel log:

    WARNING: at fs/sysfs/dir.c:462 sysfs_add_one+0x35/0x3d()
    sysfs: duplicate filename '0:15' can not be created

    The cause is that the bdi belonging to the fuse filesystem was
    destoryed only after the device file was released. Fix this by
    calling bdi_destroy() from fuse_put_super() instead.

    Signed-off-by: Miklos Szeredi
    CC: stable@kernel.org

    Miklos Szeredi
     

07 Jan, 2009

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
    fuse: clean up annotations of fc->lock
    fuse: fix sparse warning in ioctl
    fuse: update interface version
    fuse: add fuse_conn->release()
    fuse: separate out fuse_conn_init() from new_conn()
    fuse: add fuse_ prefix to several functions
    fuse: implement poll support
    fuse: implement unsolicited notification
    fuse: add file kernel handle
    fuse: implement ioctl support
    fuse: don't let fuse_req->end() put the base reference
    fuse: move FUSE_MINOR to miscdevice.h
    fuse: style fixes

    Linus Torvalds
     

02 Dec, 2008

1 commit


26 Nov, 2008

2 commits

  • Add fuse_ prefix to request_send*() and get_root_inode() as some of
    those functions will be exported for CUSE. With or without CUSE
    export, having the function names scoped is a good idea for
    debuggability.

    Signed-off-by: Tejun Heo
    Signed-off-by: Miklos Szeredi

    Tejun Heo
     
  • Implement poll support. Polled files are indexed using kh in a RB
    tree rooted at fuse_conn->polled_files.

    Client should send FUSE_NOTIFY_POLL notification once after processing
    FUSE_POLL which has FUSE_POLL_SCHEDULE_NOTIFY set. Sending
    notification unconditionally after the latest poll or everytime file
    content might have changed is inefficient but won't cause malfunction.

    fuse_file_poll() can sleep and requires patches from the following
    thread which allows f_op->poll() to sleep.

    http://thread.gmane.org/gmane.linux.kernel/726176

    Signed-off-by: Tejun Heo
    Signed-off-by: Miklos Szeredi

    Tejun Heo