29 Jan, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "Assorted stuff; the biggest pile here is Christoph's ACL series. Plus
    assorted cleanups and fixes all over the place...

    There will be another pile later this week"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (43 commits)
    __dentry_path() fixes
    vfs: Remove second variable named error in __dentry_path
    vfs: Is mounted should be testing mnt_ns for NULL or error.
    Fix race when checking i_size on direct i/o read
    hfsplus: remove can_set_xattr
    nfsd: use get_acl and ->set_acl
    fs: remove generic_acl
    nfs: use generic posix ACL infrastructure for v3 Posix ACLs
    gfs2: use generic posix ACL infrastructure
    jfs: use generic posix ACL infrastructure
    xfs: use generic posix ACL infrastructure
    reiserfs: use generic posix ACL infrastructure
    ocfs2: use generic posix ACL infrastructure
    jffs2: use generic posix ACL infrastructure
    hfsplus: use generic posix ACL infrastructure
    f2fs: use generic posix ACL infrastructure
    ext2/3/4: use generic posix ACL infrastructure
    btrfs: use generic posix ACL infrastructure
    fs: make posix_acl_create more useful
    fs: make posix_acl_chmod more useful
    ...

    Linus Torvalds
     

26 Jan, 2014

1 commit

  • So far I've had one ACK for this, and no other comments. So I think it
    is probably time to send this via some suitable tree. I'm guessing that
    the vfs tree would be the most appropriate route, but not sure that
    there is one at the moment (don't see anything recent at kernel.org)
    so in that case I think -mm is the "back up plan". Al, please let me
    know if you will take this?

    Steve.

    ---------------------

    Following on from the "Re: [PATCH v3] vfs: fix a bug when we do some dio
    reads with append dio writes" thread on linux-fsdevel, this patch is my
    current version of the fix proposed as option (b) in that thread.

    Removing the i_size test from the direct i/o read path at vfs level
    means that filesystems now have to deal with requests which are beyond
    i_size themselves. These I've divided into three sets:

    a) Those with "no op" ->direct_IO (9p, cifs, ceph)
    These are obviously not going to be an issue

    b) Those with "home brew" ->direct_IO (nfs, fuse)
    I've been told that NFS should not have any problem with the larger
    i_size, however I've added an extra test to FUSE to duplicate the
    original behaviour just to be on the safe side.

    c) Those using __blockdev_direct_IO()
    These call through to ->get_block() which should deal with the EOF
    condition correctly. I've verified that with GFS2 and I believe that
    Zheng has verified it for ext4. I've also run the test on XFS and it
    passes both before and after this change.

    The part of the patch in filemap.c looks a lot larger than it really is
    - there are only two lines of real change. The rest is just indentation
    of the contained code.

    There remains a test of i_size though, which was added for btrfs. It
    doesn't cause the other filesystems a problem as the test is performed
    after ->direct_IO has been called. It is possible that there is a race
    that does matter to btrfs, however this patch doesn't change that, so
    its still an overall improvement.

    Signed-off-by: Steven Whitehouse
    Reported-by: Zheng Liu
    Cc: Jan Kara
    Cc: Dave Chinner
    Acked-by: Miklos Szeredi
    Cc: Chris Mason
    Cc: Josef Bacik
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Signed-off-by: Al Viro

    Steven Whitehouse
     

23 Jan, 2014

4 commits

  • open/release operations require userspace transitions to keep track
    of the open count and to perform any FS-specific setup. However,
    for some purely read-only FSs which don't need to perform any setup
    at open/release time, we can avoid the performance overhead of
    calling into userspace for open/release calls.

    This patch adds the necessary support to the fuse kernel modules to prevent
    open/release operations from hitting in userspace. When the client returns
    ENOSYS, we avoid sending the subsequent release to userspace, and also
    remember this so that future opens also don't trigger a userspace
    operation.

    Signed-off-by: Miklos Szeredi

    Andrew Gallagher
     
  • Various read operations (e.g. readlink, readdir) invalidate the cached
    attrs for atime changes. This patch adds a new function
    'fuse_invalidate_atime', which checks for a read-only super block and
    avoids the attr invalidation in that case.

    Signed-off-by: Andrew Gallagher
    Signed-off-by: Miklos Szeredi

    Andrew Gallagher
     
  • As noticed by Coverity the "num != 0" condition never triggers. Instead it
    should check for a complete page.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Having this struct in module memory could Oops when if the module is
    unloaded while the buffer still persists in a pipe.

    Since sock_pipe_buf_ops is essentially the same as fuse_dev_pipe_buf_steal
    merge them into nosteal_pipe_buf_ops (this is the same as
    default_pipe_buf_ops except stealing the page from the buffer is not
    allowed).

    Reported-by: Al Viro
    Signed-off-by: Miklos Szeredi
    Cc: stable@vger.kernel.org

    Miklos Szeredi
     

13 Nov, 2013

1 commit

  • Pull vfs updates from Al Viro:
    "All kinds of stuff this time around; some more notable parts:

    - RCU'd vfsmounts handling
    - new primitives for coredump handling
    - files_lock is gone
    - Bruce's delegations handling series
    - exportfs fixes

    plus misc stuff all over the place"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits)
    ecryptfs: ->f_op is never NULL
    locks: break delegations on any attribute modification
    locks: break delegations on link
    locks: break delegations on rename
    locks: helper functions for delegation breaking
    locks: break delegations on unlink
    namei: minor vfs_unlink cleanup
    locks: implement delegations
    locks: introduce new FL_DELEG lock flag
    vfs: take i_mutex on renamed file
    vfs: rename I_MUTEX_QUOTA now that it's not used for quotas
    vfs: don't use PARENT/CHILD lock classes for non-directories
    vfs: pull ext4's double-i_mutex-locking into common code
    exportfs: fix quadratic behavior in filehandle lookup
    exportfs: better variable name
    exportfs: move most of reconnect_path to helper function
    exportfs: eliminate unused "noprogress" counter
    exportfs: stop retrying once we race with rename/remove
    exportfs: clear DISCONNECTED on all parents sooner
    exportfs: more detailed comment for path_reconnect
    ...

    Linus Torvalds
     

05 Nov, 2013

4 commits

  • All async fuse requests must be supplied with extra reference to a fuse
    file. This is necessary to ensure that the fuse file is not released until
    all in-flight requests are completed. Fuse secondary writeback requests
    must obey this rule as well.

    Signed-off-by: Maxim Patlasov
    Signed-off-by: Miklos Szeredi

    Maxim Patlasov
     
  • BDI_WRITTEN counter is used to estimate bdi bandwidth. It must be
    incremented every time as bdi ends page writeback. No matter whether it
    was fulfilled by actual write or by discarding the request (e.g. due to
    shrunk i_size).

    Note that even before writepages patches, the case "Got truncated off
    completely" was handled in fuse_send_writepage() by calling
    fuse_writepage_finish() which updated BDI_WRITTEN unconditionally.

    Signed-off-by: Maxim Patlasov
    Signed-off-by: Miklos Szeredi

    Maxim Patlasov
     
  • If writeback happens while fuse is in FUSE_NOWRITE condition, the request
    will be queued but not processed immediately (see fuse_flush_writepages()).
    Until FUSE_NOWRITE becomes relaxed, more writebacks can happen. They will
    be queued as "secondary" requests to that first ("primary") request.

    Existing implementation crops only primary request. This is not correct
    because a subsequent extending write(2) may increase i_size and then
    secondary requests won't be cropped properly. The result would be stale
    data written to the server to a file offset where zeros must be.

    Similar problem may happen if secondary requests are attached to an
    in-flight request that was already cropped.

    The patch solves the issue by cropping all secondary requests in
    fuse_writepage_end(). Thanks to Miklos for idea.

    Signed-off-by: Maxim Patlasov
    Signed-off-by: Miklos Szeredi

    Maxim Patlasov
     
  • fuse_writepage_in_flight() returns false if it fails to find request with
    given index in fi->writepages. Then the caller proceeds with populating
    data->orig_pages[] and incrementing req->num_pages. Hence,
    fuse_writepage_in_flight() must revert changes it made in request before
    returning false.

    Signed-off-by: Maxim Patlasov
    Signed-off-by: Miklos Szeredi

    Maxim Patlasov
     

25 Oct, 2013

2 commits


01 Oct, 2013

13 commits

  • This allows udev (or more recently systemd-tmpfiles) to create /dev/cuse on
    boot, in the same way as /dev/fuse is currently created, and the corresponding
    module to be loaded on first access.

    The corresponding functionalty was introduced for fuse in commit 578454f.

    Signed-off-by: Tom Gundersen
    Cc: Kay Sievers
    Signed-off-by: Miklos Szeredi

    Tom Gundersen
     
  • If ->writepage() tries to write back a page whose copy is still in flight,
    then just skip by calling redirty_page_for_writepage().

    This is OK, since now ->writepage() should never be called for data
    integrity sync.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • As Maxim Patlasov pointed out, it's possible to get a dirty page while it's
    copy is still under writeback, despite fuse_page_mkwrite() doing its thing
    (direct IO).

    This could result in two concurrent write request for the same offset, with
    data corruption if they get mixed up.

    To prevent this, fuse needs to check and delay such writes. This
    implementation does this by:

    1. check if page is still under writeout, if so create a new, single page
    secondary request for it

    2. chain this secondary request onto the in-flight request

    2/a. if a seconday request for the same offset was already chained to the
    in-flight request, then just copy the contents of the page and discard
    the new secondary request. This makes sure that for each page will
    have at most two requests associated with it

    3. when the in-flight request finished, send off all secondary requests
    chained onto it

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Checking against tmp-page indexes is not very useful, and results in one
    (or rarely two) page requests. Which is not much of an improvement...

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • The patch fixes a race between ftruncate(2), mmap-ed write and write(2):

    1) An user makes a page dirty via mmap-ed write.
    2) The user performs shrinking truncate(2) intended to purge the page.
    3) Before fuse_do_setattr calls truncate_pagecache, the page goes to
    writeback. fuse_writepages_fill attaches a new page to FUSE_WRITE request,
    then releases the original page by end_page_writeback and unlock it.
    4) fuse_do_setattr completes and successfully returns. Since now, i_mutex
    is free.
    5) Ordinary write(2) extends i_size back to cover the page. Note that
    fuse_send_write_pages do wait for fuse writeback, but for another
    page->index.
    6) fuse_writepages_fill attaches more pages to the request (if any), then
    fuse_writepages_send is eventually called. It is supposed to crop
    inarg->size of the request, but it doesn't because i_size has already been
    extended back.

    Moving end_page_writeback behind fuse_writepages_send guarantees that
    __fuse_release_nowrite (called from fuse_do_setattr) will crop inarg->size
    of the request before write(2) gets the chance to extend i_size.

    Signed-off-by: Maxim Patlasov
    Signed-off-by: Miklos Szeredi

    Maxim Patlasov
     
  • The .writepages one is required to make each writeback request carry more than
    one page on it. The patch enables optimized behaviour unconditionally,
    i.e. mmap-ed writes will benefit from the patch even if fc->writeback_cache=0.

    [SzM: simplify, add comments]

    Signed-off-by: Maxim Patlasov
    Signed-off-by: Miklos Szeredi

    Pavel Emelyanov
     
  • Don't bug if there's no writable files found for page writeback. If ever
    this is triggered, a WARN_ON helps debugging it much better then a BUG_ON.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Lock the page in fuse_page_mkwrite() to protect against a race with
    fuse_writepage() where the page is redirtied before the actual writeback
    begins.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • The .writepages callback will issue writeback requests with more than one
    page aboard. Make existing end/check code be aware of this.

    Signed-off-by: Maxim Patlasov
    Signed-off-by: Miklos Szeredi

    Pavel Emelyanov
     
  • There will be a .writepageS callback implementation which will need to
    get a fuse_file out of a fuse_inode, thus make a helper for this.

    Signed-off-by: Maxim Patlasov
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Miklos Szeredi

    Pavel Emelyanov
     
  • fuse_access() is never called in RCU walk, only on the final component of
    access(2) and chdir(2)...

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Doing dput(parent) is not valid in RCU walk mode. In RCU mode it would
    probably be okay to update the parent flags, but it's actually not
    necessary most of the time...

    So only set the FUSE_I_ADVISE_RDPLUS flag on the parent when the entry was
    recently initialized by READDIRPLUS.

    This is achieved by setting FUSE_I_INIT_RDPLUS on entries added by
    READDIRPLUS and only dropping out of RCU mode if this flag is set.
    FUSE_I_INIT_RDPLUS is cleared once the FUSE_I_ADVISE_RDPLUS flag is set in
    the parent.

    Reported-by: Al Viro
    Signed-off-by: Miklos Szeredi
    Cc: stable@vger.kernel.org

    Miklos Szeredi
     
  • If revalidate finds an invalid dentry in RCU walk mode, let the VFS deal
    with it instead of calling check_submounts_and_drop() which is not prepared
    for being called from RCU walk.

    Signed-off-by: Miklos Szeredi
    Cc: stable@vger.kernel.org

    Miklos Szeredi
     

18 Sep, 2013

2 commits

  • A former patch introducing FUSE_I_SIZE_UNSTABLE flag provided detailed
    description of races between ftruncate and anyone who can extend i_size:

    > 1. As in the previous scenario fuse_dentry_revalidate() discovered that i_size
    > changed (due to our own fuse_do_setattr()) and is going to call
    > truncate_pagecache() for some 'new_size' it believes valid right now. But by
    > the time that particular truncate_pagecache() is called ...
    > 2. fuse_do_setattr() returns (either having called truncate_pagecache() or
    > not -- it doesn't matter).
    > 3. The file is extended either by write(2) or ftruncate(2) or fallocate(2).
    > 4. mmap-ed write makes a page in the extended region dirty.

    This patch adds necessary bits to fuse_file_fallocate() to protect from that
    race.

    Signed-off-by: Maxim Patlasov
    Signed-off-by: Miklos Szeredi
    Cc: stable@vger.kernel.org

    Maxim Patlasov
     
  • The patch fixes a race between mmap-ed write and fallocate(PUNCH_HOLE):

    1) An user makes a page dirty via mmap-ed write.
    2) The user performs fallocate(2) with mode == PUNCH_HOLE|KEEP_SIZE
    and covering the page.
    3) Before truncate_pagecache_range call from fuse_file_fallocate,
    the page goes to write-back. The page is fully processed by fuse_writepage
    (including end_page_writeback on the page), but fuse_flush_writepages did
    nothing because fi->writectr < 0.
    4) truncate_pagecache_range is called and fuse_file_fallocate is finishing
    by calling fuse_release_nowrite. The latter triggers processing queued
    write-back request which will write stale data to the hole soon.

    Changed in v2 (thanks to Brian for suggestion):
    - Do not truncate page cache until FUSE_FALLOCATE succeeded. Otherwise,
    we can end up in returning -ENOTSUPP while user data is already punched
    from page cache. Use filemap_write_and_wait_range() instead.
    Changed in v3 (thanks to Miklos for suggestion):
    - fuse_wait_on_writeback() is prone to livelocks; use fuse_set_nowrite()
    instead. So far as we need a dirty-page barrier only, fuse_sync_writes()
    should be enough.
    - rebased to for-linus branch of fuse.git

    Signed-off-by: Maxim Patlasov
    Signed-off-by: Miklos Szeredi
    Cc: stable@vger.kernel.org

    Maxim Patlasov
     

13 Sep, 2013

1 commit


12 Sep, 2013

1 commit

  • The feature prevents mistrusted filesystems (ie: FUSE mounts created by
    unprivileged users) to grow a large number of dirty pages before
    throttling. For such filesystems balance_dirty_pages always check bdi
    counters against bdi limits. I.e. even if global "nr_dirty" is under
    "freerun", it's not allowed to skip bdi checks. The only use case for now
    is fuse: it sets bdi max_ratio to 1% by default and system administrators
    are supposed to expect that this limit won't be exceeded.

    The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag. A
    filesystem may set the flag when it initializes its BDI.

    The problematic scenario comes from the fact that nobody pays attention to
    the NR_WRITEBACK_TEMP counter (i.e. number of pages under fuse
    writeback). The implementation of fuse writeback releases original page
    (by calling end_page_writeback) almost immediately. A fuse request queued
    for real processing bears a copy of original page. Hence, if userspace
    fuse daemon doesn't finalize write requests in timely manner, an
    aggressive mmap writer can pollute virtually all memory by those temporary
    fuse page copies. They are carefully accounted in NR_WRITEBACK_TEMP, but
    nobody cares.

    To make further explanations shorter, let me use "NR_WRITEBACK_TEMP
    problem" as a shortcut for "a possibility of uncontrolled grow of amount
    of RAM consumed by temporary pages allocated by kernel fuse to process
    writeback".

    The problem was very easy to reproduce. There is a trivial example
    filesystem implementation in fuse userspace distribution: fusexmp_fh.c. I
    added "sleep(1);" to the write methods, then recompiled and mounted it.
    Then created a huge file on the mount point and run a simple program which
    mmap-ed the file to a memory region, then wrote a data to the region. An
    hour later I observed almost all RAM consumed by fuse writeback. Since
    then some unrelated changes in kernel fuse made it more difficult to
    reproduce, but it is still possible now.

    Putting this theoretical happens-in-the-lab thing aside, there is another
    thing that really hurts real world (FUSE) users. This is write-through
    page cache policy FUSE currently uses. I.e. handling write(2), kernel
    fuse populates page cache and flushes user data to the server
    synchronously. This is excessively suboptimal. Pavel Emelyanov's patches
    ("writeback cache policy") solve the problem, but they also make resolving
    NR_WRITEBACK_TEMP problem absolutely necessary. Otherwise, simply copying
    a huge file to a fuse mount would result in memory starvation. Miklos,
    the maintainer of FUSE, believes strictlimit feature the way to go.

    And eventually putting FUSE topics aside, there is one more use-case for
    strictlimit feature. Using a slow USB stick (mass storage) in a machine
    with huge amount of RAM installed is a well-known pain. Let's make simple
    computations. Assuming 64GB of RAM installed, existing implementation of
    balance_dirty_pages will start throttling only after 9.6GB of RAM becomes
    dirty (freerun == 15% of total RAM). So, the command "cp 9GB_file
    /media/my-usb-storage/" may return in a few seconds, but subsequent
    "umount /media/my-usb-storage/" will take more than two hours if effective
    throughput of the storage is, to say, 1MB/sec.

    After inclusion of strictlimit feature, it will be trivial to add a knob
    (e.g. /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.
    Manually or via udev rule. May be I'm wrong, but it seems to be quite a
    natural desire to limit the amount of dirty memory for some devices we are
    not fully trust (in the sense of sustainable throughput).

    [akpm@linux-foundation.org: fix warning in page-writeback.c]
    Signed-off-by: Maxim Patlasov
    Cc: Jan Kara
    Cc: Miklos Szeredi
    Cc: Wu Fengguang
    Cc: Pavel Emelyanov
    Cc: James Bottomley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Maxim Patlasov
     

10 Sep, 2013

1 commit


06 Sep, 2013

3 commits

  • Drop a subtree when we find that it has moved or been delated. This can be
    done as long as there are no submounts under this location.

    If the directory was moved and we come across the same directory in a
    future lookup it will be reconnected by d_materialise_unique().

    Signed-off-by: Anand Avati
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Anand Avati
     
  • On errors unrelated to the filesystem's state (ENOMEM, ENOTCONN) return the
    error itself from ->d_revalidate() insted of returning zero (invalid).

    Also make a common label for invalidating the dentry. This will be used by
    the next patch.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • Use d_materialise_unique() instead of d_splice_alias(). This allows dentry
    subtrees to be moved to a new place if there moved, even if something is
    referencing a dentry in the subtree (open fd, cwd, etc..).

    This will also allow us to drop a subtree if it is found to be replaced by
    something else. In this case the disconnected subtree can later be
    reconnected to its new location.

    d_materialise_unique() ensures that a directory entry only ever has one
    alias. We keep fc->inst_mutex around the calls for d_materialise_unique()
    on directories to prevent a race with mkdir "stealing" the inode.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     

04 Sep, 2013

1 commit


03 Sep, 2013

4 commits

  • Userspace can add names containing a slash character to the directory
    listing. Don't allow this as it could cause all sorts of trouble.

    Signed-off-by: Miklos Szeredi
    Cc: stable@vger.kernel.org

    Miklos Szeredi
     
  • The way how fuse calls truncate_pagecache() from fuse_change_attributes()
    is completely wrong. Because, w/o i_mutex held, we never sure whether
    'oldsize' and 'attr->size' are valid by the time of execution of
    truncate_pagecache(inode, oldsize, attr->size). In fact, as soon as we
    released fc->lock in the middle of fuse_change_attributes(), we completely
    loose control of actions which may happen with given inode until we reach
    truncate_pagecache. The list of potentially dangerous actions includes
    mmap-ed reads and writes, ftruncate(2) and write(2) extending file size.

    The typical outcome of doing truncate_pagecache() with outdated arguments
    is data corruption from user point of view. This is (in some sense)
    acceptable in cases when the issue is triggered by a change of the file on
    the server (i.e. externally wrt fuse operation), but it is absolutely
    intolerable in scenarios when a single fuse client modifies a file without
    any external intervention. A real life case I discovered by fsx-linux
    looked like this:

    1. Shrinking ftruncate(2) comes to fuse_do_setattr(). The latter sends
    FUSE_SETATTR to the server synchronously, but before getting fc->lock ...
    2. fuse_dentry_revalidate() is asynchronously called. It sends FUSE_LOOKUP
    to the server synchronously, then calls fuse_change_attributes(). The
    latter updates i_size, releases fc->lock, but before comparing oldsize vs
    attr->size..
    3. fuse_do_setattr() from the first step proceeds by acquiring fc->lock and
    updating attributes and i_size, but now oldsize is equal to
    outarg.attr.size because i_size has just been updated (step 2). Hence,
    fuse_do_setattr() returns w/o calling truncate_pagecache().
    4. As soon as ftruncate(2) completes, the user extends file size by
    write(2) making a hole in the middle of file, then reads data from the hole
    either by read(2) or mmap-ed read. The user expects to get zero data from
    the hole, but gets stale data because truncate_pagecache() is not executed
    yet.

    The scenario above illustrates one side of the problem: not truncating the
    page cache even though we should. Another side corresponds to truncating
    page cache too late, when the state of inode changed significantly.
    Theoretically, the following is possible:

    1. As in the previous scenario fuse_dentry_revalidate() discovered that
    i_size changed (due to our own fuse_do_setattr()) and is going to call
    truncate_pagecache() for some 'new_size' it believes valid right now. But
    by the time that particular truncate_pagecache() is called ...
    2. fuse_do_setattr() returns (either having called truncate_pagecache() or
    not -- it doesn't matter).
    3. The file is extended either by write(2) or ftruncate(2) or fallocate(2).
    4. mmap-ed write makes a page in the extended region dirty.

    The result will be the lost of data user wrote on the fourth step.

    The patch is a hotfix resolving the issue in a simplistic way: let's skip
    dangerous i_size update and truncate_pagecache if an operation changing
    file size is in progress. This simplistic approach looks correct for the
    cases w/o external changes. And to handle them properly, more sophisticated
    and intrusive techniques (e.g. NFS-like one) would be required. I'd like to
    postpone it until the issue is well discussed on the mailing list(s).

    Changed in v2:
    - improved patch description to cover both sides of the issue.

    Signed-off-by: Maxim Patlasov
    Signed-off-by: Miklos Szeredi
    Cc: stable@vger.kernel.org

    Maxim Patlasov
     
  • Calls like setxattr and removexattr result in updation of ctime.
    Therefore invalidate inode attributes to force a refresh.

    Signed-off-by: Anand Avati
    Reviewed-by: Brian Foster
    Signed-off-by: Miklos Szeredi
    Cc: stable@vger.kernel.org

    Anand Avati
     
  • The patch fixes a race between ftruncate(2), mmap-ed write and write(2):

    1) An user makes a page dirty via mmap-ed write.
    2) The user performs shrinking truncate(2) intended to purge the page.
    3) Before fuse_do_setattr calls truncate_pagecache, the page goes to
    writeback. fuse_writepage_locked fills FUSE_WRITE request and releases
    the original page by end_page_writeback.
    4) fuse_do_setattr() completes and successfully returns. Since now, i_mutex
    is free.
    5) Ordinary write(2) extends i_size back to cover the page. Note that
    fuse_send_write_pages do wait for fuse writeback, but for another
    page->index.
    6) fuse_writepage_locked proceeds by queueing FUSE_WRITE request.
    fuse_send_writepage is supposed to crop inarg->size of the request,
    but it doesn't because i_size has already been extended back.

    Moving end_page_writeback to the end of fuse_writepage_locked fixes the
    race because now the fact that truncate_pagecache is successfully returned
    infers that fuse_writepage_locked has already called end_page_writeback.
    And this, in turn, infers that fuse_flush_writepages has already called
    fuse_send_writepage, and the latter used valid (shrunk) i_size. write(2)
    could not extend it because of i_mutex held by ftruncate(2).

    Signed-off-by: Maxim Patlasov
    Signed-off-by: Miklos Szeredi
    Cc: stable@vger.kernel.org

    Maxim Patlasov
     

30 Jul, 2013

1 commit