fuse: support writable mmap

Quoting Linus (3 years ago, FUSE inclusion discussions): "User-space filesystems are hard to get right. I'd claim that they are almost impossible, unless you limit them somehow (shared writable mappings are the nastiest part - if you don't have those, you can reasonably limit your problems by limiting the number of dirty pages you accept through normal "write()" calls)." Instead of attempting the impossible, I've just waited for the dirty page accounting infrastructure to materialize (thanks to Peter Zijlstra and others). This nicely solved the biggest problem: limiting the number of pages used for write caching. Some small details remained, however, which this largish patch attempts to address. It provides a page writeback implementation for fuse, which is completely safe against VM related deadlocks. Performance may not be very good for certain usage patterns, but generally it should be acceptable. It has been tested extensively with fsx-linux and bash-shared-mapping. Fuse page writeback design -------------------------- fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM. It copies the contents of the original page, and queues a WRITE request to the userspace filesystem using this temp page. The writeback is finished instantly from the MM's point of view: the page is removed from the radix trees, and the PageDirty and PageWriteback flags are cleared. For the duration of the actual write, the NR_WRITEBACK_TEMP counter is incremented. The per-bdi writeback count is not decremented until the actual write completes. On dirtying the page, fuse waits for a previous write to finish before proceeding. This makes sure, there can only be one temporary page used at a time for one cached page. This approach is wasteful in both memory and CPU bandwidth, so why is this complication needed? The basic problem is that there can be no guarantee about the time in which the userspace filesystem will complete a write. It may be buggy or even malicious, and fail to complete WRITE requests. We don't want unrelated parts of the system to grind to a halt in such cases. Also a filesystem may need additional resources (particularly memory) to complete a WRITE request. There's a great danger of a deadlock if that allocation may wait for the writepage to finish. Currently there are several cases where the kernel can block on page writeback: - allocation order is larger than PAGE_ALLOC_COSTLY_ORDER - page migration - throttle_vm_writeout (through NR_WRITEBACK) - sync(2) Of course in some cases (fsync, msync) we explicitly want to allow blocking. So for these cases new code has to be added to fuse, since the VM is not tracking writeback pages for us any more. As an extra safetly measure, the maximum dirty ratio allocated to a single fuse filesystem is set to 1% by default. This way one (or several) buggy or malicious fuse filesystems cannot slow down the rest of the system by hogging dirty memory. With appropriate privileges, this limit can be raised through '/sys/class/bdi/<bdi>/max_ratio'. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions): "User-space filesystems are hard to get right. I'd claim that they are almost impossible, unless you limit them somehow (shared writable mappings are the nastiest part - if you don't have those, you can reasonably limit your problems by limiting the number of dirty pages you accept through normal "write()" calls)." Instead of attempting the impossible, I've just waited for the dirty page accounting infrastructure to materialize (thanks to Peter Zijlstra and others). This nicely solved the biggest problem: limiting the number of pages used for write caching. Some small details remained, however, which this largish patch attempts to address. It provides a page writeback implementation for fuse, which is completely safe against VM related deadlocks. Performance may not be very good for certain usage patterns, but generally it should be acceptable. It has been tested extensively with fsx-linux and bash-shared-mapping. Fuse page writeback design -------------------------- fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM. It copies the contents of the original page, and queues a WRITE request to the userspace filesystem using this temp page. The writeback is finished instantly from the MM's point of view: the page is removed from the radix trees, and the PageDirty and PageWriteback flags are cleared. For the duration of the actual write, the NR_WRITEBACK_TEMP counter is incremented. The per-bdi writeback count is not decremented until the actual write completes. On dirtying the page, fuse waits for a previous write to finish before proceeding. This makes sure, there can only be one temporary page used at a time for one cached page. This approach is wasteful in both memory and CPU bandwidth, so why is this complication needed? The basic problem is that there can be no guarantee about the time in which the userspace filesystem will complete a write. It may be buggy or even malicious, and fail to complete WRITE requests. We don't want unrelated parts of the system to grind to a halt in such cases. Also a filesystem may need additional resources (particularly memory) to complete a WRITE request. There's a great danger of a deadlock if that allocation may wait for the writepage to finish. Currently there are several cases where the kernel can block on page writeback: - allocation order is larger than PAGE_ALLOC_COSTLY_ORDER - page migration - throttle_vm_writeout (through NR_WRITEBACK) - sync(2) Of course in some cases (fsync, msync) we explicitly want to allow blocking. So for these cases new code has to be added to fuse, since the VM is not tracking writeback pages for us any more. As an extra safetly measure, the maximum dirty ratio allocated to a single fuse filesystem is set to 1% by default. This way one (or several) buggy or malicious fuse filesystems cannot slow down the rest of the system by hogging dirty memory. With appropriate privileges, this limit can be raised through '/sys/class/bdi/<bdi>/max_ratio'. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miklos Szeredi · Linus Torvalds
1 parent b88473f73e
Showing 5 changed files with 481 additions and 29 deletions Side-by-side Diff
fs/fuse/dev.c
fs/fuse/dir.c
fs/fuse/file.c
fs/fuse/fuse_i.h
fs/fuse/inode.c
@@ -47,6 +47,14 @@
 	return req;
 }
  
+struct fuse_req *fuse_request_alloc_nofs(void)
+{
+	struct fuse_req *req = kmem_cache_alloc(fuse_req_cachep, GFP_NOFS);
+	if (req)
+		fuse_request_init(req);
+	return req;
+}
+
 void fuse_request_free(struct fuse_req *req)
 {
 	kmem_cache_free(fuse_req_cachep, req);
@@ -427,6 +435,17 @@
 {
 	req->isreply = 1;
 	request_send_nowait(fc, req);
+}
+
+/*
+ * Called under fc->lock
+ *
+ * fc->connected must have been checked previously
+ */
+void request_send_background_locked(struct fuse_conn *fc, struct fuse_req *req)
+{
+	req->isreply = 1;
+	request_send_nowait_locked(fc, req);
 }
  
 /*
@@ -1107,6 +1107,50 @@
 }
  
 /*
+ * Prevent concurrent writepages on inode
+ *
+ * This is done by adding a negative bias to the inode write counter
+ * and waiting for all pending writes to finish.
+ */
+void fuse_set_nowrite(struct inode *inode)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	BUG_ON(!mutex_is_locked(&inode->i_mutex));
+
+	spin_lock(&fc->lock);
+	BUG_ON(fi->writectr < 0);
+	fi->writectr += FUSE_NOWRITE;
+	spin_unlock(&fc->lock);
+	wait_event(fi->page_waitq, fi->writectr == FUSE_NOWRITE);
+}
+
+/*
+ * Allow writepages on inode
+ *
+ * Remove the bias from the writecounter and send any queued
+ * writepages.
+ */
+static void __fuse_release_nowrite(struct inode *inode)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	BUG_ON(fi->writectr != FUSE_NOWRITE);
+	fi->writectr = 0;
+	fuse_flush_writepages(inode);
+}
+
+void fuse_release_nowrite(struct inode *inode)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	spin_lock(&fc->lock);
+	__fuse_release_nowrite(inode);
+	spin_unlock(&fc->lock);
+}
+
+/*
  * Set attributes, and at the same time refresh them.
  *
  * Truncation is slightly complicated, because the 'truncate' request
@@ -1122,6 +1166,8 @@
 	struct fuse_req *req;
 	struct fuse_setattr_in inarg;
 	struct fuse_attr_out outarg;
+	bool is_truncate = false;
+	loff_t oldsize;
 	int err;
  
 	if (!fuse_allow_task(fc, current))
  
@@ -1145,12 +1191,16 @@
 			send_sig(SIGXFSZ, current, 0);
 			return -EFBIG;
 		}
+		is_truncate = true;
 	}
  
 	req = fuse_get_req(fc);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
  
+	if (is_truncate)
+		fuse_set_nowrite(inode);
+
 	memset(&inarg, 0, sizeof(inarg));
 	memset(&outarg, 0, sizeof(outarg));
 	iattr_to_fattr(attr, &inarg);
  
  
  
@@ -1181,16 +1231,44 @@
 	if (err) {
 		if (err == -EINTR)
 			fuse_invalidate_attr(inode);
-		return err;
+		goto error;
 	}
  
 	if ((inode->i_mode ^ outarg.attr.mode) & S_IFMT) {
 		make_bad_inode(inode);
-		return -EIO;
+		err = -EIO;
+		goto error;
 	}
  
-	fuse_change_attributes(inode, &outarg.attr, attr_timeout(&outarg), 0);
+	spin_lock(&fc->lock);
+	fuse_change_attributes_common(inode, &outarg.attr,
+				      attr_timeout(&outarg));
+	oldsize = inode->i_size;
+	i_size_write(inode, outarg.attr.size);
+
+	if (is_truncate) {
+		/* NOTE: this may release/reacquire fc->lock */
+		__fuse_release_nowrite(inode);
+	}
+	spin_unlock(&fc->lock);
+
+	/*
+	 * Only call invalidate_inode_pages2() after removing
+	 * FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock.
+	 */
+	if (S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
+		if (outarg.attr.size < oldsize)
+			fuse_truncate(inode->i_mapping, outarg.attr.size);
+		invalidate_inode_pages2(inode->i_mapping);
+	}
+
 	return 0;
+
+error:
+	if (is_truncate)
+		fuse_release_nowrite(inode);
+
+	return err;
 }
  
 static int fuse_setattr(struct dentry *entry, struct iattr *attr)
@@ -210,6 +210,49 @@
 	return (u64) v0 + ((u64) v1 << 32);
 }
  
+/*
+ * Check if page is under writeback
+ *
+ * This is currently done by walking the list of writepage requests
+ * for the inode, which can be pretty inefficient.
+ */
+static bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_req *req;
+	bool found = false;
+
+	spin_lock(&fc->lock);
+	list_for_each_entry(req, &fi->writepages, writepages_entry) {
+		pgoff_t curr_index;
+
+		BUG_ON(req->inode != inode);
+		curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
+		if (curr_index == index) {
+			found = true;
+			break;
+		}
+	}
+	spin_unlock(&fc->lock);
+
+	return found;
+}
+
+/*
+ * Wait for page writeback to be completed.
+ *
+ * Since fuse doesn't rely on the VM writeback tracking, this has to
+ * use some other means.
+ */
+static int fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	wait_event(fi->page_waitq, !fuse_page_is_writeback(inode, index));
+	return 0;
+}
+
 static int fuse_flush(struct file *file, fl_owner_t id)
 {
 	struct inode *inode = file->f_path.dentry->d_inode;
@@ -245,6 +288,21 @@
 	return err;
 }
  
+/*
+ * Wait for all pending writepages on the inode to finish.
+ *
+ * This is currently done by blocking further writes with FUSE_NOWRITE
+ * and waiting for all sent writes to complete.
+ *
+ * This must be called under i_mutex, otherwise the FUSE_NOWRITE usage
+ * could conflict with truncation.
+ */
+static void fuse_sync_writes(struct inode *inode)
+{
+	fuse_set_nowrite(inode);
+	fuse_release_nowrite(inode);
+}
+
 int fuse_fsync_common(struct file *file, struct dentry *de, int datasync,
 		      int isdir)
 {
@@ -261,6 +319,17 @@
 	if ((!isdir && fc->no_fsync) || (isdir && fc->no_fsyncdir))
 		return 0;
  
+	/*
+	 * Start writeback against all dirty pages of the inode, then
+	 * wait for all outstanding writes, before sending the FSYNC
+	 * request.
+	 */
+	err = write_inode_now(inode, 0);
+	if (err)
+		return err;
+
+	fuse_sync_writes(inode);
+
 	req = fuse_get_req(fc);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
@@ -340,6 +409,13 @@
 	if (is_bad_inode(inode))
 		goto out;
  
+	/*
+	 * Page writeback can extend beyond the liftime of the
+	 * page-cache page, so make sure we read a properly synced
+	 * page.
+	 */
+	fuse_wait_on_page_writeback(inode, page->index);
+
 	req = fuse_get_req(fc);
 	err = PTR_ERR(req);
 	if (IS_ERR(req))
@@ -411,6 +487,8 @@
 	struct inode *inode = data->inode;
 	struct fuse_conn *fc = get_fuse_conn(inode);
  
+	fuse_wait_on_page_writeback(inode, page->index);
+
 	if (req->num_pages &&
 	    (req->num_pages == FUSE_MAX_PAGES_PER_REQ ||
 	     (req->num_pages + 1) * PAGE_CACHE_SIZE > fc->max_read ||
  
@@ -477,11 +555,10 @@
 }
  
 static void fuse_write_fill(struct fuse_req *req, struct file *file,
-			    struct inode *inode, loff_t pos, size_t count,
-			    int writepage)
+			    struct fuse_file *ff, struct inode *inode,
+			    loff_t pos, size_t count, int writepage)
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
-	struct fuse_file *ff = file->private_data;
 	struct fuse_write_in *inarg = &req->misc.write.in;
 	struct fuse_write_out *outarg = &req->misc.write.out;
  
@@ -490,7 +567,7 @@
 	inarg->offset = pos;
 	inarg->size = count;
 	inarg->write_flags = writepage ? FUSE_WRITE_CACHE : 0;
-	inarg->flags = file->f_flags;
+	inarg->flags = file ? file->f_flags : 0;
 	req->in.h.opcode = FUSE_WRITE;
 	req->in.h.nodeid = get_node_id(inode);
 	req->in.argpages = 1;
@@ -511,7 +588,7 @@
 			      fl_owner_t owner)
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
-	fuse_write_fill(req, file, inode, pos, count, 0);
+	fuse_write_fill(req, file, file->private_data, inode, pos, count, 0);
 	if (owner != NULL) {
 		struct fuse_write_in *inarg = &req->misc.write.in;
 		inarg->write_flags |= FUSE_WRITE_LOCKOWNER;
@@ -546,6 +623,12 @@
 	if (is_bad_inode(inode))
 		return -EIO;
  
+	/*
+	 * Make sure writepages on the same page are not mixed up with
+	 * plain writes.
+	 */
+	fuse_wait_on_page_writeback(inode, page->index);
+
 	req = fuse_get_req(fc);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
  
  
  
  
  
  
@@ -716,24 +799,228 @@
 	return res;
 }
  
-static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
+static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req)
 {
-	if ((vma->vm_flags & VM_SHARED)) {
-		if ((vma->vm_flags & VM_WRITE))
-			return -ENODEV;
-		else
-			vma->vm_flags &= ~VM_MAYWRITE;
+	__free_page(req->pages[0]);
+	fuse_file_put(req->ff);
+	fuse_put_request(fc, req);
+}
+
+static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
+{
+	struct inode *inode = req->inode;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
+
+	list_del(&req->writepages_entry);
+	dec_bdi_stat(bdi, BDI_WRITEBACK);
+	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
+	bdi_writeout_inc(bdi);
+	wake_up(&fi->page_waitq);
+}
+
+/* Called under fc->lock, may release and reacquire it */
+static void fuse_send_writepage(struct fuse_conn *fc, struct fuse_req *req)
+{
+	struct fuse_inode *fi = get_fuse_inode(req->inode);
+	loff_t size = i_size_read(req->inode);
+	struct fuse_write_in *inarg = &req->misc.write.in;
+
+	if (!fc->connected)
+		goto out_free;
+
+	if (inarg->offset + PAGE_CACHE_SIZE <= size) {
+		inarg->size = PAGE_CACHE_SIZE;
+	} else if (inarg->offset < size) {
+		inarg->size = size & (PAGE_CACHE_SIZE - 1);
+	} else {
+		/* Got truncated off completely */
+		goto out_free;
 	}
-	return generic_file_mmap(file, vma);
+
+	req->in.args[1].size = inarg->size;
+	fi->writectr++;
+	request_send_background_locked(fc, req);
+	return;
+
+ out_free:
+	fuse_writepage_finish(fc, req);
+	spin_unlock(&fc->lock);
+	fuse_writepage_free(fc, req);
+	spin_lock(&fc->lock);
 }
  
-static int fuse_set_page_dirty(struct page *page)
+/*
+ * If fi->writectr is positive (no truncate or fsync going on) send
+ * all queued writepage requests.
+ *
+ * Called with fc->lock
+ */
+void fuse_flush_writepages(struct inode *inode)
 {
-	printk("fuse_set_page_dirty: should not happen\n");
-	dump_stack();
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_req *req;
+
+	while (fi->writectr >= 0 && !list_empty(&fi->queued_writes)) {
+		req = list_entry(fi->queued_writes.next, struct fuse_req, list);
+		list_del_init(&req->list);
+		fuse_send_writepage(fc, req);
+	}
+}
+
+static void fuse_writepage_end(struct fuse_conn *fc, struct fuse_req *req)
+{
+	struct inode *inode = req->inode;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	mapping_set_error(inode->i_mapping, req->out.h.error);
+	spin_lock(&fc->lock);
+	fi->writectr--;
+	fuse_writepage_finish(fc, req);
+	spin_unlock(&fc->lock);
+	fuse_writepage_free(fc, req);
+}
+
+static int fuse_writepage_locked(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	struct inode *inode = mapping->host;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_req *req;
+	struct fuse_file *ff;
+	struct page *tmp_page;
+
+	set_page_writeback(page);
+
+	req = fuse_request_alloc_nofs();
+	if (!req)
+		goto err;
+
+	tmp_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+	if (!tmp_page)
+		goto err_free;
+
+	spin_lock(&fc->lock);
+	BUG_ON(list_empty(&fi->write_files));
+	ff = list_entry(fi->write_files.next, struct fuse_file, write_entry);
+	req->ff = fuse_file_get(ff);
+	spin_unlock(&fc->lock);
+
+	fuse_write_fill(req, NULL, ff, inode, page_offset(page), 0, 1);
+
+	copy_highpage(tmp_page, page);
+	req->num_pages = 1;
+	req->pages[0] = tmp_page;
+	req->page_offset = 0;
+	req->end = fuse_writepage_end;
+	req->inode = inode;
+
+	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
+	end_page_writeback(page);
+
+	spin_lock(&fc->lock);
+	list_add(&req->writepages_entry, &fi->writepages);
+	list_add_tail(&req->list, &fi->queued_writes);
+	fuse_flush_writepages(inode);
+	spin_unlock(&fc->lock);
+
 	return 0;
+
+err_free:
+	fuse_request_free(req);
+err:
+	end_page_writeback(page);
+	return -ENOMEM;
 }
  
+static int fuse_writepage(struct page *page, struct writeback_control *wbc)
+{
+	int err;
+
+	err = fuse_writepage_locked(page);
+	unlock_page(page);
+
+	return err;
+}
+
+static int fuse_launder_page(struct page *page)
+{
+	int err = 0;
+	if (clear_page_dirty_for_io(page)) {
+		struct inode *inode = page->mapping->host;
+		err = fuse_writepage_locked(page);
+		if (!err)
+			fuse_wait_on_page_writeback(inode, page->index);
+	}
+	return err;
+}
+
+/*
+ * Write back dirty pages now, because there may not be any suitable
+ * open files later
+ */
+static void fuse_vma_close(struct vm_area_struct *vma)
+{
+	filemap_write_and_wait(vma->vm_file->f_mapping);
+}
+
+/*
+ * Wait for writeback against this page to complete before allowing it
+ * to be marked dirty again, and hence written back again, possibly
+ * before the previous writepage completed.
+ *
+ * Block here, instead of in ->writepage(), so that the userspace fs
+ * can only block processes actually operating on the filesystem.
+ *
+ * Otherwise unprivileged userspace fs would be able to block
+ * unrelated:
+ *
+ * - page migration
+ * - sync(2)
+ * - try_to_free_pages() with order > PAGE_ALLOC_COSTLY_ORDER
+ */
+static int fuse_page_mkwrite(struct vm_area_struct *vma, struct page *page)
+{
+	/*
+	 * Don't use page->mapping as it may become NULL from a
+	 * concurrent truncate.
+	 */
+	struct inode *inode = vma->vm_file->f_mapping->host;
+
+	fuse_wait_on_page_writeback(inode, page->index);
+	return 0;
+}
+
+static struct vm_operations_struct fuse_file_vm_ops = {
+	.close		= fuse_vma_close,
+	.fault		= filemap_fault,
+	.page_mkwrite	= fuse_page_mkwrite,
+};
+
+static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) {
+		struct inode *inode = file->f_dentry->d_inode;
+		struct fuse_conn *fc = get_fuse_conn(inode);
+		struct fuse_inode *fi = get_fuse_inode(inode);
+		struct fuse_file *ff = file->private_data;
+		/*
+		 * file may be written through mmap, so chain it onto the
+		 * inodes's write_file list
+		 */
+		spin_lock(&fc->lock);
+		if (list_empty(&ff->write_entry))
+			list_add(&ff->write_entry, &fi->write_files);
+		spin_unlock(&fc->lock);
+	}
+	file_accessed(file);
+	vma->vm_ops = &fuse_file_vm_ops;
+	return 0;
+}
+
 static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
 				  struct file_lock *fl)
 {
  
@@ -940,10 +1227,12 @@
  
 static const struct address_space_operations fuse_file_aops  = {
 	.readpage	= fuse_readpage,
+	.writepage	= fuse_writepage,
+	.launder_page	= fuse_launder_page,
 	.write_begin	= fuse_write_begin,
 	.write_end	= fuse_write_end,
 	.readpages	= fuse_readpages,
-	.set_page_dirty	= fuse_set_page_dirty,
+	.set_page_dirty	= __set_page_dirty_nobuffers,
 	.bmap		= fuse_bmap,
 };
  
@@ -15,6 +15,7 @@
 #include <linux/mm.h>
 #include <linux/backing-dev.h>
 #include <linux/mutex.h>
+#include <linux/rwsem.h>
  
 /** Max number of pages that can be used in a single read request */
 #define FUSE_MAX_PAGES_PER_REQ 32
@@ -25,6 +26,9 @@
 /** Congestion starts at 75% of maximum */
 #define FUSE_CONGESTION_THRESHOLD (FUSE_MAX_BACKGROUND * 75 / 100)
  
+/** Bias for fi->writectr, meaning new writepages must not be sent */
+#define FUSE_NOWRITE INT_MIN
+
 /** It could be as large as PATH_MAX, but would that have any uses? */
 #define FUSE_NAME_MAX 1024
  
@@ -73,6 +77,19 @@
  
 	/** Files usable in writepage.  Protected by fc->lock */
 	struct list_head write_files;
+
+	/** Writepages pending on truncate or fsync */
+	struct list_head queued_writes;
+
+	/** Number of sent writes, a negative bias (FUSE_NOWRITE)
+	 * means more writes are blocked */
+	int writectr;
+
+	/** Waitq for writepage completion */
+	wait_queue_head_t page_waitq;
+
+	/** List of writepage requestst (pending or sent) */
+	struct list_head writepages;
 };
  
 /** FUSE specific file data */
@@ -242,6 +259,12 @@
 	/** File used in the request (or NULL) */
 	struct fuse_file *ff;
  
+	/** Inode used in the request or NULL */
+	struct inode *inode;
+
+	/** Link on fi->writepages */
+	struct list_head writepages_entry;
+
 	/** Request completion callback */
 	void (*end)(struct fuse_conn *, struct fuse_req *);
  
@@ -504,6 +527,11 @@
 void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
 			    u64 attr_valid, u64 attr_version);
  
+void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
+				   u64 attr_valid);
+
+void fuse_truncate(struct address_space *mapping, loff_t offset);
+
 /**
  * Initialize the client device
  */
@@ -522,6 +550,8 @@
  */
 struct fuse_req *fuse_request_alloc(void);
  
+struct fuse_req *fuse_request_alloc_nofs(void);
+
 /**
  * Free a request
  */
@@ -558,6 +588,8 @@
  */
 void request_send_background(struct fuse_conn *fc, struct fuse_req *req);
  
+void request_send_background_locked(struct fuse_conn *fc, struct fuse_req *req);
+
 /* Abort all requests */
 void fuse_abort_conn(struct fuse_conn *fc);
  
@@ -600,4 +632,9 @@
  
 int fuse_update_attributes(struct inode *inode, struct kstat *stat,
 			   struct file *file, bool *refreshed);
+
+void fuse_flush_writepages(struct inode *inode);
+
+void fuse_set_nowrite(struct inode *inode);
+void fuse_release_nowrite(struct inode *inode);
@@ -59,7 +59,11 @@
 	fi->nodeid = 0;
 	fi->nlookup = 0;
 	fi->attr_version = 0;
+	fi->writectr = 0;
 	INIT_LIST_HEAD(&fi->write_files);
+	INIT_LIST_HEAD(&fi->queued_writes);
+	INIT_LIST_HEAD(&fi->writepages);
+	init_waitqueue_head(&fi->page_waitq);
 	fi->forget_req = fuse_request_alloc();
 	if (!fi->forget_req) {
 		kmem_cache_free(fuse_inode_cachep, inode);
@@ -73,6 +77,7 @@
 {
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	BUG_ON(!list_empty(&fi->write_files));
+	BUG_ON(!list_empty(&fi->queued_writes));
 	if (fi->forget_req)
 		fuse_request_free(fi->forget_req);
 	kmem_cache_free(fuse_inode_cachep, inode);
@@ -109,7 +114,7 @@
 	return 0;
 }
  
-static void fuse_truncate(struct address_space *mapping, loff_t offset)
+void fuse_truncate(struct address_space *mapping, loff_t offset)
 {
 	/* See vmtruncate() */
 	unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
  
  
@@ -117,19 +122,12 @@
 	unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
 }
  
-
-void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
-			    u64 attr_valid, u64 attr_version)
+void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
+				   u64 attr_valid)
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
-	loff_t oldsize;
  
-	spin_lock(&fc->lock);
-	if (attr_version != 0 && fi->attr_version > attr_version) {
-		spin_unlock(&fc->lock);
-		return;
-	}
 	fi->attr_version = ++fc->attr_version;
 	fi->i_time = attr_valid;
  
  
@@ -159,7 +157,23 @@
 	fi->orig_i_mode = inode->i_mode;
 	if (!(fc->flags & FUSE_DEFAULT_PERMISSIONS))
 		inode->i_mode &= ~S_ISVTX;
+}
  
+void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
+			    u64 attr_valid, u64 attr_version)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	loff_t oldsize;
+
+	spin_lock(&fc->lock);
+	if (attr_version != 0 && fi->attr_version > attr_version) {
+		spin_unlock(&fc->lock);
+		return;
+	}
+
+	fuse_change_attributes_common(inode, attr, attr_valid);
+
 	oldsize = inode->i_size;
 	i_size_write(inode, attr->size);
 	spin_unlock(&fc->lock);
@@ -468,6 +482,8 @@
 		atomic_set(&fc->num_waiting, 0);
 		fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
 		fc->bdi.unplug_io_fn = default_unplug_io_fn;
+		/* fuse does it's own writeback accounting */
+		fc->bdi.capabilities = BDI_CAP_NO_ACCT_WB;
 		fc->dev = sb->s_dev;
 		err = bdi_init(&fc->bdi);
 		if (err)
@@ -475,6 +491,19 @@
 		err = bdi_register_dev(&fc->bdi, fc->dev);
 		if (err)
 			goto error_bdi_destroy;
+		/*
+		 * For a single fuse filesystem use max 1% of dirty +
+		 * writeback threshold.
+		 *
+		 * This gives about 1M of write buffer for memory maps on a
+		 * machine with 1G and 10% dirty_ratio, which should be more
+		 * than enough.
+		 *
+		 * Privileged users can raise it by writing to
+		 *
+		 *    /sys/class/bdi/<bdi>/max_ratio
+		 */
+		bdi_set_max_ratio(&fc->bdi, 1);
 		fc->reqctr = 0;
 		fc->blocked = 1;
 		fc->attr_version = 1;
...	...	@@ -47,6 +47,14 @@
47	47	return req;
48	48	}
49	49
	50	+struct fuse_req *fuse_request_alloc_nofs(void)
	51	+{
	52	+ struct fuse_req *req = kmem_cache_alloc(fuse_req_cachep, GFP_NOFS);
	53	+ if (req)
	54	+ fuse_request_init(req);
	55	+ return req;
	56	+}
	57	+
50	58	void fuse_request_free(struct fuse_req *req)
51	59	{
52	60	kmem_cache_free(fuse_req_cachep, req);
...	...	@@ -427,6 +435,17 @@
427	435	{
428	436	req->isreply = 1;
429	437	request_send_nowait(fc, req);
	438	+}
	439	+
	440	+/*
	441	+ * Called under fc->lock
	442	+ *
	443	+ * fc->connected must have been checked previously
	444	+ */
	445	+void request_send_background_locked(struct fuse_conn fc, struct fuse_req req)
	446	+{
	447	+ req->isreply = 1;
	448	+ request_send_nowait_locked(fc, req);
430	449	}
431	450
432	451	/*
...	...	@@ -1107,6 +1107,50 @@
1107	1107	}
1108	1108
1109	1109	/*
	1110	+ * Prevent concurrent writepages on inode
	1111	+ *
	1112	+ * This is done by adding a negative bias to the inode write counter
	1113	+ * and waiting for all pending writes to finish.
	1114	+ */
	1115	+void fuse_set_nowrite(struct inode *inode)
	1116	+{
	1117	+ struct fuse_conn *fc = get_fuse_conn(inode);
	1118	+ struct fuse_inode *fi = get_fuse_inode(inode);
	1119	+
	1120	+ BUG_ON(!mutex_is_locked(&inode->i_mutex));
	1121	+
	1122	+ spin_lock(&fc->lock);
	1123	+ BUG_ON(fi->writectr < 0);
	1124	+ fi->writectr += FUSE_NOWRITE;
	1125	+ spin_unlock(&fc->lock);
	1126	+ wait_event(fi->page_waitq, fi->writectr == FUSE_NOWRITE);
	1127	+}
	1128	+
	1129	+/*
	1130	+ * Allow writepages on inode
	1131	+ *
	1132	+ * Remove the bias from the writecounter and send any queued
	1133	+ * writepages.
	1134	+ */
	1135	+static void __fuse_release_nowrite(struct inode *inode)
	1136	+{
	1137	+ struct fuse_inode *fi = get_fuse_inode(inode);
	1138	+
	1139	+ BUG_ON(fi->writectr != FUSE_NOWRITE);
	1140	+ fi->writectr = 0;
	1141	+ fuse_flush_writepages(inode);
	1142	+}
	1143	+
	1144	+void fuse_release_nowrite(struct inode *inode)
	1145	+{
	1146	+ struct fuse_conn *fc = get_fuse_conn(inode);
	1147	+
	1148	+ spin_lock(&fc->lock);
	1149	+ __fuse_release_nowrite(inode);
	1150	+ spin_unlock(&fc->lock);
	1151	+}
	1152	+
	1153	+/*
1110	1154	* Set attributes, and at the same time refresh them.
1111	1155	*
1112	1156	* Truncation is slightly complicated, because the 'truncate' request
...	...	@@ -1122,6 +1166,8 @@
1122	1166	struct fuse_req *req;
1123	1167	struct fuse_setattr_in inarg;
1124	1168	struct fuse_attr_out outarg;
	1169	+ bool is_truncate = false;
	1170	+ loff_t oldsize;
1125	1171	int err;
1126	1172
1127	1173	if (!fuse_allow_task(fc, current))
1128	1174
...	...	@@ -1145,12 +1191,16 @@
1145	1191	send_sig(SIGXFSZ, current, 0);
1146	1192	return -EFBIG;
1147	1193	}
	1194	+ is_truncate = true;
1148	1195	}
1149	1196
1150	1197	req = fuse_get_req(fc);
1151	1198	if (IS_ERR(req))
1152	1199	return PTR_ERR(req);
1153	1200
	1201	+ if (is_truncate)
	1202	+ fuse_set_nowrite(inode);
	1203	+
1154	1204	memset(&inarg, 0, sizeof(inarg));
1155	1205	memset(&outarg, 0, sizeof(outarg));
1156	1206	iattr_to_fattr(attr, &inarg);
1157	1207
1158	1208
1159	1209
...	...	@@ -1181,16 +1231,44 @@
1181	1231	if (err) {
1182	1232	if (err == -EINTR)
1183	1233	fuse_invalidate_attr(inode);
1184		- return err;
	1234	+ goto error;
1185	1235	}
1186	1236
1187	1237	if ((inode->i_mode ^ outarg.attr.mode) & S_IFMT) {
1188	1238	make_bad_inode(inode);
1189		- return -EIO;
	1239	+ err = -EIO;
	1240	+ goto error;
1190	1241	}
1191	1242
1192		- fuse_change_attributes(inode, &outarg.attr, attr_timeout(&outarg), 0);
	1243	+ spin_lock(&fc->lock);
	1244	+ fuse_change_attributes_common(inode, &outarg.attr,
	1245	+ attr_timeout(&outarg));
	1246	+ oldsize = inode->i_size;
	1247	+ i_size_write(inode, outarg.attr.size);
	1248	+
	1249	+ if (is_truncate) {
	1250	+ /* NOTE: this may release/reacquire fc->lock */
	1251	+ __fuse_release_nowrite(inode);
	1252	+ }
	1253	+ spin_unlock(&fc->lock);
	1254	+
	1255	+ /*
	1256	+ * Only call invalidate_inode_pages2() after removing
	1257	+ * FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock.
	1258	+ */
	1259	+ if (S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
	1260	+ if (outarg.attr.size < oldsize)
	1261	+ fuse_truncate(inode->i_mapping, outarg.attr.size);
	1262	+ invalidate_inode_pages2(inode->i_mapping);
	1263	+ }
	1264	+
1193	1265	return 0;
	1266	+
	1267	+error:
	1268	+ if (is_truncate)
	1269	+ fuse_release_nowrite(inode);
	1270	+
	1271	+ return err;
1194	1272	}
1195	1273
1196	1274	static int fuse_setattr(struct dentry entry, struct iattr attr)
...	...	@@ -210,6 +210,49 @@
210	210	return (u64) v0 + ((u64) v1 << 32);
211	211	}
212	212
	213	+/*
	214	+ * Check if page is under writeback
	215	+ *
	216	+ * This is currently done by walking the list of writepage requests
	217	+ * for the inode, which can be pretty inefficient.
	218	+ */
	219	+static bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
	220	+{
	221	+ struct fuse_conn *fc = get_fuse_conn(inode);
	222	+ struct fuse_inode *fi = get_fuse_inode(inode);
	223	+ struct fuse_req *req;
	224	+ bool found = false;
	225	+
	226	+ spin_lock(&fc->lock);
	227	+ list_for_each_entry(req, &fi->writepages, writepages_entry) {
	228	+ pgoff_t curr_index;
	229	+
	230	+ BUG_ON(req->inode != inode);
	231	+ curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
	232	+ if (curr_index == index) {
	233	+ found = true;
	234	+ break;
	235	+ }
	236	+ }
	237	+ spin_unlock(&fc->lock);
	238	+
	239	+ return found;
	240	+}
	241	+
	242	+/*
	243	+ * Wait for page writeback to be completed.
	244	+ *
	245	+ * Since fuse doesn't rely on the VM writeback tracking, this has to
	246	+ * use some other means.
	247	+ */
	248	+static int fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index)
	249	+{
	250	+ struct fuse_inode *fi = get_fuse_inode(inode);
	251	+
	252	+ wait_event(fi->page_waitq, !fuse_page_is_writeback(inode, index));
	253	+ return 0;
	254	+}
	255	+
213	256	static int fuse_flush(struct file *file, fl_owner_t id)
214	257	{
215	258	struct inode *inode = file->f_path.dentry->d_inode;
...	...	@@ -245,6 +288,21 @@
245	288	return err;
246	289	}
247	290
	291	+/*
	292	+ * Wait for all pending writepages on the inode to finish.
	293	+ *
	294	+ * This is currently done by blocking further writes with FUSE_NOWRITE
	295	+ * and waiting for all sent writes to complete.
	296	+ *
	297	+ * This must be called under i_mutex, otherwise the FUSE_NOWRITE usage
	298	+ * could conflict with truncation.
	299	+ */
	300	+static void fuse_sync_writes(struct inode *inode)
	301	+{
	302	+ fuse_set_nowrite(inode);
	303	+ fuse_release_nowrite(inode);
	304	+}
	305	+
248	306	int fuse_fsync_common(struct file file, struct dentry de, int datasync,
249	307	int isdir)
250	308	{
...	...	@@ -261,6 +319,17 @@
261	319	if ((!isdir && fc->no_fsync) \|\| (isdir && fc->no_fsyncdir))
262	320	return 0;
263	321
	322	+ /*
	323	+ * Start writeback against all dirty pages of the inode, then
	324	+ * wait for all outstanding writes, before sending the FSYNC
	325	+ * request.
	326	+ */
	327	+ err = write_inode_now(inode, 0);
	328	+ if (err)
	329	+ return err;
	330	+
	331	+ fuse_sync_writes(inode);
	332	+
264	333	req = fuse_get_req(fc);
265	334	if (IS_ERR(req))
266	335	return PTR_ERR(req);
...	...	@@ -340,6 +409,13 @@
340	409	if (is_bad_inode(inode))
341	410	goto out;
342	411
	412	+ /*
	413	+ * Page writeback can extend beyond the liftime of the
	414	+ * page-cache page, so make sure we read a properly synced
	415	+ * page.
	416	+ */
	417	+ fuse_wait_on_page_writeback(inode, page->index);
	418	+
343	419	req = fuse_get_req(fc);
344	420	err = PTR_ERR(req);
345	421	if (IS_ERR(req))
...	...	@@ -411,6 +487,8 @@
411	487	struct inode *inode = data->inode;
412	488	struct fuse_conn *fc = get_fuse_conn(inode);
413	489
	490	+ fuse_wait_on_page_writeback(inode, page->index);
	491	+
414	492	if (req->num_pages &&
415	493	(req->num_pages == FUSE_MAX_PAGES_PER_REQ \|\|
416	494	(req->num_pages + 1) * PAGE_CACHE_SIZE > fc->max_read \|\|
417	495
...	...	@@ -477,11 +555,10 @@
477	555	}
478	556
479	557	static void fuse_write_fill(struct fuse_req req, struct file file,
480		- struct inode *inode, loff_t pos, size_t count,
481		- int writepage)
	558	+ struct fuse_file ff, struct inode inode,
	559	+ loff_t pos, size_t count, int writepage)
482	560	{
483	561	struct fuse_conn *fc = get_fuse_conn(inode);
484		- struct fuse_file *ff = file->private_data;
485	562	struct fuse_write_in *inarg = &req->misc.write.in;
486	563	struct fuse_write_out *outarg = &req->misc.write.out;
487	564
...	...	@@ -490,7 +567,7 @@
490	567	inarg->offset = pos;
491	568	inarg->size = count;
492	569	inarg->write_flags = writepage ? FUSE_WRITE_CACHE : 0;
493		- inarg->flags = file->f_flags;
	570	+ inarg->flags = file ? file->f_flags : 0;
494	571	req->in.h.opcode = FUSE_WRITE;
495	572	req->in.h.nodeid = get_node_id(inode);
496	573	req->in.argpages = 1;
...	...	@@ -511,7 +588,7 @@
511	588	fl_owner_t owner)
512	589	{
513	590	struct fuse_conn *fc = get_fuse_conn(inode);
514		- fuse_write_fill(req, file, inode, pos, count, 0);
	591	+ fuse_write_fill(req, file, file->private_data, inode, pos, count, 0);
515	592	if (owner != NULL) {
516	593	struct fuse_write_in *inarg = &req->misc.write.in;
517	594	inarg->write_flags \|= FUSE_WRITE_LOCKOWNER;
...	...	@@ -546,6 +623,12 @@
546	623	if (is_bad_inode(inode))
547	624	return -EIO;
548	625
	626	+ /*
	627	+ * Make sure writepages on the same page are not mixed up with
	628	+ * plain writes.
	629	+ */
	630	+ fuse_wait_on_page_writeback(inode, page->index);
	631	+
549	632	req = fuse_get_req(fc);
550	633	if (IS_ERR(req))
551	634	return PTR_ERR(req);
552	635
553	636
554	637
555	638
556	639
557	640
...	...	@@ -716,24 +799,228 @@
716	799	return res;
717	800	}
718	801
719		-static int fuse_file_mmap(struct file file, struct vm_area_struct vma)
	802	+static void fuse_writepage_free(struct fuse_conn fc, struct fuse_req req)
720	803	{
721		- if ((vma->vm_flags & VM_SHARED)) {
722		- if ((vma->vm_flags & VM_WRITE))
723		- return -ENODEV;
724		- else
725		- vma->vm_flags &= ~VM_MAYWRITE;
	804	+ __free_page(req->pages[0]);
	805	+ fuse_file_put(req->ff);
	806	+ fuse_put_request(fc, req);
	807	+}
	808	+
	809	+static void fuse_writepage_finish(struct fuse_conn fc, struct fuse_req req)
	810	+{
	811	+ struct inode *inode = req->inode;
	812	+ struct fuse_inode *fi = get_fuse_inode(inode);
	813	+ struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
	814	+
	815	+ list_del(&req->writepages_entry);
	816	+ dec_bdi_stat(bdi, BDI_WRITEBACK);
	817	+ dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
	818	+ bdi_writeout_inc(bdi);
	819	+ wake_up(&fi->page_waitq);
	820	+}
	821	+
	822	+/* Called under fc->lock, may release and reacquire it */
	823	+static void fuse_send_writepage(struct fuse_conn fc, struct fuse_req req)
	824	+{
	825	+ struct fuse_inode *fi = get_fuse_inode(req->inode);
	826	+ loff_t size = i_size_read(req->inode);
	827	+ struct fuse_write_in *inarg = &req->misc.write.in;
	828	+
	829	+ if (!fc->connected)
	830	+ goto out_free;
	831	+
	832	+ if (inarg->offset + PAGE_CACHE_SIZE <= size) {
	833	+ inarg->size = PAGE_CACHE_SIZE;
	834	+ } else if (inarg->offset < size) {
	835	+ inarg->size = size & (PAGE_CACHE_SIZE - 1);
	836	+ } else {
	837	+ /* Got truncated off completely */
	838	+ goto out_free;
726	839	}
727		- return generic_file_mmap(file, vma);
	840	+
	841	+ req->in.args[1].size = inarg->size;
	842	+ fi->writectr++;
	843	+ request_send_background_locked(fc, req);
	844	+ return;
	845	+
	846	+ out_free:
	847	+ fuse_writepage_finish(fc, req);
	848	+ spin_unlock(&fc->lock);
	849	+ fuse_writepage_free(fc, req);
	850	+ spin_lock(&fc->lock);
728	851	}
729	852
730		-static int fuse_set_page_dirty(struct page *page)
	853	+/*
	854	+ * If fi->writectr is positive (no truncate or fsync going on) send
	855	+ * all queued writepage requests.
	856	+ *
	857	+ * Called with fc->lock
	858	+ */
	859	+void fuse_flush_writepages(struct inode *inode)
731	860	{
732		- printk("fuse_set_page_dirty: should not happen\n");
733		- dump_stack();
	861	+ struct fuse_conn *fc = get_fuse_conn(inode);
	862	+ struct fuse_inode *fi = get_fuse_inode(inode);
	863	+ struct fuse_req *req;
	864	+
	865	+ while (fi->writectr >= 0 && !list_empty(&fi->queued_writes)) {
	866	+ req = list_entry(fi->queued_writes.next, struct fuse_req, list);
	867	+ list_del_init(&req->list);
	868	+ fuse_send_writepage(fc, req);
	869	+ }
	870	+}
	871	+
	872	+static void fuse_writepage_end(struct fuse_conn fc, struct fuse_req req)
	873	+{
	874	+ struct inode *inode = req->inode;
	875	+ struct fuse_inode *fi = get_fuse_inode(inode);
	876	+
	877	+ mapping_set_error(inode->i_mapping, req->out.h.error);
	878	+ spin_lock(&fc->lock);
	879	+ fi->writectr--;
	880	+ fuse_writepage_finish(fc, req);
	881	+ spin_unlock(&fc->lock);
	882	+ fuse_writepage_free(fc, req);
	883	+}
	884	+
	885	+static int fuse_writepage_locked(struct page *page)
	886	+{
	887	+ struct address_space *mapping = page->mapping;
	888	+ struct inode *inode = mapping->host;
	889	+ struct fuse_conn *fc = get_fuse_conn(inode);
	890	+ struct fuse_inode *fi = get_fuse_inode(inode);
	891	+ struct fuse_req *req;
	892	+ struct fuse_file *ff;
	893	+ struct page *tmp_page;
	894	+
	895	+ set_page_writeback(page);
	896	+
	897	+ req = fuse_request_alloc_nofs();
	898	+ if (!req)
	899	+ goto err;
	900	+
	901	+ tmp_page = alloc_page(GFP_NOFS \| __GFP_HIGHMEM);
	902	+ if (!tmp_page)
	903	+ goto err_free;
	904	+
	905	+ spin_lock(&fc->lock);
	906	+ BUG_ON(list_empty(&fi->write_files));
	907	+ ff = list_entry(fi->write_files.next, struct fuse_file, write_entry);
	908	+ req->ff = fuse_file_get(ff);
	909	+ spin_unlock(&fc->lock);
	910	+
	911	+ fuse_write_fill(req, NULL, ff, inode, page_offset(page), 0, 1);
	912	+
	913	+ copy_highpage(tmp_page, page);
	914	+ req->num_pages = 1;
	915	+ req->pages[0] = tmp_page;
	916	+ req->page_offset = 0;
	917	+ req->end = fuse_writepage_end;
	918	+ req->inode = inode;
	919	+
	920	+ inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
	921	+ inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
	922	+ end_page_writeback(page);
	923	+
	924	+ spin_lock(&fc->lock);
	925	+ list_add(&req->writepages_entry, &fi->writepages);
	926	+ list_add_tail(&req->list, &fi->queued_writes);
	927	+ fuse_flush_writepages(inode);
	928	+ spin_unlock(&fc->lock);
	929	+
734	930	return 0;
	931	+
	932	+err_free:
	933	+ fuse_request_free(req);
	934	+err:
	935	+ end_page_writeback(page);
	936	+ return -ENOMEM;
735	937	}
736	938
	939	+static int fuse_writepage(struct page page, struct writeback_control wbc)
	940	+{
	941	+ int err;
	942	+
	943	+ err = fuse_writepage_locked(page);
	944	+ unlock_page(page);
	945	+
	946	+ return err;
	947	+}
	948	+
	949	+static int fuse_launder_page(struct page *page)
	950	+{
	951	+ int err = 0;
	952	+ if (clear_page_dirty_for_io(page)) {
	953	+ struct inode *inode = page->mapping->host;
	954	+ err = fuse_writepage_locked(page);
	955	+ if (!err)
	956	+ fuse_wait_on_page_writeback(inode, page->index);
	957	+ }
	958	+ return err;
	959	+}
	960	+
	961	+/*
	962	+ * Write back dirty pages now, because there may not be any suitable
	963	+ * open files later
	964	+ */
	965	+static void fuse_vma_close(struct vm_area_struct *vma)
	966	+{
	967	+ filemap_write_and_wait(vma->vm_file->f_mapping);
	968	+}
	969	+
	970	+/*
	971	+ * Wait for writeback against this page to complete before allowing it
	972	+ * to be marked dirty again, and hence written back again, possibly
	973	+ * before the previous writepage completed.
	974	+ *
	975	+ * Block here, instead of in ->writepage(), so that the userspace fs
	976	+ * can only block processes actually operating on the filesystem.
	977	+ *
	978	+ * Otherwise unprivileged userspace fs would be able to block
	979	+ * unrelated:
	980	+ *
	981	+ * - page migration
	982	+ * - sync(2)
	983	+ * - try_to_free_pages() with order > PAGE_ALLOC_COSTLY_ORDER
	984	+ */
	985	+static int fuse_page_mkwrite(struct vm_area_struct vma, struct page page)
	986	+{
	987	+ /*
	988	+ * Don't use page->mapping as it may become NULL from a
	989	+ * concurrent truncate.
	990	+ */
	991	+ struct inode *inode = vma->vm_file->f_mapping->host;
	992	+
	993	+ fuse_wait_on_page_writeback(inode, page->index);
	994	+ return 0;
	995	+}
	996	+
	997	+static struct vm_operations_struct fuse_file_vm_ops = {
	998	+ .close = fuse_vma_close,
	999	+ .fault = filemap_fault,
	1000	+ .page_mkwrite = fuse_page_mkwrite,
	1001	+};
	1002	+
	1003	+static int fuse_file_mmap(struct file file, struct vm_area_struct vma)
	1004	+{
	1005	+ if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) {
	1006	+ struct inode *inode = file->f_dentry->d_inode;
	1007	+ struct fuse_conn *fc = get_fuse_conn(inode);
	1008	+ struct fuse_inode *fi = get_fuse_inode(inode);
	1009	+ struct fuse_file *ff = file->private_data;
	1010	+ /*
	1011	+ * file may be written through mmap, so chain it onto the
	1012	+ * inodes's write_file list
	1013	+ */
	1014	+ spin_lock(&fc->lock);
	1015	+ if (list_empty(&ff->write_entry))
	1016	+ list_add(&ff->write_entry, &fi->write_files);
	1017	+ spin_unlock(&fc->lock);
	1018	+ }
	1019	+ file_accessed(file);
	1020	+ vma->vm_ops = &fuse_file_vm_ops;
	1021	+ return 0;
	1022	+}
	1023	+
737	1024	static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
738	1025	struct file_lock *fl)
739	1026	{
740	1027
...	...	@@ -940,10 +1227,12 @@
940	1227
941	1228	static const struct address_space_operations fuse_file_aops = {
942	1229	.readpage = fuse_readpage,
	1230	+ .writepage = fuse_writepage,
	1231	+ .launder_page = fuse_launder_page,
943	1232	.write_begin = fuse_write_begin,
944	1233	.write_end = fuse_write_end,
945	1234	.readpages = fuse_readpages,
946		- .set_page_dirty = fuse_set_page_dirty,
	1235	+ .set_page_dirty = __set_page_dirty_nobuffers,
947	1236	.bmap = fuse_bmap,
948	1237	};
949	1238
...	...	@@ -15,6 +15,7 @@
15	15	#include <linux/mm.h>
16	16	#include <linux/backing-dev.h>
17	17	#include <linux/mutex.h>
	18	+#include <linux/rwsem.h>
18	19
19	20	/** Max number of pages that can be used in a single read request */
20	21	#define FUSE_MAX_PAGES_PER_REQ 32
...	...	@@ -25,6 +26,9 @@
25	26	/** Congestion starts at 75% of maximum */
26	27	#define FUSE_CONGESTION_THRESHOLD (FUSE_MAX_BACKGROUND * 75 / 100)
27	28
	29	+/** Bias for fi->writectr, meaning new writepages must not be sent */
	30	+#define FUSE_NOWRITE INT_MIN
	31	+
28	32	/** It could be as large as PATH_MAX, but would that have any uses? */
29	33	#define FUSE_NAME_MAX 1024
30	34
...	...	@@ -73,6 +77,19 @@
73	77
74	78	/** Files usable in writepage. Protected by fc->lock */
75	79	struct list_head write_files;
	80	+
	81	+ /** Writepages pending on truncate or fsync */
	82	+ struct list_head queued_writes;
	83	+
	84	+ /** Number of sent writes, a negative bias (FUSE_NOWRITE)
	85	+ * means more writes are blocked */
	86	+ int writectr;
	87	+
	88	+ /** Waitq for writepage completion */
	89	+ wait_queue_head_t page_waitq;
	90	+
	91	+ /** List of writepage requestst (pending or sent) */
	92	+ struct list_head writepages;
76	93	};
77	94
78	95	/** FUSE specific file data */
...	...	@@ -242,6 +259,12 @@
242	259	/** File used in the request (or NULL) */
243	260	struct fuse_file *ff;
244	261
	262	+ /** Inode used in the request or NULL */
	263	+ struct inode *inode;
	264	+
	265	+ /** Link on fi->writepages */
	266	+ struct list_head writepages_entry;
	267	+
245	268	/** Request completion callback */
246	269	void (end)(struct fuse_conn , struct fuse_req *);
247	270
...	...	@@ -504,6 +527,11 @@
504	527	void fuse_change_attributes(struct inode inode, struct fuse_attr attr,
505	528	u64 attr_valid, u64 attr_version);
506	529
	530	+void fuse_change_attributes_common(struct inode inode, struct fuse_attr attr,
	531	+ u64 attr_valid);
	532	+
	533	+void fuse_truncate(struct address_space *mapping, loff_t offset);
	534	+
507	535	/**
508	536	* Initialize the client device
509	537	*/
...	...	@@ -522,6 +550,8 @@
522	550	*/
523	551	struct fuse_req *fuse_request_alloc(void);
524	552
	553	+struct fuse_req *fuse_request_alloc_nofs(void);
	554	+
525	555	/**
526	556	* Free a request
527	557	*/
...	...	@@ -558,6 +588,8 @@
558	588	*/
559	589	void request_send_background(struct fuse_conn fc, struct fuse_req req);
560	590
	591	+void request_send_background_locked(struct fuse_conn fc, struct fuse_req req);
	592	+
561	593	/* Abort all requests */
562	594	void fuse_abort_conn(struct fuse_conn *fc);
563	595
...	...	@@ -600,4 +632,9 @@
600	632
601	633	int fuse_update_attributes(struct inode inode, struct kstat stat,
602	634	struct file file, bool refreshed);
	635	+
	636	+void fuse_flush_writepages(struct inode *inode);
	637	+
	638	+void fuse_set_nowrite(struct inode *inode);
	639	+void fuse_release_nowrite(struct inode *inode);
...	...	@@ -59,7 +59,11 @@
59	59	fi->nodeid = 0;
60	60	fi->nlookup = 0;
61	61	fi->attr_version = 0;
	62	+ fi->writectr = 0;
62	63	INIT_LIST_HEAD(&fi->write_files);
	64	+ INIT_LIST_HEAD(&fi->queued_writes);
	65	+ INIT_LIST_HEAD(&fi->writepages);
	66	+ init_waitqueue_head(&fi->page_waitq);
63	67	fi->forget_req = fuse_request_alloc();
64	68	if (!fi->forget_req) {
65	69	kmem_cache_free(fuse_inode_cachep, inode);
...	...	@@ -73,6 +77,7 @@
73	77	{
74	78	struct fuse_inode *fi = get_fuse_inode(inode);
75	79	BUG_ON(!list_empty(&fi->write_files));
	80	+ BUG_ON(!list_empty(&fi->queued_writes));
76	81	if (fi->forget_req)
77	82	fuse_request_free(fi->forget_req);
78	83	kmem_cache_free(fuse_inode_cachep, inode);
...	...	@@ -109,7 +114,7 @@
109	114	return 0;
110	115	}
111	116
112		-static void fuse_truncate(struct address_space *mapping, loff_t offset)
	117	+void fuse_truncate(struct address_space *mapping, loff_t offset)
113	118	{
114	119	/* See vmtruncate() */
115	120	unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
116	121
117	122
...	...	@@ -117,19 +122,12 @@
117	122	unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
118	123	}
119	124
120		-
121		-void fuse_change_attributes(struct inode inode, struct fuse_attr attr,
122		- u64 attr_valid, u64 attr_version)
	125	+void fuse_change_attributes_common(struct inode inode, struct fuse_attr attr,
	126	+ u64 attr_valid)
123	127	{
124	128	struct fuse_conn *fc = get_fuse_conn(inode);
125	129	struct fuse_inode *fi = get_fuse_inode(inode);
126		- loff_t oldsize;
127	130
128		- spin_lock(&fc->lock);
129		- if (attr_version != 0 && fi->attr_version > attr_version) {
130		- spin_unlock(&fc->lock);
131		- return;
132		- }
133	131	fi->attr_version = ++fc->attr_version;
134	132	fi->i_time = attr_valid;
135	133
136	134
...	...	@@ -159,7 +157,23 @@
159	157	fi->orig_i_mode = inode->i_mode;
160	158	if (!(fc->flags & FUSE_DEFAULT_PERMISSIONS))
161	159	inode->i_mode &= ~S_ISVTX;
	160	+}
162	161
	162	+void fuse_change_attributes(struct inode inode, struct fuse_attr attr,
	163	+ u64 attr_valid, u64 attr_version)
	164	+{
	165	+ struct fuse_conn *fc = get_fuse_conn(inode);
	166	+ struct fuse_inode *fi = get_fuse_inode(inode);
	167	+ loff_t oldsize;
	168	+
	169	+ spin_lock(&fc->lock);
	170	+ if (attr_version != 0 && fi->attr_version > attr_version) {
	171	+ spin_unlock(&fc->lock);
	172	+ return;
	173	+ }
	174	+
	175	+ fuse_change_attributes_common(inode, attr, attr_valid);
	176	+
163	177	oldsize = inode->i_size;
164	178	i_size_write(inode, attr->size);
165	179	spin_unlock(&fc->lock);
...	...	@@ -468,6 +482,8 @@
468	482	atomic_set(&fc->num_waiting, 0);
469	483	fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
470	484	fc->bdi.unplug_io_fn = default_unplug_io_fn;
	485	+ /* fuse does it's own writeback accounting */
	486	+ fc->bdi.capabilities = BDI_CAP_NO_ACCT_WB;
471	487	fc->dev = sb->s_dev;
472	488	err = bdi_init(&fc->bdi);
473	489	if (err)
...	...	@@ -475,6 +491,19 @@
475	491	err = bdi_register_dev(&fc->bdi, fc->dev);
476	492	if (err)
477	493	goto error_bdi_destroy;
	494	+ /*
	495	+ * For a single fuse filesystem use max 1% of dirty +
	496	+ * writeback threshold.
	497	+ *
	498	+ * This gives about 1M of write buffer for memory maps on a
	499	+ * machine with 1G and 10% dirty_ratio, which should be more
	500	+ * than enough.
	501	+ *
	502	+ * Privileged users can raise it by writing to
	503	+ *
	504	+ * /sys/class/bdi/<bdi>/max_ratio
	505	+ */
	506	+ bdi_set_max_ratio(&fc->bdi, 1);
478	507	fc->reqctr = 0;
479	508	fc->blocked = 1;
480	509	fc->attr_version = 1;