18 Oct, 2007

7 commits

  • This patch moves transport dynamic registration and matching to the net
    module to prevent a bad Kconfig dependency between the net and fs 9p modules.

    Signed-off-by: Eric Van Hensbergen

    Eric Van Hensbergen
     
  • Loose mode in 9p utilizes the page cache without respecting coherency with
    the server. Any writes previously invaldiated the entire mapping for a file.
    This patch softens the behavior to only invalidate the region of the actual
    write.

    Signed-off-by: Eric Van Hensbergen

    Eric Van Hensbergen
     
  • The 9P2000 protocol requires the authentication and permission checks to be
    done in the file server. For that reason every user that accesses the file
    server tree has to authenticate and attach to the server separately.
    Multiple users can share the same connection to the server.

    Currently v9fs does a single attach and executes all I/O operations as a
    single user. This makes using v9fs in multiuser environment unsafe as it
    depends on the client doing the permission checking.

    This patch improves the 9P2000 support by allowing every user to attach
    separately. The patch defines three modes of access (new mount option
    'access'):

    - attach-per-user (access=user) (default mode for 9P2000.u)
    If a user tries to access a file served by v9fs for the first time, v9fs
    sends an attach command to the server (Tattach) specifying the user. If
    the attach succeeds, the user can access the v9fs tree.
    As there is no uname->uid (string->integer) mapping yet, this mode works
    only with the 9P2000.u dialect.

    - allow only one user to access the tree (access=)
    Only the user with uid can access the v9fs tree. Other users that attempt
    to access it will get EPERM error.

    - do all operations as a single user (access=any) (default for 9P2000)
    V9fs does a single attach and all operations are done as a single user.
    If this mode is selected, the v9fs behavior is identical with the current
    one.

    Signed-off-by: Latchesar Ionkov
    Signed-off-by: Eric Van Hensbergen

    Latchesar Ionkov
     
  • Change the names of 'uid' and 'gid' parameters to the more appropriate
    'dfltuid' and 'dfltgid'. This also sets the default uid/gid to -2
    (aka nfsnobody)

    Signed-off-by: Latchesar Ionkov
    Signed-off-by: Eric Van Hensbergen

    Latchesar Ionkov
     
  • Create more general flags field in the v9fs_session_info struct and move the
    'extended' flag as a bit in the flags.

    Signed-off-by: Latchesar Ionkov
    Signed-off-by: Eric Van Hensbergen

    Latchesar Ionkov
     
  • This patch abstracts out the interfaces to underlying transports so that
    new transports can be added as modules. This should also allow kernel
    configuration of transports without ifdef-hell.

    Signed-off-by: Eric Van Hensbergen

    Eric Van Hensbergen
     
  • * 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6: (59 commits)
    [XFS] eagerly remove vmap mappings to avoid upsetting Xen
    [XFS] simplify validata_fields
    [XFS] no longer using io_vnode, as was remaining from 23 cherrypick
    [XFS] Remove STATIC which was missing from prior manual merge
    [XFS] Put back the QUEUE_ORDERED_NONE test in the barrier check.
    [XFS] Turn off XBF_ASYNC flag before re-reading superblock.
    [XFS] avoid race in sync_inodes() that can fail to write out all dirty data
    [XFS] This fix prevents bulkstat from spinning in an infinite loop.
    [XFS] simplify xfs_create/mknod/symlink prototype
    [XFS] avoid xfs_getattr in XFS_IOC_FSGETXATTR ioctl
    [XFS] get_bulkall() could return incorrect inode state
    [XFS] Kill unused IOMAP_EOF flag
    [XFS] fix when DMAPI mount option processing happens
    [XFS] ensure file size is logged on synchronous writes
    [XFS] growlock should be a mutex
    [XFS] replace some large xfs_log_priv.h macros by proper functions
    [XFS] kill struct bhv_vfs
    [XFS] move syncing related members from struct bhv_vfs to struct xfs_mount
    [XFS] kill the vfs_flags member in struct bhv_vfs
    [XFS] kill the vfs_fsid and vfs_altfsid members in struct bhv_vfs
    ...

    Linus Torvalds
     

17 Oct, 2007

33 commits

  • This patch contains the following cleanups that are now possible:
    - remove the unused security_operations->inode_xattr_getsuffix
    - remove the no longer used security_operations->unregister_security
    - remove some no longer required exit code
    - remove a bunch of no longer used exports

    Signed-off-by: Adrian Bunk
    Acked-by: James Morris
    Cc: Chris Wright
    Cc: Stephen Smalley
    Cc: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Implement file posix capabilities. This allows programs to be given a
    subset of root's powers regardless of who runs them, without having to use
    setuid and giving the binary all of root's powers.

    This version works with Kaigai Kohei's userspace tools, found at
    http://www.kaigai.gr.jp/index.php. For more information on how to use this
    patch, Chris Friedhoff has posted a nice page at
    http://www.friedhoff.org/fscaps.html.

    Changelog:
    Nov 27:
    Incorporate fixes from Andrew Morton
    (security-introduce-file-caps-tweaks and
    security-introduce-file-caps-warning-fix)
    Fix Kconfig dependency.
    Fix change signaling behavior when file caps are not compiled in.

    Nov 13:
    Integrate comments from Alexey: Remove CONFIG_ ifdef from
    capability.h, and use %zd for printing a size_t.

    Nov 13:
    Fix endianness warnings by sparse as suggested by Alexey
    Dobriyan.

    Nov 09:
    Address warnings of unused variables at cap_bprm_set_security
    when file capabilities are disabled, and simultaneously clean
    up the code a little, by pulling the new code into a helper
    function.

    Nov 08:
    For pointers to required userspace tools and how to use
    them, see http://www.friedhoff.org/fscaps.html.

    Nov 07:
    Fix the calculation of the highest bit checked in
    check_cap_sanity().

    Nov 07:
    Allow file caps to be enabled without CONFIG_SECURITY, since
    capabilities are the default.
    Hook cap_task_setscheduler when !CONFIG_SECURITY.
    Move capable(TASK_KILL) to end of cap_task_kill to reduce
    audit messages.

    Nov 05:
    Add secondary calls in selinux/hooks.c to task_setioprio and
    task_setscheduler so that selinux and capabilities with file
    cap support can be stacked.

    Sep 05:
    As Seth Arnold points out, uid checks are out of place
    for capability code.

    Sep 01:
    Define task_setscheduler, task_setioprio, cap_task_kill, and
    task_setnice to make sure a user cannot affect a process in which
    they called a program with some fscaps.

    One remaining question is the note under task_setscheduler: are we
    ok with CAP_SYS_NICE being sufficient to confine a process to a
    cpuset?

    It is a semantic change, as without fsccaps, attach_task doesn't
    allow CAP_SYS_NICE to override the uid equivalence check. But since
    it uses security_task_setscheduler, which elsewhere is used where
    CAP_SYS_NICE can be used to override the uid equivalence check,
    fixing it might be tough.

    task_setscheduler
    note: this also controls cpuset:attach_task. Are we ok with
    CAP_SYS_NICE being used to confine to a cpuset?
    task_setioprio
    task_setnice
    sys_setpriority uses this (through set_one_prio) for another
    process. Need same checks as setrlimit

    Aug 21:
    Updated secureexec implementation to reflect the fact that
    euid and uid might be the same and nonzero, but the process
    might still have elevated caps.

    Aug 15:
    Handle endianness of xattrs.
    Enforce capability version match between kernel and disk.
    Enforce that no bits beyond the known max capability are
    set, else return -EPERM.
    With this extra processing, it may be worth reconsidering
    doing all the work at bprm_set_security rather than
    d_instantiate.

    Aug 10:
    Always call getxattr at bprm_set_security, rather than
    caching it at d_instantiate.

    [morgan@kernel.org: file-caps clean up for linux/capability.h]
    [bunk@kernel.org: unexport cap_inode_killpriv]
    Signed-off-by: Serge E. Hallyn
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: Chris Wright
    Cc: Andrew Morgan
    Signed-off-by: Andrew Morgan
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • I'm going to be modifying nfsd_rename() shortly to support read-only bind
    mounts. This #ifdef is around the area I'm patching, and it starts to get
    really ugly if I just try to add my new code by itself. Using this little
    helper makes things a lot cleaner to use.

    Signed-off-by: Dave Hansen
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • First of all, this makes the structure jumping look a little bit cleaner. So,
    this stands alone as a tiny cleanup. But, we also need 'mnt' by itself a few
    more times later in this series, so this isn't _just_ a cleanup.

    Signed-off-by: Dave Hansen
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • may_open() calls vfs_permission() before it does checks for IS_RDONLY(inode).
    It checks _again_ inside of vfs_permission().

    The check inside of vfs_permission() is going away eventually. With the
    mnt_want/drop_write() functions, all of the r/o checks (except for this one)
    are consistently done before calling permission(). Because of this, I'd like
    to use permission() to hold a debugging check to make sure that the
    mnt_want/drop_write() calls are actually being made.

    So, to do this:
    1. remove the IS_RDONLY() check from permission()
    2. enforce that you must mnt_want_write() before
    even calling permission()
    3. actually add the debugging check to permission()

    We need to rearrange may_open() to do r/o checks before calling permission().
    Here's the patch.

    Signed-off-by: Dave Hansen
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • Why do we need r/o bind mounts?

    This feature allows a read-only view into a read-write filesystem. In the
    process of doing that, it also provides infrastructure for keeping track of
    the number of writers to any given mount.

    This has a number of uses. It allows chroots to have parts of filesystems
    writable. It will be useful for containers in the future because users may
    have root inside a container, but should not be allowed to write to
    somefilesystems. This also replaces patches that vserver has had out of the
    tree for several years.

    It allows security enhancement by making sure that parts of your filesystem
    read-only (such as when you don't trust your FTP server), when you don't want
    to have entire new filesystems mounted, or when you want atime selectively
    updated. I've been using the following script to test that the feature is
    working as desired. It takes a directory and makes a regular bind and a r/o
    bind mount of it. It then performs some normal filesystem operations on the
    three directories, including ones that are expected to fail, like creating a
    file on the r/o mount.

    This patch:

    Some filesystems forego the vfs and may_open() and create their own 'struct
    file's.

    This patch creates a couple of helper functions which can be used by these
    filesystems, and will provide a unified place which the r/o bind mount code
    may patch.

    Also, rename an existing, static-scope init_file() to a less generic name.

    Signed-off-by: Dave Hansen
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • Define a new function fuse_refresh_attributes() that conditionally refreshes
    the attributes based on the validity timeout.

    In fuse_permission() only refresh the attributes for checking the execute bits
    if necessary.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Don't return -ENOENT for a read() on the fuse device when the request was
    aborted. Instead return -ENODEV, meaning the filesystem has been
    force-umounted or aborted.

    Previously ENOENT meant that the request was interrupted, but now the
    'aborted' flag is not set in case of interrupts.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Don't set 'aborted' flag on a request if it's interrupted. We have to wait
    for the answer anyway, and this would only a very little time while copying
    the reply.

    This means, that write() on the fuse device will not return -ENOENT during
    normal operation, only if the filesystem is aborted by a forced umount or
    through the fusectl interface.

    This could simplify userspace code somewhat when backward compatibility with
    earlier kernel versions is not required.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Move dput/mntput pair from request_end() to fuse_release_end(), because
    there's no other place they are used.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • The VFS checks sticky bits on the parent directory even if the filesystem
    defines it's own ->permission(). In some situations (sshfs, mountlo, etc) the
    user does have permission to delete a file even if the attribute based
    checking would not allow it.

    So work around this by storing the permission bits separately and returning
    them in stat(), but cutting the permission bits off from inode->i_mode.

    This is slightly hackish, but it's probably not worth it to add new
    infrastructure in VFS and a slight performance penalty for all filesystems,
    just for the sake of fuse.

    [Jan Engelhardt] cosmetic fixes
    Signed-off-by: Miklos Szeredi
    Cc: Jan Engelhardt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • fuse_permission() didn't refresh inode attributes before using them, even if
    the validity has already expired.

    Thanks to Junjiro Okajima for spotting this.

    Also remove some old code to unconditionally refresh the attributes on the
    root inode.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Aufs seems to depend on a positive i_nlink value. So fill in a dummy but sane
    value for the root inode at mount time.

    The inode attributes are refreshed with the correct values at the first
    opportunity.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Other than truncate, there are two cases, when fuse tries to get rid
    of cached pages:

    a) in open, if KEEP_CACHE flag is not set
    b) in getattr, if file size changed spontaneously

    Until now invalidate_mapping_pages() were used, which didn't get rid
    of mapped pages. This is wrong, and becomes more wrong as dirty pages
    are introduced. So instead properly invalidate all pages with
    invalidate_inode_pages2().

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Memory mappings were only truncated on an explicit truncate, but not when the
    file size was changed externally.

    Fix this by moving the truncation code from fuse_setattr to
    fuse_change_attributes.

    Yes, there are races between write and and external truncation, but we can't
    really do anything about them.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Make lifetime of 'struct fuse_file' independent from 'struct file' by adding a
    reference counter and destructor.

    This will enable asynchronous page writeback, where it cannot be guaranteed,
    that the file is not released while a request with this file handle is being
    served.

    The actual RELEASE request is only sent when there are no more references to
    the fuse_file.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Use wake_up_all instead of wake_up in put_reserved_req(), otherwise it is
    possible that the right task is not woken up.

    Also create a separate reserved_req_waitq in addition to the blocked_waitq,
    since they fulfill totally separate functions.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Set the read and write congestion state if the request queue is close to
    blocking, and clear it when it's not.

    This prevents unnecessary blocking in readahead and (when writable mmaps are
    allowed) writeback.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Val's cross-port of the ext3 reservations code into ext2.

    [mbligh@mbligh.org: Small type error for printk
    [akpm@linux-foundation.org: fix types, sync with ext3]
    [mbligh@mbligh.org: Bring ext2 reservations code in line with latest ext3]
    [akpm@linux-foundation.org: kill noisy printk]
    [akpm@linux-foundation.org: remember to dirty the gdp's block]
    [akpm@linux-foundation.org: cross-port the missed 5dea5176e5c32ef9f0d1a41d28427b3bf6881b3a]
    [akpm@linux-foundation.org: cross-port e6022603b9aa7d61d20b392e69edcdbbc1789969]
    [akpm@linux-foundation.org: Port the omitted 08fb306fe63d98eb86e3b16f4cc21816fa47f18e]
    [akpm@linux-foundation.org: Backport the missed 20acaa18d0c002fec180956f87adeb3f11f635a6]
    [akpm@linux-foundation.org: fixes]
    [cmm@us.ibm.com: fix reservation extension]
    [bunk@stusta.de: make ext2_get_blocks() static]
    [hugh@veritas.com: fix hang]
    [hugh@veritas.com: ext2_new_blocks should reset the reservation window size]
    [hugh@veritas.com: ext2 balloc: fix off-by-one against rsv_end]
    [hugh@veritas.com: grp_goal 0 is a genuine goal (unlike -1), so ext2_try_to_allocate_with_rsv should treat it as such]
    [hugh@veritas.com: rbtree usage cleanup]
    [pbadari@us.ibm.com: Fix for ext2 reservation]
    [bunk@kernel.org: remove fs/ext2/balloc.c:reserve_blocks()]
    [hugh@veritas.com: ext2 balloc: use io_error label]
    Cc: "Martin J. Bligh"
    Cc: Valerie Henson
    Cc: Mingming Cao
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Adrian Bunk
    Signed-off-by: Hugh Dickins
    Signed-off-by: Badari Pulavarty
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin J. Bligh
     
  • I_LOCK was used for several unrelated purposes, which caused deadlock
    situations in certain filesystems as a side effect. One of the purposes
    now uses the new I_SYNC bit.

    Also document the various bits and change their order from historical to
    logical.

    [bunk@stusta.de: make fs/inode.c:wake_up_inode() static]
    Signed-off-by: Joern Engel
    Cc: Dave Kleikamp
    Cc: David Chinner
    Cc: Anton Altaparmakov
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joern Engel
     
  • After making dirty a 100M file, the normal behavior is to start the writeback
    for all data after 30s delays. But sometimes the following happens instead:

    - after 30s: ~4M
    - after 5s: ~4M
    - after 5s: all remaining 92M

    Some analyze shows that the internal io dispatch queues goes like this:

    s_io s_more_io
    -------------------------
    1) 100M,1K 0
    2) 1K 96M
    3) 0 96M

    1) initial state with a 100M file and a 1K file
    2) 4M written, nr_to_write 0, no more writes(BUG)

    nr_to_write > 0 in (3) fools the upper layer to think that data have all been
    written out. The big dirty file is actually still sitting in s_more_io. We
    cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and
    let the loop in generic_sync_sb_inodes() continue: this may starve newly
    expired inodes in s_dirty. It is also not an option to draw inodes from both
    s_more_io and s_dirty, an let the loop go on: this might lead to live locks,
    and might also starve other superblocks in sync time(well kupdate may still
    starve some superblocks, that's another bug).

    We have to return when a full scan of s_io completes. So nr_to_write > 0 does
    not necessarily mean that "all data are written". This patch introduces a
    flag writeback_control.more_io to indicate this situation. With it the big
    dirty file no longer has to wait for the next kupdate invocation 5s later.

    Cc: David Chinner
    Cc: Ken Chen
    Signed-off-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Miklos Szeredi and me identified a writeback bug:

    > The following strange behavior can be observed:
    >
    > 1. large file is written
    > 2. after 30 seconds, nr_dirty goes down by 1024
    > 3. then for some time (< 30 sec) nothing happens (disk idle)
    > 4. then nr_dirty again goes down by 1024
    > 5. repeat from 3. until whole file is written
    >
    > So basically a 4Mbyte chunk of the file is written every 30 seconds.
    > I'm quite sure this is not the intended behavior.

    It can be produced by the following test scheme:

    # cat bin/test-writeback.sh
    grep nr_dirty /proc/vmstat
    echo 1 > /proc/sys/fs/inode_debug
    dd if=/dev/zero of=/var/x bs=1K count=204800&
    while true; do grep nr_dirty /proc/vmstat; sleep 1; done

    # bin/test-writeback.sh
    nr_dirty 19207
    nr_dirty 19207
    nr_dirty 30924
    204800+0 records in
    204800+0 records out
    209715200 bytes (210 MB) copied, 1.58363 seconds, 132 MB/s
    nr_dirty 47150
    nr_dirty 47141
    nr_dirty 47142
    nr_dirty 47142
    nr_dirty 47142
    nr_dirty 47142
    nr_dirty 47205
    nr_dirty 47214
    nr_dirty 47214
    nr_dirty 47214
    nr_dirty 47214
    nr_dirty 47214
    nr_dirty 47215
    nr_dirty 47216
    nr_dirty 47216
    nr_dirty 47216
    nr_dirty 47154
    nr_dirty 47143
    nr_dirty 47143
    nr_dirty 47143
    nr_dirty 47143
    nr_dirty 47143
    nr_dirty 47142
    nr_dirty 47142
    nr_dirty 47142
    nr_dirty 47142
    nr_dirty 47134
    nr_dirty 47134
    nr_dirty 47135
    nr_dirty 47135
    nr_dirty 47135
    nr_dirty 46097 pages_skipped != pages_skipped) {
    544 /*
    545 * writeback is not making progress due to locked
    546 * buffers. Skip this inode for now.
    547 */
    548 redirty_tail(inode);
    549 }

    More debug efforts show that __block_write_full_page()
    never has the chance to call submit_bh() for that big dirty file:
    the buffer head is *clean*. So basicly no page io is issued by
    __block_write_full_page(), hence pages_skipped goes up.

    Also the comment in generic_sync_sb_inodes():

    544 /*
    545 * writeback is not making progress due to locked
    546 * buffers. Skip this inode for now.
    547 */

    and the comment in __block_write_full_page():

    1713 /*
    1714 * The page was marked dirty, but the buffers were
    1715 * clean. Someone wrote them back by hand with
    1716 * ll_rw_block/submit_bh. A rare case.
    1717 */

    do not quite agree with each other. The page writeback should be skipped for
    'locked buffer', but here it is 'clean buffer'!

    This patch fixes this bug. Though I'm not sure why __block_write_full_page()
    is called only to do nothing and who actually issued the writeback for us.

    This is the two possible new behaviors after the patch:

    1) pretty nice: wait 30s and write ALL:)
    2) not so good:
    - during the dd: ~16M
    - after 30s: ~4M
    - after 5s: ~4M
    - after 5s: ~176M

    The next patch will fix case (2).

    Cc: David Chinner
    Cc: Ken Chen
    Signed-off-by: Fengguang Wu
    Signed-off-by: David Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • NTFS's if-condition on dirty inodes is not complete. Fix it with
    sb_has_dirty_inodes().

    Cc: Anton Altaparmakov
    Cc: Ken Chen
    Signed-off-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Streamline the management of dirty inode lists and fix time ordering bugs.

    The writeback logic used to move not-yet-expired dirty inodes from s_dirty to
    s_io, *only to* move them back. The move-inodes-back-and-forth thing is a
    mess, which is eliminated by this patch.

    The new scheme is:
    - s_dirty acts as a time ordered io delaying queue;
    - s_io/s_more_io together acts as an io dispatching queue.

    On kupdate writeback, we pull some inodes from s_dirty to s_io at the start of
    every full scan of s_io. Otherwise (i.e. for sync/throttle/background
    writeback), we always pull from s_dirty on each run (a partial scan).

    Note that the line
    list_splice_init(&sb->s_more_io, &sb->s_io);
    is moved to queue_io() to leave s_io empty. Otherwise a big dirtied file will
    sit in s_io for a long time, preventing new expired inodes to get in.

    Cc: Ken Chen
    Cc: Andrew Morton
    Signed-off-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Current -mm tree has bucketful of bug fixes in periodic writeback path.
    However, we still hit a glitch where dirty pages on a given inode aren't
    completely flushed to the disk, and system will accumulate large amount of
    dirty pages beyond what dirty_expire_interval is designed for.

    The problem is __sync_single_inode() will move an inode to sb->s_dirty list
    even when there are more pending dirty pages on that inode. If there is
    another inode with a small number of dirty pages, we hit a case where the loop
    iteration in wb_kupdate() terminates prematurely because wbc.nr_to_write > 0.
    Thus leaving the inode that has large amount of dirty pages behind and it has
    to wait for another dirty_writeback_interval before we flush it again. We
    effectively only write out MAX_WRITEBACK_PAGES every dirty_writeback_interval.
    If the rate of dirtying is sufficiently high, the system will start
    accumulate a large number of dirty pages.

    So fix it by having another sb->s_more_io list on which to park the inode
    while we iterate through sb->s_io and to allow each dirty inode which resides
    on that sb to have an equal chance of flushing some amount of dirty pages.

    Signed-off-by: Ken Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen
     
  • This one fixes four bugs.

    There are a few situation in there where writeback decides it is going to skip
    over a blockdev inode on the kernel-internal blockdev superblock. It
    presently does this by moving the blockdev inode onto the tail of the blockdev
    superblock's s_dirty. But

    a) this screws up s_dirty's reverse-time-orderedness and

    b) refiling the blockdev for writeback in another 30 second is rude. We
    should try again sooner than that.

    Fix all this up by using redirty_head(): move the blockdev inode onto the head
    of the blockdev superblock's s_dirty list for prompt writeback.

    Cc: Mike Waychison
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Recycling the previous changelog:

    When the writeback function is operating in writeback-for-flushing mode
    (as opposed to writeback-for-integrity) and it encounters an I_LOCKed inode,
    it will skip writing that inode. This is done for throughput and latency:
    move on to another inode rather than blocking for this one.

    Writeback skips this inode by moving it off s_io and onto s_dirty, so that
    writeback can proceed with the other inodes on s_io.

    However that inode movement can corrupt s_dirty's
    reverse-time-orderedness. Fix that by using the new redirty_tail(), which
    will update the refiled inode's dirtied_when field.

    Note: the behaviour in here is a bit rude: if kupdate happens to come
    across a locked inode then it will defer writeback of that inode for another
    30 seconds. We'll address that in the next patch.

    Address that here. What we do is to move the skipped inode to the _head_ of
    s_dirty, immediately eligible for writeout again. Instead of deferring that
    writeout for another 30 seconds.

    One would think that this might cause a livelock: we keep on trying to write
    the same locked inode. But it won't because:

    a) if that was the case, it would _already_ be happening on the
    balance_dirty_pages codepath. Because balance_dirty_pages() doesn't care
    about inode timestamps.

    b) if we skipped this inode then we won't have done any writeback. The
    higher-level writeback paths will see that wbc.nr_to_write didn't change
    and they'll then back off and take a nap.

    Cc: Mike Waychison
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • When the writeback function is operating in writeback-for-flushing mode (as
    opposed to writeback-for-integrity) and it encounters an I_LOCKed inode, it
    will skip writing that inode. This is done for throughput and latency: move
    on to another inode rather than blocking for this one.

    Writeback skips this inode by moving it off s_io and onto s_dirty, so that
    writeback can proceed with the other inodes on s_io.

    However that inode movement can corrupt s_dirty's reverse-time-orderedness.
    Fix that by using the new redirty_tail(), which will update the refiled
    inode's dirtied_when field.

    Note: the behaviour in here is a bit rude: if kupdate happens to come across a
    locked inode then it will defer writeback of that inode for another 30
    seconds. We'll address that in the next patch.

    Cc: Mike Waychison
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • There's a comment in there which claims that the inode is left on s_io
    if nfs chickened out of writing some data.

    But that's not been true for three years.
    9290280ced13c85689adeffa587e9a53bd3a5873 fixed a livelock by moving these
    inodes back onto s_dirty. Fix the comment.

    In the second leg of the `if', use redirty_tail() rather than open-coding it.

    Add weaselly comment indicating lack of confidence in the code and lack of the
    fortitude which would be needed to fiddle with it.

    Cc: Mike Waychison
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • When the kupdate function has tried to write back an expired inode it will
    then check to see whether some of the inode's pages are still dirty.

    This can happen when the filesystem decided to not write a page for some
    reason. But it does _not_ occur due to redirtyings: a redirtying will set
    I_DIRTY_PAGES.

    What we need to do here is to set I_DIRTY_PAGES to reflect reality and to then
    put the inode onto the _head_ of s_dirty for consideration on the next kupdate
    pass, in five seconds time.

    Problem is, the code failed to modify the inode's timestamp when pushing the
    inode onto thehead of s_dirty.

    The patch:

    If there are no other inodes on s_dirty then we leave the inode's timestamp
    alone: it is already expired.

    If there _are_ other inodes on s_dirty then we arrange for this inode to get
    the same timestamp as the inode which is at the head of s_dirty, thus
    preserving the s_dirty ordering. But we only need to do this if this inode
    purports to have been dirtied before the one at head-of-list.

    Cc: Mike Waychison
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • While writeback is working against a dirty inode it does a check after trying
    to write some of the inode's pages:

    "did the lower layers skip some of the inode's dirty pages because they were
    locked (or under writeback, or whatever)"

    If this turns out to be true, we must move the inode back onto s_dirty and
    redirty it. The reason for doing this is that fsync() and friends only check
    the s_dirty list, and those functions want to know about those pages which
    were locked, so they can be waited upon and, if necessary, rewritten.

    Problem is, that redirtying was putting the inode onto the tail of s_dirty
    without updating its timestamp. This causes a violation of s_dirty ordering.

    Fix this by updating inode->dirtied_when when moving the inode onto s_dirty.

    But the code is still a bit buggy? If the inode was _already_ dirty then we
    don't need to move it at all. Oh well, hopefully it doesn't matter too much,
    as that was a redirtying, which was very recent anwyay.

    Cc: Mike Waychison
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • For reasons which escape me, inodes which are dirty against a ram-backed
    filesystem are managed in the same way as inodes which are backed by real
    devices.

    Probably we could optimise things here. But given that we skip the entire
    supeblock as son as we hit the first dirty inode, there's not a lot to be
    gained.

    And the code does need to handle one particular non-backed superblock: the
    kernel's fake internal superblock which holds all the blockdevs.

    Still. At present when the code encounters an inode which is dirty against a
    memory-backed filesystem it will skip that inode by refiling it back onto
    s_dirty. But it fails to update the inode's timestamp when doing so which at
    least makes the debugging code upset.

    Fix.

    Cc: Mike Waychison
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • When writeback has finished writing back an inode it looks to see if that
    inode is still dirty. If it is, that means that a process redirtied the inode
    while its writeback was in progress.

    What we need to do here is to refile the redirtied inode onto the s_dirty
    list.

    But we're doing that wrongly: it could be that this inode was redirtied
    _before_ the last inode on s_dirty. We're blindly appending this inode to the
    list, after an inode which might be less-recently-dirtied, thus violating the
    list's ordering.

    So we must either insertion-sort this inode into the correct place, or we must
    update this inode's dirtied_when field when appending it to the reverse-sorted
    s_dirty list, to preserve the reverse-time-ordering.

    This patch does the latter: if this inode was dirtied less recently than the
    tail inode then copy the tail inode's timestamp into this inode.

    This means that in rare circumstances, some inodes will be writen back later
    than they should have been. But the time slip will be small.

    Cc: Mike Waychison
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton