25 Jul, 2008

1 commit

  • [Summary]

    Split LRU-list of unused dentries to one per superblock to avoid soft
    lock up during NFS mounts and remounting of any filesystem.

    Previously I posted here:
    http://lkml.org/lkml/2008/3/5/590

    [Descriptions]

    - background

    dentry_unused is a list of dentries which are not referenced.
    dentry_unused grows up when references on directories or files are
    released. This list can be very long if there is huge free memory.

    - the problem

    When shrink_dcache_sb() is called, it scans all dentry_unused linearly
    under spin_lock(), and if dentry->d_sb is differnt from given
    superblock, scan next dentry. This scan costs very much if there are
    many entries, and very ineffective if there are many superblocks.

    IOW, When we need to shrink unused dentries on one dentry, but scans
    unused dentries on all superblocks in the system. For example, we scan
    500 dentries to unmount a filesystem, but scans 1,000,000 or more unused
    dentries on other superblocks.

    In our case , At mounting NFS*, shrink_dcache_sb() is called to shrink
    unused dentries on NFS, but scans 100,000,000 unused dentries on
    superblocks in the system such as local ext3 filesystems. I hear NFS
    mounting took 1 min on some system in use.

    * : NFS uses virtual filesystem in rpc layer, so NFS is affected by
    this problem.

    100,000,000 is possible number on large systems.

    Per-superblock LRU of unused dentried can reduce the cost in
    reasonable manner.

    - How to fix

    I found this problem is solved by David Chinner's "Per-superblock
    unused dentry LRU lists V3"(1), so I rebase it and add some fix to
    reclaim with fairness, which is in Andrew Morton's comments(2).

    1) http://lkml.org/lkml/2006/5/25/318
    2) http://lkml.org/lkml/2006/5/25/320

    Split LRU-list of unused dentries to each superblocks. Then, NFS
    mounting will check dentries under a superblock instead of all. But
    this spliting will break LRU of dentry-unused. So, I've attempted to
    make reclaim unused dentrins with fairness by calculate number of
    dentries to scan on this sb based on following way

    number of dentries to scan on this sb =
    count * (number of dentries on this sb / number of dentries in the machine)

    - ToDo
    - I have to measuring performance number and do stress tests.

    - When unmount occurs during prune_dcache(), scanning on same
    superblock, It is unable to reach next superblock because it is gone
    away. We restart scannig superblock from first one, it causes
    unfairness of reclaim unused dentries on first superblock. But I think
    this happens very rarely.

    - Test Results

    Result on 6GB boxes with excessive unused dentries.

    Without patch:

    $ cat /proc/sys/fs/dentry-state
    10181835 10180203 45 0 0 0
    # mount -t nfs 10.124.60.70:/work/kernel-src nfs
    real 0m1.830s
    user 0m0.001s
    sys 0m1.653s

    With this patch:
    $ cat /proc/sys/fs/dentry-state
    10236610 10234751 45 0 0 0
    # mount -t nfs 10.124.60.70:/work/kernel-src nfs
    real 0m0.106s
    user 0m0.002s
    sys 0m0.032s

    [akpm@linux-foundation.org: fix comments]
    Signed-off-by: Kentaro Makita
    Cc: Neil Brown
    Cc: Trond Myklebust
    Cc: David Chinner
    Cc: "J. Bruce Fields"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kentaro Makita
     

29 Apr, 2008

1 commit


28 Apr, 2008

1 commit

  • Currently, we just turn quotas off on remount of filesystem to read-only
    state. The patch below adds necessary framework so that we can turn quotas
    off on remount RO but we are able to automatically reenable them again when
    filesystem is remounted to RW state. All we need to do is to keep references
    to inodes of quota files when remounting RO and using these references to
    reenable quotas when remounting RW.

    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

22 Apr, 2008

1 commit


19 Apr, 2008

2 commits

  • There have been a few oopses caused by 'struct file's with NULL f_vfsmnts.
    There was also a set of potentially missed mnt_want_write()s from
    dentry_open() calls.

    This patch provides a very simple debugging framework to catch these kinds of
    bugs. It will WARN_ON() them, but should stop us from having any oopses or
    mnt_writer count imbalances.

    I'm quite convinced that this is a good thing because it found bugs in the
    stuff I was working on as soon as I wrote it.

    [hch: made it conditional on a debug option.
    But it's still a little bit too ugly]

    [hch: merged forced remount r/o fix from Dave and akpm's fix for the fix]

    Signed-off-by: Dave Hansen
    Acked-by: Al Viro
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Hansen
     
  • The emergency remount code forcibly removes FMODE_WRITE from
    filps. The r/o bind mount code notices that this was done
    without a proper mnt_drop_write() and properly gives a
    warning.

    This patch does a mnt_drop_write() to keep everything
    balanced.

    Acked-by: Al Viro
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dave Hansen
    Signed-off-by: Al Viro

    Dave Hansen
     

25 Mar, 2008

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    [PATCH] get stack footprint of pathname resolution back to relative sanity
    [PATCH] double iput() on failure exit in hugetlb
    [PATCH] double dput() on failure exit in tiny-shmem
    [PATCH] fix up new filp allocators
    [PATCH] check for null vfsmount in dentry_open()
    [PATCH] reiserfs: eliminate private use of struct file in xattr
    [PATCH] sanitize hppfs
    hppfs pass vfsmount to dentry_open()
    [PATCH] restore export of do_kern_mount()

    Linus Torvalds
     

20 Mar, 2008

1 commit

  • Fix kernel-doc notation warnings in fs/.

    Warning(mmotm-2008-0314-1449//fs/super.c:560): missing initial short description on line:
    * mark_files_ro
    Warning(mmotm-2008-0314-1449//fs/locks.c:1277): missing initial short description on line:
    * lease_get_mtime
    Warning(mmotm-2008-0314-1449//fs/locks.c:1277): missing initial short description on line:
    * lease_get_mtime
    Warning(mmotm-2008-0314-1449//fs/namei.c:1368): missing initial short description on line:
    * lookup_one_len: filesystem helper to lookup single pathname component
    Warning(mmotm-2008-0314-1449//fs/buffer.c:3221): missing initial short description on line:
    * bh_uptodate_or_lock: Test whether the buffer is uptodate
    Warning(mmotm-2008-0314-1449//fs/buffer.c:3240): missing initial short description on line:
    * bh_submit_read: Submit a locked buffer for reading
    Warning(mmotm-2008-0314-1449//fs/fs-writeback.c:30): missing initial short description on line:
    * writeback_acquire: attempt to get exclusive writeback access to a device
    Warning(mmotm-2008-0314-1449//fs/fs-writeback.c:47): missing initial short description on line:
    * writeback_in_progress: determine whether there is writeback in progress
    Warning(mmotm-2008-0314-1449//fs/fs-writeback.c:58): missing initial short description on line:
    * writeback_release: relinquish exclusive writeback access against a device.
    Warning(mmotm-2008-0314-1449//include/linux/jbd.h:351): contents before sections
    Warning(mmotm-2008-0314-1449//include/linux/jbd.h:561): contents before sections
    Warning(mmotm-2008-0314-1449//fs/jbd/transaction.c:1935): missing initial short description on line:
    * void journal_invalidatepage()

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

18 Mar, 2008

1 commit

  • vfs_kern_mount() requires having a reference to fs type, which
    makes it impossible for module to create procfs, etc. private
    mount. Open-coding is not an option, since e.g. put_filesystem()
    is _not_ exported, and for a good reason.

    Signed-off-by: Al Viro

    Al Viro
     

06 Mar, 2008

1 commit

  • Introduce new LSM interfaces to allow an FS to deal with their own mount
    options. This includes a new string parsing function exported from the
    LSM that an FS can use to get a security data blob and a new security
    data blob. This is particularly useful for an FS which uses binary
    mount data, like NFS, which does not pass strings into the vfs to be
    handled by the loaded LSM. Also fix a BUG() in both SELinux and SMACK
    when dealing with binary mount data. If the binary mount data is less
    than one page the copy_page() in security_sb_copy_data() can cause an
    illegal page fault and boom. Remove all NFSisms from the SELinux code
    since they were broken by past NFS changes.

    Signed-off-by: Eric Paris
    Acked-by: Stephen Smalley
    Acked-by: Casey Schaufler
    Signed-off-by: James Morris

    Eric Paris
     

09 Feb, 2008

2 commits

  • Turn off quotas before filesystem is remounted read only. Otherwise quota
    will try to write to read-only filesystem which does no good... We could
    also just refuse to remount ro when quota is enabled but turning quota off
    is consistent with what we do on umount.

    Signed-off-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Add a new s_options field to struct super_block. Filesystems can save
    mount options passed to them in mount or remount. It is automatically
    freed when the superblock is destroyed.

    A new helper function, generic_show_options() is introduced, which uses
    this field to display the mount options in /proc/mounts.

    Another helper function, save_mount_options() may be used by
    filesystems to save the options in the super block.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

20 Oct, 2007

5 commits

  • * Convert files to UTF-8.

    * Also correct some people's names
    (one example is Eißfeldt, which was found in a source file.
    Given that the author used an ß at all in a source file
    indicates that the real name has in fact a 'ß' and not an 'ss',
    which is commonly used as a substitute for 'ß' when limited to
    7bit.)

    * Correct town names (Goettingen -> Göttingen)

    * Update Eberhard Mönkeberg's address (http://lkml.org/lkml/2007/1/8/313)

    Signed-off-by: Jan Engelhardt
    Signed-off-by: Adrian Bunk

    Jan Engelhardt
     
  • Fix the various misspellings of "system", controller", "interrupt" and
    "[un]necessary".

    Signed-off-by: Robert P. J. Day
    Signed-off-by: Adrian Bunk

    Robert P. J. Day
     
  • This flag tells the .get_sb callback that this is a kern_mount() call so that
    it can trust *data pointer to be valid in-kernel one. If this flag is passed
    from the user process, it is cleared since the *data pointer is not a valid
    kernel object.

    Running a few steps forward - this will be needed for proc to create the
    superblock and store a valid pid namespace on it during the namespace
    creation. The reason, why the namespace cannot live without proc mount is
    described in the appropriate patch.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • fs/super.c: use list_for_each_entry() instead of list_for_each() in
    sget()

    [akpm@linux-foundation.org: clean up some crap while we're there]
    Signed-off-by: Matthias Kaehlcke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthias Kaehlcke
     
  • Declarations go into headers.

    Signed-off-by: Miklos Szeredi
    Cc: Ram Pai
    Acked-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

17 Oct, 2007

1 commit

  • Current -mm tree has bucketful of bug fixes in periodic writeback path.
    However, we still hit a glitch where dirty pages on a given inode aren't
    completely flushed to the disk, and system will accumulate large amount of
    dirty pages beyond what dirty_expire_interval is designed for.

    The problem is __sync_single_inode() will move an inode to sb->s_dirty list
    even when there are more pending dirty pages on that inode. If there is
    another inode with a small number of dirty pages, we hit a case where the loop
    iteration in wb_kupdate() terminates prematurely because wbc.nr_to_write > 0.
    Thus leaving the inode that has large amount of dirty pages behind and it has
    to wait for another dirty_writeback_interval before we flush it again. We
    effectively only write out MAX_WRITEBACK_PAGES every dirty_writeback_interval.
    If the rate of dirtying is sufficiently high, the system will start
    accumulate a large number of dirty pages.

    So fix it by having another sb->s_more_io list on which to park the inode
    while we iterate through sb->s_io and to allow each dirty inode which resides
    on that sb to have an equal chance of flushing some amount of dirty pages.

    Signed-off-by: Ken Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen
     

17 Jul, 2007

1 commit

  • I was seeing a null pointer deref in fs/super.c:vfs_kern_mount().
    Some file system get_sb() handler was returning NULL mnt_sb with
    a non-negative return value. I also noticed a "hugetlbfs: Bad
    mount option:" message in the log.

    Turns out that hugetlbfs_parse_options() was not checking for an
    empty option string after call to strsep(). On failure,
    hugetlbfs_parse_options() returns 1. hugetlbfs_fill_super() just
    passed this return code back up the call stack where
    vfs_kern_mount() missed the error and proceeded with a NULL mnt_sb.

    Apparently introduced by patch:
    hugetlbfs-use-lib-parser-fix-docs.patch

    The problem was exposed by this line in my fstab:

    none /huge hugetlbfs defaults 0 0

    It can also be demonstrated by invoking mount of hugetlbfs
    directly with no options or a bogus option.

    This patch:

    1) adds the check for empty option to hugetlbfs_parse_options(),
    2) enhances the error message to bracket any unrecognized
    option with quotes ,
    3) modifies hugetlbfs_parse_options() to return -EINVAL on any
    unrecognized option,
    4) adds a BUG_ON() to vfs_kern_mount() to catch any get_sb()
    handler that returns a NULL mnt->mnt_sb with a return value
    >= 0.

    Signed-off-by: Lee Schermerhorn
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

09 May, 2007

1 commit

  • There's a slight problem with filesystem type representation in fuse
    based filesystems.

    From the kernel's view, there are just two filesystem types: fuse and
    fuseblk. From the user's view there are lots of different filesystem
    types. The user is not even much concerned if the filesystem is fuse based
    or not. So there's a conflict of interest in how this should be
    represented in fstab, mtab and /proc/mounts.

    The current scheme is to encode the real filesystem type in the mount
    source. So an sshfs mount looks like this:

    sshfs#user@server:/ /mnt/server fuse rw,nosuid,nodev,...

    This url-ish syntax works OK for sshfs and similar filesystems. However
    for block device based filesystems (ntfs-3g, zfs) it doesn't work, since
    the kernel expects the mount source to be a real device name.

    A possibly better scheme would be to encode the real type in the type
    field as "type.subtype". So fuse mounts would look like this:

    /dev/hda1 /mnt/windows fuseblk.ntfs-3g rw,...
    user@server:/ /mnt/server fuse.sshfs rw,nosuid,nodev,...

    This patch adds the necessary code to the kernel so that this can be
    correctly displayed in /proc/mounts.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

28 Apr, 2007

1 commit


13 Feb, 2007

1 commit


12 Jan, 2007

1 commit

  • Revert bd_mount_mutex back to a semaphore so that xfs_freeze -f /mnt/newtest;
    xfs_freeze -u /mnt/newtest works safely and doesn't produce lockdep warnings.

    (XFS unlocks the semaphore from a different task, by design. The mutex
    code warns about this)

    Signed-off-by: Dave Chinner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Chinner
     

09 Dec, 2006

1 commit

  • This patch changes struct file to use struct path instead of having
    independent pointers to struct dentry and struct vfsmount, and converts all
    users of f_{dentry,vfsmnt} in fs/ to use f_path.{dentry,mnt}.

    Additionally, it adds two #define's to make the transition easier for users of
    the f_dentry and f_vfsmnt.

    Signed-off-by: Josef "Jeff" Sipek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef "Jeff" Sipek
     

04 Dec, 2006

1 commit


12 Oct, 2006

1 commit

  • The attached patch destroys all the dentries attached to a superblock in one go
    by:

    (1) Destroying the tree rooted at s_root.

    (2) Destroying every entry in the anon list, one at a time.

    (3) Each entry in the anon list has its subtree consumed from the leaves
    inwards.

    This reduces the amount of work generic_shutdown_super() does, and avoids
    iterating through the dentry_unused list.

    Note that locking is almost entirely absent in the shrink_dcache_for_umount*()
    functions added by this patch. This is because:

    (1) at the point the filesystem calls generic_shutdown_super(), it is not
    permitted to further touch the superblock's set of dentries, and nor may
    it remove aliases from inodes;

    (2) the dcache memory shrinker now skips dentries that are being unmounted;
    and

    (3) the superblock no longer has any external references through which the VFS
    can reach it.

    Given these points, the only locking we need to do is when we remove dentries
    from the unused list and the name hashes, which we do a directory's worth at a
    time.

    We also don't need to guard against reference counts going to zero unexpectedly
    and removing bits of the tree we're working on as nothing else can call dput().

    A cut down version of dentry_iput() has been folded into
    shrink_dcache_for_umount_subtree() function. Apart from not needing to unlock
    things, it also doesn't need to check for inotify watches.

    In this version of the patch, the complaint about a dentry still being in use
    has been expanded from a single BUG_ON() and now gives much more information.

    Signed-off-by: David Howells
    Acked-by: NeilBrown
    Acked-by: Ian Kent
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

01 Oct, 2006

2 commits

  • Make it possible to disable the block layer. Not all embedded devices require
    it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
    the block layer to be present.

    This patch does the following:

    (*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
    support.

    (*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
    an item that uses the block layer. This includes:

    (*) Block I/O tracing.

    (*) Disk partition code.

    (*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.

    (*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
    block layer to do scheduling. Some drivers that use SCSI facilities -
    such as USB storage - end up disabled indirectly from this.

    (*) Various block-based device drivers, such as IDE and the old CDROM
    drivers.

    (*) MTD blockdev handling and FTL.

    (*) JFFS - which uses set_bdev_super(), something it could avoid doing by
    taking a leaf out of JFFS2's book.

    (*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
    linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
    however, still used in places, and so is still available.

    (*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
    parts of linux/fs.h.

    (*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.

    (*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.

    (*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
    is not enabled.

    (*) fs/no-block.c is created to hold out-of-line stubs and things that are
    required when CONFIG_BLOCK is not set:

    (*) Default blockdev file operations (to give error ENODEV on opening).

    (*) Makes some /proc changes:

    (*) /proc/devices does not list any blockdevs.

    (*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.

    (*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.

    (*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
    given command other than Q_SYNC or if a special device is specified.

    (*) In init/do_mounts.c, no reference is made to the blockdev routines if
    CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.

    (*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
    error ENOSYS by way of cond_syscall if so).

    (*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
    CONFIG_BLOCK is not set, since they can't then happen.

    Signed-Off-By: David Howells
    Signed-off-by: Jens Axboe

    David Howells
     
  • Move some functions out of the buffering code that aren't strictly buffering
    specific. This is a precursor to being able to disable the block layer.

    (*) Moved some stuff out of fs/buffer.c:

    (*) The file sync and general sync stuff moved to fs/sync.c.

    (*) The superblock sync stuff moved to fs/super.c.

    (*) do_invalidatepage() moved to mm/truncate.c.

    (*) try_to_release_page() moved to mm/filemap.c.

    (*) Moved some related declarations between header files:

    (*) declarations for do_invalidatepage() and try_to_release_page() moved
    to linux/mm.h.

    (*) __set_page_dirty_buffers() moved to linux/buffer_head.h.

    Signed-Off-By: David Howells
    Signed-off-by: Jens Axboe

    David Howells
     

30 Sep, 2006

1 commit

  • grab_super gets called with sb_lock held, and releases it. Add a lock
    annotation to this function so that sparse can check callers for lock
    pairing, and so that sparse will not complain about this function since it
    intentionally uses the lock in this manner.

    Signed-off-by: Josh Triplett
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josh Triplett
     

07 Sep, 2006

1 commit


04 Jul, 2006

2 commits

  • The s_umount rwsem needs to be classified as per-superblock since it's
    perfectly legit to keep multiple of those recursively in the VFS locking
    rules.

    Has no effect on non-lockdep kernels.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     
  • Teach special (per-filesystem) locking code to the lock validator.

    Minimal effect on non-lockdep kernels: one extra parameter to alloc_super().

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

01 Jul, 2006

1 commit


25 Jun, 2006

1 commit


23 Jun, 2006

3 commits

  • Give the statfs superblock operation a dentry pointer rather than a superblock
    pointer.

    This complements the get_sb() patch. That reduced the significance of
    sb->s_root, allowing NFS to place a fake root there. However, NFS does
    require a dentry to use as a target for the statfs operation. This permits
    the root in the vfsmount to be used instead.

    linux/mount.h has been added where necessary to make allyesconfig build
    successfully.

    Interest has also been expressed for use with the FUSE and XFS filesystems.

    Signed-off-by: David Howells
    Acked-by: Al Viro
    Cc: Nathan Scott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Extend the get_sb() filesystem operation to take an extra argument that
    permits the VFS to pass in the target vfsmount that defines the mountpoint.

    The filesystem is then required to manually set the superblock and root dentry
    pointers. For most filesystems, this should be done with simple_set_mnt()
    which will set the superblock pointer and then set the root dentry to the
    superblock's s_root (as per the old default behaviour).

    The get_sb() op now returns an integer as there's now no need to return the
    superblock pointer.

    This patch permits a superblock to be implicitly shared amongst several mount
    points, such as can be done with NFS to avoid potential inode aliasing. In
    such a case, simple_set_mnt() would not be called, and instead the mnt_root
    and mnt_sb would be set directly.

    The patch also makes the following changes:

    (*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
    pointer argument and return an integer, so most filesystems have to change
    very little.

    (*) If one of the convenience function is not used, then get_sb() should
    normally call simple_set_mnt() to instantiate the vfsmount. This will
    always return 0, and so can be tail-called from get_sb().

    (*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
    dcache upon superblock destruction rather than shrink_dcache_anon().

    This is required because the superblock may now have multiple trees that
    aren't actually bound to s_root, but that still need to be cleaned up. The
    currently called functions assume that the whole tree is rooted at s_root,
    and that anonymous dentries are not the roots of trees which results in
    dentries being left unculled.

    However, with the way NFS superblock sharing are currently set to be
    implemented, these assumptions are violated: the root of the filesystem is
    simply a dummy dentry and inode (the real inode for '/' may well be
    inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
    with child trees.

    [*] Anonymous until discovered from another tree.

    (*) The documentation has been adjusted, including the additional bit of
    changing ext2_* into foo_* in the documentation.

    [akpm@osdl.org: convert ipath_fs, do other stuff]
    Signed-off-by: David Howells
    Acked-by: Al Viro
    Cc: Nathan Scott
    Cc: Roland Dreier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • The race is that the shrink_dcache_memory shrinker could get called while a
    filesystem is being unmounted, and could try to prune a dentry belonging to
    that filesystem.

    If it does, then it will call in to iput on the inode while the dentry is
    no longer able to be found by the umounting process. If iput takes a
    while, generic_shutdown_super could get all the way though
    shrink_dcache_parent and shrink_dcache_anon and invalidate_inodes without
    ever waiting on this particular inode.

    Eventually the superblock gets freed anyway and if the iput tried to touch
    it (which some filesystems certainly do), it will lose. The promised
    "Self-destruct in 5 seconds" doesn't lead to a nice day.

    The race is closed by holding s_umount while calling prune_one_dentry on
    someone else's dentry. As a down_read_trylock is used,
    shrink_dcache_memory will no longer try to prune the dentry of a filesystem
    that is being unmounted, and unmount will not be able to start until any
    such active prune_one_dentry completes.

    This requires that prune_dcache *knows* which filesystem (if any) it is
    doing the prune on behalf of so that it can be careful of other
    filesystems. shrink_dcache_memory isn't called it on behalf of any
    filesystem, and so is careful of everything.

    shrink_dcache_anon is now passed a super_block rather than the s_anon list
    out of the superblock, so it can get the s_anon list itself, and can pass
    the superblock down to prune_dcache.

    If prune_dcache finds a dentry that it cannot free, it leaves it where it
    is (at the tail of the list) and exits, on the assumption that some other
    thread will be removing that dentry soon. To try to make sure that some
    work gets done, a limited number of dnetries which are untouchable are
    skipped over while choosing the dentry to work on.

    I believe this race was first found by Kirill Korotaev.

    Cc: Jan Blunck
    Acked-by: Kirill Korotaev
    Cc: Olaf Hering
    Acked-by: Balbir Singh
    Signed-off-by: Neil Brown
    Signed-off-by: Balbir Singh
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

09 Jun, 2006

2 commits


27 Mar, 2006

1 commit

  • Semaphore to mutex conversion.

    The conversion was generated via scripts, and the result was validated
    automatically via a script as well.

    Signed-off-by: Ingo Molnar
    Cc: Eric Van Hensbergen
    Cc: Robert Love
    Cc: Thomas Gleixner
    Cc: David Woodhouse
    Cc: Neil Brown
    Cc: Trond Myklebust
    Cc: Dave Kleikamp
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar