20 Jun, 2011

2 commits


16 Jun, 2011

2 commits

  • [Kudos to dhowells for tracking that crap down]

    If two processes attempt to cause automounting on the same mountpoint at the
    same time, the vfsmount holding the mountpoint will be left with one too few
    references on it, causing a BUG when the kernel tries to clean up.

    The problem is that lock_mount() drops the caller's reference to the
    mountpoint's vfsmount in the case where it finds something already mounted on
    the mountpoint as it transits to the mounted filesystem and replaces path->mnt
    with the new mountpoint vfsmount.

    During a pathwalk, however, we don't take a reference on the vfsmount if it is
    the same as the one in the nameidata struct, but do_add_mount() doesn't know
    this.

    The fix is to make sure we have a ref on the vfsmount of the mountpoint before
    calling do_add_mount(). However, if lock_mount() doesn't transit, we're then
    left with an extra ref on the mountpoint vfsmount which needs releasing.
    We can handle that in follow_managed() by not making assumptions about what
    we can and what we cannot get from lookup_mnt() as the current code does.

    The callers of follow_managed() expect that reference to path->mnt will be
    grabbed iff path->mnt has been changed. follow_managed() and follow_automount()
    keep track of whether such reference has been grabbed and assume that it'll
    happen in those and only those cases that'll have us return with changed
    path->mnt. That assumption is almost correct - it breaks in case of
    racing automounts and in even harder to hit race between following a mountpoint
    and a couple of mount --move. The thing is, we don't need to make that
    assumption at all - after the end of loop in follow_manage() we can check
    if path->mnt has ended up unchanged and do mntput() if needed.

    The BUG can be reproduced with the following test program:

    #include
    #include
    #include
    #include
    #include
    int main(int argc, char **argv)
    {
    int pid, ws;
    struct stat buf;
    pid = fork();
    stat(argv[1], &buf);
    if (pid > 0) wait(&ws);
    return 0;
    }

    and the following procedure:

    (1) Mount an NFS volume that on the server has something else mounted on a
    subdirectory. For instance, I can mount / from my server:

    mount warthog:/ /mnt -t nfs4 -r

    On the server /data has another filesystem mounted on it, so NFS will see
    a change in FSID as it walks down the path, and will mark /mnt/data as
    being a mountpoint. This will cause the automount code to be triggered.

    !!! Do not look inside the mounted fs at this point !!!

    (2) Run the above program on a file within the submount to generate two
    simultaneous automount requests:

    /tmp/forkstat /mnt/data/testfile

    (3) Unmount the automounted submount:

    umount /mnt/data

    (4) Unmount the original mount:

    umount /mnt

    At this point the kernel should throw a BUG with something like the
    following:

    BUG: Dentry ffff880032e3c5c0{i=2,n=} still in use (1) [unmount of nfs4 0:12]

    Note that the bug appears on the root dentry of the original mount, not the
    mountpoint and not the submount because sys_umount() hasn't got to its final
    mntput_no_expire() yet, but this isn't so obvious from the call trace:

    [] shrink_dcache_for_umount+0x69/0x82
    [] generic_shutdown_super+0x37/0x15b
    [] ? nfs_super_return_all_delegations+0x2e/0x1b1 [nfs]
    [] kill_anon_super+0x1d/0x7e
    [] nfs4_kill_super+0x60/0xb6 [nfs]
    [] deactivate_locked_super+0x34/0x83
    [] deactivate_super+0x6f/0x7b
    [] mntput_no_expire+0x18d/0x199
    [] mntput+0x3b/0x44
    [] release_mounts+0xa2/0xbf
    [] sys_umount+0x47a/0x4ba
    [] ? trace_hardirqs_on_caller+0x1fd/0x22f
    [] system_call_fastpath+0x16/0x1b

    as do_umount() is inlined. However, you can see release_mounts() in there.

    Note also that it may be necessary to have multiple CPU cores to be able to
    trigger this bug.

    Tested-by: Jeff Layton
    Tested-by: Ian Kent
    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    Al Viro
     
  • Git bisection shows that commit e6bc45d65df8599fdbae73be9cec4ceed274db53 causes
    BUG_ONs under high I/O load:

    kernel BUG at fs/inode.c:1368!
    [ 2862.501007] Call Trace:
    [ 2862.501007] [] d_kill+0xf8/0x140
    [ 2862.501007] [] dput+0xc9/0x190
    [ 2862.501007] [] fput+0x15f/0x210
    [ 2862.501007] [] filp_close+0x61/0x90
    [ 2862.501007] [] sys_close+0xb1/0x110
    [ 2862.501007] [] system_call_fastpath+0x16/0x1b

    A reliable way to reproduce this bug is:
    Login to KDE, run 'rsnapshot sync', and apt-get install openjdk-6-jdk,
    and apt-get remove openjdk-6-jdk.

    The buggy part of the patch is this:
    struct inode *inode = NULL;
    .....
    - if (nd.last.name[nd.last.len])
    - goto slashes;
    inode = dentry->d_inode;
    - if (inode)
    - ihold(inode);
    + if (nd.last.name[nd.last.len] || !inode)
    + goto slashes;
    + ihold(inode)
    ...
    if (inode)
    iput(inode); /* truncate the inode here */

    If nd.last.name[nd.last.len] is nonzero (and thus goto slashes branch is taken),
    and dentry->d_inode is non-NULL, then this code now does an additional iput on
    the inode, which is wrong.

    Fix this by only setting the inode variable if nd.last.name[nd.last.len] is 0.

    Reference: https://lkml.org/lkml/2011/6/15/50
    Reported-by: Norbert Preining
    Reported-by: Török Edwin
    Cc: "Theodore Ts'o"
    Cc: Al Viro
    Signed-off-by: Török Edwin
    Signed-off-by: Al Viro

    Török Edwin
     

07 Jun, 2011

1 commit


30 May, 2011

1 commit


27 May, 2011

4 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • ... and kill a useless local variable in follow_dotdot_rcu(), while
    we are at it - follow_mount_rcu(nd, path, inode) *always* assigned
    value to *inode, and always it had been path->dentry->d_inode (aka
    nd->path.dentry->d_inode, since it always got &nd->path as the second
    argument).

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (25 commits)
    cifs: remove unnecessary dentry_unhash on rmdir/rename_dir
    ocfs2: remove unnecessary dentry_unhash on rmdir/rename_dir
    exofs: remove unnecessary dentry_unhash on rmdir/rename_dir
    nfs: remove unnecessary dentry_unhash on rmdir/rename_dir
    ext2: remove unnecessary dentry_unhash on rmdir/rename_dir
    ext3: remove unnecessary dentry_unhash on rmdir/rename_dir
    ext4: remove unnecessary dentry_unhash on rmdir/rename_dir
    btrfs: remove unnecessary dentry_unhash in rmdir/rename_dir
    ceph: remove unnecessary dentry_unhash calls
    vfs: clean up vfs_rename_other
    vfs: clean up vfs_rename_dir
    vfs: clean up vfs_rmdir
    vfs: fix vfs_rename_dir for FS_RENAME_DOES_D_MOVE filesystems
    libfs: drop unneeded dentry_unhash
    vfs: update dentry_unhash() comment
    vfs: push dentry_unhash on rename_dir into file systems
    vfs: push dentry_unhash on rmdir into file systems
    vfs: remove dget() from dentry_unhash()
    vfs: dentry_unhash immediately prior to rmdir
    vfs: Block mmapped writes while the fs is frozen
    ...

    Linus Torvalds
     

26 May, 2011

11 commits

  • Simplify control flow to match vfs_rename_dir.

    Signed-off-by: Sage Weil
    Signed-off-by: Al Viro

    Sage Weil
     
  • Simplify control flow through vfs_rename_dir.

    Signed-off-by: Sage Weil
    Signed-off-by: Al Viro

    Sage Weil
     
  • Simplify the control flow with an out label.

    Signed-off-by: Sage Weil
    Signed-off-by: Al Viro

    Sage Weil
     
  • vfs_rename_dir() doesn't properly account for filesystems with
    FS_RENAME_DOES_D_MOVE. If new_dentry has a target inode attached, it
    unhashes the new_dentry prior to the rename() iop and rehashes it after,
    but doesn't account for the possibility that rename() may have swapped
    {old,new}_dentry. For FS_RENAME_DOES_D_MOVE filesystems, it rehashes
    new_dentry (now the old renamed-from name, which d_move() expected to go
    away), such that a subsequent lookup will find it. Currently all
    FS_RENAME_DOES_D_MOVE filesystems compensate for this by failing in
    d_revalidate.

    The bug was introduced by: commit 349457ccf2592c14bdf13b6706170ae2e94931b1
    "[PATCH] Allow file systems to manually d_move() inside of ->rename()"

    Fix by not rehashing the new dentry. Rehashing used to be needed by
    d_move() but isn't anymore.

    Reported-by: Sage Weil
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • The helper is now only called by file systems, not the VFS.

    Signed-off-by: Sage Weil
    Signed-off-by: Al Viro

    Sage Weil
     
  • Only a few file systems need this. Start by pushing it down into each
    rename method (except gfs2 and xfs) so that it can be dealt with on a
    per-fs basis.

    Acked-by: Christoph Hellwig
    Signed-off-by: Sage Weil
    Signed-off-by: Al Viro

    Sage Weil
     
  • Only a few file systems need this. Start by pushing it down into each
    fs rmdir method (except gfs2 and xfs) so it can be dealt with on a per-fs
    basis.

    This does not change behavior for any in-tree file systems.

    Acked-by: Christoph Hellwig
    Signed-off-by: Sage Weil
    Signed-off-by: Al Viro

    Sage Weil
     
  • This serves no useful purpose that I can discern. All callers (rename,
    rmdir) hold their own reference to the dentry.

    A quick audit of all file systems showed no relevant checks on the value
    of d_count in vfs_rmdir/vfs_rename_dir paths.

    Signed-off-by: Sage Weil
    Signed-off-by: Al Viro

    Sage Weil
     
  • This presumes that there is no reason to unhash a dentry if we fail because
    it is a mountpoint or the LSM check fails, and that the LSM checks do not
    depend on the dentry being unhashed.

    Signed-off-by: Sage Weil
    Signed-off-by: Al Viro

    Sage Weil
     
  • new helper: complete_walk(). Done on successful completion
    of walk, drops out of RCU mode, does d_revalidate of final
    result if that hadn't been done already.

    handle_reval_dot() and nameidata_drop_rcu_last() subsumed into
    that one; callers converted to use of complete_walk().

    Signed-off-by: Al Viro

    Al Viro
     
  • Merge these into a single function (unlazy_walk(nd, dentry)),
    kill ..._maybe variants

    Signed-off-by: Al Viro

    Al Viro
     

21 May, 2011

1 commit


14 May, 2011

1 commit

  • It's a hot function, and we're better off not mixing types in the mask
    calculations. The compiler just ends up mixing 16-bit and 32-bit
    operations, for no good reason.

    So do everything in 'unsigned int' rather than mixing 'unsigned int'
    masking with a 'umode_t' (16-bit) mode variable.

    This, together with the parent commit (47a150edc2ae: "Cache user_ns in
    struct cred") makes acl_permission_check() much nicer.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

16 Apr, 2011

1 commit

  • During RCU walk in path_lookupat and path_openat, the rcu lookup
    frequently failed if looking up an absolute path, because when root
    directory was looked up, seq number was not properly set in nameidata.

    We dropped out of RCU walk in nameidata_drop_rcu due to mismatch in
    directory entry's seq number. We reverted to slow path walk that need
    to take references.

    With the following patch, I saw a 50% increase in an exim mail server
    benchmark throughput on a 4-socket Nehalem-EX system.

    Signed-off-by: Tim Chen
    Reviewed-by: Andi Kleen
    Cc: stable@kernel.org (v2.6.38)
    Signed-off-by: Linus Torvalds

    Tim Chen
     

31 Mar, 2011

1 commit


25 Mar, 2011

1 commit


24 Mar, 2011

3 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    deal with races in /proc/*/{syscall,stack,personality}
    proc: enable writing to /proc/pid/mem
    proc: make check_mem_permission() return an mm_struct on success
    proc: hold cred_guard_mutex in check_mem_permission()
    proc: disable mem_write after exec
    mm: implement access_remote_vm
    mm: factor out main logic of access_process_vm
    mm: use mm_struct to resolve gate vma's in __get_user_pages
    mm: arch: rename in_gate_area_no_task to in_gate_area_no_mm
    mm: arch: make in_gate_area take an mm_struct instead of a task_struct
    mm: arch: make get_gate_vma take an mm_struct instead of a task_struct
    x86: mark associated mm when running a task in 32 bit compatibility mode
    x86: add context tag to mark mm when running a task in 32-bit compatibility mode
    auxv: require the target to be tracable (or yourself)
    close race in /proc/*/environ
    report errors in /proc/*/*map* sanely
    pagemap: close races with suid execve
    make sessionid permissions in /proc/*/task/* match those in /proc/*
    fix leaks in path_lookupat()

    Fix up trivial conflicts in fs/proc/base.c

    Linus Torvalds
     
  • And give it a kernel-doc comment.

    [akpm@linux-foundation.org: btrfs changed in linux-next]
    Signed-off-by: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • Cheat for now and say all files belong to init_user_ns. Next step will be
    to let superblocks belong to a user_ns, and derive inode_userns(inode)
    from inode->i_sb->s_user_ns. Finally we'll introduce more flexible
    arrangements.

    Changelog:
    Feb 15: make is_owner_or_cap take const struct inode
    Feb 23: make is_owner_or_cap bool

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

23 Mar, 2011

1 commit


18 Mar, 2011

2 commits


16 Mar, 2011

8 commits