18 Sep, 2013

1 commit


17 Sep, 2013

1 commit

  • If O_CREAT|O_EXCL are passed to open, then we know that either

    - the file is successfully created, or
    - the operation fails in some way.

    So previously we set FILE_CREATED before calling ->atomic_open() so the
    filesystem doesn't have to. This, however, led to bugs in the
    implementation that went unnoticed when the filesystem didn't check for
    existence, yet returned success. To prevent this kind of bug, require
    filesystems to always explicitly set FILE_CREATED on O_CREAT|O_EXCL and
    verify this in the VFS.

    Also added a couple more verifications for the result of atomic_open():

    - Warn if filesystem set FILE_CREATED despite the lack of O_CREAT.
    - Warn if filesystem set FILE_CREATED but gave a negative dentry.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     

11 Sep, 2013

5 commits

  • Signed-off-by: Dave Jones
    Signed-off-by: Al Viro

    Dave Jones
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • For a long time no filesystem has been using vfs_follow_link, and as seen
    by recent filesystem submissions any new use is accidental as well.

    Remove vfs_follow_link, document the replacement in
    Documentation/filesystems/porting and also rename __vfs_follow_link
    to match its only caller better.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Pull vfs pile 3 (of many) from Al Viro:
    "Waiman's conversion of d_path() and bits related to it,
    kern_path_mountpoint(), several cleanups and fixes (exportfs
    one is -stable fodder, IMO).

    There definitely will be more... ;-/"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    split read_seqretry_or_unlock(), convert d_walk() to resulting primitives
    dcache: Translating dentry into pathname without taking rename_lock
    autofs4 - fix device ioctl mount lookup
    introduce kern_path_mountpoint()
    rename user_path_umountat() to user_path_mountpoint_at()
    take unlazy_walk() into umount_lookup_last()
    Kill indirect include of file.h from eventfd.h, use fdget() in cgroup.c
    prune_super(): sb->s_op is never NULL
    exportfs: don't assume that ->iterate() won't feed us too long entries
    afs: get rid of redundant ->d_name.len checks

    Linus Torvalds
     
  • When I moved the RCU walk termination into unlazy_walk(), I didn't copy
    quite all of it: for the successful RCU termination we properly add the
    necessary reference counts to our temporary copy of the root path, but
    for the failure case we need to make sure that any temporary root path
    information is cleared out (since it does _not_ have the proper
    reference counts from the RCU lookup).

    We could clean up this mess by just always dropping the temporary root
    information, but Al points out that that would mean that a single lookup
    through symlinks could see multiple different root entries if it races
    with another thread doing chroot. Not that I think we should really
    care (we had that before too, back before we had a copy of the root path
    in the nameidata).

    Al says he has a cunning plan. In the meantime, this is the minimal fix
    for the problem, even if it's not all that pretty.

    Reported-by: Mace Moneta
    Acked-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

09 Sep, 2013

5 commits

  • This is the fix that the last two commits indirectly led up to - making
    sure that we don't call dput() in a bad context on the dentries we've
    looked up in RCU mode after the sequence count validation fails.

    This basically expands d_rcu_to_refcount() into the callers, and then
    fixes the callers to delay the dput() in the failure case until _after_
    we've dropped all locks and are no longer in an RCU-locked region.

    The case of 'complete_walk()' was trivial, since its failure case did
    the unlock_rcu_walk() directly after the call to d_rcu_to_refcount(),
    and as such that is just a pure expansion of the function with a trivial
    movement of the resulting dput() to after 'unlock_rcu_walk()'.

    In contrast, the unlazy_walk() case was much more complicated, because
    not only does convert two different dentries from RCU to be reference
    counted, but it used to not call unlock_rcu_walk() at all, and instead
    just returned an error and let the caller clean everything up in
    "terminate_walk()".

    Happily, one of the dentries in question (called "parent" inside
    unlazy_walk()) is the dentry of "nd->path", which terminate_walk() wants
    a refcount to anyway for the non-RCU case.

    So what the new and improved unlazy_walk() does is to first turn that
    dentry into a refcounted one, and once that is set up, the error cases
    can continue to use the terminate_walk() helper for cleanup, but for the
    non-RCU case. Which makes it possible to drop out of RCU mode if we
    actually hit the sequence number failure case.

    Acked-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • ... and move the extern from linux/namei.h to fs/internal.h,
    along with that of vfs_path_lookup().

    Signed-off-by: Al Viro

    Al Viro
     
  • ... and massage it a bit to reduce nesting

    Signed-off-by: Al Viro

    Al Viro
     
  • This simplifies the RCU to refcounting code in particular.

    I was originally intending to leave this for later, but walking through
    all the dput() logic (see previous commit), I realized that the dput()
    "might_sleep()" check was misleadingly weak. And I removed it as
    misleading, both for performance profiling and for debugging.

    However, the might_sleep() debugging case is actually true: the final
    dput() can indeed sleep, if the inode of the dentry that you are
    releasing ends up sleeping at iput time (see dentry_iput()). So the
    problem with the might_sleep() in dput() wasn't that it wasn't true, it
    was that it wasn't actually testing and triggering on the interesting
    case.

    In particular, just about *any* dput() can indeed sleep, if you happen
    to race with another thread deleting the file in question, and you then
    lose the race to the be the last dput() for that file. But because it's
    a very rare race, the debugging code would never trigger it in practice.

    Why is this problematic? The new d_rcu_to_refcount() (see commit
    15570086b590: "vfs: reimplement d_rcu_to_refcount() using
    lockref_get_or_lock()") does a dput() for the failure case, and it does
    it under the RCU lock. So potentially sleeping really is a bug.

    But there's no way I'm going to fix this with the previous complicated
    "lockref_get_or_lock()" interface. And rather than revert to the old
    and crufty nested dentry locking code (which did get this right by
    delaying the reference count updates until they were verified to be
    safe), let's make forward progress.

    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

05 Sep, 2013

1 commit

  • Pull vfs pile 1 from Al Viro:
    "Unfortunately, this merge window it'll have a be a lot of small piles -
    my fault, actually, for not keeping #for-next in anything that would
    resemble a sane shape ;-/

    This pile: assorted fixes (the first 3 are -stable fodder, IMO) and
    cleanups + %pd/%pD formats (dentry/file pathname, up to 4 last
    components) + several long-standing patches from various folks.

    There definitely will be a lot more (starting with Miklos'
    check_submount_and_drop() series)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
    direct-io: Handle O_(D)SYNC AIO
    direct-io: Implement generic deferred AIO completions
    add formats for dentry/file pathnames
    kvm eventfd: switch to fdget
    powerpc kvm: use fdget
    switch fchmod() to fdget
    switch epoll_ctl() to fdget
    switch copy_module_from_fd() to fdget
    git simplify nilfs check for busy subtree
    ibmasmfs: don't bother passing superblock when not needed
    don't pass superblock to hypfs_{mkdir,create*}
    don't pass superblock to hypfs_diag_create_files
    don't pass superblock to hypfs_vm_create_files()
    oprofile: get rid of pointless forward declarations of struct super_block
    oprofilefs_create_...() do not need superblock argument
    oprofilefs_mkdir() doesn't need superblock argument
    don't bother with passing superblock to oprofile_create_stats_files()
    oprofile: don't bother with passing superblock to ->create_files()
    don't bother passing sb to oprofile_create_files()
    coh901318: don't open-code simple_read_from_buffer()
    ...

    Linus Torvalds
     

04 Sep, 2013

1 commit

  • Christopher reported a regression where he was unable to unmount a NFS
    filesystem where the root had gone stale. The problem is that
    d_revalidate handles the root of the filesystem differently from other
    dentries, but d_weak_revalidate does not. We could simply fix this by
    making d_weak_revalidate return success on IS_ROOT dentries, but there
    are cases where we do want to revalidate the root of the fs.

    A umount is really a special case. We generally aren't interested in
    anything but the dentry and vfsmount that's attached at that point. If
    the inode turns out to be stale we just don't care since the intent is
    to stop using it anyway.

    Try to handle this situation better by treating umount as a special
    case in the lookup code. Have it resolve the parent using normal
    means, and then do a lookup of the final dentry without revalidating
    it. In most cases, the final lookup will come out of the dcache, but
    the case where there's a trailing symlink or !LAST_NORM entry on the
    end complicates things a bit.

    Cc: Neil Brown
    Reported-by: Christopher T Vogan
    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     

03 Sep, 2013

1 commit

  • This moves __d_rcu_to_refcount() from into fs/namei.c
    and re-implements it using the lockref infrastructure instead. It also
    adds a lot of comments about what is actually going on, because turning
    a dentry that was looked up using RCU into a long-lived reference
    counted entry is one of the more subtle parts of the rcu walk.

    We also used to be _particularly_ subtle in unlazy_walk() where we
    re-validate both the dentry and its parent using the same sequence
    count. We used to do it by nesting the locks and then verifying the
    sequence count just once.

    That was silly, because nested locking is expensive, but the sequence
    count check is not. So this just re-validates the dentry and the parent
    separately, avoiding the nested locking, and making the lockref lookup
    possible.

    Acked-by: Waiman Long
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

29 Aug, 2013

2 commits

  • This just replaces the dentry count/lock combination with the lockref
    structure that contains both a count and a spinlock, and does the
    mechanical conversion to use the lockref infrastructure.

    There are no semantic changes here, it's purely syntactic. The
    reference lockref implementation uses the spinlock exactly the same way
    that the old dcache code did, and the bulk of this patch is just
    expanding the internal "d_count" use in the dcache code to use
    "d_lockref.count" instead.

    This is purely preparation for the real change to make the reference
    count updates be lockless during the 3.12 merge window.

    [ As with the previous commit, this is a rewritten version of a concept
    originally from Waiman, so credit goes to him, blame for any errors
    goes to me.

    Waiman's patch had some semantic differences for taking advantage of
    the lockless update in dget_parent(), while this patch is
    intentionally a pure search-and-replace change with no semantic
    changes. - Linus ]

    Signed-off-by: Waiman Long
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • This reverts commit bb2314b47996491bbc5add73633905c3120b6268.

    It wasn't necessarily wrong per se, but we're still busily discussing
    the exact details of this all, so I'm going to revert it for now.

    It's true that you can already do flink() through /proc and that flink()
    isn't new. But as Brad Spengler points out, some secure environments do
    not mount proc, and flink adds a new interface that can avoid path
    lookup of the source for those kinds of environments.

    We may re-do this (and even mark it for stable backporting back in 3.11
    and possibly earlier) once the whole discussion about the interface is done.

    Cc: Andy Lutomirski
    Cc: Al Viro
    Cc: Oleg Nesterov
    Cc: Brad Spengler
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

05 Aug, 2013

1 commit

  • Every now and then someone proposes a new flink syscall, and this spawns
    a long discussion of whether it would be a security problem. I think
    that this is missing the point: flink is *already* allowed without
    privilege as long as /proc is mounted -- it's called AT_SYMLINK_FOLLOW.

    Now that O_TMPFILE is here, the ability to create a file with O_TMPFILE,
    write it, and link it in is very convenient. The only problem is that
    it requires that /proc be mounted so that you can do:

    linkat(AT_FDCWD, "/proc/self/fd/", dfd, path, AT_SYMLINK_NOFOLLOW)

    This sucks -- it's much nicer to do:

    linkat(tmpfd, "", dfd, path, AT_EMPTY_PATH)

    Let's allow it.

    If this turns out to be excessively scary, it we could instead require
    that the inode in question be I_LINKABLE, but this seems pointless given
    the /proc situation

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Al Viro

    Andy Lutomirski
     

13 Jul, 2013

1 commit

  • [suggested by Rasmus Villemoes] make O_DIRECTORY | O_RDWR part of O_TMPFILE;
    that will fail on old kernels in a lot more cases than what I came up with.
    And make sure O_CREAT doesn't get there...

    Signed-off-by: Al Viro

    Al Viro
     

29 Jun, 2013

5 commits


15 Jun, 2013

1 commit


12 May, 2013

1 commit

  • Pull audit changes from Eric Paris:
    "Al used to send pull requests every couple of years but he told me to
    just start pushing them to you directly.

    Our touching outside of core audit code is pretty straight forward. A
    couple of interface changes which hit net/. A simple argument bug
    calling audit functions in namei.c and the removal of some assembly
    branch prediction code on ppc"

    * git://git.infradead.org/users/eparis/audit: (31 commits)
    audit: fix message spacing printing auid
    Revert "audit: move kaudit thread start from auditd registration to kaudit init"
    audit: vfs: fix audit_inode call in O_CREAT case of do_last
    audit: Make testing for a valid loginuid explicit.
    audit: fix event coverage of AUDIT_ANOM_LINK
    audit: use spin_lock in audit_receive_msg to process tty logging
    audit: do not needlessly take a lock in tty_audit_exit
    audit: do not needlessly take a spinlock in copy_signal
    audit: add an option to control logging of passwords with pam_tty_audit
    audit: use spin_lock_irqsave/restore in audit tty code
    helper for some session id stuff
    audit: use a consistent audit helper to log lsm information
    audit: push loginuid and sessionid processing down
    audit: stop pushing loginid, uid, sessionid as arguments
    audit: remove the old depricated kernel interface
    audit: make validity checking generic
    audit: allow checking the type of audit message in the user filter
    audit: fix build break when AUDIT_DEBUG == 2
    audit: remove duplicate export of audit_enabled
    Audit: do not print error when LSMs disabled
    ...

    Linus Torvalds
     

08 May, 2013

1 commit

  • Jiri reported a regression in auditing of open(..., O_CREAT) syscalls.
    In older kernels, creating a file with open(..., O_CREAT) created
    audit_name records that looked like this:

    type=PATH msg=audit(1360255720.628:64): item=1 name="/abc/foo" inode=138810 dev=fd:00 mode=0100640 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
    type=PATH msg=audit(1360255720.628:64): item=0 name="/abc/" inode=138635 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0

    ...in recent kernels though, they look like this:

    type=PATH msg=audit(1360255402.886:12574): item=2 name=(null) inode=264599 dev=fd:00 mode=0100640 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
    type=PATH msg=audit(1360255402.886:12574): item=1 name=(null) inode=264598 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
    type=PATH msg=audit(1360255402.886:12574): item=0 name="/abc/foo" inode=264598 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0

    Richard bisected to determine that the problems started with commit
    bfcec708, but the log messages have changed with some later
    audit-related patches.

    The problem is that this audit_inode call is passing in the parent of
    the dentry being opened, but audit_inode is being called with the parent
    flag false. This causes later audit_inode and audit_inode_child calls to
    match the wrong entry in the audit_names list.

    This patch simply sets the flag to properly indicate that this inode
    represents the parent. With this, the audit_names entries are back to
    looking like they did before.

    Cc: # v3.7+
    Reported-by: Jiri Jaburek
    Signed-off-by: Jeff Layton
    Test By: Richard Guy Briggs
    Signed-off-by: Eric Paris

    Jeff Layton
     

09 Mar, 2013

1 commit


02 Mar, 2013

1 commit


26 Feb, 2013

1 commit

  • The following set of operations on a NFS client and server will cause

    server# mkdir a
    client# cd a
    server# mv a a.bak
    client# sleep 30 # (or whatever the dir attrcache timeout is)
    client# stat .
    stat: cannot stat `.': Stale NFS file handle

    Obviously, we should not be getting an ESTALE error back there since the
    inode still exists on the server. The problem is that the lookup code
    will call d_revalidate on the dentry that "." refers to, because NFS has
    FS_REVAL_DOT set.

    nfs_lookup_revalidate will see that the parent directory has changed and
    will try to reverify the dentry by redoing a LOOKUP. That of course
    fails, so the lookup code returns ESTALE.

    The problem here is that d_revalidate is really a bad fit for this case.
    What we really want to know at this point is whether the inode is still
    good or not, but we don't really care what name it goes by or whether
    the dcache is still valid.

    Add a new d_op->d_weak_revalidate operation and have complete_walk call
    that instead of d_revalidate. The intent there is to allow for a
    "weaker" d_revalidate that just checks to see whether the inode is still
    good. This is also gives us an opportunity to kill off the FS_REVAL_DOT
    special casing.

    [AV: changed method name, added note in porting, fixed confusion re
    having it possibly called from RCU mode (it won't be)]

    Cc: NeilBrown
    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     

23 Feb, 2013

6 commits


21 Dec, 2012

4 commits