13 Apr, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "The first vfs pile, with deep apologies for being very late in this
    window.

    Assorted cleanups and fixes, plus a large preparatory part of iov_iter
    work. There's a lot more of that, but it'll probably go into the next
    merge window - it *does* shape up nicely, removes a lot of
    boilerplate, gets rid of locking inconsistencie between aio_write and
    splice_write and I hope to get Kent's direct-io rewrite merged into
    the same queue, but some of the stuff after this point is having
    (mostly trivial) conflicts with the things already merged into
    mainline and with some I want more testing.

    This one passes LTP and xfstests without regressions, in addition to
    usual beating. BTW, readahead02 in ltp syscalls testsuite has started
    giving failures since "mm/readahead.c: fix readahead failure for
    memoryless NUMA nodes and limit readahead pages" - might be a false
    positive, might be a real regression..."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    missing bits of "splice: fix racy pipe->buffers uses"
    cifs: fix the race in cifs_writev()
    ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
    kill generic_file_buffered_write()
    ocfs2_file_aio_write(): switch to generic_perform_write()
    ceph_aio_write(): switch to generic_perform_write()
    xfs_file_buffered_aio_write(): switch to generic_perform_write()
    export generic_perform_write(), start getting rid of generic_file_buffer_write()
    generic_file_direct_write(): get rid of ppos argument
    btrfs_file_aio_write(): get rid of ppos
    kill the 5th argument of generic_file_buffered_write()
    kill the 4th argument of __generic_file_aio_write()
    lustre: don't open-code kernel_recvmsg()
    ocfs2: don't open-code kernel_recvmsg()
    drbd: don't open-code kernel_recvmsg()
    constify blk_rq_map_user_iov() and friends
    lustre: switch to kernel_sendmsg()
    ocfs2: don't open-code kernel_sendmsg()
    take iov_iter stuff to mm/iov_iter.c
    process_vm_access: tidy up a bit
    ...

    Linus Torvalds
     

05 Apr, 2014

1 commit

  • Pull file locking updates from Jeff Layton:
    "Highlights:

    - maintainership change for fs/locks.c. Willy's not interested in
    maintaining it these days, and is OK with Bruce and I taking it.
    - fix for open vs setlease race that Al ID'ed
    - cleanup and consolidation of file locking code
    - eliminate unneeded BUG() call
    - merge of file-private lock implementation"

    * 'locks-3.15' of git://git.samba.org/jlayton/linux:
    locks: make locks_mandatory_area check for file-private locks
    locks: fix locks_mandatory_locked to respect file-private locks
    locks: require that flock->l_pid be set to 0 for file-private locks
    locks: add new fcntl cmd values for handling file private locks
    locks: skip deadlock detection on FL_FILE_PVT locks
    locks: pass the cmd value to fcntl_getlk/getlk64
    locks: report l_pid as -1 for FL_FILE_PVT locks
    locks: make /proc/locks show IS_FILE_PVT locks as type "FLPVT"
    locks: rename locks_remove_flock to locks_remove_file
    locks: consolidate checks for compatible filp->f_mode values in setlk handlers
    locks: fix posix lock range overflow handling
    locks: eliminate BUG() call when there's an unexpected lock on file close
    locks: add __acquires and __releases annotations to locks_start and locks_stop
    locks: remove "inline" qualifier from fl_link manipulation functions
    locks: clean up comment typo
    locks: close potential race between setlease and open
    MAINTAINERS: update entry for fs/locks.c

    Linus Torvalds
     

02 Apr, 2014

3 commits


01 Apr, 2014

7 commits

  • If flags contain RENAME_EXCHANGE then exchange source and destination files.
    There's no restriction on the type of the files; e.g. a directory can be
    exchanged with a symlink.

    Signed-off-by: Miklos Szeredi
    Reviewed-by: Jan Kara
    Reviewed-by: J. Bruce Fields

    Miklos Szeredi
     
  • Add flags to security_path_rename() and security_inode_rename() hooks.

    Signed-off-by: Miklos Szeredi
    Reviewed-by: J. Bruce Fields

    Miklos Szeredi
     
  • If this flag is specified and the target of the rename exists then the
    rename syscall fails with EEXIST.

    The VFS does the existence checking, so it is trivial to enable for most
    local filesystems. This patch only enables it in ext4.

    For network filesystems the VFS check is not enough as there may be a race
    between a remote create and the rename, so these filesystems need to handle
    this flag in their ->rename() implementations to ensure atomicity.

    Andy writes about why this is useful:

    "The trivial answer: to eliminate the race condition from 'mv -i'.

    Another answer: there's a common pattern to atomically create a file
    with contents: open a temporary file, write to it, optionally fsync
    it, close it, then link(2) it to the final name, then unlink the
    temporary file.

    The reason to use link(2) is because it won't silently clobber the destination.

    This is annoying:
    - It requires an extra system call that shouldn't be necessary.
    - It doesn't work on (IMO sensible) filesystems that don't support
    hard links (e.g. vfat).
    - It's not atomic -- there's an intermediate state where both files exist.
    - It's ugly.

    The new rename flag will make this totally sensible.

    To be fair, on new enough kernels, you can also use O_TMPFILE and
    linkat to achieve the same thing even more cleanly."

    Suggested-by: Andy Lutomirski
    Signed-off-by: Miklos Szeredi
    Reviewed-by: J. Bruce Fields

    Miklos Szeredi
     
  • Add new renameat2 syscall, which is the same as renameat with an added
    flags argument.

    Pass flags to vfs_rename() and to i_op->rename() as well.

    Signed-off-by: Miklos Szeredi
    Reviewed-by: J. Bruce Fields

    Miklos Szeredi
     
  • There's actually very little difference between vfs_rename_dir() and
    vfs_rename_other() so move both inline into vfs_rename() which still stays
    reasonably readable.

    Signed-off-by: Miklos Szeredi
    Reviewed-by: J. Bruce Fields

    Miklos Szeredi
     
  • Move the d_move() in vfs_rename_dir() up, similarly to how it's done in
    vfs_rename_other(). The next patch will consolidate these two functions
    and this is the only structural difference between them.

    I'm not sure if doing the d_move() after the dput is even valid. But there
    may be a logical explanation for that. But moving the d_move() before the
    dput() (and the mutex_unlock()) should definitely not hurt.

    Signed-off-by: Miklos Szeredi
    Reviewed-by: J. Bruce Fields

    Miklos Szeredi
     
  • Add d_is_dir(dentry) helper which is analogous to S_ISDIR().

    To avoid confusion, rename d_is_directory() to d_can_lookup().

    Signed-off-by: Miklos Szeredi
    Reviewed-by: J. Bruce Fields

    Miklos Szeredi
     

31 Mar, 2014

1 commit

  • As Trond pointed out, you can currently deadlock yourself by setting a
    file-private lock on a file that requires mandatory locking and then
    trying to do I/O on it.

    Avoid this problem by plumbing some knowledge of file-private locks into
    the mandatory locking code. In order to do this, we must pass down
    information about the struct file that's being used to
    locks_verify_locked.

    Reported-by: Trond Myklebust
    Signed-off-by: Jeff Layton
    Acked-by: J. Bruce Fields

    Jeff Layton
     

23 Mar, 2014

1 commit

  • We can get false negative from __lookup_mnt() if an unrelated vfsmount
    gets moved. In that case legitimize_mnt() is guaranteed to fail,
    and we will fall back to non-RCU walk... unless we end up running
    into a hard error on a filesystem object we wouldn't have reached
    if not for that false negative. IOW, delaying that check until
    the end of pathname resolution is wrong - we should recheck right
    after we attempt to cross the mountpoint. We don't need to recheck
    unless we see d_mountpoint() being true - in that case even if
    we have just raced with mount/umount, we can simply go on as if
    we'd come at the moment when the sucker wasn't a mountpoint; if we
    run into a hard error as the result, it was a legitimate outcome.
    __lookup_mnt() returning NULL is different in that respect, since
    it might've happened due to operation on completely unrelated
    mountpoint.

    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     

10 Mar, 2014

1 commit

  • Our write() system call has always been atomic in the sense that you get
    the expected thread-safe contiguous write, but we haven't actually
    guaranteed that concurrent writes are serialized wrt f_pos accesses, so
    threads (or processes) that share a file descriptor and use "write()"
    concurrently would quite likely overwrite each others data.

    This violates POSIX.1-2008/SUSv4 Section XSI 2.9.7 that says:

    "2.9.7 Thread Interactions with Regular File Operations

    All of the following functions shall be atomic with respect to each
    other in the effects specified in POSIX.1-2008 when they operate on
    regular files or symbolic links: [...]"

    and one of the effects is the file position update.

    This unprotected file position behavior is not new behavior, and nobody
    has ever cared. Until now. Yongzhi Pan reported unexpected behavior to
    Michael Kerrisk that was due to this.

    This resolves the issue with a f_pos-specific lock that is taken by
    read/write/lseek on file descriptors that may be shared across threads
    or processes.

    Reported-by: Yongzhi Pan
    Reported-by: Michael Kerrisk
    Cc: Al Viro
    Signed-off-by: Linus Torvalds
    Signed-off-by: Al Viro

    Linus Torvalds
     

06 Feb, 2014

1 commit

  • This changes 'do_execve()' to get the executable name as a 'struct
    filename', and to free it when it is done. This is what the normal
    users want, and it simplifies and streamlines their error handling.

    The controlled lifetime of the executable name also fixes a
    use-after-free problem with the trace_sched_process_exec tracepoint: the
    lifetime of the passed-in string for kernel users was not at all
    obvious, and the user-mode helper code used UMH_WAIT_EXEC to serialize
    the pathname allocation lifetime with the execve() having finished,
    which in turn meant that the trace point that happened after
    mm_release() of the old process VM ended up using already free'd memory.

    To solve the kernel string lifetime issue, this simply introduces
    "getname_kernel()" that works like the normal user-space getname()
    function, except with the source coming from kernel memory.

    As Oleg points out, this also means that we could drop the tcomm[] array
    from 'struct linux_binprm', since the pathname lifetime now covers
    setup_new_exec(). That would be a separate cleanup.

    Reported-by: Igor Zhbanov
    Tested-by: Steven Rostedt
    Cc: Oleg Nesterov
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Feb, 2014

2 commits

  • Recent changes to retry on ESTALE in linkat
    (commit 442e31ca5a49e398351b2954b51f578353fdf210)
    introduced a mountpoint reference leak and a small memory
    leak in case a filesystem link operation returns ESTALE
    which is pretty normal for distributed filesystems like
    lustre, nfs and so on.
    Free old_path in such a case.

    [AV: there was another missing path_put() nearby - on the previous
    goto retry]

    Signed-off-by: Oleg Drokin:
    Signed-off-by: Al Viro

    Oleg Drokin
     
  • Leaving getname() exported when putname() isn't is a bad idea.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     

26 Jan, 2014

1 commit

  • Factor out the code to get an ACL either from the inode or disk from
    check_acl, so that it can be used elsewhere later on.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Al Viro

    Christoph Hellwig
     

13 Dec, 2013

1 commit

  • When explicitly hashing the end of a string with the word-at-a-time
    interface, we have to be careful which end of the word we pick up.

    On big-endian CPUs, the upper-bits will contain the data we're after, so
    ensure we generate our masks accordingly (and avoid hashing whatever
    random junk may have been sitting after the string).

    This patch adds a new dcache helper, bytemask_from_count, which creates
    a mask appropriate for the CPU endianness.

    Cc: Al Viro
    Signed-off-by: Will Deacon
    Signed-off-by: Linus Torvalds

    Will Deacon
     

29 Nov, 2013

1 commit


22 Nov, 2013

1 commit

  • Pull audit updates from Eric Paris:
    "Nothing amazing. Formatting, small bug fixes, couple of fixes where
    we didn't get records due to some old VFS changes, and a change to how
    we collect execve info..."

    Fixed conflict in fs/exec.c as per Eric and linux-next.

    * git://git.infradead.org/users/eparis/audit: (28 commits)
    audit: fix type of sessionid in audit_set_loginuid()
    audit: call audit_bprm() only once to add AUDIT_EXECVE information
    audit: move audit_aux_data_execve contents into audit_context union
    audit: remove unused envc member of audit_aux_data_execve
    audit: Kill the unused struct audit_aux_data_capset
    audit: do not reject all AUDIT_INODE filter types
    audit: suppress stock memalloc failure warnings since already managed
    audit: log the audit_names record type
    audit: add child record before the create to handle case where create fails
    audit: use given values in tty_audit enable api
    audit: use nlmsg_len() to get message payload length
    audit: use memset instead of trying to initialize field by field
    audit: fix info leak in AUDIT_GET requests
    audit: update AUDIT_INODE filter rule to comparator function
    audit: audit feature to set loginuid immutable
    audit: audit feature to only allow unsetting the loginuid
    audit: allow unsetting the loginuid (with priv)
    audit: remove CONFIG_AUDIT_LOGINUID_IMMUTABLE
    audit: loginuid functions coding style
    selinux: apply selinux checks on new audit message types
    ...

    Linus Torvalds
     

13 Nov, 2013

1 commit

  • Pull vfs updates from Al Viro:
    "All kinds of stuff this time around; some more notable parts:

    - RCU'd vfsmounts handling
    - new primitives for coredump handling
    - files_lock is gone
    - Bruce's delegations handling series
    - exportfs fixes

    plus misc stuff all over the place"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits)
    ecryptfs: ->f_op is never NULL
    locks: break delegations on any attribute modification
    locks: break delegations on link
    locks: break delegations on rename
    locks: helper functions for delegation breaking
    locks: break delegations on unlink
    namei: minor vfs_unlink cleanup
    locks: implement delegations
    locks: introduce new FL_DELEG lock flag
    vfs: take i_mutex on renamed file
    vfs: rename I_MUTEX_QUOTA now that it's not used for quotas
    vfs: don't use PARENT/CHILD lock classes for non-directories
    vfs: pull ext4's double-i_mutex-locking into common code
    exportfs: fix quadratic behavior in filehandle lookup
    exportfs: better variable name
    exportfs: move most of reconnect_path to helper function
    exportfs: eliminate unused "noprogress" counter
    exportfs: stop retrying once we race with rename/remove
    exportfs: clear DISCONNECTED on all parents sooner
    exportfs: more detailed comment for path_reconnect
    ...

    Linus Torvalds
     

09 Nov, 2013

10 commits

  • Cc: Tyler Hicks
    Cc: Dustin Kirkland
    Acked-by: Jeff Layton
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • Cc: David Howells
    Acked-by: Jeff Layton
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • We'll need the same logic for rename and link.

    Acked-by: Jeff Layton
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • We need to break delegations on any operation that changes the set of
    links pointing to an inode. Start with unlink.

    Such operations also hold the i_mutex on a parent directory. Breaking a
    delegation may require waiting for a timeout (by default 90 seconds) in
    the case of a unresponsive NFS client. To avoid blocking all directory
    operations, we therefore drop locks before waiting for the delegation.
    The logic then looks like:

    acquire locks
    ...
    test for delegation; if found:
    take reference on inode
    release locks
    wait for delegation break
    drop reference on inode
    retry

    It is possible this could never terminate. (Even if we take precautions
    to prevent another delegation being acquired on the same inode, we could
    get a different inode on each retry.) But this seems very unlikely.

    The initial test for a delegation happens after the lock on the target
    inode is acquired, but the directory inode may have been acquired
    further up the call stack. We therefore add a "struct inode **"
    argument to any intervening functions, which we use to pass the inode
    back up to the caller in the case it needs a delegation synchronously
    broken.

    Cc: David Howells
    Cc: Tyler Hicks
    Cc: Dustin Kirkland
    Acked-by: Jeff Layton
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • We'll be using dentry->d_inode in one more place.

    Acked-by: Jeff Layton
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • A read delegation is used by NFSv4 as a guarantee that a client can
    perform local read opens without informing the server.

    The open operation takes the last component of the pathname as an
    argument, thus is also a lookup operation, and giving the client the
    above guarantee means informing the client before we allow anything that
    would change the set of names pointing to the inode.

    Therefore, we need to break delegations on rename, link, and unlink.

    We also need to prevent new delegations from being acquired while one of
    these operations is in progress.

    We could add some completely new locking for that purpose, but it's
    simpler to use the i_mutex, since that's already taken by all the
    operations we care about.

    The single exception is rename. So, modify rename to take the i_mutex
    on the file that is being renamed.

    Also fix up lockdep and Documentation/filesystems/directory-locking to
    reflect the change.

    Acked-by: Jeff Layton
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • The DCACHE_NEED_LOOKUP case referred to here was removed with
    39e3c9553f34381a1b664c27b0c696a266a5735e "vfs: remove
    DCACHE_NEED_LOOKUP".

    There are only four real_lookup() callers and all of them pass in an
    unhashed dentry just returned from d_alloc.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     
  • Put a type field into struct dentry::d_flags to indicate if the dentry is one
    of the following types that relate particularly to pathwalk:

    Miss (negative dentry)
    Directory
    "Automount" directory (defective - no i_op->lookup())
    Symlink
    Other (regular, socket, fifo, device)

    The type field is set to one of the first five types on a dentry by calls to
    __d_instantiate() and d_obtain_alias() from information in the inode (if one is
    given).

    The type is cleared by dentry_unlink_inode() when it reconstitutes an existing
    dentry as a negative dentry.

    Accessors provided are:

    d_set_type(dentry, type)
    d_is_directory(dentry)
    d_is_autodir(dentry)
    d_is_symlink(dentry)
    d_is_file(dentry)
    d_is_negative(dentry)
    d_is_positive(dentry)

    A bunch of checks in pathname resolution switched to those.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     
  • those have become aliases for rcu_read_{lock,unlock}()

    Signed-off-by: Al Viro

    Al Viro
     
  • * RCU-delayed freeing of vfsmounts
    * vfsmount_lock replaced with a seqlock (mount_lock)
    * sequence number from mount_lock is stored in nameidata->m_seq and
    used when we exit RCU mode
    * new vfsmount flag - MNT_SYNC_UMOUNT. Set by umount_tree() when its
    caller knows that vfsmount will have no surviving references.
    * synchronize_rcu() done between unlocking namespace_sem in namespace_unlock()
    and doing pending mntput().
    * new helper: legitimize_mnt(mnt, seq). Checks the mount_lock sequence
    number against seq, then grabs reference to mnt. Then it rechecks mount_lock
    again to close the race and either returns success or drops the reference it
    has acquired. The subtle point is that in case of MNT_SYNC_UMOUNT we can
    simply decrement the refcount and sod off - aforementioned synchronize_rcu()
    makes sure that final mntput() won't come until we leave RCU mode. We need
    that, since we don't want to end up with some lazy pathwalk racing with
    umount() and stealing the final mntput() from it - caller of umount() may
    expect it to return only once the fs is shut down and we don't want to break
    that. In other cases (i.e. with MNT_SYNC_UMOUNT absent) we have to do
    full-blown mntput() in case of mount_lock sequence number mismatch happening
    just as we'd grabbed the reference, but in those cases we won't be stealing
    the final mntput() from anything that would care.
    * mntput_no_expire() doesn't lock anything on the fast path now. Incidentally,
    SMP and UP cases are handled the same way - no ifdefs there.
    * normal pathname resolution does *not* do any writes to mount_lock. It does,
    of course, bump the refcounts of vfsmount and dentry in the very end, but that's
    it.

    Signed-off-by: Al Viro

    Al Viro
     

06 Nov, 2013

1 commit

  • Historically, when a syscall that creates a dentry fails, you get an audit
    record that looks something like this (when trying to create a file named
    "new" in "/tmp/tmp.SxiLnCcv63"):

    type=PATH msg=audit(1366128956.279:965): item=0 name="/tmp/tmp.SxiLnCcv63/new" inode=2138308 dev=fd:02 mode=040700 ouid=0 ogid=0 rdev=00:00 obj=staff_u:object_r:user_tmp_t:s15:c0.c1023

    This record makes no sense since it's associating the inode information for
    "/tmp/tmp.SxiLnCcv63" with the path "/tmp/tmp.SxiLnCcv63/new". The recent
    patch I posted to fix the audit_inode call in do_last fixes this, by making it
    look more like this:

    type=PATH msg=audit(1366128765.989:13875): item=0 name="/tmp/tmp.DJ1O8V3e4f/" inode=141 dev=fd:02 mode=040700 ouid=0 ogid=0 rdev=00:00 obj=staff_u:object_r:user_tmp_t:s15:c0.c1023

    While this is more correct, if the creation of the file fails, then we
    have no record of the filename that the user tried to create.

    This patch adds a call to audit_inode_child to may_create. This creates
    an AUDIT_TYPE_CHILD_CREATE record that will sit in place until the
    create succeeds. When and if the create does succeed, then this record
    will be updated with the correct inode info from the create.

    This fixes what was broken in commit bfcec708.
    Commit 79f6530c should also be backported to stable v3.7+.

    Signed-off-by: Jeff Layton
    Signed-off-by: Eric Paris
    Signed-off-by: Richard Guy Briggs
    Signed-off-by: Eric Paris

    Jeff Layton
     

25 Oct, 2013

1 commit

  • Instead of passing the direction as argument (and checking it on every
    step through the hash chain), just have separate __lookup_mnt() and
    __lookup_mnt_last(). And use the standard iterators...

    Signed-off-by: Al Viro

    Al Viro
     

22 Oct, 2013

1 commit


18 Sep, 2013

1 commit


17 Sep, 2013

1 commit

  • If O_CREAT|O_EXCL are passed to open, then we know that either

    - the file is successfully created, or
    - the operation fails in some way.

    So previously we set FILE_CREATED before calling ->atomic_open() so the
    filesystem doesn't have to. This, however, led to bugs in the
    implementation that went unnoticed when the filesystem didn't check for
    existence, yet returned success. To prevent this kind of bug, require
    filesystems to always explicitly set FILE_CREATED on O_CREAT|O_EXCL and
    verify this in the VFS.

    Also added a couple more verifications for the result of atomic_open():

    - Warn if filesystem set FILE_CREATED despite the lack of O_CREAT.
    - Warn if filesystem set FILE_CREATED but gave a negative dentry.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     

11 Sep, 2013

2 commits