23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

11 Jun, 2014

1 commit

  • The kernel has no concept of capabilities with respect to inodes; inodes
    exist independently of namespaces. For example, inode_capable(inode,
    CAP_LINUX_IMMUTABLE) would be nonsense.

    This patch changes inode_capable to check for uid and gid mappings and
    renames it to capable_wrt_inode_uidgid, which should make it more
    obvious what it does.

    Fixes CVE-2014-4014.

    Cc: Theodore Ts'o
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Chinner
    Cc: stable@vger.kernel.org
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     

06 Dec, 2013

1 commit

  • Currently notify_change directly updates i_version for size updates,
    which not only is counter to how all other fields are updated through
    struct iattr, but also breaks XFS, which need inode updates to happen
    under its own lock, and synchronized to the structure that gets written
    to the log.

    Remove the update in the common code, and it to btrfs and ext4,
    XFS already does a proper updaste internally and currently gets a
    double update with the existing code.

    IMHO this is 3.13 and -stable material and should go in through the XFS
    tree.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Andreas Dilger
    Acked-by: Jan Kara
    Reviewed-by: Dave Chinner
    Signed-off-by: Chris Mason
    Signed-off-by: Ben Myers

    Christoph Hellwig
     

09 Nov, 2013

1 commit


20 Nov, 2012

1 commit

  • - Allow chown if CAP_CHOWN is present in the current user namespace
    and the uid of the inode maps into the current user namespace, and
    the destination uid or gid maps into the current user namespace.

    - Allow perserving setgid when changing an inode if CAP_FSETID is
    present in the current user namespace and the owner of the file has
    a mapping into the current user namespace.

    Acked-by: Serge E. Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

08 Sep, 2012

1 commit

  • Changing an inode's metadata may result in our not needing to appraise
    the file. In such cases, we must remove 'security.ima'.

    Changelog v1:
    - use ima_inode_post_setattr() stub function, if IMA_APPRAISE not configured

    Signed-off-by: Mimi Zohar
    Acked-by: Serge Hallyn
    Acked-by: Dmitry Kasatkin

    Mimi Zohar
     

14 Jul, 2012

1 commit


31 May, 2012

1 commit

  • When a file is truncated with truncate()/ftruncate() and then closed,
    iversion is not updated. This patch uses ATTR_SIZE flag as an indication
    to increment iversion.

    Mimi said:

    On fput(), i_version is used to detect and flag files that have changed
    and need to be re-measured in the IMA measurement policy. When a file
    is truncated with truncate()/ftruncate() and then closed, i_version is
    not updated. As a result, although the file has changed, it will not be
    re-measured and added to the IMA measurement list on subsequent access.

    Signed-off-by: Dmitry Kasatkin
    Acked-by: Mimi Zohar
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dmitry Kasatkin
     

03 May, 2012

1 commit


29 Feb, 2012

1 commit


04 Jan, 2012

1 commit


09 Aug, 2011

1 commit


21 Jul, 2011

2 commits

  • Let filesystems handle waiting for direct I/O requests themselves instead
    of doing it beforehand. This means filesystem-specific locks to prevent
    new dio referenes from appearing can be held. This is important to allow
    generalizing i_dio_count to non-DIO_LOCKING filesystems.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • i_alloc_sem is a rather special rw_semaphore. It's the last one that may
    be released by a non-owner, and it's write side is always mirrored by
    real exclusion. It's intended use it to wait for all pending direct I/O
    requests to finish before starting a truncate.

    Replace it with a hand-grown construct:

    - exclusion for truncates is already guaranteed by i_mutex, so it can
    simply fall way
    - the reader side is replaced by an i_dio_count member in struct inode
    that counts the number of pending direct I/O requests. Truncate can't
    proceed as long as it's non-zero
    - when i_dio_count reaches non-zero we wake up a pending truncate using
    wake_up_bit on a new bit in i_flags
    - new references to i_dio_count can't appear while we are waiting for
    it to read zero because the direct I/O count always needs i_mutex
    (or an equivalent like XFS's i_iolock) for starting a new operation.

    This scheme is much simpler, and saves the space of a spinlock_t and a
    struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
    system).

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

19 Jul, 2011

1 commit


29 May, 2011

1 commit

  • Some recent benchmarking on btrfs showed that a major scaling bottleneck
    on large systems on btrfs is currently the xattr lookup on every write.

    Why xattr lookup on every write I hear you ask?

    write wants to drop suid and security related xattrs that could set o
    capabilities for executables. To do that it currently looks up
    security.capability on EVERY write (even for non executables) to decide
    whether to drop it or not.

    In btrfs this causes an additional tree walk, hitting some per file system
    locks and quite bad scalability. In a simple read workload on a 8S
    system I saw over 90% CPU time in spinlocks related to that.

    Chris Mason tells me this is also a problem in ext4, where it hits
    the global mbcache lock.

    This patch adds a simple per inode to avoid this problem. We only
    do the lookup once per file and then if there is no xattr cache
    the decision. All xattr changes clear the flag.

    I also used the same flag to avoid the suid check, although
    that one is pretty cheap.

    A file system can also set this flag when it creates the inode,
    if it has a cheap way to do so. This is done for some common file systems
    in followon patches.

    With this patch a major part of the lock contention disappears
    for btrfs. Some testing on smaller systems didn't show significant
    performance changes, but at least it helps the larger systems
    and is generally more efficient.

    v2: Rename is_sgid. add file system helper.
    Cc: chris.mason@oracle.com
    Cc: josef@redhat.com
    Cc: viro@zeniv.linux.org.uk
    Cc: agruen@linbit.com
    Cc: Serge E. Hallyn
    Signed-off-by: Andi Kleen
    Signed-off-by: Al Viro

    Andi Kleen
     

31 Mar, 2011

1 commit


24 Mar, 2011

1 commit


10 Aug, 2010

4 commits

  • Make sure we check the truncate constraints early on in ->setattr by adding
    those checks to inode_change_ok. Also clean up and document inode_change_ok
    to make this obvious.

    As a fallout we don't have to call inode_newsize_ok from simple_setsize and
    simplify it down to a truncate_setsize which doesn't return an error. This
    simplifies a lot of setattr implementations and means we use truncate_setsize
    almost everywhere. Get rid of fat_setsize now that it's trivial and mark
    ext2_setsize static to make the calling convention obvious.

    Keep the inode_newsize_ok in vmtruncate for now as all callers need an
    audit for its removal anyway.

    Note: setattr code in ecryptfs doesn't call inode_change_ok at all and
    needs a deeper audit, but that is left for later.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Replace inode_setattr with opencoded variants of it in all callers. This
    moves the remaining call to vmtruncate into the filesystem methods where it
    can be replaced with the proper truncate sequence.

    In a few cases it was obvious that we would never end up calling vmtruncate
    so it was left out in the opencoded variant:

    spufs: explicitly checks for ATTR_SIZE earlier
    btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
    ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above

    In addition to that ncpfs called inode_setattr with handcrafted iattrs,
    which allowed to trim down the opencoded variant.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • With the new truncate sequence every filesystem that wants to support file
    size changes on disk needs to implement its own ->setattr. So instead
    of calling inode_setattr which supports size changes call into a simple
    method that doesn't support this. simple_setattr is almost what we
    want except that it does not mark the inode dirty after changes. Given
    that marking the inode dirty is a no-op for the simple in-memory filesystems
    that use simple_setattr currently just add the mark_inode_dirty call.

    Also add a WARN_ON for the presence of a truncate method to simple_setattr
    to catch new instances of it during the transition period.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Despite its name it's now a generic implementation of ->setattr, but
    rather a helper to copy attributes from a struct iattr to the inode.
    Rename it to setattr_copy to reflect this fact.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

28 May, 2010

1 commit

  • Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
    setattr > vmtruncate > truncate, have filesystems call their truncate sequence
    from ->setattr if filesystem specific operations are required. vmtruncate is
    deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
    previously should be used.

    simple_setattr is introduced for simple in-ram filesystems to implement
    the new truncate sequence. Eventually all filesystems should be converted
    to implement a setattr, and the default code in notify_change should go
    away.

    simple_setsize is also introduced to perform just the ATTR_SIZE portion
    of simple_setattr (ie. changing i_size and trimming pagecache).

    To implement the new truncate sequence:
    - filesystem specific manipulations (eg freeing blocks) must be done in
    the setattr method rather than ->truncate.
    - vmtruncate can not be used by core code to trim blocks past i_size in
    the event of write failure after allocation, so this must be performed
    in the fs code.
    - convert usage of helpers block_write_begin, nobh_write_begin,
    cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
    variants. These avoid calling vmtruncate to trim blocks (see previous).
    - inode_setattr should not be used. generic_setattr is a new function
    to be used to copy simple attributes into the generic inode.
    - make use of the better opportunity to handle errors with the new sequence.

    Big problem with the previous calling sequence: the filesystem is not called
    until i_size has already changed. This means it is not allowed to fail the
    call, and also it does not know what the previous i_size was. Also, generic
    code calling vmtruncate to truncate allocated blocks in case of error had
    no good way to return a meaningful error (or, for example, atomically handle
    block deallocation).

    Cc: Christoph Hellwig
    Acked-by: Jan Kara
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     

07 Mar, 2010

1 commit

  • Make sure compiler won't do weird things with limits. E.g. fetching them
    twice may return 2 different values after writable limits are implemented.

    I.e. either use rlimit helpers added in commit 3e10e716abf3 ("resource:
    add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.

    Signed-off-by: Jiri Slaby
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     

05 Mar, 2010

1 commit

  • Currently notify_change calls vfs_dq_transfer directly. This means
    we tie the quota code into the VFS. Get rid of that and make the
    filesystem responsible for the transfer. Most filesystems already
    do this, only ufs and udf need the code added, and for jfs it needs to
    be enabled unconditionally instead of only when ACLs are enabled.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     

24 Sep, 2009

1 commit

  • Introduce new truncate helpers truncate_pagecache and inode_newsize_ok.
    vmtruncate is also consolidated from mm/memory.c and mm/nommu.c and
    into mm/truncate.c.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     

26 Mar, 2009

1 commit


14 Nov, 2008

1 commit

  • Wrap access to task credentials so that they can be separated more easily from
    the task_struct during the introduction of COW creds.

    Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().

    Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
    sense to use RCU directly rather than a convenient wrapper; these will be
    addressed by later patches.

    Signed-off-by: David Howells
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Cc: Al Viro
    Signed-off-by: James Morris

    David Howells
     

23 Oct, 2008

1 commit


27 Jul, 2008

2 commits

  • Move the immutable and append-only checks from chmod, chown and utimes
    into notify_change(). Checks for immutable and append-only files are
    always performed by the VFS and not by the filesystem (see
    permission() and may_...() in namei.c), so these belong in
    notify_change(), and not in inode_change_ok().

    This should be completely equivalent.

    CC: Ulrich Drepper
    CC: Michael Kerrisk
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • Add a new ia_valid flag: ATTR_TIMES_SET, to handle the
    UTIMES_OMIT/UTIMES_NOW and UTIMES_NOW/UTIMES_OMIT cases. In these
    cases neither ATTR_MTIME_SET nor ATTR_ATIME_SET is in the flags, yet
    the POSIX draft specifies that permission checking is performed the
    same way as if one or both of the times was explicitly set to a
    timestamp.

    See the path "vfs: utimensat(): fix error checking for
    {UTIME_NOW,UTIME_OMIT} case" by Michael Kerrisk for the patch
    introducing this behavior.

    This is a cleanup, as well as allowing filesystems (NFS/fuse/...) to
    perform their own permission checking instead of the default.

    CC: Ulrich Drepper
    CC: Michael Kerrisk
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi