05 Sep, 2017

2 commits

  • Problem with ioctl() is that it's a file operation, yet often used as an
    inode operation (i.e. modify the inode despite the file being opened for
    read-only).

    mnt_want_write_file() is used by filesystems in such cases to get write
    access on an arbitrary open file.

    Since overlayfs lets filesystems do all file operations, including ioctl,
    this can lead to mnt_want_write_file() returning OK for a lower file and
    modification of that lower file.

    This patch prevents modification by checking if the file is from an
    overlayfs lower layer and returning EPERM in that case.

    Need to introduce a mnt_want_write_file_path() variant that still does the
    old thing for inode operations that can do the copy up + modification
    correctly in such cases (fchown, fsetxattr, fremovexattr).

    This does not address the correctness of such ioctls on overlayfs (the
    correct way would be to copy up and attempt to perform ioctl on upper
    file).

    In theory this could be a regression. We very much hope that nobody is
    relying on such a hack in any sane setup.

    While this patch meddles in VFS code, it has no effect on non-overlayfs
    filesystems.

    Reported-by: "zhangyi (F)"
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Add a separate flags argument (in addition to the open flags) to control
    the behavior of d_real().

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

08 Jul, 2017

1 commit

  • Pull Writeback error handling updates from Jeff Layton:
    "This pile represents the bulk of the writeback error handling fixes
    that I have for this cycle. Some of the earlier patches in this pile
    may look trivial but they are prerequisites for later patches in the
    series.

    The aim of this set is to improve how we track and report writeback
    errors to userland. Most applications that care about data integrity
    will periodically call fsync/fdatasync/msync to ensure that their
    writes have made it to the backing store.

    For a very long time, we have tracked writeback errors using two flags
    in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
    writeback error occurs (via mapping_set_error) and are cleared as a
    side-effect of filemap_check_errors (as you noted yesterday). This
    model really sucks for userland.

    Only the first task to call fsync (or msync or fdatasync) will see the
    error. Any subsequent task calling fsync on a file will get back 0
    (unless another writeback error occurs in the interim). If I have
    several tasks writing to a file and calling fsync to ensure that their
    writes got stored, then I need to have them coordinate with one
    another. That's difficult enough, but in a world of containerized
    setups that coordination may even not be possible.

    But wait...it gets worse!

    The calls to filemap_check_errors can be buried pretty far down in the
    call stack, and there are internal callers of filemap_write_and_wait
    and the like that also end up clearing those errors. Many of those
    callers ignore the error return from that function or return it to
    userland at nonsensical times (e.g. truncate() or stat()). If I get
    back -EIO on a truncate, there is no reason to think that it was
    because some previous writeback failed, and a subsequent fsync() will
    (incorrectly) return 0.

    This pile aims to do three things:

    1) ensure that when a writeback error occurs that that error will be
    reported to userland on a subsequent fsync/fdatasync/msync call,
    regardless of what internal callers are doing

    2) report writeback errors on all file descriptions that were open at
    the time that the error occurred. This is a user-visible change,
    but I think most applications are written to assume this behavior
    anyway. Those that aren't are unlikely to be hurt by it.

    3) document what filesystems should do when there is a writeback
    error. Today, there is very little consistency between them, and a
    lot of cargo-cult copying. We need to make it very clear what
    filesystems should do in this situation.

    To achieve this, the set adds a new data type (errseq_t) and then
    builds new writeback error tracking infrastructure around that. Once
    all of that is in place, we change the filesystems to use the new
    infrastructure for reporting wb errors to userland.

    Note that this is just the initial foray into cleaning up this mess.
    There is a lot of work remaining here:

    1) convert the rest of the filesystems in a similar fashion. Once the
    initial set is in, then I think most other fs' will be fairly
    simple to convert. Hopefully most of those can in via individual
    filesystem trees.

    2) convert internal waiters on writeback to use errseq_t for
    detecting errors instead of relying on the AS_* flags. I have some
    draft patches for this for ext4, but they are not quite ready for
    prime time yet.

    This was a discussion topic this year at LSF/MM too. If you're
    interested in the gory details, LWN has some good articles about this:

    https://lwn.net/Articles/718734/
    https://lwn.net/Articles/724307/"

    * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    btrfs: minimal conversion to errseq_t writeback error reporting on fsync
    xfs: minimal conversion to errseq_t writeback error reporting
    ext4: use errseq_t based error handling for reporting data writeback errors
    fs: convert __generic_file_fsync to use errseq_t based reporting
    block: convert to errseq_t based writeback error tracking
    dax: set errors in mapping when writeback fails
    Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
    mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
    fs: new infrastructure for writeback error handling and reporting
    lib: add errseq_t type and infrastructure for handling it
    mm: don't TestClearPageError in __filemap_fdatawait_range
    mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
    jbd2: don't clear and reset errors after waiting on writeback
    buffer: set errors in mapping at the time that the error occurs
    fs: check for writeback errors after syncing out buffers in generic_file_fsync
    buffer: use mapping_set_error instead of setting the flag
    mm: fix mapping_set_error call in me_pagecache_dirty

    Linus Torvalds
     

06 Jul, 2017

1 commit

  • Most filesystems currently use mapping_set_error and
    filemap_check_errors for setting and reporting/clearing writeback errors
    at the mapping level. filemap_check_errors is indirectly called from
    most of the filemap_fdatawait_* functions and from
    filemap_write_and_wait*. These functions are called from all sorts of
    contexts to wait on writeback to finish -- e.g. mostly in fsync, but
    also in truncate calls, getattr, etc.

    The non-fsync callers are problematic. We should be reporting writeback
    errors during fsync, but many places spread over the tree clear out
    errors before they can be properly reported, or report errors at
    nonsensical times.

    If I get -EIO on a stat() call, there is no reason for me to assume that
    it is because some previous writeback failed. The fact that it also
    clears out the error such that a subsequent fsync returns 0 is a bug,
    and a nasty one since that's potentially silent data corruption.

    This patch adds a small bit of new infrastructure for setting and
    reporting errors during address_space writeback. While the above was my
    original impetus for adding this, I think it's also the case that
    current fsync semantics are just problematic for userland. Most
    applications that call fsync do so to ensure that the data they wrote
    has hit the backing store.

    In the case where there are multiple writers to the file at the same
    time, this is really hard to determine. The first one to call fsync will
    see any stored error, and the rest get back 0. The processes with open
    fds may not be associated with one another in any way. They could even
    be in different containers, so ensuring coordination between all fsync
    callers is not really an option.

    One way to remedy this would be to track what file descriptor was used
    to dirty the file, but that's rather cumbersome and would likely be
    slow. However, there is a simpler way to improve the semantics here
    without incurring too much overhead.

    This set adds an errseq_t to struct address_space, and a corresponding
    one is added to struct file. Writeback errors are recorded in the
    mapping's errseq_t, and the one in struct file is used as the "since"
    value.

    This changes the semantics of the Linux fsync implementation such that
    applications can now use it to determine whether there were any
    writeback errors since fsync(fd) was last called (or since the file was
    opened in the case of fsync having never been called).

    Note that those writeback errors may have occurred when writing data
    that was dirtied via an entirely different fd, but that's the case now
    with the current mapping_set_error/filemap_check_error infrastructure.
    This will at least prevent you from getting a false report of success.

    The new behavior is still consistent with the POSIX spec, and is more
    reliable for application developers. This patch just adds some basic
    infrastructure for doing this, and ensures that the f_wb_err "cursor"
    is properly set when a file is opened. Later patches will change the
    existing code to use this new infrastructure for reporting errors at
    fsync time.

    Signed-off-by: Jeff Layton
    Reviewed-by: Jan Kara

    Jeff Layton
     

28 Jun, 2017

1 commit

  • Define a set of write life time hints:

    RWH_WRITE_LIFE_NOT_SET No hint information set
    RWH_WRITE_LIFE_NONE No hints about write life time
    RWH_WRITE_LIFE_SHORT Data written has a short life time
    RWH_WRITE_LIFE_MEDIUM Data written has a medium life time
    RWH_WRITE_LIFE_LONG Data written has a long life time
    RWH_WRITE_LIFE_EXTREME Data written has an extremely long life time

    The intent is for these values to be relative to each other, no
    absolute meaning should be attached to these flag names.

    Add an fcntl interface for querying these flags, and also for
    setting them as well:

    F_GET_RW_HINT Returns the read/write hint set on the
    underlying inode.

    F_SET_RW_HINT Set one of the above write hints on the
    underlying inode.

    F_GET_FILE_RW_HINT Returns the read/write hint set on the
    file descriptor.

    F_SET_FILE_RW_HINT Set one of the above write hints on the
    file descriptor.

    The user passes in a 64-bit pointer to get/set these values, and
    the interface returns 0/-1 on success/error.

    Sample program testing/implementing basic setting/getting of write
    hints is below.

    Add support for storing the write life time hint in the inode flags
    and in struct file as well, and pass them to the kiocb flags. If
    both a file and its corresponding inode has a write hint, then we
    use the one in the file, if available. The file hint can be used
    for sync/direct IO, for buffered writeback only the inode hint
    is available.

    This is in preparation for utilizing these hints in the block layer,
    to guide on-media data placement.

    /*
    * writehint.c: get or set an inode write hint
    */
    #include
    #include
    #include
    #include
    #include
    #include

    #ifndef F_GET_RW_HINT
    #define F_LINUX_SPECIFIC_BASE 1024
    #define F_GET_RW_HINT (F_LINUX_SPECIFIC_BASE + 11)
    #define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 12)
    #endif

    static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE",
    "RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM",
    "RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" };

    int main(int argc, char *argv[])
    {
    uint64_t hint;
    int fd, ret;

    if (argc < 2) {
    fprintf(stderr, "%s: file \n", argv[0]);
    return 1;
    }

    fd = open(argv[1], O_RDONLY);
    if (fd < 0) {
    perror("open");
    return 2;
    }

    if (argc > 2) {
    hint = atoi(argv[2]);
    ret = fcntl(fd, F_SET_RW_HINT, &hint);
    if (ret < 0) {
    perror("fcntl: F_SET_RW_HINT");
    return 4;
    }
    }

    ret = fcntl(fd, F_GET_RW_HINT, &hint);
    if (ret < 0) {
    perror("fcntl: F_GET_RW_HINT");
    return 3;
    }

    printf("%s: hint %s\n", argv[1], str[hint]);
    close(fd);
    return 0;
    }

    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Jens Axboe
     

13 May, 2017

1 commit

  • Pull misc vfs updates from Al Viro:
    "Making sure that something like a referral point won't end up as pwd
    or root.

    The main part is the last commit (fixing mntns_install()); that one
    fixes a hard-to-hit race. The fchdir() commit is making fchdir(2) a
    bit more robust - it should be impossible to get opened files (even
    O_PATH ones) for referral points in the first place, so the existing
    checks are OK, but checking the same thing as in chdir(2) is just as
    cheap.

    The path_init() commit removes a redundant check that shouldn't have
    been there in the first place"

    * 'work.sane_pwd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    make sure that mntns_install() doesn't end up with referral for root
    path_init(): don't bother with checking MAY_EXEC for LOOKUP_ROOT
    make sure that fchdir() won't accept referral points, etc.

    Linus Torvalds
     

11 May, 2017

1 commit

  • Pull overlayfs update from Miklos Szeredi:
    "The biggest part of this is making st_dev/st_ino on the overlay behave
    like a normal filesystem (i.e. st_ino doesn't change on copy up,
    st_dev is the same for all files and directories). Currently this only
    works if all layers are on the same filesystem, but future work will
    move the general case towards more sane behavior.

    There are also miscellaneous fixes, including fixes to handling
    append-only files. There's a small change in the VFS, but that only
    has an effect on overlayfs, since otherwise file->f_path.dentry->inode
    and file_inode(file) are always the same"

    * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    ovl: update documentation w.r.t. constant inode numbers
    ovl: persistent inode numbers for upper hardlinks
    ovl: merge getattr for dir and nondir
    ovl: constant st_ino/st_dev across copy up
    ovl: persistent inode number for directories
    ovl: set the ORIGIN type flag
    ovl: lookup non-dir copy-up-origin by file handle
    ovl: use an auxiliary var for overlay root entry
    ovl: store file handle of lower inode on copy up
    ovl: check if all layers are on the same fs
    ovl: do not set overlay.opaque on non-dir create
    ovl: check IS_APPEND() on real upper inode
    vfs: ftruncate check IS_APPEND() on real upper inode
    ovl: Use designated initializers
    ovl: lockdep annotate of nested stacked overlayfs inode lock

    Linus Torvalds
     

10 May, 2017

1 commit

  • Pull misc vfs updates from Al Viro:
    "Assorted bits and pieces from various people. No common topic in this
    pile, sorry"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs/affs: add rename exchange
    fs/affs: add rename2 to prepare multiple methods
    Make stat/lstat/fstatat pass AT_NO_AUTOMOUNT to vfs_statx()
    fs: don't set *REFERENCED on single use objects
    fs: compat: Remove warning from COMPATIBLE_IOCTL
    remove pointless extern of atime_need_update_rcu()
    fs: completely ignore unknown open flags
    fs: add a VALID_OPEN_FLAGS
    fs: remove _submit_bh()
    fs: constify tree_descr arrays passed to simple_fill_super()
    fs: drop duplicate header percpu-rwsem.h
    fs/affs: bugfix: Write files greater than page size on OFS
    fs/affs: bugfix: enable writes on OFS disks
    fs/affs: remove node generation check
    fs/affs: import amigaffs.h
    fs/affs: bugfix: make symbolic links work again

    Linus Torvalds
     

27 Apr, 2017

1 commit


22 Apr, 2017

1 commit


20 Apr, 2017

1 commit


18 Apr, 2017

1 commit


07 Feb, 2017

2 commits

  • Before calling write f_ops, call file_start_write() instead
    of sb_start_write().

    Replace {sb,file}_start_write() for {copy,clone}_file_range() and
    for fallocate().

    Beyond correct semantics, this avoids freeze protection to sb when
    operating on special inodes, such as fallocate() on a blockdev.

    Reviewed-by: Jan Kara
    Signed-off-by: Amir Goldstein
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Miklos Szeredi

    Amir Goldstein
     
  • There was an obscure use case of fallocate of directory inode
    in the vfs helper with the comment:
    "Let individual file system decide if it supports preallocation
    for directories or not."

    But there is no in-tree file system that implements fallocate
    for directory operations.

    Deny an attempt to fallocate a directory with EISDIR error.

    This change is needed prior to converting sb_start_write()
    to file_start_write(), so freeze protection is correctly
    handled for cases of fallocate file and blockdev.

    Cc: linux-api@vger.kernel.org
    Cc: Al Viro
    Signed-off-by: Amir Goldstein
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Miklos Szeredi

    Amir Goldstein
     

25 Dec, 2016

1 commit


14 Oct, 2016

1 commit

  • …kernel/git/dgc/linux-xfs

    < XFS has gained super CoW powers! >
    ----------------------------------
    \ ^__^
    \ (oo)\_______
    (__)\ )\/\
    ||----w |
    || ||

    Pull XFS support for shared data extents from Dave Chinner:
    "This is the second part of the XFS updates for this merge cycle. This
    pullreq contains the new shared data extents feature for XFS.

    Given the complexity and size of this change I am expecting - like the
    addition of reverse mapping last cycle - that there will be some
    follow-up bug fixes and cleanups around the -rc3 stage for issues that
    I'm sure will show up once the code hits a wider userbase.

    What it is:

    At the most basic level we are simply adding shared data extents to
    XFS - i.e. a single extent on disk can now have multiple owners. To do
    this we have to add new on-disk features to both track the shared
    extents and the number of times they've been shared. This is done by
    the new "refcount" btree that sits in every allocation group. When we
    share or unshare an extent, this tree gets updated.

    Along with this new tree, the reverse mapping tree needs to be updated
    to track each owner or a shared extent. This also needs to be updated
    ever share/unshare operation. These interactions at extent allocation
    and freeing time have complex ordering and recovery constraints, so
    there's a significant amount of new intent-based transaction code to
    ensure that operations are performed atomically from both the runtime
    and integrity/crash recovery perspectives.

    We also need to break sharing when writes hit a shared extent - this
    is where the new copy-on-write implementation comes in. We allocate
    new storage and copy the original data along with the overwrite data
    into the new location. We only do this for data as we don't share
    metadata at all - each inode has it's own metadata that tracks the
    shared data extents, the extents undergoing CoW and it's own private
    extents.

    Of course, being XFS, nothing is simple - we use delayed allocation
    for CoW similar to how we use it for normal writes. ENOSPC is a
    significant issue here - we build on the reservation code added in
    4.8-rc1 with the reverse mapping feature to ensure we don't get
    spurious ENOSPC issues part way through a CoW operation. These
    mechanisms also help minimise fragmentation due to repeated CoW
    operations. To further reduce fragmentation overhead, we've also
    introduced a CoW extent size hint, which indicates how large a region
    we should allocate when we execute a CoW operation.

    With all this functionality in place, we can hook up .copy_file_range,
    .clone_file_range and .dedupe_file_range and we gain all the
    capabilities of reflink and other vfs provided functionality that
    enable manipulation to shared extents. We also added a fallocate mode
    that explicitly unshares a range of a file, which we implemented as an
    explicit CoW of all the shared extents in a file.

    As such, it's a huge chunk of new functionality with new on-disk
    format features and internal infrastructure. It warns at mount time as
    an experimental feature and that it may eat data (as we do with all
    new on-disk features until they stabilise). We have not released
    userspace suport for it yet - userspace support currently requires
    download from Darrick's xfsprogs repo and build from source, so the
    access to this feature is really developer/tester only at this point.
    Initial userspace support will be released at the same time the kernel
    with this code in it is released.

    The new code causes 5-6 new failures with xfstests - these aren't
    serious functional failures but things the output of tests changing
    slightly due to perturbations in layouts, space usage, etc. OTOH,
    we've added 150+ new tests to xfstests that specifically exercise this
    new functionality so it's got far better test coverage than any
    functionality we've previously added to XFS.

    Darrick has done a pretty amazing job getting us to this stage, and
    special mention also needs to go to Christoph (review, testing,
    improvements and bug fixes) and Brian (caught several intricate bugs
    during review) for the effort they've also put in.

    Summary:

    - unshare range (FALLOC_FL_UNSHARE) support for fallocate

    - copy-on-write extent size hints (FS_XFLAG_COWEXTSIZE) for fsxattr
    interface

    - shared extent support for XFS

    - copy-on-write support for shared extents

    - copy_file_range support

    - clone_file_range support (implements reflink)

    - dedupe_file_range support

    - defrag support for reverse mapping enabled filesystems"

    * tag 'xfs-reflink-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (71 commits)
    xfs: convert COW blocks to real blocks before unwritten extent conversion
    xfs: rework refcount cow recovery error handling
    xfs: clear reflink flag if setting realtime flag
    xfs: fix error initialization
    xfs: fix label inaccuracies
    xfs: remove isize check from unshare operation
    xfs: reduce stack usage of _reflink_clear_inode_flag
    xfs: check inode reflink flag before calling reflink functions
    xfs: implement swapext for rmap filesystems
    xfs: refactor swapext code
    xfs: various swapext cleanups
    xfs: recognize the reflink feature bit
    xfs: simulate per-AG reservations being critically low
    xfs: don't mix reflink and DAX mode for now
    xfs: check for invalid inode reflink flags
    xfs: set a default CoW extent size of 32 blocks
    xfs: convert unwritten status of reverse mappings for shared files
    xfs: use interval query for rmap alloc operations on shared files
    xfs: add shared rmap map/unmap/convert log item types
    xfs: increase log reservations for reflink
    ...

    Linus Torvalds
     

12 Oct, 2016

1 commit

  • After much discussion, it seems that the fallocate feature flag
    FALLOC_FL_ZERO_RANGE maps nicely to SCSI WRITE SAME; and the feature
    FALLOC_FL_PUNCH_HOLE maps nicely to the devices that have been whitelisted
    for zeroing SCSI UNMAP. Punch still requires that FALLOC_FL_KEEP_SIZE is
    set. A length that goes past the end of the device will be clamped to the
    device size if KEEP_SIZE is set; or will return -EINVAL if not. Both
    start and length must be aligned to the device's logical block size.

    Since the semantics of fallocate are fairly well established already, wire
    up the two pieces. The other fallocate variants (collapse range, insert
    range, and allocate blocks) are not supported.

    Link: http://lkml.kernel.org/r/147518379992.22791.8849838163218235007.stgit@birch.djwong.org
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Bart Van Assche
    Cc: Theodore Ts'o
    Cc: Martin K. Petersen
    Cc: Mike Snitzer # tweaked header
    Cc: Brian Foster
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

04 Oct, 2016

1 commit


16 Sep, 2016

2 commits

  • The problem with writecount is: we want consistent handling of it for
    underlying filesystems as well as overlayfs. Making sure i_writecount is
    correct on all layers is difficult. Instead this patch makes sure that
    when write access is acquired, it's always done on the underlying writable
    layer (called the upper layer). We must also make sure to look at the
    writecount on this layer when checking for conflicting leases.

    Open for write already updates the upper layer's writecount. Leaving only
    truncate.

    For truncate copy up must happen before get_write_access() so that the
    writecount is updated on the upper layer. Problem with this is if
    something fails after that, then copy-up was done needlessly. E.g. if
    break_lease() was interrupted. Probably not a big deal in practice.

    Another interesting case is if there's a denywrite on a lower file that is
    then opened for write or truncated. With this patch these will succeed,
    which is somewhat counterintuitive. But I think it's still acceptable,
    considering that the copy-up does actually create a different file, so the
    old, denywrite mapping won't be touched.

    On non-overlayfs d_real() is an identity function and d_real_inode() is
    equivalent to d_inode() so this patch doesn't change behavior in that case.

    Signed-off-by: Miklos Szeredi
    Acked-by: Jeff Layton
    Cc: "J. Bruce Fields"

    Miklos Szeredi
     
  • This patch allows flock, posix locks, ofd locks and leases to work
    correctly on overlayfs.

    Instead of using the underlying inode for storing lock context use the
    overlay inode. This allows locks to be persistent across copy-up.

    This is done by introducing locks_inode() helper and using it instead of
    file_inode() to get the inode in locking code. For non-overlayfs the two
    are equivalent, except for an extra pointer dereference in locks_inode().

    Since lock operations are in "struct file_operations" we must also make
    sure not to call underlying filesystem's lock operations. Introcude a
    super block flag MS_NOREMOTELOCK to this effect.

    Signed-off-by: Miklos Szeredi
    Acked-by: Jeff Layton
    Cc: "J. Bruce Fields"

    Miklos Szeredi
     

07 Aug, 2016

1 commit

  • Pull binfmt_misc update from James Bottomley:
    "This update is to allow architecture emulation containers to function
    such that the emulation binary can be housed outside the container
    itself. The container and fs parts both have acks from relevant
    experts.

    To use the new feature you have to add an F option to your binfmt_misc
    configuration"

    From the docs:
    "The usual behaviour of binfmt_misc is to spawn the binary lazily when
    the misc format file is invoked. However, this doesn't work very well
    in the face of mount namespaces and changeroots, so the F mode opens
    the binary as soon as the emulation is installed and uses the opened
    image to spawn the emulator, meaning it is always available once
    installed, regardless of how the environment changes"

    * tag 'binfmt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/binfmt_misc:
    binfmt_misc: add F option description to documentation
    binfmt_misc: add persistent opened binary handler for containers
    fs: add filp_clone_open API

    Linus Torvalds
     

30 Jun, 2016

1 commit

  • The two methods essentially do the same: find the real dentry/inode
    belonging to an overlay dentry. The difference is in the usage:

    vfs_open() uses ->d_select_inode() and expects the function to perform
    copy-up if necessary based on the open flags argument.

    file_dentry() uses ->d_real() passing in the overlay dentry as well as the
    underlying inode.

    vfs_rename() uses ->d_select_inode() but passes zero flags. ->d_real()
    with a zero inode would have worked just as well here.

    This patch merges the functionality of ->d_select_inode() into ->d_real()
    by adding an 'open_flags' argument to the latter.

    [Al Viro] Make the signature of d_real() match that of ->d_real() again.
    And constify the inode argument, while we are at it.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

18 May, 2016

1 commit

  • Pull 'struct path' constification update from Al Viro:
    "'struct path' is passed by reference to a bunch of Linux security
    methods; in theory, there's nothing to stop them from modifying the
    damn thing and LSM community being what it is, sooner or later some
    enterprising soul is going to decide that it's a good idea.

    Let's remove the temptation and constify all of those..."

    * 'work.const-path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    constify ima_d_path()
    constify security_sb_pivotroot()
    constify security_path_chroot()
    constify security_path_{link,rename}
    apparmor: remove useless checks for NULL ->mnt
    constify security_path_{mkdir,mknod,symlink}
    constify security_path_{unlink,rmdir}
    apparmor: constify common_perm_...()
    apparmor: constify aa_path_link()
    apparmor: new helper - common_path_perm()
    constify chmod_common/security_path_chmod
    constify security_sb_mount()
    constify chown_common/security_path_chown
    tomoyo: constify assorted struct path *
    apparmor_path_truncate(): path->mnt is never NULL
    constify vfs_truncate()
    constify security_path_truncate()
    [apparmor] constify struct path * in a bunch of helpers

    Linus Torvalds
     

17 May, 2016

1 commit


11 May, 2016

1 commit


03 May, 2016

1 commit


31 Mar, 2016

1 commit

  • I need an API that allows me to obtain a clone of the current file
    pointer to pass in to an exec handler. I've labelled this as an
    internal API because I can't see how it would be useful outside of the
    fs subsystem. The use case will be a persistent binfmt_misc handler.

    Signed-off-by: James Bottomley
    Acked-by: Serge Hallyn
    Acked-by: Jan Kara

    James Bottomley
     

28 Mar, 2016

3 commits


23 Mar, 2016

1 commit

  • This commit fixes the following security hole affecting systems where
    all of the following conditions are fulfilled:

    - The fs.suid_dumpable sysctl is set to 2.
    - The kernel.core_pattern sysctl's value starts with "/". (Systems
    where kernel.core_pattern starts with "|/" are not affected.)
    - Unprivileged user namespace creation is permitted. (This is
    true on Linux >=3.8, but some distributions disallow it by
    default using a distro patch.)

    Under these conditions, if a program executes under secure exec rules,
    causing it to run with the SUID_DUMP_ROOT flag, then unshares its user
    namespace, changes its root directory and crashes, the coredump will be
    written using fsuid=0 and a path derived from kernel.core_pattern - but
    this path is interpreted relative to the root directory of the process,
    allowing the attacker to control where a coredump will be written with
    root privileges.

    To fix the security issue, always interpret core_pattern for dumps that
    are written under SUID_DUMP_ROOT relative to the root directory of init.

    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Andy Lutomirski
    Cc: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     

23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

04 Jan, 2016

1 commit


10 Jul, 2015

1 commit

  • Today proc and sysfs do not contain any executable files. Several
    applications today mount proc or sysfs without noexec and nosuid and
    then depend on there being no exectuables files on proc or sysfs.
    Having any executable files show on proc or sysfs would cause
    a user space visible regression, and most likely security problems.

    Therefore commit to never allowing executables on proc and sysfs by
    adding a new flag to mark them as filesystems without executables and
    enforce that flag.

    Test the flag where MNT_NOEXEC is tested today, so that the only user
    visible effect will be that exectuables will be treated as if the
    execute bit is cleared.

    The filesystems proc and sysfs do not currently incoporate any
    executable files so this does not result in any user visible effects.

    This makes it unnecessary to vet changes to proc and sysfs tightly for
    adding exectuable files or changes to chattr that would modify
    existing files, as no matter what the individual file say they will
    not be treated as exectuable files by the vfs.

    Not having to vet changes to closely is important as without this we
    are only one proc_create call (or another goof up in the
    implementation of notify_change) from having problematic executables
    on proc. Those mistakes are all too easy to make and would create
    a situation where there are security issues or the assumptions of
    some program having to be broken (and cause userspace regressions).

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

24 Jun, 2015

2 commits

  • Comment in include/linux/security.h says that ->inode_killpriv() should
    be called when setuid bit is being removed and that similar security
    labels (in fact this applies only to file capabilities) should be
    removed at this time as well. However we don't call ->inode_killpriv()
    when we remove suid bit on truncate.

    We fix the problem by calling ->inode_need_killpriv() and subsequently
    ->inode_killpriv() on truncate the same way as we do it on file write.

    After this patch there's only one user of should_remove_suid() - ocfs2 -
    and indeed it's buggy because it doesn't call ->inode_killpriv() on
    write. However fixing it is difficult because of special locking
    constraints.

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • Turn
    d_path(&file->f_path, ...);
    into
    file_path(file, ...);

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro

    Miklos Szeredi
     

19 Jun, 2015

1 commit

  • Make file->f_path always point to the overlay dentry so that the path in
    /proc/pid/fd is correct and to ensure that label-based LSMs have access to the
    overlay as well as the underlay (path-based LSMs probably don't need it).

    Using my union testsuite to set things up, before the patch I see:

    [root@andromeda union-testsuite]# bash 5 /a/foo107
    [root@andromeda union-testsuite]# stat /mnt/a/foo107
    ...
    Device: 23h/35d Inode: 13381 Links: 1
    ...
    [root@andromeda union-testsuite]# stat -L /proc/$$/fd/5
    ...
    Device: 23h/35d Inode: 13381 Links: 1
    ...

    After the patch:

    [root@andromeda union-testsuite]# bash 5 /mnt/a/foo107
    [root@andromeda union-testsuite]# stat /mnt/a/foo107
    ...
    Device: 23h/35d Inode: 40346 Links: 1
    ...
    [root@andromeda union-testsuite]# stat -L /proc/$$/fd/5
    ...
    Device: 23h/35d Inode: 40346 Links: 1
    ...

    Note the change in where /proc/$$/fd/5 points to in the ls command. It was
    pointing to /a/foo107 (which doesn't exist) and now points to /mnt/a/foo107
    (which is correct).

    The inode accessed, however, is the lower layer. The union layer is on device
    25h/37d and the upper layer on 24h/36d.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

11 May, 2015

1 commit


24 Apr, 2015

1 commit

  • Pull xfs update from Dave Chinner:
    "This update contains:

    - RENAME_WHITEOUT support

    - conversion of per-cpu superblock accounting to use generic counters

    - new inode mmap lock so that we can lock page faults out of
    truncate, hole punch and other direct extent manipulation functions
    to avoid racing mmap writes from causing data corruption

    - rework of direct IO submission and completion to solve data
    corruption issue when running concurrent extending DIO writes.
    Also solves problem of running IO completion transactions in
    interrupt context during size extending AIO writes.

    - FALLOC_FL_INSERT_RANGE support for inserting holes into a file via
    direct extent manipulation to avoid needing to copy data within the
    file

    - attribute block header field overflow fix for 64k block size
    filesystems

    - Lots of changes to log messaging to be more informative and concise
    when errors occur. Also prevent a lot of unnecessary log spamming
    due to cascading failures in error conditions.

    - lots of cleanups and bug fixes

    One thing of note is the direct IO fixes that we merged last week
    after the window opened. Even though a little late, they fix a user
    reported data corruption and have been pretty well tested. I figured
    there was not much point waiting another 2 weeks for -rc1 to be
    released just so I could send them to you..."

    * tag 'xfs-for-linus-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (49 commits)
    xfs: using generic_file_direct_write() is unnecessary
    xfs: direct IO EOF zeroing needs to drain AIO
    xfs: DIO write completion size updates race
    xfs: DIO writes within EOF don't need an ioend
    xfs: handle DIO overwrite EOF update completion correctly
    xfs: DIO needs an ioend for writes
    xfs: move DIO mapping size calculation
    xfs: factor DIO write mapping from get_blocks
    xfs: unlock i_mutex in xfs_break_layouts
    xfs: kill unnecessary firstused overflow check on attr3 leaf removal
    xfs: use larger in-core attr firstused field and detect overflow
    xfs: pass attr geometry to attr leaf header conversion functions
    xfs: disallow ro->rw remount on norecovery mount
    xfs: xfs_shift_file_space can be static
    xfs: Add support FALLOC_FL_INSERT_RANGE for fallocate
    fs: Add support FALLOC_FL_INSERT_RANGE for fallocate
    xfs: Fix incorrect positive ENOMEM return
    xfs: xfs_mru_cache_insert() should use GFP_NOFS
    xfs: %pF is only for function pointers
    xfs: fix shadow warning in xfs_da3_root_split()
    ...

    Linus Torvalds
     

12 Apr, 2015

1 commit