21 Nov, 2018

2 commits

  • commit 5e1275808630ea3b2c97c776f40e475017535f72 upstream.

    Kaixuxia repors that it's possible to crash overlayfs by removing the
    whiteout on the upper layer before creating a directory over it. This is a
    reproducer:

    mkdir lower upper work merge
    touch lower/file
    mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merge
    rm merge/file
    ls -al merge/file
    rm upper/file
    ls -al merge/
    mkdir merge/file

    Before commencing with a vfs_rename(..., RENAME_EXCHANGE) verify that the
    lookup of "upper" is positive and is a whiteout, and return ESTALE
    otherwise.

    Reported by: kaixuxia
    Signed-off-by: Miklos Szeredi
    Fixes: e9be9d5e76e3 ("overlay filesystem")
    Cc: # v3.18
    Signed-off-by: Greg Kroah-Hartman

    Miklos Szeredi
     
  • commit 6cd078702f2f33cb6b19a682de3e9184112f1a46 upstream.

    linking a non-copied-up file into a non-copied-up parent results in a
    nested call to mutex_lock_interruptible(&oi->lock). Fix this by copying up
    target parent before ovl_nlink_start(), same as done in ovl_rename().

    ~/unionmount-testsuite$ ./run --ov -s
    ~/unionmount-testsuite$ ln /mnt/a/foo100 /mnt/a/dir100/

    WARNING: possible recursive locking detected
    --------------------------------------------
    ln/1545 is trying to acquire lock:
    00000000bcce7c4c (&ovl_i_lock_key[depth]){+.+.}, at:
    ovl_copy_up_start+0x28/0x7d
    but task is already holding lock:
    0000000026d73d5b (&ovl_i_lock_key[depth]){+.+.}, at:
    ovl_nlink_start+0x3c/0xc1

    [SzM: this seems to be a false positive, but doing the copy-up first is
    harmless and removes the lockdep splat]

    Reported-by: syzbot+3ef5c0d1a5cb0b21e6be@syzkaller.appspotmail.com
    Fixes: 5f8415d6b87e ("ovl: persistent overlay inode nlink for...")
    Cc: # v4.13
    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi
    [amir: backport to v4.18]
    Signed-off-by: Amir Goldstein
    Signed-off-by: Greg Kroah-Hartman

    Amir Goldstein
     

05 Oct, 2017

1 commit


14 Sep, 2017

1 commit

  • GFP_TEMPORARY was introduced by commit e12ba74d8ff3 ("Group short-lived
    and reclaimable kernel allocations") along with __GFP_RECLAIMABLE. It's
    primary motivation was to allow users to tell that an allocation is
    short lived and so the allocator can try to place such allocations close
    together and prevent long term fragmentation. As much as this sounds
    like a reasonable semantic it becomes much less clear when to use the
    highlevel GFP_TEMPORARY allocation flag. How long is temporary? Can the
    context holding that memory sleep? Can it take locks? It seems there is
    no good answer for those questions.

    The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
    __GFP_RECLAIMABLE which in itself is tricky because basically none of
    the existing caller provide a way to reclaim the allocated memory. So
    this is rather misleading and hard to evaluate for any benefits.

    I have checked some random users and none of them has added the flag
    with a specific justification. I suspect most of them just copied from
    other existing users and others just thought it might be a good idea to
    use without any measuring. This suggests that GFP_TEMPORARY just
    motivates for cargo cult usage without any reasoning.

    I believe that our gfp flags are quite complex already and especially
    those with highlevel semantic should be clearly defined to prevent from
    confusion and abuse. Therefore I propose dropping GFP_TEMPORARY and
    replace all existing users to simply use GFP_KERNEL. Please note that
    SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
    so they will be placed properly for memory fragmentation prevention.

    I can see reasons we might want some gfp flag to reflect shorterm
    allocations but I propose starting from a clear semantic definition and
    only then add users with proper justification.

    This was been brought up before LSF this year by Matthew [1] and it
    turned out that GFP_TEMPORARY really doesn't have a clear semantic. It
    seems to be a heuristic without any measured advantage for most (if not
    all) its current users. The follow up discussion has revealed that
    opinions on what might be temporary allocation differ a lot between
    developers. So rather than trying to tweak existing users into a
    semantic which they haven't expected I propose to simply remove the flag
    and start from scratch if we really need a semantic for short term
    allocations.

    [1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org

    [akpm@linux-foundation.org: fix typo]
    [akpm@linux-foundation.org: coding-style fixes]
    [sfr@canb.auug.org.au: drm/i915: fix up]
    Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Stephen Rothwell
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Neil Brown
    Cc: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

28 Jul, 2017

1 commit

  • Impure directories are ones which contain objects with origins (i.e. those
    that have been copied up). These are relevant to readdir operation only
    because of the d_ino field, no other transformation is necessary. Also a
    directory can become impure between two getdents(2) calls.

    This patch creates a cache for impure directories. Unlike the cache for
    merged directories, this one only contains entries with origin and is not
    refcounted but has a its lifetime tied to that of the dentry.

    Similarly to the merged cache, the impure cache is invalidated based on a
    version number. This version number is incremented when an entry with
    origin is added or removed from the directory.

    If the cache is empty, then the impure xattr is removed from the directory.

    This patch also fixes up handling of d_ino for the ".." entry if the parent
    directory is merged.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

14 Jul, 2017

1 commit

  • When linking a file with copy up origin into a new parent, mark the
    new parent dir "impure".

    Fixes: ee1d6d37b6b8 ("ovl: mark upper dir with type origin entries "impure"")
    Cc: # v4.12
    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi

    Amir Goldstein
     

05 Jul, 2017

6 commits

  • With inodes index enabled, an overlay inode nlink counts the union of upper
    and non-covered lower hardlinks. During the lifetime of a non-pure upper
    inode, the following nlink modifying operations can happen:

    1. Lower hardlink copy up
    2. Upper hardlink created, unlinked or renamed over
    3. Lower hardlink whiteout or renamed over

    For the first, copy up case, the union nlink does not change, whether the
    operation succeeds or fails, but the upper inode nlink may change.
    Therefore, before copy up, we store the union nlink value relative to the
    lower inode nlink in the index inode xattr trusted.overlay.nlink.

    For the second, upper hardlink case, the union nlink should be incremented
    or decremented IFF the operation succeeds, aligned with nlink change of the
    upper inode. Therefore, before link/unlink/rename, we store the union nlink
    value relative to the upper inode nlink in the index inode.

    For the last, lower cover up case, we simplify things by preceding the
    whiteout or cover up with copy up. This makes sure that there is an index
    upper inode where the nlink xattr can be stored before the copied up upper
    entry is unlink.

    Return the overlay inode nlinks for indexed upper inodes on stat(2).

    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi

    Amir Goldstein
     
  • For rename, we need to ensure that an upper alias exists for hard links
    before attempting the operation. Introduce a flag in ovl_entry to track
    the state of the upper alias.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Bad index entries are entries whose name does not match the
    origin file handle stored in trusted.overlay.origin xattr.
    Bad index entries could be a result of a system power off in
    the middle of copy up.

    Stale index entries are entries whose origin file handle is
    stale. Stale index entries could be a result of copying layers
    or removing lower entries while the overlay is not mounted.
    The case of copying layers should be detected earlier by the
    verification of upper root dir origin and index dir origin.

    Both bad and stale index entries are detected and removed
    on mount.

    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi

    Amir Goldstein
     
  • Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • When checking for consistency in directory operations (unlink, rename,
    etc.) match inodes not dentries.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • This patch fixes an overlay inode nlink leak in the case where
    ovl_rename() renames over a non-dir.

    This is not so critical, because overlay inode doesn't rely on
    nlink dropping to zero for inode deletion.

    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi

    Amir Goldstein
     

29 May, 2017

1 commit

  • An upper dir is marked "impure" to let ovl_iterate() know that this
    directory may contain non pure upper entries whose d_ino may need to be
    read from the origin inode.

    We already mark a non-merge dir "impure" when moving a non-pure child
    entry inside it, to let ovl_iterate() know not to iterate the non-merge
    dir directly.

    Mark also a merge dir "impure" when moving a non-pure child entry inside
    it and when copying up a child entry inside it.

    This can be used to optimize ovl_iterate() to perform a "pure merge" of
    upper and lower directories, merging the content of the directories,
    without having to read d_ino from origin inodes.

    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi

    Amir Goldstein
     

19 May, 2017

3 commits

  • When moving a merge dir or non-dir with copy up origin into a non-merge
    upper dir (a.k.a pure upper dir), we are marking the target parent dir
    "impure". ovl_iterate() iterates pure upper dirs directly, because there is
    no need to filter out whiteouts and merge dir content with lower dir. But
    for the case of an "impure" upper dir, ovl_iterate() will not be able to
    iterate the real upper dir directly, because it will need to lookup the
    origin inode and use it to fill d_ino.

    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi

    Amir Goldstein
     
  • Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • On failure to set opaque/redirect xattr on rename, skip setting xattr and
    return -EXDEV.

    On failure to set opaque xattr when creating a new directory, -EIO is
    returned instead of -EOPNOTSUPP.

    Any failure to set those xattr will be recorded in super block and
    then setting any xattr on upper won't be attempted again.

    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi

    Amir Goldstein
     

05 May, 2017

3 commits

  • An upper type non directory dentry that is a copy up target
    should have a reference to its lower copy up origin.

    There are three ways for an upper type dentry to be instantiated:
    1. A lower type dentry that is being copied up
    2. An entry that is found in upper dir by ovl_lookup()
    3. A negative dentry is hardlinked to an upper type dentry

    In the first case, the lower reference is set before copy up.
    In the second case, the lower reference is found by ovl_lookup().
    In the last case of hardlinked upper dentry, it is not easy to
    update the lower reference of the negative dentry. Instead,
    drop the newly hardlinked negative dentry from dcache and let
    the next access call ovl_lookup() to find its lower reference.

    This makes sure that the inode number reported by stat(2) after
    the hardlink is created is the same inode number that will be
    reported by stat(2) after mount cycle, which is the inode number
    of the lower copy up origin of the hardlink source.

    NOTE that this does not fix breaking of lower hardlinks on copy
    up, but only fixes the case of lower nlink == 1, whose upper copy
    up inode is hardlinked in upper dir.

    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi

    Amir Goldstein
     
  • Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • stat(2) on overlay directories reports the overlay temp inode
    number, which is constant across copy up, but is not persistent.

    When all layers are on the same fs, report the copy up origin inode
    number for directories.

    This inode number is persistent, unique across the overlay mount and
    constant across copy up.

    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi

    Amir Goldstein
     

26 Apr, 2017

1 commit


03 Mar, 2017

1 commit

  • Add a system call to make extended file information available, including
    file creation and some attribute flags where available through the
    underlying filesystem.

    The getattr inode operation is altered to take two additional arguments: a
    u32 request_mask and an unsigned int flags that indicate the
    synchronisation mode. This change is propagated to the vfs_getattr*()
    function.

    Functions like vfs_stat() are now inline wrappers around new functions
    vfs_statx() and vfs_statx_fd() to reduce stack usage.

    ========
    OVERVIEW
    ========

    The idea was initially proposed as a set of xattrs that could be retrieved
    with getxattr(), but the general preference proved to be for a new syscall
    with an extended stat structure.

    A number of requests were gathered for features to be included. The
    following have been included:

    (1) Make the fields a consistent size on all arches and make them large.

    (2) Spare space, request flags and information flags are provided for
    future expansion.

    (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
    __s64).

    (4) Creation time: The SMB protocol carries the creation time, which could
    be exported by Samba, which will in turn help CIFS make use of
    FS-Cache as that can be used for coherency data (stx_btime).

    This is also specified in NFSv4 as a recommended attribute and could
    be exported by NFSD [Steve French].

    (5) Lightweight stat: Ask for just those details of interest, and allow a
    netfs (such as NFS) to approximate anything not of interest, possibly
    without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
    Dilger] (AT_STATX_DONT_SYNC).

    (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
    its cached attributes are up to date [Trond Myklebust]
    (AT_STATX_FORCE_SYNC).

    And the following have been left out for future extension:

    (7) Data version number: Could be used by userspace NFS servers [Aneesh
    Kumar].

    Can also be used to modify fill_post_wcc() in NFSD which retrieves
    i_version directly, but has just called vfs_getattr(). It could get
    it from the kstat struct if it used vfs_xgetattr() instead.

    (There's disagreement on the exact semantics of a single field, since
    not all filesystems do this the same way).

    (8) BSD stat compatibility: Including more fields from the BSD stat such
    as creation time (st_btime) and inode generation number (st_gen)
    [Jeremy Allison, Bernd Schubert].

    (9) Inode generation number: Useful for FUSE and userspace NFS servers
    [Bernd Schubert].

    (This was asked for but later deemed unnecessary with the
    open-by-handle capability available and caused disagreement as to
    whether it's a security hole or not).

    (10) Extra coherency data may be useful in making backups [Andreas Dilger].

    (No particular data were offered, but things like last backup
    timestamp, the data version number and the DOS archive bit would come
    into this category).

    (11) Allow the filesystem to indicate what it can/cannot provide: A
    filesystem can now say it doesn't support a standard stat feature if
    that isn't available, so if, for instance, inode numbers or UIDs don't
    exist or are fabricated locally...

    (This requires a separate system call - I have an fsinfo() call idea
    for this).

    (12) Store a 16-byte volume ID in the superblock that can be returned in
    struct xstat [Steve French].

    (Deferred to fsinfo).

    (13) Include granularity fields in the time data to indicate the
    granularity of each of the times (NFSv4 time_delta) [Steve French].

    (Deferred to fsinfo).

    (14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
    Note that the Linux IOC flags are a mess and filesystems such as Ext4
    define flags that aren't in linux/fs.h, so translation in the kernel
    may be a necessity (or, possibly, we provide the filesystem type too).

    (Some attributes are made available in stx_attributes, but the general
    feeling was that the IOC flags were to ext[234]-specific and shouldn't
    be exposed through statx this way).

    (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
    Michael Kerrisk].

    (Deferred, probably to fsinfo. Finding out if there's an ACL or
    seclabal might require extra filesystem operations).

    (16) Femtosecond-resolution timestamps [Dave Chinner].

    (A __reserved field has been left in the statx_timestamp struct for
    this - if there proves to be a need).

    (17) A set multiple attributes syscall to go with this.

    ===============
    NEW SYSTEM CALL
    ===============

    The new system call is:

    int ret = statx(int dfd,
    const char *filename,
    unsigned int flags,
    unsigned int mask,
    struct statx *buffer);

    The dfd, filename and flags parameters indicate the file to query, in a
    similar way to fstatat(). There is no equivalent of lstat() as that can be
    emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
    also no equivalent of fstat() as that can be emulated by passing a NULL
    filename to statx() with the fd of interest in dfd.

    Whether or not statx() synchronises the attributes with the backing store
    can be controlled by OR'ing a value into the flags argument (this typically
    only affects network filesystems):

    (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
    respect.

    (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
    its attributes with the server - which might require data writeback to
    occur to get the timestamps correct.

    (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
    network filesystem. The resulting values should be considered
    approximate.

    mask is a bitmask indicating the fields in struct statx that are of
    interest to the caller. The user should set this to STATX_BASIC_STATS to
    get the basic set returned by stat(). It should be noted that asking for
    more information may entail extra I/O operations.

    buffer points to the destination for the data. This must be 256 bytes in
    size.

    ======================
    MAIN ATTRIBUTES RECORD
    ======================

    The following structures are defined in which to return the main attribute
    set:

    struct statx_timestamp {
    __s64 tv_sec;
    __s32 tv_nsec;
    __s32 __reserved;
    };

    struct statx {
    __u32 stx_mask;
    __u32 stx_blksize;
    __u64 stx_attributes;
    __u32 stx_nlink;
    __u32 stx_uid;
    __u32 stx_gid;
    __u16 stx_mode;
    __u16 __spare0[1];
    __u64 stx_ino;
    __u64 stx_size;
    __u64 stx_blocks;
    __u64 __spare1[1];
    struct statx_timestamp stx_atime;
    struct statx_timestamp stx_btime;
    struct statx_timestamp stx_ctime;
    struct statx_timestamp stx_mtime;
    __u32 stx_rdev_major;
    __u32 stx_rdev_minor;
    __u32 stx_dev_major;
    __u32 stx_dev_minor;
    __u64 __spare2[14];
    };

    The defined bits in request_mask and stx_mask are:

    STATX_TYPE Want/got stx_mode & S_IFMT
    STATX_MODE Want/got stx_mode & ~S_IFMT
    STATX_NLINK Want/got stx_nlink
    STATX_UID Want/got stx_uid
    STATX_GID Want/got stx_gid
    STATX_ATIME Want/got stx_atime{,_ns}
    STATX_MTIME Want/got stx_mtime{,_ns}
    STATX_CTIME Want/got stx_ctime{,_ns}
    STATX_INO Want/got stx_ino
    STATX_SIZE Want/got stx_size
    STATX_BLOCKS Want/got stx_blocks
    STATX_BASIC_STATS [The stuff in the normal stat struct]
    STATX_BTIME Want/got stx_btime{,_ns}
    STATX_ALL [All currently available stuff]

    stx_btime is the file creation time, stx_mask is a bitmask indicating the
    data provided and __spares*[] are where as-yet undefined fields can be
    placed.

    Time fields are structures with separate seconds and nanoseconds fields
    plus a reserved field in case we want to add even finer resolution. Note
    that times will be negative if before 1970; in such a case, the nanosecond
    fields will also be negative if not zero.

    The bits defined in the stx_attributes field convey information about a
    file, how it is accessed, where it is and what it does. The following
    attributes map to FS_*_FL flags and are the same numerical value:

    STATX_ATTR_COMPRESSED File is compressed by the fs
    STATX_ATTR_IMMUTABLE File is marked immutable
    STATX_ATTR_APPEND File is append-only
    STATX_ATTR_NODUMP File is not to be dumped
    STATX_ATTR_ENCRYPTED File requires key to decrypt in fs

    Within the kernel, the supported flags are listed by:

    KSTAT_ATTR_FS_IOC_FLAGS

    [Are any other IOC flags of sufficient general interest to be exposed
    through this interface?]

    New flags include:

    STATX_ATTR_AUTOMOUNT Object is an automount trigger

    These are for the use of GUI tools that might want to mark files specially,
    depending on what they are.

    Fields in struct statx come in a number of classes:

    (0) stx_dev_*, stx_blksize.

    These are local system information and are always available.

    (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
    stx_size, stx_blocks.

    These will be returned whether the caller asks for them or not. The
    corresponding bits in stx_mask will be set to indicate whether they
    actually have valid values.

    If the caller didn't ask for them, then they may be approximated. For
    example, NFS won't waste any time updating them from the server,
    unless as a byproduct of updating something requested.

    If the values don't actually exist for the underlying object (such as
    UID or GID on a DOS file), then the bit won't be set in the stx_mask,
    even if the caller asked for the value. In such a case, the returned
    value will be a fabrication.

    Note that there are instances where the type might not be valid, for
    instance Windows reparse points.

    (2) stx_rdev_*.

    This will be set only if stx_mode indicates we're looking at a
    blockdev or a chardev, otherwise will be 0.

    (3) stx_btime.

    Similar to (1), except this will be set to 0 if it doesn't exist.

    =======
    TESTING
    =======

    The following test program can be used to test the statx system call:

    samples/statx/test-statx.c

    Just compile and run, passing it paths to the files you want to examine.
    The file is built automatically if CONFIG_SAMPLES is enabled.

    Here's some example output. Firstly, an NFS directory that crosses to
    another FSID. Note that the AUTOMOUNT attribute is set because transiting
    this directory will cause d_automount to be invoked by the VFS.

    [root@andromeda ~]# /tmp/test-statx -A /warthog/data
    statx(/warthog/data) = 0
    results=7ff
    Size: 4096 Blocks: 8 IO Block: 1048576 directory
    Device: 00:26 Inode: 1703937 Links: 125
    Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
    Access: 2016-11-24 09:02:12.219699527+0000
    Modify: 2016-11-17 10:44:36.225653653+0000
    Change: 2016-11-17 10:44:36.225653653+0000
    Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)

    Secondly, the result of automounting on that directory.

    [root@andromeda ~]# /tmp/test-statx /warthog/data
    statx(/warthog/data) = 0
    results=7ff
    Size: 4096 Blocks: 8 IO Block: 1048576 directory
    Device: 00:27 Inode: 2 Links: 125
    Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
    Access: 2016-11-24 09:02:12.219699527+0000
    Modify: 2016-11-17 10:44:36.225653653+0000
    Change: 2016-11-17 10:44:36.225653653+0000

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

16 Dec, 2016

15 commits

  • FWIW, there's a bit of abuse of struct kstat in overlayfs object
    creation paths - for one thing, it ends up with a very small subset
    of struct kstat (mode + rdev), for another it also needs link in
    case of symlinks and ends up passing it separately.

    IMO it would be better to introduce a separate object for that.

    In principle, we might even lift that thing into general API and switch
    ->mkdir()/->mknod()/->symlink() to identical calling conventions. Hell
    knows, perhaps ->create() as well...

    Signed-off-by: Miklos Szeredi

    Al Viro
     
  • The benefit of making directories opaque on creation is that lookups can
    stop short when they reach the original created directory, instead of
    continue lookup the entire depth of parent directory stack.

    The best case is overlay with N layers, performing lookup for first level
    directory, which exists only in upper. In that case, there will be only
    one lookup instead of N.

    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi

    Amir Goldstein
     
  • oe->opaque is set for

    a) whiteouts
    b) directories having the "trusted.overlay.opaque" xattr

    Case b can be simplified, since setting the xattr always implies setting
    oe->opaque. Also once set, the opaque flag is never cleared.

    Don't need to set opaque flag for non-directories.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Add a module option to allow tuning the max size of absolute redirects.
    Default is 256.

    Size of relative redirects is naturally limited by the the underlying
    filesystem's max filename length (usually 255).

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Before introducing redirect_dir feature, the condition
    !ovl_lower_positive(dentry) for a directory, implied that it is a pure
    upper directory, which may be removed if empty.

    Now that directory can be redirect, it is possible that upper does not
    cover any lower (i.e. !ovl_lower_positive(dentry)), but the directory is a
    merge (with redirected path) and maybe non empty.

    Check for this case in ovl_remove_upper().

    This change fixes the following test case from rename-pop-dir.py
    of unionmount-testsuite:

    """Remove dir and rename old name"""
    d = ctx.non_empty_dir()
    d2 = ctx.no_dir()

    ctx.rmdir(d, err=ENOTEMPTY)
    ctx.rename(d, d2)
    ctx.rmdir(d, err=ENOENT)
    ctx.rmdir(d2, err=ENOTEMPTY)

    ./run --ov rename-pop-dir
    /mnt/a/no_dir103: Expected error (Directory not empty) was not produced

    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi

    Amir Goldstein
     
  • Current code returns EXDEV when a directory would need to be copied up to
    move. We could copy up the directory tree in this case, but there's
    another, simpler solution: point to old lower directory from moved upper
    directory.

    This is achieved with a "trusted.overlay.redirect" xattr storing the path
    relative to the root of the overlay. After such attribute has been set,
    the directory can be moved without further actions required.

    This is a backward incompatible feature, old kernels won't be able to
    correctly mount an overlay containing redirected directories.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Check if something exists on the lower layer(s) under the target or rename
    to decide if directory needs to be marked "opaque".

    Marking opaque is done before the rename, and on failure the marking was
    undone. Also the opaque xattr was removed if the target didn't cover
    anything.

    This patch changes behavior so that removal of "opaque" is not done in
    either of the above cases. This means that directory may have the opaque
    flag even if it doesn't cover anything. However this shouldn't affect the
    performance or semantics of the overalay, while simplifying the code.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • d_is_dir() is safe to call on a negative dentry. Use this fact to simplify
    handling of the lower or merged directories.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • The remainging uses of __OVL_PATH_PURE can be replaced by
    ovl_dentry_is_opaque().

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Currently ovl_lookup() checks existence of lower file even if there's a
    non-directory on upper (which is always opaque). This is done so that
    remove can decide whether a whiteout is needed or not.

    It would be better to defer this check to unlink, since most of the time
    the gathered information about opaqueness will be unused.

    This adds a helper ovl_lower_positive() that checks if there's anything on
    the lower layer(s).

    The following patches also introduce changes to how the "opaque" attribute
    is updated on directories: this attribute is added when the directory is
    creted or moved over a whiteout or object covering something on the lower
    layer. However following changes will allow the attribute to remain on the
    directory after being moved, even if the new location doesn't cover
    anything. Because of this, we need to check lower layers even for opaque
    directories, so that whiteout is only created when necessary.

    This function will later be also used to decide about marking a directory
    opaque, so deal with negative dentries as well. When dealing with
    negative, it's enough to check for being a whiteout

    If the dentry is positive but not upper then it also obviously needs
    whiteout/opaque.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • And use it instead of ovl_dentry_is_opaque() where appropriate.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Since commit 07a2daab49c5 ("ovl: Copy up underlying inode's ->i_mode to
    overlay inode") sticky checking on overlay inode is performed by the vfs,
    so checking against sticky on underlying inode is not needed.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • This is redundant, the vfs already performed this check (and was broken,
    see commit 9409e22acdfc ("vfs: rename: check backing inode being equal")).

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • No sense in opening special files on the underlying layers, they work just
    as well if opened on the overlay.

    Side effect is that it's no longer possible to connect one side of a pipe
    opened on overlayfs with the other side opened on the underlying layer.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

15 Oct, 2016

1 commit

  • Pull overlayfs updates from Miklos Szeredi:
    "This update contains fixes to the "use mounter's permission to access
    underlying layers" area, and miscellaneous other fixes and cleanups.

    No new features this time"

    * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    ovl: use vfs_get_link()
    vfs: add vfs_get_link() helper
    ovl: use generic_readlink
    ovl: explain error values when removing acl from workdir
    ovl: Fix info leak in ovl_lookup_temp()
    ovl: during copy up, switch to mounter's creds early
    ovl: lookup: do getxattr with mounter's permission
    ovl: copy_up_xattr(): use strnlen

    Linus Torvalds
     

11 Oct, 2016

2 commits

  • Pull more vfs updates from Al Viro:
    ">rename2() work from Miklos + current_time() from Deepa"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Replace current_fs_time() with current_time()
    fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
    fs: Replace CURRENT_TIME with current_time() for inode timestamps
    fs: proc: Delete inode time initializations in proc_alloc_inode()
    vfs: Add current_time() api
    vfs: add note about i_op->rename changes to porting
    fs: rename "rename2" i_op to "rename"
    vfs: remove unused i_op->rename
    fs: make remaining filesystems use .rename2
    libfs: support RENAME_NOREPLACE in simple_rename()
    fs: support RENAME_NOREPLACE for local filesystems
    ncpfs: fix unused variable warning

    Linus Torvalds
     
  • Pull vfs xattr updates from Al Viro:
    "xattr stuff from Andreas

    This completes the switch to xattr_handler ->get()/->set() from
    ->getxattr/->setxattr/->removexattr"

    * 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: Remove {get,set,remove}xattr inode operations
    xattr: Stop calling {get,set,remove}xattr inode operations
    vfs: Check for the IOP_XATTR flag in listxattr
    xattr: Add __vfs_{get,set,remove}xattr helpers
    libfs: Use IOP_XATTR flag for empty directory handling
    vfs: Use IOP_XATTR flag for bad-inode handling
    vfs: Add IOP_XATTR inode operations flag
    vfs: Move xattr_resolve_name to the front of fs/xattr.c
    ecryptfs: Switch to generic xattr handlers
    sockfs: Get rid of getxattr iop
    sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
    kernfs: Switch to generic xattr handlers
    hfs: Switch to generic xattr handlers
    jffs2: Remove jffs2_{get,set,remove}xattr macros
    xattr: Remove unnecessary NULL attribute name check

    Linus Torvalds
     

08 Oct, 2016

1 commit