06 May, 2019

1 commit

  • This amends commit 10dce8af3422 ("fs: stream_open - opener for
    stream-like files so that read and write can run simultaneously without
    deadlock") in how position is passed into .read()/.write() handler for
    stream-like files:

    Rasmus noticed that we currently pass 0 as position and ignore any position
    change if that is done by a file implementation. This papers over bugs if ppos
    is used in files that declare themselves as being stream-like as such bugs will
    go unnoticed. Even if a file implementation is correctly converted into using
    stream_open, its read/write later could be changed to use ppos and even though
    that won't be working correctly, that bug might go unnoticed without someone
    doing wrong behaviour analysis. It is thus better to pass ppos=NULL into
    read/write for stream-like files as that don't give any chance for ppos usage
    bugs because it will oops if ppos is ever used inside .read() or .write().

    Note 1: rw_verify_area, new_sync_{read,write} needs to be updated
    because they are called by vfs_read/vfs_write & friends before
    file_operations .read/.write .

    Note 2: if file backend uses new-style .read_iter/.write_iter, position
    is still passed into there as non-pointer kiocb.ki_pos . Currently
    stream_open.cocci (semantic patch added by 10dce8af3422) ignores files
    whose file_operations has *_iter methods.

    Suggested-by: Rasmus Villemoes
    Signed-off-by: Kirill Smelkov

    Kirill Smelkov
     

07 Apr, 2019

1 commit

  • …multaneously without deadlock

    Commit 9c225f2655e3 ("vfs: atomic f_pos accesses as per POSIX") added
    locking for file.f_pos access and in particular made concurrent read and
    write not possible - now both those functions take f_pos lock for the
    whole run, and so if e.g. a read is blocked waiting for data, write will
    deadlock waiting for that read to complete.

    This caused regression for stream-like files where previously read and
    write could run simultaneously, but after that patch could not do so
    anymore. See e.g. commit 581d21a2d02a ("xenbus: fix deadlock on writes
    to /proc/xen/xenbus") which fixes such regression for particular case of
    /proc/xen/xenbus.

    The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
    safety for read/write/lseek and added the locking to file descriptors of
    all regular files. In 2014 that thread-safety problem was not new as it
    was already discussed earlier in 2006.

    However even though 2006'th version of Linus's patch was adding f_pos
    locking "only for files that are marked seekable with FMODE_LSEEK (thus
    avoiding the stream-like objects like pipes and sockets)", the 2014
    version - the one that actually made it into the tree as 9c225f2655e3 -
    is doing so irregardless of whether a file is seekable or not.

    See

    https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
    https://lwn.net/Articles/180387
    https://lwn.net/Articles/180396

    for historic context.

    The reason that it did so is, probably, that there are many files that
    are marked non-seekable, but e.g. their read implementation actually
    depends on knowing current position to correctly handle the read. Some
    examples:

    kernel/power/user.c snapshot_read
    fs/debugfs/file.c u32_array_read
    fs/fuse/control.c fuse_conn_waiting_read + ...
    drivers/hwmon/asus_atk0110.c atk_debugfs_ggrp_read
    arch/s390/hypfs/inode.c hypfs_read_iter
    ...

    Despite that, many nonseekable_open users implement read and write with
    pure stream semantics - they don't depend on passed ppos at all. And for
    those cases where read could wait for something inside, it creates a
    situation similar to xenbus - the write could be never made to go until
    read is done, and read is waiting for some, potentially external, event,
    for potentially unbounded time -> deadlock.

    Besides xenbus, there are 14 such places in the kernel that I've found
    with semantic patch (see below):

    drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
    drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
    drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
    drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
    net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
    drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
    drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
    drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
    net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
    drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
    drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
    drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
    drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
    drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()

    In addition to the cases above another regression caused by f_pos
    locking is that now FUSE filesystems that implement open with
    FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
    stream-like files - for the same reason as above e.g. read can deadlock
    write locking on file.f_pos in the kernel.

    FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f715 ("fuse:
    implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
    in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
    write routines not depending on current position at all, and with both
    read and write being potentially blocking operations:

    See

    https://github.com/libfuse/osspd
    https://lwn.net/Articles/308445

    https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
    https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
    https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510

    Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
    "somewhat pipe-like files ..." with read handler not using offset.
    However that test implements only read without write and cannot exercise
    the deadlock scenario:

    https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
    https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
    https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216

    I've actually hit the read vs write deadlock for real while implementing
    my FUSE filesystem where there is /head/watch file, for which open
    creates separate bidirectional socket-like stream in between filesystem
    and its user with both read and write being later performed
    simultaneously. And there it is semantically not easy to split the
    stream into two separate read-only and write-only channels:

    https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169

    Let's fix this regression. The plan is:

    1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
    doing so would break many in-kernel nonseekable_open users which
    actually use ppos in read/write handlers.

    2. Add stream_open() to kernel to open stream-like non-seekable file
    descriptors. Read and write on such file descriptors would never use
    nor change ppos. And with that property on stream-like files read and
    write will be running without taking f_pos lock - i.e. read and write
    could be running simultaneously.

    3. With semantic patch search and convert to stream_open all in-kernel
    nonseekable_open users for which read and write actually do not
    depend on ppos and where there is no other methods in file_operations
    which assume @offset access.

    4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
    steam_open if that bit is present in filesystem open reply.

    It was tempting to change fs/fuse/ open handler to use stream_open
    instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
    grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
    and in particular GVFS which actually uses offset in its read and
    write handlers

    https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
    https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
    https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
    https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481

    so if we would do such a change it will break a real user.

    5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
    from v3.14+ (the kernel where 9c225f2655 first appeared).

    This will allow to patch OSSPD and other FUSE filesystems that
    provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
    in their open handler and this way avoid the deadlock on all kernel
    versions. This should work because fs/fuse/ ignores unknown open
    flags returned from a filesystem and so passing FOPEN_STREAM to a
    kernel that is not aware of this flag cannot hurt. In turn the kernel
    that is not aware of FOPEN_STREAM will be < v3.14 where just
    FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
    write deadlock.

    This patch adds stream_open, converts /proc/xen/xenbus to it and adds
    semantic patch to automatically locate in-kernel places that are either
    required to be converted due to read vs write deadlock, or that are just
    safe to be converted because read and write do not use ppos and there
    are no other funky methods in file_operations.

    Regarding semantic patch I've verified each generated change manually -
    that it is correct to convert - and each other nonseekable_open instance
    left - that it is either not correct to convert there, or that it is not
    converted due to current stream_open.cocci limitations.

    The script also does not convert files that should be valid to convert,
    but that currently have .llseek = noop_llseek or generic_file_llseek for
    unknown reason despite file being opened with nonseekable_open (e.g.
    drivers/input/mousedev.c)

    Cc: Michael Kerrisk <mtk.manpages@gmail.com>
    Cc: Yongzhi Pan <panyongzhi@gmail.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: David Vrabel <david.vrabel@citrix.com>
    Cc: Juergen Gross <jgross@suse.com>
    Cc: Miklos Szeredi <miklos@szeredi.hu>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Julia Lawall <Julia.Lawall@lip6.fr>
    Cc: Nikolaus Rath <Nikolaus@rath.org>
    Cc: Han-Wen Nienhuys <hanwen@google.com>
    Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Kirill Smelkov
     

30 Mar, 2019

1 commit

  • syzbot is hitting lockdep warning [1] due to trying to open a fifo
    during an execve() operation. But we don't need to open non regular
    files during an execve() operation, for all files which we will need are
    the executable file itself and the interpreter programs like /bin/sh and
    ld-linux.so.2 .

    Since the manpage for execve(2) says that execve() returns EACCES when
    the file or a script interpreter is not a regular file, and the manpage
    for uselib(2) says that uselib() can return EACCES, and we use
    FMODE_EXEC when opening for execve()/uselib(), we can bail out if a non
    regular file is requested with FMODE_EXEC set.

    Since this deadlock followed by khungtaskd warnings is trivially
    reproducible by a local unprivileged user, and syzbot's frequent crash
    due to this deadlock defers finding other bugs, let's workaround this
    deadlock until we get a chance to find a better solution.

    [1] https://syzkaller.appspot.com/bug?id=b5095bfec44ec84213bac54742a82483aad578ce

    Link: http://lkml.kernel.org/r/1552044017-7890-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Reported-by: syzbot
    Fixes: 8924feff66f35fe2 ("splice: lift pipe_lock out of splice_to_pipe()")
    Signed-off-by: Tetsuo Handa
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Eric Biggers
    Cc: Dmitry Vyukov
    Cc: [4.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

22 Aug, 2018

1 commit

  • Pull overlayfs updates from Miklos Szeredi:
    "This contains two new features:

    - Stack file operations: this allows removal of several hacks from
    the VFS, proper interaction of read-only open files with copy-up,
    possibility to implement fs modifying ioctls properly, and others.

    - Metadata only copy-up: when file is on lower layer and only
    metadata is modified (except size) then only copy up the metadata
    and continue to use the data from the lower file"

    * tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (66 commits)
    ovl: Enable metadata only feature
    ovl: Do not do metacopy only for ioctl modifying file attr
    ovl: Do not do metadata only copy-up for truncate operation
    ovl: add helper to force data copy-up
    ovl: Check redirect on index as well
    ovl: Set redirect on upper inode when it is linked
    ovl: Set redirect on metacopy files upon rename
    ovl: Do not set dentry type ORIGIN for broken hardlinks
    ovl: Add an inode flag OVL_CONST_INO
    ovl: Treat metacopy dentries as type OVL_PATH_MERGE
    ovl: Check redirects for metacopy files
    ovl: Move some dir related ovl_lookup_single() code in else block
    ovl: Do not expose metacopy only dentry from d_real()
    ovl: Open file with data except for the case of fsync
    ovl: Add helper ovl_inode_realdata()
    ovl: Store lower data inode in ovl_inode
    ovl: Fix ovl_getattr() to get number of blocks from lower
    ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry
    ovl: Copy up meta inode data from lowest data inode
    ovl: Modify ovl_lookup() and friends to lookup metacopy dentry
    ...

    Linus Torvalds
     

18 Jul, 2018

5 commits


12 Jul, 2018

11 commits


11 Jul, 2018

2 commits

  • An ->open() instances really, really should not be doing that. There's
    a lot of places e.g. around atomic_open() that could be confused by that,
    so let's catch that early.

    Acked-by: Linus Torvalds
    Signed-off-by: Al Viro

    Al Viro
     
  • it's exactly the same thing as
    dentry_open(&file->f_path, file->f_flags, file->f_cred)

    ... and rename it to file_clone_open(), while we are at it.
    'filp' naming convention is bogus; sure, it's "file pointer",
    but we generally don't do that kind of Hungarian notation.
    Some of the instances have too many callers to touch, but this
    one has only two, so let's sanitize it while we can...

    Acked-by: Linus Torvalds
    Signed-off-by: Al Viro

    Al Viro
     

04 Jun, 2018

1 commit

  • This reverts commit cab64df194667dc5d9d786f0a895f647f5501c0d.

    Having vfs_open() in some cases drop the reference to
    struct file combined with

    error = vfs_open(path, f, cred);
    if (error) {
    put_filp(f);
    return ERR_PTR(error);
    }
    return f;

    is flat-out wrong. It used to be

    error = vfs_open(path, f, cred);
    if (!error) {
    /* from now on we need fput() to dispose of f */
    error = open_check_o_direct(f);
    if (error) {
    fput(f);
    f = ERR_PTR(error);
    }
    } else {
    put_filp(f);
    f = ERR_PTR(error);
    }

    and sure, having that open_check_o_direct() boilerplate gotten rid of is
    nice, but not that way...

    Worse, another call chain (via finish_open()) is FUBAR now wrt
    FILE_OPENED handling - in that case we get error returned, with file
    already hit by fput() *AND* FILE_OPENED not set. Guess what happens in
    path_openat(), when it hits

    if (!(opened & FILE_OPENED)) {
    BUG_ON(!error);
    put_filp(file);
    }

    The root cause of all that crap is that the callers of do_dentry_open()
    have no way to tell which way did it fail; while that could be fixed up
    (by passing something like int *opened to do_dentry_open() and have it
    marked if we'd called ->open()), it's probably much too late in the
    cycle to do so right now.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

07 Apr, 2018

1 commit

  • Pull misc vfs updates from Al Viro:
    "Assorted stuff, including Christoph's I_DIRTY patches"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: move I_DIRTY_INODE to fs.h
    ubifs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call
    ntfs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call
    gfs2: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) calls
    fs: fold open_check_o_direct into do_dentry_open
    vfs: Replace stray non-ASCII homoglyph characters with their ASCII equivalents
    vfs: make sure struct filename->iname is word-aligned
    get rid of pointless includes of fs_struct.h
    [poll] annotate SAA6588_CMD_POLL users

    Linus Torvalds
     

03 Apr, 2018

10 commits

  • Using the ksys_fallocate() wrapper allows us to get rid of in-kernel
    calls to the sys_fallocate() syscall. The ksys_ prefix denotes that this
    function is meant as a drop-in replacement for the syscall. In
    particular, it uses the same calling convention as sys_fallocate().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using the ksys_truncate() wrapper allows us to get rid of in-kernel
    calls to the sys_truncate() syscall. The ksys_ prefix denotes that this
    function is meant as a drop-in replacement for the syscall. In
    particular, it uses the same calling convention as sys_truncate().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using this wrapper allows us to avoid the in-kernel calls to the
    sys_open() syscall. The ksys_ prefix denotes that this function is meant
    as a drop-in replacement for the syscall. In particular, it uses the
    same calling convention as sys_open().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using the ksys_close() wrapper allows us to get rid of in-kernel calls
    to the sys_close() syscall. The ksys_ prefix denotes that this function
    is meant as a drop-in replacement for the syscall. In particular, it
    uses the same calling convention as sys_close(), with one subtle
    difference:

    The few places which checked the return value did not care about the return
    value re-writing in sys_close(), so simply use a wrapper around
    __close_fd().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using the ksys_ftruncate() wrapper allows us to get rid of in-kernel
    calls to the sys_ftruncate() syscall. The ksys_ prefix denotes that this
    function is meant as a drop-in replacement for the syscall. In
    particular, it uses the same calling convention as sys_ftruncate().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using the fs-interal do_fchownat() wrapper allows us to get rid of
    fs-internal calls to the sys_fchownat() syscall.

    Introducing the ksys_fchown() helper and the ksys_{,}chown() wrappers
    allows us to avoid the in-kernel calls to the sys_{,l,f}chown() syscalls.
    The ksys_ prefix denotes that these functions are meant as a drop-in
    replacement for the syscalls. In particular, they use the same calling
    convention as sys_{,l,f}chown().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using the fs-internal do_faccessat() helper allows us to get rid of
    fs-internal calls to the sys_faccessat() syscall.

    Introducing the ksys_access() wrapper allows us to avoid the in-kernel
    calls to the sys_access() syscall. The ksys_ prefix denotes that this
    function is meant as a drop-in replacement for the syscall. In
    particular, it uses the same calling convention as sys_access().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • … in-kernel calls to syscall

    Using the fs-internal do_fchmodat() helper allows us to get rid of
    fs-internal calls to the sys_fchmodat() syscall.

    Introducing the ksys_fchmod() helper and the ksys_chmod() wrapper allows
    us to avoid the in-kernel calls to the sys_fchmod() and sys_chmod()
    syscalls. The ksys_ prefix denotes that these functions are meant as a
    drop-in replacement for the syscalls. In particular, they use the same
    calling convention as sys_fchmod() and sys_chmod().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>

    Dominik Brodowski
     
  • Using this helper allows us to avoid the in-kernel calls to the sys_chdir()
    syscall. The ksys_ prefix denotes that this function is meant as a drop-in
    replacement for the syscall. In particular, it uses the same calling
    convention as sys_chdir().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using this helper allows us to avoid the in-kernel calls to the
    sys_chroot() syscall. The ksys_ prefix denotes that this function is
    meant as a drop-in replacement for the syscall. In particular, it uses the
    same calling convention as sys_chroot().

    In the near future, the fs-external callers of ksys_chroot() should be
    converted to use kern_path()/set_fs_root() directly. Then ksys_chroot()
    can be moved within sys_chroot() again.

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Alexander Viro
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     

28 Mar, 2018

1 commit


05 Sep, 2017

2 commits

  • Problem with ioctl() is that it's a file operation, yet often used as an
    inode operation (i.e. modify the inode despite the file being opened for
    read-only).

    mnt_want_write_file() is used by filesystems in such cases to get write
    access on an arbitrary open file.

    Since overlayfs lets filesystems do all file operations, including ioctl,
    this can lead to mnt_want_write_file() returning OK for a lower file and
    modification of that lower file.

    This patch prevents modification by checking if the file is from an
    overlayfs lower layer and returning EPERM in that case.

    Need to introduce a mnt_want_write_file_path() variant that still does the
    old thing for inode operations that can do the copy up + modification
    correctly in such cases (fchown, fsetxattr, fremovexattr).

    This does not address the correctness of such ioctls on overlayfs (the
    correct way would be to copy up and attempt to perform ioctl on upper
    file).

    In theory this could be a regression. We very much hope that nobody is
    relying on such a hack in any sane setup.

    While this patch meddles in VFS code, it has no effect on non-overlayfs
    filesystems.

    Reported-by: "zhangyi (F)"
    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • Add a separate flags argument (in addition to the open flags) to control
    the behavior of d_real().

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

08 Jul, 2017

1 commit

  • Pull Writeback error handling updates from Jeff Layton:
    "This pile represents the bulk of the writeback error handling fixes
    that I have for this cycle. Some of the earlier patches in this pile
    may look trivial but they are prerequisites for later patches in the
    series.

    The aim of this set is to improve how we track and report writeback
    errors to userland. Most applications that care about data integrity
    will periodically call fsync/fdatasync/msync to ensure that their
    writes have made it to the backing store.

    For a very long time, we have tracked writeback errors using two flags
    in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
    writeback error occurs (via mapping_set_error) and are cleared as a
    side-effect of filemap_check_errors (as you noted yesterday). This
    model really sucks for userland.

    Only the first task to call fsync (or msync or fdatasync) will see the
    error. Any subsequent task calling fsync on a file will get back 0
    (unless another writeback error occurs in the interim). If I have
    several tasks writing to a file and calling fsync to ensure that their
    writes got stored, then I need to have them coordinate with one
    another. That's difficult enough, but in a world of containerized
    setups that coordination may even not be possible.

    But wait...it gets worse!

    The calls to filemap_check_errors can be buried pretty far down in the
    call stack, and there are internal callers of filemap_write_and_wait
    and the like that also end up clearing those errors. Many of those
    callers ignore the error return from that function or return it to
    userland at nonsensical times (e.g. truncate() or stat()). If I get
    back -EIO on a truncate, there is no reason to think that it was
    because some previous writeback failed, and a subsequent fsync() will
    (incorrectly) return 0.

    This pile aims to do three things:

    1) ensure that when a writeback error occurs that that error will be
    reported to userland on a subsequent fsync/fdatasync/msync call,
    regardless of what internal callers are doing

    2) report writeback errors on all file descriptions that were open at
    the time that the error occurred. This is a user-visible change,
    but I think most applications are written to assume this behavior
    anyway. Those that aren't are unlikely to be hurt by it.

    3) document what filesystems should do when there is a writeback
    error. Today, there is very little consistency between them, and a
    lot of cargo-cult copying. We need to make it very clear what
    filesystems should do in this situation.

    To achieve this, the set adds a new data type (errseq_t) and then
    builds new writeback error tracking infrastructure around that. Once
    all of that is in place, we change the filesystems to use the new
    infrastructure for reporting wb errors to userland.

    Note that this is just the initial foray into cleaning up this mess.
    There is a lot of work remaining here:

    1) convert the rest of the filesystems in a similar fashion. Once the
    initial set is in, then I think most other fs' will be fairly
    simple to convert. Hopefully most of those can in via individual
    filesystem trees.

    2) convert internal waiters on writeback to use errseq_t for
    detecting errors instead of relying on the AS_* flags. I have some
    draft patches for this for ext4, but they are not quite ready for
    prime time yet.

    This was a discussion topic this year at LSF/MM too. If you're
    interested in the gory details, LWN has some good articles about this:

    https://lwn.net/Articles/718734/
    https://lwn.net/Articles/724307/"

    * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    btrfs: minimal conversion to errseq_t writeback error reporting on fsync
    xfs: minimal conversion to errseq_t writeback error reporting
    ext4: use errseq_t based error handling for reporting data writeback errors
    fs: convert __generic_file_fsync to use errseq_t based reporting
    block: convert to errseq_t based writeback error tracking
    dax: set errors in mapping when writeback fails
    Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
    mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
    fs: new infrastructure for writeback error handling and reporting
    lib: add errseq_t type and infrastructure for handling it
    mm: don't TestClearPageError in __filemap_fdatawait_range
    mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
    jbd2: don't clear and reset errors after waiting on writeback
    buffer: set errors in mapping at the time that the error occurs
    fs: check for writeback errors after syncing out buffers in generic_file_fsync
    buffer: use mapping_set_error instead of setting the flag
    mm: fix mapping_set_error call in me_pagecache_dirty

    Linus Torvalds
     

06 Jul, 2017

1 commit

  • Most filesystems currently use mapping_set_error and
    filemap_check_errors for setting and reporting/clearing writeback errors
    at the mapping level. filemap_check_errors is indirectly called from
    most of the filemap_fdatawait_* functions and from
    filemap_write_and_wait*. These functions are called from all sorts of
    contexts to wait on writeback to finish -- e.g. mostly in fsync, but
    also in truncate calls, getattr, etc.

    The non-fsync callers are problematic. We should be reporting writeback
    errors during fsync, but many places spread over the tree clear out
    errors before they can be properly reported, or report errors at
    nonsensical times.

    If I get -EIO on a stat() call, there is no reason for me to assume that
    it is because some previous writeback failed. The fact that it also
    clears out the error such that a subsequent fsync returns 0 is a bug,
    and a nasty one since that's potentially silent data corruption.

    This patch adds a small bit of new infrastructure for setting and
    reporting errors during address_space writeback. While the above was my
    original impetus for adding this, I think it's also the case that
    current fsync semantics are just problematic for userland. Most
    applications that call fsync do so to ensure that the data they wrote
    has hit the backing store.

    In the case where there are multiple writers to the file at the same
    time, this is really hard to determine. The first one to call fsync will
    see any stored error, and the rest get back 0. The processes with open
    fds may not be associated with one another in any way. They could even
    be in different containers, so ensuring coordination between all fsync
    callers is not really an option.

    One way to remedy this would be to track what file descriptor was used
    to dirty the file, but that's rather cumbersome and would likely be
    slow. However, there is a simpler way to improve the semantics here
    without incurring too much overhead.

    This set adds an errseq_t to struct address_space, and a corresponding
    one is added to struct file. Writeback errors are recorded in the
    mapping's errseq_t, and the one in struct file is used as the "since"
    value.

    This changes the semantics of the Linux fsync implementation such that
    applications can now use it to determine whether there were any
    writeback errors since fsync(fd) was last called (or since the file was
    opened in the case of fsync having never been called).

    Note that those writeback errors may have occurred when writing data
    that was dirtied via an entirely different fd, but that's the case now
    with the current mapping_set_error/filemap_check_error infrastructure.
    This will at least prevent you from getting a false report of success.

    The new behavior is still consistent with the POSIX spec, and is more
    reliable for application developers. This patch just adds some basic
    infrastructure for doing this, and ensures that the f_wb_err "cursor"
    is properly set when a file is opened. Later patches will change the
    existing code to use this new infrastructure for reporting errors at
    fsync time.

    Signed-off-by: Jeff Layton
    Reviewed-by: Jan Kara

    Jeff Layton
     

28 Jun, 2017

1 commit

  • Define a set of write life time hints:

    RWH_WRITE_LIFE_NOT_SET No hint information set
    RWH_WRITE_LIFE_NONE No hints about write life time
    RWH_WRITE_LIFE_SHORT Data written has a short life time
    RWH_WRITE_LIFE_MEDIUM Data written has a medium life time
    RWH_WRITE_LIFE_LONG Data written has a long life time
    RWH_WRITE_LIFE_EXTREME Data written has an extremely long life time

    The intent is for these values to be relative to each other, no
    absolute meaning should be attached to these flag names.

    Add an fcntl interface for querying these flags, and also for
    setting them as well:

    F_GET_RW_HINT Returns the read/write hint set on the
    underlying inode.

    F_SET_RW_HINT Set one of the above write hints on the
    underlying inode.

    F_GET_FILE_RW_HINT Returns the read/write hint set on the
    file descriptor.

    F_SET_FILE_RW_HINT Set one of the above write hints on the
    file descriptor.

    The user passes in a 64-bit pointer to get/set these values, and
    the interface returns 0/-1 on success/error.

    Sample program testing/implementing basic setting/getting of write
    hints is below.

    Add support for storing the write life time hint in the inode flags
    and in struct file as well, and pass them to the kiocb flags. If
    both a file and its corresponding inode has a write hint, then we
    use the one in the file, if available. The file hint can be used
    for sync/direct IO, for buffered writeback only the inode hint
    is available.

    This is in preparation for utilizing these hints in the block layer,
    to guide on-media data placement.

    /*
    * writehint.c: get or set an inode write hint
    */
    #include
    #include
    #include
    #include
    #include
    #include

    #ifndef F_GET_RW_HINT
    #define F_LINUX_SPECIFIC_BASE 1024
    #define F_GET_RW_HINT (F_LINUX_SPECIFIC_BASE + 11)
    #define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 12)
    #endif

    static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE",
    "RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM",
    "RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" };

    int main(int argc, char *argv[])
    {
    uint64_t hint;
    int fd, ret;

    if (argc < 2) {
    fprintf(stderr, "%s: file \n", argv[0]);
    return 1;
    }

    fd = open(argv[1], O_RDONLY);
    if (fd < 0) {
    perror("open");
    return 2;
    }

    if (argc > 2) {
    hint = atoi(argv[2]);
    ret = fcntl(fd, F_SET_RW_HINT, &hint);
    if (ret < 0) {
    perror("fcntl: F_SET_RW_HINT");
    return 4;
    }
    }

    ret = fcntl(fd, F_GET_RW_HINT, &hint);
    if (ret < 0) {
    perror("fcntl: F_GET_RW_HINT");
    return 3;
    }

    printf("%s: hint %s\n", argv[1], str[hint]);
    close(fd);
    return 0;
    }

    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Jens Axboe