25 Oct, 2020

1 commit

  • Pull io_uring fixes from Jens Axboe:

    - fsize was missed in previous unification of work flags

    - Few fixes cleaning up the flags unification creds cases (Pavel)

    - Fix NUMA affinities for completely unplugged/replugged node for io-wq

    - Two fallout fixes from the set_fs changes. One local to io_uring, one
    for the splice entry point that io_uring uses.

    - Linked timeout fixes (Pavel)

    - Removal of ->flush() ->files work-around that we don't need anymore
    with referenced files (Pavel)

    - Various cleanups (Pavel)

    * tag 'io_uring-5.10-2020-10-24' of git://git.kernel.dk/linux-block:
    splice: change exported internal do_splice() helper to take kernel offset
    io_uring: make loop_rw_iter() use original user supplied pointers
    io_uring: remove req cancel in ->flush()
    io-wq: re-set NUMA node affinities if CPUs come online
    io_uring: don't reuse linked_timeout
    io_uring: unify fsize with def->work_flags
    io_uring: fix racy REQ_F_LINK_TIMEOUT clearing
    io_uring: do poll's hash_node init in common code
    io_uring: inline io_poll_task_handler()
    io_uring: remove extra ->file check in poll prep
    io_uring: make cached_cq_overflow non atomic_t
    io_uring: inline io_fail_links()
    io_uring: kill ref get/drop in personality init
    io_uring: flags-based creds init in queue

    Linus Torvalds
     

23 Oct, 2020

2 commits

  • With the set_fs change, we can no longer rely on copy_{to,from}_user()
    accepting a kernel pointer, and it was bad form to do so anyway. Clean
    this up and change the internal helper that io_uring uses to deal with
    kernel pointers instead. This puts the offset copy in/out in __do_splice()
    instead, which just calls the same helper.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Pull initial set_fs() removal from Al Viro:
    "Christoph's set_fs base series + fixups"

    * 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Allow a NULL pos pointer to __kernel_read
    fs: Allow a NULL pos pointer to __kernel_write
    powerpc: remove address space overrides using set_fs()
    powerpc: use non-set_fs based maccess routines
    x86: remove address space overrides using set_fs()
    x86: make TASK_SIZE_MAX usable from assembly code
    x86: move PAGE_OFFSET, TASK_SIZE & friends to page_{32,64}_types.h
    lkdtm: remove set_fs-based tests
    test_bitmap: remove user bitmap tests
    uaccess: add infrastructure for kernel builds with set_fs()
    fs: don't allow splice read/write without explicit ops
    fs: don't allow kernel reads and writes without iter ops
    sysctl: Convert to iter interfaces
    proc: add a read_iter method to proc proc_ops
    proc: cleanup the compat vs no compat file ops
    proc: remove a level of indentation in proc_get_inode

    Linus Torvalds
     

13 Oct, 2020

1 commit

  • Pull compat iovec cleanups from Al Viro:
    "Christoph's series around import_iovec() and compat variant thereof"

    * 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    security/keys: remove compat_keyctl_instantiate_key_iov
    mm: remove compat_process_vm_{readv,writev}
    fs: remove compat_sys_vmsplice
    fs: remove the compat readv/writev syscalls
    fs: remove various compat readv/writev helpers
    iov_iter: transparently handle compat iovecs in import_iovec
    iov_iter: refactor rw_copy_check_uvector and import_iovec
    iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c
    compat.h: fix a spelling error in

    Linus Torvalds
     

07 Oct, 2020

1 commit

  • Tetsuo Handa reports that splice() can return 0 before the real EOF, if
    the data in the splice source pipe is an empty pipe buffer. That empty
    pipe buffer case doesn't happen in any normal situation, but you can
    trigger it by doing a write to a pipe that fails due to a page fault.

    Tetsuo has a test-case to show the behavior:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include

    int main(int argc, char *argv[])
    {
    const int fd = open("/tmp/testfile", O_WRONLY | O_CREAT, 0600);
    int pipe_fd[2] = { -1, -1 };
    pipe(pipe_fd);
    write(pipe_fd[1], NULL, 4096);
    /* This splice() should wait unless interrupted. */
    return !splice(pipe_fd[0], NULL, fd, NULL, 65536, 0);
    }

    which results in

    write(5, NULL, 4096) = -1 EFAULT (Bad address)
    splice(4, NULL, 3, NULL, 65536, 0) = 0

    and this can confuse splice() users into believing they have hit EOF
    prematurely.

    The issue was introduced when the pipe write code started pre-allocating
    the pipe buffers before copying data from user space.

    This is modified verion of Tetsuo's original patch.

    Fixes: a194dfe6e6f6 ("pipe: Rearrange sequence in pipe_write() to preallocate slot")
    Link:https://lore.kernel.org/linux-fsdevel/20201005121339.4063-1-penguin-kernel@I-love.SAKURA.ne.jp/
    Reported-by: Tetsuo Handa
    Acked-by: Tetsuo Handa
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

03 Oct, 2020

2 commits


02 Oct, 2020

1 commit

  • The pipe splice code still used the old model of waiting for pipe IO by
    using a non-specific "pipe_wait()" that waited for any pipe event to
    happen, which depended on all pipe IO being entirely serialized by the
    pipe lock. So by checking the state you were waiting for, and then
    adding yourself to the wait queue before dropping the lock, you were
    guaranteed to see all the wakeups.

    Strictly speaking, the actual wakeups were not done under the lock, but
    the pipe_wait() model still worked, because since the waiter held the
    lock when checking whether it should sleep, it would always see the
    current state, and the wakeup was always done after updating the state.

    However, commit 0ddad21d3e99 ("pipe: use exclusive waits when reading or
    writing") split the single wait-queue into two, and in the process also
    made the "wait for event" code wait for _two_ wait queues, and that then
    showed a race with the wakers that were not serialized by the pipe lock.

    It's only splice that used that "pipe_wait()" model, so the problem
    wasn't obvious, but Josef Bacik reports:

    "I hit a hang with fstest btrfs/187, which does a btrfs send into
    /dev/null. This works by creating a pipe, the write side is given to
    the kernel to write into, and the read side is handed to a thread that
    splices into a file, in this case /dev/null.

    The box that was hung had the write side stuck here [pipe_write] and
    the read side stuck here [splice_from_pipe_next -> pipe_wait].

    [ more details about pipe_wait() scenario ]

    The problem is we're doing the prepare_to_wait, which sets our state
    each time, however we can be woken up either with reads or writes. In
    the case above we race with the WRITER waking us up, and re-set our
    state to INTERRUPTIBLE, and thus never break out of schedule"

    Josef had a patch that avoided the issue in pipe_wait() by just making
    it set the state only once, but the deeper problem is that pipe_wait()
    depends on a level of synchonization by the pipe mutex that it really
    shouldn't. And the whole "wait for any pipe state change" model really
    isn't very good to begin with.

    So rather than trying to work around things in pipe_wait(), remove that
    legacy model of "wait for arbitrary pipe event" entirely, and actually
    create functions that wait for the pipe actually being readable or
    writable, and can do so without depending on the pipe lock serializing
    everything.

    Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing")
    Link: https://lore.kernel.org/linux-fsdevel/bfa88b5ad6f069b2b679316b9e495a970130416c.1601567868.git.josef@toxicpanda.com/
    Reported-by: Josef Bacik
    Reviewed-and-tested-by: Josef Bacik
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

09 Sep, 2020

1 commit

  • default_file_splice_write is the last piece of generic code that uses
    set_fs to make the uaccess routines operate on kernel pointers. It
    implements a "fallback loop" for splicing from files that do not actually
    provide a proper splice_read method. The usual file systems and other
    high bandwidth instances all provide a ->splice_read, so this just removes
    support for various device drivers and procfs/debugfs files. If splice
    support for any of those turns out to be important it can be added back
    by switching them to the iter ops and using generic_file_splice_read.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Kees Cook
    Signed-off-by: Al Viro

    Christoph Hellwig
     

14 Jun, 2020

1 commit

  • …git/dhowells/linux-fs

    Pull notification queue from David Howells:
    "This adds a general notification queue concept and adds an event
    source for keys/keyrings, such as linking and unlinking keys and
    changing their attributes.

    Thanks to Debarshi Ray, we do have a pull request to use this to fix a
    problem with gnome-online-accounts - as mentioned last time:

    https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47

    Without this, g-o-a has to constantly poll a keyring-based kerberos
    cache to find out if kinit has changed anything.

    [ There are other notification pending: mount/sb fsinfo notifications
    for libmount that Karel Zak and Ian Kent have been working on, and
    Christian Brauner would like to use them in lxc, but let's see how
    this one works first ]

    LSM hooks are included:

    - A set of hooks are provided that allow an LSM to rule on whether or
    not a watch may be set. Each of these hooks takes a different
    "watched object" parameter, so they're not really shareable. The
    LSM should use current's credentials. [Wanted by SELinux & Smack]

    - A hook is provided to allow an LSM to rule on whether or not a
    particular message may be posted to a particular queue. This is
    given the credentials from the event generator (which may be the
    system) and the watch setter. [Wanted by Smack]

    I've provided SELinux and Smack with implementations of some of these
    hooks.

    WHY
    ===

    Key/keyring notifications are desirable because if you have your
    kerberos tickets in a file/directory, your Gnome desktop will monitor
    that using something like fanotify and tell you if your credentials
    cache changes.

    However, we also have the ability to cache your kerberos tickets in
    the session, user or persistent keyring so that it isn't left around
    on disk across a reboot or logout. Keyrings, however, cannot currently
    be monitored asynchronously, so the desktop has to poll for it - not
    so good on a laptop. This facility will allow the desktop to avoid the
    need to poll.

    DESIGN DECISIONS
    ================

    - The notification queue is built on top of a standard pipe. Messages
    are effectively spliced in. The pipe is opened with a special flag:

    pipe2(fds, O_NOTIFICATION_PIPE);

    The special flag has the same value as O_EXCL (which doesn't seem
    like it will ever be applicable in this context)[?]. It is given up
    front to make it a lot easier to prohibit splice&co from accessing
    the pipe.

    [?] Should this be done some other way? I'd rather not use up a new
    O_* flag if I can avoid it - should I add a pipe3() system call
    instead?

    The pipe is then configured::

    ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
    ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter);

    Messages are then read out of the pipe using read().

    - It should be possible to allow write() to insert data into the
    notification pipes too, but this is currently disabled as the
    kernel has to be able to insert messages into the pipe *without*
    holding pipe->mutex and the code to make this work needs careful
    auditing.

    - sendfile(), splice() and vmsplice() are disabled on notification
    pipes because of the pipe->mutex issue and also because they
    sometimes want to revert what they just did - but one or more
    notification messages might've been interleaved in the ring.

    - The kernel inserts messages with the wait queue spinlock held. This
    means that pipe_read() and pipe_write() have to take the spinlock
    to update the queue pointers.

    - Records in the buffer are binary, typed and have a length so that
    they can be of varying size.

    This allows multiple heterogeneous sources to share a common
    buffer; there are 16 million types available, of which I've used
    just a few, so there is scope for others to be used. Tags may be
    specified when a watchpoint is created to help distinguish the
    sources.

    - Records are filterable as types have up to 256 subtypes that can be
    individually filtered. Other filtration is also available.

    - Notification pipes don't interfere with each other; each may be
    bound to a different set of watches. Any particular notification
    will be copied to all the queues that are currently watching for it
    - and only those that are watching for it.

    - When recording a notification, the kernel will not sleep, but will
    rather mark a queue as having lost a message if there's
    insufficient space. read() will fabricate a loss notification
    message at an appropriate point later.

    - The notification pipe is created and then watchpoints are attached
    to it, using one of:

    keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01);
    watch_mount(AT_FDCWD, "/", 0, fd, 0x02);
    watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03);

    where in both cases, fd indicates the queue and the number after is
    a tag between 0 and 255.

    - Watches are removed if either the notification pipe is destroyed or
    the watched object is destroyed. In the latter case, a message will
    be generated indicating the enforced watch removal.

    Things I want to avoid:

    - Introducing features that make the core VFS dependent on the
    network stack or networking namespaces (ie. usage of netlink).

    - Dumping all this stuff into dmesg and having a daemon that sits
    there parsing the output and distributing it as this then puts the
    responsibility for security into userspace and makes handling
    namespaces tricky. Further, dmesg might not exist or might be
    inaccessible inside a container.

    - Letting users see events they shouldn't be able to see.

    TESTING AND MANPAGES
    ====================

    - The keyutils tree has a pipe-watch branch that has keyctl commands
    for making use of notifications. Proposed manual pages can also be
    found on this branch, though a couple of them really need to go to
    the main manpages repository instead.

    If the kernel supports the watching of keys, then running "make
    test" on that branch will cause the testing infrastructure to spawn
    a monitoring process on the side that monitors a notifications pipe
    for all the key/keyring changes induced by the tests and they'll
    all be checked off to make sure they happened.

    https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch

    - A test program is provided (samples/watch_queue/watch_test) that
    can be used to monitor for keyrings, mount and superblock events.
    Information on the notifications is simply logged to stdout"

    * tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    smack: Implement the watch_key and post_notification hooks
    selinux: Implement the watch_key security hook
    keys: Make the KEY_NEED_* perms an enum rather than a mask
    pipe: Add notification lossage handling
    pipe: Allow buffers to be marked read-whole-or-error for notifications
    Add sample notification program
    watch_queue: Add a key/keyring notification facility
    security: Add hooks to rule on setting a watch
    pipe: Add general notification queue support
    pipe: Add O_NOTIFICATION_PIPE
    security: Add a hook for the point of notification insertion
    uapi: General notification queue definitions

    Linus Torvalds
     

04 Jun, 2020

1 commit

  • Pull splice updates from Al Viro:
    "Christoph's assorted splice cleanups"

    * 'work.splice' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: rename pipe_buf ->steal to ->try_steal
    fs: make the pipe_buf_operations ->confirm operation optional
    fs: make the pipe_buf_operations ->steal operation optional
    trace: remove tracing_pipe_buf_ops
    pipe: merge anon_pipe_buf*_ops
    fs: simplify do_splice_from
    fs: simplify do_splice_to

    Linus Torvalds
     

03 Jun, 2020

1 commit

  • Pull io_uring updates from Jens Axboe:
    "A relatively quiet round, mostly just fixes and code improvements. In
    particular:

    - Make statx just use the generic statx handler, instead of open
    coding it. We don't need that anymore, as we always call it async
    safe (Bijan)

    - Enable closing of the ring itself. Also fixes O_PATH closure (me)

    - Properly name completion members (me)

    - Batch reap of dead file registrations (me)

    - Allow IORING_OP_POLL with double waitqueues (me)

    - Add tee(2) support (Pavel)

    - Remove double off read (Pavel)

    - Fix overflow cancellations (Pavel)

    - Improve CQ timeouts (Pavel)

    - Async defer drain fixes (Pavel)

    - Add support for enabling/disabling notifications on a registered
    eventfd (Stefano)

    - Remove dead state parameter (Xiaoguang)

    - Disable SQPOLL submit on dying ctx (Xiaoguang)

    - Various code cleanups"

    * tag 'for-5.8/io_uring-2020-06-01' of git://git.kernel.dk/linux-block: (29 commits)
    io_uring: fix overflowed reqs cancellation
    io_uring: off timeouts based only on completions
    io_uring: move timeouts flushing to a helper
    statx: hide interfaces no longer used by io_uring
    io_uring: call statx directly
    statx: allow system call to be invoked from io_uring
    io_uring: add io_statx structure
    io_uring: get rid of manual punting in io_close
    io_uring: separate DRAIN flushing into a cold path
    io_uring: don't re-read sqe->off in timeout_prep()
    io_uring: simplify io_timeout locking
    io_uring: fix flush req->refs underflow
    io_uring: don't submit sqes when ctx->refs is dying
    io_uring: async task poll trigger cleanup
    io_uring: add tee(2) support
    splice: export do_tee()
    io_uring: don't repeat valid flag list
    io_uring: rename io_file_put()
    io_uring: remove req->needs_fixed_files
    io_uring: cleanup io_poll_remove_one() logic
    ...

    Linus Torvalds
     

21 May, 2020

7 commits

  • syzbot is reporting that splice()ing from non-empty read side to
    already-full write side causes unkillable task, for opipe_prep() is by
    error not inverting pipe_full() test.

    CPU: 0 PID: 9460 Comm: syz-executor.5 Not tainted 5.6.0-rc3-next-20200228-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:rol32 include/linux/bitops.h:105 [inline]
    RIP: 0010:iterate_chain_key kernel/locking/lockdep.c:369 [inline]
    RIP: 0010:__lock_acquire+0x6a3/0x5270 kernel/locking/lockdep.c:4178
    Call Trace:
    lock_acquire+0x197/0x420 kernel/locking/lockdep.c:4720
    __mutex_lock_common kernel/locking/mutex.c:956 [inline]
    __mutex_lock+0x156/0x13c0 kernel/locking/mutex.c:1103
    pipe_lock_nested fs/pipe.c:66 [inline]
    pipe_double_lock+0x1a0/0x1e0 fs/pipe.c:104
    splice_pipe_to_pipe fs/splice.c:1562 [inline]
    do_splice+0x35f/0x1520 fs/splice.c:1141
    __do_sys_splice fs/splice.c:1447 [inline]
    __se_sys_splice fs/splice.c:1427 [inline]
    __x64_sys_splice+0x2b5/0x320 fs/splice.c:1427
    do_syscall_64+0xf6/0x790 arch/x86/entry/common.c:295
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Reported-by: syzbot+b48daca8639150bc5e73@syzkaller.appspotmail.com
    Link: https://syzkaller.appspot.com/bug?id=9386d051e11e09973d5a4cf79af5e8cedf79386d
    Fixes: 8cefc107ca54c8b0 ("pipe: Use head and tail pointers for the ring, not cursor and length")
    Cc: stable@vger.kernel.org # 5.5+
    Signed-off-by: Tetsuo Handa
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • And replace the arcane return value convention with a simple bool
    where true means success and false means failure.

    [AV: braino fix folded in]

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Just return 0 for success if it is not present.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Just return 1 for failure if it is not present.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • All the op vectors are exactly the same, they are just used to encode
    packet or nomerge behavior. There already is a flag for the packet
    behavior, so just add a new one to allow for merging. Inverting it vs
    the previous nomerge special casing actually allows for much nicer code.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • No need for a local function pointer when we can trivial branch on the
    ->splice_write presence.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • No need for a local function pointer when we can trivial branch on the
    ->splice_read presence.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

19 May, 2020

1 commit

  • Make it possible to have a general notification queue built on top of a
    standard pipe. Notifications are 'spliced' into the pipe and then read
    out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
    notifications as post_one_notification() cannot take pipe->mutex. This
    means that notifications could be posted in between individual pipe
    buffers, making iov_iter_revert() difficult to effect.

    The way the notification queue is used is:

    (1) An application opens a pipe with a special flag and indicates the
    number of messages it wishes to be able to queue at once (this can
    only be set once):

    pipe2(fds, O_NOTIFICATION_PIPE);
    ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);

    (2) The application then uses poll() and read() as normal to extract data
    from the pipe. read() will return multiple notifications if the
    buffer is big enough, but it will not split a notification across
    buffers - rather it will return a short read or EMSGSIZE.

    Notification messages include a length in the header so that the
    caller can split them up.

    Each message has a header that describes it:

    struct watch_notification {
    __u32 type:24;
    __u32 subtype:8;
    __u32 info;
    };

    The type indicates the source (eg. mount tree changes, superblock events,
    keyring changes, block layer events) and the subtype indicates the event
    type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
    indicates a number of things, including the entry length, an ID assigned to
    a watchpoint contributing to this buffer and type-specific flags.

    Supplementary data, such as the key ID that generated an event, can be
    attached in additional slots. The maximum message size is 127 bytes.
    Messages may not be padded or aligned, so there is no guarantee, for
    example, that the notification type will be on a 4-byte bounary.

    Signed-off-by: David Howells

    David Howells
     

18 May, 2020

1 commit


07 May, 2020

1 commit

  • do_splice() is used by io_uring, as will be do_tee(). Move f_mode
    checks from sys_{splice,tee}() to do_{splice,tee}(), so they're
    enforced for io_uring as well.

    Fixes: 7d67af2c0134 ("io_uring: add splice(2) support")
    Reported-by: Jann Horn
    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Pavel Begunkov
     

03 Mar, 2020

1 commit


09 Feb, 2020

1 commit

  • This makes the pipe code use separate wait-queues and exclusive waiting
    for readers and writers, avoiding a nasty thundering herd problem when
    there are lots of readers waiting for data on a pipe (or, less commonly,
    lots of writers waiting for a pipe to have space).

    While this isn't a common occurrence in the traditional "use a pipe as a
    data transport" case, where you typically only have a single reader and
    a single writer process, there is one common special case: using a pipe
    as a source of "locking tokens" rather than for data communication.

    In particular, the GNU make jobserver code ends up using a pipe as a way
    to limit parallelism, where each job consumes a token by reading a byte
    from the jobserver pipe, and releases the token by writing a byte back
    to the pipe.

    This pattern is fairly traditional on Unix, and works very well, but
    will waste a lot of time waking up a lot of processes when only a single
    reader needs to be woken up when a writer releases a new token.

    A simplified test-case of just this pipe interaction is to create 64
    processes, and then pass a single token around between them (this
    test-case also intentionally passes another token that gets ignored to
    test the "wake up next" logic too, in case anybody wonders about it):

    #include

    int main(int argc, char **argv)
    {
    int fd[2], counters[2];

    pipe(fd);
    counters[0] = 0;
    counters[1] = -1;
    write(fd[1], counters, sizeof(counters));

    /* 64 processes */
    fork(); fork(); fork(); fork(); fork(); fork();

    do {
    int i;
    read(fd[0], &i, sizeof(i));
    if (i < 0)
    continue;
    counters[0] = i+1;
    write(fd[1], counters, (1+(i & 1)) *sizeof(int));
    } while (counters[0] < 1000000);
    return 0;
    }

    and in a perfect world, passing that token around should only cause one
    context switch per transfer, when the writer of a token causes a
    directed wakeup of just a single reader.

    But with the "writer wakes all readers" model we traditionally had, on
    my test box the above case causes more than an order of magnitude more
    scheduling: instead of the expected ~1M context switches, "perf stat"
    shows

    231,852.37 msec task-clock # 15.857 CPUs utilized
    11,250,961 context-switches # 0.049 M/sec
    616,304 cpu-migrations # 0.003 M/sec
    1,648 page-faults # 0.007 K/sec
    1,097,903,998,514 cycles # 4.735 GHz
    120,781,778,352 instructions # 0.11 insn per cycle
    27,997,056,043 branches # 120.754 M/sec
    283,581,233 branch-misses # 1.01% of all branches

    14.621273891 seconds time elapsed

    0.018243000 seconds user
    3.611468000 seconds sys

    before this commit.

    After this commit, I get

    5,229.55 msec task-clock # 3.072 CPUs utilized
    1,212,233 context-switches # 0.232 M/sec
    103,951 cpu-migrations # 0.020 M/sec
    1,328 page-faults # 0.254 K/sec
    21,307,456,166 cycles # 4.074 GHz
    12,947,819,999 instructions # 0.61 insn per cycle
    2,881,985,678 branches # 551.096 M/sec
    64,267,015 branch-misses # 2.23% of all branches

    1.702148350 seconds time elapsed

    0.004868000 seconds user
    0.110786000 seconds sys

    instead. Much better.

    [ Note! This kernel improvement seems to be very good at triggering a
    race condition in the make jobserver (in GNU make 4.2.1) for me. It's
    a long known bug that was fixed back in June 2017 by GNU make commit
    b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
    avoid hangs.").

    But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
    so a number of distributions may still have the buggy version. Some
    have backported the fix to their 4.2.1 release, though, and even
    without the fix it's quite timing-dependent whether the bug actually
    is hit. ]

    Josh Triplett says:
    "I've been hammering on your pipe fix patch (switching to exclusive
    wait queues) for a month or so, on several different systems, and I've
    run into no issues with it. The patch *substantially* improves
    parallel build times on large (~100 CPU) systems, both with parallel
    make and with other things that use make's pipe-based jobserver.

    All current distributions (including stable and long-term stable
    distributions) have versions of GNU make that no longer have the
    jobserver bug"

    Tested-by: Josh Triplett
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

08 Dec, 2019

1 commit

  • This code is ancient, and goes back to when we only had a single page
    for the pipe buffers. The exact history is hidden in the mists of time
    (ie "before git", and in fact predates the BK repository too).

    At that long-ago point in time, it actually helped to try to merge big
    back-and-forth pipe reads and writes, and not limit pipe reads to the
    single pipe buffer in length just because that was all we had at a time.

    However, since then we've expanded the pipe buffers to multiple pages,
    and this logic really doesn't seem to make sense. And a lot of it is
    somewhat questionable (ie "hmm, the user asked for a non-blocking read,
    but we see that there's a writer pending, so let's wait anyway to get
    the extra data that the writer will have").

    But more importantly, it makes the "go to sleep" logic much less
    obvious, and considering the wakeup issues we've had, I want to make for
    less of those kinds of things.

    Cc: David Howells
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

07 Dec, 2019

1 commit

  • Similarly to commit 8f868d68d335 ("pipe: Fix missing mask update after
    pipe_wait()") this fixes a case where the pipe rewrite ended up caching
    the pipe state incorrectly over a pipe lock drop event.

    It wasn't quite as obvious, because you needed to splice data from a
    pipe to a file, which is a fairly unusual operation, but it's completely
    wrong.

    Make sure we load the pipe head/tail/size information only after we've
    waited for there to be data in the pipe.

    While in that file, also make one of the splice helper functions use the
    canonical arghument order for pipe_empty(). That's syntactic - pipe
    emptiness is just that head and tail are equal, and thus mixing up head
    and tail doesn't really matter. It's still wrong, though.

    Reported-by: David Sterba
    Cc: David Howells
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Dec, 2019

1 commit

  • …ux/kernel/git/dhowells/linux-fs

    Pull pipe rework from David Howells:
    "This is my set of preparatory patches for building a general
    notification queue on top of pipes. It makes a number of significant
    changes:

    - It removes the nr_exclusive argument from __wake_up_sync_key() as
    this is always 1. This prepares for the next step:

    - Adds wake_up_interruptible_sync_poll_locked() so that poll can be
    woken up from a function that's holding the poll waitqueue
    spinlock.

    - Change the pipe buffer ring to be managed in terms of unbounded
    head and tail indices rather than bounded index and length. This
    means that reading the pipe only needs to modify one index, not
    two.

    - A selection of helper functions are provided to query the state of
    the pipe buffer, plus a couple to apply updates to the pipe
    indices.

    - The pipe ring is allowed to have kernel-reserved slots. This allows
    many notification messages to be spliced in by the kernel without
    allowing userspace to pin too many pages if it writes to the same
    pipe.

    - Advance the head and tail indices inside the pipe waitqueue lock
    and use wake_up_interruptible_sync_poll_locked() to poke poll
    without having to take the lock twice.

    - Rearrange pipe_write() to preallocate the buffer it is going to
    write into and then drop the spinlock. This allows kernel
    notifications to then be added the ring whilst it is filling the
    buffer it allocated. The read side is stalled because the pipe
    mutex is still held.

    - Don't wake up readers on a pipe if there was already data in it
    when we added more.

    - Don't wake up writers on a pipe if the ring wasn't full before we
    removed a buffer"

    * tag 'notifications-pipe-prep-20191115' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    pipe: Remove sync on wake_ups
    pipe: Increase the writer-wakeup threshold to reduce context-switch count
    pipe: Check for ring full inside of the spinlock in pipe_write()
    pipe: Remove redundant wakeup from pipe_write()
    pipe: Rearrange sequence in pipe_write() to preallocate slot
    pipe: Conditionalise wakeup in pipe_read()
    pipe: Advance tail pointer inside of wait spinlock in pipe_read()
    pipe: Allow pipes to have kernel-reserved slots
    pipe: Use head and tail pointers for the ring, not cursor and length
    Add wake_up_interruptible_sync_poll_locked()
    Remove the nr_exclusive argument from __wake_up_sync_key()
    pipe: Reduce #inclusion of pipe_fs_i.h

    Linus Torvalds
     

16 Nov, 2019

1 commit

  • Split pipe->ring_size into two numbers:

    (1) pipe->ring_size - indicates the hard size of the pipe ring.

    (2) pipe->max_usage - indicates the maximum number of pipe ring slots that
    userspace orchestrated events can fill.

    This allows for a pipe that is both writable by the general kernel
    notification facility and by userspace, allowing plenty of ring space for
    notifications to be added whilst preventing userspace from being able to
    pin too much unswappable kernel space.

    Signed-off-by: David Howells

    David Howells
     

31 Oct, 2019

1 commit

  • Convert pipes to use head and tail pointers for the buffer ring rather than
    pointer and length as the latter requires two atomic ops to update (or a
    combined op) whereas the former only requires one.

    (1) The head pointer is the point at which production occurs and points to
    the slot in which the next buffer will be placed. This is equivalent
    to pipe->curbuf + pipe->nrbufs.

    The head pointer belongs to the write-side.

    (2) The tail pointer is the point at which consumption occurs. It points
    to the next slot to be consumed. This is equivalent to pipe->curbuf.

    The tail pointer belongs to the read-side.

    (3) head and tail are allowed to run to UINT_MAX and wrap naturally. They
    are only masked off when the array is being accessed, e.g.:

    pipe->bufs[head & mask]

    This means that it is not necessary to have a dead slot in the ring as
    head == tail isn't ambiguous.

    (4) The ring is empty if "head == tail".

    A helper, pipe_empty(), is provided for this.

    (5) The occupancy of the ring is "head - tail".

    A helper, pipe_occupancy(), is provided for this.

    (6) The number of free slots in the ring is "pipe->ring_size - occupancy".

    A helper, pipe_space_for_user() is provided to indicate how many slots
    userspace may use.

    (7) The ring is full if "head - tail >= pipe->ring_size".

    A helper, pipe_full(), is provided for this.

    Signed-off-by: David Howells

    David Howells
     

15 Oct, 2019

1 commit

  • Andreas Grünbacher reports that on the two filesystems that support
    iomap directio, it's possible for splice() to return -EAGAIN (instead of
    a short splice) if the pipe being written to has less space available in
    its pipe buffers than the length supplied by the calling process.

    Months ago we fixed splice_direct_to_actor to clamp the length of the
    read request to the size of the splice pipe. Do the same to do_splice.

    Fixes: 17614445576b6 ("splice: don't read more than available pipe space")
    Reported-by: syzbot+3c01db6025f26530cf8d@syzkaller.appspotmail.com
    Reported-by: Andreas Grünbacher
    Reviewed-by: Andreas Grünbacher
    Signed-off-by: Darrick J. Wong

    Darrick J. Wong
     

01 Jun, 2019

1 commit


21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

27 Apr, 2019

1 commit

  • Pull tracing fixes from Steven Rostedt:
    "Three tracing fixes:

    - Use "nosteal" for ring buffer splice pages

    - Memory leak fix in error path of trace_pid_write()

    - Fix preempt_enable_no_resched() (use preempt_enable()) in ring
    buffer code"

    * tag 'trace-v5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    trace: Fix preempt_enable_no_resched() abuse
    tracing: Fix a memory leak by early error exit in trace_pid_write()
    tracing: Fix buffer_ref pipe ops

    Linus Torvalds
     

26 Apr, 2019

1 commit

  • This fixes multiple issues in buffer_pipe_buf_ops:

    - The ->steal() handler must not return zero unless the pipe buffer has
    the only reference to the page. But generic_pipe_buf_steal() assumes
    that every reference to the pipe is tracked by the page's refcount,
    which isn't true for these buffers - buffer_pipe_buf_get(), which
    duplicates a buffer, doesn't touch the page's refcount.
    Fix it by using generic_pipe_buf_nosteal(), which refuses every
    attempted theft. It should be easy to actually support ->steal, but the
    only current users of pipe_buf_steal() are the virtio console and FUSE,
    and they also only use it as an optimization. So it's probably not worth
    the effort.
    - The ->get() and ->release() handlers can be invoked concurrently on pipe
    buffers backed by the same struct buffer_ref. Make them safe against
    concurrency by using refcount_t.
    - The pointers stored in ->private were only zeroed out when the last
    reference to the buffer_ref was dropped. As far as I know, this
    shouldn't be necessary anyway, but if we do it, let's always do it.

    Link: http://lkml.kernel.org/r/20190404215925.253531-1-jannh@google.com

    Cc: Ingo Molnar
    Cc: Masami Hiramatsu
    Cc: Al Viro
    Cc: stable@vger.kernel.org
    Fixes: 73a757e63114d ("ring-buffer: Return reader page back into existing ring buffer")
    Signed-off-by: Jann Horn
    Signed-off-by: Steven Rostedt (VMware)

    Jann Horn
     

15 Apr, 2019

2 commits

  • Merge page ref overflow branch.

    Jann Horn reported that he can overflow the page ref count with
    sufficient memory (and a filesystem that is intentionally extremely
    slow).

    Admittedly it's not exactly easy. To have more than four billion
    references to a page requires a minimum of 32GB of kernel memory just
    for the pointers to the pages, much less any metadata to keep track of
    those pointers. Jann needed a total of 140GB of memory and a specially
    crafted filesystem that leaves all reads pending (in order to not ever
    free the page references and just keep adding more).

    Still, we have a fairly straightforward way to limit the two obvious
    user-controllable sources of page references: direct-IO like page
    references gotten through get_user_pages(), and the splice pipe page
    duplication. So let's just do that.

    * branch page-refs:
    fs: prevent page refcount overflow in pipe_buf_get
    mm: prevent get_user_pages() from overflowing page refcount
    mm: add 'try_get_page()' helper function
    mm: make page ref count overflow check tighter and more explicit

    Linus Torvalds
     
  • Change pipe_buf_get() to return a bool indicating whether it succeeded
    in raising the refcount of the page (if the thing in the pipe is a page).
    This removes another mechanism for overflowing the page refcount. All
    callers converted to handle a failure.

    Reported-by: Jann Horn
    Signed-off-by: Matthew Wilcox
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

13 Mar, 2019

1 commit

  • Pull misc vfs updates from Al Viro:
    "Assorted fixes (really no common topic here)"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: Make __vfs_write() static
    vfs: fix preadv64v2 and pwritev64v2 compat syscalls with offset == -1
    pipe: stop using ->can_merge
    splice: don't merge into linked buffers
    fs: move generic stat response attr handling to vfs_getattr_nosec
    orangefs: don't reinitialize result_mask in ->getattr
    fs/devpts: always delete dcache dentry-s in dput()

    Linus Torvalds
     

05 Mar, 2019

2 commits

  • The current implementation of splice() and tee() ignores O_NONBLOCK set
    on pipe file descriptors and checks only the SPLICE_F_NONBLOCK flag for
    blocking on pipe arguments. This is inconsistent since splice()-ing
    from/to non-pipe file descriptors does take O_NONBLOCK into
    consideration.

    Fix this by promoting O_NONBLOCK, when set on a pipe, to
    SPLICE_F_NONBLOCK.

    Some context for how the current implementation of splice() leads to
    inconsistent behavior. In the ongoing work[1] to add VM tracing
    capability to trace-cmd we stream tracing data over named FIFOs or
    vsockets from guests back to the host.

    When we receive SIGINT from user to stop tracing, we set O_NONBLOCK on
    the input file descriptor and set SPLICE_F_NONBLOCK for the next call to
    splice(). If splice() was blocked waiting on data from the input FIFO,
    after SIGINT splice() restarts with the same arguments (no
    SPLICE_F_NONBLOCK) and blocks again instead of returning -EAGAIN when no
    data is available.

    This differs from the splice() behavior when reading from a vsocket or
    when we're doing a traditional read()/write() loop (trace-cmd's
    --nosplice argument).

    With this patch applied we get the same behavior in all situations after
    setting O_NONBLOCK which also matches the behavior of doing a
    read()/write() loop instead of splice().

    This change does have potential of breaking users who don't expect
    EAGAIN from splice() when SPLICE_F_NONBLOCK is not set. OTOH programs
    that set O_NONBLOCK and don't anticipate EAGAIN are arguably buggy[2].

    [1] https://github.com/skaslev/trace-cmd/tree/vsock
    [2] https://github.com/torvalds/linux/blob/d47e3da1759230e394096fd742aad423c291ba48/fs/read_write.c#L1425

    Signed-off-by: Slavomir Kaslev
    Reviewed-by: Steven Rostedt (VMware)
    Signed-off-by: Linus Torvalds

    Slavomir Kaslev
     
  • Every in-kernel use of this function defined it to KERNEL_DS (either as
    an actual define, or as an inline function). It's an entirely
    historical artifact, and long long long ago used to actually read the
    segment selector valueof '%ds' on x86.

    Which in the kernel is always KERNEL_DS.

    Inspired by a patch from Jann Horn that just did this for a very small
    subset of users (the ones in fs/), along with Al who suggested a script.
    I then just took it to the logical extreme and removed all the remaining
    gunk.

    Roughly scripted with

    git grep -l '(get_ds())' -- :^tools/ | xargs sed -i 's/(get_ds())/(KERNEL_DS)/'
    git grep -lw 'get_ds' -- :^tools/ | xargs sed -i '/^#define get_ds()/d'

    plus manual fixups to remove a few unusual usage patterns, the couple of
    inline function cases and to fix up a comment that had become stale.

    The 'get_ds()' function remains in an x86 kvm selftest, since in user
    space it actually does something relevant.

    Inspired-by: Jann Horn
    Inspired-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Feb, 2019

1 commit

  • Al Viro pointed out that since there is only one pipe buffer type to which
    new data can be appended, it isn't necessary to have a ->can_merge field in
    struct pipe_buf_operations, we can just check for a magic type.

    Suggested-by: Al Viro
    Signed-off-by: Jann Horn
    Signed-off-by: Al Viro

    Jann Horn