12 Oct, 2020

1 commit


02 Oct, 2020

1 commit

  • The pipe splice code still used the old model of waiting for pipe IO by
    using a non-specific "pipe_wait()" that waited for any pipe event to
    happen, which depended on all pipe IO being entirely serialized by the
    pipe lock. So by checking the state you were waiting for, and then
    adding yourself to the wait queue before dropping the lock, you were
    guaranteed to see all the wakeups.

    Strictly speaking, the actual wakeups were not done under the lock, but
    the pipe_wait() model still worked, because since the waiter held the
    lock when checking whether it should sleep, it would always see the
    current state, and the wakeup was always done after updating the state.

    However, commit 0ddad21d3e99 ("pipe: use exclusive waits when reading or
    writing") split the single wait-queue into two, and in the process also
    made the "wait for event" code wait for _two_ wait queues, and that then
    showed a race with the wakers that were not serialized by the pipe lock.

    It's only splice that used that "pipe_wait()" model, so the problem
    wasn't obvious, but Josef Bacik reports:

    "I hit a hang with fstest btrfs/187, which does a btrfs send into
    /dev/null. This works by creating a pipe, the write side is given to
    the kernel to write into, and the read side is handed to a thread that
    splices into a file, in this case /dev/null.

    The box that was hung had the write side stuck here [pipe_write] and
    the read side stuck here [splice_from_pipe_next -> pipe_wait].

    [ more details about pipe_wait() scenario ]

    The problem is we're doing the prepare_to_wait, which sets our state
    each time, however we can be woken up either with reads or writes. In
    the case above we race with the WRITER waking us up, and re-set our
    state to INTERRUPTIBLE, and thus never break out of schedule"

    Josef had a patch that avoided the issue in pipe_wait() by just making
    it set the state only once, but the deeper problem is that pipe_wait()
    depends on a level of synchonization by the pipe mutex that it really
    shouldn't. And the whole "wait for any pipe state change" model really
    isn't very good to begin with.

    So rather than trying to work around things in pipe_wait(), remove that
    legacy model of "wait for arbitrary pipe event" entirely, and actually
    create functions that wait for the pipe actually being readable or
    writable, and can do so without depending on the pipe lock serializing
    everything.

    Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing")
    Link: https://lore.kernel.org/linux-fsdevel/bfa88b5ad6f069b2b679316b9e495a970130416c.1601567868.git.josef@toxicpanda.com/
    Reported-by: Josef Bacik
    Reviewed-and-tested-by: Josef Bacik
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Oct, 2020

1 commit

  • Calling pipe2() with O_NOTIFICATION_PIPE could results in memory
    leaks unless watch_queue_init() is successful.

    In case of watch_queue_init() failure in pipe2() we are left
    with inode and pipe_inode_info instances that need to be freed. That
    failure exit has been introduced in commit c73be61cede5 ("pipe: Add
    general notification queue support") and its handling should've been
    identical to nearby treatment of alloc_file_pseudo() failures - it
    is dealing with the same situation. As it is, the mainline kernel
    leaks in that case.

    Another problem is that CONFIG_WATCH_QUEUE and !CONFIG_WATCH_QUEUE
    cases are treated differently (and the former leaks just pipe_inode_info,
    the latter - both pipe_inode_info and inode).

    Fixed by providing a dummy wacth_queue_init() in !CONFIG_WATCH_QUEUE
    case and by having failures of wacth_queue_init() handled the same way
    we handle alloc_file_pseudo() ones.

    Fixes: c73be61cede5 ("pipe: Add general notification queue support")
    Signed-off-by: Qian Cai
    Signed-off-by: Al Viro

    Qian Cai
     

14 Jun, 2020

1 commit

  • …git/dhowells/linux-fs

    Pull notification queue from David Howells:
    "This adds a general notification queue concept and adds an event
    source for keys/keyrings, such as linking and unlinking keys and
    changing their attributes.

    Thanks to Debarshi Ray, we do have a pull request to use this to fix a
    problem with gnome-online-accounts - as mentioned last time:

    https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47

    Without this, g-o-a has to constantly poll a keyring-based kerberos
    cache to find out if kinit has changed anything.

    [ There are other notification pending: mount/sb fsinfo notifications
    for libmount that Karel Zak and Ian Kent have been working on, and
    Christian Brauner would like to use them in lxc, but let's see how
    this one works first ]

    LSM hooks are included:

    - A set of hooks are provided that allow an LSM to rule on whether or
    not a watch may be set. Each of these hooks takes a different
    "watched object" parameter, so they're not really shareable. The
    LSM should use current's credentials. [Wanted by SELinux & Smack]

    - A hook is provided to allow an LSM to rule on whether or not a
    particular message may be posted to a particular queue. This is
    given the credentials from the event generator (which may be the
    system) and the watch setter. [Wanted by Smack]

    I've provided SELinux and Smack with implementations of some of these
    hooks.

    WHY
    ===

    Key/keyring notifications are desirable because if you have your
    kerberos tickets in a file/directory, your Gnome desktop will monitor
    that using something like fanotify and tell you if your credentials
    cache changes.

    However, we also have the ability to cache your kerberos tickets in
    the session, user or persistent keyring so that it isn't left around
    on disk across a reboot or logout. Keyrings, however, cannot currently
    be monitored asynchronously, so the desktop has to poll for it - not
    so good on a laptop. This facility will allow the desktop to avoid the
    need to poll.

    DESIGN DECISIONS
    ================

    - The notification queue is built on top of a standard pipe. Messages
    are effectively spliced in. The pipe is opened with a special flag:

    pipe2(fds, O_NOTIFICATION_PIPE);

    The special flag has the same value as O_EXCL (which doesn't seem
    like it will ever be applicable in this context)[?]. It is given up
    front to make it a lot easier to prohibit splice&co from accessing
    the pipe.

    [?] Should this be done some other way? I'd rather not use up a new
    O_* flag if I can avoid it - should I add a pipe3() system call
    instead?

    The pipe is then configured::

    ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
    ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter);

    Messages are then read out of the pipe using read().

    - It should be possible to allow write() to insert data into the
    notification pipes too, but this is currently disabled as the
    kernel has to be able to insert messages into the pipe *without*
    holding pipe->mutex and the code to make this work needs careful
    auditing.

    - sendfile(), splice() and vmsplice() are disabled on notification
    pipes because of the pipe->mutex issue and also because they
    sometimes want to revert what they just did - but one or more
    notification messages might've been interleaved in the ring.

    - The kernel inserts messages with the wait queue spinlock held. This
    means that pipe_read() and pipe_write() have to take the spinlock
    to update the queue pointers.

    - Records in the buffer are binary, typed and have a length so that
    they can be of varying size.

    This allows multiple heterogeneous sources to share a common
    buffer; there are 16 million types available, of which I've used
    just a few, so there is scope for others to be used. Tags may be
    specified when a watchpoint is created to help distinguish the
    sources.

    - Records are filterable as types have up to 256 subtypes that can be
    individually filtered. Other filtration is also available.

    - Notification pipes don't interfere with each other; each may be
    bound to a different set of watches. Any particular notification
    will be copied to all the queues that are currently watching for it
    - and only those that are watching for it.

    - When recording a notification, the kernel will not sleep, but will
    rather mark a queue as having lost a message if there's
    insufficient space. read() will fabricate a loss notification
    message at an appropriate point later.

    - The notification pipe is created and then watchpoints are attached
    to it, using one of:

    keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01);
    watch_mount(AT_FDCWD, "/", 0, fd, 0x02);
    watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03);

    where in both cases, fd indicates the queue and the number after is
    a tag between 0 and 255.

    - Watches are removed if either the notification pipe is destroyed or
    the watched object is destroyed. In the latter case, a message will
    be generated indicating the enforced watch removal.

    Things I want to avoid:

    - Introducing features that make the core VFS dependent on the
    network stack or networking namespaces (ie. usage of netlink).

    - Dumping all this stuff into dmesg and having a daemon that sits
    there parsing the output and distributing it as this then puts the
    responsibility for security into userspace and makes handling
    namespaces tricky. Further, dmesg might not exist or might be
    inaccessible inside a container.

    - Letting users see events they shouldn't be able to see.

    TESTING AND MANPAGES
    ====================

    - The keyutils tree has a pipe-watch branch that has keyctl commands
    for making use of notifications. Proposed manual pages can also be
    found on this branch, though a couple of them really need to go to
    the main manpages repository instead.

    If the kernel supports the watching of keys, then running "make
    test" on that branch will cause the testing infrastructure to spawn
    a monitoring process on the side that monitors a notifications pipe
    for all the key/keyring changes induced by the tests and they'll
    all be checked off to make sure they happened.

    https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch

    - A test program is provided (samples/watch_queue/watch_test) that
    can be used to monitor for keyrings, mount and superblock events.
    Information on the notifications is simply logged to stdout"

    * tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    smack: Implement the watch_key and post_notification hooks
    selinux: Implement the watch_key security hook
    keys: Make the KEY_NEED_* perms an enum rather than a mask
    pipe: Add notification lossage handling
    pipe: Allow buffers to be marked read-whole-or-error for notifications
    Add sample notification program
    watch_queue: Add a key/keyring notification facility
    security: Add hooks to rule on setting a watch
    pipe: Add general notification queue support
    pipe: Add O_NOTIFICATION_PIPE
    security: Add a hook for the point of notification insertion
    uapi: General notification queue definitions

    Linus Torvalds
     

21 May, 2020

3 commits


19 May, 2020

3 commits

  • Add handling for loss of notifications by having read() insert a
    loss-notification message after it has read the pipe buffer that was last
    in the ring when the loss occurred.

    Lossage can come about either by running out of notification descriptors or
    by running out of space in the pipe ring.

    Signed-off-by: David Howells

    David Howells
     
  • Allow a buffer to be marked such that read() must return the entire buffer
    in one go or return ENOBUFS. Multiple buffers can be amalgamated into a
    single read, but a short read will occur if the next "whole" buffer won't
    fit.

    This is useful for watch queue notifications to make sure we don't split a
    notification across multiple reads, especially given that we need to
    fabricate an overrun record under some circumstances - and that isn't in
    the buffers.

    Signed-off-by: David Howells

    David Howells
     
  • Make it possible to have a general notification queue built on top of a
    standard pipe. Notifications are 'spliced' into the pipe and then read
    out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
    notifications as post_one_notification() cannot take pipe->mutex. This
    means that notifications could be posted in between individual pipe
    buffers, making iov_iter_revert() difficult to effect.

    The way the notification queue is used is:

    (1) An application opens a pipe with a special flag and indicates the
    number of messages it wishes to be able to queue at once (this can
    only be set once):

    pipe2(fds, O_NOTIFICATION_PIPE);
    ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);

    (2) The application then uses poll() and read() as normal to extract data
    from the pipe. read() will return multiple notifications if the
    buffer is big enough, but it will not split a notification across
    buffers - rather it will return a short read or EMSGSIZE.

    Notification messages include a length in the header so that the
    caller can split them up.

    Each message has a header that describes it:

    struct watch_notification {
    __u32 type:24;
    __u32 subtype:8;
    __u32 info;
    };

    The type indicates the source (eg. mount tree changes, superblock events,
    keyring changes, block layer events) and the subtype indicates the event
    type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
    indicates a number of things, including the entry length, an ID assigned to
    a watchpoint contributing to this buffer and type-specific flags.

    Supplementary data, such as the key ID that generated an event, can be
    attached in additional slots. The maximum message size is 127 bytes.
    Messages may not be padded or aligned, so there is no guarantee, for
    example, that the notification type will be on a 4-byte bounary.

    Signed-off-by: David Howells

    David Howells
     

03 Apr, 2020

1 commit

  • Rename (__)memcg_kmem_(un)charge() into (__)memcg_kmem_(un)charge_page()
    to better reflect what they are actually doing:

    1) call __memcg_kmem_(un)charge_memcg() to actually charge or uncharge
    the current memcg

    2) set or clear the PageKmemcg flag

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Link: http://lkml.kernel.org/r/20200109202659.752357-4-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

19 Feb, 2020

1 commit

  • Andrei Vagin reported that commit 0ddad21d3e99 ("pipe: use exclusive
    waits when reading or writing") broke one of the CRIU tests. He even
    has a trivial reproducer:

    #include
    #include
    #include

    int main()
    {
    int p[2];
    pid_t p1, p2;
    int status;

    if (pipe(p) == -1)
    return 1;

    p1 = fork();
    if (p1 == 0) {
    close(p[1]);
    read(p[0], &status, sizeof(status));
    return 0;
    }
    p2 = fork();
    if (p2 == 0) {
    close(p[1]);
    read(p[0], &status, sizeof(status));
    return 0;
    }
    sleep(1);
    close(p[1]);
    wait(&status);
    wait(&status);

    return 0;
    }

    and the problem - once he points it out - is obvious. We use these nice
    exclusive waits, but when the last writer goes away, it then needs to
    wake up _every_ reader (and conversely, the last reader disappearing
    needs to wake every writer, of course).

    In fact, when going through this, we had several small oddities around
    how to wake things. We did in fact wake every reader when we changed
    the size of the pipe buffers. But that's entirely pointless, since that
    just acts as a possible source of new space - no new data to read.

    And when we change the size of the buffer, we don't need to wake all
    writers even when we add space - that case acts just as if somebody made
    space by reading, and any writer that finds itself not filling it up
    entirely will wake the next one.

    On the other hand, on the exit path, we tried to limit the wakeups with
    the proper poll keys etc, which is entirely pointless, because at that
    point we obviously need to wake up everybody. So don't do that: just
    wake up everybody - but only do that if the counts changed to zero.

    So fix those non-IO wakeups to be more proper: space change doesn't add
    any new data, but it might make room for writers, so it wakes up a
    writer. And the actual changes to reader/writer counts should wake up
    everybody, since everybody is affected (ie readers will all see EOF if
    the writers have gone away, and writers will all get EPIPE if all
    readers have gone away).

    Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing")
    Reported-and-tested-by: Andrei Vagin
    Cc: Josh Triplett
    Cc: Matthew Wilcox
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

09 Feb, 2020

1 commit

  • This makes the pipe code use separate wait-queues and exclusive waiting
    for readers and writers, avoiding a nasty thundering herd problem when
    there are lots of readers waiting for data on a pipe (or, less commonly,
    lots of writers waiting for a pipe to have space).

    While this isn't a common occurrence in the traditional "use a pipe as a
    data transport" case, where you typically only have a single reader and
    a single writer process, there is one common special case: using a pipe
    as a source of "locking tokens" rather than for data communication.

    In particular, the GNU make jobserver code ends up using a pipe as a way
    to limit parallelism, where each job consumes a token by reading a byte
    from the jobserver pipe, and releases the token by writing a byte back
    to the pipe.

    This pattern is fairly traditional on Unix, and works very well, but
    will waste a lot of time waking up a lot of processes when only a single
    reader needs to be woken up when a writer releases a new token.

    A simplified test-case of just this pipe interaction is to create 64
    processes, and then pass a single token around between them (this
    test-case also intentionally passes another token that gets ignored to
    test the "wake up next" logic too, in case anybody wonders about it):

    #include

    int main(int argc, char **argv)
    {
    int fd[2], counters[2];

    pipe(fd);
    counters[0] = 0;
    counters[1] = -1;
    write(fd[1], counters, sizeof(counters));

    /* 64 processes */
    fork(); fork(); fork(); fork(); fork(); fork();

    do {
    int i;
    read(fd[0], &i, sizeof(i));
    if (i < 0)
    continue;
    counters[0] = i+1;
    write(fd[1], counters, (1+(i & 1)) *sizeof(int));
    } while (counters[0] < 1000000);
    return 0;
    }

    and in a perfect world, passing that token around should only cause one
    context switch per transfer, when the writer of a token causes a
    directed wakeup of just a single reader.

    But with the "writer wakes all readers" model we traditionally had, on
    my test box the above case causes more than an order of magnitude more
    scheduling: instead of the expected ~1M context switches, "perf stat"
    shows

    231,852.37 msec task-clock # 15.857 CPUs utilized
    11,250,961 context-switches # 0.049 M/sec
    616,304 cpu-migrations # 0.003 M/sec
    1,648 page-faults # 0.007 K/sec
    1,097,903,998,514 cycles # 4.735 GHz
    120,781,778,352 instructions # 0.11 insn per cycle
    27,997,056,043 branches # 120.754 M/sec
    283,581,233 branch-misses # 1.01% of all branches

    14.621273891 seconds time elapsed

    0.018243000 seconds user
    3.611468000 seconds sys

    before this commit.

    After this commit, I get

    5,229.55 msec task-clock # 3.072 CPUs utilized
    1,212,233 context-switches # 0.232 M/sec
    103,951 cpu-migrations # 0.020 M/sec
    1,328 page-faults # 0.254 K/sec
    21,307,456,166 cycles # 4.074 GHz
    12,947,819,999 instructions # 0.61 insn per cycle
    2,881,985,678 branches # 551.096 M/sec
    64,267,015 branch-misses # 2.23% of all branches

    1.702148350 seconds time elapsed

    0.004868000 seconds user
    0.110786000 seconds sys

    instead. Much better.

    [ Note! This kernel improvement seems to be very good at triggering a
    race condition in the make jobserver (in GNU make 4.2.1) for me. It's
    a long known bug that was fixed back in June 2017 by GNU make commit
    b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
    avoid hangs.").

    But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
    so a number of distributions may still have the buggy version. Some
    have backported the fix to their 4.2.1 release, though, and even
    without the fix it's quite timing-dependent whether the bug actually
    is hit. ]

    Josh Triplett says:
    "I've been hammering on your pipe fix patch (switching to exclusive
    wait queues) for a month or so, on several different systems, and I've
    run into no issues with it. The patch *substantially* improves
    parallel build times on large (~100 CPU) systems, both with parallel
    make and with other things that use make's pipe-based jobserver.

    All current distributions (including stable and long-term stable
    distributions) have versions of GNU make that no longer have the
    jobserver bug"

    Tested-by: Josh Triplett
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

23 Dec, 2019

1 commit

  • LTP pipeio_1 test is hanging with v5.5-rc2-385-gb8e382a185eb,
    with read side observing empty pipe and sleeping and write
    side running out of space and then sleeping as well. In this
    scenario there are 5 writers and 1 reader.

    Problem is that after pipe_write() reacquires pipe lock, it
    re-checks for empty pipe with potentially stale 'head' and
    doesn't wake up read side anymore. pipe->tail can advance
    beyond 'head', because there are multiple writers.

    Use pipe->head for empty pipe check after reacquiring lock
    to observe current state.

    Testing: With patch, LTP pipeio_1 ran successfully in loop for 1 hour.
    Without patch it hanged within a minute.

    Fixes: 1b6b26ae7053 ("pipe: fix and clarify pipe write wakeup logic")
    Reported-by: Rachel Sibley
    Signed-off-by: Jan Stancek
    Signed-off-by: Linus Torvalds

    Jan Stancek
     

12 Dec, 2019

1 commit

  • There's no need to separately check for signals while inside the locked
    region, since we're going to do "wait_event_interruptible()" right
    afterwards anyway, and the error handling is much simpler there.

    The check for whether we had already read anything was also redundant,
    since we no longer do the odd merging of reads when there are pending
    writers.

    But perhaps more importantly, this adds commentary about why we still
    need to wake up possible writers even though we didn't read any data,
    and why we can skip all the finishing touches now if we get a signal (or
    had a signal pending) while waiting for more data.

    [ This is a split-out cleanup from my "make pipe IO use exclusive wait
    queues" thing, which I can't apply because it triggers a nasty bug in
    the GNU make jobserver - Linus ]

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

08 Dec, 2019

5 commits

  • pipe_wait() may be simple, but since it relies on the pipe lock, it
    means that we have to do the wakeup while holding the lock. That's
    unfortunate, because the very first thing the waked entity will want to
    do is to get the pipe lock for itself.

    So get rid of the pipe_wait() usage by simply releasing the pipe lock,
    doing the wakeup (if required) and then using wait_event_interruptible()
    to wait on the right condition instead.

    wait_event_interruptible() handles races on its own by comparing the
    wakeup condition before and after adding itself to the wait queue, so
    you can use an optimistic unlocked condition for it.

    Cc: David Howells
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • This code is ancient, and goes back to when we only had a single page
    for the pipe buffers. The exact history is hidden in the mists of time
    (ie "before git", and in fact predates the BK repository too).

    At that long-ago point in time, it actually helped to try to merge big
    back-and-forth pipe reads and writes, and not limit pipe reads to the
    single pipe buffer in length just because that was all we had at a time.

    However, since then we've expanded the pipe buffers to multiple pages,
    and this logic really doesn't seem to make sense. And a lot of it is
    somewhat questionable (ie "hmm, the user asked for a non-blocking read,
    but we see that there's a writer pending, so let's wait anyway to get
    the extra data that the writer will have").

    But more importantly, it makes the "go to sleep" logic much less
    obvious, and considering the wakeup issues we've had, I want to make for
    less of those kinds of things.

    Cc: David Howells
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • This is the read side version of the previous commit: it simplifies the
    logic to only wake up waiting writers when necessary, and makes sure to
    use a synchronous wakeup. This time not so much for GNU make jobserver
    reasons (that pipe never fills up), but simply to get the writer going
    quickly again.

    A bit less verbose commentary this time, if only because I assume that
    the write side commentary isn't going to be ignored if you touch this
    code.

    Cc: David Howells
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The pipe rework ends up having been extra painful, partly becaused of
    actual bugs with ordering and caching of the pipe state, but also
    because of subtle performance issues.

    In particular, the pipe rework caused the kernel build to inexplicably
    slow down.

    The reason turns out to be that the GNU make jobserver (which limits the
    parallelism of the build) uses a pipe to implement a "token" system: a
    parallel submake will read a character from the pipe to get the job
    token before starting a new job, and will write a character back to the
    pipe when it is done. The overall job limit is thus easily controlled
    by just writing the appropriate number of initial token characters into
    the pipe.

    But to work well, that really means that the old behavior of write
    wakeups being synchronous (WF_SYNC) is very important - when the pipe
    writer wakes up a reader, we want the reader to actually get scheduled
    immediately. Otherwise you lose the parallelism of the build.

    The pipe rework lost that synchronous wakeup on write, and we had
    clearly all forgotten the reasons and rules for it.

    This rewrites the pipe write wakeup logic to do the required Wsync
    wakeups, but also clarifies the logic and avoids extraneous wakeups.

    It also ends up addign a number of comments about what oit does and why,
    so that we hopefully don't end up forgetting about this next time we
    change this code.

    Cc: David Howells
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The kernel wait queues have a basic rule to them: you add yourself to
    the wait-queue first, and then you check the things that you're going to
    wait on. That avoids the races with the event you're waiting for.

    The same goes for poll/select logic: the "poll_wait()" goes first, and
    then you check the things you're polling for.

    Of course, if you use locking, the ordering doesn't matter since the
    lock will serialize with anything that changes the state you're looking
    at. That's not the case here, though.

    So move the poll_wait() first in pipe_poll(), before you start looking
    at the pipe state.

    Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length")
    Cc: David Howells
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

06 Dec, 2019

3 commits

  • Merge two fixes for the pipe rework from David Howells:
    "Here are a couple of patches to fix bugs syzbot found in the pipe
    changes:

    - An assertion check will sometimes trip when polling a pipe because
    the ring size and indices used are approximate and may be being
    changed simultaneously.

    An equivalent approximate calculation was done previously, but
    without the assertion check, so I've just dropped the check. To
    make it accurate, the pipe mutex would need to be taken or the spin
    lock could be used - but usage of the spinlock would need to be
    rolled out into splice, iov_iter and other places for that.

    - The index mask and the max_usage values cannot be cached across
    pipe_wait() as F_SETPIPE_SZ could have been called during the wait.
    This can cause pipe_write() to break"

    * pipe-rework:
    pipe: Fix missing mask update after pipe_wait()
    pipe: Remove assertion from pipe_poll()

    Linus Torvalds
     
  • Fix pipe_write() to not cache the ring index mask and max_usage as their
    values are invalidated by calling pipe_wait() because the latter
    function drops the pipe lock, thereby allowing F_SETPIPE_SZ change them.
    Without this, pipe_write() may subsequently miscalculate the array
    indices and pipe fullness, leading to an oops like the following:

    BUG: KASAN: slab-out-of-bounds in pipe_write+0xc25/0xe10 fs/pipe.c:481
    Write of size 8 at addr ffff8880771167a8 by task syz-executor.3/7987
    ...
    CPU: 1 PID: 7987 Comm: syz-executor.3 Not tainted 5.4.0-rc2-syzkaller #0
    ...
    Call Trace:
    pipe_write+0xc25/0xe10 fs/pipe.c:481
    call_write_iter include/linux/fs.h:1895 [inline]
    new_sync_write+0x3fd/0x7e0 fs/read_write.c:483
    __vfs_write+0x94/0x110 fs/read_write.c:496
    vfs_write+0x18a/0x520 fs/read_write.c:558
    ksys_write+0x105/0x220 fs/read_write.c:611
    __do_sys_write fs/read_write.c:623 [inline]
    __se_sys_write fs/read_write.c:620 [inline]
    __x64_sys_write+0x6e/0xb0 fs/read_write.c:620
    do_syscall_64+0xca/0x5d0 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    This is not a problem for pipe_read() as the mask is recalculated on
    each pass of the loop, after pipe_wait() has been called.

    Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length")
    Reported-by: syzbot+838eb0878ffd51f27c41@syzkaller.appspotmail.com
    Signed-off-by: David Howells
    Cc: Eric Biggers
    [ Changed it to use a temporary variable 'mask' to avoid long lines -Linus ]
    Signed-off-by: Linus Torvalds

    David Howells
     
  • An assertion check was added to pipe_poll() to make sure that the ring
    occupancy isn't seen to overflow the ring size. However, since no locks
    are held when the three values are read, it is possible for F_SETPIPE_SZ
    to intervene and muck up the calculation, thereby causing the oops.

    Fix this by simply removing the assertion and accepting that the
    calculation might be approximate.

    Note that the previous code also had a similar issue, though there was
    no assertion check, since the occupancy counter and the ring size were
    not read with a lock held, so it's possible that the poll check might
    have malfunctioned then too.

    Also wake up all the waiters so that they can reissue their checks if
    there was a competing read or write.

    Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length")
    Reported-by: syzbot+d37abaade33a934f16f2@syzkaller.appspotmail.com
    Signed-off-by: David Howells
    cc: Eric Biggers
    Signed-off-by: Linus Torvalds

    David Howells
     

01 Dec, 2019

1 commit

  • …ux/kernel/git/dhowells/linux-fs

    Pull pipe rework from David Howells:
    "This is my set of preparatory patches for building a general
    notification queue on top of pipes. It makes a number of significant
    changes:

    - It removes the nr_exclusive argument from __wake_up_sync_key() as
    this is always 1. This prepares for the next step:

    - Adds wake_up_interruptible_sync_poll_locked() so that poll can be
    woken up from a function that's holding the poll waitqueue
    spinlock.

    - Change the pipe buffer ring to be managed in terms of unbounded
    head and tail indices rather than bounded index and length. This
    means that reading the pipe only needs to modify one index, not
    two.

    - A selection of helper functions are provided to query the state of
    the pipe buffer, plus a couple to apply updates to the pipe
    indices.

    - The pipe ring is allowed to have kernel-reserved slots. This allows
    many notification messages to be spliced in by the kernel without
    allowing userspace to pin too many pages if it writes to the same
    pipe.

    - Advance the head and tail indices inside the pipe waitqueue lock
    and use wake_up_interruptible_sync_poll_locked() to poke poll
    without having to take the lock twice.

    - Rearrange pipe_write() to preallocate the buffer it is going to
    write into and then drop the spinlock. This allows kernel
    notifications to then be added the ring whilst it is filling the
    buffer it allocated. The read side is stalled because the pipe
    mutex is still held.

    - Don't wake up readers on a pipe if there was already data in it
    when we added more.

    - Don't wake up writers on a pipe if the ring wasn't full before we
    removed a buffer"

    * tag 'notifications-pipe-prep-20191115' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    pipe: Remove sync on wake_ups
    pipe: Increase the writer-wakeup threshold to reduce context-switch count
    pipe: Check for ring full inside of the spinlock in pipe_write()
    pipe: Remove redundant wakeup from pipe_write()
    pipe: Rearrange sequence in pipe_write() to preallocate slot
    pipe: Conditionalise wakeup in pipe_read()
    pipe: Advance tail pointer inside of wait spinlock in pipe_read()
    pipe: Allow pipes to have kernel-reserved slots
    pipe: Use head and tail pointers for the ring, not cursor and length
    Add wake_up_interruptible_sync_poll_locked()
    Remove the nr_exclusive argument from __wake_up_sync_key()
    pipe: Reduce #inclusion of pipe_fs_i.h

    Linus Torvalds
     

26 Nov, 2019

1 commit

  • In commit 3975b097e577 ("convert stream-like files -> stream_open, even
    if they use noop_llseek") Kirill used a coccinelle script to change
    "nonseekable_open()" to "stream_open()", which changed the trivial cases
    of stream-like file descriptors to the new model with FMODE_STREAM.

    However, the two big cases - sockets and pipes - don't actually have
    that trivial pattern at all, and were thus never converted to
    FMODE_STREAM even though it makes lots of sense to do so.

    That's particularly true when looking forward to the next change:
    getting rid of FMODE_ATOMIC_POS entirely, and just using FMODE_STREAM to
    decide whether f_pos updates are needed or not. And if they are, we'll
    always do them atomically.

    This came up because KCSAN (correctly) noted that the non-locked f_pos
    updates are data races: they are clearly benign for the case where we
    don't care, but it would be good to just not have that issue exist at
    all.

    Note that the reason we used FMODE_ATOMIC_POS originally is that only
    doing it for the minimal required case is "safer" in that it's possible
    that the f_pos locking can cause unnecessary serialization across the
    whole write() call. And in the worst case, that kind of serialization
    can cause deadlock issues: think writers that need readers to empty the
    state using the same file descriptor.

    [ Note that the locking is per-file descriptor - because it protects
    "f_pos", which is obviously per-file descriptor - so it only affects
    cases where you literally use the same file descriptor to both read
    and write.

    So a regular pipe that has separate reading and writing file
    descriptors doesn't really have this situation even though it's the
    obvious case of "reader empties what a bit writer concurrently fills"

    But we want to make pipes as being stream-line anyway, because we
    don't want the unnecessary overhead of locking, and because a named
    pipe can be (ab-)used by reading and writing to the same file
    descriptor. ]

    There are likely a lot of other cases that might want FMODE_STREAM, and
    looking for ".llseek = no_llseek" users and other cases that don't have
    an lseek file operation at all and making them use "stream_open()" might
    be a good idea. But pipes and sockets are likely to be the two main
    cases.

    Cc: Kirill Smelkov
    Cc: Eic Dumazet
    Cc: Al Viro
    Cc: Alan Stern
    Cc: Marco Elver
    Cc: Andrea Parri
    Cc: Paul McKenney
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

16 Nov, 2019

8 commits

  • David Howells
     
  • Increase the threshold at which the reader sends a wake event to the
    writers in the queue such that the queue must be half empty before the wake
    is issued rather than the wake being issued when just a single slot
    available.

    This reduces the number of context switches in the tests significantly,
    without altering the amount of work achieved. With my pipe-bench program,
    there's a 20% reduction versus an unpatched kernel.

    Suggested-by: Konstantin Khlebnikov
    Signed-off-by: David Howells

    David Howells
     
  • Make pipe_write() check to see if the ring has become full between it
    taking the pipe mutex, checking the ring status and then taking the
    spinlock.

    This can happen if a notification is written into the pipe as that happens
    without the pipe mutex.

    Signed-off-by: David Howells

    David Howells
     
  • Remove a redundant wakeup from pipe_write().

    Signed-off-by: David Howells

    David Howells
     
  • Rearrange the sequence in pipe_write() so that the allocation of the new
    buffer, the allocation of a ring slot and the attachment to the ring is
    done under the pipe wait spinlock and then the lock is dropped and the
    buffer can be filled.

    The data copy needs to be done with the spinlock unheld and irqs enabled,
    so the lock needs to be dropped first. However, the reader can't progress
    as we're holding pipe->mutex.

    We also need to drop the lock as that would impact others looking at the
    pipe waitqueue, such as poll(), the consumer and a future kernel message
    writer.

    We just abandon the preallocated slot if we get a copy error. Future
    writes may continue it and a future read will eventually recycle it.

    Signed-off-by: David Howells

    David Howells
     
  • Only do a wakeup in pipe_read() if we made space in a completely full
    buffer. The producer shouldn't be waiting on pipe->wait otherwise.

    Signed-off-by: David Howells

    David Howells
     
  • Advance the pipe ring tail pointer inside of wait spinlock in pipe_read()
    so that the pipe can be written into with kernel notifications from
    contexts where pipe->mutex cannot be taken.

    Signed-off-by: David Howells

    David Howells
     
  • Split pipe->ring_size into two numbers:

    (1) pipe->ring_size - indicates the hard size of the pipe ring.

    (2) pipe->max_usage - indicates the maximum number of pipe ring slots that
    userspace orchestrated events can fill.

    This allows for a pipe that is both writable by the general kernel
    notification facility and by userspace, allowing plenty of ring space for
    notifications to be added whilst preventing userspace from being able to
    pin too much unswappable kernel space.

    Signed-off-by: David Howells

    David Howells
     

31 Oct, 2019

1 commit

  • Convert pipes to use head and tail pointers for the buffer ring rather than
    pointer and length as the latter requires two atomic ops to update (or a
    combined op) whereas the former only requires one.

    (1) The head pointer is the point at which production occurs and points to
    the slot in which the next buffer will be placed. This is equivalent
    to pipe->curbuf + pipe->nrbufs.

    The head pointer belongs to the write-side.

    (2) The tail pointer is the point at which consumption occurs. It points
    to the next slot to be consumed. This is equivalent to pipe->curbuf.

    The tail pointer belongs to the read-side.

    (3) head and tail are allowed to run to UINT_MAX and wrap naturally. They
    are only masked off when the array is being accessed, e.g.:

    pipe->bufs[head & mask]

    This means that it is not necessary to have a dead slot in the ring as
    head == tail isn't ambiguous.

    (4) The ring is empty if "head == tail".

    A helper, pipe_empty(), is provided for this.

    (5) The occupancy of the ring is "head - tail".

    A helper, pipe_occupancy(), is provided for this.

    (6) The number of free slots in the ring is "pipe->ring_size - occupancy".

    A helper, pipe_space_for_user() is provided to indicate how many slots
    userspace may use.

    (7) The ring is full if "head - tail >= pipe->ring_size".

    A helper, pipe_full(), is provided for this.

    Signed-off-by: David Howells

    David Howells
     

26 May, 2019

2 commits

  • Convert the pipe filesystem to the new internal mount API as the old
    one will be obsoleted and removed. This allows greater flexibility in
    communication of mount parameters between userspace, the VFS and the
    filesystem.

    See Documentation/filesystems/mount_api.txt for more information.

    Signed-off-by: David Howells
    cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: Al Viro

    David Howells
     
  • Once upon a time we used to set ->d_name of e.g. pipefs root
    so that d_path() on pipes would work. These days it's
    completely pointless - dentries of pipes are not even connected
    to pipefs root. However, mount_pseudo() had set the root
    dentry name (passed as the second argument) and callers
    kept inventing names to pass to it. Including those that
    didn't *have* any non-root dentries to start with...

    All of that had been pointless for about 8 years now; it's
    time to get rid of that cargo-culting...

    Signed-off-by: Al Viro

    Al Viro
     

15 Apr, 2019

2 commits

  • Merge page ref overflow branch.

    Jann Horn reported that he can overflow the page ref count with
    sufficient memory (and a filesystem that is intentionally extremely
    slow).

    Admittedly it's not exactly easy. To have more than four billion
    references to a page requires a minimum of 32GB of kernel memory just
    for the pointers to the pages, much less any metadata to keep track of
    those pointers. Jann needed a total of 140GB of memory and a specially
    crafted filesystem that leaves all reads pending (in order to not ever
    free the page references and just keep adding more).

    Still, we have a fairly straightforward way to limit the two obvious
    user-controllable sources of page references: direct-IO like page
    references gotten through get_user_pages(), and the splice pipe page
    duplication. So let's just do that.

    * branch page-refs:
    fs: prevent page refcount overflow in pipe_buf_get
    mm: prevent get_user_pages() from overflowing page refcount
    mm: add 'try_get_page()' helper function
    mm: make page ref count overflow check tighter and more explicit

    Linus Torvalds
     
  • Change pipe_buf_get() to return a bool indicating whether it succeeded
    in raising the refcount of the page (if the thing in the pipe is a page).
    This removes another mechanism for overflowing the page refcount. All
    callers converted to handle a failure.

    Reported-by: Jann Horn
    Signed-off-by: Matthew Wilcox
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

13 Mar, 2019

1 commit

  • Pull misc vfs updates from Al Viro:
    "Assorted fixes (really no common topic here)"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: Make __vfs_write() static
    vfs: fix preadv64v2 and pwritev64v2 compat syscalls with offset == -1
    pipe: stop using ->can_merge
    splice: don't merge into linked buffers
    fs: move generic stat response attr handling to vfs_getattr_nosec
    orangefs: don't reinitialize result_mask in ->getattr
    fs/devpts: always delete dcache dentry-s in dput()

    Linus Torvalds
     

06 Mar, 2019

1 commit

  • Move the memcg_kmem_enabled() checks into memcg kmem charge/uncharge
    functions, so, the users don't have to explicitly check that condition.

    This is purely code cleanup patch without any functional change. Only
    the order of checks in memcg_charge_slab() can potentially be changed
    but the functionally it will be same. This should not matter as
    memcg_charge_slab() is not in the hot path.

    Link: http://lkml.kernel.org/r/20190103161203.162375-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt