30 Dec, 2020

1 commit

  • [ Upstream commit 289caf5d8f6c61c6d2b7fd752a7f483cd153f182 ]

    Patch series "simplify ep_poll".

    This patch series is a followup based on the suggestions and feedback by
    Linus:
    https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com

    The first patch in the series is a fix for the epoll race in presence of
    timeouts, so that it can be cleanly backported to all affected stable
    kernels.

    The rest of the patch series simplify the ep_poll() implementation. Some
    of these simplifications result in minor performance enhancements as well.
    We have kept these changes under self tests and internal benchmarks for a
    few days, and there are minor (1-2%) performance enhancements as a result.

    This patch (of 8):

    After abc610e01c66 ("fs/epoll: avoid barrier after an epoll_wait(2)
    timeout"), we break out of the ep_poll loop upon timeout, without checking
    whether there is any new events available. Prior to that patch-series we
    always called ep_events_available() after exiting the loop.

    This can cause races and missed wakeups. For example, consider the
    following scenario reported by Guantao Liu:

    Suppose we have an eventfd added using EPOLLET to an epollfd.

    Thread 1: Sleeps for just below 5ms and then writes to an eventfd.
    Thread 2: Calls epoll_wait with a timeout of 5 ms. If it sees an
    event of the eventfd, it will write back on that fd.
    Thread 3: Calls epoll_wait with a negative timeout.

    Prior to abc610e01c66, it is guaranteed that Thread 3 will wake up either
    by Thread 1 or Thread 2. After abc610e01c66, Thread 3 can be blocked
    indefinitely if Thread 2 sees a timeout right before the write to the
    eventfd by Thread 1. Thread 2 will be woken up from
    schedule_hrtimeout_range and, with evail 0, it will not call
    ep_send_events().

    To fix this issue:
    1) Simplify the timed_out case as suggested by Linus.
    2) while holding the lock, recheck whether the thread was woken up
    after its time out has reached.

    Note that (2) is different from Linus' original suggestion: It do not set
    "eavail = ep_events_available(ep)" to avoid unnecessary contention (when
    there are too many timed-out threads and a small number of events), as
    well as races mentioned in the discussion thread.

    This is the first patch in the series so that the backport to stable
    releases is straightforward.

    Link: https://lkml.kernel.org/r/20201106231635.3528496-1-soheil.kdev@gmail.com
    Link: https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com
    Link: https://lkml.kernel.org/r/20201106231635.3528496-2-soheil.kdev@gmail.com
    Fixes: abc610e01c66 ("fs/epoll: avoid barrier after an epoll_wait(2) timeout")
    Signed-off-by: Soheil Hassas Yeganeh
    Tested-by: Guantao Liu
    Suggested-by: Linus Torvalds
    Reported-by: Guantao Liu
    Reviewed-by: Eric Dumazet
    Reviewed-by: Willem de Bruijn
    Reviewed-by: Khazhismel Kumykov
    Reviewed-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Soheil Hassas Yeganeh
     

25 Sep, 2020

1 commit


11 Sep, 2020

1 commit

  • Checking for the lack of epitems refering to the epoll we want to insert into
    is not enough; we might have an insertion of that epoll into another one that
    has already collected the set of files to recheck for excessive reverse paths,
    but hasn't gotten to creating/inserting the epitem for it.

    However, any such insertion in progress can be detected - it will update the
    generation count in our epoll when it's done looking through it for files
    to check. That gets done under ->mtx of our epoll and that allows us to
    detect that safely.

    We are *not* holding epmutex here, so the generation count is not stable.
    However, since both the update of ep->gen by loop check and (later)
    insertion into ->f_ep_link are done with ep->mtx held, we are fine -
    the sequence is
    grab epmutex
    bump loop_check_gen
    ...
    grab tep->mtx // 1
    tep->gen = loop_check_gen
    ...
    drop tep->mtx // 2
    ...
    grab tep->mtx // 3
    ...
    insert into ->f_ep_link
    ...
    drop tep->mtx // 4
    bump loop_check_gen
    drop epmutex
    and if the fastpath check in another thread happens for that
    eventpoll, it can come
    * before (1) - in that case fastpath is just fine
    * after (4) - we'll see non-empty ->f_ep_link, slow path
    taken
    * between (2) and (3) - loop_check_gen is stable,
    with ->mtx providing barriers and we end up taking slow path.

    Note that ->f_ep_link emptiness check is slightly racy - we are protected
    against insertions into that list, but removals can happen right under us.
    Not a problem - in the worst case we'll end up taking a slow path for
    no good reason.

    Signed-off-by: Al Viro

    Al Viro
     

10 Sep, 2020

2 commits


02 Sep, 2020

1 commit


23 Aug, 2020

2 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • When adding a new fd to an epoll, and that this new fd is an
    epoll fd itself, we recursively scan the fds attached to it
    to detect cycles, and add non-epool files to a "check list"
    that gets subsequently parsed.

    However, this check list isn't completely safe when deletions
    can happen concurrently. To sidestep the issue, make sure that
    a struct file placed on the check list sees its f_count increased,
    ensuring that a concurrent deletion won't result in the file
    disapearing from under our feet.

    Cc: stable@vger.kernel.org
    Signed-off-by: Marc Zyngier
    Signed-off-by: Al Viro

    Marc Zyngier
     

15 May, 2020

1 commit

  • There is a possible race when ep_scan_ready_list() leaves ->rdllist and
    ->obflist empty for a short period of time although some events are
    pending. It is quite likely that ep_events_available() observes empty
    lists and goes to sleep.

    Since commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of
    nested epoll") we are conservative in wakeups (there is only one place
    for wakeup and this is ep_poll_callback()), thus ep_events_available()
    must always observe correct state of two lists.

    The easiest and correct way is to do the final check under the lock.
    This does not impact the performance, since lock is taken anyway for
    adding a wait entry to the wait queue.

    The discussion of the problem can be found here:

    https://lore.kernel.org/linux-fsdevel/a2f22c3c-c25a-4bda-8339-a7bdaf17849e@akamai.com/

    In this patch barrierless __set_current_state() is used. This is safe
    since waitqueue_active() is called under the same lock on wakeup side.

    Short-circuit for fatal signals (i.e. fatal_signal_pending() check) is
    moved to the line just before actual events harvesting routine. This is
    fully compliant to what is said in the comment of the patch where the
    actual fatal_signal_pending() check was added: c257a340ede0 ("fs, epoll:
    short circuit fetching events if thread has been killed").

    Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll")
    Reported-by: Jason Baron
    Reported-by: Randy Dunlap
    Signed-off-by: Roman Penyaev
    Signed-off-by: Andrew Morton
    Reviewed-by: Jason Baron
    Cc: Khazhismel Kumykov
    Cc: Alexander Viro
    Cc:
    Link: http://lkml.kernel.org/r/20200505145609.1865152-1-rpenyaev@suse.de
    Signed-off-by: Linus Torvalds

    Roman Penyaev
     

08 May, 2020

2 commits

  • This patch does two things:

    - fixes a lost wakeup introduced by commit 339ddb53d373 ("fs/epoll:
    remove unnecessary wakeups of nested epoll")

    - improves performance for events delivery.

    The description of the problem is the following: if N (>1) threads are
    waiting on ep->wq for new events and M (>1) events come, it is quite
    likely that >1 wakeups hit the same wait queue entry, because there is
    quite a big window between __add_wait_queue_exclusive() and the
    following __remove_wait_queue() calls in ep_poll() function.

    This can lead to lost wakeups, because thread, which was woken up, can
    handle not all the events in ->rdllist. (in better words the problem is
    described here: https://lkml.org/lkml/2019/10/7/905)

    The idea of the current patch is to use init_wait() instead of
    init_waitqueue_entry().

    Internally init_wait() sets autoremove_wake_function as a callback,
    which removes the wait entry atomically (under the wq locks) from the
    list, thus the next coming wakeup hits the next wait entry in the wait
    queue, thus preventing lost wakeups.

    Problem is very well reproduced by the epoll60 test case [1].

    Wait entry removal on wakeup has also performance benefits, because
    there is no need to take a ep->lock and remove wait entry from the queue
    after the successful wakeup. Here is the timing output of the epoll60
    test case:

    With explicit wakeup from ep_scan_ready_list() (the state of the
    code prior 339ddb53d373):

    real 0m6.970s
    user 0m49.786s
    sys 0m0.113s

    After this patch:

    real 0m5.220s
    user 0m36.879s
    sys 0m0.019s

    The other testcase is the stress-epoll [2], where one thread consumes
    all the events and other threads produce many events:

    With explicit wakeup from ep_scan_ready_list() (the state of the
    code prior 339ddb53d373):

    threads events/ms run-time ms
    8 5427 1474
    16 6163 2596
    32 6824 4689
    64 7060 9064
    128 6991 18309

    After this patch:

    threads events/ms run-time ms
    8 5598 1429
    16 7073 2262
    32 7502 4265
    64 7640 8376
    128 7634 16767

    (number of "events/ms" represents event bandwidth, thus higher is
    better; number of "run-time ms" represents overall time spent
    doing the benchmark, thus lower is better)

    [1] tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
    [2] https://github.com/rouming/test-tools/blob/master/stress-epoll.c

    Signed-off-by: Roman Penyaev
    Signed-off-by: Andrew Morton
    Reviewed-by: Jason Baron
    Cc: Khazhismel Kumykov
    Cc: Alexander Viro
    Cc: Heiher
    Cc:
    Link: http://lkml.kernel.org/r/20200430130326.1368509-2-rpenyaev@suse.de
    Signed-off-by: Linus Torvalds

    Roman Penyaev
     
  • In the event that we add to ovflist, before commit 339ddb53d373
    ("fs/epoll: remove unnecessary wakeups of nested epoll") we would be
    woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback.

    With that wakeup removed, if we add to ovflist here, we may never wake
    up. Rather than adding back the ep_scan_ready_list wakeup - which was
    resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback.

    We noticed that one of our workloads was missing wakeups starting with
    339ddb53d373 and upon manual inspection, this wakeup seemed missing to me.
    With this patch added, we no longer see missing wakeups. I haven't yet
    tried to make a small reproducer, but the existing kselftests in
    filesystem/epoll passed for me with this patch.

    [khazhy@google.com: use if/elif instead of goto + cleanup suggested by Roman]
    Link: http://lkml.kernel.org/r/20200424190039.192373-1-khazhy@google.com
    Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll")
    Signed-off-by: Khazhismel Kumykov
    Signed-off-by: Andrew Morton
    Reviewed-by: Roman Penyaev
    Cc: Alexander Viro
    Cc: Roman Penyaev
    Cc: Heiher
    Cc: Jason Baron
    Cc:
    Link: http://lkml.kernel.org/r/20200424025057.118641-1-khazhy@google.com
    Signed-off-by: Linus Torvalds

    Khazhismel Kumykov
     

08 Apr, 2020

1 commit

  • Davidlohr Bueso pointed out that when CONFIG_DEBUG_LOCK_ALLOC is set
    ep_poll_safewake() can take several non-raw spinlocks after disabling
    interrupts. Since a spinlock can block in the -rt kernel, we can't take a
    spinlock after disabling interrupts. So let's re-work how we determine
    the nesting level such that it plays nicely with the -rt kernel.

    Let's introduce a 'nests' field in struct eventpoll that records the
    current nesting level during ep_poll_callback(). Then, if we nest again
    we can find the previous struct eventpoll that we were called from and
    increase our count by 1. The 'nests' field is protected by
    ep->poll_wait.lock.

    I've also moved the visited field to reduce the size of struct eventpoll
    from 184 bytes to 176 bytes on x86_64 for !CONFIG_DEBUG_LOCK_ALLOC, which
    is typical for a production config.

    Reported-by: Davidlohr Bueso
    Signed-off-by: Jason Baron
    Signed-off-by: Andrew Morton
    Reviewed-by: Davidlohr Bueso
    Cc: Roman Penyaev
    Cc: Eric Wong
    Cc: Al Viro
    Link: http://lkml.kernel.org/r/1582739816-13167-1-git-send-email-jbaron@akamai.com
    Signed-off-by: Linus Torvalds

    Jason Baron
     

22 Mar, 2020

1 commit

  • This fixes possible lost wakeup introduced by commit a218cc491420.
    Originally modifications to ep->wq were serialized by ep->wq.lock, but
    in commit a218cc491420 ("epoll: use rwlock in order to reduce
    ep_poll_callback() contention") a new rw lock was introduced in order to
    relax fd event path, i.e. callers of ep_poll_callback() function.

    After the change ep_modify and ep_insert (both are called on epoll_ctl()
    path) were switched to ep->lock, but ep_poll (epoll_wait) was using
    ep->wq.lock on wqueue list modification.

    The bug doesn't lead to any wqueue list corruptions, because wake up
    path and list modifications were serialized by ep->wq.lock internally,
    but actual waitqueue_active() check prior wake_up() call can be
    reordered with modifications of ep ready list, thus wake up can be lost.

    And yes, can be healed by explicit smp_mb():

    list_add_tail(&epi->rdlink, &ep->rdllist);
    smp_mb();
    if (waitqueue_active(&ep->wq))
    wake_up(&ep->wp);

    But let's make it simple, thus current patch replaces ep->wq.lock with
    the ep->lock for wqueue modifications, thus wake up path always observes
    activeness of the wqueue correcty.

    Fixes: a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention")
    Reported-by: Max Neunhoeffer
    Signed-off-by: Roman Penyaev
    Signed-off-by: Andrew Morton
    Tested-by: Max Neunhoeffer
    Cc: Jakub Kicinski
    Cc: Christopher Kohlhoff
    Cc: Davidlohr Bueso
    Cc: Jason Baron
    Cc: Jes Sorensen
    Cc: [5.1+]
    Link: http://lkml.kernel.org/r/20200214170211.561524-1-rpenyaev@suse.de
    References: https://bugzilla.kernel.org/show_bug.cgi?id=205933
    Bisected-by: Max Neunhoeffer
    Signed-off-by: Linus Torvalds

    Roman Penyaev
     

30 Jan, 2020

2 commits


05 Dec, 2019

2 commits

  • Take the case where we have:

    t0
    | (ew)
    e0
    | (et)
    e1
    | (lt)
    s0

    t0: thread 0
    e0: epoll fd 0
    e1: epoll fd 1
    s0: socket fd 0
    ew: epoll_wait
    et: edge-trigger
    lt: level-trigger

    We remove unnecessary wakeups to prevent the nested epoll that working in edge-
    triggered mode to waking up continuously.

    Test code:
    #include
    #include
    #include

    int main(int argc, char *argv[])
    {
    int sfd[2];
    int efd[2];
    struct epoll_event e;

    if (socketpair(AF_UNIX, SOCK_STREAM, 0, sfd) < 0)
    goto out;

    efd[0] = epoll_create(1);
    if (efd[0] < 0)
    goto out;

    efd[1] = epoll_create(1);
    if (efd[1] < 0)
    goto out;

    e.events = EPOLLIN;
    if (epoll_ctl(efd[1], EPOLL_CTL_ADD, sfd[0], &e) < 0)
    goto out;

    e.events = EPOLLIN | EPOLLET;
    if (epoll_ctl(efd[0], EPOLL_CTL_ADD, efd[1], &e) < 0)
    goto out;

    if (write(sfd[1], "w", 1) != 1)
    goto out;

    if (epoll_wait(efd[0], &e, 1, 0) != 1)
    goto out;

    if (epoll_wait(efd[0], &e, 1, 0) != 0)
    goto out;

    close(efd[0]);
    close(efd[1]);
    close(sfd[0]);
    close(sfd[1]);

    return 0;

    out:
    return -1;
    }

    More tests:
    https://github.com/heiher/epoll-wakeup

    Link: http://lkml.kernel.org/r/20191009060516.3577-1-r@hev.cc
    Signed-off-by: hev
    Reviewed-by: Roman Penyaev
    Cc: Al Viro
    Cc: Davide Libenzi
    Cc: Davidlohr Bueso
    Cc: Dominik Brodowski
    Cc: Eric Wong
    Cc: Jason Baron
    Cc: Sridhar Samudrala
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiher
     
  • Currently, ep_poll_safewake() in the CONFIG_DEBUG_LOCK_ALLOC case uses
    ep_call_nested() in order to pass the correct subclass argument to
    spin_lock_irqsave_nested(). However, ep_call_nested() adds unnecessary
    checks for epoll depth and loops that are already verified when doing
    EPOLL_CTL_ADD. This mirrors a conversion that was done for
    !CONFIG_DEBUG_LOCK_ALLOC in: commit 37b5e5212a44 ("epoll: remove
    ep_call_nested() from ep_eventpoll_poll()")

    Link: http://lkml.kernel.org/r/1567628549-11501-1-git-send-email-jbaron@akamai.com
    Signed-off-by: Jason Baron
    Reviewed-by: Roman Penyaev
    Cc: Davidlohr Bueso
    Cc: Al Viro
    Cc: Eric Wong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     

21 Aug, 2019

1 commit

  • Add an ID and a device pointer to 'struct wakeup_source'. Use them to to
    expose wakeup sources statistics in sysfs under
    /sys/class/wakeup/wakeup/*.

    Co-developed-by: Greg Kroah-Hartman
    Signed-off-by: Greg Kroah-Hartman
    Co-developed-by: Stephen Boyd
    Signed-off-by: Stephen Boyd
    Signed-off-by: Tri Vo
    Tested-by: Kalesh Singh
    Signed-off-by: Rafael J. Wysocki

    Tri Vo
     

19 Jul, 2019

1 commit

  • In the sysctl code the proc_dointvec_minmax() function is often used to
    validate the user supplied value between an allowed range. This
    function uses the extra1 and extra2 members from struct ctl_table as
    minimum and maximum allowed value.

    On sysctl handler declaration, in every source file there are some
    readonly variables containing just an integer which address is assigned
    to the extra1 and extra2 members, so the sysctl range is enforced.

    The special values 0, 1 and INT_MAX are very often used as range
    boundary, leading duplication of variables like zero=0, one=1,
    int_max=INT_MAX in different source files:

    $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
    248

    Add a const int array containing the most commonly used values, some
    macros to refer more easily to the correct array member, and use them
    instead of creating a local one for every object file.

    This is the bloat-o-meter output comparing the old and new binary
    compiled with the default Fedora config:

    # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
    add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
    Data old new delta
    sysctl_vals - 12 +12
    __kstrtab_sysctl_vals - 12 +12
    max 14 10 -4
    int_max 16 - -16
    one 68 - -68
    zero 128 28 -100
    Total: Before=20583249, After=20583085, chg -0.00%

    [mcroce@redhat.com: tipc: remove two unused variables]
    Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
    [akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
    [arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
    Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
    [akpm@linux-foundation.org: fix fs/eventpoll.c]
    Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
    Signed-off-by: Matteo Croce
    Signed-off-by: Arnd Bergmann
    Acked-by: Kees Cook
    Reviewed-by: Aaron Tomlin
    Cc: Matthew Wilcox
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matteo Croce
     

17 Jul, 2019

1 commit

  • task->saved_sigmask and ->restore_sigmask are only used in the ret-from-
    syscall paths. This means that set_user_sigmask() can save ->blocked in
    ->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked
    was modified.

    This way the callers do not need 2 sigset_t's passed to set/restore and
    restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns
    into the trivial helper which just calls restore_saved_sigmask().

    Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.com
    Signed-off-by: Oleg Nesterov
    Cc: Deepa Dinamani
    Cc: Arnd Bergmann
    Cc: Jens Axboe
    Cc: Davidlohr Bueso
    Cc: Eric Wong
    Cc: Jason Baron
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: Eric W. Biederman
    Cc: David Laight
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

29 Jun, 2019

1 commit

  • This is the minimal fix for stable, I'll send cleanups later.

    Commit 854a6ed56839 ("signal: Add restore_user_sigmask()") introduced
    the visible change which breaks user-space: a signal temporary unblocked
    by set_user_sigmask() can be delivered even if the caller returns
    success or timeout.

    Change restore_user_sigmask() to accept the additional "interrupted"
    argument which should be used instead of signal_pending() check, and
    update the callers.

    Eric said:

    : For clarity. I don't think this is required by posix, or fundamentally to
    : remove the races in select. It is what linux has always done and we have
    : applications who care so I agree this fix is needed.
    :
    : Further in any case where the semantic change that this patch rolls back
    : (aka where allowing a signal to be delivered and the select like call to
    : complete) would be advantage we can do as well if not better by using
    : signalfd.
    :
    : Michael is there any chance we can get this guarantee of the linux
    : implementation of pselect and friends clearly documented. The guarantee
    : that if the system call completes successfully we are guaranteed that no
    : signal that is unblocked by using sigmask will be delivered?

    Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
    Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add restore_user_sigmask()")
    Signed-off-by: Oleg Nesterov
    Reported-by: Eric Wong
    Tested-by: Eric Wong
    Acked-by: "Eric W. Biederman"
    Acked-by: Arnd Bergmann
    Acked-by: Deepa Dinamani
    Cc: Michael Kerrisk
    Cc: Jens Axboe
    Cc: Davidlohr Bueso
    Cc: Jason Baron
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: David Laight
    Cc: [5.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

08 Mar, 2019

3 commits

  • The goal of this patch is to reduce contention of ep_poll_callback()
    which can be called concurrently from different CPUs in case of high
    events rates and many fds per epoll. Problem can be very well
    reproduced by generating events (write to pipe or eventfd) from many
    threads, while consumer thread does polling. In other words this patch
    increases the bandwidth of events which can be delivered from sources to
    the poller by adding poll items in a lockless way to the list.

    The main change is in replacement of the spinlock with a rwlock, which
    is taken on read in ep_poll_callback(), and then by adding poll items to
    the tail of the list using xchg atomic instruction. Write lock is taken
    everywhere else in order to stop list modifications and guarantee that
    list updates are fully completed (I assume that write side of a rwlock
    does not starve, it seems qrwlock implementation has these guarantees).

    The following are some microbenchmark results based on the test [1]
    which starts threads which generate N events each. The test ends when
    all events are successfully fetched by the poller thread:

    spinlock
    ========

    threads events/ms run-time ms
    8 6402 12495
    16 7045 22709
    32 7395 43268

    rwlock + xchg
    =============

    threads events/ms run-time ms
    8 10038 7969
    16 12178 13138
    32 13223 24199

    According to the results bandwidth of delivered events is significantly
    increased, thus execution time is reduced.

    This patch was tested with different sort of microbenchmarks and
    artificial delays (e.g. "udelay(get_random_int() & 0xff)") introduced
    in kernel on paths where items are added to lists.

    [1] https://github.com/rouming/test-tools/blob/master/stress-epoll.c

    Link: http://lkml.kernel.org/r/20190103150104.17128-5-rpenyaev@suse.de
    Signed-off-by: Roman Penyaev
    Cc: Davidlohr Bueso
    Cc: Jason Baron
    Cc: Al Viro
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Penyaev
     
  • Original comment "Activate ep->ws since epi->ws may get deactivated at
    any time" indeed sounds loud, but it is incorrect, because the path
    where we check epi->ws is a path where insert to ovflist happens, i.e.
    ep_scan_ready_list() has taken ep->mtx and waits for this callback to
    finish, thus ep_modify() (which unregisters wakeup source) waits for
    ep_scan_ready_list().

    Here in this patch I simply call ep_pm_stay_awake_rcu(), which is a bit
    extra for this path (indirectly protected by main ep->mtx, so even rcu
    is not needed), but I do not want to create another naked
    __ep_pm_stay_awake() variant only for this particular case, so rcu variant
    is just better for all the cases.

    Link: http://lkml.kernel.org/r/20190103150104.17128-4-rpenyaev@suse.de
    Signed-off-by: Roman Penyaev
    Cc: Davidlohr Bueso
    Cc: Jason Baron
    Cc: Al Viro
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Penyaev
     
  • Patch series "use rwlock in order to reduce ep_poll_callback()
    contention", v3.

    The last patch targets the contention problem in ep_poll_callback(),
    which can be very well reproduced by generating events (write to pipe or
    eventfd) from many threads, while consumer thread does polling.

    The following are some microbenchmark results based on the test [1]
    which starts threads which generate N events each. The test ends when
    all events are successfully fetched by the poller thread:

    spinlock
    ========

    threads events/ms run-time ms
    8 6402 12495
    16 7045 22709
    32 7395 43268

    rwlock + xchg
    =============

    threads events/ms run-time ms
    8 10038 7969
    16 12178 13138
    32 13223 24199

    According to the results bandwidth of delivered events is significantly
    increased, thus execution time is reduced.

    This patch (of 4):

    All coming events are stored in FIFO order and this is also should be
    applicable to ->ovflist, which originally is stack, i.e. LIFO.

    Thus to keep correct FIFO order ->ovflist should reversed by adding
    elements to the head of the read list but not to the tail.

    Link: http://lkml.kernel.org/r/20190103150104.17128-2-rpenyaev@suse.de
    Signed-off-by: Roman Penyaev
    Reviewed-by: Davidlohr Bueso
    Cc: Jason Baron
    Cc: Al Viro
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Penyaev
     

06 Jan, 2019

1 commit

  • Merge more updates from Andrew Morton:

    - procfs updates

    - various misc bits

    - lib/ updates

    - epoll updates

    - autofs

    - fatfs

    - a few more MM bits

    * emailed patches from Andrew Morton : (58 commits)
    mm/page_io.c: fix polled swap page in
    checkpatch: add Co-developed-by to signature tags
    docs: fix Co-Developed-by docs
    drivers/base/platform.c: kmemleak ignore a known leak
    fs: don't open code lru_to_page()
    fs/: remove caller signal_pending branch predictions
    mm/: remove caller signal_pending branch predictions
    arch/arc/mm/fault.c: remove caller signal_pending_branch predictions
    kernel/sched/: remove caller signal_pending branch predictions
    kernel/locking/mutex.c: remove caller signal_pending branch predictions
    mm: select HAVE_MOVE_PMD on x86 for faster mremap
    mm: speed up mremap by 20x on large regions
    mm: treewide: remove unused address argument from pte_alloc functions
    initramfs: cleanup incomplete rootfs
    scripts/gdb: fix lx-version string output
    kernel/kcov.c: mark write_comp_data() as notrace
    kernel/sysctl: add panic_print into sysctl
    panic: add options to print system info when panic happens
    bfs: extra sanity checking and static inode bitmap
    exec: separate MM_ANONPAGES and RLIMIT_STACK accounting
    ...

    Linus Torvalds
     

05 Jan, 2019

8 commits

  • There is no reason why we rearm the waitiqueue upon every fetch_events
    retry (for when events are found yet send_events() fails). If nothing
    else, this saves four lock operations per retry, and furthermore reduces
    the scope of the lock even further.

    [akpm@linux-foundation.org: restore code to original position, fix and reflow comment]
    Link: http://lkml.kernel.org/r/20181114182532.27981-2-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Jason Baron
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • It is currently called check_events because it, well, did exactly that.
    However, since the lockless ep_events_available() call, the label no
    longer checks, but just sends the events. Rename as such.

    Link: http://lkml.kernel.org/r/20181114182532.27981-1-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Cc: Jason Baron
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Upon timeout, we can just exit out of the loop, without the cost of the
    changing the task's state with an smp_store_mb call. Just exit out of
    the loop and be done - setting the task state afterwards will be, of
    course, redundant.

    [dave@stgolabs.net: forgotten fixlets]
    Link: http://lkml.kernel.org/r/20181109155258.jxcr4t2pnz6zqct3@linux-r8p5
    Link: http://lkml.kernel.org/r/20181108051006.18751-7-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Cc: Davidlohr Bueso
    Cc: Jason Baron
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • This patch aims at reducing ep wq.lock hold times in epoll_wait(2). For
    the blocking case, there is no need to constantly take and drop the
    spinlock, which is only needed to manipulate the waitqueue.

    The call to ep_events_available() is now lockless, and only exposed to
    benign races. Here, if false positive (returns available events and
    does not see another thread deleting an epi from the list) we call into
    send_events and then the list's state is correctly seen. Otoh, if a
    false negative and we don't see a list_add_tail(), for example, from irq
    callback, then it is rechecked again before blocking, which will see the
    correct state.

    In order for more accuracy to see concurrent list_del_init(), use the
    list_empty_careful() variant -- of course, this won't be safe against
    insertions from wakeup.

    For the overflow list we obviously need to prevent load/store tearing as
    we don't want to see partial values while the ready list is disabled.

    [dave@stgolabs.net: forgotten fixlets]
    Link: http://lkml.kernel.org/r/20181109155258.jxcr4t2pnz6zqct3@linux-r8p5
    Link: http://lkml.kernel.org/r/20181108051006.18751-6-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Suggested-by: Jason Baron
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Insted of just commenting how important it is, lets make it more robust
    and add a lockdep_assert_held() call.

    Link: http://lkml.kernel.org/r/20181108051006.18751-5-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Cc: Jason Baron
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • The ep->ovflist is a secondary ready-list to temporarily store events
    that might occur when doing sproc without holding the ep->wq.lock. This
    accounts for every time we check for ready events and also send events
    back to userspace; both callbacks, particularly the latter because of
    copy_to_user, can account for a non-trivial time.

    As such, the unlikely() check to see if the pointer is being used, seems
    both misleading and sub-optimal. In fact, we go to an awful lot of
    trouble to sync both lists, and populating the ovflist is far from an
    uncommon scenario.

    For example, profiling a concurrent epoll_wait(2) benchmark, with
    CONFIG_PROFILE_ANNOTATED_BRANCHES shows that for a two threads a 33%
    incorrect rate was seen; and when incrementally increasing the number of
    epoll instances (which is used, for example for multiple queuing load
    balancing models), up to a 90% incorrect rate was seen.

    Similarly, by deleting the prediction, 3% throughput boost was seen
    across incremental threads.

    Link: http://lkml.kernel.org/r/20181108051006.18751-4-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Cc: Jason Baron
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • The current logic is a bit convoluted. Lets simplify this with a
    standard list_for_each_entry_safe() loop instead and just break out
    after maxevents is reached.

    While at it, remove an unnecessary indentation level in the loop when
    there are in fact ready events.

    Link: http://lkml.kernel.org/r/20181108051006.18751-3-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Cc: Jason Baron
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Patch series "epoll: some miscellaneous optimizations".

    The following are some incremental optimizations on some of the epoll
    core. Each patch has the details, but together, the series is seen to
    shave off measurable cycles on a number of systems and workloads.

    For example, on a 40-core IB, a pipetest as well as parallel
    epoll_wait() benchmark show around a 20-30% increase in raw operations
    per second when the box is fully occupied (incremental thread counts),
    and up to 15% performance improvement with lower counts.

    Passes ltp epoll related testcases.

    This patch(of 6):

    All callers pass the EP_MAX_NESTS constant already, so lets simplify
    this a tad and get rid of the redundant parameter for nested eventpolls.

    Link: http://lkml.kernel.org/r/20181108051006.18751-2-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Cc: Jason Baron
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

04 Jan, 2019

1 commit

  • Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
    of the user address range verification function since we got rid of the
    old racy i386-only code to walk page tables by hand.

    It existed because the original 80386 would not honor the write protect
    bit when in kernel mode, so you had to do COW by hand before doing any
    user access. But we haven't supported that in a long time, and these
    days the 'type' argument is a purely historical artifact.

    A discussion about extending 'user_access_begin()' to do the range
    checking resulted this patch, because there is no way we're going to
    move the old VERIFY_xyz interface to that model. And it's best done at
    the end of the merge window when I've done most of my merges, so let's
    just get this done once and for all.

    This patch was mostly done with a sed-script, with manual fix-ups for
    the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

    There were a couple of notable cases:

    - csky still had the old "verify_area()" name as an alias.

    - the iter_iov code had magical hardcoded knowledge of the actual
    values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
    really used it)

    - microblaze used the type argument for a debug printout

    but other than those oddities this should be a total no-op patch.

    I tried to fix up all architectures, did fairly extensive grepping for
    access_ok() uses, and the changes are trivial, but I may have missed
    something. Any missed conversion should be trivially fixable, though.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

07 Dec, 2018

2 commits

  • Refactor the logic to restore the sigmask before the syscall
    returns into an api.
    This is useful for versions of syscalls that pass in the
    sigmask and expect the current->sigmask to be changed during
    the execution and restored after the execution of the syscall.

    With the advent of new y2038 syscalls in the subsequent patches,
    we add two more new versions of the syscalls (for pselect, ppoll
    and io_pgetevents) in addition to the existing native and compat
    versions. Adding such an api reduces the logic that would need to
    be replicated otherwise.

    Signed-off-by: Deepa Dinamani
    Signed-off-by: Arnd Bergmann

    Deepa Dinamani
     
  • Refactor reading sigset from userspace and updating sigmask
    into an api.

    This is useful for versions of syscalls that pass in the
    sigmask and expect the current->sigmask to be changed during,
    and restored after, the execution of the syscall.

    With the advent of new y2038 syscalls in the subsequent patches,
    we add two more new versions of the syscalls (for pselect, ppoll,
    and io_pgetevents) in addition to the existing native and compat
    versions. Adding such an api reduces the logic that would need to
    be replicated otherwise.

    Note that the calls to sigprocmask() ignored the return value
    from the api as the function only returns an error on an invalid
    first argument that is hardcoded at these call sites.
    The updated logic uses set_current_blocked() instead.

    Signed-off-by: Deepa Dinamani
    Signed-off-by: Arnd Bergmann

    Deepa Dinamani
     

23 Aug, 2018

3 commits

  • Instead of having each caller pass the rdllink explicitly, just have
    ep_is_linked() pass it while the callers just need the epi pointer. This
    helper is all about the rdllink, and this change, furthermore, improves
    the function's self documentation.

    Link: http://lkml.kernel.org/r/20180727053432.16679-3-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Cc: Jason Baron
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Similar to other calls, ep_poll() is not called with interrupts disabled,
    and we can therefore avoid the irq save/restore dance and just disable
    local irqs. In fact, the call should never be called in irq context at
    all, considering that the only path is

    epoll_wait(2) -> do_epoll_wait() -> ep_poll().

    When running on a 2 socket 40-core (ht) IvyBridge a common pipe based
    epoll_wait(2) microbenchmark, the following performance improvements are
    seen:

    # threads vanilla dirty
    1 1805587 2106412
    2 1854064 2090762
    4 1805484 2017436
    8 1751222 1974475
    16 1725299 1962104
    32 1378463 1571233
    64 787368 900784

    Which is a pretty constantly near 15%.

    Also add a lockdep check such that we detect any mischief before
    deadlocking.

    Link: http://lkml.kernel.org/r/20180727053432.16679-2-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Cc: Peter Zijlstra
    Cc: Jason Baron
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • ... 'tis easier on the eye.

    [akpm@linux-foundation.org: use inlines rather than macros]
    Link: http://lkml.kernel.org/r/20180725185620.11020-1-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Jason Baron
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso