07 Oct, 2020

4 commits

  • commit 3701cb59d892b88d569427586f01491552f377b1 upstream.

    or get freed, for that matter, if it's a long (separately stored)
    name.

    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     
  • commit fe0a916c1eae8e17e86c3753d13919177d63ed7e upstream.

    Checking for the lack of epitems refering to the epoll we want to insert into
    is not enough; we might have an insertion of that epoll into another one that
    has already collected the set of files to recheck for excessive reverse paths,
    but hasn't gotten to creating/inserting the epitem for it.

    However, any such insertion in progress can be detected - it will update the
    generation count in our epoll when it's done looking through it for files
    to check. That gets done under ->mtx of our epoll and that allows us to
    detect that safely.

    We are *not* holding epmutex here, so the generation count is not stable.
    However, since both the update of ep->gen by loop check and (later)
    insertion into ->f_ep_link are done with ep->mtx held, we are fine -
    the sequence is
    grab epmutex
    bump loop_check_gen
    ...
    grab tep->mtx // 1
    tep->gen = loop_check_gen
    ...
    drop tep->mtx // 2
    ...
    grab tep->mtx // 3
    ...
    insert into ->f_ep_link
    ...
    drop tep->mtx // 4
    bump loop_check_gen
    drop epmutex
    and if the fastpath check in another thread happens for that
    eventpoll, it can come
    * before (1) - in that case fastpath is just fine
    * after (4) - we'll see non-empty ->f_ep_link, slow path
    taken
    * between (2) and (3) - loop_check_gen is stable,
    with ->mtx providing barriers and we end up taking slow path.

    Note that ->f_ep_link emptiness check is slightly racy - we are protected
    against insertions into that list, but removals can happen right under us.
    Not a problem - in the worst case we'll end up taking a slow path for
    no good reason.

    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     
  • commit 18306c404abe18a0972587a6266830583c60c928 upstream.

    removes the need to clear it, along with the races.

    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     
  • commit f8d4f44df056c5b504b0d49683fb7279218fd207 upstream.

    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     

10 Sep, 2020

1 commit

  • [ Upstream commit 77f4689de17c0887775bb77896f4cc11a39bf848 ]

    epoll_loop_check_proc() can run into a file already committed to destruction;
    we can't grab a reference on those and don't need to add them to the set for
    reverse path check anyway.

    Tested-by: Marc Zyngier
    Fixes: a9ed4a6560b8 ("epoll: Keep a reference on files added to the check list")
    Signed-off-by: Al Viro
    Signed-off-by: Sasha Levin

    Al Viro
     

26 Aug, 2020

2 commits

  • commit 52c479697c9b73f628140dcdfcd39ea302d05482 upstream.

    Signed-off-by: Al Viro
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     
  • commit a9ed4a6560b8562b7e2e2bed9527e88001f7b682 upstream.

    When adding a new fd to an epoll, and that this new fd is an
    epoll fd itself, we recursively scan the fds attached to it
    to detect cycles, and add non-epool files to a "check list"
    that gets subsequently parsed.

    However, this check list isn't completely safe when deletions
    can happen concurrently. To sidestep the issue, make sure that
    a struct file placed on the check list sees its f_count increased,
    ensuring that a concurrent deletion won't result in the file
    disapearing from under our feet.

    Cc: stable@vger.kernel.org
    Signed-off-by: Marc Zyngier
    Signed-off-by: Al Viro
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     

14 May, 2020

2 commits

  • commit 0c54a6a44bf3d41e76ce3f583a6ece267618df2e upstream.

    In the event that we add to ovflist, before commit 339ddb53d373
    ("fs/epoll: remove unnecessary wakeups of nested epoll") we would be
    woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback.

    With that wakeup removed, if we add to ovflist here, we may never wake
    up. Rather than adding back the ep_scan_ready_list wakeup - which was
    resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback.

    We noticed that one of our workloads was missing wakeups starting with
    339ddb53d373 and upon manual inspection, this wakeup seemed missing to me.
    With this patch added, we no longer see missing wakeups. I haven't yet
    tried to make a small reproducer, but the existing kselftests in
    filesystem/epoll passed for me with this patch.

    [khazhy@google.com: use if/elif instead of goto + cleanup suggested by Roman]
    Link: http://lkml.kernel.org/r/20200424190039.192373-1-khazhy@google.com
    Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll")
    Signed-off-by: Khazhismel Kumykov
    Signed-off-by: Andrew Morton
    Reviewed-by: Roman Penyaev
    Cc: Alexander Viro
    Cc: Roman Penyaev
    Cc: Heiher
    Cc: Jason Baron
    Cc:
    Link: http://lkml.kernel.org/r/20200424025057.118641-1-khazhy@google.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Khazhismel Kumykov
     
  • commit 412895f03cbf9633298111cb4dfde13b7720e2c5 upstream.

    This patch does two things:

    - fixes a lost wakeup introduced by commit 339ddb53d373 ("fs/epoll:
    remove unnecessary wakeups of nested epoll")

    - improves performance for events delivery.

    The description of the problem is the following: if N (>1) threads are
    waiting on ep->wq for new events and M (>1) events come, it is quite
    likely that >1 wakeups hit the same wait queue entry, because there is
    quite a big window between __add_wait_queue_exclusive() and the
    following __remove_wait_queue() calls in ep_poll() function.

    This can lead to lost wakeups, because thread, which was woken up, can
    handle not all the events in ->rdllist. (in better words the problem is
    described here: https://lkml.org/lkml/2019/10/7/905)

    The idea of the current patch is to use init_wait() instead of
    init_waitqueue_entry().

    Internally init_wait() sets autoremove_wake_function as a callback,
    which removes the wait entry atomically (under the wq locks) from the
    list, thus the next coming wakeup hits the next wait entry in the wait
    queue, thus preventing lost wakeups.

    Problem is very well reproduced by the epoll60 test case [1].

    Wait entry removal on wakeup has also performance benefits, because
    there is no need to take a ep->lock and remove wait entry from the queue
    after the successful wakeup. Here is the timing output of the epoll60
    test case:

    With explicit wakeup from ep_scan_ready_list() (the state of the
    code prior 339ddb53d373):

    real 0m6.970s
    user 0m49.786s
    sys 0m0.113s

    After this patch:

    real 0m5.220s
    user 0m36.879s
    sys 0m0.019s

    The other testcase is the stress-epoll [2], where one thread consumes
    all the events and other threads produce many events:

    With explicit wakeup from ep_scan_ready_list() (the state of the
    code prior 339ddb53d373):

    threads events/ms run-time ms
    8 5427 1474
    16 6163 2596
    32 6824 4689
    64 7060 9064
    128 6991 18309

    After this patch:

    threads events/ms run-time ms
    8 5598 1429
    16 7073 2262
    32 7502 4265
    64 7640 8376
    128 7634 16767

    (number of "events/ms" represents event bandwidth, thus higher is
    better; number of "run-time ms" represents overall time spent
    doing the benchmark, thus lower is better)

    [1] tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
    [2] https://github.com/rouming/test-tools/blob/master/stress-epoll.c

    Signed-off-by: Roman Penyaev
    Signed-off-by: Andrew Morton
    Reviewed-by: Jason Baron
    Cc: Khazhismel Kumykov
    Cc: Alexander Viro
    Cc: Heiher
    Cc:
    Link: http://lkml.kernel.org/r/20200430130326.1368509-2-rpenyaev@suse.de
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roman Penyaev
     

25 Mar, 2020

1 commit

  • commit 1b53734bd0b2feed8e7761771b2e76fc9126ea0c upstream.

    This fixes possible lost wakeup introduced by commit a218cc491420.
    Originally modifications to ep->wq were serialized by ep->wq.lock, but
    in commit a218cc491420 ("epoll: use rwlock in order to reduce
    ep_poll_callback() contention") a new rw lock was introduced in order to
    relax fd event path, i.e. callers of ep_poll_callback() function.

    After the change ep_modify and ep_insert (both are called on epoll_ctl()
    path) were switched to ep->lock, but ep_poll (epoll_wait) was using
    ep->wq.lock on wqueue list modification.

    The bug doesn't lead to any wqueue list corruptions, because wake up
    path and list modifications were serialized by ep->wq.lock internally,
    but actual waitqueue_active() check prior wake_up() call can be
    reordered with modifications of ep ready list, thus wake up can be lost.

    And yes, can be healed by explicit smp_mb():

    list_add_tail(&epi->rdlink, &ep->rdllist);
    smp_mb();
    if (waitqueue_active(&ep->wq))
    wake_up(&ep->wp);

    But let's make it simple, thus current patch replaces ep->wq.lock with
    the ep->lock for wqueue modifications, thus wake up path always observes
    activeness of the wqueue correcty.

    Fixes: a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention")
    Reported-by: Max Neunhoeffer
    Signed-off-by: Roman Penyaev
    Signed-off-by: Andrew Morton
    Tested-by: Max Neunhoeffer
    Cc: Jakub Kicinski
    Cc: Christopher Kohlhoff
    Cc: Davidlohr Bueso
    Cc: Jason Baron
    Cc: Jes Sorensen
    Cc: [5.1+]
    Link: http://lkml.kernel.org/r/20200214170211.561524-1-rpenyaev@suse.de
    References: https://bugzilla.kernel.org/show_bug.cgi?id=205933
    Bisected-by: Max Neunhoeffer
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roman Penyaev
     

21 Aug, 2019

1 commit

  • Add an ID and a device pointer to 'struct wakeup_source'. Use them to to
    expose wakeup sources statistics in sysfs under
    /sys/class/wakeup/wakeup/*.

    Co-developed-by: Greg Kroah-Hartman
    Signed-off-by: Greg Kroah-Hartman
    Co-developed-by: Stephen Boyd
    Signed-off-by: Stephen Boyd
    Signed-off-by: Tri Vo
    Tested-by: Kalesh Singh
    Signed-off-by: Rafael J. Wysocki

    Tri Vo
     

19 Jul, 2019

1 commit

  • In the sysctl code the proc_dointvec_minmax() function is often used to
    validate the user supplied value between an allowed range. This
    function uses the extra1 and extra2 members from struct ctl_table as
    minimum and maximum allowed value.

    On sysctl handler declaration, in every source file there are some
    readonly variables containing just an integer which address is assigned
    to the extra1 and extra2 members, so the sysctl range is enforced.

    The special values 0, 1 and INT_MAX are very often used as range
    boundary, leading duplication of variables like zero=0, one=1,
    int_max=INT_MAX in different source files:

    $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
    248

    Add a const int array containing the most commonly used values, some
    macros to refer more easily to the correct array member, and use them
    instead of creating a local one for every object file.

    This is the bloat-o-meter output comparing the old and new binary
    compiled with the default Fedora config:

    # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
    add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
    Data old new delta
    sysctl_vals - 12 +12
    __kstrtab_sysctl_vals - 12 +12
    max 14 10 -4
    int_max 16 - -16
    one 68 - -68
    zero 128 28 -100
    Total: Before=20583249, After=20583085, chg -0.00%

    [mcroce@redhat.com: tipc: remove two unused variables]
    Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
    [akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
    [arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
    Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
    [akpm@linux-foundation.org: fix fs/eventpoll.c]
    Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
    Signed-off-by: Matteo Croce
    Signed-off-by: Arnd Bergmann
    Acked-by: Kees Cook
    Reviewed-by: Aaron Tomlin
    Cc: Matthew Wilcox
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matteo Croce
     

17 Jul, 2019

1 commit

  • task->saved_sigmask and ->restore_sigmask are only used in the ret-from-
    syscall paths. This means that set_user_sigmask() can save ->blocked in
    ->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked
    was modified.

    This way the callers do not need 2 sigset_t's passed to set/restore and
    restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns
    into the trivial helper which just calls restore_saved_sigmask().

    Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.com
    Signed-off-by: Oleg Nesterov
    Cc: Deepa Dinamani
    Cc: Arnd Bergmann
    Cc: Jens Axboe
    Cc: Davidlohr Bueso
    Cc: Eric Wong
    Cc: Jason Baron
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: Eric W. Biederman
    Cc: David Laight
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

29 Jun, 2019

1 commit

  • This is the minimal fix for stable, I'll send cleanups later.

    Commit 854a6ed56839 ("signal: Add restore_user_sigmask()") introduced
    the visible change which breaks user-space: a signal temporary unblocked
    by set_user_sigmask() can be delivered even if the caller returns
    success or timeout.

    Change restore_user_sigmask() to accept the additional "interrupted"
    argument which should be used instead of signal_pending() check, and
    update the callers.

    Eric said:

    : For clarity. I don't think this is required by posix, or fundamentally to
    : remove the races in select. It is what linux has always done and we have
    : applications who care so I agree this fix is needed.
    :
    : Further in any case where the semantic change that this patch rolls back
    : (aka where allowing a signal to be delivered and the select like call to
    : complete) would be advantage we can do as well if not better by using
    : signalfd.
    :
    : Michael is there any chance we can get this guarantee of the linux
    : implementation of pselect and friends clearly documented. The guarantee
    : that if the system call completes successfully we are guaranteed that no
    : signal that is unblocked by using sigmask will be delivered?

    Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
    Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add restore_user_sigmask()")
    Signed-off-by: Oleg Nesterov
    Reported-by: Eric Wong
    Tested-by: Eric Wong
    Acked-by: "Eric W. Biederman"
    Acked-by: Arnd Bergmann
    Acked-by: Deepa Dinamani
    Cc: Michael Kerrisk
    Cc: Jens Axboe
    Cc: Davidlohr Bueso
    Cc: Jason Baron
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: David Laight
    Cc: [5.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

08 Mar, 2019

3 commits

  • The goal of this patch is to reduce contention of ep_poll_callback()
    which can be called concurrently from different CPUs in case of high
    events rates and many fds per epoll. Problem can be very well
    reproduced by generating events (write to pipe or eventfd) from many
    threads, while consumer thread does polling. In other words this patch
    increases the bandwidth of events which can be delivered from sources to
    the poller by adding poll items in a lockless way to the list.

    The main change is in replacement of the spinlock with a rwlock, which
    is taken on read in ep_poll_callback(), and then by adding poll items to
    the tail of the list using xchg atomic instruction. Write lock is taken
    everywhere else in order to stop list modifications and guarantee that
    list updates are fully completed (I assume that write side of a rwlock
    does not starve, it seems qrwlock implementation has these guarantees).

    The following are some microbenchmark results based on the test [1]
    which starts threads which generate N events each. The test ends when
    all events are successfully fetched by the poller thread:

    spinlock
    ========

    threads events/ms run-time ms
    8 6402 12495
    16 7045 22709
    32 7395 43268

    rwlock + xchg
    =============

    threads events/ms run-time ms
    8 10038 7969
    16 12178 13138
    32 13223 24199

    According to the results bandwidth of delivered events is significantly
    increased, thus execution time is reduced.

    This patch was tested with different sort of microbenchmarks and
    artificial delays (e.g. "udelay(get_random_int() & 0xff)") introduced
    in kernel on paths where items are added to lists.

    [1] https://github.com/rouming/test-tools/blob/master/stress-epoll.c

    Link: http://lkml.kernel.org/r/20190103150104.17128-5-rpenyaev@suse.de
    Signed-off-by: Roman Penyaev
    Cc: Davidlohr Bueso
    Cc: Jason Baron
    Cc: Al Viro
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Penyaev
     
  • Original comment "Activate ep->ws since epi->ws may get deactivated at
    any time" indeed sounds loud, but it is incorrect, because the path
    where we check epi->ws is a path where insert to ovflist happens, i.e.
    ep_scan_ready_list() has taken ep->mtx and waits for this callback to
    finish, thus ep_modify() (which unregisters wakeup source) waits for
    ep_scan_ready_list().

    Here in this patch I simply call ep_pm_stay_awake_rcu(), which is a bit
    extra for this path (indirectly protected by main ep->mtx, so even rcu
    is not needed), but I do not want to create another naked
    __ep_pm_stay_awake() variant only for this particular case, so rcu variant
    is just better for all the cases.

    Link: http://lkml.kernel.org/r/20190103150104.17128-4-rpenyaev@suse.de
    Signed-off-by: Roman Penyaev
    Cc: Davidlohr Bueso
    Cc: Jason Baron
    Cc: Al Viro
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Penyaev
     
  • Patch series "use rwlock in order to reduce ep_poll_callback()
    contention", v3.

    The last patch targets the contention problem in ep_poll_callback(),
    which can be very well reproduced by generating events (write to pipe or
    eventfd) from many threads, while consumer thread does polling.

    The following are some microbenchmark results based on the test [1]
    which starts threads which generate N events each. The test ends when
    all events are successfully fetched by the poller thread:

    spinlock
    ========

    threads events/ms run-time ms
    8 6402 12495
    16 7045 22709
    32 7395 43268

    rwlock + xchg
    =============

    threads events/ms run-time ms
    8 10038 7969
    16 12178 13138
    32 13223 24199

    According to the results bandwidth of delivered events is significantly
    increased, thus execution time is reduced.

    This patch (of 4):

    All coming events are stored in FIFO order and this is also should be
    applicable to ->ovflist, which originally is stack, i.e. LIFO.

    Thus to keep correct FIFO order ->ovflist should reversed by adding
    elements to the head of the read list but not to the tail.

    Link: http://lkml.kernel.org/r/20190103150104.17128-2-rpenyaev@suse.de
    Signed-off-by: Roman Penyaev
    Reviewed-by: Davidlohr Bueso
    Cc: Jason Baron
    Cc: Al Viro
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Penyaev
     

06 Jan, 2019

1 commit

  • Merge more updates from Andrew Morton:

    - procfs updates

    - various misc bits

    - lib/ updates

    - epoll updates

    - autofs

    - fatfs

    - a few more MM bits

    * emailed patches from Andrew Morton : (58 commits)
    mm/page_io.c: fix polled swap page in
    checkpatch: add Co-developed-by to signature tags
    docs: fix Co-Developed-by docs
    drivers/base/platform.c: kmemleak ignore a known leak
    fs: don't open code lru_to_page()
    fs/: remove caller signal_pending branch predictions
    mm/: remove caller signal_pending branch predictions
    arch/arc/mm/fault.c: remove caller signal_pending_branch predictions
    kernel/sched/: remove caller signal_pending branch predictions
    kernel/locking/mutex.c: remove caller signal_pending branch predictions
    mm: select HAVE_MOVE_PMD on x86 for faster mremap
    mm: speed up mremap by 20x on large regions
    mm: treewide: remove unused address argument from pte_alloc functions
    initramfs: cleanup incomplete rootfs
    scripts/gdb: fix lx-version string output
    kernel/kcov.c: mark write_comp_data() as notrace
    kernel/sysctl: add panic_print into sysctl
    panic: add options to print system info when panic happens
    bfs: extra sanity checking and static inode bitmap
    exec: separate MM_ANONPAGES and RLIMIT_STACK accounting
    ...

    Linus Torvalds
     

05 Jan, 2019

8 commits

  • There is no reason why we rearm the waitiqueue upon every fetch_events
    retry (for when events are found yet send_events() fails). If nothing
    else, this saves four lock operations per retry, and furthermore reduces
    the scope of the lock even further.

    [akpm@linux-foundation.org: restore code to original position, fix and reflow comment]
    Link: http://lkml.kernel.org/r/20181114182532.27981-2-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Jason Baron
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • It is currently called check_events because it, well, did exactly that.
    However, since the lockless ep_events_available() call, the label no
    longer checks, but just sends the events. Rename as such.

    Link: http://lkml.kernel.org/r/20181114182532.27981-1-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Cc: Jason Baron
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Upon timeout, we can just exit out of the loop, without the cost of the
    changing the task's state with an smp_store_mb call. Just exit out of
    the loop and be done - setting the task state afterwards will be, of
    course, redundant.

    [dave@stgolabs.net: forgotten fixlets]
    Link: http://lkml.kernel.org/r/20181109155258.jxcr4t2pnz6zqct3@linux-r8p5
    Link: http://lkml.kernel.org/r/20181108051006.18751-7-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Cc: Davidlohr Bueso
    Cc: Jason Baron
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • This patch aims at reducing ep wq.lock hold times in epoll_wait(2). For
    the blocking case, there is no need to constantly take and drop the
    spinlock, which is only needed to manipulate the waitqueue.

    The call to ep_events_available() is now lockless, and only exposed to
    benign races. Here, if false positive (returns available events and
    does not see another thread deleting an epi from the list) we call into
    send_events and then the list's state is correctly seen. Otoh, if a
    false negative and we don't see a list_add_tail(), for example, from irq
    callback, then it is rechecked again before blocking, which will see the
    correct state.

    In order for more accuracy to see concurrent list_del_init(), use the
    list_empty_careful() variant -- of course, this won't be safe against
    insertions from wakeup.

    For the overflow list we obviously need to prevent load/store tearing as
    we don't want to see partial values while the ready list is disabled.

    [dave@stgolabs.net: forgotten fixlets]
    Link: http://lkml.kernel.org/r/20181109155258.jxcr4t2pnz6zqct3@linux-r8p5
    Link: http://lkml.kernel.org/r/20181108051006.18751-6-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Suggested-by: Jason Baron
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Insted of just commenting how important it is, lets make it more robust
    and add a lockdep_assert_held() call.

    Link: http://lkml.kernel.org/r/20181108051006.18751-5-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Cc: Jason Baron
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • The ep->ovflist is a secondary ready-list to temporarily store events
    that might occur when doing sproc without holding the ep->wq.lock. This
    accounts for every time we check for ready events and also send events
    back to userspace; both callbacks, particularly the latter because of
    copy_to_user, can account for a non-trivial time.

    As such, the unlikely() check to see if the pointer is being used, seems
    both misleading and sub-optimal. In fact, we go to an awful lot of
    trouble to sync both lists, and populating the ovflist is far from an
    uncommon scenario.

    For example, profiling a concurrent epoll_wait(2) benchmark, with
    CONFIG_PROFILE_ANNOTATED_BRANCHES shows that for a two threads a 33%
    incorrect rate was seen; and when incrementally increasing the number of
    epoll instances (which is used, for example for multiple queuing load
    balancing models), up to a 90% incorrect rate was seen.

    Similarly, by deleting the prediction, 3% throughput boost was seen
    across incremental threads.

    Link: http://lkml.kernel.org/r/20181108051006.18751-4-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Cc: Jason Baron
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • The current logic is a bit convoluted. Lets simplify this with a
    standard list_for_each_entry_safe() loop instead and just break out
    after maxevents is reached.

    While at it, remove an unnecessary indentation level in the loop when
    there are in fact ready events.

    Link: http://lkml.kernel.org/r/20181108051006.18751-3-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Cc: Jason Baron
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Patch series "epoll: some miscellaneous optimizations".

    The following are some incremental optimizations on some of the epoll
    core. Each patch has the details, but together, the series is seen to
    shave off measurable cycles on a number of systems and workloads.

    For example, on a 40-core IB, a pipetest as well as parallel
    epoll_wait() benchmark show around a 20-30% increase in raw operations
    per second when the box is fully occupied (incremental thread counts),
    and up to 15% performance improvement with lower counts.

    Passes ltp epoll related testcases.

    This patch(of 6):

    All callers pass the EP_MAX_NESTS constant already, so lets simplify
    this a tad and get rid of the redundant parameter for nested eventpolls.

    Link: http://lkml.kernel.org/r/20181108051006.18751-2-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Cc: Jason Baron
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

04 Jan, 2019

1 commit

  • Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
    of the user address range verification function since we got rid of the
    old racy i386-only code to walk page tables by hand.

    It existed because the original 80386 would not honor the write protect
    bit when in kernel mode, so you had to do COW by hand before doing any
    user access. But we haven't supported that in a long time, and these
    days the 'type' argument is a purely historical artifact.

    A discussion about extending 'user_access_begin()' to do the range
    checking resulted this patch, because there is no way we're going to
    move the old VERIFY_xyz interface to that model. And it's best done at
    the end of the merge window when I've done most of my merges, so let's
    just get this done once and for all.

    This patch was mostly done with a sed-script, with manual fix-ups for
    the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

    There were a couple of notable cases:

    - csky still had the old "verify_area()" name as an alias.

    - the iter_iov code had magical hardcoded knowledge of the actual
    values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
    really used it)

    - microblaze used the type argument for a debug printout

    but other than those oddities this should be a total no-op patch.

    I tried to fix up all architectures, did fairly extensive grepping for
    access_ok() uses, and the changes are trivial, but I may have missed
    something. Any missed conversion should be trivially fixable, though.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

07 Dec, 2018

2 commits

  • Refactor the logic to restore the sigmask before the syscall
    returns into an api.
    This is useful for versions of syscalls that pass in the
    sigmask and expect the current->sigmask to be changed during
    the execution and restored after the execution of the syscall.

    With the advent of new y2038 syscalls in the subsequent patches,
    we add two more new versions of the syscalls (for pselect, ppoll
    and io_pgetevents) in addition to the existing native and compat
    versions. Adding such an api reduces the logic that would need to
    be replicated otherwise.

    Signed-off-by: Deepa Dinamani
    Signed-off-by: Arnd Bergmann

    Deepa Dinamani
     
  • Refactor reading sigset from userspace and updating sigmask
    into an api.

    This is useful for versions of syscalls that pass in the
    sigmask and expect the current->sigmask to be changed during,
    and restored after, the execution of the syscall.

    With the advent of new y2038 syscalls in the subsequent patches,
    we add two more new versions of the syscalls (for pselect, ppoll,
    and io_pgetevents) in addition to the existing native and compat
    versions. Adding such an api reduces the logic that would need to
    be replicated otherwise.

    Note that the calls to sigprocmask() ignored the return value
    from the api as the function only returns an error on an invalid
    first argument that is hardcoded at these call sites.
    The updated logic uses set_current_blocked() instead.

    Signed-off-by: Deepa Dinamani
    Signed-off-by: Arnd Bergmann

    Deepa Dinamani
     

23 Aug, 2018

7 commits

  • Instead of having each caller pass the rdllink explicitly, just have
    ep_is_linked() pass it while the callers just need the epi pointer. This
    helper is all about the rdllink, and this change, furthermore, improves
    the function's self documentation.

    Link: http://lkml.kernel.org/r/20180727053432.16679-3-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Cc: Jason Baron
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Similar to other calls, ep_poll() is not called with interrupts disabled,
    and we can therefore avoid the irq save/restore dance and just disable
    local irqs. In fact, the call should never be called in irq context at
    all, considering that the only path is

    epoll_wait(2) -> do_epoll_wait() -> ep_poll().

    When running on a 2 socket 40-core (ht) IvyBridge a common pipe based
    epoll_wait(2) microbenchmark, the following performance improvements are
    seen:

    # threads vanilla dirty
    1 1805587 2106412
    2 1854064 2090762
    4 1805484 2017436
    8 1751222 1974475
    16 1725299 1962104
    32 1378463 1571233
    64 787368 900784

    Which is a pretty constantly near 15%.

    Also add a lockdep check such that we detect any mischief before
    deadlocking.

    Link: http://lkml.kernel.org/r/20180727053432.16679-2-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Cc: Peter Zijlstra
    Cc: Jason Baron
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • ... 'tis easier on the eye.

    [akpm@linux-foundation.org: use inlines rather than macros]
    Link: http://lkml.kernel.org/r/20180725185620.11020-1-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Jason Baron
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Sprinkle lockdep_assert_irqs_enabled() checks in the functions that do not
    save and restore interrupts when dealing with the ep->wq.lock. These are
    ep_scan_ready_list() and those called by epoll_ctl(): ep_insert, ep_modify
    and ep_remove.

    [akpm@linux-foundation.org: remove too-obvious comments]
    Link: http://lkml.kernel.org/r/20180721183127.3busfa335zlcjeox@linux-r8p5
    Signed-off-by: Davidlohr Bueso
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Both functions are similar to the context of ep_modify(), called via
    epoll_ctl(2). Just like ep_modify(), saving and restoring interrupts is
    an overkill in these calls as it will never be called with irqs disabled.
    While ep_remove() can be called directly from EPOLL_CTL_DEL, it can also
    be called when releasing the file, but this also complies with the above.

    Link: http://lkml.kernel.org/r/20180720172956.2883-3-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Jason Baron
    Cc: Al Viro
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Patch series "fs/epoll: loosen irq safety when possible".

    Both patches replace saving+restoring interrupts when taking the ep->lock
    (now the waitqueue lock), with just disabling local irqs. This shows
    immediate performance benefits in patch 1 for an epoll workload running on
    Xen. The main concern we need to have with this sort of changes in epoll
    is the ep_poll_callback() which is passed to the wait queue wakeup and is
    done very often under irq context, this patch does not touch this call.

    Patches have been tested pretty heavily with the customer workload,
    microbenchmarks, ltp testcases and two high level workloads that use epoll
    under the hood: nginx and libevent benchmarks.

    This patch (of 2):

    Saving and restoring interrupts in ep_scan_ready_list() is an
    overkill as it is never called with interrupts disabled. Loosen
    this to simply disabling local irqs such that archs where managing
    irqs is expensive or virtual environments. This patch yields
    some throughput improvements on a workload that is epoll intensive
    running on a single Xen DomU.

    1 Job 7500 --> 8800 enq/s (+17%)
    2 Jobs 14000 --> 15200 enq/s (+8%)
    3 Jobs 20500 --> 22300 enq/s (+8%)
    4 Jobs 25000 --> 28000 enq/s (+8-12)%

    On bare metal:

    For a 2-socket 40-core (ht) IvyBridge on a few workloads, unfortunately I
    don't have a xen environment and the results for Xen I do have (which
    numbers are in patch 1) I don't have the actual workload, so cannot
    compare them directly.

    1) Different configurations were used for a epoll_wait (pipes io)
    microbench (http://linux-scalability.org/epoll/epoll-test.c) and shows
    around a 7-10% improvement in overall total number of times the
    epoll_wait() loops when using both regular and nested epolls, so very
    raw numbers, but measurable nonetheless.

    # threads vanilla dirty
    1 1677717 1805587
    2 1660510 1854064
    4 1610184 1805484
    8 1577696 1751222
    16 1568837 1725299
    32 1291532 1378463
    64 752584 787368

    Note that stddev is pretty small.

    2) Another pipe test, which shows no real measurable improvement.
    (http://www.xmailserver.org/linux-patches/pipetest.c)

    Link: http://lkml.kernel.org/r/20180720172956.2883-2-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Jason Baron
    Cc: Al Viro
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Patch series "waitqueue lockdep annotation", v3.

    This series adds a strategic lockdep_assert_held to __wake_up_common to
    ensure callers really do hold the wait_queue_head lock when calling the
    unlocked wake_up variants. It turns out epoll did not do this for a
    fairly common path (hit all the time by systemd during bootup), so the
    second patch fixed this instance as well.

    This patch (of 3):

    The epoll code currently uses the unlocked waitqueue helpers for managing
    ep->wq, but instead of holding the waitqueue lock around these calls, it
    uses its own ep->lock spinlock. Given that the waitqueue is not exposed
    to the rest of the kernel this actually works ok at the moment, but
    prevents the epoll locking rules from being enforced using lockdep.
    Remove ep->lock and use the waitqueue lock to not only reduce the size of
    struct eventpoll but also to make sure we can assert locking invariants in
    the waitqueue code.

    Link: http://lkml.kernel.org/r/20171214152344.6880-2-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Baron
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Al Viro
    Cc: Andrea Arcangeli
    Cc: Mike Rapoport
    Cc: Jason Baron
    Cc: Ingo Molnar
    Cc: Matthew Wilcox
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

29 Jun, 2018

1 commit

  • The poll() changes were not well thought out, and completely
    unexplained. They also caused a huge performance regression, because
    "->poll()" was no longer a trivial file operation that just called down
    to the underlying file operations, but instead did at least two indirect
    calls.

    Indirect calls are sadly slow now with the Spectre mitigation, but the
    performance problem could at least be largely mitigated by changing the
    "->get_poll_head()" operation to just have a per-file-descriptor pointer
    to the poll head instead. That gets rid of one of the new indirections.

    But that doesn't fix the new complexity that is completely unwarranted
    for the regular case. The (undocumented) reason for the poll() changes
    was some alleged AIO poll race fixing, but we don't make the common case
    slower and more complex for some uncommon special case, so this all
    really needs way more explanations and most likely a fundamental
    redesign.

    [ This revert is a revert of about 30 different commits, not reverted
    individually because that would just be unnecessarily messy - Linus ]

    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Jun, 2018

1 commit


26 May, 2018

1 commit