Eric Lee / smarc-fsl-linux-kernel

07 Oct, 2020

4 commits

27423bb05 ep_create_wakeup_source(): dentry name can change under you... ... Browse Code »

commit 3701cb59d892b88d569427586f01491552f377b1 upstream.

or get freed, for that matter, if it's a long (separately stored)
name.

Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Al Viro
2020-10-07 14:01:31 +0800
8e58bad66 epoll: EPOLL_CTL_ADD: close the race in decision to take fast path ... Browse Code »

commit fe0a916c1eae8e17e86c3753d13919177d63ed7e upstream.

Checking for the lack of epitems refering to the epoll we want to insert into
is not enough; we might have an insertion of that epoll into another one that
has already collected the set of files to recheck for excessive reverse paths,
but hasn't gotten to creating/inserting the epitem for it.

However, any such insertion in progress can be detected - it will update the
generation count in our epoll when it's done looking through it for files
to check. That gets done under ->mtx of our epoll and that allows us to
detect that safely.

We are *not* holding epmutex here, so the generation count is not stable.
However, since both the update of ep->gen by loop check and (later)
insertion into ->f_ep_link are done with ep->mtx held, we are fine -
the sequence is
grab epmutex
bump loop_check_gen
...
grab tep->mtx // 1
tep->gen = loop_check_gen
...
drop tep->mtx // 2
...
grab tep->mtx // 3
...
insert into ->f_ep_link
...
drop tep->mtx // 4
bump loop_check_gen
drop epmutex
and if the fastpath check in another thread happens for that
eventpoll, it can come
* before (1) - in that case fastpath is just fine
* after (4) - we'll see non-empty ->f_ep_link, slow path
taken
* between (2) and (3) - loop_check_gen is stable,
with ->mtx providing barriers and we end up taking slow path.

Note that ->f_ep_link emptiness check is slightly racy - we are protected
against insertions into that list, but removals can happen right under us.
Not a problem - in the worst case we'll end up taking a slow path for
no good reason.

Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Al Viro
2020-10-07 14:01:30 +0800
099b7a1bc epoll: replace ->visited/visited_list with generation count ... Browse Code »

commit 18306c404abe18a0972587a6266830583c60c928 upstream.

removes the need to clear it, along with the races.

Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Al Viro
2020-10-07 14:01:30 +0800
8993da3d4 epoll: do not insert into poll queues until all sanity checks are done ... Browse Code »

commit f8d4f44df056c5b504b0d49683fb7279218fd207 upstream.

Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Al Viro
2020-10-07 14:01:30 +0800

10 Sep, 2020

1 commit

88405cf0f fix regression in "epoll: Keep a reference on files added to the check list" ... Browse Code »

[ Upstream commit 77f4689de17c0887775bb77896f4cc11a39bf848 ]

epoll_loop_check_proc() can run into a file already committed to destruction;
we can't grab a reference on those and don't need to add them to the set for
reverse path check anyway.

Tested-by: Marc Zyngier
Fixes: a9ed4a6560b8 ("epoll: Keep a reference on files added to the check list")
Signed-off-by: Al Viro
Signed-off-by: Sasha Levin

Al Viro
2020-09-10 01:12:28 +0800

26 Aug, 2020

2 commits

42694912a do_epoll_ctl(): clean the failure exits up a bit ... Browse Code »

commit 52c479697c9b73f628140dcdfcd39ea302d05482 upstream.

Signed-off-by: Al Viro
Signed-off-by: Marc Zyngier
Signed-off-by: Greg Kroah-Hartman

Al Viro
2020-08-26 16:41:07 +0800
b158e9161 epoll: Keep a reference on files added to the check list ... Browse Code »

commit a9ed4a6560b8562b7e2e2bed9527e88001f7b682 upstream.

When adding a new fd to an epoll, and that this new fd is an
epoll fd itself, we recursively scan the fds attached to it
to detect cycles, and add non-epool files to a "check list"
that gets subsequently parsed.

However, this check list isn't completely safe when deletions
can happen concurrently. To sidestep the issue, make sure that
a struct file placed on the check list sees its f_count increased,
ensuring that a concurrent deletion won't result in the file
disapearing from under our feet.

Cc: stable@vger.kernel.org
Signed-off-by: Marc Zyngier
Signed-off-by: Al Viro
Signed-off-by: Marc Zyngier
Signed-off-by: Greg Kroah-Hartman

Marc Zyngier
2020-08-26 16:41:07 +0800

14 May, 2020

2 commits

ee922a2f6 eventpoll: fix missing wakeup for ovflist in ep_poll_callback ... Browse Code »

commit 0c54a6a44bf3d41e76ce3f583a6ece267618df2e upstream.

In the event that we add to ovflist, before commit 339ddb53d373
("fs/epoll: remove unnecessary wakeups of nested epoll") we would be
woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback.

With that wakeup removed, if we add to ovflist here, we may never wake
up. Rather than adding back the ep_scan_ready_list wakeup - which was
resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback.

We noticed that one of our workloads was missing wakeups starting with
339ddb53d373 and upon manual inspection, this wakeup seemed missing to me.
With this patch added, we no longer see missing wakeups. I haven't yet
tried to make a small reproducer, but the existing kselftests in
filesystem/epoll passed for me with this patch.

[khazhy@google.com: use if/elif instead of goto + cleanup suggested by Roman]
Link: http://lkml.kernel.org/r/20200424190039.192373-1-khazhy@google.com
Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll")
Signed-off-by: Khazhismel Kumykov
Signed-off-by: Andrew Morton
Reviewed-by: Roman Penyaev
Cc: Alexander Viro
Cc: Roman Penyaev
Cc: Heiher
Cc: Jason Baron
Cc:
Link: http://lkml.kernel.org/r/20200424025057.118641-1-khazhy@google.com
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Khazhismel Kumykov
2020-05-14 13:58:26 +0800
5d77631de epoll: atomically remove wait entry on wake up ... Browse Code »

commit 412895f03cbf9633298111cb4dfde13b7720e2c5 upstream.

This patch does two things:

- fixes a lost wakeup introduced by commit 339ddb53d373 ("fs/epoll:
remove unnecessary wakeups of nested epoll")

- improves performance for events delivery.

The description of the problem is the following: if N (>1) threads are
waiting on ep->wq for new events and M (>1) events come, it is quite
likely that >1 wakeups hit the same wait queue entry, because there is
quite a big window between __add_wait_queue_exclusive() and the
following __remove_wait_queue() calls in ep_poll() function.

This can lead to lost wakeups, because thread, which was woken up, can
handle not all the events in ->rdllist. (in better words the problem is
described here: https://lkml.org/lkml/2019/10/7/905)

The idea of the current patch is to use init_wait() instead of
init_waitqueue_entry().

Internally init_wait() sets autoremove_wake_function as a callback,
which removes the wait entry atomically (under the wq locks) from the
list, thus the next coming wakeup hits the next wait entry in the wait
queue, thus preventing lost wakeups.

Problem is very well reproduced by the epoll60 test case [1].

Wait entry removal on wakeup has also performance benefits, because
there is no need to take a ep->lock and remove wait entry from the queue
after the successful wakeup. Here is the timing output of the epoll60
test case:

With explicit wakeup from ep_scan_ready_list() (the state of the
code prior 339ddb53d373):

real 0m6.970s
user 0m49.786s
sys 0m0.113s

After this patch:

real 0m5.220s
user 0m36.879s
sys 0m0.019s

The other testcase is the stress-epoll [2], where one thread consumes
all the events and other threads produce many events:

With explicit wakeup from ep_scan_ready_list() (the state of the
code prior 339ddb53d373):

threads events/ms run-time ms
8 5427 1474
16 6163 2596
32 6824 4689
64 7060 9064
128 6991 18309

After this patch:

threads events/ms run-time ms
8 5598 1429
16 7073 2262
32 7502 4265
64 7640 8376
128 7634 16767

(number of "events/ms" represents event bandwidth, thus higher is
better; number of "run-time ms" represents overall time spent
doing the benchmark, thus lower is better)

[1] tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
[2] https://github.com/rouming/test-tools/blob/master/stress-epoll.c

Signed-off-by: Roman Penyaev
Signed-off-by: Andrew Morton
Reviewed-by: Jason Baron
Cc: Khazhismel Kumykov
Cc: Alexander Viro
Cc: Heiher
Cc:
Link: http://lkml.kernel.org/r/20200430130326.1368509-2-rpenyaev@suse.de
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Roman Penyaev
2020-05-14 13:58:26 +0800

25 Mar, 2020

1 commit

8e709bbe4 epoll: fix possible lost wakeup on epoll_ctl() path ... Browse Code »

commit 1b53734bd0b2feed8e7761771b2e76fc9126ea0c upstream.

This fixes possible lost wakeup introduced by commit a218cc491420.
Originally modifications to ep->wq were serialized by ep->wq.lock, but
in commit a218cc491420 ("epoll: use rwlock in order to reduce
ep_poll_callback() contention") a new rw lock was introduced in order to
relax fd event path, i.e. callers of ep_poll_callback() function.

After the change ep_modify and ep_insert (both are called on epoll_ctl()
path) were switched to ep->lock, but ep_poll (epoll_wait) was using
ep->wq.lock on wqueue list modification.

The bug doesn't lead to any wqueue list corruptions, because wake up
path and list modifications were serialized by ep->wq.lock internally,
but actual waitqueue_active() check prior wake_up() call can be
reordered with modifications of ep ready list, thus wake up can be lost.

And yes, can be healed by explicit smp_mb():

list_add_tail(&epi->rdlink, &ep->rdllist);
smp_mb();
if (waitqueue_active(&ep->wq))
wake_up(&ep->wp);

But let's make it simple, thus current patch replaces ep->wq.lock with
the ep->lock for wqueue modifications, thus wake up path always observes
activeness of the wqueue correcty.

Fixes: a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention")
Reported-by: Max Neunhoeffer
Signed-off-by: Roman Penyaev
Signed-off-by: Andrew Morton
Tested-by: Max Neunhoeffer
Cc: Jakub Kicinski
Cc: Christopher Kohlhoff
Cc: Davidlohr Bueso
Cc: Jason Baron
Cc: Jes Sorensen
Cc: [5.1+]
Link: http://lkml.kernel.org/r/20200214170211.561524-1-rpenyaev@suse.de
References: https://bugzilla.kernel.org/show_bug.cgi?id=205933
Bisected-by: Max Neunhoeffer
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Roman Penyaev
2020-03-25 15:25:57 +0800

21 Aug, 2019

1 commit

c8377adfa PM / wakeup: Show wakeup sources stats in sysfs ... Browse Code »

Add an ID and a device pointer to 'struct wakeup_source'. Use them to to
expose wakeup sources statistics in sysfs under
/sys/class/wakeup/wakeup/*.

Co-developed-by: Greg Kroah-Hartman
Signed-off-by: Greg Kroah-Hartman
Co-developed-by: Stephen Boyd
Signed-off-by: Stephen Boyd
Signed-off-by: Tri Vo
Tested-by: Kalesh Singh
Signed-off-by: Rafael J. Wysocki

Tri Vo
2019-08-21 06:20:40 +0800

19 Jul, 2019

1 commit

eec4844fa proc/sysctl: add shared variables for range check ... Browse Code »

In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.

On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.

The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:

$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248

Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.

This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:

# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%

[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce
Signed-off-by: Arnd Bergmann
Acked-by: Kees Cook
Reviewed-by: Aaron Tomlin
Cc: Matthew Wilcox
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matteo Croce
2019-07-19 08:08:07 +0800

17 Jul, 2019

1 commit

b772434be signal: simplify set_user_sigmask/restore_user_sigmask ... Browse Code »

task->saved_sigmask and ->restore_sigmask are only used in the ret-from-
syscall paths. This means that set_user_sigmask() can save ->blocked in
->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked
was modified.

This way the callers do not need 2 sigset_t's passed to set/restore and
restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns
into the trivial helper which just calls restore_saved_sigmask().

Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.com
Signed-off-by: Oleg Nesterov
Cc: Deepa Dinamani
Cc: Arnd Bergmann
Cc: Jens Axboe
Cc: Davidlohr Bueso
Cc: Eric Wong
Cc: Jason Baron
Cc: Thomas Gleixner
Cc: Al Viro
Cc: Eric W. Biederman
Cc: David Laight
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2019-07-17 10:23:24 +0800

29 Jun, 2019

1 commit

97abc889e signal: remove the wrong signal_pending() check in restore_user_sigmask() ... Browse Code »

This is the minimal fix for stable, I'll send cleanups later.

Commit 854a6ed56839 ("signal: Add restore_user_sigmask()") introduced
the visible change which breaks user-space: a signal temporary unblocked
by set_user_sigmask() can be delivered even if the caller returns
success or timeout.

Change restore_user_sigmask() to accept the additional "interrupted"
argument which should be used instead of signal_pending() check, and
update the callers.

Eric said:

: For clarity. I don't think this is required by posix, or fundamentally to
: remove the races in select. It is what linux has always done and we have
: applications who care so I agree this fix is needed.
:
: Further in any case where the semantic change that this patch rolls back
: (aka where allowing a signal to be delivered and the select like call to
: complete) would be advantage we can do as well if not better by using
: signalfd.
:
: Michael is there any chance we can get this guarantee of the linux
: implementation of pselect and friends clearly documented. The guarantee
: that if the system call completes successfully we are guaranteed that no
: signal that is unblocked by using sigmask will be delivered?

Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add restore_user_sigmask()")
Signed-off-by: Oleg Nesterov
Reported-by: Eric Wong
Tested-by: Eric Wong
Acked-by: "Eric W. Biederman"
Acked-by: Arnd Bergmann
Acked-by: Deepa Dinamani
Cc: Michael Kerrisk
Cc: Jens Axboe
Cc: Davidlohr Bueso
Cc: Jason Baron
Cc: Thomas Gleixner
Cc: Al Viro
Cc: David Laight
Cc: [5.0+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2019-06-29 16:43:45 +0800

31 May, 2019

1 commit

2874c5fd2 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 ... Browse Code »

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner
Reviewed-by: Allison Randal
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-05-31 02:26:32 +0800

08 Mar, 2019

3 commits

a218cc491 epoll: use rwlock in order to reduce ep_poll_callback() contention ... Browse Code »

The goal of this patch is to reduce contention of ep_poll_callback()
which can be called concurrently from different CPUs in case of high
events rates and many fds per epoll. Problem can be very well
reproduced by generating events (write to pipe or eventfd) from many
threads, while consumer thread does polling. In other words this patch
increases the bandwidth of events which can be delivered from sources to
the poller by adding poll items in a lockless way to the list.

The main change is in replacement of the spinlock with a rwlock, which
is taken on read in ep_poll_callback(), and then by adding poll items to
the tail of the list using xchg atomic instruction. Write lock is taken
everywhere else in order to stop list modifications and guarantee that
list updates are fully completed (I assume that write side of a rwlock
does not starve, it seems qrwlock implementation has these guarantees).

The following are some microbenchmark results based on the test [1]
which starts threads which generate N events each. The test ends when
all events are successfully fetched by the poller thread:

spinlock
========

threads events/ms run-time ms
8 6402 12495
16 7045 22709
32 7395 43268

rwlock + xchg
=============

threads events/ms run-time ms
8 10038 7969
16 12178 13138
32 13223 24199

According to the results bandwidth of delivered events is significantly
increased, thus execution time is reduced.

This patch was tested with different sort of microbenchmarks and
artificial delays (e.g. "udelay(get_random_int() & 0xff)") introduced
in kernel on paths where items are added to lists.

[1] https://github.com/rouming/test-tools/blob/master/stress-epoll.c

Link: http://lkml.kernel.org/r/20190103150104.17128-5-rpenyaev@suse.de
Signed-off-by: Roman Penyaev
Cc: Davidlohr Bueso
Cc: Jason Baron
Cc: Al Viro
Cc: "Paul E. McKenney"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Penyaev
2019-03-08 10:32:01 +0800
c3e320b61 epoll: unify awaking of wakeup source on ep_poll_callback() path ... Browse Code »

Original comment "Activate ep->ws since epi->ws may get deactivated at
any time" indeed sounds loud, but it is incorrect, because the path
where we check epi->ws is a path where insert to ovflist happens, i.e.
ep_scan_ready_list() has taken ep->mtx and waits for this callback to
finish, thus ep_modify() (which unregisters wakeup source) waits for
ep_scan_ready_list().

Here in this patch I simply call ep_pm_stay_awake_rcu(), which is a bit
extra for this path (indirectly protected by main ep->mtx, so even rcu
is not needed), but I do not want to create another naked
__ep_pm_stay_awake() variant only for this particular case, so rcu variant
is just better for all the cases.

Link: http://lkml.kernel.org/r/20190103150104.17128-4-rpenyaev@suse.de
Signed-off-by: Roman Penyaev
Cc: Davidlohr Bueso
Cc: Jason Baron
Cc: Al Viro
Cc: "Paul E. McKenney"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Penyaev
2019-03-08 10:32:01 +0800
c141175d0 epoll: make sure all elements in ready list are in FIFO order ... Browse Code »

Patch series "use rwlock in order to reduce ep_poll_callback()
contention", v3.

The last patch targets the contention problem in ep_poll_callback(),
which can be very well reproduced by generating events (write to pipe or
eventfd) from many threads, while consumer thread does polling.

The following are some microbenchmark results based on the test [1]
which starts threads which generate N events each. The test ends when
all events are successfully fetched by the poller thread:

spinlock
========

threads events/ms run-time ms
8 6402 12495
16 7045 22709
32 7395 43268

rwlock + xchg
=============

threads events/ms run-time ms
8 10038 7969
16 12178 13138
32 13223 24199

According to the results bandwidth of delivered events is significantly
increased, thus execution time is reduced.

This patch (of 4):

All coming events are stored in FIFO order and this is also should be
applicable to ->ovflist, which originally is stack, i.e. LIFO.

Thus to keep correct FIFO order ->ovflist should reversed by adding
elements to the head of the read list but not to the tail.

Link: http://lkml.kernel.org/r/20190103150104.17128-2-rpenyaev@suse.de
Signed-off-by: Roman Penyaev
Reviewed-by: Davidlohr Bueso
Cc: Jason Baron
Cc: Al Viro
Cc: "Paul E. McKenney"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Penyaev
2019-03-08 10:32:01 +0800

06 Jan, 2019

1 commit

a65981109 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge more updates from Andrew Morton:

- procfs updates

- various misc bits

- lib/ updates

- epoll updates

- autofs

- fatfs

- a few more MM bits

* emailed patches from Andrew Morton : (58 commits)
mm/page_io.c: fix polled swap page in
checkpatch: add Co-developed-by to signature tags
docs: fix Co-Developed-by docs
drivers/base/platform.c: kmemleak ignore a known leak
fs: don't open code lru_to_page()
fs/: remove caller signal_pending branch predictions
mm/: remove caller signal_pending branch predictions
arch/arc/mm/fault.c: remove caller signal_pending_branch predictions
kernel/sched/: remove caller signal_pending branch predictions
kernel/locking/mutex.c: remove caller signal_pending branch predictions
mm: select HAVE_MOVE_PMD on x86 for faster mremap
mm: speed up mremap by 20x on large regions
mm: treewide: remove unused address argument from pte_alloc functions
initramfs: cleanup incomplete rootfs
scripts/gdb: fix lx-version string output
kernel/kcov.c: mark write_comp_data() as notrace
kernel/sysctl: add panic_print into sysctl
panic: add options to print system info when panic happens
bfs: extra sanity checking and static inode bitmap
exec: separate MM_ANONPAGES and RLIMIT_STACK accounting
...

Linus Torvalds
2019-01-06 01:16:18 +0800

05 Jan, 2019

8 commits

86c051793 fs/epoll: deal with wait_queue only once ... Browse Code »

There is no reason why we rearm the waitiqueue upon every fetch_events
retry (for when events are found yet send_events() fails). If nothing
else, this saves four lock operations per retry, and furthermore reduces
the scope of the lock even further.

[akpm@linux-foundation.org: restore code to original position, fix and reflow comment]
Link: http://lkml.kernel.org/r/20181114182532.27981-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Cc: Jason Baron
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2019-01-05 05:13:46 +0800
35cff1a6e fs/epoll: rename check_events label to send_events ... Browse Code »

It is currently called check_events because it, well, did exactly that.
However, since the lockless ep_events_available() call, the label no
longer checks, but just sends the events. Rename as such.

Link: http://lkml.kernel.org/r/20181114182532.27981-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Reviewed-by: Andrew Morton
Cc: Al Viro
Cc: Jason Baron
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2019-01-05 05:13:46 +0800
abc610e01 fs/epoll: avoid barrier after an epoll_wait(2) timeout ... Browse Code »

Upon timeout, we can just exit out of the loop, without the cost of the
changing the task's state with an smp_store_mb call. Just exit out of
the loop and be done - setting the task state afterwards will be, of
course, redundant.

[dave@stgolabs.net: forgotten fixlets]
Link: http://lkml.kernel.org/r/20181109155258.jxcr4t2pnz6zqct3@linux-r8p5
Link: http://lkml.kernel.org/r/20181108051006.18751-7-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Reviewed-by: Andrew Morton
Cc: Al Viro
Cc: Davidlohr Bueso
Cc: Jason Baron
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2019-01-05 05:13:46 +0800
c5a282e96 fs/epoll: reduce the scope of wq lock in epoll_wait() ... Browse Code »

This patch aims at reducing ep wq.lock hold times in epoll_wait(2). For
the blocking case, there is no need to constantly take and drop the
spinlock, which is only needed to manipulate the waitqueue.

The call to ep_events_available() is now lockless, and only exposed to
benign races. Here, if false positive (returns available events and
does not see another thread deleting an epi from the list) we call into
send_events and then the list's state is correctly seen. Otoh, if a
false negative and we don't see a list_add_tail(), for example, from irq
callback, then it is rechecked again before blocking, which will see the
correct state.

In order for more accuracy to see concurrent list_del_init(), use the
list_empty_careful() variant -- of course, this won't be safe against
insertions from wakeup.

For the overflow list we obviously need to prevent load/store tearing as
we don't want to see partial values while the ready list is disabled.

[dave@stgolabs.net: forgotten fixlets]
Link: http://lkml.kernel.org/r/20181109155258.jxcr4t2pnz6zqct3@linux-r8p5
Link: http://lkml.kernel.org/r/20181108051006.18751-6-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Suggested-by: Jason Baron
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2019-01-05 05:13:46 +0800
21877e1a5 fs/epoll: robustify ep->mtx held checks ... Browse Code »

Insted of just commenting how important it is, lets make it more robust
and add a lockdep_assert_held() call.

Link: http://lkml.kernel.org/r/20181108051006.18751-5-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Reviewed-by: Andrew Morton
Cc: Al Viro
Cc: Jason Baron
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2019-01-05 05:13:46 +0800
76699a67f fs/epoll: drop ovflist branch prediction ... Browse Code »

The ep->ovflist is a secondary ready-list to temporarily store events
that might occur when doing sproc without holding the ep->wq.lock. This
accounts for every time we check for ready events and also send events
back to userspace; both callbacks, particularly the latter because of
copy_to_user, can account for a non-trivial time.

As such, the unlikely() check to see if the pointer is being used, seems
both misleading and sub-optimal. In fact, we go to an awful lot of
trouble to sync both lists, and populating the ovflist is far from an
uncommon scenario.

For example, profiling a concurrent epoll_wait(2) benchmark, with
CONFIG_PROFILE_ANNOTATED_BRANCHES shows that for a two threads a 33%
incorrect rate was seen; and when incrementally increasing the number of
epoll instances (which is used, for example for multiple queuing load
balancing models), up to a 90% incorrect rate was seen.

Similarly, by deleting the prediction, 3% throughput boost was seen
across incremental threads.

Link: http://lkml.kernel.org/r/20181108051006.18751-4-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Reviewed-by: Andrew Morton
Cc: Al Viro
Cc: Jason Baron
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2019-01-05 05:13:46 +0800
4e0982a00 fs/epoll: simplify ep_send_events_proc() ready-list loop ... Browse Code »

The current logic is a bit convoluted. Lets simplify this with a
standard list_for_each_entry_safe() loop instead and just break out
after maxevents is reached.

While at it, remove an unnecessary indentation level in the loop when
there are in fact ready events.

Link: http://lkml.kernel.org/r/20181108051006.18751-3-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Reviewed-by: Andrew Morton
Cc: Al Viro
Cc: Jason Baron
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2019-01-05 05:13:46 +0800
74bdc1298 fs/epoll: remove max_nests argument from ep_call_nested() ... Browse Code »

Patch series "epoll: some miscellaneous optimizations".

The following are some incremental optimizations on some of the epoll
core. Each patch has the details, but together, the series is seen to
shave off measurable cycles on a number of systems and workloads.

For example, on a 40-core IB, a pipetest as well as parallel
epoll_wait() benchmark show around a 20-30% increase in raw operations
per second when the box is fully occupied (incremental thread counts),
and up to 15% performance improvement with lower counts.

Passes ltp epoll related testcases.

This patch(of 6):

All callers pass the EP_MAX_NESTS constant already, so lets simplify
this a tad and get rid of the redundant parameter for nested eventpolls.

Link: http://lkml.kernel.org/r/20181108051006.18751-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Reviewed-by: Andrew Morton
Cc: Al Viro
Cc: Jason Baron
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2019-01-05 05:13:46 +0800

04 Jan, 2019

1 commit

96d4f267e Remove 'type' argument from access_ok() function ... Browse Code »

Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.

It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.

A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.

This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

There were a couple of notable cases:

- csky still had the old "verify_area()" name as an alias.

- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)

- microblaze used the type argument for a debug printout

but other than those oddities this should be a total no-op patch.

I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.

Signed-off-by: Linus Torvalds

Linus Torvalds
2019-01-04 10:57:57 +0800

07 Dec, 2018

2 commits

854a6ed56 signal: Add restore_user_sigmask() ... Browse Code »

Refactor the logic to restore the sigmask before the syscall
returns into an api.
This is useful for versions of syscalls that pass in the
sigmask and expect the current->sigmask to be changed during
the execution and restored after the execution of the syscall.

With the advent of new y2038 syscalls in the subsequent patches,
we add two more new versions of the syscalls (for pselect, ppoll
and io_pgetevents) in addition to the existing native and compat
versions. Adding such an api reduces the logic that would need to
be replicated otherwise.

Signed-off-by: Deepa Dinamani
Signed-off-by: Arnd Bergmann

Deepa Dinamani
2018-12-07 00:22:53 +0800
ded653ccb signal: Add set_user_sigmask() ... Browse Code »

Refactor reading sigset from userspace and updating sigmask
into an api.

This is useful for versions of syscalls that pass in the
sigmask and expect the current->sigmask to be changed during,
and restored after, the execution of the syscall.

With the advent of new y2038 syscalls in the subsequent patches,
we add two more new versions of the syscalls (for pselect, ppoll,
and io_pgetevents) in addition to the existing native and compat
versions. Adding such an api reduces the logic that would need to
be replicated otherwise.

Note that the calls to sigprocmask() ignored the return value
from the api as the function only returns an error on an invalid
first argument that is hardcoded at these call sites.
The updated logic uses set_current_blocked() instead.

Signed-off-by: Deepa Dinamani
Signed-off-by: Arnd Bergmann

Deepa Dinamani
2018-12-07 00:22:38 +0800

23 Aug, 2018

7 commits

992991c03 fs/eventpoll.c: simplify ep_is_linked() callers ... Browse Code »

Instead of having each caller pass the rdllink explicitly, just have
ep_is_linked() pass it while the callers just need the epi pointer. This
helper is all about the rdllink, and this change, furthermore, improves
the function's self documentation.

Link: http://lkml.kernel.org/r/20180727053432.16679-3-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Reviewed-by: Andrew Morton
Cc: Al Viro
Cc: Jason Baron
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2018-08-23 01:52:49 +0800
679abf381 fs/eventpoll.c: loosen irq safety in ep_poll() ... Browse Code »

Similar to other calls, ep_poll() is not called with interrupts disabled,
and we can therefore avoid the irq save/restore dance and just disable
local irqs. In fact, the call should never be called in irq context at
all, considering that the only path is

epoll_wait(2) -> do_epoll_wait() -> ep_poll().

When running on a 2 socket 40-core (ht) IvyBridge a common pipe based
epoll_wait(2) microbenchmark, the following performance improvements are
seen:

# threads vanilla dirty
1 1805587 2106412
2 1854064 2090762
4 1805484 2017436
8 1751222 1974475
16 1725299 1962104
32 1378463 1571233
64 787368 900784

Which is a pretty constantly near 15%.

Also add a lockdep check such that we detect any mischief before
deadlocking.

Link: http://lkml.kernel.org/r/20180727053432.16679-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Reviewed-by: Andrew Morton
Cc: Al Viro
Cc: Peter Zijlstra
Cc: Jason Baron
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2018-08-23 01:52:49 +0800
514056d50 fs/eventpoll.c: simply CONFIG_NET_RX_BUSY_POLL ifdefery ... Browse Code »

... 'tis easier on the eye.

[akpm@linux-foundation.org: use inlines rather than macros]
Link: http://lkml.kernel.org/r/20180725185620.11020-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Cc: Jason Baron
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2018-08-23 01:52:49 +0800
92e641784 s/epoll: robustify irq safety with lockdep_assert_irqs_enabled() ... Browse Code »

Sprinkle lockdep_assert_irqs_enabled() checks in the functions that do not
save and restore interrupts when dealing with the ep->wq.lock. These are
ep_scan_ready_list() and those called by epoll_ctl(): ep_insert, ep_modify
and ep_remove.

[akpm@linux-foundation.org: remove too-obvious comments]
Link: http://lkml.kernel.org/r/20180721183127.3busfa335zlcjeox@linux-r8p5
Signed-off-by: Davidlohr Bueso
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2018-08-23 01:52:47 +0800
304b18b8d fs/epoll: loosen irq safety in epoll_insert() and epoll_remove() ... Browse Code »

Both functions are similar to the context of ep_modify(), called via
epoll_ctl(2). Just like ep_modify(), saving and restoring interrupts is
an overkill in these calls as it will never be called with irqs disabled.
While ep_remove() can be called directly from EPOLL_CTL_DEL, it can also
be called when releasing the file, but this also complies with the above.

Link: http://lkml.kernel.org/r/20180720172956.2883-3-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Cc: Jason Baron
Cc: Al Viro
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2018-08-23 01:52:47 +0800
002b34366 fs/epoll: loosen irq safety in ep_scan_ready_list() ... Browse Code »

Patch series "fs/epoll: loosen irq safety when possible".

Both patches replace saving+restoring interrupts when taking the ep->lock
(now the waitqueue lock), with just disabling local irqs. This shows
immediate performance benefits in patch 1 for an epoll workload running on
Xen. The main concern we need to have with this sort of changes in epoll
is the ep_poll_callback() which is passed to the wait queue wakeup and is
done very often under irq context, this patch does not touch this call.

Patches have been tested pretty heavily with the customer workload,
microbenchmarks, ltp testcases and two high level workloads that use epoll
under the hood: nginx and libevent benchmarks.

This patch (of 2):

Saving and restoring interrupts in ep_scan_ready_list() is an
overkill as it is never called with interrupts disabled. Loosen
this to simply disabling local irqs such that archs where managing
irqs is expensive or virtual environments. This patch yields
some throughput improvements on a workload that is epoll intensive
running on a single Xen DomU.

1 Job 7500 --> 8800 enq/s (+17%)
2 Jobs 14000 --> 15200 enq/s (+8%)
3 Jobs 20500 --> 22300 enq/s (+8%)
4 Jobs 25000 --> 28000 enq/s (+8-12)%

On bare metal:

For a 2-socket 40-core (ht) IvyBridge on a few workloads, unfortunately I
don't have a xen environment and the results for Xen I do have (which
numbers are in patch 1) I don't have the actual workload, so cannot
compare them directly.

1) Different configurations were used for a epoll_wait (pipes io)
microbench (http://linux-scalability.org/epoll/epoll-test.c) and shows
around a 7-10% improvement in overall total number of times the
epoll_wait() loops when using both regular and nested epolls, so very
raw numbers, but measurable nonetheless.

# threads vanilla dirty
1 1677717 1805587
2 1660510 1854064
4 1610184 1805484
8 1577696 1751222
16 1568837 1725299
32 1291532 1378463
64 752584 787368

Note that stddev is pretty small.

2) Another pipe test, which shows no real measurable improvement.
(http://www.xmailserver.org/linux-patches/pipetest.c)

Link: http://lkml.kernel.org/r/20180720172956.2883-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Cc: Jason Baron
Cc: Al Viro
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2018-08-23 01:52:47 +0800
ee8ef0a4b epoll: use the waitqueue lock to protect ep->wq ... Browse Code »

Patch series "waitqueue lockdep annotation", v3.

This series adds a strategic lockdep_assert_held to __wake_up_common to
ensure callers really do hold the wait_queue_head lock when calling the
unlocked wake_up variants. It turns out epoll did not do this for a
fairly common path (hit all the time by systemd during bootup), so the
second patch fixed this instance as well.

This patch (of 3):

The epoll code currently uses the unlocked waitqueue helpers for managing
ep->wq, but instead of holding the waitqueue lock around these calls, it
uses its own ep->lock spinlock. Given that the waitqueue is not exposed
to the rest of the kernel this actually works ok at the moment, but
prevents the epoll locking rules from being enforced using lockdep.
Remove ep->lock and use the waitqueue lock to not only reduce the size of
struct eventpoll but also to make sure we can assert locking invariants in
the waitqueue code.

Link: http://lkml.kernel.org/r/20171214152344.6880-2-hch@lst.de
Signed-off-by: Christoph Hellwig
Reviewed-by: Jason Baron
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Al Viro
Cc: Andrea Arcangeli
Cc: Mike Rapoport
Cc: Jason Baron
Cc: Ingo Molnar
Cc: Matthew Wilcox
Cc: Davidlohr Bueso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2018-08-23 01:52:47 +0800

29 Jun, 2018

1 commit

a11e1d432 Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL ... Browse Code »

The poll() changes were not well thought out, and completely
unexplained. They also caused a huge performance regression, because
"->poll()" was no longer a trivial file operation that just called down
to the underlying file operations, but instead did at least two indirect
calls.

Indirect calls are sadly slow now with the Spectre mitigation, but the
performance problem could at least be largely mitigated by changing the
"->get_poll_head()" operation to just have a per-file-descriptor pointer
to the poll head instead. That gets rid of one of the new indirections.

But that doesn't fix the new complexity that is completely unwarranted
for the regular case. The (undocumented) reason for the poll() changes
was some alleged AIO poll race fixing, but we don't make the common case
slower and more complex for some uncommon special case, so this all
really needs way more explanations and most likely a fundamental
redesign.

[ This revert is a revert of about 30 different commits, not reverted
individually because that would just be unnecessarily messy - Linus ]

Cc: Al Viro
Cc: Christoph Hellwig
Signed-off-by: Linus Torvalds

Linus Torvalds
2018-06-29 01:40:47 +0800

15 Jun, 2018

1 commit

11c5ad0ec eventpoll: switch to ->poll_mask ... Browse Code »

Signed-off-by: Ben Noordhuis
Signed-off-by: Al Viro

Ben Noordhuis
2018-06-15 08:09:28 +0800

26 May, 2018

1 commit

9965ed174 fs: add new vfs_poll and file_can_poll helpers ... Browse Code »

These abstract out calls to the poll method in preparation for changes
in how we poll.

Signed-off-by: Christoph Hellwig
Reviewed-by: Greg Kroah-Hartman
Reviewed-by: Darrick J. Wong

Christoph Hellwig
2018-05-26 15:16:44 +0800