Eric Lee / smarc-fsl-linux-kernel

09 Sep, 2017

1 commit

b2ac2ea62 fs/epoll: use faster rb_first_cached() ... Browse Code »

... such that we can avoid the tree walks to get the node with the
smallest key. Semantically the same, as the previously used rb_first(),
but O(1). The main overhead is the extra footprint for the cached rb_node
pointer, which should not matter for epoll.

Link: http://lkml.kernel.org/r/20170719014603.19029-15-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Acked-by: Peter Zijlstra (Intel)
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2017-09-09 09:26:49 +0800

02 Sep, 2017

1 commit

138e4ad67 epoll: fix race between ep_poll_callback(POLLFREE) and ep_free()/ep_remove() ... Browse Code »

The race was introduced by me in commit 971316f0503a ("epoll:
ep_unregister_pollwait() can use the freed pwq->whead"). I did not
realize that nothing can protect eventpoll after ep_poll_callback() sets
->whead = NULL, only whead->lock can save us from the race with
ep_free() or ep_remove().

Move ->whead = NULL to the end of ep_poll_callback() and add the
necessary barriers.

TODO: cleanup the ewake/EPOLLEXCLUSIVE logic, it was confusing even
before this patch.

Hopefully this explains use-after-free reported by syzcaller:

BUG: KASAN: use-after-free in debug_spin_lock_before
...
_raw_spin_lock_irqsave+0x4a/0x60 kernel/locking/spinlock.c:159
ep_poll_callback+0x29f/0xff0 fs/eventpoll.c:1148

this is spin_lock(eventpoll->lock),

...
Freed by task 17774:
...
kfree+0xe8/0x2c0 mm/slub.c:3883
ep_free+0x22c/0x2a0 fs/eventpoll.c:865

Fixes: 971316f0503a ("epoll: ep_unregister_pollwait() can use the freed pwq->whead")
Reported-by: 范龙飞
Cc: stable@vger.kernel.org
Signed-off-by: Oleg Nesterov
Signed-off-by: Linus Torvalds

Oleg Nesterov
2017-09-02 04:07:35 +0800

13 Jul, 2017

3 commits

92ef6da3d kcmp: fs/epoll: wrap kcmp code with CONFIG_CHECKPOINT_RESTORE ... Browse Code »

kcmp syscall is build iif CONFIG_CHECKPOINT_RESTORE is selected, so wrap
appropriate helpers in epoll code with the config to build it
conditionally.

Link: http://lkml.kernel.org/r/20170513083456.GG1881@uranus.lan
Signed-off-by: Cyrill Gorcunov
Reported-by: Andrew Morton
Cc: Andrey Vagin
Cc: Al Viro
Cc: Pavel Emelyanov
Cc: Michael Kerrisk
Cc: Jason Baron
Cc: Andy Lutomirski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2017-07-13 07:26:01 +0800
0791e3644 kcmp: add KCMP_EPOLL_TFD mode to compare epoll target files ... Browse Code »

With current epoll architecture target files are addressed with
file_struct and file descriptor number, where the last is not unique.
Moreover files can be transferred from another process via unix socket,
added into queue and closed then so we won't find this descriptor in the
task fdinfo list.

Thus to checkpoint and restore such processes CRIU needs to find out
where exactly the target file is present to add it into epoll queue.
For this sake one can use kcmp call where some particular target file
from the queue is compared with arbitrary file passed as an argument.

Because epoll target files can have same file descriptor number but
different file_struct a caller should explicitly specify the offset
within.

To test if some particular file is matching entry inside epoll one have
to

- fill kcmp_epoll_slot structure with epoll file descriptor,
target file number and target file offset (in case if only
one target is present then it should be 0)

- call kcmp as kcmp(pid1, pid2, KCMP_EPOLL_TFD, fd, &kcmp_epoll_slot)
- the kernel fetch file pointer matching file descriptor @fd of pid1
- lookups for file struct in epoll queue of pid2 and returns traditional
0,1,2 result for sorting purpose

Link: http://lkml.kernel.org/r/20170424154423.511592110@gmail.com
Signed-off-by: Cyrill Gorcunov
Acked-by: Andrey Vagin
Cc: Al Viro
Cc: Pavel Emelyanov
Cc: Michael Kerrisk
Cc: Jason Baron
Cc: Andy Lutomirski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2017-07-13 07:26:01 +0800
77493f04b procfs: fdinfo: extend information about epoll target files ... Browse Code »

Since it is possbile to have same number in tfd field (say file added,
closed, then nother file dup'ed to same number and added back) it is
imposible to distinguish such target files solely by their numbers.

Strictly speaking regular applications don't need to recognize these
targets at all but for checkpoint/restore sake we need to collect
targets to be able to push them back on restore stage in a proper order.

Thus lets add file position, inode and device number where this target
lays. This three fields can be used as a primary key for sorting, and
together with kcmp help CRIU can find out an exact file target (from the
whole set of processes being checkpointed).

Link: http://lkml.kernel.org/r/20170424154423.436491881@gmail.com
Signed-off-by: Cyrill Gorcunov
Acked-by: Andrei Vagin
Cc: Al Viro
Cc: Pavel Emelyanov
Cc: Michael Kerrisk
Cc: Jason Baron
Cc: Andy Lutomirski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2017-07-13 07:26:01 +0800

11 Jul, 2017

1 commit

c257a340e fs, epoll: short circuit fetching events if thread has been killed ... Browse Code »

We've encountered zombies that are waiting for a thread to exit that are
looping in ep_poll() almost endlessly although there is a pending
SIGKILL as a result of a group exit.

This happens because we always find ep_events_available() and fetch more
events and never are able to check for signal_pending() that would break
from the loop and return -EINTR.

Special case fatal signals and break immediately to guarantee that we
loop to fetch more events and delay making a timely exit.

It would also be possible to simply move the check for signal_pending()
higher than checking for ep_events_available(), but there have been no
reports of delayed signal handling other than SIGKILL preventing zombies
from exiting that would be fixed by this.

It fixes an issue for us where we have witnessed zombies sticking around
for at least O(minutes), but considering the code has been like this
forever and nobody else has complained that I have found, I would simply
queue it up for 4.12.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705031722350.76784@chino.kir.corp.google.com
Signed-off-by: David Rientjes
Cc: Alexander Viro
Cc: Jan Kara
Cc: Davide Libenzi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2017-07-11 07:32:36 +0800

20 Jun, 2017

2 commits

2055da973 sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming ... Browse Code »

So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.

Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.

To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:

struct wait_queue_head::task_list => ::head
struct wait_queue_entry::task_list => ::entry

For example, this code:

rqw->wait.task_list.next != &wait->task_list

... is was pretty unclear (to me) what it's doing, while now it's written this way:

rqw->wait.head.next != &wait->entry

... which makes it pretty clear that we are iterating a list until we see the head.

Other examples are:

list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
list_for_each_entry(wq, &fence->wait.task_list, task_list) {

... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:

list_for_each_entry_safe(pos, next, &x->head, entry) {
list_for_each_entry(wq, &fence->wait.head, entry) {

Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-06-20 18:19:14 +0800
ac6424b98 sched/wait: Rename wait_queue_t => wait_queue_entry_t ... Browse Code »

Rename:

wait_queue_t => wait_queue_entry_t

'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.

Start sorting this out by renaming it to 'wait_queue_entry_t'.

This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.

Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-06-20 18:18:27 +0800

25 Mar, 2017

1 commit

bf3b9f637 epoll: Add busy poll support to epoll with socket fds. ... Browse Code »

This patch adds busy poll support to epoll. The implementation is meant to
be opportunistic in that it will take the NAPI ID from the last socket
that is added to the ready list that contains a valid NAPI ID and it will
use that for busy polling until the ready list goes empty. Once the ready
list goes empty the NAPI ID is reset and busy polling is disabled until a
new socket is added to the ready list.

In addition when we insert a new socket into the epoll we record the NAPI
ID and assume we are going to receive events on it. If that doesn't occur
it will be evicted as the active NAPI ID and we will resume normal
behavior.

An application can use SO_INCOMING_CPU or SO_REUSEPORT_ATTACH_C/EBPF socket
options to spread the incoming connections to specific worker threads
based on the incoming queue. This enables epoll for each worker thread
to have only sockets that receive packets from a single queue. So when an
application calls epoll_wait() and there are no events available to report,
busy polling is done on the associated queue to pull the packets.

Signed-off-by: Sridhar Samudrala
Signed-off-by: Alexander Duyck
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Sridhar Samudrala
2017-03-25 11:49:31 +0800

02 Mar, 2017

1 commit

174cd4b1e sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sc… ... Browse Code »

…hed.h> into <linux/sched/signal.h>

Fix up affected files that include this signal functionality via sched.h.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Ingo Molnar
2017-03-02 15:42:32 +0800

28 Feb, 2017

1 commit

c857ab640 fs,eventpoll: don't test for bitfield with stack value ... Browse Code »

In case if epoll_ctl is called with operation EPOLL_CTL_DEL then
@epds.events variable allocated on stack may contain random bits which
we test then for EPOLLEXCLUSIVE. Since currently the test look like

if (epds.events & EPOLLEXCLUSIVE) {
if (op == EPOLL_CTL_MOD)
goto error_tgt_fput;
if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) ||
(epds.events & ~EPOLLEXCLUSIVE_OK_BITS)))
goto error_tgt_fput;
}

Nothing serious will happen even if epds.events has this bit set, still
better to be on safe side and make sure that we're to test this bit at
all.

Link: http://lkml.kernel.org/r/20170214154935.GG1850@uranus.lan
Signed-off-by: Cyrill Gorcunov
Cc: Al Viro
Cc: Andrey Vagin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2017-02-28 10:43:45 +0800

25 Dec, 2016

1 commit

7c0f6ba68 Replace <asm/uaccess.h> with <linux/uaccess.h> globally ... Browse Code »

This was entirely automated, using the script by Al:

PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*'
sed -i -e "s!$PATT!#include !" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro
Signed-off-by: Linus Torvalds

Linus Torvalds
2016-12-25 03:46:01 +0800

20 May, 2016

1 commit

766b9f928 fs: poll/select/recvmmsg: use timespec64 for timeout events ... Browse Code »

struct timespec is not y2038 safe. Even though timespec might be
sufficient to represent timeouts, use struct timespec64 here as the plan
is to get rid of all timespec reference in the kernel.

The patch transitions the common functions: poll_select_set_timeout()
and select_estimate_accuracy() to use timespec64. And, all the syscalls
that use these functions are transitioned in the same patch.

The restart block parameters for poll uses monotonic time. Use
timespec64 here as well to assign timeout value. This parameter in the
restart block need not change because this only holds the monotonic
timestamp at which timeout should occur. And, unsigned long data type
should be big enough for this timestamp.

The system call interfaces will be handled in a separate series.

Compat interfaces need not change as timespec64 is an alias to struct
timespec on a 64 bit system.

Link: http://lkml.kernel.org/r/1461947989-21926-3-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani
Acked-by: John Stultz
Acked-by: David S. Miller
Cc: Alexander Viro
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Deepa Dinamani
2016-05-20 10:12:14 +0800

18 Mar, 2016

1 commit

da8b44d5a timer: convert timer_slack_ns from unsigned long to u64 ... Browse Code »

This patchset introduces a /proc//timerslack_ns interface which
would allow controlling processes to be able to set the timerslack value
on other processes in order to save power by avoiding wakeups (Something
Android currently does via out-of-tree patches).

The first patch tries to fix the internal timer_slack_ns usage which was
defined as a long, which limits the slack range to ~4 seconds on 32bit
systems. It converts it to a u64, which provides the same basically
unlimited slack (500 years) on both 32bit and 64bit machines.

The second patch introduces the /proc//timerslack_ns interface
which allows the full 64bit slack range for a task to be read or set on
both 32bit and 64bit machines.

With these two patches, on a 32bit machine, after setting the slack on
bash to 10 seconds:

$ time sleep 1

real 0m10.747s
user 0m0.001s
sys 0m0.005s

The first patch is a little ugly, since I had to chase the slack delta
arguments through a number of functions converting them to u64s. Let me
know if it makes sense to break that up more or not.

Other than that things are fairly straightforward.

This patch (of 2):

The timer_slack_ns value in the task struct is currently a unsigned
long. This means that on 32bit applications, the maximum slack is just
over 4 seconds. However, on 64bit machines, its much much larger (~500
years).

This disparity could make application development a little (as well as
the default_slack) to a u64. This means both 32bit and 64bit systems
have the same effective internal slack range.

Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
the interface as a unsigned long, so we preserve that limitation on
32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
actually larger then what can be stored by an unsigned long.

This patch also modifies hrtimer functions which specified the slack
delta as a unsigned long.

Signed-off-by: John Stultz
Cc: Arjan van de Ven
Cc: Thomas Gleixner
Cc: Oren Laadan
Cc: Ruchi Kandoi
Cc: Rom Lemarchand
Cc: Kees Cook
Cc: Android Kernel Team
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

John Stultz
2016-03-18 06:09:34 +0800

06 Feb, 2016

1 commit

b6a515c8a epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT ... Browse Code »

In the current implementation of the EPOLLEXCLUSIVE flag (added for
4.5-rc1), if epoll waiters create different POLL* sets and register them
as exclusive against the same target fd, the current implementation will
stop waking any further waiters once it finds the first idle waiter.
This means that waiters could miss wakeups in certain cases.

For example, when we wake up a pipe for reading we do:
wake_up_interruptible_sync_poll(&pipe->wait, POLLIN | POLLRDNORM); So if
one epoll set or epfd is added to pipe p with POLLIN and a second set
epfd2 is added to pipe p with POLLRDNORM, only epfd may receive the
wakeup since the current implementation will stop after it finds any
intersection of events with a waiter that is blocked in epoll_wait().

We could potentially address this by requiring all epoll waiters that
are added to p be required to pass the same set of POLL* events. IE the
first EPOLL_CTL_ADD that passes EPOLLEXCLUSIVE establishes the set POLL*
flags to be used by any other epfds that are added as EPOLLEXCLUSIVE.
However, I think it might be somewhat confusing interface as we would
have to reference count the number of users for that set, and so
userspace would have to keep track of that count, or we would need a
more involved interface. It also adds some shared state that we'd have
store somewhere. I don't think anybody will want to bloat
__wait_queue_head for this.

I think what we could do instead, is to simply restrict EPOLLEXCLUSIVE
such that it can only be specified with EPOLLIN and/or EPOLLOUT. So
that way if the wakeup includes 'POLLIN' and not 'POLLOUT', we can stop
once we hit the first idle waiter that specifies the EPOLLIN bit, since
any remaining waiters that only have 'POLLOUT' set wouldn't need to be
woken. Likewise, we can do the same thing if 'POLLOUT' is in the wakeup
bit set and not 'POLLIN'. If both 'POLLOUT' and 'POLLIN' are set in the
wake bit set (there is at least one example of this I saw in fs/pipe.c),
then we just wake the entire exclusive list. Having both 'POLLOUT' and
'POLLIN' both set should not be on any performance critical path, so I
think that's ok (in fs/pipe.c its in pipe_release()). We also continue
to include EPOLLERR and EPOLLHUP by default in any exclusive set. Thus,
the user can specify EPOLLERR and/or EPOLLHUP but is not required to do
so.

Since epoll waiters may be interested in other events as well besides
EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP, these can still be added by
doing a 'dup' call on the target fd and adding that as one normally
would with EPOLL_CTL_ADD. Since I think that the POLLIN and POLLOUT
events are what we are interest in balancing, I think that the 'dup'
thing could perhaps be added to only one of the waiter threads.
However, I think that EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP should be
sufficient for the majority of use-cases.

Since EPOLLEXCLUSIVE is intended to be used with a target fd shared
among multiple epfds, where between 1 and n of the epfds may receive an
event, it does not satisfy the semantics of EPOLLONESHOT where only 1
epfd would get an event. Thus, it is not allowed to be specified in
conjunction with EPOLLEXCLUSIVE.

EPOLL_CTL_MOD is also not allowed if the fd was previously added as
EPOLLEXCLUSIVE. It seems with the limited number of flags to not be as
interesting, but this could be relaxed at some further point.

Signed-off-by: Jason Baron
Tested-by: Madars Vitolins
Cc: Michael Kerrisk
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Al Viro
Cc: Eric Wong
Cc: Jonathan Corbet
Cc: Andy Lutomirski
Cc: Hagen Paul Pfeifer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Baron
2016-02-06 10:10:40 +0800

21 Jan, 2016

1 commit

df0108c5d epoll: add EPOLLEXCLUSIVE flag ... Browse Code »

Currently, epoll file descriptors or epfds (the fd returned from
epoll_create[1]()) that are added to a shared wakeup source are always
added in a non-exclusive manner. This means that when we have multiple
epfds attached to a shared fd source they are all woken up. This creates
thundering herd type behavior.

Introduce a new 'EPOLLEXCLUSIVE' flag that can be passed as part of the
'event' argument during an epoll_ctl() EPOLL_CTL_ADD operation. This new
flag allows for exclusive wakeups when there are multiple epfds attached
to a shared fd event source.

The implementation walks the list of exclusive waiters, and queues an
event to each epfd, until it finds the first waiter that has threads
blocked on it via epoll_wait(). The idea is to search for threads which
are idle and ready to process the wakeup events. Thus, we queue an event
to at least 1 epfd, but may still potentially queue an event to all epfds
that are attached to the shared fd source.

Performance testing was done by Madars Vitolins using a modified version
of Enduro/X. The use of the 'EPOLLEXCLUSIVE' flag reduce the length of
this particular workload from 860s down to 24s.

Sample epoll_clt text:

EPOLLEXCLUSIVE

Sets an exclusive wakeup mode for the epfd file descriptor that is
being attached to the target file descriptor, fd. Thus, when an event
occurs and multiple epfd file descriptors are attached to the same
target file using EPOLLEXCLUSIVE, one or more epfds will receive an
event with epoll_wait(2). The default in this scenario (when
EPOLLEXCLUSIVE is not set) is for all epfds to receive an event.
EPOLLEXCLUSIVE may only be specified with the op EPOLL_CTL_ADD.

Signed-off-by: Jason Baron
Tested-by: Madars Vitolins
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Al Viro
Cc: Michael Kerrisk
Cc: Eric Wong
Cc: Jonathan Corbet
Cc: Andy Lutomirski
Cc: Hagen Paul Pfeifer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Baron
2016-01-21 09:09:18 +0800

14 Feb, 2015

1 commit

4d5755b14 epoll: optimize setting task running after blocking ... Browse Code »

After waking up a task waiting for an event, we explicitly mark it as
TASK_RUNNING (which is necessary as we do the checks for wakeups as
TASK_INTERRUPTIBLE). Once running and dealing with actually delivering
the events, we're obviously not planning on calling schedule, thus we can
relax the implied barrier and simply update the state with
__set_current_state().

Signed-off-by: Davidlohr Bueso
Cc: Alexander Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2015-02-14 13:21:40 +0800

06 Nov, 2014

1 commit

a3816ab0e fs: Convert show_fdinfo functions to void ... Browse Code »

seq_printf functions shouldn't really check the return value.
Checking seq_has_overflowed() occasionally is used instead.

Update vfs documentation.

Link: http://lkml.kernel.org/p/e37e6e7b76acbdcc3bb4ab2a57c8f8ca1ae11b9a.1412031505.git.joe@perches.com

Cc: David S. Miller
Cc: Al Viro
Signed-off-by: Joe Perches
[ did a few clean ups ]
Signed-off-by: Steven Rostedt

Joe Perches
2014-11-06 03:13:23 +0800

11 Sep, 2014

1 commit

c680e41b3 eventpoll: fix uninitialized variable in epoll_ctl ... Browse Code »

When calling epoll_ctl with operation EPOLL_CTL_DEL, structure epds is
not initialized but ep_take_care_of_epollwakeup reads its event field.
When this unintialized field has EPOLLWAKEUP bit set, a capability check
is done for CAP_BLOCK_SUSPEND in ep_take_care_of_epollwakeup. This
produces unexpected messages in the audit log, such as (on a system
running SELinux):

type=AVC msg=audit(1408212798.866:410): avc: denied
{ block_suspend } for pid=7754 comm="dbus-daemon" capability=36
scontext=unconfined_u:unconfined_r:unconfined_t
tcontext=unconfined_u:unconfined_r:unconfined_t
tclass=capability2 permissive=1

type=SYSCALL msg=audit(1408212798.866:410): arch=c000003e syscall=233
success=yes exit=0 a0=3 a1=2 a2=9 a3=7fffd4d66ec0 items=0 ppid=1
pid=7754 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0
fsgid=0 tty=(none) ses=3 comm="dbus-daemon"
exe="/usr/bin/dbus-daemon"
subj=unconfined_u:unconfined_r:unconfined_t key=(null)

("arch=c000003e syscall=233 a1=2" means "epoll_ctl(op=EPOLL_CTL_DEL)")

Remove use of epds in epoll_ctl when op == EPOLL_CTL_DEL.

Fixes: 4d7e30d98939 ("epoll: Add a flag, EPOLLWAKEUP, to prevent suspend while epoll events are ready")
Signed-off-by: Nicolas Iooss
Cc: Alexander Viro
Cc: Arve Hjønnevåg
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nicolas Iooss
2014-09-11 06:42:12 +0800

17 Jun, 2014

1 commit

ebe06187b epoll: fix use-after-free in eventpoll_release_file ... Browse Code »

This fixes use-after-free of epi->fllink.next inside list loop macro.
This loop actually releases elements in the body. The list is
rcu-protected but here we cannot hold rcu_read_lock because we need to
lock mutex inside.

The obvious solution is to use list_for_each_entry_safe(). RCU-ness
isn't essential because nobody can change this list under us, it's final
fput for this file.

The bug was introduced by ae10b2b4eb01 ("epoll: optimize EPOLL_CTL_DEL
using rcu")

Signed-off-by: Konstantin Khlebnikov
Reported-by: Cyrill Gorcunov
Cc: Stable # 3.13+
Cc: Sasha Levin
Cc: Jason Baron
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2014-06-17 11:21:59 +0800

07 Jun, 2014

1 commit

1f7e0616c fs: convert use of typedef ctl_table to struct ctl_table ... Browse Code »

This typedef is unnecessary and should just be removed.

Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2014-06-07 07:08:16 +0800

03 Jan, 2014

1 commit

4ff36ee94 epoll: do not take the nested ep->mtx on EPOLL_CTL_DEL ... Browse Code »

The EPOLL_CTL_DEL path of epoll contains a classic, ab-ba deadlock.
That is, epoll_ctl(a, EPOLL_CTL_DEL, b, x), will deadlock with
epoll_ctl(b, EPOLL_CTL_DEL, a, x). The deadlock was introduced with
commmit 67347fe4e632 ("epoll: do not take global 'epmutex' for simple
topologies").

The acquistion of the ep->mtx for the destination 'ep' was added such
that a concurrent EPOLL_CTL_ADD operation would see the correct state of
the ep (Specifically, the check for '!list_empty(&f.file->f_ep_links')

However, by simply not acquiring the lock, we do not serialize behind
the ep->mtx from the add path, and thus may perform a full path check
when if we had waited a little longer it may not have been necessary.
However, this is a transient state, and performing the full loop
checking in this case is not harmful.

The important point is that we wouldn't miss doing the full loop
checking when required, since EPOLL_CTL_ADD always locks any 'ep's that
its operating upon. The reason we don't need to do lock ordering in the
add path, is that we are already are holding the global 'epmutex'
whenever we do the double lock. Further, the original posting of this
patch, which was tested for the intended performance gains, did not
perform this additional locking.

Signed-off-by: Jason Baron
Cc: Nathan Zimmer
Cc: Eric Wong
Cc: Nelson Elhage
Cc: Al Viro
Cc: Davide Libenzi
Cc: "Paul E. McKenney"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Baron
2014-01-03 06:40:30 +0800

03 Dec, 2013

1 commit

95f19f658 epoll: drop EPOLLWAKEUP if PM_SLEEP is disabled ... Browse Code »

Drop EPOLLWAKEUP from epoll events mask if CONFIG_PM_SLEEP is disabled.

Signed-off-by: Amit Pundir
Cc: John Stultz
Cc: Alexander Viro
Signed-off-by: Rafael J. Wysocki

Amit Pundir
2013-12-03 22:35:52 +0800

13 Nov, 2013

4 commits

5cbb3d216 Merge branch 'akpm' (patches from Andrew Morton) ... Browse Code »

Merge first patch-bomb from Andrew Morton:
"Quite a lot of other stuff is banked up awaiting further
next->mainline merging, but this batch contains:

- Lots of random misc patches
- OCFS2
- Most of MM
- backlight updates
- lib/ updates
- printk updates
- checkpatch updates
- epoll tweaking
- rtc updates
- hfs
- hfsplus
- documentation
- procfs
- update gcov to gcc-4.7 format
- IPC"

* emailed patches from Andrew Morton : (269 commits)
ipc, msg: fix message length check for negative values
ipc/util.c: remove unnecessary work pending test
devpts: plug the memory leak in kill_sb
./Makefile: export initial ramdisk compression config option
init/Kconfig: add option to disable kernel compression
drivers: w1: make w1_slave::flags long to avoid memory corruption
drivers/w1/masters/ds1wm.cuse dev_get_platdata()
drivers/memstick/core/ms_block.c: fix unreachable state in h_msb_read_page()
drivers/memstick/core/mspro_block.c: fix attributes array allocation
drivers/pps/clients/pps-gpio.c: remove redundant of_match_ptr
kernel/panic.c: reduce 1 byte usage for print tainted buffer
gcov: reuse kbasename helper
kernel/gcov/fs.c: use pr_warn()
kernel/module.c: use pr_foo()
gcov: compile specific gcov implementation based on gcc version
gcov: add support for gcc 4.7 gcov format
gcov: move gcov structs definitions to a gcc version specific file
kernel/taskstats.c: return -ENOMEM when alloc memory fails in add_del_listener()
kernel/taskstats.c: add nla_nest_cancel() for failure processing between nla_nest_start() and nla_nest_end()
kernel/sysctl_binary.c: use scnprintf() instead of snprintf()
...

Linus Torvalds
2013-11-13 14:45:43 +0800
9bc9ccd7d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs updates from Al Viro:
"All kinds of stuff this time around; some more notable parts:

- RCU'd vfsmounts handling
- new primitives for coredump handling
- files_lock is gone
- Bruce's delegations handling series
- exportfs fixes

plus misc stuff all over the place"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits)
ecryptfs: ->f_op is never NULL
locks: break delegations on any attribute modification
locks: break delegations on link
locks: break delegations on rename
locks: helper functions for delegation breaking
locks: break delegations on unlink
namei: minor vfs_unlink cleanup
locks: implement delegations
locks: introduce new FL_DELEG lock flag
vfs: take i_mutex on renamed file
vfs: rename I_MUTEX_QUOTA now that it's not used for quotas
vfs: don't use PARENT/CHILD lock classes for non-directories
vfs: pull ext4's double-i_mutex-locking into common code
exportfs: fix quadratic behavior in filehandle lookup
exportfs: better variable name
exportfs: move most of reconnect_path to helper function
exportfs: eliminate unused "noprogress" counter
exportfs: stop retrying once we race with rename/remove
exportfs: clear DISCONNECTED on all parents sooner
exportfs: more detailed comment for path_reconnect
...

Linus Torvalds
2013-11-13 14:34:18 +0800
67347fe4e epoll: do not take global 'epmutex' for simple topologies ... Browse Code »

When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached
directly to a wakeup source, we do not need to take the global 'epmutex',
unless the epoll file descriptor is nested. The purpose of taking the
'epmutex' on add is to prevent complex topologies such as loops and deep
wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD
operations. However, for the simple case of an epoll file descriptor
attached directly to a wakeup source (with no nesting), we do not need to
hold the 'epmutex'.

This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves
scalability on larger systems. Quoting Nathan Zimmer's mail on SPECjbb
performance:

"On the 16 socket run the performance went from 35k jOPS to 125k jOPS. In
addition the benchmark when from scaling well on 10 sockets to scaling
well on just over 40 sockets.

...

Currently the benchmark stops scaling at around 40-44 sockets but it seems like
I found a second unrelated bottleneck."

[akpm@linux-foundation.org: use `bool' for boolean variables, remove unneeded/undesirable cast of void*, add missed ep_scan_ready_list() kerneldoc]
Signed-off-by: Jason Baron
Tested-by: Nathan Zimmer
Cc: Eric Wong
Cc: Nelson Elhage
Cc: Al Viro
Cc: Davide Libenzi
Cc: "Paul E. McKenney"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Baron
2013-11-13 11:09:25 +0800
ae10b2b4e epoll: optimize EPOLL_CTL_DEL using rcu ... Browse Code »

Nathan Zimmer found that once we get over 10+ cpus, the scalability of
SPECjbb falls over due to the contention on the global 'epmutex', which is
taken in on EPOLL_CTL_ADD and EPOLL_CTL_DEL operations.

Patch #1 removes the 'epmutex' lock completely from the EPOLL_CTL_DEL path
by using rcu to guard against any concurrent traversals.

Patch #2 remove the 'epmutex' lock from EPOLL_CTL_ADD operations for
simple topologies. IE when adding a link from an epoll file descriptor to
a wakeup source, where the epoll file descriptor is not nested.

This patch (of 2):

Optimize EPOLL_CTL_DEL such that it does not require the 'epmutex' by
converting the file->f_ep_links list into an rcu one. In this way, we can
traverse the epoll network on the add path in parallel with deletes.
Since deletes can't create loops or worse wakeup paths, this is safe.

This patch in combination with the patch "epoll: Do not take global 'epmutex'
for simple topologies", shows a dramatic performance improvement in
scalability for SPECjbb.

Signed-off-by: Jason Baron
Tested-by: Nathan Zimmer
Cc: Eric Wong
Cc: Nelson Elhage
Cc: Al Viro
Cc: Davide Libenzi
Cc: "Paul E. McKenney"
CC: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Baron
2013-11-13 11:09:25 +0800

30 Oct, 2013

1 commit

c511851de Revert "epoll: use freezable blocking call" ... Browse Code »

This reverts commit 1c441e921201 (epoll: use freezable blocking call)
which is reported to cause user space memory corruption to happen
after suspend to RAM.

Since it appears to be extremely difficult to root cause this
problem, it is best to revert the offending commit and try to address
the original issue in a better way later.

References: https://bugzilla.kernel.org/show_bug.cgi?id=61781
Reported-by: Natrio
Reported-by: Jeff Pohlmeyer
Bisected-by: Leo Wolf
Fixes: 1c441e921201 (epoll: use freezable blocking call)
Signed-off-by: Rafael J. Wysocki
Cc: 3.11+ # 3.11+

Rafael J. Wysocki
2013-10-30 22:27:53 +0800

25 Oct, 2013

1 commit

72c2d5319 file->f_op is never NULL... ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2013-10-25 11:34:54 +0800

12 Sep, 2013

1 commit

91cf5ab60 epoll: add a reschedule point in ep_free() ... Browse Code »

ep_free() might iterate on a huge set of epitems and hold cpu too long.
Add two cond_resched() in order to yield cpu to other tasks. This is safe
as we only hold mutexes in this function.

Signed-off-by: Eric Dumazet
Cc: Al Viro
Cc: Theodore Ts'o
Acked-by: Eric Wong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Dumazet
2013-09-12 06:58:50 +0800

04 Sep, 2013

1 commit

7e3fb5842 switch epoll_ctl() to fdget ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2013-09-04 11:04:44 +0800

04 Jul, 2013

2 commits

7f0ef0267 Merge branch 'akpm' (updates from Andrew Morton) ... Browse Code »

Merge first patch-bomb from Andrew Morton:
- various misc bits
- I'm been patchmonkeying ocfs2 for a while, as Joel and Mark have been
distracted. There has been quite a bit of activity.
- About half the MM queue
- Some backlight bits
- Various lib/ updates
- checkpatch updates
- zillions more little rtc patches
- ptrace
- signals
- exec
- procfs
- rapidio
- nbd
- aoe
- pps
- memstick
- tools/testing/selftests updates

* emailed patches from Andrew Morton : (445 commits)
tools/testing/selftests: don't assume the x bit is set on scripts
selftests: add .gitignore for kcmp
selftests: fix clean target in kcmp Makefile
selftests: add .gitignore for vm
selftests: add hugetlbfstest
self-test: fix make clean
selftests: exit 1 on failure
kernel/resource.c: remove the unneeded assignment in function __find_resource
aio: fix wrong comment in aio_complete()
drivers/w1/slaves/w1_ds2408.c: add magic sequence to disable P0 test mode
drivers/memstick/host/r592.c: convert to module_pci_driver
drivers/memstick/host/jmb38x_ms: convert to module_pci_driver
pps-gpio: add device-tree binding and support
drivers/pps/clients/pps-gpio.c: convert to module_platform_driver
drivers/pps/clients/pps-gpio.c: convert to devm_* helpers
drivers/parport/share.c: use kzalloc
Documentation/accounting/getdelays.c: avoid strncpy in accounting tool
aoe: update internal version number to v83
aoe: update copyright date
aoe: perform I/O completions in parallel
...

Linus Torvalds
2013-07-04 08:12:13 +0800
77d559180 signals: eventpoll: do not use sigprocmask() ... Browse Code »

sigprocmask() should die. None of the current callers actually
need this strange interface.

Change fs/eventpoll.c to use set_current_blocked(). This also
means we should not worry about SIGKILL/SIGSTOP.

Signed-off-by: Oleg Nesterov
Cc: Al Viro
Cc: Denys Vlasenko
Cc: Eric Wong
Cc: Jason Baron
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-07-04 07:08:01 +0800

12 May, 2013

1 commit

1c441e921 epoll: use freezable blocking call ... Browse Code »

Avoid waking up every thread sleeping in an epoll_wait call during
suspend and resume by calling a freezable blocking call. Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.

This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.

Acked-by: Tejun Heo
Signed-off-by: Colin Cross
Signed-off-by: Rafael J. Wysocki

Colin Cross
2013-05-12 20:16:22 +0800

01 May, 2013

6 commits

08d767608 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal ... Browse Code »

Pull compat cleanup from Al Viro:
"Mostly about syscall wrappers this time; there will be another pile
with patches in the same general area from various people, but I'd
rather push those after both that and vfs.git pile are in."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
syscalls.h: slightly reduce the jungles of macros
get rid of union semop in sys_semctl(2) arguments
make do_mremap() static
sparc: no need to sign-extend in sync_file_range() wrapper
ppc compat wrappers for add_key(2) and request_key(2) are pointless
x86: trim sys_ia32.h
x86: sys32_kill and sys32_mprotect are pointless
get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
merge compat sys_ipc instances
consolidate compat lookup_dcookie()
convert vmsplice to COMPAT_SYSCALL_DEFINE
switch getrusage() to COMPAT_SYSCALL_DEFINE
switch epoll_pwait to COMPAT_SYSCALL_DEFINE
convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
make SYSCALL_DEFINE-generated wrappers do asmlinkage_protect
make HAVE_SYSCALL_WRAPPERS unconditional
consolidate cond_syscall and SYSCALL_ALIAS declarations
teach SYSCALL_DEFINE how to deal with long long/unsigned long long
get rid of duplicate logics in __SC_....[1-6] definitions

Linus Torvalds
2013-05-01 22:21:43 +0800
d6d67e723 epoll: cleanup: use RCU_INIT_POINTER when nulling ... Browse Code »

It is always safe to use RCU_INIT_POINTER to NULL a pointer. This results
in slightly smaller/faster code.

Signed-off-by: Eric Wong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Wong
2013-05-01 08:04:04 +0800
450d89ec0 epoll: cleanup: hoist out f_op->poll calls ... Browse Code »

This reduces the amount of code inside the ready list iteration loops for
better readability IMHO.

Signed-off-by: Eric Wong
Cc: Davide Libenzi
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Wong
2013-05-01 08:04:04 +0800
ddf676c38 epoll: lock ep->mtx in ep_free to silence lockdep ... Browse Code »

Technically we do not need to hold ep->mtx during ep_free since we are
certain there are no other users of ep at that point. However, lockdep
complains with a "suspicious rcu_dereference_check() usage!" message; so
lock the mutex before ep_remove to silence the warning.

Signed-off-by: Eric Wong
Cc: Al Viro
Cc: Arve Hjønnevåg
Cc: Davide Libenzi
Cc: Eric Dumazet
Cc: NeilBrown ,
Cc: Rafael J. Wysocki
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Wong
2013-05-01 08:04:04 +0800
eea1d5859 epoll: use RCU to protect wakeup_source in epitem ... Browse Code »

This prevents wakeup_source destruction when a user hits the item with
EPOLL_CTL_MOD while ep_poll_callback is running.

Tested with CONFIG_SPARSE_RCU_POINTER=y and "make fs/eventpoll.o C=2"

Signed-off-by: Eric Wong
Cc: Alexander Viro
Cc: Arve Hjønnevåg
Cc: Davide Libenzi
Cc: Eric Dumazet
Cc: NeilBrown
Cc: "Rafael J. Wysocki"
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Wong
2013-05-01 08:04:04 +0800
39732ca5a epoll: trim epitem by one cache line ... Browse Code »

It is common for epoll users to have thousands of epitems, so saving a
cache line on every allocation leads to large memory savings.

Since epitem allocations are cache-aligned, reducing sizeof(struct
epitem) from 136 bytes to 128 bytes will allow it to squeeze under a
cache line boundary on x86_64.

Via /sys/kernel/slab/eventpoll_epi, I see the following changes on my
x86_64 Core2 Duo (which has 64-byte cache alignment):

object_size : 192 => 128
objs_per_slab: 21 => 32

Also, add a BUILD_BUG_ON() to check for future accidental breakage.

[akpm@linux-foundation.org: use __packed, for all architectures]
Signed-off-by: Eric Wong
Cc: Davide Libenzi
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Wong
2013-05-01 08:04:04 +0800