Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

03 Jan, 2014

1 commit

4ff36ee94 epoll: do not take the nested ep->mtx on EPOLL_CTL_DEL ... Browse Code »

The EPOLL_CTL_DEL path of epoll contains a classic, ab-ba deadlock.
That is, epoll_ctl(a, EPOLL_CTL_DEL, b, x), will deadlock with
epoll_ctl(b, EPOLL_CTL_DEL, a, x). The deadlock was introduced with
commmit 67347fe4e632 ("epoll: do not take global 'epmutex' for simple
topologies").

The acquistion of the ep->mtx for the destination 'ep' was added such
that a concurrent EPOLL_CTL_ADD operation would see the correct state of
the ep (Specifically, the check for '!list_empty(&f.file->f_ep_links')

However, by simply not acquiring the lock, we do not serialize behind
the ep->mtx from the add path, and thus may perform a full path check
when if we had waited a little longer it may not have been necessary.
However, this is a transient state, and performing the full loop
checking in this case is not harmful.

The important point is that we wouldn't miss doing the full loop
checking when required, since EPOLL_CTL_ADD always locks any 'ep's that
its operating upon. The reason we don't need to do lock ordering in the
add path, is that we are already are holding the global 'epmutex'
whenever we do the double lock. Further, the original posting of this
patch, which was tested for the intended performance gains, did not
perform this additional locking.

Signed-off-by: Jason Baron
Cc: Nathan Zimmer
Cc: Eric Wong
Cc: Nelson Elhage
Cc: Al Viro
Cc: Davide Libenzi
Cc: "Paul E. McKenney"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Baron
2014-01-03 06:40:30 +0800

03 Dec, 2013

1 commit

95f19f658 epoll: drop EPOLLWAKEUP if PM_SLEEP is disabled ... Browse Code »

Drop EPOLLWAKEUP from epoll events mask if CONFIG_PM_SLEEP is disabled.

Signed-off-by: Amit Pundir
Cc: John Stultz
Cc: Alexander Viro
Signed-off-by: Rafael J. Wysocki

Amit Pundir
2013-12-03 22:35:52 +0800

13 Nov, 2013

4 commits

5cbb3d216 Merge branch 'akpm' (patches from Andrew Morton) ... Browse Code »

Merge first patch-bomb from Andrew Morton:
"Quite a lot of other stuff is banked up awaiting further
next->mainline merging, but this batch contains:

- Lots of random misc patches
- OCFS2
- Most of MM
- backlight updates
- lib/ updates
- printk updates
- checkpatch updates
- epoll tweaking
- rtc updates
- hfs
- hfsplus
- documentation
- procfs
- update gcov to gcc-4.7 format
- IPC"

* emailed patches from Andrew Morton : (269 commits)
ipc, msg: fix message length check for negative values
ipc/util.c: remove unnecessary work pending test
devpts: plug the memory leak in kill_sb
./Makefile: export initial ramdisk compression config option
init/Kconfig: add option to disable kernel compression
drivers: w1: make w1_slave::flags long to avoid memory corruption
drivers/w1/masters/ds1wm.cuse dev_get_platdata()
drivers/memstick/core/ms_block.c: fix unreachable state in h_msb_read_page()
drivers/memstick/core/mspro_block.c: fix attributes array allocation
drivers/pps/clients/pps-gpio.c: remove redundant of_match_ptr
kernel/panic.c: reduce 1 byte usage for print tainted buffer
gcov: reuse kbasename helper
kernel/gcov/fs.c: use pr_warn()
kernel/module.c: use pr_foo()
gcov: compile specific gcov implementation based on gcc version
gcov: add support for gcc 4.7 gcov format
gcov: move gcov structs definitions to a gcc version specific file
kernel/taskstats.c: return -ENOMEM when alloc memory fails in add_del_listener()
kernel/taskstats.c: add nla_nest_cancel() for failure processing between nla_nest_start() and nla_nest_end()
kernel/sysctl_binary.c: use scnprintf() instead of snprintf()
...

Linus Torvalds
2013-11-13 14:45:43 +0800
9bc9ccd7d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs updates from Al Viro:
"All kinds of stuff this time around; some more notable parts:

- RCU'd vfsmounts handling
- new primitives for coredump handling
- files_lock is gone
- Bruce's delegations handling series
- exportfs fixes

plus misc stuff all over the place"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits)
ecryptfs: ->f_op is never NULL
locks: break delegations on any attribute modification
locks: break delegations on link
locks: break delegations on rename
locks: helper functions for delegation breaking
locks: break delegations on unlink
namei: minor vfs_unlink cleanup
locks: implement delegations
locks: introduce new FL_DELEG lock flag
vfs: take i_mutex on renamed file
vfs: rename I_MUTEX_QUOTA now that it's not used for quotas
vfs: don't use PARENT/CHILD lock classes for non-directories
vfs: pull ext4's double-i_mutex-locking into common code
exportfs: fix quadratic behavior in filehandle lookup
exportfs: better variable name
exportfs: move most of reconnect_path to helper function
exportfs: eliminate unused "noprogress" counter
exportfs: stop retrying once we race with rename/remove
exportfs: clear DISCONNECTED on all parents sooner
exportfs: more detailed comment for path_reconnect
...

Linus Torvalds
2013-11-13 14:34:18 +0800
67347fe4e epoll: do not take global 'epmutex' for simple topologies ... Browse Code »

When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached
directly to a wakeup source, we do not need to take the global 'epmutex',
unless the epoll file descriptor is nested. The purpose of taking the
'epmutex' on add is to prevent complex topologies such as loops and deep
wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD
operations. However, for the simple case of an epoll file descriptor
attached directly to a wakeup source (with no nesting), we do not need to
hold the 'epmutex'.

This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves
scalability on larger systems. Quoting Nathan Zimmer's mail on SPECjbb
performance:

"On the 16 socket run the performance went from 35k jOPS to 125k jOPS. In
addition the benchmark when from scaling well on 10 sockets to scaling
well on just over 40 sockets.

...

Currently the benchmark stops scaling at around 40-44 sockets but it seems like
I found a second unrelated bottleneck."

[akpm@linux-foundation.org: use `bool' for boolean variables, remove unneeded/undesirable cast of void*, add missed ep_scan_ready_list() kerneldoc]
Signed-off-by: Jason Baron
Tested-by: Nathan Zimmer
Cc: Eric Wong
Cc: Nelson Elhage
Cc: Al Viro
Cc: Davide Libenzi
Cc: "Paul E. McKenney"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Baron
2013-11-13 11:09:25 +0800
ae10b2b4e epoll: optimize EPOLL_CTL_DEL using rcu ... Browse Code »
18

Nathan Zimmer found that once we get over 10+ cpus, the scalability of
SPECjbb falls over due to the contention on the global 'epmutex', which is
taken in on EPOLL_CTL_ADD and EPOLL_CTL_DEL operations.

Patch #1 removes the 'epmutex' lock completely from the EPOLL_CTL_DEL path
by using rcu to guard against any concurrent traversals.

Patch #2 remove the 'epmutex' lock from EPOLL_CTL_ADD operations for
simple topologies. IE when adding a link from an epoll file descriptor to
a wakeup source, where the epoll file descriptor is not nested.

This patch (of 2):

Optimize EPOLL_CTL_DEL such that it does not require the 'epmutex' by
converting the file->f_ep_links list into an rcu one. In this way, we can
traverse the epoll network on the add path in parallel with deletes.
Since deletes can't create loops or worse wakeup paths, this is safe.

This patch in combination with the patch "epoll: Do not take global 'epmutex'
for simple topologies", shows a dramatic performance improvement in
scalability for SPECjbb.

Signed-off-by: Jason Baron
Tested-by: Nathan Zimmer
Cc: Eric Wong
Cc: Nelson Elhage
Cc: Al Viro
Cc: Davide Libenzi
Cc: "Paul E. McKenney"
CC: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Baron
2013-11-13 11:09:25 +0800

30 Oct, 2013

1 commit

c511851de Revert "epoll: use freezable blocking call" ... Browse Code »

This reverts commit 1c441e921201 (epoll: use freezable blocking call)
which is reported to cause user space memory corruption to happen
after suspend to RAM.

Since it appears to be extremely difficult to root cause this
problem, it is best to revert the offending commit and try to address
the original issue in a better way later.

References: https://bugzilla.kernel.org/show_bug.cgi?id=61781
Reported-by: Natrio
Reported-by: Jeff Pohlmeyer
Bisected-by: Leo Wolf
Fixes: 1c441e921201 (epoll: use freezable blocking call)
Signed-off-by: Rafael J. Wysocki
Cc: 3.11+ # 3.11+

Rafael J. Wysocki
2013-10-30 22:27:53 +0800

25 Oct, 2013

1 commit

72c2d5319 file->f_op is never NULL... ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2013-10-25 11:34:54 +0800

12 Sep, 2013

1 commit

91cf5ab60 epoll: add a reschedule point in ep_free() ... Browse Code »

ep_free() might iterate on a huge set of epitems and hold cpu too long.
Add two cond_resched() in order to yield cpu to other tasks. This is safe
as we only hold mutexes in this function.

Signed-off-by: Eric Dumazet
Cc: Al Viro
Cc: Theodore Ts'o
Acked-by: Eric Wong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Dumazet
2013-09-12 06:58:50 +0800

04 Sep, 2013

1 commit

7e3fb5842 switch epoll_ctl() to fdget ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2013-09-04 11:04:44 +0800

04 Jul, 2013

2 commits

7f0ef0267 Merge branch 'akpm' (updates from Andrew Morton) ... Browse Code »

Merge first patch-bomb from Andrew Morton:
- various misc bits
- I'm been patchmonkeying ocfs2 for a while, as Joel and Mark have been
distracted. There has been quite a bit of activity.
- About half the MM queue
- Some backlight bits
- Various lib/ updates
- checkpatch updates
- zillions more little rtc patches
- ptrace
- signals
- exec
- procfs
- rapidio
- nbd
- aoe
- pps
- memstick
- tools/testing/selftests updates

* emailed patches from Andrew Morton : (445 commits)
tools/testing/selftests: don't assume the x bit is set on scripts
selftests: add .gitignore for kcmp
selftests: fix clean target in kcmp Makefile
selftests: add .gitignore for vm
selftests: add hugetlbfstest
self-test: fix make clean
selftests: exit 1 on failure
kernel/resource.c: remove the unneeded assignment in function __find_resource
aio: fix wrong comment in aio_complete()
drivers/w1/slaves/w1_ds2408.c: add magic sequence to disable P0 test mode
drivers/memstick/host/r592.c: convert to module_pci_driver
drivers/memstick/host/jmb38x_ms: convert to module_pci_driver
pps-gpio: add device-tree binding and support
drivers/pps/clients/pps-gpio.c: convert to module_platform_driver
drivers/pps/clients/pps-gpio.c: convert to devm_* helpers
drivers/parport/share.c: use kzalloc
Documentation/accounting/getdelays.c: avoid strncpy in accounting tool
aoe: update internal version number to v83
aoe: update copyright date
aoe: perform I/O completions in parallel
...

Linus Torvalds
2013-07-04 08:12:13 +0800
77d559180 signals: eventpoll: do not use sigprocmask() ... Browse Code »

sigprocmask() should die. None of the current callers actually
need this strange interface.

Change fs/eventpoll.c to use set_current_blocked(). This also
means we should not worry about SIGKILL/SIGSTOP.

Signed-off-by: Oleg Nesterov
Cc: Al Viro
Cc: Denys Vlasenko
Cc: Eric Wong
Cc: Jason Baron
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-07-04 07:08:01 +0800

12 May, 2013

1 commit

1c441e921 epoll: use freezable blocking call ... Browse Code »

Avoid waking up every thread sleeping in an epoll_wait call during
suspend and resume by calling a freezable blocking call. Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.

This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.

Acked-by: Tejun Heo
Signed-off-by: Colin Cross
Signed-off-by: Rafael J. Wysocki

Colin Cross
2013-05-12 20:16:22 +0800

01 May, 2013

6 commits

08d767608 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal ... Browse Code »

Pull compat cleanup from Al Viro:
"Mostly about syscall wrappers this time; there will be another pile
with patches in the same general area from various people, but I'd
rather push those after both that and vfs.git pile are in."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
syscalls.h: slightly reduce the jungles of macros
get rid of union semop in sys_semctl(2) arguments
make do_mremap() static
sparc: no need to sign-extend in sync_file_range() wrapper
ppc compat wrappers for add_key(2) and request_key(2) are pointless
x86: trim sys_ia32.h
x86: sys32_kill and sys32_mprotect are pointless
get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
merge compat sys_ipc instances
consolidate compat lookup_dcookie()
convert vmsplice to COMPAT_SYSCALL_DEFINE
switch getrusage() to COMPAT_SYSCALL_DEFINE
switch epoll_pwait to COMPAT_SYSCALL_DEFINE
convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
make SYSCALL_DEFINE-generated wrappers do asmlinkage_protect
make HAVE_SYSCALL_WRAPPERS unconditional
consolidate cond_syscall and SYSCALL_ALIAS declarations
teach SYSCALL_DEFINE how to deal with long long/unsigned long long
get rid of duplicate logics in __SC_....[1-6] definitions

Linus Torvalds
2013-05-01 22:21:43 +0800
d6d67e723 epoll: cleanup: use RCU_INIT_POINTER when nulling ... Browse Code »

It is always safe to use RCU_INIT_POINTER to NULL a pointer. This results
in slightly smaller/faster code.

Signed-off-by: Eric Wong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Wong
2013-05-01 08:04:04 +0800
450d89ec0 epoll: cleanup: hoist out f_op->poll calls ... Browse Code »

This reduces the amount of code inside the ready list iteration loops for
better readability IMHO.

Signed-off-by: Eric Wong
Cc: Davide Libenzi
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Wong
2013-05-01 08:04:04 +0800
ddf676c38 epoll: lock ep->mtx in ep_free to silence lockdep ... Browse Code »

Technically we do not need to hold ep->mtx during ep_free since we are
certain there are no other users of ep at that point. However, lockdep
complains with a "suspicious rcu_dereference_check() usage!" message; so
lock the mutex before ep_remove to silence the warning.

Signed-off-by: Eric Wong
Cc: Al Viro
Cc: Arve Hjønnevåg
Cc: Davide Libenzi
Cc: Eric Dumazet
Cc: NeilBrown ,
Cc: Rafael J. Wysocki
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Wong
2013-05-01 08:04:04 +0800
eea1d5859 epoll: use RCU to protect wakeup_source in epitem ... Browse Code »

This prevents wakeup_source destruction when a user hits the item with
EPOLL_CTL_MOD while ep_poll_callback is running.

Tested with CONFIG_SPARSE_RCU_POINTER=y and "make fs/eventpoll.o C=2"

Signed-off-by: Eric Wong
Cc: Alexander Viro
Cc: Arve Hjønnevåg
Cc: Davide Libenzi
Cc: Eric Dumazet
Cc: NeilBrown
Cc: "Rafael J. Wysocki"
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Wong
2013-05-01 08:04:04 +0800
39732ca5a epoll: trim epitem by one cache line ... Browse Code »

It is common for epoll users to have thousands of epitems, so saving a
cache line on every allocation leads to large memory savings.

Since epitem allocations are cache-aligned, reducing sizeof(struct
epitem) from 136 bytes to 128 bytes will allow it to squeeze under a
cache line boundary on x86_64.

Via /sys/kernel/slab/eventpoll_epi, I see the following changes on my
x86_64 Core2 Duo (which has 64-byte cache alignment):

object_size : 192 => 128
objs_per_slab: 21 => 32

Also, add a BUILD_BUG_ON() to check for future accidental breakage.

[akpm@linux-foundation.org: use __packed, for all architectures]
Signed-off-by: Eric Wong
Cc: Davide Libenzi
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Wong
2013-05-01 08:04:04 +0800

04 Mar, 2013

1 commit

35280bd4a switch epoll_pwait to COMPAT_SYSCALL_DEFINE ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2013-03-04 11:58:49 +0800

03 Jan, 2013

1 commit

128dd1759 epoll: prevent missed events on EPOLL_CTL_MOD ... Browse Code »

EPOLL_CTL_MOD sets the interest mask before calling f_op->poll() to
ensure events are not missed. Since the modifications to the interest
mask are not protected by the same lock as ep_poll_callback, we need to
ensure the change is visible to other CPUs calling ep_poll_callback.

We also need to ensure f_op->poll() has an up-to-date view of past
events which occured before we modified the interest mask. So this
barrier also pairs with the barrier in wq_has_sleeper().

This should guarantee either ep_poll_callback or f_op->poll() (or both)
will notice the readiness of a recently-ready/modified item.

This issue was encountered by Andreas Voellmy and Junchang(Jason) Wang in:
http://thread.gmane.org/gmane.linux.kernel/1408782/

Signed-off-by: Eric Wong
Cc: Hans Verkuil
Cc: Jiri Olsa
Cc: Jonathan Corbet
Cc: Al Viro
Cc: Davide Libenzi
Cc: Hans de Goede
Cc: Mauro Carvalho Chehab
Cc: David Miller
Cc: Eric Dumazet
Cc: Andrew Morton
Cc: Andreas Voellmy
Tested-by: "Junchang(Jason) Wang"
Cc: netdev@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Linus Torvalds

Eric Wong
2013-01-03 01:16:43 +0800

18 Dec, 2012

1 commit

138d22b58 fs, epoll: add procfs fdinfo helper ... Browse Code »

This allows us to print out eventpoll target file descriptor, events and
data, the /proc/pid/fdinfo/fd consists of

| pos: 0
| flags: 02
| tfd: 5 events: 1d data: ffffffffffffffff enabled: 1

[avagin@: fix for unitialized ret variable]

Signed-off-by: Cyrill Gorcunov
Acked-by: Pavel Emelyanov
Cc: Oleg Nesterov
Cc: Andrey Vagin
Cc: Al Viro
Cc: Alexey Dobriyan
Cc: James Bottomley
Cc: "Aneesh Kumar K.V"
Cc: Alexey Dobriyan
Cc: Matthew Helsley
Cc: "J. Bruce Fields"
Cc: "Aneesh Kumar K.V"
Cc: Tvrtko Ursulin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-12-18 09:15:27 +0800

09 Nov, 2012

1 commit

a80a6b85b revert "epoll: support for disabling items, and a self-test app" ... Browse Code »

Revert commit 03a7beb55b9f ("epoll: support for disabling items, and a
self-test app") pending resolution of the issues identified by Michael
Kerrisk, copied below.

We'll revisit this for 3.8.

: I've taken a look at this patch as it currently stands in 3.7-rc1, and
: done a bit of testing. (By the way, the test program
: tools/testing/selftests/epoll/test_epoll.c does not compile...)
:
: There are one or two places where the behavior seems a little strange,
: so I have a question or two at the end of this mail. But other than
: that, I want to check my understanding so that the interface can be
: correctly documented.
:
: Just to go though my understanding, the problem is the following
: scenario in a multithreaded application:
:
: 1. Multiple threads are performing epoll_wait() operations,
: and maintaining a user-space cache that contains information
: corresponding to each file descriptor being monitored by
: epoll_wait().
:
: 2. At some point, a thread wants to delete (EPOLL_CTL_DEL)
: a file descriptor from the epoll interest list, and
: delete the corresponding record from the user-space cache.
:
: 3. The problem with (2) is that some other thread may have
: previously done an epoll_wait() that retrieved information
: about the fd in question, and may be in the middle of using
: information in the cache that relates to that fd. Thus,
: there is a potential race.
:
: 4. The race can't solved purely in user space, because doing
: so would require applying a mutex across the epoll_wait()
: call, which would of course blow thread concurrency.
:
: Right?
:
: Your solution is the EPOLL_CTL_DISABLE operation. I want to
: confirm my understanding about how to use this flag, since
: the description that has accompanied the patches so far
: has been a bit sparse
:
: 0. In the scenario you're concerned about, deleting a file
: descriptor means (safely) doing the following:
: (a) Deleting the file descriptor from the epoll interest list
: using EPOLL_CTL_DEL
: (b) Deleting the corresponding record in the user-space cache
:
: 1. It's only meaningful to use this EPOLL_CTL_DISABLE in
: conjunction with EPOLLONESHOT.
:
: 2. Using EPOLL_CTL_DISABLE without using EPOLLONESHOT in
: conjunction is a logical error.
:
: 3. The correct way to code multithreaded applications using
: EPOLL_CTL_DISABLE and EPOLLONESHOT is as follows:
:
: a. All EPOLL_CTL_ADD and EPOLL_CTL_MOD operations should
: should EPOLLONESHOT.
:
: b. When a thread wants to delete a file descriptor, it
: should do the following:
:
: [1] Call epoll_ctl(EPOLL_CTL_DISABLE)
: [2] If the return status from epoll_ctl(EPOLL_CTL_DISABLE)
: was zero, then the file descriptor can be safely
: deleted by the thread that made this call.
: [3] If the epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY,
: then the descriptor is in use. In this case, the calling
: thread should set a flag in the user-space cache to
: indicate that the thread that is using the descriptor
: should perform the deletion operation.
:
: Is all of the above correct?
:
: The implementation depends on checking on whether
: (events & ~EP_PRIVATE_BITS) == 0
: This replies on the fact that EPOLL_CTL_AD and EPOLL_CTL_MOD always
: set EPOLLHUP and EPOLLERR in the 'events' mask, and EPOLLONESHOT
: causes those flags (as well as all others in ~EP_PRIVATE_BITS) to be
: cleared.
:
: A corollary to the previous paragraph is that using EPOLL_CTL_DISABLE
: is only useful in conjunction with EPOLLONESHOT. However, as things
: stand, one can use EPOLL_CTL_DISABLE on a file descriptor that does
: not have EPOLLONESHOT set in 'events' This results in the following
: (slightly surprising) behavior:
:
: (a) The first call to epoll_ctl(EPOLL_CTL_DISABLE) returns 0
: (the indicator that the file descriptor can be safely deleted).
: (b) The next call to epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY.
:
: This doesn't seem particularly useful, and in fact is probably an
: indication that the user made a logic error: they should only be using
: epoll_ctl(EPOLL_CTL_DISABLE) on a file descriptor for which
: EPOLLONESHOT was set in 'events'. If that is correct, then would it
: not make sense to return an error to user space for this case?

Cc: Michael Kerrisk
Cc: "Paton J. Lewis"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2012-11-09 13:41:46 +0800

06 Oct, 2012

1 commit

03a7beb55 epoll: support for disabling items, and a self-test app ... Browse Code »

Enhanced epoll_ctl to support EPOLL_CTL_DISABLE, which disables an epoll
item. If epoll_ctl doesn't return -EBUSY in this case, it is then safe to
delete the epoll item in a multi-threaded environment. Also added a new
test_epoll self- test app to both demonstrate the need for this feature
and test it.

Signed-off-by: Paton J. Lewis
Cc: Alexander Viro
Cc: Jason Baron
Cc: Paul Holland
Cc: Davide Libenzi
Cc: Michael Kerrisk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paton J. Lewis
2012-10-06 02:05:00 +0800

27 Sep, 2012

2 commits

2903ff019 switch simple cases of fget_light to fdget ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-09-27 10:20:08 +0800
5e196a9cf switch epoll_wait(2) to fget_light() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-09-27 09:10:07 +0800

22 Aug, 2012

1 commit

98022748f eventpoll: use-after-possible-free in epoll_create1() ... Browse Code »

As soon as we'd installed the file into descriptor table, it can
get closed by another thread. Freeing ep in process...

Signed-off-by: Al Viro

Al Viro
2012-08-22 22:26:55 +0800

18 Jul, 2012

1 commit

d9914cf66 PM: Rename CAP_EPOLLWAKEUP to CAP_BLOCK_SUSPEND ... Browse Code »

As discussed in
http://thread.gmane.org/gmane.linux.kernel/1249726/focus=1288990,
the capability introduced in 4d7e30d98939a0340022ccd49325a3d70f7e0238
to govern EPOLLWAKEUP seems misnamed: this capability is about governing
the ability to suspend the system, not using a particular API flag
(EPOLLWAKEUP). We should make the name of the capability more general
to encourage reuse in related cases. (Whether or not this capability
should also be used to govern the use of /sys/power/wake_lock is a
question that needs to be separately resolved.)

This patch renames the capability to CAP_BLOCK_SUSPEND. In order to ensure
that the old capability name doesn't make it out into the wild, could you
please apply and push up the tree to ensure that it is incorporated
for the 3.5 release.

Signed-off-by: Michael Kerrisk
Acked-by: Serge Hallyn
Signed-off-by: Rafael J. Wysocki

Michael Kerrisk
2012-07-18 03:37:27 +0800

02 Jun, 2012

1 commit

754421c8c HAVE_RESTORE_SIGMASK is defined on all architectures now ... Browse Code »

Everyone either defines it in arch thread_info.h or has TIF_RESTORE_SIGMASK
and picks default set_restore_sigmask() in linux/thread_info.h. Kill the
ifdefs, slap #error in linux/thread_info.h to catch breakage when new ones
get merged.

Signed-off-by: Al Viro

Al Viro
2012-06-02 00:58:46 +0800

23 May, 2012

1 commit

a8159414d epoll: Fix user space breakage related to EPOLLWAKEUP ... Browse Code »

Commit 4d7e30d (epoll: Add a flag, EPOLLWAKEUP, to prevent
suspend while epoll events are ready) caused some applications to
malfunction, because they set the bit corresponding to the new
EPOLLWAKEUP flag in their eventpoll flags and they don't have the
new CAP_EPOLLWAKEUP capability.

To prevent that from happening, change epoll_ctl() to clear
EPOLLWAKEUP in epds.events if the caller doesn't have the
CAP_EPOLLWAKEUP capability instead of failing and returning an
error code, which allows the affected applications to function
normally.

Reported-and-tested-by: Jiri Slaby
Signed-off-by: Rafael J. Wysocki

Rafael J. Wysocki
2012-05-23 02:57:06 +0800

06 May, 2012

1 commit

4d7e30d98 epoll: Add a flag, EPOLLWAKEUP, to prevent suspend while epoll events are ready ... Browse Code »
18

When an epoll_event, that has the EPOLLWAKEUP flag set, is ready, a
wakeup_source will be active to prevent suspend. This can be used to
handle wakeup events from a driver that support poll, e.g. input, if
that driver wakes up the waitqueue passed to epoll before allowing
suspend.

Signed-off-by: Arve Hjønnevåg
Reviewed-by: NeilBrown
Signed-off-by: Rafael J. Wysocki

Arve Hjønnevåg
2012-05-06 03:50:41 +0800

26 Apr, 2012

1 commit

13d518074 epoll: clear the tfile_check_list on -ELOOP ... Browse Code »

An epoll_ctl(,EPOLL_CTL_ADD,,) operation can return '-ELOOP' to prevent
circular epoll dependencies from being created. However, in that case we
do not properly clear the 'tfile_check_list'. Thus, add a call to
clear_tfile_check_list() for the -ELOOP case.

Signed-off-by: Jason Baron
Reported-by: Yurij M. Plotnikov
Cc: Nelson Elhage
Cc: Davide Libenzi
Tested-by: Alexandra N. Kossovsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Baron
2012-04-26 12:26:33 +0800

29 Mar, 2012

2 commits

0195c0024 Merge tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/sc… ... Browse Code »

…m/linux/kernel/git/dhowells/linux-asm_system

Pull "Disintegrate and delete asm/system.h" from David Howells:
"Here are a bunch of patches to disintegrate asm/system.h into a set of
separate bits to relieve the problem of circular inclusion
dependencies.

I've built all the working defconfigs from all the arches that I can
and made sure that they don't break.

The reason for these patches is that I recently encountered a circular
dependency problem that came about when I produced some patches to
optimise get_order() by rewriting it to use ilog2().

This uses bitops - and on the SH arch asm/bitops.h drags in
asm-generic/get_order.h by a circuituous route involving asm/system.h.

The main difficulty seems to be asm/system.h. It holds a number of
low level bits with no/few dependencies that are commonly used (eg.
memory barriers) and a number of bits with more dependencies that
aren't used in many places (eg. switch_to()).

These patches break asm/system.h up into the following core pieces:

(1) asm/barrier.h

Move memory barriers here. This already done for MIPS and Alpha.

(2) asm/switch_to.h

Move switch_to() and related stuff here.

(3) asm/exec.h

Move arch_align_stack() here. Other process execution related bits
could perhaps go here from asm/processor.h.

(4) asm/cmpxchg.h

Move xchg() and cmpxchg() here as they're full word atomic ops and
frequently used by atomic_xchg() and atomic_cmpxchg().

(5) asm/bug.h

Move die() and related bits.

(6) asm/auxvec.h

Move AT_VECTOR_SIZE_ARCH here.

Other arch headers are created as needed on a per-arch basis."

Fixed up some conflicts from other header file cleanups and moving code
around that has happened in the meantime, so David's testing is somewhat
weakened by that. We'll find out anything that got broken and fix it..

* tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system: (38 commits)
Delete all instances of asm/system.h
Remove all #inclusions of asm/system.h
Add #includes needed to permit the removal of asm/system.h
Move all declarations of free_initmem() to linux/mm.h
Disintegrate asm/system.h for OpenRISC
Split arch_align_stack() out from asm-generic/system.h
Split the switch_to() wrapper out of asm-generic/system.h
Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h
Create asm-generic/barrier.h
Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h
Disintegrate asm/system.h for Xtensa
Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt]
Disintegrate asm/system.h for Tile
Disintegrate asm/system.h for Sparc
Disintegrate asm/system.h for SH
Disintegrate asm/system.h for Score
Disintegrate asm/system.h for S390
Disintegrate asm/system.h for PowerPC
Disintegrate asm/system.h for PA-RISC
Disintegrate asm/system.h for MN10300
...

Linus Torvalds
2012-03-29 06:58:21 +0800
9ffc93f20 Remove all #inclusions of asm/system.h ... Browse Code »

Remove all #inclusions of asm/system.h preparatory to splitting and killing
it. Performed with the following command:

perl -p -i -e 's!^#\s*include\s*.*\n!!' `grep -Irl '^#\s*include\s*' *`

Signed-off-by: David Howells

David Howells
2012-03-29 01:30:03 +0800

24 Mar, 2012

3 commits

da0503aae epoll: remove unneeded variable in reverse_path_check() ... Browse Code »

We never use the length variable.

Signed-off-by: Dan Carpenter
Acked-by: Jason Baron
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Carpenter
2012-03-24 07:58:38 +0800
02edc6fc4 epoll: comment the funky #ifdef ... Browse Code »

Looking for a bug in -rt, I stumbled across this code here from: commit
2dfa4eeab0fc ("epoll keyed wakeups: teach epoll about hints coming with
the wakeup key"), specifically:

#ifdef CONFIG_DEBUG_LOCK_ALLOC
static inline void ep_wake_up_nested(wait_queue_head_t *wqueue,
unsigned long events, int subclass)
{
unsigned long flags;

spin_lock_irqsave_nested(&wqueue->lock, flags, subclass);
wake_up_locked_poll(wqueue, events);
spin_unlock_irqrestore(&wqueue->lock, flags);
}
#else
static inline void ep_wake_up_nested(wait_queue_head_t *wqueue,
unsigned long events, int subclass)
{
wake_up_poll(wqueue, events);
}
#endif

You change the function of ep_wake_up_nested() depending on whether
CONFIG_DEBUG_LOCK_ALLOC is set or not. This looks awfully suspicious,
and there's no comment to explain why. I initially thought that this
was trying to fool lockdep, and hiding a real bug.

Investigating it, I found the creation of wake_up_nested() (which no
longer exists) but was created for the sole purpose of epoll and its
strange wake ups, as explained in commit 0ccf831cbee9 ("lockdep:
annotate epoll")

Although the commit message says "annotate epoll" the change log is much
better at explaining what is happening than what is in the actual code.
Thus a comment is really necessary here. And to save the time of other
developers from having to go trudging through the git logs trying to
figure out why this code exists.

I took parts of the change log and placed it into a comment above the
affected code. This will make the description of what is happening more
visible to new developers that have to look at this code for the first
time.

Signed-off-by: Steven Rostedt
Cc: Davide Libenzi
Cc: Peter Zijlstra
Cc: Alan Cox
Cc: Ingo Molnar
Cc: David Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven Rostedt
2012-03-24 07:58:38 +0800
626cf2366 poll: add poll_requested_events() and poll_does_not_wait() functions ... Browse Code »

In some cases the poll() implementation in a driver has to do different
things depending on the events the caller wants to poll for. An example
is when a driver needs to start a DMA engine if the caller polls for
POLLIN, but doesn't want to do that if POLLIN is not requested but instead
only POLLOUT or POLLPRI is requested. This is something that can happen
in the video4linux subsystem among others.

Unfortunately, the current epoll/poll/select implementation doesn't
provide that information reliably. The poll_table_struct does have it: it
has a key field with the event mask. But once a poll() call matches one
or more bits of that mask any following poll() calls are passed a NULL
poll_table pointer.

Also, the eventpoll implementation always left the key field at ~0 instead
of using the requested events mask.

This was changed in eventpoll.c so the key field now contains the actual
events that should be polled for as set by the caller.

The solution to the NULL poll_table pointer is to set the qproc field to
NULL in poll_table once poll() matches the events, not the poll_table
pointer itself. That way drivers can obtain the mask through a new
poll_requested_events inline.

The poll_table_struct can still be NULL since some kernel code calls it
internally (netfs_state_poll() in ./drivers/staging/pohmelfs/netfs.h). In
that case poll_requested_events() returns ~0 (i.e. all events).

Very rarely drivers might want to know whether poll_wait will actually
wait. If another earlier file descriptor in the set already matched the
events the caller wanted to wait for, then the kernel will return from the
select() call without waiting. This might be useful information in order
to avoid doing expensive work.

A new helper function poll_does_not_wait() is added that drivers can use
to detect this situation. This is now used in sock_poll_wait() in
include/net/sock.h. This was the only place in the kernel that needed
this information.

Drivers should no longer access any of the poll_table internals, but use
the poll_requested_events() and poll_does_not_wait() access functions
instead. In order to enforce that the poll_table fields are now prepended
with an underscore and a comment was added warning against using them
directly.

This required a change in unix_dgram_poll() in unix/af_unix.c which used
the key field to get the requested events. It's been replaced by a call
to poll_requested_events().

For qproc it was especially important to change its name since the
behavior of that field changes with this patch since this function pointer
can now be NULL when that wasn't possible in the past.

Any driver accessing the qproc or key fields directly will now fail to compile.

Some notes regarding the correctness of this patch: the driver's poll()
function is called with a 'struct poll_table_struct *wait' argument. This
pointer may or may not be NULL, drivers can never rely on it being one or
the other as that depends on whether or not an earlier file descriptor in
the select()'s fdset matched the requested events.

There are only three things a driver can do with the wait argument:

1) obtain the key field:

events = wait ? wait->key : ~0;

This will still work although it should be replaced with the new
poll_requested_events() function (which does exactly the same).
This will now even work better, since wait is no longer set to NULL
unnecessarily.

2) use the qproc callback. This could be deadly since qproc can now be
NULL. Renaming qproc should prevent this from happening. There are no
kernel drivers that actually access this callback directly, BTW.

3) test whether wait == NULL to determine whether poll would return without
waiting. This is no longer sufficient as the correct test is now
wait == NULL || wait->_qproc == NULL.

However, the worst that can happen here is a slight performance hit in
the case where wait != NULL and wait->_qproc == NULL. In that case the
driver will assume that poll_wait() will actually add the fd to the set
of waiting file descriptors. Of course, poll_wait() will not do that
since it tests for wait->_qproc. This will not break anything, though.

There is only one place in the whole kernel where this happens
(sock_poll_wait() in include/net/sock.h) and that code will be replaced
by a call to poll_does_not_wait() in the next patch.

Note that even if wait->_qproc != NULL drivers cannot rely on poll_wait()
actually waiting. The next file descriptor from the set might match the
event mask and thus any possible waits will never happen.

Signed-off-by: Hans Verkuil
Reviewed-by: Jonathan Corbet
Reviewed-by: Al Viro
Cc: Davide Libenzi
Signed-off-by: Hans de Goede
Cc: Mauro Carvalho Chehab
Cc: David Miller
Cc: Eric Dumazet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hans Verkuil
2012-03-24 07:58:38 +0800

19 Mar, 2012

1 commit

93dc6107a Don't limit non-nested epoll paths ... Browse Code »

Commit 28d82dc1c4ed ("epoll: limit paths") that I did to limit the
number of possible wakeup paths in epoll is causing a few applications
to longer work (dovecot for one).

The original patch is really about limiting the amount of epoll nesting
(since epoll fds can be attached to other fds). Thus, we probably can
allow an unlimited number of paths of depth 1. My current patch limits
it at 1000. And enforce the limits on paths that have a greater depth.

This is captured in: https://bugzilla.redhat.com/show_bug.cgi?id=681578

Signed-off-by: Jason Baron
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Baron
2012-03-19 03:25:04 +0800

25 Feb, 2012

2 commits

971316f05 epoll: ep_unregister_pollwait() can use the freed pwq->whead ... Browse Code »

signalfd_cleanup() ensures that ->signalfd_wqh is not used, but
this is not enough. eppoll_entry->whead still points to the memory
we are going to free, ep_unregister_pollwait()->remove_wait_queue()
is obviously unsafe.

Change ep_poll_callback(POLLFREE) to set eppoll_entry->whead = NULL,
change ep_unregister_pollwait() to check pwq->whead != NULL under
rcu_read_lock() before remove_wait_queue(). We add the new helper,
ep_remove_wait_queue(), for this.

This works because sighand_cachep is SLAB_DESTROY_BY_RCU and because
->signalfd_wqh is initialized in sighand_ctor(), not in copy_sighand.
ep_unregister_pollwait()->remove_wait_queue() can play with already
freed and potentially reused ->sighand, but this is fine. This memory
must have the valid ->signalfd_wqh until rcu_read_unlock().

Reported-by: Maxime Bizon
Cc:
Signed-off-by: Oleg Nesterov
Signed-off-by: Linus Torvalds

Oleg Nesterov
2012-02-25 03:42:50 +0800
d80e731ec epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree() ... Browse Code »

This patch is intentionally incomplete to simplify the review.
It ignores ep_unregister_pollwait() which plays with the same wqh.
See the next change.

epoll assumes that the EPOLL_CTL_ADD'ed file controls everything
f_op->poll() needs. In particular it assumes that the wait queue
can't go away until eventpoll_release(). This is not true in case
of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand
which is not connected to the file.

This patch adds the special event, POLLFREE, currently only for
epoll. It expects that init_poll_funcptr()'ed hook should do the
necessary cleanup. Perhaps it should be defined as EPOLLFREE in
eventpoll.

__cleanup_sighand() is changed to do wake_up_poll(POLLFREE) if
->signalfd_wqh is not empty, we add the new signalfd_cleanup()
helper.

ep_poll_callback(POLLFREE) simply does list_del_init(task_list).
This make this poll entry inconsistent, but we don't care. If you
share epoll fd which contains our sigfd with another process you
should blame yourself. signalfd is "really special". I simply do
not know how we can define the "right" semantics if it used with
epoll.

The main problem is, epoll calls signalfd_poll() once to establish
the connection with the wait queue, after that signalfd_poll(NULL)
returns the different/inconsistent results depending on who does
EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd
has nothing to do with the file, it works with the current thread.

In short: this patch is the hack which tries to fix the symptoms.
It also assumes that nobody can take tasklist_lock under epoll
locks, this seems to be true.

Note:

- we do not have wake_up_all_poll() but wake_up_poll()
is fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE.

- signalfd_cleanup() uses POLLHUP along with POLLFREE,
we need a couple of simple changes in eventpoll.c to
make sure it can't be "lost".

Reported-by: Maxime Bizon
Cc:
Signed-off-by: Oleg Nesterov
Signed-off-by: Linus Torvalds

Oleg Nesterov
2012-02-25 03:42:50 +0800