01 May, 2013

6 commits

  • Pull compat cleanup from Al Viro:
    "Mostly about syscall wrappers this time; there will be another pile
    with patches in the same general area from various people, but I'd
    rather push those after both that and vfs.git pile are in."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    syscalls.h: slightly reduce the jungles of macros
    get rid of union semop in sys_semctl(2) arguments
    make do_mremap() static
    sparc: no need to sign-extend in sync_file_range() wrapper
    ppc compat wrappers for add_key(2) and request_key(2) are pointless
    x86: trim sys_ia32.h
    x86: sys32_kill and sys32_mprotect are pointless
    get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
    merge compat sys_ipc instances
    consolidate compat lookup_dcookie()
    convert vmsplice to COMPAT_SYSCALL_DEFINE
    switch getrusage() to COMPAT_SYSCALL_DEFINE
    switch epoll_pwait to COMPAT_SYSCALL_DEFINE
    convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
    switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
    make SYSCALL_DEFINE-generated wrappers do asmlinkage_protect
    make HAVE_SYSCALL_WRAPPERS unconditional
    consolidate cond_syscall and SYSCALL_ALIAS declarations
    teach SYSCALL_DEFINE how to deal with long long/unsigned long long
    get rid of duplicate logics in __SC_....[1-6] definitions

    Linus Torvalds
     
  • It is always safe to use RCU_INIT_POINTER to NULL a pointer. This results
    in slightly smaller/faster code.

    Signed-off-by: Eric Wong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Wong
     
  • This reduces the amount of code inside the ready list iteration loops for
    better readability IMHO.

    Signed-off-by: Eric Wong
    Cc: Davide Libenzi
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Wong
     
  • Technically we do not need to hold ep->mtx during ep_free since we are
    certain there are no other users of ep at that point. However, lockdep
    complains with a "suspicious rcu_dereference_check() usage!" message; so
    lock the mutex before ep_remove to silence the warning.

    Signed-off-by: Eric Wong
    Cc: Al Viro
    Cc: Arve Hjønnevåg
    Cc: Davide Libenzi
    Cc: Eric Dumazet
    Cc: NeilBrown ,
    Cc: Rafael J. Wysocki
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Wong
     
  • This prevents wakeup_source destruction when a user hits the item with
    EPOLL_CTL_MOD while ep_poll_callback is running.

    Tested with CONFIG_SPARSE_RCU_POINTER=y and "make fs/eventpoll.o C=2"

    Signed-off-by: Eric Wong
    Cc: Alexander Viro
    Cc: Arve Hjønnevåg
    Cc: Davide Libenzi
    Cc: Eric Dumazet
    Cc: NeilBrown
    Cc: "Rafael J. Wysocki"
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Wong
     
  • It is common for epoll users to have thousands of epitems, so saving a
    cache line on every allocation leads to large memory savings.

    Since epitem allocations are cache-aligned, reducing sizeof(struct
    epitem) from 136 bytes to 128 bytes will allow it to squeeze under a
    cache line boundary on x86_64.

    Via /sys/kernel/slab/eventpoll_epi, I see the following changes on my
    x86_64 Core2 Duo (which has 64-byte cache alignment):

    object_size : 192 => 128
    objs_per_slab: 21 => 32

    Also, add a BUILD_BUG_ON() to check for future accidental breakage.

    [akpm@linux-foundation.org: use __packed, for all architectures]
    Signed-off-by: Eric Wong
    Cc: Davide Libenzi
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Wong
     

04 Mar, 2013

1 commit


03 Jan, 2013

1 commit

  • EPOLL_CTL_MOD sets the interest mask before calling f_op->poll() to
    ensure events are not missed. Since the modifications to the interest
    mask are not protected by the same lock as ep_poll_callback, we need to
    ensure the change is visible to other CPUs calling ep_poll_callback.

    We also need to ensure f_op->poll() has an up-to-date view of past
    events which occured before we modified the interest mask. So this
    barrier also pairs with the barrier in wq_has_sleeper().

    This should guarantee either ep_poll_callback or f_op->poll() (or both)
    will notice the readiness of a recently-ready/modified item.

    This issue was encountered by Andreas Voellmy and Junchang(Jason) Wang in:
    http://thread.gmane.org/gmane.linux.kernel/1408782/

    Signed-off-by: Eric Wong
    Cc: Hans Verkuil
    Cc: Jiri Olsa
    Cc: Jonathan Corbet
    Cc: Al Viro
    Cc: Davide Libenzi
    Cc: Hans de Goede
    Cc: Mauro Carvalho Chehab
    Cc: David Miller
    Cc: Eric Dumazet
    Cc: Andrew Morton
    Cc: Andreas Voellmy
    Tested-by: "Junchang(Jason) Wang"
    Cc: netdev@vger.kernel.org
    Cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Eric Wong
     

18 Dec, 2012

1 commit

  • This allows us to print out eventpoll target file descriptor, events and
    data, the /proc/pid/fdinfo/fd consists of

    | pos: 0
    | flags: 02
    | tfd: 5 events: 1d data: ffffffffffffffff enabled: 1

    [avagin@: fix for unitialized ret variable]

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Andrey Vagin
    Cc: Al Viro
    Cc: Alexey Dobriyan
    Cc: James Bottomley
    Cc: "Aneesh Kumar K.V"
    Cc: Alexey Dobriyan
    Cc: Matthew Helsley
    Cc: "J. Bruce Fields"
    Cc: "Aneesh Kumar K.V"
    Cc: Tvrtko Ursulin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

09 Nov, 2012

1 commit

  • Revert commit 03a7beb55b9f ("epoll: support for disabling items, and a
    self-test app") pending resolution of the issues identified by Michael
    Kerrisk, copied below.

    We'll revisit this for 3.8.

    : I've taken a look at this patch as it currently stands in 3.7-rc1, and
    : done a bit of testing. (By the way, the test program
    : tools/testing/selftests/epoll/test_epoll.c does not compile...)
    :
    : There are one or two places where the behavior seems a little strange,
    : so I have a question or two at the end of this mail. But other than
    : that, I want to check my understanding so that the interface can be
    : correctly documented.
    :
    : Just to go though my understanding, the problem is the following
    : scenario in a multithreaded application:
    :
    : 1. Multiple threads are performing epoll_wait() operations,
    : and maintaining a user-space cache that contains information
    : corresponding to each file descriptor being monitored by
    : epoll_wait().
    :
    : 2. At some point, a thread wants to delete (EPOLL_CTL_DEL)
    : a file descriptor from the epoll interest list, and
    : delete the corresponding record from the user-space cache.
    :
    : 3. The problem with (2) is that some other thread may have
    : previously done an epoll_wait() that retrieved information
    : about the fd in question, and may be in the middle of using
    : information in the cache that relates to that fd. Thus,
    : there is a potential race.
    :
    : 4. The race can't solved purely in user space, because doing
    : so would require applying a mutex across the epoll_wait()
    : call, which would of course blow thread concurrency.
    :
    : Right?
    :
    : Your solution is the EPOLL_CTL_DISABLE operation. I want to
    : confirm my understanding about how to use this flag, since
    : the description that has accompanied the patches so far
    : has been a bit sparse
    :
    : 0. In the scenario you're concerned about, deleting a file
    : descriptor means (safely) doing the following:
    : (a) Deleting the file descriptor from the epoll interest list
    : using EPOLL_CTL_DEL
    : (b) Deleting the corresponding record in the user-space cache
    :
    : 1. It's only meaningful to use this EPOLL_CTL_DISABLE in
    : conjunction with EPOLLONESHOT.
    :
    : 2. Using EPOLL_CTL_DISABLE without using EPOLLONESHOT in
    : conjunction is a logical error.
    :
    : 3. The correct way to code multithreaded applications using
    : EPOLL_CTL_DISABLE and EPOLLONESHOT is as follows:
    :
    : a. All EPOLL_CTL_ADD and EPOLL_CTL_MOD operations should
    : should EPOLLONESHOT.
    :
    : b. When a thread wants to delete a file descriptor, it
    : should do the following:
    :
    : [1] Call epoll_ctl(EPOLL_CTL_DISABLE)
    : [2] If the return status from epoll_ctl(EPOLL_CTL_DISABLE)
    : was zero, then the file descriptor can be safely
    : deleted by the thread that made this call.
    : [3] If the epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY,
    : then the descriptor is in use. In this case, the calling
    : thread should set a flag in the user-space cache to
    : indicate that the thread that is using the descriptor
    : should perform the deletion operation.
    :
    : Is all of the above correct?
    :
    : The implementation depends on checking on whether
    : (events & ~EP_PRIVATE_BITS) == 0
    : This replies on the fact that EPOLL_CTL_AD and EPOLL_CTL_MOD always
    : set EPOLLHUP and EPOLLERR in the 'events' mask, and EPOLLONESHOT
    : causes those flags (as well as all others in ~EP_PRIVATE_BITS) to be
    : cleared.
    :
    : A corollary to the previous paragraph is that using EPOLL_CTL_DISABLE
    : is only useful in conjunction with EPOLLONESHOT. However, as things
    : stand, one can use EPOLL_CTL_DISABLE on a file descriptor that does
    : not have EPOLLONESHOT set in 'events' This results in the following
    : (slightly surprising) behavior:
    :
    : (a) The first call to epoll_ctl(EPOLL_CTL_DISABLE) returns 0
    : (the indicator that the file descriptor can be safely deleted).
    : (b) The next call to epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY.
    :
    : This doesn't seem particularly useful, and in fact is probably an
    : indication that the user made a logic error: they should only be using
    : epoll_ctl(EPOLL_CTL_DISABLE) on a file descriptor for which
    : EPOLLONESHOT was set in 'events'. If that is correct, then would it
    : not make sense to return an error to user space for this case?

    Cc: Michael Kerrisk
    Cc: "Paton J. Lewis"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

06 Oct, 2012

1 commit

  • Enhanced epoll_ctl to support EPOLL_CTL_DISABLE, which disables an epoll
    item. If epoll_ctl doesn't return -EBUSY in this case, it is then safe to
    delete the epoll item in a multi-threaded environment. Also added a new
    test_epoll self- test app to both demonstrate the need for this feature
    and test it.

    Signed-off-by: Paton J. Lewis
    Cc: Alexander Viro
    Cc: Jason Baron
    Cc: Paul Holland
    Cc: Davide Libenzi
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paton J. Lewis
     

27 Sep, 2012

2 commits


22 Aug, 2012

1 commit


18 Jul, 2012

1 commit

  • As discussed in
    http://thread.gmane.org/gmane.linux.kernel/1249726/focus=1288990,
    the capability introduced in 4d7e30d98939a0340022ccd49325a3d70f7e0238
    to govern EPOLLWAKEUP seems misnamed: this capability is about governing
    the ability to suspend the system, not using a particular API flag
    (EPOLLWAKEUP). We should make the name of the capability more general
    to encourage reuse in related cases. (Whether or not this capability
    should also be used to govern the use of /sys/power/wake_lock is a
    question that needs to be separately resolved.)

    This patch renames the capability to CAP_BLOCK_SUSPEND. In order to ensure
    that the old capability name doesn't make it out into the wild, could you
    please apply and push up the tree to ensure that it is incorporated
    for the 3.5 release.

    Signed-off-by: Michael Kerrisk
    Acked-by: Serge Hallyn
    Signed-off-by: Rafael J. Wysocki

    Michael Kerrisk
     

02 Jun, 2012

1 commit


23 May, 2012

1 commit

  • Commit 4d7e30d (epoll: Add a flag, EPOLLWAKEUP, to prevent
    suspend while epoll events are ready) caused some applications to
    malfunction, because they set the bit corresponding to the new
    EPOLLWAKEUP flag in their eventpoll flags and they don't have the
    new CAP_EPOLLWAKEUP capability.

    To prevent that from happening, change epoll_ctl() to clear
    EPOLLWAKEUP in epds.events if the caller doesn't have the
    CAP_EPOLLWAKEUP capability instead of failing and returning an
    error code, which allows the affected applications to function
    normally.

    Reported-and-tested-by: Jiri Slaby
    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     

06 May, 2012

1 commit


26 Apr, 2012

1 commit

  • An epoll_ctl(,EPOLL_CTL_ADD,,) operation can return '-ELOOP' to prevent
    circular epoll dependencies from being created. However, in that case we
    do not properly clear the 'tfile_check_list'. Thus, add a call to
    clear_tfile_check_list() for the -ELOOP case.

    Signed-off-by: Jason Baron
    Reported-by: Yurij M. Plotnikov
    Cc: Nelson Elhage
    Cc: Davide Libenzi
    Tested-by: Alexandra N. Kossovsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     

29 Mar, 2012

2 commits

  • …m/linux/kernel/git/dhowells/linux-asm_system

    Pull "Disintegrate and delete asm/system.h" from David Howells:
    "Here are a bunch of patches to disintegrate asm/system.h into a set of
    separate bits to relieve the problem of circular inclusion
    dependencies.

    I've built all the working defconfigs from all the arches that I can
    and made sure that they don't break.

    The reason for these patches is that I recently encountered a circular
    dependency problem that came about when I produced some patches to
    optimise get_order() by rewriting it to use ilog2().

    This uses bitops - and on the SH arch asm/bitops.h drags in
    asm-generic/get_order.h by a circuituous route involving asm/system.h.

    The main difficulty seems to be asm/system.h. It holds a number of
    low level bits with no/few dependencies that are commonly used (eg.
    memory barriers) and a number of bits with more dependencies that
    aren't used in many places (eg. switch_to()).

    These patches break asm/system.h up into the following core pieces:

    (1) asm/barrier.h

    Move memory barriers here. This already done for MIPS and Alpha.

    (2) asm/switch_to.h

    Move switch_to() and related stuff here.

    (3) asm/exec.h

    Move arch_align_stack() here. Other process execution related bits
    could perhaps go here from asm/processor.h.

    (4) asm/cmpxchg.h

    Move xchg() and cmpxchg() here as they're full word atomic ops and
    frequently used by atomic_xchg() and atomic_cmpxchg().

    (5) asm/bug.h

    Move die() and related bits.

    (6) asm/auxvec.h

    Move AT_VECTOR_SIZE_ARCH here.

    Other arch headers are created as needed on a per-arch basis."

    Fixed up some conflicts from other header file cleanups and moving code
    around that has happened in the meantime, so David's testing is somewhat
    weakened by that. We'll find out anything that got broken and fix it..

    * tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system: (38 commits)
    Delete all instances of asm/system.h
    Remove all #inclusions of asm/system.h
    Add #includes needed to permit the removal of asm/system.h
    Move all declarations of free_initmem() to linux/mm.h
    Disintegrate asm/system.h for OpenRISC
    Split arch_align_stack() out from asm-generic/system.h
    Split the switch_to() wrapper out of asm-generic/system.h
    Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h
    Create asm-generic/barrier.h
    Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h
    Disintegrate asm/system.h for Xtensa
    Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt]
    Disintegrate asm/system.h for Tile
    Disintegrate asm/system.h for Sparc
    Disintegrate asm/system.h for SH
    Disintegrate asm/system.h for Score
    Disintegrate asm/system.h for S390
    Disintegrate asm/system.h for PowerPC
    Disintegrate asm/system.h for PA-RISC
    Disintegrate asm/system.h for MN10300
    ...

    Linus Torvalds
     
  • Remove all #inclusions of asm/system.h preparatory to splitting and killing
    it. Performed with the following command:

    perl -p -i -e 's!^#\s*include\s*.*\n!!' `grep -Irl '^#\s*include\s*' *`

    Signed-off-by: David Howells

    David Howells
     

24 Mar, 2012

3 commits

  • We never use the length variable.

    Signed-off-by: Dan Carpenter
    Acked-by: Jason Baron
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     
  • Looking for a bug in -rt, I stumbled across this code here from: commit
    2dfa4eeab0fc ("epoll keyed wakeups: teach epoll about hints coming with
    the wakeup key"), specifically:

    #ifdef CONFIG_DEBUG_LOCK_ALLOC
    static inline void ep_wake_up_nested(wait_queue_head_t *wqueue,
    unsigned long events, int subclass)
    {
    unsigned long flags;

    spin_lock_irqsave_nested(&wqueue->lock, flags, subclass);
    wake_up_locked_poll(wqueue, events);
    spin_unlock_irqrestore(&wqueue->lock, flags);
    }
    #else
    static inline void ep_wake_up_nested(wait_queue_head_t *wqueue,
    unsigned long events, int subclass)
    {
    wake_up_poll(wqueue, events);
    }
    #endif

    You change the function of ep_wake_up_nested() depending on whether
    CONFIG_DEBUG_LOCK_ALLOC is set or not. This looks awfully suspicious,
    and there's no comment to explain why. I initially thought that this
    was trying to fool lockdep, and hiding a real bug.

    Investigating it, I found the creation of wake_up_nested() (which no
    longer exists) but was created for the sole purpose of epoll and its
    strange wake ups, as explained in commit 0ccf831cbee9 ("lockdep:
    annotate epoll")

    Although the commit message says "annotate epoll" the change log is much
    better at explaining what is happening than what is in the actual code.
    Thus a comment is really necessary here. And to save the time of other
    developers from having to go trudging through the git logs trying to
    figure out why this code exists.

    I took parts of the change log and placed it into a comment above the
    affected code. This will make the description of what is happening more
    visible to new developers that have to look at this code for the first
    time.

    Signed-off-by: Steven Rostedt
    Cc: Davide Libenzi
    Cc: Peter Zijlstra
    Cc: Alan Cox
    Cc: Ingo Molnar
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • In some cases the poll() implementation in a driver has to do different
    things depending on the events the caller wants to poll for. An example
    is when a driver needs to start a DMA engine if the caller polls for
    POLLIN, but doesn't want to do that if POLLIN is not requested but instead
    only POLLOUT or POLLPRI is requested. This is something that can happen
    in the video4linux subsystem among others.

    Unfortunately, the current epoll/poll/select implementation doesn't
    provide that information reliably. The poll_table_struct does have it: it
    has a key field with the event mask. But once a poll() call matches one
    or more bits of that mask any following poll() calls are passed a NULL
    poll_table pointer.

    Also, the eventpoll implementation always left the key field at ~0 instead
    of using the requested events mask.

    This was changed in eventpoll.c so the key field now contains the actual
    events that should be polled for as set by the caller.

    The solution to the NULL poll_table pointer is to set the qproc field to
    NULL in poll_table once poll() matches the events, not the poll_table
    pointer itself. That way drivers can obtain the mask through a new
    poll_requested_events inline.

    The poll_table_struct can still be NULL since some kernel code calls it
    internally (netfs_state_poll() in ./drivers/staging/pohmelfs/netfs.h). In
    that case poll_requested_events() returns ~0 (i.e. all events).

    Very rarely drivers might want to know whether poll_wait will actually
    wait. If another earlier file descriptor in the set already matched the
    events the caller wanted to wait for, then the kernel will return from the
    select() call without waiting. This might be useful information in order
    to avoid doing expensive work.

    A new helper function poll_does_not_wait() is added that drivers can use
    to detect this situation. This is now used in sock_poll_wait() in
    include/net/sock.h. This was the only place in the kernel that needed
    this information.

    Drivers should no longer access any of the poll_table internals, but use
    the poll_requested_events() and poll_does_not_wait() access functions
    instead. In order to enforce that the poll_table fields are now prepended
    with an underscore and a comment was added warning against using them
    directly.

    This required a change in unix_dgram_poll() in unix/af_unix.c which used
    the key field to get the requested events. It's been replaced by a call
    to poll_requested_events().

    For qproc it was especially important to change its name since the
    behavior of that field changes with this patch since this function pointer
    can now be NULL when that wasn't possible in the past.

    Any driver accessing the qproc or key fields directly will now fail to compile.

    Some notes regarding the correctness of this patch: the driver's poll()
    function is called with a 'struct poll_table_struct *wait' argument. This
    pointer may or may not be NULL, drivers can never rely on it being one or
    the other as that depends on whether or not an earlier file descriptor in
    the select()'s fdset matched the requested events.

    There are only three things a driver can do with the wait argument:

    1) obtain the key field:

    events = wait ? wait->key : ~0;

    This will still work although it should be replaced with the new
    poll_requested_events() function (which does exactly the same).
    This will now even work better, since wait is no longer set to NULL
    unnecessarily.

    2) use the qproc callback. This could be deadly since qproc can now be
    NULL. Renaming qproc should prevent this from happening. There are no
    kernel drivers that actually access this callback directly, BTW.

    3) test whether wait == NULL to determine whether poll would return without
    waiting. This is no longer sufficient as the correct test is now
    wait == NULL || wait->_qproc == NULL.

    However, the worst that can happen here is a slight performance hit in
    the case where wait != NULL and wait->_qproc == NULL. In that case the
    driver will assume that poll_wait() will actually add the fd to the set
    of waiting file descriptors. Of course, poll_wait() will not do that
    since it tests for wait->_qproc. This will not break anything, though.

    There is only one place in the whole kernel where this happens
    (sock_poll_wait() in include/net/sock.h) and that code will be replaced
    by a call to poll_does_not_wait() in the next patch.

    Note that even if wait->_qproc != NULL drivers cannot rely on poll_wait()
    actually waiting. The next file descriptor from the set might match the
    event mask and thus any possible waits will never happen.

    Signed-off-by: Hans Verkuil
    Reviewed-by: Jonathan Corbet
    Reviewed-by: Al Viro
    Cc: Davide Libenzi
    Signed-off-by: Hans de Goede
    Cc: Mauro Carvalho Chehab
    Cc: David Miller
    Cc: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hans Verkuil
     

19 Mar, 2012

1 commit

  • Commit 28d82dc1c4ed ("epoll: limit paths") that I did to limit the
    number of possible wakeup paths in epoll is causing a few applications
    to longer work (dovecot for one).

    The original patch is really about limiting the amount of epoll nesting
    (since epoll fds can be attached to other fds). Thus, we probably can
    allow an unlimited number of paths of depth 1. My current patch limits
    it at 1000. And enforce the limits on paths that have a greater depth.

    This is captured in: https://bugzilla.redhat.com/show_bug.cgi?id=681578

    Signed-off-by: Jason Baron
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     

25 Feb, 2012

2 commits

  • signalfd_cleanup() ensures that ->signalfd_wqh is not used, but
    this is not enough. eppoll_entry->whead still points to the memory
    we are going to free, ep_unregister_pollwait()->remove_wait_queue()
    is obviously unsafe.

    Change ep_poll_callback(POLLFREE) to set eppoll_entry->whead = NULL,
    change ep_unregister_pollwait() to check pwq->whead != NULL under
    rcu_read_lock() before remove_wait_queue(). We add the new helper,
    ep_remove_wait_queue(), for this.

    This works because sighand_cachep is SLAB_DESTROY_BY_RCU and because
    ->signalfd_wqh is initialized in sighand_ctor(), not in copy_sighand.
    ep_unregister_pollwait()->remove_wait_queue() can play with already
    freed and potentially reused ->sighand, but this is fine. This memory
    must have the valid ->signalfd_wqh until rcu_read_unlock().

    Reported-by: Maxime Bizon
    Cc:
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This patch is intentionally incomplete to simplify the review.
    It ignores ep_unregister_pollwait() which plays with the same wqh.
    See the next change.

    epoll assumes that the EPOLL_CTL_ADD'ed file controls everything
    f_op->poll() needs. In particular it assumes that the wait queue
    can't go away until eventpoll_release(). This is not true in case
    of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand
    which is not connected to the file.

    This patch adds the special event, POLLFREE, currently only for
    epoll. It expects that init_poll_funcptr()'ed hook should do the
    necessary cleanup. Perhaps it should be defined as EPOLLFREE in
    eventpoll.

    __cleanup_sighand() is changed to do wake_up_poll(POLLFREE) if
    ->signalfd_wqh is not empty, we add the new signalfd_cleanup()
    helper.

    ep_poll_callback(POLLFREE) simply does list_del_init(task_list).
    This make this poll entry inconsistent, but we don't care. If you
    share epoll fd which contains our sigfd with another process you
    should blame yourself. signalfd is "really special". I simply do
    not know how we can define the "right" semantics if it used with
    epoll.

    The main problem is, epoll calls signalfd_poll() once to establish
    the connection with the wait queue, after that signalfd_poll(NULL)
    returns the different/inconsistent results depending on who does
    EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd
    has nothing to do with the file, it works with the current thread.

    In short: this patch is the hack which tries to fix the symptoms.
    It also assumes that nobody can take tasklist_lock under epoll
    locks, this seems to be true.

    Note:

    - we do not have wake_up_all_poll() but wake_up_poll()
    is fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE.

    - signalfd_cleanup() uses POLLHUP along with POLLFREE,
    we need a couple of simple changes in eventpoll.c to
    make sure it can't be "lost".

    Reported-by: Maxime Bizon
    Cc:
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

13 Jan, 2012

1 commit

  • The current epoll code can be tickled to run basically indefinitely in
    both loop detection path check (on ep_insert()), and in the wakeup paths.
    The programs that tickle this behavior set up deeply linked networks of
    epoll file descriptors that cause the epoll algorithms to traverse them
    indefinitely. A couple of these sample programs have been previously
    posted in this thread: https://lkml.org/lkml/2011/2/25/297.

    To fix the loop detection path check algorithms, I simply keep track of
    the epoll nodes that have been already visited. Thus, the loop detection
    becomes proportional to the number of epoll file descriptor and links.
    This dramatically decreases the run-time of the loop check algorithm. In
    one diabolical case I tried it reduced the run-time from 15 mintues (all
    in kernel time) to .3 seconds.

    Fixing the wakeup paths could be done at wakeup time in a similar manner
    by keeping track of nodes that have already been visited, but the
    complexity is harder, since there can be multiple wakeups on different
    cpus...Thus, I've opted to limit the number of possible wakeup paths when
    the paths are created.

    This is accomplished, by noting that the end file descriptor points that
    are found during the loop detection pass (from the newly added link), are
    actually the sources for wakeup events. I keep a list of these file
    descriptors and limit the number and length of these paths that emanate
    from these 'source file descriptors'. In the current implemetation I
    allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
    length 4 and 10 of length 5. Note that it is sufficient to check the
    'source file descriptors' reachable from the newly added link, since no
    other 'source file descriptors' will have newly added links. This allows
    us to check only the wakeup paths that may have gotten too long, and not
    re-check all possible wakeup paths on the system.

    In terms of the path limit selection, I think its first worth noting that
    the most common case for epoll, is probably the model where you have 1
    epoll file descriptor that is monitoring n number of 'source file
    descriptors'. In this case, each 'source file descriptor' has a 1 path of
    length 1. Thus, I believe that the limits I'm proposing are quite
    reasonable and in fact may be too generous. Thus, I'm hoping that the
    proposed limits will not prevent any workloads that currently work to
    fail.

    In terms of locking, I have extended the use of the 'epmutex' to all
    epoll_ctl add and remove operations. Currently its only used in a subset
    of the add paths. I need to hold the epmutex, so that we can correctly
    traverse a coherent graph, to check the number of paths. I believe that
    this additional locking is probably ok, since its in the setup/teardown
    paths, and doesn't affect the running paths, but it certainly is going to
    add some extra overhead. Also, worth noting is that the epmuex was
    recently added to the ep_ctl add operations in the initial path loop
    detection code using the argument that it was not on a critical path.

    Another thing to note here, is the length of epoll chains that is allowed.
    Currently, eventpoll.c defines:

    /* Maximum number of nesting allowed inside epoll sets */
    #define EP_MAX_NESTS 4

    This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
    + 1). However, this limit is currently only enforced during the loop
    check detection code, and only when the epoll file descriptors are added
    in a certain order. Thus, this limit is currently easily bypassed. The
    newly added check for wakeup paths, stricly limits the wakeup paths to a
    length of 5, regardless of the order in which ep's are linked together.
    Thus, a side-effect of the new code is a more consistent enforcement of
    the graph depth.

    Thus far, I've tested this, using the sample programs previously
    mentioned, which now either return quickly or return -EINVAL. I've also
    testing using the piptest.c epoll tester, which showed no difference in
    performance. I've also created a number of different epoll networks and
    tested that they behave as expectded.

    I believe this solves the original diabolical test cases, while still
    preserving the sane epoll nesting.

    Signed-off-by: Jason Baron
    Cc: Nelson Elhage
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     

01 Nov, 2011

1 commit

  • epoll can acquire recursively acquire ep->mtx on multiple "struct
    eventpoll"s at once in the case where one epoll fd is monitoring another
    epoll fd. This is perfectly OK, since we're careful about the lock
    ordering, but it causes spurious lockdep warnings. Annotate the recursion
    using mutex_lock_nested, and add a comment explaining the nesting rules
    for good measure.

    Recent versions of systemd are triggering this, and it can also be
    demonstrated with the following trivial test program:

    --------------------8
    Tested-by: Paul Bolle
    Signed-off-by: Nelson Elhage
    Acked-by: Jason Baron
    Cc: Dave Jones
    Cc: Davide Libenzi
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nelson Elhage
     

15 Sep, 2011

1 commit


27 Jul, 2011

1 commit

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     

26 Jul, 2011

1 commit


31 Mar, 2011

1 commit


23 Mar, 2011

2 commits

  • Add a comment to ep_poll(), rename labels a bit clearly, fix a warning of
    unused variable from gcc and optimize the non-blocking path a little.

    Hinted-by: Andrew Morton
    Signed-off-by: Davide Libenzi

    hannes@cmpxchg.org:

    : The non-blocking ep_poll path optimization introduced skipping over the
    : return value setup.
    :
    : Initialize it properly, my userspace gets upset by epoll_wait() returning
    : random things.
    :
    : In addition, remove the reinitialization at the fetch_events label, the
    : return value is garuanteed to be zero when execution reaches there.

    [hannes@cmpxchg.org: fix initialization]
    Signed-off-by: Johannes Weiner
    Cc: Shawn Bohrer
    Acked-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shawn Bohrer
     
  • Move the event readiness check into a proper inline, and use it uniformly
    inside ep_poll() code. Events in the ->ovflist are no less ready than the
    ones in ->rdllist.

    Signed-off-by: Davide Libenzi
    Cc: Shawn Bohrer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

19 Mar, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits)
    doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore
    Update cpuset info & webiste for cgroups
    dcdbas: force SMI to happen when expected
    arch/arm/Kconfig: remove one to many l's in the word.
    asm-generic/user.h: Fix spelling in comment
    drm: fix printk typo 'sracth'
    Remove one to many n's in a word
    Documentation/filesystems/romfs.txt: fixing link to genromfs
    drivers:scsi Change printk typo initate -> initiate
    serial, pch uart: Remove duplicate inclusion of linux/pci.h header
    fs/eventpoll.c: fix spelling
    mm: Fix out-of-date comments which refers non-existent functions
    drm: Fix printk typo 'failled'
    coh901318.c: Change initate to initiate.
    mbox-db5500.c Change initate to initiate.
    edac: correct i82975x error-info reported
    edac: correct i82975x mci initialisation
    edac: correct commented info
    fs: update comments to point correct document
    target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c
    ...

    Trivial conflict in fs/eventpoll.c (spelling vs addition)

    Linus Torvalds
     

26 Feb, 2011

1 commit

  • In several places, an epoll fd can call another file's ->f_op->poll()
    method with ep->mtx held. This is in general unsafe, because that other
    file could itself be an epoll fd that contains the original epoll fd.

    The code defends against this possibility in its own ->poll() method using
    ep_call_nested, but there are several other unsafe calls to ->poll
    elsewhere that can be made to deadlock. For example, the following simple
    program causes the call in ep_insert recursively call the original fd's
    ->poll, leading to deadlock:

    #include
    #include

    int main(void) {
    int e1, e2, p[2];
    struct epoll_event evt = {
    .events = EPOLLIN
    };

    e1 = epoll_create(1);
    e2 = epoll_create(2);
    pipe(p);

    epoll_ctl(e2, EPOLL_CTL_ADD, e1, &evt);
    epoll_ctl(e1, EPOLL_CTL_ADD, p[0], &evt);
    write(p[1], p, sizeof p);
    epoll_ctl(e1, EPOLL_CTL_ADD, e2, &evt);

    return 0;
    }

    On insertion, check whether the inserted file is itself a struct epoll,
    and if so, do a recursive walk to detect whether inserting this file would
    create a loop of epoll structures, which could lead to deadlock.

    [nelhage@ksplice.com: Use epmutex to serialize concurrent inserts]
    Signed-off-by: Davide Libenzi
    Signed-off-by: Nelson Elhage
    Reported-by: Nelson Elhage
    Tested-by: Nelson Elhage
    Cc: [2.6.34+, possibly earlier]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

18 Feb, 2011

1 commit

  • eventpoll.c has wonderful comments but some annoying typos
    sneaked in:
    * toepoll_ctl -> to epoll_ctl
    * rapresent -> represents
    * sructure -> structure
    * machanism -> mechanism
    * trasfering -> transferring

    Signed-off-by: Daniel Baluta
    Signed-off-by: Jiri Kosina

    Daniel Baluta
     

03 Feb, 2011

1 commit

  • commit 95aac7b1cd224f ("epoll: make epoll_wait() use the hrtimer range
    feature") added a performance regression because it uses timespec_add_ns()
    with potential very large 'ns' values.

    [akpm@linux-foundation.org: s/epoll_set_mstimeout/ep_set_mstimeout/, per Davide]
    Reported-by: Simon Kirby
    Signed-off-by: Eric Dumazet
    Cc: Shawn Bohrer
    Acked-by: Davide Libenzi
    Cc: [2.6.37.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

14 Jan, 2011

1 commit

  • On a 16TB machine, max_user_watches has an integer overflow. Convert it
    to use a long and handle the associated fallout.

    Signed-off-by: Robin Holt
    Cc: "Eric W. Biederman"
    Acked-by: Davide Libenzi
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt