26 May, 2018

1 commit


03 Apr, 2018

1 commit

  • Using the helper functions do_epoll_create() and do_epoll_wait() allows us
    to remove in-kernel calls to the related syscall functions.

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Alexander Viro
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     

12 Feb, 2018

1 commit

  • This is the mindless scripted replacement of kernel use of POLL*
    variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
    L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
    for f in $L; do sed -i "-es/^\([^\"]*\)\(\\)/\\1E\\2/" $f; done
    done

    with de-mangling cleanups yet to come.

    NOTE! On almost all architectures, the EPOLL* constants have the same
    values as the POLL* constants do. But they keyword here is "almost".
    For various bad reasons they aren't the same, and epoll() doesn't
    actually work quite correctly in some cases due to this on Sparc et al.

    The next patch from Al will sort out the final differences, and we
    should be all done.

    Scripted-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

02 Feb, 2018

2 commits


29 Nov, 2017

2 commits


28 Nov, 2017

2 commits


18 Nov, 2017

4 commits

  • Merge more updates from Andrew Morton:

    - a bit more MM

    - procfs updates

    - dynamic-debug fixes

    - lib/ updates

    - checkpatch

    - epoll

    - nilfs2

    - signals

    - rapidio

    - PID management cleanup and optimization

    - kcov updates

    - sysvipc updates

    - quite a few misc things all over the place

    * emailed patches from Andrew Morton : (94 commits)
    EXPERT Kconfig menu: fix broken EXPERT menu
    include/asm-generic/topology.h: remove unused parent_node() macro
    arch/tile/include/asm/topology.h: remove unused parent_node() macro
    arch/sparc/include/asm/topology_64.h: remove unused parent_node() macro
    arch/sh/include/asm/topology.h: remove unused parent_node() macro
    arch/ia64/include/asm/topology.h: remove unused parent_node() macro
    drivers/pcmcia/sa1111_badge4.c: avoid unused function warning
    mm: add infrastructure for get_user_pages_fast() benchmarking
    sysvipc: make get_maxid O(1) again
    sysvipc: properly name ipc_addid() limit parameter
    sysvipc: duplicate lock comments wrt ipc_addid()
    sysvipc: unteach ids->next_id for !CHECKPOINT_RESTORE
    initramfs: use time64_t timestamps
    drivers/watchdog: make use of devm_register_reboot_notifier()
    kernel/reboot.c: add devm_register_reboot_notifier()
    kcov: update documentation
    Makefile: support flag -fsanitizer-coverage=trace-cmp
    kcov: support comparison operands collection
    kcov: remove pointless current != NULL check
    kernel/panic.c: add TAINT_AUX
    ...

    Linus Torvalds
     
  • The use of ep_call_nested() in ep_eventpoll_poll(), which is the .poll
    routine for an epoll fd, is used to prevent excessively deep epoll
    nesting, and to prevent circular paths.

    However, we are already preventing these conditions during
    EPOLL_CTL_ADD. In terms of too deep epoll chains, we do in fact allow
    deep nesting of the epoll fds themselves (deeper than EP_MAX_NESTS),
    however we don't allow more than EP_MAX_NESTS when an epoll file
    descriptor is actually connected to a wakeup source. Thus, we do not
    require the use of ep_call_nested(), since ep_eventpoll_poll(), which is
    called via ep_scan_ready_list() only continues nesting if there are
    events available.

    Since ep_call_nested() is implemented using a global lock, applications
    that make use of nested epoll can see large performance improvements
    with this change.

    Davidlohr said:

    : Improvements are quite obscene actually, such as for the following
    : epoll_wait() benchmark with 2 level nesting on a 80 core IvyBridge:
    :
    : ncpus vanilla dirty delta
    : 1 2447092 3028315 +23.75%
    : 4 231265 2986954 +1191.57%
    : 8 121631 2898796 +2283.27%
    : 16 59749 2902056 +4757.07%
    : 32 26837 2326314 +8568.30%
    : 64 12926 1341281 +10276.61%
    :
    : (http://linux-scalability.org/epoll/epoll-test.c)

    Link: http://lkml.kernel.org/r/1509430214-5599-1-git-send-email-jbaron@akamai.com
    Signed-off-by: Jason Baron
    Cc: Davidlohr Bueso
    Cc: Alexander Viro
    Cc: Salman Qazi
    Cc: Hou Tao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     
  • ep_poll_safewake() is used to wakeup potentially nested epoll file
    descriptors. The function uses ep_call_nested() to prevent entering the
    same wake up queue more than once, and to prevent excessively deep
    wakeup paths (deeper than EP_MAX_NESTS). However, this is not necessary
    since we are already preventing these conditions during EPOLL_CTL_ADD.
    This saves extra function calls, and avoids taking a global lock during
    the ep_call_nested() calls.

    I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC
    case, since ep_call_nested() keeps track of the nesting level, and this
    is required by the call to spin_lock_irqsave_nested(). It would be nice
    to remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC
    case as well, however its not clear how to simply pass the nesting level
    through multiple wake_up() levels without more surgery. In any case, I
    don't think CONFIG_DEBUG_LOCK_ALLOC is generally used for production.
    This patch, also apparently fixes a workload at Google that Salman Qazi
    reported by completely removing the poll_safewake_ncalls->lock from
    wakeup paths.

    Link: http://lkml.kernel.org/r/1507920533-8812-1-git-send-email-jbaron@akamai.com
    Signed-off-by: Jason Baron
    Acked-by: Davidlohr Bueso
    Cc: Alexander Viro
    Cc: Salman Qazi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     
  • A userspace application can directly trigger the allocations from
    eventpoll_epi and eventpoll_pwq slabs. A buggy or malicious application
    can consume a significant amount of system memory by triggering such
    allocations. Indeed we have seen in production where a buggy
    application was leaking the epoll references and causing a burst of
    eventpoll_epi and eventpoll_pwq slab allocations. This patch opt-in the
    charging of eventpoll_epi and eventpoll_pwq slabs.

    There is a per-user limit (~4% of total memory if no highmem) on these
    caches. I think it is too generous particularly in the scenario where
    jobs of multiple users are running on the system and the administrator
    is reducing cost by overcomitting the memory. This is unaccounted
    kernel memory and will not be considered by the oom-killer. I think by
    accounting it to kmemcg, for systems with kmem accounting enabled, we
    can provide better isolation between jobs of different users.

    Link: http://lkml.kernel.org/r/20171003021519.23907-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Alexander Viro
    Cc: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

20 Sep, 2017

1 commit


09 Sep, 2017

1 commit

  • ... such that we can avoid the tree walks to get the node with the
    smallest key. Semantically the same, as the previously used rb_first(),
    but O(1). The main overhead is the extra footprint for the cached rb_node
    pointer, which should not matter for epoll.

    Link: http://lkml.kernel.org/r/20170719014603.19029-15-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Acked-by: Peter Zijlstra (Intel)
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

02 Sep, 2017

1 commit

  • The race was introduced by me in commit 971316f0503a ("epoll:
    ep_unregister_pollwait() can use the freed pwq->whead"). I did not
    realize that nothing can protect eventpoll after ep_poll_callback() sets
    ->whead = NULL, only whead->lock can save us from the race with
    ep_free() or ep_remove().

    Move ->whead = NULL to the end of ep_poll_callback() and add the
    necessary barriers.

    TODO: cleanup the ewake/EPOLLEXCLUSIVE logic, it was confusing even
    before this patch.

    Hopefully this explains use-after-free reported by syzcaller:

    BUG: KASAN: use-after-free in debug_spin_lock_before
    ...
    _raw_spin_lock_irqsave+0x4a/0x60 kernel/locking/spinlock.c:159
    ep_poll_callback+0x29f/0xff0 fs/eventpoll.c:1148

    this is spin_lock(eventpoll->lock),

    ...
    Freed by task 17774:
    ...
    kfree+0xe8/0x2c0 mm/slub.c:3883
    ep_free+0x22c/0x2a0 fs/eventpoll.c:865

    Fixes: 971316f0503a ("epoll: ep_unregister_pollwait() can use the freed pwq->whead")
    Reported-by: 范龙飞
    Cc: stable@vger.kernel.org
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

13 Jul, 2017

3 commits

  • kcmp syscall is build iif CONFIG_CHECKPOINT_RESTORE is selected, so wrap
    appropriate helpers in epoll code with the config to build it
    conditionally.

    Link: http://lkml.kernel.org/r/20170513083456.GG1881@uranus.lan
    Signed-off-by: Cyrill Gorcunov
    Reported-by: Andrew Morton
    Cc: Andrey Vagin
    Cc: Al Viro
    Cc: Pavel Emelyanov
    Cc: Michael Kerrisk
    Cc: Jason Baron
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • With current epoll architecture target files are addressed with
    file_struct and file descriptor number, where the last is not unique.
    Moreover files can be transferred from another process via unix socket,
    added into queue and closed then so we won't find this descriptor in the
    task fdinfo list.

    Thus to checkpoint and restore such processes CRIU needs to find out
    where exactly the target file is present to add it into epoll queue.
    For this sake one can use kcmp call where some particular target file
    from the queue is compared with arbitrary file passed as an argument.

    Because epoll target files can have same file descriptor number but
    different file_struct a caller should explicitly specify the offset
    within.

    To test if some particular file is matching entry inside epoll one have
    to

    - fill kcmp_epoll_slot structure with epoll file descriptor,
    target file number and target file offset (in case if only
    one target is present then it should be 0)

    - call kcmp as kcmp(pid1, pid2, KCMP_EPOLL_TFD, fd, &kcmp_epoll_slot)
    - the kernel fetch file pointer matching file descriptor @fd of pid1
    - lookups for file struct in epoll queue of pid2 and returns traditional
    0,1,2 result for sorting purpose

    Link: http://lkml.kernel.org/r/20170424154423.511592110@gmail.com
    Signed-off-by: Cyrill Gorcunov
    Acked-by: Andrey Vagin
    Cc: Al Viro
    Cc: Pavel Emelyanov
    Cc: Michael Kerrisk
    Cc: Jason Baron
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Since it is possbile to have same number in tfd field (say file added,
    closed, then nother file dup'ed to same number and added back) it is
    imposible to distinguish such target files solely by their numbers.

    Strictly speaking regular applications don't need to recognize these
    targets at all but for checkpoint/restore sake we need to collect
    targets to be able to push them back on restore stage in a proper order.

    Thus lets add file position, inode and device number where this target
    lays. This three fields can be used as a primary key for sorting, and
    together with kcmp help CRIU can find out an exact file target (from the
    whole set of processes being checkpointed).

    Link: http://lkml.kernel.org/r/20170424154423.436491881@gmail.com
    Signed-off-by: Cyrill Gorcunov
    Acked-by: Andrei Vagin
    Cc: Al Viro
    Cc: Pavel Emelyanov
    Cc: Michael Kerrisk
    Cc: Jason Baron
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

11 Jul, 2017

1 commit

  • We've encountered zombies that are waiting for a thread to exit that are
    looping in ep_poll() almost endlessly although there is a pending
    SIGKILL as a result of a group exit.

    This happens because we always find ep_events_available() and fetch more
    events and never are able to check for signal_pending() that would break
    from the loop and return -EINTR.

    Special case fatal signals and break immediately to guarantee that we
    loop to fetch more events and delay making a timely exit.

    It would also be possible to simply move the check for signal_pending()
    higher than checking for ep_events_available(), but there have been no
    reports of delayed signal handling other than SIGKILL preventing zombies
    from exiting that would be fixed by this.

    It fixes an issue for us where we have witnessed zombies sticking around
    for at least O(minutes), but considering the code has been like this
    forever and nobody else has complained that I have found, I would simply
    queue it up for 4.12.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705031722350.76784@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Cc: Alexander Viro
    Cc: Jan Kara
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

20 Jun, 2017

2 commits

  • So I've noticed a number of instances where it was not obvious from the
    code whether ->task_list was for a wait-queue head or a wait-queue entry.

    Furthermore, there's a number of wait-queue users where the lists are
    not for 'tasks' but other entities (poll tables, etc.), in which case
    the 'task_list' name is actively confusing.

    To clear this all up, name the wait-queue head and entry list structure
    fields unambiguously:

    struct wait_queue_head::task_list => ::head
    struct wait_queue_entry::task_list => ::entry

    For example, this code:

    rqw->wait.task_list.next != &wait->task_list

    ... is was pretty unclear (to me) what it's doing, while now it's written this way:

    rqw->wait.head.next != &wait->entry

    ... which makes it pretty clear that we are iterating a list until we see the head.

    Other examples are:

    list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
    list_for_each_entry(wq, &fence->wait.task_list, task_list) {

    ... where it's unclear (to me) what we are iterating, and during review it's
    hard to tell whether it's trying to walk a wait-queue entry (which would be
    a bug), while now it's written as:

    list_for_each_entry_safe(pos, next, &x->head, entry) {
    list_for_each_entry(wq, &fence->wait.head, entry) {

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Rename:

    wait_queue_t => wait_queue_entry_t

    'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
    but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
    which had to carry the name.

    Start sorting this out by renaming it to 'wait_queue_entry_t'.

    This also allows the real structure name 'struct __wait_queue' to
    lose its double underscore and become 'struct wait_queue_entry',
    which is the more canonical nomenclature for such data types.

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

25 Mar, 2017

1 commit

  • This patch adds busy poll support to epoll. The implementation is meant to
    be opportunistic in that it will take the NAPI ID from the last socket
    that is added to the ready list that contains a valid NAPI ID and it will
    use that for busy polling until the ready list goes empty. Once the ready
    list goes empty the NAPI ID is reset and busy polling is disabled until a
    new socket is added to the ready list.

    In addition when we insert a new socket into the epoll we record the NAPI
    ID and assume we are going to receive events on it. If that doesn't occur
    it will be evicted as the active NAPI ID and we will resume normal
    behavior.

    An application can use SO_INCOMING_CPU or SO_REUSEPORT_ATTACH_C/EBPF socket
    options to spread the incoming connections to specific worker threads
    based on the incoming queue. This enables epoll for each worker thread
    to have only sockets that receive packets from a single queue. So when an
    application calls epoll_wait() and there are no events available to report,
    busy polling is done on the associated queue to pull the packets.

    Signed-off-by: Sridhar Samudrala
    Signed-off-by: Alexander Duyck
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Sridhar Samudrala
     

02 Mar, 2017

1 commit


28 Feb, 2017

1 commit

  • In case if epoll_ctl is called with operation EPOLL_CTL_DEL then
    @epds.events variable allocated on stack may contain random bits which
    we test then for EPOLLEXCLUSIVE. Since currently the test look like

    if (epds.events & EPOLLEXCLUSIVE) {
    if (op == EPOLL_CTL_MOD)
    goto error_tgt_fput;
    if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) ||
    (epds.events & ~EPOLLEXCLUSIVE_OK_BITS)))
    goto error_tgt_fput;
    }

    Nothing serious will happen even if epds.events has this bit set, still
    better to be on safe side and make sure that we're to test this bit at
    all.

    Link: http://lkml.kernel.org/r/20170214154935.GG1850@uranus.lan
    Signed-off-by: Cyrill Gorcunov
    Cc: Al Viro
    Cc: Andrey Vagin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

25 Dec, 2016

1 commit


20 May, 2016

1 commit

  • struct timespec is not y2038 safe. Even though timespec might be
    sufficient to represent timeouts, use struct timespec64 here as the plan
    is to get rid of all timespec reference in the kernel.

    The patch transitions the common functions: poll_select_set_timeout()
    and select_estimate_accuracy() to use timespec64. And, all the syscalls
    that use these functions are transitioned in the same patch.

    The restart block parameters for poll uses monotonic time. Use
    timespec64 here as well to assign timeout value. This parameter in the
    restart block need not change because this only holds the monotonic
    timestamp at which timeout should occur. And, unsigned long data type
    should be big enough for this timestamp.

    The system call interfaces will be handled in a separate series.

    Compat interfaces need not change as timespec64 is an alias to struct
    timespec on a 64 bit system.

    Link: http://lkml.kernel.org/r/1461947989-21926-3-git-send-email-deepa.kernel@gmail.com
    Signed-off-by: Deepa Dinamani
    Acked-by: John Stultz
    Acked-by: David S. Miller
    Cc: Alexander Viro
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Deepa Dinamani
     

18 Mar, 2016

1 commit

  • This patchset introduces a /proc//timerslack_ns interface which
    would allow controlling processes to be able to set the timerslack value
    on other processes in order to save power by avoiding wakeups (Something
    Android currently does via out-of-tree patches).

    The first patch tries to fix the internal timer_slack_ns usage which was
    defined as a long, which limits the slack range to ~4 seconds on 32bit
    systems. It converts it to a u64, which provides the same basically
    unlimited slack (500 years) on both 32bit and 64bit machines.

    The second patch introduces the /proc//timerslack_ns interface
    which allows the full 64bit slack range for a task to be read or set on
    both 32bit and 64bit machines.

    With these two patches, on a 32bit machine, after setting the slack on
    bash to 10 seconds:

    $ time sleep 1

    real 0m10.747s
    user 0m0.001s
    sys 0m0.005s

    The first patch is a little ugly, since I had to chase the slack delta
    arguments through a number of functions converting them to u64s. Let me
    know if it makes sense to break that up more or not.

    Other than that things are fairly straightforward.

    This patch (of 2):

    The timer_slack_ns value in the task struct is currently a unsigned
    long. This means that on 32bit applications, the maximum slack is just
    over 4 seconds. However, on 64bit machines, its much much larger (~500
    years).

    This disparity could make application development a little (as well as
    the default_slack) to a u64. This means both 32bit and 64bit systems
    have the same effective internal slack range.

    Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
    the interface as a unsigned long, so we preserve that limitation on
    32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
    long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
    actually larger then what can be stored by an unsigned long.

    This patch also modifies hrtimer functions which specified the slack
    delta as a unsigned long.

    Signed-off-by: John Stultz
    Cc: Arjan van de Ven
    Cc: Thomas Gleixner
    Cc: Oren Laadan
    Cc: Ruchi Kandoi
    Cc: Rom Lemarchand
    Cc: Kees Cook
    Cc: Android Kernel Team
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Stultz
     

06 Feb, 2016

1 commit

  • In the current implementation of the EPOLLEXCLUSIVE flag (added for
    4.5-rc1), if epoll waiters create different POLL* sets and register them
    as exclusive against the same target fd, the current implementation will
    stop waking any further waiters once it finds the first idle waiter.
    This means that waiters could miss wakeups in certain cases.

    For example, when we wake up a pipe for reading we do:
    wake_up_interruptible_sync_poll(&pipe->wait, POLLIN | POLLRDNORM); So if
    one epoll set or epfd is added to pipe p with POLLIN and a second set
    epfd2 is added to pipe p with POLLRDNORM, only epfd may receive the
    wakeup since the current implementation will stop after it finds any
    intersection of events with a waiter that is blocked in epoll_wait().

    We could potentially address this by requiring all epoll waiters that
    are added to p be required to pass the same set of POLL* events. IE the
    first EPOLL_CTL_ADD that passes EPOLLEXCLUSIVE establishes the set POLL*
    flags to be used by any other epfds that are added as EPOLLEXCLUSIVE.
    However, I think it might be somewhat confusing interface as we would
    have to reference count the number of users for that set, and so
    userspace would have to keep track of that count, or we would need a
    more involved interface. It also adds some shared state that we'd have
    store somewhere. I don't think anybody will want to bloat
    __wait_queue_head for this.

    I think what we could do instead, is to simply restrict EPOLLEXCLUSIVE
    such that it can only be specified with EPOLLIN and/or EPOLLOUT. So
    that way if the wakeup includes 'POLLIN' and not 'POLLOUT', we can stop
    once we hit the first idle waiter that specifies the EPOLLIN bit, since
    any remaining waiters that only have 'POLLOUT' set wouldn't need to be
    woken. Likewise, we can do the same thing if 'POLLOUT' is in the wakeup
    bit set and not 'POLLIN'. If both 'POLLOUT' and 'POLLIN' are set in the
    wake bit set (there is at least one example of this I saw in fs/pipe.c),
    then we just wake the entire exclusive list. Having both 'POLLOUT' and
    'POLLIN' both set should not be on any performance critical path, so I
    think that's ok (in fs/pipe.c its in pipe_release()). We also continue
    to include EPOLLERR and EPOLLHUP by default in any exclusive set. Thus,
    the user can specify EPOLLERR and/or EPOLLHUP but is not required to do
    so.

    Since epoll waiters may be interested in other events as well besides
    EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP, these can still be added by
    doing a 'dup' call on the target fd and adding that as one normally
    would with EPOLL_CTL_ADD. Since I think that the POLLIN and POLLOUT
    events are what we are interest in balancing, I think that the 'dup'
    thing could perhaps be added to only one of the waiter threads.
    However, I think that EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP should be
    sufficient for the majority of use-cases.

    Since EPOLLEXCLUSIVE is intended to be used with a target fd shared
    among multiple epfds, where between 1 and n of the epfds may receive an
    event, it does not satisfy the semantics of EPOLLONESHOT where only 1
    epfd would get an event. Thus, it is not allowed to be specified in
    conjunction with EPOLLEXCLUSIVE.

    EPOLL_CTL_MOD is also not allowed if the fd was previously added as
    EPOLLEXCLUSIVE. It seems with the limited number of flags to not be as
    interesting, but this could be relaxed at some further point.

    Signed-off-by: Jason Baron
    Tested-by: Madars Vitolins
    Cc: Michael Kerrisk
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Al Viro
    Cc: Eric Wong
    Cc: Jonathan Corbet
    Cc: Andy Lutomirski
    Cc: Hagen Paul Pfeifer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     

21 Jan, 2016

1 commit

  • Currently, epoll file descriptors or epfds (the fd returned from
    epoll_create[1]()) that are added to a shared wakeup source are always
    added in a non-exclusive manner. This means that when we have multiple
    epfds attached to a shared fd source they are all woken up. This creates
    thundering herd type behavior.

    Introduce a new 'EPOLLEXCLUSIVE' flag that can be passed as part of the
    'event' argument during an epoll_ctl() EPOLL_CTL_ADD operation. This new
    flag allows for exclusive wakeups when there are multiple epfds attached
    to a shared fd event source.

    The implementation walks the list of exclusive waiters, and queues an
    event to each epfd, until it finds the first waiter that has threads
    blocked on it via epoll_wait(). The idea is to search for threads which
    are idle and ready to process the wakeup events. Thus, we queue an event
    to at least 1 epfd, but may still potentially queue an event to all epfds
    that are attached to the shared fd source.

    Performance testing was done by Madars Vitolins using a modified version
    of Enduro/X. The use of the 'EPOLLEXCLUSIVE' flag reduce the length of
    this particular workload from 860s down to 24s.

    Sample epoll_clt text:

    EPOLLEXCLUSIVE

    Sets an exclusive wakeup mode for the epfd file descriptor that is
    being attached to the target file descriptor, fd. Thus, when an event
    occurs and multiple epfd file descriptors are attached to the same
    target file using EPOLLEXCLUSIVE, one or more epfds will receive an
    event with epoll_wait(2). The default in this scenario (when
    EPOLLEXCLUSIVE is not set) is for all epfds to receive an event.
    EPOLLEXCLUSIVE may only be specified with the op EPOLL_CTL_ADD.

    Signed-off-by: Jason Baron
    Tested-by: Madars Vitolins
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Al Viro
    Cc: Michael Kerrisk
    Cc: Eric Wong
    Cc: Jonathan Corbet
    Cc: Andy Lutomirski
    Cc: Hagen Paul Pfeifer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     

14 Feb, 2015

1 commit

  • After waking up a task waiting for an event, we explicitly mark it as
    TASK_RUNNING (which is necessary as we do the checks for wakeups as
    TASK_INTERRUPTIBLE). Once running and dealing with actually delivering
    the events, we're obviously not planning on calling schedule, thus we can
    relax the implied barrier and simply update the state with
    __set_current_state().

    Signed-off-by: Davidlohr Bueso
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

06 Nov, 2014

1 commit

  • seq_printf functions shouldn't really check the return value.
    Checking seq_has_overflowed() occasionally is used instead.

    Update vfs documentation.

    Link: http://lkml.kernel.org/p/e37e6e7b76acbdcc3bb4ab2a57c8f8ca1ae11b9a.1412031505.git.joe@perches.com

    Cc: David S. Miller
    Cc: Al Viro
    Signed-off-by: Joe Perches
    [ did a few clean ups ]
    Signed-off-by: Steven Rostedt

    Joe Perches
     

11 Sep, 2014

1 commit

  • When calling epoll_ctl with operation EPOLL_CTL_DEL, structure epds is
    not initialized but ep_take_care_of_epollwakeup reads its event field.
    When this unintialized field has EPOLLWAKEUP bit set, a capability check
    is done for CAP_BLOCK_SUSPEND in ep_take_care_of_epollwakeup. This
    produces unexpected messages in the audit log, such as (on a system
    running SELinux):

    type=AVC msg=audit(1408212798.866:410): avc: denied
    { block_suspend } for pid=7754 comm="dbus-daemon" capability=36
    scontext=unconfined_u:unconfined_r:unconfined_t
    tcontext=unconfined_u:unconfined_r:unconfined_t
    tclass=capability2 permissive=1

    type=SYSCALL msg=audit(1408212798.866:410): arch=c000003e syscall=233
    success=yes exit=0 a0=3 a1=2 a2=9 a3=7fffd4d66ec0 items=0 ppid=1
    pid=7754 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0
    fsgid=0 tty=(none) ses=3 comm="dbus-daemon"
    exe="/usr/bin/dbus-daemon"
    subj=unconfined_u:unconfined_r:unconfined_t key=(null)

    ("arch=c000003e syscall=233 a1=2" means "epoll_ctl(op=EPOLL_CTL_DEL)")

    Remove use of epds in epoll_ctl when op == EPOLL_CTL_DEL.

    Fixes: 4d7e30d98939 ("epoll: Add a flag, EPOLLWAKEUP, to prevent suspend while epoll events are ready")
    Signed-off-by: Nicolas Iooss
    Cc: Alexander Viro
    Cc: Arve Hjønnevåg
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Iooss
     

17 Jun, 2014

1 commit

  • This fixes use-after-free of epi->fllink.next inside list loop macro.
    This loop actually releases elements in the body. The list is
    rcu-protected but here we cannot hold rcu_read_lock because we need to
    lock mutex inside.

    The obvious solution is to use list_for_each_entry_safe(). RCU-ness
    isn't essential because nobody can change this list under us, it's final
    fput for this file.

    The bug was introduced by ae10b2b4eb01 ("epoll: optimize EPOLL_CTL_DEL
    using rcu")

    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Cyrill Gorcunov
    Cc: Stable # 3.13+
    Cc: Sasha Levin
    Cc: Jason Baron
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

07 Jun, 2014

1 commit


03 Jan, 2014

1 commit

  • The EPOLL_CTL_DEL path of epoll contains a classic, ab-ba deadlock.
    That is, epoll_ctl(a, EPOLL_CTL_DEL, b, x), will deadlock with
    epoll_ctl(b, EPOLL_CTL_DEL, a, x). The deadlock was introduced with
    commmit 67347fe4e632 ("epoll: do not take global 'epmutex' for simple
    topologies").

    The acquistion of the ep->mtx for the destination 'ep' was added such
    that a concurrent EPOLL_CTL_ADD operation would see the correct state of
    the ep (Specifically, the check for '!list_empty(&f.file->f_ep_links')

    However, by simply not acquiring the lock, we do not serialize behind
    the ep->mtx from the add path, and thus may perform a full path check
    when if we had waited a little longer it may not have been necessary.
    However, this is a transient state, and performing the full loop
    checking in this case is not harmful.

    The important point is that we wouldn't miss doing the full loop
    checking when required, since EPOLL_CTL_ADD always locks any 'ep's that
    its operating upon. The reason we don't need to do lock ordering in the
    add path, is that we are already are holding the global 'epmutex'
    whenever we do the double lock. Further, the original posting of this
    patch, which was tested for the intended performance gains, did not
    perform this additional locking.

    Signed-off-by: Jason Baron
    Cc: Nathan Zimmer
    Cc: Eric Wong
    Cc: Nelson Elhage
    Cc: Al Viro
    Cc: Davide Libenzi
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     

03 Dec, 2013

1 commit


13 Nov, 2013

3 commits

  • Merge first patch-bomb from Andrew Morton:
    "Quite a lot of other stuff is banked up awaiting further
    next->mainline merging, but this batch contains:

    - Lots of random misc patches
    - OCFS2
    - Most of MM
    - backlight updates
    - lib/ updates
    - printk updates
    - checkpatch updates
    - epoll tweaking
    - rtc updates
    - hfs
    - hfsplus
    - documentation
    - procfs
    - update gcov to gcc-4.7 format
    - IPC"

    * emailed patches from Andrew Morton : (269 commits)
    ipc, msg: fix message length check for negative values
    ipc/util.c: remove unnecessary work pending test
    devpts: plug the memory leak in kill_sb
    ./Makefile: export initial ramdisk compression config option
    init/Kconfig: add option to disable kernel compression
    drivers: w1: make w1_slave::flags long to avoid memory corruption
    drivers/w1/masters/ds1wm.cuse dev_get_platdata()
    drivers/memstick/core/ms_block.c: fix unreachable state in h_msb_read_page()
    drivers/memstick/core/mspro_block.c: fix attributes array allocation
    drivers/pps/clients/pps-gpio.c: remove redundant of_match_ptr
    kernel/panic.c: reduce 1 byte usage for print tainted buffer
    gcov: reuse kbasename helper
    kernel/gcov/fs.c: use pr_warn()
    kernel/module.c: use pr_foo()
    gcov: compile specific gcov implementation based on gcc version
    gcov: add support for gcc 4.7 gcov format
    gcov: move gcov structs definitions to a gcc version specific file
    kernel/taskstats.c: return -ENOMEM when alloc memory fails in add_del_listener()
    kernel/taskstats.c: add nla_nest_cancel() for failure processing between nla_nest_start() and nla_nest_end()
    kernel/sysctl_binary.c: use scnprintf() instead of snprintf()
    ...

    Linus Torvalds
     
  • Pull vfs updates from Al Viro:
    "All kinds of stuff this time around; some more notable parts:

    - RCU'd vfsmounts handling
    - new primitives for coredump handling
    - files_lock is gone
    - Bruce's delegations handling series
    - exportfs fixes

    plus misc stuff all over the place"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits)
    ecryptfs: ->f_op is never NULL
    locks: break delegations on any attribute modification
    locks: break delegations on link
    locks: break delegations on rename
    locks: helper functions for delegation breaking
    locks: break delegations on unlink
    namei: minor vfs_unlink cleanup
    locks: implement delegations
    locks: introduce new FL_DELEG lock flag
    vfs: take i_mutex on renamed file
    vfs: rename I_MUTEX_QUOTA now that it's not used for quotas
    vfs: don't use PARENT/CHILD lock classes for non-directories
    vfs: pull ext4's double-i_mutex-locking into common code
    exportfs: fix quadratic behavior in filehandle lookup
    exportfs: better variable name
    exportfs: move most of reconnect_path to helper function
    exportfs: eliminate unused "noprogress" counter
    exportfs: stop retrying once we race with rename/remove
    exportfs: clear DISCONNECTED on all parents sooner
    exportfs: more detailed comment for path_reconnect
    ...

    Linus Torvalds
     
  • When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached
    directly to a wakeup source, we do not need to take the global 'epmutex',
    unless the epoll file descriptor is nested. The purpose of taking the
    'epmutex' on add is to prevent complex topologies such as loops and deep
    wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD
    operations. However, for the simple case of an epoll file descriptor
    attached directly to a wakeup source (with no nesting), we do not need to
    hold the 'epmutex'.

    This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves
    scalability on larger systems. Quoting Nathan Zimmer's mail on SPECjbb
    performance:

    "On the 16 socket run the performance went from 35k jOPS to 125k jOPS. In
    addition the benchmark when from scaling well on 10 sockets to scaling
    well on just over 40 sockets.

    ...

    Currently the benchmark stops scaling at around 40-44 sockets but it seems like
    I found a second unrelated bottleneck."

    [akpm@linux-foundation.org: use `bool' for boolean variables, remove unneeded/undesirable cast of void*, add missed ep_scan_ready_list() kerneldoc]
    Signed-off-by: Jason Baron
    Tested-by: Nathan Zimmer
    Cc: Eric Wong
    Cc: Nelson Elhage
    Cc: Al Viro
    Cc: Davide Libenzi
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron