13 Jan, 2012

1 commit

  • The current epoll code can be tickled to run basically indefinitely in
    both loop detection path check (on ep_insert()), and in the wakeup paths.
    The programs that tickle this behavior set up deeply linked networks of
    epoll file descriptors that cause the epoll algorithms to traverse them
    indefinitely. A couple of these sample programs have been previously
    posted in this thread: https://lkml.org/lkml/2011/2/25/297.

    To fix the loop detection path check algorithms, I simply keep track of
    the epoll nodes that have been already visited. Thus, the loop detection
    becomes proportional to the number of epoll file descriptor and links.
    This dramatically decreases the run-time of the loop check algorithm. In
    one diabolical case I tried it reduced the run-time from 15 mintues (all
    in kernel time) to .3 seconds.

    Fixing the wakeup paths could be done at wakeup time in a similar manner
    by keeping track of nodes that have already been visited, but the
    complexity is harder, since there can be multiple wakeups on different
    cpus...Thus, I've opted to limit the number of possible wakeup paths when
    the paths are created.

    This is accomplished, by noting that the end file descriptor points that
    are found during the loop detection pass (from the newly added link), are
    actually the sources for wakeup events. I keep a list of these file
    descriptors and limit the number and length of these paths that emanate
    from these 'source file descriptors'. In the current implemetation I
    allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
    length 4 and 10 of length 5. Note that it is sufficient to check the
    'source file descriptors' reachable from the newly added link, since no
    other 'source file descriptors' will have newly added links. This allows
    us to check only the wakeup paths that may have gotten too long, and not
    re-check all possible wakeup paths on the system.

    In terms of the path limit selection, I think its first worth noting that
    the most common case for epoll, is probably the model where you have 1
    epoll file descriptor that is monitoring n number of 'source file
    descriptors'. In this case, each 'source file descriptor' has a 1 path of
    length 1. Thus, I believe that the limits I'm proposing are quite
    reasonable and in fact may be too generous. Thus, I'm hoping that the
    proposed limits will not prevent any workloads that currently work to
    fail.

    In terms of locking, I have extended the use of the 'epmutex' to all
    epoll_ctl add and remove operations. Currently its only used in a subset
    of the add paths. I need to hold the epmutex, so that we can correctly
    traverse a coherent graph, to check the number of paths. I believe that
    this additional locking is probably ok, since its in the setup/teardown
    paths, and doesn't affect the running paths, but it certainly is going to
    add some extra overhead. Also, worth noting is that the epmuex was
    recently added to the ep_ctl add operations in the initial path loop
    detection code using the argument that it was not on a critical path.

    Another thing to note here, is the length of epoll chains that is allowed.
    Currently, eventpoll.c defines:

    /* Maximum number of nesting allowed inside epoll sets */
    #define EP_MAX_NESTS 4

    This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
    + 1). However, this limit is currently only enforced during the loop
    check detection code, and only when the epoll file descriptors are added
    in a certain order. Thus, this limit is currently easily bypassed. The
    newly added check for wakeup paths, stricly limits the wakeup paths to a
    length of 5, regardless of the order in which ep's are linked together.
    Thus, a side-effect of the new code is a more consistent enforcement of
    the graph depth.

    Thus far, I've tested this, using the sample programs previously
    mentioned, which now either return quickly or return -EINVAL. I've also
    testing using the piptest.c epoll tester, which showed no difference in
    performance. I've also created a number of different epoll networks and
    tested that they behave as expectded.

    I believe this solves the original diabolical test cases, while still
    preserving the sane epoll nesting.

    Signed-off-by: Jason Baron
    Cc: Nelson Elhage
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Baron
     

01 Nov, 2011

1 commit

  • epoll can acquire recursively acquire ep->mtx on multiple "struct
    eventpoll"s at once in the case where one epoll fd is monitoring another
    epoll fd. This is perfectly OK, since we're careful about the lock
    ordering, but it causes spurious lockdep warnings. Annotate the recursion
    using mutex_lock_nested, and add a comment explaining the nesting rules
    for good measure.

    Recent versions of systemd are triggering this, and it can also be
    demonstrated with the following trivial test program:

    --------------------8
    Tested-by: Paul Bolle
    Signed-off-by: Nelson Elhage
    Acked-by: Jason Baron
    Cc: Dave Jones
    Cc: Davide Libenzi
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nelson Elhage
     

15 Sep, 2011

1 commit


27 Jul, 2011

1 commit

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     

26 Jul, 2011

1 commit


31 Mar, 2011

1 commit


23 Mar, 2011

2 commits

  • Add a comment to ep_poll(), rename labels a bit clearly, fix a warning of
    unused variable from gcc and optimize the non-blocking path a little.

    Hinted-by: Andrew Morton
    Signed-off-by: Davide Libenzi

    hannes@cmpxchg.org:

    : The non-blocking ep_poll path optimization introduced skipping over the
    : return value setup.
    :
    : Initialize it properly, my userspace gets upset by epoll_wait() returning
    : random things.
    :
    : In addition, remove the reinitialization at the fetch_events label, the
    : return value is garuanteed to be zero when execution reaches there.

    [hannes@cmpxchg.org: fix initialization]
    Signed-off-by: Johannes Weiner
    Cc: Shawn Bohrer
    Acked-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shawn Bohrer
     
  • Move the event readiness check into a proper inline, and use it uniformly
    inside ep_poll() code. Events in the ->ovflist are no less ready than the
    ones in ->rdllist.

    Signed-off-by: Davide Libenzi
    Cc: Shawn Bohrer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

19 Mar, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits)
    doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore
    Update cpuset info & webiste for cgroups
    dcdbas: force SMI to happen when expected
    arch/arm/Kconfig: remove one to many l's in the word.
    asm-generic/user.h: Fix spelling in comment
    drm: fix printk typo 'sracth'
    Remove one to many n's in a word
    Documentation/filesystems/romfs.txt: fixing link to genromfs
    drivers:scsi Change printk typo initate -> initiate
    serial, pch uart: Remove duplicate inclusion of linux/pci.h header
    fs/eventpoll.c: fix spelling
    mm: Fix out-of-date comments which refers non-existent functions
    drm: Fix printk typo 'failled'
    coh901318.c: Change initate to initiate.
    mbox-db5500.c Change initate to initiate.
    edac: correct i82975x error-info reported
    edac: correct i82975x mci initialisation
    edac: correct commented info
    fs: update comments to point correct document
    target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c
    ...

    Trivial conflict in fs/eventpoll.c (spelling vs addition)

    Linus Torvalds
     

26 Feb, 2011

1 commit

  • In several places, an epoll fd can call another file's ->f_op->poll()
    method with ep->mtx held. This is in general unsafe, because that other
    file could itself be an epoll fd that contains the original epoll fd.

    The code defends against this possibility in its own ->poll() method using
    ep_call_nested, but there are several other unsafe calls to ->poll
    elsewhere that can be made to deadlock. For example, the following simple
    program causes the call in ep_insert recursively call the original fd's
    ->poll, leading to deadlock:

    #include
    #include

    int main(void) {
    int e1, e2, p[2];
    struct epoll_event evt = {
    .events = EPOLLIN
    };

    e1 = epoll_create(1);
    e2 = epoll_create(2);
    pipe(p);

    epoll_ctl(e2, EPOLL_CTL_ADD, e1, &evt);
    epoll_ctl(e1, EPOLL_CTL_ADD, p[0], &evt);
    write(p[1], p, sizeof p);
    epoll_ctl(e1, EPOLL_CTL_ADD, e2, &evt);

    return 0;
    }

    On insertion, check whether the inserted file is itself a struct epoll,
    and if so, do a recursive walk to detect whether inserting this file would
    create a loop of epoll structures, which could lead to deadlock.

    [nelhage@ksplice.com: Use epmutex to serialize concurrent inserts]
    Signed-off-by: Davide Libenzi
    Signed-off-by: Nelson Elhage
    Reported-by: Nelson Elhage
    Tested-by: Nelson Elhage
    Cc: [2.6.34+, possibly earlier]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

18 Feb, 2011

1 commit

  • eventpoll.c has wonderful comments but some annoying typos
    sneaked in:
    * toepoll_ctl -> to epoll_ctl
    * rapresent -> represents
    * sructure -> structure
    * machanism -> mechanism
    * trasfering -> transferring

    Signed-off-by: Daniel Baluta
    Signed-off-by: Jiri Kosina

    Daniel Baluta
     

03 Feb, 2011

1 commit

  • commit 95aac7b1cd224f ("epoll: make epoll_wait() use the hrtimer range
    feature") added a performance regression because it uses timespec_add_ns()
    with potential very large 'ns' values.

    [akpm@linux-foundation.org: s/epoll_set_mstimeout/ep_set_mstimeout/, per Davide]
    Reported-by: Simon Kirby
    Signed-off-by: Eric Dumazet
    Cc: Shawn Bohrer
    Acked-by: Davide Libenzi
    Cc: [2.6.37.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

14 Jan, 2011

1 commit

  • On a 16TB machine, max_user_watches has an integer overflow. Convert it
    to use a long and handle the associated fallout.

    Signed-off-by: Robin Holt
    Cc: "Eric W. Biederman"
    Acked-by: Davide Libenzi
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

28 Oct, 2010

1 commit

  • This make epoll use hrtimers for the timeout value which prevents
    epoll_wait() from timing out up to a millisecond early.

    This mirrors the behavior of select() and poll().

    Signed-off-by: Shawn Bohrer
    Cc: Al Viro
    Acked-by: Davide Libenzi
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shawn Bohrer
     

15 Oct, 2010

1 commit

  • All file_operations should get a .llseek operation so we can make
    nonseekable_open the default for future file operations without a
    .llseek pointer.

    The three cases that we can automatically detect are no_llseek, seq_lseek
    and default_llseek. For cases where we can we can automatically prove that
    the file offset is always ignored, we use noop_llseek, which maintains
    the current behavior of not returning an error from a seek.

    New drivers should normally not use noop_llseek but instead use no_llseek
    and call nonseekable_open at open time. Existing drivers can be converted
    to do the same when the maintainer knows for certain that no user code
    relies on calling seek on the device file.

    The generated code is often incorrectly indented and right now contains
    comments that clarify for each added line why a specific variant was
    chosen. In the version that gets submitted upstream, the comments will
    be gone and I will manually fix the indentation, because there does not
    seem to be a way to do that using coccinelle.

    Some amount of new code is currently sitting in linux-next that should get
    the same modifications, which I will do at the end of the merge window.

    Many thanks to Julia Lawall for helping me learn to write a semantic
    patch that does all this.

    ===== begin semantic patch =====
    // This adds an llseek= method to all file operations,
    // as a preparation for making no_llseek the default.
    //
    // The rules are
    // - use no_llseek explicitly if we do nonseekable_open
    // - use seq_lseek for sequential files
    // - use default_llseek if we know we access f_pos
    // - use noop_llseek if we know we don't access f_pos,
    // but we still want to allow users to call lseek
    //
    @ open1 exists @
    identifier nested_open;
    @@
    nested_open(...)
    {

    }

    @ open exists@
    identifier open_f;
    identifier i, f;
    identifier open1.nested_open;
    @@
    int open_f(struct inode *i, struct file *f)
    {

    }

    @ read disable optional_qualifier exists @
    identifier read_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    expression E;
    identifier func;
    @@
    ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
    {

    }

    @ read_no_fpos disable optional_qualifier exists @
    identifier read_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    @@
    ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
    {
    ... when != off
    }

    @ write @
    identifier write_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    expression E;
    identifier func;
    @@
    ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
    {

    }

    @ write_no_fpos @
    identifier write_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    @@
    ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
    {
    ... when != off
    }

    @ fops0 @
    identifier fops;
    @@
    struct file_operations fops = {
    ...
    };

    @ has_llseek depends on fops0 @
    identifier fops0.fops;
    identifier llseek_f;
    @@
    struct file_operations fops = {
    ...
    .llseek = llseek_f,
    ...
    };

    @ has_read depends on fops0 @
    identifier fops0.fops;
    identifier read_f;
    @@
    struct file_operations fops = {
    ...
    .read = read_f,
    ...
    };

    @ has_write depends on fops0 @
    identifier fops0.fops;
    identifier write_f;
    @@
    struct file_operations fops = {
    ...
    .write = write_f,
    ...
    };

    @ has_open depends on fops0 @
    identifier fops0.fops;
    identifier open_f;
    @@
    struct file_operations fops = {
    ...
    .open = open_f,
    ...
    };

    // use no_llseek if we call nonseekable_open
    ////////////////////////////////////////////
    @ nonseekable1 depends on !has_llseek && has_open @
    identifier fops0.fops;
    identifier nso ~= "nonseekable_open";
    @@
    struct file_operations fops = {
    ... .open = nso, ...
    +.llseek = no_llseek, /* nonseekable */
    };

    @ nonseekable2 depends on !has_llseek @
    identifier fops0.fops;
    identifier open.open_f;
    @@
    struct file_operations fops = {
    ... .open = open_f, ...
    +.llseek = no_llseek, /* open uses nonseekable */
    };

    // use seq_lseek for sequential files
    /////////////////////////////////////
    @ seq depends on !has_llseek @
    identifier fops0.fops;
    identifier sr ~= "seq_read";
    @@
    struct file_operations fops = {
    ... .read = sr, ...
    +.llseek = seq_lseek, /* we have seq_read */
    };

    // use default_llseek if there is a readdir
    ///////////////////////////////////////////
    @ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier readdir_e;
    @@
    // any other fop is used that changes pos
    struct file_operations fops = {
    ... .readdir = readdir_e, ...
    +.llseek = default_llseek, /* readdir is present */
    };

    // use default_llseek if at least one of read/write touches f_pos
    /////////////////////////////////////////////////////////////////
    @ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier read.read_f;
    @@
    // read fops use offset
    struct file_operations fops = {
    ... .read = read_f, ...
    +.llseek = default_llseek, /* read accesses f_pos */
    };

    @ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier write.write_f;
    @@
    // write fops use offset
    struct file_operations fops = {
    ... .write = write_f, ...
    + .llseek = default_llseek, /* write accesses f_pos */
    };

    // Use noop_llseek if neither read nor write accesses f_pos
    ///////////////////////////////////////////////////////////

    @ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier read_no_fpos.read_f;
    identifier write_no_fpos.write_f;
    @@
    // write fops use offset
    struct file_operations fops = {
    ...
    .write = write_f,
    .read = read_f,
    ...
    +.llseek = noop_llseek, /* read and write both use no f_pos */
    };

    @ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier write_no_fpos.write_f;
    @@
    struct file_operations fops = {
    ... .write = write_f, ...
    +.llseek = noop_llseek, /* write uses no f_pos */
    };

    @ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier read_no_fpos.read_f;
    @@
    struct file_operations fops = {
    ... .read = read_f, ...
    +.llseek = noop_llseek, /* read uses no f_pos */
    };

    @ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    @@
    struct file_operations fops = {
    ...
    +.llseek = noop_llseek, /* no read or write fn */
    };
    ===== End semantic patch =====

    Signed-off-by: Arnd Bergmann
    Cc: Julia Lawall
    Cc: Christoph Hellwig

    Arnd Bergmann
     

11 May, 2010

1 commit

  • epoll should not touch flags in wait_queue_t. This patch introduces a new
    function __add_wait_queue_exclusive(), for the users, who use wait queue as a
    LIFO queue.

    __add_wait_queue_tail_exclusive() is introduced too instead of
    add_wait_queue_exclusive_locked(). remove_wait_queue_locked() is removed, as
    it is a duplicate of __remove_wait_queue(), disliked by users, and with less
    users.

    Signed-off-by: Changli Gao
    Signed-off-by: Peter Zijlstra
    Cc: Alexander Viro
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Davide Libenzi
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Changli Gao
     

23 Dec, 2009

1 commit

  • It seems a couple places such as arch/ia64/kernel/perfmon.c and
    drivers/infiniband/core/uverbs_main.c could use anon_inode_getfile()
    instead of a private pseudo-fs + alloc_file(), if only there were a way
    to get a read-only file. So provide this by having anon_inode_getfile()
    create a read-only file if we pass O_RDONLY in flags.

    Signed-off-by: Roland Dreier
    Signed-off-by: Al Viro

    Roland Dreier
     

19 Nov, 2009

1 commit


12 Nov, 2009

1 commit


19 Jun, 2009

1 commit

  • This fixes a regression in 2.6.30.

    I unfortunately accepted a patch time ago, to drop the "current" usage
    from possible IRQ context, w/out proper thought over it. The patch
    switched to using the CPU id by bounding the nested call callback with a
    get_cpu()/put_cpu().

    Unfortunately the ep_call_nested() function can be called with a callback
    that grabs sleepy locks (from own f_op->poll()), that results in epic
    fails. The following patch uses the proper "context" depending on the
    path where it is called, and on the kind of callback.

    This has been reported by Stefan Richter, that has also verified the patch
    is his previously failing environment.

    Signed-off-by: Davide Libenzi
    Reported-by: Stefan Richter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

13 May, 2009

1 commit

  • Fix a size check WRT the manual pages. This was inadvertently broken by
    commit 9fe5ad9c8cef9ad5873d8ee55d1cf00d9b607df0 ("flag parameters
    add-on: remove epoll_create size param").

    Signed-off-by: Davide Libenzi
    Cc:
    Cc: rohit verma
    Cc: Ulrich Drepper
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

01 Apr, 2009

9 commits

  • Use the events hint now sent by some devices, to avoid unnecessary wakeups
    for events that are of no interest for the caller. This code handles both
    devices that are sending keyed events, and the ones that are not (and
    event the ones that sometimes send events, and sometimes don't).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Davide Libenzi
    Cc: Alan Cox
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • eventpoll.c uses void * in one place for no obvious reason; change it to
    use the real type instead.

    Signed-off-by: Tony Battersby
    Acked-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Battersby
     
  • ep_modify() doesn't need to set event.data from within the ep->lock
    spinlock as the comment suggests. The only place event.data is used is
    ep_send_events_proc(), and this is protected by ep->mtx instead of
    ep->lock. Also update the comment for mutex_lock() at the top of
    ep_scan_ready_list(), which mentions epoll_ctl(EPOLL_CTL_DEL) but not
    epoll_ctl(EPOLL_CTL_MOD).

    ep_modify() can also use spin_lock_irq() instead of spin_lock_irqsave().

    Signed-off-by: Tony Battersby
    Acked-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Battersby
     
  • xchg in ep_unregister_pollwait() is unnecessary because it is protected by
    either epmutex or ep->mtx (the same protection as ep_remove()).

    If xchg was necessary, it would be insufficient to protect against
    problems: if multiple concurrent calls to ep_unregister_pollwait() were
    possible then a second caller that returns without doing anything because
    nwait == 0 could return before the waitqueues are removed by the first
    caller, which looks like it could lead to problematic races with
    ep_poll_callback().

    So remove xchg and add comments about the locking.

    Signed-off-by: Tony Battersby
    Acked-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Battersby
     
  • If epoll_wait returns -EFAULT, the event that was being returned when the
    fault was encountered will be forgotten. This is not a big deal since
    EFAULT will happen only if a buggy userspace program passes in a bad
    address, in which case what happens later usually doesn't matter.
    However, it is easy to remember the event for later, and this patch makes
    a simple change to do that.

    Signed-off-by: Tony Battersby
    Acked-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Battersby
     
  • ep_call_nested() (formerly ep_poll_safewake()) uses "current" (without
    dereferencing it) to detect callback recursion, but it may be called from
    irq context where the use of current is generally discouraged. It would
    be better to use get_cpu() and put_cpu() to detect the callback recursion.

    Signed-off-by: Tony Battersby
    Acked-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Battersby
     
  • Remove debugging code from epoll. There's no need for it to be included
    into mainline code.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • Signed-off-by: Davide Libenzi
    Cc: Pavel Pisa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • Fix a bug inside the epoll's f_op->poll() code, that returns POLLIN even
    though there are no actual ready monitored fds. The bug shows up if you
    add an epoll fd inside another fd container (poll, select, epoll).

    The problem is that callback-based wake ups used by epoll does not carry
    (patches will follow, to fix this) any information about the events that
    actually happened. So the callback code, since it can't call the file*
    ->poll() inside the callback, chains the file* into a ready-list.

    So, suppose you added an fd with EPOLLOUT only, and some data shows up on
    the fd, the file* mapped by the fd will be added into the ready-list (via
    wakeup callback). During normal epoll_wait() use, this condition is
    sorted out at the time we're actually able to call the file*'s
    f_op->poll().

    Inside the old epoll's f_op->poll() though, only a quick check
    !list_empty(ready-list) was performed, and this could have led to
    reporting POLLIN even though no ready fds would show up at a following
    epoll_wait(). In order to correctly report the ready status for an epoll
    fd, the ready-list must be checked to see if any really available fd+event
    would be ready in a following epoll_wait().

    Operation (calling f_op->poll() from inside f_op->poll()) that, like wake
    ups, must be handled with care because of the fact that epoll fds can be
    added to other epoll fds.

    Test code:

    /*
    * epoll_test by Davide Libenzi (Simple code to test epoll internals)
    * Copyright (C) 2008 Davide Libenzi
    *
    * This program is free software; you can redistribute it and/or modify
    * it under the terms of the GNU General Public License as published by
    * the Free Software Foundation; either version 2 of the License, or
    * (at your option) any later version.
    *
    * This program is distributed in the hope that it will be useful,
    * but WITHOUT ANY WARRANTY; without even the implied warranty of
    * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    * GNU General Public License for more details.
    *
    * You should have received a copy of the GNU General Public License
    * along with this program; if not, write to the Free Software
    * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
    *
    * Davide Libenzi
    *
    */

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define EPWAIT_TIMEO (1 * 1000)
    #ifndef POLLRDHUP
    #define POLLRDHUP 0x2000
    #endif

    #define EPOLL_MAX_CHAIN 100L

    #define EPOLL_TF_LOOP (1 << 0)

    struct epoll_test_cfg {
    long size;
    long flags;
    };

    static int xepoll_create(int n) {
    int epfd;

    if ((epfd = epoll_create(n)) == -1) {
    perror("epoll_create");
    exit(2);
    }

    return epfd;
    }

    static void xepoll_ctl(int epfd, int cmd, int fd, struct epoll_event *evt) {
    if (epoll_ctl(epfd, cmd, fd, evt) < 0) {
    perror("epoll_ctl");
    exit(3);
    }
    }

    static void xpipe(int *fds) {
    if (pipe(fds)) {
    perror("pipe");
    exit(4);
    }
    }

    static pid_t xfork(void) {
    pid_t pid;

    if ((pid = fork()) == (pid_t) -1) {
    perror("pipe");
    exit(5);
    }

    return pid;
    }

    static int run_forked_proc(int (*proc)(void *), void *data) {
    int status;
    pid_t pid;

    if ((pid = xfork()) == 0)
    exit((*proc)(data));
    if (waitpid(pid, &status, 0) != pid) {
    perror("waitpid");
    return -1;
    }

    return WIFEXITED(status) ? WEXITSTATUS(status): -2;
    }

    static int check_events(int fd, int timeo) {
    struct pollfd pfd;

    fprintf(stdout, "Checking events for fd %d\n", fd);
    memset(&pfd, 0, sizeof(pfd));
    pfd.fd = fd;
    pfd.events = POLLIN | POLLOUT;
    if (poll(&pfd, 1, timeo) < 0) {
    perror("poll()");
    return 0;
    }
    if (pfd.revents & POLLIN)
    fprintf(stdout, "\tPOLLIN\n");
    if (pfd.revents & POLLOUT)
    fprintf(stdout, "\tPOLLOUT\n");
    if (pfd.revents & POLLERR)
    fprintf(stdout, "\tPOLLERR\n");
    if (pfd.revents & POLLHUP)
    fprintf(stdout, "\tPOLLHUP\n");
    if (pfd.revents & POLLRDHUP)
    fprintf(stdout, "\tPOLLRDHUP\n");

    return pfd.revents;
    }

    static int epoll_test_tty(void *data) {
    int epfd, ifd = fileno(stdin), res;
    struct epoll_event evt;

    if (check_events(ifd, 0) != POLLOUT) {
    fprintf(stderr, "Something is cooking on STDIN (%d)\n", ifd);
    return 1;
    }
    epfd = xepoll_create(1);
    fprintf(stdout, "Created epoll fd (%d)\n", epfd);
    memset(&evt, 0, sizeof(evt));
    evt.events = EPOLLIN;
    xepoll_ctl(epfd, EPOLL_CTL_ADD, ifd, &evt);
    if (check_events(epfd, 0) & POLLIN) {
    res = epoll_wait(epfd, &evt, 1, 0);
    if (res == 0) {
    fprintf(stderr, "Epoll fd (%d) is ready when it shouldn't!\n",
    epfd);
    return 2;
    }
    }

    return 0;
    }

    static int epoll_wakeup_chain(void *data) {
    struct epoll_test_cfg *tcfg = data;
    int i, res, epfd, bfd, nfd, pfds[2];
    pid_t pid;
    struct epoll_event evt;

    memset(&evt, 0, sizeof(evt));
    evt.events = EPOLLIN;

    epfd = bfd = xepoll_create(1);

    for (i = 0; i < tcfg->size; i++) {
    nfd = xepoll_create(1);
    xepoll_ctl(bfd, EPOLL_CTL_ADD, nfd, &evt);
    bfd = nfd;
    }
    xpipe(pfds);
    if (tcfg->flags & EPOLL_TF_LOOP)
    {
    xepoll_ctl(bfd, EPOLL_CTL_ADD, epfd, &evt);
    /*
    * If we're testing for loop, we want that the wakeup
    * triggered by the write to the pipe done in the child
    * process, triggers a fake event. So we add the pipe
    * read size with EPOLLOUT events. This will trigger
    * an addition to the ready-list, but no real events
    * will be there. The the epoll kernel code will proceed
    * in calling f_op->poll() of the epfd, triggering the
    * loop we want to test.
    */
    evt.events = EPOLLOUT;
    }
    xepoll_ctl(bfd, EPOLL_CTL_ADD, pfds[0], &evt);

    /*
    * The pipe write must come after the poll(2) call inside
    * check_events(). This tests the nested wakeup code in
    * fs/eventpoll.c:ep_poll_safewake()
    * By having the check_events() (hence poll(2)) happens first,
    * we have poll wait queue filled up, and the write(2) in the
    * child will trigger the wakeup chain.
    */
    if ((pid = xfork()) == 0) {
    sleep(1);
    write(pfds[1], "w", 1);
    exit(0);
    }

    res = check_events(epfd, 2000) & POLLIN;

    if (waitpid(pid, NULL, 0) != pid) {
    perror("waitpid");
    return -1;
    }

    return res;
    }

    static int epoll_poll_chain(void *data) {
    struct epoll_test_cfg *tcfg = data;
    int i, res, epfd, bfd, nfd, pfds[2];
    pid_t pid;
    struct epoll_event evt;

    memset(&evt, 0, sizeof(evt));
    evt.events = EPOLLIN;

    epfd = bfd = xepoll_create(1);

    for (i = 0; i < tcfg->size; i++) {
    nfd = xepoll_create(1);
    xepoll_ctl(bfd, EPOLL_CTL_ADD, nfd, &evt);
    bfd = nfd;
    }
    xpipe(pfds);
    if (tcfg->flags & EPOLL_TF_LOOP)
    {
    xepoll_ctl(bfd, EPOLL_CTL_ADD, epfd, &evt);
    /*
    * If we're testing for loop, we want that the wakeup
    * triggered by the write to the pipe done in the child
    * process, triggers a fake event. So we add the pipe
    * read size with EPOLLOUT events. This will trigger
    * an addition to the ready-list, but no real events
    * will be there. The the epoll kernel code will proceed
    * in calling f_op->poll() of the epfd, triggering the
    * loop we want to test.
    */
    evt.events = EPOLLOUT;
    }
    xepoll_ctl(bfd, EPOLL_CTL_ADD, pfds[0], &evt);

    /*
    * The pipe write mush come before the poll(2) call inside
    * check_events(). This tests the nested f_op->poll calls code in
    * fs/eventpoll.c:ep_eventpoll_poll()
    * By having the pipe write(2) happen first, we make the kernel
    * epoll code to load the ready lists, and the following poll(2)
    * done inside check_events() will test nested poll code in
    * ep_eventpoll_poll().
    */
    if ((pid = xfork()) == 0) {
    write(pfds[1], "w", 1);
    exit(0);
    }
    sleep(1);
    res = check_events(epfd, 1000) & POLLIN;

    if (waitpid(pid, NULL, 0) != pid) {
    perror("waitpid");
    return -1;
    }

    return res;
    }

    int main(int ac, char **av) {
    int error;
    struct epoll_test_cfg tcfg;

    fprintf(stdout, "\n********** Testing TTY events\n");
    error = run_forked_proc(epoll_test_tty, NULL);
    fprintf(stdout, error == 0 ?
    "********** OK\n": "********** FAIL (%d)\n", error);

    tcfg.size = 3;
    tcfg.flags = 0;
    fprintf(stdout, "\n********** Testing short wakeup chain\n");
    error = run_forked_proc(epoll_wakeup_chain, &tcfg);
    fprintf(stdout, error == POLLIN ?
    "********** OK\n": "********** FAIL (%d)\n", error);

    tcfg.size = EPOLL_MAX_CHAIN;
    tcfg.flags = 0;
    fprintf(stdout, "\n********** Testing long wakeup chain (HOLD ON)\n");
    error = run_forked_proc(epoll_wakeup_chain, &tcfg);
    fprintf(stdout, error == 0 ?
    "********** OK\n": "********** FAIL (%d)\n", error);

    tcfg.size = 3;
    tcfg.flags = 0;
    fprintf(stdout, "\n********** Testing short poll chain\n");
    error = run_forked_proc(epoll_poll_chain, &tcfg);
    fprintf(stdout, error == POLLIN ?
    "********** OK\n": "********** FAIL (%d)\n", error);

    tcfg.size = EPOLL_MAX_CHAIN;
    tcfg.flags = 0;
    fprintf(stdout, "\n********** Testing long poll chain (HOLD ON)\n");
    error = run_forked_proc(epoll_poll_chain, &tcfg);
    fprintf(stdout, error == 0 ?
    "********** OK\n": "********** FAIL (%d)\n", error);

    tcfg.size = 3;
    tcfg.flags = EPOLL_TF_LOOP;
    fprintf(stdout, "\n********** Testing loopy wakeup chain (HOLD ON)\n");
    error = run_forked_proc(epoll_wakeup_chain, &tcfg);
    fprintf(stdout, error == 0 ?
    "********** OK\n": "********** FAIL (%d)\n", error);

    tcfg.size = 3;
    tcfg.flags = EPOLL_TF_LOOP;
    fprintf(stdout, "\n********** Testing loopy poll chain (HOLD ON)\n");
    error = run_forked_proc(epoll_poll_chain, &tcfg);
    fprintf(stdout, error == 0 ?
    "********** OK\n": "********** FAIL (%d)\n", error);

    return 0;
    }

    Signed-off-by: Davide Libenzi
    Cc: Pavel Pisa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

16 Mar, 2009

1 commit

  • This lock moves out of the CONFIG_EPOLL ifdef and becomes f_lock. For now,
    epoll remains the only user, but a future patch will use it to protect
    f_flags as well.

    Cc: Davide Libenzi
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jonathan Corbet

    Jonathan Corbet
     

30 Jan, 2009

1 commit

  • Linus suggested to put limits where the money is, and max_user_watches
    already does that w/out the need of max_user_instances. That has the
    advantage to mitigate the potential DoS while allowing pretty generous
    default behavior.

    Allowing top 4% of low memory (per user) to be allocated in epoll watches,
    we have:

    LOMEM MAX_WATCHES (per user)
    512MB ~178000
    1GB ~356000
    2GB ~712000

    A box with 512MB of lomem, will meet some challenge in hitting 180K
    watches, socket buffers math teaches us. No more max_user_instances
    limits then.

    Signed-off-by: Davide Libenzi
    Cc: Willy Tarreau
    Cc: Michael Kerrisk
    Cc: Bron Gondwana
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

14 Jan, 2009

1 commit


02 Dec, 2008

1 commit

  • It has been thought that the per-user file descriptors limit would also
    limit the resources that a normal user can request via the epoll
    interface. Vegard Nossum reported a very simple program (a modified
    version attached) that can make a normal user to request a pretty large
    amount of kernel memory, well within the its maximum number of fds. To
    solve such problem, default limits are now imposed, and /proc based
    configuration has been introduced. A new directory has been created,
    named /proc/sys/fs/epoll/ and inside there, there are two configuration
    points:

    max_user_instances = Maximum number of devices - per user

    max_user_watches = Maximum number of "watched" fds - per user

    The current default for "max_user_watches" limits the memory used by epoll
    to store "watches", to 1/32 of the amount of the low RAM. As example, a
    256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
    That should be enough to not break existing heavy epoll users. The
    default value for "max_user_instances" is set to 128, that should be
    enough too.

    This also changes the userspace, because a new error code can now come out
    from EPOLL_CTL_ADD (-ENOSPC). The EMFILE from epoll_create() was already
    listed, so that should be ok.

    [akpm@linux-foundation.org: use get_current_user()]
    Signed-off-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc:
    Cc: Cyrill Gorcunov
    Reported-by: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

27 Oct, 2008

1 commit

  • In commit f337b9c58332bdecde965b436e47ea4c94d30da0 ("epoll: drop
    unnecessary test") Thomas found that there is an unnecessary (always
    true) test in ep_send_events(). The callback never inserts into
    ->rdllink while the send loop is performed, and also does the
    ~EP_PRIVATE_BITS test. Given we're holding the mutex during this time,
    the conditions tested inside the loop are always true.

    HOWEVER.

    The test "!ep_is_linked(&epi->rdllink)" wasn't there because we insert
    into ->rdllink, but because the send-events loop might terminate before
    the whole list is scanned (-EFAULT).

    In such cases, when the loop terminates early, and when a (leftover)
    file received an event while we're performing the lockless loop, we need
    such test to avoid to double insert the epoll items. The list_splice()
    done a few steps below, will correctly re-insert the ones that were left
    on "txlist".

    This should fix the kenrel.org bugzilla entry 11831.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

17 Oct, 2008

1 commit

  • Thomas found that there is an unnecessary (always true) test in
    ep_send_events(). The callback never inserts into ->rdllink while the
    send loop is performed, and also does the ~EP_PRIVATE_BITS test. Given
    we're holding the mutex during this time, the conditions tested inside the
    loop are always true. This patch drops the test done inside the
    re-insertion loop.

    Signed-off-by: Davide Libenzi
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

13 Aug, 2008

1 commit


25 Jul, 2008

3 commits

  • Remove the size parameter from the new epoll_create syscall and renames the
    syscall itself. The updated test program follows.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include
    #include

    #ifndef __NR_epoll_create2
    # ifdef __x86_64__
    # define __NR_epoll_create2 291
    # elif defined __i386__
    # define __NR_epoll_create2 329
    # else
    # error "need __NR_epoll_create2"
    # endif
    #endif

    #define EPOLL_CLOEXEC O_CLOEXEC

    int
    main (void)
    {
    int fd = syscall (__NR_epoll_create2, 0);
    if (fd == -1)
    {
    puts ("epoll_create2(0) failed");
    return 1;
    }
    int coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if (coe & FD_CLOEXEC)
    {
    puts ("epoll_create2(0) set close-on-exec flag");
    return 1;
    }
    close (fd);

    fd = syscall (__NR_epoll_create2, EPOLL_CLOEXEC);
    if (fd == -1)
    {
    puts ("epoll_create2(EPOLL_CLOEXEC) failed");
    return 1;
    }
    coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if ((coe & FD_CLOEXEC) == 0)
    {
    puts ("epoll_create2(EPOLL_CLOEXEC) set close-on-exec flag");
    return 1;
    }
    close (fd);

    puts ("OK");

    return 0;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     
  • This patch adds test that ensure the boundary conditions for the various
    constants introduced in the previous patches is met. No code is generated.

    [akpm@linux-foundation.org: fix alpha]
    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     
  • This patch adds the new epoll_create2 syscall. It extends the old epoll_create
    syscall by one parameter which is meant to hold a flag value. In this
    patch the only flag support is EPOLL_CLOEXEC which causes the close-on-exec
    flag for the returned file descriptor to be set.

    A new name EPOLL_CLOEXEC is introduced which in this implementation must
    have the same value as O_CLOEXEC.

    The following test must be adjusted for architectures other than x86 and
    x86-64 and in case the syscall numbers changed.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include
    #include

    #ifndef __NR_epoll_create2
    # ifdef __x86_64__
    # define __NR_epoll_create2 291
    # elif defined __i386__
    # define __NR_epoll_create2 329
    # else
    # error "need __NR_epoll_create2"
    # endif
    #endif

    #define EPOLL_CLOEXEC O_CLOEXEC

    int
    main (void)
    {
    int fd = syscall (__NR_epoll_create2, 1, 0);
    if (fd == -1)
    {
    puts ("epoll_create2(0) failed");
    return 1;
    }
    int coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if (coe & FD_CLOEXEC)
    {
    puts ("epoll_create2(0) set close-on-exec flag");
    return 1;
    }
    close (fd);

    fd = syscall (__NR_epoll_create2, 1, EPOLL_CLOEXEC);
    if (fd == -1)
    {
    puts ("epoll_create2(EPOLL_CLOEXEC) failed");
    return 1;
    }
    coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if ((coe & FD_CLOEXEC) == 0)
    {
    puts ("epoll_create2(EPOLL_CLOEXEC) set close-on-exec flag");
    return 1;
    }
    close (fd);

    puts ("OK");

    return 0;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper