11 May, 2010

1 commit

  • epoll should not touch flags in wait_queue_t. This patch introduces a new
    function __add_wait_queue_exclusive(), for the users, who use wait queue as a
    LIFO queue.

    __add_wait_queue_tail_exclusive() is introduced too instead of
    add_wait_queue_exclusive_locked(). remove_wait_queue_locked() is removed, as
    it is a duplicate of __remove_wait_queue(), disliked by users, and with less
    users.

    Signed-off-by: Changli Gao
    Signed-off-by: Peter Zijlstra
    Cc: Alexander Viro
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Davide Libenzi
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Changli Gao
     

23 Dec, 2009

1 commit

  • It seems a couple places such as arch/ia64/kernel/perfmon.c and
    drivers/infiniband/core/uverbs_main.c could use anon_inode_getfile()
    instead of a private pseudo-fs + alloc_file(), if only there were a way
    to get a read-only file. So provide this by having anon_inode_getfile()
    create a read-only file if we pass O_RDONLY in flags.

    Signed-off-by: Roland Dreier
    Signed-off-by: Al Viro

    Roland Dreier
     

19 Nov, 2009

1 commit


12 Nov, 2009

1 commit


19 Jun, 2009

1 commit

  • This fixes a regression in 2.6.30.

    I unfortunately accepted a patch time ago, to drop the "current" usage
    from possible IRQ context, w/out proper thought over it. The patch
    switched to using the CPU id by bounding the nested call callback with a
    get_cpu()/put_cpu().

    Unfortunately the ep_call_nested() function can be called with a callback
    that grabs sleepy locks (from own f_op->poll()), that results in epic
    fails. The following patch uses the proper "context" depending on the
    path where it is called, and on the kind of callback.

    This has been reported by Stefan Richter, that has also verified the patch
    is his previously failing environment.

    Signed-off-by: Davide Libenzi
    Reported-by: Stefan Richter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

13 May, 2009

1 commit

  • Fix a size check WRT the manual pages. This was inadvertently broken by
    commit 9fe5ad9c8cef9ad5873d8ee55d1cf00d9b607df0 ("flag parameters
    add-on: remove epoll_create size param").

    Signed-off-by: Davide Libenzi
    Cc:
    Cc: rohit verma
    Cc: Ulrich Drepper
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

01 Apr, 2009

9 commits

  • Use the events hint now sent by some devices, to avoid unnecessary wakeups
    for events that are of no interest for the caller. This code handles both
    devices that are sending keyed events, and the ones that are not (and
    event the ones that sometimes send events, and sometimes don't).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Davide Libenzi
    Cc: Alan Cox
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • eventpoll.c uses void * in one place for no obvious reason; change it to
    use the real type instead.

    Signed-off-by: Tony Battersby
    Acked-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Battersby
     
  • ep_modify() doesn't need to set event.data from within the ep->lock
    spinlock as the comment suggests. The only place event.data is used is
    ep_send_events_proc(), and this is protected by ep->mtx instead of
    ep->lock. Also update the comment for mutex_lock() at the top of
    ep_scan_ready_list(), which mentions epoll_ctl(EPOLL_CTL_DEL) but not
    epoll_ctl(EPOLL_CTL_MOD).

    ep_modify() can also use spin_lock_irq() instead of spin_lock_irqsave().

    Signed-off-by: Tony Battersby
    Acked-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Battersby
     
  • xchg in ep_unregister_pollwait() is unnecessary because it is protected by
    either epmutex or ep->mtx (the same protection as ep_remove()).

    If xchg was necessary, it would be insufficient to protect against
    problems: if multiple concurrent calls to ep_unregister_pollwait() were
    possible then a second caller that returns without doing anything because
    nwait == 0 could return before the waitqueues are removed by the first
    caller, which looks like it could lead to problematic races with
    ep_poll_callback().

    So remove xchg and add comments about the locking.

    Signed-off-by: Tony Battersby
    Acked-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Battersby
     
  • If epoll_wait returns -EFAULT, the event that was being returned when the
    fault was encountered will be forgotten. This is not a big deal since
    EFAULT will happen only if a buggy userspace program passes in a bad
    address, in which case what happens later usually doesn't matter.
    However, it is easy to remember the event for later, and this patch makes
    a simple change to do that.

    Signed-off-by: Tony Battersby
    Acked-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Battersby
     
  • ep_call_nested() (formerly ep_poll_safewake()) uses "current" (without
    dereferencing it) to detect callback recursion, but it may be called from
    irq context where the use of current is generally discouraged. It would
    be better to use get_cpu() and put_cpu() to detect the callback recursion.

    Signed-off-by: Tony Battersby
    Acked-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Battersby
     
  • Remove debugging code from epoll. There's no need for it to be included
    into mainline code.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • Signed-off-by: Davide Libenzi
    Cc: Pavel Pisa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • Fix a bug inside the epoll's f_op->poll() code, that returns POLLIN even
    though there are no actual ready monitored fds. The bug shows up if you
    add an epoll fd inside another fd container (poll, select, epoll).

    The problem is that callback-based wake ups used by epoll does not carry
    (patches will follow, to fix this) any information about the events that
    actually happened. So the callback code, since it can't call the file*
    ->poll() inside the callback, chains the file* into a ready-list.

    So, suppose you added an fd with EPOLLOUT only, and some data shows up on
    the fd, the file* mapped by the fd will be added into the ready-list (via
    wakeup callback). During normal epoll_wait() use, this condition is
    sorted out at the time we're actually able to call the file*'s
    f_op->poll().

    Inside the old epoll's f_op->poll() though, only a quick check
    !list_empty(ready-list) was performed, and this could have led to
    reporting POLLIN even though no ready fds would show up at a following
    epoll_wait(). In order to correctly report the ready status for an epoll
    fd, the ready-list must be checked to see if any really available fd+event
    would be ready in a following epoll_wait().

    Operation (calling f_op->poll() from inside f_op->poll()) that, like wake
    ups, must be handled with care because of the fact that epoll fds can be
    added to other epoll fds.

    Test code:

    /*
    * epoll_test by Davide Libenzi (Simple code to test epoll internals)
    * Copyright (C) 2008 Davide Libenzi
    *
    * This program is free software; you can redistribute it and/or modify
    * it under the terms of the GNU General Public License as published by
    * the Free Software Foundation; either version 2 of the License, or
    * (at your option) any later version.
    *
    * This program is distributed in the hope that it will be useful,
    * but WITHOUT ANY WARRANTY; without even the implied warranty of
    * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    * GNU General Public License for more details.
    *
    * You should have received a copy of the GNU General Public License
    * along with this program; if not, write to the Free Software
    * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
    *
    * Davide Libenzi
    *
    */

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define EPWAIT_TIMEO (1 * 1000)
    #ifndef POLLRDHUP
    #define POLLRDHUP 0x2000
    #endif

    #define EPOLL_MAX_CHAIN 100L

    #define EPOLL_TF_LOOP (1 << 0)

    struct epoll_test_cfg {
    long size;
    long flags;
    };

    static int xepoll_create(int n) {
    int epfd;

    if ((epfd = epoll_create(n)) == -1) {
    perror("epoll_create");
    exit(2);
    }

    return epfd;
    }

    static void xepoll_ctl(int epfd, int cmd, int fd, struct epoll_event *evt) {
    if (epoll_ctl(epfd, cmd, fd, evt) < 0) {
    perror("epoll_ctl");
    exit(3);
    }
    }

    static void xpipe(int *fds) {
    if (pipe(fds)) {
    perror("pipe");
    exit(4);
    }
    }

    static pid_t xfork(void) {
    pid_t pid;

    if ((pid = fork()) == (pid_t) -1) {
    perror("pipe");
    exit(5);
    }

    return pid;
    }

    static int run_forked_proc(int (*proc)(void *), void *data) {
    int status;
    pid_t pid;

    if ((pid = xfork()) == 0)
    exit((*proc)(data));
    if (waitpid(pid, &status, 0) != pid) {
    perror("waitpid");
    return -1;
    }

    return WIFEXITED(status) ? WEXITSTATUS(status): -2;
    }

    static int check_events(int fd, int timeo) {
    struct pollfd pfd;

    fprintf(stdout, "Checking events for fd %d\n", fd);
    memset(&pfd, 0, sizeof(pfd));
    pfd.fd = fd;
    pfd.events = POLLIN | POLLOUT;
    if (poll(&pfd, 1, timeo) < 0) {
    perror("poll()");
    return 0;
    }
    if (pfd.revents & POLLIN)
    fprintf(stdout, "\tPOLLIN\n");
    if (pfd.revents & POLLOUT)
    fprintf(stdout, "\tPOLLOUT\n");
    if (pfd.revents & POLLERR)
    fprintf(stdout, "\tPOLLERR\n");
    if (pfd.revents & POLLHUP)
    fprintf(stdout, "\tPOLLHUP\n");
    if (pfd.revents & POLLRDHUP)
    fprintf(stdout, "\tPOLLRDHUP\n");

    return pfd.revents;
    }

    static int epoll_test_tty(void *data) {
    int epfd, ifd = fileno(stdin), res;
    struct epoll_event evt;

    if (check_events(ifd, 0) != POLLOUT) {
    fprintf(stderr, "Something is cooking on STDIN (%d)\n", ifd);
    return 1;
    }
    epfd = xepoll_create(1);
    fprintf(stdout, "Created epoll fd (%d)\n", epfd);
    memset(&evt, 0, sizeof(evt));
    evt.events = EPOLLIN;
    xepoll_ctl(epfd, EPOLL_CTL_ADD, ifd, &evt);
    if (check_events(epfd, 0) & POLLIN) {
    res = epoll_wait(epfd, &evt, 1, 0);
    if (res == 0) {
    fprintf(stderr, "Epoll fd (%d) is ready when it shouldn't!\n",
    epfd);
    return 2;
    }
    }

    return 0;
    }

    static int epoll_wakeup_chain(void *data) {
    struct epoll_test_cfg *tcfg = data;
    int i, res, epfd, bfd, nfd, pfds[2];
    pid_t pid;
    struct epoll_event evt;

    memset(&evt, 0, sizeof(evt));
    evt.events = EPOLLIN;

    epfd = bfd = xepoll_create(1);

    for (i = 0; i < tcfg->size; i++) {
    nfd = xepoll_create(1);
    xepoll_ctl(bfd, EPOLL_CTL_ADD, nfd, &evt);
    bfd = nfd;
    }
    xpipe(pfds);
    if (tcfg->flags & EPOLL_TF_LOOP)
    {
    xepoll_ctl(bfd, EPOLL_CTL_ADD, epfd, &evt);
    /*
    * If we're testing for loop, we want that the wakeup
    * triggered by the write to the pipe done in the child
    * process, triggers a fake event. So we add the pipe
    * read size with EPOLLOUT events. This will trigger
    * an addition to the ready-list, but no real events
    * will be there. The the epoll kernel code will proceed
    * in calling f_op->poll() of the epfd, triggering the
    * loop we want to test.
    */
    evt.events = EPOLLOUT;
    }
    xepoll_ctl(bfd, EPOLL_CTL_ADD, pfds[0], &evt);

    /*
    * The pipe write must come after the poll(2) call inside
    * check_events(). This tests the nested wakeup code in
    * fs/eventpoll.c:ep_poll_safewake()
    * By having the check_events() (hence poll(2)) happens first,
    * we have poll wait queue filled up, and the write(2) in the
    * child will trigger the wakeup chain.
    */
    if ((pid = xfork()) == 0) {
    sleep(1);
    write(pfds[1], "w", 1);
    exit(0);
    }

    res = check_events(epfd, 2000) & POLLIN;

    if (waitpid(pid, NULL, 0) != pid) {
    perror("waitpid");
    return -1;
    }

    return res;
    }

    static int epoll_poll_chain(void *data) {
    struct epoll_test_cfg *tcfg = data;
    int i, res, epfd, bfd, nfd, pfds[2];
    pid_t pid;
    struct epoll_event evt;

    memset(&evt, 0, sizeof(evt));
    evt.events = EPOLLIN;

    epfd = bfd = xepoll_create(1);

    for (i = 0; i < tcfg->size; i++) {
    nfd = xepoll_create(1);
    xepoll_ctl(bfd, EPOLL_CTL_ADD, nfd, &evt);
    bfd = nfd;
    }
    xpipe(pfds);
    if (tcfg->flags & EPOLL_TF_LOOP)
    {
    xepoll_ctl(bfd, EPOLL_CTL_ADD, epfd, &evt);
    /*
    * If we're testing for loop, we want that the wakeup
    * triggered by the write to the pipe done in the child
    * process, triggers a fake event. So we add the pipe
    * read size with EPOLLOUT events. This will trigger
    * an addition to the ready-list, but no real events
    * will be there. The the epoll kernel code will proceed
    * in calling f_op->poll() of the epfd, triggering the
    * loop we want to test.
    */
    evt.events = EPOLLOUT;
    }
    xepoll_ctl(bfd, EPOLL_CTL_ADD, pfds[0], &evt);

    /*
    * The pipe write mush come before the poll(2) call inside
    * check_events(). This tests the nested f_op->poll calls code in
    * fs/eventpoll.c:ep_eventpoll_poll()
    * By having the pipe write(2) happen first, we make the kernel
    * epoll code to load the ready lists, and the following poll(2)
    * done inside check_events() will test nested poll code in
    * ep_eventpoll_poll().
    */
    if ((pid = xfork()) == 0) {
    write(pfds[1], "w", 1);
    exit(0);
    }
    sleep(1);
    res = check_events(epfd, 1000) & POLLIN;

    if (waitpid(pid, NULL, 0) != pid) {
    perror("waitpid");
    return -1;
    }

    return res;
    }

    int main(int ac, char **av) {
    int error;
    struct epoll_test_cfg tcfg;

    fprintf(stdout, "\n********** Testing TTY events\n");
    error = run_forked_proc(epoll_test_tty, NULL);
    fprintf(stdout, error == 0 ?
    "********** OK\n": "********** FAIL (%d)\n", error);

    tcfg.size = 3;
    tcfg.flags = 0;
    fprintf(stdout, "\n********** Testing short wakeup chain\n");
    error = run_forked_proc(epoll_wakeup_chain, &tcfg);
    fprintf(stdout, error == POLLIN ?
    "********** OK\n": "********** FAIL (%d)\n", error);

    tcfg.size = EPOLL_MAX_CHAIN;
    tcfg.flags = 0;
    fprintf(stdout, "\n********** Testing long wakeup chain (HOLD ON)\n");
    error = run_forked_proc(epoll_wakeup_chain, &tcfg);
    fprintf(stdout, error == 0 ?
    "********** OK\n": "********** FAIL (%d)\n", error);

    tcfg.size = 3;
    tcfg.flags = 0;
    fprintf(stdout, "\n********** Testing short poll chain\n");
    error = run_forked_proc(epoll_poll_chain, &tcfg);
    fprintf(stdout, error == POLLIN ?
    "********** OK\n": "********** FAIL (%d)\n", error);

    tcfg.size = EPOLL_MAX_CHAIN;
    tcfg.flags = 0;
    fprintf(stdout, "\n********** Testing long poll chain (HOLD ON)\n");
    error = run_forked_proc(epoll_poll_chain, &tcfg);
    fprintf(stdout, error == 0 ?
    "********** OK\n": "********** FAIL (%d)\n", error);

    tcfg.size = 3;
    tcfg.flags = EPOLL_TF_LOOP;
    fprintf(stdout, "\n********** Testing loopy wakeup chain (HOLD ON)\n");
    error = run_forked_proc(epoll_wakeup_chain, &tcfg);
    fprintf(stdout, error == 0 ?
    "********** OK\n": "********** FAIL (%d)\n", error);

    tcfg.size = 3;
    tcfg.flags = EPOLL_TF_LOOP;
    fprintf(stdout, "\n********** Testing loopy poll chain (HOLD ON)\n");
    error = run_forked_proc(epoll_poll_chain, &tcfg);
    fprintf(stdout, error == 0 ?
    "********** OK\n": "********** FAIL (%d)\n", error);

    return 0;
    }

    Signed-off-by: Davide Libenzi
    Cc: Pavel Pisa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

16 Mar, 2009

1 commit

  • This lock moves out of the CONFIG_EPOLL ifdef and becomes f_lock. For now,
    epoll remains the only user, but a future patch will use it to protect
    f_flags as well.

    Cc: Davide Libenzi
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jonathan Corbet

    Jonathan Corbet
     

30 Jan, 2009

1 commit

  • Linus suggested to put limits where the money is, and max_user_watches
    already does that w/out the need of max_user_instances. That has the
    advantage to mitigate the potential DoS while allowing pretty generous
    default behavior.

    Allowing top 4% of low memory (per user) to be allocated in epoll watches,
    we have:

    LOMEM MAX_WATCHES (per user)
    512MB ~178000
    1GB ~356000
    2GB ~712000

    A box with 512MB of lomem, will meet some challenge in hitting 180K
    watches, socket buffers math teaches us. No more max_user_instances
    limits then.

    Signed-off-by: Davide Libenzi
    Cc: Willy Tarreau
    Cc: Michael Kerrisk
    Cc: Bron Gondwana
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

14 Jan, 2009

1 commit


02 Dec, 2008

1 commit

  • It has been thought that the per-user file descriptors limit would also
    limit the resources that a normal user can request via the epoll
    interface. Vegard Nossum reported a very simple program (a modified
    version attached) that can make a normal user to request a pretty large
    amount of kernel memory, well within the its maximum number of fds. To
    solve such problem, default limits are now imposed, and /proc based
    configuration has been introduced. A new directory has been created,
    named /proc/sys/fs/epoll/ and inside there, there are two configuration
    points:

    max_user_instances = Maximum number of devices - per user

    max_user_watches = Maximum number of "watched" fds - per user

    The current default for "max_user_watches" limits the memory used by epoll
    to store "watches", to 1/32 of the amount of the low RAM. As example, a
    256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
    That should be enough to not break existing heavy epoll users. The
    default value for "max_user_instances" is set to 128, that should be
    enough too.

    This also changes the userspace, because a new error code can now come out
    from EPOLL_CTL_ADD (-ENOSPC). The EMFILE from epoll_create() was already
    listed, so that should be ok.

    [akpm@linux-foundation.org: use get_current_user()]
    Signed-off-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc:
    Cc: Cyrill Gorcunov
    Reported-by: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

27 Oct, 2008

1 commit

  • In commit f337b9c58332bdecde965b436e47ea4c94d30da0 ("epoll: drop
    unnecessary test") Thomas found that there is an unnecessary (always
    true) test in ep_send_events(). The callback never inserts into
    ->rdllink while the send loop is performed, and also does the
    ~EP_PRIVATE_BITS test. Given we're holding the mutex during this time,
    the conditions tested inside the loop are always true.

    HOWEVER.

    The test "!ep_is_linked(&epi->rdllink)" wasn't there because we insert
    into ->rdllink, but because the send-events loop might terminate before
    the whole list is scanned (-EFAULT).

    In such cases, when the loop terminates early, and when a (leftover)
    file received an event while we're performing the lockless loop, we need
    such test to avoid to double insert the epoll items. The list_splice()
    done a few steps below, will correctly re-insert the ones that were left
    on "txlist".

    This should fix the kenrel.org bugzilla entry 11831.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

17 Oct, 2008

1 commit

  • Thomas found that there is an unnecessary (always true) test in
    ep_send_events(). The callback never inserts into ->rdllink while the
    send loop is performed, and also does the ~EP_PRIVATE_BITS test. Given
    we're holding the mutex during this time, the conditions tested inside the
    loop are always true. This patch drops the test done inside the
    re-insertion loop.

    Signed-off-by: Davide Libenzi
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

13 Aug, 2008

1 commit


25 Jul, 2008

4 commits

  • Remove the size parameter from the new epoll_create syscall and renames the
    syscall itself. The updated test program follows.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include
    #include

    #ifndef __NR_epoll_create2
    # ifdef __x86_64__
    # define __NR_epoll_create2 291
    # elif defined __i386__
    # define __NR_epoll_create2 329
    # else
    # error "need __NR_epoll_create2"
    # endif
    #endif

    #define EPOLL_CLOEXEC O_CLOEXEC

    int
    main (void)
    {
    int fd = syscall (__NR_epoll_create2, 0);
    if (fd == -1)
    {
    puts ("epoll_create2(0) failed");
    return 1;
    }
    int coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if (coe & FD_CLOEXEC)
    {
    puts ("epoll_create2(0) set close-on-exec flag");
    return 1;
    }
    close (fd);

    fd = syscall (__NR_epoll_create2, EPOLL_CLOEXEC);
    if (fd == -1)
    {
    puts ("epoll_create2(EPOLL_CLOEXEC) failed");
    return 1;
    }
    coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if ((coe & FD_CLOEXEC) == 0)
    {
    puts ("epoll_create2(EPOLL_CLOEXEC) set close-on-exec flag");
    return 1;
    }
    close (fd);

    puts ("OK");

    return 0;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     
  • This patch adds test that ensure the boundary conditions for the various
    constants introduced in the previous patches is met. No code is generated.

    [akpm@linux-foundation.org: fix alpha]
    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     
  • This patch adds the new epoll_create2 syscall. It extends the old epoll_create
    syscall by one parameter which is meant to hold a flag value. In this
    patch the only flag support is EPOLL_CLOEXEC which causes the close-on-exec
    flag for the returned file descriptor to be set.

    A new name EPOLL_CLOEXEC is introduced which in this implementation must
    have the same value as O_CLOEXEC.

    The following test must be adjusted for architectures other than x86 and
    x86-64 and in case the syscall numbers changed.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include
    #include

    #ifndef __NR_epoll_create2
    # ifdef __x86_64__
    # define __NR_epoll_create2 291
    # elif defined __i386__
    # define __NR_epoll_create2 329
    # else
    # error "need __NR_epoll_create2"
    # endif
    #endif

    #define EPOLL_CLOEXEC O_CLOEXEC

    int
    main (void)
    {
    int fd = syscall (__NR_epoll_create2, 1, 0);
    if (fd == -1)
    {
    puts ("epoll_create2(0) failed");
    return 1;
    }
    int coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if (coe & FD_CLOEXEC)
    {
    puts ("epoll_create2(0) set close-on-exec flag");
    return 1;
    }
    close (fd);

    fd = syscall (__NR_epoll_create2, 1, EPOLL_CLOEXEC);
    if (fd == -1)
    {
    puts ("epoll_create2(EPOLL_CLOEXEC) failed");
    return 1;
    }
    coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if ((coe & FD_CLOEXEC) == 0)
    {
    puts ("epoll_create2(EPOLL_CLOEXEC) set close-on-exec flag");
    return 1;
    }
    close (fd);

    puts ("OK");

    return 0;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     
  • This patch just extends the anon_inode_getfd interface to take an additional
    parameter with a flag value. The flag value is passed on to
    get_unused_fd_flags in anticipation for a use with the O_CLOEXEC flag.

    No actual semantic changes here, the changed callers all pass 0 for now.

    [akpm@linux-foundation.org: KVM fix]
    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     

02 May, 2008

1 commit

  • a) none of the callers even looks at inode or file returned by anon_inode_getfd()
    b) any caller that would try to look at those would be racy, since by the time
    it returns we might have raced with close() from another thread and that
    file would be pining for fjords.

    Signed-off-by: Al Viro

    Al Viro
     

30 Apr, 2008

2 commits

  • Change all the #ifdef TIF_RESTORE_SIGMASK conditionals in non-arch code to
    #ifdef HAVE_SET_RESTORE_SIGMASK. If arch code defines it first, the generic
    set_restore_sigmask() using TIF_RESTORE_SIGMASK is not defined.

    Signed-off-by: Roland McGrath
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • This adds the set_restore_sigmask() inline in and
    replaces every set_thread_flag(TIF_RESTORE_SIGMASK) with a call to it. No
    change, but abstracts the details of the flag protocol from all the calls.

    Signed-off-by: Roland McGrath
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     

29 Apr, 2008

1 commit

  • Epoll calls rb_set_parent(n, n) to initialize the rb-tree node, but
    rb_set_parent() accesses node's pointer in its code. This creates a
    warning in kmemcheck (reported by Vegard Nossum) about an uninitialized
    memory access. The warning is harmless since the following rb-tree node
    insert is going to overwrite the node data. In any case I think it's
    better to not have that happening at all, and fix it by simplifying the
    code to get rid of a few lines that became superfluous after the previous
    epoll changes.

    Signed-off-by: Davide Libenzi
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

06 Feb, 2008

1 commit

  • On Sat, 2008-01-05 at 13:35 -0800, Davide Libenzi wrote:

    > I remember I talked with Arjan about this time ago. Basically, since 1)
    > you can drop an epoll fd inside another epoll fd 2) callback-based wakeups
    > are used, you can see a wake_up() from inside another wake_up(), but they
    > will never refer to the same lock instance.
    > Think about:
    >
    > dfd = socket(...);
    > efd1 = epoll_create();
    > efd2 = epoll_create();
    > epoll_ctl(efd1, EPOLL_CTL_ADD, dfd, ...);
    > epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
    >
    > When a packet arrives to the device underneath "dfd", the net code will
    > issue a wake_up() on its poll wake list. Epoll (efd1) has installed a
    > callback wakeup entry on that queue, and the wake_up() performed by the
    > "dfd" net code will end up in ep_poll_callback(). At this point epoll
    > (efd1) notices that it may have some event ready, so it needs to wake up
    > the waiters on its poll wait list (efd2). So it calls ep_poll_safewake()
    > that ends up in another wake_up(), after having checked about the
    > recursion constraints. That are, no more than EP_MAX_POLLWAKE_NESTS, to
    > avoid stack blasting. Never hit the same queue, to avoid loops like:
    >
    > epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
    > epoll_ctl(efd3, EPOLL_CTL_ADD, efd2, ...);
    > epoll_ctl(efd4, EPOLL_CTL_ADD, efd3, ...);
    > epoll_ctl(efd1, EPOLL_CTL_ADD, efd4, ...);
    >
    > The code "if (tncur->wq == wq || ..." prevents re-entering the same
    > queue/lock.

    Since the epoll code is very careful to not nest same instance locks
    allow the recursion.

    Signed-off-by: Peter Zijlstra
    Tested-by: Stefan Richter
    Acked-by: Davide Libenzi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

07 Dec, 2007

1 commit


20 Oct, 2007

1 commit


19 Oct, 2007

1 commit

  • Get rid of sparse related warnings from places that use integer as NULL
    pointer.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Stephen Hemminger
    Cc: Andi Kleen
    Cc: Jeff Garzik
    Cc: Matt Mackall
    Cc: Ian Kent
    Cc: Arnd Bergmann
    Cc: Davide Libenzi
    Cc: Stephen Smalley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Hemminger
     

20 Jul, 2007

1 commit

  • Slab destructors were no longer supported after Christoph's
    c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
    BUGs for both slab and slub, and slob never supported them
    either.

    This rips out support for the dtor pointer from kmem_cache_create()
    completely and fixes up every single callsite in the kernel (there were
    about 224, not including the slab allocator definitions themselves,
    or the documentation references).

    Signed-off-by: Paul Mundt

    Paul Mundt
     

15 May, 2007

4 commits

  • Move the kfree() call inside the ep_free() function.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • Fixes some epoll code comments.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • Changes the rwlock to a spinlock, and drops the use-count variable.
    Operations are always bound by the mutex now, so the use-count is no more
    needed. For the same reason, the rwlock can become a simple spinlock.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • Fixes the epoll single pass code. During the unlocked event delivery (to
    userspace) code, the poll callback can re-issue new events, and we must
    receive them correctly. Since we loop in a lockless fashion, we want to be
    O(nready), and we don't want to flash on/off the spinlock for every event, we
    have the poll callback to use a secondary list to queue events while we're
    inside the event delivery loop. The rw_semaphore has been turned into a
    mutex. This patch also adds the wait-exclusive flag, as suggested by Davi
    Arnaut.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

11 May, 2007

1 commit