25 Jan, 2010

1 commit

  • KVM needs a wait to atomically remove themselves from the eventfd ->poll()
    wait queue head, in order to handle correctly their IRQfd deassign
    operation.

    This patch introduces such API, plus a way to read an eventfd from its
    context.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Avi Kivity

    Davide Libenzi
     

23 Sep, 2009

1 commit

  • Split the anonfd interface into a bare file pointer creation one, and a
    file pointer creation plus install one.

    There are cases, like the usage of eventfds inside other kernel
    interfaces, where the file pointer created by anonfd needs to be used
    inside the initialization of other structures.

    As it is right now, as soon as anon_inode_getfd() returns, the kenrle can
    race with userspace closing the newly installed file descriptor.

    This patch, while keeping the old anon_inode_getfd(), introduces a new
    anon_inode_getfile() (whose services are reused in anon_inode_getfd())
    that allows to split the file creation phase and the fd install one.

    Once all the kernel structures are initialized, the code can call the
    proper fd_install().

    Gregory manifested the need for something like this inside KVM.

    Signed-off-by: Davide Libenzi
    Cc: Alexander Viro
    Cc: James Morris
    Cc: Peter Zijlstra
    Cc: Gregory Haskins
    Acked-by: Serge Hallyn
    Acked-by: Roland Dreier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

01 Jul, 2009

1 commit

  • Change the eventfd interface to de-couple the eventfd memory context, from
    the file pointer instance.

    Without such change, there is no clean way to racely free handle the
    POLLHUP event sent when the last instance of the file* goes away. Also,
    now the internal eventfd APIs are using the eventfd context instead of the
    file*.

    This patch is required by KVM's IRQfd code, which is still under
    development.

    Signed-off-by: Davide Libenzi
    Cc: Gregory Haskins
    Cc: Rusty Russell
    Cc: Benjamin LaHaise
    Cc: Avi Kivity
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

01 Apr, 2009

1 commit

  • People started using eventfd in a semaphore-like way where before they
    were using pipes.

    That is, counter-based resource access. Where a "wait()" returns
    immediately by decrementing the counter by one, if counter is greater than
    zero. Otherwise will wait. And where a "post(count)" will add count to
    the counter releasing the appropriate amount of waiters. If eventfd the
    "post" (write) part is fine, while the "wait" (read) does not dequeue 1,
    but the whole counter value.

    The problem with eventfd is that a read() on the fd returns and wipes the
    whole counter, making the use of it as semaphore a little bit more
    cumbersome. You can do a read() followed by a write() of COUNTER-1, but
    IMO it's pretty easy and cheap to make this work w/out extra steps. This
    patch introduces a new eventfd flag that tells eventfd to only dequeue 1
    from the counter, allowing simple read/write to make it behave like a
    semaphore. Simple test here:

    http://www.xmailserver.org/eventfd-sem.c

    To be back-compatible with earlier kernels, userspace applications should
    probe for the availability of this feature via

    #ifdef EFD_SEMAPHORE
    fd = eventfd2 (CNT, EFD_SEMAPHORE);
    if (fd == -1 && errno == EINVAL)

    #else

    #endif

    Signed-off-by: Davide Libenzi
    Cc:
    Tested-by: Michael Kerrisk
    Cc: Ulrich Drepper
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

25 Jul, 2008

2 commits

  • This patch adds support for the EFD_NONBLOCK flag to eventfd2. The
    additional changes needed are minimal.

    The following test must be adjusted for architectures other than x86 and
    x86-64 and in case the syscall numbers changed.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include

    #ifndef __NR_eventfd2
    # ifdef __x86_64__
    # define __NR_eventfd2 290
    # elif defined __i386__
    # define __NR_eventfd2 328
    # else
    # error "need __NR_eventfd2"
    # endif
    #endif

    #define EFD_NONBLOCK O_NONBLOCK

    int
    main (void)
    {
    int fd = syscall (__NR_eventfd2, 1, 0);
    if (fd == -1)
    {
    puts ("eventfd2(0) failed");
    return 1;
    }
    int fl = fcntl (fd, F_GETFL);
    if (fl == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if (fl & O_NONBLOCK)
    {
    puts ("eventfd2(0) sets non-blocking mode");
    return 1;
    }
    close (fd);

    fd = syscall (__NR_eventfd2, 1, EFD_NONBLOCK);
    if (fd == -1)
    {
    puts ("eventfd2(EFD_NONBLOCK) failed");
    return 1;
    }
    fl = fcntl (fd, F_GETFL);
    if (fl == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if ((fl & O_NONBLOCK) == 0)
    {
    puts ("eventfd2(EFD_NONBLOCK) does not set non-blocking mode");
    return 1;
    }
    close (fd);

    puts ("OK");

    return 0;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     
  • This patch adds the new eventfd2 syscall. It extends the old eventfd
    syscall by one parameter which is meant to hold a flag value. In this
    patch the only flag support is EFD_CLOEXEC which causes the close-on-exec
    flag for the returned file descriptor to be set.

    A new name EFD_CLOEXEC is introduced which in this implementation must
    have the same value as O_CLOEXEC.

    The following test must be adjusted for architectures other than x86 and
    x86-64 and in case the syscall numbers changed.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include

    #ifndef __NR_eventfd2
    # ifdef __x86_64__
    # define __NR_eventfd2 290
    # elif defined __i386__
    # define __NR_eventfd2 328
    # else
    # error "need __NR_eventfd2"
    # endif
    #endif

    #define EFD_CLOEXEC O_CLOEXEC

    int
    main (void)
    {
    int fd = syscall (__NR_eventfd2, 1, 0);
    if (fd == -1)
    {
    puts ("eventfd2(0) failed");
    return 1;
    }
    int coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if (coe & FD_CLOEXEC)
    {
    puts ("eventfd2(0) sets close-on-exec flag");
    return 1;
    }
    close (fd);

    fd = syscall (__NR_eventfd2, 1, EFD_CLOEXEC);
    if (fd == -1)
    {
    puts ("eventfd2(EFD_CLOEXEC) failed");
    return 1;
    }
    coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if ((coe & FD_CLOEXEC) == 0)
    {
    puts ("eventfd2(EFD_CLOEXEC) does not set close-on-exec flag");
    return 1;
    }
    close (fd);

    puts ("OK");

    return 0;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    [akpm@linux-foundation.org: add sys_ni stub]
    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     

30 Apr, 2008

1 commit


29 Jun, 2007

1 commit


11 May, 2007

1 commit

  • This is a very simple and light file descriptor, that can be used as event
    wait/dispatch by userspace (both wait and dispatch) and by the kernel
    (dispatch only). It can be used instead of pipe(2) in all cases where those
    would simply be used to signal events. Their kernel overhead is much lower
    than pipes, and they do not consume two fds. When used in the kernel, it can
    offer an fd-bridge to enable, for example, functionalities like KAIO or
    syslets/threadlets to signal to an fd the completion of certain operations.
    But more in general, an eventfd can be used by the kernel to signal readiness,
    in a POSIX poll/select way, of interfaces that would otherwise be incompatible
    with it. The API is:

    int eventfd(unsigned int count);

    The eventfd API accepts an initial "count" parameter, and returns an eventfd
    fd. It supports poll(2) (POLLIN, POLLOUT, POLLERR), read(2) and write(2).

    The POLLIN flag is raised when the internal counter is greater than zero.

    The POLLOUT flag is raised when at least a value of "1" can be written to the
    internal counter.

    The POLLERR flag is raised when an overflow in the counter value is detected.

    The write(2) operation can never overflow the counter, since it blocks (unless
    O_NONBLOCK is set, in which case -EAGAIN is returned).

    But the eventfd_signal() function can do it, since it's supposed to not sleep
    during its operation.

    The read(2) function reads the __u64 counter value, and reset the internal
    value to zero. If the value read is equal to (__u64) -1, an overflow happened
    on the internal counter (due to 2^64 eventfd_signal() posts that has never
    been retired - unlickely, but possible).

    The write(2) call writes an __u64 count value, and adds it to the current
    counter. The eventfd fd supports O_NONBLOCK also.

    On the kernel side, we have:

    struct file *eventfd_fget(int fd);
    int eventfd_signal(struct file *file, unsigned int n);

    The eventfd_fget() should be called to get a struct file* from an eventfd fd
    (this is an fget() + check of f_op being an eventfd fops pointer).

    The kernel can then call eventfd_signal() every time it wants to post an event
    to userspace. The eventfd_signal() function can be called from any context.
    An eventfd() simple test and bench is available here:

    http://www.xmailserver.org/eventfd-bench.c

    This is the eventfd-based version of pipetest-4 (pipe(2) based):

    http://www.xmailserver.org/pipetest-4.c

    Not that performance matters much in the eventfd case, but eventfd-bench
    shows almost as double as performance than pipetest-4.

    [akpm@linux-foundation.org: fix i386 build]
    [akpm@linux-foundation.org: add sys_eventfd to sys_ni.c]
    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi