07 May, 2020

1 commit

  • eventfd is using ->read() as it's file_operations read handler, but
    this prevents passing in information about whether a given IO operation
    is blocking or not. We can only use the file flags for that. To support
    async (-EAGAIN/poll based) retries for io_uring, we need ->read_iter()
    support. Convert eventfd to using ->read_iter().

    With ->read_iter(), we can support IOCB_NOWAIT. Ensure the fd setup
    is done such that we set file->f_mode with FMODE_NOWAIT.

    [missing include added]

    Signed-off-by: Jens Axboe
    Signed-off-by: Al Viro

    Jens Axboe
     

04 Feb, 2020

1 commit

  • eventfd use cases from aio and io_uring can deadlock due to circular
    or resursive calling, when eventfd_signal() tries to grab the waitqueue
    lock. On top of that, it's also possible to construct notification
    chains that are deep enough that we could blow the stack.

    Add a percpu counter that tracks the percpu recursion depth, warn if we
    exceed it. The counter is also exposed so that users of eventfd_signal()
    can do the right thing if it's non-zero in the context where it is
    called.

    Cc: stable@vger.kernel.org # 4.19+
    Signed-off-by: Jens Axboe

    Jens Axboe
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

2 commits

  • Fix sparse warning:

    fs/eventfd.c:26:1: warning:
    symbol 'eventfd_ida' was not declared. Should it be static?

    Link: http://lkml.kernel.org/r/20190413142348.34716-1-yuehaibing@huawei.com
    Signed-off-by: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    YueHaibing
     
  • Finding endpoints of an IPC channel is one of essential task to
    understand how a user program works. Procfs and netlink socket provide
    enough hints to find endpoints for IPC channels like pipes, unix
    sockets, and pseudo terminals. However, there is no simple way to find
    endpoints for an eventfd file from userland. An inode number doesn't
    hint. Unlike pipe, all eventfd files share the same inode object.

    To provide the way to find endpoints of an eventfd file, this patch adds
    "eventfd-id" field to /proc/PID/fdinfo of eventfd as identifier.
    Integers managed by an IDA are used as ids.

    A tool like lsof can utilize the information to print endpoints.

    Link: http://lkml.kernel.org/r/20190327181823.20222-1-yamato@redhat.com
    Signed-off-by: Masatake YAMATO
    Cc: Al Viro
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masatake YAMATO
     

29 Jun, 2018

1 commit

  • The poll() changes were not well thought out, and completely
    unexplained. They also caused a huge performance regression, because
    "->poll()" was no longer a trivial file operation that just called down
    to the underlying file operations, but instead did at least two indirect
    calls.

    Indirect calls are sadly slow now with the Spectre mitigation, but the
    performance problem could at least be largely mitigated by changing the
    "->get_poll_head()" operation to just have a per-file-descriptor pointer
    to the poll head instead. That gets rid of one of the new indirections.

    But that doesn't fix the new complexity that is completely unwarranted
    for the regular case. The (undocumented) reason for the poll() changes
    was some alleged AIO poll race fixing, but we don't make the common case
    slower and more complex for some uncommon special case, so this all
    really needs way more explanations and most likely a fundamental
    redesign.

    [ This revert is a revert of about 30 different commits, not reverted
    individually because that would just be unnecessarily messy - Linus ]

    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Jun, 2018

1 commit

  • The ->poll_mask() operation has a mask of events that the caller
    is interested in, but we're returning all events regardless.

    Change to return only the events the caller is interested in. This
    fixes aio IO_CMD_POLL returning immediately when called with POLLIN
    on an eventfd, since an eventfd is almost always ready for a write.

    Signed-off-by: Avi Kivity
    Signed-off-by: Al Viro

    Avi Kivity
     

26 May, 2018

1 commit


03 Apr, 2018

1 commit


12 Feb, 2018

1 commit

  • This is the mindless scripted replacement of kernel use of POLL*
    variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
    L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
    for f in $L; do sed -i "-es/^\([^\"]*\)\(\\)/\\1E\\2/" $f; done
    done

    with de-mangling cleanups yet to come.

    NOTE! On almost all architectures, the EPOLL* constants have the same
    values as the POLL* constants do. But they keyword here is "almost".
    For various bad reasons they aren't the same, and epoll() doesn't
    actually work quite correctly in some cases due to this on Sparc et al.

    The next patch from Al will sort out the final differences, and we
    should be all done.

    Scripted-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Feb, 2018

1 commit

  • Pull misc vfs updates from Al Viro:
    "All kinds of misc stuff, without any unifying topic, from various
    people.

    Neil's d_anon patch, several bugfixes, introduction of kvmalloc
    analogue of kmemdup_user(), extending bitfield.h to deal with
    fixed-endians, assorted cleanups all over the place..."

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
    alpha: osf_sys.c: use timespec64 where appropriate
    alpha: osf_sys.c: fix put_tv32 regression
    jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path
    dcache: delete unused d_hash_mask
    dcache: subtract d_hash_shift from 32 in advance
    fs/buffer.c: fold init_buffer() into init_page_buffers()
    fs: fold __inode_permission() into inode_permission()
    fs: add RWF_APPEND
    sctp: use vmemdup_user() rather than badly open-coding memdup_user()
    snd_ctl_elem_init_enum_names(): switch to vmemdup_user()
    replace_user_tlv(): switch to vmemdup_user()
    new primitive: vmemdup_user()
    memdup_user(): switch to GFP_USER
    eventfd: fold eventfd_ctx_get() into eventfd_ctx_fileget()
    eventfd: fold eventfd_ctx_read() into eventfd_read()
    eventfd: convert to use anon_inode_getfd()
    nfs4file: get rid of pointless include of btrfs.h
    uvc_v4l2: clean copyin/copyout up
    vme_user: don't use __copy_..._user()
    usx2y: don't bother with memdup_user() for 16-byte structure
    ...

    Linus Torvalds
     

07 Jan, 2018

3 commits

  • eventfd_ctx_get() is not used outside of eventfd.c, so unexport it and
    fold it into eventfd_ctx_fileget().

    (eventfd_ctx_get() was apparently added years ago for KVM irqfd's, but
    was never used.)

    Signed-off-by: Eric Biggers
    Signed-off-by: Al Viro

    Eric Biggers
     
  • eventfd_ctx_read() is not used outside of eventfd.c, so unexport it and
    fold it into eventfd_read(). This slightly simplifies the code and
    makes it more analogous to eventfd_write().

    (eventfd_ctx_read() was apparently added years ago for KVM irqfd's, but
    was never used.)

    Signed-off-by: Eric Biggers
    Signed-off-by: Al Viro

    Eric Biggers
     
  • Nothing actually calls eventfd_file_create() besides the eventfd2()
    system call itself. So simplify things by folding it into the system
    call and using anon_inode_getfd() instead of anon_inode_getfile(). This
    removes over 40 lines with no change in functionality.

    (eventfd_file_create() was apparently added years ago for KVM irqfd's,
    but was never used.)

    Signed-off-by: Eric Biggers
    Signed-off-by: Al Viro

    Eric Biggers
     

28 Nov, 2017

1 commit


04 Jul, 2017

1 commit

  • Pull documentation updates from Jonathan Corbet:
    "There has been a fair amount of activity in the docs tree this time
    around. Highlights include:

    - Conversion of a bunch of security documentation into RST

    - The conversion of the remaining DocBook templates by The Amazing
    Mauro Machine. We can now drop the entire DocBook build chain.

    - The usual collection of fixes and minor updates"

    * tag 'docs-4.13' of git://git.lwn.net/linux: (90 commits)
    scripts/kernel-doc: handle DECLARE_HASHTABLE
    Documentation: atomic_ops.txt is core-api/atomic_ops.rst
    Docs: clean up some DocBook loose ends
    Make the main documentation title less Geocities
    Docs: Use kernel-figure in vidioc-g-selection.rst
    Docs: fix table problems in ras.rst
    Docs: Fix breakage with Sphinx 1.5 and upper
    Docs: Include the Latex "ifthen" package
    doc/kokr/howto: Only send regression fixes after -rc1
    docs-rst: fix broken links to dynamic-debug-howto in kernel-parameters
    doc: Document suitability of IBM Verse for kernel development
    Doc: fix a markup error in coding-style.rst
    docs: driver-api: i2c: remove some outdated information
    Documentation: DMA API: fix a typo in a function name
    Docs: Insert missing space to separate link from text
    doc/ko_KR/memory-barriers: Update control-dependencies example
    Documentation, kbuild: fix typo "minimun" -> "minimum"
    docs: Fix some formatting issues in request-key.rst
    doc: ReSTify keys-trusted-encrypted.txt
    doc: ReSTify keys-request-key.txt
    ...

    Linus Torvalds
     

20 Jun, 2017

1 commit

  • Rename:

    wait_queue_t => wait_queue_entry_t

    'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
    but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
    which had to carry the name.

    Start sorting this out by renaming it to 'wait_queue_entry_t'.

    This also allows the real structure name 'struct __wait_queue' to
    lose its double underscore and become 'struct wait_queue_entry',
    which is the more canonical nomenclature for such data types.

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

16 May, 2017

1 commit


02 Mar, 2017

1 commit


23 Mar, 2016

1 commit

  • Since commit e22553e2a25e ("eventfd: don't take the spinlock in
    eventfd_poll", 2015-02-17), eventfd is reading ctx->count outside
    ctx->wqh.lock.

    However, things aren't as simple as the read barrier in eventfd_poll
    would suggest. In fact, the read barrier, besides lacking a comment, is
    not paired in any obvious manner with another read barrier, and it is
    pointless because it is sitting between a write (deep in poll_wait) and
    the read of ctx->count. The read barrier is acting just as a compiler
    barrier, for which we can use READ_ONCE instead. This is what the code
    change in this patch does.

    The documentation change is just as important, however. The question,
    posed by Andrea Arcangeli, is then why the thing is safe on
    architectures where spin_unlock does not imply a store-load memory
    barrier. The answer is that it's safe because writes of ctx->count use
    the same lock as poll_wait, and hence an acquire barrier implicit in
    poll_wait provides the necessary synchronization between eventfd_poll
    and callers of wake_up_locked_poll. This is sort of mentioned in the
    commit message with respect to eventfd_ctx_read ("eventfd_read is
    similar, it will do a single decrement with the lock held") but it
    applies to all other callers too. It's tricky enough that it should be
    documented in the code.

    Signed-off-by: Paolo Bonzini
    Reviewed-by: Andrea Arcangeli
    Cc: Chris Mason
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paolo Bonzini
     

08 Dec, 2015

1 commit


18 Feb, 2015

1 commit

  • The spinlock in eventfd_poll is trying to protect the count of events so
    it can decide if it should return POLLIN, POLLERR, or POLLOUT. But,
    because of the way we drop the lock after calling poll_wait, and drop it
    again before returning, we have the same pile of races with the lock as
    we do with a single read of ctx->count().

    This replaces the lock with a read barrier and single read.

    eventfd_write does a single bump of ctx->count, so this should not add
    new races with adding events. eventfd_read is similar, it will do a
    single decrement with the lock held, and so we're making the race with
    concurrent readers slightly larger.

    This spinlock is the top CPU user in kernel code during one of our
    workloads. Removing it gives us a ~2% boost.

    [arnd@arndb.de: avoid unused variable warning]
    [dan.carpenter@oracle.com: type bug in eventfd_poll()]
    Signed-off-by: Chris Mason
    Cc: Davide Libenzi
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Mason
     

06 Nov, 2014

1 commit

  • seq_printf functions shouldn't really check the return value.
    Checking seq_has_overflowed() occasionally is used instead.

    Update vfs documentation.

    Link: http://lkml.kernel.org/p/e37e6e7b76acbdcc3bb4ab2a57c8f8ca1ae11b9a.1412031505.git.joe@perches.com

    Cc: David S. Miller
    Cc: Al Viro
    Signed-off-by: Joe Perches
    [ did a few clean ups ]
    Signed-off-by: Steven Rostedt

    Joe Perches
     

25 Jan, 2014

1 commit


18 Dec, 2012

1 commit

  • This allows us to print out raw counter value. The /proc/pid/fdinfo/fd
    output is

    | pos: 0
    | flags: 04002
    | eventfd-count: 5a

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Andrey Vagin
    Cc: Al Viro
    Cc: Alexey Dobriyan
    Cc: James Bottomley
    Cc: "Aneesh Kumar K.V"
    Cc: Alexey Dobriyan
    Cc: Matthew Helsley
    Cc: "J. Bruce Fields"
    Cc: "Aneesh Kumar K.V"
    Cc: Tvrtko Ursulin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

01 Jun, 2012

1 commit

  • eventfd_ctx->count is an __u64 counter which is allowed to reach
    ULLONG_MAX. eventfd_write() adds a __u64 value to "count", but the kernel
    side eventfd_signal() only adds an int value to it. Make them consistent.

    [akpm@linux-foundation.org: update interface documentation]
    Signed-off-by: Sha Zhengju
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     

29 Feb, 2012

1 commit


22 Feb, 2011

1 commit


15 Oct, 2010

1 commit

  • All file_operations should get a .llseek operation so we can make
    nonseekable_open the default for future file operations without a
    .llseek pointer.

    The three cases that we can automatically detect are no_llseek, seq_lseek
    and default_llseek. For cases where we can we can automatically prove that
    the file offset is always ignored, we use noop_llseek, which maintains
    the current behavior of not returning an error from a seek.

    New drivers should normally not use noop_llseek but instead use no_llseek
    and call nonseekable_open at open time. Existing drivers can be converted
    to do the same when the maintainer knows for certain that no user code
    relies on calling seek on the device file.

    The generated code is often incorrectly indented and right now contains
    comments that clarify for each added line why a specific variant was
    chosen. In the version that gets submitted upstream, the comments will
    be gone and I will manually fix the indentation, because there does not
    seem to be a way to do that using coccinelle.

    Some amount of new code is currently sitting in linux-next that should get
    the same modifications, which I will do at the end of the merge window.

    Many thanks to Julia Lawall for helping me learn to write a semantic
    patch that does all this.

    ===== begin semantic patch =====
    // This adds an llseek= method to all file operations,
    // as a preparation for making no_llseek the default.
    //
    // The rules are
    // - use no_llseek explicitly if we do nonseekable_open
    // - use seq_lseek for sequential files
    // - use default_llseek if we know we access f_pos
    // - use noop_llseek if we know we don't access f_pos,
    // but we still want to allow users to call lseek
    //
    @ open1 exists @
    identifier nested_open;
    @@
    nested_open(...)
    {

    }

    @ open exists@
    identifier open_f;
    identifier i, f;
    identifier open1.nested_open;
    @@
    int open_f(struct inode *i, struct file *f)
    {

    }

    @ read disable optional_qualifier exists @
    identifier read_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    expression E;
    identifier func;
    @@
    ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
    {

    }

    @ read_no_fpos disable optional_qualifier exists @
    identifier read_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    @@
    ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
    {
    ... when != off
    }

    @ write @
    identifier write_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    expression E;
    identifier func;
    @@
    ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
    {

    }

    @ write_no_fpos @
    identifier write_f;
    identifier f, p, s, off;
    type ssize_t, size_t, loff_t;
    @@
    ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
    {
    ... when != off
    }

    @ fops0 @
    identifier fops;
    @@
    struct file_operations fops = {
    ...
    };

    @ has_llseek depends on fops0 @
    identifier fops0.fops;
    identifier llseek_f;
    @@
    struct file_operations fops = {
    ...
    .llseek = llseek_f,
    ...
    };

    @ has_read depends on fops0 @
    identifier fops0.fops;
    identifier read_f;
    @@
    struct file_operations fops = {
    ...
    .read = read_f,
    ...
    };

    @ has_write depends on fops0 @
    identifier fops0.fops;
    identifier write_f;
    @@
    struct file_operations fops = {
    ...
    .write = write_f,
    ...
    };

    @ has_open depends on fops0 @
    identifier fops0.fops;
    identifier open_f;
    @@
    struct file_operations fops = {
    ...
    .open = open_f,
    ...
    };

    // use no_llseek if we call nonseekable_open
    ////////////////////////////////////////////
    @ nonseekable1 depends on !has_llseek && has_open @
    identifier fops0.fops;
    identifier nso ~= "nonseekable_open";
    @@
    struct file_operations fops = {
    ... .open = nso, ...
    +.llseek = no_llseek, /* nonseekable */
    };

    @ nonseekable2 depends on !has_llseek @
    identifier fops0.fops;
    identifier open.open_f;
    @@
    struct file_operations fops = {
    ... .open = open_f, ...
    +.llseek = no_llseek, /* open uses nonseekable */
    };

    // use seq_lseek for sequential files
    /////////////////////////////////////
    @ seq depends on !has_llseek @
    identifier fops0.fops;
    identifier sr ~= "seq_read";
    @@
    struct file_operations fops = {
    ... .read = sr, ...
    +.llseek = seq_lseek, /* we have seq_read */
    };

    // use default_llseek if there is a readdir
    ///////////////////////////////////////////
    @ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier readdir_e;
    @@
    // any other fop is used that changes pos
    struct file_operations fops = {
    ... .readdir = readdir_e, ...
    +.llseek = default_llseek, /* readdir is present */
    };

    // use default_llseek if at least one of read/write touches f_pos
    /////////////////////////////////////////////////////////////////
    @ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier read.read_f;
    @@
    // read fops use offset
    struct file_operations fops = {
    ... .read = read_f, ...
    +.llseek = default_llseek, /* read accesses f_pos */
    };

    @ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier write.write_f;
    @@
    // write fops use offset
    struct file_operations fops = {
    ... .write = write_f, ...
    + .llseek = default_llseek, /* write accesses f_pos */
    };

    // Use noop_llseek if neither read nor write accesses f_pos
    ///////////////////////////////////////////////////////////

    @ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier read_no_fpos.read_f;
    identifier write_no_fpos.write_f;
    @@
    // write fops use offset
    struct file_operations fops = {
    ...
    .write = write_f,
    .read = read_f,
    ...
    +.llseek = noop_llseek, /* read and write both use no f_pos */
    };

    @ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier write_no_fpos.write_f;
    @@
    struct file_operations fops = {
    ... .write = write_f, ...
    +.llseek = noop_llseek, /* write uses no f_pos */
    };

    @ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    identifier read_no_fpos.read_f;
    @@
    struct file_operations fops = {
    ... .read = read_f, ...
    +.llseek = noop_llseek, /* read uses no f_pos */
    };

    @ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
    identifier fops0.fops;
    @@
    struct file_operations fops = {
    ...
    +.llseek = noop_llseek, /* no read or write fn */
    };
    ===== End semantic patch =====

    Signed-off-by: Arnd Bergmann
    Cc: Julia Lawall
    Cc: Christoph Hellwig

    Arnd Bergmann
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

25 Jan, 2010

1 commit

  • KVM needs a wait to atomically remove themselves from the eventfd ->poll()
    wait queue head, in order to handle correctly their IRQfd deassign
    operation.

    This patch introduces such API, plus a way to read an eventfd from its
    context.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Avi Kivity

    Davide Libenzi
     

23 Dec, 2009

1 commit

  • It seems a couple places such as arch/ia64/kernel/perfmon.c and
    drivers/infiniband/core/uverbs_main.c could use anon_inode_getfile()
    instead of a private pseudo-fs + alloc_file(), if only there were a way
    to get a read-only file. So provide this by having anon_inode_getfile()
    create a read-only file if we pass O_RDONLY in flags.

    Signed-off-by: Roland Dreier
    Signed-off-by: Al Viro

    Roland Dreier
     

23 Sep, 2009

1 commit

  • Split the anonfd interface into a bare file pointer creation one, and a
    file pointer creation plus install one.

    There are cases, like the usage of eventfds inside other kernel
    interfaces, where the file pointer created by anonfd needs to be used
    inside the initialization of other structures.

    As it is right now, as soon as anon_inode_getfd() returns, the kenrle can
    race with userspace closing the newly installed file descriptor.

    This patch, while keeping the old anon_inode_getfd(), introduces a new
    anon_inode_getfile() (whose services are reused in anon_inode_getfd())
    that allows to split the file creation phase and the fd install one.

    Once all the kernel structures are initialized, the code can call the
    proper fd_install().

    Gregory manifested the need for something like this inside KVM.

    Signed-off-by: Davide Libenzi
    Cc: Alexander Viro
    Cc: James Morris
    Cc: Peter Zijlstra
    Cc: Gregory Haskins
    Acked-by: Serge Hallyn
    Acked-by: Roland Dreier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

01 Jul, 2009

1 commit

  • Change the eventfd interface to de-couple the eventfd memory context, from
    the file pointer instance.

    Without such change, there is no clean way to racely free handle the
    POLLHUP event sent when the last instance of the file* goes away. Also,
    now the internal eventfd APIs are using the eventfd context instead of the
    file*.

    This patch is required by KVM's IRQfd code, which is still under
    development.

    Signed-off-by: Davide Libenzi
    Cc: Gregory Haskins
    Cc: Rusty Russell
    Cc: Benjamin LaHaise
    Cc: Avi Kivity
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

12 Jun, 2009

1 commit


01 Apr, 2009

2 commits

  • Introduce keyed event wakeups inside the eventfd code.

    Signed-off-by: Davide Libenzi
    Cc: Alan Cox
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • People started using eventfd in a semaphore-like way where before they
    were using pipes.

    That is, counter-based resource access. Where a "wait()" returns
    immediately by decrementing the counter by one, if counter is greater than
    zero. Otherwise will wait. And where a "post(count)" will add count to
    the counter releasing the appropriate amount of waiters. If eventfd the
    "post" (write) part is fine, while the "wait" (read) does not dequeue 1,
    but the whole counter value.

    The problem with eventfd is that a read() on the fd returns and wipes the
    whole counter, making the use of it as semaphore a little bit more
    cumbersome. You can do a read() followed by a write() of COUNTER-1, but
    IMO it's pretty easy and cheap to make this work w/out extra steps. This
    patch introduces a new eventfd flag that tells eventfd to only dequeue 1
    from the counter, allowing simple read/write to make it behave like a
    semaphore. Simple test here:

    http://www.xmailserver.org/eventfd-sem.c

    To be back-compatible with earlier kernels, userspace applications should
    probe for the availability of this feature via

    #ifdef EFD_SEMAPHORE
    fd = eventfd2 (CNT, EFD_SEMAPHORE);
    if (fd == -1 && errno == EINVAL)

    #else

    #endif

    Signed-off-by: Davide Libenzi
    Cc:
    Tested-by: Michael Kerrisk
    Cc: Ulrich Drepper
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

14 Jan, 2009

1 commit


25 Jul, 2008

2 commits

  • This patch adds test that ensure the boundary conditions for the various
    constants introduced in the previous patches is met. No code is generated.

    [akpm@linux-foundation.org: fix alpha]
    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     
  • This patch adds support for the EFD_NONBLOCK flag to eventfd2. The
    additional changes needed are minimal.

    The following test must be adjusted for architectures other than x86 and
    x86-64 and in case the syscall numbers changed.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include

    #ifndef __NR_eventfd2
    # ifdef __x86_64__
    # define __NR_eventfd2 290
    # elif defined __i386__
    # define __NR_eventfd2 328
    # else
    # error "need __NR_eventfd2"
    # endif
    #endif

    #define EFD_NONBLOCK O_NONBLOCK

    int
    main (void)
    {
    int fd = syscall (__NR_eventfd2, 1, 0);
    if (fd == -1)
    {
    puts ("eventfd2(0) failed");
    return 1;
    }
    int fl = fcntl (fd, F_GETFL);
    if (fl == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if (fl & O_NONBLOCK)
    {
    puts ("eventfd2(0) sets non-blocking mode");
    return 1;
    }
    close (fd);

    fd = syscall (__NR_eventfd2, 1, EFD_NONBLOCK);
    if (fd == -1)
    {
    puts ("eventfd2(EFD_NONBLOCK) failed");
    return 1;
    }
    fl = fcntl (fd, F_GETFL);
    if (fl == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if ((fl & O_NONBLOCK) == 0)
    {
    puts ("eventfd2(EFD_NONBLOCK) does not set non-blocking mode");
    return 1;
    }
    close (fd);

    puts ("OK");

    return 0;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper