01 Nov, 2011

1 commit

  • The basic idea behind cross memory attach is to allow MPI programs doing
    intra-node communication to do a single copy of the message rather than a
    double copy of the message via shared memory.

    The following patch attempts to achieve this by allowing a destination
    process, given an address and size from a source process, to copy memory
    directly from the source process into its own address space via a system
    call. There is also a symmetrical ability to copy from the current
    process's address space into a destination process's address space.

    - Use of /proc/pid/mem has been considered, but there are issues with
    using it:
    - Does not allow for specifying iovecs for both src and dest, assuming
    preadv or pwritev was implemented either the area read from or
    written to would need to be contiguous.
    - Currently mem_read allows only processes who are currently
    ptrace'ing the target and are still able to ptrace the target to read
    from the target. This check could possibly be moved to the open call,
    but its not clear exactly what race this restriction is stopping
    (reason appears to have been lost)
    - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
    domain socket is a bit ugly from a userspace point of view,
    especially when you may have hundreds if not (eventually) thousands
    of processes that all need to do this with each other
    - Doesn't allow for some future use of the interface we would like to
    consider adding in the future (see below)
    - Interestingly reading from /proc/pid/mem currently actually
    involves two copies! (But this could be fixed pretty easily)

    As mentioned previously use of vmsplice instead was considered, but has
    problems. Since you need the reader and writer working co-operatively if
    the pipe is not drained then you block. Which requires some wrapping to
    do non blocking on the send side or polling on the receive. In all to all
    communication it requires ordering otherwise you can deadlock. And in the
    example of many MPI tasks writing to one MPI task vmsplice serialises the
    copying.

    There are some cases of MPI collectives where even a single copy interface
    does not get us the performance gain we could. For example in an
    MPI_Reduce rather than copy the data from the source we would like to
    instead use it directly in a mathops (say the reduce is doing a sum) as
    this would save us doing a copy. We don't need to keep a copy of the data
    from the source. I haven't implemented this, but I think this interface
    could in the future do all this through the use of the flags - eg could
    specify the math operation and type and the kernel rather than just
    copying the data would apply the specified operation between the source
    and destination and store it in the destination.

    Although we don't have a "second user" of the interface (though I've had
    some nibbles from people who may be interested in using it for intra
    process messaging which is not MPI). This interface is something which
    hardware vendors are already doing for their custom drivers to implement
    fast local communication. And so in addition to this being useful for
    OpenMPI it would mean the driver maintainers don't have to fix things up
    when the mm changes.

    There was some discussion about how much faster a true zero copy would
    go. Here's a link back to the email with some testing I did on that:

    http://marc.info/?l=linux-mm&m=130105930902915&w=2

    There is a basic man page for the proposed interface here:

    http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt

    This has been implemented for x86 and powerpc, other architecture should
    mainly (I think) just need to add syscall numbers for the process_vm_readv
    and process_vm_writev. There are 32 bit compatibility versions for
    64-bit kernels.

    For arch maintainers there are some simple tests to be able to quickly
    verify that the syscalls are working correctly here:

    http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz

    Signed-off-by: Chris Yeoh
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Arnd Bergmann
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: James Morris
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christopher Yeoh
     

27 Aug, 2011

1 commit


21 May, 2011

1 commit

  • When building with:

    CONFIG_64BIT=y
    CONFIG_MIPS32_COMPAT=y
    CONFIG_COMPAT=y
    CONFIG_MIPS32_O32=y
    CONFIG_MIPS32_N32=y
    CONFIG_SYSVIPC is not set
    (and implicitly: CONFIG_SYSVIPC_COMPAT is not set)

    the final link fails with unresolved symbols for:

    compat_sys_semctl, compat_sys_msgsnd, compat_sys_msgrcv,
    compat_sys_shmctl, compat_sys_msgctl, compat_sys_semtimedop

    The fix is to add cond_syscall declarations for all syscalls in
    ipc/compat.c

    Signed-off-by: Kevin Cernekee
    Acked-by: Ralf Baechle
    Acked-by: Arnd Bergmann
    Cc: Andrew Morton
    Cc: Al Viro
    Cc: Stephen Rothwell
    Signed-off-by: Linus Torvalds

    Kevin Cernekee
     

06 May, 2011

1 commit

  • This patch adds a multiple message send syscall and is the send
    version of the existing recvmmsg syscall. This is heavily
    based on the patch by Arnaldo that added recvmmsg.

    I wrote a microbenchmark to test the performance gains of using
    this new syscall:

    http://ozlabs.org/~anton/junkcode/sendmmsg_test.c

    The test was run on a ppc64 box with a 10 Gbit network card. The
    benchmark can send both UDP and RAW ethernet packets.

    64B UDP

    batch pkts/sec
    1 804570
    2 872800 (+ 8 %)
    4 916556 (+14 %)
    8 939712 (+17 %)
    16 952688 (+18 %)
    32 956448 (+19 %)
    64 964800 (+20 %)

    64B raw socket

    batch pkts/sec
    1 1201449
    2 1350028 (+12 %)
    4 1461416 (+22 %)
    8 1513080 (+26 %)
    16 1541216 (+28 %)
    32 1553440 (+29 %)
    64 1557888 (+30 %)

    We see a 20% improvement in throughput on UDP send and 30%
    on raw socket send.

    [ Add sparc syscall entries. -DaveM ]

    Signed-off-by: Anton Blanchard
    Signed-off-by: David S. Miller

    Anton Blanchard
     

15 Mar, 2011

2 commits


23 Sep, 2010

1 commit


28 Jul, 2010

2 commits


13 Mar, 2010

1 commit

  • Add a generic implementation of the ipc demultiplexer syscall. Except for
    s390 and sparc64 all implementations of the sys_ipc are nearly identical.

    There are slight differences in the types of the parameters, where mips
    and powerpc as the only 64-bit architectures with sys_ipc use unsigned
    long for the "third" argument as it gets casted to a pointer later, while
    it traditionally is an "int" like most other paramters. frv goes even
    further and uses unsigned long for all parameters execept for "ptr" which
    is a pointer type everywhere. The change from int to unsigned long for
    "third" and back to "int" for the others on frv should be fine due to the
    in-register calling conventions for syscalls (we already had a similar
    issue with the generic sys_ptrace), but I'd prefer to have the arch
    maintainers looks over this in details.

    Except for that h8300, m68k and m68knommu lack an impplementation of the
    semtimedop sub call which this patch adds, and various architectures have
    gets used - at least on i386 it seems superflous as the compat code on
    x86-64 and ia64 doesn't even bother to implement it.

    [akpm@linux-foundation.org: add sys_ipc to sys_ni.c]
    Signed-off-by: Christoph Hellwig
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Hirokazu Takata
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Reviewed-by: H. Peter Anvin
    Cc: Al Viro
    Cc: Arnd Bergmann
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: "Luck, Tony"
    Cc: James Morris
    Cc: Andreas Schwab
    Acked-by: Jesper Nilsson
    Acked-by: Russell King
    Acked-by: David Howells
    Acked-by: Kyle McMartin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

08 Dec, 2009

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1815 commits)
    mac80211: fix reorder buffer release
    iwmc3200wifi: Enable wimax core through module parameter
    iwmc3200wifi: Add wifi-wimax coexistence mode as a module parameter
    iwmc3200wifi: Coex table command does not expect a response
    iwmc3200wifi: Update wiwi priority table
    iwlwifi: driver version track kernel version
    iwlwifi: indicate uCode type when fail dump error/event log
    iwl3945: remove duplicated event logging code
    b43: fix two warnings
    ipw2100: fix rebooting hang with driver loaded
    cfg80211: indent regulatory messages with spaces
    iwmc3200wifi: fix NULL pointer dereference in pmkid update
    mac80211: Fix TX status reporting for injected data frames
    ath9k: enable 2GHz band only if the device supports it
    airo: Fix integer overflow warning
    rt2x00: Fix padding bug on L2PAD devices.
    WE: Fix set events not propagated
    b43legacy: avoid PPC fault during resume
    b43: avoid PPC fault during resume
    tcp: fix a timewait refcnt race
    ...

    Fix up conflicts due to sysctl cleanups (dead sysctl_check code and
    CTL_UNNUMBERED removed) in
    kernel/sysctl_check.c
    net/ipv4/sysctl_net_ipv4.c
    net/ipv6/addrconf.c
    net/sctp/sysctl.c

    Linus Torvalds
     

06 Nov, 2009

1 commit


13 Oct, 2009

1 commit

  • Meaning receive multiple messages, reducing the number of syscalls and
    net stack entry/exit operations.

    Next patches will introduce mechanisms where protocols that want to
    optimize this operation will provide an unlocked_recvmsg operation.

    This takes into account comments made by:

    . Paul Moore: sock_recvmsg is called only for the first datagram,
    sock_recvmsg_nosec is used for the rest.

    . Caitlin Bestler: recvmmsg now has a struct timespec timeout, that
    works in the same fashion as the ppoll one.

    If the underlying protocol returns a datagram with MSG_OOB set, this
    will make recvmmsg return right away with as many datagrams (+ the OOB
    one) it has received so far.

    . Rémi Denis-Courmont & Steven Whitehouse: If we receive N < vlen
    datagrams and then recvmsg returns an error, recvmmsg will return
    the successfully received datagrams, store the error and return it
    in the next call.

    This paves the way for a subsequent optimization, sk_prot->unlocked_recvmsg,
    where we will be able to acquire the lock only at batch start and end, not at
    every underlying recvmsg call.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     

25 Sep, 2009

1 commit


22 Sep, 2009

1 commit

  • sparc64 allnoconfig:

    arch/sparc/kernel/built-in.o(.text+0x134e0): In function `sys32_recvfrom':
    : undefined reference to `compat_sys_recvfrom'
    arch/sparc/kernel/built-in.o(.text+0x134e4): In function `sys32_recvfrom':
    : undefined reference to `compat_sys_recvfrom'

    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Andrew Morton
     

21 Sep, 2009

1 commit

  • Bye-bye Performance Counters, welcome Performance Events!

    In the past few months the perfcounters subsystem has grown out its
    initial role of counting hardware events, and has become (and is
    becoming) a much broader generic event enumeration, reporting, logging,
    monitoring, analysis facility.

    Naming its core object 'perf_counter' and naming the subsystem
    'perfcounters' has become more and more of a misnomer. With pending
    code like hw-breakpoints support the 'counter' name is less and
    less appropriate.

    All in one, we've decided to rename the subsystem to 'performance
    events' and to propagate this rename through all fields, variables
    and API names. (in an ABI compatible fashion)

    The word 'event' is also a bit shorter than 'counter' - which makes
    it slightly more convenient to write/handle as well.

    Thanks goes to Stephane Eranian who first observed this misnomer and
    suggested a rename.

    User-space tooling and ABI compatibility is not affected - this patch
    should be function-invariant. (Also, defconfigs were not touched to
    keep the size down.)

    This patch has been generated via the following script:

    FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')

    sed -i \
    -e 's/PERF_EVENT_/PERF_RECORD_/g' \
    -e 's/PERF_COUNTER/PERF_EVENT/g' \
    -e 's/perf_counter/perf_event/g' \
    -e 's/nb_counters/nb_events/g' \
    -e 's/swcounter/swevent/g' \
    -e 's/tpcounter_event/tp_event/g' \
    $FILES

    for N in $(find . -name perf_counter.[ch]); do
    M=$(echo $N | sed 's/perf_counter/perf_event/g')
    mv $N $M
    done

    FILES=$(find . -name perf_event.*)

    sed -i \
    -e 's/COUNTER_MASK/REG_MASK/g' \
    -e 's/COUNTER/EVENT/g' \
    -e 's/\/event_id/g' \
    -e 's/counter/event/g' \
    -e 's/Counter/Event/g' \
    $FILES

    ... to keep it as correct as possible. This script can also be
    used by anyone who has pending perfcounters patches - it converts
    a Linux kernel tree over to the new naming. We tried to time this
    change to the point in time where the amount of pending patches
    is the smallest: the end of the merge window.

    Namespace clashes were fixed up in a preparatory patch - and some
    stylistic fallout will be fixed up in a subsequent patch.

    ( NOTE: 'counters' are still the proper terminology when we deal
    with hardware registers - and these sed scripts are a bit
    over-eager in renaming them. I've undone some of that, but
    in case there's something left where 'counter' would be
    better than 'event' we can undo that on an individual basis
    instead of touching an otherwise nicely automated patch. )

    Suggested-by: Stephane Eranian
    Acked-by: Peter Zijlstra
    Acked-by: Paul Mackerras
    Reviewed-by: Arjan van de Ven
    Cc: Mike Galbraith
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: Kyle McMartin
    Cc: Martin Schwidefsky
    Cc: "David S. Miller"
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

21 Jan, 2009

1 commit


14 Jan, 2009

1 commit


08 Dec, 2008

1 commit

  • Implement the core kernel bits of Performance Counters subsystem.

    The Linux Performance Counter subsystem provides an abstraction of
    performance counter hardware capabilities. It provides per task and per
    CPU counters, and it provides event capabilities on top of those.

    Performance counters are accessed via special file descriptors.
    There's one file descriptor per virtual counter used.

    The special file descriptor is opened via the perf_counter_open()
    system call:

    int
    perf_counter_open(u32 hw_event_type,
    u32 hw_event_period,
    u32 record_type,
    pid_t pid,
    int cpu);

    The syscall returns the new fd. The fd can be used via the normal
    VFS system calls: read() can be used to read the counter, fcntl()
    can be used to set the blocking mode, etc.

    Multiple counters can be kept open at a time, and the counters
    can be poll()ed.

    See more details in Documentation/perf-counters.txt.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

20 Nov, 2008

1 commit

  • Introduce a new accept4() system call. The addition of this system call
    matches analogous changes in 2.6.27 (dup3(), evenfd2(), signalfd4(),
    inotify_init1(), epoll_create1(), pipe2()) which added new system calls
    that differed from analogous traditional system calls in adding a flags
    argument that can be used to access additional functionality.

    The accept4() system call is exactly the same as accept(), except that
    it adds a flags bit-mask argument. Two flags are initially implemented.
    (Most of the new system calls in 2.6.27 also had both of these flags.)

    SOCK_CLOEXEC causes the close-on-exec (FD_CLOEXEC) flag to be enabled
    for the new file descriptor returned by accept4(). This is a useful
    security feature to avoid leaking information in a multithreaded
    program where one thread is doing an accept() at the same time as
    another thread is doing a fork() plus exec(). More details here:
    http://udrepper.livejournal.com/20407.html "Secure File Descriptor Handling",
    Ulrich Drepper).

    The other flag is SOCK_NONBLOCK, which causes the O_NONBLOCK flag
    to be enabled on the new open file description created by accept4().
    (This flag is merely a convenience, saving the use of additional calls
    fcntl(F_GETFL) and fcntl (F_SETFL) to achieve the same result.

    Here's a test program. Works on x86-32. Should work on x86-64, but
    I (mtk) don't have a system to hand to test with.

    It tests accept4() with each of the four possible combinations of
    SOCK_CLOEXEC and SOCK_NONBLOCK set/clear in 'flags', and verifies
    that the appropriate flags are set on the file descriptor/open file
    description returned by accept4().

    I tested Ulrich's patch in this thread by applying against 2.6.28-rc2,
    and it passes according to my test program.

    /* test_accept4.c

    Copyright (C) 2008, Linux Foundation, written by Michael Kerrisk

    Licensed under the GNU GPLv2 or later.
    */
    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define PORT_NUM 33333

    #define die(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0)

    /**********************************************************************/

    /* The following is what we need until glibc gets a wrapper for
    accept4() */

    /* Flags for socket(), socketpair(), accept4() */
    #ifndef SOCK_CLOEXEC
    #define SOCK_CLOEXEC O_CLOEXEC
    #endif
    #ifndef SOCK_NONBLOCK
    #define SOCK_NONBLOCK O_NONBLOCK
    #endif

    #ifdef __x86_64__
    #define SYS_accept4 288
    #elif __i386__
    #define USE_SOCKETCALL 1
    #define SYS_ACCEPT4 18
    #else
    #error "Sorry -- don't know the syscall # on this architecture"
    #endif

    static int
    accept4(int fd, struct sockaddr *sockaddr, socklen_t *addrlen, int flags)
    {
    printf("Calling accept4(): flags = %x", flags);
    if (flags != 0) {
    printf(" (");
    if (flags & SOCK_CLOEXEC)
    printf("SOCK_CLOEXEC");
    if ((flags & SOCK_CLOEXEC) && (flags & SOCK_NONBLOCK))
    printf(" ");
    if (flags & SOCK_NONBLOCK)
    printf("SOCK_NONBLOCK");
    printf(")");
    }
    printf("\n");

    #if USE_SOCKETCALL
    long args[6];

    args[0] = fd;
    args[1] = (long) sockaddr;
    args[2] = (long) addrlen;
    args[3] = flags;

    return syscall(SYS_socketcall, SYS_ACCEPT4, args);
    #else
    return syscall(SYS_accept4, fd, sockaddr, addrlen, flags);
    #endif
    }

    /**********************************************************************/

    static int
    do_test(int lfd, struct sockaddr_in *conn_addr,
    int closeonexec_flag, int nonblock_flag)
    {
    int connfd, acceptfd;
    int fdf, flf, fdf_pass, flf_pass;
    struct sockaddr_in claddr;
    socklen_t addrlen;

    printf("=======================================\n");

    connfd = socket(AF_INET, SOCK_STREAM, 0);
    if (connfd == -1)
    die("socket");
    if (connect(connfd, (struct sockaddr *) conn_addr,
    sizeof(struct sockaddr_in)) == -1)
    die("connect");

    addrlen = sizeof(struct sockaddr_in);
    acceptfd = accept4(lfd, (struct sockaddr *) &claddr, &addrlen,
    closeonexec_flag | nonblock_flag);
    if (acceptfd == -1) {
    perror("accept4()");
    close(connfd);
    return 0;
    }

    fdf = fcntl(acceptfd, F_GETFD);
    if (fdf == -1)
    die("fcntl:F_GETFD");
    fdf_pass = ((fdf & FD_CLOEXEC) != 0) ==
    ((closeonexec_flag & SOCK_CLOEXEC) != 0);
    printf("Close-on-exec flag is %sset (%s); ",
    (fdf & FD_CLOEXEC) ? "" : "not ",
    fdf_pass ? "OK" : "failed");

    flf = fcntl(acceptfd, F_GETFL);
    if (flf == -1)
    die("fcntl:F_GETFD");
    flf_pass = ((flf & O_NONBLOCK) != 0) ==
    ((nonblock_flag & SOCK_NONBLOCK) !=0);
    printf("nonblock flag is %sset (%s)\n",
    (flf & O_NONBLOCK) ? "" : "not ",
    flf_pass ? "OK" : "failed");

    close(acceptfd);
    close(connfd);

    printf("Test result: %s\n", (fdf_pass && flf_pass) ? "PASS" : "FAIL");
    return fdf_pass && flf_pass;
    }

    static int
    create_listening_socket(int port_num)
    {
    struct sockaddr_in svaddr;
    int lfd;
    int optval;

    memset(&svaddr, 0, sizeof(struct sockaddr_in));
    svaddr.sin_family = AF_INET;
    svaddr.sin_addr.s_addr = htonl(INADDR_ANY);
    svaddr.sin_port = htons(port_num);

    lfd = socket(AF_INET, SOCK_STREAM, 0);
    if (lfd == -1)
    die("socket");

    optval = 1;
    if (setsockopt(lfd, SOL_SOCKET, SO_REUSEADDR, &optval,
    sizeof(optval)) == -1)
    die("setsockopt");

    if (bind(lfd, (struct sockaddr *) &svaddr,
    sizeof(struct sockaddr_in)) == -1)
    die("bind");

    if (listen(lfd, 5) == -1)
    die("listen");

    return lfd;
    }

    int
    main(int argc, char *argv[])
    {
    struct sockaddr_in conn_addr;
    int lfd;
    int port_num;
    int passed;

    passed = 1;

    port_num = (argc > 1) ? atoi(argv[1]) : PORT_NUM;

    memset(&conn_addr, 0, sizeof(struct sockaddr_in));
    conn_addr.sin_family = AF_INET;
    conn_addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
    conn_addr.sin_port = htons(port_num);

    lfd = create_listening_socket(port_num);

    if (!do_test(lfd, &conn_addr, 0, 0))
    passed = 0;
    if (!do_test(lfd, &conn_addr, SOCK_CLOEXEC, 0))
    passed = 0;
    if (!do_test(lfd, &conn_addr, 0, SOCK_NONBLOCK))
    passed = 0;
    if (!do_test(lfd, &conn_addr, SOCK_CLOEXEC, SOCK_NONBLOCK))
    passed = 0;

    close(lfd);

    exit(passed ? EXIT_SUCCESS : EXIT_FAILURE);
    }

    [mtk.manpages@gmail.com: rewrote changelog, updated test program]
    Signed-off-by: Ulrich Drepper
    Tested-by: Michael Kerrisk
    Acked-by: Michael Kerrisk
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     

17 Oct, 2008

1 commit

  • This patchs adds the CONFIG_AIO option which allows to remove support
    for asynchronous I/O operations, that are not necessarly used by
    applications, particularly on embedded devices. As this is a
    size-reduction option, it depends on CONFIG_EMBEDDED. It allows to
    save ~7 kilobytes of kernel code/data:

    text data bss dec hex filename
    1115067 119180 217088 1451335 162547 vmlinux
    1108025 119048 217088 1444161 160941 vmlinux.new
    -7042 -132 0 -7174 -1C06 +/-

    This patch has been originally written by Matt Mackall
    , and is part of the Linux Tiny project.

    [randy.dunlap@oracle.com: build fix]
    Signed-off-by: Thomas Petazzoni
    Cc: Benjamin LaHaise
    Cc: Zach Brown
    Signed-off-by: Matt Mackall
    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Petazzoni
     

30 Sep, 2008

1 commit

  • This patch adds the CONFIG_FILE_LOCKING option which allows to remove
    support for advisory locks. With this patch enabled, the flock()
    system call, the F_GETLK, F_SETLK and F_SETLKW operations of fcntl()
    and NFS support are disabled. These features are not necessarly needed
    on embedded systems. It allows to save ~11 Kb of kernel code and data:

    text data bss dec hex filename
    1125436 118764 212992 1457192 163c28 vmlinux.old
    1114299 118564 212992 1445855 160fdf vmlinux
    -11137 -200 0 -11337 -2C49 +/-

    This patch has originally been written by Matt Mackall
    , and is part of the Linux Tiny project.

    Signed-off-by: Thomas Petazzoni
    Signed-off-by: Matt Mackall
    Cc: matthew@wil.cx
    Cc: linux-fsdevel@vger.kernel.org
    Cc: mpm@selenic.com
    Cc: akpm@linux-foundation.org
    Signed-off-by: J. Bruce Fields

    Thomas Petazzoni
     

26 Jul, 2008

2 commits


25 Jul, 2008

4 commits

  • This patch introduces the new syscall inotify_init1 (note: the 1 stands for
    the one parameter the syscall takes, as opposed to no parameter before). The
    values accepted for this parameter are function-specific and defined in the
    inotify.h header. Here the values must match the O_* flags, though. In this
    patch CLOEXEC support is introduced.

    The following test must be adjusted for architectures other than x86 and
    x86-64 and in case the syscall numbers changed.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include

    #ifndef __NR_inotify_init1
    # ifdef __x86_64__
    # define __NR_inotify_init1 294
    # elif defined __i386__
    # define __NR_inotify_init1 332
    # else
    # error "need __NR_inotify_init1"
    # endif
    #endif

    #define IN_CLOEXEC O_CLOEXEC

    int
    main (void)
    {
    int fd;
    fd = syscall (__NR_inotify_init1, 0);
    if (fd == -1)
    {
    puts ("inotify_init1(0) failed");
    return 1;
    }
    int coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if (coe & FD_CLOEXEC)
    {
    puts ("inotify_init1(0) set close-on-exit");
    return 1;
    }
    close (fd);

    fd = syscall (__NR_inotify_init1, IN_CLOEXEC);
    if (fd == -1)
    {
    puts ("inotify_init1(IN_CLOEXEC) failed");
    return 1;
    }
    coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if ((coe & FD_CLOEXEC) == 0)
    {
    puts ("inotify_init1(O_CLOEXEC) does not set close-on-exit");
    return 1;
    }
    close (fd);

    puts ("OK");

    return 0;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    [akpm@linux-foundation.org: add sys_ni stub]
    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     
  • This patch adds the new eventfd2 syscall. It extends the old eventfd
    syscall by one parameter which is meant to hold a flag value. In this
    patch the only flag support is EFD_CLOEXEC which causes the close-on-exec
    flag for the returned file descriptor to be set.

    A new name EFD_CLOEXEC is introduced which in this implementation must
    have the same value as O_CLOEXEC.

    The following test must be adjusted for architectures other than x86 and
    x86-64 and in case the syscall numbers changed.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include

    #ifndef __NR_eventfd2
    # ifdef __x86_64__
    # define __NR_eventfd2 290
    # elif defined __i386__
    # define __NR_eventfd2 328
    # else
    # error "need __NR_eventfd2"
    # endif
    #endif

    #define EFD_CLOEXEC O_CLOEXEC

    int
    main (void)
    {
    int fd = syscall (__NR_eventfd2, 1, 0);
    if (fd == -1)
    {
    puts ("eventfd2(0) failed");
    return 1;
    }
    int coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if (coe & FD_CLOEXEC)
    {
    puts ("eventfd2(0) sets close-on-exec flag");
    return 1;
    }
    close (fd);

    fd = syscall (__NR_eventfd2, 1, EFD_CLOEXEC);
    if (fd == -1)
    {
    puts ("eventfd2(EFD_CLOEXEC) failed");
    return 1;
    }
    coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if ((coe & FD_CLOEXEC) == 0)
    {
    puts ("eventfd2(EFD_CLOEXEC) does not set close-on-exec flag");
    return 1;
    }
    close (fd);

    puts ("OK");

    return 0;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    [akpm@linux-foundation.org: add sys_ni stub]
    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     
  • This patch adds the new signalfd4 syscall. It extends the old signalfd
    syscall by one parameter which is meant to hold a flag value. In this
    patch the only flag support is SFD_CLOEXEC which causes the close-on-exec
    flag for the returned file descriptor to be set.

    A new name SFD_CLOEXEC is introduced which in this implementation must
    have the same value as O_CLOEXEC.

    The following test must be adjusted for architectures other than x86 and
    x86-64 and in case the syscall numbers changed.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include
    #include

    #ifndef __NR_signalfd4
    # ifdef __x86_64__
    # define __NR_signalfd4 289
    # elif defined __i386__
    # define __NR_signalfd4 327
    # else
    # error "need __NR_signalfd4"
    # endif
    #endif

    #define SFD_CLOEXEC O_CLOEXEC

    int
    main (void)
    {
    sigset_t ss;
    sigemptyset (&ss);
    sigaddset (&ss, SIGUSR1);
    int fd = syscall (__NR_signalfd4, -1, &ss, 8, 0);
    if (fd == -1)
    {
    puts ("signalfd4(0) failed");
    return 1;
    }
    int coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if (coe & FD_CLOEXEC)
    {
    puts ("signalfd4(0) set close-on-exec flag");
    return 1;
    }
    close (fd);

    fd = syscall (__NR_signalfd4, -1, &ss, 8, SFD_CLOEXEC);
    if (fd == -1)
    {
    puts ("signalfd4(SFD_CLOEXEC) failed");
    return 1;
    }
    coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if ((coe & FD_CLOEXEC) == 0)
    {
    puts ("signalfd4(SFD_CLOEXEC) does not set close-on-exec flag");
    return 1;
    }
    close (fd);

    puts ("OK");

    return 0;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    [akpm@linux-foundation.org: add sys_ni stub]
    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     
  • This patch is by far the most complex in the series. It adds a new syscall
    paccept. This syscall differs from accept in that it adds (at the userlevel)
    two additional parameters:

    - a signal mask
    - a flags value

    The flags parameter can be used to set flag like SOCK_CLOEXEC. This is
    imlpemented here as well. Some people argued that this is a property which
    should be inherited from the file desriptor for the server but this is against
    POSIX. Additionally, we really want the signal mask parameter as well
    (similar to pselect, ppoll, etc). So an interface change in inevitable.

    The flag value is the same as for socket and socketpair. I think diverging
    here will only create confusion. Similar to the filesystem interfaces where
    the use of the O_* constants differs, it is acceptable here.

    The signal mask is handled as for pselect etc. The mask is temporarily
    installed for the thread and removed before the call returns. I modeled the
    code after pselect. If there is a problem it's likely also in pselect.

    For architectures which use socketcall I maintained this interface instead of
    adding a system call. The symmetry shouldn't be broken.

    The following test must be adjusted for architectures other than x86 and
    x86-64 and in case the syscall numbers changed.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #ifndef __NR_paccept
    # ifdef __x86_64__
    # define __NR_paccept 288
    # elif defined __i386__
    # define SYS_PACCEPT 18
    # define USE_SOCKETCALL 1
    # else
    # error "need __NR_paccept"
    # endif
    #endif

    #ifdef USE_SOCKETCALL
    # define paccept(fd, addr, addrlen, mask, flags) \
    ({ long args[6] = { \
    (long) fd, (long) addr, (long) addrlen, (long) mask, 8, (long) flags }; \
    syscall (__NR_socketcall, SYS_PACCEPT, args); })
    #else
    # define paccept(fd, addr, addrlen, mask, flags) \
    syscall (__NR_paccept, fd, addr, addrlen, mask, 8, flags)
    #endif

    #define PORT 57392

    #define SOCK_CLOEXEC O_CLOEXEC

    static pthread_barrier_t b;

    static void *
    tf (void *arg)
    {
    pthread_barrier_wait (&b);
    int s = socket (AF_INET, SOCK_STREAM, 0);
    struct sockaddr_in sin;
    sin.sin_family = AF_INET;
    sin.sin_addr.s_addr = htonl (INADDR_LOOPBACK);
    sin.sin_port = htons (PORT);
    connect (s, (const struct sockaddr *) &sin, sizeof (sin));
    close (s);

    pthread_barrier_wait (&b);
    s = socket (AF_INET, SOCK_STREAM, 0);
    sin.sin_port = htons (PORT);
    connect (s, (const struct sockaddr *) &sin, sizeof (sin));
    close (s);
    pthread_barrier_wait (&b);

    pthread_barrier_wait (&b);
    sleep (2);
    pthread_kill ((pthread_t) arg, SIGUSR1);

    return NULL;
    }

    static void
    handler (int s)
    {
    }

    int
    main (void)
    {
    pthread_barrier_init (&b, NULL, 2);

    struct sockaddr_in sin;
    pthread_t th;
    if (pthread_create (&th, NULL, tf, (void *) pthread_self ()) != 0)
    {
    puts ("pthread_create failed");
    return 1;
    }

    int s = socket (AF_INET, SOCK_STREAM, 0);
    int reuse = 1;
    setsockopt (s, SOL_SOCKET, SO_REUSEADDR, &reuse, sizeof (reuse));
    sin.sin_family = AF_INET;
    sin.sin_addr.s_addr = htonl (INADDR_LOOPBACK);
    sin.sin_port = htons (PORT);
    bind (s, (struct sockaddr *) &sin, sizeof (sin));
    listen (s, SOMAXCONN);

    pthread_barrier_wait (&b);

    int s2 = paccept (s, NULL, 0, NULL, 0);
    if (s2 < 0)
    {
    puts ("paccept(0) failed");
    return 1;
    }

    int coe = fcntl (s2, F_GETFD);
    if (coe & FD_CLOEXEC)
    {
    puts ("paccept(0) set close-on-exec-flag");
    return 1;
    }
    close (s2);

    pthread_barrier_wait (&b);

    s2 = paccept (s, NULL, 0, NULL, SOCK_CLOEXEC);
    if (s2 < 0)
    {
    puts ("paccept(SOCK_CLOEXEC) failed");
    return 1;
    }

    coe = fcntl (s2, F_GETFD);
    if ((coe & FD_CLOEXEC) == 0)
    {
    puts ("paccept(SOCK_CLOEXEC) does not set close-on-exec flag");
    return 1;
    }
    close (s2);

    pthread_barrier_wait (&b);

    struct sigaction sa;
    sa.sa_handler = handler;
    sa.sa_flags = 0;
    sigemptyset (&sa.sa_mask);
    sigaction (SIGUSR1, &sa, NULL);

    sigset_t ss;
    pthread_sigmask (SIG_SETMASK, NULL, &ss);
    sigaddset (&ss, SIGUSR1);
    pthread_sigmask (SIG_SETMASK, &ss, NULL);

    sigdelset (&ss, SIGUSR1);
    alarm (4);
    pthread_barrier_wait (&b);

    errno = 0 ;
    s2 = paccept (s, NULL, 0, &ss, 0);
    if (s2 != -1 || errno != EINTR)
    {
    puts ("paccept did not fail with EINTR");
    return 1;
    }

    close (s);

    puts ("OK");

    return 0;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    [akpm@linux-foundation.org: make it compile]
    [akpm@linux-foundation.org: add sys_ni stub]
    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc:
    Cc: "David S. Miller"
    Cc: Roland McGrath
    Cc: Kyle McMartin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     

23 Jul, 2008

1 commit


06 Feb, 2008

1 commit

  • This is the new timerfd API as it is implemented by the following patch:

    int timerfd_create(int clockid, int flags);
    int timerfd_settime(int ufd, int flags,
    const struct itimerspec *utmr,
    struct itimerspec *otmr);
    int timerfd_gettime(int ufd, struct itimerspec *otmr);

    The timerfd_create() API creates an un-programmed timerfd fd. The "clockid"
    parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME.

    The timerfd_settime() API give new settings by the timerfd fd, by optionally
    retrieving the previous expiration time (in case the "otmr" parameter is not
    NULL).

    The time value specified in "utmr" is absolute, if the TFD_TIMER_ABSTIME bit
    is set in the "flags" parameter. Otherwise it's a relative time.

    The timerfd_gettime() API returns the next expiration time of the timer, or
    {0, 0} if the timerfd has not been set yet.

    Like the previous timerfd API implementation, read(2) and poll(2) are
    supported (with the same interface). Here's a simple test program I used to
    exercise the new timerfd APIs:

    http://www.xmailserver.org/timerfd-test2.c

    [akpm@linux-foundation.org: coding-style cleanups]
    [akpm@linux-foundation.org: fix ia64 build]
    [akpm@linux-foundation.org: fix m68k build]
    [akpm@linux-foundation.org: fix mips build]
    [akpm@linux-foundation.org: fix alpha, arm, blackfin, cris, m68k, s390, sparc and sparc64 builds]
    [heiko.carstens@de.ibm.com: fix s390]
    [akpm@linux-foundation.org: fix powerpc build]
    [akpm@linux-foundation.org: fix sparc64 more]
    Signed-off-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc: Thomas Gleixner
    Cc: Davide Libenzi
    Cc: Michael Kerrisk
    Cc: Martin Schwidefsky
    Signed-off-by: Heiko Carstens
    Cc: Michael Kerrisk
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

24 Jan, 2008

1 commit

  • Using 64k pages on 64-bit PowerPC systems makes life difficult for
    emulators that are trying to emulate an ISA, such as x86, which use a
    smaller page size, since the emulator can no longer use the MMU and
    the normal system calls for controlling page protections. Of course,
    the emulator can emulate the MMU by checking and possibly remapping
    the address for each memory access in software, but that is pretty
    slow.

    This provides a facility for such programs to control the access
    permissions on individual 4k sub-pages of 64k pages. The idea is
    that the emulator supplies an array of protection masks to apply to a
    specified range of virtual addresses. These masks are applied at the
    level where hardware PTEs are inserted into the hardware page table
    based on the Linux PTEs, so the Linux PTEs are not affected. Note
    that this new mechanism does not allow any access that would otherwise
    be prohibited; it can only prohibit accesses that would otherwise be
    allowed. This new facility is only available on 64-bit PowerPC and
    only when the kernel is configured for 64k pages.

    The masks are supplied using a new subpage_prot system call, which
    takes a starting virtual address and length, and a pointer to an array
    of protection masks in memory. The array has a 32-bit word per 64k
    page to be protected; each 32-bit word consists of 16 2-bit fields,
    for which 0 allows any access (that is otherwise allowed), 1 prevents
    write accesses, and 2 or 3 prevent any access.

    Implicit in this is that the regions of the address space that are
    protected are switched to use 4k hardware pages rather than 64k
    hardware pages (on machines with hardware 64k page support). In fact
    the whole process is switched to use 4k hardware pages when the
    subpage_prot system call is used, but this could be improved in future
    to switch only the affected segments.

    The subpage protection bits are stored in a 3 level tree akin to the
    page table tree. The top level of this tree is stored in a structure
    that is appended to the top level of the page table tree, i.e., the
    pgd array. Since it will often only be 32-bit addresses (below 4GB)
    that are protected, the pointers to the first four bottom level pages
    are also stored in this structure (each bottom level page contains the
    protection bits for 1GB of address space), so the protection bits for
    addresses below 4GB can be accessed with one fewer loads than those
    for higher addresses.

    Signed-off-by: Paul Mackerras

    Paul Mackerras
     

31 Oct, 2007

1 commit


17 Oct, 2007

1 commit


17 Jul, 2007

1 commit

  • OpenVZ Linux kernel team has discovered the problem with 32bit quota tools
    working on 64bit architectures. In 2.6.10 kernel sys32_quotactl() function
    was replaced by sys_quotactl() with the comment "sys_quotactl seems to be
    32/64bit clean, enable it for 32bit" However this isn't right. Look at
    if_dqblk structure:

    struct if_dqblk {
    __u64 dqb_bhardlimit;
    __u64 dqb_bsoftlimit;
    __u64 dqb_curspace;
    __u64 dqb_ihardlimit;
    __u64 dqb_isoftlimit;
    __u64 dqb_curinodes;
    __u64 dqb_btime;
    __u64 dqb_itime;
    __u32 dqb_valid;
    };

    For 32 bit quota tools sizeof(if_dqblk) == 0x44.
    But for 64 bit kernel its size is 0x48, 'cause of alignment!
    Thus we got a problem. Attached patch reintroduce sys32_quotactl() function,
    that handles this and related situations.

    [michal.k.k.piotrowski@gmail.com: build fix]
    [akpm@linux-foundation.org: Make it link with CONFIG_QUOTA=n]
    Signed-off-by: Vasily Tarasov
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: Jan Kara
    Cc:
    Signed-off-by: Michal Piotrowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasily Tarasov
     

13 May, 2007

1 commit


11 May, 2007

3 commits

  • This is a very simple and light file descriptor, that can be used as event
    wait/dispatch by userspace (both wait and dispatch) and by the kernel
    (dispatch only). It can be used instead of pipe(2) in all cases where those
    would simply be used to signal events. Their kernel overhead is much lower
    than pipes, and they do not consume two fds. When used in the kernel, it can
    offer an fd-bridge to enable, for example, functionalities like KAIO or
    syslets/threadlets to signal to an fd the completion of certain operations.
    But more in general, an eventfd can be used by the kernel to signal readiness,
    in a POSIX poll/select way, of interfaces that would otherwise be incompatible
    with it. The API is:

    int eventfd(unsigned int count);

    The eventfd API accepts an initial "count" parameter, and returns an eventfd
    fd. It supports poll(2) (POLLIN, POLLOUT, POLLERR), read(2) and write(2).

    The POLLIN flag is raised when the internal counter is greater than zero.

    The POLLOUT flag is raised when at least a value of "1" can be written to the
    internal counter.

    The POLLERR flag is raised when an overflow in the counter value is detected.

    The write(2) operation can never overflow the counter, since it blocks (unless
    O_NONBLOCK is set, in which case -EAGAIN is returned).

    But the eventfd_signal() function can do it, since it's supposed to not sleep
    during its operation.

    The read(2) function reads the __u64 counter value, and reset the internal
    value to zero. If the value read is equal to (__u64) -1, an overflow happened
    on the internal counter (due to 2^64 eventfd_signal() posts that has never
    been retired - unlickely, but possible).

    The write(2) call writes an __u64 count value, and adds it to the current
    counter. The eventfd fd supports O_NONBLOCK also.

    On the kernel side, we have:

    struct file *eventfd_fget(int fd);
    int eventfd_signal(struct file *file, unsigned int n);

    The eventfd_fget() should be called to get a struct file* from an eventfd fd
    (this is an fget() + check of f_op being an eventfd fops pointer).

    The kernel can then call eventfd_signal() every time it wants to post an event
    to userspace. The eventfd_signal() function can be called from any context.
    An eventfd() simple test and bench is available here:

    http://www.xmailserver.org/eventfd-bench.c

    This is the eventfd-based version of pipetest-4 (pipe(2) based):

    http://www.xmailserver.org/pipetest-4.c

    Not that performance matters much in the eventfd case, but eventfd-bench
    shows almost as double as performance than pipetest-4.

    [akpm@linux-foundation.org: fix i386 build]
    [akpm@linux-foundation.org: add sys_eventfd to sys_ni.c]
    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • This patch introduces a new system call for timers events delivered though
    file descriptors. This allows timer event to be used with standard POSIX
    poll(2), select(2) and read(2). As a consequence of supporting the Linux
    f_op->poll subsystem, they can be used with epoll(2) too.

    The system call is defined as:

    int timerfd(int ufd, int clockid, int flags, const struct itimerspec *utmr);

    The "ufd" parameter allows for re-use (re-programming) of an existing timerfd
    w/out going through the close/open cycle (same as signalfd). If "ufd" is -1,
    s new file descriptor will be created, otherwise the existing "ufd" will be
    re-programmed.

    The "clockid" parameter is either CLOCK_MONOTONIC or CLOCK_REALTIME. The time
    specified in the "utmr->it_value" parameter is the expiry time for the timer.

    If the TFD_TIMER_ABSTIME flag is set in "flags", this is an absolute time,
    otherwise it's a relative time.

    If the time specified in the "utmr->it_interval" is not zero (.tv_sec == 0,
    tv_nsec == 0), this is the period at which the following ticks should be
    generated.

    The "utmr->it_interval" should be set to zero if only one tick is requested.
    Setting the "utmr->it_value" to zero will disable the timer, or will create a
    timerfd without the timer enabled.

    The function returns the new (or same, in case "ufd" is a valid timerfd
    descriptor) file, or -1 in case of error.

    As stated before, the timerfd file descriptor supports poll(2), select(2) and
    epoll(2). When a timer event happened on the timerfd, a POLLIN mask will be
    returned.

    The read(2) call can be used, and it will return a u32 variable holding the
    number of "ticks" that happened on the interface since the last call to
    read(2). The read(2) call supportes the O_NONBLOCK flag too, and EAGAIN will
    be returned if no ticks happened.

    A quick test program, shows timerfd working correctly on my amd64 box:

    http://www.xmailserver.org/timerfd-test.c

    [akpm@linux-foundation.org: add sys_timerfd to sys_ni.c]
    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • This patch series implements the new signalfd() system call.

    I took part of the original Linus code (and you know how badly it can be
    broken :), and I added even more breakage ;) Signals are fetched from the same
    signal queue used by the process, so signalfd will compete with standard
    kernel delivery in dequeue_signal(). If you want to reliably fetch signals on
    the signalfd file, you need to block them with sigprocmask(SIG_BLOCK). This
    seems to be working fine on my Dual Opteron machine. I made a quick test
    program for it:

    http://www.xmailserver.org/signafd-test.c

    The signalfd() system call implements signal delivery into a file descriptor
    receiver. The signalfd file descriptor if created with the following API:

    int signalfd(int ufd, const sigset_t *mask, size_t masksize);

    The "ufd" parameter allows to change an existing signalfd sigmask, w/out going
    to close/create cycle (Linus idea). Use "ufd" == -1 if you want a brand new
    signalfd file.

    The "mask" allows to specify the signal mask of signals that we are interested
    in. The "masksize" parameter is the size of "mask".

    The signalfd fd supports the poll(2) and read(2) system calls. The poll(2)
    will return POLLIN when signals are available to be dequeued. As a direct
    consequence of supporting the Linux poll subsystem, the signalfd fd can use
    used together with epoll(2) too.

    The read(2) system call will return a "struct signalfd_siginfo" structure in
    the userspace supplied buffer. The return value is the number of bytes copied
    in the supplied buffer, or -1 in case of error. The read(2) call can also
    return 0, in case the sighand structure to which the signalfd was attached,
    has been orphaned. The O_NONBLOCK flag is also supported, and read(2) will
    return -EAGAIN in case no signal is available.

    If the size of the buffer passed to read(2) is lower than sizeof(struct
    signalfd_siginfo), -EINVAL is returned. A read from the signalfd can also
    return -ERESTARTSYS in case a signal hits the process. The format of the
    struct signalfd_siginfo is, and the valid fields depends of the (->code &
    __SI_MASK) value, in the same way a struct siginfo would:

    struct signalfd_siginfo {
    __u32 signo; /* si_signo */
    __s32 err; /* si_errno */
    __s32 code; /* si_code */
    __u32 pid; /* si_pid */
    __u32 uid; /* si_uid */
    __s32 fd; /* si_fd */
    __u32 tid; /* si_fd */
    __u32 band; /* si_band */
    __u32 overrun; /* si_overrun */
    __u32 trapno; /* si_trapno */
    __s32 status; /* si_status */
    __s32 svint; /* si_int */
    __u64 svptr; /* si_ptr */
    __u64 utime; /* si_utime */
    __u64 stime; /* si_stime */
    __u64 addr; /* si_addr */
    };

    [akpm@linux-foundation.org: fix signalfd_copyinfo() on i386]
    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

04 Nov, 2006

1 commit


17 Oct, 2006

1 commit