13 Oct, 2009

2 commits

  • Meaning receive multiple messages, reducing the number of syscalls and
    net stack entry/exit operations.

    Next patches will introduce mechanisms where protocols that want to
    optimize this operation will provide an unlocked_recvmsg operation.

    This takes into account comments made by:

    . Paul Moore: sock_recvmsg is called only for the first datagram,
    sock_recvmsg_nosec is used for the rest.

    . Caitlin Bestler: recvmmsg now has a struct timespec timeout, that
    works in the same fashion as the ppoll one.

    If the underlying protocol returns a datagram with MSG_OOB set, this
    will make recvmmsg return right away with as many datagrams (+ the OOB
    one) it has received so far.

    . Rémi Denis-Courmont & Steven Whitehouse: If we receive N < vlen
    datagrams and then recvmsg returns an error, recvmmsg will return
    the successfully received datagrams, store the error and return it
    in the next call.

    This paves the way for a subsequent optimization, sk_prot->unlocked_recvmsg,
    where we will be able to acquire the lock only at batch start and end, not at
    every underlying recvmsg call.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • Create a new socket level option to report number of queue overflows

    Recently I augmented the AF_PACKET protocol to report the number of frames lost
    on the socket receive queue between any two enqueued frames. This value was
    exported via a SOL_PACKET level cmsg. AFter I completed that work it was
    requested that this feature be generalized so that any datagram oriented socket
    could make use of this option. As such I've created this patch, It creates a
    new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
    SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
    overflowed between any two given frames. It also augments the AF_PACKET
    protocol to take advantage of this new feature (as it previously did not touch
    sk->sk_drops, which this patch uses to record the overflow count). Tested
    successfully by me.

    Notes:

    1) Unlike my previous patch, this patch simply records the sk_drops value, which
    is not a number of drops between packets, but rather a total number of drops.
    Deltas must be computed in user space.

    2) While this patch currently works with datagram oriented protocols, it will
    also be accepted by non-datagram oriented protocols. I'm not sure if thats
    agreeable to everyone, but my argument in favor of doing so is that, for those
    protocols which aren't applicable to this option, sk_drops will always be zero,
    and reporting no drops on a receive queue that isn't used for those
    non-participating protocols seems reasonable to me. This also saves us having
    to code in a per-protocol opt in mechanism.

    3) This applies cleanly to net-next assuming that commit
    977750076d98c7ff6cbda51858bb5a5894a9d9ab (my af packet cmsg patch) is reverted

    Signed-off-by: Neil Horman
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neil Horman
     

10 Oct, 2009

1 commit


08 Oct, 2009

1 commit

  • Refactor wext to
    * split out iwpriv handling
    * split out iwspy handling
    * split out procfs support
    * allow cfg80211 to have wireless extensions compat code
    w/o CONFIG_WIRELESS_EXT

    After this, drivers need to
    - select WIRELESS_EXT - for wext support
    - select WEXT_PRIV - for iwpriv support
    - select WEXT_SPY - for iwspy support

    except cfg80211 -- which gets new hooks in wext-core.c
    and can then get wext handlers without CONFIG_WIRELESS_EXT.

    Wireless extensions procfs support is auto-selected
    based on PROC_FS and anything that requires the wext core
    (i.e. WIRELESS_EXT or CFG80211_WEXT).

    Signed-off-by: Johannes Berg
    Signed-off-by: John W. Linville

    Johannes Berg
     

07 Oct, 2009

1 commit

  • An incoming datagram must bring into cpu cache *lot* of cache lines,
    in particular : (other parts omitted (hash chains, ip route cache...))

    On 32bit arches :

    offsetof(struct sock, sk_rcvbuf) =0x30 (read)
    offsetof(struct sock, sk_lock) =0x34 (rw)

    offsetof(struct sock, sk_sleep) =0x50 (read)
    offsetof(struct sock, sk_rmem_alloc) =0x64 (rw)
    offsetof(struct sock, sk_receive_queue)=0x74 (rw)

    offsetof(struct sock, sk_forward_alloc)=0x98 (rw)

    offsetof(struct sock, sk_callback_lock)=0xcc (rw)
    offsetof(struct sock, sk_drops) =0xd8 (read if we add dropcount support, rw if frame dropped)
    offsetof(struct sock, sk_filter) =0xf8 (read)

    offsetof(struct sock, sk_socket) =0x138 (read)

    offsetof(struct sock, sk_data_ready) =0x15c (read)

    We can avoid sk->sk_socket and socket->fasync_list referencing on sockets
    with no fasync() structures. (socket->fasync_list ptr is probably already in cache
    because it shares a cache line with socket->wait, ie location pointed by sk->sk_sleep)

    This avoids one cache line load per incoming packet for common cases (no fasync())

    We can leave (or even move in a future patch) sk->sk_socket in a cold location

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Oct, 2009

1 commit

  • This provides safety against negative optlen at the type
    level instead of depending upon (sometimes non-trivial)
    checks against this sprinkled all over the the place, in
    each and every implementation.

    Based upon work done by Arjan van de Ven and feedback
    from Linus Torvalds.

    Signed-off-by: David S. Miller

    David S. Miller
     

29 Sep, 2009

1 commit

  • The sys_socketcall() function has a very clever system for the copy
    size of its arguments. Unfortunately, gcc cannot deal with this in
    terms of proving that the copy_from_user() is then always in bounds.
    This is the last (well 9th of this series, but last in the kernel) such
    case around.

    With this patch, we can turn on code to make having the boundary provably
    right for the whole kernel, and detect introduction of new security
    accidents of this type early on.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: David S. Miller

    Arjan van de Ven
     

23 Sep, 2009

1 commit

  • Move various magic-number definitions into magic.h.

    Signed-off-by: Nick Black
    Acked-by: Pekka Enberg
    Cc: Al Viro
    Cc: "David S. Miller"
    Cc: Casey Schaufler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Black
     

22 Sep, 2009

1 commit


15 Sep, 2009

1 commit


14 Aug, 2009

1 commit

  • kernel_sendpage() does the proper default case handling for when the
    socket doesn't have a native sendpage implementation.

    Now, arguably this might be something that we could instead solve by
    just specifying that all protocols should do it themselves at the
    protocol level, but we really only care about the common protocols.
    Does anybody really care about sendpage on something like Appletalk? Not
    likely.

    Acked-by: David S. Miller
    Acked-by: Julien TINNES
    Acked-by: Tavis Ormandy
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

07 Apr, 2009

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
    b44: Use kernel DMA addresses for the kernel DMA API
    forcedeth: Fix resume from hibernation regression.
    xfrm: fix fragmentation on inter family tunnels
    ibm_newemac: Fix dangerous struct assumption
    gigaset: documentation update
    gigaset: in file ops, check for device disconnect before anything else
    bas_gigaset: use tasklet_hi_schedule for timing critical tasklets
    net/802/fddi.c: add MODULE_LICENSE
    smsc911x: remove unused #include
    axnet_cs: fix phy_id detection for bogus Asix chip.
    bnx2: Use request_firmware()
    b44: Fix sizes passed to b44_sync_dma_desc_for_{device,cpu}()
    socket: use percpu_add() while updating sockets_in_use
    virtio_net: Set the mac config only when VIRITO_NET_F_MAC
    myri_sbus: use request_firmware
    e1000: fix loss of multicast packets
    vxge: should include tcp.h

    Conflict in firmware/WHENCE (SCSI vs net firmware)

    Linus Torvalds
     

05 Apr, 2009

1 commit

  • sock_alloc() currently uses following code to update sockets_in_use

    get_cpu_var(sockets_in_use)++;
    put_cpu_var(sockets_in_use);

    This translates to :

    c0436274: b8 01 00 00 00 mov $0x1,%eax
    c0436279: e8 42 40 df ff call c022a2c0
    c043627e: bb 20 4f 6a c0 mov $0xc06a4f20,%ebx
    c0436283: e8 18 ca f0 ff call c0342ca0
    c0436288: 03 1c 85 60 4a 65 c0 add -0x3f9ab5a0(,%eax,4),%ebx
    c043628f: ff 03 incl (%ebx)
    c0436291: b8 01 00 00 00 mov $0x1,%eax
    c0436296: e8 75 3f df ff call c022a210
    c043629b: 89 e0 mov %esp,%eax
    c043629d: 25 00 e0 ff ff and $0xffffe000,%eax
    c04362a2: f6 40 08 08 testb $0x8,0x8(%eax)
    c04362a6: 75 07 jne c04362af
    c04362a8: 8d 46 d8 lea -0x28(%esi),%eax
    c04362ab: 5b pop %ebx
    c04362ac: 5e pop %esi
    c04362ad: c9 leave
    c04362ae: c3 ret
    c04362af: e8 cc 5d 09 00 call c04cc080
    c04362b4: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi
    c04362b8: eb ee jmp c04362a8

    While percpu_add(sockets_in_use, 1) translates to a single instruction :

    c0436275: 64 83 05 20 5f 6a c0 addl $0x1,%fs:0xc06a5f20

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Mar, 2009

3 commits

  • The socket_post_accept() hook is not currently used by any in-tree modules
    and its existence continues to cause problems by confusing people about
    what can be safely accomplished using this hook. If a legitimate need for
    this hook arises in the future it can always be reintroduced.

    Signed-off-by: Paul Moore
    Signed-off-by: James Morris

    Paul Moore
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (37 commits)
    fs: avoid I_NEW inodes
    Merge code for single and multiple-instance mounts
    Remove get_init_pts_sb()
    Move common mknod_ptmx() calls into caller
    Parse mount options just once and copy them to super block
    Unroll essentials of do_remount_sb() into devpts
    vfs: simple_set_mnt() should return void
    fs: move bdev code out of buffer.c
    constify dentry_operations: rest
    constify dentry_operations: configfs
    constify dentry_operations: sysfs
    constify dentry_operations: JFS
    constify dentry_operations: OCFS2
    constify dentry_operations: GFS2
    constify dentry_operations: FAT
    constify dentry_operations: FUSE
    constify dentry_operations: procfs
    constify dentry_operations: ecryptfs
    constify dentry_operations: CIFS
    constify dentry_operations: AFS
    ...

    Linus Torvalds
     
  • Signed-off-by: Al Viro

    Al Viro
     

27 Mar, 2009

1 commit


16 Mar, 2009

1 commit

  • Removing the BKL from FASYNC handling ran into the challenge of keeping the
    setting of the FASYNC bit in filp->f_flags atomic with regard to calls to
    the underlying fasync() function. Andi Kleen suggested moving the handling
    of that bit into fasync(); this patch does exactly that. As a result, we
    have a couple of internal API changes: fasync() must now manage the FASYNC
    bit, and it will be called without the BKL held.

    As it happens, every fasync() implementation in the kernel with one
    exception calls fasync_helper(). So, if we make fasync_helper() set the
    FASYNC bit, we can avoid making any changes to the other fasync()
    functions - as long as those functions, themselves, have proper locking.
    Most fasync() implementations do nothing but call fasync_helper() - which
    has its own lock - so they are easily verified as correct. The BKL had
    already been pushed down into the rest.

    The networking code has its own version of fasync_helper(), so that code
    has been augmented with explicit FASYNC bit handling.

    Cc: Al Viro
    Cc: David Miller
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jonathan Corbet

    Jonathan Corbet
     

16 Feb, 2009

1 commit

  • The overlap with the old SO_TIMESTAMP[NS] options is handled so
    that time stamping in software (net_enable_timestamp()) is
    enabled when SO_TIMESTAMP[NS] and/or SO_TIMESTAMPING_RX_SOFTWARE
    is set. It's disabled if all of these are off.

    Signed-off-by: Patrick Ohly
    Signed-off-by: David S. Miller

    Patrick Ohly
     

14 Jan, 2009

3 commits


05 Jan, 2009

2 commits


29 Dec, 2008

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1429 commits)
    net: Allow dependancies of FDDI & Tokenring to be modular.
    igb: Fix build warning when DCA is disabled.
    net: Fix warning fallout from recent NAPI interface changes.
    gro: Fix potential use after free
    sfc: If AN is enabled, always read speed/duplex from the AN advertising bits
    sfc: When disabling the NIC, close the device rather than unregistering it
    sfc: SFT9001: Add cable diagnostics
    sfc: Add support for multiple PHY self-tests
    sfc: Merge top-level functions for self-tests
    sfc: Clean up PHY mode management in loopback self-test
    sfc: Fix unreliable link detection in some loopback modes
    sfc: Generate unique names for per-NIC workqueues
    802.3ad: use standard ethhdr instead of ad_header
    802.3ad: generalize out mac address initializer
    802.3ad: initialize ports LACPDU from const initializer
    802.3ad: remove typedef around ad_system
    802.3ad: turn ports is_individual into a bool
    802.3ad: turn ports is_enabled into a bool
    802.3ad: make ntt bool
    ixgbe: Fix set_ringparam in ixgbe to use the same memory pools.
    ...

    Fixed trivial IPv4/6 address printing conflicts in fs/cifs/connect.c due
    to the conversion to %pI (in this networking merge) and the addition of
    doing IPv6 addresses (from the earlier merge of CIFS).

    Linus Torvalds
     

25 Dec, 2008

1 commit


24 Dec, 2008

1 commit


19 Dec, 2008

1 commit

  • The kernel_accept() does not hold the module refcount of newsock->ops->owner,
    so we need __module_get(newsock->ops->owner) code after call kernel_accept()
    by hand.
    In sunrpc, the module refcount is missing to hold. So this cause kernel panic.

    Used following script to reproduct:

    while [ 1 ];
    do
    mount -t nfs4 192.168.0.19:/ /mnt
    touch /mnt/file
    umount /mnt
    lsmod | grep ipv6
    done

    This patch fixed the problem by add __module_get(newsock->ops->owner) to
    kernel_accept(). So we do not need to used __module_get(newsock->ops->owner)
    in every place when used kernel_accept().

    Signed-off-by: Wei Yongjun
    Signed-off-by: David S. Miller

    Wei Yongjun
     

04 Dec, 2008

1 commit


21 Nov, 2008

1 commit


20 Nov, 2008

1 commit

  • Introduce a new accept4() system call. The addition of this system call
    matches analogous changes in 2.6.27 (dup3(), evenfd2(), signalfd4(),
    inotify_init1(), epoll_create1(), pipe2()) which added new system calls
    that differed from analogous traditional system calls in adding a flags
    argument that can be used to access additional functionality.

    The accept4() system call is exactly the same as accept(), except that
    it adds a flags bit-mask argument. Two flags are initially implemented.
    (Most of the new system calls in 2.6.27 also had both of these flags.)

    SOCK_CLOEXEC causes the close-on-exec (FD_CLOEXEC) flag to be enabled
    for the new file descriptor returned by accept4(). This is a useful
    security feature to avoid leaking information in a multithreaded
    program where one thread is doing an accept() at the same time as
    another thread is doing a fork() plus exec(). More details here:
    http://udrepper.livejournal.com/20407.html "Secure File Descriptor Handling",
    Ulrich Drepper).

    The other flag is SOCK_NONBLOCK, which causes the O_NONBLOCK flag
    to be enabled on the new open file description created by accept4().
    (This flag is merely a convenience, saving the use of additional calls
    fcntl(F_GETFL) and fcntl (F_SETFL) to achieve the same result.

    Here's a test program. Works on x86-32. Should work on x86-64, but
    I (mtk) don't have a system to hand to test with.

    It tests accept4() with each of the four possible combinations of
    SOCK_CLOEXEC and SOCK_NONBLOCK set/clear in 'flags', and verifies
    that the appropriate flags are set on the file descriptor/open file
    description returned by accept4().

    I tested Ulrich's patch in this thread by applying against 2.6.28-rc2,
    and it passes according to my test program.

    /* test_accept4.c

    Copyright (C) 2008, Linux Foundation, written by Michael Kerrisk

    Licensed under the GNU GPLv2 or later.
    */
    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define PORT_NUM 33333

    #define die(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0)

    /**********************************************************************/

    /* The following is what we need until glibc gets a wrapper for
    accept4() */

    /* Flags for socket(), socketpair(), accept4() */
    #ifndef SOCK_CLOEXEC
    #define SOCK_CLOEXEC O_CLOEXEC
    #endif
    #ifndef SOCK_NONBLOCK
    #define SOCK_NONBLOCK O_NONBLOCK
    #endif

    #ifdef __x86_64__
    #define SYS_accept4 288
    #elif __i386__
    #define USE_SOCKETCALL 1
    #define SYS_ACCEPT4 18
    #else
    #error "Sorry -- don't know the syscall # on this architecture"
    #endif

    static int
    accept4(int fd, struct sockaddr *sockaddr, socklen_t *addrlen, int flags)
    {
    printf("Calling accept4(): flags = %x", flags);
    if (flags != 0) {
    printf(" (");
    if (flags & SOCK_CLOEXEC)
    printf("SOCK_CLOEXEC");
    if ((flags & SOCK_CLOEXEC) && (flags & SOCK_NONBLOCK))
    printf(" ");
    if (flags & SOCK_NONBLOCK)
    printf("SOCK_NONBLOCK");
    printf(")");
    }
    printf("\n");

    #if USE_SOCKETCALL
    long args[6];

    args[0] = fd;
    args[1] = (long) sockaddr;
    args[2] = (long) addrlen;
    args[3] = flags;

    return syscall(SYS_socketcall, SYS_ACCEPT4, args);
    #else
    return syscall(SYS_accept4, fd, sockaddr, addrlen, flags);
    #endif
    }

    /**********************************************************************/

    static int
    do_test(int lfd, struct sockaddr_in *conn_addr,
    int closeonexec_flag, int nonblock_flag)
    {
    int connfd, acceptfd;
    int fdf, flf, fdf_pass, flf_pass;
    struct sockaddr_in claddr;
    socklen_t addrlen;

    printf("=======================================\n");

    connfd = socket(AF_INET, SOCK_STREAM, 0);
    if (connfd == -1)
    die("socket");
    if (connect(connfd, (struct sockaddr *) conn_addr,
    sizeof(struct sockaddr_in)) == -1)
    die("connect");

    addrlen = sizeof(struct sockaddr_in);
    acceptfd = accept4(lfd, (struct sockaddr *) &claddr, &addrlen,
    closeonexec_flag | nonblock_flag);
    if (acceptfd == -1) {
    perror("accept4()");
    close(connfd);
    return 0;
    }

    fdf = fcntl(acceptfd, F_GETFD);
    if (fdf == -1)
    die("fcntl:F_GETFD");
    fdf_pass = ((fdf & FD_CLOEXEC) != 0) ==
    ((closeonexec_flag & SOCK_CLOEXEC) != 0);
    printf("Close-on-exec flag is %sset (%s); ",
    (fdf & FD_CLOEXEC) ? "" : "not ",
    fdf_pass ? "OK" : "failed");

    flf = fcntl(acceptfd, F_GETFL);
    if (flf == -1)
    die("fcntl:F_GETFD");
    flf_pass = ((flf & O_NONBLOCK) != 0) ==
    ((nonblock_flag & SOCK_NONBLOCK) !=0);
    printf("nonblock flag is %sset (%s)\n",
    (flf & O_NONBLOCK) ? "" : "not ",
    flf_pass ? "OK" : "failed");

    close(acceptfd);
    close(connfd);

    printf("Test result: %s\n", (fdf_pass && flf_pass) ? "PASS" : "FAIL");
    return fdf_pass && flf_pass;
    }

    static int
    create_listening_socket(int port_num)
    {
    struct sockaddr_in svaddr;
    int lfd;
    int optval;

    memset(&svaddr, 0, sizeof(struct sockaddr_in));
    svaddr.sin_family = AF_INET;
    svaddr.sin_addr.s_addr = htonl(INADDR_ANY);
    svaddr.sin_port = htons(port_num);

    lfd = socket(AF_INET, SOCK_STREAM, 0);
    if (lfd == -1)
    die("socket");

    optval = 1;
    if (setsockopt(lfd, SOL_SOCKET, SO_REUSEADDR, &optval,
    sizeof(optval)) == -1)
    die("setsockopt");

    if (bind(lfd, (struct sockaddr *) &svaddr,
    sizeof(struct sockaddr_in)) == -1)
    die("bind");

    if (listen(lfd, 5) == -1)
    die("listen");

    return lfd;
    }

    int
    main(int argc, char *argv[])
    {
    struct sockaddr_in conn_addr;
    int lfd;
    int port_num;
    int passed;

    passed = 1;

    port_num = (argc > 1) ? atoi(argv[1]) : PORT_NUM;

    memset(&conn_addr, 0, sizeof(struct sockaddr_in));
    conn_addr.sin_family = AF_INET;
    conn_addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
    conn_addr.sin_port = htons(port_num);

    lfd = create_listening_socket(port_num);

    if (!do_test(lfd, &conn_addr, 0, 0))
    passed = 0;
    if (!do_test(lfd, &conn_addr, SOCK_CLOEXEC, 0))
    passed = 0;
    if (!do_test(lfd, &conn_addr, 0, SOCK_NONBLOCK))
    passed = 0;
    if (!do_test(lfd, &conn_addr, SOCK_CLOEXEC, SOCK_NONBLOCK))
    passed = 0;

    close(lfd);

    exit(passed ? EXIT_SUCCESS : EXIT_FAILURE);
    }

    [mtk.manpages@gmail.com: rewrote changelog, updated test program]
    Signed-off-by: Ulrich Drepper
    Tested-by: Michael Kerrisk
    Acked-by: Michael Kerrisk
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     

14 Nov, 2008

1 commit

  • Wrap access to task credentials so that they can be separated more easily from
    the task_struct during the introduction of COW creds.

    Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().

    Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
    sense to use RCU directly rather than a convenient wrapper; these will be
    addressed by later patches.

    Signed-off-by: David Howells
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Cc: netdev@vger.kernel.org
    Signed-off-by: James Morris

    David Howells
     

07 Nov, 2008

1 commit


04 Nov, 2008

1 commit


02 Nov, 2008

1 commit

  • As it is, all instances of ->release() for files that have ->fasync()
    need to remember to evict file from fasync lists; forgetting that
    creates a hole and we actually have a bunch that *does* forget.

    So let's keep our lives simple - let __fput() check FASYNC in
    file->f_flags and call ->fasync() there if it's been set. And lose that
    crap in ->release() instances - leaving it there is still valid, but we
    don't have to bother anymore.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

17 Oct, 2008

1 commit


23 Sep, 2008

1 commit

  • The reasons for disabling paccept() are as follows:

    * The API is more complex than needed. There is AFAICS no demonstrated
    use case that the sigset argument of this syscall serves that couldn't
    equally be served by the use of pselect/ppoll/epoll_pwait + traditional
    accept(). Roland seems to concur with this opinion
    (http://thread.gmane.org/gmane.linux.kernel/723953/focus=732255). I
    have (more than once) asked Ulrich to explain otherwise
    (http://thread.gmane.org/gmane.linux.kernel/723952/focus=731018), but he
    does not respond, so one is left to assume that he doesn't know of such
    a case.

    * The use of a sigset argument is not consistent with other I/O APIs
    that can block on a single file descriptor (e.g., read(), recv(),
    connect()).

    * The behavior of paccept() when interrupted by a signal is IMO strange:
    the kernel restarts the system call if SA_RESTART was set for the
    handler. I think that it should not do this -- that it should behave
    consistently with paccept()/ppoll()/epoll_pwait(), which never restart,
    regardless of SA_RESTART. The reasoning here is that the very purpose
    of paccept() is to wait for a connection or a signal, and that
    restarting in the latter case is probably never useful. (Note: Roland
    disagrees on this point, believing that rather paccept() should be
    consistent with accept() in its behavior wrt EINTR
    (http://thread.gmane.org/gmane.linux.kernel/723953/focus=732255).)

    I believe that instead, a simpler API, consistent with Ulrich's other
    recent additions, is preferable:

    accept4(int fd, struct sockaddr *sa, socklen_t *salen, ind flags);

    (This simpler API was originally proposed by Ulrich:
    http://thread.gmane.org/gmane.linux.network/92072)

    If this simpler API is added, then if we later decide that the sigset
    argument really is required, then a suitable bit in 'flags' could be added
    to indicate the presence of the sigset argument.

    At this point, I am hoping we either will get a counter-argument from
    Ulrich about why we really do need paccept()'s sigset argument, or that he
    will resubmit the original accept4() patch.

    Signed-off-by: Michael Kerrisk
    Cc: David Miller
    Cc: Davide Libenzi
    Cc: Alan Cox
    Cc: Ulrich Drepper
    Cc: Jakub Jelinek
    Cc: Roland McGrath
    Cc: Oleg Nesterov
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Kerrisk
     

27 Jul, 2008

1 commit

  • Kmem cache passed to constructor is only needed for constructors that are
    themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
    passed kmem cache in non-trivial way, so pass only pointer to object.

    Non-trivial places are:
    arch/powerpc/mm/init_64.c
    arch/powerpc/mm/hugetlbpage.c

    This is flag day, yes.

    Signed-off-by: Alexey Dobriyan
    Acked-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: Jon Tollefson
    Cc: Nick Piggin
    Cc: Matt Mackall
    [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
    [akpm@linux-foundation.org: fix mm/slab.c]
    [akpm@linux-foundation.org: fix ubifs]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

25 Jul, 2008

2 commits

  • This patch adds test that ensure the boundary conditions for the various
    constants introduced in the previous patches is met. No code is generated.

    [akpm@linux-foundation.org: fix alpha]
    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     
  • This patch introduces support for the SOCK_NONBLOCK flag in socket,
    socketpair, and paccept. To do this the internal function sock_attach_fd
    gets an additional parameter which it uses to set the appropriate flag for
    the file descriptor.

    Given that in modern, scalable programs almost all socket connections are
    non-blocking and the minimal additional cost for the new functionality
    I see no reason not to add this code.

    The following test must be adjusted for architectures other than x86 and
    x86-64 and in case the syscall numbers changed.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #ifndef __NR_paccept
    # ifdef __x86_64__
    # define __NR_paccept 288
    # elif defined __i386__
    # define SYS_PACCEPT 18
    # define USE_SOCKETCALL 1
    # else
    # error "need __NR_paccept"
    # endif
    #endif

    #ifdef USE_SOCKETCALL
    # define paccept(fd, addr, addrlen, mask, flags) \
    ({ long args[6] = { \
    (long) fd, (long) addr, (long) addrlen, (long) mask, 8, (long) flags }; \
    syscall (__NR_socketcall, SYS_PACCEPT, args); })
    #else
    # define paccept(fd, addr, addrlen, mask, flags) \
    syscall (__NR_paccept, fd, addr, addrlen, mask, 8, flags)
    #endif

    #define PORT 57392

    #define SOCK_NONBLOCK O_NONBLOCK

    static pthread_barrier_t b;

    static void *
    tf (void *arg)
    {
    pthread_barrier_wait (&b);
    int s = socket (AF_INET, SOCK_STREAM, 0);
    struct sockaddr_in sin;
    sin.sin_family = AF_INET;
    sin.sin_addr.s_addr = htonl (INADDR_LOOPBACK);
    sin.sin_port = htons (PORT);
    connect (s, (const struct sockaddr *) &sin, sizeof (sin));
    close (s);
    pthread_barrier_wait (&b);

    pthread_barrier_wait (&b);
    s = socket (AF_INET, SOCK_STREAM, 0);
    sin.sin_port = htons (PORT);
    connect (s, (const struct sockaddr *) &sin, sizeof (sin));
    close (s);
    pthread_barrier_wait (&b);

    return NULL;
    }

    int
    main (void)
    {
    int fd;
    fd = socket (PF_INET, SOCK_STREAM, 0);
    if (fd == -1)
    {
    puts ("socket(0) failed");
    return 1;
    }
    int fl = fcntl (fd, F_GETFL);
    if (fl == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if (fl & O_NONBLOCK)
    {
    puts ("socket(0) set non-blocking mode");
    return 1;
    }
    close (fd);

    fd = socket (PF_INET, SOCK_STREAM|SOCK_NONBLOCK, 0);
    if (fd == -1)
    {
    puts ("socket(SOCK_NONBLOCK) failed");
    return 1;
    }
    fl = fcntl (fd, F_GETFL);
    if (fl == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if ((fl & O_NONBLOCK) == 0)
    {
    puts ("socket(SOCK_NONBLOCK) does not set non-blocking mode");
    return 1;
    }
    close (fd);

    int fds[2];
    if (socketpair (PF_UNIX, SOCK_STREAM, 0, fds) == -1)
    {
    puts ("socketpair(0) failed");
    return 1;
    }
    for (int i = 0; i < 2; ++i)
    {
    fl = fcntl (fds[i], F_GETFL);
    if (fl == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if (fl & O_NONBLOCK)
    {
    printf ("socketpair(0) set non-blocking mode for fds[%d]\n", i);
    return 1;
    }
    close (fds[i]);
    }

    if (socketpair (PF_UNIX, SOCK_STREAM|SOCK_NONBLOCK, 0, fds) == -1)
    {
    puts ("socketpair(SOCK_NONBLOCK) failed");
    return 1;
    }
    for (int i = 0; i < 2; ++i)
    {
    fl = fcntl (fds[i], F_GETFL);
    if (fl == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if ((fl & O_NONBLOCK) == 0)
    {
    printf ("socketpair(SOCK_NONBLOCK) does not set non-blocking mode for fds[%d]\n", i);
    return 1;
    }
    close (fds[i]);
    }

    pthread_barrier_init (&b, NULL, 2);

    struct sockaddr_in sin;
    pthread_t th;
    if (pthread_create (&th, NULL, tf, NULL) != 0)
    {
    puts ("pthread_create failed");
    return 1;
    }

    int s = socket (AF_INET, SOCK_STREAM, 0);
    int reuse = 1;
    setsockopt (s, SOL_SOCKET, SO_REUSEADDR, &reuse, sizeof (reuse));
    sin.sin_family = AF_INET;
    sin.sin_addr.s_addr = htonl (INADDR_LOOPBACK);
    sin.sin_port = htons (PORT);
    bind (s, (struct sockaddr *) &sin, sizeof (sin));
    listen (s, SOMAXCONN);

    pthread_barrier_wait (&b);

    int s2 = paccept (s, NULL, 0, NULL, 0);
    if (s2 < 0)
    {
    puts ("paccept(0) failed");
    return 1;
    }

    fl = fcntl (s2, F_GETFL);
    if (fl & O_NONBLOCK)
    {
    puts ("paccept(0) set non-blocking mode");
    return 1;
    }
    close (s2);
    close (s);

    pthread_barrier_wait (&b);

    s = socket (AF_INET, SOCK_STREAM, 0);
    sin.sin_port = htons (PORT);
    setsockopt (s, SOL_SOCKET, SO_REUSEADDR, &reuse, sizeof (reuse));
    bind (s, (struct sockaddr *) &sin, sizeof (sin));
    listen (s, SOMAXCONN);

    pthread_barrier_wait (&b);

    s2 = paccept (s, NULL, 0, NULL, SOCK_NONBLOCK);
    if (s2 < 0)
    {
    puts ("paccept(SOCK_NONBLOCK) failed");
    return 1;
    }

    fl = fcntl (s2, F_GETFL);
    if ((fl & O_NONBLOCK) == 0)
    {
    puts ("paccept(SOCK_NONBLOCK) does not set non-blocking mode");
    return 1;
    }
    close (s2);
    close (s);

    pthread_barrier_wait (&b);
    puts ("OK");

    return 0;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper