21 Nov, 2013

1 commit


26 Oct, 2013

1 commit

  • I initial build non irq safe version of net_get_random_once because I
    would liked to have the freedom to defer even the extraction process of
    get_random_bytes until the nonblocking pool is fully seeded.

    I don't think this is a good idea anymore and thus this patch makes
    net_get_random_once irq safe. Now someone using net_get_random_once does
    not need to care from where it is called.

    Cc: David S. Miller
    Cc: Eric Dumazet
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

22 Oct, 2013

1 commit

  • This patch fixes the following warning:

    In file included from include/linux/skbuff.h:27:0,
    from include/linux/netfilter.h:5,
    from include/net/netns/netfilter.h:5,
    from include/net/net_namespace.h:20,
    from include/linux/init_task.h:14,
    from init/init_task.c:1:
    include/linux/net.h:243:14: warning: 'struct static_key' declared inside parameter list [enabled by default]
    struct static_key *done_key);

    on x86_64 allnoconfig, um defconfig and ia64 allmodconfig and maybe others as well.

    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

20 Oct, 2013

1 commit

  • net_get_random_once is a new macro which handles the initialization
    of secret keys. It is possible to call it in the fast path. Only the
    initialization depends on the spinlock and is rather slow. Otherwise
    it should get used just before the key is used to delay the entropy
    extration as late as possible to get better randomness. It returns true
    if the key got initialized.

    The usage of static_keys for net_get_random_once is a bit uncommon so
    it needs some further explanation why this actually works:

    === In the simple non-HAVE_JUMP_LABEL case we actually have ===
    no constrains to use static_key_(true|false) on keys initialized with
    STATIC_KEY_INIT_(FALSE|TRUE). So this path just expands in favor of
    the likely case that the initialization is already done. The key is
    initialized like this:

    ___done_key = { .enabled = ATOMIC_INIT(0) }

    The check

    if (!static_key_true(&___done_key)) \

    expands into (pseudo code)

    if (!likely(___done_key > 0))

    , so we take the fast path as soon as ___done_key is increased from the
    helper function.

    === If HAVE_JUMP_LABELs are available this depends ===
    on patching of jumps into the prepared NOPs, which is done in
    jump_label_init at boot-up time (from start_kernel). It is forbidden
    and dangerous to use net_get_random_once in functions which are called
    before that!

    At compilation time NOPs are generated at the call sites of
    net_get_random_once. E.g. net/ipv6/inet6_hashtable.c:inet6_ehashfn (we
    need to call net_get_random_once two times in inet6_ehashfn, so two NOPs):

    71: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
    76: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)

    Both will be patched to the actual jumps to the end of the function to
    call __net_get_random_once at boot time as explained above.

    arch_static_branch is optimized and inlined for false as return value and
    actually also returns false in case the NOP is placed in the instruction
    stream. So in the fast case we get a "return false". But because we
    initialize ___done_key with (enabled != (entries & 1)) this call-site
    will get patched up at boot thus returning true. The final check looks
    like this:

    if (!static_key_true(&___done_key)) \
    ___ret = __net_get_random_once(buf, \

    expands to

    if (!!static_key_false(&___done_key)) \
    ___ret = __net_get_random_once(buf, \

    So we get true at boot time and as soon as static_key_slow_inc is called
    on the key it will invert the logic and return false for the fast path.
    static_key_slow_inc will change the branch because it got initialized
    with .enabled == 0. After static_key_slow_inc is called on the key the
    branch is replaced with a nop again.

    === Misc: ===
    The helper defers the increment into a workqueue so we don't
    have problems calling this code from atomic sections. A seperate boolean
    (___done) guards the case where we enter net_get_random_once again before
    the increment happend.

    Cc: Ingo Molnar
    Cc: Steven Rostedt
    Cc: Jason Baron
    Cc: Peter Zijlstra
    Cc: Eric Dumazet
    Cc: "David S. Miller"
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

27 Sep, 2013

1 commit

  • There are a mix of function prototypes with and without extern
    in the kernel sources. Standardize on not using extern for
    function prototypes.

    Function prototypes don't need to be written with extern.
    extern is assumed by the compiler. Its use is as unnecessary as
    using auto to declare automatic/local variables in a block.

    Signed-off-by: Joe Perches

    Joe Perches
     

05 Jun, 2013

1 commit


30 Apr, 2013

1 commit

  • Commit 496f2f93b1cc ("random32: rename random32 to prandom") renamed
    random32() and srandom32() to prandom_u32() and prandom_seed()
    respectively.

    net_random() and net_srandom() need to be redefined with prandom_* in
    order to finish the naming transition.

    While I'm at it, enclose macro argument of net_srandom() with parenthesis.

    Signed-off-by: Akinobu Mita
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     

13 Oct, 2012

1 commit


03 Oct, 2012

1 commit

  • Pull vfs update from Al Viro:

    - big one - consolidation of descriptor-related logics; almost all of
    that is moved to fs/file.c

    (BTW, I'm seriously tempted to rename the result to fd.c. As it is,
    we have a situation when file_table.c is about handling of struct
    file and file.c is about handling of descriptor tables; the reasons
    are historical - file_table.c used to be about a static array of
    struct file we used to have way back).

    A lot of stray ends got cleaned up and converted to saner primitives,
    disgusting mess in android/binder.c is still disgusting, but at least
    doesn't poke so much in descriptor table guts anymore. A bunch of
    relatively minor races got fixed in process, plus an ext4 struct file
    leak.

    - related thing - fget_light() partially unuglified; see fdget() in
    there (and yes, it generates the code as good as we used to have).

    - also related - bits of Cyrill's procfs stuff that got entangled into
    that work; _not_ all of it, just the initial move to fs/proc/fd.c and
    switch of fdinfo to seq_file.

    - Alex's fs/coredump.c spiltoff - the same story, had been easier to
    take that commit than mess with conflicts. The rest is a separate
    pile, this was just a mechanical code movement.

    - a few misc patches all over the place. Not all for this cycle,
    there'll be more (and quite a few currently sit in akpm's tree)."

    Fix up trivial conflicts in the android binder driver, and some fairly
    simple conflicts due to two different changes to the sock_alloc_file()
    interface ("take descriptor handling from sock_alloc_file() to callers"
    vs "net: Providing protocol type via system.sockprotoname xattr of
    /proc/PID/fd entries" adding a dentry name to the socket)

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
    MAX_LFS_FILESIZE should be a loff_t
    compat: fs: Generic compat_sys_sendfile implementation
    fs: push rcu_barrier() from deactivate_locked_super() to filesystems
    btrfs: reada_extent doesn't need kref for refcount
    coredump: move core dump functionality into its own file
    coredump: prevent double-free on an error path in core dumper
    usb/gadget: fix misannotations
    fcntl: fix misannotations
    ceph: don't abuse d_delete() on failure exits
    hypfs: ->d_parent is never NULL or negative
    vfs: delete surplus inode NULL check
    switch simple cases of fget_light to fdget
    new helpers: fdget()/fdput()
    switch o2hb_region_dev_write() to fget_light()
    proc_map_files_readdir(): don't bother with grabbing files
    make get_file() return its argument
    vhost_set_vring(): turn pollstart/pollstop into bool
    switch prctl_set_mm_exe_file() to fget_light()
    switch xfs_find_handle() to fget_light()
    switch xfs_swapext() to fget_light()
    ...

    Linus Torvalds
     

27 Sep, 2012

1 commit

  • Both modular callers of sock_map_fd() had been buggy; sctp one leaks
    descriptor and file if copy_to_user() fails, 9p one shouldn't be
    exposing file in the descriptor table at all.

    Switch both to sock_alloc_file(), export it, unexport sock_map_fd() and
    make it static.

    Signed-off-by: Al Viro

    Al Viro
     

23 Jul, 2012

1 commit

  • Instead of updating the sk_cgrp_prioidx struct field on every send
    this only updates the field when a task is moved via cgroup
    infrastructure.

    This allows sockets that may be used by a kernel worker thread
    to be managed. For example in the iscsi case today a user can
    put iscsid in a netprio cgroup and control traffic will be sent
    with the correct sk_cgrp_prioidx value set but as soon as data
    is sent the kernel worker thread isssues a send and sk_cgrp_prioidx
    is updated with the kernel worker threads value which is the
    default case.

    It seems more correct to only update the field when the user
    explicitly sets it via control group infrastructure. This allows
    the users to manage sockets that may be used with other threads.

    Signed-off-by: John Fastabend
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    John Fastabend
     

21 Jul, 2012

1 commit

  • This patch fixes a crash
    tun_chr_close -> netdev_run_todo -> tun_free_netdev -> sk_release_kernel ->
    sock_release -> iput(SOCK_INODE(sock))
    introduced by commit 1ab5ecb90cb6a3df1476e052f76a6e8f6511cb3d

    The problem is that this socket is embedded in struct tun_struct, it has
    no inode, iput is called on invalid inode, which modifies invalid memory
    and optionally causes a crash.

    sock_release also decrements sockets_in_use, this causes a bug that
    "sockets: used" field in /proc/*/net/sockstat keeps on decreasing when
    creating and closing tun devices.

    This patch introduces a flag SOCK_EXTERNALLY_ALLOCATED that instructs
    sock_release to not free the inode and not decrement sockets_in_use,
    fixing both memory corruption and sockets_in_use underflow.

    It should be backported to 3.3 an 3.4 stabke.

    Signed-off-by: Mikulas Patocka
    Cc: stable@kernel.org
    Signed-off-by: David S. Miller

    Mikulas Patocka
     

30 May, 2012

1 commit

  • The MODULE_ALAIS_NET_PF macro set is missing a variant that allows for the
    appending of an arbitrary string to the net-pf--proto- base. while
    MODULE_ALIAS_NET_PF_PROTO_NAME_TYPE allows an appending of a numerical type, we
    need to be able to append a generic string to support generic netlink families
    that have neither a fix numberical protocol nor type number

    Signed-off-by: Neil Horman
    CC: Eric Dumazet
    CC: David Miller
    Signed-off-by: David S. Miller

    Neil Horman
     

16 May, 2012

1 commit

  • __ratelimit() can be considered an inverted bool test because
    it returns true when not ratelimited. Several tests in the
    kernel tree use this __ratelimit() function incorrectly.

    No net_ratelimit uses are incorrect currently though.

    Most uses of net_ratelimit are to log something via printk or
    pr_.

    In order to minimize the uses of net_ratelimit, and to start
    standardizing the code style used for __ratelimit() and net_ratelimit(),
    add a net_ratelimited_function() macro and net__ratelimited()
    logging macros similar to pr__ratelimited that use the global
    net_ratelimit instead of a static per call site "struct ratelimit_state".

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

22 Feb, 2012

1 commit

  • This one specifies where to start MSG_PEEK-ing queue data from. When
    set to negative value means that MSG_PEEK works as ususally -- peeks
    from the head of the queue always.

    When some bytes are peeked from queue and the peeking offset is non
    negative it is moved forward so that the next peek will return next
    portion of data.

    When non-peeking recvmsg occurs and the peeking offset is non negative
    is is moved backward so that the next peek will still peek the proper
    data (i.e. the one that would have been picked if there were no non
    peeking recv in between).

    The offset is set using per-proto opteration to let the protocol handle
    the locking issues and to check whether the peeking offset feature is
    supported by the protocol the socket belongs to.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

28 May, 2011

1 commit

  • Ingo Molnar noticed that we have this unnecessary ratelimit.h
    dependency in linux/net.h, which hid compilation problems from
    people doing builds only with CONFIG_NET enabled.

    Move this stuff out to a seperate net/net_ratelimit.h file and
    include that in the only two places where this thing is needed.

    Signed-off-by: David S. Miller
    Acked-by: Ingo Molnar

    David S. Miller
     

06 May, 2011

1 commit

  • This patch adds a multiple message send syscall and is the send
    version of the existing recvmmsg syscall. This is heavily
    based on the patch by Arnaldo that added recvmmsg.

    I wrote a microbenchmark to test the performance gains of using
    this new syscall:

    http://ozlabs.org/~anton/junkcode/sendmmsg_test.c

    The test was run on a ppc64 box with a 10 Gbit network card. The
    benchmark can send both UDP and RAW ethernet packets.

    64B UDP

    batch pkts/sec
    1 804570
    2 872800 (+ 8 %)
    4 916556 (+14 %)
    8 939712 (+17 %)
    16 952688 (+18 %)
    32 956448 (+19 %)
    64 964800 (+20 %)

    64B raw socket

    batch pkts/sec
    1 1201449
    2 1350028 (+12 %)
    4 1461416 (+22 %)
    8 1513080 (+26 %)
    16 1541216 (+28 %)
    32 1553440 (+29 %)
    64 1557888 (+30 %)

    We see a 20% improvement in throughput on UDP send and 30%
    on raw socket send.

    [ Add sparc syscall entries. -DaveM ]

    Signed-off-by: Anton Blanchard
    Signed-off-by: David S. Miller

    Anton Blanchard
     

23 Feb, 2011

1 commit


02 Oct, 2010

1 commit


03 Jul, 2010

1 commit

  • Fix kernel-doc warnings in linux/net.h:

    Warning(include/linux/net.h:151): No description found for parameter 'wq'
    Warning(include/linux/net.h:151): Excess struct/union/enum/typedef member 'fasync_list' description in 'socket'
    Warning(include/linux/net.h:151): Excess struct/union/enum/typedef member 'wait' description in 'socket'

    Signed-off-by: Randy Dunlap
    Signed-off-by: David S. Miller

    Randy Dunlap
     

02 May, 2010

1 commit

  • sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we
    need two atomic operations (and associated dirtying) per incoming
    packet.

    RCU conversion is pretty much needed :

    1) Add a new structure, called "struct socket_wq" to hold all fields
    that will need rcu_read_lock() protection (currently: a
    wait_queue_head_t and a struct fasync_struct pointer).

    [Future patch will add a list anchor for wakeup coalescing]

    2) Attach one of such structure to each "struct socket" created in
    sock_alloc_inode().

    3) Respect RCU grace period when freeing a "struct socket_wq"

    4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct
    socket_wq"

    5) Change sk_sleep() function to use new sk->sk_wq instead of
    sk->sk_sleep

    6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside
    a rcu_read_lock() section.

    7) Change all sk_has_sleeper() callers to :
    - Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock)
    - Use wq_has_sleeper() to eventually wakeup tasks.
    - Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock)

    8) sock_wake_async() is modified to use rcu protection as well.

    9) Exceptions :
    macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq"
    instead of dynamically allocated ones. They dont need rcu freeing.

    Some cleanups or followups are probably needed, (possible
    sk_callback_lock conversion to a spinlock for example...).

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Feb, 2010

1 commit

  • Ifdef out
    struct proto_ops::compat_ioctl
    struct proto_ops::compat_setsockopt
    struct proto_ops::compat_getsockopt
    to make structures smaller on COMPAT=n kernels.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     

06 Dec, 2009

2 commits


07 Nov, 2009

1 commit

  • All users of wrapped proto_ops are now gone, so we can safely remove
    the wrappers as well.

    Cc: David S. Miller
    Cc: netdev@vger.kernel.org
    Signed-off-by: Arnd Bergmann
    Signed-off-by: David S. Miller

    Arnd Bergmann
     

06 Nov, 2009

1 commit

  • The generic __sock_create function has a kern argument which allows the
    security system to make decisions based on if a socket is being created by
    the kernel or by userspace. This patch passes that flag to the
    net_proto_family specific create function, so it can do the same thing.

    Signed-off-by: Eric Paris
    Acked-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Eric Paris
     

29 Oct, 2009

1 commit

  • proto_ops->getname implies copying protocol specific data
    into storage unit (particulary to __kernel_sockaddr_storage).
    So when we implement new protocol support we should keep such
    a detail in mind (which is easy to forget about).

    Lets introduce DECLARE_SOCKADDR helper which check if
    storage unit is not overfowed at build time.

    Eventually inet_getname is switched to use DECLARE_SOCKADDR
    (to show example of usage).

    Signed-off-by: Cyrill Gorcunov
    Signed-off-by: David S. Miller

    Cyrill Gorcunov
     

13 Oct, 2009

1 commit

  • Meaning receive multiple messages, reducing the number of syscalls and
    net stack entry/exit operations.

    Next patches will introduce mechanisms where protocols that want to
    optimize this operation will provide an unlocked_recvmsg operation.

    This takes into account comments made by:

    . Paul Moore: sock_recvmsg is called only for the first datagram,
    sock_recvmsg_nosec is used for the rest.

    . Caitlin Bestler: recvmmsg now has a struct timespec timeout, that
    works in the same fashion as the ppoll one.

    If the underlying protocol returns a datagram with MSG_OOB set, this
    will make recvmmsg return right away with as many datagrams (+ the OOB
    one) it has received so far.

    . Rémi Denis-Courmont & Steven Whitehouse: If we receive N < vlen
    datagrams and then recvmsg returns an error, recvmmsg will return
    the successfully received datagrams, store the error and return it
    in the next call.

    This paves the way for a subsequent optimization, sk_prot->unlocked_recvmsg,
    where we will be able to acquire the lock only at batch start and end, not at
    every underlying recvmsg call.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     

01 Oct, 2009

1 commit

  • This provides safety against negative optlen at the type
    level instead of depending upon (sometimes non-trivial)
    checks against this sprinkled all over the the place, in
    each and every implementation.

    Based upon work done by Arjan van de Ven and feedback
    from Linus Torvalds.

    Signed-off-by: David S. Miller

    David S. Miller
     

22 Sep, 2009

1 commit

  • Decouple kernel.h from ratelimit.h: the global declaration of
    printk's ratelimit_state is not needed, and it leads to messy
    circular dependencies due to ratelimit.h's (new) adding of a
    spinlock_types.h include.

    Cc: Peter Zijlstra
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: David S. Miller
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

15 Sep, 2009

1 commit


16 Mar, 2009

1 commit

  • On x86_64, its rather unfortunate that "wait_queue_head_t wait"
    field of "struct socket" spans two cache lines (assuming a 64
    bytes cache line in current cpus)

    offsetof(struct socket, wait)=0x30
    sizeof(wait_queue_head_t)=0x18

    This might explain why Kenny Chang noticed that his multicast workload
    was performing bad with 64 bit kernels, since more cache lines ping pongs
    were involved.

    This litle patch moves "wait" field next "fasync_list" so that both
    fields share a single cache line, to speedup sock_def_readable()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Nov, 2008

1 commit

  • Introduce a new accept4() system call. The addition of this system call
    matches analogous changes in 2.6.27 (dup3(), evenfd2(), signalfd4(),
    inotify_init1(), epoll_create1(), pipe2()) which added new system calls
    that differed from analogous traditional system calls in adding a flags
    argument that can be used to access additional functionality.

    The accept4() system call is exactly the same as accept(), except that
    it adds a flags bit-mask argument. Two flags are initially implemented.
    (Most of the new system calls in 2.6.27 also had both of these flags.)

    SOCK_CLOEXEC causes the close-on-exec (FD_CLOEXEC) flag to be enabled
    for the new file descriptor returned by accept4(). This is a useful
    security feature to avoid leaking information in a multithreaded
    program where one thread is doing an accept() at the same time as
    another thread is doing a fork() plus exec(). More details here:
    http://udrepper.livejournal.com/20407.html "Secure File Descriptor Handling",
    Ulrich Drepper).

    The other flag is SOCK_NONBLOCK, which causes the O_NONBLOCK flag
    to be enabled on the new open file description created by accept4().
    (This flag is merely a convenience, saving the use of additional calls
    fcntl(F_GETFL) and fcntl (F_SETFL) to achieve the same result.

    Here's a test program. Works on x86-32. Should work on x86-64, but
    I (mtk) don't have a system to hand to test with.

    It tests accept4() with each of the four possible combinations of
    SOCK_CLOEXEC and SOCK_NONBLOCK set/clear in 'flags', and verifies
    that the appropriate flags are set on the file descriptor/open file
    description returned by accept4().

    I tested Ulrich's patch in this thread by applying against 2.6.28-rc2,
    and it passes according to my test program.

    /* test_accept4.c

    Copyright (C) 2008, Linux Foundation, written by Michael Kerrisk

    Licensed under the GNU GPLv2 or later.
    */
    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define PORT_NUM 33333

    #define die(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0)

    /**********************************************************************/

    /* The following is what we need until glibc gets a wrapper for
    accept4() */

    /* Flags for socket(), socketpair(), accept4() */
    #ifndef SOCK_CLOEXEC
    #define SOCK_CLOEXEC O_CLOEXEC
    #endif
    #ifndef SOCK_NONBLOCK
    #define SOCK_NONBLOCK O_NONBLOCK
    #endif

    #ifdef __x86_64__
    #define SYS_accept4 288
    #elif __i386__
    #define USE_SOCKETCALL 1
    #define SYS_ACCEPT4 18
    #else
    #error "Sorry -- don't know the syscall # on this architecture"
    #endif

    static int
    accept4(int fd, struct sockaddr *sockaddr, socklen_t *addrlen, int flags)
    {
    printf("Calling accept4(): flags = %x", flags);
    if (flags != 0) {
    printf(" (");
    if (flags & SOCK_CLOEXEC)
    printf("SOCK_CLOEXEC");
    if ((flags & SOCK_CLOEXEC) && (flags & SOCK_NONBLOCK))
    printf(" ");
    if (flags & SOCK_NONBLOCK)
    printf("SOCK_NONBLOCK");
    printf(")");
    }
    printf("\n");

    #if USE_SOCKETCALL
    long args[6];

    args[0] = fd;
    args[1] = (long) sockaddr;
    args[2] = (long) addrlen;
    args[3] = flags;

    return syscall(SYS_socketcall, SYS_ACCEPT4, args);
    #else
    return syscall(SYS_accept4, fd, sockaddr, addrlen, flags);
    #endif
    }

    /**********************************************************************/

    static int
    do_test(int lfd, struct sockaddr_in *conn_addr,
    int closeonexec_flag, int nonblock_flag)
    {
    int connfd, acceptfd;
    int fdf, flf, fdf_pass, flf_pass;
    struct sockaddr_in claddr;
    socklen_t addrlen;

    printf("=======================================\n");

    connfd = socket(AF_INET, SOCK_STREAM, 0);
    if (connfd == -1)
    die("socket");
    if (connect(connfd, (struct sockaddr *) conn_addr,
    sizeof(struct sockaddr_in)) == -1)
    die("connect");

    addrlen = sizeof(struct sockaddr_in);
    acceptfd = accept4(lfd, (struct sockaddr *) &claddr, &addrlen,
    closeonexec_flag | nonblock_flag);
    if (acceptfd == -1) {
    perror("accept4()");
    close(connfd);
    return 0;
    }

    fdf = fcntl(acceptfd, F_GETFD);
    if (fdf == -1)
    die("fcntl:F_GETFD");
    fdf_pass = ((fdf & FD_CLOEXEC) != 0) ==
    ((closeonexec_flag & SOCK_CLOEXEC) != 0);
    printf("Close-on-exec flag is %sset (%s); ",
    (fdf & FD_CLOEXEC) ? "" : "not ",
    fdf_pass ? "OK" : "failed");

    flf = fcntl(acceptfd, F_GETFL);
    if (flf == -1)
    die("fcntl:F_GETFD");
    flf_pass = ((flf & O_NONBLOCK) != 0) ==
    ((nonblock_flag & SOCK_NONBLOCK) !=0);
    printf("nonblock flag is %sset (%s)\n",
    (flf & O_NONBLOCK) ? "" : "not ",
    flf_pass ? "OK" : "failed");

    close(acceptfd);
    close(connfd);

    printf("Test result: %s\n", (fdf_pass && flf_pass) ? "PASS" : "FAIL");
    return fdf_pass && flf_pass;
    }

    static int
    create_listening_socket(int port_num)
    {
    struct sockaddr_in svaddr;
    int lfd;
    int optval;

    memset(&svaddr, 0, sizeof(struct sockaddr_in));
    svaddr.sin_family = AF_INET;
    svaddr.sin_addr.s_addr = htonl(INADDR_ANY);
    svaddr.sin_port = htons(port_num);

    lfd = socket(AF_INET, SOCK_STREAM, 0);
    if (lfd == -1)
    die("socket");

    optval = 1;
    if (setsockopt(lfd, SOL_SOCKET, SO_REUSEADDR, &optval,
    sizeof(optval)) == -1)
    die("setsockopt");

    if (bind(lfd, (struct sockaddr *) &svaddr,
    sizeof(struct sockaddr_in)) == -1)
    die("bind");

    if (listen(lfd, 5) == -1)
    die("listen");

    return lfd;
    }

    int
    main(int argc, char *argv[])
    {
    struct sockaddr_in conn_addr;
    int lfd;
    int port_num;
    int passed;

    passed = 1;

    port_num = (argc > 1) ? atoi(argv[1]) : PORT_NUM;

    memset(&conn_addr, 0, sizeof(struct sockaddr_in));
    conn_addr.sin_family = AF_INET;
    conn_addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
    conn_addr.sin_port = htons(port_num);

    lfd = create_listening_socket(port_num);

    if (!do_test(lfd, &conn_addr, 0, 0))
    passed = 0;
    if (!do_test(lfd, &conn_addr, SOCK_CLOEXEC, 0))
    passed = 0;
    if (!do_test(lfd, &conn_addr, 0, SOCK_NONBLOCK))
    passed = 0;
    if (!do_test(lfd, &conn_addr, SOCK_CLOEXEC, SOCK_NONBLOCK))
    passed = 0;

    close(lfd);

    exit(passed ? EXIT_SUCCESS : EXIT_FAILURE);
    }

    [mtk.manpages@gmail.com: rewrote changelog, updated test program]
    Signed-off-by: Ulrich Drepper
    Tested-by: Michael Kerrisk
    Acked-by: Michael Kerrisk
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     

27 Aug, 2008

1 commit

  • Including in the user-visible part of this header has
    caused build regressions with headers from 2.6.27-rc. Move it down to
    the #ifdef __KERNEL__ part, which is the only place it's needed. Move
    some other kernel-only things down there too, while we're at it.

    Signed-off-by: David Woodhouse
    Signed-off-by: Linus Torvalds

    David Woodhouse
     

26 Jul, 2008

1 commit

  • All ratelimit user use same jiffies and burst params, so some messages
    (callbacks) will be lost.

    For example:
    a call printk_ratelimit(5 * HZ, 1)
    b call printk_ratelimit(5 * HZ, 1) before the 5*HZ timeout of a, then b will
    will be supressed.

    - rewrite __ratelimit, and use a ratelimit_state as parameter. Thanks for
    hints from andrew.

    - Add WARN_ON_RATELIMIT, update rcupreempt.h

    - remove __printk_ratelimit

    - use __ratelimit in net_ratelimit

    Signed-off-by: Dave Young
    Cc: "David S. Miller"
    Cc: "Paul E. McKenney"
    Cc: Dave Young
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Young
     

25 Jul, 2008

4 commits

  • This patch introduces support for the SOCK_NONBLOCK flag in socket,
    socketpair, and paccept. To do this the internal function sock_attach_fd
    gets an additional parameter which it uses to set the appropriate flag for
    the file descriptor.

    Given that in modern, scalable programs almost all socket connections are
    non-blocking and the minimal additional cost for the new functionality
    I see no reason not to add this code.

    The following test must be adjusted for architectures other than x86 and
    x86-64 and in case the syscall numbers changed.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #ifndef __NR_paccept
    # ifdef __x86_64__
    # define __NR_paccept 288
    # elif defined __i386__
    # define SYS_PACCEPT 18
    # define USE_SOCKETCALL 1
    # else
    # error "need __NR_paccept"
    # endif
    #endif

    #ifdef USE_SOCKETCALL
    # define paccept(fd, addr, addrlen, mask, flags) \
    ({ long args[6] = { \
    (long) fd, (long) addr, (long) addrlen, (long) mask, 8, (long) flags }; \
    syscall (__NR_socketcall, SYS_PACCEPT, args); })
    #else
    # define paccept(fd, addr, addrlen, mask, flags) \
    syscall (__NR_paccept, fd, addr, addrlen, mask, 8, flags)
    #endif

    #define PORT 57392

    #define SOCK_NONBLOCK O_NONBLOCK

    static pthread_barrier_t b;

    static void *
    tf (void *arg)
    {
    pthread_barrier_wait (&b);
    int s = socket (AF_INET, SOCK_STREAM, 0);
    struct sockaddr_in sin;
    sin.sin_family = AF_INET;
    sin.sin_addr.s_addr = htonl (INADDR_LOOPBACK);
    sin.sin_port = htons (PORT);
    connect (s, (const struct sockaddr *) &sin, sizeof (sin));
    close (s);
    pthread_barrier_wait (&b);

    pthread_barrier_wait (&b);
    s = socket (AF_INET, SOCK_STREAM, 0);
    sin.sin_port = htons (PORT);
    connect (s, (const struct sockaddr *) &sin, sizeof (sin));
    close (s);
    pthread_barrier_wait (&b);

    return NULL;
    }

    int
    main (void)
    {
    int fd;
    fd = socket (PF_INET, SOCK_STREAM, 0);
    if (fd == -1)
    {
    puts ("socket(0) failed");
    return 1;
    }
    int fl = fcntl (fd, F_GETFL);
    if (fl == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if (fl & O_NONBLOCK)
    {
    puts ("socket(0) set non-blocking mode");
    return 1;
    }
    close (fd);

    fd = socket (PF_INET, SOCK_STREAM|SOCK_NONBLOCK, 0);
    if (fd == -1)
    {
    puts ("socket(SOCK_NONBLOCK) failed");
    return 1;
    }
    fl = fcntl (fd, F_GETFL);
    if (fl == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if ((fl & O_NONBLOCK) == 0)
    {
    puts ("socket(SOCK_NONBLOCK) does not set non-blocking mode");
    return 1;
    }
    close (fd);

    int fds[2];
    if (socketpair (PF_UNIX, SOCK_STREAM, 0, fds) == -1)
    {
    puts ("socketpair(0) failed");
    return 1;
    }
    for (int i = 0; i < 2; ++i)
    {
    fl = fcntl (fds[i], F_GETFL);
    if (fl == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if (fl & O_NONBLOCK)
    {
    printf ("socketpair(0) set non-blocking mode for fds[%d]\n", i);
    return 1;
    }
    close (fds[i]);
    }

    if (socketpair (PF_UNIX, SOCK_STREAM|SOCK_NONBLOCK, 0, fds) == -1)
    {
    puts ("socketpair(SOCK_NONBLOCK) failed");
    return 1;
    }
    for (int i = 0; i < 2; ++i)
    {
    fl = fcntl (fds[i], F_GETFL);
    if (fl == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if ((fl & O_NONBLOCK) == 0)
    {
    printf ("socketpair(SOCK_NONBLOCK) does not set non-blocking mode for fds[%d]\n", i);
    return 1;
    }
    close (fds[i]);
    }

    pthread_barrier_init (&b, NULL, 2);

    struct sockaddr_in sin;
    pthread_t th;
    if (pthread_create (&th, NULL, tf, NULL) != 0)
    {
    puts ("pthread_create failed");
    return 1;
    }

    int s = socket (AF_INET, SOCK_STREAM, 0);
    int reuse = 1;
    setsockopt (s, SOL_SOCKET, SO_REUSEADDR, &reuse, sizeof (reuse));
    sin.sin_family = AF_INET;
    sin.sin_addr.s_addr = htonl (INADDR_LOOPBACK);
    sin.sin_port = htons (PORT);
    bind (s, (struct sockaddr *) &sin, sizeof (sin));
    listen (s, SOMAXCONN);

    pthread_barrier_wait (&b);

    int s2 = paccept (s, NULL, 0, NULL, 0);
    if (s2 < 0)
    {
    puts ("paccept(0) failed");
    return 1;
    }

    fl = fcntl (s2, F_GETFL);
    if (fl & O_NONBLOCK)
    {
    puts ("paccept(0) set non-blocking mode");
    return 1;
    }
    close (s2);
    close (s);

    pthread_barrier_wait (&b);

    s = socket (AF_INET, SOCK_STREAM, 0);
    sin.sin_port = htons (PORT);
    setsockopt (s, SOL_SOCKET, SO_REUSEADDR, &reuse, sizeof (reuse));
    bind (s, (struct sockaddr *) &sin, sizeof (sin));
    listen (s, SOMAXCONN);

    pthread_barrier_wait (&b);

    s2 = paccept (s, NULL, 0, NULL, SOCK_NONBLOCK);
    if (s2 < 0)
    {
    puts ("paccept(SOCK_NONBLOCK) failed");
    return 1;
    }

    fl = fcntl (s2, F_GETFL);
    if ((fl & O_NONBLOCK) == 0)
    {
    puts ("paccept(SOCK_NONBLOCK) does not set non-blocking mode");
    return 1;
    }
    close (s2);
    close (s);

    pthread_barrier_wait (&b);
    puts ("OK");

    return 0;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     
  • Some platforms do not have support to restore the signal mask in the
    return path from a syscall. For those platforms syscalls like pselect are
    not defined at all. This is, I think, not a good choice for paccept()
    since paccept() adds more value on top of accept() than just the signal
    mask handling.

    Therefore this patch defines a scaled down version of the sys_paccept
    function for those platforms. It returns -EINVAL in case the signal mask
    is non-NULL but behaves the same otherwise.

    Note that I explicitly included . I saw that it is
    currently included but indirectly two levels down. There is too much risk
    in relying on this. The header might change and then suddenly the
    function definition would change without anyone immediately noticing.

    Signed-off-by: Ulrich Drepper
    Cc: Davide Libenzi
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     
  • This patch is by far the most complex in the series. It adds a new syscall
    paccept. This syscall differs from accept in that it adds (at the userlevel)
    two additional parameters:

    - a signal mask
    - a flags value

    The flags parameter can be used to set flag like SOCK_CLOEXEC. This is
    imlpemented here as well. Some people argued that this is a property which
    should be inherited from the file desriptor for the server but this is against
    POSIX. Additionally, we really want the signal mask parameter as well
    (similar to pselect, ppoll, etc). So an interface change in inevitable.

    The flag value is the same as for socket and socketpair. I think diverging
    here will only create confusion. Similar to the filesystem interfaces where
    the use of the O_* constants differs, it is acceptable here.

    The signal mask is handled as for pselect etc. The mask is temporarily
    installed for the thread and removed before the call returns. I modeled the
    code after pselect. If there is a problem it's likely also in pselect.

    For architectures which use socketcall I maintained this interface instead of
    adding a system call. The symmetry shouldn't be broken.

    The following test must be adjusted for architectures other than x86 and
    x86-64 and in case the syscall numbers changed.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #ifndef __NR_paccept
    # ifdef __x86_64__
    # define __NR_paccept 288
    # elif defined __i386__
    # define SYS_PACCEPT 18
    # define USE_SOCKETCALL 1
    # else
    # error "need __NR_paccept"
    # endif
    #endif

    #ifdef USE_SOCKETCALL
    # define paccept(fd, addr, addrlen, mask, flags) \
    ({ long args[6] = { \
    (long) fd, (long) addr, (long) addrlen, (long) mask, 8, (long) flags }; \
    syscall (__NR_socketcall, SYS_PACCEPT, args); })
    #else
    # define paccept(fd, addr, addrlen, mask, flags) \
    syscall (__NR_paccept, fd, addr, addrlen, mask, 8, flags)
    #endif

    #define PORT 57392

    #define SOCK_CLOEXEC O_CLOEXEC

    static pthread_barrier_t b;

    static void *
    tf (void *arg)
    {
    pthread_barrier_wait (&b);
    int s = socket (AF_INET, SOCK_STREAM, 0);
    struct sockaddr_in sin;
    sin.sin_family = AF_INET;
    sin.sin_addr.s_addr = htonl (INADDR_LOOPBACK);
    sin.sin_port = htons (PORT);
    connect (s, (const struct sockaddr *) &sin, sizeof (sin));
    close (s);

    pthread_barrier_wait (&b);
    s = socket (AF_INET, SOCK_STREAM, 0);
    sin.sin_port = htons (PORT);
    connect (s, (const struct sockaddr *) &sin, sizeof (sin));
    close (s);
    pthread_barrier_wait (&b);

    pthread_barrier_wait (&b);
    sleep (2);
    pthread_kill ((pthread_t) arg, SIGUSR1);

    return NULL;
    }

    static void
    handler (int s)
    {
    }

    int
    main (void)
    {
    pthread_barrier_init (&b, NULL, 2);

    struct sockaddr_in sin;
    pthread_t th;
    if (pthread_create (&th, NULL, tf, (void *) pthread_self ()) != 0)
    {
    puts ("pthread_create failed");
    return 1;
    }

    int s = socket (AF_INET, SOCK_STREAM, 0);
    int reuse = 1;
    setsockopt (s, SOL_SOCKET, SO_REUSEADDR, &reuse, sizeof (reuse));
    sin.sin_family = AF_INET;
    sin.sin_addr.s_addr = htonl (INADDR_LOOPBACK);
    sin.sin_port = htons (PORT);
    bind (s, (struct sockaddr *) &sin, sizeof (sin));
    listen (s, SOMAXCONN);

    pthread_barrier_wait (&b);

    int s2 = paccept (s, NULL, 0, NULL, 0);
    if (s2 < 0)
    {
    puts ("paccept(0) failed");
    return 1;
    }

    int coe = fcntl (s2, F_GETFD);
    if (coe & FD_CLOEXEC)
    {
    puts ("paccept(0) set close-on-exec-flag");
    return 1;
    }
    close (s2);

    pthread_barrier_wait (&b);

    s2 = paccept (s, NULL, 0, NULL, SOCK_CLOEXEC);
    if (s2 < 0)
    {
    puts ("paccept(SOCK_CLOEXEC) failed");
    return 1;
    }

    coe = fcntl (s2, F_GETFD);
    if ((coe & FD_CLOEXEC) == 0)
    {
    puts ("paccept(SOCK_CLOEXEC) does not set close-on-exec flag");
    return 1;
    }
    close (s2);

    pthread_barrier_wait (&b);

    struct sigaction sa;
    sa.sa_handler = handler;
    sa.sa_flags = 0;
    sigemptyset (&sa.sa_mask);
    sigaction (SIGUSR1, &sa, NULL);

    sigset_t ss;
    pthread_sigmask (SIG_SETMASK, NULL, &ss);
    sigaddset (&ss, SIGUSR1);
    pthread_sigmask (SIG_SETMASK, &ss, NULL);

    sigdelset (&ss, SIGUSR1);
    alarm (4);
    pthread_barrier_wait (&b);

    errno = 0 ;
    s2 = paccept (s, NULL, 0, &ss, 0);
    if (s2 != -1 || errno != EINTR)
    {
    puts ("paccept did not fail with EINTR");
    return 1;
    }

    close (s);

    puts ("OK");

    return 0;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    [akpm@linux-foundation.org: make it compile]
    [akpm@linux-foundation.org: add sys_ni stub]
    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc:
    Cc: "David S. Miller"
    Cc: Roland McGrath
    Cc: Kyle McMartin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     
  • This patch adds support for flag values which are ORed to the type passwd
    to socket and socketpair. The additional code is minimal. The flag
    values in this implementation can and must match the O_* flags. This
    avoids overhead in the conversion.

    The internal functions sock_alloc_fd and sock_map_fd get a new parameters
    and all callers are changed.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #include
    #include
    #include
    #include
    #include

    #define PORT 57392

    /* For Linux these must be the same. */
    #define SOCK_CLOEXEC O_CLOEXEC

    int
    main (void)
    {
    int fd;
    fd = socket (PF_INET, SOCK_STREAM, 0);
    if (fd == -1)
    {
    puts ("socket(0) failed");
    return 1;
    }
    int coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if (coe & FD_CLOEXEC)
    {
    puts ("socket(0) set close-on-exec flag");
    return 1;
    }
    close (fd);

    fd = socket (PF_INET, SOCK_STREAM|SOCK_CLOEXEC, 0);
    if (fd == -1)
    {
    puts ("socket(SOCK_CLOEXEC) failed");
    return 1;
    }
    coe = fcntl (fd, F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if ((coe & FD_CLOEXEC) == 0)
    {
    puts ("socket(SOCK_CLOEXEC) does not set close-on-exec flag");
    return 1;
    }
    close (fd);

    int fds[2];
    if (socketpair (PF_UNIX, SOCK_STREAM, 0, fds) == -1)
    {
    puts ("socketpair(0) failed");
    return 1;
    }
    for (int i = 0; i < 2; ++i)
    {
    coe = fcntl (fds[i], F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if (coe & FD_CLOEXEC)
    {
    printf ("socketpair(0) set close-on-exec flag for fds[%d]\n", i);
    return 1;
    }
    close (fds[i]);
    }

    if (socketpair (PF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0, fds) == -1)
    {
    puts ("socketpair(SOCK_CLOEXEC) failed");
    return 1;
    }
    for (int i = 0; i < 2; ++i)
    {
    coe = fcntl (fds[i], F_GETFD);
    if (coe == -1)
    {
    puts ("fcntl failed");
    return 1;
    }
    if ((coe & FD_CLOEXEC) == 0)
    {
    printf ("socketpair(SOCK_CLOEXEC) does not set close-on-exec flag for fds[%d]\n", i);
    return 1;
    }
    close (fds[i]);
    }

    puts ("OK");

    return 0;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Signed-off-by: Ulrich Drepper
    Acked-by: Davide Libenzi
    Cc: Michael Kerrisk
    Cc: "David S. Miller"
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     

08 Jul, 2008

1 commit