16 Apr, 2012

1 commit


04 Apr, 2012

1 commit

  • unix_dgram_sendmsg() currently builds linear skbs, and this can stress
    page allocator with high order page allocations. When memory gets
    fragmented, this can eventually fail.

    We can try to use order-2 allocations for skb head (SKB_MAX_ALLOC) plus
    up to 16 page fragments to lower pressure on buddy allocator.

    This patch has no effect on messages of less than 16064 bytes.
    (on 64bit arches with PAGE_SIZE=4096)

    For bigger messages (from 16065 to 81600 bytes), this patch brings
    reliability at the expense of performance penalty because of extra pages
    allocations.

    netperf -t DG_STREAM -T 0,2 -- -m 16064 -s 200000
    ->4086040 Messages / 10s

    netperf -t DG_STREAM -T 0,2 -- -m 16068 -s 200000
    ->3901747 Messages / 10s

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Mar, 2012

1 commit

  • In some cases the poll() implementation in a driver has to do different
    things depending on the events the caller wants to poll for. An example
    is when a driver needs to start a DMA engine if the caller polls for
    POLLIN, but doesn't want to do that if POLLIN is not requested but instead
    only POLLOUT or POLLPRI is requested. This is something that can happen
    in the video4linux subsystem among others.

    Unfortunately, the current epoll/poll/select implementation doesn't
    provide that information reliably. The poll_table_struct does have it: it
    has a key field with the event mask. But once a poll() call matches one
    or more bits of that mask any following poll() calls are passed a NULL
    poll_table pointer.

    Also, the eventpoll implementation always left the key field at ~0 instead
    of using the requested events mask.

    This was changed in eventpoll.c so the key field now contains the actual
    events that should be polled for as set by the caller.

    The solution to the NULL poll_table pointer is to set the qproc field to
    NULL in poll_table once poll() matches the events, not the poll_table
    pointer itself. That way drivers can obtain the mask through a new
    poll_requested_events inline.

    The poll_table_struct can still be NULL since some kernel code calls it
    internally (netfs_state_poll() in ./drivers/staging/pohmelfs/netfs.h). In
    that case poll_requested_events() returns ~0 (i.e. all events).

    Very rarely drivers might want to know whether poll_wait will actually
    wait. If another earlier file descriptor in the set already matched the
    events the caller wanted to wait for, then the kernel will return from the
    select() call without waiting. This might be useful information in order
    to avoid doing expensive work.

    A new helper function poll_does_not_wait() is added that drivers can use
    to detect this situation. This is now used in sock_poll_wait() in
    include/net/sock.h. This was the only place in the kernel that needed
    this information.

    Drivers should no longer access any of the poll_table internals, but use
    the poll_requested_events() and poll_does_not_wait() access functions
    instead. In order to enforce that the poll_table fields are now prepended
    with an underscore and a comment was added warning against using them
    directly.

    This required a change in unix_dgram_poll() in unix/af_unix.c which used
    the key field to get the requested events. It's been replaced by a call
    to poll_requested_events().

    For qproc it was especially important to change its name since the
    behavior of that field changes with this patch since this function pointer
    can now be NULL when that wasn't possible in the past.

    Any driver accessing the qproc or key fields directly will now fail to compile.

    Some notes regarding the correctness of this patch: the driver's poll()
    function is called with a 'struct poll_table_struct *wait' argument. This
    pointer may or may not be NULL, drivers can never rely on it being one or
    the other as that depends on whether or not an earlier file descriptor in
    the select()'s fdset matched the requested events.

    There are only three things a driver can do with the wait argument:

    1) obtain the key field:

    events = wait ? wait->key : ~0;

    This will still work although it should be replaced with the new
    poll_requested_events() function (which does exactly the same).
    This will now even work better, since wait is no longer set to NULL
    unnecessarily.

    2) use the qproc callback. This could be deadly since qproc can now be
    NULL. Renaming qproc should prevent this from happening. There are no
    kernel drivers that actually access this callback directly, BTW.

    3) test whether wait == NULL to determine whether poll would return without
    waiting. This is no longer sufficient as the correct test is now
    wait == NULL || wait->_qproc == NULL.

    However, the worst that can happen here is a slight performance hit in
    the case where wait != NULL and wait->_qproc == NULL. In that case the
    driver will assume that poll_wait() will actually add the fd to the set
    of waiting file descriptors. Of course, poll_wait() will not do that
    since it tests for wait->_qproc. This will not break anything, though.

    There is only one place in the whole kernel where this happens
    (sock_poll_wait() in include/net/sock.h) and that code will be replaced
    by a call to poll_does_not_wait() in the next patch.

    Note that even if wait->_qproc != NULL drivers cannot rely on poll_wait()
    actually waiting. The next file descriptor from the set might match the
    event mask and thus any possible waits will never happen.

    Signed-off-by: Hans Verkuil
    Reviewed-by: Jonathan Corbet
    Reviewed-by: Al Viro
    Cc: Davide Libenzi
    Signed-off-by: Hans de Goede
    Cc: Mauro Carvalho Chehab
    Cc: David Miller
    Cc: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hans Verkuil
     

22 Mar, 2012

1 commit

  • Pull vfs pile 1 from Al Viro:
    "This is _not_ all; in particular, Miklos' and Jan's stuff is not there
    yet."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits)
    ext4: initialization of ext4_li_mtx needs to be done earlier
    debugfs-related mode_t whack-a-mole
    hfsplus: add an ioctl to bless files
    hfsplus: change finder_info to u32
    hfsplus: initialise userflags
    qnx4: new helper - try_extent()
    qnx4: get rid of qnx4_bread/qnx4_getblk
    take removal of PF_FORKNOEXEC to flush_old_exec()
    trim includes in inode.c
    um: uml_dup_mmap() relies on ->mmap_sem being held, but activate_mm() doesn't hold it
    um: embed ->stub_pages[] into mmu_context
    gadgetfs: list_for_each_safe() misuse
    ocfs2: fix leaks on failure exits in module_init
    ecryptfs: make register_filesystem() the last potential failure exit
    ntfs: forgets to unregister sysctls on register_filesystem() failure
    logfs: missing cleanup on register_filesystem() failure
    jfs: mising cleanup on register_filesystem() failure
    make configfs_pin_fs() return root dentry on success
    configfs: configfs_create_dir() has parent dentry in dentry->d_parent
    configfs: sanitize configfs_create()
    ...

    Linus Torvalds
     

21 Mar, 2012

2 commits


23 Feb, 2012

1 commit

  • Piergiorgio Beruto expressed the need to fetch size of first datagram in
    queue for AF_UNIX sockets and suggested a patch against SIOCINQ ioctl.

    I suggested instead to implement MSG_TRUNC support as a recv() input
    flag, as already done for RAW, UDP & NETLINK sockets.

    len = recv(fd, &byte, 1, MSG_PEEK | MSG_TRUNC);

    MSG_TRUNC asks recv() to return the real length of the packet, even when
    is was longer than the passed buffer.

    There is risk that a userland application used MSG_TRUNC by accident
    (since it had no effect on af_unix sockets) and this might break after
    this patch.

    Signed-off-by: Eric Dumazet
    Tested-by: Piergiorgio Beruto
    CC: Michael Kerrisk
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Feb, 2012

2 commits

  • The same here -- we can protect the sk_peek_off manipulations with
    the unix_sk->readlock mutex.

    The peeking of data from a stream socket is done in the datagram style,
    i.e. even if there's enough room for more data in the user buffer, only
    the head skb's data is copied in there. This feature is preserved when
    peeking data from a given offset -- the data is read till the nearest
    skb's boundary.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • The sk_peek_off manipulations are protected with the unix_sk->readlock mutex.
    This mutex is enough since all we need is to syncronize setting the offset
    vs reading the queue head. The latter is fully covered with the mentioned lock.

    The recently added __skb_recv_datagram's offset is used to pick the skb to
    read the data from.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

31 Jan, 2012

1 commit

  • Commit 0884d7aa24 (AF_UNIX: Fix poll blocking problem when reading from
    a stream socket) added a regression for epoll() in Edge Triggered mode
    (EPOLLET)

    Appropriate fix is to use skb_peek()/skb_unlink() instead of
    skb_dequeue(), and only call skb_unlink() when skb is fully consumed.

    This remove the need to requeue a partial skb into sk_receive_queue head
    and the extra sk->sk_data_ready() calls that added the regression.

    This is safe because once skb is given to sk_receive_queue, it is not
    modified by a writer, and readers are serialized by u->readlock mutex.

    This also reduce number of spinlock acquisition for small reads or
    MSG_PEEK users so should improve overall performance.

    Reported-by: Nick Mathewson
    Signed-off-by: Eric Dumazet
    Cc: Alexey Moiseytsev
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Jan, 2012

1 commit

  • * 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (165 commits)
    reiserfs: Properly display mount options in /proc/mounts
    vfs: prevent remount read-only if pending removes
    vfs: count unlinked inodes
    vfs: protect remounting superblock read-only
    vfs: keep list of mounts for each superblock
    vfs: switch ->show_options() to struct dentry *
    vfs: switch ->show_path() to struct dentry *
    vfs: switch ->show_devname() to struct dentry *
    vfs: switch ->show_stats to struct dentry *
    switch security_path_chmod() to struct path *
    vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb
    vfs: trim includes a bit
    switch mnt_namespace ->root to struct mount
    vfs: take /proc/*/mounts and friends to fs/proc_namespace.c
    vfs: opencode mntget() mnt_set_mountpoint()
    vfs: spread struct mount - remaining argument of next_mnt()
    vfs: move fsnotify junk to struct mount
    vfs: move mnt_devname
    vfs: move mnt_list to struct mount
    vfs: switch pnode.h macros to struct mount *
    ...

    Linus Torvalds
     

04 Jan, 2012

1 commit


31 Dec, 2011

1 commit


17 Dec, 2011

1 commit


27 Nov, 2011

1 commit


29 Sep, 2011

1 commit

  • Since commit 7361c36c5224 (af_unix: Allow credentials to work across
    user and pid namespaces) af_unix performance dropped a lot.

    This is because we now take a reference on pid and cred in each write(),
    and release them in read(), usually done from another process,
    eventually from another cpu. This triggers false sharing.

    # Events: 154K cycles
    #
    # Overhead Command Shared Object Symbol
    # ........ ....... .................. .........................
    #
    10.40% hackbench [kernel.kallsyms] [k] put_pid
    8.60% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg
    7.87% hackbench [kernel.kallsyms] [k] unix_stream_sendmsg
    6.11% hackbench [kernel.kallsyms] [k] do_raw_spin_lock
    4.95% hackbench [kernel.kallsyms] [k] unix_scm_to_skb
    4.87% hackbench [kernel.kallsyms] [k] pid_nr_ns
    4.34% hackbench [kernel.kallsyms] [k] cred_to_ucred
    2.39% hackbench [kernel.kallsyms] [k] unix_destruct_scm
    2.24% hackbench [kernel.kallsyms] [k] sub_preempt_count
    1.75% hackbench [kernel.kallsyms] [k] fget_light
    1.51% hackbench [kernel.kallsyms] [k]
    __mutex_lock_interruptible_slowpath
    1.42% hackbench [kernel.kallsyms] [k] sock_alloc_send_pskb

    This patch includes SCM_CREDENTIALS information in a af_unix message/skb
    only if requested by the sender, [man 7 unix for details how to include
    ancillary data using sendmsg() system call]

    Note: This might break buggy applications that expected SCM_CREDENTIAL
    from an unaware write() system call, and receiver not using SO_PASSCRED
    socket option.

    If SOCK_PASSCRED is set on source or destination socket, we still
    include credentials for mere write() syscalls.

    Performance boost in hackbench : more than 50% gain on a 16 thread
    machine (2 quad-core cpus, 2 threads per core)

    hackbench 20 thread 2000

    4.228 sec instead of 9.102 sec

    Signed-off-by: Eric Dumazet
    Acked-by: Tim Chen
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Sep, 2011

1 commit


25 Aug, 2011

1 commit

  • Patch series 109f6e39..7361c36c back in 2.6.36 added functionality to
    allow credentials to work across pid namespaces for packets sent via
    UNIX sockets. However, the atomic reference counts on pid and
    credentials caused plenty of cache bouncing when there are numerous
    threads of the same pid sharing a UNIX socket. This patch mitigates the
    problem by eliminating extraneous reference counts on pid and
    credentials on both send and receive path of UNIX sockets. I found a 2x
    improvement in hackbench's threaded case.

    On the receive path in unix_dgram_recvmsg, currently there is an
    increment of reference count on pid and credentials in scm_set_cred.
    Then there are two decrement of the reference counts. Once in scm_recv
    and once when skb_free_datagram call skb->destructor function
    unix_destruct_scm. One pair of increment and decrement of ref count on
    pid and credentials can be eliminated from the receive path. Until we
    destroy the skb, we already set a reference when we created the skb on
    the send side.

    On the send path, there are two increments of ref count on pid and
    credentials, once in scm_send and once in unix_scm_to_skb. Then there
    is a decrement of the reference counts in scm_destroy's call to
    scm_destroy_cred at the end of unix_dgram_sendmsg functions. One pair
    of increment and decrement of the reference counts can be removed so we
    only need to increment the ref counts once.

    By incorporating these changes, for hackbench running on a 4 socket
    NHM-EX machine with 40 cores, the execution of hackbench on
    50 groups of 20 threads sped up by factor of 2.

    Hackbench command used for testing:
    ./hackbench 50 thread 2000

    Signed-off-by: Tim Chen
    Signed-off-by: David S. Miller

    Tim Chen
     

20 Jul, 2011

1 commit


24 May, 2011

1 commit

  • The %pK format specifier is designed to hide exposed kernel pointers,
    specifically via /proc interfaces. Exposing these pointers provides an
    easy target for kernel write vulnerabilities, since they reveal the
    locations of writable structures containing easily triggerable function
    pointers. The behavior of %pK depends on the kptr_restrict sysctl.

    If kptr_restrict is set to 0, no deviation from the standard %p behavior
    occurs. If kptr_restrict is set to 1, the default, if the current user
    (intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
    (currently in the LSM tree), kernel pointers using %pK are printed as 0's.
    If kptr_restrict is set to 2, kernel pointers using %pK are printed as
    0's regardless of privileges. Replacing with 0's was chosen over the
    default "(null)", which cannot be parsed by userland %p, which expects
    "(nil)".

    The supporting code for kptr_restrict and %pK are currently in the -mm
    tree. This patch converts users of %p in net/ to %pK. Cases of printing
    pointers to the syslog are not covered, since this would eliminate useful
    information for postmortem debugging and the reading of the syslog is
    already optionally protected by the dmesg_restrict sysctl.

    Signed-off-by: Dan Rosenberg
    Cc: James Morris
    Cc: Eric Dumazet
    Cc: Thomas Graf
    Cc: Eugene Teo
    Cc: Kees Cook
    Cc: Ingo Molnar
    Cc: David S. Miller
    Cc: Peter Zijlstra
    Cc: Eric Paris
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Dan Rosenberg
     

02 May, 2011

1 commit

  • This fixes the following oops discovered by Dan Aloni:
    > Anyway, the following is the output of the Oops that I got on the
    > Ubuntu kernel on which I first detected the problem
    > (2.6.37-12-generic). The Oops that followed will be more useful, I
    > guess.

    >[ 5594.669852] BUG: unable to handle kernel NULL pointer dereference
    > at           (null)
    > [ 5594.681606] IP: [] unix_dgram_recvmsg+0x1fb/0x420
    > [ 5594.687576] PGD 2a05d067 PUD 2b951067 PMD 0
    > [ 5594.693720] Oops: 0002 [#1] SMP
    > [ 5594.699888] last sysfs file:

    The bug was that unix domain sockets use a pseduo packet for
    connecting and accept uses that psudo packet to get the socket.
    In the buggy seqpacket case we were allowing unconnected
    sockets to call recvmsg and try to receive the pseudo packet.

    That is always wrong and as of commit 7361c36c5 the pseudo
    packet had become enough different from a normal packet
    that the kernel started oopsing.

    Do for seqpacket_recv what was done for seqpacket_send in 2.5
    and only allow it on connected seqpacket sockets.

    Cc: stable@kernel.org
    Tested-by: Dan Aloni
    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

31 Mar, 2011

1 commit


17 Mar, 2011

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1480 commits)
    bonding: enable netpoll without checking link status
    xfrm: Refcount destination entry on xfrm_lookup
    net: introduce rx_handler results and logic around that
    bonding: get rid of IFF_SLAVE_INACTIVE netdev->priv_flag
    bonding: wrap slave state work
    net: get rid of multiple bond-related netdevice->priv_flags
    bonding: register slave pointer for rx_handler
    be2net: Bump up the version number
    be2net: Copyright notice change. Update to Emulex instead of ServerEngines
    e1000e: fix kconfig for crc32 dependency
    netfilter ebtables: fix xt_AUDIT to work with ebtables
    xen network backend driver
    bonding: Improve syslog message at device creation time
    bonding: Call netif_carrier_off after register_netdevice
    bonding: Incorrect TX queue offset
    net_sched: fix ip_tos2prio
    xfrm: fix __xfrm_route_forward()
    be2net: Fix UDP packet detected status in RX compl
    Phonet: fix aligned-mode pipe socket buffer header reserve
    netxen: support for GbE port settings
    ...

    Fix up conflicts in drivers/staging/brcm80211/brcmsmac/wl_mac80211.c
    with the staging updates.

    Linus Torvalds
     

16 Mar, 2011

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (57 commits)
    tidy the trailing symlinks traversal up
    Turn resolution of trailing symlinks iterative everywhere
    simplify link_path_walk() tail
    Make trailing symlink resolution in path_lookupat() iterative
    update nd->inode in __do_follow_link() instead of after do_follow_link()
    pull handling of one pathname component into a helper
    fs: allow AT_EMPTY_PATH in linkat(), limit that to CAP_DAC_READ_SEARCH
    Allow passing O_PATH descriptors via SCM_RIGHTS datagrams
    readlinkat(), fchownat() and fstatat() with empty relative pathnames
    Allow O_PATH for symlinks
    New kind of open files - "location only".
    ext4: Copy fs UUID to superblock
    ext3: Copy fs UUID to superblock.
    vfs: Export file system uuid via /proc//mountinfo
    unistd.h: Add new syscalls numbers to asm-generic
    x86: Add new syscalls for x86_64
    x86: Add new syscalls for x86_32
    fs: Remove i_nlink check from file system link callback
    fs: Don't allow to create hardlink for deleted file
    vfs: Add open by file handle support
    ...

    Linus Torvalds
     
  • David S. Miller
     

15 Mar, 2011

1 commit


14 Mar, 2011

1 commit


11 Mar, 2011

1 commit


08 Mar, 2011

2 commits

  • Signed-off-by: Hagen Paul Pfeifer
    Signed-off-by: David S. Miller

    Hagen Paul Pfeifer
     
  • The unix_dgram_recvmsg and unix_stream_recvmsg routines in
    net/af_unix.c utilize mutex_lock(&u->readlock) calls in order to
    serialize read operations of multiple threads on a single socket. This
    implies that, if all n threads of a process block in an AF_UNIX recv
    call trying to read data from the same socket, one of these threads
    will be sleeping in state TASK_INTERRUPTIBLE and all others in state
    TASK_UNINTERRUPTIBLE. Provided that a particular signal is supposed to
    be handled by a signal handler defined by the process and that none of
    this threads is blocking the signal, the complete_signal routine in
    kernel/signal.c will select the 'first' such thread it happens to
    encounter when deciding which thread to notify that a signal is
    supposed to be handled and if this is one of the TASK_UNINTERRUPTIBLE
    threads, the signal won't be handled until the one thread not blocking
    on the u->readlock mutex is woken up because some data to process has
    arrived (if this ever happens). The included patch fixes this by
    changing mutex_lock to mutex_lock_interruptible and handling possible
    error returns in the same way interruptions are handled by the actual
    receive-code.

    Signed-off-by: Rainer Weikusat
    Signed-off-by: David S. Miller

    Rainer Weikusat
     

23 Feb, 2011

1 commit


20 Jan, 2011

1 commit


19 Jan, 2011

1 commit

  • Linux Socket Filters can already be successfully attached and detached on unix
    sockets with setsockopt(sockfd, SOL_SOCKET, SO_{ATTACH,DETACH}_FILTER, ...).
    See: Documentation/networking/filter.txt

    But the filter was never used in the unix socket code so it did not work. This
    patch uses sk_filter() to filter buffers before delivery.

    This short program demonstrates the problem on SOCK_DGRAM.

    int main(void) {
    int i, j, ret;
    int sv[2];
    struct pollfd fds[2];
    char *message = "Hello world!";
    char buffer[64];
    struct sock_filter ins[32] = {{0,},};
    struct sock_fprog filter;

    socketpair(AF_UNIX, SOCK_DGRAM, 0, sv);

    for (i = 0 ; i < 2 ; i++) {
    fds[i].fd = sv[i];
    fds[i].events = POLLIN;
    fds[i].revents = 0;
    }

    for(j = 1 ; j < 13 ; j++) {

    /* Set a socket filter to truncate the message */
    memset(ins, 0, sizeof(ins));
    ins[0].code = BPF_RET|BPF_K;
    ins[0].k = j;
    filter.len = 1;
    filter.filter = ins;
    setsockopt(sv[1], SOL_SOCKET, SO_ATTACH_FILTER, &filter, sizeof(filter));

    /* send a message */
    send(sv[0], message, strlen(message) + 1, 0);

    /* The filter should let the message pass but truncated. */
    poll(fds, 2, 0);

    /* Receive the truncated message*/
    ret = recv(sv[1], buffer, 64, 0);
    printf("received %d bytes, expected %d\n", ret, j);
    }

    for (i = 0 ; i < 2 ; i++)
    close(sv[i]);

    return 0;
    }

    Signed-off-by: Alban Crequy
    Reviewed-by: Ian Molton
    Signed-off-by: David S. Miller

    Alban Crequy
     

06 Jan, 2011

1 commit

  • unix_release() can asynchornously set socket->sk to NULL, and
    it does so without holding the unix_state_lock() on "other"
    during stream connects.

    However, the reverse mapping, sk->sk_socket, is only transitioned
    to NULL under the unix_state_lock().

    Therefore make the security hooks follow the reverse mapping instead
    of the forward mapping.

    Reported-by: Jeremy Fitzhardinge
    Reported-by: Linus Torvalds
    Signed-off-by: David S. Miller

    David S. Miller
     

09 Dec, 2010

1 commit


30 Nov, 2010

1 commit

  • Its easy to eat all kernel memory and trigger NMI watchdog, using an
    exploit program that queues unix sockets on top of others.

    lkml ref : http://lkml.org/lkml/2010/11/25/8

    This mechanism is used in applications, one choice we have is to have a
    recursion limit.

    Other limits might be needed as well (if we queue other types of files),
    since the passfd mechanism is currently limited by socket receive queue
    sizes only.

    Add a recursion_level to unix socket, allowing up to 4 levels.

    Each time we send an unix socket through sendfd mechanism, we copy its
    recursion level (plus one) to receiver. This recursion level is cleared
    when socket receive queue is emptied.

    Reported-by: Марк Коренберг
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Nov, 2010

3 commits

  • unix_dgram_poll() is pretty expensive to check POLLOUT status, because
    it has to lock the socket to get its peer, take a reference on the peer
    to check its receive queue status, and queue another poll_wait on
    peer_wait. This all can be avoided if the process calling
    unix_dgram_poll() is not interested in POLLOUT status. It makes
    unix_dgram_recvmsg() faster by not queueing irrelevant pollers in
    peer_wait.

    On a test program provided by Alan Crequy :

    Before:

    real 0m0.211s
    user 0m0.000s
    sys 0m0.208s

    After:

    real 0m0.044s
    user 0m0.000s
    sys 0m0.040s

    Suggested-by: Davide Libenzi
    Reported-by: Alban Crequy
    Acked-by: Davide Libenzi
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Alban Crequy reported a problem with connected dgram af_unix sockets and
    provided a test program. epoll() would miss to send an EPOLLOUT event
    when a thread unqueues a packet from the other peer, making its receive
    queue not full.

    This is because unix_dgram_poll() fails to call sock_poll_wait(file,
    &unix_sk(other)->peer_wait, wait);
    if the socket is not writeable at the time epoll_ctl(ADD) is called.

    We must call sock_poll_wait(), regardless of 'writable' status, so that
    epoll can be notified later of states changes.

    Misc: avoids testing twice (sk->sk_shutdown & RCV_SHUTDOWN)

    Reported-by: Alban Crequy
    Cc: Davide Libenzi
    Signed-off-by: Eric Dumazet
    Acked-by: Davide Libenzi
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Instead of wakeup all sleepers, use wake_up_interruptible_sync_poll() to
    wakeup only ones interested into writing the socket.

    This patch is a specialization of commit 37e5540b3c9d (epoll keyed
    wakeups: make sockets use keyed wakeups).

    On a test program provided by Alan Crequy :

    Before:
    real 0m3.101s
    user 0m0.000s
    sys 0m6.104s

    After:

    real 0m0.211s
    user 0m0.000s
    sys 0m0.208s

    Reported-by: Alban Crequy
    Signed-off-by: Eric Dumazet
    Cc: Davide Libenzi
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Oct, 2010

1 commit

  • Robin Holt tried to boot a 16TB system and found af_unix was overflowing
    a 32bit value :

    We were seeing a failure which prevented boot. The kernel was incapable
    of creating either a named pipe or unix domain socket. This comes down
    to a common kernel function called unix_create1() which does:

    atomic_inc(&unix_nr_socks);
    if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
    goto out;

    The function get_max_files() is a simple return of files_stat.max_files.
    files_stat.max_files is a signed integer and is computed in
    fs/file_table.c's files_init().

    n = (mempages * (PAGE_SIZE / 1024)) / 10;
    files_stat.max_files = n;

    In our case, mempages (total_ram_pages) is approx 3,758,096,384
    (0xe0000000). That leaves max_files at approximately 1,503,238,553.
    This causes 2 * get_max_files() to integer overflow.

    Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
    integers, and change af_unix to use an atomic_long_t instead of atomic_t.

    get_max_files() is changed to return an unsigned long. get_nr_files() is
    changed to return a long.

    unix_nr_socks is changed from atomic_t to atomic_long_t, while not
    strictly needed to address Robin problem.

    Before patch (on a 64bit kernel) :
    # echo 2147483648 >/proc/sys/fs/file-max
    # cat /proc/sys/fs/file-max
    -18446744071562067968

    After patch:
    # echo 2147483648 >/proc/sys/fs/file-max
    # cat /proc/sys/fs/file-max
    2147483648
    # cat /proc/sys/fs/file-nr
    704 0 2147483648

    Reported-by: Robin Holt
    Signed-off-by: Eric Dumazet
    Acked-by: David Miller
    Reviewed-by: Robin Holt
    Tested-by: Robin Holt
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet