08 Mar, 2011

1 commit

  • The unix_dgram_recvmsg and unix_stream_recvmsg routines in
    net/af_unix.c utilize mutex_lock(&u->readlock) calls in order to
    serialize read operations of multiple threads on a single socket. This
    implies that, if all n threads of a process block in an AF_UNIX recv
    call trying to read data from the same socket, one of these threads
    will be sleeping in state TASK_INTERRUPTIBLE and all others in state
    TASK_UNINTERRUPTIBLE. Provided that a particular signal is supposed to
    be handled by a signal handler defined by the process and that none of
    this threads is blocking the signal, the complete_signal routine in
    kernel/signal.c will select the 'first' such thread it happens to
    encounter when deciding which thread to notify that a signal is
    supposed to be handled and if this is one of the TASK_UNINTERRUPTIBLE
    threads, the signal won't be handled until the one thread not blocking
    on the u->readlock mutex is woken up because some data to process has
    arrived (if this ever happens). The included patch fixes this by
    changing mutex_lock to mutex_lock_interruptible and handling possible
    error returns in the same way interruptions are handled by the actual
    receive-code.

    Signed-off-by: Rainer Weikusat
    Signed-off-by: David S. Miller

    Rainer Weikusat
     

06 Jan, 2011

1 commit

  • unix_release() can asynchornously set socket->sk to NULL, and
    it does so without holding the unix_state_lock() on "other"
    during stream connects.

    However, the reverse mapping, sk->sk_socket, is only transitioned
    to NULL under the unix_state_lock().

    Therefore make the security hooks follow the reverse mapping instead
    of the forward mapping.

    Reported-by: Jeremy Fitzhardinge
    Reported-by: Linus Torvalds
    Signed-off-by: David S. Miller

    David S. Miller
     

09 Dec, 2010

1 commit


30 Nov, 2010

1 commit

  • Its easy to eat all kernel memory and trigger NMI watchdog, using an
    exploit program that queues unix sockets on top of others.

    lkml ref : http://lkml.org/lkml/2010/11/25/8

    This mechanism is used in applications, one choice we have is to have a
    recursion limit.

    Other limits might be needed as well (if we queue other types of files),
    since the passfd mechanism is currently limited by socket receive queue
    sizes only.

    Add a recursion_level to unix socket, allowing up to 4 levels.

    Each time we send an unix socket through sendfd mechanism, we copy its
    recursion level (plus one) to receiver. This recursion level is cleared
    when socket receive queue is emptied.

    Reported-by: Марк Коренберг
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Nov, 2010

1 commit

  • Vegard Nossum found a unix socket OOM was possible, posting an exploit
    program.

    My analysis is we can eat all LOWMEM memory before unix_gc() being
    called from unix_release_sock(). Moreover, the thread blocked in
    unix_gc() can consume huge amount of time to perform cleanup because of
    huge working set.

    One way to handle this is to have a sensible limit on unix_tot_inflight,
    tested from wait_for_unix_gc() and to force a call to unix_gc() if this
    limit is hit.

    This solves the OOM and also reduce overall latencies, and should not
    slowdown normal workloads.

    Reported-by: Vegard Nossum
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Nov, 2010

3 commits

  • unix_dgram_poll() is pretty expensive to check POLLOUT status, because
    it has to lock the socket to get its peer, take a reference on the peer
    to check its receive queue status, and queue another poll_wait on
    peer_wait. This all can be avoided if the process calling
    unix_dgram_poll() is not interested in POLLOUT status. It makes
    unix_dgram_recvmsg() faster by not queueing irrelevant pollers in
    peer_wait.

    On a test program provided by Alan Crequy :

    Before:

    real 0m0.211s
    user 0m0.000s
    sys 0m0.208s

    After:

    real 0m0.044s
    user 0m0.000s
    sys 0m0.040s

    Suggested-by: Davide Libenzi
    Reported-by: Alban Crequy
    Acked-by: Davide Libenzi
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Alban Crequy reported a problem with connected dgram af_unix sockets and
    provided a test program. epoll() would miss to send an EPOLLOUT event
    when a thread unqueues a packet from the other peer, making its receive
    queue not full.

    This is because unix_dgram_poll() fails to call sock_poll_wait(file,
    &unix_sk(other)->peer_wait, wait);
    if the socket is not writeable at the time epoll_ctl(ADD) is called.

    We must call sock_poll_wait(), regardless of 'writable' status, so that
    epoll can be notified later of states changes.

    Misc: avoids testing twice (sk->sk_shutdown & RCV_SHUTDOWN)

    Reported-by: Alban Crequy
    Cc: Davide Libenzi
    Signed-off-by: Eric Dumazet
    Acked-by: Davide Libenzi
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Instead of wakeup all sleepers, use wake_up_interruptible_sync_poll() to
    wakeup only ones interested into writing the socket.

    This patch is a specialization of commit 37e5540b3c9d (epoll keyed
    wakeups: make sockets use keyed wakeups).

    On a test program provided by Alan Crequy :

    Before:
    real 0m3.101s
    user 0m0.000s
    sys 0m6.104s

    After:

    real 0m0.211s
    user 0m0.000s
    sys 0m0.208s

    Reported-by: Alban Crequy
    Signed-off-by: Eric Dumazet
    Cc: Davide Libenzi
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Oct, 2010

1 commit

  • Robin Holt tried to boot a 16TB system and found af_unix was overflowing
    a 32bit value :

    We were seeing a failure which prevented boot. The kernel was incapable
    of creating either a named pipe or unix domain socket. This comes down
    to a common kernel function called unix_create1() which does:

    atomic_inc(&unix_nr_socks);
    if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
    goto out;

    The function get_max_files() is a simple return of files_stat.max_files.
    files_stat.max_files is a signed integer and is computed in
    fs/file_table.c's files_init().

    n = (mempages * (PAGE_SIZE / 1024)) / 10;
    files_stat.max_files = n;

    In our case, mempages (total_ram_pages) is approx 3,758,096,384
    (0xe0000000). That leaves max_files at approximately 1,503,238,553.
    This causes 2 * get_max_files() to integer overflow.

    Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
    integers, and change af_unix to use an atomic_long_t instead of atomic_t.

    get_max_files() is changed to return an unsigned long. get_nr_files() is
    changed to return a long.

    unix_nr_socks is changed from atomic_t to atomic_long_t, while not
    strictly needed to address Robin problem.

    Before patch (on a 64bit kernel) :
    # echo 2147483648 >/proc/sys/fs/file-max
    # cat /proc/sys/fs/file-max
    -18446744071562067968

    After patch:
    # echo 2147483648 >/proc/sys/fs/file-max
    # cat /proc/sys/fs/file-max
    2147483648
    # cat /proc/sys/fs/file-nr
    704 0 2147483648

    Reported-by: Robin Holt
    Signed-off-by: Eric Dumazet
    Acked-by: David Miller
    Reviewed-by: Robin Holt
    Tested-by: Robin Holt
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

06 Oct, 2010

1 commit

  • Userspace applications can already request to receive timestamps with:
    setsockopt(sockfd, SOL_SOCKET, SO_TIMESTAMP, ...)

    Although setsockopt() returns zero (success), timestamps are not added to the
    ancillary data. This patch fixes that on SOCK_DGRAM and SOCK_SEQPACKET Unix
    sockets.

    Signed-off-by: Alban Crequy
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alban Crequy
     

10 Sep, 2010

1 commit


08 Sep, 2010

1 commit

  • We assumed that unix_autobind() never fails if kzalloc() succeeded.
    But unix_autobind() allows only 1048576 names. If /proc/sys/fs/file-max is
    larger than 1048576 (e.g. systems with more than 10GB of RAM), a local user can
    consume all names using fork()/socket()/bind().

    If all names are in use, those who call bind() with addr_len == sizeof(short)
    or connect()/sendmsg() with setsockopt(SO_PASSCRED) will continue

    while (1)
    yield();

    loop at unix_autobind() till a name becomes available.
    This patch adds a loop counter in order to give up after 1048576 attempts.

    Calling yield() for once per 256 attempts may not be sufficient when many names
    are already in use, for __unix_find_socket_byname() can take long time under
    such circumstance. Therefore, this patch also adds cond_resched() call.

    Note that currently a local user can consume 2GB of kernel memory if the user
    is allowed to create and autobind 1048576 UNIX domain sockets. We should
    consider adding some restriction for autobind operation.

    Signed-off-by: Tetsuo Handa
    Signed-off-by: David S. Miller

    Tetsuo Handa
     

07 Sep, 2010

1 commit


21 Jul, 2010

1 commit

  • Convert a few calls from kfree_skb to consume_skb

    Noticed while I was working on dropwatch that I was detecting lots of internal
    skb drops in several places. While some are legitimate, several were not,
    freeing skbs that were at the end of their life, rather than being discarded due
    to an error. This patch converts those calls sites from using kfree_skb to
    consume_skb, which quiets the in-kernel drop_monitor code from detecting them as
    drops. Tested successfully by myself

    Signed-off-by: Neil Horman
    Signed-off-by: David S. Miller

    Neil Horman
     

17 Jun, 2010

3 commits

  • Remove the restriction that only allows connecting to a unix domain
    socket identified by unix path that is in the same network namespace.

    Crossing network namespaces is always tricky and we did not support
    this at first, because of a strict policy of don't mix the namespaces.
    Later after Pavel proposed this we did not support this because no one
    had performed the audit to make certain using unix domain sockets
    across namespaces is safe.

    What fundamentally makes connecting to af_unix sockets in other
    namespaces is safe is that you have to have the proper permissions on
    the unix domain socket inode that lives in the filesystem. If you
    want strict isolation you just don't create inodes where unfriendlys
    can get at them, or with permissions that allow unfriendlys to open
    them. All nicely handled for us by the mount namespace and other
    standard file system facilities.

    I looked through unix domain sockets and they are a very controlled
    environment so none of the work that goes on in dev_forward_skb to
    make crossing namespaces safe appears needed, we are not loosing
    controll of the skb and so do not need to set up the skb to look like
    it is comming in fresh from the outside world. Further the fields in
    struct unix_skb_parms should not have any problems crossing network
    namespaces.

    Now that we handle SCM_CREDENTIALS in a way that gives useable values
    across namespaces. There does not appear to be any operational
    problems with encouraging the use of unix domain sockets across
    containers either.

    Signed-off-by: Eric W. Biederman
    Acked-by: Daniel Lezcano
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • In unix_skb_parms store pointers to struct pid and struct cred instead
    of raw uid, gid, and pid values, then translate the credentials on
    reception into values that are meaningful in the receiving processes
    namespaces.

    Signed-off-by: Eric W. Biederman
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Use struct pid and struct cred to store the peer credentials on struct
    sock. This gives enough information to convert the peer credential
    information to a value relative to whatever namespace the socket is in
    at the time.

    This removes nasty surprises when using SO_PEERCRED on socket
    connetions where the processes on either side are in different pid and
    user namespaces.

    Signed-off-by: Eric W. Biederman
    Acked-by: Daniel Lezcano
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

04 May, 2010

1 commit


02 May, 2010

1 commit

  • sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we
    need two atomic operations (and associated dirtying) per incoming
    packet.

    RCU conversion is pretty much needed :

    1) Add a new structure, called "struct socket_wq" to hold all fields
    that will need rcu_read_lock() protection (currently: a
    wait_queue_head_t and a struct fasync_struct pointer).

    [Future patch will add a list anchor for wakeup coalescing]

    2) Attach one of such structure to each "struct socket" created in
    sock_alloc_inode().

    3) Respect RCU grace period when freeing a "struct socket_wq"

    4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct
    socket_wq"

    5) Change sk_sleep() function to use new sk->sk_wq instead of
    sk->sk_sleep

    6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside
    a rcu_read_lock() section.

    7) Change all sk_has_sleeper() callers to :
    - Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock)
    - Use wq_has_sleeper() to eventually wakeup tasks.
    - Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock)

    8) sock_wake_async() is modified to use rcu protection as well.

    9) Exceptions :
    macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq"
    instead of dynamically allocated ones. They dont need rcu freeing.

    Some cleanups or followups are probably needed, (possible
    sk_callback_lock conversion to a spinlock for example...).

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Apr, 2010

1 commit

  • Define a new function to return the waitqueue of a "struct sock".

    static inline wait_queue_head_t *sk_sleep(struct sock *sk)
    {
    return sk->sk_sleep;
    }

    Change all read occurrences of sk_sleep by a call to this function.

    Needed for a future RCU conversion. sk_sleep wont be a field directly
    available.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

19 Feb, 2010

1 commit


18 Jan, 2010

1 commit


08 Dec, 2009

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1815 commits)
    mac80211: fix reorder buffer release
    iwmc3200wifi: Enable wimax core through module parameter
    iwmc3200wifi: Add wifi-wimax coexistence mode as a module parameter
    iwmc3200wifi: Coex table command does not expect a response
    iwmc3200wifi: Update wiwi priority table
    iwlwifi: driver version track kernel version
    iwlwifi: indicate uCode type when fail dump error/event log
    iwl3945: remove duplicated event logging code
    b43: fix two warnings
    ipw2100: fix rebooting hang with driver loaded
    cfg80211: indent regulatory messages with spaces
    iwmc3200wifi: fix NULL pointer dereference in pmkid update
    mac80211: Fix TX status reporting for injected data frames
    ath9k: enable 2GHz band only if the device supports it
    airo: Fix integer overflow warning
    rt2x00: Fix padding bug on L2PAD devices.
    WE: Fix set events not propagated
    b43legacy: avoid PPC fault during resume
    b43: avoid PPC fault during resume
    tcp: fix a timewait refcnt race
    ...

    Fix up conflicts due to sysctl cleanups (dead sysctl_check code and
    CTL_UNNUMBERED removed) in
    kernel/sysctl_check.c
    net/ipv4/sysctl_net_ipv4.c
    net/ipv6/addrconf.c
    net/sctp/sysctl.c

    Linus Torvalds
     

30 Nov, 2009

1 commit


12 Nov, 2009

1 commit

  • Now that sys_sysctl is a compatiblity wrapper around /proc/sys
    all sysctl strategy routines, and all ctl_name and strategy
    entries in the sysctl tables are unused, and can be
    revmoed.

    In addition neigh_sysctl_register has been modified to no longer
    take a strategy argument and it's callers have been modified not
    to pass one.

    Cc: "David Miller"
    Cc: Hideaki YOSHIFUJI
    Cc: netdev@vger.kernel.org
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

11 Nov, 2009

1 commit


06 Nov, 2009

1 commit

  • The generic __sock_create function has a kern argument which allows the
    security system to make decisions based on if a socket is being created by
    the kernel or by userspace. This patch passes that flag to the
    net_proto_family specific create function, so it can do the same thing.

    Signed-off-by: Eric Paris
    Acked-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Eric Paris
     

27 Oct, 2009

1 commit


19 Oct, 2009

1 commit

  • I found a deadlock bug in UNIX domain socket, which makes able to DoS
    attack against the local machine by non-root users.

    How to reproduce:
    1. Make a listening AF_UNIX/SOCK_STREAM socket with an abstruct
    namespace(*), and shutdown(2) it.
    2. Repeat connect(2)ing to the listening socket from the other sockets
    until the connection backlog is full-filled.
    3. connect(2) takes the CPU forever. If every core is taken, the
    system hangs.

    PoC code: (Run as many times as cores on SMP machines.)

    int main(void)
    {
    int ret;
    int csd;
    int lsd;
    struct sockaddr_un sun;

    /* make an abstruct name address (*) */
    memset(&sun, 0, sizeof(sun));
    sun.sun_family = PF_UNIX;
    sprintf(&sun.sun_path[1], "%d", getpid());

    /* create the listening socket and shutdown */
    lsd = socket(AF_UNIX, SOCK_STREAM, 0);
    bind(lsd, (struct sockaddr *)&sun, sizeof(sun));
    listen(lsd, 1);
    shutdown(lsd, SHUT_RDWR);

    /* connect loop */
    alarm(15); /* forcely exit the loop after 15 sec */
    for (;;) {
    csd = socket(AF_UNIX, SOCK_STREAM, 0);
    ret = connect(csd, (struct sockaddr *)&sun, sizeof(sun));
    if (-1 == ret) {
    perror("connect()");
    break;
    }
    puts("Connection OK");
    }
    return 0;
    }

    (*) Make sun_path[0] = 0 to use the abstruct namespace.
    If a file-based socket is used, the system doesn't deadlock because
    of context switches in the file system layer.

    Why this happens:
    Error checks between unix_socket_connect() and unix_wait_for_peer() are
    inconsistent. The former calls the latter to wait until the backlog is
    processed. Despite the latter returns without doing anything when the
    socket is shutdown, the former doesn't check the shutdown state and
    just retries calling the latter forever.

    Patch:
    The patch below adds shutdown check into unix_socket_connect(), so
    connect(2) to the shutdown socket will return -ECONREFUSED.

    Signed-off-by: Tomoki Sekiyama
    Signed-off-by: Masanori Yoshida
    Signed-off-by: David S. Miller

    Tomoki Sekiyama
     

07 Oct, 2009

1 commit


12 Sep, 2009

1 commit

  • Kalle Olavi Niemitalo reported that:

    "..., when one process calls sendmsg once to send 43804 bytes of
    data and one file descriptor, and another process then calls recvmsg
    three times to receive the 16032+16032+11740 bytes, each of those
    recvmsg calls returns the file descriptor in the ancillary data. I
    confirmed this with strace. The behaviour differs from Linux
    2.6.26, where reportedly only one of those recvmsg calls (I think
    the first one) returned the file descriptor."

    This bug was introduced by a patch from me titled "net: unix: fix inflight
    counting bug in garbage collector", commit 6209344f5.

    And the reason is, quoting Kalle:

    "Before your patch, unix_attach_fds() would set scm->fp = NULL, so
    that if the loop in unix_stream_sendmsg() ran multiple iterations,
    it could not call unix_attach_fds() again. But now,
    unix_attach_fds() leaves scm->fp unchanged, and I think this causes
    it to be called multiple times and duplicate the same file
    descriptors to each struct sk_buff."

    Fix this by introducing a flag that is cleared at the start and set
    when the fds attached to the first buffer. The resulting code should
    work equivalently to the one on 2.6.26.

    Reported-by: Kalle Olavi Niemitalo
    Signed-off-by: Miklos Szeredi
    Signed-off-by: David S. Miller

    Miklos Szeredi
     

10 Jul, 2009

1 commit

  • Adding memory barrier after the poll_wait function, paired with
    receive callbacks. Adding fuctions sock_poll_wait and sk_has_sleeper
    to wrap the memory barrier.

    Without the memory barrier, following race can happen.
    The race fires, when following code paths meet, and the tp->rcv_nxt
    and __add_wait_queue updates stay in CPU caches.

    CPU1 CPU2

    sys_select receive packet
    ... ...
    __add_wait_queue update tp->rcv_nxt
    ... ...
    tp->rcv_nxt check sock_def_readable
    ... {
    schedule ...
    if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
    wake_up_interruptible(sk->sk_sleep)
    ...
    }

    If there was no cache the code would work ok, since the wait_queue and
    rcv_nxt are opposit to each other.

    Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
    passed the tp->rcv_nxt check and sleeps, or will get the new value for
    tp->rcv_nxt and will return with new data mask.
    In both cases the process (CPU1) is being added to the wait queue, so the
    waitqueue_active (CPU2) call cannot miss and will wake up CPU1.

    The bad case is when the __add_wait_queue changes done by CPU1 stay in its
    cache, and so does the tp->rcv_nxt update on CPU2 side. The CPU1 will then
    endup calling schedule and sleep forever if there are no more data on the
    socket.

    Calls to poll_wait in following modules were ommited:
    net/bluetooth/af_bluetooth.c
    net/irda/af_irda.c
    net/irda/irnet/irnet_ppp.c
    net/mac80211/rc80211_pid_debugfs.c
    net/phonet/socket.c
    net/rds/af_rds.c
    net/rfkill/core.c
    net/sunrpc/cache.c
    net/sunrpc/rpc_pipe.c
    net/tipc/socket.c

    Signed-off-by: Jiri Olsa
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Jiri Olsa
     

18 Jun, 2009

1 commit

  • commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
    (net: No more expensive sock_hold()/sock_put() on each tx)
    changed initial sk_wmem_alloc value.

    We need to take into account this offset when reporting
    sk_wmem_alloc to user, in PROC_FS files or various
    ioctls (SIOCOUTQ/TIOCOUTQ)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Apr, 2009

1 commit


27 Feb, 2009

1 commit


01 Jan, 2009

1 commit


29 Dec, 2008

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1429 commits)
    net: Allow dependancies of FDDI & Tokenring to be modular.
    igb: Fix build warning when DCA is disabled.
    net: Fix warning fallout from recent NAPI interface changes.
    gro: Fix potential use after free
    sfc: If AN is enabled, always read speed/duplex from the AN advertising bits
    sfc: When disabling the NIC, close the device rather than unregistering it
    sfc: SFT9001: Add cable diagnostics
    sfc: Add support for multiple PHY self-tests
    sfc: Merge top-level functions for self-tests
    sfc: Clean up PHY mode management in loopback self-test
    sfc: Fix unreliable link detection in some loopback modes
    sfc: Generate unique names for per-NIC workqueues
    802.3ad: use standard ethhdr instead of ad_header
    802.3ad: generalize out mac address initializer
    802.3ad: initialize ports LACPDU from const initializer
    802.3ad: remove typedef around ad_system
    802.3ad: turn ports is_individual into a bool
    802.3ad: turn ports is_enabled into a bool
    802.3ad: make ntt bool
    ixgbe: Fix set_ringparam in ixgbe to use the same memory pools.
    ...

    Fixed trivial IPv4/6 address printing conflicts in fs/cifs/connect.c due
    to the conversion to %pI (in this networking merge) and the addition of
    doing IPv6 addresses (from the earlier merge of CIFS).

    Linus Torvalds
     

04 Dec, 2008

1 commit


03 Dec, 2008

1 commit