11 Nov, 2009

1 commit


06 Nov, 2009

1 commit

  • The generic __sock_create function has a kern argument which allows the
    security system to make decisions based on if a socket is being created by
    the kernel or by userspace. This patch passes that flag to the
    net_proto_family specific create function, so it can do the same thing.

    Signed-off-by: Eric Paris
    Acked-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Eric Paris
     

27 Oct, 2009

1 commit


19 Oct, 2009

1 commit

  • I found a deadlock bug in UNIX domain socket, which makes able to DoS
    attack against the local machine by non-root users.

    How to reproduce:
    1. Make a listening AF_UNIX/SOCK_STREAM socket with an abstruct
    namespace(*), and shutdown(2) it.
    2. Repeat connect(2)ing to the listening socket from the other sockets
    until the connection backlog is full-filled.
    3. connect(2) takes the CPU forever. If every core is taken, the
    system hangs.

    PoC code: (Run as many times as cores on SMP machines.)

    int main(void)
    {
    int ret;
    int csd;
    int lsd;
    struct sockaddr_un sun;

    /* make an abstruct name address (*) */
    memset(&sun, 0, sizeof(sun));
    sun.sun_family = PF_UNIX;
    sprintf(&sun.sun_path[1], "%d", getpid());

    /* create the listening socket and shutdown */
    lsd = socket(AF_UNIX, SOCK_STREAM, 0);
    bind(lsd, (struct sockaddr *)&sun, sizeof(sun));
    listen(lsd, 1);
    shutdown(lsd, SHUT_RDWR);

    /* connect loop */
    alarm(15); /* forcely exit the loop after 15 sec */
    for (;;) {
    csd = socket(AF_UNIX, SOCK_STREAM, 0);
    ret = connect(csd, (struct sockaddr *)&sun, sizeof(sun));
    if (-1 == ret) {
    perror("connect()");
    break;
    }
    puts("Connection OK");
    }
    return 0;
    }

    (*) Make sun_path[0] = 0 to use the abstruct namespace.
    If a file-based socket is used, the system doesn't deadlock because
    of context switches in the file system layer.

    Why this happens:
    Error checks between unix_socket_connect() and unix_wait_for_peer() are
    inconsistent. The former calls the latter to wait until the backlog is
    processed. Despite the latter returns without doing anything when the
    socket is shutdown, the former doesn't check the shutdown state and
    just retries calling the latter forever.

    Patch:
    The patch below adds shutdown check into unix_socket_connect(), so
    connect(2) to the shutdown socket will return -ECONREFUSED.

    Signed-off-by: Tomoki Sekiyama
    Signed-off-by: Masanori Yoshida
    Signed-off-by: David S. Miller

    Tomoki Sekiyama
     

07 Oct, 2009

1 commit


12 Sep, 2009

1 commit

  • Kalle Olavi Niemitalo reported that:

    "..., when one process calls sendmsg once to send 43804 bytes of
    data and one file descriptor, and another process then calls recvmsg
    three times to receive the 16032+16032+11740 bytes, each of those
    recvmsg calls returns the file descriptor in the ancillary data. I
    confirmed this with strace. The behaviour differs from Linux
    2.6.26, where reportedly only one of those recvmsg calls (I think
    the first one) returned the file descriptor."

    This bug was introduced by a patch from me titled "net: unix: fix inflight
    counting bug in garbage collector", commit 6209344f5.

    And the reason is, quoting Kalle:

    "Before your patch, unix_attach_fds() would set scm->fp = NULL, so
    that if the loop in unix_stream_sendmsg() ran multiple iterations,
    it could not call unix_attach_fds() again. But now,
    unix_attach_fds() leaves scm->fp unchanged, and I think this causes
    it to be called multiple times and duplicate the same file
    descriptors to each struct sk_buff."

    Fix this by introducing a flag that is cleared at the start and set
    when the fds attached to the first buffer. The resulting code should
    work equivalently to the one on 2.6.26.

    Reported-by: Kalle Olavi Niemitalo
    Signed-off-by: Miklos Szeredi
    Signed-off-by: David S. Miller

    Miklos Szeredi
     

10 Jul, 2009

1 commit

  • Adding memory barrier after the poll_wait function, paired with
    receive callbacks. Adding fuctions sock_poll_wait and sk_has_sleeper
    to wrap the memory barrier.

    Without the memory barrier, following race can happen.
    The race fires, when following code paths meet, and the tp->rcv_nxt
    and __add_wait_queue updates stay in CPU caches.

    CPU1 CPU2

    sys_select receive packet
    ... ...
    __add_wait_queue update tp->rcv_nxt
    ... ...
    tp->rcv_nxt check sock_def_readable
    ... {
    schedule ...
    if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
    wake_up_interruptible(sk->sk_sleep)
    ...
    }

    If there was no cache the code would work ok, since the wait_queue and
    rcv_nxt are opposit to each other.

    Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
    passed the tp->rcv_nxt check and sleeps, or will get the new value for
    tp->rcv_nxt and will return with new data mask.
    In both cases the process (CPU1) is being added to the wait queue, so the
    waitqueue_active (CPU2) call cannot miss and will wake up CPU1.

    The bad case is when the __add_wait_queue changes done by CPU1 stay in its
    cache, and so does the tp->rcv_nxt update on CPU2 side. The CPU1 will then
    endup calling schedule and sleep forever if there are no more data on the
    socket.

    Calls to poll_wait in following modules were ommited:
    net/bluetooth/af_bluetooth.c
    net/irda/af_irda.c
    net/irda/irnet/irnet_ppp.c
    net/mac80211/rc80211_pid_debugfs.c
    net/phonet/socket.c
    net/rds/af_rds.c
    net/rfkill/core.c
    net/sunrpc/cache.c
    net/sunrpc/rpc_pipe.c
    net/tipc/socket.c

    Signed-off-by: Jiri Olsa
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Jiri Olsa
     

18 Jun, 2009

1 commit

  • commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
    (net: No more expensive sock_hold()/sock_put() on each tx)
    changed initial sk_wmem_alloc value.

    We need to take into account this offset when reporting
    sk_wmem_alloc to user, in PROC_FS files or various
    ioctls (SIOCOUTQ/TIOCOUTQ)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Apr, 2009

1 commit


27 Feb, 2009

1 commit


01 Jan, 2009

1 commit


29 Dec, 2008

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1429 commits)
    net: Allow dependancies of FDDI & Tokenring to be modular.
    igb: Fix build warning when DCA is disabled.
    net: Fix warning fallout from recent NAPI interface changes.
    gro: Fix potential use after free
    sfc: If AN is enabled, always read speed/duplex from the AN advertising bits
    sfc: When disabling the NIC, close the device rather than unregistering it
    sfc: SFT9001: Add cable diagnostics
    sfc: Add support for multiple PHY self-tests
    sfc: Merge top-level functions for self-tests
    sfc: Clean up PHY mode management in loopback self-test
    sfc: Fix unreliable link detection in some loopback modes
    sfc: Generate unique names for per-NIC workqueues
    802.3ad: use standard ethhdr instead of ad_header
    802.3ad: generalize out mac address initializer
    802.3ad: initialize ports LACPDU from const initializer
    802.3ad: remove typedef around ad_system
    802.3ad: turn ports is_individual into a bool
    802.3ad: turn ports is_enabled into a bool
    802.3ad: make ntt bool
    ixgbe: Fix set_ringparam in ixgbe to use the same memory pools.
    ...

    Fixed trivial IPv4/6 address printing conflicts in fs/cifs/connect.c due
    to the conversion to %pI (in this networking merge) and the addition of
    doing IPv6 addresses (from the earlier merge of CIFS).

    Linus Torvalds
     

04 Dec, 2008

1 commit


03 Dec, 2008

1 commit


27 Nov, 2008

1 commit

  • This is an implementation of David Miller's suggested fix in:
    https://bugzilla.redhat.com/show_bug.cgi?id=470201

    It has been updated to use wait_event() instead of
    wait_event_interruptible().

    Paraphrasing the description from the above report, it makes sendmsg()
    block while UNIX garbage collection is in progress. This avoids a
    situation where child processes continue to queue new FDs over a
    AF_UNIX socket to a parent which is in the exit path and running
    garbage collection on these FDs. This contention can result in soft
    lockups and oom-killing of unrelated processes.

    Signed-off-by: dann frazier
    Signed-off-by: David S. Miller

    dann frazier
     

26 Nov, 2008

1 commit

  • Instead of using one atomic_t per protocol, use a percpu_counter
    for "sockets_allocated", to reduce cache line contention on
    heavy duty network servers.

    Note : We revert commit (248969ae31e1b3276fc4399d67ce29a5d81e6fd9
    net: af_unix can make unix_nr_socks visbile in /proc),
    since it is not anymore used after sock_prot_inuse_add() addition

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Nov, 2008

2 commits

  • The rule of calling sock_prot_inuse_add() is that BHs must
    be disabled. Some new calls were added where this was not
    true and this tiggers warnings as reported by Ilpo.

    Fix this by adding explicit BH disabling around those call sites,
    or moving sock_prot_inuse_add() call inside an existing BH disabled
    section.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The rule of calling sock_prot_inuse_add() is that BHs must
    be disabled. Some new calls were added where this was not
    true and this tiggers warnings as reported by Ilpo.

    Fix this by adding explicit BH disabling around those call sites.

    Signed-off-by: David S. Miller

    David S. Miller
     

20 Nov, 2008

2 commits


17 Nov, 2008

3 commits

  • This patch is a preparation to namespace conversion of /proc/net/protocols

    In order to have relevant information for UNIX protocol, we should use
    sock_prot_inuse_add() to update a (percpu and pernamespace) counter of
    inuse sockets.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Currently, /proc/net/protocols displays socket counts only for TCP/TCPv6
    protocols

    We can provide unix_nr_socks for free here, this counter being
    already maintained in af_unix

    Before patch :

    # grep UNIX /proc/net/protocols
    UNIX 428 -1 -1 NI 0 yes kernel

    After patch :

    # grep UNIX /proc/net/protocols
    UNIX 428 98 -1 NI 0 yes kernel

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This is a pure cleanup of net/unix/af_unix.c to meet current code
    style standards

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Nov, 2008

2 commits

  • Conflicts:
    security/keys/internal.h
    security/keys/process_keys.c
    security/keys/request_key.c

    Fixed conflicts above by using the non 'tsk' versions.

    Signed-off-by: James Morris

    James Morris
     
  • Wrap access to task credentials so that they can be separated more easily from
    the task_struct during the introduction of COW creds.

    Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().

    Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
    sense to use RCU directly rather than a convenient wrapper; these will be
    addressed by later patches.

    Signed-off-by: David Howells
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Cc: netdev@vger.kernel.org
    Signed-off-by: James Morris

    David Howells
     

12 Nov, 2008

1 commit


10 Nov, 2008

1 commit

  • Previously I assumed that the receive queues of candidates don't
    change during the GC. This is only half true, nothing can be received
    from the queues (see comment in unix_gc()), but buffers could be added
    through the other half of the socket pair, which may still have file
    descriptors referring to it.

    This can result in inc_inflight_move_tail() erronously increasing the
    "inflight" counter for a unix socket for which dec_inflight() wasn't
    previously called. This in turn can trigger the "BUG_ON(total_refs <
    inflight_refs)" in a later garbage collection run.

    Fix this by only manipulating the "inflight" counter for sockets which
    are candidates themselves. Duplicating the file references in
    unix_attach_fds() is also needed to prevent a socket becoming a
    candidate for GC while the skb that contains it is not yet queued.

    Reported-by: Andrea Bittau
    Signed-off-by: Miklos Szeredi
    CC: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

07 Nov, 2008

1 commit


04 Nov, 2008

1 commit

  • I want to compile out proc_* and sysctl_* handlers totally and
    stub them to NULL depending on config options, however usage of &
    will prevent this, since taking adress of NULL pointer will break
    compilation.

    So, drop & in front of every ->proc_handler and every ->strategy
    handler, it was never needed in fact.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     

02 Nov, 2008

2 commits


23 Oct, 2008

1 commit


14 Oct, 2008

1 commit

  • Clean up the various different email addresses of mine listed in the code
    to a single current and valid address. As Dave says his network merges
    for 2.6.28 are now done this seems a good point to send them in where
    they won't risk disrupting real changes.

    Signed-off-by: Alan Cox
    Signed-off-by: David S. Miller

    Alan Cox
     

27 Jul, 2008

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (39 commits)
    [PATCH] fix RLIM_NOFILE handling
    [PATCH] get rid of corner case in dup3() entirely
    [PATCH] remove remaining namei_{32,64}.h crap
    [PATCH] get rid of indirect users of namei.h
    [PATCH] get rid of __user_path_lookup_open
    [PATCH] f_count may wrap around
    [PATCH] dup3 fix
    [PATCH] don't pass nameidata to __ncp_lookup_validate()
    [PATCH] don't pass nameidata to gfs2_lookupi()
    [PATCH] new (local) helper: user_path_parent()
    [PATCH] sanitize __user_walk_fd() et.al.
    [PATCH] preparation to __user_walk_fd cleanup
    [PATCH] kill nameidata passing to permission(), rename to inode_permission()
    [PATCH] take noexec checks to very few callers that care
    Re: [PATCH 3/6] vfs: open_exec cleanup
    [patch 4/4] vfs: immutable inode checking cleanup
    [patch 3/4] fat: dont call notify_change
    [patch 2/4] vfs: utimes cleanup
    [patch 1/4] vfs: utimes: move owner check into inode_change_ok()
    [PATCH] vfs: use kstrdup() and check failing allocation
    ...

    Linus Torvalds
     
  • make it atomic_long_t; while we are at it, get rid of useless checks in affs,
    hfs and hpfs - ->open() always has it equal to 1, ->release() - to 0.

    Signed-off-by: Al Viro

    Al Viro
     

26 Jul, 2008

1 commit

  • Removes legacy reinvent-the-wheel type thing. The generic
    machinery integrates much better to automated debugging aids
    such as kerneloops.org (and others), and is unambiguous due to
    better naming. Non-intuively BUG_TRAP() is actually equal to
    WARN_ON() rather than BUG_ON() though some might actually be
    promoted to BUG_ON() but I left that to future.

    I could make at least one BUILD_BUG_ON conversion.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

28 Jun, 2008

2 commits

  • Conflicts:

    drivers/net/wireless/iwlwifi/iwl4965-base.c

    David S. Miller
     
  • For n:1 'datagram connections' (eg /dev/log), the unix_dgram_sendmsg
    routine implements a form of receiver-imposed flow control by
    comparing the length of the receive queue of the 'peer socket' with
    the max_ack_backlog value stored in the corresponding sock structure,
    either blocking the thread which caused the send-routine to be called
    or returning EAGAIN. This routine is used by both SOCK_DGRAM and
    SOCK_SEQPACKET sockets. The poll-implementation for these socket types
    is datagram_poll from core/datagram.c. A socket is deemed to be
    writeable by this routine when the memory presently consumed by
    datagrams owned by it is less than the configured socket send buffer
    size. This is always wrong for PF_UNIX non-stream sockets connected to
    server sockets dealing with (potentially) multiple clients if the
    abovementioned receive queue is currently considered to be full.
    'poll' will then return, indicating that the socket is writeable, but
    a subsequent write result in EAGAIN, effectively causing an (usual)
    application to 'poll for writeability by repeated send request with
    O_NONBLOCK set' until it has consumed its time quantum.

    The change below uses a suitably modified variant of the datagram_poll
    routines for both type of PF_UNIX sockets, which tests if the
    recv-queue of the peer a socket is connected to is presently
    considered to be 'full' as part of the 'is this socket
    writeable'-checking code. The socket being polled is additionally
    put onto the peer_wait wait queue associated with its peer, because the
    unix_dgram_recvmsg routine does a wake up on this queue after a
    datagram was received and the 'other wakeup call' is done implicitly
    as part of skb destruction, meaning, a process blocked in poll
    because of a full peer receive queue could otherwise sleep forever
    if no datagram owned by its socket was already sitting on this queue.
    Among this change is a small (inline) helper routine named
    'unix_recvq_full', which consolidates the actual testing code (in three
    different places) into a single location.

    Signed-off-by: Rainer Weikusat
    Signed-off-by: David S. Miller

    Rainer Weikusat
     

20 Jun, 2008

1 commit


18 Jun, 2008

1 commit

  • The unix_dgram_sendmsg routine implements a (somewhat crude)
    form of receiver-imposed flow control by comparing the length of the
    receive queue of the 'peer socket' with the max_ack_backlog value
    stored in the corresponding sock structure, either blocking
    the thread which caused the send-routine to be called or returning
    EAGAIN. This routine is used by both SOCK_DGRAM and SOCK_SEQPACKET
    sockets. The poll-implementation for these socket types is
    datagram_poll from core/datagram.c. A socket is deemed to be writeable
    by this routine when the memory presently consumed by datagrams
    owned by it is less than the configured socket send buffer size. This
    is always wrong for connected PF_UNIX non-stream sockets when the
    abovementioned receive queue is currently considered to be full.
    'poll' will then return, indicating that the socket is writeable, but
    a subsequent write result in EAGAIN, effectively causing an
    (usual) application to 'poll for writeability by repeated send request
    with O_NONBLOCK set' until it has consumed its time quantum.

    The change below uses a suitably modified variant of the datagram_poll
    routines for both type of PF_UNIX sockets, which tests if the
    recv-queue of the peer a socket is connected to is presently
    considered to be 'full' as part of the 'is this socket
    writeable'-checking code. The socket being polled is additionally
    put onto the peer_wait wait queue associated with its peer, because the
    unix_dgram_sendmsg routine does a wake up on this queue after a
    datagram was received and the 'other wakeup call' is done implicitly
    as part of skb destruction, meaning, a process blocked in poll
    because of a full peer receive queue could otherwise sleep forever
    if no datagram owned by its socket was already sitting on this queue.
    Among this change is a small (inline) helper routine named
    'unix_recvq_full', which consolidates the actual testing code (in three
    different places) into a single location.

    Signed-off-by: Rainer Weikusat
    Signed-off-by: David S. Miller

    Rainer Weikusat