02 Dec, 2011

1 commit

  • This reverts commit 81d54ec8479a2c695760da81f05b5a9fb2dbe40a.

    If we take the "try_again" goto, due to a checksum error,
    the 'len' has already been truncated. So we won't compute
    the same values as the original code did.

    Reported-by: paul bilke
    Signed-off-by: David S. Miller

    David S. Miller
     

02 Nov, 2011

1 commit

  • the tcp and udp code creates a set of struct file_operations at runtime
    while it can also be done at compile time, with the added benefit of then
    having these file operations be const.

    the trickiest part was to get the "THIS_MODULE" reference right; the naive
    method of declaring a struct in the place of registration would not work
    for this reason.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: David S. Miller

    Arjan van de Ven
     

22 Sep, 2011

1 commit

  • Conflicts:
    MAINTAINERS
    drivers/net/Kconfig
    drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
    drivers/net/ethernet/broadcom/tg3.c
    drivers/net/wireless/iwlwifi/iwl-pci.c
    drivers/net/wireless/iwlwifi/iwl-trans-tx-pcie.c
    drivers/net/wireless/rt2x00/rt2800usb.c
    drivers/net/wireless/wl12xx/main.c

    David S. Miller
     

31 Aug, 2011

1 commit


18 Aug, 2011

1 commit

  • The l4_rxhash flag was added to the skb structure to indicate
    that the rxhash value was computed over the 4 tuple for the
    packet which includes the port information in the encapsulated
    transport packet. This is used by the stack to preserve the
    rxhash value in __skb_rx_tunnel.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

12 Aug, 2011

1 commit


22 Jul, 2011

1 commit

  • IPv6 fragment identification generation is way beyond what we use for
    IPv4 : It uses a single generator. Its not scalable and allows DOS
    attacks.

    Now inetpeer is IPv6 aware, we can use it to provide a more secure and
    scalable frag ident generator (per destination, instead of system wide)

    This patch :
    1) defines a new secure_ipv6_id() helper
    2) extends inet_getid() to provide 32bit results
    3) extends ipv6_select_ident() with a new dest parameter

    Reported-by: Fernando Gont
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Jun, 2011

2 commits

  • Consider this scenario: When the size of the first received udp packet
    is bigger than the receive buffer, MSG_TRUNC bit is set in msg->msg_flags.
    However, if checksum error happens and this is a blocking socket, it will
    goto try_again loop to receive the next packet. But if the size of the
    next udp packet is smaller than receive buffer, MSG_TRUNC flag should not
    be set, but because MSG_TRUNC bit is not cleared in msg->msg_flags before
    receive the next packet, MSG_TRUNC is still set, which is wrong.

    Fix this problem by clearing MSG_TRUNC flag when starting over for a
    new packet.

    Signed-off-by: Xufeng Zhang
    Signed-off-by: Paul Gortmaker
    Signed-off-by: David S. Miller

    Xufeng Zhang
     
  • udpv6_recvmsg() function is not using the correct variable to determine
    whether or not the socket is in non-blocking operation, this will lead
    to unexpected behavior when a UDP checksum error occurs.

    Consider a non-blocking udp receive scenario: when udpv6_recvmsg() is
    called by sock_common_recvmsg(), MSG_DONTWAIT bit of flags variable in
    udpv6_recvmsg() is cleared by "flags & ~MSG_DONTWAIT" in this call:

    err = sk->sk_prot->recvmsg(iocb, sk, msg, size, flags & MSG_DONTWAIT,
    flags & ~MSG_DONTWAIT, &addr_len);

    i.e. with udpv6_recvmsg() getting these values:

    int noblock = flags & MSG_DONTWAIT
    int flags = flags & ~MSG_DONTWAIT

    So, when udp checksum error occurs, the execution will go to
    csum_copy_err, and then the problem happens:

    csum_copy_err:
    ...............
    if (flags & MSG_DONTWAIT)
    return -EAGAIN;
    goto try_again;
    ...............

    But it will always go to try_again as MSG_DONTWAIT has been cleared
    from flags at call time -- only noblock contains the original value
    of MSG_DONTWAIT, so the test should be:

    if (noblock)
    return -EAGAIN;

    This is also consistent with what the ipv4/udp code does.

    Signed-off-by: Xufeng Zhang
    Signed-off-by: Paul Gortmaker
    Signed-off-by: David S. Miller

    Xufeng Zhang
     

24 May, 2011

1 commit

  • The %pK format specifier is designed to hide exposed kernel pointers,
    specifically via /proc interfaces. Exposing these pointers provides an
    easy target for kernel write vulnerabilities, since they reveal the
    locations of writable structures containing easily triggerable function
    pointers. The behavior of %pK depends on the kptr_restrict sysctl.

    If kptr_restrict is set to 0, no deviation from the standard %p behavior
    occurs. If kptr_restrict is set to 1, the default, if the current user
    (intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
    (currently in the LSM tree), kernel pointers using %pK are printed as 0's.
    If kptr_restrict is set to 2, kernel pointers using %pK are printed as
    0's regardless of privileges. Replacing with 0's was chosen over the
    default "(null)", which cannot be parsed by userland %p, which expects
    "(nil)".

    The supporting code for kptr_restrict and %pK are currently in the -mm
    tree. This patch converts users of %p in net/ to %pK. Cases of printing
    pointers to the syslog are not covered, since this would eliminate useful
    information for postmortem debugging and the reading of the syslog is
    already optionally protected by the dmesg_restrict sysctl.

    Signed-off-by: Dan Rosenberg
    Cc: James Morris
    Cc: Eric Dumazet
    Cc: Thomas Graf
    Cc: Eugene Teo
    Cc: Kees Cook
    Cc: Ingo Molnar
    Cc: David S. Miller
    Cc: Peter Zijlstra
    Cc: Eric Paris
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Dan Rosenberg
     

29 Apr, 2011

1 commit


27 Apr, 2011

1 commit


23 Apr, 2011

1 commit


22 Apr, 2011

1 commit

  • At this point, skb->data points to skb_transport_header.
    So, headroom check is wrong.

    For some case:bridge(UFO is on) + eth device(UFO is off),
    there is no enough headroom for IPv6 frag head.
    But headroom check is always false.

    This will bring about data be moved to there prior to skb->head,
    when adding IPv6 frag header to skb.

    Signed-off-by: Shan Wei
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller

    Shan Wei
     

07 Apr, 2011

1 commit

  • properly record sk_rxhash in ipv6 sockets (v2)

    Noticed while working on another project that flows to sockets which I had open
    on a test systems weren't getting steered properly when I had RFS enabled.
    Looking more closely I found that:

    1) The affected sockets were all ipv6
    2) They weren't getting steered because sk->sk_rxhash was never set from the
    incomming skbs on that socket.

    This was occuring because there are several points in the IPv4 tcp and udp code
    which save the rxhash value when a new connection is established. Those calls
    to sock_rps_save_rxhash were never added to the corresponding ipv6 code paths.
    This patch adds those calls. Tested by myself to properly enable RFS
    functionalty on ipv6.

    Change notes:
    v2:
    Filtered UDP to only arm RFS on bound sockets (Eric Dumazet)

    Signed-off-by: Neil Horman
    Signed-off-by: David S. Miller

    Neil Horman
     

13 Mar, 2011

4 commits


02 Mar, 2011

1 commit

  • Route lookups follow a general pattern in the ipv6 code wherein
    we first find the non-IPSEC route, potentially override the
    flow destination address due to ipv6 options settings, and then
    finally make an IPSEC search using either xfrm_lookup() or
    __xfrm_lookup().

    __xfrm_lookup() is used when we want to generate a blackhole route
    if the key manager needs to resolve the IPSEC rules (in this case
    -EREMOTE is returned and the original 'dst' is left unchanged).

    Otherwise plain xfrm_lookup() is used and when asynchronous IPSEC
    resolution is necessary, we simply fail the lookup completely.

    All of these cases are encapsulated into two routines,
    ip6_dst_lookup_flow and ip6_sk_dst_lookup_flow. The latter of which
    handles unconnected UDP datagram sockets.

    Signed-off-by: David S. Miller

    David S. Miller
     

25 Jan, 2011

1 commit

  • Quoting Ben Hutchings: we presumably won't be defining features that
    can only be enabled on 64-bit architectures.

    Occurences found by `grep -r` on net/, drivers/net, include/

    [ Move features and vlan_features next to each other in
    struct netdev, as per Eric Dumazet's suggestion -DaveM ]

    Signed-off-by: Michał Mirosław
    Signed-off-by: David S. Miller

    Michał Mirosław
     

18 Dec, 2010

1 commit


17 Dec, 2010

1 commit

  • Special care is taken inside sk_port_alloc to avoid overwriting
    skc_node/skc_nulls_node. We should also avoid overwriting
    skc_bind_node/skc_portaddr_node.

    The patch fixes the following crash:

    BUG: unable to handle kernel paging request at fffffffffffffff0
    IP: [] udp4_lib_lookup2+0xad/0x370
    [] __udp4_lib_lookup+0x282/0x360
    [] __udp4_lib_rcv+0x31e/0x700
    [] ? ip_local_deliver_finish+0x65/0x190
    [] ? ip_local_deliver+0x88/0xa0
    [] udp_rcv+0x15/0x20
    [] ip_local_deliver_finish+0x65/0x190
    [] ip_local_deliver+0x88/0xa0
    [] ip_rcv_finish+0x32d/0x6f0
    [] ? netif_receive_skb+0x99c/0x11c0
    [] ip_rcv+0x2bb/0x350
    [] netif_receive_skb+0x99c/0x11c0

    Signed-off-by: Leonard Crestez
    Signed-off-by: Octavian Purdila
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Octavian Purdila
     

11 Dec, 2010

1 commit


10 Dec, 2010

1 commit

  • Followup of commit b178bb3dfc30 (net: reorder struct sock fields)

    Optimize INET input path a bit further, by :

    1) moving sk_refcnt close to sk_lock.

    This reduces number of dirtied cache lines by one on 64bit arches (and
    64 bytes cache line size).

    2) moving inet_daddr & inet_rcv_saddr at the beginning of sk

    (same cache line than hash / family / bound_dev_if / nulls_node)

    This reduces number of accessed cache lines in lookups by one, and dont
    increase size of inet and timewait socks.
    inet and tw sockets now share same place-holder for these fields.

    Before patch :

    offsetof(struct sock, sk_refcnt) = 0x10
    offsetof(struct sock, sk_lock) = 0x40
    offsetof(struct sock, sk_receive_queue) = 0x60
    offsetof(struct inet_sock, inet_daddr) = 0x270
    offsetof(struct inet_sock, inet_rcv_saddr) = 0x274

    After patch :

    offsetof(struct sock, sk_refcnt) = 0x44
    offsetof(struct sock, sk_lock) = 0x48
    offsetof(struct sock, sk_receive_queue) = 0x68
    offsetof(struct inet_sock, inet_daddr) = 0x0
    offsetof(struct inet_sock, inet_rcv_saddr) = 0x4

    compute_score() (udp or tcp) now use a single cache line per ignored
    item, instead of two.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Nov, 2010

1 commit

  • UDP sockets refcount is usually 2, unless an incoming frame is going to
    be queued in receive or backlog queue.

    Using atomic_inc_not_zero_hint() permits to reduce latency, because
    processor issues less memory transactions.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Oct, 2010

1 commit


21 Oct, 2010

2 commits


09 Sep, 2010

1 commit

  • commit 30fff923 introduced in linux-2.6.33 (udp: bind() optimisation)
    added a secondary hash on UDP, hashed on (local addr, local port).

    Problem is that following sequence :

    fd = socket(...)
    connect(fd, &remote, ...)

    not only selects remote end point (address and port), but also sets
    local address, while UDP stack stored in secondary hash table the socket
    while its local address was INADDR_ANY (or ipv6 equivalent)

    Sequence is :
    - autobind() : choose a random local port, insert socket in hash tables
    [while local address is INADDR_ANY]
    - connect() : set remote address and port, change local address to IP
    given by a route lookup.

    When an incoming UDP frame comes, if more than 10 sockets are found in
    primary hash table, we switch to secondary table, and fail to find
    socket because its local address changed.

    One solution to this problem is to rehash datagram socket if needed.

    We add a new rehash(struct socket *) method in "struct proto", and
    implement this method for UDP v4 & v6, using a common helper.

    This rehashing only takes care of secondary hash table, since primary
    hash (based on local port only) is not changed.

    Reported-by: Krzysztof Piotr Oledzki
    Signed-off-by: Eric Dumazet
    Tested-by: Krzysztof Piotr Oledzki
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Jun, 2010

1 commit


02 Jun, 2010

1 commit

  • There are more than a dozen occurrences of following code in the
    IPv6 stack:

    if (opt && opt->srcrt) {
    struct rt0_hdr *rt0 = (struct rt0_hdr *) opt->srcrt;
    ipv6_addr_copy(&final, &fl.fl6_dst);
    ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
    final_p = &final;
    }

    Replace those with a helper. Note that the helper overrides final_p
    in all cases. This is ok as final_p was previously initialized to
    NULL when declared.

    Signed-off-by: Arnaud Ebalard
    Signed-off-by: David S. Miller

    Arnaud Ebalard
     

01 Jun, 2010

1 commit

  • Correct sk_forward_alloc handling for error_queue would need to use a
    backlog of frames that softirq handler could not deliver because socket
    is owned by user thread. Or extend backlog processing to be able to
    process normal and error packets.

    Another possibility is to not use mem charge for error queue, this is
    what I implemented in this patch.

    Note: this reverts commit 29030374
    (net: fix sk_forward_alloc corruptions), since we dont need to lock
    socket anymore.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

29 May, 2010

1 commit

  • As David found out, sock_queue_err_skb() should be called with socket
    lock hold, or we risk sk_forward_alloc corruption, since we use non
    atomic operations to update this field.

    This patch adds bh_lock_sock()/bh_unlock_sock() pair to three spots.
    (BH already disabled)

    1) skb_tstamp_tx()
    2) Before calling ip_icmp_error(), in __udp4_lib_err()
    3) Before calling ipv6_icmp_error(), in __udp6_lib_err()

    Reported-by: Anton Blanchard
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 May, 2010

1 commit

  • This new sock lock primitive was introduced to speedup some user context
    socket manipulation. But it is unsafe to protect two threads, one using
    regular lock_sock/release_sock, one using lock_sock_bh/unlock_sock_bh

    This patch changes lock_sock_bh to be careful against 'owned' state.
    If owned is found to be set, we must take the slow path.
    lock_sock_bh() now returns a boolean to say if the slow path was taken,
    and this boolean is used at unlock_sock_bh time to call the appropriate
    unlock function.

    After this change, BH are either disabled or enabled during the
    lock_sock_bh/unlock_sock_bh protected section. This might be misleading,
    so we rename these functions to lock_sock_fast()/unlock_sock_fast().

    Reported-by: Anton Blanchard
    Signed-off-by: Eric Dumazet
    Tested-by: Anton Blanchard
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 May, 2010

1 commit

  • Adding addresses and ports to the short packet log message,
    like ipv4/udp.c does it, makes these messages a lot more useful:

    [ 822.182450] UDPv6: short packet: From [2001:db8:ffb4:3::1]:47839 23715/178 to [2001:db8:ffb4:3:5054:ff:feff:200]:1234

    This requires us to drop logging in case pskb_may_pull() fails,
    which also is consistent with ipv4/udp.c

    Signed-off-by: Bjørn Mork
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Bjørn Mork
     

29 Apr, 2010

2 commits

  • When queueing a skb to socket, we can immediately release its dst if
    target socket do not use IP_CMSG_PKTINFO.

    tcp_data_queue() can drop dst too.

    This to benefit from a hot cache line and avoid the receiver, possibly
    on another cpu, to dirty this cache line himself.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Since commit 95766fff ([UDP]: Add memory accounting.),
    each received packet needs one extra sock_lock()/sock_release() pair.

    This added latency because of possible backlog handling. Then later,
    ticket spinlocks added yet another latency source in case of DDOS.

    This patch introduces lock_sock_bh() and unlock_sock_bh()
    synchronization primitives, avoiding one atomic operation and backlog
    processing.

    skb_free_datagram_locked() uses them instead of full blown
    lock_sock()/release_sock(). skb is orphaned inside locked section for
    proper socket memory reclaim, and finally freed outside of it.

    UDP receive path now take the socket spinlock only once.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Apr, 2010

1 commit

  • Current socket backlog limit is not enough to really stop DDOS attacks,
    because user thread spend many time to process a full backlog each
    round, and user might crazy spin on socket lock.

    We should add backlog size and receive_queue size (aka rmem_alloc) to
    pace writers, and let user run without being slow down too much.

    Introduce a sk_rcvqueues_full() helper, to avoid taking socket lock in
    stress situations.

    Under huge stress from a multiqueue/RPS enabled NIC, a single flow udp
    receiver can now process ~200.000 pps (instead of ~100 pps before the
    patch) on a 8 core machine.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Apr, 2010

1 commit

  • Finally add support to detect a local IPV6_DONTFRAG event
    and return the relevant data to the user if they've enabled
    IPV6_RECVPATHMTU on the socket. The next recvmsg() will
    return no data, but have an IPV6_PATHMTU as ancillary data.

    Signed-off-by: Brian Haley
    Signed-off-by: David S. Miller

    Brian Haley