12 Dec, 2011

1 commit


24 Nov, 2011

1 commit

  • We can not update iph->daddr in ip_options_rcv_srr(), It is too early.
    When some exception ocurred later (eg. in ip_forward() when goto
    sr_failed) we need the ip header be identical to the original one as
    ICMP need it.

    Add a field 'nexthop' in struct ip_options to save nexthop of LSRR
    or SSRR option.

    Signed-off-by: Li Wei
    Signed-off-by: David S. Miller

    Li Wei
     

08 Aug, 2011

1 commit

  • The raw sockets can provide source address for
    routing but their privileges are not considered. We
    can provide non-local source address, make sure the
    FLOWI_FLAG_ANYSRC flag is set if socket has privileges
    for this, i.e. based on hdrincl (IP_HDRINCL) and
    transparent flags.

    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Julian Anastasov
     

07 May, 2011

1 commit

  • When we fast path datagram sends to avoid locking by putting
    the inet_cork on the stack we use up lots of space that isn't
    necessary.

    This is because inet_cork contains a "struct flowi" which isn't
    used in these code paths.

    Split inet_cork to two parts, "inet_cork" and "inet_cork_full".
    Only the latter of which has the "struct flowi" and is what is
    stored in inet_sock.

    Signed-off-by: David S. Miller
    Acked-by: Eric Dumazet

    David S. Miller
     

29 Apr, 2011

1 commit

  • We lack proper synchronization to manipulate inet->opt ip_options

    Problem is ip_make_skb() calls ip_setup_cork() and
    ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
    without any protection against another thread manipulating inet->opt.

    Another thread can change inet->opt pointer and free old one under us.

    Use RCU to protect inet->opt (changed to inet->inet_opt).

    Instead of handling atomic refcounts, just copy ip_options when
    necessary, to avoid cache line dirtying.

    We cant insert an rcu_head in struct ip_options since its included in
    skb->cb[], so this patch is large because I had to introduce a new
    ip_options_rcu structure.

    Signed-off-by: Eric Dumazet
    Cc: Herbert Xu
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Mar, 2011

1 commit

  • In order to allow simultaneous calls to ip_append_data on the same
    socket, it must not modify any shared state in sk or inet (other
    than those that are designed to allow that such as atomic counters).

    This patch abstracts out write references to sk and inet_sk in
    ip_append_data and its friends so that we may use the underlying
    code in parallel.

    Signed-off-by: Herbert Xu
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Herbert Xu
     

28 Jan, 2011

1 commit

  • TCP is going to record metrics for the connection,
    so pre-COW the route metrics at route cache entry
    creation time.

    This avoids several atomic operations that have to
    occur if we COW the metrics after the entry reaches
    global visibility.

    Signed-off-by: David S. Miller

    David S. Miller
     

10 Dec, 2010

1 commit

  • Followup of commit b178bb3dfc30 (net: reorder struct sock fields)

    Optimize INET input path a bit further, by :

    1) moving sk_refcnt close to sk_lock.

    This reduces number of dirtied cache lines by one on 64bit arches (and
    64 bytes cache line size).

    2) moving inet_daddr & inet_rcv_saddr at the beginning of sk

    (same cache line than hash / family / bound_dev_if / nulls_node)

    This reduces number of accessed cache lines in lookups by one, and dont
    increase size of inet and timewait socks.
    inet and tw sockets now share same place-holder for these fields.

    Before patch :

    offsetof(struct sock, sk_refcnt) = 0x10
    offsetof(struct sock, sk_lock) = 0x40
    offsetof(struct sock, sk_receive_queue) = 0x60
    offsetof(struct inet_sock, inet_daddr) = 0x270
    offsetof(struct inet_sock, inet_rcv_saddr) = 0x274

    After patch :

    offsetof(struct sock, sk_refcnt) = 0x44
    offsetof(struct sock, sk_lock) = 0x48
    offsetof(struct sock, sk_receive_queue) = 0x68
    offsetof(struct inet_sock, inet_daddr) = 0x0
    offsetof(struct inet_sock, inet_rcv_saddr) = 0x4

    compute_score() (udp or tcp) now use a single cache line per ignored
    item, instead of two.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Nov, 2010

1 commit

  • in_dev->mc_list is protected by one rwlock (in_dev->mc_list_lock).

    This can easily be converted to a RCU protection.

    Writers hold RTNL, so mc_list_lock is removed, not replaced by a
    spinlock.

    Signed-off-by: Eric Dumazet
    Cc: Cypher Wu
    Cc: Américo Wang
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Jun, 2010

1 commit


28 Apr, 2010

2 commits


17 Apr, 2010

1 commit

  • This patch implements receive flow steering (RFS). RFS steers
    received packets for layer 3 and 4 processing to the CPU where
    the application for the corresponding flow is running. RFS is an
    extension of Receive Packet Steering (RPS).

    The basic idea of RFS is that when an application calls recvmsg
    (or sendmsg) the application's running CPU is stored in a hash
    table that is indexed by the connection's rxhash which is stored in
    the socket structure. The rxhash is passed in skb's received on
    the connection from netif_receive_skb. For each received packet,
    the associated rxhash is used to look up the CPU in the hash table,
    if a valid CPU is set then the packet is steered to that CPU using
    the RPS mechanisms.

    The convolution of the simple approach is that it would potentially
    allow OOO packets. If threads are thrashing around CPUs or multiple
    threads are trying to read from the same sockets, a quickly changing
    CPU value in the hash table could cause rampant OOO packets--
    we consider this a non-starter.

    To avoid OOO packets, this solution implements two types of hash
    tables: rps_sock_flow_table and rps_dev_flow_table.

    rps_sock_table is a global hash table. Each entry is just a CPU
    number and it is populated in recvmsg and sendmsg as described above.
    This table contains the "desired" CPUs for flows.

    rps_dev_flow_table is specific to each device queue. Each entry
    contains a CPU and a tail queue counter. The CPU is the "current"
    CPU for a matching flow. The tail queue counter holds the value
    of a tail queue counter for the associated CPU's backlog queue at
    the time of last enqueue for a flow matching the entry.

    Each backlog queue has a queue head counter which is incremented
    on dequeue, and so a queue tail counter is computed as queue head
    count + queue length. When a packet is enqueued on a backlog queue,
    the current value of the queue tail counter is saved in the hash
    entry of the rps_dev_flow_table.

    And now the trick: when selecting the CPU for RPS (get_rps_cpu)
    the rps_sock_flow table and the rps_dev_flow table for the RX queue
    are consulted. When the desired CPU for the flow (found in the
    rps_sock_flow table) does not match the current CPU (found in the
    rps_dev_flow table), the current CPU is changed to the desired CPU
    if one of the following is true:

    - The current CPU is unset (equal to RPS_NO_CPU)
    - Current CPU is offline
    - The current CPU's queue head counter >= queue tail counter in the
    rps_dev_flow table. This checks if the queue tail has advanced
    beyond the last packet that was enqueued using this table entry.
    This guarantees that all packets queued using this entry have been
    dequeued, thus preserving in order delivery.

    Making each queue have its own rps_dev_flow table has two advantages:
    1) the tail queue counters will be written on each receive, so
    keeping the table local to interrupting CPU s good for locality. 2)
    this allows lockless access to the table-- the CPU number and queue
    tail counter need to be accessed together under mutual exclusion
    from netif_receive_skb, we assume that this is only called from
    device napi_poll which is non-reentrant.

    This patch implements RFS for TCP and connected UDP sockets.
    It should be usable for other flow oriented protocols.

    There are two configuration parameters for RFS. The
    "rps_flow_entries" kernel init parameter sets the number of
    entries in the rps_sock_flow_table, the per rxqueue sysfs entry
    "rps_flow_cnt" contains the number of entries in the rps_dev_flow
    table for the rxqueue. Both are rounded to power of two.

    The obvious benefit of RFS (over just RPS) is that it achieves
    CPU locality between the receive processing for a flow and the
    applications processing; this can result in increased performance
    (higher pps, lower latency).

    The benefits of RFS are dependent on cache hierarchy, application
    load, and other factors. On simple benchmarks, we don't necessarily
    see improvement and sometimes see degradation. However, for more
    complex benchmarks and for applications where cache pressure is
    much higher this technique seems to perform very well.

    Below are some benchmark results which show the potential benfit of
    this patch. The netperf test has 500 instances of netperf TCP_RR
    test with 1 byte req. and resp. The RPC test is an request/response
    test similar in structure to netperf RR test ith 100 threads on
    each host, but does more work in userspace that netperf.

    e1000e on 8 core Intel
    No RFS or RPS 104K tps at 30% CPU
    No RFS (best RPS config): 290K tps at 63% CPU
    RFS 303K tps at 61% CPU

    RPC test tps CPU% 50/90/99% usec latency Latency StdDev
    No RFS/RPS 103K 48% 757/900/3185 4472.35
    RPS only: 174K 73% 415/993/2468 491.66
    RFS 223K 73% 379/651/1382 315.61

    Signed-off-by: Tom Herbert
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Tom Herbert
     

12 Jan, 2010

1 commit

  • This patch adds the kernel portions needed to implement
    RFC 5082 Generalized TTL Security Mechanism (GTSM).
    It is a lightweight security measure against forged
    packets causing DoS attacks (for BGP).

    This is already implemented the same way in BSD kernels.
    For the necessary Quagga patch
    http://www.gossamer-threads.com/lists/quagga/dev/17389

    Description from Cisco
    http://www.cisco.com/en/US/docs/ios/12_3t/12_3t7/feature/guide/gt_btsh.html

    It does add one byte to each socket structure, but I did
    a little rearrangement to reuse a hole (on 64 bit), but it
    does grow the structure on 32 bit

    This should be documented on ip(4) man page and the Glibc in.h
    file also needs update. IPV6_MINHOPLIMIT should also be added
    (although BSD doesn't support that).

    Only TCP is supported, but could also be added to UDP, DCCP, SCTP
    if desired.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     

19 Oct, 2009

1 commit

  • In order to have better cache layouts of struct sock (separate zones
    for rx/tx paths), we need this preliminary patch.

    Goal is to transfert fields used at lookup time in the first
    read-mostly cache line (inside struct sock_common) and move sk_refcnt
    to a separate cache line (only written by rx path)

    This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
    sport and id fields. This allows a future patch to define these
    fields as macros, like sk_refcnt, without name clashes.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Jun, 2009

1 commit

  • * 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/vegard/kmemcheck: (39 commits)
    signal: fix __send_signal() false positive kmemcheck warning
    fs: fix do_mount_root() false positive kmemcheck warning
    fs: introduce __getname_gfp()
    trace: annotate bitfields in struct ring_buffer_event
    net: annotate struct sock bitfield
    c2port: annotate bitfield for kmemcheck
    net: annotate inet_timewait_sock bitfields
    ieee1394/csr1212: fix false positive kmemcheck report
    ieee1394: annotate bitfield
    net: annotate bitfields in struct inet_sock
    net: use kmemcheck bitfields API for skbuff
    kmemcheck: introduce bitfield API
    kmemcheck: add opcode self-testing at boot
    x86: unify pte_hidden
    x86: make _PAGE_HIDDEN conditional
    kmemcheck: make kconfig accessible for other architectures
    kmemcheck: enable in the x86 Kconfig
    kmemcheck: add hooks for the page allocator
    kmemcheck: add hooks for page- and sg-dma-mappings
    kmemcheck: don't track page tables
    ...

    Linus Torvalds
     

15 Jun, 2009

1 commit


02 Jun, 2009

1 commit

  • After some discussion offline with Christoph Lameter and David Stevens
    regarding multicast behaviour in Linux, I'm submitting a slightly
    modified patch from the one Christoph submitted earlier.

    This patch provides a new socket option IP_MULTICAST_ALL.

    In this case, default behaviour is _unchanged_ from the current
    Linux standard. The socket option is set by default to provide
    original behaviour. Sockets wishing to receive data only from
    multicast groups they join explicitly will need to clear this
    socket option.

    Signed-off-by: Nivedita Singhvi
    Signed-off-by: Christoph Lameter
    Acked-by: David Stevens
    Signed-off-by: David S. Miller

    Nivedita Singhvi
     

01 Oct, 2008

4 commits

  • Current TCP code relies on the local port of the listening socket
    being the same as the destination address of the incoming
    connection. Port redirection used by many transparent proxying
    techniques obviously breaks this, so we have to store the original
    destination port address.

    This patch extends struct inet_request_sock and stores the incoming
    destination port value there. It also modifies the handshake code to
    use that value as the source port when sending reply packets.

    Signed-off-by: KOVACS Krisztian
    Signed-off-by: David S. Miller

    KOVACS Krisztian
     
  • The TCP stack sends out SYN+ACK/ACK/RST reply packets in response to
    incoming packets. The non-local source address check on output bites
    us again, as replies for transparently redirected traffic won't have a
    chance to leave the node.

    This patch selectively sets the FLOWI_FLAG_ANYSRC flag when doing the
    route lookup for those replies. Transparent replies are enabled if the
    listening socket has the transparent socket flag set.

    Signed-off-by: KOVACS Krisztian
    Signed-off-by: David S. Miller

    KOVACS Krisztian
     
  • inet_iif() in inet_sock.h requires route.h. Since users of inet_iif()
    usually require other route.h functionality anyway this patch moves
    inet_iif() to route.h.

    Signed-off-by: KOVACS Krisztian
    Signed-off-by: David S. Miller

    KOVACS Krisztian
     
  • This patch introduces the IP_TRANSPARENT socket option: enabling that
    will make the IPv4 routing omit the non-local source address check on
    output. Setting IP_TRANSPARENT requires NET_ADMIN capability.

    Signed-off-by: KOVACS Krisztian
    Signed-off-by: David S. Miller

    KOVACS Krisztian
     

17 Jun, 2008

2 commits

  • There are many possible ways to add this "salt", thus I made this
    patch to be the last in the series to change it if required.

    Currently I propose to use the struct net pointer itself as this
    salt, but since this pointer is most often cache-line aligned, shift
    this right to eliminate the bits, that are most often zeroed.

    After this, simply add this mix to prepared hashfn-s.

    For CONFIG_NET_NS=n case this salt is 0 and no changes in hashfn
    appear.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • Although this hash takes addresses into account, the ehash chains
    can also be too long when, for instance, communications via lo occur.
    So, prepare the inet_hashfn to take struct net into account.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

11 Jun, 2008

1 commit


25 Mar, 2008

1 commit


23 Mar, 2008

1 commit


06 Mar, 2008

1 commit


05 Mar, 2008

1 commit

  • If all of the entropy is in the local and foreign addresses,
    but xor'ing together would cancel out that entropy, the
    current hash performs poorly.

    Suggested by Cosmin Ratiu:

    Basically, the situation is as follows: There is a client
    machine and a server machine. Both create 15000 virtual
    interfaces, open up a socket for each pair of interfaces and
    do SIP traffic. By profiling I noticed that there is a lot of
    time spent walking the established hash chains with this
    particular setup.

    The addresses were distributed like this: client interfaces
    were 198.18.0.1/16 with increments of 1 and server interfaces
    were 198.18.128.1/16 with increments of 1. As I said, there
    were 15000 interfaces. Source and destination ports were 5060
    for each connection. So in this case, ports don't matter for
    hashing purposes, and the bits from the address pairs used
    cancel each other, meaning there are no differences in the
    whole lot of pairs, so they all end up in the same hash chain.

    Signed-off-by: David S. Miller

    David S. Miller
     

26 Oct, 2007

1 commit

  • UDP currently uses skb->dev->ifindex which may provide the wrong
    information when the socket bound to a specific interface.
    This patch makes inet_iif() accessible to UDP and makes UDP use it.

    The scenario we are trying to fix is when a client is running on
    the same system and the server and both client and server bind to
    a non-loopback device.

    Signed-off-by: Vlad Yasevich
    Acked-by: David L Stevens
    Signed-off-by: David S. Miller

    Vlad Yasevich
     

26 Apr, 2007

1 commit

  • The days are gone when this was not an issue, there are folks out
    there with huge bot networks that can be used to attack the
    established hash tables on remote systems.

    So just like the routing cache and connection tracking
    hash, use Jenkins hash with random secret input.

    Signed-off-by: David S. Miller

    David S. Miller
     

29 Sep, 2006

6 commits


23 Sep, 2006

2 commits

  • The value is_setbyuser from struct ip_options is never used and set
    only one time (http://linux-net.osdl.org/index.php/TODO#IPV4).
    This little patch removes it from the kernel source.

    Signed-off-by: Louis Nyffenegger
    Signed-off-by: David S. Miller

    Louis Nyffenegger
     
  • Changes to the core network stack to support the NetLabel subsystem. This
    includes changes to the IPv4 option handling to support CIPSO labels.

    Signed-off-by: Paul Moore
    Signed-off-by: David S. Miller

    Paul Moore
     

26 Apr, 2006

1 commit