10 Dec, 2013

2 commits

  • tclass information in now already stored in rcv_flowinfo
    We do not need to store the same information twice.

    Signed-off-by: Florent Fourcot
    Reviewed-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Florent Fourcot
     
  • The current implementation of IPV6_FLOWINFO only gives a
    result if pktoptions is available (thanks to the
    ip6_datagram_recv_ctl function).
    It gives inconsistent results to user space, sometimes
    there is a result for getsockopt(IPV6_FLOWINFO), sometimes
    not.

    This patch add rcv_flowinfo to store it, and return it to
    the userspace in the same way than other pkt_options.

    Signed-off-by: Florent Fourcot
    Reviewed-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Florent Fourcot
     

06 Dec, 2013

1 commit

  • The code to detect fragments in checksum_setup() was missing for IPv4 and
    too eager for IPv6. (It transpires that Windows seems to send IPv6 packets
    with a fragment header even if they are not a fragment - i.e. offset is zero,
    and M bit is not set).

    This patch also incorporates a fix to callers of maybe_pull_tail() where
    skb->network_header was being erroneously added to the length argument.

    Signed-off-by: Paul Durrant
    Signed-off-by: Zoltan Kiss
    Cc: Wei Liu
    Cc: Ian Campbell
    Cc: David Vrabel
    cc: David Miller
    Acked-by: Wei Liu
    Signed-off-by: David S. Miller

    Paul Durrant
     

29 Oct, 2013

1 commit


10 Oct, 2013

1 commit

  • TCP listener refactoring, part 5 :

    We want to be able to insert request sockets (SYN_RECV) into main
    ehash table instead of the per listener hash table to allow RCU
    lookups and remove listener lock contention.

    This patch includes the needed struct sock_common in front
    of struct request_sock

    This means there is no more inet6_request_sock IPv6 specific
    structure.

    Following inet_request_sock fields were renamed as they became
    macros to reference fields from struct sock_common.
    Prefix ir_ was chosen to avoid name collisions.

    loc_port -> ir_loc_port
    loc_addr -> ir_loc_addr
    rmt_addr -> ir_rmt_addr
    rmt_port -> ir_rmt_port
    iif -> ir_iif

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Oct, 2013

1 commit

  • TCP listener refactoring, part 4 :

    To speed up inet lookups, we moved IPv4 addresses from inet to struct
    sock_common

    Now is time to do the same for IPv6, because it permits us to have fast
    lookups for all kind of sockets, including upcoming SYN_RECV.

    Getting IPv6 addresses in TCP lookups currently requires two extra cache
    lines, plus a dereference (and memory stall).

    inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6

    This patch is way bigger than its IPv4 counter part, because for IPv4,
    we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
    it's not doable easily.

    inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
    inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr

    And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
    at the same offset.

    We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
    macro.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Oct, 2013

1 commit

  • TCP listener refactoring, part 2 :

    We can use a generic lookup, sockets being in whatever state, if
    we are sure all relevant fields are at the same place in all socket
    types (ESTABLISH, TIME_WAIT, SYN_RECV)

    This patch removes these macros :

    inet_addrpair, inet_addrpair, tw_addrpair, tw_portpair

    And adds :

    sk_portpair, sk_addrpair, sk_daddr, sk_rcv_saddr

    Then, INET_TW_MATCH() is really the same than INET_MATCH()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Aug, 2013

1 commit


27 Aug, 2013

1 commit


20 Aug, 2013

1 commit

  • It is not allowed for an ipv6 packet to contain multiple fragmentation
    headers. So discard packets which were already reassembled by
    fragmentation logic and send back a parameter problem icmp.

    The updates for RFC 6980 will come in later, I have to do a bit more
    research here.

    Cc: YOSHIFUJI Hideaki
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

14 Aug, 2013

1 commit

  • Commit cab70040dfd95ee32144f02fade64f0cb94f31a0 ("net: igmp:
    Reduce Unsolicited report interval to 1s when using IGMPv3") and
    2690048c01f32bf45d1c1e1ab3079bc10ad2aea7 ("net: igmp: Allow user-space
    configuration of igmp unsolicited report interval") by William Manley made
    igmp unsolicited report intervals configurable per interface and corrected
    the interval of unsolicited igmpv3 report messages resendings to 1s.

    Same needs to be done for IPv6:

    MLDv1 (RFC2710 7.10.): 10 seconds
    MLDv2 (RFC3810 9.11.): 1 second

    Both intervals are configurable via new procfs knobs
    mldv1_unsolicited_report_interval and mldv2_unsolicited_report_interval.

    (also added .force_mld_version to ipv6_devconf_dflt to bring structs in
    line without semantic changes)

    v2:
    a) Joined documentation update for IPv4 and IPv6 MLD/IGMP
    unsolicited_report_interval procfs knobs.
    b) incorporate stylistic feedback from William Manley

    v3:
    a) add new DEVCONF_* values to the end of the enum (thanks to David
    Miller)

    Cc: Cong Wang
    Cc: William Manley
    Cc: Benjamin LaHaise
    Cc: YOSHIFUJI Hideaki
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

31 Jan, 2013

1 commit


14 Jan, 2013

2 commits


09 Dec, 2012

1 commit

  • This patch adds support in the kernel for offloading in the NIC Tx and Rx
    checksumming for encapsulated packets (such as VXLAN and IP GRE).

    For Tx encapsulation offload, the driver will need to set the right bits
    in netdev->hw_enc_features. The protocol driver will have to set the
    skb->encapsulation bit and populate the inner headers, so the NIC driver will
    use those inner headers to calculate the csum in hardware.

    For Rx encapsulation offload, the driver will need to set again the
    skb->encapsulation flag and the skb->ip_csum to CHECKSUM_UNNECESSARY.
    In that case the protocol driver should push the decapsulated packet up
    to the stack, again with CHECKSUM_UNNECESSARY. In ether case, the protocol
    driver should set the skb->encapsulation flag back to zero. Finally the
    protocol driver should have NETIF_F_RXCSUM flag set in its features.

    Signed-off-by: Joseph Gasparakis
    Signed-off-by: Peter P Waskiewicz Jr
    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Joseph Gasparakis
     

01 Dec, 2012

1 commit

  • commit 68835aba4d9b (net: optimize INET input path further)
    moved some fields used for tcp/udp sockets lookup in the first cache
    line of struct sock_common.

    This patch moves inet_dport/inet_num as well, filling a 32bit hole
    on 64 bit arches and reducing number of cache line misses in lookups.

    Also change INET_MATCH()/INET_TW_MATCH() to perform the ports match
    before addresses match, as this check is more discriminant.

    Remove the hash check from MATCH() macros because we dont need to
    re validate the hash value after taking a refcount on socket, and
    use likely/unlikely compiler hints, as the sk_hash/hash check
    makes the following conditional tests 100% predicted by cpu.

    Introduce skc_addrpair/skc_portpair pair values to better
    document the alignment requirements of the port/addr pairs
    used in the various MATCH() macros, and remove some casts.

    The namespace check can also be done at last.

    This slightly improves TCP/UDP lookup times.

    IP/TCP early demux needs inet->rx_dst_ifindex and
    TCP needs inet->min_ttl, lets group them together in same cache line.

    With help from Ben Hutchings & Joe Perches.

    Idea of this patch came after Ling Ma proposal to move skc_hash
    to the beginning of struct sock_common, and should allow him
    to submit a final version of his patch. My tests show an improvement
    doing so.

    Signed-off-by: Eric Dumazet
    Cc: Ben Hutchings
    Cc: Joe Perches
    Cc: Ling Ma
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Nov, 2012

1 commit


13 Oct, 2012

1 commit


30 Aug, 2012

1 commit

  • The IPv6 conntrack fragmentation currently has a couple of shortcomings.
    Fragmentes are collected in PREROUTING/OUTPUT, are defragmented, the
    defragmented packet is then passed to conntrack, the resulting conntrack
    information is attached to each original fragment and the fragments then
    continue their way through the stack.

    Helper invocation occurs in the POSTROUTING hook, at which point only
    the original fragments are available. The result of this is that
    fragmented packets are never passed to helpers.

    This patch improves the situation in the following way:

    - If a reassembled packet belongs to a connection that has a helper
    assigned, the reassembled packet is passed through the stack instead
    of the original fragments.

    - During defragmentation, the largest received fragment size is stored.
    On output, the packet is refragmented if required. If the largest
    received fragment size exceeds the outgoing MTU, a "packet too big"
    message is generated, thus behaving as if the original fragments
    were passed through the stack from an outside point of view.

    - The ipv6_helper() hook function can't receive fragments anymore for
    connections using a helper, so it is switched to use ipv6_skip_exthdr()
    instead of the netfilter specific nf_ct_ipv6_skip_exthdr() and the
    reassembled packets are passed to connection tracking helpers.

    The result of this is that we can properly track fragmented packets, but
    still generate ICMPv6 Packet too big messages if we would have before.

    This patch is also required as a precondition for IPv6 NAT, where NAT
    helpers might enlarge packets up to a point that they require
    fragmentation. In that case we can't generate Packet too big messages
    since the proper MTU can't be calculated in all cases (f.i. when
    changing textual representation of a variable amount of addresses),
    so the packet is transparently fragmented iff the original packet or
    fragments would have fit the outgoing MTU.

    IPVS parts by Jesper Dangaard Brouer .

    Signed-off-by: Patrick McHardy

    Patrick McHardy
     

07 Aug, 2012

1 commit

  • IPv6 needs a cookie in dst_check() call.

    We need to add rx_dst_cookie and provide a family independent
    sk_rx_dst_set(sk, skb) method to properly support IPv6 TCP early demux.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Jul, 2012

1 commit

  • We should provide to inet6_csk_route_socket a struct flowi6 pointer,
    so that net6_csk_xmit() works correctly instead of sending garbage.

    Also add some consts

    Signed-off-by: Eric Dumazet
    Reported-by: Yuchung Cheng
    Cc: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Jul, 2012

1 commit


13 Feb, 2012

2 commits

  • Currently, it is not easily possible to get TOS/DSCP value of packets from
    an incoming TCP stream. The mechanism is there, IP_PKTOPTIONS getsockopt
    with IP_RECVTOS set, the same way as incoming TTL can be queried. This is
    not actually implemented for TOS, though.

    This patch adds this functionality, both for IPv4 (IP_PKTOPTIONS) and IPv6
    (IPV6_2292PKTOPTIONS). For IPv4, like in the IP_RECVTTL case, the value of
    the TOS field is stored from the other party's ACK.

    This is needed for proxies which require DSCP transparency. One such example
    is at http://zph.bratcheda.org/.

    Signed-off-by: Jiri Benc
    Signed-off-by: David S. Miller

    Jiri Benc
     
  • Implement helper inline function to get traffic class from IPv6 header.

    Signed-off-by: Jiri Benc
    Signed-off-by: David S. Miller

    Jiri Benc
     

09 Feb, 2012

1 commit


12 Dec, 2011

1 commit


25 Nov, 2010

1 commit

  • ipv6_sk_mc_lock rwlock becomes a spinlock.

    readers (inet6_mc_check()) now takes rcu_read_lock() instead of read
    lock. Writers dont need to disable BH anymore.

    struct ipv6_mc_socklist objects are reclaimed after one RCU grace
    period.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Oct, 2010

1 commit


23 Aug, 2010

1 commit

  • __packed is only defined in kernel space, so we should use
    __attribute__((packed)) for the code shared between kernel and user space.

    Two __attribute() annotations are replaced with __attribute__() too.

    Signed-off-by: Changli Gao
    Signed-off-by: David S. Miller

    Changli Gao
     

20 Jul, 2010

1 commit

  • Even with jumbograms I cannot see any way in which we would need
    to records a larger than 65535 valued next-header offset.

    The maximum extension header length is (256 << 3) == 2048.
    There are only a handful of extension headers specified which
    we'd even accept (say 5 or 6), therefore the largest next-header
    offset we'd ever have to contend with is something less than
    say 16k.

    Therefore make it a u16 instead of a u32.

    Signed-off-by: David S. Miller

    David S. Miller
     

03 Jun, 2010

1 commit


11 May, 2010

2 commits

  • This patch adds support for multiple independant multicast routing instances,
    named "tables".

    Userspace multicast routing daemons can bind to a specific table instance by
    issuing a setsockopt call using a new option MRT6_TABLE. The table number is
    stored in the raw socket data and affects all following ip6mr setsockopt(),
    getsockopt() and ioctl() calls. By default, a single table (RT6_TABLE_DFLT)
    is created with a default routing rule pointing to it. Newly created pim6reg
    devices have the table number appended ("pim6regX"), with the exception of
    devices created in the default table, which are named just "pim6reg" for
    compatibility reasons.

    Packets are directed to a specific table instance using routing rules,
    similar to how regular routing rules work. Currently iif, oif and mark
    are supported as keys, source and destination addresses could be supported
    additionally.

    Example usage:

    - bind pimd/xorp/... to a specific table:

    uint32_t table = 123;
    setsockopt(fd, SOL_IPV6, MRT6_TABLE, &table, sizeof(table));

    - create routing rules directing packets to the new table:

    # ip -6 mrule add iif eth0 lookup 123
    # ip -6 mrule add oif eth0 lookup 123

    Signed-off-by: Patrick McHardy

    Patrick McHardy
     
  • Conflicts:
    net/bridge/br_device.c
    net/bridge/br_forward.c

    Signed-off-by: Patrick McHardy

    Patrick McHardy
     

24 Apr, 2010

2 commits

  • Finally add support to detect a local IPV6_DONTFRAG event
    and return the relevant data to the user if they've enabled
    IPV6_RECVPATHMTU on the socket. The next recvmsg() will
    return no data, but have an IPV6_PATHMTU as ancillary data.

    Signed-off-by: Brian Haley
    Signed-off-by: David S. Miller

    Brian Haley
     
  • Add underlying data structure changes and basic setsockopt()
    and getsockopt() support for IPV6_RECVPATHMTU, IPV6_PATHMTU,
    and IPV6_DONTFRAG. IPV6_PATHMTU is actually fully functional
    at this point.

    Signed-off-by: Brian Haley
    Signed-off-by: David S. Miller

    Brian Haley
     

23 Apr, 2010

1 commit

  • This patch adds IPv6 support for RFC5082 Generalized TTL Security Mechanism.

    Not to users of mapped address; the IPV6 and IPV4 socket options are seperate.
    The server does have to deal with both IPv4 and IPv6 socket options
    and the client has to handle the different for each family.

    On client:
    int ttl = 255;
    getaddrinfo(argv[1], argv[2], &hint, &result);

    for (rp = result; rp != NULL; rp = rp->ai_next) {
    s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
    if (s < 0) continue;

    if (rp->ai_family == AF_INET) {
    setsockopt(s, IPPROTO_IP, IP_TTL, &ttl, sizeof(ttl));
    } else if (rp->ai_family == AF_INET6) {
    setsockopt(s, IPPROTO_IPV6, IPV6_UNICAST_HOPS,
    &ttl, sizeof(ttl)))
    }

    if (connect(s, rp->ai_addr, rp->ai_addrlen) == 0) {
    ...

    On server:
    int minttl = 255 - maxhops;

    getaddrinfo(NULL, port, &hints, &result);
    for (rp = result; rp != NULL; rp = rp->ai_next) {
    s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
    if (s < 0) continue;

    if (rp->ai_family == AF_INET6)
    setsockopt(s, IPPROTO_IPV6, IPV6_MINHOPCOUNT,
    &minttl, sizeof(minttl));
    setsockopt(s, IPPROTO_IP, IP_MINTTL, &minttl, sizeof(minttl));

    if (bind(s, rp->ai_addr, rp->ai_addrlen) == 0)
    break
    ...

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     

13 Apr, 2010

1 commit


19 Oct, 2009

1 commit

  • In order to have better cache layouts of struct sock (separate zones
    for rx/tx paths), we need this preliminary patch.

    Goal is to transfert fields used at lookup time in the first
    read-mostly cache line (inside struct sock_common) and move sk_refcnt
    to a separate cache line (only written by rx path)

    This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
    sport and id fields. This allows a future patch to define these
    fields as macros, like sk_refcnt, without name clashes.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Oct, 2009

1 commit

  • (This patch fixes bug of commit f7734fdf61ec6bb848e0bafc1fb8bad2c124bb50
    title "make TLLAO option for NA packets configurable")

    When the IPV6 conf is used, the function sysctl_set_parent is called and the
    array addrconf_sysctl is used as a parameter of the function.

    The above patch added new conf "force_tllao" into the array addrconf_sysctl,
    but the size of the array was not modified, the static allocated size is
    DEVCONF_MAX + 1 but the real size is DEVCONF_MAX + 2, so the problem is
    that the function sysctl_set_parent accessed wrong address.

    I got the following information.
    Call Trace:
    [] sysctl_set_parent+0x29/0x3e
    [] sysctl_set_parent+0x29/0x3e
    [] sysctl_set_parent+0x29/0x3e
    [] sysctl_set_parent+0x29/0x3e
    [] sysctl_set_parent+0x29/0x3e
    [] __register_sysctl_paths+0xde/0x272
    [] ? __kmalloc_track_caller+0x16e/0x180
    [] ? __addrconf_sysctl_register+0xc5/0x144 [ipv6]
    [] register_net_sysctl_table+0x48/0x4b
    [] __addrconf_sysctl_register+0xf7/0x144 [ipv6]
    [] addrconf_init_net+0xd4/0x104 [ipv6]
    [] setup_net+0x35/0x82
    [] copy_net_ns+0x76/0xe0
    [] create_new_namespaces+0xf0/0x16e
    [] copy_namespaces+0x65/0x9f
    [] copy_process+0xb2c/0x12c3
    [] do_fork+0x14b/0x2d2
    [] ? up_read+0xe/0x10
    [] ? do_page_fault+0x27a/0x2aa
    [] sys_clone+0x28/0x2a
    [] stub_clone+0x13/0x20
    [] ? system_call_fastpath+0x16/0x1b

    And the information of IPV6 in .config is as following.
    IPV6 in .config:
    CONFIG_IPV6=m
    CONFIG_IPV6_PRIVACY=y
    CONFIG_IPV6_ROUTER_PREF=y
    CONFIG_IPV6_ROUTE_INFO=y
    CONFIG_IPV6_OPTIMISTIC_DAD=y
    CONFIG_IPV6_MIP6=m
    CONFIG_IPV6_SIT=m
    # CONFIG_IPV6_SIT_6RD is not set
    CONFIG_IPV6_NDISC_NODETYPE=y
    CONFIG_IPV6_TUNNEL=m
    CONFIG_IPV6_MULTIPLE_TABLES=y
    CONFIG_IPV6_SUBTREES=y
    CONFIG_IPV6_MROUTE=y
    CONFIG_IPV6_PIMSM_V2=y
    # CONFIG_IP_VS_IPV6 is not set
    CONFIG_NF_CONNTRACK_IPV6=m
    CONFIG_IP6_NF_MATCH_IPV6HEADER=m

    I confirmed this patch fixes this problem.

    Signed-off-by: Jin Dongming
    Signed-off-by: David S. Miller

    Jin Dongming
     

07 Oct, 2009

1 commit

  • On Friday 02 October 2009 20:53:51 you wrote:

    > This is good although I would have shortened the name.

    Ah, I knew I forgot something :) Here is v4.

    tavi

    >From 24d96d825b9fa832b22878cc6c990d5711968734 Mon Sep 17 00:00:00 2001
    From: Octavian Purdila
    Date: Fri, 2 Oct 2009 00:51:15 +0300
    Subject: [PATCH] ipv6: new sysctl for sending TLLAO with unicast NAs

    Neighbor advertisements responding to unicast neighbor solicitations
    did not include the target link-layer address option. This patch adds
    a new sysctl option (disabled by default) which controls whether this
    option should be sent even with unicast NAs.

    The need for this arose because certain routers expect the TLLAO in
    some situations even as a response to unicast NS packets.

    Moreover, RFC 2461 recommends sending this to avoid a race condition
    (section 4.4, Target link-layer address)

    Signed-off-by: Cosmin Ratiu
    Signed-off-by: Octavian Purdila
    Signed-off-by: David S. Miller

    Octavian Purdila