12 Dec, 2011

1 commit


10 Nov, 2011

1 commit

  • Le lundi 07 novembre 2011 à 15:33 +0100, Eric Dumazet a écrit :

    > At least, in recent kernels we dont change dst->refcnt in forwarding
    > patch (usinf NOREF skb->dst)
    >
    > One particular point is the atomic_inc(dst->refcnt) we have to perform
    > when queuing an UDP packet if socket asked PKTINFO stuff (for example a
    > typical DNS server has to setup this option)
    >
    > I have one patch somewhere that stores the information in skb->cb[] and
    > avoid the atomic_{inc|dec}(dst->refcnt).
    >

    OK I found it, I did some extra tests and believe its ready.

    [PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference

    When a socket uses IP_PKTINFO notifications, we currently force a dst
    reference for each received skb. Reader has to access dst to get needed
    information (rt_iif & rt_spec_dst) and must release dst reference.

    We also forced a dst reference if skb was put in socket backlog, even
    without IP_PKTINFO handling. This happens under stress/load.

    We can instead store the needed information in skb->cb[], so that only
    softirq handler really access dst, improving cache hit ratios.

    This removes two atomic operations per packet, and false sharing as
    well.

    On a benchmark using a mono threaded receiver (doing only recvmsg()
    calls), I can reach 720.000 pps instead of 570.000 pps.

    IP_PKTINFO is typically used by DNS servers, and any multihomed aware
    UDP application.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Oct, 2011

1 commit

  • There is a long standing bug in linux tcp stack, about ACK messages sent
    on behalf of TIME_WAIT sockets.

    In the IP header of the ACK message, we choose to reflect TOS field of
    incoming message, and this might break some setups.

    Example of things that were broken :
    - Routing using TOS as a selector
    - Firewalls
    - Trafic classification / shaping

    We now remember in timewait structure the inet tos field and use it in
    ACK generation, and route lookup.

    Notes :
    - We still reflect incoming TOS in RST messages.
    - We could extend MuraliRaja Muniraju patch to report TOS value in
    netlink messages for TIME_WAIT sockets.
    - A patch is needed for IPv6

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Oct, 2011

1 commit

  • Fragmented multicast frames are delivered to a single macvlan port,
    because ip defrag logic considers other samples are redundant.

    Implement a defrag step before trying to send the multicast frame.

    Reported-by: Ben Greear
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 Jul, 2011

1 commit


24 Jun, 2011

1 commit


22 Jun, 2011

1 commit


09 Jun, 2011

1 commit

  • Andi Kleen and Tim Chen reported huge contention on inetpeer
    unused_peers.lock, on memcached workload on a 40 core machine, with
    disabled route cache.

    It appears we constantly flip peers refcnt between 0 and 1 values, and
    we must insert/remove peers from unused_peers.list, holding a contended
    spinlock.

    Remove this list completely and perform a garbage collection on-the-fly,
    at lookup time, using the expired nodes we met during the tree
    traversal.

    This removes a lot of code, makes locking more standard, and obsoletes
    two sysctls (inet_peer_gc_mintime and inet_peer_gc_maxtime). This also
    removes two pointers in inet_peer structure.

    There is still a false sharing effect because refcnt is in first cache
    line of object [were the links and keys used by lookups are located], we
    might move it at the end of inet_peer structure to let this first cache
    line mostly read by cpus.

    Signed-off-by: Eric Dumazet
    CC: Andi Kleen
    CC: Tim Chen
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 May, 2011

1 commit


09 May, 2011

3 commits


07 May, 2011

1 commit

  • When we fast path datagram sends to avoid locking by putting
    the inet_cork on the stack we use up lots of space that isn't
    necessary.

    This is because inet_cork contains a "struct flowi" which isn't
    used in these code paths.

    Split inet_cork to two parts, "inet_cork" and "inet_cork_full".
    Only the latter of which has the "struct flowi" and is what is
    stored in inet_sock.

    Signed-off-by: David S. Miller
    Acked-by: Eric Dumazet

    David S. Miller
     

29 Apr, 2011

1 commit

  • We lack proper synchronization to manipulate inet->opt ip_options

    Problem is ip_make_skb() calls ip_setup_cork() and
    ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
    without any protection against another thread manipulating inet->opt.

    Another thread can change inet->opt pointer and free old one under us.

    Use RCU to protect inet->opt (changed to inet->inet_opt).

    Instead of handling atomic refcounts, just copy ip_options when
    necessary, to avoid cache line dirtying.

    We cant insert an rcu_head in struct ip_options since its included in
    skb->cb[], so this patch is large because I had to introduce a new
    ip_options_rcu structure.

    Signed-off-by: Eric Dumazet
    Cc: Herbert Xu
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Mar, 2011

1 commit

  • My commit 6d55cb91a0020ac0 (gre: fix hard header destination
    address checking) broke multicast.

    The reason is that ip_gre used to get ipgre_header() calls with
    zero destination if we have NOARP or multicast destination. Instead
    the actual target was decided at ipgre_tunnel_xmit() time based on
    per-protocol dissection.

    Instead of allowing the "abuse" of ->header() calls with invalid
    destination, this creates multicast mappings for ip_gre. This also
    fixes "ip neigh show nud noarp" to display the proper multicast
    mappings used by the gre device.

    Reported-by: Doug Kehn
    Signed-off-by: Timo Teräs
    Acked-by: Doug Kehn
    Signed-off-by: David S. Miller

    Timo Teräs
     

02 Mar, 2011

1 commit

  • This patch adds the helper ip_make_skb which is like ip_append_data
    and ip_push_pending_frames all rolled into one, except that it does
    not send the skb produced. The sending part is carried out by
    ip_send_skb, which the transport protocol can call after it has
    tweaked the skb.

    It is meant to be called in cases where corking is not used should
    have a one-to-one correspondence to sendmsg.

    This patch also adds the helper ip_finish_skb which is meant to
    be replace ip_push_pending_frames when corking is required.
    Previously the protocol stack would peek at the socket write
    queue and add its header to the first packet. With ip_finish_skb,
    the protocol stack can directly operate on the final skb instead,
    just like the non-corking case with ip_make_skb.

    Signed-off-by: Herbert Xu
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Herbert Xu
     

13 Dec, 2010

1 commit

  • Always go through a new ip4_dst_hoplimit() helper, just like ipv6.

    This allowed several simplifications:

    1) The interim dst_metric_hoplimit() can go as it's no longer
    userd.

    2) The sysctl_ip_default_ttl entry no longer needs to use
    ipv4_doint_and_flush, since the sysctl is not cached in
    routing cache metrics any longer.

    3) ipv4_doint_and_flush no longer needs to be exported and
    therefore can be marked static.

    When ipv4_doint_and_flush_strategy was removed some time ago,
    the external declaration in ip.h was mistakenly left around
    so kill that off too.

    We have to move the sysctl_ip_default_ttl declaration into
    ipv4's route cache definition header net/route.h, because
    currently net/ip.h (where the declaration lives now) has
    a back dependency on net/route.h

    Signed-off-by: David S. Miller

    David S. Miller
     

26 Oct, 2010

1 commit


24 Sep, 2010

1 commit


19 Aug, 2010

1 commit

  • This patch removes the abstraction introduced by the union skb_shared_tx in
    the shared skb data.

    The access of the different union elements at several places led to some
    confusion about accessing the shared tx_flags e.g. in skb_orphan_try().

    http://marc.info/?l=linux-netdev&m=128084897415886&w=2

    Signed-off-by: Oliver Hartkopp
    Signed-off-by: David S. Miller

    Oliver Hartkopp
     

01 Jul, 2010

1 commit

  • /proc/net/snmp and /proc/net/netstat expose SNMP counters.

    Width of these counters is either 32 or 64 bits, depending on the size
    of "unsigned long" in kernel.

    This means user program parsing these files must already be prepared to
    deal with 64bit values, regardless of user program being 32 or 64 bit.

    This patch introduces 64bit snmp values for IPSTAT mib, where some
    counters can wrap pretty fast if they are 32bit wide.

    # netstat -s|egrep "InOctets|OutOctets"
    InOctets: 244068329096
    OutOctets: 244069348848

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Jun, 2010

1 commit

  • In preparation for 64bit snmp counters for some mibs,
    add an 'align' parameter to snmp_mib_init(), instead
    of assuming mibs only contain 'unsigned long' fields.

    Callers can use __alignof__(type) to provide correct
    alignment.

    Signed-off-by: Eric Dumazet
    CC: Herbert Xu
    CC: Arnaldo Carvalho de Melo
    CC: Hideaki YOSHIFUJI
    CC: Vlad Yasevich
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Jun, 2010

1 commit

  • commit 66018506e15b (ip: Router Alert RCU conversion) introduced RCU
    lookups to ip_call_ra_chain(). It missed proper deinit phase :
    When ip_ra_control() deletes an ip_ra_chain, it should make sure
    ip_call_ra_chain() users can not start to use socket during the rcu
    grace period. It should also delay the sock_put() after the grace
    period, or we risk a premature socket freeing and corruptions, as
    raw sockets are not rcu protected yet.

    This delay avoids using expensive atomic_inc_not_zero() in
    ip_call_ra_chain().

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Jun, 2010

1 commit


25 May, 2010

1 commit


16 May, 2010

1 commit

  • (Dropped the infiniband part, because Tetsuo modified the related code,
    I will send a separate patch for it once this is accepted.)

    This patch introduces /proc/sys/net/ipv4/ip_local_reserved_ports which
    allows users to reserve ports for third-party applications.

    The reserved ports will not be used by automatic port assignments
    (e.g. when calling connect() or bind() with port number 0). Explicit
    port allocation behavior is unchanged.

    Signed-off-by: Octavian Purdila
    Signed-off-by: WANG Cong
    Cc: Neil Horman
    Cc: Eric Dumazet
    Cc: Eric W. Biederman
    Signed-off-by: David S. Miller

    Amerigo Wang
     

29 Apr, 2010

1 commit

  • When queueing a skb to socket, we can immediately release its dst if
    target socket do not use IP_CMSG_PKTINFO.

    tcp_data_queue() can drop dst too.

    This to benefit from a hot cache line and avoid the receiver, possibly
    on another cpu, to dirty this cache line himself.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Apr, 2010

1 commit

  • As Herbert Xu said: we should be able to simply replace ipfragok
    with skb->local_df. commit f88037(sctp: Drop ipfargok in sctp_xmit function)
    has droped ipfragok and set local_df value properly.

    The patch kills the ipfragok parameter of .queue_xmit().

    Signed-off-by: Shan Wei
    Signed-off-by: David S. Miller

    Shan Wei
     

17 Feb, 2010

1 commit

  • Add __percpu sparse annotations to net.

    These annotations are to make sparse consider percpu variables to be
    in a different address space and warn if accessed without going
    through percpu accessors. This patch doesn't affect normal builds.

    The macro and type tricks around snmp stats make things a bit
    interesting. DEFINE/DECLARE_SNMP_STAT() macros mark the target field
    as __percpu and SNMP_UPD_PO_STATS() macro is updated accordingly. All
    snmp_mib_*() users which used to cast the argument to (void **) are
    updated to cast it to (void __percpu **).

    Signed-off-by: Tejun Heo
    Acked-by: David S. Miller
    Cc: Patrick McHardy
    Cc: Arnaldo Carvalho de Melo
    Cc: Vlad Yasevich
    Cc: netdev@vger.kernel.org
    Signed-off-by: David S. Miller

    Tejun Heo
     

16 Feb, 2010

1 commit


14 Jan, 2010

1 commit


07 Jan, 2010

1 commit

  • When we have L3 tunnels with different inner/outer families
    (i.e. IPV4/IPV6) which use a multicast address as the outer tunnel
    destination address, multicast packets will be loopbacked back to the
    sending socket even if IP*_MULTICAST_LOOP is set to disabled.

    The mc_loop flag is present in the family specific part of the socket
    (e.g. the IPv4 or IPv4 specific part). setsockopt sets the inner
    family mc_loop flag. When the packet is pushed through the L3 tunnel
    it will eventually be processed by the outer family which if different
    will check the flag in a different part of the socket then it was set.

    Signed-off-by: Octavian Purdila
    Signed-off-by: David S. Miller

    Octavian Purdila
     

15 Dec, 2009

1 commit

  • When fragments from bridge netfilter are passed to IPv4 or IPv6 conntrack
    and a reassembly queue with the same fragment key already exists from
    reassembling a similar packet received on a different device (f.i. with
    multicasted fragments), the reassembled packet might continue on a different
    codepath than where the head fragment originated. This can cause crashes
    in bridge netfilter when a fragment received on a non-bridge device (and
    thus with skb->nf_bridge == NULL) continues through the bridge netfilter
    code.

    Add a new reassembly identifier for packets originating from bridge
    netfilter and use it to put those packets in insolated queues.

    Fixes http://bugzilla.kernel.org/show_bug.cgi?id=14805

    Reported-and-Tested-by: Chong Qiao
    Signed-off-by: Patrick McHardy

    Patrick McHardy
     

04 Nov, 2009

1 commit

  • This cleanup patch puts struct/union/enum opening braces,
    in first line to ease grep games.

    struct something
    {

    becomes :

    struct something {

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Oct, 2009

1 commit

  • In order to have better cache layouts of struct sock (separate zones
    for rx/tx paths), we need this preliminary patch.

    Goal is to transfert fields used at lookup time in the first
    read-mostly cache line (inside struct sock_common) and move sk_refcnt
    to a separate cache line (only written by rx path)

    This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
    sport and id fields. This allows a future patch to define these
    fields as macros, like sk_refcnt, without name clashes.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Oct, 2009

1 commit

  • This provides safety against negative optlen at the type
    level instead of depending upon (sometimes non-trivial)
    checks against this sprinkled all over the the place, in
    each and every implementation.

    Based upon work done by Arjan van de Ven and feedback
    from Linus Torvalds.

    Signed-off-by: David S. Miller

    David S. Miller
     

24 Sep, 2009

1 commit

  • It's unused.

    It isn't needed -- read or write flag is already passed and sysctl
    shouldn't care about the rest.

    It _was_ used in two places at arch/frv for some reason.

    Signed-off-by: Alexey Dobriyan
    Cc: David Howells
    Cc: "Eric W. Biederman"
    Cc: Al Viro
    Cc: Ralf Baechle
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Cc: "David S. Miller"
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

27 Apr, 2009

1 commit

  • The IP MIB (RFC 4293) defines stats for InOctets, OutOctets, InMcastOctets and
    OutMcastOctets:
    http://tools.ietf.org/html/rfc4293
    But it seems we don't track those in any way that easy to separate from other
    protocols. This patch adds those missing counters to the stats file. Tested
    successfully by me

    With help from Eric Dumazet.

    Signed-off-by: Neil Horman
    Signed-off-by: David S. Miller

    Neil Horman
     

16 Feb, 2009

1 commit


26 Nov, 2008

1 commit

  • Make
    net.core.xfrm_aevent_etime
    net.core.xfrm_acq_expires
    net.core.xfrm_aevent_rseqth
    net.core.xfrm_larval_drop

    sysctls per-netns.

    For that make net_core_path[] global, register it to prevent two
    /proc/net/core antries and change initcall position -- xfrm_init() is called
    from fs_initcall, so this one should be fs_initcall at least.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Alexey Dobriyan