12 May, 2013

1 commit


02 Apr, 2013

1 commit


14 Feb, 2013

1 commit

  • Patch cef401de7be8c4e (net: fix possible wrong checksum
    generation) fixed wrong checksum calculation but it broke TSO by
    defining new GSO type but not a netdev feature for that type.
    net_gso_ok() would not allow hardware checksum/segmentation
    offload of such packets without the feature.

    Following patch fixes TSO and wrong checksum. This patch uses
    same logic that Eric Dumazet used. Patch introduces new flag
    SKBTX_SHARED_FRAG if at least one frag can be modified by
    the user. but SKBTX_SHARED_FRAG flag is kept in skb shared
    info tx_flags rather than gso_type.

    tx_flags is better compared to gso_type since we can have skb with
    shared frag without gso packet. It does not link SHARED_FRAG to
    GSO, So there is no need to define netdev feature for this.

    Signed-off-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Pravin B Shelar
     

09 Dec, 2012

1 commit


09 Oct, 2012

1 commit

  • Add new flag to remember when route is via gateway.
    We will use it to allow rt_gateway to contain address of
    directly connected host for the cases when DST_NOCACHE is
    used or when the NH exception caches per-destination route
    without DST_NOCACHE flag, i.e. when routes are not used for
    other destinations. By this way we force the neighbour
    resolving to work with the routed destination but we
    can use different address in the packet, feature needed
    for IPVS-DR where original packet for virtual IP is routed
    via route to real IP.

    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Julian Anastasov
     

25 Sep, 2012

1 commit

  • We currently use a per socket order-0 page cache for tcp_sendmsg()
    operations.

    This page is used to build fragments for skbs.

    Its done to increase probability of coalescing small write() into
    single segments in skbs still in write queue (not yet sent)

    But it wastes a lot of memory for applications handling many mostly
    idle sockets, since each socket holds one page in sk->sk_sndmsg_page

    Its also quite inefficient to build TSO 64KB packets, because we need
    about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
    page allocator more than wanted.

    This patch adds a per task frag allocator and uses bigger pages,
    if available. An automatic fallback is done in case of memory pressure.

    (up to 32768 bytes per frag, thats order-3 pages on x86)

    This increases TCP stream performance by 20% on loopback device,
    but also benefits on other network devices, since 8x less frags are
    mapped on transmit and unmapped on tx completion. Alexander Duyck
    mentioned a probable performance win on systems with IOMMU enabled.

    Its possible some SG enabled hardware cant cope with bigger fragments,
    but their ndo_start_xmit() should already handle this, splitting a
    fragment in sub fragments, since some arches have PAGE_SIZE=65536

    Successfully tested on various ethernet devices.
    (ixgbe, igb, bnx2x, tg3, mellanox mlx4)

    Signed-off-by: Eric Dumazet
    Cc: Ben Hutchings
    Cc: Vijay Subramanian
    Cc: Alexander Duyck
    Tested-by: Vijay Subramanian
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Aug, 2012

1 commit

  • IPv4 conntrack defragments incoming packet at the PRE_ROUTING hook and
    (in case of forwarded packets) refragments them at POST_ROUTING
    independent of the IP_DF flag. Refragmentation uses the dst_mtu() of
    the local route without caring about the original fragment sizes,
    thereby breaking PMTUD.

    This patch fixes this by keeping track of the largest received fragment
    with IP_DF set and generates an ICMP fragmentation required error during
    refragmentation if that size exceeds the MTU.

    Signed-off-by: Patrick McHardy
    Acked-by: Eric Dumazet
    Acked-by: David S. Miller

    Patrick McHardy
     

22 Aug, 2012

1 commit

  • Christian Casteyde reported a kmemcheck 32-bit read from uninitialized
    memory in __ip_select_ident().

    It turns out that __ip_make_skb() called ip_select_ident() before
    properly initializing iph->daddr.

    This is a bug uncovered by commit 1d861aa4b3fb (inet: Minimize use of
    cached route inetpeer.)

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=46131

    Reported-by: Christian Casteyde
    Signed-off-by: Eric Dumazet
    Cc: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Aug, 2012

1 commit

  • ip_send_skb() can send orphaned skb, so we must pass the net pointer to
    avoid possible NULL dereference in error path.

    Bug added by commit 3a7c384ffd57 (ipv4: tcp: unicast_sock should not
    land outside of TCP stack)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Aug, 2012

1 commit

  • commit be9f4a44e7d41cee (ipv4: tcp: remove per net tcp_sock) added a
    selinux regression, reported and bisected by John Stultz

    selinux_ip_postroute_compat() expect to find a valid sk->sk_security
    pointer, but this field is NULL for unicast_sock

    It turns out that unicast_sock are really temporary stuff to be able
    to reuse part of IP stack (ip_append_data()/ip_push_pending_frames())

    Fact is that frames sent by ip_send_unicast_reply() should be orphaned
    to not fool LSM.

    Note IPv6 never had this problem, as tcp_v6_send_response() doesnt use a
    fake socket at all. I'll probably implement tcp_v4_send_response() to
    remove these unicast_sock in linux-3.7

    Reported-by: John Stultz
    Bisected-by: John Stultz
    Signed-off-by: Eric Dumazet
    Cc: Paul Moore
    Cc: Eric Paris
    Cc: "Serge E. Hallyn"
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Aug, 2012

1 commit

  • __neigh_create() returns either a pointer to struct neighbour or PTR_ERR().
    But the caller expects it to return either a pointer or NULL. Replace
    the NULL check with IS_ERR() check.

    The bug was introduced in a263b3093641fb1ec377582c90986a7fd0625184
    ("ipv4: Make neigh lookups directly in output packet path.").

    Signed-off-by: Vasily Kulikov
    Signed-off-by: David S. Miller

    Vasiliy Kulikov
     

23 Jul, 2012

2 commits

  • The ipv4 routing cache is non-deterministic, performance wise, and is
    subject to reasonably easy to launch denial of service attacks.

    The routing cache works great for well behaved traffic, and the world
    was a much friendlier place when the tradeoffs that led to the routing
    cache's design were considered.

    What it boils down to is that the performance of the routing cache is
    a product of the traffic patterns seen by a system rather than being a
    product of the contents of the routing tables. The former of which is
    controllable by external entitites.

    Even for "well behaved" legitimate traffic, high volume sites can see
    hit rates in the routing cache of only ~%10.

    The general flow of this patch series is that first the routing cache
    is removed. We build a completely new rtable entry every lookup
    request.

    Next we make some simplifications due to the fact that removing the
    routing cache causes several members of struct rtable to become no
    longer necessary.

    Then we need to make some amends such that we can legally cache
    pre-constructed routes in the FIB nexthops. Firstly, we need to
    invalidate routes which are hit with nexthop exceptions. Secondly we
    have to change the semantics of rt->rt_gateway such that zero means
    that the destination is on-link and non-zero otherwise.

    Now that the preparations are ready, we start caching precomputed
    routes in the FIB nexthops. Output and input routes need different
    kinds of care when determining if we can legally do such caching or
    not. The details are in the commit log messages for those changes.

    The patch series then winds down with some more struct rtable
    simplifications and other tidy ups that remove unnecessary overhead.

    On a SPARC-T3 output route lookups are ~876 cycles. Input route
    lookups are ~1169 cycles with rpfilter disabled, and about ~1468
    cycles with rpfilter enabled.

    These measurements were taken with the kbench_mod test module in the
    net_test_tools GIT tree:

    git://git.kernel.org/pub/scm/linux/kernel/git/davem/net_test_tools.git

    That GIT tree also includes a udpflood tester tool and stresses
    route lookups on packet output.

    For example, on the same SPARC-T3 system we can run:

    time ./udpflood -l 10000000 10.2.2.11

    with routing cache:
    real 1m21.955s user 0m6.530s sys 1m15.390s

    without routing cache:
    real 1m31.678s user 0m6.520s sys 1m25.140s

    Performance undoubtedly can easily be improved further.

    For example fib_table_lookup() performs a lot of excessive
    computations with all the masking and shifting, some of it
    conditionalized to deal with edge cases.

    Also, Eric's no-ref optimization for input route lookups can be
    re-instated for the FIB nexthop caching code path. I would be really
    pleased if someone would work on that.

    In fact anyone suitable motivated can just fire up perf on the loading
    of the test net_test_tools benchmark kernel module. I spend much of
    my time going:

    bash# perf record insmod ./kbench_mod.ko dst=172.30.42.22 src=74.128.0.1 iif=2
    bash# perf report

    Thanks to helpful feedback from Joe Perches, Eric Dumazet, Ben
    Hutchings, and others.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Set unicast_sock uc_ttl to -1 so that we select the right ttl,
    instead of sending packets with a 0 ttl.

    Bug added in commit be9f4a44e7d4 (ipv4: tcp: remove per net tcp_sock)

    Signed-off-by: Hiroaki SHIMODA
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Jul, 2012

1 commit

  • In order to allow prefixed routes, we have to adjust how rt_gateway
    is set and interpreted.

    The new interpretation is:

    1) rt_gateway == 0, destination is on-link, nexthop is iph->daddr

    2) rt_gateway != 0, destination requires a nexthop gateway

    Abstract the fetching of the proper nexthop value using a new
    inline helper, rt_nexthop(), as suggested by Joe Perches.

    Signed-off-by: David S. Miller
    Tested-by: Vijay Subramanian

    David S. Miller
     

20 Jul, 2012

1 commit

  • tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket
    per network namespace.

    This leads to bad behavior on multiqueue NICS, because many cpus
    contend for the socket lock and once socket lock is acquired, extra
    false sharing on various socket fields slow down the operations.

    To better resist to attacks, we use a percpu socket. Each cpu can
    run without contention, using appropriate memory (local node)

    Additional features :

    1) We also mirror the queue_mapping of the incoming skb, so that
    answers use the same queue if possible.

    2) Setting SOCK_USE_WRITE_QUEUE socket flag speedup sock_wfree()

    3) We now limit the number of in-flight RST/ACK [1] packets
    per cpu, instead of per namespace, and we honor the sysctl_wmem_default
    limit dynamically. (Prior to this patch, sysctl_wmem_default value was
    copied at boot time, so any further change would not affect tcp_sock
    limit)

    [1] These packets are only generated when no socket was matched for
    the incoming packet.

    Reported-by: Bill Sommerfeld
    Signed-off-by: Eric Dumazet
    Cc: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet
     

05 Jul, 2012

2 commits


28 Jun, 2012

1 commit


13 Jun, 2012

1 commit

  • Add dev_loopback_xmit() in order to deduplicate functions
    ip_dev_loopback_xmit() (in net/ipv4/ip_output.c) and
    ip6_dev_loopback_xmit() (in net/ipv6/ip6_output.c).

    I was about to reinvent the wheel when I noticed that
    ip_dev_loopback_xmit() and ip6_dev_loopback_xmit() do exactly what I
    need and are not IP-only functions, but they were not available to reuse
    elsewhere.

    ip6_dev_loopback_xmit() does not have line "skb_dst_force(skb);", but I
    understand that this is harmless, and should be in dev_loopback_xmit().

    Signed-off-by: Michel Machado
    CC: "David S. Miller"
    CC: Alexey Kuznetsov
    CC: James Morris
    CC: Hideaki YOSHIFUJI
    CC: Patrick McHardy
    CC: Eric Dumazet
    CC: Jiri Pirko
    CC: "Michał Mirosław"
    CC: Ben Hutchings
    Signed-off-by: David S. Miller

    Michel Machado
     

04 Jun, 2012

1 commit


16 May, 2012

1 commit


29 Mar, 2012

1 commit


06 Dec, 2011

1 commit


02 Dec, 2011

1 commit

  • gcc compiler is smart enough to use a single load/store if we
    memcpy(dptr, sptr, 8) on x86_64, regardless of
    CONFIG_CC_OPTIMIZE_FOR_SIZE

    In IP header, daddr immediately follows saddr, this wont change in the
    future. We only need to make sure our flowi4 (saddr,daddr) fields wont
    break the rule.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Oct, 2011

1 commit

  • There is a long standing bug in linux tcp stack, about ACK messages sent
    on behalf of TIME_WAIT sockets.

    In the IP header of the ACK message, we choose to reflect TOS field of
    incoming message, and this might break some setups.

    Example of things that were broken :
    - Routing using TOS as a selector
    - Firewalls
    - Trafic classification / shaping

    We now remember in timewait structure the inet tos field and use it in
    ACK generation, and route lookup.

    Notes :
    - We still reflect incoming TOS in RST messages.
    - We could extend MuraliRaja Muniraju patch to report TOS value in
    netlink messages for TIME_WAIT sockets.
    - A patch is needed for IPv6

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Oct, 2011

1 commit

  • To ease skb->truesize sanitization, its better to be able to localize
    all references to skb frags size.

    Define accessors : skb_frag_size() to fetch frag size, and
    skb_frag_size_{set|add|sub}() to manipulate it.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Aug, 2011

1 commit


08 Aug, 2011

1 commit


03 Aug, 2011

1 commit

  • Gergely Kalman reported crashes in check_peer_redir().

    It appears commit f39925dbde778 (ipv4: Cache learned redirect
    information in inetpeer.) added a race, leading to possible NULL ptr
    dereference.

    Since we can now change dst neighbour, we should make sure a reader can
    safely use a neighbour.

    Add RCU protection to dst neighbour, and make sure check_peer_redir()
    can be called safely by different cpus in parallel.

    As neighbours are already freed after one RCU grace period, this patch
    should not add typical RCU penalty (cache cold effects)

    Many thanks to Gergely for providing a pretty report pointing to the
    bug.

    Reported-by: Gergely Kalman
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Jul, 2011

1 commit

  • Because the ip fragment offset field counts 8-byte chunks, ip
    fragments other than the last must contain a multiple of 8 bytes of
    payload. ip_ufo_append_data wasn't respecting this constraint and,
    depending on the MTU and ip option sizes, could create malformed
    non-final fragments.

    Google-Bug-Id: 5009328
    Signed-off-by: Bill Sommerfeld
    Signed-off-by: David S. Miller

    Bill Sommerfeld
     

18 Jul, 2011

1 commit


17 Jul, 2011

2 commits


14 Jul, 2011

1 commit

  • Now that there is a one-to-one correspondance between neighbour
    and hh_cache entries, we no longer need:

    1) dynamic allocation
    2) attachment to dst->hh
    3) refcounting

    Initialization of the hh_cache entry is indicated by hh_len
    being non-zero, and such initialization is always done with
    the neighbour's lock held as a writer.

    Signed-off-by: David S. Miller

    David S. Miller
     

06 Jul, 2011

1 commit


02 Jul, 2011

1 commit

  • We might call ip_ufo_append_data() for packets that will be IPsec
    transformed later. This function should be used just for real
    udp packets. So we check for rt->dst.header_len which is only
    nonzero on IPsec handling and call ip_ufo_append_data() just
    if rt->dst.header_len is zero.

    Signed-off-by: Steffen Klassert
    Signed-off-by: David S. Miller

    Steffen Klassert
     

28 Jun, 2011

2 commits

  • ip_append_data() builds packets based on the mtu from dst_mtu(rt->dst.path).
    On IPsec the effective mtu is lower because we need to add the protocol
    headers and trailers later when we do the IPsec transformations. So after
    the IPsec transformations the packet might be too big, which leads to a
    slowpath fragmentation then. This patch fixes this by building the packets
    based on the lower IPsec mtu from dst_mtu(&rt->dst) and adapts the exthdr
    handling to this.

    Signed-off-by: Steffen Klassert
    Signed-off-by: David S. Miller

    Steffen Klassert
     
  • Git commit 59104f06 (ip: take care of last fragment in ip_append_data)
    added a check to see if we exceed the mtu when we add trailer_len.
    However, the mtu is already subtracted by the trailer length when the
    xfrm transfomation bundles are set up. So IPsec packets with mtu
    size get fragmented, or if the DF bit is set the packets will not
    be send even though they match the mtu perfectly fine. This patch
    actually reverts commit 59104f06.

    Signed-off-by: Steffen Klassert
    Signed-off-by: David S. Miller

    Steffen Klassert
     

22 Jun, 2011

1 commit


10 Jun, 2011

1 commit

  • We assume that transhdrlen is positive on the first fragment
    which is wrong for raw packets. So we don't add exthdrlen to the
    packet size for raw packets. This leads to a reallocation on IPsec
    because we have not enough headroom on the skb to place the IPsec
    headers. This patch fixes this by adding exthdrlen to the packet
    size whenever the send queue of the socket is empty. This issue was
    introduced with git commit 1470ddf7 (inet: Remove explicit write
    references to sk/inet in ip_append_data)

    Signed-off-by: Steffen Klassert
    Signed-off-by: David S. Miller

    Steffen Klassert