23 Apr, 2013

1 commit

  • Conflicts:
    drivers/net/ethernet/emulex/benet/be_main.c
    drivers/net/ethernet/intel/igb/igb_main.c
    drivers/net/wireless/brcm80211/brcmsmac/mac80211_if.c
    include/net/scm.h
    net/batman-adv/routing.c
    net/ipv4/tcp_input.c

    The e{uid,gid} --> {uid,gid} credentials fix conflicted with the
    cleanup in net-next to now pass cred structs around.

    The be2net driver had a bug fix in 'net' that overlapped with the VLAN
    interface changes by Patrick McHardy in net-next.

    An IGB conflict existed because in 'net' the build_skb() support was
    reverted, and in 'net-next' there was a comment style fix within that
    code.

    Several batman-adv conflicts were resolved by making sure that all
    calls to batadv_is_my_mac() are changed to have a new bat_priv first
    argument.

    Eric Dumazet's TS ECR fix in TCP in 'net' conflicted with the F-RTO
    rewrite in 'net-next', mostly overlapping changes.

    Thanks to Stephen Rothwell and Antonio Quartulli for help with several
    of these merge resolutions.

    Signed-off-by: David S. Miller

    David S. Miller
     

17 Apr, 2013

1 commit

  • Commit 4a94445c9a5c (net: Use ip_route_input_noref() in input path)
    added a bug in IP defragmentation handling, as non refcounted
    dst could escape an RCU protected section.

    Commit 64f3b9e203bd068 (net: ip_expire() must revalidate route) fixed
    the case of timeouts, but not the general problem.

    Tom Parkin noticed crashes in UDP stack and provided a patch,
    but further analysis permitted us to pinpoint the root cause.

    Before queueing a packet into a frag list, we must drop its dst,
    as this dst has limited lifetime (RCU protected)

    When/if a packet is finally reassembled, we use the dst of the very
    last skb, still protected by RCU and valid, as the dst of the
    reassembled packet.

    Use same logic in IPv6, as there is no need to hold dst references.

    Reported-by: Tom Parkin
    Tested-by: Tom Parkin
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Mar, 2013

1 commit


19 Mar, 2013

1 commit

  • This patch introduces a constant limit of the fragment queue hash
    table bucket list lengths. Currently the limit 128 is choosen somewhat
    arbitrary and just ensures that we can fill up the fragment cache with
    empty packets up to the default ip_frag_high_thresh limits. It should
    just protect from list iteration eating considerable amounts of cpu.

    If we reach the maximum length in one hash bucket a warning is printed.
    This is implemented on the caller side of inet_frag_find to distinguish
    between the different users of inet_fragment.c.

    I dropped the out of memory warning in the ipv4 fragment lookup path,
    because we already get a warning by the slab allocator.

    Cc: Eric Dumazet
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Hannes Frederic Sowa
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

16 Feb, 2013

1 commit


30 Jan, 2013

2 commits

  • Updating the fragmentation queues LRU (Least-Recently-Used) list,
    required taking the hash writer lock. However, the LRU list isn't
    tied to the hash at all, so we can use a separate lock for it.

    Original-idea-by: Florian Westphal
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     
  • This change is primarily a preparation to ease the extension of memory
    limit tracking.

    The change does reduce the number atomic operation, during freeing of
    a frag queue. This does introduce a some performance improvement, as
    these atomic operations are at the core of the performance problems
    seen on NUMA systems.

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     

18 Jan, 2013

1 commit

  • Increase the amount of memory usage limits for incomplete
    IP fragments.

    Arguing for new thresh high/low values:

    High threshold = 4 MBytes
    Low threshold = 3 MBytes

    The fragmentation memory accounting code, tries to account for the
    real memory usage, by measuring both the size of frag queue struct
    (inet_frag_queue (ipv4:ipq/ipv6:frag_queue)) and the SKB's truesize.

    We want to be able to handle/hold-on-to enough fragments, to ensure
    good performance, without causing incomplete fragments to hurt
    scalability, by causing the number of inet_frag_queue to grow too much
    (resulting longer searches for frag queues).

    For IPv4, how much memory does the largest frag consume.

    Maximum size fragment is 64K, which is approx 44 fragments with
    MTU(1500) sized packets. Sizeof(struct ipq) is 200. A 1500 byte
    packet results in a truesize of 2944 (not 2048 as I first assumed)

    (44*2944)+200 = 129736 bytes

    The current default high thresh of 262144 bytes, is obviously
    problematic, as only two 64K fragments can fit in the queue at the
    same time.

    How many 64K fragment can we fit into 4 MBytes:

    4*2^20/((44*2944)+200) = 32.34 fragment in queues

    An attacker could send a separate/distinct fake fragment packets per
    queue, causing us to allocate one inet_frag_queue per packet, and thus
    attacking the hash table and its lists.

    How many frag queue do we need to store, and given a current hash size
    of 64, what is the average list length.

    Using one MTU sized fragment per inet_frag_queue, each consuming
    (2944+200) 3144 bytes.

    4*2^20/(2944+200) = 1334 frag queues -> 21 avg list length

    An attack could send small fragments, the smallest packet I could send
    resulted in a truesize of 896 bytes (I'm a little surprised by this).

    4*2^20/(896+200) = 3827 frag queues -> 59 avg list length

    When increasing these number, we also need to followup with
    improvements, that is going to help scalability. Simply increasing
    the hash size, is not enough as the current implementation does not
    have a per hash bucket locking.

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     

13 Dec, 2012

1 commit

  • Pull networking changes from David Miller:

    1) Allow to dump, monitor, and change the bridge multicast database
    using netlink. From Cong Wang.

    2) RFC 5961 TCP blind data injection attack mitigation, from Eric
    Dumazet.

    3) Networking user namespace support from Eric W. Biederman.

    4) tuntap/virtio-net multiqueue support by Jason Wang.

    5) Support for checksum offload of encapsulated packets (basically,
    tunneled traffic can still be checksummed by HW). From Joseph
    Gasparakis.

    6) Allow BPF filter access to VLAN tags, from Eric Dumazet and
    Daniel Borkmann.

    7) Bridge port parameters over netlink and BPDU blocking support
    from Stephen Hemminger.

    8) Improve data access patterns during inet socket demux by rearranging
    socket layout, from Eric Dumazet.

    9) TIPC protocol updates and cleanups from Ying Xue, Paul Gortmaker, and
    Jon Maloy.

    10) Update TCP socket hash sizing to be more in line with current day
    realities. The existing heurstics were choosen a decade ago.
    From Eric Dumazet.

    11) Fix races, queue bloat, and excessive wakeups in ATM and
    associated drivers, from Krzysztof Mazur and David Woodhouse.

    12) Support DOVE (Distributed Overlay Virtual Ethernet) extensions
    in VXLAN driver, from David Stevens.

    13) Add "oops_only" mode to netconsole, from Amerigo Wang.

    14) Support set and query of VEB/VEPA bridge mode via PF_BRIDGE, also
    allow DCB netlink to work on namespaces other than the initial
    namespace. From John Fastabend.

    15) Support PTP in the Tigon3 driver, from Matt Carlson.

    16) tun/vhost zero copy fixes and improvements, plus turn it on
    by default, from Michael S. Tsirkin.

    17) Support per-association statistics in SCTP, from Michele
    Baldessari.

    And many, many, driver updates, cleanups, and improvements. Too
    numerous to mention individually.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
    net/mlx4_en: Add support for destination MAC in steering rules
    net/mlx4_en: Use generic etherdevice.h functions.
    net: ethtool: Add destination MAC address to flow steering API
    bridge: add support of adding and deleting mdb entries
    bridge: notify mdb changes via netlink
    ndisc: Unexport ndisc_{build,send}_skb().
    uapi: add missing netconf.h to export list
    pkt_sched: avoid requeues if possible
    solos-pci: fix double-free of TX skb in DMA mode
    bnx2: Fix accidental reversions.
    bna: Driver Version Updated to 3.1.2.1
    bna: Firmware update
    bna: Add RX State
    bna: Rx Page Based Allocation
    bna: TX Intr Coalescing Fix
    bna: Tx and Rx Optimizations
    bna: Code Cleanup and Enhancements
    ath9k: check pdata variable before dereferencing it
    ath5k: RX timestamp is reported at end of frame
    ath9k_htc: RX timestamp is reported at end of frame
    ...

    Linus Torvalds
     

11 Dec, 2012

1 commit

  • ip_check_defrag() might be called from af_packet within the
    RX path where shared SKBs are used, so it must not modify
    the input SKB before it has unshared it for defragmentation.
    Use skb_copy_bits() to get the IP header and only pull in
    everything later.

    The same is true for the other caller in macvlan as it is
    called from dev->rx_handler which can also get a shared SKB.

    Reported-by: Eric Leblond
    Cc: stable@vger.kernel.org
    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

19 Nov, 2012

1 commit

  • In preparation for supporting the creation of network namespaces
    by unprivileged users, modify all of the per net sysctl exports
    and refuse to allow them to unprivileged users.

    This makes it safe for unprivileged users in general to access
    per net sysctls, and allows sysctls to be exported to unprivileged
    users on an individual basis as they are deemed safe.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

20 Sep, 2012

1 commit


27 Aug, 2012

1 commit

  • IPv4 conntrack defragments incoming packet at the PRE_ROUTING hook and
    (in case of forwarded packets) refragments them at POST_ROUTING
    independent of the IP_DF flag. Refragmentation uses the dst_mtu() of
    the local route without caring about the original fragment sizes,
    thereby breaking PMTUD.

    This patch fixes this by keeping track of the largest received fragment
    with IP_DF set and generates an ICMP fragmentation required error during
    refragmentation if that size exceeds the MTU.

    Signed-off-by: Patrick McHardy
    Acked-by: Eric Dumazet
    Acked-by: David S. Miller

    Patrick McHardy
     

27 Jul, 2012

1 commit

  • With the routing cache removal we lost the "noref" code paths on
    input, and this can kill some routing workloads.

    Reinstate the noref path when we hit a cached route in the FIB
    nexthops.

    With help from Eric Dumazet.

    Reported-by: Alexander Duyck
    Signed-off-by: David S. Miller
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    David S. Miller
     

21 Jul, 2012

1 commit


28 Jun, 2012

2 commits

  • This reverts commit c074da2810c118b3812f32d6754bd9ead2f169e7.

    This change has several unwanted side effects:

    1) Sockets will cache the DST_NOCACHE route in sk->sk_rx_dst and we'll
    thus never create a real cached route.

    2) All TCP traffic will use DST_NOCACHE and never use the routing
    cache at all.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • DDOS synflood attacks hit badly IP route cache.

    On typical machines, this cache is allowed to hold up to 8 Millions dst
    entries, 256 bytes for each, for a total of 2GB of memory.

    rt_garbage_collect() triggers and tries to cleanup things.

    Eventually route cache is disabled but machine is under fire and might
    OOM and crash.

    This patch exploits the new TCP early demux, to set a nocache
    boolean in case incoming TCP frame is for a not yet ESTABLISHED or
    TIMEWAIT socket.

    This 'nocache' boolean is then used in case dst entry is not found in
    route cache, to create an unhashed dst entry (DST_NOCACHE)

    SYN-cookie-ACK sent use a similar mechanism (ipv4: tcp: dont cache
    output dst for syncookies), so after this patch, a machine is able to
    absorb a DDOS synflood attack without polluting its IP route cache.

    Signed-off-by: Eric Dumazet
    Cc: Hans Schillstrom
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Jun, 2012

1 commit


09 Jun, 2012

1 commit


20 May, 2012

1 commit

  • ip_frag_reasm() can use skb_try_coalesce() to build optimized skb,
    reducing memory used by them (truesize), and reducing number of cache
    line misses and overhead for the consumer.

    Signed-off-by: Eric Dumazet
    Cc: Alexander Duyck
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 May, 2012

1 commit


16 May, 2012

1 commit


21 Apr, 2012

2 commits

  • This results in code with less boiler plate that is a bit easier
    to read.

    Additionally stops us from using compatibility code in the sysctl
    core, hastening the day when the compatibility code can be removed.

    Signed-off-by: Eric W. Biederman
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • register_sysctl_rotable never caught on as an interesting way to
    register sysctls. My take on the situation is that what we want are
    sysctls that we can only see in the initial network namespace. What we
    have implemented with register_sysctl_rotable are sysctls that we can
    see in all of the network namespaces and can only change in the initial
    network namespace.

    That is a very silly way to go. Just register the network sysctls
    in the initial network namespace and we don't have any weird special
    cases to deal with.

    The sysctls affected are:
    /proc/sys/net/ipv4/ipfrag_secret_interval
    /proc/sys/net/ipv4/ipfrag_max_dist
    /proc/sys/net/ipv6/ip6frag_secret_interval
    /proc/sys/net/ipv6/mld_max_msf

    I really don't expect anyone will miss them if they can't read them in a
    child user namespace.

    CC: Pavel Emelyanov
    Signed-off-by: Eric W. Biederman
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

20 Apr, 2012

1 commit


13 Mar, 2012

1 commit

  • Add #define pr_fmt(fmt) as appropriate.

    Add "IPv4: ", "TCP: ", and "IPsec: " to appropriate files.
    Standardize on "UDPLite: " for appropriate uses.
    Some prefixes were previously "UDPLITE: " and "UDP-Lite: ".

    Add KBUILD_MODNAME ": " to icmp and gre.
    Remove embedded prefixes as appropriate.

    Add missing "\n" to pr_info in gre.c.

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

12 Mar, 2012

1 commit

  • Use a more current kernel messaging style.

    Convert a printk block to print_hex_dump.
    Coalesce formats, align arguments.
    Use %s, __func__ instead of embedding function names.

    Some messages that were prefixed with _close are
    now prefixed with _fini. Some ah4 and esp messages
    are now not prefixed with "ip ".

    The intent of this patch is to later add something like
    #define pr_fmt(fmt) "IPv4: " fmt.
    to standardize the output messages.

    Text size is trivially reduced. (x86-32 allyesconfig)

    $ size net/ipv4/built-in.o*
    text data bss dec hex filename
    887888 31558 249696 1169142 11d6f6 net/ipv4/built-in.o.new
    887934 31558 249800 1169292 11d78c net/ipv4/built-in.o.old

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

02 Dec, 2011

1 commit


19 Oct, 2011

2 commits

  • To ease skb->truesize sanitization, its better to be able to localize
    all references to skb frags size.

    Define accessors : skb_frag_size() to fetch frag size, and
    skb_frag_size_{set|add|sub}() to manipulate it.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Fragmented multicast frames are delivered to a single macvlan port,
    because ip defrag logic considers other samples are redundant.

    Implement a defrag step before trying to send the multicast frame.

    Reported-by: Ben Greear
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 Jul, 2011

1 commit


18 May, 2011

1 commit


17 May, 2011

1 commit

  • Commit 6623e3b24a5e (ipv4: IP defragmentation must be ECN aware) was an
    attempt to not lose "Congestion Experienced" (CE) indications when
    performing datagram defragmentation.

    Stefanos Harhalakis raised the point that RFC 3168 requirements were not
    completely met by this commit.

    In particular, we MUST detect invalid combinations and eventually drop
    illegal frames.

    Reported-by: Stefanos Harhalakis
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

05 May, 2011

1 commit

  • Commit 4a94445c9a5c (net: Use ip_route_input_noref() in input path)
    added a bug in IP defragmentation handling, in case timeout is fired.

    When a frame is defragmented, we use last skb dst field when building
    final skb. Its dst is valid, since we are in rcu read section.

    But if a timeout occurs, we take first queued fragment to build one ICMP
    TIME EXCEEDED message. Problem is all queued skb have weak dst pointers,
    since we escaped RCU critical section after their queueing. icmp_send()
    might dereference a now freed (and possibly reused) part of memory.

    Calling skb_dst_drop() and ip_route_input_noref() to revalidate route is
    the only possible choice.

    Reported-by: Denys Fedoryshchenko
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Jan, 2011

1 commit

  • RFC3168 (The Addition of Explicit Congestion Notification to IP)
    states :

    5.3. Fragmentation

    ECN-capable packets MAY have the DF (Don't Fragment) bit set.
    Reassembly of a fragmented packet MUST NOT lose indications of
    congestion. In other words, if any fragment of an IP packet to be
    reassembled has the CE codepoint set, then one of two actions MUST be
    taken:

    * Set the CE codepoint on the reassembled packet. However, this
    MUST NOT occur if any of the other fragments contributing to
    this reassembly carries the Not-ECT codepoint.

    * The packet is dropped, instead of being reassembled, for any
    other reason.

    This patch implements this requirement for IPv4, choosing the first
    action :

    If one fragment had NO-ECT codepoint
    reassembled frame has NO-ECT
    ElIf one fragment had CE codepoint
    reassembled frame has CE

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Dec, 2010

1 commit


24 Sep, 2010

1 commit


23 Aug, 2010

1 commit

  • SKBs can be "fragmented" in two ways, via a page array (called
    skb_shinfo(skb)->frags[]) and via a list of SKBs (called
    skb_shinfo(skb)->frag_list).

    Since skb_has_frags() tests the latter, it's name is confusing
    since it sounds more like it's testing the former.

    Signed-off-by: David S. Miller

    David S. Miller
     

13 Jul, 2010

1 commit


01 Jul, 2010

1 commit

  • add fast path for in-order fragments

    As the fragments are sent in order in most of OSes, such as Windows, Darwin and
    FreeBSD, it is likely the new fragments are at the end of the inet_frag_queue.
    In the fast path, we check if the skb at the end of the inet_frag_queue is the
    prev we expect.

    Signed-off-by: Changli Gao
    ----
    include/net/inet_frag.h | 1 +
    net/ipv4/ip_fragment.c | 12 ++++++++++++
    net/ipv6/reassembly.c | 11 +++++++++++
    3 files changed, 24 insertions(+)
    Signed-off-by: David S. Miller

    Changli Gao