20 Sep, 2013

3 commits

  • Pull networking fixes from David Miller:

    1) If the local_df boolean is set on an SKB we have to allocate a
    unique ID even if IP_DF is set in the ipv4 headers, from Ansis
    Atteka.

    2) Some fixups for the new chipset support that went into the sfc
    driver, from Ben Hutchings.

    3) Because SCTP bypasses a good chunk of, and actually duplicates, the
    logic of the ipv6 output path, some IPSEC things don't get done
    properly. Integrate SCTP better into the ipv6 output path so that
    these problems are fixed and such issues don't get missed in the
    future either. From Daniel Borkmann.

    4) Fix skge regressions added by the DMA mapping error return checking
    added in v3.10, from Mikulas Patocka.

    5) Kill some more IRQF_DISABLED references, from Michael Opdenacker.

    6) Fix races and deadlocks in the bridging code, from Hong Zhiguo.

    7) Fix error handling in tun_set_iff(), in particular don't leak
    resources. From Jason Wang.

    8) Prevent format-string injection into xen-netback driver, from Kees
    Cook.

    9) Fix regression added to netpoll ARP packet handling, in particular
    check for the right ETH_P_ARP protocol code. From Sonic Zhang.

    10) Try to deal with AMD IOMMU errors when using r8169 chips, from
    Francois Romieu.

    11) Cure freezes due to recent changes in the rt2x00 wireless driver,
    from Stanislaw Gruszka.

    12) Don't do SPI transfers (which can sleep) in interrupt context in
    cw1200 driver, from Solomon Peachy.

    13) Fix LEDs handling bug in 5720 tg3 chips already handled for 5719.
    From Nithin Sujir.

    14) Make xen_netbk_count_skb_slots() count the actual number of slots
    that will be used, taking into consideration packing and other
    issues that the transmit path will run into. From David Vrabel.

    15) Use the correct maximum age when calculating the bridge
    message_age_timer, from Chris Healy.

    16) Get rid of memory leaks in mcs7780 IRDA driver, from Alexey
    Khoroshilov.

    17) Netfilter conntrack extensions were converted to RCU but are not
    always freed properly using kfree_rcu(). Fix from Michal Kubecek.

    18) VF reset recovery not being done correctly in qlcnic driver, from
    Manish Chopra.

    19) Fix inverted test in ATM nicstar driver, from Andy Shevchenko.

    20) Missing workqueue destroy in cxgb4 error handling, from Wei Yang.

    21) Internal switch not initialized properly in bgmac driver, from Rafał
    Miłecki.

    22) Netlink messages report wrong local and remote addresses in IPv6
    tunneling, from Ding Zhi.

    23) ICMP redirects should not generate socket errors in DCCP and SCTP.
    We're still working out how this should be handled for RAW and UDP
    sockets. From Daniel Borkmann and Duan Jiong.

    24) We've had several bugs wherein the network namespace's loopback
    device gets accessed after it is free'd, NULL it out so that we can
    catch these problems more readily. From Eric W Biederman.

    25) Fix regression in TCP RTO calculations, from Neal Cardwell.

    26) Fix too early free of xen-netback network device when VIFs still
    exist. From Paul Durrant.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits)
    netconsole: fix a deadlock with rtnl and netconsole's mutex
    netpoll: fix NULL pointer dereference in netpoll_cleanup
    skge: fix broken driver
    ip: generate unique IP identificator if local fragmentation is allowed
    ip: use ip_hdr() in __ip_make_skb() to retrieve IP header
    xen-netback: Don't destroy the netdev until the vif is shut down
    net:dccp: do not report ICMP redirects to user space
    cnic: Fix crash in cnic_bnx2x_service_kcq()
    bnx2x, cnic, bnx2i, bnx2fc: Fix bnx2i and bnx2fc regressions.
    vxlan: Avoid creating fdb entry with NULL destination
    tcp: fix RTO calculated from cached RTT
    drivers: net: phy: cicada.c: clears warning Use #include instead of
    net loopback: Set loopback_dev to NULL when freed
    batman-adv: set the TAG flag for the vid passed to BLA
    netfilter: nfnetlink_queue: use network skb for sequence adjustment
    net: sctp: rfc4443: do not report ICMP redirects to user space
    net: usb: cdc_ether: use usb.h macros whenever possible
    net: usb: cdc_ether: fix checkpatch errors and warnings
    net: usb: cdc_ether: Use wwan interface for Telit modules
    ip6_tunnels: raddr and laddr are inverted in nl msg
    ...

    Linus Torvalds
     
  • If local fragmentation is allowed, then ip_select_ident() and
    ip_select_ident_more() need to generate unique IDs to ensure
    correct defragmentation on the peer.

    For example, if IPsec (tunnel mode) has to encrypt large skbs
    that have local_df bit set, then all IP fragments that belonged
    to different ESP datagrams would have used the same identificator.
    If one of these IP fragments would get lost or reordered, then
    peer could possibly stitch together wrong IP fragments that did
    not belong to the same datagram. This would lead to a packet loss
    or data corruption.

    Signed-off-by: Ansis Atteka
    Signed-off-by: David S. Miller

    Ansis Atteka
     
  • skb->data already points to IP header, but for the sake of
    consistency we can also use ip_hdr() to retrieve it.

    Signed-off-by: Ansis Atteka
    Signed-off-by: David S. Miller

    Ansis Atteka
     

18 Sep, 2013

1 commit

  • Commit 1b7fdd2ab5852 ("tcp: do not use cached RTT for RTT estimation")
    did not correctly account for the fact that crtt is the RTT shifted
    left 3 bits. Fix the calculation to consistently reflect this fact.

    Signed-off-by: Neal Cardwell
    Cc: Eric Dumazet
    Cc: Yuchung Cheng
    Acked-by: Eric Dumazet
    Acked-By: Yuchung Cheng
    Signed-off-by: David S. Miller

    Neal Cardwell
     

13 Sep, 2013

1 commit


07 Sep, 2013

2 commits

  • TCP receive window handling is multi staged.

    A socket has a memory budget, static or dynamic, in sk_rcvbuf.

    Because we do not really know how this memory budget translates to
    a TCP window (payload), TCP announces a small initial window
    (about 20 MSS).

    When a packet is received, we increase TCP rcv_win depending
    on the payload/truesize ratio of this packet. Good citizen
    packets give a hint that it's reasonable to have rcv_win = sk_rcvbuf/2

    This heuristic takes place in tcp_grow_window()

    Problem is : We currently call tcp_grow_window() only for in-order
    packets.

    This means that reorders or packet losses stop proper grow of
    rcv_win, and senders are unable to benefit from fast recovery,
    or proper reordering level detection.

    Really, a packet being stored in OFO queue is not a bad citizen.
    It should be part of the game as in-order packets.

    In our traces, we very often see sender is limited by linux small
    receive windows, even if linux hosts use autotuning (DRS) and should
    allow rcv_win to grow to ~3MB.

    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In commit 0f7cc9a3 "tcp: increase throughput when reordering is high",
    it only allows cwnd to increase in Open state. This mistakenly disables
    slow start after timeout (CA_Loss). Moreover cwnd won't grow if the
    state moves from Disorder to Open later in tcp_fastretrans_alert().

    Therefore the correct logic should be to allow cwnd to grow as long
    as the data is received in order in Open, Loss, or even Disorder state.

    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

06 Sep, 2013

3 commits

  • Pull networking changes from David Miller:
    "Noteworthy changes this time around:

    1) Multicast rejoin support for team driver, from Jiri Pirko.

    2) Centralize and simplify TCP RTT measurement handling in order to
    reduce the impact of bad RTO seeding from SYN/ACKs. Also, when
    both timestamps and local RTT measurements are available prefer
    the later because there are broken middleware devices which
    scramble the timestamp.

    From Yuchung Cheng.

    3) Add TCP_NOTSENT_LOWAT socket option to limit the amount of kernel
    memory consumed to queue up unsend user data. From Eric Dumazet.

    4) Add a "physical port ID" abstraction for network devices, from
    Jiri Pirko.

    5) Add a "suppress" operation to influence fib_rules lookups, from
    Stefan Tomanek.

    6) Add a networking development FAQ, from Paul Gortmaker.

    7) Extend the information provided by tcp_probe and add ipv6 support,
    from Daniel Borkmann.

    8) Use RCU locking more extensively in openvswitch data paths, from
    Pravin B Shelar.

    9) Add SCTP support to openvswitch, from Joe Stringer.

    10) Add EF10 chip support to SFC driver, from Ben Hutchings.

    11) Add new SYNPROXY netfilter target, from Patrick McHardy.

    12) Compute a rate approximation for sending in TCP sockets, and use
    this to more intelligently coalesce TSO frames. Furthermore, add
    a new packet scheduler which takes advantage of this estimate when
    available. From Eric Dumazet.

    13) Allow AF_PACKET fanouts with random selection, from Daniel
    Borkmann.

    14) Add ipv6 support to vxlan driver, from Cong Wang"

    Resolved conflicts as per discussion.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1218 commits)
    openvswitch: Fix alignment of struct sw_flow_key.
    netfilter: Fix build errors with xt_socket.c
    tcp: Add missing braces to do_tcp_setsockopt
    caif: Add missing braces to multiline if in cfctrl_linkup_request
    bnx2x: Add missing braces in bnx2x:bnx2x_link_initialize
    vxlan: Fix kernel panic on device delete.
    net: mvneta: implement ->ndo_do_ioctl() to support PHY ioctls
    net: mvneta: properly disable HW PHY polling and ensure adjust_link() works
    icplus: Use netif_running to determine device state
    ethernet/arc/arc_emac: Fix huge delays in large file copies
    tuntap: orphan frags before trying to set tx timestamp
    tuntap: purge socket error queue on detach
    qlcnic: use standard NAPI weights
    ipv6:introduce function to find route for redirect
    bnx2x: VF RSS support - VF side
    bnx2x: VF RSS support - PF side
    vxlan: Notify drivers for listening UDP port changes
    net: usbnet: update addr_assign_type if appropriate
    driver/net: enic: update enic maintainers and driver
    driver/net: enic: Exposing symbols for Cisco's low latency driver
    ...

    Linus Torvalds
     
  • Conflicts:
    drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
    net/bridge/br_multicast.c
    net/ipv6/sit.c

    The conflicts were minor:

    1) sit.c changes overlap with change to ip_tunnel_xmit() signature.

    2) br_multicast.c had an overlap between computing max_delay using
    msecs_to_jiffies and turning MLDV2_MRC() into an inline function
    with a name using lowercase instead of uppercase letters.

    3) stmmac had two overlapping changes, one which conditionally allocated
    and hooked up a dma_cfg based upon the presence of the pbl OF property,
    and another one handling store-and-forward DMA made. The latter of
    which should not go into the new of_find_property() basic block.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Signed-off-by: Dave Jones
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Dave Jones
     

05 Sep, 2013

3 commits

  • Pull PTR_RET() removal patches from Rusty Russell:
    "PTR_RET() is a weird name, and led to some confusing usage. We ended
    up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages.

    This has been sitting in linux-next for a whole cycle"

    [ There are still some PTR_RET users scattered about, with some of them
    possibly being new, but most of them existing in Rusty's tree too. We
    have that

    #define PTR_RET(p) PTR_ERR_OR_ZERO(p)

    thing in , so they continue to work for now - Linus ]

    * tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    GFS2: Replace PTR_RET with PTR_ERR_OR_ZERO
    Btrfs: volume: Replace PTR_RET with PTR_ERR_OR_ZERO
    drm/cma: Replace PTR_RET with PTR_ERR_OR_ZERO
    sh_veu: Replace PTR_RET with PTR_ERR_OR_ZERO
    dma-buf: Replace PTR_RET with PTR_ERR_OR_ZERO
    drivers/rtc: Replace PTR_RET with PTR_ERR_OR_ZERO
    mm/oom_kill: remove weird use of ERR_PTR()/PTR_ERR().
    staging/zcache: don't use PTR_RET().
    remoteproc: don't use PTR_RET().
    pinctrl: don't use PTR_RET().
    acpi: Replace weird use of PTR_RET.
    s390: Replace weird use of PTR_RET.
    PTR_RET is now PTR_ERR_OR_ZERO(): Replace most.
    PTR_RET is now PTR_ERR_OR_ZERO

    Linus Torvalds
     
  • Commit 1b7fdd2ab585("tcp: do not use cached RTT for RTT estimation")
    removes important comments on how RTO is initialized and updated.
    Hopefully this patch puts those information back.

    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Pablo Neira Ayuso says:

    ====================
    The following batch contains:

    * Three fixes for the new synproxy target available in your
    net-next tree, from Jesper D. Brouer and Patrick McHardy.

    * One fix for TCPMSS to correctly handling the fragmentation
    case, from Phil Oester. I'll pass this one to -stable.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

04 Sep, 2013

10 commits

  • Packets reaching SYNPROXY were default dropped, as they were most
    likely invalid (given the recommended state matching). This
    patch, changes SYNPROXY target to let packets, not consumed,
    continue being processed by the stack.

    This will be more in line other target modules. As it will allow
    more flexible configurations of handling, logging or matching on
    packets in INVALID states.

    Signed-off-by: Jesper Dangaard Brouer
    Acked-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Jesper Dangaard Brouer
     
  • Its seems Patrick missed to incoorporate some of my requested changes
    during review v2 of SYNPROXY netfilter module.

    Which were, to avoid SYN+ACK packets to enter the path, meant for the
    ACK packet from the client (from the 3WHS).

    Further there were a bug in ip6t_SYNPROXY.c, for matching SYN packets
    that didn't exclude the ACK flag.

    Go a step further with SYN packet/flag matching by excluding flags
    ACK+FIN+RST, in both IPv4 and IPv6 modules.

    The intented usage of SYNPROXY is as follows:
    (gracefully describing usage in commit)

    iptables -t raw -A PREROUTING -i eth0 -p tcp --dport 80 --syn -j NOTRACK
    iptables -A INPUT -i eth0 -p tcp --dport 80 -m state UNTRACKED,INVALID \
    -j SYNPROXY --sack-perm --timestamp --mss 1480 --wscale 7 --ecn

    echo 0 > /proc/sys/net/netfilter/nf_conntrack_tcp_loose

    This does filter SYN flags early, for packets in the UNTRACKED state,
    but packets in the INVALID state with other TCP flags could still
    reach the module, thus this stricter flag matching is still needed.

    Signed-off-by: Jesper Dangaard Brouer
    Acked-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Jesper Dangaard Brouer
     
  • tcp_rcv_established() returns only one value namely 0. We change the return
    value to void (as suggested by David Miller).

    After commit 0c24604b (tcp: implement RFC 5961 4.2), we no longer send RSTs in
    response to SYNs. We can remove the check and processing on the return value of
    tcp_rcv_established().

    We also fix jtcp_rcv_established() in tcp_probe.c to match that of
    tcp_rcv_established().

    Signed-off-by: Vijay Subramanian
    Signed-off-by: David S. Miller

    Vijay Subramanian
     
  • With recent changes in tcp_probe module (e.g. f925d0a62d ("net: tcp_probe:
    add IPv6 support")) we also need to take into account that tbuf needs to
    be updated as format string will be further expanded. tbuf sits on the stack
    in tcpprobe_read() function that is invoked when user space reads procfs
    file /proc/net/tcpprobe, hence not fast path as in jtcp_rcv_established().
    Having a size similarly as in sctp_probe module of 256 bytes is fully
    sufficient for that, we need theoretical maximum of 252 bytes otherwise we
    could get truncated.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • The goal of this patch is to harmonize cleanup done on a skbuff on rx path.
    Before this patch, behaviors were different depending of the tunnel type.

    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     
  • The goal of this patch is to harmonize cleanup done on a skbuff on xmit path.
    Before this patch, behaviors were different depending of the tunnel type.

    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     
  • This function was only used when a packet was sent to another netns. Now, it can
    also be used after tunnel encapsulation or decapsulation.

    Only skb_orphan() should not be done when a packet is not crossing netns.

    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     
  • This argument is not used, let's remove it.

    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     
  • This config option is superfluous in that it only guards a call
    to neigh_app_ns(). Enabling CONFIG_ARPD by default has no
    change in behavior. There will now be call to __neigh_notify()
    for each ARP resolution, which has no impact unless there is a
    user space daemon waiting to receive the notification, i.e.,
    the case for which CONFIG_ARPD was designed anyways.

    Suggested-by: Eric W. Biederman
    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: James Morris
    Cc: Hideaki YOSHIFUJI
    Cc: Patrick McHardy
    Cc: "Eric W. Biederman"
    Cc: Gao feng
    Cc: Joe Perches
    Cc: Veaceslav Falico
    Signed-off-by: Tim Gardner
    Reviewed-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Tim Gardner
     
  • Pull cgroup updates from Tejun Heo:
    "A lot of activities on the cgroup front. Most changes aren't visible
    to userland at all at this point and are laying foundation for the
    planned unified hierarchy.

    - The biggest change is decoupling the lifetime management of css
    (cgroup_subsys_state) from that of cgroup's. Because controllers
    (cpu, memory, block and so on) will need to be dynamically enabled
    and disabled, css which is the association point between a cgroup
    and a controller may come and go dynamically across the lifetime of
    a cgroup. Till now, css's were created when the associated cgroup
    was created and stayed till the cgroup got destroyed.

    Assumptions around this tight coupling permeated through cgroup
    core and controllers. These assumptions are gradually removed,
    which consists bulk of patches, and css destruction path is
    completely decoupled from cgroup destruction path. Note that
    decoupling of creation path is relatively easy on top of these
    changes and the patchset is pending for the next window.

    - cgroup has its own event mechanism cgroup.event_control, which is
    only used by memcg. It is overly complex trying to achieve high
    flexibility whose benefits seem dubious at best. Going forward,
    new events will simply generate file modified event and the
    existing mechanism is being made specific to memcg. This pull
    request contains prepatory patches for such change.

    - Various fixes and cleanups"

    Fixed up conflict in kernel/cgroup.c as per Tejun.

    * 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits)
    cgroup: fix cgroup_css() invocation in css_from_id()
    cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp()
    cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup
    cgroup: implement CFTYPE_NO_PREFIX
    cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys
    cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax
    cgroup: fix cgroup_write_event_control()
    cgroup: fix subsystem file accesses on the root cgroup
    cgroup: change cgroup_from_id() to css_from_id()
    cgroup: use css_get() in cgroup_create() to check CSS_ROOT
    cpuset: remove an unncessary forward declaration
    cgroup: RCU protect each cgroup_subsys_state release
    cgroup: move subsys file removal to kill_css()
    cgroup: factor out kill_css()
    cgroup: decouple cgroup_subsys_state destruction from cgroup destruction
    cgroup: replace cgroup->css_kill_cnt with ->nr_css
    cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item
    cgroup: move cgroup->subsys[] assignment to online_css()
    cgroup: reorganize css init / exit paths
    cgroup: add __rcu modifier to cgroup->subsys[]
    ...

    Linus Torvalds
     

03 Sep, 2013

1 commit

  • Fengguang reported:

    net/built-in.o: In function `in6_dev_finish_destroy':
    (.text+0x4ca7d): undefined reference to `snmp_mib_free'

    this is due to snmp_mib_free() is defined when CONFIG_INET is enabled,
    but in6_dev_finish_destroy() is now moved to core kernel.

    I think snmp_mib_free() is small enough to be inlined, so just make it
    static inline.

    Reported-by: kbuild test robot
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

01 Sep, 2013

1 commit


31 Aug, 2013

3 commits

  • Since commit 3d7b46cd20e3 (ip_tunnel: push generic protocol handling to
    ip_tunnel module.), an Oops is triggered when an xfrm policy is configured on
    an IPv4 over IPv4 tunnel.

    xfrm4_policy_check() calls __xfrm_policy_check2(), which uses skb_dst(skb). But
    this field is NULL because iptunnel_pull_header() calls skb_dst_drop(skb).

    Signed-off-by: Li Hongjun
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Li Hongjun
     
  • In commit 90ba9b19 (tcp: tcp_make_synack() can use alloc_skb()), Eric changed
    the call to sock_wmalloc in tcp_make_synack to alloc_skb. In doing so,
    the netfilter owner match lost its ability to block the SYNACK packet on
    outbound listening sockets. Revert the change, restoring the owner match
    functionality.

    This closes netfilter bugzilla #847.

    Signed-off-by: Phil Oester
    Signed-off-by: David S. Miller

    Phil Oester
     
  • RTT cached in the TCP metrics are valuable for the initial timeout
    because SYN RTT usually does not account for serialization delays
    on low BW path.

    However using it to seed the RTT estimator maybe disruptive because
    other components (e.g., pacing) require the smooth RTT to be obtained
    from actual connection.

    The solution is to use the higher cached RTT to set the first RTO
    conservatively like tcp_rtt_estimator(), but avoid seeding the other
    RTT estimator variables such as srtt. It is also a good idea to
    keep RTO conservative to obtain the first RTT sample, and the
    performance is insured by TCP loss probe if SYN RTT is available.

    To keep the seeding formula consistent across SYN RTT and cached RTT,
    the rttvar is twice the cached RTT instead of cached RTTVAR value. The
    reason is because cached variation may be too small (near min RTO)
    which defeats the purpose of being conservative on first RTO. However
    the metrics still keep the RTT variations as they might be useful for
    user applications (through ip).

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Tested-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

30 Aug, 2013

5 commits

  • Steffen Klassert says:

    ====================
    This pull request fixes some issues that arise when 6in4 or 4in6 tunnels
    are used in combination with IPsec, all from Hannes Frederic Sowa and a
    null pointer dereference when queueing packets to the policy hold queue.

    1) We might access the local error handler of the wrong address family if
    6in4 or 4in6 tunnel is protected by ipsec. Fix this by addind a pointer
    to the correct local_error to xfrm_state_afinet.

    2) Add a helper function to always refer to the correct interpretation
    of skb->sk.

    3) Call skb_reset_inner_headers to record the position of the inner headers
    when adding a new one in various ipv6 tunnels. This is needed to identify
    the addresses where to send back errors in the xfrm layer.

    4) Dereference inner ipv6 header if encapsulated to always call the
    right error handler.

    5) Choose protocol family by skb protocol to not call the wrong
    xfrm{4,6}_local_error handler in case an ipv6 sockets is used
    in ipv4 mode.

    6) Partly revert "xfrm: introduce helper for safe determination of mtu"
    because this introduced pmtu discovery problems.

    7) Set skb->protocol on tcp, raw and ip6_append_data genereated skbs.
    We need this to get the correct mtu informations in xfrm.

    8) Fix null pointer dereference in xdst_queue_output.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • ipv4: raw_sendmsg: don't use header's destination address

    A sendto() regression was bisected and found to start with commit
    f8126f1d5136be1 (ipv4: Adjust semantics of rt->rt_gateway.)

    The problem is that it tries to ARP-lookup the constructed packet's
    destination address rather than the explicitly provided address.

    Fix this using FLOWI_FLAG_KNOWN_NH so that given nexthop is used.

    cf. commit 2ad5b9e4bd314fc685086b99e90e5de3bc59e26b

    Reported-by: Chris Clark
    Bisected-by: Chris Clark
    Tested-by: Chris Clark
    Suggested-by: Julian Anastasov
    Signed-off-by: Chris Clark
    Signed-off-by: David S. Miller

    Chris Clark
     
  • After hearing many people over past years complaining against TSO being
    bursty or even buggy, we are proud to present automatic sizing of TSO
    packets.

    One part of the problem is that tcp_tso_should_defer() uses an heuristic
    relying on upcoming ACKS instead of a timer, but more generally, having
    big TSO packets makes little sense for low rates, as it tends to create
    micro bursts on the network, and general consensus is to reduce the
    buffering amount.

    This patch introduces a per socket sk_pacing_rate, that approximates
    the current sending rate, and allows us to size the TSO packets so
    that we try to send one packet every ms.

    This field could be set by other transports.

    Patch has no impact for high speed flows, where having large TSO packets
    makes sense to reach line rate.

    For other flows, this helps better packet scheduling and ACK clocking.

    This patch increases performance of TCP flows in lossy environments.

    A new sysctl (tcp_min_tso_segs) is added, to specify the
    minimal size of a TSO packet (default being 2).

    A follow-up patch will provide a new packet scheduler (FQ), using
    sk_pacing_rate as an input to perform optional per flow pacing.

    This explains why we chose to set sk_pacing_rate to twice the current
    rate, allowing 'slow start' ramp up.

    sk_pacing_rate = 2 * cwnd * mss / srtt

    v2: Neal Cardwell reported a suspect deferring of last two segments on
    initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
    into account tp->xmit_size_goal_segs

    Signed-off-by: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Cc: Van Jacobson
    Cc: Tom Herbert
    Acked-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The zero value means that tsecr is not valid, so it's a special case.

    tsoffset is used to customize tcp_time_stamp for one socket.
    tsoffset is usually zero, it's used when a socket was moved from one
    host to another host.

    Currently this issue affects logic of tcp_rcv_rtt_measure_ts. Due to
    incorrect value of rcv_tsecr, tcp_rcv_rtt_measure_ts sets rto to
    TCP_RTO_MAX.

    Cc: Pavel Emelyanov
    Cc: Eric Dumazet
    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: James Morris
    Cc: Hideaki YOSHIFUJI
    Cc: Patrick McHardy
    Reported-by: Cyrill Gorcunov
    Signed-off-by: Andrey Vagin
    Signed-off-by: David S. Miller

    Andrew Vagin
     
  • u32 rcv_tstamp; /* timestamp of last received ACK */

    Its value used in tcp_retransmit_timer, which closes socket
    if the last ack was received more then TCP_RTO_MAX ago.

    Currently rcv_tstamp is initialized to zero and if tcp_retransmit_timer
    is called before receiving a first ack, the connection is closed.

    This patch initializes rcv_tstamp to a timestamp, when a socket was
    restored.

    Cc: Pavel Emelyanov
    Cc: Eric Dumazet
    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: James Morris
    Cc: Hideaki YOSHIFUJI
    Cc: Patrick McHardy
    Reported-by: Cyrill Gorcunov
    Signed-off-by: Andrey Vagin
    Signed-off-by: David S. Miller

    Andrew Vagin
     

28 Aug, 2013

5 commits

  • Add a SYNPROXY for netfilter. The code is split into two parts, the synproxy
    core with common functions and an address family specific target.

    The SYNPROXY receives the connection request from the client, responds with
    a SYN/ACK containing a SYN cookie and announcing a zero window and checks
    whether the final ACK from the client contains a valid cookie.

    It then establishes a connection to the original destination and, if
    successful, sends a window update to the client with the window size
    announced by the server.

    Support for timestamps, SACK, window scaling and MSS options can be
    statically configured as target parameters if the features of the server
    are known. If timestamps are used, the timestamp value sent back to
    the client in the SYN/ACK will be different from the real timestamp of
    the server. In order to now break PAWS, the timestamps are translated in
    the direction server->client.

    Signed-off-by: Patrick McHardy
    Tested-by: Martin Topholm
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Pablo Neira Ayuso

    Patrick McHardy
     
  • Extract the local TCP stack independant parts of tcp_v4_init_sequence()
    and cookie_v4_check() and export them for use by the upcoming SYNPROXY
    target.

    Signed-off-by: Patrick McHardy
    Acked-by: David S. Miller
    Tested-by: Martin Topholm
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Pablo Neira Ayuso

    Patrick McHardy
     
  • Split out sequence number adjustments from NAT and move them to the conntrack
    core to make them usable for SYN proxying. The sequence number adjustment
    information is moved to a seperate extend. The extend is added to new
    conntracks when a NAT mapping is set up for a connection using a helper.

    As a side effect, this saves 24 bytes per connection with NAT in the common
    case that a connection does not have a helper assigned.

    Signed-off-by: Patrick McHardy
    Tested-by: Martin Topholm
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Pablo Neira Ayuso

    Patrick McHardy
     
  • As reported by Casper Gripenberg, in a bridged setup, using ip[6]t_REJECT
    with the tcp-reset option sends out reset packets with the src MAC address
    of the local bridge interface, instead of the MAC address of the intended
    destination. This causes some routers/firewalls to drop the reset packet
    as it appears to be spoofed. Fix this by bypassing ip[6]_local_out and
    setting the MAC of the sender in the tcp reset packet.

    This closes netfilter bugzilla #531.

    Signed-off-by: Phil Oester
    Signed-off-by: Pablo Neira Ayuso

    Phil Oester
     
  • Currently, the tcp_probe snooper can either filter packets by a given
    port (handed to the module via module parameter e.g. port=80) or lets
    all TCP traffic pass (port=0, default). When a port is specified, the
    port number is tested against the sk's source/destination port. Thus,
    if one of them matches, the information will be further processed for
    the log.

    As this is quite limited, allow for more advanced filtering possibilities
    which can facilitate debugging/analysis with the help of the tcp_probe
    snooper. Therefore, similarly as added to BPF machine in commit 7e75f93e
    ("pkt_sched: ingress socket filter by mark"), add the possibility to
    use skb->mark as a filter.

    If the mark is not being used otherwise, this allows ingress filtering
    by flow (e.g. in order to track updates from only a single flow, or a
    subset of all flows for a given port) and other things such as dynamic
    logging and reconfiguration without removing/re-inserting the tcp_probe
    module, etc. Simple example:

    insmod net/ipv4/tcp_probe.ko fwmark=8888 full=1
    ...
    iptables -A INPUT -i eth4 -t mangle -p tcp --dport 22 \
    --sport 60952 -j MARK --set-mark 8888
    [... sampling interval ...]
    iptables -D INPUT -i eth4 -t mangle -p tcp --dport 22 \
    --sport 60952 -j MARK --set-mark 8888

    The current option to filter by a given port is still being preserved. A
    similar approach could be done for the sctp_probe module as a follow-up.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

27 Aug, 2013

1 commit


26 Aug, 2013

1 commit

  • In commit 0ea9d5e3e0e03a63b11392f5613378977dae7eca ("xfrm: introduce
    helper for safe determination of mtu") I switched the determination of
    ipv4 mtus from dst_mtu to ip_skb_dst_mtu. This was an error because in
    case of IP_PMTUDISC_PROBE we fall back to the interface mtu, which is
    never correct for ipv4 ipsec.

    This patch partly reverts 0ea9d5e3e0e03a63b11392f5613378977dae7eca
    ("xfrm: introduce helper for safe determination of mtu").

    Cc: Steffen Klassert
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: Steffen Klassert

    Hannes Frederic Sowa