04 Nov, 2015

1 commit


03 Nov, 2015

4 commits

  • This patch fixes following problems :

    1) percpu_counter_init() can return an error, therefore
    init_frag_mem_limit() must propagate this error so that
    inet_frags_init_net() can do the same up to its callers.

    2) If ip[46]_frags_ns_ctl_register() fail, we must unwind
    properly and free the percpu_counter.

    Without this fix, we leave freed object in percpu_counters
    global list (if CONFIG_HOTPLUG_CPU) leading to crashes.

    This bug was detected by KASAN and syzkaller tool
    (http://github.com/google/syzkaller)

    Fixes: 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting")
    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Cc: Hannes Frederic Sowa
    Cc: Jesper Dangaard Brouer
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • skb_set_owner_w() is called from various places that assume
    skb->sk always point to a full blown socket (as it changes
    sk->sk_wmem_alloc)

    We'd like to attach skb to request sockets, and in the future
    to timewait sockets as well. For these kind of pseudo sockets,
    we need to take a traditional refcount and use sock_edemux()
    as the destructor.

    It is now time to un-inline skb_set_owner_w(), being too big.

    Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener")
    Signed-off-by: Eric Dumazet
    Bisected-by: Haiyang Zhang
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Fixes the following kernel BUG :

    BUG: using __this_cpu_add() in preemptible [00000000] code: bash/2758
    caller is __this_cpu_preempt_check+0x13/0x15
    CPU: 0 PID: 2758 Comm: bash Tainted: P O 3.18.19 #2
    ffffffff8170eaca ffff880110d1b788 ffffffff81482b2a 0000000000000000
    0000000000000000 ffff880110d1b7b8 ffffffff812010ae ffff880007cab800
    ffff88001a060800 ffff88013a899108 ffff880108b84240 ffff880110d1b7c8
    Call Trace:
    [] dump_stack+0x52/0x80
    [] check_preemption_disabled+0xce/0xe1
    [] __this_cpu_preempt_check+0x13/0x15
    [] ipmr_queue_xmit+0x647/0x70c
    [] ip_mr_forward+0x32f/0x34e
    [] ip_mroute_setsockopt+0xe03/0x108c
    [] ? get_parent_ip+0x11/0x42
    [] ? pollwake+0x4d/0x51
    [] ? default_wake_function+0x0/0xf
    [] ? get_parent_ip+0x11/0x42
    [] ? __wake_up_common+0x45/0x77
    [] ? _raw_spin_unlock_irqrestore+0x1d/0x32
    [] ? __wake_up_sync_key+0x4a/0x53
    [] ? sock_def_readable+0x71/0x75
    [] do_ip_setsockopt+0x9d/0xb55
    [] ? unix_seqpacket_sendmsg+0x3f/0x41
    [] ? sock_sendmsg+0x6d/0x86
    [] ? sockfd_lookup_light+0x12/0x5d
    [] ? SyS_sendto+0xf3/0x11b
    [] ? new_sync_read+0x82/0xaa
    [] compat_ip_setsockopt+0x3b/0x99
    [] compat_raw_setsockopt+0x11/0x32
    [] compat_sock_common_setsockopt+0x18/0x1f
    [] compat_SyS_setsockopt+0x1a9/0x1cf
    [] compat_SyS_socketcall+0x180/0x1e3
    [] cstar_dispatch+0x7/0x1e

    Signed-off-by: Ani Sinha
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Ani Sinha
     
  • This patch changes how the multipath hash is computed for locally
    generated flows: now the hash comprises l4 information.

    This allows better utilization of the available paths when the existing
    flows have the same source IP and the same destination IP: with l3 hash,
    even when multiple connections are in place simultaneously, a single path
    will be used, while with l4 hash we can use all the available paths.

    v2 changes:
    - use get_hash_from_flowi4() instead of implementing just another l4 hash
    function

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

02 Nov, 2015

4 commits

  • When nexthop is part of multipath route we should clear the
    LINKDOWN flag when link goes UP or when first address is added.
    This is needed because we always set LINKDOWN flag when DEAD flag
    was set but now on UP the nexthop is not dead anymore. Examples when
    LINKDOWN bit can be forgotten when no NETDEV_CHANGE is delivered:

    - link goes down (LINKDOWN is set), then link goes UP and device
    shows carrier OK but LINKDOWN remains set

    - last address is deleted (LINKDOWN is set), then address is
    added and device shows carrier OK but LINKDOWN remains set

    Steps to reproduce:
    modprobe dummy
    ifconfig dummy0 192.168.168.1 up

    here add a multipath route where one nexthop is for dummy0:

    ip route add 1.2.3.4 nexthop dummy0 nexthop SOME_OTHER_DEVICE
    ifconfig dummy0 down
    ifconfig dummy0 up

    now ip route shows nexthop that is not dead. Now set the sysctl var:

    echo 1 > /proc/sys/net/ipv4/conf/dummy0/ignore_routes_with_linkdown

    now ip route will show a dead nexthop because the forgotten
    RTNH_F_LINKDOWN is propagated as RTNH_F_DEAD.

    Fixes: 8a3d03166f19 ("net: track link-status of ipv4 nexthops")
    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Julian Anastasov
     
  • When fib_netdev_event calls fib_disable_ip on NETDEV_DOWN event
    we should not delete the local routes if the local address
    is still present. The confusion comes from the fact that both
    fib_netdev_event and fib_inetaddr_event use the NETDEV_DOWN
    constant. Fix it by returning back the variable 'force'.

    Steps to reproduce:
    modprobe dummy
    ifconfig dummy0 192.168.168.1 up
    ifconfig dummy0 down
    ip route list table local | grep dummy | grep host
    local 192.168.168.1 dev dummy0 proto kernel scope host src 192.168.168.1

    Fixes: 8a3d03166f19 ("net: track link-status of ipv4 nexthops")
    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Julian Anastasov
     
  • CHECKSUM_PARTIAL skbs should never arrive in ip_fragment. If we get one
    of those warn about them once and handle them gracefully by recalculating
    the checksum.

    Cc: Eric Dumazet
    Cc: Vlad Yasevich
    Cc: Benjamin Coddington
    Cc: Tom Herbert
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     
  • We cannot reliable calculate packet size on MSG_MORE corked sockets
    and thus cannot decide if they are going to be fragmented later on,
    so better not use CHECKSUM_PARTIAL in the first place.

    Cc: Eric Dumazet
    Cc: Vlad Yasevich
    Cc: Benjamin Coddington
    Cc: Tom Herbert
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

01 Nov, 2015

1 commit


30 Oct, 2015

1 commit

  • Steffen Klassert says:

    ====================
    pull request (net-next): ipsec-next 2015-10-30

    1) The flow cache is limited by the flow cache limit which
    depends on the number of cpus and the xfrm garbage collector
    threshold which is independent of the number of cpus. This
    leads to the fact that on systems with more than 16 cpus
    we hit the xfrm garbage collector limit and refuse new
    allocations, so new flows are dropped. On systems with 16
    or less cpus, we hit the flowcache limit. In this case, we
    shrink the flow cache instead of refusing new flows.

    We increase the xfrm garbage collector threshold to INT_MAX
    to get the same behaviour, independent of the number of cpus.

    2) Fix some unaligned accesses on sparc systems.
    From Sowmini Varadhan.

    3) Fix some header checks in _decode_session4. We may call
    pskb_may_pull with a negative value converted to unsigened
    int from pskb_may_pull. This can lead to incorrect policy
    lookups. We fix this by a check of the data pointer position
    before we call pskb_may_pull.

    4) Reload skb header pointers after calling pskb_may_pull
    in _decode_session4 as this may change the pointers into
    the packet.

    5) Add a missing statistic counter on inner mode errors.

    Please pull or let me know if there are problems.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

28 Oct, 2015

1 commit

  • We were computing the child index in cases where the key value we were
    looking for was actually less than the base key of the tnode. As a result
    we were getting incorrect index values that would cause us to skip over
    some children.

    To fix this I have added a test that will force us to use child index 0 if
    the key we are looking for is less than the key of the current tnode.

    Fixes: 8be33e955cb9 ("fib_trie: Fib walk rcu should take a tnode and key instead of a trie and a leaf")
    Reported-by: Brian Rak
    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Alexander Duyck
     

27 Oct, 2015

1 commit


24 Oct, 2015

1 commit

  • Conflicts:
    net/ipv6/xfrm6_output.c
    net/openvswitch/flow_netlink.c
    net/openvswitch/vport-gre.c
    net/openvswitch/vport-vxlan.c
    net/openvswitch/vport.c
    net/openvswitch/vport.h

    The openvswitch conflicts were overlapping changes. One was
    the egress tunnel info fix in 'net' and the other was the
    vport ->send() op simplification in 'net-next'.

    The xfrm6_output.c conflicts was also a simplification
    overlapping a bug fix.

    Signed-off-by: David S. Miller

    David S. Miller
     

23 Oct, 2015

6 commits

  • Multiple cpus can process duplicates of incoming ACK messages
    matching a SYN_RECV request socket. This is a rare event under
    normal operations, but definitely can happen.

    Only one must win the race, otherwise corruption would occur.

    To fix this without adding new atomic ops, we use logic in
    inet_ehash_nolisten() to detect the request was present in the same
    ehash bucket where we try to insert the new child.

    If request socket was not found, we have to undo the child creation.

    This actually removes a spin_lock()/spin_unlock() pair in
    reqsk_queue_unlink() for the fast path.

    Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Currently adding a new ipv4 address always cause the creation of the
    related network route, with default metric. When a host has multiple
    interfaces on the same network, multiple routes with the same metric
    are created.

    If the userspace wants to set specific metric on each routes, i.e.
    giving better metric to ethernet links in respect to Wi-Fi ones,
    the network routes must be deleted and recreated, which is error-prone.

    This patch implements the support for IFA_F_NOPREFIXROUTE for ipv4
    address. When an address is added with such flag set, no associated
    network route is created, no network route is deleted when
    said IP is gone and it's up to the user space manage such route.

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • If alpha is strictly reduced by alpha >> dctcp_shift_g and if alpha is less
    than 1 << dctcp_shift_g, then alpha may never reach zero. For example,
    given shift_g=4 and alpha=15, alpha >> dctcp_shift_g yields 0 and alpha
    remains 15. The effect isn't noticeable in this case below cwnd=137, but
    could gradually drive uncongested flows with leftover alpha down to
    cwnd=137. A larger dctcp_shift_g would have a greater effect.

    This change causes alpha=15 to drop to 0 instead of being decrementing by 1
    as it would when alpha=16. However, it requires one less conditional to
    implement since it doesn't have to guard against subtracting 1 from 0U. A
    decay of 15 is not unreasonable since an equal or greater amount occurs at
    alpha >= 240.

    Signed-off-by: Andrew G. Shewmaker
    Acked-by: Florian Westphal
    Signed-off-by: David S. Miller

    Andrew Shewmaker
     
  • A call to pskb_may_pull may change the pointers into the packet,
    so reload the pointers after the call.

    Signed-off-by: Steffen Klassert

    Steffen Klassert
     
  • We skip the header informations if the data pointer points
    already behind the header in question for some protocols.
    This is because we call pskb_may_pull with a negative value
    converted to unsigened int from pskb_may_pull in this case.
    Skipping the header informations can lead to incorrect policy
    lookups, so fix it by a check of the data pointer position
    before we call pskb_may_pull.

    Signed-off-by: Steffen Klassert

    Steffen Klassert
     
  • While transitioning to netdev based vport we broke OVS
    feature which allows user to retrieve tunnel packet egress
    information for lwtunnel devices. Following patch fixes it
    by introducing ndo operation to get the tunnel egress info.
    Same ndo operation can be used for lwtunnel devices and compat
    ovs-tnl-vport devices. So after adding such device operation
    we can remove similar operation from ovs-vport.

    Fixes: 614732eaa12d ("openvswitch: Use regular VXLAN net_device device").
    Signed-off-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Pravin B Shelar
     

22 Oct, 2015

4 commits

  • Steffen Klassert says:

    ====================
    pull request (net): ipsec 2015-10-22

    1) Fix IPsec pre-encap fragmentation for GSO packets.
    From Herbert Xu.

    2) Fix some header checks in _decode_session6.
    We skip the header informations if the data pointer points
    already behind the header in question for some protocols.
    This is because we call pskb_may_pull with a negative value
    converted to unsigened int from pskb_may_pull in this case.
    Skipping the header informations can lead to incorrect policy
    lookups. From Mathias Krause.

    3) Allow to change the replay threshold and expiry timer of a
    state without having to set other attributes like replay
    counter and byte lifetime. Changing these other attributes
    may break the SA. From Michael Rossberg.

    4) Fix pmtu discovery for local generated packets.
    We may fail dispatch to the inner address family.
    As a reault, the local error handler is not called
    and the mtu value is not reported back to userspace.

    Please pull or let me know if there are problems.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Commit e520af48c7e5a introduced the following bug when setting the
    TCP_REPAIR sockoption:

    [ 2860.657036] BUG: using __this_cpu_add() in preemptible [00000000] code: daemon/12164
    [ 2860.657045] caller is __this_cpu_preempt_check+0x13/0x20
    [ 2860.657049] CPU: 1 PID: 12164 Comm: daemon Not tainted 4.2.3 #1
    [ 2860.657051] Hardware name: Dell Inc. PowerEdge R210 II/0JP7TR, BIOS 2.0.5 03/13/2012
    [ 2860.657054] ffffffff81c7f071 ffff880231e9fdf8 ffffffff8185d765 0000000000000002
    [ 2860.657058] 0000000000000001 ffff880231e9fe28 ffffffff8146ed91 ffff880231e9fe18
    [ 2860.657062] ffffffff81cd1a5d ffff88023534f200 ffff8800b9811000 ffff880231e9fe38
    [ 2860.657065] Call Trace:
    [ 2860.657072] [] dump_stack+0x4f/0x7b
    [ 2860.657075] [] check_preemption_disabled+0xe1/0xf0
    [ 2860.657078] [] __this_cpu_preempt_check+0x13/0x20
    [ 2860.657082] [] tcp_xmit_probe_skb+0xc7/0x100
    [ 2860.657085] [] tcp_send_window_probe+0x2d/0x30
    [ 2860.657089] [] do_tcp_setsockopt.isra.29+0x74c/0x830
    [ 2860.657093] [] tcp_setsockopt+0x2c/0x30
    [ 2860.657097] [] sock_common_setsockopt+0x14/0x20
    [ 2860.657100] [] SyS_setsockopt+0x71/0xc0
    [ 2860.657104] [] entry_SYSCALL_64_fastpath+0x16/0x75

    Since tcp_xmit_probe_skb() can be called from process context, use
    NET_INC_STATS() instead of NET_INC_STATS_BH().

    Fixes: e520af48c7e5 ("tcp: add TCPWinProbe and TCPKeepAlive SNMP counters")
    Signed-off-by: Renato Westphal
    Signed-off-by: David S. Miller

    Renato Westphal
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contains four Netfilter fixes for net, they are:

    1) Fix Kconfig dependencies of new nf_dup_ipv4 and nf_dup_ipv6.

    2) Remove bogus test nh_scope in IPv4 rpfilter match that is breaking
    --accept-local, from Xin Long.

    3) Wait for RCU grace period after dropping the pending packets in the
    nfqueue, from Florian Westphal.

    4) Fix sleeping allocation while holding spin_lock_bh, from Nikolay Borisov.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • if_nlmsg_size() overestimates the minimum allocation size of netlink
    dump request (when called from rtnl_calcit()) or the size of the
    message (when called from rtnl_getlink()). This is because
    ext_filter_mask is not supported by rtnl_link_get_af_size() and
    rtnl_link_get_size().

    The over-estimation is significant when at least one netdev has many
    VLANs configured (8 bytes for each configured VLAN).

    This patch-set "rightsizes" the protocol specific attribute size
    calculation by propagating ext_filter_mask to rtnl_link_get_af_size()
    and adding this a argument to get_link_af_size op in rtnl_af_ops.

    Bridge module already used filtering aware sizing for notifications.
    br_get_link_af_size_filtered() is consistent with the modified
    get_link_af_size op so it replaces br_get_link_af_size() in br_af_ops.
    br_get_link_af_size() becomes unused and thus removed.

    Signed-off-by: Ronen Arad
    Acked-by: Sridhar Samudrala
    Signed-off-by: David S. Miller

    Arad, Ronen
     

21 Oct, 2015

6 commits

  • This patch implements the second half of RACK that uses the the most
    recent transmit time among all delivered packets to detect losses.

    tcp_rack_mark_lost() is called upon receiving a dubious ACK.
    It then checks if an not-yet-sacked packet was sent at least
    "reo_wnd" prior to the sent time of the most recently delivered.
    If so the packet is deemed lost.

    The "reo_wnd" reordering window starts with 1msec for fast loss
    detection and changes to min-RTT/4 when reordering is observed.
    We found 1msec accommodates well on tiny degree of reordering
    (
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • This patch is the first half of the RACK loss recovery.

    RACK loss recovery uses the notion of time instead
    of packet sequence (FACK) or counts (dupthresh). It's inspired by the
    previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
    transmit (new data packet) is sacked, then current retransmitted
    sequence below the newly sacked sequence must been lost,
    since at least one round trip time has elapsed.

    But it has several limitations:
    1) can't detect tail drops since it depends on limited transmit
    2) is disabled upon reordering (assumes no reordering)
    3) only enabled in fast recovery ut not timeout recovery

    RACK (Recently ACK) addresses these limitations with the notion
    of time instead: a packet P1 is lost if a later packet P2 is s/acked,
    as at least one round trip has passed.

    Since RACK cares about the time sequence instead of the data sequence
    of packets, it can detect tail drops when later retransmission is
    s/acked while FACK or dupthresh can't. For reordering RACK uses a
    dynamically adjusted reordering window ("reo_wnd") to reduce false
    positives on ever (small) degree of reordering.

    This patch implements tcp_advanced_rack() which tracks the
    most recent transmission time among the packets that have been
    delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
    is the key to determine which packet has been lost.

    Consider an example that the sender sends six packets:
    T1: P1 (lost)
    T2: P2
    T3: P3
    T4: P4
    T100: sack of P2. rack.mstamp = T2
    T101: retransmit P1
    T102: sack of P2,P3,P4. rack.mstamp = T4
    T205: ACK of P4 since the hole is repaired. rack.mstamp = T101

    We need to be careful about spurious retransmission because it may
    falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
    to falsely mark all packets lost, just like a spurious timeout.

    We identify spurious retransmission by the ACK's TS echo value.
    If TS option is not applicable but the retransmission is acknowledged
    less than min-RTT ago, it is likely to be spurious. We refrain from
    using the transmission time of these spurious retransmissions.

    The second half is implemented in the next patch that marks packet
    lost using RACK timestamp.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • a helper to prepare the main RACK patch

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Remove the existing lost retransmit detection because RACK subsumes
    it completely. This also stops the overloading the ack_seq field of
    the skb control block.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Kathleen Nichols' algorithm for tracking the minimum RTT of a
    data stream over some measurement window. It uses constant space
    and constant time per update. Yet it almost always delivers
    the same minimum as an implementation that has to keep all
    the data in the window. The measurement window is tunable via
    sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.

    The algorithm keeps track of the best, 2nd best & 3rd best min
    values, maintaining an invariant that the measurement time of
    the n'th best >= n-1'th best. It also makes sure that the three
    values are widely separated in the time window since that bounds
    the worse case error when that data is monotonically increasing
    over the window.

    Upon getting a new min, we can forget everything earlier because
    it has no value - the new min is less than everything else in the
    window by definition and it's the most recent. So we restart fresh
    on every new min and overwrites the 2nd & 3rd choices. The same
    property holds for the 2nd & 3rd best.

    Therefore we have to maintain two invariants to maximize the
    information in the samples, one on values (1st.v
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Currently ca_seq_rtt_us does not use Kern's check. Fix that by
    checking if any packet acked is a retransmit, for both RTT used
    for RTT estimation and congestion control.

    Fixes: 5b08e47ca ("tcp: prefer packet timing to TS-ECR for RTT")
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

20 Oct, 2015

1 commit


19 Oct, 2015

4 commits

  • Commit 044a832a777 ("xfrm: Fix local error reporting crash
    with interfamily tunnels") moved the setting of skb->protocol
    behind the last access of the inner mode family to fix an
    interfamily crash. Unfortunately now skb->protocol might not
    be set at all, so we fail dispatch to the inner address family.
    As a reault, the local error handler is not called and the
    mtu value is not reported back to userspace.

    We fix this by setting skb->protocol on message size errors
    before we call xfrm_local_error.

    Fixes: 044a832a7779c ("xfrm: Fix local error reporting crash with interfamily tunnels")
    Signed-off-by: Steffen Klassert

    Steffen Klassert
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter/IPVS updates for net-next

    The following patchset contains Netfilter/IPVS updates for your net-next
    tree. Most relevantly, updates for the nfnetlink_log to integrate with
    conntrack, fixes for cttimeout and improvements for nf_queue core, they are:

    1) Remove useless ifdef around static inline function in IPVS, from
    Eric W. Biederman.

    2) Simplify the conntrack support for nfnetlink_queue: Merge
    nfnetlink_queue_ct.c file into nfnetlink_queue_core.c, then rename it back
    to nfnetlink_queue.c

    3) Use y2038 safe timestamp from nfnetlink_queue.

    4) Get rid of dead function definition in nf_conntrack, from Flavio
    Leitner.

    5) Attach conntrack support for nfnetlink_log.c, from Ken-ichirou MATSUZAWA.
    This adds a new NETFILTER_NETLINK_GLUE_CT Kconfig switch that
    controls enabling both nfqueue and nflog integration with conntrack.
    The userspace application can request this via NFULNL_CFG_F_CONNTRACK
    configuration flag.

    6) Remove unused netns variables in IPVS, from Eric W. Biederman and
    Simon Horman.

    7) Don't put back the refcount on the cttimeout object from xt_CT on success.

    8) Fix crash on cttimeout policy object removal. We have to flush out
    the cttimeout extension area of the conntrack not to refer to an unexisting
    object that was just removed.

    9) Make sure rcu_callback completion before removing nfnetlink_cttimeout
    module removal.

    10) Fix compilation warning in br_netfilter when no nf_defrag_ipv4 and
    nf_defrag_ipv6 are enabled. Patch from Arnd Bergmann.

    11) Autoload ctnetlink dependencies when NFULNL_CFG_F_CONNTRACK is
    requested. Again from Ken-ichirou MATSUZAWA.

    12) Don't use pointer to previous hook when reinjecting traffic via
    nf_queue with NF_REPEAT verdict since it may be already gone. This
    also avoids a deadloop if the userspace application keeps returning
    NF_REPEAT.

    13) A bunch of cleanups for netfilter IPv4 and IPv6 code from Ian Morris.

    14) Consolidate logger instance existence check in nfulnl_recv_config().

    15) Fix broken atomicity when applying configuration updates to logger
    instances in nfnetlink_log.

    16) Get rid of the .owner attribute in our hook object. We don't need
    this anymore since we're dropping pending packets that have escaped
    from the kernel when unremoving the hook. Patch from Florian Westphal.

    17) Remove unnecessary rcu_read_lock() from nf_reinject code, we always
    assume RCU read side lock from .call_rcu in nfnetlink. Also from Florian.

    18) Use static inline function instead of macros to define NF_HOOK() and
    NF_HOOK_COND() when no netfilter support in on, from Arnd Bergmann.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • At the time of commit fff326990789 ("tcp: reflect SYN queue_mapping into
    SYNACK packets") we had little ways to cope with SYN floods.

    We no longer need to reflect incoming skb queue mappings, and instead
    can pick a TX queue based on cpu cooking the SYNACK, with normal XPS
    affinities.

    Note that all SYNACK retransmits were picking TX queue 0, this no longer
    is a win given that SYNACK rtx are now distributed on all cpus.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • A dhcp server may provide parameters to a client from a pool of IP
    addresses and using a shared rootfs, or provide a specific set of
    parameters for a specific client, usually using the MAC address to
    identify each client individually. The dhcp protocol also specifies
    a client-id field which can be used to determine the correct
    parameters to supply when no MAC address is available. There is
    currently no way to tell the kernel to supply a specific client-id,
    only the userspace dhcp clients support this feature, but this can
    not be used when the network is needed before userspace is available
    such as when the root filesystem is on NFS.

    This patch is to be able to do something like "ip=dhcp,client_id_type,
    client_id_value", as a kernel parameter to enable the kernel to
    identify itself to the server.

    Signed-off-by: Li RongQing
    Signed-off-by: David S. Miller

    Li RongQing
     

17 Oct, 2015

5 commits