09 Mar, 2016

2 commits

  • One of our customers observed issues with FIB6 garbage collectors
    running in different network namespaces blocking each other, resulting
    in soft lockups (fib6_run_gc() initiated from timer runs always in
    forced mode).

    Now that FIB6 walkers are separated per namespace, there is no more need
    for instances of fib6_run_gc() in different namespaces blocking each
    other. There is still a call to icmp6_dst_gc() which operates on shared
    data but this function is protected by its own shared lock.

    Signed-off-by: Michal Kubecek
    Reviewed-by: Cong Wang
    Signed-off-by: David S. Miller

    Michal Kubeček
     
  • The IPv6 FIB data structures are separated per network namespace but
    there is still only one global walkers list and one global walker list
    lock. This means changes in one namespace unnecessarily interfere with
    walkers in other namespaces.

    Replace the global list with per-netns lists (and give each its own
    lock).

    Signed-off-by: Michal Kubecek
    Reviewed-by: Cong Wang
    Signed-off-by: David S. Miller

    Michal Kubeček
     

17 Feb, 2016

3 commits


11 Feb, 2016

4 commits


08 Feb, 2016

9 commits


11 Jan, 2016

3 commits


19 Dec, 2015

1 commit

  • Allow accepted sockets to derive their sk_bound_dev_if setting from the
    l3mdev domain in which the packets originated. A sysctl setting is added
    to control the behavior which is similar to sk_mark and
    sysctl_tcp_fwmark_accept.

    This effectively allow a process to have a "VRF-global" listen socket,
    with child sockets bound to the VRF device in which the packet originated.
    A similar behavior can be achieved using sk_mark, but a solution using marks
    is incomplete as it does not handle duplicate addresses in different L3
    domains/VRFs. Allowing sockets to inherit the sk_bound_dev_if from l3mdev
    domain provides a complete solution.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

16 Dec, 2015

1 commit

  • As we all know, the value of pf_retrans >= max_retrans_path can
    disable pf state. The variables of pf_retrans and max_retrans_path
    can be changed by the userspace application.

    Sometimes the user expects to disable pf state while the 2
    variables are changed to enable pf state. So it is necessary to
    introduce a new variable to disable pf state.

    According to the suggestions from Vlad Yasevich, extra1 and extra2
    are removed. The initialization of pf_enable is added.

    Acked-by: Vlad Yasevich
    Signed-off-by: Zhu Yanjun
    Acked-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Zhu Yanjun
     

05 Aug, 2015

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    The following patchset contains Netfilter updates for net-next, they are:

    1) A couple of cleanups for the netfilter core hook from Eric Biederman.

    2) Net namespace hook registration, also from Eric. This adds a dependency with
    the rtnl_lock. This should be fine by now but we have to keep an eye on this
    because if we ever get the per-subsys nfnl_lock before rtnl we have may
    problems in the future. But we have room to remove this in the future by
    propagating the complexity to the clients, by registering hooks for the init
    netns functions.

    3) Update nf_tables to use the new net namespace hook infrastructure, also from
    Eric.

    4) Three patches to refine and to address problems from the new net namespace
    hook infrastructure.

    5) Switch to alternate jumpstack in xtables iff the packet is reentering. This
    only applies to a very special case, the TEE target, but Eric Dumazet
    reports that this is slowing down things for everyone else. So let's only
    switch to the alternate jumpstack if the tee target is in used through a
    static key. This batch also comes with offline precalculation of the
    jumpstack based on the callchain depth. From Florian Westphal.

    6) Minimal SCTP multihoming support for our conntrack helper, from Michal
    Kubecek.

    7) Reduce nf_bridge_info per skbuff scratchpad area to 32 bytes, from Florian
    Westphal.

    8) Fix several checkpatch errors in bridge netfilter, from Bernhard Thaler.

    9) Get rid of useless debug message in ip6t_REJECT, from Subash Abhinov.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

01 Aug, 2015

1 commit


20 Jul, 2015

1 commit

  • Quoting Daniel Borkmann:

    "When adding connection tracking template rules to a netns, f.e. to
    configure netfilter zones, the kernel will endlessly busy-loop as soon
    as we try to delete the given netns in case there's at least one
    template present, which is problematic i.e. if there is such bravery that
    the priviledged user inside the netns is assumed untrusted.

    Minimal example:

    ip netns add foo
    ip netns exec foo iptables -t raw -A PREROUTING -d 1.2.3.4 -j CT --zone 1
    ip netns del foo

    What happens is that when nf_ct_iterate_cleanup() is being called from
    nf_conntrack_cleanup_net_list() for a provided netns, we always end up
    with a net->ct.count > 0 and thus jump back to i_see_dead_people. We
    don't get a soft-lockup as we still have a schedule() point, but the
    serving CPU spins on 100% from that point onwards.

    Since templates are normally allocated with nf_conntrack_alloc(), we
    also bump net->ct.count. The issue why they are not yet nf_ct_put() is
    because the per netns .exit() handler from x_tables (which would eventually
    invoke xt_CT's xt_ct_tg_destroy() that drops reference on info->ct) is
    called in the dependency chain at a *later* point in time than the per
    netns .exit() handler for the connection tracker.

    This is clearly a chicken'n'egg problem: after the connection tracker
    .exit() handler, we've teared down all the connection tracking
    infrastructure already, so rightfully, xt_ct_tg_destroy() cannot be
    invoked at a later point in time during the netns cleanup, as that would
    lead to a use-after-free. At the same time, we cannot make x_tables depend
    on the connection tracker module, so that the xt_ct_tg_destroy() would
    be invoked earlier in the cleanup chain."

    Daniel confirms this has to do with the order in which modules are loaded or
    having compiled nf_conntrack as modules while x_tables built-in. So we have no
    guarantees regarding the order in which netns callbacks are executed.

    Fix this by allocating the templates through kmalloc() from the respective
    SYNPROXY and CT targets, so they don't depend on the conntrack kmem cache.
    Then, release then via nf_ct_tmpl_free() from destroy_conntrack(). This branch
    is marked as unlikely since conntrack templates are rarely allocated and only
    from the configuration plane path.

    Note that templates are not kept in any list to avoid further dependencies with
    nf_conntrack anymore, thus, the tmpl larval list is removed.

    Reported-by: Daniel Borkmann
    Signed-off-by: Pablo Neira Ayuso
    Tested-by: Daniel Borkmann

    Pablo Neira Ayuso
     

16 Jul, 2015

1 commit

  • - Add a new set of functions for registering and unregistering per
    network namespace hooks.

    - Modify the old global namespace hook functions to use the per
    network namespace hooks in their implementation, so their remains a
    single list that needs to be walked for any hook (this is important
    for keeping the hook priority working and for keeping the code
    walking the hooks simple).

    - Only allow registering the per netdevice hooks in the network
    namespace where the network device lives.

    - Dynamically allocate the structures in the per network namespace
    hook list in nf_register_net_hook, and unregister them in
    nf_unregister_net_hook.

    Dynamic allocate is required somewhere as the number of network
    namespaces are not fixed so we might as well allocate them in the
    registration function.

    The chain of registered hooks on any list is expected to be small so
    the cost of walking that list to find the entry we are unregistering
    should also be small.

    Performing the management of the dynamically allocated list entries
    in the registration and unregistration functions keeps the complexity
    from spreading.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

10 Jul, 2015

1 commit

  • Add support to allow non-local binds similar to how this was done for IPv4.
    Non-local binds are very useful in emulating the Internet in a box, etc.

    This add the ip_nonlocal_bind sysctl under ipv6.

    Testing:

    Set up nonlocal binding and receive routing on a host, e.g.:

    ip -6 rule add from ::/0 iif eth0 lookup 200
    ip -6 route add local 2001:0:0:1::/64 dev lo proto kernel scope host table 200
    sysctl -w net.ipv6.ip_nonlocal_bind=1

    Set up routing to 2001:0:0:1::/64 on peer to go to first host

    ping6 -I 2001:0:0:1::1 peer-address -- to verify

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

24 Jun, 2015

1 commit


19 Jun, 2015

2 commits

  • This pulls the full hook netfilter definitions from all those that include
    net_namespace.h.

    Instead let's just include the bare minimum required in the new
    linux/netfilter_defs.h file, and use it from the netfilter netns header files.

    I also needed to include in.h and in6.h from linux/netfilter.h otherwise we hit
    this compilation error:

    In file included from include/linux/netfilter_defs.h:4:0,
    from include/net/netns/netfilter.h:4,
    from include/net/net_namespace.h:22,
    from include/linux/netdevice.h:43,
    from net/netfilter/nfnetlink_queue_core.c:23:
    include/uapi/linux/netfilter.h:76:17: error: field ‘in’ has incomplete type struct in_addr in;

    And also explicit include linux/netfilter.h in several spots.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Eric W. Biederman

    Pablo Neira Ayuso
     
  • We don't need to pull the full definitions in that file, a simple forward
    declaration is enough.

    Moreover, include linux/procfs.h from nf_synproxy_core, otherwise this hits a
    compilation error due to missing declarations, ie.

    net/netfilter/nf_synproxy_core.c: In function ‘synproxy_proc_init’:
    net/netfilter/nf_synproxy_core.c:326:2: error: implicit declaration of function ‘proc_create’ [-Werror=implicit-function-declaration]
    if (!proc_create("synproxy", S_IRUGO, net->proc_net_stat,
    ^

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Eric W. Biederman

    Pablo Neira Ayuso
     

15 Jun, 2015

1 commit

  • ->auto_asconf_splist is per namespace and mangled by functions like
    sctp_setsockopt_auto_asconf() which doesn't guarantee any serialization.

    Also, the call to inet_sk_copy_descendant() was backuping
    ->auto_asconf_list through the copy but was not honoring
    ->do_auto_asconf, which could lead to list corruption if it was
    different between both sockets.

    This commit thus fixes the list handling by using ->addr_wq_lock
    spinlock to protect the list. A special handling is done upon socket
    creation and destruction for that. Error handlig on sctp_init_sock()
    will never return an error after having initialized asconf, so
    sctp_destroy_sock() can be called without addrq_wq_lock. The lock now
    will be take on sctp_close_sock(), before locking the socket, so we
    don't do it in inverse order compared to sctp_addr_wq_timeout_handler().

    Instead of taking the lock on sctp_sock_migrate() for copying and
    restoring the list values, it's preferred to avoid rewritting it by
    implementing sctp_copy_descendant().

    Issue was found with a test application that kept flipping sysctl
    default_auto_asconf on and off, but one could trigger it by issuing
    simultaneous setsockopt() calls on multiple sockets or by
    creating/destroying sockets fast enough. This is only triggerable
    locally.

    Fixes: 9f7d653b67ae ("sctp: Add Auto-ASCONF support (core).")
    Reported-by: Ji Jianwen
    Suggested-by: Neil Horman
    Suggested-by: Hannes Frederic Sowa
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     

31 May, 2015

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    The following patchset contains Netfilter updates for net-next, they are:

    1) default CONFIG_NETFILTER_INGRESS to y for easier compile-testing of all
    options.

    2) Allow to bind a table to net_device. This introduces the internal
    NFT_AF_NEEDS_DEV flag to perform a mandatory check for this binding.
    This is required by the next patch.

    3) Add the 'netdev' table family, this new table allows you to create ingress
    filter basechains. This provides access to the existing nf_tables features
    from ingress.

    4) Kill unused argument from compat_find_calc_{match,target} in ip_tables
    and ip6_tables, from Florian Westphal.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

28 May, 2015

1 commit

  • After commit 07f4c90062f8f ("tcp/dccp: try to not exhaust
    ip_local_port_range in connect()") it is advised to have an even number
    of ports described in /proc/sys/net/ipv4/ip_local_port_range

    This means start/end values should have a different parity.

    Let's warn sysadmins of this, so that they can update their settings
    if they want to.

    Suggested-by: David S. Miller
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 May, 2015

1 commit


20 May, 2015

1 commit

  • This work as a follow-up of commit f7b3bec6f516 ("net: allow setting ecn
    via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
    ECN connections. In other words, this work adds a retry with a non-ECN
    setup SYN packet, as suggested from the RFC on the first timeout:

    [...] A host that receives no reply to an ECN-setup SYN within the
    normal SYN retransmission timeout interval MAY resend the SYN and
    any subsequent SYN retransmissions with CWR and ECE cleared. [...]

    Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
    that is, Linux default since 2009 via commit 255cac91c3c9 ("tcp: extend
    ECN sysctl to allow server-side only ECN"):

    1) Normal ECN-capable path:

    SYN ECE CWR ----->

    2) Path with broken middlebox, when client has fallback:

    SYN ECE CWR ----X crappy middlebox drops packet
    (timeout, rtx)
    SYN ----->

    In case we would not have the fallback implemented, the middlebox drop
    point would basically end up as:

    SYN ECE CWR ----X crappy middlebox drops packet
    (timeout, rtx)
    SYN ECE CWR ----X crappy middlebox drops packet
    (timeout, rtx)
    SYN ECE CWR ----X crappy middlebox drops packet
    (timeout, rtx)

    In any case, it's rather a smaller percentage of sites where there would
    occur such additional setup latency: it was found in end of 2014 that ~56%
    of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
    ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
    when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
    fallback would mitigate with a slight latency trade-off. Recent related
    paper on this topic:

    Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
    Gorry Fairhurst, and Richard Scheffenegger:
    "Enabling Internet-Wide Deployment of Explicit Congestion Notification."
    Proc. PAM 2015, New York.
    http://ecn.ethz.ch/ecn-pam15.pdf

    Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
    section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
    which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
    allows for disabling the fallback.

    tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
    rather we let tcp_ecn_rcv_synack() take that over on input path in case a
    SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
    ECN being negotiated eventually in that case.

    Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
    Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Florian Westphal
    Signed-off-by: Mirja Kühlewind
    Signed-off-by: Brian Trammell
    Cc: Eric Dumazet
    Cc: Dave That
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

04 May, 2015

1 commit

  • This patch divides the IPv6 flow label space into two ranges:
    0-7ffff is reserved for flow label manager, 80000-fffff will be
    used for creating auto flow labels (per RFC6438). This only affects how
    labels are set on transmit, it does not affect receive. This range split
    can be disbaled by systcl.

    Background:

    IPv6 flow labels have been an unmitigated disappointment thus far
    in the lifetime of IPv6. Support in HW devices to use them for ECMP
    is lacking, and OSes don't turn them on by default. If we had these
    we could get much better hashing in IPv6 networks without resorting
    to DPI, possibly eliminating some of the motivations to to define new
    encaps in UDP just for getting ECMP.

    Unfortunately, the initial specfications of IPv6 did not clarify
    how they are to be used. There has always been a vague concept that
    these can be used for ECMP, flow hashing, etc. and we do now have a
    good standard how to this in RFC6438. The problem is that flow labels
    can be either stateful or stateless (as in RFC6438), and we are
    presented with the possibility that a stateless label may collide
    with a stateful one. Attempts to split the flow label space were
    rejected in IETF. When we added support in Linux for RFC6438, we
    could not turn on flow labels by default due to this conflict.

    This patch splits the flow label space and should give us
    a path to enabling auto flow labels by default for all IPv6 packets.
    This is an API change so we need to consider compatibility with
    existing deployment. The stateful range is chosen to be the lower
    values in hopes that most uses would have chosen small numbers.

    Once we resolve the stateless/stateful issue, we can proceed to
    look at enabling RFC6438 flow labels by default (starting with
    scaled testing).

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

18 Apr, 2015

1 commit

  • This inline has ~500 callsites.

    On 04/14/2015 08:37 PM, David Miller wrote:
    > That BUG_ON() was added 7 years ago, and I don't remember it ever
    > triggering or helping us diagnose something, so just remove it and
    > keep the function inlined.

    On x86 allyesconfig build:

    text data bss dec hex filename
    82447071 22255384 20627456 125329911 77861f7 vmlinux4
    82441375 22255384 20627456 125324215 7784bb7 vmlinux5prime

    Signed-off-by: Denys Vlasenko
    CC: Eric W. Biederman
    CC: David S. Miller
    CC: Jan Engelhardt
    CC: Jiri Pirko
    CC: linux-kernel@vger.kernel.org
    CC: netdev@vger.kernel.org
    Signed-off-by: David S. Miller

    Denys Vlasenko
     

24 Mar, 2015

1 commit


19 Mar, 2015

1 commit