26 Mar, 2018

1 commit

  • Currently, the SMC experimental TCP option in a SYN packet is lost on
    the server side when SYN Cookies are active. However, the corresponding
    SYNACK sent back to the client contains the SMC option. This causes an
    inconsistent view of the SMC capabilities on the client and server.

    This patch disables the SMC option in the SYNACK when SYN Cookies are
    active to avoid this issue.

    Fixes: 60e2a7780793b ("tcp: TCP experimental option for SMC")
    Signed-off-by: Hans Wippel
    Signed-off-by: Ursula Braun
    Signed-off-by: David S. Miller

    Hans Wippel
     

28 Oct, 2017

1 commit


19 Jul, 2017

1 commit

  • KMSAN reported use of uninitialized memory in skb_set_hash_from_sk(),
    which originated from the TCP request socket created in
    cookie_v6_check():

    ==================================================================
    BUG: KMSAN: use of uninitialized memory in tcp_transmit_skb+0xf77/0x3ec0
    CPU: 1 PID: 2949 Comm: syz-execprog Not tainted 4.11.0-rc5+ #2931
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    TCP: request_sock_TCPv6: Possible SYN flooding on port 20028. Sending cookies. Check SNMP counters.
    Call Trace:

    __dump_stack lib/dump_stack.c:16
    dump_stack+0x172/0x1c0 lib/dump_stack.c:52
    kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:927
    __msan_warning_32+0x61/0xb0 mm/kmsan/kmsan_instr.c:469
    skb_set_hash_from_sk ./include/net/sock.h:2011
    tcp_transmit_skb+0xf77/0x3ec0 net/ipv4/tcp_output.c:983
    tcp_send_ack+0x75b/0x830 net/ipv4/tcp_output.c:3493
    tcp_delack_timer_handler+0x9a6/0xb90 net/ipv4/tcp_timer.c:284
    tcp_delack_timer+0x1b0/0x310 net/ipv4/tcp_timer.c:309
    call_timer_fn+0x240/0x520 kernel/time/timer.c:1268
    expire_timers kernel/time/timer.c:1307
    __run_timers+0xc13/0xf10 kernel/time/timer.c:1601
    run_timer_softirq+0x36/0xa0 kernel/time/timer.c:1614
    __do_softirq+0x485/0x942 kernel/softirq.c:284
    invoke_softirq kernel/softirq.c:364
    irq_exit+0x1fa/0x230 kernel/softirq.c:405
    exiting_irq+0xe/0x10 ./arch/x86/include/asm/apic.h:657
    smp_apic_timer_interrupt+0x5a/0x80 arch/x86/kernel/apic/apic.c:966
    apic_timer_interrupt+0x86/0x90 arch/x86/entry/entry_64.S:489
    RIP: 0010:native_restore_fl ./arch/x86/include/asm/irqflags.h:36
    RIP: 0010:arch_local_irq_restore ./arch/x86/include/asm/irqflags.h:77
    RIP: 0010:__msan_poison_alloca+0xed/0x120 mm/kmsan/kmsan_instr.c:440
    RSP: 0018:ffff880024917cd8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
    RAX: 0000000000000246 RBX: ffff8800224c0000 RCX: 0000000000000005
    RDX: 0000000000000004 RSI: ffff880000000000 RDI: ffffea0000b6d770
    RBP: ffff880024917d58 R08: 0000000000000dd8 R09: 0000000000000004
    R10: 0000160000000000 R11: 0000000000000000 R12: ffffffff85abf810
    R13: ffff880024917dd8 R14: 0000000000000010 R15: ffffffff81cabde4

    poll_select_copy_remaining+0xac/0x6b0 fs/select.c:293
    SYSC_select+0x4b4/0x4e0 fs/select.c:653
    SyS_select+0x76/0xa0 fs/select.c:634
    entry_SYSCALL_64_fastpath+0x13/0x94 arch/x86/entry/entry_64.S:204
    RIP: 0033:0x4597e7
    RSP: 002b:000000c420037ee0 EFLAGS: 00000246 ORIG_RAX: 0000000000000017
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004597e7
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: 000000c420037ef0 R08: 000000c420037ee0 R09: 0000000000000059
    R10: 0000000000000000 R11: 0000000000000246 R12: 000000000042dc20
    R13: 00000000000000f3 R14: 0000000000000030 R15: 0000000000000003
    chained origin:
    save_stack_trace+0x37/0x40 arch/x86/kernel/stacktrace.c:59
    kmsan_save_stack_with_flags mm/kmsan/kmsan.c:302
    kmsan_save_stack mm/kmsan/kmsan.c:317
    kmsan_internal_chain_origin+0x12a/0x1f0 mm/kmsan/kmsan.c:547
    __msan_store_shadow_origin_4+0xac/0x110 mm/kmsan/kmsan_instr.c:259
    tcp_create_openreq_child+0x709/0x1ae0 net/ipv4/tcp_minisocks.c:472
    tcp_v6_syn_recv_sock+0x7eb/0x2a30 net/ipv6/tcp_ipv6.c:1103
    tcp_get_cookie_sock+0x136/0x5f0 net/ipv4/syncookies.c:212
    cookie_v6_check+0x17a9/0x1b50 net/ipv6/syncookies.c:245
    tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:989
    tcp_v6_do_rcv+0xdd8/0x1c60 net/ipv6/tcp_ipv6.c:1298
    tcp_v6_rcv+0x41a3/0x4f00 net/ipv6/tcp_ipv6.c:1487
    ip6_input_finish+0x82f/0x1ee0 net/ipv6/ip6_input.c:279
    NF_HOOK ./include/linux/netfilter.h:257
    ip6_input+0x239/0x290 net/ipv6/ip6_input.c:322
    dst_input ./include/net/dst.h:492
    ip6_rcv_finish net/ipv6/ip6_input.c:69
    NF_HOOK ./include/linux/netfilter.h:257
    ipv6_rcv+0x1dbd/0x22e0 net/ipv6/ip6_input.c:203
    __netif_receive_skb_core+0x2f6f/0x3a20 net/core/dev.c:4208
    __netif_receive_skb net/core/dev.c:4246
    process_backlog+0x667/0xba0 net/core/dev.c:4866
    napi_poll net/core/dev.c:5268
    net_rx_action+0xc95/0x1590 net/core/dev.c:5333
    __do_softirq+0x485/0x942 kernel/softirq.c:284
    origin:
    save_stack_trace+0x37/0x40 arch/x86/kernel/stacktrace.c:59
    kmsan_save_stack_with_flags mm/kmsan/kmsan.c:302
    kmsan_internal_poison_shadow+0xb1/0x1a0 mm/kmsan/kmsan.c:198
    kmsan_kmalloc+0x7f/0xe0 mm/kmsan/kmsan.c:337
    kmem_cache_alloc+0x1c2/0x1e0 mm/slub.c:2766
    reqsk_alloc ./include/net/request_sock.h:87
    inet_reqsk_alloc+0xa4/0x5b0 net/ipv4/tcp_input.c:6200
    cookie_v6_check+0x4f4/0x1b50 net/ipv6/syncookies.c:169
    tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:989
    tcp_v6_do_rcv+0xdd8/0x1c60 net/ipv6/tcp_ipv6.c:1298
    tcp_v6_rcv+0x41a3/0x4f00 net/ipv6/tcp_ipv6.c:1487
    ip6_input_finish+0x82f/0x1ee0 net/ipv6/ip6_input.c:279
    NF_HOOK ./include/linux/netfilter.h:257
    ip6_input+0x239/0x290 net/ipv6/ip6_input.c:322
    dst_input ./include/net/dst.h:492
    ip6_rcv_finish net/ipv6/ip6_input.c:69
    NF_HOOK ./include/linux/netfilter.h:257
    ipv6_rcv+0x1dbd/0x22e0 net/ipv6/ip6_input.c:203
    __netif_receive_skb_core+0x2f6f/0x3a20 net/core/dev.c:4208
    __netif_receive_skb net/core/dev.c:4246
    process_backlog+0x667/0xba0 net/core/dev.c:4866
    napi_poll net/core/dev.c:5268
    net_rx_action+0xc95/0x1590 net/core/dev.c:5333
    __do_softirq+0x485/0x942 kernel/softirq.c:284
    ==================================================================

    Similar error is reported for cookie_v4_check().

    Fixes: 58d607d3e52f ("tcp: provide skb->hash to synack packets")
    Signed-off-by: Alexander Potapenko
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexander Potapenko
     

01 Jul, 2017

1 commit

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     

08 Jun, 2017

3 commits


18 May, 2017

1 commit

  • TCP Timestamps option is defined in RFC 7323

    Traditionally on linux, it has been tied to the internal
    'jiffies' variable, because it had been a cheap and good enough
    generator.

    For TCP flows on the Internet, 1 ms resolution would be much better
    than 4ms or 10ms (HZ=250 or HZ=100 respectively)

    For TCP flows in the DC, Google has used usec resolution for more
    than two years with great success [1]

    Receive size autotuning (DRS) is indeed more precise and converges
    faster to optimal window size.

    This patch converts tp->tcp_mstamp to a plain u64 value storing
    a 1 usec TCP clock.

    This choice will allow us to upstream the 1 usec TS option as
    discussed in IETF 97.

    [1] https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 May, 2017

1 commit

  • Whole point of randomization was to hide server uptime, but an attacker
    can simply start a syn flood and TCP generates 'old style' timestamps,
    directly revealing server jiffies value.

    Also, TSval sent by the server to a particular remote address vary
    depending on syncookies being sent or not, potentially triggering PAWS
    drops for innocent clients.

    Lets implement proper randomization, including for SYNcookies.

    Also we do not need to export sysctl_tcp_timestamps, since it is not
    used from a module.

    In v2, I added Florian feedback and contribution, adding tsoff to
    tcp_get_cookie_sock().

    v3 removed one unused variable in tcp_v4_connect() as Florian spotted.

    Fixes: 95a22caee396c ("tcp: randomize tcp timestamp offsets for each connection")
    Signed-off-by: Eric Dumazet
    Reviewed-by: Florian Westphal
    Tested-by: Florian Westphal
    Cc: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Jan, 2017

1 commit

  • SHA1 is slower and less secure than SipHash, and so replacing syncookie
    generation with SipHash makes natural sense. Some BSDs have been doing
    this for several years in fact.

    The speedup should be similar -- and even more impressive -- to the
    speedup from the sequence number fix in this series.

    Signed-off-by: Jason A. Donenfeld
    Cc: Eric Dumazet
    Cc: David Miller
    Signed-off-by: David S. Miller

    Jason A. Donenfeld
     

03 Dec, 2016

1 commit

  • jiffies based timestamps allow for easy inference of number of devices
    behind NAT translators and also makes tracking of hosts simpler.

    commit ceaa1fef65a7c2e ("tcp: adding a per-socket timestamp offset")
    added the main infrastructure that is needed for per-connection ts
    randomization, in particular writing/reading the on-wire tcp header
    format takes the offset into account so rest of stack can use normal
    tcp_time_stamp (jiffies).

    So only two items are left:
    - add a tsoffset for request sockets
    - extend the tcp isn generator to also return another 32bit number
    in addition to the ISN.

    Re-use of ISN generator also means timestamps are still monotonically
    increasing for same connection quadruple, i.e. PAWS will still work.

    Includes fixes from Eric Dumazet.

    Signed-off-by: Florian Westphal
    Acked-by: Eric Dumazet
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Florian Westphal
     

05 Nov, 2016

1 commit

  • - Use the UID in routing lookups made by protocol connect() and
    sendmsg() functions.
    - Make sure that routing lookups triggered by incoming packets
    (e.g., Path MTU discovery) take the UID of the socket into
    account.
    - For packets not associated with a userspace socket, (e.g., ping
    replies) use UID 0 inside the user namespace corresponding to
    the network namespace the socket belongs to. This allows
    all namespaces to apply routing and iptables rules to
    kernel-originated traffic in that namespaces by matching UID 0.
    This is better than using the UID of the kernel socket that is
    sending the traffic, because the UID of kernel sockets created
    at namespace creation time (e.g., the per-processor ICMP and
    TCP sockets) is the UID of the user that created the socket,
    which might not be mapped in the namespace.

    Tested: compiles allnoconfig, allyesconfig, allmodconfig
    Tested: https://android-review.googlesource.com/253302
    Signed-off-by: Lorenzo Colitti
    Signed-off-by: David S. Miller

    Lorenzo Colitti
     

28 Apr, 2016

1 commit


20 Mar, 2016

1 commit

  • Pull networking updates from David Miller:
    "Highlights:

    1) Support more Realtek wireless chips, from Jes Sorenson.

    2) New BPF types for per-cpu hash and arrap maps, from Alexei
    Starovoitov.

    3) Make several TCP sysctls per-namespace, from Nikolay Borisov.

    4) Allow the use of SO_REUSEPORT in order to do per-thread processing
    of incoming TCP/UDP connections. The muxing can be done using a
    BPF program which hashes the incoming packet. From Craig Gallek.

    5) Add a multiplexer for TCP streams, to provide a messaged based
    interface. BPF programs can be used to determine the message
    boundaries. From Tom Herbert.

    6) Add 802.1AE MACSEC support, from Sabrina Dubroca.

    7) Avoid factorial complexity when taking down an inetdev interface
    with lots of configured addresses. We were doing things like
    traversing the entire address less for each address removed, and
    flushing the entire netfilter conntrack table for every address as
    well.

    8) Add and use SKB bulk free infrastructure, from Jesper Brouer.

    9) Allow offloading u32 classifiers to hardware, and implement for
    ixgbe, from John Fastabend.

    10) Allow configuring IRQ coalescing parameters on a per-queue basis,
    from Kan Liang.

    11) Extend ethtool so that larger link mode masks can be supported.
    From David Decotigny.

    12) Introduce devlink, which can be used to configure port link types
    (ethernet vs Infiniband, etc.), port splitting, and switch device
    level attributes as a whole. From Jiri Pirko.

    13) Hardware offload support for flower classifiers, from Amir Vadai.

    14) Add "Local Checksum Offload". Basically, for a tunneled packet
    the checksum of the outer header is 'constant' (because with the
    checksum field filled into the inner protocol header, the payload
    of the outer frame checksums to 'zero'), and we can take advantage
    of that in various ways. From Edward Cree"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1548 commits)
    bonding: fix bond_get_stats()
    net: bcmgenet: fix dma api length mismatch
    net/mlx4_core: Fix backward compatibility on VFs
    phy: mdio-thunder: Fix some Kconfig typos
    lan78xx: add ndo_get_stats64
    lan78xx: handle statistics counter rollover
    RDS: TCP: Remove unused constant
    RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket
    net: smc911x: convert pxa dma to dmaengine
    team: remove duplicate set of flag IFF_MULTICAST
    bonding: remove duplicate set of flag IFF_MULTICAST
    net: fix a comment typo
    ethernet: micrel: fix some error codes
    ip_tunnels, bpf: define IP_TUNNEL_OPTS_MAX and use it
    bpf, dst: add and use dst_tclassid helper
    bpf: make skb->tc_classid also readable
    net: mvneta: bm: clarify dependencies
    cls_bpf: reset class and reuse major in da
    ldmvsw: Checkpatch sunvnet.c and sunvnet_common.c
    ldmvsw: Add ldmvsw.c driver code
    ...

    Linus Torvalds
     

16 Mar, 2016

1 commit

  • $ make tags
    GEN tags
    ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1"
    ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1"
    ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern "\1"
    ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1"
    ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1"
    ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1"
    ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1"
    ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1"
    ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1"

    Which are all the result of the DEFINE_PER_CPU pattern:

    scripts/tags.sh:200: '/\
    Acked-by: David S. Miller
    Acked-by: Rafael J. Wysocki
    Cc: Tejun Heo
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

08 Feb, 2016

1 commit


19 Dec, 2015

1 commit

  • Allow accepted sockets to derive their sk_bound_dev_if setting from the
    l3mdev domain in which the packets originated. A sysctl setting is added
    to control the behavior which is similar to sk_mark and
    sysctl_tcp_fwmark_accept.

    This effectively allow a process to have a "VRF-global" listen socket,
    with child sockets bound to the VRF device in which the packet originated.
    A similar behavior can be achieved using sk_mark, but a solution using marks
    is incomplete as it does not handle duplicate addresses in different L3
    domains/VRFs. Allowing sockets to inherit the sk_bound_dev_if from l3mdev
    domain provides a complete solution.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

03 Dec, 2015

1 commit

  • This patch addresses multiple problems :

    UDP/RAW sendmsg() need to get a stable struct ipv6_txoptions
    while socket is not locked : Other threads can change np->opt
    concurrently. Dmitry posted a syzkaller
    (http://github.com/google/syzkaller) program desmonstrating
    use-after-free.

    Starting with TCP/DCCP lockless listeners, tcp_v6_syn_recv_sock()
    and dccp_v6_request_recv_sock() also need to use RCU protection
    to dereference np->opt once (before calling ipv6_dup_options())

    This patch adds full RCU protection to np->opt

    Reported-by: Dmitry Vyukov
    Signed-off-by: Eric Dumazet
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Oct, 2015

1 commit


05 Oct, 2015

1 commit

  • inet_reqsk_alloc() is used to allocate a temporary request
    in order to generate a SYNACK with a cookie. Then later,
    syncookie validation also uses a temporary request.

    These paths already took a reference on listener refcount,
    we can avoid a couple of atomic operations.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Sep, 2015

1 commit


22 Sep, 2015

1 commit

  • Currently SYN/ACK RTT is measured in jiffies. For LAN the SYN/ACK
    RTT is often measured as 0ms or sometimes 1ms, which would affect
    RTT estimation and min RTT samping used by some congestion control.

    This patch improves SYN/ACK RTT to be usec resolution if platform
    supports it. While the timestamping of SYN/ACK is done in request
    sock, the RTT measurement is carefully arranged to avoid storing
    another u64 timestamp in tcp_sock.

    For regular handshake w/o SYNACK retransmission, the RTT is sampled
    right after the child socket is created and right before the request
    sock is released (tcp_check_req() in tcp_minisocks.c)

    For Fast Open the child socket is already created when SYN/ACK was
    sent, the RTT is sampled in tcp_rcv_state_process() after processing
    the final ACK an right before the request socket is released.

    If the SYN/ACK was retransmistted or SYN-cookie was used, we rely
    on TCP timestamps to measure the RTT. The sample is taken at the
    same place in tcp_rcv_state_process() after the timestamp values
    are validated in tcp_validate_incoming(). Note that we do not store
    TS echo value in request_sock for SYN-cookies, because the value
    is already stored in tp->rx_opt used by tcp_ack_update_rtt().

    One side benefit is that the RTT measurement now happens before
    initializing congestion control (of the passive side). Therefore
    the congestion control can use the SYN/ACK RTT.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

08 Jun, 2015

1 commit


25 Mar, 2015

1 commit

  • ss should display ipv4 mapped request sockets like this :

    tcp SYN-RECV 0 0 ::ffff:192.168.0.1:8080 ::ffff:192.0.2.1:35261

    and not like this :

    tcp SYN-RECV 0 0 192.168.0.1:8080 192.0.2.1:35261

    We should init ireq->ireq_family based on listener sk_family,
    not the actual protocol carried by SYN packet.

    This means we can set ireq_family in inet_reqsk_alloc()

    Fixes: 3f66b083a5b7 ("inet: introduce ireq_family")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Mar, 2015

1 commit

  • One of the major issue for TCP is the SYNACK rtx handling,
    done by inet_csk_reqsk_queue_prune(), fired by the keepalive
    timer of a TCP_LISTEN socket.

    This function runs for awful long times, with socket lock held,
    meaning that other cpus needing this lock have to spin for hundred of ms.

    SYNACK are sent in huge bursts, likely to cause severe drops anyway.

    This model was OK 15 years ago when memory was very tight.

    We now can afford to have a timer per request sock.

    Timer invocations no longer need to lock the listener,
    and can be run from all cpus in parallel.

    With following patch increasing somaxconn width to 32 bits,
    I tested a listener with more than 4 million active request sockets,
    and a steady SYNFLOOD of ~200,000 SYN per second.
    Host was sending ~830,000 SYNACK per second.

    This is ~100 times more what we could achieve before this patch.

    Later, we will get rid of the listener hash and use ehash instead.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Mar, 2015

3 commits

  • While testing last patch series, I found req sock refcounting was wrong.

    We must set skc_refcnt to 1 for all request socks added in hashes,
    but also on request sockets created by FastOpen or syncookies.

    It is tricky because we need to defer this initialization so that
    future RCU lookups do not try to take a refcount on a not yet
    fully initialized request socket.

    Also get rid of ireq_refcnt alias.

    Signed-off-by: Eric Dumazet
    Fixes: 13854e5a6046 ("inet: add proper refcounting to request sock")
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The listener field in struct tcp_request_sock is a pointer
    back to the listener. We now have req->rsk_listener, so TCP
    only needs one boolean and not a full pointer.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • listener socket can be used to set net pointer, and will
    be later used to hold a reference on listener.

    Add a const qualifier to first argument (struct request_sock_ops *),
    and factorize all write_pnet(&ireq->ireq_net, sock_net(sk));

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Mar, 2015

2 commits


05 Nov, 2014

2 commits

  • This patch allows to set ECN on a per-route basis in case the sysctl
    tcp_ecn is not set to 1. In other words, when ECN is set for specific
    routes, it provides a tcp_ecn=1 behaviour for that route while the rest
    of the stack acts according to the global settings.

    One can use 'ip route change dev $dev $net features ecn' to toggle this.

    Having a more fine-grained per-route setting can be beneficial for various
    reasons, for example, 1) within data centers, or 2) local ISPs may deploy
    ECN support for their own video/streaming services [1], etc.

    There was a recent measurement study/paper [2] which scanned the Alexa's
    publicly available top million websites list from a vantage point in US,
    Europe and Asia:

    Half of the Alexa list will now happily use ECN (tcp_ecn=2, most likely
    blamed to commit 255cac91c3 ("tcp: extend ECN sysctl to allow server-side
    only ECN") ;)); the break in connectivity on-path was found is about
    1 in 10,000 cases. Timeouts rather than receiving back RSTs were much
    more common in the negotiation phase (and mostly seen in the Alexa
    middle band, ranks around 50k-150k): from 12-thousand hosts on which
    there _may_ be ECN-linked connection failures, only 79 failed with RST
    when _not_ failing with RST when ECN is not requested.

    It's unclear though, how much equipment in the wild actually marks CE
    when buffers start to fill up.

    We thought about a fallback to non-ECN for retransmitted SYNs as another
    global option (which could perhaps one day be made default), but as Eric
    points out, there's much more work needed to detect broken middleboxes.

    Two examples Eric mentioned are buggy firewalls that accept only a single
    SYN per flow, and middleboxes that successfully let an ECN flow establish,
    but later mark CE for all packets (so cwnd converges to 1).

    [1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15
    [2] http://ecn.ethz.ch/

    Joint work with Daniel Borkmann.

    Reference: http://thread.gmane.org/gmane.linux.network/335797
    Suggested-by: Hannes Frederic Sowa
    Acked-by: Eric Dumazet
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • The function cookie_check_timestamp(), both called from IPv4/6 context,
    is being used to decode the echoed timestamp from the SYN/ACK into TCP
    options used for follow-up communication with the peer.

    We can remove ECN handling from that function, split it into a separate
    one, and simply rename the original function into cookie_decode_options().
    cookie_decode_options() just fills in tcp_option struct based on the
    echoed timestamp received from the peer. Anything that fails in this
    function will actually discard the request socket.

    While this is the natural place for decoding options such as ECN which
    commit 172d69e63c7f ("syncookies: add support for ECN") added, we argue
    that in particular for ECN handling, it can be checked at a later point
    in time as the request sock would actually not need to be dropped from
    this, but just ECN support turned off.

    Therefore, we split this functionality into cookie_ecn_ok(), which tells
    us if the timestamp indicates ECN support AND the tcp_ecn sysctl is enabled.

    This prepares for per-route ECN support: just looking at the tcp_ecn sysctl
    won't be enough anymore at that point; if the timestamp indicates ECN
    and sysctl tcp_ecn == 0, we will also need to check the ECN dst metric.

    This would mean adding a route lookup to cookie_check_timestamp(), which
    we definitely want to avoid. As we already do a route lookup at a later
    point in cookie_{v4,v6}_check(), we can simply make use of that as well
    for the new cookie_ecn_ok() function w/o any additional cost.

    Joint work with Daniel Borkmann.

    Acked-by: Eric Dumazet
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

31 Oct, 2014

1 commit


24 Oct, 2014

1 commit


19 Oct, 2014

1 commit

  • Pull networking fixes from David Miller:

    1) Include fixes for netrom and dsa (Fabian Frederick and Florian
    Fainelli)

    2) Fix FIXED_PHY support in stmmac, from Giuseppe CAVALLARO.

    3) Several SKB use after free fixes (vxlan, openvswitch, vxlan,
    ip_tunnel, fou), from Li ROngQing.

    4) fec driver PTP support fixes from Luwei Zhou and Nimrod Andy.

    5) Use after free in virtio_net, from Michael S Tsirkin.

    6) Fix flow mask handling for megaflows in openvswitch, from Pravin B
    Shelar.

    7) ISDN gigaset and capi bug fixes from Tilman Schmidt.

    8) Fix route leak in ip_send_unicast_reply(), from Vasily Averin.

    9) Fix two eBPF JIT bugs on x86, from Alexei Starovoitov.

    10) TCP_SKB_CB() reorganization caused a few regressions, fixed by Cong
    Wang and Eric Dumazet.

    11) Don't overwrite end of SKB when parsing malformed sctp ASCONF
    chunks, from Daniel Borkmann.

    12) Don't call sock_kfree_s() with NULL pointers, this function also has
    the side effect of adjusting the socket memory usage. From Cong Wang.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits)
    bna: fix skb->truesize underestimation
    net: dsa: add includes for ethtool and phy_fixed definitions
    openvswitch: Set flow-key members.
    netrom: use linux/uaccess.h
    dsa: Fix conversion from host device to mii bus
    tipc: fix bug in bundled buffer reception
    ipv6: introduce tcp_v6_iif()
    sfc: add support for skb->xmit_more
    r8152: return -EBUSY for runtime suspend
    ipv4: fix a potential use after free in fou.c
    ipv4: fix a potential use after free in ip_tunnel_core.c
    hyperv: Add handling of IP header with option field in netvsc_set_hash()
    openvswitch: Create right mask with disabled megaflows
    vxlan: fix a free after use
    openvswitch: fix a use after free
    ipv4: dst_entry leak in ip_send_unicast_reply()
    ipv4: clean up cookie_v4_check()
    ipv4: share tcp_v4_save_options() with cookie_v4_check()
    ipv4: call __ip_options_echo() in cookie_v4_check()
    atm: simplify lanai.c by using module_pci_driver
    ...

    Linus Torvalds
     

18 Oct, 2014

1 commit

  • Commit 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line
    misses") added a regression for SO_BINDTODEVICE on IPv6.

    This is because we still use inet6_iif() which expects that IP6 control
    block is still at the beginning of skb->cb[]

    This patch adds tcp_v6_iif() helper and uses it where necessary.

    Because __inet6_lookup_skb() is used by TCP and DCCP, we add an iif
    parameter to it.

    Signed-off-by: Eric Dumazet
    Fixes: 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Oct, 2014

1 commit

  • Pull percpu consistent-ops changes from Tejun Heo:
    "Way back, before the current percpu allocator was implemented, static
    and dynamic percpu memory areas were allocated and handled separately
    and had their own accessors. The distinction has been gone for many
    years now; however, the now duplicate two sets of accessors remained
    with the pointer based ones - this_cpu_*() - evolving various other
    operations over time. During the process, we also accumulated other
    inconsistent operations.

    This pull request contains Christoph's patches to clean up the
    duplicate accessor situation. __get_cpu_var() uses are replaced with
    with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().

    Unfortunately, the former sometimes is tricky thanks to C being a bit
    messy with the distinction between lvalues and pointers, which led to
    a rather ugly solution for cpumask_var_t involving the introduction of
    this_cpu_cpumask_var_ptr().

    This converts most of the uses but not all. Christoph will follow up
    with the remaining conversions in this merge window and hopefully
    remove the obsolete accessors"

    * 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
    irqchip: Properly fetch the per cpu offset
    percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
    ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
    percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
    Revert "powerpc: Replace __get_cpu_var uses"
    percpu: Remove __this_cpu_ptr
    clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
    sparc: Replace __get_cpu_var uses
    avr32: Replace __get_cpu_var with __this_cpu_write
    blackfin: Replace __get_cpu_var uses
    tile: Use this_cpu_ptr() for hardware counters
    tile: Replace __get_cpu_var uses
    powerpc: Replace __get_cpu_var uses
    alpha: Replace __get_cpu_var
    ia64: Replace __get_cpu_var uses
    s390: cio driver &__get_cpu_var replacements
    s390: Replace __get_cpu_var uses
    mips: Replace __get_cpu_var uses
    MIPS: Replace __get_cpu_var uses in FPU emulator.
    arm: Replace __this_cpu_ptr with raw_cpu_ptr
    ...

    Linus Torvalds
     

29 Sep, 2014

1 commit


28 Aug, 2014

1 commit


27 Aug, 2014

1 commit