21 Oct, 2017

1 commit

  • syzkaller found another bug in DCCP/TCP stacks [1]

    For the reasons explained in commit ce1050089c96 ("tcp/dccp: fix
    ireq->pktopts race"), we need to make sure we do not access
    ireq->opt unless we own the request sock.

    Note the opt field is renamed to ireq_opt to ease grep games.

    [1]
    BUG: KASAN: use-after-free in ip_queue_xmit+0x1687/0x18e0 net/ipv4/ip_output.c:474
    Read of size 1 at addr ffff8801c951039c by task syz-executor5/3295

    CPU: 1 PID: 3295 Comm: syz-executor5 Not tainted 4.14.0-rc4+ #80
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:16 [inline]
    dump_stack+0x194/0x257 lib/dump_stack.c:52
    print_address_description+0x73/0x250 mm/kasan/report.c:252
    kasan_report_error mm/kasan/report.c:351 [inline]
    kasan_report+0x25b/0x340 mm/kasan/report.c:409
    __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:427
    ip_queue_xmit+0x1687/0x18e0 net/ipv4/ip_output.c:474
    tcp_transmit_skb+0x1ab7/0x3840 net/ipv4/tcp_output.c:1135
    tcp_send_ack.part.37+0x3bb/0x650 net/ipv4/tcp_output.c:3587
    tcp_send_ack+0x49/0x60 net/ipv4/tcp_output.c:3557
    __tcp_ack_snd_check+0x2c6/0x4b0 net/ipv4/tcp_input.c:5072
    tcp_ack_snd_check net/ipv4/tcp_input.c:5085 [inline]
    tcp_rcv_state_process+0x2eff/0x4850 net/ipv4/tcp_input.c:6071
    tcp_child_process+0x342/0x990 net/ipv4/tcp_minisocks.c:816
    tcp_v4_rcv+0x1827/0x2f80 net/ipv4/tcp_ipv4.c:1682
    ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:464 [inline]
    ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
    __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
    __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
    netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
    netif_receive_skb+0xae/0x390 net/core/dev.c:4611
    tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
    tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
    tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
    call_write_iter include/linux/fs.h:1770 [inline]
    new_sync_write fs/read_write.c:468 [inline]
    __vfs_write+0x68a/0x970 fs/read_write.c:481
    vfs_write+0x18f/0x510 fs/read_write.c:543
    SYSC_write fs/read_write.c:588 [inline]
    SyS_write+0xef/0x220 fs/read_write.c:580
    entry_SYSCALL_64_fastpath+0x1f/0xbe
    RIP: 0033:0x40c341
    RSP: 002b:00007f469523ec10 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
    RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 000000000040c341
    RDX: 0000000000000037 RSI: 0000000020004000 RDI: 0000000000000015
    RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
    R10: 00000000000f4240 R11: 0000000000000293 R12: 00000000004b7fd1
    R13: 00000000ffffffff R14: 0000000020000000 R15: 0000000000025000

    Allocated by task 3295:
    save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
    save_stack+0x43/0xd0 mm/kasan/kasan.c:447
    set_track mm/kasan/kasan.c:459 [inline]
    kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
    __do_kmalloc mm/slab.c:3725 [inline]
    __kmalloc+0x162/0x760 mm/slab.c:3734
    kmalloc include/linux/slab.h:498 [inline]
    tcp_v4_save_options include/net/tcp.h:1962 [inline]
    tcp_v4_init_req+0x2d3/0x3e0 net/ipv4/tcp_ipv4.c:1271
    tcp_conn_request+0xf6d/0x3410 net/ipv4/tcp_input.c:6283
    tcp_v4_conn_request+0x157/0x210 net/ipv4/tcp_ipv4.c:1313
    tcp_rcv_state_process+0x8ea/0x4850 net/ipv4/tcp_input.c:5857
    tcp_v4_do_rcv+0x55c/0x7d0 net/ipv4/tcp_ipv4.c:1482
    tcp_v4_rcv+0x2d10/0x2f80 net/ipv4/tcp_ipv4.c:1711
    ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:464 [inline]
    ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
    __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
    __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
    netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
    netif_receive_skb+0xae/0x390 net/core/dev.c:4611
    tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
    tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
    tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
    call_write_iter include/linux/fs.h:1770 [inline]
    new_sync_write fs/read_write.c:468 [inline]
    __vfs_write+0x68a/0x970 fs/read_write.c:481
    vfs_write+0x18f/0x510 fs/read_write.c:543
    SYSC_write fs/read_write.c:588 [inline]
    SyS_write+0xef/0x220 fs/read_write.c:580
    entry_SYSCALL_64_fastpath+0x1f/0xbe

    Freed by task 3306:
    save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
    save_stack+0x43/0xd0 mm/kasan/kasan.c:447
    set_track mm/kasan/kasan.c:459 [inline]
    kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
    __cache_free mm/slab.c:3503 [inline]
    kfree+0xca/0x250 mm/slab.c:3820
    inet_sock_destruct+0x59d/0x950 net/ipv4/af_inet.c:157
    __sk_destruct+0xfd/0x910 net/core/sock.c:1560
    sk_destruct+0x47/0x80 net/core/sock.c:1595
    __sk_free+0x57/0x230 net/core/sock.c:1603
    sk_free+0x2a/0x40 net/core/sock.c:1614
    sock_put include/net/sock.h:1652 [inline]
    inet_csk_complete_hashdance+0xd5/0xf0 net/ipv4/inet_connection_sock.c:959
    tcp_check_req+0xf4d/0x1620 net/ipv4/tcp_minisocks.c:765
    tcp_v4_rcv+0x17f6/0x2f80 net/ipv4/tcp_ipv4.c:1675
    ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:464 [inline]
    ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
    __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
    __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
    netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
    netif_receive_skb+0xae/0x390 net/core/dev.c:4611
    tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
    tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
    tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
    call_write_iter include/linux/fs.h:1770 [inline]
    new_sync_write fs/read_write.c:468 [inline]
    __vfs_write+0x68a/0x970 fs/read_write.c:481
    vfs_write+0x18f/0x510 fs/read_write.c:543
    SYSC_write fs/read_write.c:588 [inline]
    SyS_write+0xef/0x220 fs/read_write.c:580
    entry_SYSCALL_64_fastpath+0x1f/0xbe

    Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Aug, 2017

1 commit

  • __ip_options_echo() uses the current network namespace, and
    currently retrives it via skb->dst->dev.

    This commit adds an explicit 'net' argument to __ip_options_echo()
    and update all the call sites to provide it, usually via a simpler
    sock_net().

    After this change, __ip_options_echo() no more needs to access
    skb->dst and we can drop a couple of hack to preserve such
    info in the rx path.

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

19 Jul, 2017

1 commit

  • KMSAN reported use of uninitialized memory in skb_set_hash_from_sk(),
    which originated from the TCP request socket created in
    cookie_v6_check():

    ==================================================================
    BUG: KMSAN: use of uninitialized memory in tcp_transmit_skb+0xf77/0x3ec0
    CPU: 1 PID: 2949 Comm: syz-execprog Not tainted 4.11.0-rc5+ #2931
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    TCP: request_sock_TCPv6: Possible SYN flooding on port 20028. Sending cookies. Check SNMP counters.
    Call Trace:

    __dump_stack lib/dump_stack.c:16
    dump_stack+0x172/0x1c0 lib/dump_stack.c:52
    kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:927
    __msan_warning_32+0x61/0xb0 mm/kmsan/kmsan_instr.c:469
    skb_set_hash_from_sk ./include/net/sock.h:2011
    tcp_transmit_skb+0xf77/0x3ec0 net/ipv4/tcp_output.c:983
    tcp_send_ack+0x75b/0x830 net/ipv4/tcp_output.c:3493
    tcp_delack_timer_handler+0x9a6/0xb90 net/ipv4/tcp_timer.c:284
    tcp_delack_timer+0x1b0/0x310 net/ipv4/tcp_timer.c:309
    call_timer_fn+0x240/0x520 kernel/time/timer.c:1268
    expire_timers kernel/time/timer.c:1307
    __run_timers+0xc13/0xf10 kernel/time/timer.c:1601
    run_timer_softirq+0x36/0xa0 kernel/time/timer.c:1614
    __do_softirq+0x485/0x942 kernel/softirq.c:284
    invoke_softirq kernel/softirq.c:364
    irq_exit+0x1fa/0x230 kernel/softirq.c:405
    exiting_irq+0xe/0x10 ./arch/x86/include/asm/apic.h:657
    smp_apic_timer_interrupt+0x5a/0x80 arch/x86/kernel/apic/apic.c:966
    apic_timer_interrupt+0x86/0x90 arch/x86/entry/entry_64.S:489
    RIP: 0010:native_restore_fl ./arch/x86/include/asm/irqflags.h:36
    RIP: 0010:arch_local_irq_restore ./arch/x86/include/asm/irqflags.h:77
    RIP: 0010:__msan_poison_alloca+0xed/0x120 mm/kmsan/kmsan_instr.c:440
    RSP: 0018:ffff880024917cd8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
    RAX: 0000000000000246 RBX: ffff8800224c0000 RCX: 0000000000000005
    RDX: 0000000000000004 RSI: ffff880000000000 RDI: ffffea0000b6d770
    RBP: ffff880024917d58 R08: 0000000000000dd8 R09: 0000000000000004
    R10: 0000160000000000 R11: 0000000000000000 R12: ffffffff85abf810
    R13: ffff880024917dd8 R14: 0000000000000010 R15: ffffffff81cabde4

    poll_select_copy_remaining+0xac/0x6b0 fs/select.c:293
    SYSC_select+0x4b4/0x4e0 fs/select.c:653
    SyS_select+0x76/0xa0 fs/select.c:634
    entry_SYSCALL_64_fastpath+0x13/0x94 arch/x86/entry/entry_64.S:204
    RIP: 0033:0x4597e7
    RSP: 002b:000000c420037ee0 EFLAGS: 00000246 ORIG_RAX: 0000000000000017
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004597e7
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: 000000c420037ef0 R08: 000000c420037ee0 R09: 0000000000000059
    R10: 0000000000000000 R11: 0000000000000246 R12: 000000000042dc20
    R13: 00000000000000f3 R14: 0000000000000030 R15: 0000000000000003
    chained origin:
    save_stack_trace+0x37/0x40 arch/x86/kernel/stacktrace.c:59
    kmsan_save_stack_with_flags mm/kmsan/kmsan.c:302
    kmsan_save_stack mm/kmsan/kmsan.c:317
    kmsan_internal_chain_origin+0x12a/0x1f0 mm/kmsan/kmsan.c:547
    __msan_store_shadow_origin_4+0xac/0x110 mm/kmsan/kmsan_instr.c:259
    tcp_create_openreq_child+0x709/0x1ae0 net/ipv4/tcp_minisocks.c:472
    tcp_v6_syn_recv_sock+0x7eb/0x2a30 net/ipv6/tcp_ipv6.c:1103
    tcp_get_cookie_sock+0x136/0x5f0 net/ipv4/syncookies.c:212
    cookie_v6_check+0x17a9/0x1b50 net/ipv6/syncookies.c:245
    tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:989
    tcp_v6_do_rcv+0xdd8/0x1c60 net/ipv6/tcp_ipv6.c:1298
    tcp_v6_rcv+0x41a3/0x4f00 net/ipv6/tcp_ipv6.c:1487
    ip6_input_finish+0x82f/0x1ee0 net/ipv6/ip6_input.c:279
    NF_HOOK ./include/linux/netfilter.h:257
    ip6_input+0x239/0x290 net/ipv6/ip6_input.c:322
    dst_input ./include/net/dst.h:492
    ip6_rcv_finish net/ipv6/ip6_input.c:69
    NF_HOOK ./include/linux/netfilter.h:257
    ipv6_rcv+0x1dbd/0x22e0 net/ipv6/ip6_input.c:203
    __netif_receive_skb_core+0x2f6f/0x3a20 net/core/dev.c:4208
    __netif_receive_skb net/core/dev.c:4246
    process_backlog+0x667/0xba0 net/core/dev.c:4866
    napi_poll net/core/dev.c:5268
    net_rx_action+0xc95/0x1590 net/core/dev.c:5333
    __do_softirq+0x485/0x942 kernel/softirq.c:284
    origin:
    save_stack_trace+0x37/0x40 arch/x86/kernel/stacktrace.c:59
    kmsan_save_stack_with_flags mm/kmsan/kmsan.c:302
    kmsan_internal_poison_shadow+0xb1/0x1a0 mm/kmsan/kmsan.c:198
    kmsan_kmalloc+0x7f/0xe0 mm/kmsan/kmsan.c:337
    kmem_cache_alloc+0x1c2/0x1e0 mm/slub.c:2766
    reqsk_alloc ./include/net/request_sock.h:87
    inet_reqsk_alloc+0xa4/0x5b0 net/ipv4/tcp_input.c:6200
    cookie_v6_check+0x4f4/0x1b50 net/ipv6/syncookies.c:169
    tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:989
    tcp_v6_do_rcv+0xdd8/0x1c60 net/ipv6/tcp_ipv6.c:1298
    tcp_v6_rcv+0x41a3/0x4f00 net/ipv6/tcp_ipv6.c:1487
    ip6_input_finish+0x82f/0x1ee0 net/ipv6/ip6_input.c:279
    NF_HOOK ./include/linux/netfilter.h:257
    ip6_input+0x239/0x290 net/ipv6/ip6_input.c:322
    dst_input ./include/net/dst.h:492
    ip6_rcv_finish net/ipv6/ip6_input.c:69
    NF_HOOK ./include/linux/netfilter.h:257
    ipv6_rcv+0x1dbd/0x22e0 net/ipv6/ip6_input.c:203
    __netif_receive_skb_core+0x2f6f/0x3a20 net/core/dev.c:4208
    __netif_receive_skb net/core/dev.c:4246
    process_backlog+0x667/0xba0 net/core/dev.c:4866
    napi_poll net/core/dev.c:5268
    net_rx_action+0xc95/0x1590 net/core/dev.c:5333
    __do_softirq+0x485/0x942 kernel/softirq.c:284
    ==================================================================

    Similar error is reported for cookie_v4_check().

    Fixes: 58d607d3e52f ("tcp: provide skb->hash to synack packets")
    Signed-off-by: Alexander Potapenko
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexander Potapenko
     

01 Jul, 2017

1 commit

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    This patch uses refcount_inc_not_zero() instead of
    atomic_inc_not_zero_hint() due to absense of a _hint()
    version of refcount API. If the hint() version must
    be used, we might need to revisit API.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     

08 Jun, 2017

4 commits


18 May, 2017

1 commit

  • TCP Timestamps option is defined in RFC 7323

    Traditionally on linux, it has been tied to the internal
    'jiffies' variable, because it had been a cheap and good enough
    generator.

    For TCP flows on the Internet, 1 ms resolution would be much better
    than 4ms or 10ms (HZ=250 or HZ=100 respectively)

    For TCP flows in the DC, Google has used usec resolution for more
    than two years with great success [1]

    Receive size autotuning (DRS) is indeed more precise and converges
    faster to optimal window size.

    This patch converts tp->tcp_mstamp to a plain u64 value storing
    a 1 usec TCP clock.

    This choice will allow us to upstream the 1 usec TS option as
    discussed in IETF 97.

    [1] https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 May, 2017

1 commit

  • Whole point of randomization was to hide server uptime, but an attacker
    can simply start a syn flood and TCP generates 'old style' timestamps,
    directly revealing server jiffies value.

    Also, TSval sent by the server to a particular remote address vary
    depending on syncookies being sent or not, potentially triggering PAWS
    drops for innocent clients.

    Lets implement proper randomization, including for SYNcookies.

    Also we do not need to export sysctl_tcp_timestamps, since it is not
    used from a module.

    In v2, I added Florian feedback and contribution, adding tsoff to
    tcp_get_cookie_sock().

    v3 removed one unused variable in tcp_v4_connect() as Florian spotted.

    Fixes: 95a22caee396c ("tcp: randomize tcp timestamp offsets for each connection")
    Signed-off-by: Eric Dumazet
    Reviewed-by: Florian Westphal
    Tested-by: Florian Westphal
    Cc: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Jan, 2017

1 commit

  • SHA1 is slower and less secure than SipHash, and so replacing syncookie
    generation with SipHash makes natural sense. Some BSDs have been doing
    this for several years in fact.

    The speedup should be similar -- and even more impressive -- to the
    speedup from the sequence number fix in this series.

    Signed-off-by: Jason A. Donenfeld
    Cc: Eric Dumazet
    Cc: David Miller
    Signed-off-by: David S. Miller

    Jason A. Donenfeld
     

03 Dec, 2016

1 commit

  • jiffies based timestamps allow for easy inference of number of devices
    behind NAT translators and also makes tracking of hosts simpler.

    commit ceaa1fef65a7c2e ("tcp: adding a per-socket timestamp offset")
    added the main infrastructure that is needed for per-connection ts
    randomization, in particular writing/reading the on-wire tcp header
    format takes the offset into account so rest of stack can use normal
    tcp_time_stamp (jiffies).

    So only two items are left:
    - add a tsoffset for request sockets
    - extend the tcp isn generator to also return another 32bit number
    in addition to the ISN.

    Re-use of ISN generator also means timestamps are still monotonically
    increasing for same connection quadruple, i.e. PAWS will still work.

    Includes fixes from Eric Dumazet.

    Signed-off-by: Florian Westphal
    Acked-by: Eric Dumazet
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Florian Westphal
     

05 Nov, 2016

1 commit

  • - Use the UID in routing lookups made by protocol connect() and
    sendmsg() functions.
    - Make sure that routing lookups triggered by incoming packets
    (e.g., Path MTU discovery) take the UID of the socket into
    account.
    - For packets not associated with a userspace socket, (e.g., ping
    replies) use UID 0 inside the user namespace corresponding to
    the network namespace the socket belongs to. This allows
    all namespaces to apply routing and iptables rules to
    kernel-originated traffic in that namespaces by matching UID 0.
    This is better than using the UID of the kernel socket that is
    sending the traffic, because the UID of kernel sockets created
    at namespace creation time (e.g., the per-processor ICMP and
    TCP sockets) is the UID of the user that created the socket,
    which might not be mapped in the namespace.

    Tested: compiles allnoconfig, allyesconfig, allmodconfig
    Tested: https://android-review.googlesource.com/253302
    Signed-off-by: Lorenzo Colitti
    Signed-off-by: David S. Miller

    Lorenzo Colitti
     

28 Apr, 2016

1 commit


20 Mar, 2016

1 commit

  • Pull networking updates from David Miller:
    "Highlights:

    1) Support more Realtek wireless chips, from Jes Sorenson.

    2) New BPF types for per-cpu hash and arrap maps, from Alexei
    Starovoitov.

    3) Make several TCP sysctls per-namespace, from Nikolay Borisov.

    4) Allow the use of SO_REUSEPORT in order to do per-thread processing
    of incoming TCP/UDP connections. The muxing can be done using a
    BPF program which hashes the incoming packet. From Craig Gallek.

    5) Add a multiplexer for TCP streams, to provide a messaged based
    interface. BPF programs can be used to determine the message
    boundaries. From Tom Herbert.

    6) Add 802.1AE MACSEC support, from Sabrina Dubroca.

    7) Avoid factorial complexity when taking down an inetdev interface
    with lots of configured addresses. We were doing things like
    traversing the entire address less for each address removed, and
    flushing the entire netfilter conntrack table for every address as
    well.

    8) Add and use SKB bulk free infrastructure, from Jesper Brouer.

    9) Allow offloading u32 classifiers to hardware, and implement for
    ixgbe, from John Fastabend.

    10) Allow configuring IRQ coalescing parameters on a per-queue basis,
    from Kan Liang.

    11) Extend ethtool so that larger link mode masks can be supported.
    From David Decotigny.

    12) Introduce devlink, which can be used to configure port link types
    (ethernet vs Infiniband, etc.), port splitting, and switch device
    level attributes as a whole. From Jiri Pirko.

    13) Hardware offload support for flower classifiers, from Amir Vadai.

    14) Add "Local Checksum Offload". Basically, for a tunneled packet
    the checksum of the outer header is 'constant' (because with the
    checksum field filled into the inner protocol header, the payload
    of the outer frame checksums to 'zero'), and we can take advantage
    of that in various ways. From Edward Cree"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1548 commits)
    bonding: fix bond_get_stats()
    net: bcmgenet: fix dma api length mismatch
    net/mlx4_core: Fix backward compatibility on VFs
    phy: mdio-thunder: Fix some Kconfig typos
    lan78xx: add ndo_get_stats64
    lan78xx: handle statistics counter rollover
    RDS: TCP: Remove unused constant
    RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket
    net: smc911x: convert pxa dma to dmaengine
    team: remove duplicate set of flag IFF_MULTICAST
    bonding: remove duplicate set of flag IFF_MULTICAST
    net: fix a comment typo
    ethernet: micrel: fix some error codes
    ip_tunnels, bpf: define IP_TUNNEL_OPTS_MAX and use it
    bpf, dst: add and use dst_tclassid helper
    bpf: make skb->tc_classid also readable
    net: mvneta: bm: clarify dependencies
    cls_bpf: reset class and reuse major in da
    ldmvsw: Checkpatch sunvnet.c and sunvnet_common.c
    ldmvsw: Add ldmvsw.c driver code
    ...

    Linus Torvalds
     

16 Mar, 2016

1 commit

  • $ make tags
    GEN tags
    ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1"
    ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1"
    ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern "\1"
    ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1"
    ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1"
    ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1"
    ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1"
    ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1"
    ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1"

    Which are all the result of the DEFINE_PER_CPU pattern:

    scripts/tags.sh:200: '/\
    Acked-by: David S. Miller
    Acked-by: Rafael J. Wysocki
    Cc: Tejun Heo
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

08 Feb, 2016

1 commit


19 Dec, 2015

1 commit

  • Allow accepted sockets to derive their sk_bound_dev_if setting from the
    l3mdev domain in which the packets originated. A sysctl setting is added
    to control the behavior which is similar to sk_mark and
    sysctl_tcp_fwmark_accept.

    This effectively allow a process to have a "VRF-global" listen socket,
    with child sockets bound to the VRF device in which the packet originated.
    A similar behavior can be achieved using sk_mark, but a solution using marks
    is incomplete as it does not handle duplicate addresses in different L3
    domains/VRFs. Allowing sockets to inherit the sk_bound_dev_if from l3mdev
    domain provides a complete solution.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

23 Oct, 2015

1 commit

  • Multiple cpus can process duplicates of incoming ACK messages
    matching a SYN_RECV request socket. This is a rare event under
    normal operations, but definitely can happen.

    Only one must win the race, otherwise corruption would occur.

    To fix this without adding new atomic ops, we use logic in
    inet_ehash_nolisten() to detect the request was present in the same
    ehash bucket where we try to insert the new child.

    If request socket was not found, we have to undo the child creation.

    This actually removes a spin_lock()/spin_unlock() pair in
    reqsk_queue_unlink() for the fast path.

    Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Oct, 2015

1 commit


11 Oct, 2015

1 commit

  • Before recent TCP listener patches, we were updating listener
    sk->sk_rxhash before the cloning of master socket.

    children sk_rxhash was therefore correct after the normal 3WHS.

    But with lockless listener, we no longer dirty/change listener sk_rxhash
    as it would be racy.

    We need to correctly update the child sk_rxhash, otherwise first data
    packet wont hit correct cpu if RFS is used.

    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Reported-by: Willem de Bruijn
    Cc: Tom Herbert
    Acked-by: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet
     

05 Oct, 2015

1 commit

  • inet_reqsk_alloc() is used to allocate a temporary request
    in order to generate a SYNACK with a cookie. Then later,
    syncookie validation also uses a temporary request.

    These paths already took a reference on listener refcount,
    we can avoid a couple of atomic operations.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Oct, 2015

1 commit

  • In this patch, we insert request sockets into TCP/DCCP
    regular ehash table (where ESTABLISHED and TIMEWAIT sockets
    are) instead of using the per listener hash table.

    ACK packets find SYN_RECV pseudo sockets without having
    to find and lock the listener.

    In nominal conditions, this halves pressure on listener lock.

    Note that this will allow for SO_REUSEPORT refinements,
    so that we can select a listener using cpu/numa affinities instead
    of the prior 'consistent hash', since only SYN packets will
    apply this selection logic.

    We will shrink listen_sock in the following patch to ease
    code review.

    Signed-off-by: Eric Dumazet
    Cc: Ying Cai
    Cc: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Sep, 2015

1 commit


22 Sep, 2015

1 commit

  • Currently SYN/ACK RTT is measured in jiffies. For LAN the SYN/ACK
    RTT is often measured as 0ms or sometimes 1ms, which would affect
    RTT estimation and min RTT samping used by some congestion control.

    This patch improves SYN/ACK RTT to be usec resolution if platform
    supports it. While the timestamping of SYN/ACK is done in request
    sock, the RTT measurement is carefully arranged to avoid storing
    another u64 timestamp in tcp_sock.

    For regular handshake w/o SYNACK retransmission, the RTT is sampled
    right after the child socket is created and right before the request
    sock is released (tcp_check_req() in tcp_minisocks.c)

    For Fast Open the child socket is already created when SYN/ACK was
    sent, the RTT is sampled in tcp_rcv_state_process() after processing
    the final ACK an right before the request socket is released.

    If the SYN/ACK was retransmistted or SYN-cookie was used, we rely
    on TCP timestamps to measure the RTT. The sample is taken at the
    same place in tcp_rcv_state_process() after the timestamp values
    are validated in tcp_validate_incoming(). Note that we do not store
    TS echo value in request_sock for SYN-cookies, because the value
    is already stored in tp->rx_opt used by tcp_ack_update_rtt().

    One side benefit is that the RTT measurement now happens before
    initializing congestion control (of the passive side). Therefore
    the congestion control can use the SYN/ACK RTT.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

08 Jun, 2015

1 commit


25 Mar, 2015

1 commit

  • ss should display ipv4 mapped request sockets like this :

    tcp SYN-RECV 0 0 ::ffff:192.168.0.1:8080 ::ffff:192.0.2.1:35261

    and not like this :

    tcp SYN-RECV 0 0 192.168.0.1:8080 192.0.2.1:35261

    We should init ireq->ireq_family based on listener sk_family,
    not the actual protocol carried by SYN packet.

    This means we can set ireq_family in inet_reqsk_alloc()

    Fixes: 3f66b083a5b7 ("inet: introduce ireq_family")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Mar, 2015

1 commit

  • One of the major issue for TCP is the SYNACK rtx handling,
    done by inet_csk_reqsk_queue_prune(), fired by the keepalive
    timer of a TCP_LISTEN socket.

    This function runs for awful long times, with socket lock held,
    meaning that other cpus needing this lock have to spin for hundred of ms.

    SYNACK are sent in huge bursts, likely to cause severe drops anyway.

    This model was OK 15 years ago when memory was very tight.

    We now can afford to have a timer per request sock.

    Timer invocations no longer need to lock the listener,
    and can be run from all cpus in parallel.

    With following patch increasing somaxconn width to 32 bits,
    I tested a listener with more than 4 million active request sockets,
    and a steady SYNFLOOD of ~200,000 SYN per second.
    Host was sending ~830,000 SYNACK per second.

    This is ~100 times more what we could achieve before this patch.

    Later, we will get rid of the listener hash and use ehash instead.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Mar, 2015

1 commit


18 Mar, 2015

3 commits

  • While testing last patch series, I found req sock refcounting was wrong.

    We must set skc_refcnt to 1 for all request socks added in hashes,
    but also on request sockets created by FastOpen or syncookies.

    It is tricky because we need to defer this initialization so that
    future RCU lookups do not try to take a refcount on a not yet
    fully initialized request socket.

    Also get rid of ireq_refcnt alias.

    Signed-off-by: Eric Dumazet
    Fixes: 13854e5a6046 ("inet: add proper refcounting to request sock")
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The listener field in struct tcp_request_sock is a pointer
    back to the listener. We now have req->rsk_listener, so TCP
    only needs one boolean and not a full pointer.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • listener socket can be used to set net pointer, and will
    be later used to hold a reference on listener.

    Add a const qualifier to first argument (struct request_sock_ops *),
    and factorize all write_pnet(&ireq->ireq_net, sock_net(sk));

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Mar, 2015

1 commit

  • reqsk_put() is the generic function that should be used
    to release a refcount (and automatically call reqsk_free())

    reqsk_free() might be called if refcount is known to be 0
    or undefined.

    refcnt is set to one in inet_csk_reqsk_queue_add()

    As request socks are not yet in global ehash table,
    I added temporary debugging checks in reqsk_put() and reqsk_free()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Mar, 2015

1 commit

  • Once request socks will be in ehash table, they will need to have
    a valid ir_iff field.

    This is currently true only for IPv6. This patch extends support
    for IPv4 as well.

    This means inet_diag_fill_req() can now properly use ir_iif,
    which is better for IPv6 link locals anyway, as request sockets
    and established sockets will propagate consistent netlink idiag_if.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Mar, 2015

1 commit


12 Mar, 2015

2 commits

  • I forgot to use write_pnet() in three locations.

    Signed-off-by: Eric Dumazet
    Fixes: 33cf7c90fe2f9 ("net: add real socket cookies")
    Reported-by: kbuild test robot
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • A long standing problem in netlink socket dumps is the use
    of kernel socket addresses as cookies.

    1) It is a security concern.

    2) Sockets can be reused quite quickly, so there is
    no guarantee a cookie is used once and identify
    a flow.

    3) request sock, establish sock, and timewait socks
    for a given flow have different cookies.

    Part of our effort to bring better TCP statistics requires
    to switch to a different allocator.

    In this patch, I chose to use a per network namespace 64bit generator,
    and to use it only in the case a socket needs to be dumped to netlink.
    (This might be refined later if needed)

    Note that I tried to carry cookies from request sock, to establish sock,
    then timewait sockets.

    Signed-off-by: Eric Dumazet
    Cc: Eric Salo
    Signed-off-by: David S. Miller

    Eric Dumazet
     

05 Nov, 2014

3 commits

  • This patch allows to set ECN on a per-route basis in case the sysctl
    tcp_ecn is not set to 1. In other words, when ECN is set for specific
    routes, it provides a tcp_ecn=1 behaviour for that route while the rest
    of the stack acts according to the global settings.

    One can use 'ip route change dev $dev $net features ecn' to toggle this.

    Having a more fine-grained per-route setting can be beneficial for various
    reasons, for example, 1) within data centers, or 2) local ISPs may deploy
    ECN support for their own video/streaming services [1], etc.

    There was a recent measurement study/paper [2] which scanned the Alexa's
    publicly available top million websites list from a vantage point in US,
    Europe and Asia:

    Half of the Alexa list will now happily use ECN (tcp_ecn=2, most likely
    blamed to commit 255cac91c3 ("tcp: extend ECN sysctl to allow server-side
    only ECN") ;)); the break in connectivity on-path was found is about
    1 in 10,000 cases. Timeouts rather than receiving back RSTs were much
    more common in the negotiation phase (and mostly seen in the Alexa
    middle band, ranks around 50k-150k): from 12-thousand hosts on which
    there _may_ be ECN-linked connection failures, only 79 failed with RST
    when _not_ failing with RST when ECN is not requested.

    It's unclear though, how much equipment in the wild actually marks CE
    when buffers start to fill up.

    We thought about a fallback to non-ECN for retransmitted SYNs as another
    global option (which could perhaps one day be made default), but as Eric
    points out, there's much more work needed to detect broken middleboxes.

    Two examples Eric mentioned are buggy firewalls that accept only a single
    SYN per flow, and middleboxes that successfully let an ECN flow establish,
    but later mark CE for all packets (so cwnd converges to 1).

    [1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15
    [2] http://ecn.ethz.ch/

    Joint work with Daniel Borkmann.

    Reference: http://thread.gmane.org/gmane.linux.network/335797
    Suggested-by: Hannes Frederic Sowa
    Acked-by: Eric Dumazet
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • The function cookie_check_timestamp(), both called from IPv4/6 context,
    is being used to decode the echoed timestamp from the SYN/ACK into TCP
    options used for follow-up communication with the peer.

    We can remove ECN handling from that function, split it into a separate
    one, and simply rename the original function into cookie_decode_options().
    cookie_decode_options() just fills in tcp_option struct based on the
    echoed timestamp received from the peer. Anything that fails in this
    function will actually discard the request socket.

    While this is the natural place for decoding options such as ECN which
    commit 172d69e63c7f ("syncookies: add support for ECN") added, we argue
    that in particular for ECN handling, it can be checked at a later point
    in time as the request sock would actually not need to be dropped from
    this, but just ECN support turned off.

    Therefore, we split this functionality into cookie_ecn_ok(), which tells
    us if the timestamp indicates ECN support AND the tcp_ecn sysctl is enabled.

    This prepares for per-route ECN support: just looking at the tcp_ecn sysctl
    won't be enough anymore at that point; if the timestamp indicates ECN
    and sysctl tcp_ecn == 0, we will also need to check the ECN dst metric.

    This would mean adding a route lookup to cookie_check_timestamp(), which
    we definitely want to avoid. As we already do a route lookup at a later
    point in cookie_{v4,v6}_check(), we can simply make use of that as well
    for the new cookie_ecn_ok() function w/o any additional cost.

    Joint work with Daniel Borkmann.

    Acked-by: Eric Dumazet
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Was a bit more difficult to read than needed due to magic shifts;
    add defines and document the used encoding scheme.

    Joint work with Daniel Borkmann.

    Acked-by: Eric Dumazet
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal