18 Oct, 2018

1 commit

  • [ Upstream commit 2ab2ddd301a22ca3c5f0b743593e4ad2953dfa53 ]

    Timer handlers do not imply rcu_read_lock(), so my recent fix
    triggered a LOCKDEP warning when SYNACK is retransmit.

    Lets add rcu_read_lock()/rcu_read_unlock() pairs around ireq->ireq_opt
    usages instead of guessing what is done by callers, since it is
    not worth the pain.

    Get rid of ireq_opt_deref() helper since it hides the logic
    without real benefit, since it is now a standard rcu_dereference().

    Fixes: 1ad98e9d1bdf ("tcp/dccp: fix lockdep issue when SYN is backlogged")
    Signed-off-by: Eric Dumazet
    Reported-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

15 Sep, 2018

1 commit

  • [ Upstream commit 431280eebed9f5079553daf003011097763e71fd ]

    tcp uses per-cpu (and per namespace) sockets (net->ipv4.tcp_sk) internally
    to send some control packets.

    1) RST packets, through tcp_v4_send_reset()
    2) ACK packets in SYN-RECV and TIME-WAIT state, through tcp_v4_send_ack()

    These packets assert IP_DF, and also use the hashed IP ident generator
    to provide an IPv4 ID number.

    Geoff Alexander reported this could be used to build off-path attacks.

    These packets should not be fragmented, since their size is smaller than
    IPV4_MIN_MTU. Only some tunneled paths could eventually have to fragment,
    regardless of inner IPID.

    We really can use zero IPID, to address the flaw, and as a bonus,
    avoid a couple of atomic operations in ip_idents_reserve()

    Signed-off-by: Eric Dumazet
    Reported-by: Geoff Alexander
    Tested-by: Geoff Alexander
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

26 Jun, 2018

1 commit

  • [ Upstream commit 4fd44a98ffe0d048246efef67ed640fdf2098a62 ]

    commit 079096f103fa ("tcp/dccp: install syn_recv requests into ehash
    table") introduced an optimization for the handling of child sockets
    created for a new TCP connection.

    But this optimization passes any data associated with the last ACK of the
    connection handshake up the stack without verifying its checksum, because it
    calls tcp_child_process(), which in turn calls tcp_rcv_state_process()
    directly. These lower-level processing functions do not do any checksum
    verification.

    Insert a tcp_checksum_complete call in the TCP_NEW_SYN_RECEIVE path to
    fix this.

    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Frank van der Linden
    Signed-off-by: Eric Dumazet
    Tested-by: Balbir Singh
    Reviewed-by: Balbir Singh
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Frank van der Linden
     

03 Jan, 2018

1 commit

  • [ Upstream commit 30791ac41927ebd3e75486f9504b6d2280463bf0 ]

    The MD5-key that belongs to a connection is identified by the peer's
    IP-address. When we are in tcp_v4(6)_reqsk_send_ack(), we are replying
    to an incoming segment from tcp_check_req() that failed the seq-number
    checks.

    Thus, to find the correct key, we need to use the skb's saddr and not
    the daddr.

    This bug seems to have been there since quite a while, but probably got
    unnoticed because the consequences are not catastrophic. We will call
    tcp_v4_reqsk_send_ack only to send a challenge-ACK back to the peer,
    thus the connection doesn't really fail.

    Fixes: 9501f9722922 ("tcp md5sig: Let the caller pass appropriate key for tcp_v{4,6}_do_calc_md5_hash().")
    Signed-off-by: Christoph Paasch
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Christoph Paasch
     

17 Dec, 2017

1 commit

  • [ Upstream commit eeea10b83a139451130df1594f26710c8fa390c8 ]

    James Morris reported kernel stack corruption bug [1] while
    running the SELinux testsuite, and bisected to a recent
    commit bffa72cf7f9d ("net: sk_buff rbnode reorg")

    We believe this commit is fine, but exposes an older bug.

    SELinux code runs from tcp_filter() and might send an ICMP,
    expecting IP options to be found in skb->cb[] using regular IPCB placement.

    We need to defer TCP mangling of skb->cb[] after tcp_filter() calls.

    This patch adds tcp_v4_fill_cb()/tcp_v4_restore_cb() in a very
    similar way we added them for IPv6.

    [1]
    [ 339.806024] SELinux: failure in selinux_parse_skb(), unable to parse packet
    [ 339.822505] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffff81745af5
    [ 339.822505]
    [ 339.852250] CPU: 4 PID: 3642 Comm: client Not tainted 4.15.0-rc1-test #15
    [ 339.868498] Hardware name: LENOVO 10FGS0VA1L/30BC, BIOS FWKT68A 01/19/2017
    [ 339.885060] Call Trace:
    [ 339.896875]
    [ 339.908103] dump_stack+0x63/0x87
    [ 339.920645] panic+0xe8/0x248
    [ 339.932668] ? ip_push_pending_frames+0x33/0x40
    [ 339.946328] ? icmp_send+0x525/0x530
    [ 339.958861] ? kfree_skbmem+0x60/0x70
    [ 339.971431] __stack_chk_fail+0x1b/0x20
    [ 339.984049] icmp_send+0x525/0x530
    [ 339.996205] ? netlbl_skbuff_err+0x36/0x40
    [ 340.008997] ? selinux_netlbl_err+0x11/0x20
    [ 340.021816] ? selinux_socket_sock_rcv_skb+0x211/0x230
    [ 340.035529] ? security_sock_rcv_skb+0x3b/0x50
    [ 340.048471] ? sk_filter_trim_cap+0x44/0x1c0
    [ 340.061246] ? tcp_v4_inbound_md5_hash+0x69/0x1b0
    [ 340.074562] ? tcp_filter+0x2c/0x40
    [ 340.086400] ? tcp_v4_rcv+0x820/0xa20
    [ 340.098329] ? ip_local_deliver_finish+0x71/0x1a0
    [ 340.111279] ? ip_local_deliver+0x6f/0xe0
    [ 340.123535] ? ip_rcv_finish+0x3a0/0x3a0
    [ 340.135523] ? ip_rcv_finish+0xdb/0x3a0
    [ 340.147442] ? ip_rcv+0x27c/0x3c0
    [ 340.158668] ? inet_del_offload+0x40/0x40
    [ 340.170580] ? __netif_receive_skb_core+0x4ac/0x900
    [ 340.183285] ? rcu_accelerate_cbs+0x5b/0x80
    [ 340.195282] ? __netif_receive_skb+0x18/0x60
    [ 340.207288] ? process_backlog+0x95/0x140
    [ 340.218948] ? net_rx_action+0x26c/0x3b0
    [ 340.230416] ? __do_softirq+0xc9/0x26a
    [ 340.241625] ? do_softirq_own_stack+0x2a/0x40
    [ 340.253368]
    [ 340.262673] ? do_softirq+0x50/0x60
    [ 340.273450] ? __local_bh_enable_ip+0x57/0x60
    [ 340.285045] ? ip_finish_output2+0x175/0x350
    [ 340.296403] ? ip_finish_output+0x127/0x1d0
    [ 340.307665] ? nf_hook_slow+0x3c/0xb0
    [ 340.318230] ? ip_output+0x72/0xe0
    [ 340.328524] ? ip_fragment.constprop.54+0x80/0x80
    [ 340.340070] ? ip_local_out+0x35/0x40
    [ 340.350497] ? ip_queue_xmit+0x15c/0x3f0
    [ 340.361060] ? __kmalloc_reserve.isra.40+0x31/0x90
    [ 340.372484] ? __skb_clone+0x2e/0x130
    [ 340.382633] ? tcp_transmit_skb+0x558/0xa10
    [ 340.393262] ? tcp_connect+0x938/0xad0
    [ 340.403370] ? ktime_get_with_offset+0x4c/0xb0
    [ 340.414206] ? tcp_v4_connect+0x457/0x4e0
    [ 340.424471] ? __inet_stream_connect+0xb3/0x300
    [ 340.435195] ? inet_stream_connect+0x3b/0x60
    [ 340.445607] ? SYSC_connect+0xd9/0x110
    [ 340.455455] ? __audit_syscall_entry+0xaf/0x100
    [ 340.466112] ? syscall_trace_enter+0x1d0/0x2b0
    [ 340.476636] ? __audit_syscall_exit+0x209/0x290
    [ 340.487151] ? SyS_connect+0xe/0x10
    [ 340.496453] ? do_syscall_64+0x67/0x1b0
    [ 340.506078] ? entry_SYSCALL64_slow_path+0x25/0x25

    Fixes: 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
    Signed-off-by: Eric Dumazet
    Reported-by: James Morris
    Tested-by: James Morris
    Tested-by: Casey Schaufler
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

26 Oct, 2017

1 commit

  • In my first attempt to fix the lockdep splat, I forgot we could
    enter inet_csk_route_req() with a freshly allocated request socket,
    for which refcount has not yet been elevated, due to complex
    SLAB_TYPESAFE_BY_RCU rules.

    We either are in rcu_read_lock() section _or_ we own a refcount on the
    request.

    Correct RCU verb to use here is rcu_dereference_check(), although it is
    not possible to prove we actually own a reference on a shared
    refcount :/

    In v2, I added ireq_opt_deref() helper and use in three places, to fix other
    possible splats.

    [ 49.844590] lockdep_rcu_suspicious+0xea/0xf3
    [ 49.846487] inet_csk_route_req+0x53/0x14d
    [ 49.848334] tcp_v4_route_req+0xe/0x10
    [ 49.850174] tcp_conn_request+0x31c/0x6a0
    [ 49.851992] ? __lock_acquire+0x614/0x822
    [ 49.854015] tcp_v4_conn_request+0x5a/0x79
    [ 49.855957] ? tcp_v4_conn_request+0x5a/0x79
    [ 49.858052] tcp_rcv_state_process+0x98/0xdcc
    [ 49.859990] ? sk_filter_trim_cap+0x2f6/0x307
    [ 49.862085] tcp_v4_do_rcv+0xfc/0x145
    [ 49.864055] ? tcp_v4_do_rcv+0xfc/0x145
    [ 49.866173] tcp_v4_rcv+0x5ab/0xaf9
    [ 49.868029] ip_local_deliver_finish+0x1af/0x2e7
    [ 49.870064] ip_local_deliver+0x1b2/0x1c5
    [ 49.871775] ? inet_del_offload+0x45/0x45
    [ 49.873916] ip_rcv_finish+0x3f7/0x471
    [ 49.875476] ip_rcv+0x3f1/0x42f
    [ 49.876991] ? ip_local_deliver_finish+0x2e7/0x2e7
    [ 49.878791] __netif_receive_skb_core+0x6d3/0x950
    [ 49.880701] ? process_backlog+0x7e/0x216
    [ 49.882589] __netif_receive_skb+0x1d/0x5e
    [ 49.884122] process_backlog+0x10c/0x216
    [ 49.885812] net_rx_action+0x147/0x3df

    Fixes: a6ca7abe53633 ("tcp/dccp: fix lockdep splat in inet_csk_route_req()")
    Fixes: c92e8c02fe66 ("tcp/dccp: fix ireq->opt races")
    Signed-off-by: Eric Dumazet
    Reported-by: kernel test robot
    Reported-by: Maciej Żenczykowski
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Oct, 2017

1 commit

  • syzkaller found another bug in DCCP/TCP stacks [1]

    For the reasons explained in commit ce1050089c96 ("tcp/dccp: fix
    ireq->pktopts race"), we need to make sure we do not access
    ireq->opt unless we own the request sock.

    Note the opt field is renamed to ireq_opt to ease grep games.

    [1]
    BUG: KASAN: use-after-free in ip_queue_xmit+0x1687/0x18e0 net/ipv4/ip_output.c:474
    Read of size 1 at addr ffff8801c951039c by task syz-executor5/3295

    CPU: 1 PID: 3295 Comm: syz-executor5 Not tainted 4.14.0-rc4+ #80
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:16 [inline]
    dump_stack+0x194/0x257 lib/dump_stack.c:52
    print_address_description+0x73/0x250 mm/kasan/report.c:252
    kasan_report_error mm/kasan/report.c:351 [inline]
    kasan_report+0x25b/0x340 mm/kasan/report.c:409
    __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:427
    ip_queue_xmit+0x1687/0x18e0 net/ipv4/ip_output.c:474
    tcp_transmit_skb+0x1ab7/0x3840 net/ipv4/tcp_output.c:1135
    tcp_send_ack.part.37+0x3bb/0x650 net/ipv4/tcp_output.c:3587
    tcp_send_ack+0x49/0x60 net/ipv4/tcp_output.c:3557
    __tcp_ack_snd_check+0x2c6/0x4b0 net/ipv4/tcp_input.c:5072
    tcp_ack_snd_check net/ipv4/tcp_input.c:5085 [inline]
    tcp_rcv_state_process+0x2eff/0x4850 net/ipv4/tcp_input.c:6071
    tcp_child_process+0x342/0x990 net/ipv4/tcp_minisocks.c:816
    tcp_v4_rcv+0x1827/0x2f80 net/ipv4/tcp_ipv4.c:1682
    ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:464 [inline]
    ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
    __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
    __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
    netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
    netif_receive_skb+0xae/0x390 net/core/dev.c:4611
    tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
    tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
    tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
    call_write_iter include/linux/fs.h:1770 [inline]
    new_sync_write fs/read_write.c:468 [inline]
    __vfs_write+0x68a/0x970 fs/read_write.c:481
    vfs_write+0x18f/0x510 fs/read_write.c:543
    SYSC_write fs/read_write.c:588 [inline]
    SyS_write+0xef/0x220 fs/read_write.c:580
    entry_SYSCALL_64_fastpath+0x1f/0xbe
    RIP: 0033:0x40c341
    RSP: 002b:00007f469523ec10 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
    RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 000000000040c341
    RDX: 0000000000000037 RSI: 0000000020004000 RDI: 0000000000000015
    RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
    R10: 00000000000f4240 R11: 0000000000000293 R12: 00000000004b7fd1
    R13: 00000000ffffffff R14: 0000000020000000 R15: 0000000000025000

    Allocated by task 3295:
    save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
    save_stack+0x43/0xd0 mm/kasan/kasan.c:447
    set_track mm/kasan/kasan.c:459 [inline]
    kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
    __do_kmalloc mm/slab.c:3725 [inline]
    __kmalloc+0x162/0x760 mm/slab.c:3734
    kmalloc include/linux/slab.h:498 [inline]
    tcp_v4_save_options include/net/tcp.h:1962 [inline]
    tcp_v4_init_req+0x2d3/0x3e0 net/ipv4/tcp_ipv4.c:1271
    tcp_conn_request+0xf6d/0x3410 net/ipv4/tcp_input.c:6283
    tcp_v4_conn_request+0x157/0x210 net/ipv4/tcp_ipv4.c:1313
    tcp_rcv_state_process+0x8ea/0x4850 net/ipv4/tcp_input.c:5857
    tcp_v4_do_rcv+0x55c/0x7d0 net/ipv4/tcp_ipv4.c:1482
    tcp_v4_rcv+0x2d10/0x2f80 net/ipv4/tcp_ipv4.c:1711
    ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:464 [inline]
    ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
    __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
    __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
    netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
    netif_receive_skb+0xae/0x390 net/core/dev.c:4611
    tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
    tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
    tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
    call_write_iter include/linux/fs.h:1770 [inline]
    new_sync_write fs/read_write.c:468 [inline]
    __vfs_write+0x68a/0x970 fs/read_write.c:481
    vfs_write+0x18f/0x510 fs/read_write.c:543
    SYSC_write fs/read_write.c:588 [inline]
    SyS_write+0xef/0x220 fs/read_write.c:580
    entry_SYSCALL_64_fastpath+0x1f/0xbe

    Freed by task 3306:
    save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
    save_stack+0x43/0xd0 mm/kasan/kasan.c:447
    set_track mm/kasan/kasan.c:459 [inline]
    kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
    __cache_free mm/slab.c:3503 [inline]
    kfree+0xca/0x250 mm/slab.c:3820
    inet_sock_destruct+0x59d/0x950 net/ipv4/af_inet.c:157
    __sk_destruct+0xfd/0x910 net/core/sock.c:1560
    sk_destruct+0x47/0x80 net/core/sock.c:1595
    __sk_free+0x57/0x230 net/core/sock.c:1603
    sk_free+0x2a/0x40 net/core/sock.c:1614
    sock_put include/net/sock.h:1652 [inline]
    inet_csk_complete_hashdance+0xd5/0xf0 net/ipv4/inet_connection_sock.c:959
    tcp_check_req+0xf4d/0x1620 net/ipv4/tcp_minisocks.c:765
    tcp_v4_rcv+0x17f6/0x2f80 net/ipv4/tcp_ipv4.c:1675
    ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:464 [inline]
    ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
    __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
    __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
    netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
    netif_receive_skb+0xae/0x390 net/core/dev.c:4611
    tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
    tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
    tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
    call_write_iter include/linux/fs.h:1770 [inline]
    new_sync_write fs/read_write.c:468 [inline]
    __vfs_write+0x68a/0x970 fs/read_write.c:481
    vfs_write+0x18f/0x510 fs/read_write.c:543
    SYSC_write fs/read_write.c:588 [inline]
    SyS_write+0xef/0x220 fs/read_write.c:580
    entry_SYSCALL_64_fastpath+0x1f/0xbe

    Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Oct, 2017

1 commit

  • Currently no error is emitted, but this infrastructure will
    used by the next patch to allow source address validation
    for mcast sockets.
    Since early demux can do a route lookup and an ipv4 route
    lookup can return an error code this is consistent with the
    current ipv4 route infrastructure.

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

09 Sep, 2017

1 commit

  • While the cited commit fixed a possible deadlock, it added a leak
    of the request socket, since reqsk_put() must be called if the BPF
    filter decided the ACK packet must be dropped.

    Fixes: d624d276d1dd ("tcp: fix possible deadlock in TCP stack vs BPF filter")
    Signed-off-by: Eric Dumazet
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Aug, 2017

1 commit

  • When SOF_TIMESTAMPING_RX_SOFTWARE is enabled for tcp sockets, return the
    timestamp corresponding to the highest sequence number data returned.

    Previously the skb->tstamp is overwritten when a TCP packet is placed
    in the out of order queue. While the packet is in the ooo queue, save the
    timestamp in the TCB_SKB_CB. This space is shared with the gso_*
    options which are only used on the tx path, and a previously unused 4
    byte hole.

    When skbs are coalesced either in the sk_receive_queue or the
    out_of_order_queue always choose the timestamp of the appended skb to
    maintain the invariant of returning the timestamp of the last byte in
    the recvmsg buffer.

    Signed-off-by: Mike Maloney
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Mike Maloney
     

16 Aug, 2017

1 commit


15 Aug, 2017

1 commit

  • Filtering the ACK packet was not put at the right place.

    At this place, we already allocated a child and put it
    into accept queue.

    We absolutely need to call tcp_child_process() to release
    its spinlock, or we will deadlock at accept() or close() time.

    Found by syzkaller team (Thanks a lot !)

    Fixes: 8fac365f63c8 ("tcp: Add a tcp_filter hook before handle ack packet")
    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Cc: Chenbo Feng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Aug, 2017

1 commit

  • Add a second device index, sdif, to inet socket lookups. sdif is the
    index for ingress devices enslaved to an l3mdev. It allows the lookups
    to consider the enslaved device as well as the L3 domain when searching
    for a socket.

    TCP moves the data in the cb. Prior to tcp_v4_rcv (e.g., early demux) the
    ingress index is obtained from IPCB using inet_sdif and after the cb move
    in tcp_v4_rcv the tcp_v4_sdif helper is used.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

07 Aug, 2017

1 commit

  • __ip_options_echo() uses the current network namespace, and
    currently retrives it via skb->dst->dev.

    This commit adds an explicit 'net' argument to __ip_options_echo()
    and update all the call sites to provide it, usually via a simpler
    sock_net().

    After this change, __ip_options_echo() no more needs to access
    skb->dst and we can drop a couple of hack to preserve such
    info in the rx path.

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

01 Aug, 2017

2 commits

  • Was only checked by the removed prequeue code.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • prequeue is a tcp receive optimization that moves part of rx processing
    from bh to process context.

    This only works if the socket being processed belongs to a process that
    is blocked in recv on that socket.

    In practice, this doesn't happen anymore that often because nowadays
    servers tend to use an event driven (epoll) model.

    Even normal client applications (web browsers) commonly use many tcp
    connections in parallel.

    This has measureable impact only in netperf (which uses plain recv and
    thus allows prequeue use) from host to locally running vm (~4%), however,
    there were no changes when using netperf between two physical hosts with
    ixgbe interfaces.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

25 Jul, 2017

1 commit


06 Jul, 2017

1 commit


01 Jul, 2017

1 commit

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    This patch uses refcount_inc_not_zero() instead of
    atomic_inc_not_zero_hint() due to absense of a _hint()
    version of refcount API. If the hint() version must
    be used, we might need to revisit API.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     

22 Jun, 2017

1 commit

  • Currently in both ipv4 and ipv6 code path, the ack packet received when
    sk at TCP_NEW_SYN_RECV state is not filtered by socket filter or cgroup
    filter since it is handled from tcp_child_process and never reaches the
    tcp_filter inside tcp_v4_rcv or tcp_v6_rcv. Adding a tcp_filter hooks
    here can make sure all the ingress tcp packet can be correctly filtered.

    Signed-off-by: Chenbo Feng
    Signed-off-by: David S. Miller

    Chenbo Feng
     

21 Jun, 2017

1 commit

  • Changing from a memcpy to per-member comparison left the
    size variable unused:

    net/ipv4/tcp_ipv4.c: In function 'tcp_md5_do_lookup':
    net/ipv4/tcp_ipv4.c:910:15: error: unused variable 'size' [-Werror=unused-variable]

    This does not show up when CONFIG_IPV6 is enabled, but the
    variable can be removed either way, along with the now unused
    assignment.

    Fixes: 6797318e623d ("tcp: md5: add an address prefix for key lookup")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: David S. Miller

    Arnd Bergmann
     

20 Jun, 2017

2 commits

  • Replace first padding in the tcp_md5sig structure with a new flag field
    and address prefix length so it can be specified when configuring a new
    key for TCP MD5 signature. The tcpm_flags field will only be used if the
    socket option is TCP_MD5SIG_EXT to avoid breaking existing programs, and
    tcpm_prefixlen only when the TCP_MD5SIG_FLAG_PREFIX flag is set.

    Signed-off-by: Bob Gilligan
    Signed-off-by: Eric Mowat
    Signed-off-by: Ivan Delalande
    Signed-off-by: David S. Miller

    Ivan Delalande
     
  • This allows the keys used for TCP MD5 signature to be used for whole
    range of addresses, specified with a prefix length, instead of only one
    address as it currently is.

    Signed-off-by: Bob Gilligan
    Signed-off-by: Eric Mowat
    Signed-off-by: Ivan Delalande
    Signed-off-by: David S. Miller

    Ivan Delalande
     

16 Jun, 2017

1 commit

  • Add the infrustructure for attaching Upper Layer Protocols (ULPs) over TCP
    sockets. Based on a similar infrastructure in tcp_cong. The idea is that any
    ULP can add its own logic by changing the TCP proto_ops structure to its own
    methods.

    Example usage:

    setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls"));

    modules will call:
    tcp_register_ulp(&tcp_tls_ulp_ops);

    to register/unregister their ulp, with an init function and name.

    A list of registered ulps will be returned by tcp_get_available_ulp, which is
    hooked up to /proc. Example:

    $ cat /proc/sys/net/ipv4/tcp_available_ulp
    tls

    There is currently no functionality to remove or chain ULPs, but
    it should be possible to add these in the future if needed.

    Signed-off-by: Boris Pismenny
    Signed-off-by: Dave Watson
    Signed-off-by: David S. Miller

    Dave Watson
     

08 Jun, 2017

4 commits

  • DRAM supply shortage and poor memory pressure tracking in TCP
    stack makes any change in SO_SNDBUF/SO_RCVBUF (or equivalent autotuning
    limits) and tcp_mem[] quite hazardous.

    TCPMemoryPressures SNMP counter is an indication of tcp_mem sysctl
    limits being hit, but only tracking number of transitions.

    If TCP stack behavior under stress was perfect :
    1) It would maintain memory usage close to the limit.
    2) Memory pressure state would be entered for short times.

    We certainly prefer 100 events lasting 10ms compared to one event
    lasting 200 seconds.

    This patch adds a new SNMP counter tracking cumulative duration of
    memory pressure events, given in ms units.

    $ cat /proc/sys/net/ipv4/tcp_mem
    3088 4117 6176
    $ grep TCP /proc/net/sockstat
    TCP: inuse 180 orphan 0 tw 2 alloc 234 mem 4140
    $ nstat -n ; sleep 10 ; nstat |grep Pressure
    TcpExtTCPMemoryPressures 1700
    TcpExtTCPMemoryPressuresChrono 5209

    v2: Used EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() as David
    instructed.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 May, 2017

2 commits

  • TCP Timestamps option is defined in RFC 7323

    Traditionally on linux, it has been tied to the internal
    'jiffies' variable, because it had been a cheap and good enough
    generator.

    For TCP flows on the Internet, 1 ms resolution would be much better
    than 4ms or 10ms (HZ=250 or HZ=100 respectively)

    For TCP flows in the DC, Google has used usec resolution for more
    than two years with great success [1]

    Receive size autotuning (DRS) is indeed more precise and converges
    faster to optimal window size.

    This patch converts tp->tcp_mstamp to a plain u64 value storing
    a 1 usec TCP clock.

    This choice will allow us to upstream the 1 usec TS option as
    discussed in IETF 97.

    [1] https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Idea is to later convert tp->tcp_mstamp to a full u64 counter
    using usec resolution, so that we can later have fine
    grained TCP TS clock (RFC 7323), regardless of HZ value.

    We try to refresh tp->tcp_mstamp only when necessary.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 May, 2017

1 commit

  • Pull RCU updates from Ingo Molnar:
    "The main changes are:

    - Debloat RCU headers

    - Parallelize SRCU callback handling (plus overlapping patches)

    - Improve the performance of Tree SRCU on a CPU-hotplug stress test

    - Documentation updates

    - Miscellaneous fixes"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
    rcu: Open-code the rcu_cblist_n_lazy_cbs() function
    rcu: Open-code the rcu_cblist_n_cbs() function
    rcu: Open-code the rcu_cblist_empty() function
    rcu: Separately compile large rcu_segcblist functions
    srcu: Debloat the header
    srcu: Adjust default auto-expediting holdoff
    srcu: Specify auto-expedite holdoff time
    srcu: Expedite first synchronize_srcu() when idle
    srcu: Expedited grace periods with reduced memory contention
    srcu: Make rcutorture writer stalls print SRCU GP state
    srcu: Exact tracking of srcu_data structures containing callbacks
    srcu: Make SRCU be built by default
    srcu: Fix Kconfig botch when SRCU not selected
    rcu: Make non-preemptive schedule be Tasks RCU quiescent state
    srcu: Expedite srcu_schedule_cbs_snp() callback invocation
    srcu: Parallelize callback handling
    kvm: Move srcu_struct fields to end of struct kvm
    rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
    rcu: Use true/false in assignment to bool
    rcu: Use bool value directly
    ...

    Linus Torvalds
     

06 May, 2017

1 commit

  • Whole point of randomization was to hide server uptime, but an attacker
    can simply start a syn flood and TCP generates 'old style' timestamps,
    directly revealing server jiffies value.

    Also, TSval sent by the server to a particular remote address vary
    depending on syncookies being sent or not, potentially triggering PAWS
    drops for innocent clients.

    Lets implement proper randomization, including for SYNcookies.

    Also we do not need to export sysctl_tcp_timestamps, since it is not
    used from a module.

    In v2, I added Florian feedback and contribution, adding tsoff to
    tcp_get_cookie_sock().

    v3 removed one unused variable in tcp_v4_connect() as Florian spotted.

    Fixes: 95a22caee396c ("tcp: randomize tcp timestamp offsets for each connection")
    Signed-off-by: Eric Dumazet
    Reviewed-by: Florian Westphal
    Tested-by: Florian Westphal
    Cc: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Apr, 2017

1 commit

  • Middlebox firewall issues can potentially cause server's data being
    blackholed after a successful 3WHS using TFO. Following are the related
    reports from Apple:
    https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf
    Slide 31 identifies an issue where the client ACK to the server's data
    sent during a TFO'd handshake is dropped.
    C ---> syn-data ---> S
    C X S
    [retry and timeout]

    https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf
    Slide 5 shows a similar situation that the server's data gets dropped
    after 3WHS.
    C ---- syn-data ---> S
    C S
    S (accept & write)
    C? X
    Acked-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Wei Wang
     

23 Apr, 2017

1 commit


19 Apr, 2017

1 commit

  • A group of Linux kernel hackers reported chasing a bug that resulted
    from their assumption that SLAB_DESTROY_BY_RCU provided an existence
    guarantee, that is, that no block from such a slab would be reallocated
    during an RCU read-side critical section. Of course, that is not the
    case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
    slab of blocks.

    However, there is a phrase for this, namely "type safety". This commit
    therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
    to avoid future instances of this sort of confusion.

    Signed-off-by: Paul E. McKenney
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc:
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    [ paulmck: Add comments mentioning the old name, as requested by Eric
    Dumazet, in order to help people familiar with the old name find
    the new one. ]
    Acked-by: David Rientjes

    Paul E. McKenney
     

25 Mar, 2017

1 commit

  • While working on some recent busy poll changes we found that child sockets
    were being instantiated without NAPI ID being set. In our first attempt to
    fix it, it was suggested that we should just pull programming the NAPI ID
    into the function itself since all callers will need to have it set.

    In addition to the NAPI ID change I have dropped the code that was
    populating the Rx hash since it was actually being populated in
    tcp_get_cookie_sock.

    Reported-by: Sridhar Samudrala
    Signed-off-by: Alexander Duyck
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexander Duyck
     

17 Mar, 2017

2 commits

  • The tcp_tw_recycle was already broken for connections
    behind NAT, since the per-destination timestamp is not
    monotonically increasing for multiple machines behind
    a single destination address.

    After the randomization of TCP timestamp offsets
    in commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets
    for each connection), the tcp_tw_recycle is broken for all
    types of connections for the same reason: the timestamps
    received from a single machine is not monotonically increasing,
    anymore.

    Remove tcp_tw_recycle, since it is not functional. Also, remove
    the PAWSPassive SNMP counter since it is only used for
    tcp_tw_recycle, and simplify tcp_v4_route_req and tcp_v6_route_req
    since the strict argument is only set when tcp_tw_recycle is
    enabled.

    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Eric Dumazet
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Cc: Lutz Vieweg
    Cc: Florian Westphal
    Signed-off-by: David S. Miller

    Soheil Hassas Yeganeh
     
  • Commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets for each connection)
    randomizes TCP timestamps per connection. After this commit,
    there is no guarantee that the timestamps received from the
    same destination are monotonically increasing. As a result,
    the per-destination timestamp cache in TCP metrics (i.e., tcpm_ts
    in struct tcp_metrics_block) is broken and cannot be relied upon.

    Remove the per-destination timestamp cache and all related code
    paths.

    Note that this cache was already broken for caching timestamps of
    multiple machines behind a NAT sharing the same address.

    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Eric Dumazet
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Cc: Lutz Vieweg
    Cc: Florian Westphal
    Signed-off-by: David S. Miller

    Soheil Hassas Yeganeh
     

16 Mar, 2017

1 commit


14 Mar, 2017

1 commit

  • As Eric Dumazet pointed out this also needs to be fixed in IPv6.
    v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.

    We have seen a few incidents lately where a dst_enty has been freed
    with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
    dst_entry. If the conditions/timings are right a crash then ensues when the
    freed dst_entry is referenced later on. A Common crashing back trace is:

    #8 [] page_fault at ffffffff8163e648
    [exception RIP: __tcp_ack_snd_check+74]
    .
    .
    #9 [] tcp_rcv_established at ffffffff81580b64
    #10 [] tcp_v4_do_rcv at ffffffff8158b54a
    #11 [] tcp_v4_rcv at ffffffff8158cd02
    #12 [] ip_local_deliver_finish at ffffffff815668f4
    #13 [] ip_local_deliver at ffffffff81566bd9
    #14 [] ip_rcv_finish at ffffffff8156656d
    #15 [] ip_rcv at ffffffff81566f06
    #16 [] __netif_receive_skb_core at ffffffff8152b3a2
    #17 [] __netif_receive_skb at ffffffff8152b608
    #18 [] netif_receive_skb at ffffffff8152b690
    #19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
    #20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
    #21 [] net_rx_action at ffffffff8152bac2
    #22 [] __do_softirq at ffffffff81084b4f
    #23 [] call_softirq at ffffffff8164845c
    #24 [] do_softirq at ffffffff81016fc5
    #25 [] irq_exit at ffffffff81084ee5
    #26 [] do_IRQ at ffffffff81648ff8

    Of course it may happen with other NIC drivers as well.

    It's found the freed dst_entry here:

    224 static bool tcp_in_quickack_mode(struct sock *sk)↩
    225 {↩
    226 ▹ const struct inet_connection_sock *icsk = inet_csk(sk);↩
    227 ▹ const struct dst_entry *dst = __sk_dst_get(sk);↩
    228 ↩
    229 ▹ return (dst && dst_metric(dst, RTAX_QUICKACK)) ||↩
    230 ▹ ▹ (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);↩
    231 }↩

    But there are other backtraces attributed to the same freed dst_entry in
    netfilter code as well.

    All the vmcores showed 2 significant clues:

    - Remote hosts behind the default gateway had always been redirected to a
    different gateway. A rtable/dst_entry will be added for that host. Making
    more dst_entrys with lower reference counts. Making this more probable.

    - All vmcores showed a postitive LockDroppedIcmps value, e.g:

    LockDroppedIcmps 267

    A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
    regardless of whether user space has the socket locked. This can result in a
    race condition where the same dst_entry cached in sk->sk_dst_entry can be
    decremented twice for the same socket via:

    do_redirect()->__sk_dst_check()-> dst_release().

    Which leads to the dst_entry being prematurely freed with another socket
    pointing to it via sk->sk_dst_cache and a subsequent crash.

    To fix this skip do_redirect() if usespace has the socket locked. Instead let
    the redirect take place later when user space does not have the socket
    locked.

    The dccp/IPv6 code is very similar in this respect, so fixing it there too.

    As Eric Garver pointed out the following commit now invalidates routes. Which
    can set the dst->obsolete flag so that ipv4_dst_check() returns null and
    triggers the dst_release().

    Fixes: ceb3320610d6 ("ipv4: Kill routes during PMTU/redirect updates.")
    Cc: Eric Garver
    Cc: Hannes Sowa
    Signed-off-by: Jon Maxwell
    Signed-off-by: David S. Miller

    Jon Maxwell