24 Nov, 2020

1 commit

  • When the TCP stack is in SYN flood mode, the server child socket is
    created from the SYN cookie received in a TCP packet with the ACK flag
    set.

    The child socket is created when the server receives the first TCP
    packet with a valid SYN cookie from the client. Usually, this packet
    corresponds to the final step of the TCP 3-way handshake, the ACK
    packet. But is also possible to receive a valid SYN cookie from the
    first TCP data packet sent by the client, and thus create a child socket
    from that SYN cookie.

    Since a client socket is ready to send data as soon as it receives the
    SYN+ACK packet from the server, the client can send the ACK packet (sent
    by the TCP stack code), and the first data packet (sent by the userspace
    program) almost at the same time, and thus the server will equally
    receive the two TCP packets with valid SYN cookies almost at the same
    instant.

    When such event happens, the TCP stack code has a race condition that
    occurs between the momement a lookup is done to the established
    connections hashtable to check for the existence of a connection for the
    same client, and the moment that the child socket is added to the
    established connections hashtable. As a consequence, this race condition
    can lead to a situation where we add two child sockets to the
    established connections hashtable and deliver two sockets to the
    userspace program to the same client.

    This patch fixes the race condition by checking if an existing child
    socket exists for the same client when we are adding the second child
    socket to the established connections socket. If an existing child
    socket exists, we drop the packet and discard the second child socket
    to the same client.

    Signed-off-by: Ricardo Dias
    Signed-off-by: Eric Dumazet
    Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lan
    Signed-off-by: Jakub Kicinski

    Ricardo Dias
     

11 Sep, 2020

1 commit

  • This commit adds tos as a new passed in parameter to
    ip_build_and_send_pkt() which will be used in the later commit.
    This is a pure restructure and does not have any functional change.

    Signed-off-by: Wei Wang
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Wei Wang
     

25 Aug, 2020

1 commit


20 Jul, 2020

3 commits


14 Jul, 2020

1 commit


05 Nov, 2019

1 commit


02 Nov, 2019

1 commit

  • Historically linux tried to stick to RFC 791, 1122, 2003
    for IPv4 ID field generation.

    RFC 6864 made clear that no matter how hard we try,
    we can not ensure unicity of IP ID within maximum
    lifetime for all datagrams with a given source
    address/destination address/protocol tuple.

    Linux uses a per socket inet generator (inet_id), initialized
    at connection startup with a XOR of 'jiffies' and other
    fields that appear clear on the wire.

    Thiemo Nagel pointed that this strategy is a privacy
    concern as this provides 16 bits of entropy to fingerprint
    devices.

    Let's switch to a random starting point, this is just as
    good as far as RFC 6864 is concerned and does not leak
    anything critical.

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet
    Reported-by: Thiemo Nagel
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Oct, 2019

1 commit

  • commit 174e23810cd31
    ("sk_buff: drop all skb extensions on free and skb scrubbing") made napi
    recycle always drop skb extensions. The additional skb_ext_del() that is
    performed via nf_reset on napi skb recycle is not needed anymore.

    Most nf_reset() calls in the stack are there so queued skb won't block
    'rmmod nf_conntrack' indefinitely.

    This removes the skb_ext_del from nf_reset, and renames it to a more
    fitting nf_reset_ct().

    In a few selected places, add a call to skb_ext_reset to make sure that
    no active extensions remain.

    I am submitting this for "net", because we're still early in the release
    cycle. The patch applies to net-next too, but I think the rename causes
    needless divergence between those trees.

    Suggested-by: Eric Dumazet
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

20 Apr, 2019

1 commit

  • The SIOCGSTAMP/SIOCGSTAMPNS ioctl commands are implemented by many
    socket protocol handlers, and all of those end up calling the same
    sock_get_timestamp()/sock_get_timestampns() helper functions, which
    results in a lot of duplicate code.

    With the introduction of 64-bit time_t on 32-bit architectures, this
    gets worse, as we then need four different ioctl commands in each
    socket protocol implementation.

    To simplify that, let's add a new .gettstamp() operation in
    struct proto_ops, and move ioctl implementation into the common
    sock_ioctl()/compat_sock_ioctl_trans() functions that these all go
    through.

    We can reuse the sock_get_timestamp() implementation, but generalize
    it so it can deal with both native and compat mode, as well as
    timeval and timespec structures.

    Acked-by: Stefan Schmidt
    Acked-by: Neil Horman
    Acked-by: Marc Kleine-Budde
    Link: https://lore.kernel.org/lkml/CAK8P3a038aDQQotzua_QtKGhq8O9n+rdiz2=WDCp82ys8eUT+A@mail.gmail.com/
    Signed-off-by: Arnd Bergmann
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Arnd Bergmann
     

09 Nov, 2018

1 commit

  • We'll need this to handle ICMP errors for tunnels without a sending socket
    (i.e. FoU and GUE). There, we might have to look up different types of IP
    tunnels, registered as network protocols, before we get a match, so we
    want this for the error handlers of IPPROTO_IPIP and IPPROTO_IPV6 in both
    inet_protos and inet6_protos. These error codes will be used in the next
    patch.

    For consistency, return sensible error codes in protocol error handlers
    whenever handlers can't handle errors because, even if valid, they don't
    match a protocol or any of its states.

    This has no effect on existing error handling paths.

    Signed-off-by: Stefano Brivio
    Reviewed-by: Sabrina Dubroca
    Signed-off-by: David S. Miller

    Stefano Brivio
     

03 Oct, 2018

1 commit

  • Timer handlers do not imply rcu_read_lock(), so my recent fix
    triggered a LOCKDEP warning when SYNACK is retransmit.

    Lets add rcu_read_lock()/rcu_read_unlock() pairs around ireq->ireq_opt
    usages instead of guessing what is done by callers, since it is
    not worth the pain.

    Get rid of ireq_opt_deref() helper since it hides the logic
    without real benefit, since it is now a standard rcu_dereference().

    Fixes: 1ad98e9d1bdf ("tcp/dccp: fix lockdep issue when SYN is backlogged")
    Signed-off-by: Eric Dumazet
    Reported-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eric Dumazet
     

29 Jun, 2018

1 commit

  • The poll() changes were not well thought out, and completely
    unexplained. They also caused a huge performance regression, because
    "->poll()" was no longer a trivial file operation that just called down
    to the underlying file operations, but instead did at least two indirect
    calls.

    Indirect calls are sadly slow now with the Spectre mitigation, but the
    performance problem could at least be largely mitigated by changing the
    "->get_poll_head()" operation to just have a per-file-descriptor pointer
    to the poll head instead. That gets rid of one of the new indirections.

    But that doesn't fix the new complexity that is completely unwarranted
    for the regular case. The (undocumented) reason for the poll() changes
    was some alleged AIO poll race fixing, but we don't make the common case
    slower and more complex for some uncommon special case, so this all
    really needs way more explanations and most likely a fundamental
    redesign.

    [ This revert is a revert of about 30 different commits, not reverted
    individually because that would just be unnecessarily messy - Linus ]

    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

26 May, 2018

1 commit


08 Apr, 2018

1 commit

  • syzbot reported an uninit-value read of skb->mark in iptable_mangle_hook()

    Thanks to the nice report, I tracked the problem to dccp not caring
    of ireq->ir_mark for passive sessions.

    BUG: KMSAN: uninit-value in ipt_mangle_out net/ipv4/netfilter/iptable_mangle.c:66 [inline]
    BUG: KMSAN: uninit-value in iptable_mangle_hook+0x5e5/0x720 net/ipv4/netfilter/iptable_mangle.c:84
    CPU: 0 PID: 5300 Comm: syz-executor3 Not tainted 4.16.0+ #81
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:17 [inline]
    dump_stack+0x185/0x1d0 lib/dump_stack.c:53
    kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
    __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
    ipt_mangle_out net/ipv4/netfilter/iptable_mangle.c:66 [inline]
    iptable_mangle_hook+0x5e5/0x720 net/ipv4/netfilter/iptable_mangle.c:84
    nf_hook_entry_hookfn include/linux/netfilter.h:120 [inline]
    nf_hook_slow+0x158/0x3d0 net/netfilter/core.c:483
    nf_hook include/linux/netfilter.h:243 [inline]
    __ip_local_out net/ipv4/ip_output.c:113 [inline]
    ip_local_out net/ipv4/ip_output.c:122 [inline]
    ip_queue_xmit+0x1d21/0x21c0 net/ipv4/ip_output.c:504
    dccp_transmit_skb+0x15eb/0x1900 net/dccp/output.c:142
    dccp_xmit_packet+0x814/0x9e0 net/dccp/output.c:281
    dccp_write_xmit+0x20f/0x480 net/dccp/output.c:363
    dccp_sendmsg+0x12ca/0x12d0 net/dccp/proto.c:818
    inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
    sock_sendmsg_nosec net/socket.c:630 [inline]
    sock_sendmsg net/socket.c:640 [inline]
    ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
    __sys_sendmsg net/socket.c:2080 [inline]
    SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
    SyS_sendmsg+0x54/0x80 net/socket.c:2087
    do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    RIP: 0033:0x455259
    RSP: 002b:00007f1a4473dc68 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 00007f1a4473e6d4 RCX: 0000000000455259
    RDX: 0000000000000000 RSI: 0000000020b76fc8 RDI: 0000000000000015
    RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
    R13: 00000000000004f0 R14: 00000000006fa720 R15: 0000000000000000

    Uninit was stored to memory at:
    kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
    kmsan_save_stack mm/kmsan/kmsan.c:293 [inline]
    kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:684
    __msan_chain_origin+0x69/0xc0 mm/kmsan/kmsan_instr.c:521
    ip_queue_xmit+0x1e35/0x21c0 net/ipv4/ip_output.c:502
    dccp_transmit_skb+0x15eb/0x1900 net/dccp/output.c:142
    dccp_xmit_packet+0x814/0x9e0 net/dccp/output.c:281
    dccp_write_xmit+0x20f/0x480 net/dccp/output.c:363
    dccp_sendmsg+0x12ca/0x12d0 net/dccp/proto.c:818
    inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
    sock_sendmsg_nosec net/socket.c:630 [inline]
    sock_sendmsg net/socket.c:640 [inline]
    ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
    __sys_sendmsg net/socket.c:2080 [inline]
    SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
    SyS_sendmsg+0x54/0x80 net/socket.c:2087
    do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    Uninit was stored to memory at:
    kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
    kmsan_save_stack mm/kmsan/kmsan.c:293 [inline]
    kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:684
    __msan_chain_origin+0x69/0xc0 mm/kmsan/kmsan_instr.c:521
    inet_csk_clone_lock+0x503/0x580 net/ipv4/inet_connection_sock.c:797
    dccp_create_openreq_child+0x7f/0x890 net/dccp/minisocks.c:92
    dccp_v4_request_recv_sock+0x22c/0xe90 net/dccp/ipv4.c:408
    dccp_v6_request_recv_sock+0x290/0x2000 net/dccp/ipv6.c:414
    dccp_check_req+0x7b9/0x8f0 net/dccp/minisocks.c:197
    dccp_v4_rcv+0x12e4/0x2630 net/dccp/ipv4.c:840
    ip_local_deliver_finish+0x6ed/0xd40 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:288 [inline]
    ip_local_deliver+0x43c/0x4e0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:449 [inline]
    ip_rcv_finish+0x1253/0x16d0 net/ipv4/ip_input.c:397
    NF_HOOK include/linux/netfilter.h:288 [inline]
    ip_rcv+0x119d/0x16f0 net/ipv4/ip_input.c:493
    __netif_receive_skb_core+0x47cf/0x4a80 net/core/dev.c:4562
    __netif_receive_skb net/core/dev.c:4627 [inline]
    process_backlog+0x62d/0xe20 net/core/dev.c:5307
    napi_poll net/core/dev.c:5705 [inline]
    net_rx_action+0x7c1/0x1a70 net/core/dev.c:5771
    __do_softirq+0x56d/0x93d kernel/softirq.c:285
    Uninit was created at:
    kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
    kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
    kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
    kmem_cache_alloc+0xaab/0xb90 mm/slub.c:2756
    reqsk_alloc include/net/request_sock.h:88 [inline]
    inet_reqsk_alloc+0xc4/0x7f0 net/ipv4/tcp_input.c:6145
    dccp_v4_conn_request+0x5cc/0x1770 net/dccp/ipv4.c:600
    dccp_v6_conn_request+0x299/0x1880 net/dccp/ipv6.c:317
    dccp_rcv_state_process+0x2ea/0x2410 net/dccp/input.c:612
    dccp_v4_do_rcv+0x229/0x340 net/dccp/ipv4.c:682
    dccp_v6_do_rcv+0x16d/0x1220 net/dccp/ipv6.c:578
    sk_backlog_rcv include/net/sock.h:908 [inline]
    __sk_receive_skb+0x60e/0xf20 net/core/sock.c:513
    dccp_v4_rcv+0x24d4/0x2630 net/dccp/ipv4.c:874
    ip_local_deliver_finish+0x6ed/0xd40 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:288 [inline]
    ip_local_deliver+0x43c/0x4e0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:449 [inline]
    ip_rcv_finish+0x1253/0x16d0 net/ipv4/ip_input.c:397
    NF_HOOK include/linux/netfilter.h:288 [inline]
    ip_rcv+0x119d/0x16f0 net/ipv4/ip_input.c:493
    __netif_receive_skb_core+0x47cf/0x4a80 net/core/dev.c:4562
    __netif_receive_skb net/core/dev.c:4627 [inline]
    process_backlog+0x62d/0xe20 net/core/dev.c:5307
    napi_poll net/core/dev.c:5705 [inline]
    net_rx_action+0x7c1/0x1a70 net/core/dev.c:5771
    __do_softirq+0x56d/0x93d kernel/softirq.c:285

    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Oct, 2017

1 commit

  • In my first attempt to fix the lockdep splat, I forgot we could
    enter inet_csk_route_req() with a freshly allocated request socket,
    for which refcount has not yet been elevated, due to complex
    SLAB_TYPESAFE_BY_RCU rules.

    We either are in rcu_read_lock() section _or_ we own a refcount on the
    request.

    Correct RCU verb to use here is rcu_dereference_check(), although it is
    not possible to prove we actually own a reference on a shared
    refcount :/

    In v2, I added ireq_opt_deref() helper and use in three places, to fix other
    possible splats.

    [ 49.844590] lockdep_rcu_suspicious+0xea/0xf3
    [ 49.846487] inet_csk_route_req+0x53/0x14d
    [ 49.848334] tcp_v4_route_req+0xe/0x10
    [ 49.850174] tcp_conn_request+0x31c/0x6a0
    [ 49.851992] ? __lock_acquire+0x614/0x822
    [ 49.854015] tcp_v4_conn_request+0x5a/0x79
    [ 49.855957] ? tcp_v4_conn_request+0x5a/0x79
    [ 49.858052] tcp_rcv_state_process+0x98/0xdcc
    [ 49.859990] ? sk_filter_trim_cap+0x2f6/0x307
    [ 49.862085] tcp_v4_do_rcv+0xfc/0x145
    [ 49.864055] ? tcp_v4_do_rcv+0xfc/0x145
    [ 49.866173] tcp_v4_rcv+0x5ab/0xaf9
    [ 49.868029] ip_local_deliver_finish+0x1af/0x2e7
    [ 49.870064] ip_local_deliver+0x1b2/0x1c5
    [ 49.871775] ? inet_del_offload+0x45/0x45
    [ 49.873916] ip_rcv_finish+0x3f7/0x471
    [ 49.875476] ip_rcv+0x3f1/0x42f
    [ 49.876991] ? ip_local_deliver_finish+0x2e7/0x2e7
    [ 49.878791] __netif_receive_skb_core+0x6d3/0x950
    [ 49.880701] ? process_backlog+0x7e/0x216
    [ 49.882589] __netif_receive_skb+0x1d/0x5e
    [ 49.884122] process_backlog+0x10c/0x216
    [ 49.885812] net_rx_action+0x147/0x3df

    Fixes: a6ca7abe53633 ("tcp/dccp: fix lockdep splat in inet_csk_route_req()")
    Fixes: c92e8c02fe66 ("tcp/dccp: fix ireq->opt races")
    Signed-off-by: Eric Dumazet
    Reported-by: kernel test robot
    Reported-by: Maciej Żenczykowski
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Oct, 2017

1 commit

  • syzkaller found another bug in DCCP/TCP stacks [1]

    For the reasons explained in commit ce1050089c96 ("tcp/dccp: fix
    ireq->pktopts race"), we need to make sure we do not access
    ireq->opt unless we own the request sock.

    Note the opt field is renamed to ireq_opt to ease grep games.

    [1]
    BUG: KASAN: use-after-free in ip_queue_xmit+0x1687/0x18e0 net/ipv4/ip_output.c:474
    Read of size 1 at addr ffff8801c951039c by task syz-executor5/3295

    CPU: 1 PID: 3295 Comm: syz-executor5 Not tainted 4.14.0-rc4+ #80
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:16 [inline]
    dump_stack+0x194/0x257 lib/dump_stack.c:52
    print_address_description+0x73/0x250 mm/kasan/report.c:252
    kasan_report_error mm/kasan/report.c:351 [inline]
    kasan_report+0x25b/0x340 mm/kasan/report.c:409
    __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:427
    ip_queue_xmit+0x1687/0x18e0 net/ipv4/ip_output.c:474
    tcp_transmit_skb+0x1ab7/0x3840 net/ipv4/tcp_output.c:1135
    tcp_send_ack.part.37+0x3bb/0x650 net/ipv4/tcp_output.c:3587
    tcp_send_ack+0x49/0x60 net/ipv4/tcp_output.c:3557
    __tcp_ack_snd_check+0x2c6/0x4b0 net/ipv4/tcp_input.c:5072
    tcp_ack_snd_check net/ipv4/tcp_input.c:5085 [inline]
    tcp_rcv_state_process+0x2eff/0x4850 net/ipv4/tcp_input.c:6071
    tcp_child_process+0x342/0x990 net/ipv4/tcp_minisocks.c:816
    tcp_v4_rcv+0x1827/0x2f80 net/ipv4/tcp_ipv4.c:1682
    ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:464 [inline]
    ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
    __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
    __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
    netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
    netif_receive_skb+0xae/0x390 net/core/dev.c:4611
    tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
    tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
    tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
    call_write_iter include/linux/fs.h:1770 [inline]
    new_sync_write fs/read_write.c:468 [inline]
    __vfs_write+0x68a/0x970 fs/read_write.c:481
    vfs_write+0x18f/0x510 fs/read_write.c:543
    SYSC_write fs/read_write.c:588 [inline]
    SyS_write+0xef/0x220 fs/read_write.c:580
    entry_SYSCALL_64_fastpath+0x1f/0xbe
    RIP: 0033:0x40c341
    RSP: 002b:00007f469523ec10 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
    RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 000000000040c341
    RDX: 0000000000000037 RSI: 0000000020004000 RDI: 0000000000000015
    RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
    R10: 00000000000f4240 R11: 0000000000000293 R12: 00000000004b7fd1
    R13: 00000000ffffffff R14: 0000000020000000 R15: 0000000000025000

    Allocated by task 3295:
    save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
    save_stack+0x43/0xd0 mm/kasan/kasan.c:447
    set_track mm/kasan/kasan.c:459 [inline]
    kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
    __do_kmalloc mm/slab.c:3725 [inline]
    __kmalloc+0x162/0x760 mm/slab.c:3734
    kmalloc include/linux/slab.h:498 [inline]
    tcp_v4_save_options include/net/tcp.h:1962 [inline]
    tcp_v4_init_req+0x2d3/0x3e0 net/ipv4/tcp_ipv4.c:1271
    tcp_conn_request+0xf6d/0x3410 net/ipv4/tcp_input.c:6283
    tcp_v4_conn_request+0x157/0x210 net/ipv4/tcp_ipv4.c:1313
    tcp_rcv_state_process+0x8ea/0x4850 net/ipv4/tcp_input.c:5857
    tcp_v4_do_rcv+0x55c/0x7d0 net/ipv4/tcp_ipv4.c:1482
    tcp_v4_rcv+0x2d10/0x2f80 net/ipv4/tcp_ipv4.c:1711
    ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:464 [inline]
    ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
    __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
    __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
    netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
    netif_receive_skb+0xae/0x390 net/core/dev.c:4611
    tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
    tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
    tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
    call_write_iter include/linux/fs.h:1770 [inline]
    new_sync_write fs/read_write.c:468 [inline]
    __vfs_write+0x68a/0x970 fs/read_write.c:481
    vfs_write+0x18f/0x510 fs/read_write.c:543
    SYSC_write fs/read_write.c:588 [inline]
    SyS_write+0xef/0x220 fs/read_write.c:580
    entry_SYSCALL_64_fastpath+0x1f/0xbe

    Freed by task 3306:
    save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
    save_stack+0x43/0xd0 mm/kasan/kasan.c:447
    set_track mm/kasan/kasan.c:459 [inline]
    kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
    __cache_free mm/slab.c:3503 [inline]
    kfree+0xca/0x250 mm/slab.c:3820
    inet_sock_destruct+0x59d/0x950 net/ipv4/af_inet.c:157
    __sk_destruct+0xfd/0x910 net/core/sock.c:1560
    sk_destruct+0x47/0x80 net/core/sock.c:1595
    __sk_free+0x57/0x230 net/core/sock.c:1603
    sk_free+0x2a/0x40 net/core/sock.c:1614
    sock_put include/net/sock.h:1652 [inline]
    inet_csk_complete_hashdance+0xd5/0xf0 net/ipv4/inet_connection_sock.c:959
    tcp_check_req+0xf4d/0x1620 net/ipv4/tcp_minisocks.c:765
    tcp_v4_rcv+0x17f6/0x2f80 net/ipv4/tcp_ipv4.c:1675
    ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:464 [inline]
    ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
    __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
    __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
    netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
    netif_receive_skb+0xae/0x390 net/core/dev.c:4611
    tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
    tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
    tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
    call_write_iter include/linux/fs.h:1770 [inline]
    new_sync_write fs/read_write.c:468 [inline]
    __vfs_write+0x68a/0x970 fs/read_write.c:481
    vfs_write+0x18f/0x510 fs/read_write.c:543
    SYSC_write fs/read_write.c:588 [inline]
    SyS_write+0xef/0x220 fs/read_write.c:580
    entry_SYSCALL_64_fastpath+0x1f/0xbe

    Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Aug, 2017

1 commit

  • Add a second device index, sdif, to inet socket lookups. sdif is the
    index for ingress devices enslaved to an l3mdev. It allows the lookups
    to consider the enslaved device as well as the L3 domain when searching
    for a socket.

    TCP moves the data in the cb. Prior to tcp_v4_rcv (e.g., early demux) the
    ingress index is obtained from IPCB using inet_sdif and after the cb move
    in tcp_v4_rcv the tcp_v4_sdif helper is used.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

27 Jul, 2017

1 commit


21 Jun, 2017

1 commit

  • Now dccp_ipv4 works as a kernel module. During loading this module, if
    one dccp packet is being recieved after inet_add_protocol but before
    register_pernet_subsys in which v4_ctl_sk is initialized, a null pointer
    dereference may be triggered because of init_net.dccp.v4_ctl_sk is 0x0.

    Jianlin found this issue when the following call trace occurred:

    [ 171.950177] BUG: unable to handle kernel NULL pointer dereference at 0000000000000110
    [ 171.951007] IP: [] dccp_v4_ctl_send_reset+0xc4/0x220 [dccp_ipv4]
    [...]
    [ 171.984629] Call Trace:
    [ 171.984859]
    [ 171.985061]
    [ 171.985213] [] dccp_v4_rcv+0x383/0x3f9 [dccp_ipv4]
    [ 171.985711] [] ip_local_deliver_finish+0xb4/0x1f0
    [ 171.986309] [] ip_local_deliver+0x59/0xd0
    [ 171.986852] [] ? update_curr+0x104/0x190
    [ 171.986956] [] ip_rcv_finish+0x8a/0x350
    [ 171.986956] [] ip_rcv+0x2b6/0x410
    [ 171.986956] [] ? task_cputime+0x44/0x80
    [ 171.986956] [] __netif_receive_skb_core+0x572/0x7c0
    [ 171.986956] [] ? trigger_load_balance+0x61/0x1e0
    [ 171.986956] [] __netif_receive_skb+0x18/0x60
    [ 171.986956] [] process_backlog+0xae/0x180
    [ 171.986956] [] net_rx_action+0x16d/0x380
    [ 171.986956] [] __do_softirq+0xef/0x280
    [ 171.986956] [] call_softirq+0x1c/0x30

    This patch is to move inet_add_protocol after register_pernet_subsys in
    dccp_v4_init, so that v4_ctl_sk is initialized before any incoming dccp
    packets are processed.

    Reported-by: Jianlin Shi
    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     

23 Apr, 2017

1 commit


19 Apr, 2017

1 commit

  • A group of Linux kernel hackers reported chasing a bug that resulted
    from their assumption that SLAB_DESTROY_BY_RCU provided an existence
    guarantee, that is, that no block from such a slab would be reallocated
    during an RCU read-side critical section. Of course, that is not the
    case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
    slab of blocks.

    However, there is a phrase for this, namely "type safety". This commit
    therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
    to avoid future instances of this sort of confusion.

    Signed-off-by: Paul E. McKenney
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc:
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    [ paulmck: Add comments mentioning the old name, as requested by Eric
    Dumazet, in order to help people familiar with the old name find
    the new one. ]
    Acked-by: David Rientjes

    Paul E. McKenney
     

14 Mar, 2017

1 commit

  • As Eric Dumazet pointed out this also needs to be fixed in IPv6.
    v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.

    We have seen a few incidents lately where a dst_enty has been freed
    with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
    dst_entry. If the conditions/timings are right a crash then ensues when the
    freed dst_entry is referenced later on. A Common crashing back trace is:

    #8 [] page_fault at ffffffff8163e648
    [exception RIP: __tcp_ack_snd_check+74]
    .
    .
    #9 [] tcp_rcv_established at ffffffff81580b64
    #10 [] tcp_v4_do_rcv at ffffffff8158b54a
    #11 [] tcp_v4_rcv at ffffffff8158cd02
    #12 [] ip_local_deliver_finish at ffffffff815668f4
    #13 [] ip_local_deliver at ffffffff81566bd9
    #14 [] ip_rcv_finish at ffffffff8156656d
    #15 [] ip_rcv at ffffffff81566f06
    #16 [] __netif_receive_skb_core at ffffffff8152b3a2
    #17 [] __netif_receive_skb at ffffffff8152b608
    #18 [] netif_receive_skb at ffffffff8152b690
    #19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
    #20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
    #21 [] net_rx_action at ffffffff8152bac2
    #22 [] __do_softirq at ffffffff81084b4f
    #23 [] call_softirq at ffffffff8164845c
    #24 [] do_softirq at ffffffff81016fc5
    #25 [] irq_exit at ffffffff81084ee5
    #26 [] do_IRQ at ffffffff81648ff8

    Of course it may happen with other NIC drivers as well.

    It's found the freed dst_entry here:

    224 static bool tcp_in_quickack_mode(struct sock *sk)↩
    225 {↩
    226 ▹ const struct inet_connection_sock *icsk = inet_csk(sk);↩
    227 ▹ const struct dst_entry *dst = __sk_dst_get(sk);↩
    228 ↩
    229 ▹ return (dst && dst_metric(dst, RTAX_QUICKACK)) ||↩
    230 ▹ ▹ (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);↩
    231 }↩

    But there are other backtraces attributed to the same freed dst_entry in
    netfilter code as well.

    All the vmcores showed 2 significant clues:

    - Remote hosts behind the default gateway had always been redirected to a
    different gateway. A rtable/dst_entry will be added for that host. Making
    more dst_entrys with lower reference counts. Making this more probable.

    - All vmcores showed a postitive LockDroppedIcmps value, e.g:

    LockDroppedIcmps 267

    A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
    regardless of whether user space has the socket locked. This can result in a
    race condition where the same dst_entry cached in sk->sk_dst_entry can be
    decremented twice for the same socket via:

    do_redirect()->__sk_dst_check()-> dst_release().

    Which leads to the dst_entry being prematurely freed with another socket
    pointing to it via sk->sk_dst_cache and a subsequent crash.

    To fix this skip do_redirect() if usespace has the socket locked. Instead let
    the redirect take place later when user space does not have the socket
    locked.

    The dccp/IPv6 code is very similar in this respect, so fixing it there too.

    As Eric Garver pointed out the following commit now invalidates routes. Which
    can set the dst->obsolete flag so that ipv4_dst_check() returns null and
    triggers the dst_release().

    Fixes: ceb3320610d6 ("ipv4: Kill routes during PMTU/redirect updates.")
    Cc: Eric Garver
    Cc: Hannes Sowa
    Signed-off-by: Jon Maxwell
    Signed-off-by: David S. Miller

    Jon Maxwell
     

23 Feb, 2017

1 commit

  • DCCP doesn't purge timewait sockets on network namespace shutdown.
    So, after net namespace destroyed we could still have an active timer
    which will trigger use after free in tw_timer_handler():

    BUG: KASAN: use-after-free in tw_timer_handler+0x4a/0xa0 at addr ffff88010e0d1e10
    Read of size 8 by task swapper/1/0
    Call Trace:
    __asan_load8+0x54/0x90
    tw_timer_handler+0x4a/0xa0
    call_timer_fn+0x127/0x480
    expire_timers+0x1db/0x2e0
    run_timer_softirq+0x12f/0x2a0
    __do_softirq+0x105/0x5b4
    irq_exit+0xdd/0xf0
    smp_apic_timer_interrupt+0x57/0x70
    apic_timer_interrupt+0x90/0xa0

    Object at ffff88010e0d1bc0, in cache net_namespace size: 6848
    Allocated:
    save_stack_trace+0x1b/0x20
    kasan_kmalloc+0xee/0x180
    kasan_slab_alloc+0x12/0x20
    kmem_cache_alloc+0x134/0x310
    copy_net_ns+0x8d/0x280
    create_new_namespaces+0x23f/0x340
    unshare_nsproxy_namespaces+0x75/0xf0
    SyS_unshare+0x299/0x4f0
    entry_SYSCALL_64_fastpath+0x18/0xad
    Freed:
    save_stack_trace+0x1b/0x20
    kasan_slab_free+0xae/0x180
    kmem_cache_free+0xb4/0x350
    net_drop_ns+0x3f/0x50
    cleanup_net+0x3df/0x450
    process_one_work+0x419/0xbb0
    worker_thread+0x92/0x850
    kthread+0x192/0x1e0
    ret_from_fork+0x2e/0x40

    Add .exit_batch hook to dccp_v4_ops()/dccp_v6_ops() which will purge
    timewait sockets on net namespace destruction and prevent above issue.

    Fixes: f2bf415cfed7 ("mib: add net to NET_ADD_STATS_BH")
    Reported-by: Dmitry Vyukov
    Signed-off-by: Andrey Ryabinin
    Acked-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Andrey Ryabinin
     

19 Jan, 2017

1 commit

  • The only difference between inet6_csk_bind_conflict and inet_csk_bind_conflict
    is how they check the rcv_saddr, so delete this call back and simply
    change inet_csk_bind_conflict to call inet_rcv_saddr_equal.

    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Josef Bacik
     

04 Dec, 2016

1 commit

  • Couple conflicts resolved here:

    1) In the MACB driver, a bug fix to properly initialize the
    RX tail pointer properly overlapped with some changes
    to support variable sized rings.

    2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix
    overlapping with a reorganization of the driver to support
    ACPI, OF, as well as PCI variants of the chip.

    3) In 'net' we had several probe error path bug fixes to the
    stmmac driver, meanwhile a lot of this code was cleaned up
    and reorganized in 'net-next'.

    4) The cls_flower classifier obtained a helper function in
    'net-next' called __fl_delete() and this overlapped with
    Daniel Borkamann's bug fix to use RCU for object destruction
    in 'net'. It also overlapped with Jiri's change to guard
    the rhashtable_remove_fast() call with a check against
    tc_skip_sw().

    5) In mlx4, a revert bug fix in 'net' overlapped with some
    unrelated changes in 'net-next'.

    6) In geneve, a stale header pointer after pskb_expand_head()
    bug fix in 'net' overlapped with a large reorganization of
    the same code in 'net-next'. Since the 'net-next' code no
    longer had the bug in question, there was nothing to do
    other than to simply take the 'net-next' hunks.

    Signed-off-by: David S. Miller

    David S. Miller
     

30 Nov, 2016

1 commit


15 Nov, 2016

1 commit


04 Nov, 2016

2 commits

  • dccp_v4_err() does not use pskb_may_pull() and might access garbage.

    We only need 4 bytes at the beginning of the DCCP header, like TCP,
    so the 8 bytes pulled in icmp_socket_deliver() are more than enough.

    This patch might allow to process more ICMP messages, as some routers
    are still limiting the size of reflected bytes to 28 (RFC 792), instead
    of extended lengths (RFC 1812 4.3.2.3)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Andrey Konovalov reported following error while fuzzing with syzkaller :

    IPv4: Attempt to release alive inet socket ffff880068e98940
    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] SMP KASAN
    Modules linked in:
    CPU: 1 PID: 3905 Comm: a.out Not tainted 4.9.0-rc3+ #333
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    task: ffff88006b9e0000 task.stack: ffff880068770000
    RIP: 0010:[] []
    selinux_socket_sock_rcv_skb+0xff/0x6a0 security/selinux/hooks.c:4639
    RSP: 0018:ffff8800687771c8 EFLAGS: 00010202
    RAX: ffff88006b9e0000 RBX: 1ffff1000d0eee3f RCX: 1ffff1000d1d312a
    RDX: 1ffff1000d1d31a6 RSI: dffffc0000000000 RDI: 0000000000000010
    RBP: ffff880068777360 R08: 0000000000000000 R09: 0000000000000002
    R10: dffffc0000000000 R11: 0000000000000006 R12: ffff880068e98940
    R13: 0000000000000002 R14: ffff880068777338 R15: 0000000000000000
    FS: 00007f00ff760700(0000) GS:ffff88006cd00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000020008000 CR3: 000000006a308000 CR4: 00000000000006e0
    Stack:
    ffff8800687771e0 ffffffff812508a5 ffff8800686f3168 0000000000000007
    ffff88006ac8cdfc ffff8800665ea500 0000000041b58ab3 ffffffff847b5480
    ffffffff819eac60 ffff88006b9e0860 ffff88006b9e0868 ffff88006b9e07f0
    Call Trace:
    [] security_sock_rcv_skb+0x75/0xb0 security/security.c:1317
    [] sk_filter_trim_cap+0x67/0x10e0 net/core/filter.c:81
    [] __sk_receive_skb+0x30/0xa00 net/core/sock.c:460
    [] dccp_v4_rcv+0xdb2/0x1910 net/dccp/ipv4.c:873
    [] ip_local_deliver_finish+0x332/0xad0
    net/ipv4/ip_input.c:216
    [< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
    [< inline >] NF_HOOK ./include/linux/netfilter.h:255
    [] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257
    [< inline >] dst_input ./include/net/dst.h:507
    [] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396
    [< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
    [< inline >] NF_HOOK ./include/linux/netfilter.h:255
    [] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487
    [] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213
    [] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251
    [] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279
    [] netif_receive_skb+0x48/0x250 net/core/dev.c:4303
    [] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308
    [] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332
    [< inline >] new_sync_write fs/read_write.c:499
    [] __vfs_write+0x334/0x570 fs/read_write.c:512
    [] vfs_write+0x17b/0x500 fs/read_write.c:560
    [< inline >] SYSC_write fs/read_write.c:607
    [] SyS_write+0xd4/0x1a0 fs/read_write.c:599
    [] entry_SYSCALL_64_fastpath+0x1f/0xc2

    It turns out DCCP calls __sk_receive_skb(), and this broke when
    lookups no longer took a reference on listeners.

    Fix this issue by adding a @refcounted parameter to __sk_receive_skb(),
    so that sock_put() is used only when needed.

    Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
    Signed-off-by: Eric Dumazet
    Reported-by: Andrey Konovalov
    Tested-by: Andrey Konovalov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Oct, 2016

1 commit

  • Per listen(fd, backlog) rules, there is really no point accepting a SYN,
    sending a SYNACK, and dropping the following ACK packet if accept queue
    is full, because application is not draining accept queue fast enough.

    This behavior is fooling TCP clients that believe they established a
    flow, while there is nothing at server side. They might then send about
    10 MSS (if using IW10) that will be dropped anyway while server is under
    stress.

    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Jul, 2016

1 commit

  • Dccp verifies packet integrity, including length, at initial rcv in
    dccp_invalid_packet, later pulls headers in dccp_enqueue_skb.

    A call to sk_filter in-between can cause __skb_pull to wrap skb->len.
    skb_copy_datagram_msg interprets this as a negative value, so
    (correctly) fails with EFAULT. The negative length is reported in
    ioctl SIOCINQ or possibly in a DCCP_WARN in dccp_close.

    Introduce an sk_receive_skb variant that caps how small a filter
    program can trim packets, and call this in dccp with the header
    length. Excessively trimmed packets are now processed normally and
    queued for reception as 0B payloads.

    Fixes: 7c657876b63c ("[DCCP]: Initial implementation")
    Signed-off-by: Willem de Bruijn
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

10 Jul, 2016

1 commit

  • In the prep work I did before enabling BH while handling socket backlog,
    I missed two points in DCCP :

    1) dccp_v4_ctl_send_reset() uses bh_lock_sock(), assuming BH were
    blocked. It is not anymore always true.

    2) dccp_v4_route_skb() was using __IP_INC_STATS() instead of
    IP_INC_STATS()

    A similar fix was done for TCP, in commit 47dcc20a39d0
    ("ipv4: tcp: ip_send_unicast_reply() is not BH safe")

    Fixes: 7309f8821fd6 ("dccp: do not assume DCCP code is non preemptible")
    Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog")
    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 May, 2016

1 commit


28 Apr, 2016

4 commits