18 Oct, 2018

1 commit

  • [ Upstream commit 2ab2ddd301a22ca3c5f0b743593e4ad2953dfa53 ]

    Timer handlers do not imply rcu_read_lock(), so my recent fix
    triggered a LOCKDEP warning when SYNACK is retransmit.

    Lets add rcu_read_lock()/rcu_read_unlock() pairs around ireq->ireq_opt
    usages instead of guessing what is done by callers, since it is
    not worth the pain.

    Get rid of ireq_opt_deref() helper since it hides the logic
    without real benefit, since it is now a standard rcu_dereference().

    Fixes: 1ad98e9d1bdf ("tcp/dccp: fix lockdep issue when SYN is backlogged")
    Signed-off-by: Eric Dumazet
    Reported-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

13 Feb, 2018

1 commit

  • [ Upstream commit edbe69ef2c90fc86998a74b08319a01c508bd497 ]

    This patch effectively reverts commit 9f1c2674b328 ("net: memcontrol:
    defer call to mem_cgroup_sk_alloc()").

    Moving mem_cgroup_sk_alloc() to the inet_csk_accept() completely breaks
    memcg socket memory accounting, as packets received before memcg
    pointer initialization are not accounted and are causing refcounting
    underflow on socket release.

    Actually the free-after-use problem was fixed by
    commit c0576e397508 ("net: call cgroup_sk_alloc() earlier in
    sk_clone_lock()") for the cgroup pointer.

    So, let's revert it and call mem_cgroup_sk_alloc() just before
    cgroup_sk_alloc(). This is safe, as we hold a reference to the socket
    we're cloning, and it holds a reference to the memcg.

    Also, let's drop BUG_ON(mem_cgroup_is_root()) check from
    mem_cgroup_sk_alloc(). I see no reasons why bumping the root
    memcg counter is a good reason to panic, and there are no realistic
    ways to hit it.

    Signed-off-by: Roman Gushchin
    Cc: Eric Dumazet
    Cc: David S. Miller
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     

26 Oct, 2017

1 commit

  • In my first attempt to fix the lockdep splat, I forgot we could
    enter inet_csk_route_req() with a freshly allocated request socket,
    for which refcount has not yet been elevated, due to complex
    SLAB_TYPESAFE_BY_RCU rules.

    We either are in rcu_read_lock() section _or_ we own a refcount on the
    request.

    Correct RCU verb to use here is rcu_dereference_check(), although it is
    not possible to prove we actually own a reference on a shared
    refcount :/

    In v2, I added ireq_opt_deref() helper and use in three places, to fix other
    possible splats.

    [ 49.844590] lockdep_rcu_suspicious+0xea/0xf3
    [ 49.846487] inet_csk_route_req+0x53/0x14d
    [ 49.848334] tcp_v4_route_req+0xe/0x10
    [ 49.850174] tcp_conn_request+0x31c/0x6a0
    [ 49.851992] ? __lock_acquire+0x614/0x822
    [ 49.854015] tcp_v4_conn_request+0x5a/0x79
    [ 49.855957] ? tcp_v4_conn_request+0x5a/0x79
    [ 49.858052] tcp_rcv_state_process+0x98/0xdcc
    [ 49.859990] ? sk_filter_trim_cap+0x2f6/0x307
    [ 49.862085] tcp_v4_do_rcv+0xfc/0x145
    [ 49.864055] ? tcp_v4_do_rcv+0xfc/0x145
    [ 49.866173] tcp_v4_rcv+0x5ab/0xaf9
    [ 49.868029] ip_local_deliver_finish+0x1af/0x2e7
    [ 49.870064] ip_local_deliver+0x1b2/0x1c5
    [ 49.871775] ? inet_del_offload+0x45/0x45
    [ 49.873916] ip_rcv_finish+0x3f7/0x471
    [ 49.875476] ip_rcv+0x3f1/0x42f
    [ 49.876991] ? ip_local_deliver_finish+0x2e7/0x2e7
    [ 49.878791] __netif_receive_skb_core+0x6d3/0x950
    [ 49.880701] ? process_backlog+0x7e/0x216
    [ 49.882589] __netif_receive_skb+0x1d/0x5e
    [ 49.884122] process_backlog+0x10c/0x216
    [ 49.885812] net_rx_action+0x147/0x3df

    Fixes: a6ca7abe53633 ("tcp/dccp: fix lockdep splat in inet_csk_route_req()")
    Fixes: c92e8c02fe66 ("tcp/dccp: fix ireq->opt races")
    Signed-off-by: Eric Dumazet
    Reported-by: kernel test robot
    Reported-by: Maciej Żenczykowski
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Oct, 2017

1 commit

  • This patch fixes the following lockdep splat in inet_csk_route_req()

    lockdep_rcu_suspicious
    inet_csk_route_req
    tcp_v4_send_synack
    tcp_rtx_synack
    inet_rtx_syn_ack
    tcp_fastopen_synack_time
    tcp_retransmit_timer
    tcp_write_timer_handler
    tcp_write_timer
    call_timer_fn

    Thread running inet_csk_route_req() owns a reference on the request
    socket, so we have the guarantee ireq->ireq_opt wont be changed or
    freed.

    lockdep can enforce this invariant for us.

    Fixes: c92e8c02fe66 ("tcp/dccp: fix ireq->opt races")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Oct, 2017

1 commit

  • syzkaller found another bug in DCCP/TCP stacks [1]

    For the reasons explained in commit ce1050089c96 ("tcp/dccp: fix
    ireq->pktopts race"), we need to make sure we do not access
    ireq->opt unless we own the request sock.

    Note the opt field is renamed to ireq_opt to ease grep games.

    [1]
    BUG: KASAN: use-after-free in ip_queue_xmit+0x1687/0x18e0 net/ipv4/ip_output.c:474
    Read of size 1 at addr ffff8801c951039c by task syz-executor5/3295

    CPU: 1 PID: 3295 Comm: syz-executor5 Not tainted 4.14.0-rc4+ #80
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:16 [inline]
    dump_stack+0x194/0x257 lib/dump_stack.c:52
    print_address_description+0x73/0x250 mm/kasan/report.c:252
    kasan_report_error mm/kasan/report.c:351 [inline]
    kasan_report+0x25b/0x340 mm/kasan/report.c:409
    __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:427
    ip_queue_xmit+0x1687/0x18e0 net/ipv4/ip_output.c:474
    tcp_transmit_skb+0x1ab7/0x3840 net/ipv4/tcp_output.c:1135
    tcp_send_ack.part.37+0x3bb/0x650 net/ipv4/tcp_output.c:3587
    tcp_send_ack+0x49/0x60 net/ipv4/tcp_output.c:3557
    __tcp_ack_snd_check+0x2c6/0x4b0 net/ipv4/tcp_input.c:5072
    tcp_ack_snd_check net/ipv4/tcp_input.c:5085 [inline]
    tcp_rcv_state_process+0x2eff/0x4850 net/ipv4/tcp_input.c:6071
    tcp_child_process+0x342/0x990 net/ipv4/tcp_minisocks.c:816
    tcp_v4_rcv+0x1827/0x2f80 net/ipv4/tcp_ipv4.c:1682
    ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:464 [inline]
    ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
    __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
    __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
    netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
    netif_receive_skb+0xae/0x390 net/core/dev.c:4611
    tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
    tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
    tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
    call_write_iter include/linux/fs.h:1770 [inline]
    new_sync_write fs/read_write.c:468 [inline]
    __vfs_write+0x68a/0x970 fs/read_write.c:481
    vfs_write+0x18f/0x510 fs/read_write.c:543
    SYSC_write fs/read_write.c:588 [inline]
    SyS_write+0xef/0x220 fs/read_write.c:580
    entry_SYSCALL_64_fastpath+0x1f/0xbe
    RIP: 0033:0x40c341
    RSP: 002b:00007f469523ec10 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
    RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 000000000040c341
    RDX: 0000000000000037 RSI: 0000000020004000 RDI: 0000000000000015
    RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
    R10: 00000000000f4240 R11: 0000000000000293 R12: 00000000004b7fd1
    R13: 00000000ffffffff R14: 0000000020000000 R15: 0000000000025000

    Allocated by task 3295:
    save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
    save_stack+0x43/0xd0 mm/kasan/kasan.c:447
    set_track mm/kasan/kasan.c:459 [inline]
    kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
    __do_kmalloc mm/slab.c:3725 [inline]
    __kmalloc+0x162/0x760 mm/slab.c:3734
    kmalloc include/linux/slab.h:498 [inline]
    tcp_v4_save_options include/net/tcp.h:1962 [inline]
    tcp_v4_init_req+0x2d3/0x3e0 net/ipv4/tcp_ipv4.c:1271
    tcp_conn_request+0xf6d/0x3410 net/ipv4/tcp_input.c:6283
    tcp_v4_conn_request+0x157/0x210 net/ipv4/tcp_ipv4.c:1313
    tcp_rcv_state_process+0x8ea/0x4850 net/ipv4/tcp_input.c:5857
    tcp_v4_do_rcv+0x55c/0x7d0 net/ipv4/tcp_ipv4.c:1482
    tcp_v4_rcv+0x2d10/0x2f80 net/ipv4/tcp_ipv4.c:1711
    ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:464 [inline]
    ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
    __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
    __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
    netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
    netif_receive_skb+0xae/0x390 net/core/dev.c:4611
    tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
    tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
    tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
    call_write_iter include/linux/fs.h:1770 [inline]
    new_sync_write fs/read_write.c:468 [inline]
    __vfs_write+0x68a/0x970 fs/read_write.c:481
    vfs_write+0x18f/0x510 fs/read_write.c:543
    SYSC_write fs/read_write.c:588 [inline]
    SyS_write+0xef/0x220 fs/read_write.c:580
    entry_SYSCALL_64_fastpath+0x1f/0xbe

    Freed by task 3306:
    save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
    save_stack+0x43/0xd0 mm/kasan/kasan.c:447
    set_track mm/kasan/kasan.c:459 [inline]
    kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
    __cache_free mm/slab.c:3503 [inline]
    kfree+0xca/0x250 mm/slab.c:3820
    inet_sock_destruct+0x59d/0x950 net/ipv4/af_inet.c:157
    __sk_destruct+0xfd/0x910 net/core/sock.c:1560
    sk_destruct+0x47/0x80 net/core/sock.c:1595
    __sk_free+0x57/0x230 net/core/sock.c:1603
    sk_free+0x2a/0x40 net/core/sock.c:1614
    sock_put include/net/sock.h:1652 [inline]
    inet_csk_complete_hashdance+0xd5/0xf0 net/ipv4/inet_connection_sock.c:959
    tcp_check_req+0xf4d/0x1620 net/ipv4/tcp_minisocks.c:765
    tcp_v4_rcv+0x17f6/0x2f80 net/ipv4/tcp_ipv4.c:1675
    ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:464 [inline]
    ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
    __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
    __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
    netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
    netif_receive_skb+0xae/0x390 net/core/dev.c:4611
    tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
    tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
    tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
    call_write_iter include/linux/fs.h:1770 [inline]
    new_sync_write fs/read_write.c:468 [inline]
    __vfs_write+0x68a/0x970 fs/read_write.c:481
    vfs_write+0x18f/0x510 fs/read_write.c:543
    SYSC_write fs/read_write.c:588 [inline]
    SyS_write+0xef/0x220 fs/read_write.c:580
    entry_SYSCALL_64_fastpath+0x1f/0xbe

    Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Oct, 2017

1 commit

  • This reverts commit fbb1fb4ad415cb31ce944f65a5ca700aaf73a227.

    This was not the proper fix, lets cleanly revert it, so that
    following patch can be carried to stable versions.

    sock_cgroup_ptr() callers do not expect a NULL return value.

    Signed-off-by: Eric Dumazet
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Oct, 2017

2 commits

  • sk_clone_lock() might run while TCP/DCCP listener already vanished.

    In order to prevent use after free, it is better to defer cgroup_sk_alloc()
    to the point we know both parent and child exist, and from process context.

    Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
    Signed-off-by: Eric Dumazet
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Instead of calling mem_cgroup_sk_alloc() from BH context,
    it is better to call it from inet_csk_accept() in process context.

    Not only this removes code in mem_cgroup_sk_alloc(), but it also
    fixes a bug since listener might have been dismantled and css_get()
    might cause a use-after-free.

    Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
    Signed-off-by: Eric Dumazet
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Sep, 2017

3 commits

  • When doing my reuseport rework I screwed up and changed a

    if (hlist_empty(&tb->owners))

    to

    if (!hlist_empty(&tb->owners))

    This is obviously bad as all of the reuseport/reuse logic was reversed,
    which caused weird problems like allowing an ipv4 bind conflict if we
    opened an ipv4 only socket on a port followed by an ipv6 only socket on
    the same port.

    Fixes: b9470c27607b ("inet: kill smallest_size and smallest_port")
    Reported-by: Cole Robinson
    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Josef Bacik
     
  • In ipv6_rcv_saddr_equal() we need to use inet6_rcv_saddr(sk) for the
    ipv6 compare with the fast socket information to make sure we're doing
    the proper comparisons.

    Fixes: 637bc8bbe6c0 ("inet: reset tb->fastreuseport when adding a reuseport sk")
    Reported-and-tested-by: Cole Robinson
    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Josef Bacik
     
  • We need to set the tb->fast_sk_family properly so we can use the proper
    comparison function for all subsequent reuseport bind requests.

    Fixes: 637bc8bbe6c0 ("inet: reset tb->fastreuseport when adding a reuseport sk")
    Reported-and-tested-by: Cole Robinson
    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Josef Bacik
     

13 Sep, 2017

1 commit

  • Back in linux-4.4, I inadvertently put a call to reqsk_put() in
    inet_child_forget(), forgetting it could be called from two different
    points.

    In the case it is called from inet_csk_reqsk_queue_add(), we want to
    keep the reference on the request socket, since it is released later by
    the caller (tcp_v{4|6}_rcv())

    This bug never showed up because atomic_dec_and_test() was not signaling
    the underflow, and SLAB_DESTROY_BY RCU semantic for request sockets
    prevented the request to be put in quarantine.

    Recent conversion of socket refcount from atomic_t to refcount_t finally
    exposed the bug.

    So move the reqsk_put() to inet_csk_listen_stop() to fix this.

    Thanks to Shankara Pailoor for using syzkaller and providing
    a nice set of .config and C repro.

    WARNING: CPU: 2 PID: 4277 at lib/refcount.c:186
    refcount_sub_and_test+0x167/0x1b0 lib/refcount.c:186
    Kernel panic - not syncing: panic_on_warn set ...

    CPU: 2 PID: 4277 Comm: syz-executor0 Not tainted 4.13.0-rc7 #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
    Ubuntu-1.8.2-1ubuntu1 04/01/2014
    Call Trace:

    __dump_stack lib/dump_stack.c:16 [inline]
    dump_stack+0xf7/0x1aa lib/dump_stack.c:52
    panic+0x1ae/0x3a7 kernel/panic.c:180
    __warn+0x1c4/0x1d9 kernel/panic.c:541
    report_bug+0x211/0x2d0 lib/bug.c:183
    fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:190
    do_trap_no_signal arch/x86/kernel/traps.c:224 [inline]
    do_trap+0x260/0x390 arch/x86/kernel/traps.c:273
    do_error_trap+0x118/0x340 arch/x86/kernel/traps.c:310
    do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:323
    invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:846
    RIP: 0010:refcount_sub_and_test+0x167/0x1b0 lib/refcount.c:186
    RSP: 0018:ffff88006e006b60 EFLAGS: 00010286
    RAX: 0000000000000026 RBX: 0000000000000000 RCX: 0000000000000000
    RDX: 0000000000000026 RSI: 1ffff1000dc00d2c RDI: ffffed000dc00d60
    RBP: ffff88006e006bf0 R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff1000dc00d6d
    R13: 00000000ffffffff R14: 0000000000000001 R15: ffff88006ce9d340
    refcount_dec_and_test+0x1a/0x20 lib/refcount.c:211
    reqsk_put+0x71/0x2b0 include/net/request_sock.h:123
    tcp_v4_rcv+0x259e/0x2e20 net/ipv4/tcp_ipv4.c:1729
    ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:248 [inline]
    ip_local_deliver+0x1ce/0x6d0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:477 [inline]
    ip_rcv_finish+0x8db/0x19c0 net/ipv4/ip_input.c:397
    NF_HOOK include/linux/netfilter.h:248 [inline]
    ip_rcv+0xc3f/0x17d0 net/ipv4/ip_input.c:488
    __netif_receive_skb_core+0x1fb7/0x31f0 net/core/dev.c:4298
    __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4336
    process_backlog+0x1c5/0x6d0 net/core/dev.c:5102
    napi_poll net/core/dev.c:5499 [inline]
    net_rx_action+0x6d3/0x14a0 net/core/dev.c:5565
    __do_softirq+0x2cb/0xb2d kernel/softirq.c:284
    do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:898

    do_softirq.part.16+0x63/0x80 kernel/softirq.c:328
    do_softirq kernel/softirq.c:176 [inline]
    __local_bh_enable_ip+0x84/0x90 kernel/softirq.c:181
    local_bh_enable include/linux/bottom_half.h:31 [inline]
    rcu_read_unlock_bh include/linux/rcupdate.h:705 [inline]
    ip_finish_output2+0x8ad/0x1360 net/ipv4/ip_output.c:231
    ip_finish_output+0x74e/0xb80 net/ipv4/ip_output.c:317
    NF_HOOK_COND include/linux/netfilter.h:237 [inline]
    ip_output+0x1cc/0x850 net/ipv4/ip_output.c:405
    dst_output include/net/dst.h:471 [inline]
    ip_local_out+0x95/0x160 net/ipv4/ip_output.c:124
    ip_queue_xmit+0x8c6/0x1810 net/ipv4/ip_output.c:504
    tcp_transmit_skb+0x1963/0x3320 net/ipv4/tcp_output.c:1123
    tcp_send_ack.part.35+0x38c/0x620 net/ipv4/tcp_output.c:3575
    tcp_send_ack+0x49/0x60 net/ipv4/tcp_output.c:3545
    tcp_rcv_synsent_state_process net/ipv4/tcp_input.c:5795 [inline]
    tcp_rcv_state_process+0x4876/0x4b60 net/ipv4/tcp_input.c:5930
    tcp_v4_do_rcv+0x58a/0x820 net/ipv4/tcp_ipv4.c:1483
    sk_backlog_rcv include/net/sock.h:907 [inline]
    __release_sock+0x124/0x360 net/core/sock.c:2223
    release_sock+0xa4/0x2a0 net/core/sock.c:2715
    inet_wait_for_connect net/ipv4/af_inet.c:557 [inline]
    __inet_stream_connect+0x671/0xf00 net/ipv4/af_inet.c:643
    inet_stream_connect+0x58/0xa0 net/ipv4/af_inet.c:682
    SYSC_connect+0x204/0x470 net/socket.c:1628
    SyS_connect+0x24/0x30 net/socket.c:1609
    entry_SYSCALL_64_fastpath+0x18/0xad
    RIP: 0033:0x451e59
    RSP: 002b:00007f474843fc08 EFLAGS: 00000216 ORIG_RAX: 000000000000002a
    RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 0000000000451e59
    RDX: 0000000000000010 RSI: 0000000020002000 RDI: 0000000000000007
    RBP: 0000000000000046 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000216 R12: 0000000000000000
    R13: 00007ffc040a0f8f R14: 00007f47484409c0 R15: 0000000000000000

    Fixes: ebb516af60e1 ("tcp/dccp: fix race at listener dismantle phase")
    Signed-off-by: Eric Dumazet
    Reported-by: Shankara Pailoor
    Tested-by: Shankara Pailoor
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Jul, 2017

1 commit

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    This patch uses refcount_inc_not_zero() instead of
    atomic_inc_not_zero_hint() due to absense of a _hint()
    version of refcount API. If the hint() version must
    be used, we might need to revisit API.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     

05 Jun, 2017

1 commit

  • DCCP uses dccp_write_space() for sk->sk_write_space method.

    Unfortunately a passive connection (as provided by accept())
    is using the generic sk_stream_write_space() function.

    Lets simply inherit sk->sk_write_space from the parent
    instead of forcing the generic one.

    Signed-off-by: Eric Dumazet
    Cc: Gerrit Renker
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 May, 2017

1 commit


10 May, 2017

1 commit

  • syzkaller found a way to trigger double frees from ip_mc_drop_socket()

    It turns out that leave a copy of parent mc_list at accept() time,
    which is very bad.

    Very similar to commit 8b485ce69876 ("tcp: do not inherit
    fastopen_req from parent")

    Initial report from Pray3r, completed by Andrey one.
    Thanks a lot to them !

    Signed-off-by: Eric Dumazet
    Reported-by: Pray3r
    Reported-by: Andrey Konovalov
    Tested-by: Andrey Konovalov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Mar, 2017

1 commit

  • Lockdep issues a circular dependency warning when AFS issues an operation
    through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.

    The theory lockdep comes up with is as follows:

    (1) If the pagefault handler decides it needs to read pages from AFS, it
    calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
    creating a call requires the socket lock:

    mmap_sem must be taken before sk_lock-AF_RXRPC

    (2) afs_open_socket() opens an AF_RXRPC socket and binds it. rxrpc_bind()
    binds the underlying UDP socket whilst holding its socket lock.
    inet_bind() takes its own socket lock:

    sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET

    (3) Reading from a TCP socket into a userspace buffer might cause a fault
    and thus cause the kernel to take the mmap_sem, but the TCP socket is
    locked whilst doing this:

    sk_lock-AF_INET must be taken before mmap_sem

    However, lockdep's theory is wrong in this instance because it deals only
    with lock classes and not individual locks. The AF_INET lock in (2) isn't
    really equivalent to the AF_INET lock in (3) as the former deals with a
    socket entirely internal to the kernel that never sees userspace. This is
    a limitation in the design of lockdep.

    Fix the general case by:

    (1) Double up all the locking keys used in sockets so that one set are
    used if the socket is created by userspace and the other set is used
    if the socket is created by the kernel.

    (2) Store the kern parameter passed to sk_alloc() in a variable in the
    sock struct (sk_kern_sock). This informs sock_lock_init(),
    sock_init_data() and sk_clone_lock() as to the lock keys to be used.

    Note that the child created by sk_clone_lock() inherits the parent's
    kern setting.

    (3) Add a 'kern' parameter to ->accept() that is analogous to the one
    passed in to ->create() that distinguishes whether kernel_accept() or
    sys_accept4() was the caller and can be passed to sk_alloc().

    Note that a lot of accept functions merely dequeue an already
    allocated socket. I haven't touched these as the new socket already
    exists before we get the parameter.

    Note also that there are a couple of places where I've made the accepted
    socket unconditionally kernel-based:

    irda_accept()
    rds_rcp_accept_one()
    tcp_accept_from_sock()

    because they follow a sock_create_kern() and accept off of that.

    Whilst creating this, I noticed that lustre and ocfs don't create sockets
    through sock_create_kern() and thus they aren't marked as for-kernel,
    though they appear to be internal. I wonder if these should do that so
    that they use the new set of lock keys.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

21 Jan, 2017

2 commits

  • When comparing two sockets we need to use inet6_rcv_saddr so we get a NULL
    sk_v6_rcv_saddr if the socket isn't AF_INET6, otherwise our comparison function
    can be wrong.

    Fixes: 637bc8b ("inet: reset tb->fastreuseport when adding a reuseport sk")
    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Josef Bacik
     
  • Shaohua Li made percpu_counter irq safe in commit 098faf5805c8
    ("percpu_counter: make APIs irq safe")

    We can safely remove BH disable/enable sections around various
    percpu_counter manipulations.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Jan, 2017

6 commits

  • If we have non reuseport sockets on a tb we will set tb->fastreuseport to 0 and
    never set it again. Which means that in the future if we end up adding a bunch
    of reuseport sk's to that tb we'll have to do the expensive scan every time.
    Instead add the ipv4/ipv6 saddr fields to the bind bucket, as well as the family
    so we know what comparison to make, and the ipv6 only setting so we can make
    sure to compare with new sockets appropriately. Once one sk has made it onto
    the list we know that there are no potential bind conflicts on the owners list
    that match that sk's rcv_addr. So copy the sk's information into our bind
    bucket and set tb->fastruseport to FASTREUSESOCK_STRICT so we know we have to do
    an extra check for subsequent reuseport sockets and skip the expensive bind
    conflict check.

    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Josef Bacik
     
  • inet_csk_get_port does two different things, it either scans for an open port,
    or it tries to see if the specified port is available for use. Since these two
    operations have different rules and are basically independent lets split them
    into two different functions to make them both more readable.

    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Josef Bacik
     
  • This is just wasted time, we've already found a tb that doesn't have a bind
    conflict, and we don't drop the head lock so scanning again isn't going to give
    us a different answer. Instead move the tb->reuse setting logic outside of the
    found_tb path and put it in the success: path. Then make it so that we don't
    goto again if we find a bind conflict in the found_tb path as we won't reach
    this anymore when we are scanning for an ephemeral port.

    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Josef Bacik
     
  • In inet_csk_get_port we seem to be using smallest_port to figure out where the
    best place to look for a SO_REUSEPORT sk that matches with an existing set of
    SO_REUSEPORT's. However if we get to the logic

    if (smallest_size != -1) {
    port = smallest_port;
    goto have_port;
    }

    we will do a useless search, because we would have already done the
    inet_csk_bind_conflict for that port and it would have returned 1, otherwise we
    would have gone to found_tb and succeeded. Since this logic makes us do yet
    another trip through inet_csk_bind_conflict for a port we know won't work just
    delete this code and save us the time.

    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Josef Bacik
     
  • The only difference between inet6_csk_bind_conflict and inet_csk_bind_conflict
    is how they check the rcv_saddr, so delete this call back and simply
    change inet_csk_bind_conflict to call inet_rcv_saddr_equal.

    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Josef Bacik
     
  • We pass these per-protocol equal functions around in various places, but
    we can just have one function that checks the sk->sk_family and then do
    the right comparison function. I've also changed the ipv4 version to
    not cast to inet_sock since it is unneeded.

    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Josef Bacik
     

18 Dec, 2016

2 commits

  • A user may call listen with binding an explicit port with the intent
    that the kernel will assign an available port to the socket. In this
    case inet_csk_get_port does a port scan. For such sockets, the user may
    also set soreuseport with the intent a creating more sockets for the
    port that is selected. The problem is that the initial socket being
    opened could inadvertently choose an existing and unreleated port
    number that was already created with soreuseport.

    This patch adds a boolean parameter to inet_bind_conflict that indicates
    rather soreuseport is allowed for the check (in addition to
    sk->sk_reuseport). In calls to inet_bind_conflict from inet_csk_get_port
    the argument is set to true if an explicit port is being looked up (snum
    argument is nonzero), and is false if port scan is done.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • inet_csk_get_port is called with port number (snum argument) that may be
    zero or nonzero. If it is zero, then the intent is to find an available
    ephemeral port number to bind to. If snum is non-zero then the caller
    is asking to allocate a specific port number. In the latter case we
    never want to perform the scan in ephemeral port range. It is
    conceivable that this can happen if the "goto again" in "tb_found:"
    is done. This patch adds a check that snum is zero before doing
    the "goto again".

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

05 Nov, 2016

1 commit

  • - Use the UID in routing lookups made by protocol connect() and
    sendmsg() functions.
    - Make sure that routing lookups triggered by incoming packets
    (e.g., Path MTU discovery) take the UID of the socket into
    account.
    - For packets not associated with a userspace socket, (e.g., ping
    replies) use UID 0 inside the user namespace corresponding to
    the network namespace the socket belongs to. This allows
    all namespaces to apply routing and iptables rules to
    kernel-originated traffic in that namespaces by matching UID 0.
    This is better than using the UID of the kernel socket that is
    sending the traffic, because the UID of kernel sockets created
    at namespace creation time (e.g., the per-processor ICMP and
    TCP sockets) is the UID of the user that created the socket,
    which might not be mapped in the namespace.

    Tested: compiles allnoconfig, allyesconfig, allmodconfig
    Tested: https://android-review.googlesource.com/253302
    Signed-off-by: Lorenzo Colitti
    Signed-off-by: David S. Miller

    Lorenzo Colitti
     

07 Jul, 2016

1 commit

  • Pinned timers must carry the pinned attribute in the timer structure
    itself, so convert the code to the new API.

    No functional change.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Frederic Weisbecker
    Cc: Arjan van de Ven
    Cc: Chris Mason
    Cc: Eric Dumazet
    Cc: George Spelvin
    Cc: Josh Triplett
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160704094341.617891430@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

05 May, 2016

1 commit

  • percpu_counter only have protection against preemption.

    TCP stack uses them possibly from BH, so we need BH protection
    in contexts that could be run in process context

    Fixes: c10d9310edf5 ("tcp: do not assume TCP code is non preemptible")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Apr, 2016

1 commit


08 Apr, 2016

1 commit

  • David Ahern reported panics in __inet_hash() caused by my recent commit.

    The reason is inet_reuseport_add_sock() was still using
    sk_nulls_for_each_rcu() instead of sk_for_each_rcu().
    SO_REUSEPORT enabled listeners were causing an instant crash.

    While chasing this bug, I found that I forgot to clear SOCK_RCU_FREE
    flag, as it is inherited from the parent at clone time.

    Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
    Signed-off-by: Eric Dumazet
    Reported-by: David Ahern
    Tested-by: David Ahern
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Feb, 2016

1 commit

  • One of the validation checks for the new array-based TCP SO_REUSEPORT
    validation was unintentionally dropped in ea8add2b1903. This adds it back.

    Lack of this check allows the user to allocate multiple sock_reuseport
    structures (leaking all but the first).

    Fixes: ea8add2b1903 ("tcp/dccp: better use of ephemeral ports in bind()")
    Signed-off-by: Craig Gallek
    Signed-off-by: David S. Miller

    Craig Gallek
     

23 Feb, 2016

1 commit


19 Feb, 2016

1 commit

  • Ilya reported following lockdep splat:

    kernel: =========================
    kernel: [ BUG: held lock freed! ]
    kernel: 4.5.0-rc1-ceph-00026-g5e0a311 #1 Not tainted
    kernel: -------------------------
    kernel: swapper/5/0 is freeing memory
    ffff880035c9d200-ffff880035c9dbff, with a lock still held there!
    kernel: (&(&queue->rskq_lock)->rlock){+.-...}, at:
    [] inet_csk_reqsk_queue_add+0x28/0xa0
    kernel: 4 locks held by swapper/5/0:
    kernel: #0: (rcu_read_lock){......}, at: []
    netif_receive_skb_internal+0x4b/0x1f0
    kernel: #1: (rcu_read_lock){......}, at: []
    ip_local_deliver_finish+0x3f/0x380
    kernel: #2: (slock-AF_INET){+.-...}, at: []
    sk_clone_lock+0x19b/0x440
    kernel: #3: (&(&queue->rskq_lock)->rlock){+.-...}, at:
    [] inet_csk_reqsk_queue_add+0x28/0xa0

    To properly fix this issue, inet_csk_reqsk_queue_add() needs
    to return to its callers if the child as been queued
    into accept queue.

    We also need to make sure listener is still there before
    calling sk->sk_data_ready(), by holding a reference on it,
    since the reference carried by the child can disappear as
    soon as the child is put on accept queue.

    Reported-by: Ilya Dryomov
    Fixes: ebb516af60e1 ("tcp/dccp: fix race at listener dismantle phase")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Feb, 2016

1 commit

  • Implement strategy used in __inet_hash_connect() in opposite way :

    Try to find a candidate using odd ports, then fallback to even ports.

    We no longer disable BH for whole traversal, but one bucket at a time.
    We also use cond_resched() to yield cpu to other tasks if needed.

    I removed one indentation level and tried to mirror the loop we have
    in __inet_hash_connect() and variable names to ease code maintenance.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Feb, 2016

2 commits

  • This change extends the fast SO_REUSEPORT socket lookup implemented
    for UDP to TCP. Listener sockets with SO_REUSEPORT and the same
    receive address are additionally added to an array for faster
    random access. This means that only a single socket from the group
    must be found in the listener list before any socket in the group can
    be used to receive a packet. Previously, every socket in the group
    needed to be considered before handing off the incoming packet.

    This feature also exposes the ability to use a BPF program when
    selecting a socket from a reuseport group.

    Signed-off-by: Craig Gallek
    Signed-off-by: David S. Miller

    Craig Gallek
     
  • In order to support fast reuseport lookups in TCP, the hash function
    defined in struct proto must be capable of returning an error code.
    This patch changes the function signature of all related hash functions
    to return an integer and handles or propagates this return value at
    all call sites.

    Signed-off-by: Craig Gallek
    Signed-off-by: David S. Miller

    Craig Gallek
     

08 Feb, 2016

1 commit


16 Nov, 2015

1 commit

  • Some functions access TCP sockets without holding a lock and
    might output non consistent data, depending on compiler and or
    architecture.

    tcp_diag_get_info(), tcp_get_info(), tcp_poll(), get_tcp4_sock() ...

    Introduce sk_state_load() and sk_state_store() to fix the issues,
    and more clearly document where this lack of locking is happening.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet