18 Oct, 2018

2 commits

  • [ Upstream commit 2ab2ddd301a22ca3c5f0b743593e4ad2953dfa53 ]

    Timer handlers do not imply rcu_read_lock(), so my recent fix
    triggered a LOCKDEP warning when SYNACK is retransmit.

    Lets add rcu_read_lock()/rcu_read_unlock() pairs around ireq->ireq_opt
    usages instead of guessing what is done by callers, since it is
    not worth the pain.

    Get rid of ireq_opt_deref() helper since it hides the logic
    without real benefit, since it is now a standard rcu_dereference().

    Fixes: 1ad98e9d1bdf ("tcp/dccp: fix lockdep issue when SYN is backlogged")
    Signed-off-by: Eric Dumazet
    Reported-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 1ad98e9d1bdf4724c0a8532fabd84bf3c457c2bc ]

    In normal SYN processing, packets are handled without listener
    lock and in RCU protected ingress path.

    But syzkaller is known to be able to trick us and SYN
    packets might be processed in process context, after being
    queued into socket backlog.

    In commit 06f877d613be ("tcp/dccp: fix other lockdep splats
    accessing ireq_opt") I made a very stupid fix, that happened
    to work mostly because of the regular path being RCU protected.

    Really the thing protecting ireq->ireq_opt is RCU read lock,
    and the pseudo request refcnt is not relevant.

    This patch extends what I did in commit 449809a66c1d ("tcp/dccp:
    block BH for SYN processing") by adding an extra rcu_read_{lock|unlock}
    pair in the paths that might be taken when processing SYN from
    socket backlog (thus possibly in process context)

    Fixes: 06f877d613be ("tcp/dccp: fix other lockdep splats accessing ireq_opt")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

22 Feb, 2018

1 commit

  • commit 4950276672fce5c241857540f8561c440663673d upstream.

    Patch series "kmemcheck: kill kmemcheck", v2.

    As discussed at LSF/MM, kill kmemcheck.

    KASan is a replacement that is able to work without the limitation of
    kmemcheck (single CPU, slow). KASan is already upstream.

    We are also not aware of any users of kmemcheck (or users who don't
    consider KASan as a suitable replacement).

    The only objection was that since KASAN wasn't supported by all GCC
    versions provided by distros at that time we should hold off for 2
    years, and try again.

    Now that 2 years have passed, and all distros provide gcc that supports
    KASAN, kill kmemcheck again for the very same reasons.

    This patch (of 4):

    Remove kmemcheck annotations, and calls to kmemcheck from the kernel.

    [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
    Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
    Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Alexander Potapenko
    Cc: Eric W. Biederman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Tim Hansen
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Levin, Alexander (Sasha Levin)
     

26 Oct, 2017

1 commit

  • In my first attempt to fix the lockdep splat, I forgot we could
    enter inet_csk_route_req() with a freshly allocated request socket,
    for which refcount has not yet been elevated, due to complex
    SLAB_TYPESAFE_BY_RCU rules.

    We either are in rcu_read_lock() section _or_ we own a refcount on the
    request.

    Correct RCU verb to use here is rcu_dereference_check(), although it is
    not possible to prove we actually own a reference on a shared
    refcount :/

    In v2, I added ireq_opt_deref() helper and use in three places, to fix other
    possible splats.

    [ 49.844590] lockdep_rcu_suspicious+0xea/0xf3
    [ 49.846487] inet_csk_route_req+0x53/0x14d
    [ 49.848334] tcp_v4_route_req+0xe/0x10
    [ 49.850174] tcp_conn_request+0x31c/0x6a0
    [ 49.851992] ? __lock_acquire+0x614/0x822
    [ 49.854015] tcp_v4_conn_request+0x5a/0x79
    [ 49.855957] ? tcp_v4_conn_request+0x5a/0x79
    [ 49.858052] tcp_rcv_state_process+0x98/0xdcc
    [ 49.859990] ? sk_filter_trim_cap+0x2f6/0x307
    [ 49.862085] tcp_v4_do_rcv+0xfc/0x145
    [ 49.864055] ? tcp_v4_do_rcv+0xfc/0x145
    [ 49.866173] tcp_v4_rcv+0x5ab/0xaf9
    [ 49.868029] ip_local_deliver_finish+0x1af/0x2e7
    [ 49.870064] ip_local_deliver+0x1b2/0x1c5
    [ 49.871775] ? inet_del_offload+0x45/0x45
    [ 49.873916] ip_rcv_finish+0x3f7/0x471
    [ 49.875476] ip_rcv+0x3f1/0x42f
    [ 49.876991] ? ip_local_deliver_finish+0x2e7/0x2e7
    [ 49.878791] __netif_receive_skb_core+0x6d3/0x950
    [ 49.880701] ? process_backlog+0x7e/0x216
    [ 49.882589] __netif_receive_skb+0x1d/0x5e
    [ 49.884122] process_backlog+0x10c/0x216
    [ 49.885812] net_rx_action+0x147/0x3df

    Fixes: a6ca7abe53633 ("tcp/dccp: fix lockdep splat in inet_csk_route_req()")
    Fixes: c92e8c02fe66 ("tcp/dccp: fix ireq->opt races")
    Signed-off-by: Eric Dumazet
    Reported-by: kernel test robot
    Reported-by: Maciej Żenczykowski
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Oct, 2017

1 commit

  • syzkaller found another bug in DCCP/TCP stacks [1]

    For the reasons explained in commit ce1050089c96 ("tcp/dccp: fix
    ireq->pktopts race"), we need to make sure we do not access
    ireq->opt unless we own the request sock.

    Note the opt field is renamed to ireq_opt to ease grep games.

    [1]
    BUG: KASAN: use-after-free in ip_queue_xmit+0x1687/0x18e0 net/ipv4/ip_output.c:474
    Read of size 1 at addr ffff8801c951039c by task syz-executor5/3295

    CPU: 1 PID: 3295 Comm: syz-executor5 Not tainted 4.14.0-rc4+ #80
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:16 [inline]
    dump_stack+0x194/0x257 lib/dump_stack.c:52
    print_address_description+0x73/0x250 mm/kasan/report.c:252
    kasan_report_error mm/kasan/report.c:351 [inline]
    kasan_report+0x25b/0x340 mm/kasan/report.c:409
    __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:427
    ip_queue_xmit+0x1687/0x18e0 net/ipv4/ip_output.c:474
    tcp_transmit_skb+0x1ab7/0x3840 net/ipv4/tcp_output.c:1135
    tcp_send_ack.part.37+0x3bb/0x650 net/ipv4/tcp_output.c:3587
    tcp_send_ack+0x49/0x60 net/ipv4/tcp_output.c:3557
    __tcp_ack_snd_check+0x2c6/0x4b0 net/ipv4/tcp_input.c:5072
    tcp_ack_snd_check net/ipv4/tcp_input.c:5085 [inline]
    tcp_rcv_state_process+0x2eff/0x4850 net/ipv4/tcp_input.c:6071
    tcp_child_process+0x342/0x990 net/ipv4/tcp_minisocks.c:816
    tcp_v4_rcv+0x1827/0x2f80 net/ipv4/tcp_ipv4.c:1682
    ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:464 [inline]
    ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
    __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
    __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
    netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
    netif_receive_skb+0xae/0x390 net/core/dev.c:4611
    tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
    tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
    tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
    call_write_iter include/linux/fs.h:1770 [inline]
    new_sync_write fs/read_write.c:468 [inline]
    __vfs_write+0x68a/0x970 fs/read_write.c:481
    vfs_write+0x18f/0x510 fs/read_write.c:543
    SYSC_write fs/read_write.c:588 [inline]
    SyS_write+0xef/0x220 fs/read_write.c:580
    entry_SYSCALL_64_fastpath+0x1f/0xbe
    RIP: 0033:0x40c341
    RSP: 002b:00007f469523ec10 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
    RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 000000000040c341
    RDX: 0000000000000037 RSI: 0000000020004000 RDI: 0000000000000015
    RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
    R10: 00000000000f4240 R11: 0000000000000293 R12: 00000000004b7fd1
    R13: 00000000ffffffff R14: 0000000020000000 R15: 0000000000025000

    Allocated by task 3295:
    save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
    save_stack+0x43/0xd0 mm/kasan/kasan.c:447
    set_track mm/kasan/kasan.c:459 [inline]
    kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
    __do_kmalloc mm/slab.c:3725 [inline]
    __kmalloc+0x162/0x760 mm/slab.c:3734
    kmalloc include/linux/slab.h:498 [inline]
    tcp_v4_save_options include/net/tcp.h:1962 [inline]
    tcp_v4_init_req+0x2d3/0x3e0 net/ipv4/tcp_ipv4.c:1271
    tcp_conn_request+0xf6d/0x3410 net/ipv4/tcp_input.c:6283
    tcp_v4_conn_request+0x157/0x210 net/ipv4/tcp_ipv4.c:1313
    tcp_rcv_state_process+0x8ea/0x4850 net/ipv4/tcp_input.c:5857
    tcp_v4_do_rcv+0x55c/0x7d0 net/ipv4/tcp_ipv4.c:1482
    tcp_v4_rcv+0x2d10/0x2f80 net/ipv4/tcp_ipv4.c:1711
    ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:464 [inline]
    ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
    __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
    __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
    netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
    netif_receive_skb+0xae/0x390 net/core/dev.c:4611
    tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
    tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
    tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
    call_write_iter include/linux/fs.h:1770 [inline]
    new_sync_write fs/read_write.c:468 [inline]
    __vfs_write+0x68a/0x970 fs/read_write.c:481
    vfs_write+0x18f/0x510 fs/read_write.c:543
    SYSC_write fs/read_write.c:588 [inline]
    SyS_write+0xef/0x220 fs/read_write.c:580
    entry_SYSCALL_64_fastpath+0x1f/0xbe

    Freed by task 3306:
    save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
    save_stack+0x43/0xd0 mm/kasan/kasan.c:447
    set_track mm/kasan/kasan.c:459 [inline]
    kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
    __cache_free mm/slab.c:3503 [inline]
    kfree+0xca/0x250 mm/slab.c:3820
    inet_sock_destruct+0x59d/0x950 net/ipv4/af_inet.c:157
    __sk_destruct+0xfd/0x910 net/core/sock.c:1560
    sk_destruct+0x47/0x80 net/core/sock.c:1595
    __sk_free+0x57/0x230 net/core/sock.c:1603
    sk_free+0x2a/0x40 net/core/sock.c:1614
    sock_put include/net/sock.h:1652 [inline]
    inet_csk_complete_hashdance+0xd5/0xf0 net/ipv4/inet_connection_sock.c:959
    tcp_check_req+0xf4d/0x1620 net/ipv4/tcp_minisocks.c:765
    tcp_v4_rcv+0x17f6/0x2f80 net/ipv4/tcp_ipv4.c:1675
    ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:464 [inline]
    ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
    NF_HOOK include/linux/netfilter.h:249 [inline]
    ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
    __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
    __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
    netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
    netif_receive_skb+0xae/0x390 net/core/dev.c:4611
    tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
    tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
    tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
    call_write_iter include/linux/fs.h:1770 [inline]
    new_sync_write fs/read_write.c:468 [inline]
    __vfs_write+0x68a/0x970 fs/read_write.c:481
    vfs_write+0x18f/0x510 fs/read_write.c:543
    SYSC_write fs/read_write.c:588 [inline]
    SyS_write+0xef/0x220 fs/read_write.c:580
    entry_SYSCALL_64_fastpath+0x1f/0xbe

    Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Jan, 2017

1 commit

  • This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
    alternative way to perform Fast Open on the active side (client). Prior
    to this patch, a client needs to replace the connect() call with
    sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
    to use Fast Open: these socket operations are often done in lower layer
    libraries used by many other applications. Changing these libraries
    and/or the socket call sequences are not trivial. A more convenient
    approach is to perform Fast Open by simply enabling a socket option when
    the socket is created w/o changing other socket calls sequence:
    s = socket()
    create a new socket
    setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
    newly introduced sockopt
    If set, new functionality described below will be used.
    Return ENOTSUPP if TFO is not supported or not enabled in the
    kernel.

    connect()
    With cookie present, return 0 immediately.
    With no cookie, initiate 3WHS with TFO cookie-request option and
    return -1 with errno = EINPROGRESS.

    write()/sendmsg()
    With cookie present, send out SYN with data and return the number of
    bytes buffered.
    With no cookie, and 3WHS not yet completed, return -1 with errno =
    EINPROGRESS.
    No MSG_FASTOPEN flag is needed.

    read()
    Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
    write() is not called yet.
    Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
    established but no msg is received yet.
    Return number of bytes read if socket is established and there is
    msg received.

    The new API simplifies life for applications that always perform a write()
    immediately after a successful connect(). Such applications can now take
    advantage of Fast Open by merely making one new setsockopt() call at the time
    of creating the socket. Nothing else about the application's socket call
    sequence needs to change.

    Signed-off-by: Wei Wang
    Acked-by: Eric Dumazet
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Wei Wang
     

04 Nov, 2016

1 commit

  • The IP stack records the largest fragment of a reassembled packet
    in IPCB(skb)->frag_max_size. When reading a datagram or raw packet
    that arrived fragmented, expose the value to allow applications to
    estimate receive path MTU.

    Tested:
    Sent data over a veth pair of which the source has a small mtu.
    Sent data using netcat, received using a dedicated process.

    Verified that the cmsg IP_RECVFRAGSIZE is returned only when
    data arrives fragmented, and in that cases matches the veth mtu.

    ip link add veth0 type veth peer name veth1

    ip netns add from
    ip netns add to

    ip link set dev veth1 netns to
    ip netns exec to ip addr add dev veth1 192.168.10.1/24
    ip netns exec to ip link set dev veth1 up

    ip link set dev veth0 netns from
    ip netns exec from ip addr add dev veth0 192.168.10.2/24
    ip netns exec from ip link set dev veth0 up
    ip netns exec from ip link set dev veth0 mtu 1300
    ip netns exec from ethtool -K veth0 ufo off

    dd if=/dev/zero bs=1 count=1400 2>/dev/null > payload

    ip netns exec to ./recv_cmsg_recvfragsize -4 -u -p 6000 &
    ip netns exec from nc -q 1 -u 192.168.10.1 6000 < payload

    using github.com/wdebruij/kerneltools/blob/master/tests/recvfragsize.c

    Signed-off-by: Willem de Bruijn
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

28 Jun, 2016

1 commit

  • If set, these will take precedence over the parent's options during
    both sending and child creation. If they're not set, the parent's
    options (if any) will be used.

    This is to allow the security_inet_conn_request() hook to modify the
    IPv6 options in just the same way that it already may do for IPv4.

    Signed-off-by: Huw Davies
    Signed-off-by: Paul Moore

    Huw Davies
     

19 Dec, 2015

1 commit

  • Allow accepted sockets to derive their sk_bound_dev_if setting from the
    l3mdev domain in which the packets originated. A sysctl setting is added
    to control the behavior which is similar to sk_mark and
    sysctl_tcp_fwmark_accept.

    This effectively allow a process to have a "VRF-global" listen socket,
    with child sockets bound to the VRF device in which the packet originated.
    A similar behavior can be achieved using sk_mark, but a solution using marks
    is incomplete as it does not handle duplicate addresses in different L3
    domains/VRFs. Allowing sockets to inherit the sk_bound_dev_if from l3mdev
    domain provides a complete solution.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

08 Dec, 2015

1 commit

  • TCP SYNACK messages might now be attached to request sockets.

    XFRM needs to get back to a listener socket.

    Adds new helpers that might be used elsewhere :
    sk_to_full_sk() and sk_const_to_full_sk()

    Note: We also need to add RCU protection for xfrm lookups,
    now TCP/DCCP have lockless listener processing. This will
    be addressed in separate patches.

    Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener")
    Reported-by: Dave Jones
    Signed-off-by: Eric Dumazet
    Cc: Steffen Klassert
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Nov, 2015

1 commit


05 Oct, 2015

1 commit

  • inet_reqsk_alloc() is used to allocate a temporary request
    in order to generate a SYNACK with a cookie. Then later,
    syncookie validation also uses a temporary request.

    These paths already took a reference on listener refcount,
    we can avoid a couple of atomic operations.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Jun, 2015

1 commit

  • When an application needs to force a source IP on an active TCP socket
    it has to use bind(IP, port=x).

    As most applications do not want to deal with already used ports, x is
    often set to 0, meaning the kernel is in charge to find an available
    port.
    But kernel does not know yet if this socket is going to be a listener or
    be connected.
    It has very limited choices (no full knowledge of final 4-tuple for a
    connect())

    With limited ephemeral port range (about 32K ports), it is very easy to
    fill the space.

    This patch adds a new SOL_IP socket option, asking kernel to ignore
    the 0 port provided by application in bind(IP, port=0) and only
    remember the given IP address.

    The port will be automatically chosen at connect() time, in a way
    that allows sharing a source port as long as the 4-tuples are unique.

    This new feature is available for both IPv4 and IPv6 (Thanks Neal)

    Tested:

    Wrote a test program and checked its behavior on IPv4 and IPv6.

    strace(1) shows sequences of bind(IP=127.0.0.2, port=0) followed by
    connect().
    Also getsockname() show that the port is still 0 right after bind()
    but properly allocated after connect().

    socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
    setsockopt(5, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
    bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
    getsockname(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
    connect(5, {sa_family=AF_INET, sin_port=htons(53174), sin_addr=inet_addr("127.0.0.3")}, 16) = 0
    getsockname(5, {sa_family=AF_INET, sin_port=htons(38050), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0

    IPv6 test :

    socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 7
    setsockopt(7, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
    bind(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
    getsockname(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
    connect(7, {sa_family=AF_INET6, sin6_port=htons(57300), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
    getsockname(7, {sa_family=AF_INET6, sin6_port=htons(60964), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0

    I was able to bind()/connect() a million concurrent IPv4 sockets,
    instead of ~32000 before patch.

    lpaa23:~# ulimit -n 1000010
    lpaa23:~# ./bind --connect --num-flows=1000000 &
    1000000 sockets

    lpaa23:~# grep TCP /proc/net/sockstat
    TCP: inuse 2000063 orphan 0 tw 47 alloc 2000157 mem 66

    Check that a given source port is indeed used by many different
    connections :

    lpaa23:~# ss -t src :40000 | head -10
    State Recv-Q Send-Q Local Address:Port Peer Address:Port
    ESTAB 0 0 127.0.0.2:40000 127.0.202.33:44983
    ESTAB 0 0 127.0.0.2:40000 127.2.27.240:44983
    ESTAB 0 0 127.0.0.2:40000 127.2.98.5:44983
    ESTAB 0 0 127.0.0.2:40000 127.0.124.196:44983
    ESTAB 0 0 127.0.0.2:40000 127.2.139.38:44983
    ESTAB 0 0 127.0.0.2:40000 127.1.59.80:44983
    ESTAB 0 0 127.0.0.2:40000 127.3.6.228:44983
    ESTAB 0 0 127.0.0.2:40000 127.0.38.53:44983
    ESTAB 0 0 127.0.0.2:40000 127.1.197.10:44983

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Mar, 2015

4 commits

  • While testing last patch series, I found req sock refcounting was wrong.

    We must set skc_refcnt to 1 for all request socks added in hashes,
    but also on request sockets created by FastOpen or syncookies.

    It is tricky because we need to defer this initialization so that
    future RCU lookups do not try to take a refcount on a not yet
    fully initialized request socket.

    Also get rid of ireq_refcnt alias.

    Signed-off-by: Eric Dumazet
    Fixes: 13854e5a6046 ("inet: add proper refcounting to request sock")
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • inet_reqsk_alloc() is becoming fat and should not be inlined.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • listener socket can be used to set net pointer, and will
    be later used to hold a reference on listener.

    Add a const qualifier to first argument (struct request_sock_ops *),
    and factorize all write_pnet(&ireq->ireq_net, sock_net(sk));

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • On 64bit arches, we can save 8 bytes in inet_request_sock
    by moving ir_mark to fill a hole.

    While we are at it, inet_request_mark() can get a const qualifier
    for listener socket.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Mar, 2015

1 commit

  • reqsk_put() is the generic function that should be used
    to release a refcount (and automatically call reqsk_free())

    reqsk_free() might be called if refcount is known to be 0
    or undefined.

    refcnt is set to one in inet_csk_reqsk_queue_add()

    As request socks are not yet in global ehash table,
    I added temporary debugging checks in reqsk_put() and reqsk_free()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Mar, 2015

4 commits


12 Mar, 2015

1 commit

  • A long standing problem in netlink socket dumps is the use
    of kernel socket addresses as cookies.

    1) It is a security concern.

    2) Sockets can be reused quite quickly, so there is
    no guarantee a cookie is used once and identify
    a flow.

    3) request sock, establish sock, and timewait socks
    for a given flow have different cookies.

    Part of our effort to bring better TCP statistics requires
    to switch to a different allocator.

    In this patch, I chose to use a per network namespace 64bit generator,
    and to use it only in the case a socket needs to be dumped to netlink.
    (This might be refined later if needed)

    Note that I tried to carry cookies from request sock, to establish sock,
    then timewait sockets.

    Signed-off-by: Eric Dumazet
    Cc: Eric Salo
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 Jan, 2015

3 commits

  • Add ip_cmsg_recv_offset function which takes an offset argument
    that indicates the starting offset in skb where data is being received
    from. This will be useful in the case of UDP and provided checksum
    to user space.

    ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of
    zero.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Move the IP_CMSG_* constants from ip_sockglue.c to inet_sock.h so that
    they can be referenced in other source files.

    Restructure ip_cmsg_recv to not go through flags using shift, check
    for flags by 'and'. This eliminates both the shift and a conditional
    per flag check.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • Move convert_csum from udp_sock to inet_sock. This allows the
    possibility that we can use convert checksum for different types
    of sockets and also allows convert checksum to be enabled from
    inet layer (what we'll want to do when enabling IP_CHECKSUM cmsg).

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

28 Jun, 2014

1 commit

  • Since pktops is only used for IPv6 only and opts is used for IPv4
    only, we can move these fields into a union and this allows us to drop
    the inet6_reqsk_alloc function as after this change it becomes
    equivalent with inet_reqsk_alloc.

    This patch also fixes a kmemcheck issue in the IPv6 stack: the flags
    field was not annotated after a request_sock was allocated.

    Signed-off-by: Octavian Purdila
    Signed-off-by: David S. Miller

    Octavian Purdila
     

14 May, 2014

1 commit

  • When using mark-based routing, sockets returned from accept()
    may need to be marked differently depending on the incoming
    connection request.

    This is the case, for example, if different socket marks identify
    different networks: a listening socket may want to accept
    connections from all networks, but each connection should be
    marked with the network that the request came in on, so that
    subsequent packets are sent on the correct network.

    This patch adds a sysctl to mark TCP sockets based on the fwmark
    of the incoming SYN packet. If enabled, and an unmarked socket
    receives a SYN, then the SYN packet's fwmark is written to the
    connection's inet_request_sock, and later written back to the
    accepted socket when the connection is established. If the
    socket already has a nonzero mark, then the behaviour is the same
    as it is today, i.e., the listening socket's fwmark is used.

    Black-box tested using user-mode linux:

    - IPv4/IPv6 SYN+ACK, FIN, etc. packets are routed based on the
    mark of the incoming SYN packet.
    - The socket returned by accept() is marked with the mark of the
    incoming SYN packet.
    - Tested with syncookies=1 and syncookies=2.

    Signed-off-by: Lorenzo Colitti
    Signed-off-by: David S. Miller

    Lorenzo Colitti
     

20 Oct, 2013

2 commits


11 Oct, 2013

1 commit

  • In commit 634fb979e8f ("inet: includes a sock_common in request_sock")
    I forgot that the two ports in sock_common do not have same byte order :

    skc_dport is __be16 (network order), but skc_num is __u16 (host order)

    So sparse complains because ir_loc_port (mapped into skc_num) is
    considered as __u16 while it should be __be16

    Let rename ir_loc_port to ireq->ir_num (analogy with inet->inet_num),
    and perform appropriate htons/ntohs conversions.

    Signed-off-by: Eric Dumazet
    Reported-by: Wu Fengguang
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Oct, 2013

1 commit

  • TCP listener refactoring, part 5 :

    We want to be able to insert request sockets (SYN_RECV) into main
    ehash table instead of the per listener hash table to allow RCU
    lookups and remove listener lock contention.

    This patch includes the needed struct sock_common in front
    of struct request_sock

    This means there is no more inet6_request_sock IPv6 specific
    structure.

    Following inet_request_sock fields were renamed as they became
    macros to reference fields from struct sock_common.
    Prefix ir_ was chosen to avoid name collisions.

    loc_port -> ir_loc_port
    loc_addr -> ir_loc_addr
    rmt_addr -> ir_rmt_addr
    rmt_port -> ir_rmt_port
    iif -> ir_iif

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Oct, 2013

1 commit

  • TCP listener refactoring, part 2 :

    We can use a generic lookup, sockets being in whatever state, if
    we are sure all relevant fields are at the same place in all socket
    types (ESTABLISH, TIME_WAIT, SYN_RECV)

    This patch removes these macros :

    inet_addrpair, inet_addrpair, tw_addrpair, tw_portpair

    And adds :

    sk_portpair, sk_addrpair, sk_daddr, sk_rcv_saddr

    Then, INET_TW_MATCH() is really the same than INET_MATCH()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

29 Sep, 2013

1 commit

  • If IP_TOS or IP_TTL are specified as ancillary data, then sendmsg() sends out
    packets with the specified TTL or TOS overriding the socket values specified
    with the traditional setsockopt().

    The struct inet_cork stores the values of TOS, TTL and priority that are
    passed through the struct ipcm_cookie. If there are user-specified TOS
    (tos != -1) or TTL (ttl != 0) in the struct ipcm_cookie, these values are
    used to override the per-socket values. In case of TOS also the priority
    is changed accordingly.

    Two helper functions get_rttos and get_rtconn_flags are defined to take
    into account the presence of a user specified TOS value when computing
    RT_TOS and RT_CONN_FLAGS.

    Signed-off-by: Francesco Fusco
    Signed-off-by: David S. Miller

    Francesco Fusco
     

22 Sep, 2013

1 commit

  • There are a mix of function prototypes with and without extern
    in the kernel sources. Standardize on not using extern for
    function prototypes.

    Function prototypes don't need to be written with extern.
    extern is assumed by the compiler. Its use is as unnecessary as
    using auto to declare automatic/local variables in a block.

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

12 Jun, 2013

1 commit


22 Feb, 2013

1 commit

  • It looks like its possible to open thousands of TCP IPv6
    sessions on a server, all landing in a single slot of TCP hash
    table. Incoming packets have to lookup sockets in a very
    long list.

    We should hash all bits from foreign IPv6 addresses, using
    a salt and hash mix, not a simple XOR.

    inet6_ehashfn() can also separately use the ports, instead
    of xoring them.

    Reported-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Cc: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Dec, 2012

1 commit

  • commit 68835aba4d9b (net: optimize INET input path further)
    moved some fields used for tcp/udp sockets lookup in the first cache
    line of struct sock_common.

    This patch moves inet_dport/inet_num as well, filling a 32bit hole
    on 64 bit arches and reducing number of cache line misses in lookups.

    Also change INET_MATCH()/INET_TW_MATCH() to perform the ports match
    before addresses match, as this check is more discriminant.

    Remove the hash check from MATCH() macros because we dont need to
    re validate the hash value after taking a refcount on socket, and
    use likely/unlikely compiler hints, as the sk_hash/hash check
    makes the following conditional tests 100% predicted by cpu.

    Introduce skc_addrpair/skc_portpair pair values to better
    document the alignment requirements of the port/addr pairs
    used in the various MATCH() macros, and remove some casts.

    The namespace check can also be done at last.

    This slightly improves TCP/UDP lookup times.

    IP/TCP early demux needs inet->rx_dst_ifindex and
    TCP needs inet->min_ttl, lets group them together in same cache line.

    With help from Ben Hutchings & Joe Perches.

    Idea of this patch came after Ling Ma proposal to move skc_hash
    to the beginning of struct sock_common, and should allow him
    to submit a final version of his patch. My tests show an improvement
    doing so.

    Signed-off-by: Eric Dumazet
    Cc: Ben Hutchings
    Cc: Joe Perches
    Cc: Ling Ma
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Sep, 2012

1 commit

  • We currently use a per socket order-0 page cache for tcp_sendmsg()
    operations.

    This page is used to build fragments for skbs.

    Its done to increase probability of coalescing small write() into
    single segments in skbs still in write queue (not yet sent)

    But it wastes a lot of memory for applications handling many mostly
    idle sockets, since each socket holds one page in sk->sk_sndmsg_page

    Its also quite inefficient to build TSO 64KB packets, because we need
    about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
    page allocator more than wanted.

    This patch adds a per task frag allocator and uses bigger pages,
    if available. An automatic fallback is done in case of memory pressure.

    (up to 32768 bytes per frag, thats order-3 pages on x86)

    This increases TCP stream performance by 20% on loopback device,
    but also benefits on other network devices, since 8x less frags are
    mapped on transmit and unmapped on tx completion. Alexander Duyck
    mentioned a probable performance win on systems with IOMMU enabled.

    Its possible some SG enabled hardware cant cope with bigger fragments,
    but their ndo_start_xmit() should already handle this, splitting a
    fragment in sub fragments, since some arches have PAGE_SIZE=65536

    Successfully tested on various ethernet devices.
    (ixgbe, igb, bnx2x, tg3, mellanox mlx4)

    Signed-off-by: Eric Dumazet
    Cc: Ben Hutchings
    Cc: Vijay Subramanian
    Cc: Alexander Duyck
    Tested-by: Vijay Subramanian
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Aug, 2012

1 commit

  • IPv6 needs a cookie in dst_check() call.

    We need to add rx_dst_cookie and provide a family independent
    sk_rx_dst_set(sk, skb) method to properly support IPv6 TCP early demux.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet