30 Nov, 2016

1 commit


04 Nov, 2016

2 commits

  • dccp_v4_err() does not use pskb_may_pull() and might access garbage.

    We only need 4 bytes at the beginning of the DCCP header, like TCP,
    so the 8 bytes pulled in icmp_socket_deliver() are more than enough.

    This patch might allow to process more ICMP messages, as some routers
    are still limiting the size of reflected bytes to 28 (RFC 792), instead
    of extended lengths (RFC 1812 4.3.2.3)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Andrey Konovalov reported following error while fuzzing with syzkaller :

    IPv4: Attempt to release alive inet socket ffff880068e98940
    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] SMP KASAN
    Modules linked in:
    CPU: 1 PID: 3905 Comm: a.out Not tainted 4.9.0-rc3+ #333
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    task: ffff88006b9e0000 task.stack: ffff880068770000
    RIP: 0010:[] []
    selinux_socket_sock_rcv_skb+0xff/0x6a0 security/selinux/hooks.c:4639
    RSP: 0018:ffff8800687771c8 EFLAGS: 00010202
    RAX: ffff88006b9e0000 RBX: 1ffff1000d0eee3f RCX: 1ffff1000d1d312a
    RDX: 1ffff1000d1d31a6 RSI: dffffc0000000000 RDI: 0000000000000010
    RBP: ffff880068777360 R08: 0000000000000000 R09: 0000000000000002
    R10: dffffc0000000000 R11: 0000000000000006 R12: ffff880068e98940
    R13: 0000000000000002 R14: ffff880068777338 R15: 0000000000000000
    FS: 00007f00ff760700(0000) GS:ffff88006cd00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000020008000 CR3: 000000006a308000 CR4: 00000000000006e0
    Stack:
    ffff8800687771e0 ffffffff812508a5 ffff8800686f3168 0000000000000007
    ffff88006ac8cdfc ffff8800665ea500 0000000041b58ab3 ffffffff847b5480
    ffffffff819eac60 ffff88006b9e0860 ffff88006b9e0868 ffff88006b9e07f0
    Call Trace:
    [] security_sock_rcv_skb+0x75/0xb0 security/security.c:1317
    [] sk_filter_trim_cap+0x67/0x10e0 net/core/filter.c:81
    [] __sk_receive_skb+0x30/0xa00 net/core/sock.c:460
    [] dccp_v4_rcv+0xdb2/0x1910 net/dccp/ipv4.c:873
    [] ip_local_deliver_finish+0x332/0xad0
    net/ipv4/ip_input.c:216
    [< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
    [< inline >] NF_HOOK ./include/linux/netfilter.h:255
    [] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257
    [< inline >] dst_input ./include/net/dst.h:507
    [] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396
    [< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
    [< inline >] NF_HOOK ./include/linux/netfilter.h:255
    [] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487
    [] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213
    [] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251
    [] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279
    [] netif_receive_skb+0x48/0x250 net/core/dev.c:4303
    [] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308
    [] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332
    [< inline >] new_sync_write fs/read_write.c:499
    [] __vfs_write+0x334/0x570 fs/read_write.c:512
    [] vfs_write+0x17b/0x500 fs/read_write.c:560
    [< inline >] SYSC_write fs/read_write.c:607
    [] SyS_write+0xd4/0x1a0 fs/read_write.c:599
    [] entry_SYSCALL_64_fastpath+0x1f/0xc2

    It turns out DCCP calls __sk_receive_skb(), and this broke when
    lookups no longer took a reference on listeners.

    Fix this issue by adding a @refcounted parameter to __sk_receive_skb(),
    so that sock_put() is used only when needed.

    Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
    Signed-off-by: Eric Dumazet
    Reported-by: Andrey Konovalov
    Tested-by: Andrey Konovalov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Jul, 2016

1 commit

  • Dccp verifies packet integrity, including length, at initial rcv in
    dccp_invalid_packet, later pulls headers in dccp_enqueue_skb.

    A call to sk_filter in-between can cause __skb_pull to wrap skb->len.
    skb_copy_datagram_msg interprets this as a negative value, so
    (correctly) fails with EFAULT. The negative length is reported in
    ioctl SIOCINQ or possibly in a DCCP_WARN in dccp_close.

    Introduce an sk_receive_skb variant that caps how small a filter
    program can trim packets, and call this in dccp with the header
    length. Excessively trimmed packets are now processed normally and
    queued for reception as 0B payloads.

    Fixes: 7c657876b63c ("[DCCP]: Initial implementation")
    Signed-off-by: Willem de Bruijn
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

10 Jul, 2016

1 commit

  • In the prep work I did before enabling BH while handling socket backlog,
    I missed two points in DCCP :

    1) dccp_v4_ctl_send_reset() uses bh_lock_sock(), assuming BH were
    blocked. It is not anymore always true.

    2) dccp_v4_route_skb() was using __IP_INC_STATS() instead of
    IP_INC_STATS()

    A similar fix was done for TCP, in commit 47dcc20a39d0
    ("ipv4: tcp: ip_send_unicast_reply() is not BH safe")

    Fixes: 7309f8821fd6 ("dccp: do not assume DCCP code is non preemptible")
    Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog")
    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 May, 2016

1 commit


28 Apr, 2016

4 commits


08 Apr, 2016

1 commit


05 Apr, 2016

1 commit

  • When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple
    cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes.

    By letting listeners use SOCK_RCU_FREE infrastructure,
    we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt

    Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets,
    only listeners are impacted by this change.

    Peak performance under SYNFLOOD is increased by ~33% :

    On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps

    Most consuming functions are now skb_set_owner_w() and sock_wfree()
    contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Mar, 2016

1 commit

  • Now SYN_RECV request sockets are installed in ehash table, an ICMP
    handler can find a request socket while another cpu handles an incoming
    packet transforming this SYN_RECV request socket into an ESTABLISHED
    socket.

    We need to remove the now obsolete WARN_ON(req->sk), since req->sk
    is set when a new child is created and added into listener accept queue.

    If this race happens, the ICMP will do nothing special.

    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Reported-by: Ben Lazarus
    Reported-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Feb, 2016

1 commit


19 Feb, 2016

1 commit

  • Ilya reported following lockdep splat:

    kernel: =========================
    kernel: [ BUG: held lock freed! ]
    kernel: 4.5.0-rc1-ceph-00026-g5e0a311 #1 Not tainted
    kernel: -------------------------
    kernel: swapper/5/0 is freeing memory
    ffff880035c9d200-ffff880035c9dbff, with a lock still held there!
    kernel: (&(&queue->rskq_lock)->rlock){+.-...}, at:
    [] inet_csk_reqsk_queue_add+0x28/0xa0
    kernel: 4 locks held by swapper/5/0:
    kernel: #0: (rcu_read_lock){......}, at: []
    netif_receive_skb_internal+0x4b/0x1f0
    kernel: #1: (rcu_read_lock){......}, at: []
    ip_local_deliver_finish+0x3f/0x380
    kernel: #2: (slock-AF_INET){+.-...}, at: []
    sk_clone_lock+0x19b/0x440
    kernel: #3: (&(&queue->rskq_lock)->rlock){+.-...}, at:
    [] inet_csk_reqsk_queue_add+0x28/0xa0

    To properly fix this issue, inet_csk_reqsk_queue_add() needs
    to return to its callers if the child as been queued
    into accept queue.

    We also need to make sure listener is still there before
    calling sk->sk_data_ready(), by holding a reference on it,
    since the reference carried by the child can disappear as
    soon as the child is put on accept queue.

    Reported-by: Ilya Dryomov
    Fixes: ebb516af60e1 ("tcp/dccp: fix race at listener dismantle phase")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Feb, 2016

1 commit

  • This is a preliminary step to allow fast socket lookup of SO_REUSEPORT
    groups. Doing so with a BPF filter will require access to the
    skb in question. This change plumbs the skb (and offset to payload
    data) through the call stack to the listening socket lookup
    implementations where it will be used in a following patch.

    Signed-off-by: Craig Gallek
    Signed-off-by: David S. Miller

    Craig Gallek
     

23 Oct, 2015

1 commit

  • Multiple cpus can process duplicates of incoming ACK messages
    matching a SYN_RECV request socket. This is a rare event under
    normal operations, but definitely can happen.

    Only one must win the race, otherwise corruption would occur.

    To fix this without adding new atomic ops, we use logic in
    inet_ehash_nolisten() to detect the request was present in the same
    ehash bucket where we try to insert the new child.

    If request socket was not found, we have to undo the child creation.

    This actually removes a spin_lock()/spin_unlock() pair in
    reqsk_queue_unlink() for the fast path.

    Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Oct, 2015

2 commits

  • Let's reduce the confusion about inet_csk_reqsk_queue_drop() :
    In many cases we also need to release reference on request socket,
    so add a helper to do this, reducing code size and complexity.

    Fixes: 4bdc3d66147b ("tcp/dccp: fix behavior of stale SYN_RECV request sockets")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This reverts commit c69736696cf3742b37d850289dc0d7ead177bb14.

    At the time of above commit, tcp_req_err() and dccp_req_err()
    were dead code, as SYN_RECV request sockets were not yet in ehash table.

    Real bug was fixed later in a different commit.

    We need to revert to not leak a refcount on request socket.

    inet_csk_reqsk_queue_drop_and_put() will be added
    in following commit to make clean inet_csk_reqsk_queue_drop()
    does not release the reference owned by caller.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Oct, 2015

1 commit

  • When a TCP/DCCP listener is closed, its pending SYN_RECV request sockets
    become stale, meaning 3WHS can not complete.

    But current behavior is wrong :
    incoming packets finding such stale sockets are dropped.

    We need instead to cleanup the request socket and perform another
    lookup :
    - Incoming ACK will give a RST answer,
    - SYN rtx might find another listener if available.
    - We expedite cleanup of request sockets and old listener socket.

    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

05 Oct, 2015

1 commit

  • inet_reqsk_alloc() is used to allocate a temporary request
    in order to generate a SYNACK with a cookie. Then later,
    syncookie validation also uses a temporary request.

    These paths already took a reference on listener refcount,
    we can avoid a couple of atomic operations.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Oct, 2015

1 commit

  • In this patch, we insert request sockets into TCP/DCCP
    regular ehash table (where ESTABLISHED and TIMEWAIT sockets
    are) instead of using the per listener hash table.

    ACK packets find SYN_RECV pseudo sockets without having
    to find and lock the listener.

    In nominal conditions, this halves pressure on listener lock.

    Note that this will allow for SO_REUSEPORT refinements,
    so that we can select a listener using cpu/numa affinities instead
    of the prior 'consistent hash', since only SYN packets will
    apply this selection logic.

    We will shrink listen_sock in the following patch to ease
    code review.

    Signed-off-by: Eric Dumazet
    Cc: Ying Cai
    Cc: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Sep, 2015

2 commits


26 Sep, 2015

1 commit


24 Apr, 2015

1 commit

  • [ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at
    0000000000000080
    [ 3897.931025] IP: [] reqsk_timer_handler+0x1a6/0x243

    There is a race when reqsk_timer_handler() and tcp_check_req() call
    inet_csk_reqsk_queue_unlink() on the same req at the same time.

    Before commit fa76ce7328b2 ("inet: get rid of central tcp/dccp listener
    timer"), listener spinlock was held and race could not happen.

    To solve this bug, we change reqsk_queue_unlink() to not assume req
    must be found, and we return a status, to conditionally release a
    refcount on the request sock.

    This also means tcp_check_req() in non fastopen case might or not
    consume req refcount, so tcp_v6_hnd_req() & tcp_v4_hnd_req() have
    to properly handle this.

    (Same remark for dccp_check_req() and its callers)

    inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is
    called 4 times in tcp and 3 times in dccp.

    Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer")
    Signed-off-by: Eric Dumazet
    Reported-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Mar, 2015

3 commits

  • Eric Hugne reported following error :

    I'm hitting this warning on latest net-next when i try to SSH into a machine
    with eth0 added to a bridge (but i think the problem is older than that)

    Steps to reproduce:
    node2 ~ # brctl addif br0 eth0
    [ 223.758785] device eth0 entered promiscuous mode
    node2 ~ # ip link set br0 up
    [ 244.503614] br0: port 1(eth0) entered forwarding state
    [ 244.505108] br0: port 1(eth0) entered forwarding state
    node2 ~ # [ 251.160159] ------------[ cut here ]------------
    [ 251.160831] WARNING: CPU: 0 PID: 3 at include/net/request_sock.h:102 tcp_v4_err+0x6b1/0x720()
    [ 251.162077] Modules linked in:
    [ 251.162496] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.0.0-rc3+ #18
    [ 251.163334] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    [ 251.164078] ffffffff81a8365c ffff880038a6ba18 ffffffff8162ace4 0000000000009898
    [ 251.165084] 0000000000000000 ffff880038a6ba58 ffffffff8104da85 ffff88003fa437c0
    [ 251.166195] ffff88003fa437c0 ffff88003fa74e00 ffff88003fa43bb8 ffff88003fad99a0
    [ 251.167203] Call Trace:
    [ 251.167533] [] dump_stack+0x45/0x57
    [ 251.168206] [] warn_slowpath_common+0x85/0xc0
    [ 251.169239] [] warn_slowpath_null+0x15/0x20
    [ 251.170271] [] tcp_v4_err+0x6b1/0x720
    [ 251.171408] [] ? _raw_read_lock_irq+0x3/0x10
    [ 251.172589] [] ? inet_del_offload+0x40/0x40
    [ 251.173366] [] icmp_socket_deliver+0x65/0xb0
    [ 251.174134] [] icmp_unreach+0xc2/0x280
    [ 251.174820] [] icmp_rcv+0x2bd/0x3a0
    [ 251.175473] [] ip_local_deliver_finish+0x82/0x1e0
    [ 251.176282] [] ip_local_deliver+0x88/0x90
    [ 251.177004] [] ip_rcv_finish+0xf0/0x310
    [ 251.177693] [] ip_rcv+0x2dc/0x390
    [ 251.178336] [] __netif_receive_skb_core+0x713/0xa20
    [ 251.179170] [] __netif_receive_skb+0x1a/0x80
    [ 251.179922] [] process_backlog+0x94/0x120
    [ 251.180639] [] net_rx_action+0x1e2/0x310
    [ 251.181356] [] __do_softirq+0xa7/0x290
    [ 251.182046] [] run_ksoftirqd+0x19/0x30
    [ 251.182726] [] smpboot_thread_fn+0x153/0x1d0
    [ 251.183485] [] ? SyS_setgroups+0x130/0x130
    [ 251.184228] [] kthread+0xee/0x110
    [ 251.184871] [] ? kthread_create_on_node+0x1b0/0x1b0
    [ 251.185690] [] ret_from_fork+0x58/0x90
    [ 251.186385] [] ? kthread_create_on_node+0x1b0/0x1b0
    [ 251.187216] ---[ end trace c947fc7b24e42ea1 ]---
    [ 259.542268] br0: port 1(eth0) entered forwarding state

    Remove the double calls to reqsk_put()

    [edumazet] :

    I got confused because reqsk_timer_handler() _has_ to call
    reqsk_put(req) after calling inet_csk_reqsk_queue_drop(), as
    the timer handler holds a reference on req.

    Signed-off-by: Fan Du
    Signed-off-by: Eric Dumazet
    Reported-by: Erik Hugne
    Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer")
    Signed-off-by: David S. Miller

    Fan Du
     
  • dccp_v4_err() can restrict lookups to ehash table, and not to listeners.

    Note this patch creates the infrastructure, but this means that ICMP
    messages for request sockets are ignored until complete conversion.

    New dccp_req_err() helper is exported so that we can use it in IPv6
    in following patch.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • It is not needed, and req->sk_listener points to the listener anyway.
    request_sock argument can be const.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Mar, 2015

2 commits

  • One of the major issue for TCP is the SYNACK rtx handling,
    done by inet_csk_reqsk_queue_prune(), fired by the keepalive
    timer of a TCP_LISTEN socket.

    This function runs for awful long times, with socket lock held,
    meaning that other cpus needing this lock have to spin for hundred of ms.

    SYNACK are sent in huge bursts, likely to cause severe drops anyway.

    This model was OK 15 years ago when memory was very tight.

    We now can afford to have a timer per request sock.

    Timer invocations no longer need to lock the listener,
    and can be run from all cpus in parallel.

    With following patch increasing somaxconn width to 32 bits,
    I tested a listener with more than 4 million active request sockets,
    and a steady SYNFLOOD of ~200,000 SYN per second.
    Host was sending ~830,000 SYNACK per second.

    This is ~100 times more what we could achieve before this patch.

    Later, we will get rid of the listener hash and use ehash instead.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • When request sock are put in ehash table, the whole notion
    of having a previous request to update dl_next is pointless.

    Also, following patch will get rid of big purge timer,
    so we want to delete a request sock without holding listener lock.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Mar, 2015

2 commits

  • In order to be able to use sk_ehashfn() for request socks,
    we need to initialize their IPv6/IPv4 addresses.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Intent is to converge IPv4 & IPv6 inet_hash functions to
    factorize code.

    IPv4 sockets initialize sk_rcv_saddr and sk_v6_daddr
    in this patch, thanks to new sk_daddr_set() and sk_rcv_saddr_set()
    helpers.

    __inet6_hash can now use sk_ehashfn() instead of a private
    inet6_sk_ehashfn() and will simply use __inet_hash() in a
    following patch.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Mar, 2015

1 commit

  • listener socket can be used to set net pointer, and will
    be later used to hold a reference on listener.

    Add a const qualifier to first argument (struct request_sock_ops *),
    and factorize all write_pnet(&ireq->ireq_net, sock_net(sk));

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Mar, 2015

1 commit

  • Once request socks will be in ehash table, they will need to have
    a valid ir_iff field.

    This is currently true only for IPv6. This patch extends support
    for IPv4 as well.

    This means inet_diag_fill_req() can now properly use ir_iif,
    which is better for IPv6 link locals anyway, as request sockets
    and established sockets will propagate consistent netlink idiag_if.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Mar, 2015

2 commits


12 Mar, 2015

2 commits

  • I forgot to use write_pnet() in three locations.

    Signed-off-by: Eric Dumazet
    Fixes: 33cf7c90fe2f9 ("net: add real socket cookies")
    Reported-by: kbuild test robot
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • A long standing problem in netlink socket dumps is the use
    of kernel socket addresses as cookies.

    1) It is a security concern.

    2) Sockets can be reused quite quickly, so there is
    no guarantee a cookie is used once and identify
    a flow.

    3) request sock, establish sock, and timewait socks
    for a given flow have different cookies.

    Part of our effort to bring better TCP statistics requires
    to switch to a different allocator.

    In this patch, I chose to use a per network namespace 64bit generator,
    and to use it only in the case a socket needs to be dumped to netlink.
    (This might be refined later if needed)

    Note that I tried to carry cookies from request sock, to establish sock,
    then timewait sockets.

    Signed-off-by: Eric Dumazet
    Cc: Eric Salo
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Nov, 2014

1 commit