17 Dec, 2017

1 commit

  • [ Upstream commit cfac7f836a715b91f08c851df915d401a4d52783 ]

    Maciej Żenczykowski reported some panics in tcp_twsk_destructor()
    that might be caused by the following bug.

    timewait timer is pinned to the cpu, because we want to transition
    timwewait refcount from 0 to 4 in one go, once everything has been
    initialized.

    At the time commit ed2e92394589 ("tcp/dccp: fix timewait races in timer
    handling") was merged, TCP was always running from BH habdler.

    After commit 5413d1babe8f ("net: do not block BH while processing
    socket backlog") we definitely can run tcp_time_wait() from process
    context.

    We need to block BH in the critical section so that the pinned timer
    has still its purpose.

    This bug is more likely to happen under stress and when very small RTO
    are used in datacenter flows.

    Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog")
    Signed-off-by: Eric Dumazet
    Reported-by: Maciej Żenczykowski
    Acked-by: Maciej Żenczykowski
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

08 Mar, 2017

1 commit

  • Dmitry reported crashes in DCCP stack [1]

    Problem here is that when I got rid of listener spinlock, I missed the
    fact that DCCP stores a complex state in struct dccp_request_sock,
    while TCP does not.

    Since multiple cpus could access it at the same time, we need to add
    protection.

    [1]
    BUG: KASAN: use-after-free in dccp_feat_activate_values+0x967/0xab0
    net/dccp/feat.c:1541 at addr ffff88003713be68
    Read of size 8 by task syz-executor2/8457
    CPU: 2 PID: 8457 Comm: syz-executor2 Not tainted 4.10.0-rc7+ #127
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:

    __dump_stack lib/dump_stack.c:15 [inline]
    dump_stack+0x292/0x398 lib/dump_stack.c:51
    kasan_object_err+0x1c/0x70 mm/kasan/report.c:162
    print_address_description mm/kasan/report.c:200 [inline]
    kasan_report_error mm/kasan/report.c:289 [inline]
    kasan_report.part.1+0x20e/0x4e0 mm/kasan/report.c:311
    kasan_report mm/kasan/report.c:332 [inline]
    __asan_report_load8_noabort+0x29/0x30 mm/kasan/report.c:332
    dccp_feat_activate_values+0x967/0xab0 net/dccp/feat.c:1541
    dccp_create_openreq_child+0x464/0x610 net/dccp/minisocks.c:121
    dccp_v6_request_recv_sock+0x1f6/0x1960 net/dccp/ipv6.c:457
    dccp_check_req+0x335/0x5a0 net/dccp/minisocks.c:186
    dccp_v6_rcv+0x69e/0x1d00 net/dccp/ipv6.c:711
    ip6_input_finish+0x46d/0x17a0 net/ipv6/ip6_input.c:279
    NF_HOOK include/linux/netfilter.h:257 [inline]
    ip6_input+0xdb/0x590 net/ipv6/ip6_input.c:322
    dst_input include/net/dst.h:507 [inline]
    ip6_rcv_finish+0x289/0x890 net/ipv6/ip6_input.c:69
    NF_HOOK include/linux/netfilter.h:257 [inline]
    ipv6_rcv+0x12ec/0x23d0 net/ipv6/ip6_input.c:203
    __netif_receive_skb_core+0x1ae5/0x3400 net/core/dev.c:4190
    __netif_receive_skb+0x2a/0x170 net/core/dev.c:4228
    process_backlog+0xe5/0x6c0 net/core/dev.c:4839
    napi_poll net/core/dev.c:5202 [inline]
    net_rx_action+0xe70/0x1900 net/core/dev.c:5267
    __do_softirq+0x2fb/0xb7d kernel/softirq.c:284
    do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:902

    do_softirq.part.17+0x1e8/0x230 kernel/softirq.c:328
    do_softirq kernel/softirq.c:176 [inline]
    __local_bh_enable_ip+0x1f2/0x200 kernel/softirq.c:181
    local_bh_enable include/linux/bottom_half.h:31 [inline]
    rcu_read_unlock_bh include/linux/rcupdate.h:971 [inline]
    ip6_finish_output2+0xbb0/0x23d0 net/ipv6/ip6_output.c:123
    ip6_finish_output+0x302/0x960 net/ipv6/ip6_output.c:148
    NF_HOOK_COND include/linux/netfilter.h:246 [inline]
    ip6_output+0x1cb/0x8d0 net/ipv6/ip6_output.c:162
    ip6_xmit+0xcdf/0x20d0 include/net/dst.h:501
    inet6_csk_xmit+0x320/0x5f0 net/ipv6/inet6_connection_sock.c:179
    dccp_transmit_skb+0xb09/0x1120 net/dccp/output.c:141
    dccp_xmit_packet+0x215/0x760 net/dccp/output.c:280
    dccp_write_xmit+0x168/0x1d0 net/dccp/output.c:362
    dccp_sendmsg+0x79c/0xb10 net/dccp/proto.c:796
    inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:744
    sock_sendmsg_nosec net/socket.c:635 [inline]
    sock_sendmsg+0xca/0x110 net/socket.c:645
    SYSC_sendto+0x660/0x810 net/socket.c:1687
    SyS_sendto+0x40/0x50 net/socket.c:1655
    entry_SYSCALL_64_fastpath+0x1f/0xc2
    RIP: 0033:0x4458b9
    RSP: 002b:00007f8ceb77bb58 EFLAGS: 00000282 ORIG_RAX: 000000000000002c
    RAX: ffffffffffffffda RBX: 0000000000000017 RCX: 00000000004458b9
    RDX: 0000000000000023 RSI: 0000000020e60000 RDI: 0000000000000017
    RBP: 00000000006e1b90 R08: 00000000200f9fe1 R09: 0000000000000020
    R10: 0000000000008010 R11: 0000000000000282 R12: 00000000007080a8
    R13: 0000000000000000 R14: 00007f8ceb77c9c0 R15: 00007f8ceb77c700
    Object at ffff88003713be50, in cache kmalloc-64 size: 64
    Allocated:
    PID = 8446
    save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
    save_stack+0x43/0xd0 mm/kasan/kasan.c:502
    set_track mm/kasan/kasan.c:514 [inline]
    kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:605
    kmem_cache_alloc_trace+0x82/0x270 mm/slub.c:2738
    kmalloc include/linux/slab.h:490 [inline]
    dccp_feat_entry_new+0x214/0x410 net/dccp/feat.c:467
    dccp_feat_push_change+0x38/0x220 net/dccp/feat.c:487
    __feat_register_sp+0x223/0x2f0 net/dccp/feat.c:741
    dccp_feat_propagate_ccid+0x22b/0x2b0 net/dccp/feat.c:949
    dccp_feat_server_ccid_dependencies+0x1b3/0x250 net/dccp/feat.c:1012
    dccp_make_response+0x1f1/0xc90 net/dccp/output.c:423
    dccp_v6_send_response+0x4ec/0xc20 net/dccp/ipv6.c:217
    dccp_v6_conn_request+0xaba/0x11b0 net/dccp/ipv6.c:377
    dccp_rcv_state_process+0x51e/0x1650 net/dccp/input.c:606
    dccp_v6_do_rcv+0x213/0x350 net/dccp/ipv6.c:632
    sk_backlog_rcv include/net/sock.h:893 [inline]
    __sk_receive_skb+0x36f/0xcc0 net/core/sock.c:479
    dccp_v6_rcv+0xba5/0x1d00 net/dccp/ipv6.c:742
    ip6_input_finish+0x46d/0x17a0 net/ipv6/ip6_input.c:279
    NF_HOOK include/linux/netfilter.h:257 [inline]
    ip6_input+0xdb/0x590 net/ipv6/ip6_input.c:322
    dst_input include/net/dst.h:507 [inline]
    ip6_rcv_finish+0x289/0x890 net/ipv6/ip6_input.c:69
    NF_HOOK include/linux/netfilter.h:257 [inline]
    ipv6_rcv+0x12ec/0x23d0 net/ipv6/ip6_input.c:203
    __netif_receive_skb_core+0x1ae5/0x3400 net/core/dev.c:4190
    __netif_receive_skb+0x2a/0x170 net/core/dev.c:4228
    process_backlog+0xe5/0x6c0 net/core/dev.c:4839
    napi_poll net/core/dev.c:5202 [inline]
    net_rx_action+0xe70/0x1900 net/core/dev.c:5267
    __do_softirq+0x2fb/0xb7d kernel/softirq.c:284
    Freed:
    PID = 15
    save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
    save_stack+0x43/0xd0 mm/kasan/kasan.c:502
    set_track mm/kasan/kasan.c:514 [inline]
    kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:578
    slab_free_hook mm/slub.c:1355 [inline]
    slab_free_freelist_hook mm/slub.c:1377 [inline]
    slab_free mm/slub.c:2954 [inline]
    kfree+0xe8/0x2b0 mm/slub.c:3874
    dccp_feat_entry_destructor.part.4+0x48/0x60 net/dccp/feat.c:418
    dccp_feat_entry_destructor net/dccp/feat.c:416 [inline]
    dccp_feat_list_pop net/dccp/feat.c:541 [inline]
    dccp_feat_activate_values+0x57f/0xab0 net/dccp/feat.c:1543
    dccp_create_openreq_child+0x464/0x610 net/dccp/minisocks.c:121
    dccp_v6_request_recv_sock+0x1f6/0x1960 net/dccp/ipv6.c:457
    dccp_check_req+0x335/0x5a0 net/dccp/minisocks.c:186
    dccp_v6_rcv+0x69e/0x1d00 net/dccp/ipv6.c:711
    ip6_input_finish+0x46d/0x17a0 net/ipv6/ip6_input.c:279
    NF_HOOK include/linux/netfilter.h:257 [inline]
    ip6_input+0xdb/0x590 net/ipv6/ip6_input.c:322
    dst_input include/net/dst.h:507 [inline]
    ip6_rcv_finish+0x289/0x890 net/ipv6/ip6_input.c:69
    NF_HOOK include/linux/netfilter.h:257 [inline]
    ipv6_rcv+0x12ec/0x23d0 net/ipv6/ip6_input.c:203
    __netif_receive_skb_core+0x1ae5/0x3400 net/core/dev.c:4190
    __netif_receive_skb+0x2a/0x170 net/core/dev.c:4228
    process_backlog+0xe5/0x6c0 net/core/dev.c:4839
    napi_poll net/core/dev.c:5202 [inline]
    net_rx_action+0xe70/0x1900 net/core/dev.c:5267
    __do_softirq+0x2fb/0xb7d kernel/softirq.c:284
    Memory state around the buggy address:
    ffff88003713bd00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff88003713bd80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    >ffff88003713be00: fc fc fc fc fc fc fc fc fc fc fb fb fb fb fb fb
    ^

    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Tested-by: Dmitry Vyukov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Mar, 2017

2 commits

  • When handling problems in cloning a socket with the sk_clone_locked()
    function we need to perform several steps that were open coded in it and
    its callers, so introduce a routine to avoid this duplication:
    sk_free_unlock_clone().

    Cc: Cong Wang
    Cc: Dmitry Vyukov
    Cc: Eric Dumazet
    Cc: Gerrit Renker
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/n/net-ui6laqkotycunhtmqryl9bfx@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • The code where sk_clone() came from created a new socket and locked it,
    but then, on the error path didn't unlock it.

    This problem stayed there for a long while, till b0691c8ee7c2 ("net:
    Unlock sock before calling sk_free()") fixed it, but unfortunately the
    callers of sk_clone() (now sk_clone_locked()) were not audited and the
    one in dccp_create_openreq_child() remained.

    Now in the age of the syskaller fuzzer, this was finally uncovered, as
    reported by Dmitry:

    ---- 8< ----

    I've got the following report while running syzkaller fuzzer on
    86292b33d4b7 ("Merge branch 'akpm' (patches from Andrew)")

    [ BUG: held lock freed! ]
    4.10.0+ #234 Not tainted
    -------------------------
    syz-executor6/6898 is freeing memory
    ffff88006286cac0-ffff88006286d3b7, with a lock still held there!
    (slock-AF_INET6){+.-...}, at: [] spin_lock
    include/linux/spinlock.h:299 [inline]
    (slock-AF_INET6){+.-...}, at: []
    sk_clone_lock+0x3d9/0x12c0 net/core/sock.c:1504
    5 locks held by syz-executor6/6898:
    #0: (sk_lock-AF_INET6){+.+.+.}, at: [] lock_sock
    include/net/sock.h:1460 [inline]
    #0: (sk_lock-AF_INET6){+.+.+.}, at: []
    inet_stream_connect+0x44/0xa0 net/ipv4/af_inet.c:681
    #1: (rcu_read_lock){......}, at: []
    inet6_csk_xmit+0x12a/0x5d0 net/ipv6/inet6_connection_sock.c:126
    #2: (rcu_read_lock){......}, at: [] __skb_unlink
    include/linux/skbuff.h:1767 [inline]
    #2: (rcu_read_lock){......}, at: [] __skb_dequeue
    include/linux/skbuff.h:1783 [inline]
    #2: (rcu_read_lock){......}, at: []
    process_backlog+0x264/0x730 net/core/dev.c:4835
    #3: (rcu_read_lock){......}, at: []
    ip6_input_finish+0x0/0x1700 net/ipv6/ip6_input.c:59
    #4: (slock-AF_INET6){+.-...}, at: [] spin_lock
    include/linux/spinlock.h:299 [inline]
    #4: (slock-AF_INET6){+.-...}, at: []
    sk_clone_lock+0x3d9/0x12c0 net/core/sock.c:1504

    Fix it just like was done by b0691c8ee7c2 ("net: Unlock sock before calling
    sk_free()").

    Reported-by: Dmitry Vyukov
    Cc: Cong Wang
    Cc: Eric Dumazet
    Cc: Gerrit Renker
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20170301153510.GE15145@kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     

28 Apr, 2016

1 commit


23 Oct, 2015

1 commit

  • Multiple cpus can process duplicates of incoming ACK messages
    matching a SYN_RECV request socket. This is a rare event under
    normal operations, but definitely can happen.

    Only one must win the race, otherwise corruption would occur.

    To fix this without adding new atomic ops, we use logic in
    inet_ehash_nolisten() to detect the request was present in the same
    ehash bucket where we try to insert the new child.

    If request socket was not found, we have to undo the child creation.

    This actually removes a spin_lock()/spin_unlock() pair in
    reqsk_queue_unlink() for the fast path.

    Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Sep, 2015

2 commits


22 Sep, 2015

1 commit

  • When creating a timewait socket, we need to arm the timer before
    allowing other cpus to find it. The signal allowing cpus to find
    the socket is setting tw_refcnt to non zero value.

    As we set tw_refcnt in __inet_twsk_hashdance(), we therefore need to
    call inet_twsk_schedule() first.

    This also means we need to remove tw_refcnt changes from
    inet_twsk_schedule() and let the caller handle it.

    Note that because we use mod_timer_pinned(), we have the guarantee
    the timer wont expire before we set tw_refcnt as we run in BH context.

    To make things more readable I introduced inet_twsk_reschedule() helper.

    When rearming the timer, we can use mod_timer_pending() to make sure
    we do not rearm a canceled timer.

    Note: This bug can possibly trigger if packets of a flow can hit
    multiple cpus. This does not normally happen, unless flow steering
    is broken somehow. This explains this bug was spotted ~5 months after
    its introduction.

    A similar fix is needed for SYN_RECV sockets in reqsk_queue_hash_req(),
    but will be provided in a separate patch for proper tracking.

    Fixes: 789f558cfb36 ("tcp/dccp: get rid of central timewait timer")
    Signed-off-by: Eric Dumazet
    Reported-by: Ying Cai
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Apr, 2015

1 commit

  • [ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at
    0000000000000080
    [ 3897.931025] IP: [] reqsk_timer_handler+0x1a6/0x243

    There is a race when reqsk_timer_handler() and tcp_check_req() call
    inet_csk_reqsk_queue_unlink() on the same req at the same time.

    Before commit fa76ce7328b2 ("inet: get rid of central tcp/dccp listener
    timer"), listener spinlock was held and race could not happen.

    To solve this bug, we change reqsk_queue_unlink() to not assume req
    must be found, and we return a status, to conditionally release a
    refcount on the request sock.

    This also means tcp_check_req() in non fastopen case might or not
    consume req refcount, so tcp_v6_hnd_req() & tcp_v4_hnd_req() have
    to properly handle this.

    (Same remark for dccp_check_req() and its callers)

    inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is
    called 4 times in tcp and 3 times in dccp.

    Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer")
    Signed-off-by: Eric Dumazet
    Reported-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Apr, 2015

1 commit

  • Using a timer wheel for timewait sockets was nice ~15 years ago when
    memory was expensive and machines had a single processor.

    This does not scale, code is ugly and source of huge latencies
    (Typically 30 ms have been seen, cpus spinning on death_lock spinlock.)

    We can afford to use an extra 64 bytes per timewait sock and spread
    timewait load to all cpus to have better behavior.

    Tested:

    On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1
    on the target (lpaa24)

    Before patch :

    lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
    419594

    lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
    437171

    While test is running, we can observe 25 or even 33 ms latencies.

    lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
    ...
    1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
    rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2

    lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
    ...
    1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
    rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2

    After patch :

    About 90% increase of throughput :

    lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
    810442

    lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
    800992

    And latencies are kept to minimal values during this load, even
    if network utilization is 90% higher :

    lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
    ...
    1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
    rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Mar, 2015

1 commit

  • When request sock are put in ehash table, the whole notion
    of having a previous request to update dl_next is pointless.

    Also, following patch will get rid of big purge timer,
    so we want to delete a request sock without holding listener lock.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Jul, 2014

1 commit

  • When an UDP application switches from AF_INET to AF_INET6 sockets, we
    have a small performance degradation for IPv4 communications because of
    extra cache line misses to access ipv6only information.

    This can also be noticed for TCP listeners, as ipv6_only_sock() is also
    used from __inet_lookup_listener()->compute_score()

    This is magnified when SO_REUSEPORT is used.

    Move ipv6only into struct sock_common so that it is available at
    no extra cost in lookups.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Apr, 2014

1 commit

  • Several spots in the kernel perform a sequence like:

    skb_queue_tail(&sk->s_receive_queue, skb);
    sk->sk_data_ready(sk, skb->len);

    But at the moment we place the SKB onto the socket receive queue it
    can be consumed and freed up. So this skb->len access is potentially
    to freed up memory.

    Furthermore, the skb->len can be modified by the consumer so it is
    possible that the value isn't accurate.

    And finally, no actual implementation of this callback actually uses
    the length argument. And since nobody actually cared about it's
    value, lots of call sites pass arbitrary values in such as '0' and
    even '1'.

    So just remove the length argument from the callback, that way there
    is no confusion whatsoever and all of these use-after-free cases get
    fixed as a side effect.

    Based upon a patch by Eric Dumazet and his suggestion to audit this
    issue tree-wide.

    Signed-off-by: David S. Miller

    David S. Miller
     

11 Oct, 2013

1 commit

  • In commit 634fb979e8f ("inet: includes a sock_common in request_sock")
    I forgot that the two ports in sock_common do not have same byte order :

    skc_dport is __be16 (network order), but skc_num is __u16 (host order)

    So sparse complains because ir_loc_port (mapped into skc_num) is
    considered as __u16 while it should be __be16

    Let rename ir_loc_port to ireq->ir_num (analogy with inet->inet_num),
    and perform appropriate htons/ntohs conversions.

    Signed-off-by: Eric Dumazet
    Reported-by: Wu Fengguang
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Oct, 2013

1 commit

  • TCP listener refactoring, part 5 :

    We want to be able to insert request sockets (SYN_RECV) into main
    ehash table instead of the per listener hash table to allow RCU
    lookups and remove listener lock contention.

    This patch includes the needed struct sock_common in front
    of struct request_sock

    This means there is no more inet6_request_sock IPv6 specific
    structure.

    Following inet_request_sock fields were renamed as they became
    macros to reference fields from struct sock_common.
    Prefix ir_ was chosen to avoid name collisions.

    loc_port -> ir_loc_port
    loc_addr -> ir_loc_addr
    rmt_addr -> ir_rmt_addr
    rmt_port -> ir_rmt_port
    iif -> ir_iif

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Oct, 2013

1 commit

  • TCP listener refactoring, part 4 :

    To speed up inet lookups, we moved IPv4 addresses from inet to struct
    sock_common

    Now is time to do the same for IPv6, because it permits us to have fast
    lookups for all kind of sockets, including upcoming SYN_RECV.

    Getting IPv6 addresses in TCP lookups currently requires two extra cache
    lines, plus a dereference (and memory stall).

    inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6

    This patch is way bigger than its IPv4 counter part, because for IPv4,
    we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
    it's not doable easily.

    inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
    inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr

    And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
    at the same offset.

    We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
    macro.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Nov, 2012

1 commit

  • For passive TCP connections using TCP_DEFER_ACCEPT facility,
    we incorrectly increment req->retrans each time timeout triggers
    while no SYNACK is sent.

    SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for
    which we received the ACK from client). Only the last SYNACK is sent
    so that we can receive again an ACK from client, to move the req into
    accept queue. We plan to change this later to avoid the useless
    retransmit (and potential problem as this SYNACK could be lost)

    TCP_INFO later gives wrong information to user, claiming imaginary
    retransmits.

    Decouple req->retrans field into two independent fields :

    num_retrans : number of retransmit
    num_timeout : number of timeouts

    num_timeout is the counter that is incremented at each timeout,
    regardless of actual SYNACK being sent or not, and used to
    compute the exponential timeout.

    Introduce inet_rtx_syn_ack() helper to increment num_retrans
    only if ->rtx_syn_ack() succeeded.

    Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
    when we re-send a SYNACK in answer to a (retransmitted) SYN.
    Prior to this patch, we were not counting these retransmits.

    Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
    only if a synack packet was successfully queued.

    Reported-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Cc: Julian Anastasov
    Cc: Vijay Subramanian
    Cc: Elliott Hughes
    Cc: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Mar, 2012

1 commit

  • This fixes a bug in the sequence number validation during the initial handshake.

    The code did not treat the initial sequence numbers ISS and ISR as read-only and
    did not keep state for GSR and GSS as required by the specification. This causes
    problems with retransmissions during the initial handshake, causing the
    budding connection to be reset.

    This patch now treats ISS/ISR as read-only and tracks GSS/GSR as required.

    Signed-off-by: Samuel Jero
    Signed-off-by: Gerrit Renker

    Samuel Jero
     

12 Dec, 2011

1 commit


23 Nov, 2011

1 commit


09 Nov, 2011

1 commit


12 Oct, 2010

1 commit

  • This fixes a problem and a potential loophole with regard to seqno/ackno
    validity: currently the initial adjustments to AWL/SWL are only performed
    once at the begin of the connection, during the handshake.

    Since the Sequence Window feature is always greater than Wmin=32 (7.5.2),
    it is however necessary to perform these adjustments at least for the first
    W/W' (variables as per 7.5.1) packets in the lifetime of a connection.

    This requirement is complicated by the fact that W/W' can change at any time
    during the lifetime of a connection.

    Therefore it is better to perform that safety check each time SWL/AWL are
    updated, as implemented by the patch.

    A second problem solved by this patch is that the remote/local Sequence Window
    feature values (which set the bounds for AWL/SWL/SWH) are undefined until the
    feature negotiation has completed.

    During the initial handshake we have more stringent sequence number protection;
    the changes added by this patch effect that {A,S}W{L,H} are within the correct
    bounds at the instant that feature negotiation completes (since the SeqWin
    feature activation handlers call dccp_update_gsr/gss()).

    Signed-off-by: Gerrit Renker

    Gerrit Renker
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

06 Mar, 2010

1 commit


03 Dec, 2009

1 commit

  • Add optional function parameters associated with sending SYNACK.
    These parameters are not needed after sending SYNACK, and are not
    used for retransmission. Avoids extending struct tcp_request_sock,
    and avoids allocating kernel memory.

    Also affects DCCP as it uses common struct request_sock_ops,
    but this parameter is currently reserved for future use.

    Signed-off-by: William.Allen.Simpson@gmail.com
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    William Allen Simpson
     

22 Jan, 2009

1 commit

  • This adds full support for local/remote Sequence Window feature, from which the
    * sequence-number-validity (W) and
    * acknowledgment-number-validity (W') windows
    derive as specified in RFC 4340, 7.5.3.

    Specifically, the following is contained in this patch:
    * integrated new socket fields into dccp_sk;
    * updated the update_gsr/gss routines with regard to these fields;
    * updated handler code: the Sequence Window feature is located at the TX side,
    so the local feature is meant if the handler-rx flag is false;
    * the initialisation of `rcv_wnd' in reqsk is removed, since
    - rcv_wnd is not used by the code anywhere;
    - sequence number checks are not done in the LISTEN state (cf. 7.5.3);
    - dccp_check_req checks the Ack number validity more rigorously;
    * the `struct dccp_minisock' became empty and is now removed.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     

08 Dec, 2008

4 commits

  • This removes the use of the sysctl and the minisock variable for the Send Ack
    Vector feature, as it now is handled fully dynamically via feature negotiation
    (i.e. when CCID-2 is enabled, Ack Vectors are automatically enabled as per
    RFC 4341, 4.).

    Using a sysctl in parallel to this implementation would open the door to
    crashes, since much of the code relies on tests of the boolean minisock /
    sysctl variable. Thus, this patch replaces all tests of type

    if (dccp_msk(sk)->dccpms_send_ack_vector)
    /* ... */
    with
    if (dp->dccps_hc_rx_ackvec != NULL)
    /* ... */

    The dccps_hc_rx_ackvec is allocated by the dccp_hdlr_ackvec() when feature
    negotiation concluded that Ack Vectors are to be used on the half-connection.
    Otherwise, it is NULL (due to dccp_init_sock/dccp_create_openreq_child),
    so that the test is a valid one.

    The activation handler for Ack Vectors is called as soon as the feature
    negotiation has concluded at the
    * server when the Ack marking the transition RESPOND => OPEN arrives;
    * client after it has sent its ACK, marking the transition REQUEST => PARTOPEN.

    Adding the sequence number of the Response packet to the Ack Vector has been
    removed, since
    (a) connection establishment implies that the Response has been received;
    (b) the CCIDs only look at packets received in the (PART)OPEN state, i.e.
    this entry will always be ignored;
    (c) it can not be used for anything useful - to detect loss for instance, only
    packets received after the loss can serve as pseudo-dupacks.

    There was a FIXME to change the error code when dccp_ackvec_add() fails.
    I removed this after finding out that:
    * the check whether ackno < ISN is already made earlier,
    * this Response is likely the 1st packet with an Ackno that the client gets,
    * so when dccp_ackvec_add() fails, the reason is likely not a packet error.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • Updating the NDP count feature is handled automatically now:
    * for CCID-2 it is disabled, since the code does not use NDP counts;
    * for CCID-3 it is enabled, as NDP counts are used to determine loss lengths.

    Allowing the user to change NDP values leads to unpredictable and failing
    behaviour, since it is then possible to disable NDP counts even when they
    are needed (e.g. in CCID-3).

    This means that only those user settings are sensible that agree with the
    values for Send NDP Count implied by the choice of CCID. But those settings
    are already activated by the feature negotiation (CCID dependency tracking),
    hence this form of support is redundant.

    At startup the initialisation of the NDP count feature uses the default
    value of 0, which is done implicitly by the zeroing-out of the socket when
    it is allocated. If the choice of CCID or feature negotiation enables NDP
    count, this will then be updated via the NDP activation handler.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • The TX/RX CCIDs of the minisock are now redundant: similar to the Ack Vector
    case, their value equals initially that of the sysctl, but at the end of
    feature negotiation may be something different.

    The old interface removed by this patch thus has been replaced by the newer
    interface to dynamically query the currently loaded CCIDs.

    Also removed are the constructors for the TX CCID and the RX CCID, since the
    switch "rx non-rx" is done by the handler in minisocks.c (and the handler
    is the only place in the code where CCIDs are loaded).

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • This patch integrates the activation of features at the end of negotiation
    into the server-side code.

    Note regarding the removal of 'const':
    --------------------------------------
    The 'const' attribute has been removed from 'dreq' since dccp_activate_values()
    needs to operate on dreq's feature list. Part of the activation is to remove
    those options from the list that have already been confirmed, hence it is not
    purely read-only.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     

17 Nov, 2008

1 commit

  • This patch deprecates the Ack Ratio sysctl, since
    * Ack Ratio is entirely ignored by CCID-3 and CCID-4,
    * Ack Ratio currently doesn't work in CCID-2 (i.e. is always set to 1);
    * even if it would work in CCID-2, there is no point for a user to change it:
    - Ack Ratio is constrained by cwnd (RFC 4341, 6.1.2),
    - if Ack Ratio > cwnd, the system resorts to spurious RTO timeouts
    (since waiting for Acks which will never arrive in this window),
    - cwnd is not a user-configurable value.

    The only reasonable place for Ack Ratio is to print it for debugging. It is
    planned to do this later on, as part of e.g. dccp_probe.

    With this patch Ack Ratio is now under full control of feature negotiation:
    * Ack Ratio is resolved as a dependency of the selected CCID;
    * if the chosen CCID supports it (i.e. CCID == CCID-2), Ack Ratio is set to
    the default of 2, following RFC 4340, 11.3 - "New connections start with Ack
    Ratio 2 for both endpoints";
    * what happens then is part of another patch set, since it concerns the
    dynamic update of Ack Ratio while the connection is in full flight.

    Thanks to Tomasz Grobelny for discussion leading up to this patch.

    Signed-off-by: Gerrit Renker
    Acked-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Gerrit Renker
     

05 Nov, 2008

1 commit

  • This provides feature-negotiation initialisation for both DCCP sockets
    and DCCP request_sockets, to support feature negotiation during
    connection setup.

    It also resolves a FIXME regarding the congestion control
    initialisation.

    Thanks to Wei Yongjun for help with the IPv6 side of this patch.

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: David S. Miller

    Gerrit Renker
     

20 Oct, 2008

1 commit

  • Commit a3116ac5c216fc3c145906a46df9ce542ff7dcf2 from 1st October ("tcp: Port
    redirection support for TCP") broke DCCP skb lookup by changing inet_csk_clone,
    which is used by DCCP to generate the child socket after the handshake.

    This patch updates DCCP to use 'loc_port' instead of 'sport', which fixes the
    problem, and thus inheriting port redirection support via the new interface.

    Signed-off-by: Gerrit Renker
    Signed-off-by: KOVACS Krisztian
    Acked-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Gerrit Renker
     

07 Aug, 2008

1 commit

  • If the following packet flow happen, kernel will panic.
    MathineA MathineB
    SYN
    ---------------------->
    SYN+ACK

    When a bad seq ACK is received, tcp_v4_md5_do_lookup(skb->sk, ip_hdr(skb)->daddr))
    is finally called by tcp_v4_reqsk_send_ack(), but the first parameter(skb->sk) is
    NULL at that moment, so kernel panic happens.
    This patch fixes this bug.

    OOPS output is as following:
    [ 302.812793] IP: [] tcp_v4_md5_do_lookup+0x12/0x42
    [ 302.817075] Oops: 0000 [#1] SMP
    [ 302.819815] Modules linked in: ipv6 loop dm_multipath rtc_cmos rtc_core rtc_lib pcspkr pcnet32 mii i2c_piix4 parport_pc i2c_core parport ac button ata_piix libata dm_mod mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod crc_t10dif ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: scsi_wait_scan]
    [ 302.849946]
    [ 302.851198] Pid: 0, comm: swapper Not tainted (2.6.27-rc1-guijf #5)
    [ 302.855184] EIP: 0060:[] EFLAGS: 00010296 CPU: 0
    [ 302.858296] EIP is at tcp_v4_md5_do_lookup+0x12/0x42
    [ 302.861027] EAX: 0000001e EBX: 00000000 ECX: 00000046 EDX: 00000046
    [ 302.864867] ESI: ceb69e00 EDI: 1467a8c0 EBP: cf75f180 ESP: c0792e54
    [ 302.868333] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    [ 302.871287] Process swapper (pid: 0, ti=c0792000 task=c0712340 task.ti=c0746000)
    [ 302.875592] Stack: c06f413a 00000000 cf75f180 ceb69e00 00000000 c05d0d86 000016d0 ceac5400
    [ 302.883275] c05d28f8 000016d0 ceb69e00 ceb69e20 681bf6e3 00001000 00000000 0a67a8c0
    [ 302.890971] ceac5400 c04250a3 c06f413a c0792eb0 c0792edc cf59a620 cf59a620 cf59a634
    [ 302.900140] Call Trace:
    [ 302.902392] [] tcp_v4_reqsk_send_ack+0x17/0x35
    [ 302.907060] [] tcp_check_req+0x156/0x372
    [ 302.910082] [] printk+0x14/0x18
    [ 302.912868] [] tcp_v4_do_rcv+0x1d3/0x2bf
    [ 302.917423] [] tcp_v4_rcv+0x563/0x5b9
    [ 302.920453] [] ip_local_deliver_finish+0xe8/0x183
    [ 302.923865] [] ip_rcv_finish+0x286/0x2a3
    [ 302.928569] [] dev_alloc_skb+0x11/0x25
    [ 302.931563] [] netif_receive_skb+0x2d6/0x33a
    [ 302.934914] [] pcnet32_poll+0x333/0x680 [pcnet32]
    [ 302.938735] [] net_rx_action+0x5c/0xfe
    [ 302.941792] [] __do_softirq+0x5d/0xc1
    [ 302.944788] [] __do_softirq+0x0/0xc1
    [ 302.948999] [] do_softirq+0x55/0x88
    [ 302.951870] [] handle_fasteoi_irq+0x0/0xa4
    [ 302.954986] [] irq_exit+0x35/0x69
    [ 302.959081] [] do_IRQ+0x99/0xae
    [ 302.961896] [] common_interrupt+0x23/0x28
    [ 302.966279] [] default_idle+0x2a/0x3d
    [ 302.969212] [] cpu_idle+0xb2/0xd2
    [ 302.972169] =======================
    [ 302.974274] Code: fc ff 84 d2 0f 84 df fd ff ff e9 34 fe ff ff 83 c4 0c 5b 5e 5f 5d c3 90 90 57 89 d7 56 53 89 c3 50 68 3a 41 6f c0 e8 e9 55 e5 ff 93 9c 04 00 00 58 85 d2 59 74 1e 8b 72 10 31 db 31 c9 85 f6
    [ 303.011610] EIP: [] tcp_v4_md5_do_lookup+0x12/0x42 SS:ESP 0068:c0792e54
    [ 303.018360] Kernel panic - not syncing: Fatal exception in interrupt

    Signed-off-by: Gui Jianfeng
    Signed-off-by: David S. Miller

    Gui Jianfeng
     

11 Jun, 2008

1 commit


01 Mar, 2008

1 commit


29 Jan, 2008

2 commits

  • In DCCP, timestamps can occur on packets anytime, CCID3 uses a timestamp(/echo) on the Request/Response
    exchange. This patch addresses the following situation:
    * timestamps are recorded on the listening socket;
    * Responses are sent from dccp_request_sockets;
    * suppose two connections reach the listening socket with very small time in between:
    * the first timestamp value gets overwritten by the second connection request.

    This is not really good, so this patch separates timestamps into
    * those which are received by the server during the initial handshake (on dccp_request_sock);
    * those which are received by the client or the client after connection establishment.

    As before, a timestamp of 0 is regarded as indicating that no (meaningful) timestamp has been
    received (in addition, a warning message is printed if hosts send 0-valued timestamps).

    The timestamp-echoing now works as follows:
    * when a timestamp is present on the initial Request, it is placed into dreq, due to the
    call to dccp_parse_options in dccp_v{4,6}_conn_request;
    * when a timestamp is present on the Ack leading from RESPOND => OPEN, it is copied over
    from the request_sock into the child cocket in dccp_create_openreq_child;
    * timestamps received on an (established) dccp_sock are treated as before.

    Since Elapsed Time is measured in hundredths of milliseconds (13.2), the new dccp_timestamp()
    function is used, as it is expected that the time between receiving the timestamp and
    sending the timestamp echo will be very small against the wrap-around time. As a byproduct,
    this allows smaller timestamping-time fields.

    Furthermore, inserting the Timestamp Echo option has been taken out of the block starting with
    '!dccp_packet_without_ack()', since Timestamp Echo can be carried on any packet (5.8 and 13.3).

    Signed-off-by: Gerrit Renker
    Acked-by: Ian McDonald
    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Gerrit Renker
     
  • This adds option-parsing code to processing of Acks in the listening state
    on request_socks on the server, serving two purposes
    (i) resolves a FIXME (removed);
    (ii) paves the way for feature-negotiation during connection-setup.

    There is an intended subtlety here with regard to dccp_check_req:

    Parsing options happens only after testing whether the received packet is
    a retransmitted Request. Otherwise, if the Request contained (a possibly
    large number of) feature-negotiation options, recomputing state would have to
    happen each time a retransmitted Request arrives, which opens the door to an
    easy DoS attack. Since in a genuine retransmission the options should not be
    different from the original, reusing the already computed state seems better.

    The other point is - if there are timestamp options on the Request, they will
    not be answered; which means that in the presence of retransmission (likely
    due to loss and/or other problems), the use of Request/Response RTT sampling
    is suspended, so that startup problems here do not propagate.

    Signed-off-by: Gerrit Renker
    Signed-off-by: Ian McDonald
    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Gerrit Renker
     

11 Oct, 2007

1 commit

  • This

    * removes a declaration of a non-existent function
    __dccp_minisock_init;

    * shifts the initialisation function dccp_minisock_init() from
    options.c to minisocks.c, where it is more naturally expected to
    be.

    Signed-off-by: Gerrit Renker
    Signed-off-by: Ian McDonald
    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Gerrit Renker