16 May, 2018

1 commit

  • commit b855ff827476adbdc2259e9895681d82b7b26065 upstream.

    syzbot reported an uninit-value read of skb->mark in iptable_mangle_hook()

    Thanks to the nice report, I tracked the problem to dccp not caring
    of ireq->ir_mark for passive sessions.

    BUG: KMSAN: uninit-value in ipt_mangle_out net/ipv4/netfilter/iptable_mangle.c:66 [inline]
    BUG: KMSAN: uninit-value in iptable_mangle_hook+0x5e5/0x720 net/ipv4/netfilter/iptable_mangle.c:84
    CPU: 0 PID: 5300 Comm: syz-executor3 Not tainted 4.16.0+ #81
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:17 [inline]
    dump_stack+0x185/0x1d0 lib/dump_stack.c:53
    kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
    __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
    ipt_mangle_out net/ipv4/netfilter/iptable_mangle.c:66 [inline]
    iptable_mangle_hook+0x5e5/0x720 net/ipv4/netfilter/iptable_mangle.c:84
    nf_hook_entry_hookfn include/linux/netfilter.h:120 [inline]
    nf_hook_slow+0x158/0x3d0 net/netfilter/core.c:483
    nf_hook include/linux/netfilter.h:243 [inline]
    __ip_local_out net/ipv4/ip_output.c:113 [inline]
    ip_local_out net/ipv4/ip_output.c:122 [inline]
    ip_queue_xmit+0x1d21/0x21c0 net/ipv4/ip_output.c:504
    dccp_transmit_skb+0x15eb/0x1900 net/dccp/output.c:142
    dccp_xmit_packet+0x814/0x9e0 net/dccp/output.c:281
    dccp_write_xmit+0x20f/0x480 net/dccp/output.c:363
    dccp_sendmsg+0x12ca/0x12d0 net/dccp/proto.c:818
    inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
    sock_sendmsg_nosec net/socket.c:630 [inline]
    sock_sendmsg net/socket.c:640 [inline]
    ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
    __sys_sendmsg net/socket.c:2080 [inline]
    SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
    SyS_sendmsg+0x54/0x80 net/socket.c:2087
    do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    RIP: 0033:0x455259
    RSP: 002b:00007f1a4473dc68 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 00007f1a4473e6d4 RCX: 0000000000455259
    RDX: 0000000000000000 RSI: 0000000020b76fc8 RDI: 0000000000000015
    RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
    R13: 00000000000004f0 R14: 00000000006fa720 R15: 0000000000000000

    Uninit was stored to memory at:
    kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
    kmsan_save_stack mm/kmsan/kmsan.c:293 [inline]
    kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:684
    __msan_chain_origin+0x69/0xc0 mm/kmsan/kmsan_instr.c:521
    ip_queue_xmit+0x1e35/0x21c0 net/ipv4/ip_output.c:502
    dccp_transmit_skb+0x15eb/0x1900 net/dccp/output.c:142
    dccp_xmit_packet+0x814/0x9e0 net/dccp/output.c:281
    dccp_write_xmit+0x20f/0x480 net/dccp/output.c:363
    dccp_sendmsg+0x12ca/0x12d0 net/dccp/proto.c:818
    inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
    sock_sendmsg_nosec net/socket.c:630 [inline]
    sock_sendmsg net/socket.c:640 [inline]
    ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
    __sys_sendmsg net/socket.c:2080 [inline]
    SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
    SyS_sendmsg+0x54/0x80 net/socket.c:2087
    do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    Uninit was stored to memory at:
    kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
    kmsan_save_stack mm/kmsan/kmsan.c:293 [inline]
    kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:684
    __msan_chain_origin+0x69/0xc0 mm/kmsan/kmsan_instr.c:521
    inet_csk_clone_lock+0x503/0x580 net/ipv4/inet_connection_sock.c:797
    dccp_create_openreq_child+0x7f/0x890 net/dccp/minisocks.c:92
    dccp_v4_request_recv_sock+0x22c/0xe90 net/dccp/ipv4.c:408
    dccp_v6_request_recv_sock+0x290/0x2000 net/dccp/ipv6.c:414
    dccp_check_req+0x7b9/0x8f0 net/dccp/minisocks.c:197
    dccp_v4_rcv+0x12e4/0x2630 net/dccp/ipv4.c:840
    ip_local_deliver_finish+0x6ed/0xd40 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:288 [inline]
    ip_local_deliver+0x43c/0x4e0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:449 [inline]
    ip_rcv_finish+0x1253/0x16d0 net/ipv4/ip_input.c:397
    NF_HOOK include/linux/netfilter.h:288 [inline]
    ip_rcv+0x119d/0x16f0 net/ipv4/ip_input.c:493
    __netif_receive_skb_core+0x47cf/0x4a80 net/core/dev.c:4562
    __netif_receive_skb net/core/dev.c:4627 [inline]
    process_backlog+0x62d/0xe20 net/core/dev.c:5307
    napi_poll net/core/dev.c:5705 [inline]
    net_rx_action+0x7c1/0x1a70 net/core/dev.c:5771
    __do_softirq+0x56d/0x93d kernel/softirq.c:285
    Uninit was created at:
    kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
    kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
    kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
    kmem_cache_alloc+0xaab/0xb90 mm/slub.c:2756
    reqsk_alloc include/net/request_sock.h:88 [inline]
    inet_reqsk_alloc+0xc4/0x7f0 net/ipv4/tcp_input.c:6145
    dccp_v4_conn_request+0x5cc/0x1770 net/dccp/ipv4.c:600
    dccp_v6_conn_request+0x299/0x1880 net/dccp/ipv6.c:317
    dccp_rcv_state_process+0x2ea/0x2410 net/dccp/input.c:612
    dccp_v4_do_rcv+0x229/0x340 net/dccp/ipv4.c:682
    dccp_v6_do_rcv+0x16d/0x1220 net/dccp/ipv6.c:578
    sk_backlog_rcv include/net/sock.h:908 [inline]
    __sk_receive_skb+0x60e/0xf20 net/core/sock.c:513
    dccp_v4_rcv+0x24d4/0x2630 net/dccp/ipv4.c:874
    ip_local_deliver_finish+0x6ed/0xd40 net/ipv4/ip_input.c:216
    NF_HOOK include/linux/netfilter.h:288 [inline]
    ip_local_deliver+0x43c/0x4e0 net/ipv4/ip_input.c:257
    dst_input include/net/dst.h:449 [inline]
    ip_rcv_finish+0x1253/0x16d0 net/ipv4/ip_input.c:397
    NF_HOOK include/linux/netfilter.h:288 [inline]
    ip_rcv+0x119d/0x16f0 net/ipv4/ip_input.c:493
    __netif_receive_skb_core+0x47cf/0x4a80 net/core/dev.c:4562
    __netif_receive_skb net/core/dev.c:4627 [inline]
    process_backlog+0x62d/0xe20 net/core/dev.c:5307
    napi_poll net/core/dev.c:5705 [inline]
    net_rx_action+0x7c1/0x1a70 net/core/dev.c:5771
    __do_softirq+0x56d/0x93d kernel/softirq.c:285

    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

01 Sep, 2017

1 commit


08 Aug, 2017

1 commit

  • Add a second device index, sdif, to inet6 socket lookups. sdif is the
    index for ingress devices enslaved to an l3mdev. It allows the lookups
    to consider the enslaved device as well as the L3 domain when searching
    for a socket.

    TCP moves the data in the cb. Prior to tcp_v4_rcv (e.g., early demux) the
    ingress index is obtained from IPCB using inet_sdif and after tcp_v4_rcv
    tcp_v4_sdif is used.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

27 Jul, 2017

1 commit

  • In dccp_v6_conn_request, after reqsk gets alloced and hashed into
    ehash table, reqsk's refcnt is set 3. one is for req->rsk_timer,
    one is for hlist, and the other one is for current using.

    The problem is when dccp_v6_conn_request returns and finishes using
    reqsk, it doesn't put reqsk. This will cause reqsk refcnt leaks and
    reqsk obj never gets freed.

    Jianlin found this issue when running dccp_memleak.c in a loop, the
    system memory would run out.

    dccp_memleak.c:
    int s1 = socket(PF_INET6, 6, IPPROTO_IP);
    bind(s1, &sa1, 0x20);
    listen(s1, 0x9);
    int s2 = socket(PF_INET6, 6, IPPROTO_IP);
    connect(s2, &sa1, 0x20);
    close(s1);
    close(s2);

    This patch is to put the reqsk before dccp_v6_conn_request returns,
    just as what tcp_conn_request does.

    Reported-by: Jianlin Shi
    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     

01 Jul, 2017

1 commit

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     

21 Jun, 2017

1 commit


16 May, 2017

1 commit

  • Pull networking fixes from David Miller:

    1) Track alignment in BPF verifier so that legitimate programs won't be
    rejected on !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS architectures.

    2) Make tail calls work properly in arm64 BPF JIT, from Deniel
    Borkmann.

    3) Make the configuration and semantics Generic XDP make more sense and
    don't allow both generic XDP and a driver specific instance to be
    active at the same time. Also from Daniel.

    4) Don't crash on resume in xen-netfront, from Vitaly Kuznetsov.

    5) Fix use-after-free in VRF driver, from Gao Feng.

    6) Use netdev_alloc_skb_ip_align() to avoid unaligned IP headers in
    qca_spi driver, from Stefan Wahren.

    7) Always run cleanup routines in BPF samples when we get SIGTERM, from
    Andy Gospodarek.

    8) The mdio phy code should bring PHYs out of reset using the shared
    GPIO lines before invoking bus->reset(). From Florian Fainelli.

    9) Some USB descriptor access endian fixes in various drivers from
    Johan Hovold.

    10) Handle PAUSE advertisements properly in mlx5 driver, from Gal
    Pressman.

    11) Fix reversed test in mlx5e_setup_tc(), from Saeed Mahameed.

    12) Cure netdev leak in AF_PACKET when using timestamping via control
    messages. From Douglas Caetano dos Santos.

    13) netcp doesn't support HWTSTAMP_FILTER_ALl, reject it. From Miroslav
    Lichvar.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
    ldmvsw: stop the clean timer at beginning of remove
    ldmvsw: unregistering netdev before disable hardware
    net: netcp: fix check of requested timestamping filter
    ipv6: avoid dad-failures for addresses with NODAD
    qed: Fix uninitialized data in aRFS infrastructure
    mdio: mux: fix device_node_continue.cocci warnings
    net/packet: fix missing net_device reference release
    net/mlx4_core: Use min3 to select number of MSI-X vectors
    macvlan: Fix performance issues with vlan tagged packets
    net: stmmac: use correct pointer when printing normal descriptor ring
    net/mlx5: Use underlay QPN from the root name space
    net/mlx5e: IPoIB, Only support regular RQ for now
    net/mlx5e: Fix setup TC ndo
    net/mlx5e: Fix ethtool pause support and advertise reporting
    net/mlx5e: Use the correct pause values for ethtool advertising
    vmxnet3: ensure that adapter is in proper state during force_close
    sfc: revert changes to NIC revision numbers
    net: ch9200: add missing USB-descriptor endianness conversions
    net: irda: irda-usb: fix firmware name on big-endian hosts
    net: dsa: mv88e6xxx: add default case to switch
    ...

    Linus Torvalds
     

12 May, 2017

1 commit


23 Apr, 2017

1 commit


19 Apr, 2017

1 commit

  • A group of Linux kernel hackers reported chasing a bug that resulted
    from their assumption that SLAB_DESTROY_BY_RCU provided an existence
    guarantee, that is, that no block from such a slab would be reallocated
    during an RCU read-side critical section. Of course, that is not the
    case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
    slab of blocks.

    However, there is a phrase for this, namely "type safety". This commit
    therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
    to avoid future instances of this sort of confusion.

    Signed-off-by: Paul E. McKenney
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrew Morton
    Cc:
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    [ paulmck: Add comments mentioning the old name, as requested by Eric
    Dumazet, in order to help people familiar with the old name find
    the new one. ]
    Acked-by: David Rientjes

    Paul E. McKenney
     

14 Mar, 2017

1 commit

  • As Eric Dumazet pointed out this also needs to be fixed in IPv6.
    v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.

    We have seen a few incidents lately where a dst_enty has been freed
    with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
    dst_entry. If the conditions/timings are right a crash then ensues when the
    freed dst_entry is referenced later on. A Common crashing back trace is:

    #8 [] page_fault at ffffffff8163e648
    [exception RIP: __tcp_ack_snd_check+74]
    .
    .
    #9 [] tcp_rcv_established at ffffffff81580b64
    #10 [] tcp_v4_do_rcv at ffffffff8158b54a
    #11 [] tcp_v4_rcv at ffffffff8158cd02
    #12 [] ip_local_deliver_finish at ffffffff815668f4
    #13 [] ip_local_deliver at ffffffff81566bd9
    #14 [] ip_rcv_finish at ffffffff8156656d
    #15 [] ip_rcv at ffffffff81566f06
    #16 [] __netif_receive_skb_core at ffffffff8152b3a2
    #17 [] __netif_receive_skb at ffffffff8152b608
    #18 [] netif_receive_skb at ffffffff8152b690
    #19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
    #20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
    #21 [] net_rx_action at ffffffff8152bac2
    #22 [] __do_softirq at ffffffff81084b4f
    #23 [] call_softirq at ffffffff8164845c
    #24 [] do_softirq at ffffffff81016fc5
    #25 [] irq_exit at ffffffff81084ee5
    #26 [] do_IRQ at ffffffff81648ff8

    Of course it may happen with other NIC drivers as well.

    It's found the freed dst_entry here:

    224 static bool tcp_in_quickack_mode(struct sock *sk)↩
    225 {↩
    226 ▹ const struct inet_connection_sock *icsk = inet_csk(sk);↩
    227 ▹ const struct dst_entry *dst = __sk_dst_get(sk);↩
    228 ↩
    229 ▹ return (dst && dst_metric(dst, RTAX_QUICKACK)) ||↩
    230 ▹ ▹ (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);↩
    231 }↩

    But there are other backtraces attributed to the same freed dst_entry in
    netfilter code as well.

    All the vmcores showed 2 significant clues:

    - Remote hosts behind the default gateway had always been redirected to a
    different gateway. A rtable/dst_entry will be added for that host. Making
    more dst_entrys with lower reference counts. Making this more probable.

    - All vmcores showed a postitive LockDroppedIcmps value, e.g:

    LockDroppedIcmps 267

    A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
    regardless of whether user space has the socket locked. This can result in a
    race condition where the same dst_entry cached in sk->sk_dst_entry can be
    decremented twice for the same socket via:

    do_redirect()->__sk_dst_check()-> dst_release().

    Which leads to the dst_entry being prematurely freed with another socket
    pointing to it via sk->sk_dst_cache and a subsequent crash.

    To fix this skip do_redirect() if usespace has the socket locked. Instead let
    the redirect take place later when user space does not have the socket
    locked.

    The dccp/IPv6 code is very similar in this respect, so fixing it there too.

    As Eric Garver pointed out the following commit now invalidates routes. Which
    can set the dst->obsolete flag so that ipv4_dst_check() returns null and
    triggers the dst_release().

    Fixes: ceb3320610d6 ("ipv4: Kill routes during PMTU/redirect updates.")
    Cc: Eric Garver
    Cc: Hannes Sowa
    Signed-off-by: Jon Maxwell
    Signed-off-by: David S. Miller

    Jon Maxwell
     

23 Feb, 2017

1 commit

  • DCCP doesn't purge timewait sockets on network namespace shutdown.
    So, after net namespace destroyed we could still have an active timer
    which will trigger use after free in tw_timer_handler():

    BUG: KASAN: use-after-free in tw_timer_handler+0x4a/0xa0 at addr ffff88010e0d1e10
    Read of size 8 by task swapper/1/0
    Call Trace:
    __asan_load8+0x54/0x90
    tw_timer_handler+0x4a/0xa0
    call_timer_fn+0x127/0x480
    expire_timers+0x1db/0x2e0
    run_timer_softirq+0x12f/0x2a0
    __do_softirq+0x105/0x5b4
    irq_exit+0xdd/0xf0
    smp_apic_timer_interrupt+0x57/0x70
    apic_timer_interrupt+0x90/0xa0

    Object at ffff88010e0d1bc0, in cache net_namespace size: 6848
    Allocated:
    save_stack_trace+0x1b/0x20
    kasan_kmalloc+0xee/0x180
    kasan_slab_alloc+0x12/0x20
    kmem_cache_alloc+0x134/0x310
    copy_net_ns+0x8d/0x280
    create_new_namespaces+0x23f/0x340
    unshare_nsproxy_namespaces+0x75/0xf0
    SyS_unshare+0x299/0x4f0
    entry_SYSCALL_64_fastpath+0x18/0xad
    Freed:
    save_stack_trace+0x1b/0x20
    kasan_slab_free+0xae/0x180
    kmem_cache_free+0xb4/0x350
    net_drop_ns+0x3f/0x50
    cleanup_net+0x3df/0x450
    process_one_work+0x419/0xbb0
    worker_thread+0x92/0x850
    kthread+0x192/0x1e0
    ret_from_fork+0x2e/0x40

    Add .exit_batch hook to dccp_v4_ops()/dccp_v6_ops() which will purge
    timewait sockets on net namespace destruction and prevent above issue.

    Fixes: f2bf415cfed7 ("mib: add net to NET_ADD_STATS_BH")
    Reported-by: Dmitry Vyukov
    Signed-off-by: Andrey Ryabinin
    Acked-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Andrey Ryabinin
     

28 Jan, 2017

1 commit


27 Jan, 2017

1 commit

  • Unlike ipv4, this control socket is shared by all cpus so we cannot use
    it as scratchpad area to annotate the mark that we pass to ip6_xmit().

    Add a new parameter to ip6_xmit() to indicate the mark. The SCTP socket
    family caches the flowi6 structure in the sctp_transport structure, so
    we cannot use to carry the mark unless we later on reset it back, which
    I discarded since it looks ugly to me.

    Fixes: bf99b4ded5f8 ("tcp: fix mark propagation with fwmark_reflect enabled")
    Suggested-by: Eric Dumazet
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira
     

19 Jan, 2017

1 commit

  • The only difference between inet6_csk_bind_conflict and inet_csk_bind_conflict
    is how they check the rcv_saddr, so delete this call back and simply
    change inet_csk_bind_conflict to call inet_rcv_saddr_equal.

    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Josef Bacik
     

15 Nov, 2016

1 commit


04 Nov, 2016

3 commits

  • While fuzzing kernel with syzkaller, Andrey reported a nasty crash
    in inet6_bind() caused by DCCP lacking a required method.

    Fixes: ab1e0a13d7029 ("[SOCK] proto: Add hashinfo member to struct proto")
    Signed-off-by: Eric Dumazet
    Reported-by: Andrey Konovalov
    Tested-by: Andrey Konovalov
    Cc: Arnaldo Carvalho de Melo
    Acked-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • dccp_v6_err() does not use pskb_may_pull() and might access garbage.

    We only need 4 bytes at the beginning of the DCCP header, like TCP,
    so the 8 bytes pulled in icmpv6_notify() are more than enough.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Andrey Konovalov reported following error while fuzzing with syzkaller :

    IPv4: Attempt to release alive inet socket ffff880068e98940
    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] SMP KASAN
    Modules linked in:
    CPU: 1 PID: 3905 Comm: a.out Not tainted 4.9.0-rc3+ #333
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    task: ffff88006b9e0000 task.stack: ffff880068770000
    RIP: 0010:[] []
    selinux_socket_sock_rcv_skb+0xff/0x6a0 security/selinux/hooks.c:4639
    RSP: 0018:ffff8800687771c8 EFLAGS: 00010202
    RAX: ffff88006b9e0000 RBX: 1ffff1000d0eee3f RCX: 1ffff1000d1d312a
    RDX: 1ffff1000d1d31a6 RSI: dffffc0000000000 RDI: 0000000000000010
    RBP: ffff880068777360 R08: 0000000000000000 R09: 0000000000000002
    R10: dffffc0000000000 R11: 0000000000000006 R12: ffff880068e98940
    R13: 0000000000000002 R14: ffff880068777338 R15: 0000000000000000
    FS: 00007f00ff760700(0000) GS:ffff88006cd00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000020008000 CR3: 000000006a308000 CR4: 00000000000006e0
    Stack:
    ffff8800687771e0 ffffffff812508a5 ffff8800686f3168 0000000000000007
    ffff88006ac8cdfc ffff8800665ea500 0000000041b58ab3 ffffffff847b5480
    ffffffff819eac60 ffff88006b9e0860 ffff88006b9e0868 ffff88006b9e07f0
    Call Trace:
    [] security_sock_rcv_skb+0x75/0xb0 security/security.c:1317
    [] sk_filter_trim_cap+0x67/0x10e0 net/core/filter.c:81
    [] __sk_receive_skb+0x30/0xa00 net/core/sock.c:460
    [] dccp_v4_rcv+0xdb2/0x1910 net/dccp/ipv4.c:873
    [] ip_local_deliver_finish+0x332/0xad0
    net/ipv4/ip_input.c:216
    [< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
    [< inline >] NF_HOOK ./include/linux/netfilter.h:255
    [] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257
    [< inline >] dst_input ./include/net/dst.h:507
    [] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396
    [< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232
    [< inline >] NF_HOOK ./include/linux/netfilter.h:255
    [] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487
    [] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213
    [] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251
    [] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279
    [] netif_receive_skb+0x48/0x250 net/core/dev.c:4303
    [] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308
    [] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332
    [< inline >] new_sync_write fs/read_write.c:499
    [] __vfs_write+0x334/0x570 fs/read_write.c:512
    [] vfs_write+0x17b/0x500 fs/read_write.c:560
    [< inline >] SYSC_write fs/read_write.c:607
    [] SyS_write+0xd4/0x1a0 fs/read_write.c:599
    [] entry_SYSCALL_64_fastpath+0x1f/0xc2

    It turns out DCCP calls __sk_receive_skb(), and this broke when
    lookups no longer took a reference on listeners.

    Fix this issue by adding a @refcounted parameter to __sk_receive_skb(),
    so that sock_put() is used only when needed.

    Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
    Signed-off-by: Eric Dumazet
    Reported-by: Andrey Konovalov
    Tested-by: Andrey Konovalov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Oct, 2016

1 commit

  • Per listen(fd, backlog) rules, there is really no point accepting a SYN,
    sending a SYNACK, and dropping the following ACK packet if accept queue
    is full, because application is not draining accept queue fast enough.

    This behavior is fooling TCP clients that believe they established a
    flow, while there is nothing at server side. They might then send about
    10 MSS (if using IW10) that will be dropped anyway while server is under
    stress.

    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Jul, 2016

1 commit

  • Pull security subsystem updates from James Morris:
    "Highlights:

    - TPM core and driver updates/fixes
    - IPv6 security labeling (CALIPSO)
    - Lots of Apparmor fixes
    - Seccomp: remove 2-phase API, close hole where ptrace can change
    syscall #"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (156 commits)
    apparmor: fix SECURITY_APPARMOR_HASH_DEFAULT parameter handling
    tpm: Add TPM 2.0 support to the Nuvoton i2c driver (NPCT6xx family)
    tpm: Factor out common startup code
    tpm: use devm_add_action_or_reset
    tpm2_i2c_nuvoton: add irq validity check
    tpm: read burstcount from TPM_STS in one 32-bit transaction
    tpm: fix byte-order for the value read by tpm2_get_tpm_pt
    tpm_tis_core: convert max timeouts from msec to jiffies
    apparmor: fix arg_size computation for when setprocattr is null terminated
    apparmor: fix oops, validate buffer size in apparmor_setprocattr()
    apparmor: do not expose kernel stack
    apparmor: fix module parameters can be changed after policy is locked
    apparmor: fix oops in profile_unpack() when policy_db is not present
    apparmor: don't check for vmalloc_addr if kvzalloc() failed
    apparmor: add missing id bounds check on dfa verification
    apparmor: allow SYS_CAP_RESOURCE to be sufficient to prlimit another task
    apparmor: use list_next_entry instead of list_entry_next
    apparmor: fix refcount race when finding a child profile
    apparmor: fix ref count leak when profile sha1 hash is read
    apparmor: check that xindex is in trans_table bounds
    ...

    Linus Torvalds
     

14 Jul, 2016

1 commit

  • Dccp verifies packet integrity, including length, at initial rcv in
    dccp_invalid_packet, later pulls headers in dccp_enqueue_skb.

    A call to sk_filter in-between can cause __skb_pull to wrap skb->len.
    skb_copy_datagram_msg interprets this as a negative value, so
    (correctly) fails with EFAULT. The negative length is reported in
    ioctl SIOCINQ or possibly in a DCCP_WARN in dccp_close.

    Introduce an sk_receive_skb variant that caps how small a filter
    program can trim packets, and call this in dccp with the header
    length. Excessively trimmed packets are now processed normally and
    queued for reception as 0B payloads.

    Fixes: 7c657876b63c ("[DCCP]: Initial implementation")
    Signed-off-by: Willem de Bruijn
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

07 Jul, 2016

1 commit


28 Jun, 2016

1 commit

  • If set, these will take precedence over the parent's options during
    both sending and child creation. If they're not set, the parent's
    options (if any) will be used.

    This is to allow the security_inet_conn_request() hook to modify the
    IPv6 options in just the same way that it already may do for IPv4.

    Signed-off-by: Huw Davies
    Signed-off-by: Paul Moore

    Huw Davies
     

03 May, 2016

1 commit


28 Apr, 2016

3 commits


08 Apr, 2016

1 commit


05 Apr, 2016

1 commit

  • When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple
    cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes.

    By letting listeners use SOCK_RCU_FREE infrastructure,
    we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt

    Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets,
    only listeners are impacted by this change.

    Peak performance under SYNFLOOD is increased by ~33% :

    On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps

    Most consuming functions are now skb_set_owner_w() and sock_wfree()
    contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Feb, 2016

1 commit


19 Feb, 2016

1 commit

  • Ilya reported following lockdep splat:

    kernel: =========================
    kernel: [ BUG: held lock freed! ]
    kernel: 4.5.0-rc1-ceph-00026-g5e0a311 #1 Not tainted
    kernel: -------------------------
    kernel: swapper/5/0 is freeing memory
    ffff880035c9d200-ffff880035c9dbff, with a lock still held there!
    kernel: (&(&queue->rskq_lock)->rlock){+.-...}, at:
    [] inet_csk_reqsk_queue_add+0x28/0xa0
    kernel: 4 locks held by swapper/5/0:
    kernel: #0: (rcu_read_lock){......}, at: []
    netif_receive_skb_internal+0x4b/0x1f0
    kernel: #1: (rcu_read_lock){......}, at: []
    ip_local_deliver_finish+0x3f/0x380
    kernel: #2: (slock-AF_INET){+.-...}, at: []
    sk_clone_lock+0x19b/0x440
    kernel: #3: (&(&queue->rskq_lock)->rlock){+.-...}, at:
    [] inet_csk_reqsk_queue_add+0x28/0xa0

    To properly fix this issue, inet_csk_reqsk_queue_add() needs
    to return to its callers if the child as been queued
    into accept queue.

    We also need to make sure listener is still there before
    calling sk->sk_data_ready(), by holding a reference on it,
    since the reference carried by the child can disappear as
    soon as the child is put on accept queue.

    Reported-by: Ilya Dryomov
    Fixes: ebb516af60e1 ("tcp/dccp: fix race at listener dismantle phase")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Feb, 2016

2 commits

  • This is a preliminary step to allow fast socket lookup of SO_REUSEPORT
    groups. Doing so with a BPF filter will require access to the
    skb in question. This change plumbs the skb (and offset to payload
    data) through the call stack to the listening socket lookup
    implementations where it will be used in a following patch.

    Signed-off-by: Craig Gallek
    Signed-off-by: David S. Miller

    Craig Gallek
     
  • In order to support fast lookups for TCP sockets with SO_REUSEPORT,
    the function that adds sockets to the listening hash set needs
    to be able to check receive address equality. Since this equality
    check is different for IPv4 and IPv6, we will need two different
    socket hashing functions.

    This patch adds inet6_hash identical to the existing inet_hash function
    and updates the appropriate references. A following patch will
    differentiate the two by passing different comparison functions to
    __inet_hash.

    Additionally, in order to use the IPv6 address equality function from
    inet6_hashtables (which is compiled as a built-in object when IPv6 is
    enabled) it also needs to be in a built-in object file as well. This
    moves ipv6_rcv_saddr_equal into inet_hashtables to accomplish this.

    Signed-off-by: Craig Gallek
    Signed-off-by: David S. Miller

    Craig Gallek
     

04 Dec, 2015

1 commit

  • While testing the np->opt RCU conversion, I found that UDP/IPv6 was
    using a mixture of xchg() and sk_dst_lock to protect concurrent changes
    to sk->sk_dst_cache, leading to possible corruptions and crashes.

    ip6_sk_dst_lookup_flow() uses sk_dst_check() anyway, so the simplest
    way to fix the mess is to remove sk_dst_lock completely, as we did for
    IPv4.

    __ip6_dst_store() and ip6_dst_store() share same implementation.

    sk_setup_caps() being called with socket lock being held or not,
    we have to use sk_dst_set() instead of __sk_dst_set()

    Note that I had to move the "np->dst_cookie = rt6_get_cookie(rt);"
    in ip6_dst_store() before the sk_setup_caps(sk, dst) call.

    This is because ip6_dst_store() can be called from process context,
    without any lock held.

    As soon as the dst is installed in sk->sk_dst_cache, dst can be freed
    from another cpu doing a concurrent ip6_dst_store()

    Doing the dst dereference before doing the install is needed to make
    sure no use after free would trigger.

    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Dec, 2015

1 commit

  • This patch addresses multiple problems :

    UDP/RAW sendmsg() need to get a stable struct ipv6_txoptions
    while socket is not locked : Other threads can change np->opt
    concurrently. Dmitry posted a syzkaller
    (http://github.com/google/syzkaller) program desmonstrating
    use-after-free.

    Starting with TCP/DCCP lockless listeners, tcp_v6_syn_recv_sock()
    and dccp_v6_request_recv_sock() also need to use RCU protection
    to dereference np->opt once (before calling ipv6_dup_options())

    This patch adds full RCU protection to np->opt

    Reported-by: Dmitry Vyukov
    Signed-off-by: Eric Dumazet
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Nov, 2015

1 commit

  • IPv6 request sockets store a pointer to skb containing the SYN packet
    to be able to transfer it to full blown socket when 3WHS is done
    (ireq->pktopts -> np->pktoptions)

    As explained in commit 5e0724d027f0 ("tcp/dccp: fix hashdance race for
    passive sessions"), we must transfer the skb only if we won the
    hashdance race, if multiple cpus receive the 'ack' packet completing
    3WHS at the same time.

    Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Oct, 2015

1 commit

  • Multiple cpus can process duplicates of incoming ACK messages
    matching a SYN_RECV request socket. This is a rare event under
    normal operations, but definitely can happen.

    Only one must win the race, otherwise corruption would occur.

    To fix this without adding new atomic ops, we use logic in
    inet_ehash_nolisten() to detect the request was present in the same
    ehash bucket where we try to insert the new child.

    If request socket was not found, we have to undo the child creation.

    This actually removes a spin_lock()/spin_unlock() pair in
    reqsk_queue_unlink() for the fast path.

    Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Oct, 2015

1 commit

  • Let's reduce the confusion about inet_csk_reqsk_queue_drop() :
    In many cases we also need to release reference on request socket,
    so add a helper to do this, reducing code size and complexity.

    Fixes: 4bdc3d66147b ("tcp/dccp: fix behavior of stale SYN_RECV request sockets")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Oct, 2015

1 commit

  • When a TCP/DCCP listener is closed, its pending SYN_RECV request sockets
    become stale, meaning 3WHS can not complete.

    But current behavior is wrong :
    incoming packets finding such stale sockets are dropped.

    We need instead to cleanup the request socket and perform another
    lookup :
    - Incoming ACK will give a RST answer,
    - SYN rtx might find another listener if available.
    - We expedite cleanup of request sockets and old listener socket.

    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet