28 Feb, 2020

3 commits

  • Some transports (hyperv, virtio) acquire the sock lock during the
    .release() callback.

    In the vsock_stream_connect() we call vsock_assign_transport(); if
    the socket was previously assigned to another transport, the
    vsk->transport->release() is called, but the sock lock is already
    held in the vsock_stream_connect(), causing a deadlock reported by
    syzbot:

    INFO: task syz-executor280:9768 blocked for more than 143 seconds.
    Not tainted 5.6.0-rc1-syzkaller #0
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    syz-executor280 D27912 9768 9766 0x00000000
    Call Trace:
    context_switch kernel/sched/core.c:3386 [inline]
    __schedule+0x934/0x1f90 kernel/sched/core.c:4082
    schedule+0xdc/0x2b0 kernel/sched/core.c:4156
    __lock_sock+0x165/0x290 net/core/sock.c:2413
    lock_sock_nested+0xfe/0x120 net/core/sock.c:2938
    virtio_transport_release+0xc4/0xd60 net/vmw_vsock/virtio_transport_common.c:832
    vsock_assign_transport+0xf3/0x3b0 net/vmw_vsock/af_vsock.c:454
    vsock_stream_connect+0x2b3/0xc70 net/vmw_vsock/af_vsock.c:1288
    __sys_connect_file+0x161/0x1c0 net/socket.c:1857
    __sys_connect+0x174/0x1b0 net/socket.c:1874
    __do_sys_connect net/socket.c:1885 [inline]
    __se_sys_connect net/socket.c:1882 [inline]
    __x64_sys_connect+0x73/0xb0 net/socket.c:1882
    do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    To avoid this issue, this patch remove the lock acquiring in the
    .release() callback of hyperv and virtio transports, and it holds
    the lock when we call vsk->transport->release() in the vsock core.

    Reported-by: syzbot+731710996d79d0d58fbc@syzkaller.appspotmail.com
    Fixes: 408624af4c89 ("vsock: use local transport when it is loaded")
    Signed-off-by: Stefano Garzarella
    Reviewed-by: Stefan Hajnoczi
    Signed-off-by: David S. Miller

    Stefano Garzarella
     
  • Fixes: 3a12500ed5dd ("unix: define and set show_fdinfo only if procfs is enabled")
    Signed-off-by: David S. Miller

    David S. Miller
     
  • Follow the pattern used with other *_show_fdinfo functions and only
    define unix_show_fdinfo and set it in proto_ops if CONFIG_PROCFS
    is set.

    Fixes: 3c32da19a858 ("unix: Show number of pending scm files of receive queue in fdinfo")
    Signed-off-by: Tobias Klauser
    Reviewed-by: Kirill Tkhai
    Signed-off-by: David S. Miller

    Tobias Klauser
     

27 Feb, 2020

10 commits

  • In smc_ib_remove_dev() check if the provided ib device was actually
    initialized for SMC before.

    Reported-by: syzbot+84484ccebdd4e5451d91@syzkaller.appspotmail.com
    Fixes: a4cf0443c414 ("smc: introduce SMC as an IB-client")
    Signed-off-by: Karsten Graul
    Signed-off-by: David S. Miller

    Karsten Graul
     
  • syzbot noted that the master MPTCP socket lacks the icsk_sync_mss
    callback, and was able to trigger a null pointer dereference:

    BUG: kernel NULL pointer dereference, address: 0000000000000000
    PGD 8e171067 P4D 8e171067 PUD 93fa2067 PMD 0
    Oops: 0010 [#1] PREEMPT SMP KASAN
    CPU: 0 PID: 8984 Comm: syz-executor066 Not tainted 5.6.0-rc2-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:0x0
    Code: Bad RIP value.
    RSP: 0018:ffffc900020b7b80 EFLAGS: 00010246
    RAX: 1ffff110124ba600 RBX: 0000000000000000 RCX: ffff88809fefa600
    RDX: ffff8880994cdb18 RSI: 0000000000000000 RDI: ffff8880925d3140
    RBP: ffffc900020b7bd8 R08: ffffffff870225be R09: fffffbfff140652a
    R10: fffffbfff140652a R11: 0000000000000000 R12: ffff8880925d35d0
    R13: ffff8880925d3140 R14: dffffc0000000000 R15: 1ffff110124ba6ba
    FS: 0000000001a0b880(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffffffffffffd6 CR3: 00000000a6d6f000 CR4: 00000000001406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    cipso_v4_sock_setattr+0x34b/0x470 net/ipv4/cipso_ipv4.c:1888
    netlbl_sock_setattr+0x2a7/0x310 net/netlabel/netlabel_kapi.c:989
    smack_netlabel security/smack/smack_lsm.c:2425 [inline]
    smack_inode_setsecurity+0x3da/0x4a0 security/smack/smack_lsm.c:2716
    security_inode_setsecurity+0xb2/0x140 security/security.c:1364
    __vfs_setxattr_noperm+0x16f/0x3e0 fs/xattr.c:197
    vfs_setxattr fs/xattr.c:224 [inline]
    setxattr+0x335/0x430 fs/xattr.c:451
    __do_sys_fsetxattr fs/xattr.c:506 [inline]
    __se_sys_fsetxattr+0x130/0x1b0 fs/xattr.c:495
    __x64_sys_fsetxattr+0xbf/0xd0 fs/xattr.c:495
    do_syscall_64+0xf7/0x1c0 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x440199
    Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007ffcadc19e48 EFLAGS: 00000246 ORIG_RAX: 00000000000000be
    RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440199
    RDX: 0000000020000200 RSI: 00000000200001c0 RDI: 0000000000000003
    RBP: 00000000006ca018 R08: 0000000000000003 R09: 00000000004002c8
    R10: 0000000000000009 R11: 0000000000000246 R12: 0000000000401a20
    R13: 0000000000401ab0 R14: 0000000000000000 R15: 0000000000000000
    Modules linked in:
    CR2: 0000000000000000

    Address the issue adding a dummy icsk_sync_mss callback.
    To properly sync the subflows mss and options list we need some
    additional infrastructure, which will land to net-next.

    Reported-by: syzbot+f4dfece964792d80b139@syzkaller.appspotmail.com
    Fixes: 2303f994b3e1 ("mptcp: Associate MPTCP context with TCP socket")
    Signed-off-by: Paolo Abeni
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • IPV6_ADDRFORM is able to transform IPv6 socket to IPv4 one.
    While this operation sounds illogical, we have to support it.

    One of the things it does for TCP socket is to switch sk->sk_prot
    to tcp_prot.

    We now have other layers playing with sk->sk_prot, so we should make
    sure to not interfere with them.

    This patch makes sure sk_prot is the default pointer for TCP IPv6 socket.

    syzbot reported :
    BUG: kernel NULL pointer dereference, address: 0000000000000000
    PGD a0113067 P4D a0113067 PUD a8771067 PMD 0
    Oops: 0010 [#1] PREEMPT SMP KASAN
    CPU: 0 PID: 10686 Comm: syz-executor.0 Not tainted 5.6.0-rc2-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:0x0
    Code: Bad RIP value.
    RSP: 0018:ffffc9000281fce0 EFLAGS: 00010246
    RAX: 1ffffffff15f48ac RBX: ffffffff8afa4560 RCX: dffffc0000000000
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880a69a8f40
    RBP: ffffc9000281fd10 R08: ffffffff86ed9b0c R09: ffffed1014d351f5
    R10: ffffed1014d351f5 R11: 0000000000000000 R12: ffff8880920d3098
    R13: 1ffff1101241a613 R14: ffff8880a69a8f40 R15: 0000000000000000
    FS: 00007f2ae75db700(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffffffffffffd6 CR3: 00000000a3b85000 CR4: 00000000001406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    inet_release+0x165/0x1c0 net/ipv4/af_inet.c:427
    __sock_release net/socket.c:605 [inline]
    sock_close+0xe1/0x260 net/socket.c:1283
    __fput+0x2e4/0x740 fs/file_table.c:280
    ____fput+0x15/0x20 fs/file_table.c:313
    task_work_run+0x176/0x1b0 kernel/task_work.c:113
    tracehook_notify_resume include/linux/tracehook.h:188 [inline]
    exit_to_usermode_loop arch/x86/entry/common.c:164 [inline]
    prepare_exit_to_usermode+0x480/0x5b0 arch/x86/entry/common.c:195
    syscall_return_slowpath+0x113/0x4a0 arch/x86/entry/common.c:278
    do_syscall_64+0x11f/0x1c0 arch/x86/entry/common.c:304
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x45c429
    Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007f2ae75dac78 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
    RAX: 0000000000000000 RBX: 00007f2ae75db6d4 RCX: 000000000045c429
    RDX: 0000000000000001 RSI: 000000000000011a RDI: 0000000000000004
    RBP: 000000000076bf20 R08: 0000000000000038 R09: 0000000000000000
    R10: 0000000020000180 R11: 0000000000000246 R12: 00000000ffffffff
    R13: 0000000000000a9d R14: 00000000004ccfb4 R15: 000000000076bf2c
    Modules linked in:
    CR2: 0000000000000000
    ---[ end trace 82567b5207e87bae ]---
    RIP: 0010:0x0
    Code: Bad RIP value.
    RSP: 0018:ffffc9000281fce0 EFLAGS: 00010246
    RAX: 1ffffffff15f48ac RBX: ffffffff8afa4560 RCX: dffffc0000000000
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880a69a8f40
    RBP: ffffc9000281fd10 R08: ffffffff86ed9b0c R09: ffffed1014d351f5
    R10: ffffed1014d351f5 R11: 0000000000000000 R12: ffff8880920d3098
    R13: 1ffff1101241a613 R14: ffff8880a69a8f40 R15: 0000000000000000
    FS: 00007f2ae75db700(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffffffffffffd6 CR3: 00000000a3b85000 CR4: 00000000001406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

    Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot+1938db17e275e85dc328@syzkaller.appspotmail.com
    Cc: Daniel Borkmann
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • If an SMC connection to a certain peer is setup the first time,
    a new linkgroup is created. In case of setup failures, such a
    linkgroup is unusable and should disappear. As a first step the
    linkgroup is removed from the linkgroup list in smc_lgr_forget().

    There are 2 problems:
    smc_listen_decline() might be called before linkgroup creation
    resulting in a crash due to calling smc_lgr_forget() with
    parameter NULL.
    If a setup failure occurs after linkgroup creation, the connection
    is never unregistered from the linkgroup, preventing linkgroup
    freeing.

    This patch introduces an enhanced smc_lgr_cleanup_early() function
    which
    * contains a linkgroup check for early smc_listen_decline()
    invocations
    * invokes smc_conn_free() to guarantee unregistering of the
    connection.
    * schedules fast linkgroup removal of the unusable linkgroup

    And the unused function smcd_conn_free() is removed from smc_core.h.

    Fixes: 3b2dec2603d5b ("net/smc: restructure client and server code in af_smc")
    Fixes: 2a0674fffb6bc ("net/smc: improve abnormal termination of link groups")
    Signed-off-by: Ursula Braun
    Signed-off-by: Karsten Graul
    Signed-off-by: David S. Miller

    Ursula Braun
     
  • The put of the flags was added by the commit referenced in fixes tag,
    however the size of the message was not extended accordingly.

    Fix this by adding size of the flags bitfield to the message size.

    Fixes: e38226786022 ("net: sched: update action implementations to support flags")
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • list_for_each_entry_rcu() has built-in RCU and lock checking.

    Pass cond argument to list_for_each_entry_rcu() to silence
    false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled.

    The devlink->lock is held when devlink_dpipe_table_find()
    is called in non RCU read side section. Therefore, pass struct devlink
    to devlink_dpipe_table_find() for lockdep checking.

    Signed-off-by: Madhuparna Bhowmik
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Madhuparna Bhowmik
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contains Netfilter fixes:

    1) Perform garbage collection from workqueue to fix rcu detected
    stall in ipset hash set types, from Jozsef Kadlecsik.

    2) Fix the forceadd evaluation path, also from Jozsef.

    3) Fix nft_set_pipapo selftest, from Stefano Brivio.

    4) Crash when add-flush-add element in pipapo set, also from Stefano.
    Add test to cover this crash.

    5) Remove sysctl entry under mutex in hashlimit, from Cong Wang.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Before releasing the global mutex, we only unlink the hashtable
    from the hash list, its proc file is still not unregistered at
    this point. So syzbot could trigger a race condition where a
    parallel htable_create() could register the same file immediately
    after the mutex is released.

    Move htable_remove_proc_entry() back to mutex protection to
    fix this. And, fold htable_destroy() into htable_put() to make
    the code slightly easier to understand.

    Reported-and-tested-by: syzbot+d195fd3b9a364ddd6731@syzkaller.appspotmail.com
    Fixes: c4a3922d2d20 ("netfilter: xt_hashlimit: reduce hashlimit_mutex scope for htable_put()")
    Signed-off-by: Cong Wang
    Signed-off-by: Pablo Neira Ayuso

    Cong Wang
     
  • Syzbot reported that ethnl_compact_sanity_checks() can be tricked into
    reading past the end of ETHTOOL_A_BITSET_VALUE and ETHTOOL_A_BITSET_MASK
    attributes and even the message by passing a value between (u32)(-31)
    and (u32)(-1) as ETHTOOL_A_BITSET_SIZE.

    The problem is that DIV_ROUND_UP(attr_nbits, 32) is 0 for such values so
    that zero length ETHTOOL_A_BITSET_VALUE will pass the length check but
    ethnl_bitmap32_not_zero() check would try to access up to 512 MB of
    attribute "payload".

    Prevent this overflow byt limiting the bitset size. Technically, compact
    bitset format would allow bitset sizes up to almost 2^18 (so that the
    nest size does not exceed U16_MAX) but bitsets used by ethtool are much
    shorter. S16_MAX, the largest value which can be directly used as an
    upper limit in policy, should be a reasonable compromise.

    Fixes: 10b518d4e6dd ("ethtool: netlink bitset handling")
    Reported-by: syzbot+7fd4ed5b4234ab1fdccd@syzkaller.appspotmail.com
    Reported-by: syzbot+709b7a64d57978247e44@syzkaller.appspotmail.com
    Reported-by: syzbot+983cb8fb2d17a7af549d@syzkaller.appspotmail.com
    Signed-off-by: Michal Kubecek
    Signed-off-by: David S. Miller

    Michal Kubecek
     
  • Fixes the lower and upper bounds when there are multiple TCs and
    traffic is on the the same TC on the same device.

    The lower bound is represented by 'qoffset' and the upper limit for
    hash value is 'qcount + qoffset'. This gives a clean Rx to Tx queue
    mapping when there are multiple TCs, as the queue indices for upper TCs
    will be offset by 'qoffset'.

    v2: Fixed commit description based on comments.

    Fixes: 1b837d489e06 ("net: Revoke export for __skb_tx_hash, update it to just be static skb_tx_hash")
    Fixes: eadec877ce9c ("net: Add support for subordinate traffic classes to netdev_pick_tx")
    Signed-off-by: Amritha Nambiar
    Reviewed-by: Alexander Duyck
    Reviewed-by: Sridhar Samudrala
    Signed-off-by: David S. Miller

    Amritha Nambiar
     

26 Feb, 2020

2 commits

  • Phil reports that adding elements, flushing and re-adding them
    right away:

    nft add table t '{ set s { type ipv4_addr . inet_service; flags interval; }; }'
    nft add element t s '{ 10.0.0.1 . 22-25, 10.0.0.1 . 10-20 }'
    nft flush set t s
    nft add element t s '{ 10.0.0.1 . 10-20, 10.0.0.1 . 22-25 }'

    triggers, almost reliably, a crash like this one:

    [ 71.319848] general protection fault, probably for non-canonical address 0x6f6b6e696c2e756e: 0000 [#1] PREEMPT SMP PTI
    [ 71.321540] CPU: 3 PID: 1201 Comm: kworker/3:2 Not tainted 5.6.0-rc1-00377-g2bb07f4e1d861 #192
    [ 71.322746] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190711_202441-buildvm-armv7-10.arm.fedoraproject.org-2.fc31 04/01/2014
    [ 71.324430] Workqueue: events nf_tables_trans_destroy_work [nf_tables]
    [ 71.325387] RIP: 0010:nft_set_elem_destroy+0xa5/0x110 [nf_tables]
    [ 71.326164] Code: 89 d4 84 c0 74 0e 8b 77 44 0f b6 f8 48 01 df e8 41 ff ff ff 45 84 e4 74 36 44 0f b6 63 08 45 84 e4 74 2c 49 01 dc 49 8b 04 24 8b 40 38 48 85 c0 74 4f 48 89 e7 4c 8b
    [ 71.328423] RSP: 0018:ffffc9000226fd90 EFLAGS: 00010282
    [ 71.329225] RAX: 6f6b6e696c2e756e RBX: ffff88813ab79f60 RCX: ffff88813931b5a0
    [ 71.330365] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88813ab79f9a
    [ 71.331473] RBP: ffff88813ab79f60 R08: 0000000000000008 R09: 0000000000000000
    [ 71.332627] R10: 000000000000021c R11: 0000000000000000 R12: ffff88813ab79fc2
    [ 71.333615] R13: ffff88813b3adf50 R14: dead000000000100 R15: ffff88813931b8a0
    [ 71.334596] FS: 0000000000000000(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
    [ 71.335780] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 71.336577] CR2: 000055ac683710f0 CR3: 000000013a222003 CR4: 0000000000360ee0
    [ 71.337533] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 71.338557] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 71.339718] Call Trace:
    [ 71.340093] nft_pipapo_destroy+0x7a/0x170 [nf_tables_set]
    [ 71.340973] nft_set_destroy+0x20/0x50 [nf_tables]
    [ 71.341879] nf_tables_trans_destroy_work+0x246/0x260 [nf_tables]
    [ 71.342916] process_one_work+0x1d5/0x3c0
    [ 71.343601] worker_thread+0x4a/0x3c0
    [ 71.344229] kthread+0xfb/0x130
    [ 71.344780] ? process_one_work+0x3c0/0x3c0
    [ 71.345477] ? kthread_park+0x90/0x90
    [ 71.346129] ret_from_fork+0x35/0x40
    [ 71.346748] Modules linked in: nf_tables_set nf_tables nfnetlink 8021q [last unloaded: nfnetlink]
    [ 71.348153] ---[ end trace 2eaa8149ca759bcc ]---
    [ 71.349066] RIP: 0010:nft_set_elem_destroy+0xa5/0x110 [nf_tables]
    [ 71.350016] Code: 89 d4 84 c0 74 0e 8b 77 44 0f b6 f8 48 01 df e8 41 ff ff ff 45 84 e4 74 36 44 0f b6 63 08 45 84 e4 74 2c 49 01 dc 49 8b 04 24 8b 40 38 48 85 c0 74 4f 48 89 e7 4c 8b
    [ 71.350017] RSP: 0018:ffffc9000226fd90 EFLAGS: 00010282
    [ 71.350019] RAX: 6f6b6e696c2e756e RBX: ffff88813ab79f60 RCX: ffff88813931b5a0
    [ 71.350019] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88813ab79f9a
    [ 71.350020] RBP: ffff88813ab79f60 R08: 0000000000000008 R09: 0000000000000000
    [ 71.350021] R10: 000000000000021c R11: 0000000000000000 R12: ffff88813ab79fc2
    [ 71.350022] R13: ffff88813b3adf50 R14: dead000000000100 R15: ffff88813931b8a0
    [ 71.350025] FS: 0000000000000000(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
    [ 71.350026] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 71.350027] CR2: 000055ac683710f0 CR3: 000000013a222003 CR4: 0000000000360ee0
    [ 71.350028] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 71.350028] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 71.350030] Kernel panic - not syncing: Fatal exception
    [ 71.350412] Kernel Offset: disabled
    [ 71.365922] ---[ end Kernel panic - not syncing: Fatal exception ]---

    which is caused by dangling elements that have been deactivated, but
    never removed.

    On a flush operation, nft_pipapo_walk() walks through all the elements
    in the mapping table, which are then deactivated by nft_flush_set(),
    one by one, and added to the commit list for removal. Element data is
    then freed.

    On transaction commit, nft_pipapo_remove() is called, and failed to
    remove these elements, leading to the stale references in the mapping.
    The first symptom of this, revealed by KASan, is a one-byte
    use-after-free in subsequent calls to nft_pipapo_walk(), which is
    usually not enough to trigger a panic. When stale elements are used
    more heavily, though, such as double-free via nft_pipapo_destroy()
    as in Phil's case, the problem becomes more noticeable.

    The issue comes from that fact that, on a flush operation,
    nft_pipapo_remove() won't get the actual key data via elem->key,
    elements to be deleted upon commit won't be found by the lookup via
    pipapo_get(), and removal will be skipped. Key data should be fetched
    via nft_set_ext_key(), instead.

    Reported-by: Phil Sutter
    Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges")
    Signed-off-by: Stefano Brivio
    Signed-off-by: Pablo Neira Ayuso

    Stefano Brivio
     
  • Jozsef Kadlecsik says:

    ====================
    ipset patches for nf

    The first one is larger than usual, but the issue could not be solved simpler.
    Also, it's a resend of the patch I submitted a few days ago, with a one line
    fix on top of that: the size of the comment extensions was not taken into
    account at reporting the full size of the set.

    - Fix "INFO: rcu detected stall in hash_xxx" reports of syzbot
    by introducing region locking and using workqueue instead of timer based
    gc of timed out entries in hash types of sets in ipset.
    - Fix the forceadd evaluation path - the bug was also uncovered by the syzbot.
    ====================

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

25 Feb, 2020

2 commits

  • …rnel/git/jberg/mac80211

    Johannes Berg

    ====================
    A few fixes:
    * remove a double mutex-unlock
    * fix a leak in an error path
    * NULL pointer check
    * include if_vlan.h where needed
    * avoid RCU list traversal when not under RCU
    ====================

    Signed-off-by: David S. Miller <davem@davemloft.net>

    David S. Miller
     
  • In br_dev_xmit() we perform vlan filtering in br_allowed_ingress() but
    if the packet has the vlan header inside (e.g. bridge with disabled
    tx-vlan-offload) then the vlan filtering code will use skb_vlan_untag()
    to extract the vid before filtering which in turn calls pskb_may_pull()
    and we may end up with a stale eth pointer. Moreover the cached eth header
    pointer will generally be wrong after that operation. Remove the eth header
    caching and just use eth_hdr() directly, the compiler does the right thing
    and calculates it only once so we don't lose anything.

    Fixes: 057658cb33fb ("bridge: suppress arp pkts on BR_NEIGH_SUPPRESS ports")
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     

24 Feb, 2020

4 commits

  • local->sta_mtx is held in __ieee80211_check_fast_rx_iface().
    No need to use list_for_each_entry_rcu() as it also requires
    a cond argument to avoid false lockdep warnings when not used in
    RCU read-side section (with CONFIG_PROVE_RCU_LIST).
    Therefore use list_for_each_entry();

    Signed-off-by: Madhuparna Bhowmik
    Link: https://lore.kernel.org/r/20200223143302.15390-1-madhuparnabhowmik10@gmail.com
    Signed-off-by: Johannes Berg

    Madhuparna Bhowmik
     
  • We use that here, and do seem to get it through some recursive
    include, but better include it explicitly.

    Signed-off-by: Johannes Berg
    Link: https://lore.kernel.org/r/20200224093814.1b9c258fec67.I45ac150d4e11c72eb263abec9f1f0c7add9bef2b@changeid
    Signed-off-by: Johannes Berg

    Johannes Berg
     
  • devlink_dpipe_table_find() should be called under either
    rcu_read_lock() or devlink->lock. devlink_dpipe_table_register()
    calls devlink_dpipe_table_find() without holding the lock
    and acquires it later. Therefore hold the devlink->lock
    from the beginning of devlink_dpipe_table_register().

    Suggested-by: Jiri Pirko
    Signed-off-by: Madhuparna Bhowmik
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Madhuparna Bhowmik
     
  • In a rare corner case the new logic for undo of SYNACK RTO could
    result in triggering the warning in tcp_fastretrans_alert() that says:
    WARN_ON(tp->retrans_out != 0);

    The warning looked like:

    WARNING: CPU: 1 PID: 1 at net/ipv4/tcp_input.c:2818 tcp_ack+0x13e0/0x3270

    The sequence that tickles this bug is:
    - Fast Open server receives TFO SYN with data, sends SYNACK
    - (client receives SYNACK and sends ACK, but ACK is lost)
    - server app sends some data packets
    - (N of the first data packets are lost)
    - server receives client ACK that has a TS ECR matching first SYNACK,
    and also SACKs suggesting the first N data packets were lost
    - server performs TS undo of SYNACK RTO, then immediately
    enters recovery
    - buggy behavior then performed a *second* undo that caused
    the connection to be in CA_Open with retrans_out != 0

    Basically, the incoming ACK packet with SACK blocks causes us to first
    undo the cwnd reduction from the SYNACK RTO, but then immediately
    enters fast recovery, which then makes us eligible for undo again. And
    then tcp_rcv_synrecv_state_fastopen() accidentally performs an undo
    using a "mash-up" of state from two different loss recovery phases: it
    uses the timestamp info from the ACK of the original SYNACK, and the
    undo_marker from the fast recovery.

    This fix refines the logic to only invoke the tcp_try_undo_loss()
    inside tcp_rcv_synrecv_state_fastopen() if the connection is still in
    CA_Loss. If peer SACKs triggered fast recovery, then
    tcp_rcv_synrecv_state_fastopen() can't safely undo.

    Fixes: 794200d66273 ("tcp: undo cwnd on Fast Open spurious SYNACK retransmit")
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     

23 Feb, 2020

2 commits

  • Currently if attribute parsing fails and the genl family
    does not support parallel operation, the error code returned
    by __nlmsg_parse() is discarded by genl_family_rcv_msg_attrs_parse().

    Be sure to report the error for all genl families.

    Fixes: c10e6cf85e7d ("net: genetlink: push attrbuf allocation and parsing to a separate function")
    Fixes: ab5b526da048 ("net: genetlink: always allocate separate attrs for dumpit ops")
    Signed-off-by: Paolo Abeni
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Similarly to commit c543cb4a5f07 ("ipv4: ensure rcu_read_lock() in
    ipv4_link_failure()"), __ip_options_compile() must be called under rcu
    protection.

    Fixes: 3da1ed7ac398 ("net: avoid use IPCB in cipso_v4_error")
    Suggested-by: Guillaume Nault
    Signed-off-by: Matteo Croce
    Acked-by: Paul Moore
    Signed-off-by: David S. Miller

    Matteo Croce
     

22 Feb, 2020

3 commits

  • When the forceadd option is enabled, the hash:* types should find and replace
    the first entry in the bucket with the new one if there are no reuseable
    (deleted or timed out) entries. However, the position index was just not set
    to zero and remained the invalid -1 if there were no reuseable entries.

    Reported-by: syzbot+6a86565c74ebe30aea18@syzkaller.appspotmail.com
    Fixes: 23c42a403a9c ("netfilter: ipset: Introduction of new commands and protocol version 7")
    Signed-off-by: Jozsef Kadlecsik

    Jozsef Kadlecsik
     
  • In the case of huge hash:* types of sets, due to the single spinlock of
    a set the processing of the whole set under spinlock protection could take
    too long.

    There were four places where the whole hash table of the set was processed
    from bucket to bucket under holding the spinlock:

    - During resizing a set, the original set was locked to exclude kernel side
    add/del element operations (userspace add/del is excluded by the
    nfnetlink mutex). The original set is actually just read during the
    resize, so the spinlocking is replaced with rcu locking of regions.
    However, thus there can be parallel kernel side add/del of entries.
    In order not to loose those operations a backlog is added and replayed
    after the successful resize.
    - Garbage collection of timed out entries was also protected by the spinlock.
    In order not to lock too long, region locking is introduced and a single
    region is processed in one gc go. Also, the simple timer based gc running
    is replaced with a workqueue based solution. The internal book-keeping
    (number of elements, size of extensions) is moved to region level due to
    the region locking.
    - Adding elements: when the max number of the elements is reached, the gc
    was called to evict the timed out entries. The new approach is that the gc
    is called just for the matching region, assuming that if the region
    (proportionally) seems to be full, then the whole set does. We could scan
    the other regions to check every entry under rcu locking, but for huge
    sets it'd mean a slowdown at adding elements.
    - Listing the set header data: when the set was defined with timeout
    support, the garbage collector was called to clean up timed out entries
    to get the correct element numbers and set size values. Now the set is
    scanned to check non-timed out entries, without actually calling the gc
    for the whole set.

    Thanks to Florian Westphal for helping me to solve the SOFTIRQ-safe ->
    SOFTIRQ-unsafe lock order issues during working on the patch.

    Reported-by: syzbot+4b0e9d4ff3cf117837e5@syzkaller.appspotmail.com
    Reported-by: syzbot+c27b8d5010f45c666ed1@syzkaller.appspotmail.com
    Reported-by: syzbot+68a806795ac89df3aa1c@syzkaller.appspotmail.com
    Fixes: 23c42a403a9c ("netfilter: ipset: Introduction of new commands and protocol version 7")
    Signed-off-by: Jozsef Kadlecsik

    Jozsef Kadlecsik
     
  • Pull networking fixes from David Miller:

    1) Limit xt_hashlimit hash table size to avoid OOM or hung tasks, from
    Cong Wang.

    2) Fix deadlock in xsk by publishing global consumer pointers when NAPI
    is finished, from Magnus Karlsson.

    3) Set table field properly to RT_TABLE_COMPAT when necessary, from
    Jethro Beekman.

    4) NLA_STRING attributes are not necessary NULL terminated, deal wiht
    that in IFLA_ALT_IFNAME. From Eric Dumazet.

    5) Fix checksum handling in atlantic driver, from Dmitry Bezrukov.

    6) Handle mtu==0 devices properly in wireguard, from Jason A.
    Donenfeld.

    7) Fix several lockdep warnings in bonding, from Taehee Yoo.

    8) Fix cls_flower port blocking, from Jason Baron.

    9) Sanitize internal map names in libbpf, from Toke Høiland-Jørgensen.

    10) Fix RDMA race in qede driver, from Michal Kalderon.

    11) Fix several false lockdep warnings by adding conditions to
    list_for_each_entry_rcu(), from Madhuparna Bhowmik.

    12) Fix sleep in atomic in mlx5 driver, from Huy Nguyen.

    13) Fix potential deadlock in bpf_map_do_batch(), from Yonghong Song.

    14) Hey, variables declared in switch statement before any case
    statements are not initialized. I learn something every day. Get
    rids of this stuff in several parts of the networking, from Kees
    Cook.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
    bnxt_en: Issue PCIe FLR in kdump kernel to cleanup pending DMAs.
    bnxt_en: Improve device shutdown method.
    net: netlink: cap max groups which will be considered in netlink_bind()
    net: thunderx: workaround BGX TX Underflow issue
    ionic: fix fw_status read
    net: disable BRIDGE_NETFILTER by default
    net: macb: Properly handle phylink on at91rm9200
    s390/qeth: fix off-by-one in RX copybreak check
    s390/qeth: don't warn for napi with 0 budget
    s390/qeth: vnicc Fix EOPNOTSUPP precedence
    openvswitch: Distribute switch variables for initialization
    net: ip6_gre: Distribute switch variables for initialization
    net: core: Distribute switch variables for initialization
    udp: rehash on disconnect
    net/tls: Fix to avoid gettig invalid tls record
    bpf: Fix a potential deadlock with bpf_map_do_batch
    bpf: Do not grab the bucket spinlock by default on htab batch ops
    ice: Wait for VF to be reset/ready before configuration
    ice: Don't tell the OS that link is going down
    ice: Don't reject odd values of usecs set by user
    ...

    Linus Torvalds
     

21 Feb, 2020

8 commits

  • The below-mentioned commit changed the code to unlock *inside*
    the function, but previously the unlock was *outside*. It failed
    to remove the outer unlock, however, leading to double unlock.

    Fix this.

    Fixes: 33483a6b88e4 ("mac80211: fix missing unlock on error in ieee80211_mark_sta_auth()")
    Signed-off-by: Andrei Otcheretianski
    Link: https://lore.kernel.org/r/20200221104719.cce4741cf6eb.I671567b185c8a4c2409377e483fd149ce590f56d@changeid
    [rewrite commit message to better explain what happened]
    Signed-off-by: Johannes Berg

    Andrei Otcheretianski
     
  • We may end up with a NULL reg_rule after the loop in
    handle_channel_custom() if the bandwidth didn't fit,
    check if this is the case and bail out if so.

    Signed-off-by: Johannes Berg
    Link: https://lore.kernel.org/r/20200221104449.3b558a50201c.I4ad3725c4dacaefd2d18d3cc65ba6d18acd5dbfe@changeid
    Signed-off-by: Johannes Berg

    Johannes Berg
     
  • If nl80211_parse_he_obss_pd() fails, we leak the previously
    allocated ACL memory. Free it in this case.

    Fixes: 796e90f42b7e ("cfg80211: add support for parsing OBBS_PD attributes")
    Signed-off-by: Johannes Berg
    Link: https://lore.kernel.org/r/20200221104142.835aba4cdd14.I1923b55ba9989c57e13978f91f40bfdc45e60cbd@changeid
    Signed-off-by: Johannes Berg

    Johannes Berg
     
  • Since nl_groups is a u32 we can't bind more groups via ->bind
    (netlink_bind) call, but netlink has supported more groups via
    setsockopt() for a long time and thus nlk->ngroups could be over 32.
    Recently I added support for per-vlan notifications and increased the
    groups to 33 for NETLINK_ROUTE which exposed an old bug in the
    netlink_bind() code causing out-of-bounds access on archs where unsigned
    long is 32 bits via test_bit() on a local variable. Fix this by capping the
    maximum groups in netlink_bind() to BITS_PER_TYPE(u32), effectively
    capping them at 32 which is the minimum of allocated groups and the
    maximum groups which can be bound via netlink_bind().

    CC: Christophe Leroy
    CC: Richard Guy Briggs
    Fixes: 4f520900522f ("netlink: have netlink per-protocol bind function return an error code.")
    Reported-by: Erhard F.
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     
  • The description says 'If unsure, say N.' but
    the module is built as M by default (once
    the dependencies are satisfied).

    When the module is selected (Y or M), it enables
    NETFILTER_FAMILY_BRIDGE and SKB_EXTENSIONS
    which alter kernel internal structures.

    We (Android Studio Emulator) currently do not
    use this module and think this it is more consistent
    to have it disabled by default as opposite to
    disabling it explicitly to prevent enabling
    NETFILTER_FAMILY_BRIDGE and SKB_EXTENSIONS.

    Signed-off-by: Roman Kiryanov
    Acked-by: Florian Westphal
    Signed-off-by: David S. Miller

    Roman Kiryanov
     
  • Variables declared in a switch statement before any case statements
    cannot be automatically initialized with compiler instrumentation (as
    they are not part of any execution flow). With GCC's proposed automatic
    stack variable initialization feature, this triggers a warning (and they
    don't get initialized). Clang's automatic stack variable initialization
    (via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also
    doesn't initialize such variables[1]. Note that these warnings (or silent
    skipping) happen before the dead-store elimination optimization phase,
    so even when the automatic initializations are later elided in favor of
    direct initializations, the warnings remain.

    To avoid these problems, move such variables into the "case" where
    they're used or lift them up into the main function body.

    net/openvswitch/flow_netlink.c: In function ‘validate_set’:
    net/openvswitch/flow_netlink.c:2711:29: warning: statement will never be executed [-Wswitch-unreachable]
    2711 | const struct ovs_key_ipv4 *ipv4_key;
    | ^~~~~~~~

    [1] https://bugs.llvm.org/show_bug.cgi?id=44916

    Signed-off-by: Kees Cook
    Signed-off-by: David S. Miller

    Kees Cook
     
  • Variables declared in a switch statement before any case statements
    cannot be automatically initialized with compiler instrumentation (as
    they are not part of any execution flow). With GCC's proposed automatic
    stack variable initialization feature, this triggers a warning (and they
    don't get initialized). Clang's automatic stack variable initialization
    (via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also
    doesn't initialize such variables[1]. Note that these warnings (or silent
    skipping) happen before the dead-store elimination optimization phase,
    so even when the automatic initializations are later elided in favor of
    direct initializations, the warnings remain.

    To avoid these problems, move such variables into the "case" where
    they're used or lift them up into the main function body.

    net/ipv6/ip6_gre.c: In function ‘ip6gre_err’:
    net/ipv6/ip6_gre.c:440:32: warning: statement will never be executed [-Wswitch-unreachable]
    440 | struct ipv6_tlv_tnl_enc_lim *tel;
    | ^~~

    net/ipv6/ip6_tunnel.c: In function ‘ip6_tnl_err’:
    net/ipv6/ip6_tunnel.c:520:32: warning: statement will never be executed [-Wswitch-unreachable]
    520 | struct ipv6_tlv_tnl_enc_lim *tel;
    | ^~~

    [1] https://bugs.llvm.org/show_bug.cgi?id=44916

    Signed-off-by: Kees Cook
    Signed-off-by: David S. Miller

    Kees Cook
     
  • Variables declared in a switch statement before any case statements
    cannot be automatically initialized with compiler instrumentation (as
    they are not part of any execution flow). With GCC's proposed automatic
    stack variable initialization feature, this triggers a warning (and they
    don't get initialized). Clang's automatic stack variable initialization
    (via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also
    doesn't initialize such variables[1]. Note that these warnings (or silent
    skipping) happen before the dead-store elimination optimization phase,
    so even when the automatic initializations are later elided in favor of
    direct initializations, the warnings remain.

    To avoid these problems, move such variables into the "case" where
    they're used or lift them up into the main function body.

    net/core/skbuff.c: In function ‘skb_checksum_setup_ip’:
    net/core/skbuff.c:4809:7: warning: statement will never be executed [-Wswitch-unreachable]
    4809 | int err;
    | ^~~

    [1] https://bugs.llvm.org/show_bug.cgi?id=44916

    Signed-off-by: Kees Cook
    Signed-off-by: David S. Miller

    Kees Cook
     

20 Feb, 2020

5 commits

  • Alexei Starovoitov says:

    ====================
    pull-request: bpf 2020-02-19

    The following pull-request contains BPF updates for your *net* tree.

    We've added 10 non-merge commits during the last 10 day(s) which contain
    a total of 10 files changed, 93 insertions(+), 31 deletions(-).

    The main changes are:

    1) batched bpf hashtab fixes from Brian and Yonghong.

    2) various selftests and libbpf fixes.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • As of the below commit, udp sockets bound to a specific address can
    coexist with one bound to the any addr for the same port.

    The commit also phased out the use of socket hashing based only on
    port (hslot), in favor of always hashing on {addr, port} (hslot2).

    The change broke the following behavior with disconnect (AF_UNSPEC):

    server binds to 0.0.0.0:1337
    server connects to 127.0.0.1:80
    server disconnects
    client connects to 127.0.0.1:1337
    client sends "hello"
    server reads "hello" // times out, packet did not find sk

    On connect the server acquires a specific source addr suitable for
    routing to its destination. On disconnect it reverts to the any addr.

    The connect call triggers a rehash to a different hslot2. On
    disconnect, add the same to return to the original hslot2.

    Skip this step if the socket is going to be unhashed completely.

    Fixes: 4cdeeee9252a ("net: udp: prefer listeners bound to an address")
    Reported-by: Pavel Roskin
    Signed-off-by: Willem de Bruijn
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Current code doesn't check if tcp sequence number is starting from (/after)
    1st record's start sequnce number. It only checks if seq number is before
    1st record's end sequnce number. This problem will always be a possibility
    in re-transmit case. If a record which belongs to a requested seq number is
    already deleted, tls_get_record will start looking into list and as per the
    check it will look if seq number is before the end seq of 1st record, which
    will always be true and will return 1st record always, it should in fact
    return NULL.
    As part of the fix, start looking each record only if the sequence number
    lies in the list else return NULL.
    There is one more check added, driver look for the start marker record to
    handle tcp packets which are before the tls offload start sequence number,
    hence return 1st record if the record is tls start marker and seq number is
    before the 1st record's starting sequence number.

    Fixes: e8f69799810c ("net/tls: Add generic NIC offload infrastructure")
    Signed-off-by: Rohit Maheshwari
    Reviewed-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Rohit Maheshwari
     
  • list_for_each_entry_rcu() has built-in RCU and lock checking.

    Pass cond argument to list_for_each_entry_rcu() to silence
    false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
    by default.

    Signed-off-by: Madhuparna Bhowmik
    Signed-off-by: David S. Miller

    Madhuparna Bhowmik
     
  • node_db is traversed using list_for_each_entry_rcu
    outside an RCU read-side critical section but under the protection
    of hsr->list_lock.

    Hence, add corresponding lockdep expression to silence false-positive
    warnings, and harden RCU lists.

    Signed-off-by: Amol Grover
    Signed-off-by: David S. Miller

    Amol Grover
     

19 Feb, 2020

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    This batch contains Netfilter fixes for net:

    1) Restrict hashlimit size to 1048576, from Cong Wang.

    2) Check for offload flags from nf_flow_table_offload_setup(),
    this fixes a crash in case the hardware offload is disabled.
    From Florian Westphal.

    3) Three preparation patches to extend the conntrack clash resolution,
    from Florian.

    4) Extend clash resolution to deal with DNS packets from the same flow
    racing to set up the NAT configuration.

    5) Small documentation fix in pipapo, from Stefano Brivio.

    6) Remove misleading unlikely() from pipapo_refill(), also from Stefano.

    7) Reduce hashlimit mutex scope, from Cong Wang. This patch is actually
    triggering another problem, still under discussion, another patch to
    fix this will follow up.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller