05 Apr, 2020

1 commit


29 Mar, 2020

1 commit


27 Mar, 2020

12 commits

  • Change the rpcrdma_xprt_disconnect() function so that it no longer
    waits for the DISCONNECTED event. This prevents blocking if the
    remote is unresponsive.

    In rpcrdma_xprt_disconnect(), the transport's rpcrdma_ep is
    detached. Upon return from rpcrdma_xprt_disconnect(), the transport
    (r_xprt) is ready immediately for a new connection.

    The RDMA_CM_DEVICE_REMOVAL and RDMA_CM_DISCONNECTED events are now
    handled almost identically.

    However, because the lifetimes of rpcrdma_xprt structures and
    rpcrdma_ep structures are now independent, creating an rpcrdma_ep
    needs to take a module ref count. The ep now owns most of the
    hardware resources for a transport.

    Also, a kref is needed to ensure that rpcrdma_ep sticks around
    long enough for the cm_event_handler to finish.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • rpcrdma_cm_event_handler() is always passed an @id pointer that is
    valid. However, in a subsequent patch, we won't be able to extract
    an r_xprt in every case. So instead of using the r_xprt's
    presentation address strings, extract them from struct rdma_cm_id.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • I eventually want to allocate rpcrdma_ep separately from struct
    rpcrdma_xprt so that on occasion there can be more than one ep per
    xprt.

    The new struct rpcrdma_ep will contain all the fields currently in
    rpcrdma_ia and in rpcrdma_ep. This is all the device and CM settings
    for the connection, in addition to per-connection settings
    negotiated with the remote.

    Take this opportunity to rename the existing ep fields from rep_* to
    re_* to disambiguate these from struct rpcrdma_rep.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Completion errors after a disconnect often occur much sooner than a
    CM_DISCONNECT event. Use this to try to detect connection loss more
    quickly.

    Note that other kernel ULPs do take care to disconnect explicitly
    when a WR is flushed.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Clean up:
    The upper layer serializes calls to xprt_rdma_close, so there is no
    need for an atomic bit operation, saving 8 bytes in rpcrdma_ia.

    This enables merging rpcrdma_ia_remove directly into the disconnect
    logic.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Move rdma_cm_id creation into rpcrdma_ep_create() so that it is now
    responsible for allocating all per-connection hardware resources.

    With this clean-up, all three arms of the switch statement in
    rpcrdma_ep_connect are exactly the same now, thus the switch can be
    removed.

    Because device removal behaves a little differently than
    disconnection, there is a little more work to be done before
    rpcrdma_ep_destroy() can release the connection's rdma_cm_id. So
    it is not quite symmetrical with rpcrdma_ep_create() yet.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Make a Protection Domain (PD) a per-connection resource rather than
    a per-transport resource. In other words, when the connection
    terminates, the PD is destroyed.

    Thus there is one less HW resource that remains allocated to a
    transport after a connection is closed.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Clean up: Simplify the synopses of functions in the connect and
    disconnect paths in preparation for combining the rpcrdma_ia and
    struct rpcrdma_ep structures.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Clean up: Simplify the synopses of functions in the post_send path
    by combining the struct rpcrdma_ia and struct rpcrdma_ep arguments.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Clean up: prepare for combining the rpcrdma_ia and rpcrdma_ep
    structures. Take the opportunity to rename the function to be
    consistent with the "subsystem _ object _ verb" naming scheme.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Refactor rpcrdma_ep_create(), rpcrdma_ep_disconnect(), and
    rpcrdma_ep_destroy().

    rpcrdma_ep_create will be invoked at connect time instead of at
    transport set-up time. It will be responsible for allocating per-
    connection resources. In this patch it allocates the CQs and
    creates a QP. More to come.

    rpcrdma_ep_destroy() is the inverse functionality that is
    invoked at disconnect time. It will be responsible for releasing
    the CQs and QP.

    These changes should be safe to do because both connect and
    disconnect is guaranteed to be serialized by the transport send
    lock.

    This takes us another step closer to resolving the address and route
    only at connect time so that connection failover to another device
    will work correctly.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     
  • Two changes:
    - Show the number of SG entries that were mapped. This helps debug
    DMA-related problems.
    - Record the MR's resource ID instead of its memory address. This
    groups each MR with its associated rdma-tool output, and reduces
    needless exposure of memory addresses.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker

    Chuck Lever
     

26 Mar, 2020

1 commit

  • Ever since commit 2c94b8eca1a2 ("SUNRPC: Use au_rslack when computing
    reply buffer size"). It changed how "req->rq_rcvsize" is calculated. It
    used to use au_cslack value which was nice and large and changed it to
    au_rslack value which turns out to be too small.

    Since 5.1, v3 mount with sec=krb5p fails against an Ontap server
    because client's receive buffer it too small.

    For gss krb5p, we need to account for the mic token in the verifier,
    and the wrap token in the wrap token.

    RFC 4121 defines:
    mic token
    Octet no Name Description
    --------------------------------------------------------------
    0..1 TOK_ID Identification field. Tokens emitted by
    GSS_GetMIC() contain the hex value 04 04
    expressed in big-endian order in this
    field.
    2 Flags Attributes field, as described in section
    4.2.2.
    3..7 Filler Contains five octets of hex value FF.
    8..15 SND_SEQ Sequence number field in clear text,
    expressed in big-endian order.
    16..last SGN_CKSUM Checksum of the "to-be-signed" data and
    octet 0..15, as described in section 4.2.4.

    that's 16bytes (GSS_KRB5_TOK_HDR_LEN) + chksum

    wrap token
    Octet no Name Description
    --------------------------------------------------------------
    0..1 TOK_ID Identification field. Tokens emitted by
    GSS_Wrap() contain the hex value 05 04
    expressed in big-endian order in this
    field.
    2 Flags Attributes field, as described in section
    4.2.2.
    3 Filler Contains the hex value FF.
    4..5 EC Contains the "extra count" field, in big-
    endian order as described in section 4.2.3.
    6..7 RRC Contains the "right rotation count" in big-
    endian order, as described in section
    4.2.5.
    8..15 SND_SEQ Sequence number field in clear text,
    expressed in big-endian order.
    16..last Data Encrypted data for Wrap tokens with
    confidentiality, or plaintext data followed
    by the checksum for Wrap tokens without
    confidentiality, as described in section
    4.2.4.

    Also 16bytes of header (GSS_KRB5_TOK_HDR_LEN), encrypted data, and cksum
    (other things like padding)

    RFC 3961 defines known cksum sizes:
    Checksum type sumtype checksum section or
    value size reference
    ---------------------------------------------------------------------
    CRC32 1 4 6.1.3
    rsa-md4 2 16 6.1.2
    rsa-md4-des 3 24 6.2.5
    des-mac 4 16 6.2.7
    des-mac-k 5 8 6.2.8
    rsa-md4-des-k 6 16 6.2.6
    rsa-md5 7 16 6.1.1
    rsa-md5-des 8 24 6.2.4
    rsa-md5-des3 9 24 ??
    sha1 (unkeyed) 10 20 ??
    hmac-sha1-des3-kd 12 20 6.3
    hmac-sha1-des3 13 20 ??
    sha1 (unkeyed) 14 20 ??
    hmac-sha1-96-aes128 15 20 [KRB5-AES]
    hmac-sha1-96-aes256 16 20 [KRB5-AES]
    [reserved] 0x8003 ? [GSS-KRB5]

    Linux kernel now mainly supports type 15,16 so max cksum size is 20bytes.
    (GSS_KRB5_MAX_CKSUM_LEN)

    Re-use already existing define of GSS_KRB5_MAX_SLACK_NEEDED that's used
    for encoding the gss_wrap tokens (same tokens are used in reply).

    Fixes: 2c94b8eca1a2 ("SUNRPC: Use au_rslack when computing reply buffer size")
    Signed-off-by: Olga Kornievskaia
    Reviewed-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Olga Kornievskaia
     

16 Mar, 2020

6 commits

  • By preventing compiler inlining of the integrity and privacy
    helpers, stack utilization for the common case (authentication only)
    goes way down.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Clean up: this function is no longer used.

    Signed-off-by: Chuck Lever
    Reviewed-by: Benjamin Coddington
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • xdr_buf_read_mic() tries to find unused contiguous space in a
    received xdr_buf in order to linearize the checksum for the call
    to gss_verify_mic. However, the corner cases in this code are
    numerous and we seem to keep missing them. I've just hit yet
    another buffer overrun related to it.

    This overrun is at the end of xdr_buf_read_mic():

    1284 if (buf->tail[0].iov_len != 0)
    1285 mic->data = buf->tail[0].iov_base + buf->tail[0].iov_len;
    1286 else
    1287 mic->data = buf->head[0].iov_base + buf->head[0].iov_len;
    1288 __read_bytes_from_xdr_buf(&subbuf, mic->data, mic->len);
    1289 return 0;

    This logic assumes the transport has set the length of the tail
    based on the size of the received message. base + len is then
    supposed to be off the end of the message but still within the
    actual buffer.

    In fact, the length of the tail is set by the upper layer when the
    Call is encoded so that the end of the tail is actually the end of
    the allocated buffer itself. This causes the logic above to set
    mic->data to point past the end of the receive buffer.

    The "mic->data = head" arm of this if statement is no less fragile.

    As near as I can tell, this has been a problem forever. I'm not sure
    that minimizing au_rslack recently changed this pathology much.

    So instead, let's use a more straightforward approach: kmalloc a
    separate buffer to linearize the checksum. This is similar to
    how gss_validate() currently works.

    Coming back to this code, I had some trouble understanding what
    was going on. So I've cleaned up the variable naming and added
    a few comments that point back to the XDR definition in RFC 2203
    to help guide future spelunkers, including myself.

    As an added clean up, the functionality that was in
    xdr_buf_read_mic() is folded directly into gss_unwrap_resp_integ(),
    as that is its only caller.

    Signed-off-by: Chuck Lever
    Reviewed-by: Benjamin Coddington
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • The variable status is being initialized with a value that is never
    read and it is being updated later with a new value. The initialization
    is redundant and can be removed.

    Addresses-Coverity: ("Unused value")
    Signed-off-by: Colin Ian King
    Signed-off-by: Trond Myklebust

    Colin Ian King
     
  • If the RPC call is synchronous, assume the cred is already pinned
    by the caller.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Add a flag to signal to the RPC layer that the credential is already
    pinned for the duration of the RPC call.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

13 Mar, 2020

1 commit

  • There was a bug that was causing packets to be sent to the driver
    without first calling dequeue() on the "child" qdisc. And the KASAN
    report below shows that sending a packet without calling dequeue()
    leads to bad results.

    The problem is that when checking the last qdisc "child" we do not set
    the returned skb to NULL, which can cause it to be sent to the driver,
    and so after the skb is sent, it may be freed, and in some situations a
    reference to it may still be in the child qdisc, because it was never
    dequeued.

    The crash log looks like this:

    [ 19.937538] ==================================================================
    [ 19.938300] BUG: KASAN: use-after-free in taprio_dequeue_soft+0x620/0x780
    [ 19.938968] Read of size 4 at addr ffff8881128628cc by task swapper/1/0
    [ 19.939612]
    [ 19.939772] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc3+ #97
    [ 19.940397] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qe4
    [ 19.941523] Call Trace:
    [ 19.941774]
    [ 19.941985] dump_stack+0x97/0xe0
    [ 19.942323] print_address_description.constprop.0+0x3b/0x60
    [ 19.942884] ? taprio_dequeue_soft+0x620/0x780
    [ 19.943325] ? taprio_dequeue_soft+0x620/0x780
    [ 19.943767] __kasan_report.cold+0x1a/0x32
    [ 19.944173] ? taprio_dequeue_soft+0x620/0x780
    [ 19.944612] kasan_report+0xe/0x20
    [ 19.944954] taprio_dequeue_soft+0x620/0x780
    [ 19.945380] __qdisc_run+0x164/0x18d0
    [ 19.945749] net_tx_action+0x2c4/0x730
    [ 19.946124] __do_softirq+0x268/0x7bc
    [ 19.946491] irq_exit+0x17d/0x1b0
    [ 19.946824] smp_apic_timer_interrupt+0xeb/0x380
    [ 19.947280] apic_timer_interrupt+0xf/0x20
    [ 19.947687]
    [ 19.947912] RIP: 0010:default_idle+0x2d/0x2d0
    [ 19.948345] Code: 00 00 41 56 41 55 65 44 8b 2d 3f 8d 7c 7c 41 54 55 53 0f 1f 44 00 00 e8 b1 b2 c5 fd e9 07 00 3
    [ 19.950166] RSP: 0018:ffff88811a3efda0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
    [ 19.950909] RAX: 0000000080000000 RBX: ffff88811a3a9600 RCX: ffffffff8385327e
    [ 19.951608] RDX: 1ffff110234752c0 RSI: 0000000000000000 RDI: ffffffff8385262f
    [ 19.952309] RBP: ffffed10234752c0 R08: 0000000000000001 R09: ffffed10234752c1
    [ 19.953009] R10: ffffed10234752c0 R11: ffff88811a3a9607 R12: 0000000000000001
    [ 19.953709] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
    [ 19.954408] ? default_idle_call+0x2e/0x70
    [ 19.954816] ? default_idle+0x1f/0x2d0
    [ 19.955192] default_idle_call+0x5e/0x70
    [ 19.955584] do_idle+0x3d4/0x500
    [ 19.955909] ? arch_cpu_idle_exit+0x40/0x40
    [ 19.956325] ? _raw_spin_unlock_irqrestore+0x23/0x30
    [ 19.956829] ? trace_hardirqs_on+0x30/0x160
    [ 19.957242] cpu_startup_entry+0x19/0x20
    [ 19.957633] start_secondary+0x2a6/0x380
    [ 19.958026] ? set_cpu_sibling_map+0x18b0/0x18b0
    [ 19.958486] secondary_startup_64+0xa4/0xb0
    [ 19.958921]
    [ 19.959078] Allocated by task 33:
    [ 19.959412] save_stack+0x1b/0x80
    [ 19.959747] __kasan_kmalloc.constprop.0+0xc2/0xd0
    [ 19.960222] kmem_cache_alloc+0xe4/0x230
    [ 19.960617] __alloc_skb+0x91/0x510
    [ 19.960967] ndisc_alloc_skb+0x133/0x330
    [ 19.961358] ndisc_send_ns+0x134/0x810
    [ 19.961735] addrconf_dad_work+0xad5/0xf80
    [ 19.962144] process_one_work+0x78e/0x13a0
    [ 19.962551] worker_thread+0x8f/0xfa0
    [ 19.962919] kthread+0x2ba/0x3b0
    [ 19.963242] ret_from_fork+0x3a/0x50
    [ 19.963596]
    [ 19.963753] Freed by task 33:
    [ 19.964055] save_stack+0x1b/0x80
    [ 19.964386] __kasan_slab_free+0x12f/0x180
    [ 19.964830] kmem_cache_free+0x80/0x290
    [ 19.965231] ip6_mc_input+0x38a/0x4d0
    [ 19.965617] ipv6_rcv+0x1a4/0x1d0
    [ 19.965948] __netif_receive_skb_one_core+0xf2/0x180
    [ 19.966437] netif_receive_skb+0x8c/0x3c0
    [ 19.966846] br_handle_frame_finish+0x779/0x1310
    [ 19.967302] br_handle_frame+0x42a/0x830
    [ 19.967694] __netif_receive_skb_core+0xf0e/0x2a90
    [ 19.968167] __netif_receive_skb_one_core+0x96/0x180
    [ 19.968658] process_backlog+0x198/0x650
    [ 19.969047] net_rx_action+0x2fa/0xaa0
    [ 19.969420] __do_softirq+0x268/0x7bc
    [ 19.969785]
    [ 19.969940] The buggy address belongs to the object at ffff888112862840
    [ 19.969940] which belongs to the cache skbuff_head_cache of size 224
    [ 19.971202] The buggy address is located 140 bytes inside of
    [ 19.971202] 224-byte region [ffff888112862840, ffff888112862920)
    [ 19.972344] The buggy address belongs to the page:
    [ 19.972820] page:ffffea00044a1800 refcount:1 mapcount:0 mapping:ffff88811a2bd1c0 index:0xffff8881128625c0 compo0
    [ 19.973930] flags: 0x8000000000010200(slab|head)
    [ 19.974388] raw: 8000000000010200 ffff88811a2ed650 ffff88811a2ed650 ffff88811a2bd1c0
    [ 19.975151] raw: ffff8881128625c0 0000000000190013 00000001ffffffff 0000000000000000
    [ 19.975915] page dumped because: kasan: bad access detected
    [ 19.976461] page_owner tracks the page as allocated
    [ 19.976946] page last allocated via order 2, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NO)
    [ 19.978332] prep_new_page+0x24b/0x330
    [ 19.978707] get_page_from_freelist+0x2057/0x2c90
    [ 19.979170] __alloc_pages_nodemask+0x218/0x590
    [ 19.979619] new_slab+0x9d/0x300
    [ 19.979948] ___slab_alloc.constprop.0+0x2f9/0x6f0
    [ 19.980421] __slab_alloc.constprop.0+0x30/0x60
    [ 19.980870] kmem_cache_alloc+0x201/0x230
    [ 19.981269] __alloc_skb+0x91/0x510
    [ 19.981620] alloc_skb_with_frags+0x78/0x4a0
    [ 19.982043] sock_alloc_send_pskb+0x5eb/0x750
    [ 19.982476] unix_stream_sendmsg+0x399/0x7f0
    [ 19.982904] sock_sendmsg+0xe2/0x110
    [ 19.983262] ____sys_sendmsg+0x4de/0x6d0
    [ 19.983660] ___sys_sendmsg+0xe4/0x160
    [ 19.984032] __sys_sendmsg+0xab/0x130
    [ 19.984396] do_syscall_64+0xe7/0xae0
    [ 19.984761] page last free stack trace:
    [ 19.985142] __free_pages_ok+0x432/0xbc0
    [ 19.985533] qlist_free_all+0x56/0xc0
    [ 19.985907] quarantine_reduce+0x149/0x170
    [ 19.986315] __kasan_kmalloc.constprop.0+0x9e/0xd0
    [ 19.986791] kmem_cache_alloc+0xe4/0x230
    [ 19.987182] prepare_creds+0x24/0x440
    [ 19.987548] do_faccessat+0x80/0x590
    [ 19.987906] do_syscall_64+0xe7/0xae0
    [ 19.988276] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 19.988775]
    [ 19.988930] Memory state around the buggy address:
    [ 19.989402] ffff888112862780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    [ 19.990111] ffff888112862800: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
    [ 19.990822] >ffff888112862880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [ 19.991529] ^
    [ 19.992081] ffff888112862900: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
    [ 19.992796] ffff888112862980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

    Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler")
    Reported-by: Michael Schmidt
    Signed-off-by: Vinicius Costa Gomes
    Acked-by: Andre Guedes
    Signed-off-by: David S. Miller

    Vinicius Costa Gomes
     

12 Mar, 2020

5 commits

  • Locking newsk while still holding the listener lock triggered
    a lockdep splat [1]

    We can simply move the memcg code after we release the listener lock,
    as this can also help if multiple threads are sharing a common listener.

    Also fix a typo while reading socket sk_rmem_alloc.

    [1]
    WARNING: possible recursive locking detected
    5.6.0-rc3-syzkaller #0 Not tainted
    --------------------------------------------
    syz-executor598/9524 is trying to acquire lock:
    ffff88808b5b8b90 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline]
    ffff88808b5b8b90 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x69f/0xd30 net/ipv4/inet_connection_sock.c:492

    but task is already holding lock:
    ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline]
    ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x8d/0xd30 net/ipv4/inet_connection_sock.c:445

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(sk_lock-AF_INET6);
    lock(sk_lock-AF_INET6);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    1 lock held by syz-executor598/9524:
    #0: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline]
    #0: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x8d/0xd30 net/ipv4/inet_connection_sock.c:445

    stack backtrace:
    CPU: 0 PID: 9524 Comm: syz-executor598 Not tainted 5.6.0-rc3-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x188/0x20d lib/dump_stack.c:118
    print_deadlock_bug kernel/locking/lockdep.c:2370 [inline]
    check_deadlock kernel/locking/lockdep.c:2411 [inline]
    validate_chain kernel/locking/lockdep.c:2954 [inline]
    __lock_acquire.cold+0x114/0x288 kernel/locking/lockdep.c:3954
    lock_acquire+0x197/0x420 kernel/locking/lockdep.c:4484
    lock_sock_nested+0xc5/0x110 net/core/sock.c:2947
    lock_sock include/net/sock.h:1541 [inline]
    inet_csk_accept+0x69f/0xd30 net/ipv4/inet_connection_sock.c:492
    inet_accept+0xe9/0x7c0 net/ipv4/af_inet.c:734
    __sys_accept4_file+0x3ac/0x5b0 net/socket.c:1758
    __sys_accept4+0x53/0x90 net/socket.c:1809
    __do_sys_accept4 net/socket.c:1821 [inline]
    __se_sys_accept4 net/socket.c:1818 [inline]
    __x64_sys_accept4+0x93/0xf0 net/socket.c:1818
    do_syscall_64+0xf6/0x790 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x4445c9
    Code: e8 0c 0d 03 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007ffc35b37608 EFLAGS: 00000246 ORIG_RAX: 0000000000000120
    RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004445c9
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
    RBP: 0000000000000000 R08: 0000000000306777 R09: 0000000000306777
    R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
    R13: 00000000004053d0 R14: 0000000000000000 R15: 0000000000000000

    Fixes: d752a4986532 ("net: memcg: late association of sock to memcg")
    Signed-off-by: Eric Dumazet
    Cc: Shakeel Butt
    Reported-by: syzbot
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The Internet Assigned Numbers Authority (IANA) has recently assigned
    a protocol number value of 143 for Ethernet [1].

    Before this assignment, encapsulation mechanisms such as Segment Routing
    used the IPv6-NoNxt protocol number (59) to indicate that the encapsulated
    payload is an Ethernet frame.

    In this patch, we add the definition of the Ethernet protocol number to the
    kernel headers and update the SRv6 L2 tunnels to use it.

    [1] https://www.iana.org/assignments/protocol-numbers/protocol-numbers.xhtml

    Signed-off-by: Paolo Lungaroni
    Reviewed-by: Andrea Mayer
    Acked-by: Ahmed Abdelsalam
    Signed-off-by: David S. Miller

    Paolo Lungaroni
     
  • By default, DSA drivers should configure CPU and DSA ports to their
    maximum speed. In many configurations this is sufficient to make the
    link work.

    In some cases it is necessary to configure the link to run slower,
    e.g. because of limitations of the SoC it is connected to. Or back to
    back PHYs are used and the PHY needs to be driven in order to
    establish link. In this case, phylink is used.

    Only instantiate phylink if it is required. If there is no PHY, or no
    fixed link properties, phylink can upset a link which works in the
    default configuration.

    Fixes: 0e27921816ad ("net: dsa: Use PHYLINK for the CPU/DSA ports")
    Signed-off-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Andrew Lunn
     
  • In one error case, tpacket_rcv drops packets after incrementing the
    ring producer index.

    If this happens, it does not update tp_status to TP_STATUS_USER and
    thus the reader is stalled for an iteration of the ring, causing out
    of order arrival.

    The only such error path is when virtio_net_hdr_from_skb fails due
    to encountering an unknown GSO type.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • caifdevs->list is traversed using list_for_each_entry_rcu()
    outside an RCU read-side critical section but under the
    protection of rtnl_mutex. Hence, add the corresponding lockdep
    expression to silence the following false-positive warning:

    [ 10.868467] =============================
    [ 10.869082] WARNING: suspicious RCU usage
    [ 10.869817] 5.6.0-rc1-00177-g06ec0a154aae4 #1 Not tainted
    [ 10.870804] -----------------------------
    [ 10.871557] net/caif/caif_dev.c:115 RCU-list traversed in non-reader section!!

    Reported-by: kernel test robot
    Signed-off-by: Amol Grover
    Signed-off-by: David S. Miller

    Amol Grover
     

11 Mar, 2020

7 commits

  • When trying to transmit to an unknown destination, the mesh code would
    unconditionally transmit a HWMP PREQ even if HWMP is not the current
    path selection algorithm.

    Signed-off-by: Nicolas Cavallari
    Link: https://lore.kernel.org/r/20200305140409.12204-1-cavallar@lri.fr
    Signed-off-by: Johannes Berg

    Nicolas Cavallari
     
  • Add missing attribute validation for NL80211_ATTR_OPER_CLASS
    to the netlink policy.

    Fixes: 1057d35ede5d ("cfg80211: introduce TDLS channel switch commands")
    Signed-off-by: Jakub Kicinski
    Link: https://lore.kernel.org/r/20200303051058.4089398-4-kuba@kernel.org
    Signed-off-by: Johannes Berg

    Jakub Kicinski
     
  • Add missing attribute validation for beacon report scanning
    to the netlink policy.

    Fixes: 1d76250bd34a ("nl80211: support beacon report scanning")
    Signed-off-by: Jakub Kicinski
    Link: https://lore.kernel.org/r/20200303051058.4089398-3-kuba@kernel.org
    Signed-off-by: Johannes Berg

    Jakub Kicinski
     
  • Add missing attribute validation for critical protocol fields
    to the netlink policy.

    Fixes: 5de17984898c ("cfg80211: introduce critical protocol indication from user-space")
    Signed-off-by: Jakub Kicinski
    Link: https://lore.kernel.org/r/20200303051058.4089398-2-kuba@kernel.org
    Signed-off-by: Johannes Berg

    Jakub Kicinski
     
  • During IB device removal, cancel the event worker before the device
    structure is freed.

    Fixes: a4cf0443c414 ("smc: introduce SMC as an IB-client")
    Reported-by: syzbot+b297c6825752e7a07272@syzkaller.appspotmail.com
    Signed-off-by: Karsten Graul
    Reviewed-by: Ursula Braun
    Reviewed-by: Leon Romanovsky
    Signed-off-by: David S. Miller

    Karsten Graul
     
  • Rafał found an issue that for non-Ethernet interface, if we down and up
    frequently, the memory will be consumed slowly.

    The reason is we add allnodes/allrouters addressed in multicast list in
    ipv6_add_dev(). When link down, we call ipv6_mc_down(), store all multicast
    addresses via mld_add_delrec(). But when link up, we don't call ipv6_mc_up()
    for non-Ethernet interface to remove the addresses. This makes idev->mc_tomb
    getting bigger and bigger. The call stack looks like:

    addrconf_notify(NETDEV_REGISTER)
    ipv6_add_dev
    ipv6_dev_mc_inc(ff01::1)
    ipv6_dev_mc_inc(ff02::1)
    ipv6_dev_mc_inc(ff02::2)

    addrconf_notify(NETDEV_UP)
    addrconf_dev_config
    /* Alas, we support only Ethernet autoconfiguration. */
    return;

    addrconf_notify(NETDEV_DOWN)
    addrconf_ifdown
    ipv6_mc_down
    igmp6_group_dropped(ff02::2)
    mld_add_delrec(ff02::2)
    igmp6_group_dropped(ff02::1)
    igmp6_group_dropped(ff01::1)

    After investigating, I can't found a rule to disable multicast on
    non-Ethernet interface. In RFC2460, the link could be Ethernet, PPP, ATM,
    tunnels, etc. In IPv4, it doesn't check the dev type when calls ip_mc_up()
    in inetdev_event(). Even for IPv6, we don't check the dev type and call
    ipv6_add_dev(), ipv6_dev_mc_inc() after register device.

    So I think it's OK to fix this memory consumer by calling ipv6_mc_up() for
    non-Ethernet interface.

    v2: Also check IFF_MULTICAST flag to make sure the interface supports
    multicast

    Reported-by: Rafał Miłecki
    Tested-by: Rafał Miłecki
    Fixes: 74235a25c673 ("[IPV6] addrconf: Fix IPv6 on tuntap tunnels")
    Fixes: 1666d49e1d41 ("mld: do not remove mld souce list info when set link down")
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller

    Hangbin Liu
     
  • If a TCP socket is allocated in IRQ context or cloned from unassociated
    (i.e. not associated to a memcg) in IRQ context then it will remain
    unassociated for its whole life. Almost half of the TCPs created on the
    system are created in IRQ context, so, memory used by such sockets will
    not be accounted by the memcg.

    This issue is more widespread in cgroup v1 where network memory
    accounting is opt-in but it can happen in cgroup v2 if the source socket
    for the cloning was created in root memcg.

    To fix the issue, just do the association of the sockets at the accept()
    time in the process context and then force charge the memory buffer
    already used and reserved by the socket.

    Signed-off-by: Shakeel Butt
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Shakeel Butt
     

10 Mar, 2020

2 commits

  • Simon Wunderlich says:

    ====================
    Here is a batman-adv bugfix:

    - Don't schedule OGM for disabled interface, by Sven Eckelmann
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • In our production environment we have faced with problem that updating
    classid in cgroup with heavy tasks cause long freeze of the file tables
    in this tasks. By heavy tasks we understand tasks with many threads and
    opened sockets (e.g. balancers). This freeze leads to an increase number
    of client timeouts.

    This patch implements following logic to fix this issue:
    аfter iterating 1000 file descriptors file table lock will be released
    thus providing a time gap for socket creation/deletion.

    Now update is non atomic and socket may be skipped using calls:

    dup2(oldfd, newfd);
    close(oldfd);

    But this case is not typical. Moreover before this patch skip is possible
    too by hiding socket fd in unix socket buffer.

    New sockets will be allocated with updated classid because cgroup state
    is updated before start of the file descriptors iteration.

    So in common cases this patch has no side effects.

    Signed-off-by: Dmitry Yakunin
    Reviewed-by: Konstantin Khlebnikov
    Signed-off-by: David S. Miller

    Dmitry Yakunin
     

09 Mar, 2020

2 commits

  • In commit 1ec17dbd90f8 ("inet_diag: fix reporting cgroup classid and
    fallback to priority") croup classid reporting was fixed. But this works
    only for TCP sockets because for other socket types icsk parameter can
    be NULL and classid code path is skipped. This change moves classid
    handling to inet_diag_msg_attrs_fill() function.

    Also inet_diag_msg_attrs_size() helper was added and addends in
    nlmsg_new() were reordered to save order from inet_sk_diag_fill().

    Fixes: 1ec17dbd90f8 ("inet_diag: fix reporting cgroup classid and fallback to priority")
    Signed-off-by: Dmitry Yakunin
    Reviewed-by: Konstantin Khlebnikov
    Signed-off-by: David S. Miller

    Dmitry Yakunin
     
  • syzbot found an interesting case of the kernel reading
    an uninit-value [1]

    Problem is in the handling of ETH_P_WCCP in gre_parse_header()

    We look at the byte following GRE options to eventually decide
    if the options are four bytes longer.

    Use skb_header_pointer() to not pull bytes if we found
    that no more bytes were needed.

    All callers of gre_parse_header() are properly using pskb_may_pull()
    anyway before proceeding to next header.

    [1]
    BUG: KMSAN: uninit-value in pskb_may_pull include/linux/skbuff.h:2303 [inline]
    BUG: KMSAN: uninit-value in __iptunnel_pull_header+0x30c/0xbd0 net/ipv4/ip_tunnel_core.c:94
    CPU: 1 PID: 11784 Comm: syz-executor940 Not tainted 5.6.0-rc2-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x1c9/0x220 lib/dump_stack.c:118
    kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:118
    __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215
    pskb_may_pull include/linux/skbuff.h:2303 [inline]
    __iptunnel_pull_header+0x30c/0xbd0 net/ipv4/ip_tunnel_core.c:94
    iptunnel_pull_header include/net/ip_tunnels.h:411 [inline]
    gre_rcv+0x15e/0x19c0 net/ipv6/ip6_gre.c:606
    ip6_protocol_deliver_rcu+0x181b/0x22c0 net/ipv6/ip6_input.c:432
    ip6_input_finish net/ipv6/ip6_input.c:473 [inline]
    NF_HOOK include/linux/netfilter.h:307 [inline]
    ip6_input net/ipv6/ip6_input.c:482 [inline]
    ip6_mc_input+0xdf2/0x1460 net/ipv6/ip6_input.c:576
    dst_input include/net/dst.h:442 [inline]
    ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline]
    NF_HOOK include/linux/netfilter.h:307 [inline]
    ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:306
    __netif_receive_skb_one_core net/core/dev.c:5198 [inline]
    __netif_receive_skb net/core/dev.c:5312 [inline]
    netif_receive_skb_internal net/core/dev.c:5402 [inline]
    netif_receive_skb+0x66b/0xf20 net/core/dev.c:5461
    tun_rx_batched include/linux/skbuff.h:4321 [inline]
    tun_get_user+0x6aef/0x6f60 drivers/net/tun.c:1997
    tun_chr_write_iter+0x1f2/0x360 drivers/net/tun.c:2026
    call_write_iter include/linux/fs.h:1901 [inline]
    new_sync_write fs/read_write.c:483 [inline]
    __vfs_write+0xa5a/0xca0 fs/read_write.c:496
    vfs_write+0x44a/0x8f0 fs/read_write.c:558
    ksys_write+0x267/0x450 fs/read_write.c:611
    __do_sys_write fs/read_write.c:623 [inline]
    __se_sys_write fs/read_write.c:620 [inline]
    __ia32_sys_write+0xdb/0x120 fs/read_write.c:620
    do_syscall_32_irqs_on arch/x86/entry/common.c:339 [inline]
    do_fast_syscall_32+0x3c7/0x6e0 arch/x86/entry/common.c:410
    entry_SYSENTER_compat+0x68/0x77 arch/x86/entry/entry_64_compat.S:139
    RIP: 0023:0xf7f62d99
    Code: 90 e8 0b 00 00 00 f3 90 0f ae e8 eb f9 8d 74 26 00 89 3c 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
    RSP: 002b:00000000fffedb2c EFLAGS: 00000217 ORIG_RAX: 0000000000000004
    RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000020002580
    RDX: 0000000000000fca RSI: 0000000000000036 RDI: 0000000000000004
    RBP: 0000000000008914 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
    R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

    Uninit was created at:
    kmsan_save_stack_with_flags mm/kmsan/kmsan.c:144 [inline]
    kmsan_internal_poison_shadow+0x66/0xd0 mm/kmsan/kmsan.c:127
    kmsan_slab_alloc+0x8a/0xe0 mm/kmsan/kmsan_hooks.c:82
    slab_alloc_node mm/slub.c:2793 [inline]
    __kmalloc_node_track_caller+0xb40/0x1200 mm/slub.c:4401
    __kmalloc_reserve net/core/skbuff.c:142 [inline]
    __alloc_skb+0x2fd/0xac0 net/core/skbuff.c:210
    alloc_skb include/linux/skbuff.h:1051 [inline]
    alloc_skb_with_frags+0x18c/0xa70 net/core/skbuff.c:5766
    sock_alloc_send_pskb+0xada/0xc60 net/core/sock.c:2242
    tun_alloc_skb drivers/net/tun.c:1529 [inline]
    tun_get_user+0x10ae/0x6f60 drivers/net/tun.c:1843
    tun_chr_write_iter+0x1f2/0x360 drivers/net/tun.c:2026
    call_write_iter include/linux/fs.h:1901 [inline]
    new_sync_write fs/read_write.c:483 [inline]
    __vfs_write+0xa5a/0xca0 fs/read_write.c:496
    vfs_write+0x44a/0x8f0 fs/read_write.c:558
    ksys_write+0x267/0x450 fs/read_write.c:611
    __do_sys_write fs/read_write.c:623 [inline]
    __se_sys_write fs/read_write.c:620 [inline]
    __ia32_sys_write+0xdb/0x120 fs/read_write.c:620
    do_syscall_32_irqs_on arch/x86/entry/common.c:339 [inline]
    do_fast_syscall_32+0x3c7/0x6e0 arch/x86/entry/common.c:410
    entry_SYSENTER_compat+0x68/0x77 arch/x86/entry/entry_64_compat.S:139

    Fixes: 95f5c64c3c13 ("gre: Move utility functions to common headers")
    Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Mar, 2020

2 commits

  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contains Netfilter fixes for net:

    1) Patches to bump position index from sysctl seq_next,
    from Vasilin Averin.

    2) Release flowtable hook from error path, from Florian Westphal.

    3) Patches to add missing netlink attribute validation,
    from Jakub Kicinski.

    4) Missing NFTA_CHAIN_FLAGS in nf_tables_fill_chain_info().

    5) Infinite loop in module autoload if extension is not available,
    from Florian Westphal.

    6) Missing module ownership in inet/nat chain type definition.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Set owner to THIS_MODULE, otherwise the nft_chain_nat module might be
    removed while there are still inet/nat chains in place.

    [ 117.942096] BUG: unable to handle page fault for address: ffffffffa0d5e040
    [ 117.942101] #PF: supervisor read access in kernel mode
    [ 117.942103] #PF: error_code(0x0000) - not-present page
    [ 117.942106] PGD 200c067 P4D 200c067 PUD 200d063 PMD 3dc909067 PTE 0
    [ 117.942113] Oops: 0000 [#1] PREEMPT SMP PTI
    [ 117.942118] CPU: 3 PID: 27 Comm: kworker/3:0 Not tainted 5.6.0-rc3+ #348
    [ 117.942133] Workqueue: events nf_tables_trans_destroy_work [nf_tables]
    [ 117.942145] RIP: 0010:nf_tables_chain_destroy.isra.0+0x94/0x15a [nf_tables]
    [ 117.942149] Code: f6 45 54 01 0f 84 d1 00 00 00 80 3b 05 74 44 48 8b 75 e8 48 c7 c7 72 be de a0 e8 56 e6 2d e0 48 8b 45 e8 48 c7 c7 7f be de a0 8b 30 e8 43 e6 2d e0 48 8b 45 e8 48 8b 40 10 48 85 c0 74 5b 8b
    [ 117.942152] RSP: 0018:ffffc9000015be10 EFLAGS: 00010292
    [ 117.942155] RAX: ffffffffa0d5e040 RBX: ffff88840be87fc2 RCX: 0000000000000007
    [ 117.942158] RDX: 0000000000000007 RSI: 0000000000000086 RDI: ffffffffa0debe7f
    [ 117.942160] RBP: ffff888403b54b50 R08: 0000000000001482 R09: 0000000000000004
    [ 117.942162] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8883eda7e540
    [ 117.942164] R13: dead000000000122 R14: dead000000000100 R15: ffff888403b3db80
    [ 117.942167] FS: 0000000000000000(0000) GS:ffff88840e4c0000(0000) knlGS:0000000000000000
    [ 117.942169] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 117.942172] CR2: ffffffffa0d5e040 CR3: 00000003e4c52002 CR4: 00000000001606e0
    [ 117.942174] Call Trace:
    [ 117.942188] nf_tables_trans_destroy_work.cold+0xd/0x12 [nf_tables]
    [ 117.942196] process_one_work+0x1d6/0x3b0
    [ 117.942200] worker_thread+0x45/0x3c0
    [ 117.942203] ? process_one_work+0x3b0/0x3b0
    [ 117.942210] kthread+0x112/0x130
    [ 117.942214] ? kthread_create_worker_on_cpu+0x40/0x40
    [ 117.942221] ret_from_fork+0x35/0x40

    nf_tables_chain_destroy() crashes on module_put() because the module is
    gone.

    Fixes: d164385ec572 ("netfilter: nat: add inet family nat support")
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso