09 Oct, 2020

1 commit


05 Oct, 2020

1 commit

  • When a new incoming call arrives at an userspace rxrpc socket on a new
    connection that has a security class set, the code currently pushes it onto
    the accept queue to hold a ref on it for the socket. This doesn't work,
    however, as recvmsg() pops it off, notices that it's in the SERVER_SECURING
    state and discards the ref. This means that the call runs out of refs too
    early and the kernel oopses.

    By contrast, a kernel rxrpc socket manually pre-charges the incoming call
    pool with calls that already have user call IDs assigned, so they are ref'd
    by the call tree on the socket.

    Change the mode of operation for userspace rxrpc server sockets to work
    like this too. Although this is a UAPI change, server sockets aren't
    currently functional.

    Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code")
    Signed-off-by: David Howells

    David Howells
     

09 Sep, 2020

1 commit

  • Impose a maximum on the number of client rxrpc calls that are allowed
    simultaneously. This will be in lieu of a maximum number of client
    connections as this is easier to administed as, unlike connections, calls
    aren't reusable (to be changed in a subsequent patch)..

    This doesn't affect the limits on service calls and connections.

    Signed-off-by: David Howells

    David Howells
     

21 Aug, 2020

1 commit

  • The Rx protocol has a mechanism to help generate RTT samples that works by
    a client transmitting a REQUESTED-type ACK when it receives a DATA packet
    that has the REQUEST_ACK flag set.

    The peer, however, may interpose other ACKs before transmitting the
    REQUESTED-ACK, as can be seen in the following trace excerpt:

    rxrpc_tx_data: c=00000044 DATA d0b5ece8:00000001 00000001 q=00000001 fl=07
    rxrpc_rx_ack: c=00000044 00000001 PNG r=00000000 f=00000002 p=00000000 n=0
    rxrpc_rx_ack: c=00000044 00000002 REQ r=00000001 f=00000002 p=00000001 n=0
    ...

    DATA packet 1 (q=xx) has REQUEST_ACK set (bit 1 of fl=xx). The incoming
    ping (labelled PNG) hard-acks the request DATA packet (f=xx exceeds the
    sequence number of the DATA packet), causing it to be discarded from the Tx
    ring. The ACK that was requested (labelled REQ, r=xx references the serial
    of the DATA packet) comes after the ping, but the sk_buff holding the
    timestamp has gone and the RTT sample is lost.

    This is particularly noticeable on RPC calls used to probe the service
    offered by the peer. A lot of peers end up with an unknown RTT because we
    only ever sent a single RPC. This confuses the server rotation algorithm.

    Fix this by caching the information about the outgoing packet in RTT
    calculations in the rxrpc_call struct rather than looking in the Tx ring.

    A four-deep buffer is maintained and both REQUEST_ACK-flagged DATA and
    PING-ACK transmissions are recorded in there. When the appropriate
    response ACK is received, the buffer is checked for a match and, if found,
    an RTT sample is recorded.

    If a received ACK refers to a packet with a later serial number than an
    entry in the cache, that entry is presumed lost and the entry is made
    available to record a new transmission.

    ACKs types other than REQUESTED-type and PING-type cause any matching
    sample to be cancelled as they don't necessarily represent a useful
    measurement.

    If there's no space in the buffer on ping/data transmission, the sample
    base is discarded.

    Fixes: 50235c4b5a2f ("rxrpc: Obtain RTT data by requesting ACKs on DATA packets")
    Signed-off-by: David Howells

    David Howells
     

31 Jul, 2020

1 commit

  • There's a race between rxrpc_sendmsg setting up a call, but then failing to
    send anything on it due to an error, and recvmsg() seeing the call
    completion occur and trying to return the state to the user.

    An assertion fails in rxrpc_recvmsg() because the call has already been
    released from the socket and is about to be released again as recvmsg deals
    with it. (The recvmsg_q queue on the socket holds a ref, so there's no
    problem with use-after-free.)

    We also have to be careful not to end up reporting an error twice, in such
    a way that both returns indicate to userspace that the user ID supplied
    with the call is no longer in use - which could cause the client to
    malfunction if it recycles the user ID fast enough.

    Fix this by the following means:

    (1) When sendmsg() creates a call after the point that the call has been
    successfully added to the socket, don't return any errors through
    sendmsg(), but rather complete the call and let recvmsg() retrieve
    them. Make sendmsg() return 0 at this point. Further calls to
    sendmsg() for that call will fail with ESHUTDOWN.

    Note that at this point, we haven't send any packets yet, so the
    server doesn't yet know about the call.

    (2) If sendmsg() returns an error when it was expected to create a new
    call, it means that the user ID wasn't used.

    (3) Mark the call disconnected before marking it completed to prevent an
    oops in rxrpc_release_call().

    (4) recvmsg() will then retrieve the error and set MSG_EOR to indicate
    that the user ID is no longer known by the kernel.

    An oops like the following is produced:

    kernel BUG at net/rxrpc/recvmsg.c:605!
    ...
    RIP: 0010:rxrpc_recvmsg+0x256/0x5ae
    ...
    Call Trace:
    ? __init_waitqueue_head+0x2f/0x2f
    ____sys_recvmsg+0x8a/0x148
    ? import_iovec+0x69/0x9c
    ? copy_msghdr_from_user+0x5c/0x86
    ___sys_recvmsg+0x72/0xaa
    ? __fget_files+0x22/0x57
    ? __fget_light+0x46/0x51
    ? fdget+0x9/0x1b
    do_recvmmsg+0x15e/0x232
    ? _raw_spin_unlock+0xa/0xb
    ? vtime_delta+0xf/0x25
    __x64_sys_recvmmsg+0x2c/0x2f
    do_syscall_64+0x4c/0x78
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Fixes: 357f5ef64628 ("rxrpc: Call rxrpc_release_call() on error in rxrpc_new_client_call()")
    Reported-by: syzbot+b54969381df354936d96@syzkaller.appspotmail.com
    Signed-off-by: David Howells
    Reviewed-by: Marc Dionne
    Signed-off-by: David S. Miller

    David Howells
     

14 Mar, 2020

1 commit

  • Fix the interruptibility of kernel-initiated client calls so that they're
    either only interruptible when they're waiting for a call slot to come
    available or they're not interruptible at all. Either way, they're not
    interruptible during transmission.

    This should help prevent StoreData calls from being interrupted when
    writeback is in progress. It doesn't, however, handle interruption during
    the receive phase.

    Userspace-initiated calls are still interruptable. After the signal has
    been handled, sendmsg() will return the amount of data copied out of the
    buffer and userspace can perform another sendmsg() call to continue
    transmission.

    Fixes: bc5e3a546d55 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
    Signed-off-by: David Howells

    David Howells
     

07 Feb, 2020

1 commit

  • rxrpc_rcu_destroy_call(), which is called as an RCU callback to clean up a
    put call, calls rxrpc_put_connection() which, deep in its bowels, takes a
    number of spinlocks in a non-BH-safe way, including rxrpc_conn_id_lock and
    local->client_conns_lock. RCU callbacks, however, are normally called from
    softirq context, which can cause lockdep to notice the locking
    inconsistency.

    To get lockdep to detect this, it's necessary to have the connection
    cleaned up on the put at the end of the last of its calls, though normally
    the clean up is deferred. This can be induced, however, by starting a call
    on an AF_RXRPC socket and then closing the socket without reading the
    reply.

    Fix this by having rxrpc_rcu_destroy_call() punt the destruction to a
    workqueue if in softirq-mode and defer the destruction to process context.

    Note that another way to fix this could be to add a bunch of bh-disable
    annotations to the spinlocks concerned - and there might be more than just
    those two - but that means spending more time with BHs disabled.

    Note also that some of these places were covered by bh-disable spinlocks
    belonging to the rxrpc_transport object, but these got removed without the
    _bh annotation being retained on the next lock in.

    Fixes: 999b69f89241 ("rxrpc: Kill the client connection bundle concept")
    Reported-by: syzbot+d82f3ac8d87e7ccbb2c9@syzkaller.appspotmail.com
    Reported-by: syzbot+3f1fd6b8cbf8702d134e@syzkaller.appspotmail.com
    Signed-off-by: David Howells
    cc: Hillf Danton
    Signed-off-by: David S. Miller

    David Howells
     

03 Feb, 2020

1 commit

  • When a call is disconnected, the connection pointer from the call is
    cleared to make sure it isn't used again and to prevent further attempted
    transmission for the call. Unfortunately, there might be a daemon trying
    to use it at the same time to transmit a packet.

    Fix this by keeping call->conn set, but setting a flag on the call to
    indicate disconnection instead.

    Remove also the bits in the transmission functions where the conn pointer is
    checked and a ref taken under spinlock as this is now redundant.

    Fixes: 8d94aa381dab ("rxrpc: Calls shouldn't hold socket refs")
    Signed-off-by: David Howells

    David Howells
     

07 Oct, 2019

2 commits

  • Fix the cleanup of the crypto state on a call after the call has been
    disconnected. As the call has been disconnected, its connection ref has
    been discarded and so we can't go through that to get to the security ops
    table.

    Fix this by caching the security ops pointer in the rxrpc_call struct and
    using that when freeing the call security state. Also use this in other
    places we're dealing with call-specific security.

    The symptoms look like:

    BUG: KASAN: use-after-free in rxrpc_release_call+0xb2d/0xb60
    net/rxrpc/call_object.c:481
    Read of size 8 at addr ffff888062ffeb50 by task syz-executor.5/4764

    Fixes: 1db88c534371 ("rxrpc: Fix -Wframe-larger-than= warnings from on-stack crypto")
    Reported-by: syzbot+eed305768ece6682bb7f@syzkaller.appspotmail.com
    Signed-off-by: David Howells

    David Howells
     
  • rxrpc_put_call() calls trace_rxrpc_call() after it has done the decrement
    of the refcount - which looks at the debug_id in the call record. But
    unless the refcount was reduced to zero, we no longer have the right to
    look in the record and, indeed, it may be deleted by some other thread.

    Fix this by getting the debug_id out before decrementing the refcount and
    then passing that into the tracepoint.

    Fixes: e34d4234b0b7 ("rxrpc: Trace rxrpc_call usage")
    Signed-off-by: David Howells

    David Howells
     

03 Sep, 2019

1 commit


27 Aug, 2019

2 commits

  • Use the previously-added transmit-phase skbuff private flag to simplify the
    socket buffer tracing a bit. Which phase the skbuff comes from can now be
    divined from the skb rather than having to be guessed from the call state.

    We can also reduce the number of rxrpc_skb_trace values by eliminating the
    difference between Tx and Rx in the symbols.

    Signed-off-by: David Howells

    David Howells
     
  • Abstract out rxtx ring cleanup into its own function from its two callers.
    This makes it easier to apply the same changes to both.

    Signed-off-by: David Howells

    David Howells
     

31 Jul, 2019

1 commit

  • rxkad sometimes triggers a warning about oversized stack frames when
    building with clang for a 32-bit architecture:

    net/rxrpc/rxkad.c:243:12: error: stack frame size of 1088 bytes in function 'rxkad_secure_packet' [-Werror,-Wframe-larger-than=]
    net/rxrpc/rxkad.c:501:12: error: stack frame size of 1088 bytes in function 'rxkad_verify_packet' [-Werror,-Wframe-larger-than=]

    The problem is the combination of SYNC_SKCIPHER_REQUEST_ON_STACK() in
    rxkad_verify_packet()/rxkad_secure_packet() with the relatively large
    scatterlist in rxkad_verify_packet_1()/rxkad_secure_packet_encrypt().

    The warning does not show up when using gcc, which does not inline the
    functions as aggressively, but the problem is still the same.

    Allocate the cipher buffers from the slab instead, caching the allocated
    packet crypto request memory used for DATA packet crypto in the rxrpc_call
    struct.

    Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
    Reported-by: Arnd Bergmann
    Signed-off-by: David Howells
    Acked-by: Arnd Bergmann
    cc: Herbert Xu
    Signed-off-by: David S. Miller

    David Howells
     

31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

16 May, 2019

1 commit

  • Allow kernel services using AF_RXRPC to indicate that a call should be
    non-interruptible. This allows kafs to make things like lock-extension and
    writeback data storage calls non-interruptible.

    If this is set, signals will be ignored for operations on that call where
    possible - such as waiting to get a call channel on an rxrpc connection.

    It doesn't prevent UDP sendmsg from being interrupted, but that will be
    handled by packet retransmission.

    rxrpc_kernel_recv_data() isn't affected by this since that never waits,
    preferring instead to return -EAGAIN and leave the waiting to the caller.

    Userspace initiated calls can't be set to be uninterruptible at this time.

    Signed-off-by: David Howells

    David Howells
     

30 Apr, 2019

1 commit

  • In rxrpc_destroy_all_calls(), there are two phases: (1) make sure the
    ->calls list is empty, emitting error messages if not, and (2) wait for the
    RCU cleanup to happen on outstanding calls (ie. ->nr_calls becomes 0).

    To avoid taking the call_lock, the function prechecks ->calls and if empty,
    it returns to avoid taking the lock - this is wrong, however: it still
    needs to go and do the second phase and wait for ->nr_calls to become 0.

    Without this, the rxrpc_net struct may get deallocated before we get to the
    RCU cleanup for the last calls. This can lead to:

    Slab corruption (Not tainted): kmalloc-16k start=ffff88802b178000, len=16384
    050: 6b 6b 6b 6b 6b 6b 6b 6b 61 6b 6b 6b 6b 6b 6b 6b kkkkkkkkakkkkkkk

    Note the "61" at offset 0x58. This corresponds to the ->nr_calls member of
    struct rxrpc_net (which is >9k in size, and thus allocated out of the 16k
    slab).

    Fix this by flipping the condition on the if-statement, putting the locked
    section inside the if-body and dropping the return from there. The
    function will then always go on to wait for the RCU cleanup on outstanding
    calls.

    Fixes: 2baec2c3f854 ("rxrpc: Support network namespacing")
    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

16 Jan, 2019

1 commit

  • The changes introduced to allow rxrpc calls to be retried creates an issue
    when it comes to refcounting afs_call structs. The problem is that when
    rxrpc_send_data() queues the last packet for an asynchronous call, the
    following sequence can occur:

    (1) The notify_end_tx callback is invoked which causes the state in the
    afs_call to be changed from AFS_CALL_CL_REQUESTING or
    AFS_CALL_SV_REPLYING.

    (2) afs_deliver_to_call() can then process event notifications from rxrpc
    on the async_work queue.

    (3) Delivery of events, such as an abort from the server, can cause the
    afs_call state to be changed to AFS_CALL_COMPLETE on async_work.

    (4) For an asynchronous call, afs_process_async_call() notes that the call
    is complete and tried to clean up all the refs on async_work.

    (5) rxrpc_send_data() might return the amount of data transferred
    (success) or an error - which could in turn reflect a local error or a
    received error.

    Synchronising the clean up after rxrpc_kernel_send_data() returns an error
    with the asynchronous cleanup is then tricky to get right.

    Mostly revert commit c038a58ccfd6704d4d7d60ed3d6a0fca13cf13a4. The two API
    functions the original commit added aren't currently used. This makes
    rxrpc_kernel_send_data() always return successfully if it queued the data
    it was given.

    Note that this doesn't affect synchronous calls since their Rx notification
    function merely pokes a wait queue and does not refcounting. The
    asynchronous call notification function *has* to do refcounting and pass a
    ref over the work item to avoid the need to sync the workqueue in call
    cleanup.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

09 Oct, 2018

1 commit

  • The rxrpc_input_packet() function and its call tree was built around the
    assumption that data_ready() handler called from UDP to inform a kernel
    service that there is data to be had was non-reentrant. This means that
    certain locking could be dispensed with.

    This, however, turns out not to be the case with a multi-queue network card
    that can deliver packets to multiple cpus simultaneously. Each of those
    cpus can be in the rxrpc_input_packet() function at the same time.

    Fix by adding or changing some structure members:

    (1) Add peer->rtt_input_lock to serialise access to the RTT buffer.

    (2) Make conn->service_id into a 32-bit variable so that it can be
    cmpxchg'd on all arches.

    (3) Add call->input_lock to serialise access to the Rx/Tx state. Note
    that although the Rx and Tx states are (almost) entirely separate,
    there's no point completing the separation and having separate locks
    since it's a bi-phasal RPC protocol rather than a bi-direction
    streaming protocol. Data transmission and data reception do not take
    place simultaneously on any particular call.

    and making the following functional changes:

    (1) In rxrpc_input_data(), hold call->input_lock around the core to
    prevent simultaneous producing of packets into the Rx ring and
    updating of tracking state for a particular call.

    (2) In rxrpc_input_ping_response(), only read call->ping_serial once, and
    check it before checking RXRPC_CALL_PINGING as that's a cheaper test.
    The bit test and bit clear can then be combined. No further locking
    is needed here.

    (3) In rxrpc_input_ack(), take call->input_lock after we've parsed much of
    the ACK packet. The superseded ACK check is then done both before and
    after the lock is taken.

    The handing of ackinfo data is split, parsing before the lock is taken
    and processing with it held. This is keyed on rxMTU being non-zero.

    Congestion management is also done within the locked section.

    (4) In rxrpc_input_ackall(), take call->input_lock around the Tx window
    rotation. The ACKALL packet carries no information and is only really
    useful after all packets have been transmitted since it's imprecise.

    (5) In rxrpc_input_implicit_end_call(), we use rx->incoming_lock to
    prevent calls being simultaneously implicitly ended on two cpus and
    also to prevent any races with incoming call setup.

    (6) In rxrpc_input_packet(), use cmpxchg() to effect the service upgrade
    on a connection. It is only permitted to happen once for a
    connection.

    (7) In rxrpc_new_incoming_call(), we have to recheck the routing inside
    rx->incoming_lock to see if someone else set up the call, connection
    or peer whilst we were getting there. We can't trust the values from
    the earlier routing check unless we pin refs on them - which we want
    to avoid.

    Further, we need to allow for an incoming call to have its state
    changed on another CPU between us making it live and us adjusting it
    because the conn is now in the RXRPC_CONN_SERVICE state.

    (8) In rxrpc_peer_add_rtt(), take peer->rtt_input_lock around the access
    to the RTT buffer. Don't need to lock around setting peer->rtt.

    For reference, the inventory of state-accessing or state-altering functions
    used by the packet input procedure is:

    > rxrpc_input_packet()
    * PACKET CHECKING

    * ROUTING
    > rxrpc_post_packet_to_local()
    > rxrpc_find_connection_rcu() - uses RCU
    > rxrpc_lookup_peer_rcu() - uses RCU
    > rxrpc_find_service_conn_rcu() - uses RCU
    > idr_find() - uses RCU

    * CONNECTION-LEVEL PROCESSING
    - Service upgrade
    - Can only happen once per conn
    ! Changed to use cmpxchg
    > rxrpc_post_packet_to_conn()
    - Setting conn->hi_serial
    - Probably safe not using locks
    - Maybe use cmpxchg

    * CALL-LEVEL PROCESSING
    > Old-call checking
    > rxrpc_input_implicit_end_call()
    > rxrpc_call_completed()
    > rxrpc_queue_call()
    ! Need to take rx->incoming_lock
    > __rxrpc_disconnect_call()
    > rxrpc_notify_socket()
    > rxrpc_new_incoming_call()
    - Uses rx->incoming_lock for the entire process
    - Might be able to drop this earlier in favour of the call lock
    > rxrpc_incoming_call()
    ! Conflicts with rxrpc_input_implicit_end_call()
    > rxrpc_send_ping()
    - Don't need locks to check rtt state
    > rxrpc_propose_ACK

    * PACKET DISTRIBUTION
    > rxrpc_input_call_packet()
    > rxrpc_input_data()
    * QUEUE DATA PACKET ON CALL
    > rxrpc_reduce_call_timer()
    - Uses timer_reduce()
    ! Needs call->input_lock()
    > rxrpc_receiving_reply()
    ! Needs locking around ack state
    > rxrpc_rotate_tx_window()
    > rxrpc_end_tx_phase()
    > rxrpc_proto_abort()
    > rxrpc_input_dup_data()
    - Fills the Rx buffer
    - rxrpc_propose_ACK()
    - rxrpc_notify_socket()

    > rxrpc_input_ack()
    * APPLY ACK PACKET TO CALL AND DISCARD PACKET
    > rxrpc_input_ping_response()
    - Probably doesn't need any extra locking
    ! Need READ_ONCE() on call->ping_serial
    > rxrpc_input_check_for_lost_ack()
    - Takes call->lock to consult Tx buffer
    > rxrpc_peer_add_rtt()
    ! Needs to take a lock (peer->rtt_input_lock)
    ! Could perhaps manage with cmpxchg() and xadd() instead
    > rxrpc_input_requested_ack
    - Consults Tx buffer
    ! Probably needs a lock
    > rxrpc_peer_add_rtt()
    > rxrpc_propose_ack()
    > rxrpc_input_ackinfo()
    - Changes call->tx_winsize
    ! Use cmpxchg to handle change
    ! Should perhaps track serial number
    - Uses peer->lock to record MTU specification changes
    > rxrpc_proto_abort()
    ! Need to take call->input_lock
    > rxrpc_rotate_tx_window()
    > rxrpc_end_tx_phase()
    > rxrpc_input_soft_acks()
    - Consults the Tx buffer
    > rxrpc_congestion_management()
    - Modifies the Tx annotations
    ! Needs call->input_lock()
    > rxrpc_queue_call()

    > rxrpc_input_abort()
    * APPLY ABORT PACKET TO CALL AND DISCARD PACKET
    > rxrpc_set_call_completion()
    > rxrpc_notify_socket()

    > rxrpc_input_ackall()
    * APPLY ACKALL PACKET TO CALL AND DISCARD PACKET
    ! Need to take call->input_lock
    > rxrpc_rotate_tx_window()
    > rxrpc_end_tx_phase()

    > rxrpc_reject_packet()

    There are some functions used by the above that queue the packet, after
    which the procedure is terminated:

    - rxrpc_post_packet_to_local()
    - local->event_queue is an sk_buff_head
    - local->processor is a work_struct
    - rxrpc_post_packet_to_conn()
    - conn->rx_queue is an sk_buff_head
    - conn->processor is a work_struct
    - rxrpc_reject_packet()
    - local->reject_queue is an sk_buff_head
    - local->processor is a work_struct

    And some that offload processing to process context:

    - rxrpc_notify_socket()
    - Uses RCU lock
    - Uses call->notify_lock to call call->notify_rx
    - Uses call->recvmsg_lock to queue recvmsg side
    - rxrpc_queue_call()
    - call->processor is a work_struct
    - rxrpc_propose_ACK()
    - Uses call->lock to wrap __rxrpc_propose_ACK()

    And a bunch that complete a call, all of which use call->state_lock to
    protect the call state:

    - rxrpc_call_completed()
    - rxrpc_set_call_completion()
    - rxrpc_abort_call()
    - rxrpc_proto_abort()
    - Also uses rxrpc_queue_call()

    Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
    Signed-off-by: David Howells

    David Howells
     

05 Oct, 2018

1 commit


28 Sep, 2018

1 commit

  • Fix error distribution by immediately delivering the errors to all the
    affected calls rather than deferring them to a worker thread. The problem
    with the latter is that retries and things can happen in the meantime when we
    want to stop that sooner.

    To this end:

    (1) Stop the error distributor from removing calls from the error_targets
    list so that peer->lock isn't needed to synchronise against other adds
    and removals.

    (2) Require the peer's error_targets list to be accessed with RCU, thereby
    avoiding the need to take peer->lock over distribution.

    (3) Don't attempt to affect a call's state if it is already marked complete.

    Signed-off-by: David Howells

    David Howells
     

21 Jun, 2018

1 commit

  • While __atomic_add_unless() was originally intended as a building-block
    for atomic_add_unless(), it's now used in a number of places around the
    kernel. It's the only common atomic operation named __atomic*(), rather
    than atomic_*(), and for consistency it would be better named
    atomic_fetch_add_unless().

    This lack of consistency is slightly confusing, and gets in the way of
    scripting atomics. Given that, let's clean things up and promote it to
    an official part of the atomics API, in the form of
    atomic_fetch_add_unless().

    This patch converts definitions and invocations over to the new name,
    including the instrumented version, using the following script:

    ----
    git grep -w __atomic_add_unless | while read line; do
    sed -i '{s/\/atomic_fetch_add_unless/}' "${line%%:*}";
    done
    git grep -w __arch_atomic_add_unless | while read line; do
    sed -i '{s/\/arch_atomic_fetch_add_unless/}' "${line%%:*}";
    done
    ----

    Note that we do not have atomic{64,_long}_fetch_add_unless(), which will
    be introduced by later patches.

    There should be no functional change as a result of this patch.

    Signed-off-by: Mark Rutland
    Reviewed-by: Will Deacon
    Acked-by: Geert Uytterhoeven
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Palmer Dabbelt
    Cc: Boqun Feng
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Link: https://lore.kernel.org/lkml/20180621121321.4761-2-mark.rutland@arm.com
    Signed-off-by: Ingo Molnar

    Mark Rutland
     

04 Apr, 2018

1 commit

  • Pull networking updates from David Miller:

    1) Support offloading wireless authentication to userspace via
    NL80211_CMD_EXTERNAL_AUTH, from Srinivas Dasari.

    2) A lot of work on network namespace setup/teardown from Kirill Tkhai.
    Setup and cleanup of namespaces now all run asynchronously and thus
    performance is significantly increased.

    3) Add rx/tx timestamping support to mv88e6xxx driver, from Brandon
    Streiff.

    4) Support zerocopy on RDS sockets, from Sowmini Varadhan.

    5) Use denser instruction encoding in x86 eBPF JIT, from Daniel
    Borkmann.

    6) Support hw offload of vlan filtering in mvpp2 dreiver, from Maxime
    Chevallier.

    7) Support grafting of child qdiscs in mlxsw driver, from Nogah
    Frankel.

    8) Add packet forwarding tests to selftests, from Ido Schimmel.

    9) Deal with sub-optimal GSO packets better in BBR congestion control,
    from Eric Dumazet.

    10) Support 5-tuple hashing in ipv6 multipath routing, from David Ahern.

    11) Add path MTU tests to selftests, from Stefano Brivio.

    12) Various bits of IPSEC offloading support for mlx5, from Aviad
    Yehezkel, Yossi Kuperman, and Saeed Mahameed.

    13) Support RSS spreading on ntuple filters in SFC driver, from Edward
    Cree.

    14) Lots of sockmap work from John Fastabend. Applications can use eBPF
    to filter sendmsg and sendpage operations.

    15) In-kernel receive TLS support, from Dave Watson.

    16) Add XDP support to ixgbevf, this is significant because it should
    allow optimized XDP usage in various cloud environments. From Tony
    Nguyen.

    17) Add new Intel E800 series "ice" ethernet driver, from Anirudh
    Venkataramanan et al.

    18) IP fragmentation match offload support in nfp driver, from Pieter
    Jansen van Vuuren.

    19) Support XDP redirect in i40e driver, from Björn Töpel.

    20) Add BPF_RAW_TRACEPOINT program type for accessing the arguments of
    tracepoints in their raw form, from Alexei Starovoitov.

    21) Lots of striding RQ improvements to mlx5 driver with many
    performance improvements, from Tariq Toukan.

    22) Use rhashtable for inet frag reassembly, from Eric Dumazet.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1678 commits)
    net: mvneta: improve suspend/resume
    net: mvneta: split rxq/txq init and txq deinit into SW and HW parts
    ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh
    net: bgmac: Fix endian access in bgmac_dma_tx_ring_free()
    net: bgmac: Correctly annotate register space
    route: check sysctl_fib_multipath_use_neigh earlier than hash
    fix typo in command value in drivers/net/phy/mdio-bitbang.
    sky2: Increase D3 delay to sky2 stops working after suspend
    net/mlx5e: Set EQE based as default TX interrupt moderation mode
    ibmvnic: Disable irqs before exiting reset from closed state
    net: sched: do not emit messages while holding spinlock
    vlan: also check phy_driver ts_info for vlan's real device
    Bluetooth: Mark expected switch fall-throughs
    Bluetooth: Set HCI_QUIRK_SIMULTANEOUS_DISCOVERY for BTUSB_QCA_ROME
    Bluetooth: btrsi: remove unused including
    Bluetooth: hci_bcm: Remove DMI quirk for the MINIX Z83-4
    sh_eth: kill useless check in __sh_eth_get_regs()
    sh_eth: add sh_eth_cpu_data::no_xdfar flag
    ipv6: factorize sk_wmem_alloc updates done by __ip6_append_data()
    ipv4: factorize sk_wmem_alloc updates done by __ip_append_data()
    ...

    Linus Torvalds
     

31 Mar, 2018

2 commits

  • rxrpc_call structs don't pin sockets or network namespaces, but may attempt
    to access both after their refcount reaches 0 so that they can detach
    themselves from the network namespace. However, there's no guarantee that
    the socket still exists at this point (so sock_net(&call->socket->sk) may
    be invalid) and the namespace may have gone away if the call isn't pinning
    a peer.

    Fix this by (a) carrying a net pointer in the rxrpc_call struct and (b)
    waiting for all calls to be destroyed when the network namespace goes away.

    This was detected by checker:

    net/rxrpc/call_object.c:634:57: warning: incorrect type in argument 1 (different address spaces)
    net/rxrpc/call_object.c:634:57: expected struct sock const *sk
    net/rxrpc/call_object.c:634:57: got struct sock [noderef] *

    Fixes: 2baec2c3f854 ("rxrpc: Support network namespacing")
    Signed-off-by: David Howells

    David Howells
     
  • Fix various issues detected by checker.

    Errors:

    (*) rxrpc_discard_prealloc() should be using rcu_assign_pointer to set
    call->socket.

    Warnings:

    (*) rxrpc_service_connection_reaper() should be passing NULL rather than 0 to
    trace_rxrpc_conn() as the where argument.

    (*) rxrpc_disconnect_client_call() should get its net pointer via the
    call->conn rather than call->sock to avoid a warning about accessing
    an RCU pointer without protection.

    (*) Proc seq start/stop functions need annotation as they pass locks
    between the functions.

    False positives:

    (*) Checker doesn't correctly handle of seq-retry lock context balance in
    rxrpc_find_service_conn_rcu().

    (*) Checker thinks execution may proceed past the BUG() in
    rxrpc_publish_service_conn().

    (*) Variable length array warnings from SKCIPHER_REQUEST_ON_STACK() in
    rxkad.c.

    Signed-off-by: David Howells

    David Howells
     

28 Mar, 2018

1 commit

  • In rxrpc and afs, use the debug_ids that are monotonically allocated to
    various objects as they're allocated rather than pointers as kernel
    pointers are now hashed making them less useful. Further, the debug ids
    aren't reused anywhere nearly as quickly.

    In addition, allow kernel services that use rxrpc, such as afs, to take
    numbers from the rxrpc counter, assign them to their own call struct and
    pass them in to rxrpc for both client and service calls so that the trace
    lines for each will have the same ID tag.

    Signed-off-by: David Howells

    David Howells
     

30 Nov, 2017

1 commit

  • Pull networking fixes from David Miller:

    1) The forcedeth conversion from pci_*() DMA interfaces to dma_*() ones
    missed one spot. From Zhu Yanjun.

    2) Missing CRYPTO_SHA256 Kconfig dep in cfg80211, from Johannes Berg.

    3) Fix checksum offloading in thunderx driver, from Sunil Goutham.

    4) Add SPDX to vm_sockets_diag.h, from Stephen Hemminger.

    5) Fix use after free of packet headers in TIPC, from Jon Maloy.

    6) "sizeof(ptr)" vs "sizeof(*ptr)" bug in i40e, from Gustavo A R Silva.

    7) Tunneling fixes in mlxsw driver, from Petr Machata.

    8) Fix crash in fanout_demux_rollover() of AF_PACKET, from Mike
    Maloney.

    9) Fix race in AF_PACKET bind() vs. NETDEV_UP notifier, from Eric
    Dumazet.

    10) Fix regression in sch_sfq.c due to one of the timer_setup()
    conversions. From Paolo Abeni.

    11) SCTP does list_for_each_entry() using wrong struct member, fix from
    Xin Long.

    12) Don't use big endian netlink attribute read for
    IFLA_BOND_AD_ACTOR_SYSTEM, it is in cpu endianness. Also from Xin
    Long.

    13) Fix mis-initialization of q->link.clock in CBQ scheduler, preventing
    adding filters there. From Jiri Pirko.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (67 commits)
    ethernet: dwmac-stm32: Fix copyright
    net: via: via-rhine: use %p to format void * address instead of %x
    net: ethernet: xilinx: Mark XILINX_LL_TEMAC broken on 64-bit
    myri10ge: Update MAINTAINERS
    net: sched: cbq: create block for q->link.block
    atm: suni: remove extraneous space to fix indentation
    atm: lanai: use %p to format kernel addresses instead of %x
    VSOCK: Don't set sk_state to TCP_CLOSE before testing it
    atm: fore200e: use %pK to format kernel addresses instead of %x
    ambassador: fix incorrect indentation of assignment statement
    vxlan: use __be32 type for the param vni in __vxlan_fdb_delete
    bonding: use nla_get_u64 to extract the value for IFLA_BOND_AD_ACTOR_SYSTEM
    sctp: use right member as the param of list_for_each_entry
    sch_sfq: fix null pointer dereference at timer expiration
    cls_bpf: don't decrement net's refcount when offload fails
    net/packet: fix a race in packet_bind() and packet_notifier()
    packet: fix crash in fanout_demux_rollover()
    sctp: remove extern from stream sched
    sctp: force the params with right types for sctp csum apis
    sctp: force SCTP_ERROR_INV_STRM with __u32 when calling sctp_chunk_fail
    ...

    Linus Torvalds
     

24 Nov, 2017

4 commits

  • Add an extra timeout that is set/updated when we send a DATA packet that
    has the request-ack flag set. This allows us to detect if we don't get an
    ACK in response to the latest flagged packet.

    The ACK packet is adjudged to have been lost if it doesn't turn up within
    2*RTT of the transmission.

    If the timeout occurs, we schedule the sending of a PING ACK to find out
    the state of the other side. If a new DATA packet is ready to go sooner,
    we cancel the sending of the ping and set the request-ack flag on that
    instead.

    If we get back a PING-RESPONSE ACK that indicates a lower tx_top than what
    we had at the time of the ping transmission, we adjudge all the DATA
    packets sent between the response tx_top and the ping-time tx_top to have
    been lost and retransmit immediately.

    Rather than sending a PING ACK, we could just pick a DATA packet and
    speculatively retransmit that with request-ack set. It should result in
    either a REQUESTED ACK or a DUPLICATE ACK which we can then use in lieu the
    a PING-RESPONSE ACK mentioned above.

    Signed-off-by: David Howells

    David Howells
     
  • Fix the rxrpc call expiration timeouts and make them settable from
    userspace. By analogy with other rx implementations, there should be three
    timeouts:

    (1) "Normal timeout"

    This is set for all calls and is triggered if we haven't received any
    packets from the peer in a while. It is measured from the last time
    we received any packet on that call. This is not reset by any
    connection packets (such as CHALLENGE/RESPONSE packets).

    If a service operation takes a long time, the server should generate
    PING ACKs at a duration that's substantially less than the normal
    timeout so is to keep both sides alive. This is set at 1/6 of normal
    timeout.

    (2) "Idle timeout"

    This is set only for a service call and is triggered if we stop
    receiving the DATA packets that comprise the request data. It is
    measured from the last time we received a DATA packet.

    (3) "Hard timeout"

    This can be set for a call and specified the maximum lifetime of that
    call. It should not be specified by default. Some operations (such
    as volume transfer) take a long time.

    Allow userspace to set/change the timeouts on a call with sendmsg, using a
    control message:

    RXRPC_SET_CALL_TIMEOUTS

    The data to the message is a number of 32-bit words, not all of which need
    be given:

    u32 hard_timeout; /* sec from first packet */
    u32 idle_timeout; /* msec from packet Rx */
    u32 normal_timeout; /* msec from data Rx */

    This can be set in combination with any other sendmsg() that affects a
    call.

    Signed-off-by: David Howells

    David Howells
     
  • When rxrpc_sendmsg() parses the control message buffer, it places the
    parameters extracted into a structure, but lumps together call parameters
    (such as user call ID) with operation parameters (such as whether to send
    data, send an abort or accept a call).

    Split the call parameters out into their own structure, a copy of which is
    then embedded in the operation parameters struct.

    The call parameters struct is then passed down into the places that need it
    instead of passing the individual parameters. This allows for extra call
    parameters to be added.

    Signed-off-by: David Howells

    David Howells
     
  • Provide a different lockdep key for rxrpc_call::user_mutex when the call is
    made on a kernel socket, such as by the AFS filesystem.

    The problem is that lockdep registers a false positive between userspace
    calling the sendmsg syscall on a user socket where call->user_mutex is held
    whilst userspace memory is accessed whereas the AFS filesystem may perform
    operations with mmap_sem held by the caller.

    In such a case, the following warning is produced.

    ======================================================
    WARNING: possible circular locking dependency detected
    4.14.0-fscache+ #243 Tainted: G E
    ------------------------------------------------------
    modpost/16701 is trying to acquire lock:
    (&vnode->io_lock){+.+.}, at: [] afs_begin_vnode_operation+0x33/0x77 [kafs]

    but task is already holding lock:
    (&mm->mmap_sem){++++}, at: [] __do_page_fault+0x1ef/0x486

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #3 (&mm->mmap_sem){++++}:
    __might_fault+0x61/0x89
    _copy_from_iter_full+0x40/0x1fa
    rxrpc_send_data+0x8dc/0xff3
    rxrpc_do_sendmsg+0x62f/0x6a1
    rxrpc_sendmsg+0x166/0x1b7
    sock_sendmsg+0x2d/0x39
    ___sys_sendmsg+0x1ad/0x22b
    __sys_sendmsg+0x41/0x62
    do_syscall_64+0x89/0x1be
    return_from_SYSCALL_64+0x0/0x75

    -> #2 (&call->user_mutex){+.+.}:
    __mutex_lock+0x86/0x7d2
    rxrpc_new_client_call+0x378/0x80e
    rxrpc_kernel_begin_call+0xf3/0x154
    afs_make_call+0x195/0x454 [kafs]
    afs_vl_get_capabilities+0x193/0x198 [kafs]
    afs_vl_lookup_vldb+0x5f/0x151 [kafs]
    afs_create_volume+0x2e/0x2f4 [kafs]
    afs_mount+0x56a/0x8d7 [kafs]
    mount_fs+0x6a/0x109
    vfs_kern_mount+0x67/0x135
    do_mount+0x90b/0xb57
    SyS_mount+0x72/0x98
    do_syscall_64+0x89/0x1be
    return_from_SYSCALL_64+0x0/0x75

    -> #1 (k-sk_lock-AF_RXRPC){+.+.}:
    lock_sock_nested+0x74/0x8a
    rxrpc_kernel_begin_call+0x8a/0x154
    afs_make_call+0x195/0x454 [kafs]
    afs_fs_get_capabilities+0x17a/0x17f [kafs]
    afs_probe_fileserver+0xf7/0x2f0 [kafs]
    afs_select_fileserver+0x83f/0x903 [kafs]
    afs_fetch_status+0x89/0x11d [kafs]
    afs_iget+0x16f/0x4f8 [kafs]
    afs_mount+0x6c6/0x8d7 [kafs]
    mount_fs+0x6a/0x109
    vfs_kern_mount+0x67/0x135
    do_mount+0x90b/0xb57
    SyS_mount+0x72/0x98
    do_syscall_64+0x89/0x1be
    return_from_SYSCALL_64+0x0/0x75

    -> #0 (&vnode->io_lock){+.+.}:
    lock_acquire+0x174/0x19f
    __mutex_lock+0x86/0x7d2
    afs_begin_vnode_operation+0x33/0x77 [kafs]
    afs_fetch_data+0x80/0x12a [kafs]
    afs_readpages+0x314/0x405 [kafs]
    __do_page_cache_readahead+0x203/0x2ba
    filemap_fault+0x179/0x54d
    __do_fault+0x17/0x60
    __handle_mm_fault+0x6d7/0x95c
    handle_mm_fault+0x24e/0x2a3
    __do_page_fault+0x301/0x486
    do_page_fault+0x236/0x259
    page_fault+0x22/0x30
    __clear_user+0x3d/0x60
    padzero+0x1c/0x2b
    load_elf_binary+0x785/0xdc7
    search_binary_handler+0x81/0x1ff
    do_execveat_common.isra.14+0x600/0x888
    do_execve+0x1f/0x21
    SyS_execve+0x28/0x2f
    do_syscall_64+0x89/0x1be
    return_from_SYSCALL_64+0x0/0x75

    other info that might help us debug this:

    Chain exists of:
    &vnode->io_lock --> &call->user_mutex --> &mm->mmap_sem

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&mm->mmap_sem);
    lock(&call->user_mutex);
    lock(&mm->mmap_sem);
    lock(&vnode->io_lock);

    *** DEADLOCK ***

    1 lock held by modpost/16701:
    #0: (&mm->mmap_sem){++++}, at: [] __do_page_fault+0x1ef/0x486

    stack backtrace:
    CPU: 0 PID: 16701 Comm: modpost Tainted: G E 4.14.0-fscache+ #243
    Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
    Call Trace:
    dump_stack+0x67/0x8e
    print_circular_bug+0x341/0x34f
    check_prev_add+0x11f/0x5d4
    ? add_lock_to_list.isra.12+0x8b/0x8b
    ? add_lock_to_list.isra.12+0x8b/0x8b
    ? __lock_acquire+0xf77/0x10b4
    __lock_acquire+0xf77/0x10b4
    lock_acquire+0x174/0x19f
    ? afs_begin_vnode_operation+0x33/0x77 [kafs]
    __mutex_lock+0x86/0x7d2
    ? afs_begin_vnode_operation+0x33/0x77 [kafs]
    ? afs_begin_vnode_operation+0x33/0x77 [kafs]
    ? afs_begin_vnode_operation+0x33/0x77 [kafs]
    afs_begin_vnode_operation+0x33/0x77 [kafs]
    afs_fetch_data+0x80/0x12a [kafs]
    afs_readpages+0x314/0x405 [kafs]
    __do_page_cache_readahead+0x203/0x2ba
    ? filemap_fault+0x179/0x54d
    filemap_fault+0x179/0x54d
    __do_fault+0x17/0x60
    __handle_mm_fault+0x6d7/0x95c
    handle_mm_fault+0x24e/0x2a3
    __do_page_fault+0x301/0x486
    do_page_fault+0x236/0x259
    page_fault+0x22/0x30
    RIP: 0010:__clear_user+0x3d/0x60
    RSP: 0018:ffff880071e93da0 EFLAGS: 00010202
    RAX: 0000000000000000 RBX: 000000000000011c RCX: 000000000000011c
    RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000060f720
    RBP: 000000000060f720 R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000001 R11: ffff8800b5459b68 R12: ffff8800ce150e00
    R13: 000000000060f720 R14: 00000000006127a8 R15: 0000000000000000
    padzero+0x1c/0x2b
    load_elf_binary+0x785/0xdc7
    search_binary_handler+0x81/0x1ff
    do_execveat_common.isra.14+0x600/0x888
    do_execve+0x1f/0x21
    SyS_execve+0x28/0x2f
    do_syscall_64+0x89/0x1be
    entry_SYSCALL64_slow_path+0x25/0x25
    RIP: 0033:0x7fdb6009ee07
    RSP: 002b:00007fff566d9728 EFLAGS: 00000246 ORIG_RAX: 000000000000003b
    RAX: ffffffffffffffda RBX: 000055ba57280900 RCX: 00007fdb6009ee07
    RDX: 000055ba5727f270 RSI: 000055ba5727cac0 RDI: 000055ba57280900
    RBP: 000055ba57280900 R08: 00007fff566d9700 R09: 0000000000000000
    R10: 000055ba5727cac0 R11: 0000000000000246 R12: 0000000000000000
    R13: 000055ba5727cac0 R14: 000055ba5727f270 R15: 0000000000000000

    Signed-off-by: David Howells

    David Howells
     

22 Nov, 2017

1 commit

  • This converts all remaining cases of the old setup_timer() API into using
    timer_setup(), where the callback argument is the structure already
    holding the struct timer_list. These should have no behavioral changes,
    since they just change which pointer is passed into the callback with
    the same available pointers after conversion. It handles the following
    examples, in addition to some other variations.

    Casting from unsigned long:

    void my_callback(unsigned long data)
    {
    struct something *ptr = (struct something *)data;
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, ptr);

    and forced object casts:

    void my_callback(struct something *ptr)
    {
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, (unsigned long)ptr);

    become:

    void my_callback(struct timer_list *t)
    {
    struct something *ptr = from_timer(ptr, t, my_timer);
    ...
    }
    ...
    timer_setup(&ptr->my_timer, my_callback, 0);

    Direct function assignments:

    void my_callback(unsigned long data)
    {
    struct something *ptr = (struct something *)data;
    ...
    }
    ...
    ptr->my_timer.function = my_callback;

    have a temporary cast added, along with converting the args:

    void my_callback(struct timer_list *t)
    {
    struct something *ptr = from_timer(ptr, t, my_timer);
    ...
    }
    ...
    ptr->my_timer.function = (TIMER_FUNC_TYPE)my_callback;

    And finally, callbacks without a data assignment:

    void my_callback(unsigned long data)
    {
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, 0);

    have their argument renamed to verify they're unused during conversion:

    void my_callback(struct timer_list *unused)
    {
    ...
    }
    ...
    timer_setup(&ptr->my_timer, my_callback, 0);

    The conversion is done with the following Coccinelle script:

    spatch --very-quiet --all-includes --include-headers \
    -I ./arch/x86/include -I ./arch/x86/include/generated \
    -I ./include -I ./arch/x86/include/uapi \
    -I ./arch/x86/include/generated/uapi -I ./include/uapi \
    -I ./include/generated/uapi --include ./include/linux/kconfig.h \
    --dir . \
    --cocci-file ~/src/data/timer_setup.cocci

    @fix_address_of@
    expression e;
    @@

    setup_timer(
    -&(e)
    +&e
    , ...)

    // Update any raw setup_timer() usages that have a NULL callback, but
    // would otherwise match change_timer_function_usage, since the latter
    // will update all function assignments done in the face of a NULL
    // function initialization in setup_timer().
    @change_timer_function_usage_NULL@
    expression _E;
    identifier _timer;
    type _cast_data;
    @@

    (
    -setup_timer(&_E->_timer, NULL, _E);
    +timer_setup(&_E->_timer, NULL, 0);
    |
    -setup_timer(&_E->_timer, NULL, (_cast_data)_E);
    +timer_setup(&_E->_timer, NULL, 0);
    |
    -setup_timer(&_E._timer, NULL, &_E);
    +timer_setup(&_E._timer, NULL, 0);
    |
    -setup_timer(&_E._timer, NULL, (_cast_data)&_E);
    +timer_setup(&_E._timer, NULL, 0);
    )

    @change_timer_function_usage@
    expression _E;
    identifier _timer;
    struct timer_list _stl;
    identifier _callback;
    type _cast_func, _cast_data;
    @@

    (
    -setup_timer(&_E->_timer, _callback, _E);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, &_callback, _E);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, _callback, (_cast_data)_E);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, &_callback, (_cast_data)_E);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, (_cast_func)_callback, _E);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, (_cast_func)&_callback, _E);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, (_cast_func)_callback, (_cast_data)_E);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, (_cast_func)&_callback, (_cast_data)_E);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E._timer, _callback, (_cast_data)_E);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, _callback, (_cast_data)&_E);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, &_callback, (_cast_data)_E);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, &_callback, (_cast_data)&_E);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)_E);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)&_E);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)_E);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)&_E);
    +timer_setup(&_E._timer, _callback, 0);
    |
    _E->_timer@_stl.function = _callback;
    |
    _E->_timer@_stl.function = &_callback;
    |
    _E->_timer@_stl.function = (_cast_func)_callback;
    |
    _E->_timer@_stl.function = (_cast_func)&_callback;
    |
    _E._timer@_stl.function = _callback;
    |
    _E._timer@_stl.function = &_callback;
    |
    _E._timer@_stl.function = (_cast_func)_callback;
    |
    _E._timer@_stl.function = (_cast_func)&_callback;
    )

    // callback(unsigned long arg)
    @change_callback_handle_cast
    depends on change_timer_function_usage@
    identifier change_timer_function_usage._callback;
    identifier change_timer_function_usage._timer;
    type _origtype;
    identifier _origarg;
    type _handletype;
    identifier _handle;
    @@

    void _callback(
    -_origtype _origarg
    +struct timer_list *t
    )
    {
    (
    ... when != _origarg
    _handletype *_handle =
    -(_handletype *)_origarg;
    +from_timer(_handle, t, _timer);
    ... when != _origarg
    |
    ... when != _origarg
    _handletype *_handle =
    -(void *)_origarg;
    +from_timer(_handle, t, _timer);
    ... when != _origarg
    |
    ... when != _origarg
    _handletype *_handle;
    ... when != _handle
    _handle =
    -(_handletype *)_origarg;
    +from_timer(_handle, t, _timer);
    ... when != _origarg
    |
    ... when != _origarg
    _handletype *_handle;
    ... when != _handle
    _handle =
    -(void *)_origarg;
    +from_timer(_handle, t, _timer);
    ... when != _origarg
    )
    }

    // callback(unsigned long arg) without existing variable
    @change_callback_handle_cast_no_arg
    depends on change_timer_function_usage &&
    !change_callback_handle_cast@
    identifier change_timer_function_usage._callback;
    identifier change_timer_function_usage._timer;
    type _origtype;
    identifier _origarg;
    type _handletype;
    @@

    void _callback(
    -_origtype _origarg
    +struct timer_list *t
    )
    {
    + _handletype *_origarg = from_timer(_origarg, t, _timer);
    +
    ... when != _origarg
    - (_handletype *)_origarg
    + _origarg
    ... when != _origarg
    }

    // Avoid already converted callbacks.
    @match_callback_converted
    depends on change_timer_function_usage &&
    !change_callback_handle_cast &&
    !change_callback_handle_cast_no_arg@
    identifier change_timer_function_usage._callback;
    identifier t;
    @@

    void _callback(struct timer_list *t)
    { ... }

    // callback(struct something *handle)
    @change_callback_handle_arg
    depends on change_timer_function_usage &&
    !match_callback_converted &&
    !change_callback_handle_cast &&
    !change_callback_handle_cast_no_arg@
    identifier change_timer_function_usage._callback;
    identifier change_timer_function_usage._timer;
    type _handletype;
    identifier _handle;
    @@

    void _callback(
    -_handletype *_handle
    +struct timer_list *t
    )
    {
    + _handletype *_handle = from_timer(_handle, t, _timer);
    ...
    }

    // If change_callback_handle_arg ran on an empty function, remove
    // the added handler.
    @unchange_callback_handle_arg
    depends on change_timer_function_usage &&
    change_callback_handle_arg@
    identifier change_timer_function_usage._callback;
    identifier change_timer_function_usage._timer;
    type _handletype;
    identifier _handle;
    identifier t;
    @@

    void _callback(struct timer_list *t)
    {
    - _handletype *_handle = from_timer(_handle, t, _timer);
    }

    // We only want to refactor the setup_timer() data argument if we've found
    // the matching callback. This undoes changes in change_timer_function_usage.
    @unchange_timer_function_usage
    depends on change_timer_function_usage &&
    !change_callback_handle_cast &&
    !change_callback_handle_cast_no_arg &&
    !change_callback_handle_arg@
    expression change_timer_function_usage._E;
    identifier change_timer_function_usage._timer;
    identifier change_timer_function_usage._callback;
    type change_timer_function_usage._cast_data;
    @@

    (
    -timer_setup(&_E->_timer, _callback, 0);
    +setup_timer(&_E->_timer, _callback, (_cast_data)_E);
    |
    -timer_setup(&_E._timer, _callback, 0);
    +setup_timer(&_E._timer, _callback, (_cast_data)&_E);
    )

    // If we fixed a callback from a .function assignment, fix the
    // assignment cast now.
    @change_timer_function_assignment
    depends on change_timer_function_usage &&
    (change_callback_handle_cast ||
    change_callback_handle_cast_no_arg ||
    change_callback_handle_arg)@
    expression change_timer_function_usage._E;
    identifier change_timer_function_usage._timer;
    identifier change_timer_function_usage._callback;
    type _cast_func;
    typedef TIMER_FUNC_TYPE;
    @@

    (
    _E->_timer.function =
    -_callback
    +(TIMER_FUNC_TYPE)_callback
    ;
    |
    _E->_timer.function =
    -&_callback
    +(TIMER_FUNC_TYPE)_callback
    ;
    |
    _E->_timer.function =
    -(_cast_func)_callback;
    +(TIMER_FUNC_TYPE)_callback
    ;
    |
    _E->_timer.function =
    -(_cast_func)&_callback
    +(TIMER_FUNC_TYPE)_callback
    ;
    |
    _E._timer.function =
    -_callback
    +(TIMER_FUNC_TYPE)_callback
    ;
    |
    _E._timer.function =
    -&_callback;
    +(TIMER_FUNC_TYPE)_callback
    ;
    |
    _E._timer.function =
    -(_cast_func)_callback
    +(TIMER_FUNC_TYPE)_callback
    ;
    |
    _E._timer.function =
    -(_cast_func)&_callback
    +(TIMER_FUNC_TYPE)_callback
    ;
    )

    // Sometimes timer functions are called directly. Replace matched args.
    @change_timer_function_calls
    depends on change_timer_function_usage &&
    (change_callback_handle_cast ||
    change_callback_handle_cast_no_arg ||
    change_callback_handle_arg)@
    expression _E;
    identifier change_timer_function_usage._timer;
    identifier change_timer_function_usage._callback;
    type _cast_data;
    @@

    _callback(
    (
    -(_cast_data)_E
    +&_E->_timer
    |
    -(_cast_data)&_E
    +&_E._timer
    |
    -_E
    +&_E->_timer
    )
    )

    // If a timer has been configured without a data argument, it can be
    // converted without regard to the callback argument, since it is unused.
    @match_timer_function_unused_data@
    expression _E;
    identifier _timer;
    identifier _callback;
    @@

    (
    -setup_timer(&_E->_timer, _callback, 0);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, _callback, 0L);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, _callback, 0UL);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E._timer, _callback, 0);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, _callback, 0L);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, _callback, 0UL);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_timer, _callback, 0);
    +timer_setup(&_timer, _callback, 0);
    |
    -setup_timer(&_timer, _callback, 0L);
    +timer_setup(&_timer, _callback, 0);
    |
    -setup_timer(&_timer, _callback, 0UL);
    +timer_setup(&_timer, _callback, 0);
    |
    -setup_timer(_timer, _callback, 0);
    +timer_setup(_timer, _callback, 0);
    |
    -setup_timer(_timer, _callback, 0L);
    +timer_setup(_timer, _callback, 0);
    |
    -setup_timer(_timer, _callback, 0UL);
    +timer_setup(_timer, _callback, 0);
    )

    @change_callback_unused_data
    depends on match_timer_function_unused_data@
    identifier match_timer_function_unused_data._callback;
    type _origtype;
    identifier _origarg;
    @@

    void _callback(
    -_origtype _origarg
    +struct timer_list *unused
    )
    {
    ... when != _origarg
    }

    Signed-off-by: Kees Cook

    Kees Cook
     

02 Nov, 2017

1 commit

  • Place a spinlock around the invocation of call->notify_rx() for a kernel
    service call and lock again when ending the call and replace the
    notification pointer with a pointer to a dummy function.

    This is required because it's possible for rxrpc_notify_socket() to be
    called after the call has been ended by the kernel service if called from
    the asynchronous work function rxrpc_process_call().

    However, rxrpc_notify_socket() currently only holds the RCU read lock when
    invoking ->notify_rx(), which means that the afs_call struct would need to
    be disposed of by call_rcu() rather than by kfree().

    But we shouldn't see any notifications from a call after calling
    rxrpc_kernel_end_call(), so a lock is required in rxrpc code.

    Without this, we may see the call wait queue as having a corrupt spinlock:

    BUG: spinlock bad magic on CPU#0, kworker/0:2/1612
    general protection fault: 0000 [#1] SMP
    ...
    Workqueue: krxrpcd rxrpc_process_call
    task: ffff88040b83c400 task.stack: ffff88040adfc000
    RIP: 0010:spin_bug+0x161/0x18f
    RSP: 0018:ffff88040adffcc0 EFLAGS: 00010002
    RAX: 0000000000000032 RBX: 6b6b6b6b6b6b6b6b RCX: ffffffff81ab16cf
    RDX: ffff88041fa14c01 RSI: ffff88041fa0ccb8 RDI: ffff88041fa0ccb8
    RBP: ffff88040adffcd8 R08: 00000000ffffffff R09: 00000000ffffffff
    R10: ffff88040adffc60 R11: 000000000000022c R12: ffff88040aca2208
    R13: ffffffff81a58114 R14: 0000000000000000 R15: 0000000000000000
    ....
    Call Trace:
    do_raw_spin_lock+0x1d/0x89
    _raw_spin_lock_irqsave+0x3d/0x49
    ? __wake_up_common_lock+0x4c/0xa7
    __wake_up_common_lock+0x4c/0xa7
    ? __lock_is_held+0x47/0x7a
    __wake_up+0xe/0x10
    afs_wake_up_call_waiter+0x11b/0x122 [kafs]
    rxrpc_notify_socket+0x12b/0x258
    rxrpc_process_call+0x18e/0x7d0
    process_one_work+0x298/0x4de
    ? rescuer_thread+0x280/0x280
    worker_thread+0x1d1/0x2ae
    ? rescuer_thread+0x280/0x280
    kthread+0x12c/0x134
    ? kthread_create_on_node+0x3a/0x3a
    ret_from_fork+0x27/0x40

    In this case, note the corrupt data in EBX. The address of the offending
    afs_call is in R12, plus the offset to the spinlock.

    Signed-off-by: David Howells

    David Howells
     

29 Aug, 2017

1 commit

  • Allow a client call that failed on network error to be retried, provided
    that the Tx queue still holds DATA packet 1. This allows an operation to
    be submitted to another server or another address for the same server
    without having to repackage and re-encrypt the data so far processed.

    Two new functions are provided:

    (1) rxrpc_kernel_check_call() - This is used to find out the completion
    state of a call to guess whether it can be retried and whether it
    should be retried.

    (2) rxrpc_kernel_retry_call() - Disconnect the call from its current
    connection, reset the state and submit it as a new client call to a
    new address. The new address need not match the previous address.

    A call may be retried even if all the data hasn't been loaded into it yet;
    a partially constructed will be retained at the same point it was at when
    an error condition was detected. msg_data_left() can be used to find out
    how much data was packaged before the error occurred.

    Signed-off-by: David Howells

    David Howells
     

15 Jun, 2017

1 commit

  • Cache the congestion window setting that was determined during a call's
    transmission phase when it finishes so that it can be used by the next call
    to the same peer, thereby shortcutting the slow-start algorithm.

    The value is stored in the rxrpc_peer struct and is accessed without
    locking. Each call takes the value that happens to be there when it starts
    and just overwrites the value when it finishes.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

08 Jun, 2017

1 commit

  • Provide a control message that can be specified on the first sendmsg() of a
    client call or the first sendmsg() of a service response to indicate the
    total length of the data to be transmitted for that call.

    Currently, because the length of the payload of an encrypted DATA packet is
    encrypted in front of the data, the packet cannot be encrypted until we
    know how much data it will hold.

    By specifying the length at the beginning of the transmit phase, each DATA
    packet length can be set before we start loading data from userspace (where
    several sendmsg() calls may contribute to a particular packet).

    An error will be returned if too little or too much data is presented in
    the Tx phase.

    Signed-off-by: David Howells

    David Howells
     

26 May, 2017

1 commit

  • Support network namespacing in AF_RXRPC with the following changes:

    (1) All the local endpoint, peer and call lists, locks, counters, etc. are
    moved into the per-namespace record.

    (2) All the connection tracking is moved into the per-namespace record
    with the exception of the client connection ID tree, which is kept
    global so that connection IDs are kept unique per-machine.

    (3) Each namespace gets its own epoch. This allows each network namespace
    to pretend to be a separate client machine.

    (4) The /proc/net/rxrpc_xxx files are now called /proc/net/rxrpc/xxx and
    the contents reflect the namespace.

    fs/afs/ should be okay with this patch as it explicitly requires the current
    net namespace to be init_net to permit a mount to proceed at the moment. It
    will, however, need updating so that cells, IP addresses and DNS records are
    per-namespace also.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

06 Apr, 2017

1 commit


02 Mar, 2017

1 commit

  • All the routines by which rxrpc is accessed from the outside are serialised
    by means of the socket lock (sendmsg, recvmsg, bind,
    rxrpc_kernel_begin_call(), ...) and this presents a problem:

    (1) If a number of calls on the same socket are in the process of
    connection to the same peer, a maximum of four concurrent live calls
    are permitted before further calls need to wait for a slot.

    (2) If a call is waiting for a slot, it is deep inside sendmsg() or
    rxrpc_kernel_begin_call() and the entry function is holding the socket
    lock.

    (3) sendmsg() and recvmsg() or the in-kernel equivalents are prevented
    from servicing the other calls as they need to take the socket lock to
    do so.

    (4) The socket is stuck until a call is aborted and makes its slot
    available to the waiter.

    Fix this by:

    (1) Provide each call with a mutex ('user_mutex') that arbitrates access
    by the users of rxrpc separately for each specific call.

    (2) Make rxrpc_sendmsg() and rxrpc_recvmsg() unlock the socket as soon as
    they've got a call and taken its mutex.

    Note that I'm returning EWOULDBLOCK from recvmsg() if MSG_DONTWAIT is
    set but someone else has the lock. Should I instead only return
    EWOULDBLOCK if there's nothing currently to be done on a socket, and
    sleep in this particular instance because there is something to be
    done, but we appear to be blocked by the interrupt handler doing its
    ping?

    (3) Make rxrpc_new_client_call() unlock the socket after allocating a new
    call, locking its user mutex and adding it to the socket's call tree.
    The call is returned locked so that sendmsg() can add data to it
    immediately.

    From the moment the call is in the socket tree, it is subject to
    access by sendmsg() and recvmsg() - even if it isn't connected yet.

    (4) Lock new service calls in the UDP data_ready handler (in
    rxrpc_new_incoming_call()) because they may already be in the socket's
    tree and the data_ready handler makes them live immediately if a user
    ID has already been preassigned.

    Note that the new call is locked before any notifications are sent
    that it is live, so doing mutex_trylock() *ought* to always succeed.
    Userspace is prevented from doing sendmsg() on calls that are in a
    too-early state in rxrpc_do_sendmsg().

    (5) Make rxrpc_new_incoming_call() return the call with the user mutex
    held so that a ping can be scheduled immediately under it.

    Note that it might be worth moving the ping call into
    rxrpc_new_incoming_call() and then we can drop the mutex there.

    (6) Make rxrpc_accept_call() take the lock on the call it is accepting and
    release the socket after adding the call to the socket's tree. This
    is slightly tricky as we've dequeued the call by that point and have
    to requeue it.

    Note that requeuing emits a trace event.

    (7) Make rxrpc_kernel_send_data() and rxrpc_kernel_recv_data() take the
    new mutex immediately and don't bother with the socket mutex at all.

    This patch has the nice bonus that calls on the same socket are now to some
    extent parallelisable.

    Note that we might want to move rxrpc_service_prealloc() calls out from the
    socket lock and give it its own lock, so that we don't hang progress in
    other calls because we're waiting for the allocator.

    We probably also want to avoid calling rxrpc_notify_socket() from within
    the socket lock (rxrpc_accept_call()).

    Signed-off-by: David Howells
    Tested-by: Marc Dionne
    Signed-off-by: David S. Miller

    David Howells
     

05 Jan, 2017

1 commit

  • Fix the way enum values are translated into strings in AF_RXRPC
    tracepoints. The problem with just doing a lookup in a normal flat array
    of strings or chars is that external tracing infrastructure can't find it.
    Rather, TRACE_DEFINE_ENUM must be used.

    Also sort the enums and string tables to make it easier to keep them in
    order so that a future patch to __print_symbolic() can be optimised to try
    a direct lookup into the table first before iterating over it.

    A couple of _proto() macro calls are removed because they refered to tables
    that got moved to the tracing infrastructure. The relevant data can be
    found by way of tracing.

    Signed-off-by: David Howells

    David Howells