14 Oct, 2016

1 commit


13 Oct, 2016

1 commit

  • We switched from kmap_atomic() to kmap() so the kunmap() calls need to
    be updated to match.

    Fixes: d001648ec7cf ('rxrpc: Don't expose skbs to in-kernel users [ver #2]')
    Signed-off-by: Dan Carpenter
    Signed-off-by: David Howells

    Dan Carpenter
     

12 Oct, 2016

1 commit

  • The mapping_set_error() helper sets the correct AS_ flag for the mapping
    so there is no reason to open code it. Use the helper directly.

    [akpm@linux-foundation.org: be honest about conversion from -ENXIO to -EIO]
    Link: http://lkml.kernel.org/r/20160912111608.2588-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

11 Oct, 2016

2 commits

  • Pull networking fixes from David Miller:

    1) Netfilter list handling fix, from Linus.

    2) RXRPC/AFS bug fixes from David Howells (oops on call to serviceless
    endpoints, build warnings, missing notifications, etc.) From David
    Howells.

    3) Kernel log message missing newlines, from Colin Ian King.

    4) Don't enter direct reclaim in netlink dumps, the idea is to use a
    high order allocation first and fallback quickly to a 0-order
    allocation if such a high-order one cannot be done cheaply and
    without reclaim. From Eric Dumazet.

    5) Fix firmware download errors in btusb bluetooth driver, from Ethan
    Hsieh.

    6) Missing Kconfig deps for QCOM_EMAC, from Geert Uytterhoeven.

    7) Fix MDIO_XGENE dup Kconfig entry. From Laura Abbott.

    8) Constrain ipv6 rtr_solicits sysctl values properly, from Maciej
    Żenczykowski.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits)
    netfilter: Fix slab corruption.
    be2net: Enable VF link state setting for BE3
    be2net: Fix TX stats for TSO packets
    be2net: Update Copyright string in be_hw.h
    be2net: NCSI FW section should be properly updated with ethtool for BE3
    be2net: Provide an alternate way to read pf_num for BEx chips
    wan/fsl_ucc_hdlc: Fix size used in dma_free_coherent()
    net: macb: NULL out phydev after removing mdio bus
    xen-netback: make sure that hashes are not send to unaware frontends
    Fixing a bug in team driver due to incorrect 'unsigned int' to 'int' conversion
    MAINTAINERS: add myself as a maintainer of xen-netback
    ipv6 addrconf: disallow rtr_solicits < -1
    Bluetooth: btusb: Fix atheros firmware download error
    drivers: net: phy: Correct duplicate MDIO_XGENE entry
    ethernet: qualcomm: QCOM_EMAC should depend on HAS_DMA and HAS_IOMEM
    net: ethernet: mediatek: remove hwlro property in the device tree
    net: ethernet: mediatek: get hw lro capability by the chip id instead of by the dtsi
    net: ethernet: mediatek: get the chip id by ETHDMASYS registers
    net: bgmac: Fix errant feature flag check
    netlink: do not enter direct reclaim from netlink_dump()
    ...

    Linus Torvalds
     
  • Pull more vfs updates from Al Viro:
    ">rename2() work from Miklos + current_time() from Deepa"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Replace current_fs_time() with current_time()
    fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
    fs: Replace CURRENT_TIME with current_time() for inode timestamps
    fs: proc: Delete inode time initializations in proc_alloc_inode()
    vfs: Add current_time() api
    vfs: add note about i_op->rename changes to porting
    fs: rename "rename2" i_op to "rename"
    vfs: remove unused i_op->rename
    fs: make remaining filesystems use .rename2
    libfs: support RENAME_NOREPLACE in simple_rename()
    fs: support RENAME_NOREPLACE for local filesystems
    ncpfs: fix unused variable warning

    Linus Torvalds
     

06 Oct, 2016

1 commit

  • When it's in the waiting-for-ACK state, the AFS filesystem needs to check
    the result of rxrpc_kernel_recv_data() any time it is notified to see if it
    is indicating a fatal error. If this is the case, it needs to mark the
    call completed otherwise the call just sits there and never goes away.

    Signed-off-by: David Howells

    David Howells
     

27 Sep, 2016

2 commits

  • Generated patch:

    sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2`
    sed -i "s/\brename2\b/rename/g" `git grep -wl rename2`

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     
  • This is trivial to do:

    - add flags argument to foo_rename()
    - check if flags is zero
    - assign foo_rename() to .rename2 instead of .rename

    This doesn't mean it's impossible to support RENAME_NOREPLACE for these
    filesystems, but it is not trivial, like for local filesystems.
    RENAME_NOREPLACE must guarantee atomicity (i.e. it shouldn't be possible
    for a file to be created on one host while it is overwritten by rename on
    another host).

    Filesystems converted:

    9p, afs, ceph, coda, ecryptfs, kernfs, lustre, ncpfs, nfs, ocfs2, orangefs.

    After this, we can get rid of the duplicate interfaces for rename.

    Signed-off-by: Miklos Szeredi
    Acked-by: Greg Kroah-Hartman
    Acked-by: David Howells [AFS]
    Acked-by: Mike Marshall
    Cc: Eric Van Hensbergen
    Cc: Ilya Dryomov
    Cc: Jan Harkes
    Cc: Tyler Hicks
    Cc: Oleg Drokin
    Cc: Trond Myklebust
    Cc: Mark Fasheh

    Miklos Szeredi
     

08 Sep, 2016

2 commits

  • Rewrite the data and ack handling code such that:

    (1) Parsing of received ACK and ABORT packets and the distribution and the
    filing of DATA packets happens entirely within the data_ready context
    called from the UDP socket. This allows us to process and discard ACK
    and ABORT packets much more quickly (they're no longer stashed on a
    queue for a background thread to process).

    (2) We avoid calling skb_clone(), pskb_pull() and pskb_trim(). We instead
    keep track of the offset and length of the content of each packet in
    the sk_buff metadata. This means we don't do any allocation in the
    receive path.

    (3) Jumbo DATA packet parsing is now done in data_ready context. Rather
    than cloning the packet once for each subpacket and pulling/trimming
    it, we file the packet multiple times with an annotation for each
    indicating which subpacket is there. From that we can directly
    calculate the offset and length.

    (4) A call's receive queue can be accessed without taking locks (memory
    barriers do have to be used, though).

    (5) Incoming calls are set up from preallocated resources and immediately
    made live. They can than have packets queued upon them and ACKs
    generated. If insufficient resources exist, DATA packet #1 is given a
    BUSY reply and other DATA packets are discarded).

    (6) sk_buffs no longer take a ref on their parent call.

    To make this work, the following changes are made:

    (1) Each call's receive buffer is now a circular buffer of sk_buff
    pointers (rxtx_buffer) rather than a number of sk_buff_heads spread
    between the call and the socket. This permits each sk_buff to be in
    the buffer multiple times. The receive buffer is reused for the
    transmit buffer.

    (2) A circular buffer of annotations (rxtx_annotations) is kept parallel
    to the data buffer. Transmission phase annotations indicate whether a
    buffered packet has been ACK'd or not and whether it needs
    retransmission.

    Receive phase annotations indicate whether a slot holds a whole packet
    or a jumbo subpacket and, if the latter, which subpacket. They also
    note whether the packet has been decrypted in place.

    (3) DATA packet window tracking is much simplified. Each phase has just
    two numbers representing the window (rx_hard_ack/rx_top and
    tx_hard_ack/tx_top).

    The hard_ack number is the sequence number before base of the window,
    representing the last packet the other side says it has consumed.
    hard_ack starts from 0 and the first packet is sequence number 1.

    The top number is the sequence number of the highest-numbered packet
    residing in the buffer. Packets between hard_ack+1 and top are
    soft-ACK'd to indicate they've been received, but not yet consumed.

    Four macros, before(), before_eq(), after() and after_eq() are added
    to compare sequence numbers within the window. This allows for the
    top of the window to wrap when the hard-ack sequence number gets close
    to the limit.

    Two flags, RXRPC_CALL_RX_LAST and RXRPC_CALL_TX_LAST, are added also
    to indicate when rx_top and tx_top point at the packets with the
    LAST_PACKET bit set, indicating the end of the phase.

    (4) Calls are queued on the socket 'receive queue' rather than packets.
    This means that we don't need have to invent dummy packets to queue to
    indicate abnormal/terminal states and we don't have to keep metadata
    packets (such as ABORTs) around

    (5) The offset and length of a (sub)packet's content are now passed to
    the verify_packet security op. This is currently expected to decrypt
    the packet in place and validate it.

    However, there's now nowhere to store the revised offset and length of
    the actual data within the decrypted blob (there may be a header and
    padding to skip) because an sk_buff may represent multiple packets, so
    a locate_data security op is added to retrieve these details from the
    sk_buff content when needed.

    (6) recvmsg() now has to handle jumbo subpackets, where each subpacket is
    individually secured and needs to be individually decrypted. The code
    to do this is broken out into rxrpc_recvmsg_data() and shared with the
    kernel API. It now iterates over the call's receive buffer rather
    than walking the socket receive queue.

    Additional changes:

    (1) The timers are condensed to a single timer that is set for the soonest
    of three timeouts (delayed ACK generation, DATA retransmission and
    call lifespan).

    (2) Transmission of ACK and ABORT packets is effected immediately from
    process-context socket ops/kernel API calls that cause them instead of
    them being punted off to a background work item. The data_ready
    handler still has to defer to the background, though.

    (3) A shutdown op is added to the AF_RXRPC socket so that the AFS
    filesystem can shut down the socket and flush its own work items
    before closing the socket to deal with any in-progress service calls.

    Future additional changes that will need to be considered:

    (1) Make sure that a call doesn't hog the front of the queue by receiving
    data from the network as fast as userspace is consuming it to the
    exclusion of other calls.

    (2) Transmit delayed ACKs from within recvmsg() when we've consumed
    sufficiently more packets to avoid the background work item needing to
    run.

    Signed-off-by: David Howells

    David Howells
     
  • Make it possible for the data_ready handler called from the UDP transport
    socket to completely instantiate an rxrpc_call structure and make it
    immediately live by preallocating all the memory it might need. The idea
    is to cut out the background thread usage as much as possible.

    [Note that the preallocated structs are not actually used in this patch -
    that will be done in a future patch.]

    If insufficient resources are available in the preallocation buffers, it
    will be possible to discard the DATA packet in the data_ready handler or
    schedule a BUSY packet without the need to schedule an attempt at
    allocation in a background thread.

    To this end:

    (1) Preallocate rxrpc_peer, rxrpc_connection and rxrpc_call structs to a
    maximum number each of the listen backlog size. The backlog size is
    limited to a maxmimum of 32. Only this many of each can be in the
    preallocation buffer.

    (2) For userspace sockets, the preallocation is charged initially by
    listen() and will be recharged by accepting or rejecting pending
    new incoming calls.

    (3) For kernel services {,re,dis}charging of the preallocation buffers is
    handled manually. Two notifier callbacks have to be provided before
    kernel_listen() is invoked:

    (a) An indication that a new call has been instantiated. This can be
    used to trigger background recharging.

    (b) An indication that a call is being discarded. This is used when
    the socket is being released.

    A function, rxrpc_kernel_charge_accept() is called by the kernel
    service to preallocate a single call. It should be passed the user ID
    to be used for that call and a callback to associate the rxrpc call
    with the kernel service's side of the ID.

    (4) Discard the preallocation when the socket is closed.

    (5) Temporarily bump the refcount on the call allocated in
    rxrpc_incoming_call() so that rxrpc_release_call() can ditch the
    preallocation ref on service calls unconditionally. This will no
    longer be necessary once the preallocation is used.

    Note that this does not yet control the number of active service calls on a
    client - that will come in a later patch.

    A future development would be to provide a setsockopt() call that allows a
    userspace server to manually charge the preallocation buffer. This would
    allow user call IDs to be provided in advance and the awkward manual accept
    stage to be bypassed.

    Signed-off-by: David Howells

    David Howells
     

07 Sep, 2016

1 commit

  • Add a tracepoint for working out where local aborts happen. Each
    tracepoint call is labelled with a 3-letter code so that they can be
    distinguished - and the DATA sequence number is added too where available.

    rxrpc_kernel_abort_call() also takes a 3-letter code so that AFS can
    indicate the circumstances when it aborts a call.

    Signed-off-by: David Howells

    David Howells
     

05 Sep, 2016

4 commits

  • The workqueue "afs_lock_manager" queues work item &vnode->lock_work,
    per vnode. Since there can be multiple vnodes and since their work items
    can be executed concurrently, alloc_workqueue has been used to replace
    the deprecated create_singlethread_workqueue instance.

    The WQ_MEM_RECLAIM flag has been set to ensure forward progress under
    memory pressure because the workqueue is being used on a memory reclaim
    path.

    Since there are fixed number of work items, explicit concurrency
    limit is unnecessary here.

    Signed-off-by: Bhaktipriya Shridhar
    Signed-off-by: David Howells

    Bhaktipriya Shridhar
     
  • The workqueue "afs_callback_update_worker" queues multiple work items
    viz &vnode->cb_broken_work, &server->cb_break_work which require strict
    execution ordering. Hence, an ordered dedicated workqueue has been used.

    Since the workqueue is being used on a memory reclaim path, WQ_MEM_RECLAIM
    has been set to ensure forward progress under memory pressure.

    Signed-off-by: Bhaktipriya Shridhar
    Signed-off-by: David Howells

    Bhaktipriya Shridhar
     
  • The workqueue "afs_async_calls" queues work item
    &call->async_work per afs_call. Since there could be multiple calls and since
    these calls can be run concurrently, alloc_workqueue has been used to replace
    the deprecated create_singlethread_workqueue instance.

    The WQ_MEM_RECLAIM flag has been set to ensure forward progress under
    memory pressure because the workqueue is being used on a memory reclaim
    path.

    Since there are fixed number of work items, explicit concurrency
    limit is unnecessary here.

    Signed-off-by: Bhaktipriya Shridhar
    Signed-off-by: David Howells

    Bhaktipriya Shridhar
     
  • The workqueue "afs_vlocation_update_worker" queues a single work item
    &afs_vlocation_update and hence it doesn't require execution ordering.
    Hence, alloc_workqueue has been used to replace the deprecated
    create_singlethread_workqueue instance.

    Since the workqueue is being used on a memory reclaim path, WQ_MEM_RECLAIM
    flag has been set to ensure forward progress under memory pressure.

    Since there are fixed number of work items, explicit concurrency
    limit is unnecessary here.

    Signed-off-by: Bhaktipriya Shridhar
    Signed-off-by: David Howells

    Bhaktipriya Shridhar
     

02 Sep, 2016

1 commit

  • Don't expose skbs to in-kernel users, such as the AFS filesystem, but
    instead provide a notification hook the indicates that a call needs
    attention and another that indicates that there's a new call to be
    collected.

    This makes the following possibilities more achievable:

    (1) Call refcounting can be made simpler if skbs don't hold refs to calls.

    (2) skbs referring to non-data events will be able to be freed much sooner
    rather than being queued for AFS to pick up as rxrpc_kernel_recv_data
    will be able to consult the call state.

    (3) We can shortcut the receive phase when a call is remotely aborted
    because we don't have to go through all the packets to get to the one
    cancelling the operation.

    (4) It makes it easier to do encryption/decryption directly between AFS's
    buffers and sk_buffs.

    (5) Encryption/decryption can more easily be done in the AFS's thread
    contexts - usually that of the userspace process that issued a syscall
    - rather than in one of rxrpc's background threads on a workqueue.

    (6) AFS will be able to wait synchronously on a call inside AF_RXRPC.

    To make this work, the following interface function has been added:

    int rxrpc_kernel_recv_data(
    struct socket *sock, struct rxrpc_call *call,
    void *buffer, size_t bufsize, size_t *_offset,
    bool want_more, u32 *_abort_code);

    This is the recvmsg equivalent. It allows the caller to find out about the
    state of a specific call and to transfer received data into a buffer
    piecemeal.

    afs_extract_data() and rxrpc_kernel_recv_data() now do all the extraction
    logic between them. They don't wait synchronously yet because the socket
    lock needs to be dealt with.

    Five interface functions have been removed:

    rxrpc_kernel_is_data_last()
    rxrpc_kernel_get_abort_code()
    rxrpc_kernel_get_error_number()
    rxrpc_kernel_free_skb()
    rxrpc_kernel_data_consumed()

    As a temporary hack, sk_buffs going to an in-kernel call are queued on the
    rxrpc_call struct (->knlrecv_queue) rather than being handed over to the
    in-kernel user. To process the queue internally, a temporary function,
    temp_deliver_data() has been added. This will be replaced with common code
    between the rxrpc_recvmsg() path and the kernel_rxrpc_recv_data() path in a
    future patch.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

30 Aug, 2016

4 commits

  • Pass struct socket * to more rxrpc kernel interface functions. They should
    be starting from this rather than the socket pointer in the rxrpc_call
    struct if they need to access the socket.

    I have left:

    rxrpc_kernel_is_data_last()
    rxrpc_kernel_get_abort_code()
    rxrpc_kernel_get_error_number()
    rxrpc_kernel_free_skb()
    rxrpc_kernel_data_consumed()

    unmodified as they're all about to be removed (and, in any case, don't
    touch the socket).

    Signed-off-by: David Howells

    David Howells
     
  • Provide a function so that kernel users, such as AFS, can ask for the peer
    address of a call:

    void rxrpc_kernel_get_peer(struct rxrpc_call *call,
    struct sockaddr_rxrpc *_srx);

    In the future the kernel service won't get sk_buffs to look inside.
    Further, this allows us to hide any canonicalisation inside AF_RXRPC for
    when IPv6 support is added.

    Also propagate this through to afs_find_server() and issue a warning if we
    can't handle the address family yet.

    Signed-off-by: David Howells

    David Howells
     
  • We should #include linux/random.h to use get_random().

    Signed-off-by: David Howells

    David Howells
     
  • Remove one #ifndef'd-out variable and a couple of excessive blank lines.

    Signed-off-by: David Howells

    David Howells
     

06 Aug, 2016

1 commit

  • Inside the kafs filesystem it is possible to occasionally have a call
    processed and terminated before we've had a chance to check whether we need
    to clean up the rx queue for that call because afs_send_simple_reply() ends
    the call when it is done, but this is done in a workqueue item that might
    happen to run to completion before afs_deliver_to_call() completes.

    Further, it is possible for rxrpc_kernel_send_data() to be called to send a
    reply before the last request-phase data skb is released. The rxrpc skb
    destructor is where the ACK processing is done and the call state is
    advanced upon release of the last skb. ACK generation is also deferred to
    a work item because it's possible that the skb destructor is not called in
    a context where kernel_sendmsg() can be invoked.

    To this end, the following changes are made:

    (1) kernel_rxrpc_data_consumed() is added. This should be called whenever
    an skb is emptied so as to crank the ACK and call states. This does
    not release the skb, however. kernel_rxrpc_free_skb() must now be
    called to achieve that. These together replace
    rxrpc_kernel_data_delivered().

    (2) kernel_rxrpc_data_consumed() is wrapped by afs_data_consumed().

    This makes afs_deliver_to_call() easier to work as the skb can simply
    be discarded unconditionally here without trying to work out what the
    return value of the ->deliver() function means.

    The ->deliver() functions can, via afs_data_complete(),
    afs_transfer_reply() and afs_extract_data() mark that an skb has been
    consumed (thereby cranking the state) without the need to
    conditionally free the skb to make sure the state is correct on an
    incoming call for when the call processor tries to send the reply.

    (3) rxrpc_recvmsg() now has to call kernel_rxrpc_data_consumed() when it
    has finished with a packet and MSG_PEEK isn't set.

    (4) rxrpc_packet_destructor() no longer calls rxrpc_hard_ACK_data().

    Because of this, we no longer need to clear the destructor and put the
    call before we free the skb in cases where we don't want the ACK/call
    state to be cranked.

    (5) The ->deliver() call-type callbacks are made to return -EAGAIN rather
    than 0 if they expect more data (afs_extract_data() returns -EAGAIN to
    the delivery function already), and the caller is now responsible for
    producing an abort if that was the last packet.

    (6) There are many bits of unmarshalling code where:

    ret = afs_extract_data(call, skb, last, ...);
    switch (ret) {
    case 0: break;
    case -EAGAIN: return 0;
    default: return ret;
    }

    is to be found. As -EAGAIN can now be passed back to the caller, we
    now just return if ret < 0:

    ret = afs_extract_data(call, skb, last, ...);
    if (ret < 0)
    return ret;

    (7) Checks for trailing data and empty final data packets has been
    consolidated as afs_data_complete(). So:

    if (skb->len > 0)
    return -EBADMSG;
    if (!last)
    return 0;

    becomes:

    ret = afs_data_complete(call, skb, last);
    if (ret < 0)
    return ret;

    (8) afs_transfer_reply() now checks the amount of data it has against the
    amount of data desired and the amount of data in the skb and returns
    an error to induce an abort if we don't get exactly what we want.

    Without these changes, the following oops can occasionally be observed,
    particularly if some printks are inserted into the delivery path:

    general protection fault: 0000 [#1] SMP
    Modules linked in: kafs(E) af_rxrpc(E) [last unloaded: af_rxrpc]
    CPU: 0 PID: 1305 Comm: kworker/u8:3 Tainted: G E 4.7.0-fsdevel+ #1303
    Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
    Workqueue: kafsd afs_async_workfn [kafs]
    task: ffff88040be041c0 ti: ffff88040c070000 task.ti: ffff88040c070000
    RIP: 0010:[] [] __lock_acquire+0xcf/0x15a1
    RSP: 0018:ffff88040c073bc0 EFLAGS: 00010002
    RAX: 6b6b6b6b6b6b6b6b RBX: 0000000000000000 RCX: ffff88040d29a710
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88040d29a710
    RBP: ffff88040c073c70 R08: 0000000000000001 R09: 0000000000000001
    R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
    R13: 0000000000000000 R14: ffff88040be041c0 R15: ffffffff814c928f
    FS: 0000000000000000(0000) GS:ffff88041fa00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fa4595f4750 CR3: 0000000001c14000 CR4: 00000000001406f0
    Stack:
    0000000000000006 000000000be04930 0000000000000000 ffff880400000000
    ffff880400000000 ffffffff8108f847 ffff88040be041c0 ffffffff81050446
    ffff8803fc08a920 ffff8803fc08a958 ffff88040be041c0 ffff88040c073c38
    Call Trace:
    [] ? mark_held_locks+0x5e/0x74
    [] ? __local_bh_enable_ip+0x9b/0xa1
    [] ? trace_hardirqs_on_caller+0x16d/0x189
    [] lock_acquire+0x122/0x1b6
    [] ? lock_acquire+0x122/0x1b6
    [] ? skb_dequeue+0x18/0x61
    [] _raw_spin_lock_irqsave+0x35/0x49
    [] ? skb_dequeue+0x18/0x61
    [] skb_dequeue+0x18/0x61
    [] afs_deliver_to_call+0x344/0x39d [kafs]
    [] afs_process_async_call+0x4c/0xd5 [kafs]
    [] afs_async_workfn+0xe/0x10 [kafs]
    [] process_one_work+0x29d/0x57c
    [] worker_thread+0x24a/0x385
    [] ? rescuer_thread+0x2d0/0x2d0
    [] kthread+0xf3/0xfb
    [] ret_from_fork+0x1f/0x40
    [] ? kthread_create_on_node+0x1cf/0x1cf

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

11 Jun, 2016

1 commit

  • Limit the socket incoming call backlog queue size so that a remote client
    can't pump in sufficient new calls that the server runs out of memory. Note
    that this is partially theoretical at the moment since whilst the number of
    calls is limited, the number of packets trying to set up new calls is not.
    This will be addressed in a later patch.

    If the caller of listen() specifies a backlog INT_MAX, then they get the
    current maximum; anything else greater than max_backlog or anything
    negative incurs EINVAL.

    The limit on the maximum queue size can be set by:

    echo N >/proc/sys/net/rxrpc/max_backlog

    where 4
    Signed-off-by: David S. Miller

    David Howells
     

28 May, 2016

1 commit

  • Most users of IS_ERR_VALUE() in the kernel are wrong, as they
    pass an 'int' into a function that takes an 'unsigned long'
    argument. This happens to work because the type is sign-extended
    on 64-bit architectures before it gets converted into an
    unsigned type.

    However, anything that passes an 'unsigned short' or 'unsigned int'
    argument into IS_ERR_VALUE() is guaranteed to be broken, as are
    8-bit integers and types that are wider than 'unsigned long'.

    Andrzej Hajda has already fixed a lot of the worst abusers that
    were causing actual bugs, but it would be nice to prevent any
    users that are not passing 'unsigned long' arguments.

    This patch changes all users of IS_ERR_VALUE() that I could find
    on 32-bit ARM randconfig builds and x86 allmodconfig. For the
    moment, this doesn't change the definition of IS_ERR_VALUE()
    because there are probably still architecture specific users
    elsewhere.

    Almost all the warnings I got are for files that are better off
    using 'if (err)' or 'if (err < 0)'.
    The only legitimate user I could find that we get a warning for
    is the (32-bit only) freescale fman driver, so I did not remove
    the IS_ERR_VALUE() there but changed the type to 'unsigned long'.
    For 9pfs, I just worked around one user whose calling conventions
    are so obscure that I did not dare change the behavior.

    I was using this definition for testing:

    #define IS_ERR_VALUE(x) ((unsigned long*)NULL == (typeof (x)*)NULL && \
    unlikely((unsigned long long)(x) >= (unsigned long long)(typeof(x))-MAX_ERRNO))

    which ends up making all 16-bit or wider types work correctly with
    the most plausible interpretation of what IS_ERR_VALUE() was supposed
    to return according to its users, but also causes a compile-time
    warning for any users that do not pass an 'unsigned long' argument.

    I suggested this approach earlier this year, but back then we ended
    up deciding to just fix the users that are obviously broken. After
    the initial warning that caused me to get involved in the discussion
    (fs/gfs2/dir.c) showed up again in the mainline kernel, Linus
    asked me to send the whole thing again.

    [ Updated the 9p parts as per Al Viro - Linus ]

    Signed-off-by: Arnd Bergmann
    Cc: Andrzej Hajda
    Cc: Andrew Morton
    Link: https://lkml.org/lkml/2016/1/7/363
    Link: https://lkml.org/lkml/2016/5/27/486
    Acked-by: Srinivas Kandagatla # For nvmem part
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

18 May, 2016

1 commit

  • Pull networking updates from David Miller:
    "Highlights:

    1) Support SPI based w5100 devices, from Akinobu Mita.

    2) Partial Segmentation Offload, from Alexander Duyck.

    3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.

    4) Allow cls_flower stats offload, from Amir Vadai.

    5) Implement bpf blinding, from Daniel Borkmann.

    6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
    actually using FASYNC these atomics are superfluous. From Eric
    Dumazet.

    7) Run TCP more preemptibly, also from Eric Dumazet.

    8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
    driver, from Gal Pressman.

    9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.

    10) Improve BPF usage documentation, from Jesper Dangaard Brouer.

    11) Support tunneling offloads in qed, from Manish Chopra.

    12) aRFS offloading in mlx5e, from Maor Gottlieb.

    13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
    Leitner.

    14) Add MSG_EOR support to TCP, this allows controlling packet
    coalescing on application record boundaries for more accurate
    socket timestamp sampling. From Martin KaFai Lau.

    15) Fix alignment of 64-bit netlink attributes across the board, from
    Nicolas Dichtel.

    16) Per-vlan stats in bridging, from Nikolay Aleksandrov.

    17) Several conversions of drivers to ethtool ksettings, from Philippe
    Reynes.

    18) Checksum neutral ILA in ipv6, from Tom Herbert.

    19) Factorize all of the various marvell dsa drivers into one, from
    Vivien Didelot

    20) Add VF support to qed driver, from Yuval Mintz"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
    Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
    Revert "phy dp83867: Make rgmii parameters optional"
    r8169: default to 64-bit DMA on recent PCIe chips
    phy dp83867: Make rgmii parameters optional
    phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
    bpf: arm64: remove callee-save registers use for tmp registers
    asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
    switchdev: pass pointer to fib_info instead of copy
    net_sched: close another race condition in tcf_mirred_release()
    tipc: fix nametable publication field in nl compat
    drivers: net: Don't print unpopulated net_device name
    qed: add support for dcbx.
    ravb: Add missing free_irq() calls to ravb_close()
    qed: Remove a stray tab
    net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
    net: ethernet: fec-mpc52xx: use phydev from struct net_device
    bpf, doc: fix typo on bpf_asm descriptions
    stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
    net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
    net: ethernet: fs-enet: use phydev from struct net_device
    ...

    Linus Torvalds
     

11 May, 2016

1 commit


03 May, 2016

1 commit

  • Right now ext2_get_page() (and its analogues in a bunch of other filesystems)
    relies upon the directory being locked - the way it sets and tests Checked and
    Error bits would be racy without that. Switch to a slightly different scheme,
    _not_ setting Checked in case of failure. That way the logics becomes
    if Checked => OK
    else if Error => fail
    else if !validate => fail
    else => OK
    with validation setting Checked or Error on success and failure resp. and
    returning which one had happened. Equivalent to the current logics, but unlike
    the current logics not sensitive to the order of set_bit, test_bit getting
    reordered by CPU, etc.

    Signed-off-by: Al Viro

    Al Viro
     

12 Apr, 2016

2 commits

  • In the rxrpc_connection and rxrpc_call structs, there's one field to hold
    the abort code, no matter whether that value was generated locally to be
    sent or was received from the peer via an abort packet.

    Split the abort code fields in two for cleanliness sake and add an error
    field to hold the Linux error number to the rxrpc_call struct too
    (sometimes this is generated in a context where we can't return it to
    userspace directly).

    Furthermore, add a skb mark to indicate a packet that caused a local abort
    to be generated so that recvmsg() can pick up the correct abort code. A
    future addition will need to be to indicate to userspace the difference
    between aborts via a control message.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     
  • The afs filesystem needs to wait for any outstanding asynchronous calls
    (such as FS.GiveUpCallBacks cleaning up the callbacks lodged with a server)
    to complete before closing the AF_RXRPC socket when unloading the module.

    This may occur if the module is removed too quickly after unmounting all
    filesystems. This will produce an error report that looks like:

    AFS: Assertion failed
    1 == 0 is false
    0x1 == 0x0 is false
    ------------[ cut here ]------------
    kernel BUG at ../fs/afs/rxrpc.c:135!
    ...
    RIP: 0010:[] afs_close_socket+0xec/0x107 [kafs]
    ...
    Call Trace:
    [] afs_exit+0x1f/0x57 [kafs]
    [] SyS_delete_module+0xec/0x17d
    [] entry_SYSCALL_64_fastpath+0x12/0x6b

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

15 Jan, 2016

1 commit

  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

13 Jan, 2016

1 commit

  • Pull misc vfs updates from Al Viro:
    "All kinds of stuff. That probably should've been 5 or 6 separate
    branches, but by the time I'd realized how large and mixed that bag
    had become it had been too close to -final to play with rebasing.

    Some fs/namei.c cleanups there, memdup_user_nul() introduction and
    switching open-coded instances, burying long-dead code, whack-a-mole
    of various kinds, several new helpers for ->llseek(), assorted
    cleanups and fixes from various people, etc.

    One piece probably deserves special mention - Neil's
    lookup_one_len_unlocked(). Similar to lookup_one_len(), but gets
    called without ->i_mutex and tries to avoid ever taking it. That, of
    course, means that it's not useful for any directory modifications,
    but things like getting inode attributes in nfds readdirplus are fine
    with that. I really should've asked for moratorium on lookup-related
    changes this cycle, but since I hadn't done that early enough... I
    *am* asking for that for the coming cycle, though - I'm going to try
    and get conversion of i_mutex to rwsem with ->lookup() done under lock
    taken shared.

    There will be a patch closer to the end of the window, along the lines
    of the one Linus had posted last May - mechanical conversion of
    ->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/
    inode_is_locked()/inode_lock_nested(). To quote Linus back then:

    -----
    | This is an automated patch using
    |
    | sed 's/mutex_lock(&\(.*\)->i_mutex)/inode_lock(\1)/'
    | sed 's/mutex_unlock(&\(.*\)->i_mutex)/inode_unlock(\1)/'
    | sed 's/mutex_lock_nested(&\(.*\)->i_mutex,[ ]*I_MUTEX_\([A-Z0-9_]*\))/inode_lock_nested(\1, I_MUTEX_\2)/'
    | sed 's/mutex_is_locked(&\(.*\)->i_mutex)/inode_is_locked(\1)/'
    | sed 's/mutex_trylock(&\(.*\)->i_mutex)/inode_trylock(\1)/'
    |
    | with a very few manual fixups
    -----

    I'm going to send that once the ->i_mutex-affecting stuff in -next
    gets mostly merged (or when Linus says he's about to stop taking
    merges)"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    nfsd: don't hold i_mutex over userspace upcalls
    fs:affs:Replace time_t with time64_t
    fs/9p: use fscache mutex rather than spinlock
    proc: add a reschedule point in proc_readfd_common()
    logfs: constify logfs_block_ops structures
    fcntl: allow to set O_DIRECT flag on pipe
    fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE
    fs: xattr: Use kvfree()
    [s390] page_to_phys() always returns a multiple of PAGE_SIZE
    nbd: use ->compat_ioctl()
    fs: use block_device name vsprintf helper
    lib/vsprintf: add %*pg format specifier
    fs: use gendisk->disk_name where possible
    poll: plug an unused argument to do_poll
    amdkfd: don't open-code memdup_user()
    cdrom: don't open-code memdup_user()
    rsxx: don't open-code memdup_user()
    mtip32xx: don't open-code memdup_user()
    [um] mconsole: don't open-code memdup_user_nul()
    [um] hostaudio: don't open-code memdup_user()
    ...

    Linus Torvalds
     

04 Jan, 2016

1 commit


09 Dec, 2015

1 commit

  • kmap() in page_follow_link_light() needed to go - allowing to hold
    an arbitrary number of kmaps for long is a great way to deadlocking
    the system.

    new helper (inode_nohighmem(inode)) needs to be used for pagecache
    symlinks inodes; done for all in-tree cases. page_follow_link_light()
    instrumented to yell about anything missed.

    Signed-off-by: Al Viro

    Al Viro
     

11 May, 2015

1 commit


16 Apr, 2015

1 commit


12 Apr, 2015

2 commits


02 Apr, 2015

1 commit


01 Apr, 2015

1 commit