11 Apr, 2018

3 commits

  • Currently vhost *_access_ok() functions return int. This is error-prone
    because there are two popular conventions:

    1. 0 means failure, 1 means success
    2. -errno means failure, 0 means success

    Although vhost mostly uses #1, it does not do so consistently.
    umem_access_ok() uses #2.

    This patch changes the return type from int to bool so that false means
    failure and true means success. This eliminates a potential source of
    errors.

    Suggested-by: Linus Torvalds
    Signed-off-by: Stefan Hajnoczi
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Stefan Hajnoczi
     
  • Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log
    when IOTLB is enabled") introduced a regression. The logic was
    originally:

    if (vq->iotlb)
    return 1;
    return A && B;

    After the patch the short-circuit logic for A was inverted:

    if (A || vq->iotlb)
    return A;
    return B;

    This patch fixes the regression by rewriting the checks in the obvious
    way, no longer returning A when vq->iotlb is non-NULL (which is hard to
    understand).

    Reported-by: syzbot+65a84dde0214b0387ccd@syzkaller.appspotmail.com
    Cc: Jason Wang
    Signed-off-by: Stefan Hajnoczi
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Stefan Hajnoczi
     
  • vhost_copy_to_user is used to copy vring used elements to userspace.
    We should use VHOST_ADDR_USED instead of VHOST_ADDR_DESC.

    Fixes: f88949138058 ("vhost: introduce O(1) vq metadata cache")
    Signed-off-by: Eric Auger
    Acked-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Eric Auger
     

10 Apr, 2018

1 commit

  • Pull networking fixes from David Miller:

    1) The sockmap code has to free socket memory on close if there is
    corked data, from John Fastabend.

    2) Tunnel names coming from userspace need to be length validated. From
    Eric Dumazet.

    3) arp_filter() has to take VRFs properly into account, from Miguel
    Fadon Perlines.

    4) Fix oops in error path of tcf_bpf_init(), from Davide Caratti.

    5) Missing idr_remove() in u32_delete_key(), from Cong Wang.

    6) More syzbot stuff. Several use of uninitialized value fixes all
    over, from Eric Dumazet.

    7) Do not leak kernel memory to userspace in sctp, also from Eric
    Dumazet.

    8) Discard frames from unused ports in DSA, from Andrew Lunn.

    9) Fix DMA mapping and reset/failover problems in ibmvnic, from Thomas
    Falcon.

    10) Do not access dp83640 PHY registers prematurely after reset, from
    Esben Haabendal.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
    vhost-net: set packet weight of tx polling to 2 * vq size
    net: thunderx: rework mac addresses list to u64 array
    inetpeer: fix uninit-value in inet_getpeer
    dp83640: Ensure against premature access to PHY registers after reset
    devlink: convert occ_get op to separate registration
    ARM: dts: ls1021a: Specify TBIPA register address
    net/fsl_pq_mdio: Allow explicit speficition of TBIPA address
    ibmvnic: Do not reset CRQ for Mobility driver resets
    ibmvnic: Fix failover case for non-redundant configuration
    ibmvnic: Fix reset scheduler error handling
    ibmvnic: Zero used TX descriptor counter on reset
    ibmvnic: Fix DMA mapping mistakes
    tipc: use the right skb in tipc_sk_fill_sock_diag()
    sctp: sctp_sockaddr_af must check minimal addr length for AF_INET6
    net: dsa: Discard frames from unused ports
    sctp: do not leak kernel memory to user space
    soreuseport: initialise timewait reuseport field
    ipv4: fix uninit-value in ip_route_output_key_hash_rcu()
    dccp: initialize ireq->ir_mark
    net: fix uninit-value in __hw_addr_add_ex()
    ...

    Linus Torvalds
     

09 Apr, 2018

1 commit

  • handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy
    polling udp packets with small length(e.g. 1byte udp payload), because setting
    VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet length.

    Ping-Latencies shown below were tested between two Virtual Machines using
    netperf (UDP_STREAM, len=1), and then another machine pinged the client:

    vq size=256
    Packet-Weight Ping-Latencies(millisecond)
    min avg max
    Origin 3.319 18.489 57.303
    64 1.643 2.021 2.552
    128 1.825 2.600 3.224
    256 1.997 2.710 4.295
    512 1.860 3.171 4.631
    1024 2.002 4.173 9.056
    2048 2.257 5.650 9.688
    4096 2.093 8.508 15.943

    vq size=512
    Packet-Weight Ping-Latencies(millisecond)
    min avg max
    Origin 6.537 29.177 66.245
    64 2.798 3.614 4.403
    128 2.861 3.820 4.775
    256 3.008 4.018 4.807
    512 3.254 4.523 5.824
    1024 3.079 5.335 7.747
    2048 3.944 8.201 12.762
    4096 4.158 11.057 19.985

    Seems pretty consistent, a small dip at 2 VQ sizes.
    Ring size is a hint from device about a burst size it can tolerate. Based on
    benchmarks, set the weight to 2 * vq size.

    To evaluate this change, another tests were done using netperf(RR, TX) between
    two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz, and vq size was
    tweaked through qemu. Results shown below does not show obvious changes.

    vq size=256 TCP_RR vq size=512 TCP_RR
    size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize%
    1/ 1/ -7%/ -2% 1/ 1/ 0%/ -2%
    1/ 4/ +1%/ 0% 1/ 4/ +1%/ 0%
    1/ 8/ +1%/ -2% 1/ 8/ 0%/ +1%
    64/ 1/ -6%/ 0% 64/ 1/ +7%/ +3%
    64/ 4/ 0%/ +2% 64/ 4/ -1%/ +1%
    64/ 8/ 0%/ 0% 64/ 8/ -1%/ -2%
    256/ 1/ -3%/ -4% 256/ 1/ -4%/ -2%
    256/ 4/ +3%/ +4% 256/ 4/ +1%/ +2%
    256/ 8/ +2%/ 0% 256/ 8/ +1%/ -1%

    vq size=256 UDP_RR vq size=512 UDP_RR
    size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize%
    1/ 1/ -5%/ +1% 1/ 1/ -3%/ -2%
    1/ 4/ +4%/ +1% 1/ 4/ -2%/ +2%
    1/ 8/ -1%/ -1% 1/ 8/ -1%/ 0%
    64/ 1/ -2%/ -3% 64/ 1/ +1%/ +1%
    64/ 4/ -5%/ -1% 64/ 4/ +2%/ 0%
    64/ 8/ 0%/ -1% 64/ 8/ -2%/ +1%
    256/ 1/ +7%/ +1% 256/ 1/ -7%/ 0%
    256/ 4/ +1%/ +1% 256/ 4/ -3%/ -4%
    256/ 8/ +2%/ +2% 256/ 8/ +1%/ +1%

    vq size=256 TCP_STREAM vq size=512 TCP_STREAM
    size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize%
    64/ 1/ 0%/ -3% 64/ 1/ 0%/ 0%
    64/ 4/ +3%/ -1% 64/ 4/ -2%/ +4%
    64/ 8/ +9%/ -4% 64/ 8/ -1%/ +2%
    256/ 1/ +1%/ -4% 256/ 1/ +1%/ +1%
    256/ 4/ -1%/ -1% 256/ 4/ -3%/ 0%
    256/ 8/ +7%/ +5% 256/ 8/ -3%/ 0%
    512/ 1/ +1%/ 0% 512/ 1/ -1%/ -1%
    512/ 4/ +1%/ -1% 512/ 4/ 0%/ 0%
    512/ 8/ +7%/ -5% 512/ 8/ +6%/ -1%
    1024/ 1/ 0%/ -1% 1024/ 1/ 0%/ +1%
    1024/ 4/ +3%/ 0% 1024/ 4/ +1%/ 0%
    1024/ 8/ +8%/ +5% 1024/ 8/ -1%/ 0%
    2048/ 1/ +2%/ +2% 2048/ 1/ -1%/ 0%
    2048/ 4/ +1%/ 0% 2048/ 4/ 0%/ -1%
    2048/ 8/ -2%/ 0% 2048/ 8/ 5%/ -1%
    4096/ 1/ -2%/ 0% 4096/ 1/ -2%/ 0%
    4096/ 4/ +2%/ 0% 4096/ 4/ 0%/ 0%
    4096/ 8/ +9%/ -2% 4096/ 8/ -5%/ -1%

    Acked-by: Michael S. Tsirkin
    Signed-off-by: Haibin Zhang
    Signed-off-by: Yunfang Tai
    Signed-off-by: Lidong Chen
    Signed-off-by: David S. Miller

    haibinzhang(张海斌)
     

07 Apr, 2018

1 commit

  • Pull fw_cfg, vhost updates from Michael Tsirkin:
    "This cleans up the qemu fw cfg device driver.

    On top of this, vmcore is dumped there on crash to help debugging
    with kASLR enabled.

    Also included are some fixes in vhost"

    * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
    vhost: add vsock compat ioctl
    vhost: fix vhost ioctl signature to build with clang
    fw_cfg: write vmcoreinfo details
    crash: export paddr_vmcoreinfo_note()
    fw_cfg: add DMA register
    fw_cfg: add a public uapi header
    fw_cfg: handle fw_cfg_read_blob() error
    fw_cfg: remove inline from fw_cfg_read_blob()
    fw_cfg: fix sparse warnings around FW_CFG_FILE_DIR read
    fw_cfg: fix sparse warning reading FW_CFG_ID
    fw_cfg: fix sparse warnings with fw_cfg_file
    fw_cfg: fix sparse warnings in fw_cfg_sel_endianness()
    ptr_ring: fix build

    Linus Torvalds
     

02 Apr, 2018

1 commit


30 Mar, 2018

1 commit

  • Vq log_base is the userspace address of bitmap which has nothing to do
    with IOTLB. So it needs to be validated unconditionally otherwise we
    may try use 0 as log_base which may lead to pin pages that will lead
    unexpected result (e.g trigger BUG_ON() in set_bit_to_user()).

    Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
    Reported-by: syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com
    Signed-off-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Jason Wang
     

28 Mar, 2018

1 commit

  • We tried to remove vq poll from wait queue, but do not check whether
    or not it was in a list before. This will lead double free. Fixing
    this by switching to use vhost_poll_stop() which zeros poll->wqh after
    removing poll from waitqueue to make sure it won't be freed twice.

    Cc: Darren Kenny
    Reported-by: syzbot+c0272972b01b872e604a@syzkaller.appspotmail.com
    Fixes: 2b8b328b61c79 ("vhost_net: handle polling errors when setting backend")
    Signed-off-by: Jason Wang
    Reviewed-by: Darren Kenny
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Jason Wang
     

27 Mar, 2018

1 commit

  • We try to hold TX virtqueue mutex in vhost_net_rx_peek_head_len()
    after RX virtqueue mutex is held in handle_rx(). This requires an
    appropriate lock nesting notation to calm down deadlock detector.

    Fixes: 0308813724606 ("vhost_net: basic polling support")
    Reported-by: syzbot+7f073540b1384a614e09@syzkaller.appspotmail.com
    Signed-off-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Jason Wang
     

23 Mar, 2018

1 commit

  • Fun set of conflict resolutions here...

    For the mac80211 stuff, these were fortunately just parallel
    adds. Trivially resolved.

    In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the
    function phy_disable_interrupts() earlier in the file, whilst in
    'net-next' the phy_error() call from this function was removed.

    In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the
    'rt_table_id' member of rtable collided with a bug fix in 'net' that
    added a new struct member "rt_mtu_locked" which needs to be copied
    over here.

    The mlxsw driver conflict consisted of net-next separating
    the span code and definitions into separate files, whilst
    a 'net' bug fix made some changes to that moved code.

    The mlx5 infiniband conflict resolution was quite non-trivial,
    the RDMA tree's merge commit was used as a guide here, and
    here are their notes:

    ====================

    Due to bug fixes found by the syzkaller bot and taken into the for-rc
    branch after development for the 4.17 merge window had already started
    being taken into the for-next branch, there were fairly non-trivial
    merge issues that would need to be resolved between the for-rc branch
    and the for-next branch. This merge resolves those conflicts and
    provides a unified base upon which ongoing development for 4.17 can
    be based.

    Conflicts:
    drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f9524
    (IB/mlx5: Fix cleanup order on unload) added to for-rc and
    commit b5ca15ad7e61 (IB/mlx5: Add proper representors support)
    add as part of the devel cycle both needed to modify the
    init/de-init functions used by mlx5. To support the new
    representors, the new functions added by the cleanup patch
    needed to be made non-static, and the init/de-init list
    added by the representors patch needed to be modified to
    match the init/de-init list changes made by the cleanup
    patch.
    Updates:
    drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
    prototypes added by representors patch to reflect new function
    names as changed by cleanup patch
    drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
    stage list to match new order from cleanup patch
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

20 Mar, 2018

2 commits

  • This will allow usage of vsock from 32-bit binaries on a 64-bit
    kernel.

    Signed-off-by: Sonny Rao
    Reviewed-by: Stefan Hajnoczi
    Signed-off-by: Michael S. Tsirkin

    Sonny Rao
     
  • Clang is particularly anal about signed vs unsigned comparisons and
    doesn't like the fact that some ioctl numbers set the MSB, so we get
    this error when trying to build vhost on aarch64:

    drivers/vhost/vhost.c:1400:7: error: overflow converting case value to
    switch condition type (3221794578 to 18446744072636378898)
    [-Werror, -Wswitch]
    case VHOST_GET_VRING_BASE:

    3221794578 is 0xC008AF12 in hex
    18446744072636378898 is 0xFFFFFFFFC008AF12 in hex

    Fix this by using unsigned ints in the function signature for
    vhost_vring_ioctl().

    Signed-off-by: Sonny Rao
    Reviewed-by: Darren Kenny
    Signed-off-by: Michael S. Tsirkin

    Sonny Rao
     

10 Mar, 2018

4 commits

  • After commit fc72d1d54dd9 ("tuntap: XDP transmission"), we can
    actually queueing XDP pointers in the pointer ring, so we should
    examine the pointer type before freeing the pointer.

    Fixes: fc72d1d54dd9 ("tuntap: XDP transmission")
    Reported-by: Michael S. Tsirkin
    Acked-by: Michael S. Tsirkin
    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     
  • We get pointer ring from the exported sock, this means we should keep
    rx_ring and vq->private synced during both vq stop and backend set,
    otherwise we may see stale rx_ring.

    Fixes: c67df11f6e480 ("vhost_net: try batch dequing from skb array")
    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Jason Wang
     
  • KMSAN reported a use of uninit memory in vhost_net_buf_unproduce()
    while trying to access n->vqs[VHOST_NET_VQ_TX].rx_ring:

    ==================================================================
    BUG: KMSAN: use of uninitialized memory in vhost_net_buf_unproduce+0x7bb/0x9a0 drivers/vho
    et.c:170
    CPU: 0 PID: 3021 Comm: syz-fuzzer Not tainted 4.16.0-rc4+ #3853
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
    Call Trace:
    __dump_stack lib/dump_stack.c:17 [inline]
    dump_stack+0x185/0x1d0 lib/dump_stack.c:53
    kmsan_report+0x142/0x1f0 mm/kmsan/kmsan.c:1093
    __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
    vhost_net_buf_unproduce+0x7bb/0x9a0 drivers/vhost/net.c:170
    vhost_net_stop_vq drivers/vhost/net.c:974 [inline]
    vhost_net_stop+0x146/0x380 drivers/vhost/net.c:982
    vhost_net_release+0xb1/0x4f0 drivers/vhost/net.c:1015
    __fput+0x49f/0xa00 fs/file_table.c:209
    ____fput+0x37/0x40 fs/file_table.c:243
    task_work_run+0x243/0x2c0 kernel/task_work.c:113
    tracehook_notify_resume include/linux/tracehook.h:191 [inline]
    exit_to_usermode_loop arch/x86/entry/common.c:166 [inline]
    prepare_exit_to_usermode+0x349/0x3b0 arch/x86/entry/common.c:196
    syscall_return_slowpath+0xf3/0x6d0 arch/x86/entry/common.c:265
    do_syscall_64+0x34d/0x450 arch/x86/entry/common.c:292
    ...
    origin:
    kmsan_save_stack_with_flags mm/kmsan/kmsan.c:303 [inline]
    kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:213
    kmsan_kmalloc_large+0x6f/0xd0 mm/kmsan/kmsan.c:392
    kmalloc_large_node_hook mm/slub.c:1366 [inline]
    kmalloc_large_node mm/slub.c:3808 [inline]
    __kmalloc_node+0x100e/0x1290 mm/slub.c:3818
    kmalloc_node include/linux/slab.h:554 [inline]
    kvmalloc_node+0x1a5/0x2e0 mm/util.c:419
    kvmalloc include/linux/mm.h:541 [inline]
    vhost_net_open+0x64/0x5f0 drivers/vhost/net.c:921
    misc_open+0x7b5/0x8b0 drivers/char/misc.c:154
    chrdev_open+0xc28/0xd90 fs/char_dev.c:417
    do_dentry_open+0xccb/0x1430 fs/open.c:752
    vfs_open+0x272/0x2e0 fs/open.c:866
    do_last fs/namei.c:3378 [inline]
    path_openat+0x49ad/0x6580 fs/namei.c:3519
    do_filp_open+0x267/0x640 fs/namei.c:3553
    do_sys_open+0x6ad/0x9c0 fs/open.c:1059
    SYSC_openat+0xc7/0xe0 fs/open.c:1086
    SyS_openat+0x63/0x90 fs/open.c:1080
    do_syscall_64+0x2f1/0x450 arch/x86/entry/common.c:287
    ==================================================================

    Fixes: c67df11f6e480 ("vhost_net: try batch dequing from skb array")
    Signed-off-by: Alexander Potapenko
    Signed-off-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Alexander Potapenko
     
  • Fixed a coding style issue.

    Signed-off-by: Vaibhav Murkute
    Reviewed-by: Stefan Hajnoczi
    Signed-off-by: David S. Miller

    Vaibhav Murkute
     

13 Feb, 2018

1 commit

  • Changes since v1:
    Added changes in these files:
    drivers/infiniband/hw/usnic/usnic_transport.c
    drivers/staging/lustre/lnet/lnet/lib-socket.c
    drivers/target/iscsi/iscsi_target_login.c
    drivers/vhost/net.c
    fs/dlm/lowcomms.c
    fs/ocfs2/cluster/tcp.c
    security/tomoyo/network.c

    Before:
    All these functions either return a negative error indicator,
    or store length of sockaddr into "int *socklen" parameter
    and return zero on success.

    "int *socklen" parameter is awkward. For example, if caller does not
    care, it still needs to provide on-stack storage for the value
    it does not need.

    None of the many FOO_getname() functions of various protocols
    ever used old value of *socklen. They always just overwrite it.

    This change drops this parameter, and makes all these functions, on success,
    return length of sockaddr. It's always >= 0 and can be differentiated
    from an error.

    Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.

    rpc_sockname() lost "int buflen" parameter, since its only use was
    to be passed to kernel_getsockname() as &buflen and subsequently
    not used in any way.

    Userspace API is not changed.

    text data bss dec hex filename
    30108430 2633624 873672 33615726 200ef6e vmlinux.before.o
    30108109 2633612 873672 33615393 200ee21 vmlinux.o

    Signed-off-by: Denys Vlasenko
    CC: David S. Miller
    CC: linux-kernel@vger.kernel.org
    CC: netdev@vger.kernel.org
    CC: linux-bluetooth@vger.kernel.org
    CC: linux-decnet-user@lists.sourceforge.net
    CC: linux-wireless@vger.kernel.org
    CC: linux-rdma@vger.kernel.org
    CC: linux-sctp@vger.kernel.org
    CC: linux-nfs@vger.kernel.org
    CC: linux-x25@vger.kernel.org
    Signed-off-by: David S. Miller

    Denys Vlasenko
     

12 Feb, 2018

1 commit

  • This is the mindless scripted replacement of kernel use of POLL*
    variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
    L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
    for f in $L; do sed -i "-es/^\([^\"]*\)\(\\)/\\1E\\2/" $f; done
    done

    with de-mangling cleanups yet to come.

    NOTE! On almost all architectures, the EPOLL* constants have the same
    values as the POLL* constants do. But they keyword here is "almost".
    For various bad reasons they aren't the same, and epoll() doesn't
    actually work quite correctly in some cases due to this on Sparc et al.

    The next patch from Al will sort out the final differences, and we
    should be all done.

    Scripted-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

09 Feb, 2018

1 commit

  • Pull virtio/vhost updates from Michael Tsirkin:
    "virtio, vhost: fixes, cleanups, features

    This includes the disk/cache memory stats for for the virtio balloon,
    as well as multiple fixes and cleanups"

    * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
    vhost: don't hold onto file pointer for VHOST_SET_LOG_FD
    vhost: don't hold onto file pointer for VHOST_SET_VRING_ERR
    vhost: don't hold onto file pointer for VHOST_SET_VRING_CALL
    ringtest: ring.c malloc & memset to calloc
    virtio_vop: don't kfree device on register failure
    virtio_pci: don't kfree device on register failure
    virtio: split device_register into device_initialize and device_add
    vhost: remove unused lock check flag in vhost_dev_cleanup()
    vhost: Remove the unused variable.
    virtio_blk: print capacity at probe time
    virtio: make VIRTIO a menuconfig to ease disabling it all
    virtio/ringtest: virtio_ring: fix up need_event math
    virtio/ringtest: fix up need_event math
    virtio: virtio_mmio: make of_device_ids const.
    firmware: Use PTR_ERR_OR_ZERO()
    virtio-mmio: Use PTR_ERR_OR_ZERO()
    vhost/scsi: Improve a size determination in four functions
    virtio_balloon: include disk/file caches memory statistics

    Linus Torvalds
     

01 Feb, 2018

6 commits

  • We already hold a reference to the eventfd_ctx, which is sufficient;
    there's no need to hold a reference to the struct file as well. So get
    rid of vhost_dev->log_file.

    Signed-off-by: Eric Biggers
    Signed-off-by: Michael S. Tsirkin
    Reviewed-by: Jason Wang

    Eric Biggers
     
  • We already hold a reference to the eventfd_ctx, which is sufficient;
    there's no need to hold a reference to the struct file as well. So get
    rid of vhost_virtqueue->error.

    Signed-off-by: Eric Biggers
    Signed-off-by: Michael S. Tsirkin
    Reviewed-by: Jason Wang

    Eric Biggers
     
  • We already hold a reference to the eventfd_ctx, which is sufficient;
    there's no need to hold a reference to the struct file as well. So get
    rid of vhost_virtqueue->call.

    Signed-off-by: Eric Biggers
    Signed-off-by: Michael S. Tsirkin
    Reviewed-by: Jason Wang

    Eric Biggers
     
  • In commit ea5d404655ba ("vhost: fix release path lockdep checks"),
    Michael added a flag to check whether we should hold a lock in
    vhost_dev_cleanup(), however, in commit 47283bef7ed3 ("vhost: move
    memory pointer to VQs"), RCU operations have been replaced by
    mutex, we can remove the no-longer-used `locked' parameter now.

    Signed-off-by: Caspar Zhang
    Signed-off-by: Michael S. Tsirkin

    夷则(Caspar)
     
  • The patch (7235acdb1) changed the way of the work
    flushing in which the queued seq, done seq, and the
    flushing are not used anymore. Then remove them now.

    Fixes: 7235acdb1 ("vhost: simplify work flushing")
    Cc: Jason Wang
    Signed-off-by: Tonghao Zhang
    Acked-by: Jason Wang
    Signed-off-by: Michael S. Tsirkin

    Tonghao Zhang
     
  • Pull networking updates from David Miller:

    1) Significantly shrink the core networking routing structures. Result
    of http://vger.kernel.org/~davem/seoul2017_netdev_keynote.pdf

    2) Add netdevsim driver for testing various offloads, from Jakub
    Kicinski.

    3) Support cross-chip FDB operations in DSA, from Vivien Didelot.

    4) Add a 2nd listener hash table for TCP, similar to what was done for
    UDP. From Martin KaFai Lau.

    5) Add eBPF based queue selection to tun, from Jason Wang.

    6) Lockless qdisc support, from John Fastabend.

    7) SCTP stream interleave support, from Xin Long.

    8) Smoother TCP receive autotuning, from Eric Dumazet.

    9) Lots of erspan tunneling enhancements, from William Tu.

    10) Add true function call support to BPF, from Alexei Starovoitov.

    11) Add explicit support for GRO HW offloading, from Michael Chan.

    12) Support extack generation in more netlink subsystems. From Alexander
    Aring, Quentin Monnet, and Jakub Kicinski.

    13) Add 1000BaseX, flow control, and EEE support to mvneta driver. From
    Russell King.

    14) Add flow table abstraction to netfilter, from Pablo Neira Ayuso.

    15) Many improvements and simplifications to the NFP driver bpf JIT,
    from Jakub Kicinski.

    16) Support for ipv6 non-equal cost multipath routing, from Ido
    Schimmel.

    17) Add resource abstration to devlink, from Arkadi Sharshevsky.

    18) Packet scheduler classifier shared filter block support, from Jiri
    Pirko.

    19) Avoid locking in act_csum, from Davide Caratti.

    20) devinet_ioctl() simplifications from Al viro.

    21) More TCP bpf improvements from Lawrence Brakmo.

    22) Add support for onlink ipv6 route flag, similar to ipv4, from David
    Ahern.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits)
    tls: Add support for encryption using async offload accelerator
    ip6mr: fix stale iterator
    net/sched: kconfig: Remove blank help texts
    openvswitch: meter: Use 64-bit arithmetic instead of 32-bit
    tcp_nv: fix potential integer overflow in tcpnv_acked
    r8169: fix RTL8168EP take too long to complete driver initialization.
    qmi_wwan: Add support for Quectel EP06
    rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK
    ipmr: Fix ptrdiff_t print formatting
    ibmvnic: Wait for device response when changing MAC
    qlcnic: fix deadlock bug
    tcp: release sk_frag.page in tcp_disconnect
    ipv4: Get the address of interface correctly.
    net_sched: gen_estimator: fix lockdep splat
    net: macb: Handle HRESP error
    net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring
    ipv6: addrconf: break critical section in addrconf_verify_rtnl()
    ipv6: change route cache aging logic
    i40e/i40evf: Update DESC_NEEDED value to reflect larger value
    bnxt_en: cleanup DIM work on device shutdown
    ...

    Linus Torvalds
     

31 Jan, 2018

3 commits

  • Pull poll annotations from Al Viro:
    "This introduces a __bitwise type for POLL### bitmap, and propagates
    the annotations through the tree. Most of that stuff is as simple as
    'make ->poll() instances return __poll_t and do the same to local
    variables used to hold the future return value'.

    Some of the obvious brainos found in process are fixed (e.g. POLLIN
    misspelled as POLL_IN). At that point the amount of sparse warnings is
    low and most of them are for genuine bugs - e.g. ->poll() instance
    deciding to return -EINVAL instead of a bitmap. I hadn't touched those
    in this series - it's large enough as it is.

    Another problem it has caught was eventpoll() ABI mess; select.c and
    eventpoll.c assumed that corresponding POLL### and EPOLL### were
    equal. That's true for some, but not all of them - EPOLL### are
    arch-independent, but POLL### are not.

    The last commit in this series separates userland POLL### values from
    the (now arch-independent) kernel-side ones, converting between them
    in the few places where they are copied to/from userland. AFAICS, this
    is the least disruptive fix preserving poll(2) ABI and making epoll()
    work on all architectures.

    As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
    it will trigger only on what would've triggered EPOLLWRBAND on other
    architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
    at all on sparc. With this patch they should work consistently on all
    architectures"

    * 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
    make kernel-side POLL... arch-independent
    eventpoll: no need to mask the result of epi_item_poll() again
    eventpoll: constify struct epoll_event pointers
    debugging printk in sg_poll() uses %x to print POLL... bitmap
    annotate poll(2) guts
    9p: untangle ->poll() mess
    ->si_band gets POLL... bitmap stored into a user-visible long field
    ring_buffer_poll_wait() return value used as return value of ->poll()
    the rest of drivers/*: annotate ->poll() instances
    media: annotate ->poll() instances
    fs: annotate ->poll() instances
    ipc, kernel, mm: annotate ->poll() instances
    net: annotate ->poll() instances
    apparmor: annotate ->poll() instances
    tomoyo: annotate ->poll() instances
    sound: annotate ->poll() instances
    acpi: annotate ->poll() instances
    crypto: annotate ->poll() instances
    block: annotate ->poll() instances
    x86: annotate ->poll() instances
    ...

    Linus Torvalds
     
  • Replace the specification of four data structures by pointer dereferences
    as the parameter for the operator "sizeof" to make the corresponding size
    determination a bit safer according to the Linux coding style convention.

    Signed-off-by: Markus Elfring
    Reviewed-by: Stefan Hajnoczi
    Signed-off-by: Michael S. Tsirkin

    Markus Elfring
     
  • Pull RCU updates from Ingo Molnar:
    "The main RCU changes in this cycle were:

    - Updates to use cond_resched() instead of cond_resched_rcu_qs()
    where feasible (currently everywhere except in kernel/rcu and in
    kernel/torture.c). Also a couple of fixes to avoid sending IPIs to
    offline CPUs.

    - Updates to simplify RCU's dyntick-idle handling.

    - Updates to remove almost all uses of smp_read_barrier_depends() and
    read_barrier_depends().

    - Torture-test updates.

    - Miscellaneous fixes"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
    torture: Save a line in stutter_wait(): while -> for
    torture: Eliminate torture_runnable and perf_runnable
    torture: Make stutter less vulnerable to compilers and races
    locking/locktorture: Fix num reader/writer corner cases
    locking/locktorture: Fix rwsem reader_delay
    torture: Place all torture-test modules in one MAINTAINERS group
    rcutorture/kvm-build.sh: Skip build directory check
    rcutorture: Simplify functions.sh include path
    rcutorture: Simplify logging
    rcutorture/kvm-recheck-*: Improve result directory readability check
    rcutorture/kvm.sh: Support execution from any directory
    rcutorture/kvm.sh: Use consistent help text for --qemu-args
    rcutorture/kvm.sh: Remove unused variable, `alldone`
    rcutorture: Remove unused script, config2frag.sh
    rcutorture/configinit: Fix build directory error message
    rcutorture: Preempt RCU-preempt readers more vigorously
    torture: Reduce #ifdefs for preempt_schedule()
    rcu: Remove have_rcu_nocb_mask from tree_plugin.h
    rcu: Add comment giving debug strategy for double call_rcu()
    tracing, rcu: Hide trace event rcu_nocb_wake when not used
    ...

    Linus Torvalds
     

30 Jan, 2018

1 commit

  • We don't stop device before reset owner, this means we could try to
    serve any virtqueue kick before reset dev->worker. This will result a
    warn since the work was pending at llist during owner resetting. Fix
    this by stopping device during owner reset.

    Reported-by: syzbot+eb17c6162478cc50632c@syzkaller.appspotmail.com
    Fixes: 3a4d5c94e9593 ("vhost_net: a kernel-level virtio server")
    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     

25 Jan, 2018

3 commits


11 Jan, 2018

1 commit

  • This patch tries to batched used ring update during RX. This is pretty
    fit for the case when guest is much faster (e.g dpdk based
    backend). In this case, used ring is almost empty:

    - we may get serious cache line misses/contending on both used ring
    and used idx.
    - at most 1 packet could be dequeued at one time, batching in guest
    does not make much effect.

    Update used ring in a batch can help since guest won't access the used
    ring until used idx was advanced for several descriptors and since we
    advance used ring for every N packets, guest will only need to access
    used idx for every N packet since it can cache the used idx. To have a
    better interaction for both batch dequeuing and dpdk batching,
    VHOST_RX_BATCH was used as the maximum number of descriptors that
    could be batched.

    Test were done between two machines with 2.40GHz Intel(R) Xeon(R) CPU
    E5-2630 connected back to back through ixgbe. Traffic were generated
    on one remote ixgbe through MoonGen and measure the RX pps through
    testpmd in guest when do xdp_redirect_map from local ixgbe to
    tap. RX pps were increased from 3.05 Mpps to 4.00 Mpps (about 31%
    improvement).

    One possible concern for this is the implications for TCP (especially
    latency sensitive workload). Result[1] does not show obvious changes
    for most of the netperf test (RR, TX, and RX). And we do get some
    improvements for RX on some specific size.

    Guest RX:

    size/sessions/+thu%/+normalize%
    64/ 1/ +2%/ +2%
    64/ 2/ +2%/ -1%
    64/ 4/ +1%/ +1%
    64/ 8/ 0%/ 0%
    256/ 1/ +6%/ -3%
    256/ 2/ -3%/ +2%
    256/ 4/ +11%/ +11%
    256/ 8/ 0%/ 0%
    512/ 1/ +4%/ 0%
    512/ 2/ +2%/ +2%
    512/ 4/ 0%/ -1%
    512/ 8/ -8%/ -8%
    1024/ 1/ -7%/ -17%
    1024/ 2/ -8%/ -7%
    1024/ 4/ +1%/ 0%
    1024/ 8/ 0%/ 0%
    2048/ 1/ +30%/ +14%
    2048/ 2/ +46%/ +40%
    2048/ 4/ 0%/ 0%
    2048/ 8/ 0%/ 0%
    4096/ 1/ +23%/ +22%
    4096/ 2/ +26%/ +23%
    4096/ 4/ 0%/ +1%
    4096/ 8/ 0%/ 0%
    16384/ 1/ -2%/ -3%
    16384/ 2/ +1%/ -4%
    16384/ 4/ -1%/ -3%
    16384/ 8/ 0%/ -1%
    65535/ 1/ +15%/ +7%
    65535/ 2/ +4%/ +7%
    65535/ 4/ 0%/ +1%
    65535/ 8/ 0%/ 0%

    TCP_RR:

    size/sessions/+thu%/+normalize%
    1/ 1/ 0%/ +1%
    1/ 25/ +2%/ +1%
    1/ 50/ +4%/ +1%
    64/ 1/ 0%/ -4%
    64/ 25/ +2%/ +1%
    64/ 50/ 0%/ -1%
    256/ 1/ 0%/ 0%
    256/ 25/ 0%/ 0%
    256/ 50/ +4%/ +2%

    Guest TX:

    size/sessions/+thu%/+normalize%
    64/ 1/ +4%/ -2%
    64/ 2/ -6%/ -5%
    64/ 4/ +3%/ +6%
    64/ 8/ 0%/ +3%
    256/ 1/ +15%/ +16%
    256/ 2/ +11%/ +12%
    256/ 4/ +1%/ 0%
    256/ 8/ +5%/ +5%
    512/ 1/ -1%/ -6%
    512/ 2/ 0%/ -8%
    512/ 4/ -2%/ +4%
    512/ 8/ +6%/ +9%
    1024/ 1/ +3%/ +1%
    1024/ 2/ +3%/ +9%
    1024/ 4/ 0%/ +7%
    1024/ 8/ 0%/ +7%
    2048/ 1/ +8%/ +2%
    2048/ 2/ +3%/ -1%
    2048/ 4/ -1%/ +11%
    2048/ 8/ +3%/ +9%
    4096/ 1/ +8%/ +8%
    4096/ 2/ 0%/ -7%
    4096/ 4/ +4%/ +4%
    4096/ 8/ +2%/ +5%
    16384/ 1/ -3%/ +1%
    16384/ 2/ -1%/ -12%
    16384/ 4/ -1%/ +5%
    16384/ 8/ 0%/ +1%
    65535/ 1/ 0%/ -3%
    65535/ 2/ +5%/ +16%
    65535/ 4/ +1%/ +2%
    65535/ 8/ +1%/ -1%

    Signed-off-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Jason Wang
     

09 Jan, 2018

2 commits

  • This patch implements XDP transmission for TAP. Since we can't create
    new queues for TAP during XDP set, exist ptr_ring was reused for
    queuing XDP buffers. To differ xdp_buff from sk_buff, TUN_XDP_FLAG
    (0x1UL) was encoded into lowest bit of xpd_buff pointer during
    ptr_ring_produce, and was decoded during consuming. XDP metadata was
    stored in the headroom of the packet which should work in most of
    cases since driver usually reserve enough headroom. Very minor changes
    were done for vhost_net: it just need to peek the length depends on
    the type of pointer.

    Tests were done on two Intel E5-2630 2.40GHz machines connected back
    to back through two 82599ES. Traffic were generated/received through
    MoonGen/testpmd(rxonly). It reports ~20% improvements when
    xdp_redirect_map is doing redirection from ixgbe to TAP (from 2.50Mpps
    to 3.05Mpps)

    Cc: Jesper Dangaard Brouer
    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     
  • This patch switches to use ptr_ring instead of skb_array. This will be
    used to enqueue different types of pointers by encoding type into
    lower bits.

    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     

03 Jan, 2018

1 commit

  • …k/linux-rcu into core/rcu

    Pull RCU updates from Paul E. McKenney:

    - Updates to use cond_resched() instead of cond_resched_rcu_qs()
    where feasible (currently everywhere except in kernel/rcu and
    in kernel/torture.c). Also a couple of fixes to avoid sending
    IPIs to offline CPUs.

    - Updates to simplify RCU's dyntick-idle handling.

    - Updates to remove almost all uses of smp_read_barrier_depends()
    and read_barrier_depends().

    - Miscellaneous fixes.

    - Torture-test updates.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

06 Dec, 2017

1 commit


03 Dec, 2017

1 commit

  • Matthew found a roughly 40% tcp throughput regression with commit
    c67df11f(vhost_net: try batch dequing from skb array) as discussed
    in the following thread:
    https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html

    Eventually we figured out that it was a skb leak in handle_rx()
    when sending packets to the VM. This usually happens when a guest
    can not drain out vq as fast as vhost fills in, afterwards it sets
    off the traffic jam and leaks skb(s) which occurs as no headcount
    to send on the vq from vhost side.

    This can be avoided by making sure we have got enough headcount
    before actually consuming a skb from the batched rx array while
    transmitting, which is simply done by moving checking the zero
    headcount a bit ahead.

    Signed-off-by: Wei Xu
    Reported-by: Matthew Rosato
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Wei Xu
     

29 Nov, 2017

1 commit