31 Jan, 2019

1 commit

  • [ Upstream commit cc5e710759470bc7f3c61d11fd54586f15fdbdf4 ]

    Vhost dirty page logging API is designed to sync through GPA. But we
    try to log GIOVA when device IOTLB is enabled. This is wrong and may
    lead to missing data after migration.

    To solve this issue, when logging with device IOTLB enabled, we will:

    1) reuse the device IOTLB translation result of GIOVA->HVA mapping to
    get HVA, for writable descriptor, get HVA through iovec. For used
    ring update, translate its GIOVA to HVA
    2) traverse the GPA->HVA mapping to get the possible GPA and log
    through GPA. Pay attention this reverse mapping is not guaranteed
    to be unique, so we should log each possible GPA in this case.

    This fix the failure of scp to guest during migration. In -next, we
    will probably support passing GIOVA->GPA instead of GIOVA->HVA.

    Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
    Reported-by: Jintack Lim
    Cc: Jintack Lim
    Signed-off-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jason Wang
     

13 Jan, 2019

1 commit

  • commit a72b69dc083a931422cc8a5e33841aff7d5312f2 upstream.

    The vhost_vsock->guest_cid field is uninitialized when /dev/vhost-vsock
    is opened until the VHOST_VSOCK_SET_GUEST_CID ioctl is called.

    kvmalloc(..., GFP_KERNEL | __GFP_RETRY_MAYFAIL) does not zero memory.
    All other vhost_vsock fields are initialized explicitly so just
    initialize this field too.

    Signed-off-by: Stefan Hajnoczi
    Signed-off-by: Michael S. Tsirkin
    Cc: Daniel Verkamp
    Signed-off-by: Greg Kroah-Hartman

    Stefan Hajnoczi
     

10 Jan, 2019

1 commit

  • [ Upstream commit 841df922417eb82c835e93d4b93eb6a68c99d599 ]

    We miss a write barrier that guarantees used idx is updated and seen
    before log. This will let userspace sync and copy used ring before
    used idx is update. Fix this by adding a barrier before log_write().

    Fixes: 8dd014adfea6f ("vhost-net: mergeable buffers support")
    Acked-by: Michael S. Tsirkin
    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jason Wang
     

21 Dec, 2018

1 commit

  • [ Upstream commit c38f57da428b033f2721b611d84b1f40bde674a8 ]

    If a local process has closed a connected socket and hasn't received a
    RST packet yet, then the socket remains in the table until a timeout
    expires.

    When a vhost_vsock instance is released with the timeout still pending,
    the socket is never freed because vhost_vsock has already set the
    SOCK_DONE flag.

    Check if the close timer is pending and let it close the socket. This
    prevents the race which can leak sockets.

    Reported-by: Maximilian Riemensberger
    Cc: Graham Whaley
    Signed-off-by: Stefan Hajnoczi
    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Sasha Levin

    Stefan Hajnoczi
     

13 Dec, 2018

1 commit

  • commit 834e772c8db0c6a275d75315d90aba4ebbb1e249 upstream.

    If the network stack calls .send_pkt()/.cancel_pkt() during .release(),
    a struct vhost_vsock use-after-free is possible. This occurs because
    .release() does not wait for other CPUs to stop using struct
    vhost_vsock.

    Switch to an RCU-enabled hashtable (indexed by guest CID) so that
    .release() can wait for other CPUs by calling synchronize_rcu(). This
    also eliminates vhost_vsock_lock acquisition in the data path so it
    could have a positive effect on performance.

    This is CVE-2018-14625 "kernel: use-after-free Read in vhost_transport_send_pkt".

    Cc: stable@vger.kernel.org
    Reported-and-tested-by: syzbot+bd391451452fb0b93039@syzkaller.appspotmail.com
    Reported-by: syzbot+e3e074963495f92a89ed@syzkaller.appspotmail.com
    Reported-by: syzbot+d5a0a170c5069658b141@syzkaller.appspotmail.com
    Signed-off-by: Stefan Hajnoczi
    Signed-off-by: Michael S. Tsirkin
    Acked-by: Jason Wang
    Signed-off-by: Greg Kroah-Hartman

    Stefan Hajnoczi
     

21 Nov, 2018

1 commit

  • commit 4542d623c7134bc1738f8a68ccb6dd546f1c264f upstream.

    Commands with protection information included were not truncating the
    protection iov_iter to the number of protection bytes in the command.
    This resulted in vhost_scsi mis-calculating the size of the protection
    SGL in vhost_scsi_calc_sgls(), and including both the protection and
    data SG entries in the protection SGL.

    Fixes: 09b13fa8c1a1 ("vhost/scsi: Add ANY_LAYOUT support in vhost_scsi_handle_vq")
    Signed-off-by: Greg Edwards
    Signed-off-by: Michael S. Tsirkin
    Fixes: 09b13fa8c1a1093e9458549ac8bb203a7c65c62a
    Cc: stable@vger.kernel.org
    Reviewed-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Greg Edwards
     

04 Nov, 2018

1 commit

  • [ Upstream commit ff002269a4ee9c769dbf9365acef633ebcbd6cbe ]

    The idx in vhost_vring_ioctl() was controlled by userspace, hence a
    potential exploitation of the Spectre variant 1 vulnerability.

    Fixing this by sanitizing idx before using it to index d->vqs.

    Cc: Michael S. Tsirkin
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jason Wang
     

15 Sep, 2018

1 commit

  • [ Upstream commit 2d66f997f0545c8f7fc5cf0b49af1decb35170e7 ]

    We don't wakeup the virtqueue if the first byte of pending iova range
    is the last byte of the range we just got updated. This will lead a
    virtqueue to wait for IOTLB updating forever. Fixing by correct the
    check and wake up the virtqueue in this case.

    Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
    Reported-by: Peter Xu
    Signed-off-by: Jason Wang
    Reviewed-by: Peter Xu
    Tested-by: Peter Xu
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jason Wang
     

22 Aug, 2018

1 commit

  • [ Upstream commit b13f9c6364373a1b9f71e9846dc4fb199296f926 ]

    We need to reset metadata cache during new IOTLB initialization,
    otherwise the stale pointers to previous IOTLB may be still accessed
    which will lead a use after free.

    Reported-by: syzbot+c51e6736a1bf614b3272@syzkaller.appspotmail.com
    Fixes: f88949138058 ("vhost: introduce O(1) vq metadata cache")
    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jason Wang
     

22 Jul, 2018

1 commit

  • [ Upstream commit b8f1f65882f07913157c44673af7ec0b308d03eb ]

    Sock will be NULL if we pass -1 to vhost_net_set_backend(), but when
    we meet errors during ubuf allocation, the code does not check for
    NULL before calling sockfd_put(), this will lead NULL
    dereferencing. Fixing by checking sock pointer before.

    Fixes: bab632d69ee4 ("vhost: vhost TX zero-copy support")
    Reported-by: Dan Carpenter
    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jason Wang
     

26 Jun, 2018

1 commit

  • commit 670ae9caaca467ea1bfd325cb2a5c98ba87f94ad upstream.

    struct vhost_msg within struct vhost_msg_node is copied to userspace.
    Unfortunately it turns out on 64 bit systems vhost_msg has padding after
    type which gcc doesn't initialize, leaking 4 uninitialized bytes to
    userspace.

    This padding also unfortunately means 32 bit users of this interface are
    broken on a 64 bit kernel which will need to be fixed separately.

    Fixes: CVE-2018-1118
    Cc: stable@vger.kernel.org
    Reported-by: Kevin Easton
    Signed-off-by: Michael S. Tsirkin
    Reported-by: syzbot+87cfa083e727a224754b@syzkaller.appspotmail.com
    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Greg Kroah-Hartman

    Michael S. Tsirkin
     

12 Jun, 2018

1 commit

  • [ Upstream commit 1b15ad683ab42a203f98b67045b40720e99d0e9a ]

    DaeRyong Jeong reports a race between vhost_dev_cleanup() and
    vhost_process_iotlb_msg():

    Thread interleaving:
    CPU0 (vhost_process_iotlb_msg) CPU1 (vhost_dev_cleanup)
    (In the case of both VHOST_IOTLB_UPDATE and
    VHOST_IOTLB_INVALIDATE)

    ===== =====
    vhost_umem_clean(dev->iotlb);
    if (!dev->iotlb) {
    ret = -EFAULT;
    break;
    }
    dev->iotlb = NULL;

    The reason is we don't synchronize between them, fixing by protecting
    vhost_process_iotlb_msg() with dev mutex.

    Reported-by: DaeRyong Jeong
    Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
    Signed-off-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jason Wang
     

19 Apr, 2018

2 commits

  • [ Upstream commit 7ced6c98c7ab7a1f6743931e28671b833af79b1e ]

    vhost_copy_to_user is used to copy vring used elements to userspace.
    We should use VHOST_ADDR_USED instead of VHOST_ADDR_DESC.

    Fixes: f88949138058 ("vhost: introduce O(1) vq metadata cache")
    Signed-off-by: Eric Auger
    Acked-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Auger
     
  • [ Upstream commit d14d2b78090c7de0557362b26a4ca591aa6a9faa ]

    Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log
    when IOTLB is enabled") introduced a regression. The logic was
    originally:

    if (vq->iotlb)
    return 1;
    return A && B;

    After the patch the short-circuit logic for A was inverted:

    if (A || vq->iotlb)
    return A;
    return B;

    This patch fixes the regression by rewriting the checks in the obvious
    way, no longer returning A when vq->iotlb is non-NULL (which is hard to
    understand).

    Reported-by: syzbot+65a84dde0214b0387ccd@syzkaller.appspotmail.com
    Cc: Jason Wang
    Signed-off-by: Stefan Hajnoczi
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Stefan Hajnoczi
     

12 Apr, 2018

3 commits

  • [ Upstream commit aaa3149bbee9ba9b4e6f0bd6e3e7d191edeae942 ]

    We try to hold TX virtqueue mutex in vhost_net_rx_peek_head_len()
    after RX virtqueue mutex is held in handle_rx(). This requires an
    appropriate lock nesting notation to calm down deadlock detector.

    Fixes: 0308813724606 ("vhost_net: basic polling support")
    Reported-by: syzbot+7f073540b1384a614e09@syzkaller.appspotmail.com
    Signed-off-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jason Wang
     
  • [ Upstream commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ]

    Vq log_base is the userspace address of bitmap which has nothing to do
    with IOTLB. So it needs to be validated unconditionally otherwise we
    may try use 0 as log_base which may lead to pin pages that will lead
    unexpected result (e.g trigger BUG_ON() in set_bit_to_user()).

    Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
    Reported-by: syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com
    Signed-off-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jason Wang
     
  • [ Upstream commit dc6455a71c7fc5117977e197f67f71b49f27baba ]

    We tried to remove vq poll from wait queue, but do not check whether
    or not it was in a list before. This will lead double free. Fixing
    this by switching to use vhost_poll_stop() which zeros poll->wqh after
    removing poll from waitqueue to make sure it won't be freed twice.

    Cc: Darren Kenny
    Reported-by: syzbot+c0272972b01b872e604a@syzkaller.appspotmail.com
    Fixes: 2b8b328b61c79 ("vhost_net: handle polling errors when setting backend")
    Signed-off-by: Jason Wang
    Reviewed-by: Darren Kenny
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jason Wang
     

25 Feb, 2018

1 commit

  • commit e9cb4239134c860e5f92c75bf5321bd377bb505b upstream.

    We used to call mutex_lock() in vhost_dev_lock_vqs() which tries to
    hold mutexes of all virtqueues. This may confuse lockdep to report a
    possible deadlock because of trying to hold locks belong to same
    class. Switch to use mutex_lock_nested() to avoid false positive.

    Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
    Reported-by: syzbot+dbb7c1161485e61b0241@syzkaller.appspotmail.com
    Signed-off-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jason Wang
     

13 Feb, 2018

1 commit

  • [ Upstream commit 4cd879515d686849eec5f718aeac62a70b067d82 ]

    We don't stop device before reset owner, this means we could try to
    serve any virtqueue kick before reset dev->worker. This will result a
    warn since the work was pending at llist during owner resetting. Fix
    this by stopping device during owner reset.

    Reported-by: syzbot+eb17c6162478cc50632c@syzkaller.appspotmail.com
    Fixes: 3a4d5c94e9593 ("vhost_net: a kernel-level virtio server")
    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jason Wang
     

17 Dec, 2017

1 commit

  • [ Upstream commit 6e474083f3daf3a3546737f5d7d502ad12eb257c ]

    Matthew found a roughly 40% tcp throughput regression with commit
    c67df11f(vhost_net: try batch dequing from skb array) as discussed
    in the following thread:
    https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html

    Eventually we figured out that it was a skb leak in handle_rx()
    when sending packets to the VM. This usually happens when a guest
    can not drain out vq as fast as vhost fills in, afterwards it sets
    off the traffic jam and leaks skb(s) which occurs as no headcount
    to send on the vq from vhost side.

    This can be avoided by making sure we have got enough headcount
    before actually consuming a skb from the batched rx array while
    transmitting, which is simply done by moving checking the zero
    headcount a bit ahead.

    Signed-off-by: Wei Xu
    Reported-by: Matthew Rosato
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Wei Xu
     

30 Nov, 2017

1 commit


02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

09 Sep, 2017

1 commit

  • Allow interval trees to quickly check for overlaps to avoid unnecesary
    tree lookups in interval_tree_iter_first().

    As of this patch, all interval tree flavors will require using a
    'rb_root_cached' such that we can have the leftmost node easily
    available. While most users will make use of this feature, those with
    special functions (in addition to the generic insert, delete, search
    calls) will avoid using the cached option as they can do funky things
    with insertions -- for example, vma_interval_tree_insert_after().

    [jglisse@redhat.com: fix deadlock from typo vm_lock_anon_vma()]
    Link: http://lkml.kernel.org/r/20170808225719.20723-1-jglisse@redhat.com
    Link: http://lkml.kernel.org/r/20170719014603.19029-12-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Doug Ledford
    Acked-by: Michael S. Tsirkin
    Cc: David Airlie
    Cc: Jason Wang
    Cc: Christian Benvenuti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

06 Sep, 2017

1 commit

  • We check tx avail through vhost_enable_notify() in the past which is
    wrong since it only checks whether or not guest has filled more
    available buffer since last avail idx synchronization which was just
    done by vhost_vq_avail_empty() before. What we really want is checking
    pending buffers in the avail ring. Fix this by calling
    vhost_vq_avail_empty() instead.

    This issue could be noticed by doing netperf TCP_RR benchmark as
    client from guest (but not host). With this fix, TCP_RR from guest to
    localhost restores from 1375.91 trans per sec to 55235.28 trans per
    sec on my laptop (Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz).

    Fixes: 030881372460 ("vhost_net: basic polling support")
    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     

02 Sep, 2017

1 commit

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    v2: added the change in drivers/vhost/net.c as spotted
    by Willem.

    Signed-off-by: Eric Dumazet
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Aug, 2017

1 commit

  • Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with
    skb_zerocopy_clone() wherever needed due to skb split, merge, resize
    or clone.

    Split skb_orphan_frags into two variants. The split, merge, .. paths
    support reference counted zerocopy buffers, so do not do a deep copy.
    Add skb_orphan_frags_rx for paths that may loop packets to receive
    sockets. That is not allowed, as it may cause unbounded latency.
    Deep copy all zerocopy copy buffers, ref-counted or not, in this path.

    The exact locations to modify were chosen by exhaustively searching
    through all code that might modify skb_frag references and/or the
    the SKBTX_DEV_ZEROCOPY tx_flags bit.

    The changes err on the safe side, in two ways.

    (1) legacy ubuf_info paths virtio and tap are not modified. They keep
    a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags
    still call skb_copy_ubufs and thus copy frags in this case.

    (2) not all copies deep in the stack are addressed yet. skb_shift,
    skb_split and skb_try_coalesce can be refined to avoid copying.
    These are not in the hot path and this patch is hairy enough as
    is, so that is left for future refinement.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

30 Jul, 2017

1 commit

  • This reverts commit 809ecb9bca6a9424ccd392d67e368160f8b76c92. Since it
    was reported to break vhost_net. We want to cache used event and use
    it to check for notification. The assumption was that guest won't move
    the event idx back, but this could happen in fact when 16 bit index
    wraps around after 64K entries.

    Signed-off-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Jason Wang
     

14 Jul, 2017

1 commit

  • Pull SCSI target updates from Nicholas Bellinger:
    "It's been usually busy for summer, with most of the efforts centered
    around TCMU developments and various target-core + fabric driver bug
    fixing activities. Not particularly large in terms of LoC, but lots of
    smaller patches from many different folks.

    The highlights include:

    - ibmvscsis logical partition manager support (Michael Cyr + Bryant
    Ly)

    - Convert target/iblock WRITE_SAME to blkdev_issue_zeroout (hch +
    nab)

    - Add support for TMR percpu LUN reference counting (nab)

    - Fix a potential deadlock between EXTENDED_COPY and iscsi shutdown
    (Bart)

    - Fix COMPARE_AND_WRITE caw_sem leak during se_cmd quiesce (Jiang Yi)

    - Fix TMCU module removal (Xiubo Li)

    - Fix iser-target OOPs during login failure (Andrea Righi + Sagi)

    - Breakup target-core free_device backend driver callback (mnc)

    - Perform TCMU add/delete/reconfig synchronously (mnc)

    - Fix TCMU multiple UIO open/close sequences (mnc)

    - Fix TCMU CHECK_CONDITION sense handling (mnc)

    - Fix target-core SAM_STAT_BUSY + TASK_SET_FULL handling (mnc + nab)

    - Introduce TYPE_ZBC support in PSCSI (Damien Le Moal)

    - Fix possible TCMU memory leak + OOPs when recalculating cmd base
    size (Xiubo Li + Bryant Ly + Damien Le Moal + mnc)

    - Add login_keys_workaround attribute for non RFC initiators (Robert
    LeBlanc + Arun Easi + nab)"

    * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (68 commits)
    iscsi-target: Add login_keys_workaround attribute for non RFC initiators
    Revert "qla2xxx: Fix incorrect tcm_qla2xxx_free_cmd use during TMR ABORT"
    tcmu: clean up the code and with one small fix
    tcmu: Fix possbile memory leak / OOPs when recalculating cmd base size
    target: export lio pgr/alua support as device attr
    target: Fix return sense reason in target_scsi3_emulate_pr_out
    target: Fix cmd size for PR-OUT in passthrough_parse_cdb
    tcmu: Fix dev_config_store
    target: pscsi: Introduce TYPE_ZBC support
    target: Use macro for WRITE_VERIFY_32 operation codes
    target: fix SAM_STAT_BUSY/TASK_SET_FULL handling
    target: remove transport_complete
    pscsi: finish cmd processing from pscsi_req_done
    tcmu: fix sense handling during completion
    target: add helper to copy sense to se_cmd buffer
    target: do not require a transport_complete for SCF_TRANSPORT_TASK_SENSE
    target: make device_mutex and device_list static
    tcmu: Fix flushing cmd entry dcache page
    tcmu: fix multiple uio open/close sequences
    tcmu: drop configured check in destroy
    ...

    Linus Torvalds
     

13 Jul, 2017

1 commit

  • __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
    the page allocator. This has been true but only for allocations
    requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always
    ignored for smaller sizes. This is a bit unfortunate because there is
    no way to express the same semantic for those requests and they are
    considered too important to fail so they might end up looping in the
    page allocator for ever, similarly to GFP_NOFAIL requests.

    Now that the whole tree has been cleaned up and accidental or misled
    usage of __GFP_REPEAT flag has been removed for !costly requests we can
    give the original flag a better name and more importantly a more useful
    semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
    that the allocator would try really hard but there is no promise of a
    success. This will work independent of the order and overrides the
    default allocator behavior. Page allocator users have several levels of
    guarantee vs. cost options (take GFP_KERNEL as an example)

    - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
    attempt to free memory at all. The most light weight mode which even
    doesn't kick the background reclaim. Should be used carefully because
    it might deplete the memory and the next user might hit the more
    aggressive reclaim

    - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
    allocation without any attempt to free memory from the current
    context but can wake kswapd to reclaim memory if the zone is below
    the low watermark. Can be used from either atomic contexts or when
    the request is a performance optimization and there is another
    fallback for a slow path.

    - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
    non sleeping allocation with an expensive fallback so it can access
    some portion of memory reserves. Usually used from interrupt/bh
    context with an expensive slow path fallback.

    - GFP_KERNEL - both background and direct reclaim are allowed and the
    _default_ page allocator behavior is used. That means that !costly
    allocation requests are basically nofail but there is no guarantee of
    that behavior so failures have to be checked properly by callers
    (e.g. OOM killer victim is allowed to fail currently).

    - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
    and all allocation requests fail early rather than cause disruptive
    reclaim (one round of reclaim in this implementation). The OOM killer
    is not invoked.

    - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
    behavior and all allocation requests try really hard. The request
    will fail if the reclaim cannot make any progress. The OOM killer
    won't be triggered.

    - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
    and all allocation requests will loop endlessly until they succeed.
    This might be really dangerous especially for larger orders.

    Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
    because they already had their semantic. No new users are added.
    __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
    there is no progress and we have already passed the OOM point.

    This means that all the reclaim opportunities have been exhausted except
    the most disruptive one (the OOM killer) and a user defined fallback
    behavior is more sensible than keep retrying in the page allocator.

    [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
    [mhocko@suse.com: semantic fix]
    Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
    [mhocko@kernel.org: address other thing spotted by Vlastimil]
    Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alex Belits
    Cc: Chris Wilson
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Cc: David Daney
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: NeilBrown
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

06 Jul, 2017

1 commit

  • Pull networking updates from David Miller:
    "Reasonably busy this cycle, but perhaps not as busy as in the 4.12
    merge window:

    1) Several optimizations for UDP processing under high load from
    Paolo Abeni.

    2) Support pacing internally in TCP when using the sch_fq packet
    scheduler for this is not practical. From Eric Dumazet.

    3) Support mutliple filter chains per qdisc, from Jiri Pirko.

    4) Move to 1ms TCP timestamp clock, from Eric Dumazet.

    5) Add batch dequeueing to vhost_net, from Jason Wang.

    6) Flesh out more completely SCTP checksum offload support, from
    Davide Caratti.

    7) More plumbing of extended netlink ACKs, from David Ahern, Pablo
    Neira Ayuso, and Matthias Schiffer.

    8) Add devlink support to nfp driver, from Simon Horman.

    9) Add RTM_F_FIB_MATCH flag to RTM_GETROUTE queries, from Roopa
    Prabhu.

    10) Add stack depth tracking to BPF verifier and use this information
    in the various eBPF JITs. From Alexei Starovoitov.

    11) Support XDP on qed device VFs, from Yuval Mintz.

    12) Introduce BPF PROG ID for better introspection of installed BPF
    programs. From Martin KaFai Lau.

    13) Add bpf_set_hash helper for TC bpf programs, from Daniel Borkmann.

    14) For loads, allow narrower accesses in bpf verifier checking, from
    Yonghong Song.

    15) Support MIPS in the BPF selftests and samples infrastructure, the
    MIPS eBPF JIT will be merged in via the MIPS GIT tree. From David
    Daney.

    16) Support kernel based TLS, from Dave Watson and others.

    17) Remove completely DST garbage collection, from Wei Wang.

    18) Allow installing TCP MD5 rules using prefixes, from Ivan
    Delalande.

    19) Add XDP support to Intel i40e driver, from Björn Töpel

    20) Add support for TC flower offload in nfp driver, from Simon
    Horman, Pieter Jansen van Vuuren, Benjamin LaHaise, Jakub
    Kicinski, and Bert van Leeuwen.

    21) IPSEC offloading support in mlx5, from Ilan Tayari.

    22) Add HW PTP support to macb driver, from Rafal Ozieblo.

    23) Networking refcount_t conversions, From Elena Reshetova.

    24) Add sock_ops support to BPF, from Lawrence Brako. This is useful
    for tuning the TCP sockopt settings of a group of applications,
    currently via CGROUPs"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1899 commits)
    net: phy: dp83867: add workaround for incorrect RX_CTRL pin strap
    dt-bindings: phy: dp83867: provide a workaround for incorrect RX_CTRL pin strap
    cxgb4: Support for get_ts_info ethtool method
    cxgb4: Add PTP Hardware Clock (PHC) support
    cxgb4: time stamping interface for PTP
    nfp: default to chained metadata prepend format
    nfp: remove legacy MAC address lookup
    nfp: improve order of interfaces in breakout mode
    net: macb: remove extraneous return when MACB_EXT_DESC is defined
    bpf: add missing break in for the TCP_BPF_SNDCWND_CLAMP case
    bpf: fix return in load_bpf_file
    mpls: fix rtm policy in mpls_getroute
    net, ax25: convert ax25_cb.refcount from atomic_t to refcount_t
    net, ax25: convert ax25_route.refcount from atomic_t to refcount_t
    net, ax25: convert ax25_uid_assoc.refcount from atomic_t to refcount_t
    net, sctp: convert sctp_ep_common.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_transport.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_chunk.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_datamsg.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_auth_bytes.refcnt from atomic_t to refcount_t
    ...

    Linus Torvalds
     

04 Jul, 2017

1 commit

  • Pull char/misc updates from Greg KH:
    "Here is the "big" char/misc driver patchset for 4.13-rc1.

    Lots of stuff in here, a large thunderbolt update, w1 driver header
    reorg, the new mux driver subsystem, google firmware driver updates,
    and a raft of other smaller things. Full details in the shortlog.

    All of these have been in linux-next for a while with the only
    reported issue being a merge problem with this tree and the jc-docs
    tree in the w1 documentation area"

    * tag 'char-misc-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (147 commits)
    misc: apds990x: Use sysfs_match_string() helper
    mei: drop unreachable code in mei_start
    mei: validate the message header only in first fragment.
    DocBook: w1: Update W1 file locations and names in DocBook
    mux: adg792a: always require I2C support
    nvmem: rockchip-efuse: add support for rk322x-efuse
    nvmem: core: add locking to nvmem_find_cell
    nvmem: core: Call put_device() in nvmem_unregister()
    nvmem: core: fix leaks on registration errors
    nvmem: correct Broadcom OTP controller driver writes
    w1: Add subsystem kernel public interface
    drivers/fsi: Add module license to core driver
    drivers/fsi: Use asynchronous slave mode
    drivers/fsi: Add hub master support
    drivers/fsi: Add SCOM FSI client device driver
    drivers/fsi/gpio: Add tracepoints for GPIO master
    drivers/fsi: Add GPIO based FSI master
    drivers/fsi: Document FSI master sysfs files in ABI
    drivers/fsi: Add error handling for slave
    drivers/fsi: Add tracepoints for low-level operations
    ...

    Linus Torvalds
     

20 Jun, 2017

1 commit

  • Rename:

    wait_queue_t => wait_queue_entry_t

    'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
    but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
    which had to carry the name.

    Start sorting this out by renaming it to 'wait_queue_entry_t'.

    This also allows the real structure name 'struct __wait_queue' to
    lose its double underscore and become 'struct wait_queue_entry',
    which is the more canonical nomenclature for such data types.

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

09 Jun, 2017

1 commit


18 May, 2017

2 commits

  • Vhost-vsock is a software device so there is no probe call that causes
    the driver to register its misc char device node. This creates a
    chicken and egg problem: userspace applications must open
    /dev/vhost-vsock to use the driver but the file doesn't exist until the
    kernel module has been loaded.

    Use the devname modalias mechanism so that /dev/vhost-vsock is created
    at boot. The vhost_vsock kernel module is automatically loaded when the
    first application opens /dev/host-vsock.

    Note that the "reserved for local use" range in
    Documentation/admin-guide/devices.txt is incorrect. The userio driver
    already occupies part of that range. I've updated the documentation
    accordingly.

    Cc: device@lanana.org
    Signed-off-by: Stefan Hajnoczi
    Signed-off-by: Greg Kroah-Hartman

    Stefan Hajnoczi
     
  • We used to dequeue one skb during recvmsg() from skb_array, this could
    be inefficient because of the bad cache utilization and spinlock
    touching for each packet. This patch tries to batch them by calling
    batch dequeuing helpers explicitly on the exported skb array and pass
    the skb back through msg_control for underlayer socket to finish the
    userspace copying. Batch dequeuing is also the requirement for more
    batching improvement on receive path.

    Tests were done by pktgen on tap with XDP1 in guest. Host is Intel(R)
    Xeon(R) CPU E5-2650 0 @ 2.00GHz.

    rx batch | pps

    0 2.25Mpps
    1 2.33Mpps (+3.56%)
    4 2.33Mpps (+3.56%)
    16 2.35Mpps (+4.44%)
    64 2.42Mpps (+7.56%)
    Signed-off-by: David S. Miller

    Jason Wang
     

09 May, 2017

1 commit

  • vhost code uses __GFP_REPEAT when allocating vhost_virtqueue resp.
    vhost_vsock because it would really like to prefer kmalloc to the
    vmalloc fallback - see 23cc5a991c7a ("vhost-net: extend device
    allocation to vmalloc") for more context. Michael Tsirkin has also
    noted:

    "__GFP_REPEAT overhead is during allocation time. Using vmalloc means
    all accesses are slowed down. Allocation is not on data path, accesses
    are."

    The similar applies to other vhost_kvzalloc users.

    Let's teach kvmalloc_node to handle __GFP_REPEAT properly. There are
    two things to be careful about. First we should prevent from the OOM
    killer and so have to involve __GFP_NORETRY by default and secondly
    override __GFP_REPEAT for !costly order requests as the __GFP_REPEAT is
    ignored for !costly orders.

    Supporting __GFP_REPEAT like semantic for !costly request is possible it
    would require changes in the page allocator. This is out of scope of
    this patch.

    This patch shouldn't introduce any functional change.

    Link: http://lkml.kernel.org/r/20170306103032.2540-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Michael S. Tsirkin
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

25 Apr, 2017

1 commit

  • The virtio drivers deal with struct virtio_vsock_pkt. Add
    virtio_transport_deliver_tap_pkt(pkt) for handing packets to the
    vsockmon device.

    We call virtio_transport_deliver_tap_pkt(pkt) from
    net/vmw_vsock/virtio_transport.c and drivers/vhost/vsock.c instead of
    common code. This is because the drivers may drop packets before
    handing them to common code - we still want to capture them.

    Signed-off-by: Gerard Garcia
    Signed-off-by: Stefan Hajnoczi
    Reviewed-by: Jorgen Hansen
    Signed-off-by: David S. Miller

    Gerard Garcia
     

22 Mar, 2017

1 commit


04 Mar, 2017

1 commit

  • Pull sched.h split-up from Ingo Molnar:
    "The point of these changes is to significantly reduce the
    header footprint, to speed up the kernel build and to
    have a cleaner header structure.

    After these changes the new 's typical preprocessed
    size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K
    lines), which is around 40% faster to build on typical configs.

    Not much changed from the last version (-v2) posted three weeks ago: I
    eliminated quirks, backmerged fixes plus I rebased it to an upstream
    SHA1 from yesterday that includes most changes queued up in -next plus
    all sched.h changes that were pending from Andrew.

    I've re-tested the series both on x86 and on cross-arch defconfigs,
    and did a bisectability test at a number of random points.

    I tried to test as many build configurations as possible, but some
    build breakage is probably still left - but it should be mostly
    limited to architectures that have no cross-compiler binaries
    available on kernel.org, and non-default configurations"

    * 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits)
    sched/headers: Clean up
    sched/headers: Remove #ifdefs from
    sched/headers: Remove the include from
    sched/headers, hrtimer: Remove the include from
    sched/headers, x86/apic: Remove the header inclusion from
    sched/headers, timers: Remove the include from
    sched/headers: Remove from
    sched/headers: Remove from
    sched/core: Remove unused prefetch_stack()
    sched/headers: Remove from
    sched/headers: Remove the 'init_pid_ns' prototype from
    sched/headers: Remove from
    sched/headers: Remove from
    sched/headers: Remove the runqueue_is_locked() prototype
    sched/headers: Remove from
    sched/headers: Remove from
    sched/headers: Remove from
    sched/headers: Remove from
    sched/headers: Remove the include from
    sched/headers: Remove from
    ...

    Linus Torvalds
     

03 Mar, 2017

1 commit

  • Pull vhost updates from Michael Tsirkin:
    "virtio, vhost: optimizations, fixes

    Looks like a quiet cycle for vhost/virtio, just a couple of minor
    tweaks. Most notable is automatic interrupt affinity for blk and scsi.
    Hopefully other devices are not far behind"

    * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
    virtio-console: avoid DMA from stack
    vhost: introduce O(1) vq metadata cache
    virtio_scsi: use virtio IRQ affinity
    virtio_blk: use virtio IRQ affinity
    blk-mq: provide a default queue mapping for virtio device
    virtio: provide a method to get the IRQ affinity mask for a virtqueue
    virtio: allow drivers to request IRQ affinity when creating VQs
    virtio_pci: simplify MSI-X setup
    virtio_pci: don't duplicate the msix_enable flag in struct pci_dev
    virtio_pci: use shared interrupts for virtqueues
    virtio_pci: remove struct virtio_pci_vq_info
    vhost: try avoiding avail index access when getting descriptor
    virtio_mmio: expose header to userspace

    Linus Torvalds