15 Feb, 2017

4 commits

  • commit 433e19cf33d34bb6751c874a9c00980552fe508c upstream.

    Commit a389fcfd2cb5 ("Drivers: hv: vmbus: Fix signaling logic in
    hv_need_to_signal_on_read()")
    added the proper mb(), but removed the test "prev_write_sz < pending_sz"
    when making the signal decision.

    As a result, the guest can signal the host unnecessarily,
    and then the host can throttle the guest because the host
    thinks the guest is buggy or malicious; finally the user
    running stress test can perceive intermittent freeze of
    the guest.

    This patch brings back the test, and properly handles the
    in-place consumption APIs used by NetVSC (see get_next_pkt_raw(),
    put_pkt_raw() and commit_rd_index()).

    Fixes: a389fcfd2cb5 ("Drivers: hv: vmbus: Fix signaling logic in
    hv_need_to_signal_on_read()")

    Signed-off-by: Dexuan Cui
    Reported-by: Rolf Neugebauer
    Tested-by: Rolf Neugebauer
    Cc: "K. Y. Srinivasan"
    Cc: Haiyang Zhang
    Cc: Stephen Hemminger
    Signed-off-by: K. Y. Srinivasan
    Cc: Rolf Neugebauer
    Signed-off-by: Greg Kroah-Hartman

    Dexuan Cui
     
  • commit 3372592a140db69fd63837e81f048ab4abf8111e upstream.

    Signal the host when we determine the host is to be signaled -
    on th read path. The currrent code determines the need to signal in the
    ringbuffer code and actually issues the signal elsewhere. This can result
    in the host viewing this interrupt as spurious since the host may also
    poll the channel. Make the necessary adjustments.

    Signed-off-by: K. Y. Srinivasan
    Cc: Rolf Neugebauer
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     
  • commit 1f6ee4e7d83586c8b10bd4f2f4346353d04ce884 upstream.

    Signal the host when we determine the host is to be signaled.
    The currrent code determines the need to signal in the ringbuffer
    code and actually issues the signal elsewhere. This can result
    in the host viewing this interrupt as spurious since the host may also
    poll the channel. Make the necessary adjustments.

    Signed-off-by: K. Y. Srinivasan
    Cc: Rolf Neugebauer
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     
  • commit 74198eb4a42c4a3c4fbef08fa01a291a282f7c2e upstream.

    One of the factors that can result in the host concluding that a given
    guest in mounting a DOS attack is if the guest generates interrupts
    to the host when the host is not expecting it. If these "spurious"
    interrupts reach a certain rate, the host can throttle the guest to
    minimize the impact. The host computation of the "expected number
    of interrupts" is strictly based on the ring transitions. Until
    the host logic is fixed, base the guest logic to interrupt solely
    on the ring state.

    Signed-off-by: K. Y. Srinivasan
    Cc: Rolf Neugebauer
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     

09 Jan, 2017

1 commit

  • commit abd1026da4a7700a8db370947f75cd17b6ae6f76 upstream.

    "kernel BUG at drivers/hv/channel_mgmt.c:350!" is observed when hv_vmbus
    module is unloaded. BUG_ON() was introduced in commit 85d9aa705184
    ("Drivers: hv: vmbus: add an API vmbus_hvsock_device_unregister()") as
    vmbus_free_channels() codepath was apparently forgotten.

    Fixes: 85d9aa705184 ("Drivers: hv: vmbus: add an API vmbus_hvsock_device_unregister()")

    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     

01 Nov, 2016

1 commit

  • In commit 9a56e5d6a0ba ("Drivers: hv: make VMBus bus ids persistent")
    the name of vmbus devices in sysfs changed to be (in 4.9-rc1):
    /sys/bus/vmbus/vmbus-6aebe374-9ba0-11e6-933c-00259086b36b

    The prefix ("vmbus-") is redundant and differs from how PCI is
    represented in sysfs. Therefore simplify to:
    /sys/bus/vmbus/6aebe374-9ba0-11e6-933c-00259086b36b

    Please merge this before 4.9 is released and the old format
    has to live forever.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Stephen Hemminger
     

25 Oct, 2016

1 commit

  • The host keeps sending heartbeat packets independent of the
    guest responding to them. Even though we respond to the heartbeat messages at
    interrupt level, we can have situations where there maybe multiple heartbeat
    messages pending that have not been responded to. For instance this occurs when the
    VM is paused and the host continues to send the heartbeat messages.
    Address this issue by draining and responding to all
    the heartbeat messages that maybe pending.

    Signed-off-by: Long Li
    Signed-off-by: K. Y. Srinivasan
    CC: Stable
    Signed-off-by: Greg Kroah-Hartman

    Long Li
     

27 Sep, 2016

2 commits


09 Sep, 2016

1 commit


08 Sep, 2016

3 commits

  • This enables support for more accurate TimeSync v4 samples when hosted
    under Windows Server 2016 and newer hosts.

    The new time samples include a "vmreferencetime" field that represents
    the guest's TSC value when the host generated its time sample. This value
    lets the guest calculate the latency in receiving the time sample. The
    latency is added to the sample host time prior to updating the clock.

    Signed-off-by: Alex Ng
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Alex Ng
     
  • Only the first 50 samples after boot were being used to discipline the
    clock. After the first 50 samples, any samples from the host were ignored
    and the guest clock would eventually drift from the host clock.

    This patch allows TimeSync-enabled guests to continuously synchronize the
    clock with the host clock, even after the first 50 samples.

    Signed-off-by: Alex Ng
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Alex Ng
     
  • Different Windows host versions may reuse the same protocol version when
    negotiating the TimeSync, Shutdown, and Heartbeat protocols. We should only
    refer to the protocol version to avoid conflating the two concepts.

    Signed-off-by: Alex Ng
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Alex Ng
     

07 Sep, 2016

2 commits


02 Sep, 2016

6 commits

  • Hyper-V host will send a VSS_OP_HOT_BACKUP request to check if guest is
    ready for a live backup/snapshot. The driver should respond to the check
    only if the daemon is running and listening to requests. This allows the
    host to fallback to standard snapshots in case the VSS daemon is not
    running.

    Signed-off-by: Alex Ng
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Alex Ng
     
  • Multiple VSS_OP_HOT_BACKUP requests may arrive in quick succession, even
    though the host only signals once. The driver wass handling the first
    request while ignoring the others in the ring buffer. We should poll the
    VSS channel after handling a request to continue processing other requests.

    Signed-off-by: Alex Ng
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Alex Ng
     
  • Introduce a mechanism to control how channels will be affinitized. We will
    support two policies:

    1. HV_BALANCED: All performance critical channels will be dstributed
    evenly amongst all the available NUMA nodes. Once the Node is assigned,
    we will assign the CPU based on a simple round robin scheme.

    2. HV_LOCALIZED: Only the primary channels are distributed across all
    NUMA nodes. Sub-channels will be in the same NUMA node as the primary
    channel. This is the current behaviour.

    The default policy will be the HV_BALANCED as it can minimize the remote
    memory access on NUMA machines with applications that span NUMA nodes.

    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     
  • With wrap around mappings for ring buffers we can always use a single
    memcpy() to do the job.

    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: K. Y. Srinivasan
    Tested-by: Dexuan Cui
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • Make it possible to always use a single memcpy() or to provide a direct
    link to a packet on the ring buffer by creating virtual mapping for two
    copies of the ring buffer with vmap(). Utilize currently empty
    hv_ringbuffer_cleanup() to do the unmap.

    While on it, replace sizeof(struct hv_ring_buffer) check
    in hv_ringbuffer_init() with BUILD_BUG_ON() as it is a compile time check.

    Signed-off-by: Vitaly Kuznetsov
    Tested-by: Dexuan Cui
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • In preparation for doing wrap around mappings for ring buffers cleanup
    vmbus_open() function:
    - check that ring sizes are PAGE_SIZE aligned (they are for all in-kernel
    drivers now);
    - kfree(open_info) on error only after we kzalloc() it (not an issue as it
    is valid to call kfree(NULL);
    - rename poorly named labels;
    - use alloc_pages() instead of __get_free_pages() as we need struct page
    pointer for future.

    Signed-off-by: Vitaly Kuznetsov
    Tested-by: Dexuan Cui
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     

31 Aug, 2016

14 commits

  • Reports for available memory should use the si_mem_available() value.
    The previous freeram value does not include available page cache memory.

    Signed-off-by: Alex Ng
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Alex Ng
     
  • lockdep reports possible circular locking dependency when udev is used
    for memory onlining:

    systemd-udevd/3996 is trying to acquire lock:
    ((memory_chain).rwsem){++++.+}, at: [] __blocking_notifier_call_chain+0x4e/0xc0

    but task is already holding lock:
    (&dm_device.ha_region_mutex){+.+.+.}, at: [] hv_memory_notifier+0x5e/0xc0 [hv_balloon]
    ...

    which is probably a false positive because we take and release
    ha_region_mutex from memory notifier chain depending on the arg. No real
    deadlocks were reported so far (though I'm not really sure about
    preemptible kernels...) but we don't really need to hold the mutex
    for so long. We use it to protect ha_region_list (and its members) and the
    num_pages_onlined counter. None of these operations require us to sleep
    and nothing is slow, switch to using spinlock with interrupts disabled.

    While on it, replace list_for_each -> list_for_each_entry as we actually
    need entries in all these cases, drop meaningless list_empty() checks.

    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • With the recently introduced in-kernel memory onlining
    (MEMORY_HOTPLUG_DEFAULT_ONLINE) these is no point in waiting for pages
    to come online in the driver and we can get rid of the waiting.

    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • I'm observing the following hot add requests from the WS2012 host:

    hot_add_req: start_pfn = 0x108200 count = 330752
    hot_add_req: start_pfn = 0x158e00 count = 193536
    hot_add_req: start_pfn = 0x188400 count = 239616

    As the host doesn't specify hot add regions we're trying to create
    128Mb-aligned region covering the first request, we create the 0x108000 -
    0x160000 region and we add 0x108000 - 0x158e00 memory. The second request
    passes the pfn_covered() check, we enlarge the region to 0x108000 -
    0x190000 and add 0x158e00 - 0x188200 memory. The problem emerges with the
    third request as it starts at 0x188400 so there is a 0x200 gap which is
    not covered. As the end of our region is 0x190000 now it again passes the
    pfn_covered() check were we just adjust the covered_end_pfn and make it
    0x188400 instead of 0x188200 which means that we'll try to online
    0x188200-0x188400 pages but these pages were never assigned to us and we
    crash.

    We can't react to such requests by creating new hot add regions as it may
    happen that the whole suggested range falls into the previously identified
    128Mb-aligned area so we'll end up adding nothing or create intersecting
    regions and our current logic doesn't allow that. Instead, create a list of
    such 'gaps' and check for them in the page online callback.

    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • Windows 2012 (non-R2) does not specify hot add region in hot add requests
    and the logic in hot_add_req() is trying to find a 128Mb-aligned region
    covering the request. It may also happen that host's requests are not 128Mb
    aligned and the created ha_region will start before the first specified
    PFN. We can't online these non-present pages but we don't remember the real
    start of the region.

    This is a regression introduced by the commit 5abbbb75d733 ("Drivers: hv:
    hv_balloon: don't lose memory when onlining order is not natural"). While
    the idea of keeping the 'moving window' was wrong (as there is no guarantee
    that hot add requests come ordered) we should still keep track of
    covered_start_pfn. This is not a revert, the logic is different.

    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • On Hyper-V, performance critical channels use the monitor
    mechanism to signal the host when the guest posts mesages
    for the host. This mechanism minimizes the hypervisor intercepts
    and also makes the host more efficient in that each time the
    host is woken up, it processes a batch of messages as opposed to
    just one. The goal here is improve the throughput and this is at
    the expense of increased latency.
    Implement a mechanism to let the client driver decide if latency
    is important.

    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     
  • The current delay between retries is unnecessarily high and is negatively
    affecting the time it takes to boot the system.

    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     
  • For synthetic NIC channels, enable explicit signaling policy as netvsc wants to
    explicitly control when the host is to be signaled.

    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     
  • There is a rare race when we remove an entry from the global list
    hv_context.percpu_list[cpu] in hv_process_channel_removal() ->
    percpu_channel_deq() -> list_del(): at this time, if vmbus_on_event() ->
    process_chn_event() -> pcpu_relid2channel() is trying to query the list,
    we can get the kernel fault.

    Similarly, we also have the issue in the code path: vmbus_process_offer() ->
    percpu_channel_enq().

    We can resolve the issue by disabling the tasklet when updating the list.

    The patch also moves vmbus_release_relid() to a later place where
    the channel has been removed from the per-cpu and the global lists.

    Reported-by: Rolf Neugebauer
    Signed-off-by: Dexuan Cui
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Dexuan Cui
     
  • Background: userspace daemons registration protocol for Hyper-V utilities
    drivers has two steps:
    1) daemon writes its own version to kernel
    2) kernel reads it and replies with module version
    at this point we consider the handshake procedure being completed and we
    do hv_poll_channel() transitioning the utility device to HVUTIL_READY
    state. At this point we're ready to handle messages from kernel.

    When hvutil_transport is in HVUTIL_TRANSPORT_CHARDEV mode we have a
    single buffer for outgoing message. hvutil_transport_send() puts to this
    buffer and till the buffer is cleared with hvt_op_read() returns -EFAULT
    to all consequent calls. Hostguest protocol guarantees there is no more
    than one request at a time and we will not get new requests till we reply
    to the previous one so this single message buffer is enough.

    Now to the race. When we finish negotiation procedure and send kernel
    module version to userspace with hvutil_transport_send() it goes into the
    above mentioned buffer and if the daemon is slow enough to read it from
    there we can get a collision when a request from the host comes, we won't
    be able to put anything to the buffer so the request will be lost. To
    solve the issue we need to know when the negotiation is really done (when
    the version message is read by the daemon) and transition to HVUTIL_READY
    state after this happens. Implement a callback on read to support this.
    Old style netlink communication is not affected by the change, we don't
    really know when these messages are delivered but we don't have a single
    message buffer there.

    Reported-by: Barry Davis
    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • vmbus_teardown_gpadl() can result in infinite wait when it is called on 5
    second timeout in vmbus_open(). The issue is caused by the fact that gpadl
    teardown operation won't ever succeed for an opened channel and the timeout
    isn't always enough. As a guest, we can always trust the host to respond to
    our request (and there is nothing we can do if it doesn't).

    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • In some cases create_gpadl_header() allocates submessages but we never
    free them.

    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • We use messagecount only once in vmbus_establish_gpadl() to check if
    it is safe to iterate through the submsglist. We can just initialize
    the list header in all cases in create_gpadl_header() instead.

    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • When we crash from NMI context (e.g. after NMI injection from host when
    'sysctl -w kernel.unknown_nmi_panic=1' is set) we hit

    kernel BUG at mm/vmalloc.c:1530!

    as vfree() is denied. While the issue could be solved with in_nmi() check
    instead I opted for skipping vfree on all sorts of crashes to reduce the
    amount of work which can cause consequent crashes. We don't really need to
    free anything on crash.

    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     

13 Jun, 2016

1 commit

  • The Hyper-V Linux Integration Services use the VMBus implementation for
    communication with the Hypervisor. VMBus registers its own interrupt
    handler that completely bypasses the common Linux interrupt handling.
    This implies that the interrupt entropy collector is not triggered.

    This patch adds the interrupt entropy collection callback into the VMBus
    interrupt handler function.

    Cc: stable@kernel.org
    Signed-off-by: Stephan Mueller
    Signed-off-by: Stephan Mueller
    Signed-off-by: Theodore Ts'o

    Stephan Mueller
     

02 May, 2016

4 commits

  • We set host_specified_ha_region = true on certain request but this is a
    global state which stays 'true' forever. We need to reset it when we
    receive a request where ha_region is not specified. I did not see any
    real issues, the bug was found by code inspection.

    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • When we iterate through all HA regions in handle_pg_range() we have an
    assumption that all these regions are sorted in the list and the
    'start_pfn >= has->end_pfn' check is enough to find the proper region.
    Unfortunately it's not the case with WS2016 where host can hot-add regions
    in a different order. We end up modifying the wrong HA region and crashing
    later on pages online. Modify the check to make sure we found the region
    we were searching for while iterating. Fix the same check in pfn_covered()
    as well.

    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • Kdump keeps biting. Turns out CHANNELMSG_UNLOAD_RESPONSE is always
    delivered to the CPU which was used for initial contact or to CPU0
    depending on host version. vmbus_wait_for_unload() doesn't account for
    the fact that in case we're crashing on some other CPU we won't get the
    CHANNELMSG_UNLOAD_RESPONSE message and our wait on the current CPU will
    never end.

    Do the following:
    1) Check for completion_done() in the loop. In case interrupt handler is
    still alive we'll get the confirmation we need.

    2) Read message pages for all CPUs message page as we're unsure where
    CHANNELMSG_UNLOAD_RESPONSE is going to be delivered to. We can race with
    still-alive interrupt handler doing the same, add cmpxchg() to
    vmbus_signal_eom() to not lose CHANNELMSG_UNLOAD_RESPONSE message.

    3) Cleanup message pages on all CPUs. This is required (at least for the
    current CPU as we're clearing CPU0 messages now but we may want to bring
    up additional CPUs on crash) as new messages won't be delivered till we
    consume what's pending. On boot we'll place message pages somewhere else
    and we won't be able to read stale messages.

    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • Hyper-V VMs can be replicated to another hosts and there is a feature to
    set different IP for replicas, it is called 'Failover TCP/IP'. When
    such guest starts Hyper-V host sends it KVP_OP_SET_IP_INFO message as soon
    as we finish negotiation procedure. The problem is that it can happen (and
    it actually happens) before userspace daemon connects and we reply with
    HV_E_FAIL to the message. As there are no repetitions we fail to set the
    requested IP.

    Solve the issue by postponing our reply to the negotiation message till
    userspace daemon is connected. We can't wait too long as there is a
    host-side timeout (cca. 75 seconds) and if we fail to reply in this time
    frame the whole KVP service will become inactive. The solution is not
    ideal - if it takes userspace daemon more than 60 seconds to connect
    IP Failover will still fail but I don't see a solution with our current
    separation between kernel and userspace parts.

    Other two modules (VSS and FCOPY) don't require such delay, leave them
    untouched.

    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov