03 Apr, 2015

9 commits

  • balloon_wrk.num_pages is __u32 and it comes from host in struct dm_balloon
    where it is also __u32. We, however, use 'int' in balloon_up() and in case
    we happen to receive num_pages>INT_MAX request we'll end up allocating zero
    pages as 'num_pages < alloc_unit' check in alloc_balloon_pages() will pass.
    Change num_pages type to unsigned int.

    In real life ballooning request come with num_pages in [512, 32768] range so
    this is more a future-proof/cleanup.

    Reported-by: Laszlo Ersek
    Signed-off-by: Vitaly Kuznetsov
    Reviewed-by: Laszlo Ersek
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • 'Drivers: hv: hv_balloon: refuse to balloon below the floor' fix does not
    correctly handle the case when val.freeram < num_pages as val.freeram is
    __kernel_ulong_t and the 'val.freeram - num_pages' value will be a huge
    positive value instead of being negative.

    Usually host doesn't ask us to balloon more than val.freeram but in case
    he have a memory hog started after we post the last pressure report we
    can get into troubles.

    Suggested-by: Laszlo Ersek
    Signed-off-by: Vitaly Kuznetsov
    Reviewed-by: Laszlo Ersek
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • Most of the retries can be done within a millisecond successfully, so we
    sleep 1ms before the first retry, then gradually increase the retry
    interval to 2^n with max value of 2048ms. Doing so, we will have shorter
    overall delay time, because most of the cases succeed within 1-2 attempts.

    Signed-off-by: Haiyang Zhang
    Reviewed-by: K. Y. Srinivasan
    Reviewed-by: Dexuan Cui
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Haiyang Zhang
     
  • ... and simplify alloc_balloon_pages() interface by removing redundant
    alloc_error from it.

    If we happen to enter balloon_up() with balloon_wrk.num_pages = 0 we will enter
    infinite 'while (!done)' loop as alloc_balloon_pages() will be always returning
    0 and not setting alloc_error. We will also be sending a meaningless message to
    the host on every iteration.

    The 'alloc_unit == 1 && alloc_error -> num_ballooned == 0' change and
    alloc_error elimination requires a special comment. We do alloc_balloon_pages()
    with 2 different alloc_unit values and there are 4 different
    alloc_balloon_pages() results, let's check them all.

    alloc_unit = 512:
    1) num_ballooned = 0, alloc_error = 0: we do 'alloc_unit=1' and retry pre- and
    post-patch.
    2) num_ballooned > 0, alloc_error = 0: we check 'num_ballooned == num_pages'
    and act accordingly, pre- and post-patch.
    3) num_ballooned > 0, alloc_error > 0: we report this chunk and remain within
    the loop, no changes here.
    4) num_ballooned = 0, alloc_error > 0: we do 'alloc_unit=1' and retry pre- and
    post-patch.

    alloc_unit = 1:
    1) num_ballooned = 0, alloc_error = 0: this can happen in two cases: when we
    passed 'num_pages=0' to alloc_balloon_pages() or when there was no space in
    bl_resp to place a single response. The second option is not possible as
    bl_resp is of PAGE_SIZE size and single response 'union dm_mem_page_range' is
    8 bytes, but the first one is (in theory, I think that Hyper-V host never
    places such requests). Pre-patch code loops forever, post-patch code sends
    a reply with more_pages = 0 and finishes.
    2) num_ballooned > 0, alloc_error = 0: we ran out of space in bl_resp, we
    report partial success and remain within the loop, no changes pre- and
    post-patch.
    3) num_ballooned > 0, alloc_error > 0: pre-patch code finishes, post-patch code
    does one more try and if there is no progress (we finish with
    'num_ballooned = 0') we finish. So we try a bit harder with this patch.
    4) num_ballooned = 0, alloc_error > 0: both pre- and post-patch code enter
    'more_pages = 0' branch and finish.

    So this patch has two real effects:
    1) We reply with an empty response to 'num_pages=0' request.
    2) We try a bit harder on alloc_unit=1 allocations (and reply with an empty
    tail reply in case we fail).

    An empty reply should be supported by host as we were able to send it even with
    pre-patch code when we were not able to allocate a single page.

    Suggested-by: Laszlo Ersek
    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • Commit 79208c57da53 ("Drivers: hv: hv_balloon: Make adjustments in computing
    the floor") was inacurate as it introduced a jump in our piecewiese linear
    'floor' function:

    At 2048MB we have:
    Left limit:
    104 + 2048/8 = 360
    Right limit:
    256 + 2048/16 = 384 (so the right value is 232)

    We now have to make an adjustment at 8192 boundary:
    232 + 8192/16 = 744
    512 + 8192/32 = 768 (so the right value is 488)

    Suggested-by: Laszlo Ersek
    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • Currently we add memory in 128Mb blocks but the request from host can be
    aligned differently. In such case we add a partially backed block and
    when this block goes online we skip onlining pages which are not backed
    (hv_online_page() callback serves this purpose). When we receive next
    request for the same host add region we online pages which were not backed
    before with hv_bring_pgs_online(). However, we don't check if the the block
    in question was onlined and online this tail unconditionally. This is bad as
    we avoid all online_pages() logic: these pages are not accounted, we don't
    send notifications (and hv_balloon is not the only receiver of them),...
    And, first of all, nobody asked as to online these pages. Solve the issue by
    checking if the last previously backed page was onlined and onlining the tail
    only in case it was.

    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • It's not necessary any longer, since we can safely run the blocking
    message handlers in vmbus_connection.work_queue now.

    Signed-off-by: Dexuan Cui
    Cc: K. Y. Srinivasan
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Dexuan Cui
     
  • Since the 2 fucntions can safely run in vmbus_connection.work_queue without
    hang, we don't need to schedule new work items into the per-channel workqueue.

    Actally we can even remove the per-channel workqueue now -- we'll do it
    in the next patch.

    Signed-off-by: Dexuan Cui
    Cc: K. Y. Srinivasan
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Dexuan Cui
     
  • A work item in vmbus_connection.work_queue can sleep, waiting for a new
    host message (usually it is some kind of "completion" message). Currently
    the new message will be handled in the same workqueue, but since work items
    in the workqueue is serialized, we actually have no chance to handle
    the new message if the current work item is sleeping -- as as result, the
    current work item will hang forever.

    K. Y. has posted the below fix to resolve the issue:
    Drivers: hv: vmbus: Perform device register in the per-channel work element

    Actually we can simplify the fix by directly running non-blocking message
    handlers in the dispatch tasklet (inspired by K. Y.).

    This patch is the fundamental change. The following 2 patches will simplify
    the message offering and rescind-offering handling a lot.

    Signed-off-by: Dexuan Cui
    Cc: K. Y. Srinivasan
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Dexuan Cui
     

27 Mar, 2015

1 commit


25 Mar, 2015

8 commits

  • Handle the case when the write to the ringbuffer fails. In this case,
    unconditionally signal the host. Since we may have deferred signalling
    the host based on the kick_q parameter, signalling the host
    unconditionally in this case deals with the issue.

    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     
  • Export the vmbus_sendpacket_pagebuffer_ctl() interface. This export will be
    used by the netvsc driver.

    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     
  • When a channel has been rescinded, the close operation is a noop.
    Restructure the code so we deal with the rescind condition after
    we properly cleanup the channel. I would like to thank
    Dexuan Cui for observing this problem.
    The current code leaks memory when the channel is rescinded.

    The current char-next branch is broken and this patch fixes
    the bug.

    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     
  • The indenting makes it clear that there were curly braces intended here.

    Fixes: 2dd37cb81580 ('Drivers: hv: vmbus: Handle both rescind and offer messages in the same context')
    Signed-off-by: Dan Carpenter
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Dan Carpenter
     
  • HV_CRASH_CTL_CRASH_NOTIFY is a 64 bit number. Depending on the usage context,
    the value may be truncated. This patch is in response from the following
    email from Wu Fengguang :

    From: Wu Fengguang
    Subject: [char-misc:char-misc-testing 25/45] drivers/hv/vmbus_drv.c:67:9: sparse:
    constant 0x8000000000000000 is so big it is unsigned long

    tree: git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git char-misc-testing
    head: b3de8e3719e582f3182bb504295e4a8e43c8c96f
    commit: 96c1d0581d00f7abe033350edb021a9d947d8d81 [25/45] Drivers: hv: vmbus: Add support for VMBus panic notifier handler
    reproduce:
    # apt-get install sparse
    git checkout 96c1d0581d00f7abe033350edb021a9d947d8d81
    make ARCH=x86_64 allmodconfig
    make C=1 CF=-D__CHECK_ENDIAN__

    sparse warnings: (new ones prefixed by >>)

    drivers/hv/vmbus_drv.c:67:9: sparse: constant 0x8000000000000000 is so big it is unsigned long
    ...

    Signed-off-by: Nick Meier
    Reported-by: Wu Fengguang
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Nick Meier
     
  • Memory blocks can be onlined in random order. When this order is not natural
    some memory pages are not onlined because of the redundant check in
    hv_online_page().

    Here is a real world scenario:
    1) Host tries to hot-add the following (process_hot_add):
    pg_start=rg_start=0x48000, pfn_cnt=111616, rg_size=262144

    2) This results in adding 4 memory blocks:
    [ 109.057866] init_memory_mapping: [mem 0x48000000-0x4fffffff]
    [ 114.102698] init_memory_mapping: [mem 0x50000000-0x57ffffff]
    [ 119.168039] init_memory_mapping: [mem 0x58000000-0x5fffffff]
    [ 124.233053] init_memory_mapping: [mem 0x60000000-0x67ffffff]
    The last one is incomplete but we have special has->covered_end_pfn counter to
    avoid onlining non-backed frames and hv_bring_pgs_online() function to bring
    them online later on.

    3) Now we have 4 offline memory blocks: /sys/devices/system/memory/memory9-12
    $ for f in /sys/devices/system/memory/memory*/state; do echo $f `cat $f`; done | grep -v onlin
    /sys/devices/system/memory/memory10/state offline
    /sys/devices/system/memory/memory11/state offline
    /sys/devices/system/memory/memory12/state offline
    /sys/devices/system/memory/memory9/state offline

    4) We bring them online in non-natural order:
    $grep MemTotal /proc/meminfo
    MemTotal: 966348 kB
    $echo online > /sys/devices/system/memory/memory12/state && grep MemTotal /proc/meminfo
    MemTotal: 1019596 kB
    $echo online > /sys/devices/system/memory/memory11/state && grep MemTotal /proc/meminfo
    MemTotal: 1150668 kB
    $echo online > /sys/devices/system/memory/memory9/state && grep MemTotal /proc/meminfo
    MemTotal: 1150668 kB

    As you can see memory9 block gives us zero additional memory. We can also
    observe a huge discrepancy between host- and guest-reported memory sizes.

    The root cause of the issue is the redundant pg >= covered_start_pfn check (and
    covered_start_pfn advancing) in hv_online_page(). When upper memory block in
    being onlined before the lower one (memory12 and memory11 in the above case) we
    advance the covered_start_pfn pointer and all memory9 pages do not pass the
    check. If the assumption that host always gives us requests in sequential order
    and pg_start always equals rg_start when the first request for the new HA
    region is received (that's the case in my testing) is correct than we can get
    rid of covered_start_pfn and pg >= start_pfn check in hv_online_page() is
    sufficient.

    The current char-next branch is broken and this patch fixes
    the bug.

    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • When add_memory() fails the following BUG is observed:
    [ 743.646107] hv_balloon: hot_add memory failed error is -17
    [ 743.679973]
    [ 743.680930] =====================================
    [ 743.680930] [ BUG: bad unlock balance detected! ]
    [ 743.680930] 3.19.0-rc5_bug1131426+ #552 Not tainted
    [ 743.680930] -------------------------------------
    [ 743.680930] kworker/0:2/255 is trying to release lock (&dm_device.ha_region_mutex) at:
    [ 743.680930] [] mutex_unlock+0xe/0x10
    [ 743.680930] but there are no more locks to release!

    This happens as we don't acquire ha_region_mutex and hot_add_req() expects us
    to as it does unconditional mutex_unlock(). Acquire the lock on the error path.

    The current char-next branch is broken and this patch fixes
    the bug.

    Signed-off-by: Vitaly Kuznetsov
    Acked-by: Jason Wang
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • This patch is a continuation of the rescind handling cleanup work. We cannot
    block in the global message handling work context especially if we are blocking
    waiting for the host to wake us up. I would like to thank
    Dexuan Cui for observing this problem.

    The current char-next branch is broken and this patch fixes
    the bug.

    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     

02 Mar, 2015

22 commits

  • drivers/hv/vmbus_drv.c:51:5: sparse: symbol 'hyperv_panic_event' was not declared. Should it be static?
    drivers/hv/vmbus_drv.c:51:5: sparse: symbol 'hyperv_panic_event' was not declared. Should it be static?
    drivers/hv/vmbus_drv.c:51:5: sparse: symbol 'hyperv_panic_event' was not declared. Should it be static?
    drivers/hv/vmbus_drv.c:51:5: sparse: symbol 'hyperv_panic_event' was not declared. Should it be static?

    Signed-off-by: Fengguang Wu
    Signed-off-by: Greg Kroah-Hartman

    kbuild test robot
     
  • Implement an API that gives additional control on the what VMBUS flags will be
    set as well as if the host needs to be signalled. This API will be
    useful for clients that want to batch up requests to the host.

    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     
  • Implement an API for sending pagebuffers that gives more control to the client
    in terms of setting the vmbus flags as well as deciding when to
    notify the host. This will be useful for enabling batch processing.

    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     
  • The current algorithm for picking an outgoing channel was not distributing
    the load well. Implement a simple round-robin scheme to ensure good
    distribution of the outgoing traffic.

    Signed-off-by: K. Y. Srinivasan
    Reviewed-by: Long Li
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     
  • Hyper-V allows a guest to notify the Hyper-V host that a panic
    condition occured. This notification can include up to five 64
    bit values. These 64 bit values are written into crash MSRs.
    Once the data has been written into the crash MSRs, the host is
    then notified by writing into a Crash Control MSR. On the Hyper-V
    host, the panic notification data is captured in the Windows Event
    log as a 18590 event.

    Crash MSRs are defined in appendix H of the Hypervisor Top Level
    Functional Specification. At the time of this patch, v4.0 is the
    current functional spec. The URL for the v4.0 document is:

    http://download.microsoft.com/download/A/B/4/AB43A34E-BDD0-4FA6-BDEF-79EEF16E880B/Hypervisor Top Level Functional Specification v4.0.docx

    Signed-off-by: Nick Meier
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Nick Meier
     
  • When host asks us to balloon up we need to be sure we're not committing suicide
    by overballooning. Use already existent 'floor' metric as our lowest possible
    value for free ram.

    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • When hot-added memory pages are not brought online or when some memory blocks
    are sent offline the subsequent ballooning process kills the guest with OOM
    killer. This happens as we don't report these pages as neither used nor free
    and apparently host algorithm considers them as being unused. Keep track of
    all online/offline operations and report all currently offline pages as being
    used so host won't try to balloon them out.

    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • When many memory regions are being added and automatically onlined the
    following lockup is sometimes observed:

    INFO: task udevd:1872 blocked for more than 120 seconds.
    ...
    Call Trace:
    [] schedule_timeout+0x22c/0x350
    [] wait_for_common+0x10f/0x160
    [] ? default_wake_function+0x0/0x20
    [] wait_for_completion+0x1d/0x20
    [] hv_memory_notifier+0xdc/0x120
    [] notifier_call_chain+0x4c/0x70
    ...

    When several memory blocks are going online simultaneously we got several
    hv_memory_notifier() trying to acquire the ha_region_mutex. When this mutex is
    being held by hot_add_req() all these competing acquire_region_mutex() do
    mutex_trylock, fail, and queue themselves into wait_for_completion(..). However
    when we do complete() from release_region_mutex() only one of them wakes up.
    This could be solved by changing complete() -> complete_all() memory onlining
    can be delayed as well, in that case we can still get several
    hv_memory_notifier() runners at the same time trying to grab the mutex.
    Only one of them will succeed and the others will hang for forever as
    complete() is not being called. We don't see this issue often because we have
    5sec onlining timeout in hv_mem_hot_add() and usually all udev events arrive
    in this time frame.

    Get rid of the trylock path, waiting on the mutex is supposed to provide the
    required serialization.

    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Kuznetsov
     
  • Currently we log messages when either we are not able to map an ID to a
    channel or when the channel does not have a callback associated
    (in the channel interrupt handling path). These messages don't add
    any value, get rid of them.

    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     
  • When the offer is rescinded, vmbus_close() can free up the channel;
    deinitialize the service before closing the channel.

    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     
  • Properly rollback state in vmbus_pocess_offer() in the failure paths.

    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     
  • Execute both ressind and offer messages in the same work context. This serializes these
    operations and naturally addresses the various corner cases.

    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     
  • In response to a rescind message, we need to remove the channel and the
    corresponding device. Cleanup this code path by factoring out the code
    to remove a channel.

    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     
  • Handle the case when the device may be removed when the device has no driver
    attached to it.

    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     
  • NetworkDirect is a service that supports guest RDMA.
    Define the GUID for this service.

    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     
  • Correctly rollback state if the failure occurs after we have handed over
    the ownership of the buffer to the host.

    Signed-off-by: K. Y. Srinivasan
    Cc: stable@vger.kernel.org
    Signed-off-by: Greg Kroah-Hartman

    K. Y. Srinivasan
     
  • return type of wait_for_completion_timeout is unsigned long not int, this
    patch changes the type of t from int to unsigned long.

    Signed-off-by: Nicholas Mc Guire
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Nicholas Mc Guire
     
  • return type of wait_for_completion_timeout is unsigned long not int, this
    patch changes the type of t from int to unsigned long.

    Signed-off-by: Nicholas Mc Guire
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Nicholas Mc Guire
     
  • return type of wait_for_completion_timeout is unsigned long not int, this
    patch changes the type of t from int to unsigned long.

    Signed-off-by: Nicholas Mc Guire
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Nicholas Mc Guire
     
  • Without this patch, the state is put to CHANNEL_OPENING_STATE, and when
    the driver is loaded next time, vmbus_open() will fail immediately due to
    newchannel->state != CHANNEL_OPEN_STATE.

    CC: "K. Y. Srinivasan"
    Signed-off-by: Dexuan Cui
    Reviewed-by: Jason Wang
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Dexuan Cui
     
  • I got HV_STATUS_INVALID_CONNECTION_ID on Hyper-V 2008 R2 when keeping running
    "rmmod hv_netvsc; modprobe hv_netvsc; rmmod hv_utils; modprobe hv_utils"
    in a Linux guest. Looks the host has some kind of throttling mechanism if
    some kinds of hypercalls are sent too frequently.
    Without the patch, the driver can occasionally fail to load.

    Also let's retry HV_STATUS_INSUFFICIENT_MEMORY, though we didn't get it
    before.

    Removed 'case -ENOMEM', since the hypervisor doesn't return this.

    CC: "K. Y. Srinivasan"
    Reviewed-by: Jason Wang
    Signed-off-by: Dexuan Cui
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Dexuan Cui
     
  • Before the line vmbus_open() returns, srv->util_cb can be already running
    and the variables, like util_fw_version, are needed by the srv->util_cb.

    So we have to make sure the variables are initialized before the vmbus_open().

    CC: "K. Y. Srinivasan"
    Reviewed-by: Vitaly Kuznetsov
    Reviewed-by: Jason Wang
    Signed-off-by: Dexuan Cui
    Signed-off-by: K. Y. Srinivasan
    Signed-off-by: Greg Kroah-Hartman

    Dexuan Cui