17 Nov, 2011

2 commits

  • PVHVM running with more than 32 vcpus and pv_irq/pv_time enabled
    need VCPU placement to work, or else it will softlockup.

    CC: stable@kernel.org
    Acked-by: Stefano Stabellini
    Signed-off-by: Zhenzhong Duan
    Signed-off-by: Konrad Rzeszutek Wilk

    Zhenzhong Duan
     
  • When mapping a foreign page with xenbus_map_ring_valloc() with the
    GNTTABOP_map_grant_ref hypercall, set the GNTMAP_contains_pte flag and
    pass a pointer to the PTE (in init_mm).

    After the page is mapped, the usual fault mechanism can be used to
    update additional MMs. This allows the vmalloc_sync_all() to be
    removed from alloc_vm_area().

    Signed-off-by: David Vrabel
    Acked-by: Andrew Morton
    [v1: Squashed fix by Michal for no-mmu case]
    Signed-off-by: Konrad Rzeszutek Wilk
    Signed-off-by: Michal Simek

    David Vrabel
     

07 Nov, 2011

2 commits


25 Oct, 2011

2 commits

  • …ci.fixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

    * 'stable/drivers-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    xenbus: don't rely on xen_initial_domain to detect local xenstore
    xenbus: Fix loopback event channel assuming domain 0
    xen/pv-on-hvm:kexec: Fix implicit declaration of function 'xen_hvm_domain'
    xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel
    xen/pv-on-hvm kexec: update xs_wire.h:xsd_sockmsg_type from xen-unstable
    xen/pv-on-hvm kexec+kdump: reset PV devices in kexec or crash kernel
    xen/pv-on-hvm kexec: rebind virqs to existing eventchannel ports
    xen/pv-on-hvm kexec: prevent crash in xenwatch_thread() when stale watch events arrive

    * 'stable/drivers.bugfixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    xen/pciback: Check if the device is found instead of blindly assuming so.
    xen/pciback: Do not dereference psdev during printk when it is NULL.
    xen: remove XEN_PLATFORM_PCI config option
    xen: XEN_PVHVM depends on PCI
    xen/pciback: double lock typo
    xen/pciback: use mutex rather than spinlock in vpci backend
    xen/pciback: Use mutexes when working with Xenbus state transitions.
    xen/pciback: miscellaneous adjustments
    xen/pciback: use mutex rather than spinlock in passthrough backend
    xen/pciback: use resource_size()

    * 'stable/pci.fixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    xen/pci: support multi-segment systems
    xen-swiotlb: When doing coherent alloc/dealloc check before swizzling the MFNs.
    xen/pci: make bus notifier handler return sane values
    xen-swiotlb: fix printk and panic args
    xen-swiotlb: Fix wrong panic.
    xen-swiotlb: Retry up three times to allocate Xen-SWIOTLB
    xen-pcifront: Update warning comment to use 'e820_host' option.

    Linus Torvalds
     
  • ….org/pub/scm/linux/kernel/git/konrad/xen

    * 'stable/bug.fixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    xen/p2m/debugfs: Make type_name more obvious.
    xen/p2m/debugfs: Fix potential pointer exception.
    xen/enlighten: Fix compile warnings and set cx to known value.
    xen/xenbus: Remove the unnecessary check.
    xen/irq: If we fail during msi_capability_init return proper error code.
    xen/events: Don't check the info for NULL as it is already done.
    xen/events: BUG() when we can't allocate our event->irq array.

    * 'stable/mmu.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    xen: Fix selfballooning and ensure it doesn't go too far
    xen/gntdev: Fix sleep-inside-spinlock
    xen: modify kernel mappings corresponding to granted pages
    xen: add an "highmem" parameter to alloc_xenballooned_pages
    xen/p2m: Use SetPagePrivate and its friends for M2P overrides.
    xen/p2m: Make debug/xen/mmu/p2m visible again.
    Revert "xen/debug: WARN_ON when identity PFN has no _PAGE_IOMAP flag set."

    Linus Torvalds
     

20 Oct, 2011

3 commits


30 Sep, 2011

1 commit


29 Sep, 2011

6 commits

  • In xen_memory_setup() all reserved regions and gaps are set to an
    identity (1-1) p2m mapping. If an available page has a PFN within one
    of these 1-1 mappings it will become inaccessible (as it MFN is lost)
    so release them before setting up the mapping.

    This can make an additional 256 MiB or more of RAM available
    (depending on the size of the reserved regions in the memory map) if
    the initial pages overlap with reserved regions.

    The 1:1 p2m mappings are also extended to cover partial pages. This
    fixes an issue with (for example) systems with a BIOS that puts the
    DMI tables in a reserved region that begins on a non-page boundary.

    Signed-off-by: David Vrabel
    Signed-off-by: Konrad Rzeszutek Wilk

    David Vrabel
     
  • Allow the extra memory (used by the balloon driver) to be in multiple
    regions (typically two regions, one for low memory and one for high
    memory). This allows the balloon driver to increase the number of
    available low pages (if the initial number if pages is small).

    As a side effect, the algorithm for building the e820 memory map is
    simpler and more obviously correct as the map supplied by the
    hypervisor is (almost) used as is (in particular, all reserved regions
    and gaps are preserved). Only RAM regions are altered and RAM regions
    above max_pfn + extra_pages are marked as unused (the region is split
    in two if necessary).

    Signed-off-by: David Vrabel
    Signed-off-by: Konrad Rzeszutek Wilk

    David Vrabel
     
  • Allow the xen balloon driver to populate its list of extra pages from
    more than one region of memory. This will allow platforms to provide
    (for example) a region of low memory and a region of high memory.

    The maximum possible number of extra regions is 128 (== E820MAX) which
    is quite large so xen_extra_mem is placed in __initdata. This is safe
    as both xen_memory_setup() and balloon_init() are in __init.

    The balloon regions themselves are not altered (i.e., there is still
    only the one region).

    Signed-off-by: David Vrabel
    Signed-off-by: Konrad Rzeszutek Wilk

    David Vrabel
     
  • In xen_memory_setup() pages that occur in gaps in the memory map are
    released back to Xen. This reduces the domain's current page count in
    the hypervisor. The Xen balloon driver does not correctly decrease
    its initial current_pages count to reflect this. If 'delta' pages are
    released and the target is adjusted the resulting reservation is
    always 'delta' less than the requested target.

    This affects dom0 if the initial allocation of pages overlaps the PCI
    memory region but won't affect most domU guests that have been setup
    with pseudo-physical memory maps that don't have gaps.

    Fix this by accouting for the released pages when starting the balloon
    driver.

    If the domain's targets are managed by xapi, the domain may eventually
    run out of memory and die because xapi currently gets its target
    calculations wrong and whenever it is restarted it always reduces the
    target by 'delta'.

    Signed-off-by: David Vrabel
    Signed-off-by: Konrad Rzeszutek Wilk

    David Vrabel
     
  • Xen PV on HVM guests require PCI support because they need the
    xen-platform-pci driver in order to initialize xenbus.

    Signed-off-by: Stefano Stabellini
    Signed-off-by: Konrad Rzeszutek Wilk

    Stefano Stabellini
     
  • If we want to use granted pages for AIO, changing the mappings of a user
    vma and the corresponding p2m is not enough, we also need to update the
    kernel mappings accordingly.
    Currently this is only needed for pages that are created for user usages
    through /dev/xen/gntdev. As in, pages that have been in use by the
    kernel and use the P2M will not need this special mapping.
    However there are no guarantees that in the future the kernel won't
    start accessing pages through the 1:1 even for internal usage.

    In order to avoid the complexity of dealing with highmem, we allocated
    the pages lowmem.
    We issue a HYPERVISOR_grant_table_op right away in
    m2p_add_override and we remove the mappings using another
    HYPERVISOR_grant_table_op in m2p_remove_override.
    Considering that m2p_add_override and m2p_remove_override are called
    once per page we use multicalls and hypercall batching.

    Use the kmap_op pointer directly as argument to do the mapping as it is
    guaranteed to be present up until the unmapping is done.
    Before issuing any unmapping multicalls, we need to make sure that the
    mapping has already being done, because we need the kmap->handle to be
    set correctly.

    Signed-off-by: Stefano Stabellini
    [v1: Removed GRANT_FRAME_BIT usage]
    Signed-off-by: Konrad Rzeszutek Wilk

    Stefano Stabellini
     

27 Sep, 2011

1 commit


24 Sep, 2011

2 commits


17 Sep, 2011

1 commit


15 Sep, 2011

1 commit


13 Sep, 2011

2 commits

  • The patch "xen: use maximum reservation to limit amount of usable RAM"
    (d312ae878b6aed3912e1acaaf5d0b2a9d08a4f11) breaks machines that
    do not use 'dom0_mem=' argument with:

    reserve RAM buffer: 000000133f2e2000 - 000000133fffffff
    (XEN) mm.c:4976:d0 Global bit is set to kernel page fffff8117e
    (XEN) domain_crash_sync called from entry.S
    (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
    ...

    The reason being that the last E820 entry is created using the
    'extra_pages' (which is based on how many pages have been freed).
    The mentioned git commit sets the initial value of 'extra_pages'
    using a hypercall which returns the number of pages (if dom0_mem
    has been used) or -1 otherwise. If the later we return with
    MAX_DOMAIN_PAGES as basis for calculation:

    return min(max_pages, MAX_DOMAIN_PAGES);

    and use it:

    extra_limit = xen_get_max_pages();
    if (extra_limit >= max_pfn)
    extra_pages = extra_limit - max_pfn;
    else
    extra_pages = 0;

    which means we end up with extra_pages = 128GB in PFNs (33554432)
    - 8GB in PFNs (2097152, on this specific box, can be larger or smaller),
    and then we add that value to the E820 making it:

    Xen: 00000000ff000000 - 0000000100000000 (reserved)
    Xen: 0000000100000000 - 000000133f2e2000 (usable)

    which is clearly wrong. It should look as so:

    Xen: 00000000ff000000 - 0000000100000000 (reserved)
    Xen: 0000000100000000 - 000000027fbda000 (usable)

    Naturally this problem does not present itself if dom0_mem=max:X
    is used.

    CC: stable@kernel.org
    Signed-off-by: David Vrabel
    Signed-off-by: Konrad Rzeszutek Wilk

    David Vrabel
     
  • * 'upstream/bugfix' of git://github.com/jsgf/linux-xen:
    xen: use non-tracing preempt in xen_clocksource_read()

    Linus Torvalds
     

09 Sep, 2011

1 commit

  • PV spinlocks cannot possibly work with the current code because they are
    enabled after pvops patching has already been done, and because PV
    spinlocks use a different data structure than native spinlocks so we
    cannot switch between them dynamically. A spinlock that has been taken
    once by the native code (__ticket_spin_lock) cannot be taken by
    __xen_spin_lock even after it has been released.

    Reported-and-Tested-by: Stefan Bader
    Signed-off-by: Stefano Stabellini
    Signed-off-by: Konrad Rzeszutek Wilk

    Stefano Stabellini
     

07 Sep, 2011

1 commit


02 Sep, 2011

2 commits

  • We have hit a couple of customer bugs where they would like to
    use those parameters to run an UP kernel - but both of those
    options turn of important sources of interrupt information so
    we end up not being able to boot. The correct way is to
    pass in 'dom0_max_vcpus=1' on the Xen hypervisor line and
    the kernel will patch itself to be a UP kernel.

    Fixes bug: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=637308

    CC: stable@kernel.org
    Acked-by: Ian Campbell
    Signed-off-by: Konrad Rzeszutek Wilk

    Konrad Rzeszutek Wilk
     
  • If vmalloc page_fault happens inside of interrupt handler with interrupts
    disabled then on exit path from exception handler when there is no pending
    interrupts, the following code (arch/x86/xen/xen-asm_32.S:112):

    cmpw $0x0001, XEN_vcpu_info_pending(%eax)
    sete XEN_vcpu_info_mask(%eax)

    will enable interrupts even if they has been previously disabled according to
    eflags from the bounce frame (arch/x86/xen/xen-asm_32.S:99)

    testb $X86_EFLAGS_IF>>8, 8+1+ESP_OFFSET(%esp)
    setz XEN_vcpu_info_mask(%eax)

    Solution is in setting XEN_vcpu_info_mask only when it should be set
    according to
    cmpw $0x0001, XEN_vcpu_info_pending(%eax)
    but not clearing it if there isn't any pending events.

    Reproducer for bug is attached to RHBZ 707552

    CC: stable@kernel.org
    Signed-off-by: Igor Mammedov
    Acked-by: Jeremy Fitzhardinge
    Signed-off-by: Konrad Rzeszutek Wilk

    Igor Mammedov
     

01 Sep, 2011

1 commit

  • Use the domain's maximum reservation to limit the amount of extra RAM
    for the memory balloon. This reduces the size of the pages tables and
    the amount of reserved low memory (which defaults to about 1/32 of the
    total RAM).

    On a system with 8 GiB of RAM with the domain limited to 1 GiB the
    kernel reports:

    Before:

    Memory: 627792k/4472000k available

    After:

    Memory: 549740k/11132224k available

    A increase of about 76 MiB (~1.5% of the unused 7 GiB). The reserved
    low memory is also reduced from 253 MiB to 32 MiB. The total
    additional usable RAM is 329 MiB.

    For dom0, this requires at patch to Xen ('x86: use 'dom0_mem' to limit
    the number of pages for dom0') (c/s 23790)

    CC: stable@kernel.org
    Signed-off-by: David Vrabel
    Signed-off-by: Konrad Rzeszutek Wilk

    David Vrabel
     

25 Aug, 2011

1 commit

  • The tracing code used sched_clock() to get tracing timestamps, which
    ends up calling xen_clocksource_read(). xen_clocksource_read() must
    disable preemption, but if preemption tracing is enabled, this results
    in infinite recursion.

    I've only noticed this when boot-time tracing tests are enabled, but it
    seems like a generic bug. It looks like it would also affect
    kvm_clocksource_read().

    Reported-by: Konrad Rzeszutek Wilk
    Signed-off-by: Jeremy Fitzhardinge
    Cc: Avi Kivity
    Cc: Marcelo Tosatti

    Jeremy Fitzhardinge
     

23 Aug, 2011

1 commit


22 Aug, 2011

2 commits

  • Steven Rostedt says we should use CONFIG_EVENT_TRACING.

    Cc:Steven Rostedt
    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Konrad Rzeszutek Wilk

    Jeremy Fitzhardinge
     
  • Fix regression for HVM case on older (
    Date: Thu Dec 2 17:55:10 2010 +0000

    xen: PV on HVM: support PV spinlocks and IPIs

    This change replaced the SMP operations with event based handlers without
    taking into account that this only works when the hypervisor supports
    callback vectors. This causes unexplainable hangs early on boot for
    HVM guests with more than one CPU.

    BugLink: http://bugs.launchpad.net/bugs/791850

    CC: stable@kernel.org
    Signed-off-by: Stefan Bader
    Signed-off-by: Stefano Stabellini
    Tested-and-Reported-by: Stefan Bader
    Signed-off-by: Konrad Rzeszutek Wilk

    Stefano Stabellini
     

17 Aug, 2011

1 commit

  • The order-based approach is not only less efficient (requiring a shift
    and a compare, typical generated code looking like this

    mov eax, [machine_to_phys_order]
    mov ecx, eax
    shr ebx, cl
    test ebx, ebx
    jnz ...

    whereas a direct check requires just a compare, like in

    cmp ebx, [machine_to_phys_nr]
    jae ...

    ), but also slightly dangerous in the 32-on-64 case - the element
    address calculation can wrap if the next power of two boundary is
    sufficiently far away from the actual upper limit of the table, and
    hence can result in user space addresses being accessed (with it being
    unknown what may actually be mapped there).

    Additionally, the elimination of the mistaken use of fls() here (should
    have been __fls()) fixes a latent issue on x86-64 that would trigger
    if the code was run on a system with memory extending beyond the 44-bit
    boundary.

    CC: stable@kernel.org
    Signed-off-by: Jan Beulich
    [v1: Based on Jeremy's feedback]
    Signed-off-by: Konrad Rzeszutek Wilk

    Jan Beulich
     

13 Aug, 2011

1 commit

  • * 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-tip:
    x86-64: Rework vsyscall emulation and add vsyscall= parameter
    x86-64: Wire up getcpu syscall
    x86: Remove unnecessary compile flag tweaks for vsyscall code
    x86-64: Add vsyscall:emulate_vsyscall trace event
    x86-64: Add user_64bit_mode paravirt op
    x86-64, xen: Enable the vvar mapping
    x86-64: Work around gold bug 13023
    x86-64: Move the "user" vsyscall segment out of the data segment.
    x86-64: Pad vDSO to a page boundary

    Linus Torvalds
     

10 Aug, 2011

1 commit


07 Aug, 2011

1 commit

  • * 'stable/bug.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    xen/trace: Fix compile error when CONFIG_XEN_PRIVILEGED_GUEST is not set
    xen: Fix misleading WARN message at xen_release_chunk
    xen: Fix printk() format in xen/setup.c
    xen/tracing: it looks like we wanted CONFIG_FTRACE
    xen/self-balloon: Add dependency on tmem.
    xen/balloon: Fix compile errors - missing header files.
    xen/grant: Fix compile warning.
    xen/pciback: remove duplicated #include

    Linus Torvalds
     

05 Aug, 2011

4 commits

  • with CONFIG_XEN and CONFIG_FTRACE set we get this:

    arch/x86/xen/trace.c:22: error: ‘__HYPERVISOR_console_io’ undeclared here (not in a function)
    arch/x86/xen/trace.c:22: error: array index in initializer not of integer type
    arch/x86/xen/trace.c:22: error: (near initialization for ‘xen_hypercall_names’)
    arch/x86/xen/trace.c:23: error: ‘__HYPERVISOR_physdev_op_compat’ undeclared here (not in a function)

    Issue was that the definitions of __HYPERVISOR were not pulled
    if CONFIG_XEN_PRIVILEGED_GUEST was not set.

    Reported-by: Randy Dunlap
    Acked-by: Randy Dunlap
    Acked-by: Ingo Molnar
    Signed-off-by: Konrad Rzeszutek Wilk

    Konrad Rzeszutek Wilk
     
  • Three places in the kernel assume that the only long mode CPL 3
    selector is __USER_CS. This is not true on Xen -- Xen's sysretq
    changes cs to the magic value 0xe033.

    Two of the places are corner cases, but as of "x86-64: Improve
    vsyscall emulation CS and RIP handling"
    (c9712944b2a12373cb6ff8059afcfb7e826a6c54), vsyscalls will segfault
    if called with Xen's extra CS selector. This causes a panic when
    older init builds die.

    It seems impossible to make Xen use __USER_CS reliably without
    taking a performance hit on every system call, so this fixes the
    tests instead with a new paravirt op. It's a little ugly because
    ptrace.h can't include paravirt.h.

    Signed-off-by: Andy Lutomirski
    Link: http://lkml.kernel.org/r/f4fcb3947340d9e96ce1054a432f183f9da9db83.1312378163.git.luto@mit.edu
    Reported-by: Konrad Rzeszutek Wilk
    Signed-off-by: H. Peter Anvin

    Andy Lutomirski
     
  • Xen needs to handle VVAR_PAGE, introduced in git commit:
    9fd67b4ed0714ab718f1f9bd14c344af336a6df7
    x86-64: Give vvars their own page

    Otherwise we die during bootup with a message like:

    (XEN) mm.c:940:d10 Error getting mfn 1888 (pfn 1e3e48) from L1 entry
    8000000001888465 for l1e_owner=10, pg_owner=10
    (XEN) mm.c:5049:d10 ptwr_emulate: could not get_page_from_l1e()
    [ 0.000000] BUG: unable to handle kernel NULL pointer dereference at (null)
    [ 0.000000] IP: [] xen_set_pte+0x20/0xe0

    Signed-off-by: Andy Lutomirski
    Link: http://lkml.kernel.org/r/4659478ed2f3480938f96491c2ecbe2b2e113a23.1312378163.git.luto@mit.edu
    Reviewed-by: Konrad Rzeszutek Wilk
    Signed-off-by: H. Peter Anvin

    Andy Lutomirski
     
  • WARN message should not complain
    "Failed to release memory %lx-%lx err=%d\n"
    ^^^^^^^
    about range when it fails to release just one page,
    instead it should say what pfn is not freed.

    In addition line:
    printk(KERN_INFO "xen_release_chunk: looking at area pfn %lx-%lx: "
    ...
    printk(KERN_CONT "%lu pages freed\n", len);
    will be broken if WARN in between this line is fired. So fix it
    by using a single printk for this.

    Signed-off-by: Igor Mammedov
    Signed-off-by: Konrad Rzeszutek Wilk

    Igor Mammedov