16 Dec, 2011

2 commits

  • …kernel/git/konrad/xen

    * 'stable/for-linus-fixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    xen/swiotlb: Use page alignment for early buffer allocation.
    xen: only limit memory map to maximum reservation for domain 0.

    Linus Torvalds
     
  • d312ae878b6a "xen: use maximum reservation to limit amount of usable RAM"
    clamped the total amount of RAM to the current maximum reservation. This is
    correct for dom0 but is not correct for guest domains. In order to boot a guest
    "pre-ballooned" (e.g. with memory=1G but maxmem=2G) in order to allow for
    future memory expansion the guest must derive max_pfn from the e820 provided by
    the toolstack and not the current maximum reservation (which can reflect only
    the current maximum, not the guest lifetime max). The existing algorithm
    already behaves this correctly if we do not artificially limit the maximum
    number of pages for the guest case.

    For a guest booted with maxmem=512, memory=128 this results in:
    [ 0.000000] BIOS-provided physical RAM map:
    [ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable)
    [ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved)
    -[ 0.000000] Xen: 0000000000100000 - 0000000008100000 (usable)
    -[ 0.000000] Xen: 0000000008100000 - 0000000020800000 (unusable)
    +[ 0.000000] Xen: 0000000000100000 - 0000000020800000 (usable)
    ...
    [ 0.000000] NX (Execute Disable) protection: active
    [ 0.000000] DMI not present or invalid.
    [ 0.000000] e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
    [ 0.000000] e820 remove range: 00000000000a0000 - 0000000000100000 (usable)
    -[ 0.000000] last_pfn = 0x8100 max_arch_pfn = 0x1000000
    +[ 0.000000] last_pfn = 0x20800 max_arch_pfn = 0x1000000
    [ 0.000000] initial memory mapped : 0 - 027ff000
    [ 0.000000] Base memory trampoline at [c009f000] 9f000 size 4096
    -[ 0.000000] init_memory_mapping: 0000000000000000-0000000008100000
    -[ 0.000000] 0000000000 - 0008100000 page 4k
    -[ 0.000000] kernel direct mapping tables up to 8100000 @ 27bb000-27ff000
    +[ 0.000000] init_memory_mapping: 0000000000000000-0000000020800000
    +[ 0.000000] 0000000000 - 0020800000 page 4k
    +[ 0.000000] kernel direct mapping tables up to 20800000 @ 26f8000-27ff000
    [ 0.000000] xen: setting RW the range 27e8000 - 27ff000
    [ 0.000000] 0MB HIGHMEM available.
    -[ 0.000000] 129MB LOWMEM available.
    -[ 0.000000] mapped low ram: 0 - 08100000
    -[ 0.000000] low ram: 0 - 08100000
    +[ 0.000000] 520MB LOWMEM available.
    +[ 0.000000] mapped low ram: 0 - 20800000
    +[ 0.000000] low ram: 0 - 20800000

    With this change "xl mem-set 512M" will successfully increase the
    guest RAM (by reducing the balloon).

    There is no change for dom0.

    Reported-and-Tested-by: George Shuklin
    Signed-off-by: Ian Campbell
    Cc: stable@kernel.org
    Reviewed-by: David Vrabel
    Signed-off-by: Konrad Rzeszutek Wilk

    Ian Campbell
     

04 Dec, 2011

1 commit

  • The idea behind commit d91ee5863b71 ("cpuidle: replace xen access to x86
    pm_idle and default_idle") was to have one call - disable_cpuidle()
    which would make pm_idle not be molested by other code. It disallows
    cpuidle_idle_call to be set to pm_idle (which is excellent).

    But in the select_idle_routine() and idle_setup(), the pm_idle can still
    be set to either: amd_e400_idle, mwait_idle or default_idle. This
    depends on some CPU flags (MWAIT) and in AMD case on the type of CPU.

    In case of mwait_idle we can hit some instances where the hypervisor
    (Amazon EC2 specifically) sets the MWAIT and we get:

    Brought up 2 CPUs
    invalid opcode: 0000 [#1] SMP

    Pid: 0, comm: swapper Not tainted 3.1.0-0.rc6.git0.3.fc16.x86_64 #1
    RIP: e030:[] [] mwait_idle+0x6f/0xb4
    ...
    Call Trace:
    [] cpu_idle+0xae/0xe8
    [] cpu_bringup_and_idle+0xe/0x10
    RIP [] mwait_idle+0x6f/0xb4
    RSP

    In the case of amd_e400_idle we don't get so spectacular crashes, but we
    do end up making an MSR which is trapped in the hypervisor, and then
    follow it up with a yield hypercall. Meaning we end up going to
    hypervisor twice instead of just once.

    The previous behavior before v3.0 was that pm_idle was set to
    default_idle regardless of select_idle_routine/idle_setup.

    We want to do that, but only for one specific case: Xen. This patch
    does that.

    Fixes RH BZ #739499 and Ubuntu #881076
    Reported-by: Stefan Bader
    Signed-off-by: Konrad Rzeszutek Wilk
    Signed-off-by: Linus Torvalds

    Konrad Rzeszutek Wilk
     

17 Nov, 2011

2 commits

  • PVHVM running with more than 32 vcpus and pv_irq/pv_time enabled
    need VCPU placement to work, or else it will softlockup.

    CC: stable@kernel.org
    Acked-by: Stefano Stabellini
    Signed-off-by: Zhenzhong Duan
    Signed-off-by: Konrad Rzeszutek Wilk

    Zhenzhong Duan
     
  • When mapping a foreign page with xenbus_map_ring_valloc() with the
    GNTTABOP_map_grant_ref hypercall, set the GNTMAP_contains_pte flag and
    pass a pointer to the PTE (in init_mm).

    After the page is mapped, the usual fault mechanism can be used to
    update additional MMs. This allows the vmalloc_sync_all() to be
    removed from alloc_vm_area().

    Signed-off-by: David Vrabel
    Acked-by: Andrew Morton
    [v1: Squashed fix by Michal for no-mmu case]
    Signed-off-by: Konrad Rzeszutek Wilk
    Signed-off-by: Michal Simek

    David Vrabel
     

07 Nov, 2011

2 commits


25 Oct, 2011

2 commits

  • …ci.fixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

    * 'stable/drivers-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    xenbus: don't rely on xen_initial_domain to detect local xenstore
    xenbus: Fix loopback event channel assuming domain 0
    xen/pv-on-hvm:kexec: Fix implicit declaration of function 'xen_hvm_domain'
    xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel
    xen/pv-on-hvm kexec: update xs_wire.h:xsd_sockmsg_type from xen-unstable
    xen/pv-on-hvm kexec+kdump: reset PV devices in kexec or crash kernel
    xen/pv-on-hvm kexec: rebind virqs to existing eventchannel ports
    xen/pv-on-hvm kexec: prevent crash in xenwatch_thread() when stale watch events arrive

    * 'stable/drivers.bugfixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    xen/pciback: Check if the device is found instead of blindly assuming so.
    xen/pciback: Do not dereference psdev during printk when it is NULL.
    xen: remove XEN_PLATFORM_PCI config option
    xen: XEN_PVHVM depends on PCI
    xen/pciback: double lock typo
    xen/pciback: use mutex rather than spinlock in vpci backend
    xen/pciback: Use mutexes when working with Xenbus state transitions.
    xen/pciback: miscellaneous adjustments
    xen/pciback: use mutex rather than spinlock in passthrough backend
    xen/pciback: use resource_size()

    * 'stable/pci.fixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    xen/pci: support multi-segment systems
    xen-swiotlb: When doing coherent alloc/dealloc check before swizzling the MFNs.
    xen/pci: make bus notifier handler return sane values
    xen-swiotlb: fix printk and panic args
    xen-swiotlb: Fix wrong panic.
    xen-swiotlb: Retry up three times to allocate Xen-SWIOTLB
    xen-pcifront: Update warning comment to use 'e820_host' option.

    Linus Torvalds
     
  • ….org/pub/scm/linux/kernel/git/konrad/xen

    * 'stable/bug.fixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    xen/p2m/debugfs: Make type_name more obvious.
    xen/p2m/debugfs: Fix potential pointer exception.
    xen/enlighten: Fix compile warnings and set cx to known value.
    xen/xenbus: Remove the unnecessary check.
    xen/irq: If we fail during msi_capability_init return proper error code.
    xen/events: Don't check the info for NULL as it is already done.
    xen/events: BUG() when we can't allocate our event->irq array.

    * 'stable/mmu.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    xen: Fix selfballooning and ensure it doesn't go too far
    xen/gntdev: Fix sleep-inside-spinlock
    xen: modify kernel mappings corresponding to granted pages
    xen: add an "highmem" parameter to alloc_xenballooned_pages
    xen/p2m: Use SetPagePrivate and its friends for M2P overrides.
    xen/p2m: Make debug/xen/mmu/p2m visible again.
    Revert "xen/debug: WARN_ON when identity PFN has no _PAGE_IOMAP flag set."

    Linus Torvalds
     

20 Oct, 2011

3 commits


30 Sep, 2011

1 commit


29 Sep, 2011

6 commits

  • In xen_memory_setup() all reserved regions and gaps are set to an
    identity (1-1) p2m mapping. If an available page has a PFN within one
    of these 1-1 mappings it will become inaccessible (as it MFN is lost)
    so release them before setting up the mapping.

    This can make an additional 256 MiB or more of RAM available
    (depending on the size of the reserved regions in the memory map) if
    the initial pages overlap with reserved regions.

    The 1:1 p2m mappings are also extended to cover partial pages. This
    fixes an issue with (for example) systems with a BIOS that puts the
    DMI tables in a reserved region that begins on a non-page boundary.

    Signed-off-by: David Vrabel
    Signed-off-by: Konrad Rzeszutek Wilk

    David Vrabel
     
  • Allow the extra memory (used by the balloon driver) to be in multiple
    regions (typically two regions, one for low memory and one for high
    memory). This allows the balloon driver to increase the number of
    available low pages (if the initial number if pages is small).

    As a side effect, the algorithm for building the e820 memory map is
    simpler and more obviously correct as the map supplied by the
    hypervisor is (almost) used as is (in particular, all reserved regions
    and gaps are preserved). Only RAM regions are altered and RAM regions
    above max_pfn + extra_pages are marked as unused (the region is split
    in two if necessary).

    Signed-off-by: David Vrabel
    Signed-off-by: Konrad Rzeszutek Wilk

    David Vrabel
     
  • Allow the xen balloon driver to populate its list of extra pages from
    more than one region of memory. This will allow platforms to provide
    (for example) a region of low memory and a region of high memory.

    The maximum possible number of extra regions is 128 (== E820MAX) which
    is quite large so xen_extra_mem is placed in __initdata. This is safe
    as both xen_memory_setup() and balloon_init() are in __init.

    The balloon regions themselves are not altered (i.e., there is still
    only the one region).

    Signed-off-by: David Vrabel
    Signed-off-by: Konrad Rzeszutek Wilk

    David Vrabel
     
  • In xen_memory_setup() pages that occur in gaps in the memory map are
    released back to Xen. This reduces the domain's current page count in
    the hypervisor. The Xen balloon driver does not correctly decrease
    its initial current_pages count to reflect this. If 'delta' pages are
    released and the target is adjusted the resulting reservation is
    always 'delta' less than the requested target.

    This affects dom0 if the initial allocation of pages overlaps the PCI
    memory region but won't affect most domU guests that have been setup
    with pseudo-physical memory maps that don't have gaps.

    Fix this by accouting for the released pages when starting the balloon
    driver.

    If the domain's targets are managed by xapi, the domain may eventually
    run out of memory and die because xapi currently gets its target
    calculations wrong and whenever it is restarted it always reduces the
    target by 'delta'.

    Signed-off-by: David Vrabel
    Signed-off-by: Konrad Rzeszutek Wilk

    David Vrabel
     
  • Xen PV on HVM guests require PCI support because they need the
    xen-platform-pci driver in order to initialize xenbus.

    Signed-off-by: Stefano Stabellini
    Signed-off-by: Konrad Rzeszutek Wilk

    Stefano Stabellini
     
  • If we want to use granted pages for AIO, changing the mappings of a user
    vma and the corresponding p2m is not enough, we also need to update the
    kernel mappings accordingly.
    Currently this is only needed for pages that are created for user usages
    through /dev/xen/gntdev. As in, pages that have been in use by the
    kernel and use the P2M will not need this special mapping.
    However there are no guarantees that in the future the kernel won't
    start accessing pages through the 1:1 even for internal usage.

    In order to avoid the complexity of dealing with highmem, we allocated
    the pages lowmem.
    We issue a HYPERVISOR_grant_table_op right away in
    m2p_add_override and we remove the mappings using another
    HYPERVISOR_grant_table_op in m2p_remove_override.
    Considering that m2p_add_override and m2p_remove_override are called
    once per page we use multicalls and hypercall batching.

    Use the kmap_op pointer directly as argument to do the mapping as it is
    guaranteed to be present up until the unmapping is done.
    Before issuing any unmapping multicalls, we need to make sure that the
    mapping has already being done, because we need the kmap->handle to be
    set correctly.

    Signed-off-by: Stefano Stabellini
    [v1: Removed GRANT_FRAME_BIT usage]
    Signed-off-by: Konrad Rzeszutek Wilk

    Stefano Stabellini
     

27 Sep, 2011

1 commit


24 Sep, 2011

2 commits


17 Sep, 2011

1 commit


15 Sep, 2011

1 commit


13 Sep, 2011

2 commits

  • The patch "xen: use maximum reservation to limit amount of usable RAM"
    (d312ae878b6aed3912e1acaaf5d0b2a9d08a4f11) breaks machines that
    do not use 'dom0_mem=' argument with:

    reserve RAM buffer: 000000133f2e2000 - 000000133fffffff
    (XEN) mm.c:4976:d0 Global bit is set to kernel page fffff8117e
    (XEN) domain_crash_sync called from entry.S
    (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
    ...

    The reason being that the last E820 entry is created using the
    'extra_pages' (which is based on how many pages have been freed).
    The mentioned git commit sets the initial value of 'extra_pages'
    using a hypercall which returns the number of pages (if dom0_mem
    has been used) or -1 otherwise. If the later we return with
    MAX_DOMAIN_PAGES as basis for calculation:

    return min(max_pages, MAX_DOMAIN_PAGES);

    and use it:

    extra_limit = xen_get_max_pages();
    if (extra_limit >= max_pfn)
    extra_pages = extra_limit - max_pfn;
    else
    extra_pages = 0;

    which means we end up with extra_pages = 128GB in PFNs (33554432)
    - 8GB in PFNs (2097152, on this specific box, can be larger or smaller),
    and then we add that value to the E820 making it:

    Xen: 00000000ff000000 - 0000000100000000 (reserved)
    Xen: 0000000100000000 - 000000133f2e2000 (usable)

    which is clearly wrong. It should look as so:

    Xen: 00000000ff000000 - 0000000100000000 (reserved)
    Xen: 0000000100000000 - 000000027fbda000 (usable)

    Naturally this problem does not present itself if dom0_mem=max:X
    is used.

    CC: stable@kernel.org
    Signed-off-by: David Vrabel
    Signed-off-by: Konrad Rzeszutek Wilk

    David Vrabel
     
  • * 'upstream/bugfix' of git://github.com/jsgf/linux-xen:
    xen: use non-tracing preempt in xen_clocksource_read()

    Linus Torvalds
     

09 Sep, 2011

1 commit

  • PV spinlocks cannot possibly work with the current code because they are
    enabled after pvops patching has already been done, and because PV
    spinlocks use a different data structure than native spinlocks so we
    cannot switch between them dynamically. A spinlock that has been taken
    once by the native code (__ticket_spin_lock) cannot be taken by
    __xen_spin_lock even after it has been released.

    Reported-and-Tested-by: Stefan Bader
    Signed-off-by: Stefano Stabellini
    Signed-off-by: Konrad Rzeszutek Wilk

    Stefano Stabellini
     

07 Sep, 2011

1 commit


02 Sep, 2011

2 commits

  • We have hit a couple of customer bugs where they would like to
    use those parameters to run an UP kernel - but both of those
    options turn of important sources of interrupt information so
    we end up not being able to boot. The correct way is to
    pass in 'dom0_max_vcpus=1' on the Xen hypervisor line and
    the kernel will patch itself to be a UP kernel.

    Fixes bug: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=637308

    CC: stable@kernel.org
    Acked-by: Ian Campbell
    Signed-off-by: Konrad Rzeszutek Wilk

    Konrad Rzeszutek Wilk
     
  • If vmalloc page_fault happens inside of interrupt handler with interrupts
    disabled then on exit path from exception handler when there is no pending
    interrupts, the following code (arch/x86/xen/xen-asm_32.S:112):

    cmpw $0x0001, XEN_vcpu_info_pending(%eax)
    sete XEN_vcpu_info_mask(%eax)

    will enable interrupts even if they has been previously disabled according to
    eflags from the bounce frame (arch/x86/xen/xen-asm_32.S:99)

    testb $X86_EFLAGS_IF>>8, 8+1+ESP_OFFSET(%esp)
    setz XEN_vcpu_info_mask(%eax)

    Solution is in setting XEN_vcpu_info_mask only when it should be set
    according to
    cmpw $0x0001, XEN_vcpu_info_pending(%eax)
    but not clearing it if there isn't any pending events.

    Reproducer for bug is attached to RHBZ 707552

    CC: stable@kernel.org
    Signed-off-by: Igor Mammedov
    Acked-by: Jeremy Fitzhardinge
    Signed-off-by: Konrad Rzeszutek Wilk

    Igor Mammedov
     

01 Sep, 2011

1 commit

  • Use the domain's maximum reservation to limit the amount of extra RAM
    for the memory balloon. This reduces the size of the pages tables and
    the amount of reserved low memory (which defaults to about 1/32 of the
    total RAM).

    On a system with 8 GiB of RAM with the domain limited to 1 GiB the
    kernel reports:

    Before:

    Memory: 627792k/4472000k available

    After:

    Memory: 549740k/11132224k available

    A increase of about 76 MiB (~1.5% of the unused 7 GiB). The reserved
    low memory is also reduced from 253 MiB to 32 MiB. The total
    additional usable RAM is 329 MiB.

    For dom0, this requires at patch to Xen ('x86: use 'dom0_mem' to limit
    the number of pages for dom0') (c/s 23790)

    CC: stable@kernel.org
    Signed-off-by: David Vrabel
    Signed-off-by: Konrad Rzeszutek Wilk

    David Vrabel
     

25 Aug, 2011

1 commit

  • The tracing code used sched_clock() to get tracing timestamps, which
    ends up calling xen_clocksource_read(). xen_clocksource_read() must
    disable preemption, but if preemption tracing is enabled, this results
    in infinite recursion.

    I've only noticed this when boot-time tracing tests are enabled, but it
    seems like a generic bug. It looks like it would also affect
    kvm_clocksource_read().

    Reported-by: Konrad Rzeszutek Wilk
    Signed-off-by: Jeremy Fitzhardinge
    Cc: Avi Kivity
    Cc: Marcelo Tosatti

    Jeremy Fitzhardinge
     

23 Aug, 2011

1 commit


22 Aug, 2011

2 commits

  • Steven Rostedt says we should use CONFIG_EVENT_TRACING.

    Cc:Steven Rostedt
    Signed-off-by: Jeremy Fitzhardinge
    Signed-off-by: Konrad Rzeszutek Wilk

    Jeremy Fitzhardinge
     
  • Fix regression for HVM case on older (
    Date: Thu Dec 2 17:55:10 2010 +0000

    xen: PV on HVM: support PV spinlocks and IPIs

    This change replaced the SMP operations with event based handlers without
    taking into account that this only works when the hypervisor supports
    callback vectors. This causes unexplainable hangs early on boot for
    HVM guests with more than one CPU.

    BugLink: http://bugs.launchpad.net/bugs/791850

    CC: stable@kernel.org
    Signed-off-by: Stefan Bader
    Signed-off-by: Stefano Stabellini
    Tested-and-Reported-by: Stefan Bader
    Signed-off-by: Konrad Rzeszutek Wilk

    Stefano Stabellini
     

17 Aug, 2011

1 commit

  • The order-based approach is not only less efficient (requiring a shift
    and a compare, typical generated code looking like this

    mov eax, [machine_to_phys_order]
    mov ecx, eax
    shr ebx, cl
    test ebx, ebx
    jnz ...

    whereas a direct check requires just a compare, like in

    cmp ebx, [machine_to_phys_nr]
    jae ...

    ), but also slightly dangerous in the 32-on-64 case - the element
    address calculation can wrap if the next power of two boundary is
    sufficiently far away from the actual upper limit of the table, and
    hence can result in user space addresses being accessed (with it being
    unknown what may actually be mapped there).

    Additionally, the elimination of the mistaken use of fls() here (should
    have been __fls()) fixes a latent issue on x86-64 that would trigger
    if the code was run on a system with memory extending beyond the 44-bit
    boundary.

    CC: stable@kernel.org
    Signed-off-by: Jan Beulich
    [v1: Based on Jeremy's feedback]
    Signed-off-by: Konrad Rzeszutek Wilk

    Jan Beulich
     

13 Aug, 2011

1 commit

  • * 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-tip:
    x86-64: Rework vsyscall emulation and add vsyscall= parameter
    x86-64: Wire up getcpu syscall
    x86: Remove unnecessary compile flag tweaks for vsyscall code
    x86-64: Add vsyscall:emulate_vsyscall trace event
    x86-64: Add user_64bit_mode paravirt op
    x86-64, xen: Enable the vvar mapping
    x86-64: Work around gold bug 13023
    x86-64: Move the "user" vsyscall segment out of the data segment.
    x86-64: Pad vDSO to a page boundary

    Linus Torvalds
     

10 Aug, 2011

1 commit


07 Aug, 2011

1 commit

  • * 'stable/bug.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    xen/trace: Fix compile error when CONFIG_XEN_PRIVILEGED_GUEST is not set
    xen: Fix misleading WARN message at xen_release_chunk
    xen: Fix printk() format in xen/setup.c
    xen/tracing: it looks like we wanted CONFIG_FTRACE
    xen/self-balloon: Add dependency on tmem.
    xen/balloon: Fix compile errors - missing header files.
    xen/grant: Fix compile warning.
    xen/pciback: remove duplicated #include

    Linus Torvalds
     

05 Aug, 2011

1 commit

  • with CONFIG_XEN and CONFIG_FTRACE set we get this:

    arch/x86/xen/trace.c:22: error: ‘__HYPERVISOR_console_io’ undeclared here (not in a function)
    arch/x86/xen/trace.c:22: error: array index in initializer not of integer type
    arch/x86/xen/trace.c:22: error: (near initialization for ‘xen_hypercall_names’)
    arch/x86/xen/trace.c:23: error: ‘__HYPERVISOR_physdev_op_compat’ undeclared here (not in a function)

    Issue was that the definitions of __HYPERVISOR were not pulled
    if CONFIG_XEN_PRIVILEGED_GUEST was not set.

    Reported-by: Randy Dunlap
    Acked-by: Randy Dunlap
    Acked-by: Ingo Molnar
    Signed-off-by: Konrad Rzeszutek Wilk

    Konrad Rzeszutek Wilk