20 May, 2016

3 commits


10 May, 2016

1 commit

  • The ARMv8.1 architecture extensions introduce support for hardware
    updates of the access and dirty information in page table entries. With
    VTCR_EL2.HA enabled (bit 21), when the CPU accesses an IPA with the
    PTE_AF bit cleared in the stage 2 page table, instead of raising an
    Access Flag fault to EL2 the CPU sets the actual page table entry bit
    (10). To ensure that kernel modifications to the page table do not
    inadvertently revert a bit set by hardware updates, certain Stage 2
    software pte/pmd operations must be performed atomically.

    The main user of the AF bit is the kvm_age_hva() mechanism. The
    kvm_age_hva_handler() function performs a "test and clear young" action
    on the pte/pmd. This needs to be atomic in respect of automatic hardware
    updates of the AF bit. Since the AF bit is in the same position for both
    Stage 1 and Stage 2, the patch reuses the existing
    ptep_test_and_clear_young() functionality if
    __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG is defined. Otherwise, the
    existing pte_young/pte_mkold mechanism is preserved.

    The kvm_set_s2pte_readonly() (and the corresponding pmd equivalent) have
    to perform atomic modifications in order to avoid a race with updates of
    the AF bit. The arm64 implementation has been re-written using
    exclusives.

    Currently, kvm_set_s2pte_writable() (and pmd equivalent) take a pointer
    argument and modify the pte/pmd in place. However, these functions are
    only used on local variables rather than actual page table entries, so
    it makes more sense to follow the pte_mkwrite() approach for stage 1
    attributes. The change to kvm_s2pte_mkwrite() makes it clear that these
    functions do not modify the actual page table entries.

    The (pte|pmd)_mkyoung() uses on Stage 2 entries (setting the AF bit
    explicitly) do not need to be modified since hardware updates of the
    dirty status are not supported by KVM, so there is no possibility of
    losing such information.

    Signed-off-by: Catalin Marinas
    Cc: Paolo Bonzini
    Acked-by: Marc Zyngier
    Reviewed-by: Christoffer Dall
    Signed-off-by: Christoffer Dall

    Catalin Marinas
     

03 May, 2016

10 commits

  • The only call of arch_timer_get_timecounter (in KVM) has been removed.

    Signed-off-by: Julien Grall
    Acked-by: Christoffer Dall
    Signed-off-by: Christoffer Dall

    Julien Grall
     
  • Currently, the firmware tables are parsed 2 times: once in the GIC
    drivers, the other time when initializing the vGIC. It means code
    duplication and make more tedious to add the support for another
    firmware table (like ACPI).

    Use the recently introduced helper gic_get_kvm_info() to get
    information about the virtual GIC.

    With this change, the virtual GIC becomes agnostic to the firmware
    table and KVM will be able to initialize the vGIC on ACPI.

    Signed-off-by: Julien Grall
    Reviewed-by: Christoffer Dall
    Signed-off-by: Christoffer Dall

    Julien Grall
     
  • The firmware table is currently parsed by the virtual timer code in
    order to retrieve the virtual timer interrupt. However, this is already
    done by the arch timer driver.

    To avoid code duplication, use the newly function arch_timer_get_kvm_info()
    which return all the information required by the virtual timer code.

    Signed-off-by: Julien Grall
    Acked-by: Christoffer Dall
    Signed-off-by: Christoffer Dall

    Julien Grall
     
  • Fill up the recently introduced gic_kvm_info with the hardware
    information used for virtualization.

    Signed-off-by: Julien Grall
    Cc: Thomas Gleixner
    Cc: Jason Cooper
    Cc: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Julien Grall
     
  • The ACPI code requires to use global variables in order to collect
    information from the tables.

    To make clear those variables are ACPI specific, gather all of them in a
    single structure.

    Furthermore, even if some of the variables are not marked with
    __initdata, they are all only used during the initialization. Therefore,
    the new variable, which hold the structure, can be marked with
    __initdata.

    Signed-off-by: Julien Grall
    Acked-by: Christoffer Dall
    Reviewed-by: Hanjun Guo
    Signed-off-by: Christoffer Dall

    Julien Grall
     
  • Currently, most of the pr_* messages in the GICv3 driver don't have a
    prefix. Add one to make clear where the messages come from.

    Signed-off-by: Julien Grall
    Signed-off-by: Christoffer Dall

    Julien Grall
     
  • For now, the firmware tables are parsed 2 times: once in the GIC
    drivers, the other timer when initializing the vGIC. It means code
    duplication and make more tedious to add the support for another
    firmware table (like ACPI).

    Introduce a new structure and set of helpers to get/set the virtual GIC
    information. Also fill up the structure for GICv2.

    Signed-off-by: Julien Grall
    Signed-off-by: Christoffer Dall

    Julien Grall
     
  • The ACPI code requires to use global variables in order to collect
    information from the tables.

    For now, a single global variable is used, but more will be added in a
    subsequent patch. To make clear they are ACPI specific, gather all the
    information in a single structure.

    Signed-off-by: Julien Grall
    Acked-by: Christofer Dall
    Acked-by: Hanjun Guo
    Signed-off-by: Christoffer Dall

    Julien Grall
     
  • Currently, the firmware table is parsed by the virtual timer code in
    order to retrieve the virtual timer interrupt. However, this is already
    done by the arch timer driver.

    To avoid code duplication, extend arch_timer_kvm_info to get the virtual
    IRQ.

    Note that the KVM code will be modified in a subsequent patch.

    Signed-off-by: Julien Grall
    Acked-by: Christoffer Dall
    Signed-off-by: Christoffer Dall

    Julien Grall
     
  • Introduce a structure which are filled up by the arch timer driver and
    used by the virtual timer in KVM.

    The first member of this structure will be the timecounter. More members
    will be added later.

    A stub for the new helper isn't introduced because KVM requires the arch
    timer for both ARM64 and ARM32.

    The function arch_timer_get_timecounter is kept for the time being and
    will be dropped in a subsequent patch.

    Signed-off-by: Julien Grall
    Acked-by: Christoffer Dall
    Signed-off-by: Christoffer Dall

    Julien Grall
     

29 Apr, 2016

1 commit

  • The ARM architecture mandates that when changing a page table entry
    from a valid entry to another valid entry, an invalid entry is first
    written, TLB invalidated, and only then the new entry being written.

    The current code doesn't respect this, directly writing the new
    entry and only then invalidating TLBs. Let's fix it up.

    Cc:
    Reported-by: Christoffer Dall
    Signed-off-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Marc Zyngier
     

21 Apr, 2016

17 commits

  • Now that we can handle stage-2 page tables independent
    of the host page table levels, wire up the 16K page
    support.

    Cc: Marc Zyngier
    Reviewed-by: Christoffer Dall
    Signed-off-by: Suzuki K Poulose

    Suzuki K Poulose
     
  • Now that we don't have any fake page table levels for arm64,
    cleanup the common code to get rid of the dead code.

    Cc: Marc Zyngier
    Acked-by: Christoffer Dall
    Signed-off-by: Suzuki K Poulose

    Suzuki K Poulose
     
  • On arm64, the hardware supports concatenation of upto 16 tables,
    at entry level for stage2 translations and we make use that whenever
    possible. This could lead to reduced number of translation levels than
    the normal (stage1 table) table. Also, since the IPA(40bit) is smaller
    than the some of the supported VA_BITS (e.g, 48bit), there could be
    different number of levels in stage-1 vs stage-2 tables. To reuse the
    kernel host page table walker for stage2 we have been using a fake
    software page table level, not known to the hardware. But with 16K
    translations, there could be upto 2 fake software levels (with 48bit VA
    and 40bit IPA), which complicates the code. Hence, we want to get rid of
    the hack.

    Now that we have explicit accessors for hyp vs stage2 page tables,
    define the stage2 walker helpers accordingly based on the actual
    table used by the hardware.

    Once we know the number of translation levels used by the hardware,
    it is merely a job of defining the helpers based on whether a
    particular level is folded or not, looking at the number of levels.

    Some facts before we calculate the translation levels:

    1) Smallest page size supported by arm64 is 4K.
    2) The minimum number of bits resolved at any page table level
    is (PAGE_SHIFT - 3) at intermediate levels.
    Both of them implies, minimum number of bits required for a level
    change is 9.

    Since we can concatenate upto 16 tables at stage2 entry, the total
    number of page table levels used by the hardware for resolving N bits
    is same as that for (N - 4) bits (with concatenation), as there cannot
    be a level in between (N, N-4) as per the above rules.

    Hence, we have

    STAGE2_PGTABLE_LEVELS = PGTABLE_LEVELS(KVM_PHYS_SHIFT - 4)

    With the current IPA limit (40bit), for all supported translations
    and VA_BITS, we have the following condition (even for 36bit VA with
    16K page size):

    CONFIG_PGTABLE_LEVELS >= STAGE2_PGTABLE_LEVELS.

    So, for e.g, if PUD is present in stage2, it is present in the hyp(host).
    Hence, we fall back to the host definition if we find that a level is not
    folded. Otherwise we redefine it accordingly. A build time check is added
    to make sure the above condition holds. If this condition breaks in future,
    we can rearrange the host level helpers and fix our code easily.

    Cc: Marc Zyngier
    Cc: Christoffer Dall
    Reviewed-by: Christoffer Dall
    Signed-off-by: Suzuki K Poulose

    Suzuki K Poulose
     
  • Now that we have switched to explicit page table routines,
    get rid of the obsolete kvm_* wrappers.

    Also, kvm_tlb_flush_vmid_by_ipa is now called only on stage2
    page tables, hence get rid of the redundant check.

    Cc: Marc Zyngier
    Acked-by: Christoffer Dall
    Signed-off-by: Suzuki K Poulose

    Suzuki K Poulose
     
  • Now that the hyp page table is handled by different set of
    routines, rename the original shared routines to stage2 handlers.
    Also make explicit use of the stage2 page table helpers.

    unmap_range has been merged to existing unmap_stage2_range.

    Cc: Marc Zyngier
    Acked-by: Christoffer Dall
    Signed-off-by: Suzuki K Poulose

    Suzuki K Poulose
     
  • We have common routines to modify hyp and stage2 page tables
    based on the 'kvm' parameter. For a smoother transition to
    using separate routines for each, duplicate the routines
    and modify the copy to work on hyp.

    Marks the forked routines with _hyp_ and gets rid of the
    kvm parameter which is no longer needed and is NULL for hyp.
    Also, gets rid of calls to kvm_tlb_flush_by_vmid_ipa() calls
    from the hyp versions. Uses explicit host page table accessors
    instead of the kvm_* page table helpers.

    Suggested-by: Christoffer Dall
    Cc: Marc Zyngier
    Reviewed-by: Christoffer Dall
    Signed-off-by: Suzuki K Poulose

    Suzuki K Poulose
     
  • We have stage2 page table helpers for both arm and arm64. Switch to
    the stage2 helpers for routines that only deal with stage2 page table.

    Cc: Marc Zyngier
    Acked-by: Christoffer Dall
    Signed-off-by: Suzuki K Poulose

    Suzuki K Poulose
     
  • Introduce hyp_pxx_table_empty helpers for checking whether
    a given table entry is empty. This will be used explicitly
    once we switch to explicit routines for hyp page table walk.

    Acked-by: Marc Zyngier
    Acked-by: Christoffer Dall
    Signed-off-by: Suzuki K Poulose

    Suzuki K Poulose
     
  • Introduce stage2 page table helpers for arm64. With the fake
    page table level still in place, the stage2 table has the same
    number of levels as that of the host (and hyp), so they all
    fallback to the host version.

    Acked-by: Marc Zyngier
    Reviewed-by: Christoffer Dall
    Signed-off-by: Suzuki K Poulose

    Suzuki K Poulose
     
  • Introduce hyp_pxx_table_empty helpers for checking whether
    a given table entry is empty. This will be used explicitly
    once we switch to explicit routines for hyp page table walk.

    Acked-by: Marc Zyngier
    Reviewed-by: Christoffer Dall
    Signed-off-by: Suzuki K Poulose
    Signed-off-by: Christoffer Dall

    Suzuki K Poulose
     
  • Define the page table helpers for walking the stage2 pagetable
    for arm. Since both hyp and stage2 have the same number of levels,
    as that of the host we reuse the host helpers.

    The exceptions are the p.d_addr_end routines which have to deal
    with IPA > 32bit, hence we use the open coded version of their host helpers
    which supports 64bit.

    Acked-by: Marc Zyngier
    Reviewed-by: Christoffer Dall
    Signed-off-by: Suzuki K Poulose
    Signed-off-by: Christoffer Dall

    Suzuki K Poulose
     
  • Get rid of kvm_pud_huge() which falls back to pud_huge. Use
    pud_huge instead.

    Acked-by: Christoffer Dall
    Acked-by: Marc Zyngier
    Signed-off-by: Suzuki K Poulose

    Suzuki K Poulose
     
  • Both arm and arm64 now provides a helper, pmd_thp_or_huge()
    to check if the given pmd represents a huge page. Use that
    instead of our own custom check.

    Suggested-by: Mark Rutland
    Cc: Marc Zyngier
    Acked-by: Marc Zyngier
    Acked-by: Christoffer Dall
    Signed-off-by: Suzuki K Poulose

    Suzuki K Poulose
     
  • Add a helper to determine if a given pmd represents a huge page
    either by hugetlb or thp, as we have for arm. This will be used
    by KVM MMU code.

    Suggested-by: Mark Rutland
    Cc: Catalin Marinas
    Cc: Steve Capper
    Cc: Will Deacon
    Acked-by: Marc Zyngier
    Acked-by: Will Deacon
    Acked-by: Christoffer Dall
    Signed-off-by: Suzuki K Poulose

    Suzuki K Poulose
     
  • Rearrange the code for fake pgd handling, which is applicable
    only for arm64. This will later be removed once we introduce
    the stage2 page table walker macros.

    Reviewed-by: Marc Zyngier
    Reviewed-by: Christoffer Dall
    Signed-off-by: Suzuki K Poulose

    Suzuki K Poulose
     
  • We share most of the bits for VTCR_EL2 for different page sizes,
    except for the TG0 value and the entry level value. This patch
    makes the definitions a bit more cleaner to reflect this fact.

    Also cleans up the VTTBR_X calculation. No functional changes.

    Cc: Marc Zyngier
    Acked-by: Marc Zyngier
    Reviewed-by: Christoffer Dall
    Signed-off-by: Suzuki K Poulose

    Suzuki K Poulose
     
  • TCR_EL1, TCR_EL2 and VTCR_EL2, all share some field positions
    (TG0, ORGN0, IRGN0 and SH0) and their corresponding value definitions.

    This patch makes the TCR_EL1 definitions reusable and uses them for TCR_EL2
    and VTCR_EL2 fields.

    This also fixes a bug where we assume TG0 in {V}TCR_EL2 is 1bit field.

    Cc: Catalin Marinas
    Cc: Mark Rutland
    Acked-by: Marc Zyngier
    Acked-by: Will Deacon
    Reviewed-by: Christoffer Dall
    Signed-off-by: Suzuki K Poulose

    Suzuki K Poulose
     

06 Apr, 2016

3 commits

  • Commit 1e947bad0b63 ("arm64: KVM: Skip HYP setup when already running
    in HYP") re-organized the hyp init code and ended up leaving the CPU
    hotplug and PM notifier even if hyp mode initialization fails.

    Since KVM is not yet supported with ACPI, the above mentioned commit
    breaks CPU hotplug in ACPI boot.

    This patch fixes teardown_hyp_mode to properly unregister both CPU
    hotplug and PM notifiers in the teardown path.

    Fixes: 1e947bad0b63 ("arm64: KVM: Skip HYP setup when already running in HYP")
    Cc: Christoffer Dall
    Cc: Marc Zyngier
    Signed-off-by: Sudeep Holla
    Signed-off-by: Christoffer Dall

    Sudeep Holla
     
  • We always thought that 40bits of PA range would be the minimum people
    would actually build. Anything less is terrifyingly small.

    Turns out that we were both right and wrong. Nobody has ever built
    such a system, but the ARM Foundation Model has a PARange set to 36bits.
    Just because we can. Oh well. Now, the KVM API explicitely says that
    we offer a 40bit PA space to the VM, so we shouldn't run KVM on
    the Foundation Model at all.

    That being said, this patch offers a less agressive alternative, and
    loudly warns about the configuration being unsupported. You'll still
    be able to run VMs (at your own risks, though).

    This is just a workaround until we have a proper userspace API where
    we report the PARange to userspace.

    Signed-off-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Marc Zyngier
     
  • On a host that runs NTP, corrections can have a direct impact on
    the background timer that we program on the behalf of a vcpu.

    In particular, NTP performing a forward correction will result in
    a timer expiring sooner than expected from a guest point of view.
    Not a big deal, we kick the vcpu anyway.

    But on wake-up, the vcpu thread is going to perform a check to
    find out whether or not it should block. And at that point, the
    timer check is going to say "timer has not expired yet, go back
    to sleep". This results in the timer event being lost forever.

    There are multiple ways to handle this. One would be record that
    the timer has expired and let kvm_cpu_has_pending_timer return
    true in that case, but that would be fairly invasive. Another is
    to check for the "short sleep" condition in the hrtimer callback,
    and restart the timer for the remaining time when the condition
    is detected.

    This patch implements the latter, with a bit of refactoring in
    order to avoid too much code duplication.

    Cc:
    Reported-by: Alexander Graf
    Reviewed-by: Alexander Graf
    Signed-off-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Marc Zyngier
     

01 Apr, 2016

1 commit

  • The kernel is written in C, not python, so we need braces around
    multi-line if statements. GCC 6 actually warns about this, thanks to the
    fantastic new "-Wmisleading-indentation" flag:

    | virt/kvm/arm/pmu.c: In function ‘kvm_pmu_overflow_status’:
    | virt/kvm/arm/pmu.c:198:3: warning: statement is indented as if it were guarded by... [-Wmisleading-indentation]
    | reg &= vcpu_sys_reg(vcpu, PMCNTENSET_EL0);
    | ^~~
    | arch/arm64/kvm/../../../virt/kvm/arm/pmu.c:196:2: note: ...this ‘if’ clause, but it is not
    | if ((vcpu_sys_reg(vcpu, PMCR_EL0) & ARMV8_PMU_PMCR_E))
    | ^~

    As it turns out, this particular case is harmless (we just do some &=
    operations with 0), but worth fixing nonetheless.

    Signed-off-by: Will Deacon
    Signed-off-by: Christoffer Dall

    Will Deacon
     

31 Mar, 2016

2 commits

  • When the kernel is running at EL2, it doesn't need init_hyp_mode() to
    configure page tables for HYP. This function also registers the CPU
    hotplug and lower power notifiers that cause HYP to be re-initialised
    after the CPU has been reset.

    To avoid losing the register state that controls stage2 translation, move
    the registering of these notifiers into init_subsystems(), and add a
    is_kernel_in_hyp_mode() path to each callback.

    Acked-by: Marc Zyngier
    Acked-by: Christoffer Dall
    Fixes: 1e947bad0b6 ("arm64: KVM: Skip HYP setup when already running in HYP")
    Signed-off-by: James Morse
    Signed-off-by: Christoffer Dall

    James Morse
     
  • When we detect support for 16bit VMID in ID_AA64MMFR1, we set the
    VTCR_EL2_VS field to 1 to make use of 16bit vmids. But, with
    commit 3a3604bc5eb4 ("arm64: KVM: Switch to C-based stage2 init")
    this is broken and we corrupt VTCR_EL2:T0SZ instead of updating the VS
    field. VTCR_EL2_VS was actually defined to the field shift (19) and
    not the real value for VS. This patch fixes the issue.

    Fixes: commit 3a3604bc5eb4 ("arm64: KVM: Switch to C-based stage2 init")
    Cc: Christoffer Dall
    Cc: Mark Rutland
    Acked-by: Marc Zyngier
    Signed-off-by: Suzuki K Poulose
    Signed-off-by: Christoffer Dall

    Suzuki K Poulose
     

27 Mar, 2016

2 commits

  • Linus Torvalds
     
  • Pull Ceph updates from Sage Weil:
    "There is quite a bit here, including some overdue refactoring and
    cleanup on the mon_client and osd_client code from Ilya, scattered
    writeback support for CephFS and a pile of bug fixes from Zheng, and a
    few random cleanups and fixes from others"

    [ I already decided not to pull this because of it having been rebased
    recently, but ended up changing my mind after all. Next time I'll
    really hold people to it. Oh well. - Linus ]

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (34 commits)
    libceph: use KMEM_CACHE macro
    ceph: use kmem_cache_zalloc
    rbd: use KMEM_CACHE macro
    ceph: use lookup request to revalidate dentry
    ceph: kill ceph_get_dentry_parent_inode()
    ceph: fix security xattr deadlock
    ceph: don't request vxattrs from MDS
    ceph: fix mounting same fs multiple times
    ceph: remove unnecessary NULL check
    ceph: avoid updating directory inode's i_size accidentally
    ceph: fix race during filling readdir cache
    libceph: use sizeof_footer() more
    ceph: kill ceph_empty_snapc
    ceph: fix a wrong comparison
    ceph: replace CURRENT_TIME by current_fs_time()
    ceph: scattered page writeback
    libceph: add helper that duplicates last extent operation
    libceph: enable large, variable-sized OSD requests
    libceph: osdc->req_mempool should be backed by a slab pool
    libceph: make r_request msg_size calculation clearer
    ...

    Linus Torvalds