01 Feb, 2017

1 commit

  • commit 0d6da872d3e4a60f43c295386d7ff9a4cdcd57e9 upstream.

    The last pgtable rework silently disabled the CMMA unused state by
    setting a local pte variable (a parameter) instead of propagating it
    back into the caller. Fix it.

    Fixes: ebde765c0e85 ("s390/mm: uninline ptep_xxx functions from pgtable.h")
    Cc: Martin Schwidefsky
    Cc: Claudio Imbrenda
    Signed-off-by: Christian Borntraeger
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Greg Kroah-Hartman

    Christian Borntraeger
     

28 Oct, 2016

1 commit

  • Pull s390 fixes from Martin Schwidefsky:
    "A few more s390 patches for 4.9:
    - a fix for an overflow in the dasd driver reported by UBSAN
    - fix a regression and add hotplug memory to the zone movable again
    - add ignore defines for the pkey system calls
    - fix the ouput of the merged stack tracer
    - replace printk with pr_cont in arch/s390 where appropriate
    - remove the arch specific return_address function again
    - ignore reserved channel paths at boot time
    - add a missing hugetlb_bad_size call to the arch backend"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
    s390/mm: fix zone calculation in arch_add_memory()
    s390/dumpstack: use pr_cont within show_stack and die
    s390/dumpstack: get rid of return_address again
    s390/disassambler: use pr_cont where appropriate
    s390/dumpstack: use pr_cont where appropriate
    s390/dumpstack: restore reliable indicator for call traces
    s390/mm: use hugetlb_bad_size()
    s390/cio: don't register chpids in reserved state
    s390: ignore pkey system calls
    s390/dasd: avoid undefined behaviour

    Linus Torvalds
     

24 Oct, 2016

1 commit

  • Standby (hotplug) memory should be added to ZONE_MOVABLE on s390. After
    commit 199071f1 "s390/mm: make arch_add_memory() NUMA aware",
    arch_add_memory() used memblock_end_of_DRAM() to find out the end of
    ZONE_NORMAL and the beginning of ZONE_MOVABLE. However, commit 7f36e3e5
    "memory-hotplug: add hot-added memory ranges to memblock before allocate
    node_data for a node." moved the call of memblock_add_node() before
    the call of arch_add_memory() in add_memory_resource(), and thus changed
    the return value of memblock_end_of_DRAM() when called in
    arch_add_memory(). As a result, arch_add_memory() will think that all
    memory blocks should be added to ZONE_NORMAL.

    Fix this by changing the logic in arch_add_memory() so that it will
    manually iterate over all zones of a given node to find out which zone
    a memory block should be added to.

    Reviewed-by: Heiko Carstens
    Signed-off-by: Gerald Schaefer
    Signed-off-by: Martin Schwidefsky

    Gerald Schaefer
     

19 Oct, 2016

1 commit


17 Oct, 2016

1 commit


05 Oct, 2016

1 commit

  • Pull s390 updates from Martin Schwidefsky:
    "The new features and main improvements in this merge for v4.9

    - Support for the UBSAN sanitizer

    - Set HAVE_EFFICIENT_UNALIGNED_ACCESS, it improves the code in some
    places

    - Improvements for the in-kernel fpu code, in particular the overhead
    for multiple consecutive in kernel fpu users is recuded

    - Add a SIMD implementation for the RAID6 gen and xor operations

    - Add RAID6 recovery based on the XC instruction

    - The PCI DMA flush logic has been improved to increase the speed of
    the map / unmap operations

    - The time synchronization code has seen some updates

    And bug fixes all over the place"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (48 commits)
    s390/con3270: fix insufficient space padding
    s390/con3270: fix use of uninitialised data
    MAINTAINERS: update DASD maintainer
    s390/cio: fix accidental interrupt enabling during resume
    s390/dasd: add missing \n to end of dev_err messages
    s390/config: Enable config options for Docker
    s390/dasd: make query host access interruptible
    s390/dasd: fix panic during offline processing
    s390/dasd: fix hanging offline processing
    s390/pci_dma: improve lazy flush for unmap
    s390/pci_dma: split dma_update_trans
    s390/pci_dma: improve map_sg
    s390/pci_dma: simplify dma address calculation
    s390/pci_dma: remove dma address range check
    iommu/s390: simplify registration of I/O address translation parameters
    s390: migrate exception table users off module.h and onto extable.h
    s390: export header for CLP ioctl
    s390/vmur: fix irq pointer dereference in int handler
    s390/dasd: add missing KOBJ_CHANGE event for unformatted devices
    s390: enable UBSAN
    ...

    Linus Torvalds
     

20 Sep, 2016

2 commits

  • These files were only including module.h for exception table
    related functions. We've now separated that content out into its
    own file "extable.h" so now move over to that and avoid all the
    extra header content in module.h that we don't really need to compile
    these files.

    The additions of uaccess.h are to deal with implict includes like:

    arch/s390/kernel/traps.c: In function 'do_report_trap':
    arch/s390/kernel/traps.c:56:4: error: implicit declaration of function 'extable_fixup' [-Werror=implicit-function-declaration]
    arch/s390/kernel/traps.c: In function 'illegal_op':
    arch/s390/kernel/traps.c:173:3: error: implicit declaration of function 'get_user' [-Werror=implicit-function-declaration]

    Cc: Heiko Carstens
    Cc: linux-s390@vger.kernel.org
    Signed-off-by: Paul Gortmaker
    Signed-off-by: Martin Schwidefsky

    Paul Gortmaker
     
  • Install the callbacks via the state machine.

    Signed-off-by: Sebastian Andrzej Siewior
    Cc: linux-s390@vger.kernel.org
    Cc: Peter Zijlstra
    Cc: Heiko Carstens
    Cc: rt@linutronix.de
    Cc: Martin Schwidefsky
    Link: http://lkml.kernel.org/r/20160906170457.32393-18-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     

24 Aug, 2016

3 commits


10 Aug, 2016

1 commit

  • Both set_memory_ro() and set_memory_rw() will modify the page
    attributes of at least one page, even if the numpages parameter is
    zero.

    The author expected that calling these functions with numpages == zero
    would never happen. However with the new 444d13ff10fb ("modules: add
    ro_after_init support") feature this happens frequently.

    Therefore do the right thing and make these two functions return
    gracefully if nothing should be done.

    Fixes crashes on module load like this one:

    Unable to handle kernel pointer dereference in virtual kernel address space
    Failing address: 000003ff80008000 TEID: 000003ff80008407
    Fault in home space mode while using kernel ASCE.
    AS:0000000000d18007 R3:00000001e6aa4007 S:00000001e6a10800 P:00000001e34ee21d
    Oops: 0004 ilc:3 [#1] SMP
    Modules linked in: x_tables
    CPU: 10 PID: 1 Comm: systemd Not tainted 4.7.0-11895-g3fa9045 #4
    Hardware name: IBM 2964 N96 703 (LPAR)
    task: 00000001e9118000 task.stack: 00000001e9120000
    Krnl PSW : 0704e00180000000 00000000005677f8 (rb_erase+0xf0/0x4d0)
    R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
    Krnl GPRS: 000003ff80008b20 000003ff80008b20 000003ff80008b70 0000000000b9d608
    000003ff80008b20 0000000000000000 00000001e9123e88 000003ff80008950
    00000001e485ab40 000003ff00000000 000003ff80008b00 00000001e4858480
    0000000100000000 000003ff80008b68 00000000001d5998 00000001e9123c28
    Krnl Code: 00000000005677e8: ec1801c3007c cgij %r1,0,8,567b6e
    00000000005677ee: e32010100020 cg %r2,16(%r1)
    #00000000005677f4: a78401c2 brc 8,567b78
    >00000000005677f8: e35010080024 stg %r5,8(%r1)
    00000000005677fe: ec5801af007c cgij %r5,0,8,567b5c
    0000000000567804: e30050000024 stg %r0,0(%r5)
    000000000056780a: ebacf0680004 lmg %r10,%r12,104(%r15)
    0000000000567810: 07fe bcr 15,%r14
    Call Trace:
    ([] __this_module+0x0/0xffffffffffffd700 [x_tables])
    ([] do_init_module+0x12c/0x220)
    ([] load_module+0x24e2/0x2b10)
    ([] SyS_finit_module+0xbe/0xd8)
    ([] system_call+0xd6/0x264)
    Last Breaking-Event-Address:
    [] rb_erase+0x12/0x4d0
    Kernel panic - not syncing: Fatal exception: panic_on_oops

    Reported-by: Christian Borntraeger
    Reported-and-tested-by: Sebastian Ott
    Fixes: e8a97e42dc98 ("s390/pageattr: allow kernel page table splitting")
    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     

03 Aug, 2016

1 commit

  • Pull KVM updates from Paolo Bonzini:

    - ARM: GICv3 ITS emulation and various fixes. Removal of the
    old VGIC implementation.

    - s390: support for trapping software breakpoints, nested
    virtualization (vSIE), the STHYI opcode, initial extensions
    for CPU model support.

    - MIPS: support for MIPS64 hosts (32-bit guests only) and lots
    of cleanups, preliminary to this and the upcoming support for
    hardware virtualization extensions.

    - x86: support for execute-only mappings in nested EPT; reduced
    vmexit latency for TSC deadline timer (by about 30%) on Intel
    hosts; support for more than 255 vCPUs.

    - PPC: bugfixes.

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (302 commits)
    KVM: PPC: Introduce KVM_CAP_PPC_HTM
    MIPS: Select HAVE_KVM for MIPS64_R{2,6}
    MIPS: KVM: Reset CP0_PageMask during host TLB flush
    MIPS: KVM: Fix ptr->int cast via KVM_GUEST_KSEGX()
    MIPS: KVM: Sign extend MFC0/RDHWR results
    MIPS: KVM: Fix 64-bit big endian dynamic translation
    MIPS: KVM: Fail if ebase doesn't fit in CP0_EBase
    MIPS: KVM: Use 64-bit CP0_EBase when appropriate
    MIPS: KVM: Set CP0_Status.KX on MIPS64
    MIPS: KVM: Make entry code MIPS64 friendly
    MIPS: KVM: Use kmap instead of CKSEG0ADDR()
    MIPS: KVM: Use virt_to_phys() to get commpage PFN
    MIPS: Fix definition of KSEGX() for 64-bit
    KVM: VMX: Add VMCS to CPU's loaded VMCSs before VMPTRLD
    kvm: x86: nVMX: maintain internal copy of current VMCS
    KVM: PPC: Book3S HV: Save/restore TM state in H_CEDE
    KVM: PPC: Book3S HV: Pull out TM state save/restore into separate procedures
    KVM: arm64: vgic-its: Simplify MAPI error handling
    KVM: arm64: vgic-its: Make vgic_its_cmd_handle_mapi similar to other handlers
    KVM: arm64: vgic-its: Turn device_id validation into generic ID validation
    ...

    Linus Torvalds
     

31 Jul, 2016

1 commit

  • The hugetlbfs ptepmd conversion functions currently assume that the pmd
    bit layout is consistent with the pte layout, which is not really true.

    The SW read and write bits are encoded as the sequence "wr" in a pte, but
    in a pmd it is "rw". The hugetlbfs conversion assumes that the sequence
    is identical in both cases, which results in swapped read and write bits
    in the pmd. In practice this is not a problem, because those pmd bits are
    only relevant for THP pmds and not for hugetlbfs pmds. The hugetlbfs code
    works on (fake) ptes, and the converted pte bits are correct.

    There is another variation in pte/pmd encoding which affects dirty
    prot-none ptes/pmds. In this case, a pmd has both its HW read-only and
    invalid bit set, while it is only the invalid bit for a pte. This also has
    no effect in practice, but it should better be consistent.

    This patch fixes both inconsistencies by changing the SW read/write bit
    layout for pmds as well as the PAGE_NONE encoding for ptes. It also makes
    the hugetlbfs conversion functions more robust by introducing a
    move_set_bit() macro that uses the pte/pmd bit #defines instead of
    constant shifts.

    Signed-off-by: Gerald Schaefer
    Signed-off-by: Martin Schwidefsky

    Gerald Schaefer
     

27 Jul, 2016

3 commits

  • Merge updates from Andrew Morton:

    - a few misc bits

    - ocfs2

    - most(?) of MM

    * emailed patches from Andrew Morton : (125 commits)
    thp: fix comments of __pmd_trans_huge_lock()
    cgroup: remove unnecessary 0 check from css_from_id()
    cgroup: fix idr leak for the first cgroup root
    mm: memcontrol: fix documentation for compound parameter
    mm: memcontrol: remove BUG_ON in uncharge_list
    mm: fix build warnings in
    mm, thp: convert from optimistic swapin collapsing to conservative
    mm, thp: fix comment inconsistency for swapin readahead functions
    thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
    shmem: split huge pages beyond i_size under memory pressure
    thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
    khugepaged: add support of collapse for tmpfs/shmem pages
    shmem: make shmem_inode_info::lock irq-safe
    khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
    thp: extract khugepaged from mm/huge_memory.c
    shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
    shmem: add huge pages support
    shmem: get_unmapped_area align huge page
    shmem: prepare huge= mount option and sysfs knob
    mm, rmap: account shmem thp pages
    ...

    Linus Torvalds
     
  • We always have vma->vm_mm around.

    Link: http://lkml.kernel.org/r/1466021202-61880-8-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Pull s390 updates from Martin Schwidefsky:
    "There are a couple of new things for s390 with this merge request:

    - a new scheduling domain "drawer" is added to reflect the unusual
    topology found on z13 machines. Performance tests showed up to 8
    percent gain with the additional domain.

    - the new crc-32 checksum crypto module uses the vector-galois-field
    multiply and sum SIMD instruction to speed up crc-32 and crc-32c.

    - proper __ro_after_init support, this requires RO_AFTER_INIT_DATA in
    the generic vmlinux.lds linker script definitions.

    - kcov instrumentation support. A prerequisite for that is the
    inline assembly basic block cleanup, which is the reason for the
    net/iucv/iucv.c change.

    - support for 2GB pages is added to the hugetlbfs backend.

    Then there are two removals:

    - the oprofile hardware sampling support is dead code and is removed.
    The oprofile user space uses the perf interface nowadays.

    - the ETR clock synchronization is removed, this has been superseeded
    be the STP clock synchronization. And it always has been
    "interesting" code..

    And the usual bug fixes and cleanups"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (82 commits)
    s390/pci: Delete an unnecessary check before the function call "pci_dev_put"
    s390/smp: clean up a condition
    s390/cio/chp : Remove deprecated create_singlethread_workqueue
    s390/chsc: improve channel path descriptor determination
    s390/chsc: sanitize fmt check for chp_desc determination
    s390/cio: make fmt1 channel path descriptor optional
    s390/chsc: fix ioctl CHSC_INFO_CU command
    s390/cio/device_ops: fix kernel doc
    s390/cio: allow to reset channel measurement block
    s390/console: Make preferred console handling more consistent
    s390/mm: fix gmap tlb flush issues
    s390/mm: add support for 2GB hugepages
    s390: have unique symbol for __switch_to address
    s390/cpuinfo: show maximum thread id
    s390/ptrace: clarify bits in the per_struct
    s390: stack address vs thread_info
    s390: remove pointless load within __switch_to
    s390: enable kcov support
    s390/cpumf: use basic block for ecctr inline assembly
    s390/hypfs: use basic block for diag inline assembly
    ...

    Linus Torvalds
     

13 Jul, 2016

1 commit

  • __tlb_flush_asce() should never be used if multiple asce belong to a mm.

    As this function changes mm logic determining if local or global tlb
    flushes will be neded, we might end up flushing only the gmap asce on all
    CPUs and a follow up mm asce flushes will only flush on the local CPU,
    although that asce ran on multiple CPUs.

    The missing tlb flushes will provoke strange faults in user space and even
    low address protections in user space, crashing the kernel.

    Fixes: 1b948d6caec4 ("s390/mm,tlb: optimize TLB flushing for zEC12")
    Cc: stable@vger.kernel.org # 3.15+
    Reported-by: Sascha Silbe
    Acked-by: Martin Schwidefsky
    Signed-off-by: David Hildenbrand
    Signed-off-by: Martin Schwidefsky

    David Hildenbrand
     

06 Jul, 2016

1 commit


28 Jun, 2016

1 commit

  • Use only simple inline assemblies which consist of a single basic
    block if the register asm construct is being used.

    Otherwise gcc would generate broken code if the compiler option
    --sanitize-coverage=trace-pc would be used.

    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     

25 Jun, 2016

1 commit

  • __GFP_REPEAT has a rather weak semantic but since it has been introduced
    around 2.6.12 it has been ignored for low order allocations.

    page_table_alloc then uses the flag for a single page allocation. This
    means that this flag has never been actually useful here because it has
    always been used only for PAGE_ALLOC_COSTLY requests.

    Link: http://lkml.kernel.org/r/1464599699-30131-14-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Heiko Carstens
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

20 Jun, 2016

18 commits

  • Nested virtualization will have to enable own gmaps. Current code
    would enable the wrong gmap whenever scheduled out and back in,
    therefore resulting in the wrong gmap being enabled.

    This patch reenables the last enabled gmap, therefore avoiding having to
    touch vcpu->arch.gmap when enabling a different gmap.

    Acked-by: Christian Borntraeger
    Signed-off-by: David Hildenbrand
    Signed-off-by: Christian Borntraeger

    David Hildenbrand
     
  • Let's not fault in everything in read-write but limit it to read-only
    where possible.

    When restricting access rights, we already have the required protection
    level in our hands. When reading from guest 2 storage (gmap_read_table),
    it is obviously PROT_READ. When shadowing a pte, the required protection
    level is given via the guest 2 provided pte.

    Based on an initial patch by Martin Schwidefsky.

    Acked-by: Martin Schwidefsky
    Signed-off-by: David Hildenbrand
    Signed-off-by: Christian Borntraeger

    David Hildenbrand
     
  • It will be very helpful to have a mechanism to check without any locks
    if a given gmap shadow is still valid and matches the given properties.

    Acked-by: Martin Schwidefsky
    Signed-off-by: David Hildenbrand
    Signed-off-by: Christian Borntraeger

    David Hildenbrand
     
  • For nested virtualization, we want to know if we are handling a protection
    exception, because these can directly be forwarded to the guest without
    additional checks.

    Acked-by: Martin Schwidefsky
    Signed-off-by: David Hildenbrand
    Signed-off-by: Christian Borntraeger

    David Hildenbrand
     
  • We have no known user of real-space designation and only support it to
    be architecture compliant.

    Gmap shadows with real-space designation are never unshadowed
    automatically, as there is nothing to protect for the top level table.

    So let's simply limit the number of such shadows to one by removing
    existing ones on creation of another one.

    Acked-by: Martin Schwidefsky
    Signed-off-by: David Hildenbrand
    Signed-off-by: Christian Borntraeger

    David Hildenbrand
     
  • We can easily support real-space designation just like EDAT1 and EDAT2.
    So guest2 can provide for guest3 an asce with the real-space control being
    set.

    We simply have to allocate the biggest page table possible and fake all
    levels.

    There is no protection to consider. If we exceed guest memory, vsie code
    will inject an addressing exception (via program intercept). In the future,
    we could limit the fake table level to the gmap page table.

    As the top level page table can never go away, such gmap shadows will never
    get unshadowed, we'll have to come up with another way to limit the number
    of kept gmap shadows.

    Acked-by: Martin Schwidefsky
    Signed-off-by: David Hildenbrand
    Signed-off-by: Christian Borntraeger

    David Hildenbrand
     
  • If the guest is enabled for EDAT2, we can easily create shadows for
    guest2 -> guest3 provided tables that make use of EDAT2.

    If guest2 references a 2GB page, this memory looks consecutive for guest2,
    but it does not have to be so for us. Therefore we have to create fake
    segment and page tables.

    This works just like EDAT1 support, so page tables are removed when the
    parent table (r3t table entry) is changed.

    We don't hve to care about:
    - ACCF-Validity Control in RTTE
    - Access-Control Bits in RTTE
    - Fetch-Protection Bit in RTTE
    - Common-Region Bit in RTTE

    Just like for EDAT1, all bits might be dropped and there is no guaranteed
    that they are active.

    Acked-by: Martin Schwidefsky
    Signed-off-by: David Hildenbrand
    Signed-off-by: Christian Borntraeger

    David Hildenbrand
     
  • If the guest is enabled for EDAT1, we can easily create shadows for
    guest2 -> guest3 provided tables that make use of EDAT1.

    If guest2 references a 1MB page, this memory looks consecutive for guest2,
    but it might not be so for us. Therefore we have to create fake page tables.

    We can easily add that to our existing infrastructure. The invalidation
    mechanism will make sure that fake page tables are removed when the parent
    table (sgt table entry) is changed.

    As EDAT1 also introduced protection on all page table levels, we have to
    also shadow these correctly.

    We don't have to care about:
    - ACCF-Validity Control in STE
    - Access-Control Bits in STE
    - Fetch-Protection Bit in STE
    - Common-Segment Bit in STE

    As all bits might be dropped and there is no guaranteed that they are
    active ("unpredictable whether the CPU uses these bits", "may be used").
    Without using EDAT1 in the shadow ourselfes (STE-format control == 0),
    simply shadowing these bits would not be enough. They would be ignored.

    Please note that we are using the "fake" flag to make this look consistent
    with further changes (EDAT2, real-space designation support) and don't let
    the shadow functions handle fc=1 stes.

    In the future, with huge pages in the host, gmap_shadow_pgt() could simply
    try to map a huge host page if "fake" is set to one and indicate via return
    value that no lower fake tables / shadow ptes are required.

    Acked-by: Martin Schwidefsky
    Signed-off-by: David Hildenbrand
    Signed-off-by: Christian Borntraeger

    David Hildenbrand
     
  • In preparation for EDAT1/EDAT2 support for gmap shadows, we have to store
    the requested edat level in the gmap shadow.

    The edat level used during shadow translation is a property of the gmap
    shadow. Depending on that level, the gmap shadow will look differently for
    the same guest tables. We have to store it internally in order to support
    it later.

    Acked-by: Martin Schwidefsky
    Signed-off-by: David Hildenbrand
    Signed-off-by: Christian Borntraeger

    David Hildenbrand
     
  • Before any thread is allowed to use a gmap_shadow, it has to be fully
    initialized. However, for invalidation to work properly, we have to
    register the new gmap_shadow before we protect the parent gmap table.

    Because locking is tricky, and we have to avoid duplicate gmaps, let's
    introduce an initialized field, that signalizes other threads if that
    gmap_shadow can already be used or if they have to retry.

    Let's properly return errors using ERR_PTR() instead of simply returning
    NULL, so a caller can properly react on the error.

    Acked-by: Martin Schwidefsky
    Signed-off-by: David Hildenbrand
    Signed-off-by: Christian Borntraeger

    David Hildenbrand
     
  • We have to unlock sg->guest_table_lock in order to call
    gmap_protect_rmap(). If we sleep just before that call, another VCPU
    might pick up that shadowed page table (while it is not protected yet)
    and use it.

    In order to avoid these races, we have to introduce a third state -
    "origin set but still invalid" for an entry. This way, we can avoid
    another thread already using the entry before the table is fully protected.
    As soon as everything is set up, we can clear the invalid bit - if we
    had no race with the unshadowing code.

    Suggested-by: Martin Schwidefsky
    Acked-by: Martin Schwidefsky
    Signed-off-by: David Hildenbrand
    Signed-off-by: Christian Borntraeger

    David Hildenbrand
     
  • We really want to avoid manually handling protection for nested
    virtualization. By shadowing pages with the protection the guest asked us
    for, the SIE can handle most protection-related actions for us (e.g.
    special handling for MVPG) and we can directly forward protection
    exceptions to the guest.

    PTEs will now always be shadowed with the correct _PAGE_PROTECT flag.
    Unshadowing will take care of any guest changes to the parent PTE and
    any host changes to the host PTE. If the host PTE doesn't have the
    fitting access rights or is not available, we have to fix it up.

    Acked-by: Martin Schwidefsky
    Signed-off-by: David Hildenbrand
    Signed-off-by: Christian Borntraeger

    David Hildenbrand
     
  • For now, the tlb of shadow gmap is only flushed when the parent is removed,
    not when it is removed upfront. Therefore other shadow gmaps can reuse the
    tables without the tlb getting flushed.

    Fix this by simply flushing the tlb
    1. Before the shadow tables are removed (analogouos to other unshadow functions)
    2. When the gmap is freed and therefore the top level pages are freed.

    Acked-by: Martin Schwidefsky
    Signed-off-by: David Hildenbrand
    Signed-off-by: Christian Borntraeger

    David Hildenbrand
     
  • For a nested KVM guest the outer KVM host needs to create shadow
    page tables for the nested guest. This patch adds the basic support
    to the guest address space (gmap) code.

    For each guest address space the inner KVM host creates, the first
    outer KVM host needs to create shadow page tables. The address space
    is identified by the ASCE loaded into the control register 1 at the
    time the inner SIE instruction for the second nested KVM guest is
    executed. The outer KVM host creates the shadow tables starting with
    the table identified by the ASCE on a on-demand basis. The outer KVM
    host will get repeated faults for all the shadow tables needed to
    run the second KVM guest.

    While a shadow page table for the second KVM guest is active the access
    to the origin region, segment and page tables needs to be restricted
    for the first KVM guest. For region and segment and page tables the first
    KVM guest may read the memory, but write attempt has to lead to an
    unshadow. This is done using the page invalid and read-only bits in the
    page table of the first KVM guest. If the first guest re-accesses one of
    the origin pages of a shadow, it gets a fault and the affected parts of
    the shadow page table hierarchy needs to be removed again.

    PGSTE tables don't have to be shadowed, as all interpretation assist can't
    deal with the invalid bits in the shadow pte being set differently than
    the original ones provided by the first KVM guest.

    Many bug fixes and improvements by David Hildenbrand.

    Reviewed-by: David Hildenbrand
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Christian Borntraeger

    Martin Schwidefsky
     
  • Let's use a reference counter mechanism to control the lifetime of
    gmap structures. This will be needed for further changes related to
    gmap shadows.

    Reviewed-by: David Hildenbrand
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Christian Borntraeger

    Martin Schwidefsky
     
  • The current gmap pte notifier forces a pte into to a read-write state.
    If the pte is invalidated the gmap notifier is called to inform KVM
    that the mapping will go away.

    Extend this approach to allow read-write, read-only and no-access
    as possible target states and call the pte notifier for any change
    to the pte.

    This mechanism is used to temporarily set specific access rights for
    a pte without doing the heavy work of a true mprotect call.

    Reviewed-by: David Hildenbrand
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Christian Borntraeger

    Martin Schwidefsky
     
  • The gmap notifier list and the gmap list in the mm_struct change rarely.
    Use RCU to optimize the reader of these lists.

    Reviewed-by: David Hildenbrand
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Christian Borntraeger

    Martin Schwidefsky
     
  • Pass an address range to the page table invalidation notifier
    for KVM. This allows to notify changes that affect a larger
    virtual memory area, e.g. for 1MB pages.

    Reviewed-by: David Hildenbrand
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Christian Borntraeger

    Martin Schwidefsky
     

14 Jun, 2016

1 commit

  • The usual problem for code that is ifdef'ed out is that it doesn't
    compile after a while. That's also the case for the storage key
    initialisation code, if it would be used (set PAGE_DEFAULT_KEY to
    something not zero):

    ./arch/s390/include/asm/page.h: In function 'storage_key_init_range':
    ./arch/s390/include/asm/page.h:36:2: error: implicit declaration of function '__storage_key_init_range'

    Since the code itself has been useful for debugging purposes several
    times, remove the ifdefs and make sure the code gets compiler
    coverage. The cost for this is eight bytes.

    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens