08 Jun, 2016

1 commit

  • This adds a REQ_OP_FLUSH operation that is sent to request_fn
    based drivers by the block layer's flush code, instead of
    sending requests with the request->cmd_flags REQ_FLUSH bit set.

    Signed-off-by: Mike Christie
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Mike Christie
     

05 Jun, 2016

4 commits

  • Signed-off-by: Helge Deller

    Helge Deller
     
  • One of the debian buildd servers had this crash in the syslog without
    any other information:

    Unaligned handler failed, ret = -2
    clock_adjtime (pid 22578): Unaligned data reference (code 28)
    CPU: 1 PID: 22578 Comm: clock_adjtime Tainted: G E 4.5.0-2-parisc64-smp #1 Debian 4.5.4-1
    task: 000000007d9960f8 ti: 00000001bde7c000 task.ti: 00000001bde7c000

    YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
    PSW: 00001000000001001111100000001111 Tainted: G E
    r00-03 000000ff0804f80f 00000001bde7c2b0 00000000402d2be8 00000001bde7c2b0
    r04-07 00000000409e1fd0 00000000fa6f7fff 00000001bde7c148 00000000fa6f7fff
    r08-11 0000000000000000 00000000ffffffff 00000000fac9bb7b 000000000002b4d4
    r12-15 000000000015241c 000000000015242c 000000000000002d 00000000fac9bb7b
    r16-19 0000000000028800 0000000000000001 0000000000000070 00000001bde7c218
    r20-23 0000000000000000 00000001bde7c210 0000000000000002 0000000000000000
    r24-27 0000000000000000 0000000000000000 00000001bde7c148 00000000409e1fd0
    r28-31 0000000000000001 00000001bde7c320 00000001bde7c350 00000001bde7c218
    sr00-03 0000000001200000 0000000001200000 0000000000000000 0000000001200000
    sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000

    IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000402d2e84 00000000402d2e88
    IIR: 0ca0d089 ISR: 0000000001200000 IOR: 00000000fa6f7fff
    CPU: 1 CR30: 00000001bde7c000 CR31: ffffffffffffffff
    ORIG_R28: 00000002369fe628
    IAOQ[0]: compat_get_timex+0x2dc/0x3c0
    IAOQ[1]: compat_get_timex+0x2e0/0x3c0
    RP(r2): compat_get_timex+0x40/0x3c0
    Backtrace:
    [] compat_SyS_clock_adjtime+0x40/0xc0
    [] syscall_exit+0x0/0x14

    This means the userspace program clock_adjtime called the clock_adjtime()
    syscall and then crashed inside the compat_get_timex() function.
    Syscalls should never crash programs, but instead return EFAULT.

    The IIR register contains the executed instruction, which disassebles
    into "ldw 0(sr3,r5),r9".
    This load-word instruction is part of __get_user() which tried to read the word
    at %r5/IOR (0xfa6f7fff). This means the unaligned handler jumped in. The
    unaligned handler is able to emulate all ldw instructions, but it fails if it
    fails to read the source e.g. because of page fault.

    The following program reproduces the problem:

    #define _GNU_SOURCE
    #include
    #include
    #include

    int main(void) {
    /* allocate 8k */
    char *ptr = mmap(NULL, 2*4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    /* free second half (upper 4k) and make it invalid. */
    munmap(ptr+4096, 4096);
    /* syscall where first int is unaligned and clobbers into invalid memory region */
    /* syscall should return EFAULT */
    return syscall(__NR_clock_adjtime, 0, ptr+4095);
    }

    To fix this issue we simply need to check if the faulting instruction address
    is in the exception fixup table when the unaligned handler failed. If it
    is, call the fixup routine instead of crashing.

    While looking at the unaligned handler I found another issue as well: The
    target register should not be modified if the handler was unsuccessful.

    Signed-off-by: Helge Deller
    Cc: stable@vger.kernel.org

    Helge Deller
     
  • Avoid showing invalid printk time stamps during boot.

    Signed-off-by: Helge Deller
    Reviewed-by: Aaro Koskinen

    Helge Deller
     
  • This patch fixes backtrace on PA-RISC

    There were several problems:

    1) The code that decodes instructions handles instructions that subtract
    from the stack pointer incorrectly. If the instruction subtracts the
    number X from the stack pointer the code increases the frame size by
    (0x100000000-X). This results in invalid accesses to memory and
    recursive page faults.

    2) Because gcc reorders blocks, handling instructions that subtract from
    the frame pointer is incorrect. For example, this function
    int f(int a)
    {
    if (__builtin_expect(a, 1))
    return a;
    g();
    return a;
    }
    is compiled in such a way, that the code that decreases the stack
    pointer for the first "return a" is placed before the code for "g" call.
    If we recognize this decrement, we mistakenly believe that the frame
    size for the "g" call is zero.

    To fix problems 1) and 2), the patch doesn't recognize instructions that
    decrease the stack pointer at all. To further safeguard the unwind code
    against nonsense values, we don't allow frame size larger than
    Total_frame_size.

    3) The backtrace is not locked. If stack dump races with module unload,
    invalid table can be accessed.

    This patch adds a spinlock when processing module tables.

    Note, that for correct backtrace, you need recent binutils.
    Binutils 2.18 from Debian 5 produce garbage unwind tables.
    Binutils 2.21 work better (it sometimes forgets function frames, but at
    least it doesn't generate garbage).

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Helge Deller

    Mikulas Patocka
     

04 Jun, 2016

4 commits

  • Pull irq fixes from Thomas Gleixner:
    - a few simple fixes for fallout from the recent gic-v3 changes
    - a workaround for a Cavium thunderX erratum
    - a bugfix for the pic32 irqchip to make external interrupts work proper
    - a missing return value in the generic IPI management code

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    irqchip/irq-pic32-evic: Fix bug with external interrupts.
    irqchip/gicv3-its: numa: Enable workaround for Cavium thunderx erratum 23144
    irqchip/gic-v3: Fix quiescence check in gic_enable_redist
    irqchip/gic-v3: Fix copy+paste mistakes in defines
    irqchip/gic-v3: Fix ICC_SGI1R_EL1.INTID decoding mask
    genirq: Fix missing return value in irq_destroy_ipi()

    Linus Torvalds
     
  • Pull ARM fix from Russell King:
    "Just one fix to the ptrace code, spotted by Simon Marchi, where if a
    thread migrates to a different CPU and the VFP registers are changed
    through ptrace, the application doesn't see the updated VFP registers"

    * 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm:
    ARM: fix PTRACE_SETVFPREGS on SMP systems

    Linus Torvalds
     
  • Pull arm64 fixes from Will Deacon:
    "The main thing here is reviving hugetlb support using contiguous ptes,
    which we ended up reverting at the last minute in 4.5 pending a fix
    which went into the core mm/ code during the recent merge window.

    - Revert a previous revert and get hugetlb going with contiguous hints
    - Wire up missing compat syscalls
    - Enable CONFIG_SET_MODULE_RONX by default
    - Add missing line to our compat /proc/cpuinfo output
    - Clarify levels in our page table dumps
    - Fix booting with RANDOMIZE_TEXT_OFFSET enabled
    - Misc fixes to the ARM CPU PMU driver (refcounting, probe failure)
    - Remove some dead code and update a comment"

    * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
    arm64: fix alignment when RANDOMIZE_TEXT_OFFSET is enabled
    arm64: move {PAGE,CONT}_SHIFT into Kconfig
    arm64: mm: dump: log span level
    arm64: update stale PAGE_OFFSET comment
    drivers/perf: arm_pmu: Avoid leaking pmu->irq_affinity on error
    drivers/perf: arm_pmu: Defer the setting of __oprofile_cpu_pmu
    drivers/perf: arm_pmu: Fix reference count of a device_node in of_pmu_irq_cfg
    arm64: report CPU number in bad_mode
    arm64: unistd32.h: wire up missing syscalls for compat tasks
    arm64: Provide "model name" in /proc/cpuinfo for PER_LINUX32 tasks
    arm64: enable CONFIG_SET_MODULE_RONX by default
    arm64: Remove orphaned __addr_ok() definition
    Revert "arm64: hugetlb: partial revert of 66b3923a1a0f"

    Linus Torvalds
     
  • Pull powerpc fixes from Michael Ellerman:
    - Handle RTAS delay requests in configure_bridge from Russell Currey
    - Refactor the configure_bridge RTAS tokens from Russell Currey
    - Fix definition of SIAR and SDAR registers from Thomas Huth
    - Use privileged SPR number for MMCR2 from Thomas Huth
    - Update LPCR only if it is powernv from Aneesh Kumar K.V
    - Fix the reference bit update when handling hash fault from Aneesh
    Kumar K.V
    - Add missing tlb flush from Aneesh Kumar K.V
    - Add POWER8NVL support to ibm,client-architecture-support call from
    Thomas Huth

    * tag 'powerpc-4.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
    powerpc/pseries: Add POWER8NVL support to ibm,client-architecture-support call
    powerpc/mm/radix: Add missing tlb flush
    powerpc/mm/hash: Fix the reference bit update when handling hash fault
    powerpc/mm/radix: Update LPCR only if it is powernv
    powerpc: Use privileged SPR number for MMCR2
    powerpc: Fix definition of SIAR and SDAR registers
    powerpc/pseries/eeh: Refactor the configure_bridge RTAS tokens
    powerpc/pseries/eeh: Handle RTAS delay requests in configure_bridge

    Linus Torvalds
     

03 Jun, 2016

7 commits

  • With ARM64_64K_PAGES and RANDOMIZE_TEXT_OFFSET enabled, we hit the
    following issue on the boot:

    kernel BUG at arch/arm64/mm/mmu.c:480!
    Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
    Modules linked in:
    CPU: 0 PID: 0 Comm: swapper Not tainted 4.6.0 #310
    Hardware name: ARM Juno development board (r2) (DT)
    task: ffff000008d58a80 ti: ffff000008d30000 task.ti: ffff000008d30000
    PC is at map_kernel_segment+0x44/0xb0
    LR is at paging_init+0x84/0x5b0
    pc : [] lr : [] pstate: 600002c5

    Call trace:
    [] map_kernel_segment+0x44/0xb0
    [] paging_init+0x84/0x5b0
    [] setup_arch+0x198/0x534
    [] start_kernel+0x70/0x388
    [] __primary_switched+0x30/0x74

    Commit 7eb90f2ff7e3 ("arm64: cover the .head.text section in the .text
    segment mapping") removed the alignment between the .head.text and .text
    sections, and used the _text rather than the _stext interval for mapping
    the .text segment.

    Prior to this commit _stext was always section aligned and didn't cause
    any issue even when RANDOMIZE_TEXT_OFFSET was enabled. Since that
    alignment has been removed and _text is used to map the .text segment,
    we need ensure _text is always page aligned when RANDOMIZE_TEXT_OFFSET
    is enabled.

    This patch adds logic to TEXT_OFFSET fuzzing to ensure that the offset
    is always aligned to the kernel page size. To ensure this, we rely on
    the PAGE_SHIFT being available via Kconfig.

    Signed-off-by: Mark Rutland
    Reported-by: Sudeep Holla
    Cc: Ard Biesheuvel
    Cc: Catalin Marinas
    Cc: Will Deacon
    Fixes: 7eb90f2ff7e3 ("arm64: cover the .head.text section in the .text segment mapping")
    Signed-off-by: Will Deacon

    Mark Rutland
     
  • In some cases (e.g. the awk for CONFIG_RANDOMIZE_TEXT_OFFSET) we would
    like to make use of PAGE_SHIFT outside of code that can include the
    usual header files.

    Add a new CONFIG_ARM64_PAGE_SHIFT for this, likewise with
    ARM64_CONT_SHIFT for consistency.

    Signed-off-by: Mark Rutland
    Cc: Ard Biesheuvel
    Cc: Catalin Marinas
    Cc: Sudeep Holla
    Cc: Will Deacon
    Signed-off-by: Will Deacon

    Mark Rutland
     
  • The page table dump code logs spans of entries at the same level
    (pgd/pud/pmd/pte) which have the same attributes. While we log the
    (decoded) attributes, we don't log the level, which leaves the output
    ambiguous and/or confusing in some cases.

    For example:

    0xffff800800000000-0xffff800980000000 6G RW NX SHD AF BLK UXN MEM/NORMAL

    If using 4K pages, this may describe a span of 6 1G block entries at the
    PGD/PUD level, or 3072 2M block entries at the PMD level.

    This patch adds the page table level to each output line, removing this
    ambiguity. For the example above, this will produce:

    0xffffffc800000000-0xffffffc980000000 6G PUD RW NX SHD AF BLK UXN MEM/NORMAL

    When 3 level tables are in use, and we use the asm-generic/nopud.h
    definitions, the dump code treats each entry in the PGD as a 1 element
    table at the PUD level, and logs spans as being PUDs, which can be
    confusing. To counteract this, the "PUD" mnemonic is replaced with "PGD"
    when CONFIG_PGTABLE_LEVELS
    Cc: Catalin Marinas
    Cc: Huang Shijie
    Cc: Laura Abbott
    Cc: Steve Capper
    Cc: Will Deacon
    Signed-off-by: Will Deacon

    Mark Rutland
     
  • Commit ab893fb9f1b17f02 ("arm64: introduce KIMAGE_VADDR as the virtual
    base of the kernel region") logically split KIMAGE_VADDR from
    PAGE_OFFSET, and since commit f9040773b7bbbd9e ("arm64: move kernel
    image to base of vmalloc area") the two have been distinct values.

    Unfortunately, neither commit updated the comment above these
    definitions, which now erroneously states that PAGE_OFFSET is the start
    of the kernel image rather than the start of the linear mapping.

    This patch fixes said comment, and introduces an explanation of
    KIMAGE_VADDR.

    Signed-off-by: Mark Rutland
    Cc: Will Deacon
    Cc: Catalin Marinas
    Cc: Marc Zyngier
    Signed-off-by: Will Deacon

    Mark Rutland
     
  • If we take an exception we don't expect (e.g. SError), we report this in
    the bad_mode handler with pr_crit. Depending on the configured log
    level, we may or may not log additional information in functions called
    subsequently. Notably, the messages in dump_stack (including the CPU
    number) are printed with KERN_DEFAULT and may not appear.

    Some exceptions have an IMPLEMENTATION DEFINED ESR_ELx.ISS encoding, and
    knowing the CPU number is crucial to correctly decode them. To ensure
    that this is always possible, we should log the CPU number along with
    the ESR_ELx value, so we are not reliant on subsequent logs or
    additional printk configuration options.

    This patch logs the CPU number in bad_mode such that it is possible for
    a developer to decode these exceptions, provided access to sufficient
    documentation.

    Signed-off-by: Mark Rutland
    Reported-by: Al Grant
    Cc: Catalin Marinas
    Cc: Dave Martin
    Cc: Robin Murphy
    Cc: Will Deacon
    Signed-off-by: Will Deacon

    Mark Rutland
     
  • Pull KVM fixes from Radim Krčmář:
    "ARM:
    - two fixes for 4.6 vgic [Christoffer] (cc stable)

    - six fixes for 4.7 vgic [Marc]

    x86:
    - six fixes from syzkaller reports [Paolo] (two of them cc stable)

    - allow OS X to boot [Dmitry]

    - don't trust compilers [Nadav]"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    KVM: x86: fix OOPS after invalid KVM_SET_DEBUGREGS
    KVM: x86: avoid vmalloc(0) in the KVM_SET_CPUID
    KVM: irqfd: fix NULL pointer dereference in kvm_irq_map_gsi
    KVM: fail KVM_SET_VCPU_EVENTS with invalid exception number
    KVM: x86: avoid vmalloc(0) in the KVM_SET_CPUID
    kvm: x86: avoid warning on repeated KVM_SET_TSS_ADDR
    KVM: Handle MSR_IA32_PERF_CTL
    KVM: x86: avoid write-tearing of TDP
    KVM: arm/arm64: vgic-new: Removel harmful BUG_ON
    arm64: KVM: vgic-v3: Relax synchronization when SRE==1
    arm64: KVM: vgic-v3: Prevent the guest from messing with ICC_SRE_EL1
    arm64: KVM: Make ICC_SRE_EL1 access return the configured SRE value
    KVM: arm/arm64: vgic-v3: Always resample level interrupts
    KVM: arm/arm64: vgic-v2: Always resample level interrupts
    KVM: arm/arm64: vgic-v3: Clear all dirty LRs
    KVM: arm/arm64: vgic-v2: Clear all dirty LRs

    Linus Torvalds
     
  • The erratum fixes the hang of ITS SYNC command by avoiding inter node
    io and collections/cpu mapping on thunderx dual-socket platform.

    This fix is only applicable for Cavium's ThunderX dual-socket platform.

    Reviewed-by: Robert Richter
    Signed-off-by: Ganapatrao Kulkarni
    Signed-off-by: Robert Richter
    Signed-off-by: Marc Zyngier

    Ganapatrao Kulkarni
     

02 Jun, 2016

8 commits

  • MOV to DR6 or DR7 causes a #GP if an attempt is made to write a 1 to
    any of bits 63:32. However, this is not detected at KVM_SET_DEBUGREGS
    time, and the next KVM_RUN oopses:

    general protection fault: 0000 [#1] SMP
    CPU: 2 PID: 14987 Comm: a.out Not tainted 4.4.9-300.fc23.x86_64 #1
    Hardware name: LENOVO 2325F51/2325F51, BIOS G2ET32WW (1.12 ) 05/30/2012
    [...]
    Call Trace:
    [] kvm_arch_vcpu_ioctl_run+0x141d/0x14e0 [kvm]
    [] kvm_vcpu_ioctl+0x33d/0x620 [kvm]
    [] do_vfs_ioctl+0x298/0x480
    [] SyS_ioctl+0x79/0x90
    [] entry_SYSCALL_64_fastpath+0x12/0x71
    Code: 55 83 ff 07 48 89 e5 77 27 89 ff ff 24 fd 90 87 80 81 0f 23 fe 5d c3 0f 23 c6 5d c3 0f 23 ce 5d c3 0f 23 d6 5d c3 0f 23 de 5d c3 23 f6 5d c3 0f 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
    RIP [] native_set_debugreg+0x2b/0x40
    RSP

    Testcase (beautified/reduced from syzkaller output):

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    long r[8];

    int main()
    {
    struct kvm_debugregs dr = { 0 };

    r[2] = open("/dev/kvm", O_RDONLY);
    r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
    r[4] = ioctl(r[3], KVM_CREATE_VCPU, 7);

    memcpy(&dr,
    "\x5d\x6a\x6b\xe8\x57\x3b\x4b\x7e\xcf\x0d\xa1\x72"
    "\xa3\x4a\x29\x0c\xfc\x6d\x44\x00\xa7\x52\xc7\xd8"
    "\x00\xdb\x89\x9d\x78\xb5\x54\x6b\x6b\x13\x1c\xe9"
    "\x5e\xd3\x0e\x40\x6f\xb4\x66\xf7\x5b\xe3\x36\xcb",
    48);
    r[7] = ioctl(r[4], KVM_SET_DEBUGREGS, &dr);
    r[6] = ioctl(r[4], KVM_RUN, 0);
    }

    Reported-by: Dmitry Vyukov
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Radim Krčmář

    Paolo Bonzini
     
  • This cannot be returned by KVM_GET_VCPU_EVENTS, so it is okay to return
    EINVAL. It causes a WARN from exception_type:

    WARNING: CPU: 3 PID: 16732 at arch/x86/kvm/x86.c:345 exception_type+0x49/0x50 [kvm]()
    CPU: 3 PID: 16732 Comm: a.out Tainted: G W 4.4.6-300.fc23.x86_64 #1
    Hardware name: LENOVO 2325F51/2325F51, BIOS G2ET32WW (1.12 ) 05/30/2012
    0000000000000286 000000006308a48b ffff8800bec7fcf8 ffffffff813b542e
    0000000000000000 ffffffffa0966496 ffff8800bec7fd30 ffffffff810a40f2
    ffff8800552a8000 0000000000000000 00000000002c267c 0000000000000001
    Call Trace:
    [] dump_stack+0x63/0x85
    [] warn_slowpath_common+0x82/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] exception_type+0x49/0x50 [kvm]
    [] kvm_arch_vcpu_ioctl_run+0x10a2/0x14e0 [kvm]
    [] kvm_vcpu_ioctl+0x33d/0x620 [kvm]
    [] do_vfs_ioctl+0x298/0x480
    [] SyS_ioctl+0x79/0x90
    [] entry_SYSCALL_64_fastpath+0x12/0x71
    ---[ end trace b1a0391266848f50 ]---

    Testcase (beautified/reduced from syzkaller output):

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    long r[31];

    int main()
    {
    memset(r, -1, sizeof(r));
    r[2] = open("/dev/kvm", O_RDONLY);
    r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
    r[7] = ioctl(r[3], KVM_CREATE_VCPU, 0);

    struct kvm_vcpu_events ve = {
    .exception.injected = 1,
    .exception.nr = 0xd4
    };
    r[27] = ioctl(r[7], KVM_SET_VCPU_EVENTS, &ve);
    r[30] = ioctl(r[7], KVM_RUN, 0);
    return 0;
    }

    Reported-by: Dmitry Vyukov
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Radim Krčmář

    Paolo Bonzini
     
  • This causes an ugly dmesg splat. Beautified syzkaller testcase:

    #include
    #include
    #include
    #include
    #include

    long r[8];

    int main()
    {
    struct kvm_cpuid2 c = { 0 };
    r[2] = open("/dev/kvm", O_RDWR);
    r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
    r[4] = ioctl(r[3], KVM_CREATE_VCPU, 0x8);
    r[7] = ioctl(r[4], KVM_SET_CPUID, &c);
    return 0;
    }

    Reported-by: Dmitry Vyukov
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Radim Krčmář

    Paolo Bonzini
     
  • Found by syzkaller:

    WARNING: CPU: 3 PID: 15175 at arch/x86/kvm/x86.c:7705 __x86_set_memory_region+0x1dc/0x1f0 [kvm]()
    CPU: 3 PID: 15175 Comm: a.out Tainted: G W 4.4.6-300.fc23.x86_64 #1
    Hardware name: LENOVO 2325F51/2325F51, BIOS G2ET32WW (1.12 ) 05/30/2012
    0000000000000286 00000000950899a7 ffff88011ab3fbf0 ffffffff813b542e
    0000000000000000 ffffffffa0966496 ffff88011ab3fc28 ffffffff810a40f2
    00000000000001fd 0000000000003000 ffff88014fc50000 0000000000000000
    Call Trace:
    [] dump_stack+0x63/0x85
    [] warn_slowpath_common+0x82/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] __x86_set_memory_region+0x1dc/0x1f0 [kvm]
    [] x86_set_memory_region+0x3b/0x60 [kvm]
    [] vmx_set_tss_addr+0x3c/0x150 [kvm_intel]
    [] kvm_arch_vm_ioctl+0x654/0xbc0 [kvm]
    [] kvm_vm_ioctl+0x9a/0x6f0 [kvm]
    [] do_vfs_ioctl+0x298/0x480
    [] SyS_ioctl+0x79/0x90
    [] entry_SYSCALL_64_fastpath+0x12/0x71

    Testcase:

    #include
    #include
    #include
    #include
    #include

    long r[8];

    int main()
    {
    memset(r, -1, sizeof(r));
    r[2] = open("/dev/kvm", O_RDONLY|O_TRUNC);
    r[3] = ioctl(r[2], KVM_CREATE_VM, 0x0ul);
    r[5] = ioctl(r[3], KVM_SET_TSS_ADDR, 0x20000000ul);
    r[7] = ioctl(r[3], KVM_SET_TSS_ADDR, 0x20000000ul);
    return 0;
    }

    Reported-by: Dmitry Vyukov
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Radim Krčmář

    Paolo Bonzini
     
  • Intel CPUs having Turbo Boost feature implement an MSR to provide a
    control interface via rdmsr/wrmsr instructions. One could detect the
    presence of this feature by issuing one of these instructions and
    handling the #GP exception which is generated in case the referenced MSR
    is not implemented by the CPU.

    KVM's vCPU model behaves exactly as a real CPU in this case by injecting
    a fault when MSR_IA32_PERF_CTL is called (which KVM does not support).
    However, some operating systems use this register during an early boot
    stage in which their kernel is not capable of handling #GP correctly,
    causing #DP and finally a triple fault effectively resetting the vCPU.

    This patch implements a dummy handler for MSR_IA32_PERF_CTL to avoid the
    crashes.

    Signed-off-by: Dmitry Bilunov
    Signed-off-by: Radim Krčmář

    Dmitry Bilunov
     
  • In theory, nothing prevents the compiler from write-tearing PTEs, or
    split PTE writes. These partially-modified PTEs can be fetched by other
    cores and cause mayhem. I have not really encountered such case in
    real-life, but it does seem possible.

    For example, the compiler may try to do something creative for
    kvm_set_pte_rmapp() and perform multiple writes to the PTE.

    Signed-off-by: Nadav Amit
    Signed-off-by: Radim Krčmář

    Nadav Amit
     
  • PTRACE_SETVFPREGS fails to properly mark the VFP register set to be
    reloaded, because it undoes one of the effects of vfp_flush_hwstate().

    Specifically vfp_flush_hwstate() sets thread->vfpstate.hard.cpu to
    an invalid CPU number, but vfp_set() overwrites this with the original
    CPU number, thereby rendering the hardware state as apparently "valid",
    even though the software state is more recent.

    Fix this by reverting the previous change.

    Cc:
    Fixes: 8130b9d7b9d8 ("ARM: 7308/1: vfp: flush thread hwstate before copying ptrace registers")
    Acked-by: Will Deacon
    Tested-by: Simon Marchi
    Signed-off-by: Russell King

    Russell King
     
  • We're missing entries for mlock2, copy_file_range, preadv2 and pwritev2
    in our compat syscall table, so hook them up. Only the last two need
    compat wrappers.

    Signed-off-by: Will Deacon

    Will Deacon
     

01 Jun, 2016

7 commits

  • Pull sparc fixes from David Miller:
    "sparc64 mmu context allocation and trap return bug fixes"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
    sparc64: Fix return from trap window fill crashes.
    sparc: Harden signal return frame checks.
    sparc64: Take ctx_alloc_lock properly in hugetlb_setup().

    Linus Torvalds
     
  • If we do not provide the PVR for POWER8NVL, a guest on this system
    currently ends up in PowerISA 2.06 compatibility mode on KVM, since QEMU
    does not provide a generic PowerISA 2.07 mode yet. So some new
    instructions from POWER8 (like "mtvsrd") get disabled for the guest,
    resulting in crashes when using code compiled explicitly for
    POWER8 (e.g. with the "-mcpu=power8" option of GCC).

    Fixes: ddee09c099c3 ("powerpc: Add PVR for POWER8NVL processor")
    Cc: stable@vger.kernel.org # v4.0+
    Signed-off-by: Thomas Huth
    Signed-off-by: Michael Ellerman

    Thomas Huth
     
  • This should not have any impact on hash, because hash does tlb
    invalidate with every pte update and we don't implement
    flush_tlb_* functions for hash. With radix we should make an explicit
    call to flush tlb outside pte update.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Michael Ellerman

    Aneesh Kumar K.V
     
  • When we converted the asm routines to C functions, we missed updating
    HPTE_R_R based on _PAGE_ACCESSED. ASM code used to copy over the lower
    bits from pte via.

    andi. r3,r30,0x1fe /* Get basic set of flags */

    We also update the code such that we won't update the Change bit ('C'
    bit) always. This was added by commit c5cf0e30bf3d8 ("powerpc: Fix
    buglet with MMU hash management").

    With hash64, we need to make sure that hardware doesn't do a pte update
    directly. This is because we do end up with entries in TLB with no hash
    page table entry. This happens because when we find a hash bucket full,
    we "evict" a more/less random entry from it. When we do that we don't
    invalidate the TLB (hpte_remove) because we assume the old translation
    is still technically "valid". For more info look at commit
    0608d692463("powerpc/mm: Always invalidate tlb on hpte invalidate and
    update").

    Thus it's critical that valid hash PTEs always have reference bit set
    and writeable ones have change bit set. We do this by hashing a
    non-dirty linux PTE as read-only and always setting _PAGE_ACCESSED (and
    thus R) when hashing anything else in. Any attempt by Linux at clearing
    those bits also removes the corresponding hash entry.

    Commit 5cf0e30bf3d8 did that for 'C' bit by enabling 'C' bit always.
    We don't really need to do that because we never map a RW pte entry
    without setting 'C' bit. On READ fault on a RW pte entry, we still map
    it READ only, hence a store update in the page will still cause a hash
    pte fault.

    This patch reverts the part of commit c5cf0e30bf3d8 ("[PATCH] powerpc:
    Fix buglet with MMU hash management") and retain the updatepp part.

    - If we hit the updatepp path on native, the old code without that
    commit, would fail to set C bcause native_hpte_updatepp()
    was implemented to filter the same bits as H_PROTECT and not let C
    through thus we would "upgrade" a RO HPTE to RW without setting C
    thus causing the bug. So the real fix in that commit was the change
    to native_hpte_updatepp

    Fixes: 89ff725051d1 ("powerpc/mm: Convert __hash_page_64K to C")
    Cc: stable@vger.kernel.org # v4.5+
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Michael Ellerman

    Aneesh Kumar K.V
     
  • LPCR cannot be updated when running in guest mode.

    Fixes: 2bfd65e45e87 ("powerpc/mm/radix: Add radix callbacks for early init routines")
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Michael Ellerman

    Aneesh Kumar K.V
     
  • This patch brings the PER_LINUX32 /proc/cpuinfo format more in line with
    the 32-bit ARM one by providing an additional line:

    model name : ARMv8 Processor rev X (v8l)

    Cc:
    Acked-by: Will Deacon
    Signed-off-by: Catalin Marinas
    Signed-off-by: Will Deacon

    Catalin Marinas
     
  • Pull s390 fixes from Martin Schwidefsky:
    "Three bugs fixes and an update for the default configuration"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
    s390: fix info leak in do_sigsegv
    s390/config: update default configuration
    s390/bpf: fix recache skb->data/hlen for skb_vlan_push/pop
    s390/bpf: reduce maximum program size to 64 KB

    Linus Torvalds
     

31 May, 2016

9 commits

  • The GICv3 backend of the vgic is quite barrier heavy, in order
    to ensure synchronization of the system registers and the
    memory mapped view for a potential GICv2 guest.

    But when the guest is using a GICv3 model, there is absolutely
    no need to execute all these heavy barriers, and it is actually
    beneficial to avoid them altogether.

    This patch makes the synchonization conditional, and ensures
    that we do not change the EL1 SRE settings if we do not need to.

    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Marc Zyngier
     
  • Both our GIC emulations are "strict", in the sense that we either
    emulate a GICv2 or a GICv3, and not a GICv3 with GICv2 legacy
    support.

    But when running on a GICv3 host, we still allow the guest to
    tinker with the ICC_SRE_EL1 register during its time slice:
    it can switch SRE off, observe that it is off, and yet on the
    next world switch, find the SRE bit to be set again. Not very
    nice.

    An obvious solution is to always trap accesses to ICC_SRE_EL1
    (by clearing ICC_SRE_EL2.Enable), and to let the handler return
    the programmed value on a read, or ignore the write.

    That way, the guest can always observe that our GICv3 is SRE==1
    only.

    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Marc Zyngier
     
  • When we trap ICC_SRE_EL1, we handle it as RAZ/WI. It would be
    more correct to actual make it RO, and return the configured
    value when read.

    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Marc Zyngier
     
  • When saving the state of the list registers, it is critical to
    reset them zero, as we could otherwise leave unexpected EOI
    interrupts pending for virtual level interrupts.

    Cc: stable@vger.kernel.org # v4.6+
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • The SET_MODULE_RONX protections are effectively the same as the
    DEBUG_RODATA protections we enabled by default back in commit
    57efac2f7108e325 ("arm64: enable CONFIG_DEBUG_RODATA by default"). It
    seems unusual to have one but not the other.

    As evidenced by the help text, the rationale appears to be that
    SET_MODULE_RONX interacts poorly with tracing and patching, but both of
    these make use of the insn framework, which takes SET_MODULE_RONX into
    account. Any remaining issues are bugs which should be fixed regardless
    of the default state of the option.

    This patch enables DEBUG_SET_MODULE_RONX by default, and replaces the
    help text with a new wording derived from the DEBUG_RODATA help text,
    which better describes the functionality. Previously, the DEBUG_RODATA
    entry was inconsistently indented with spaces, which are replaced with
    tabs as with the other Kconfig entries.

    Additionally, the wording of recommended defaults is made consistent for
    all options. These are placed in a new paragraph, unquoted, as a full
    sentence (with a period/full stop) as this appears to be the most common
    form per $(git grep 'in doubt').

    Cc: Catalin Marinas
    Cc: Laura Abbott
    Acked-by: Kees Cook
    Acked-by: Ard Biesheuvel
    Signed-off-by: Mark Rutland
    Signed-off-by: Will Deacon

    Mark Rutland
     
  • Since commit 12a0ef7b0ac3 ("arm64: use generic strnlen_user and
    strncpy_from_user functions"), the definition of __addr_ok() has been
    languishing unused; eradicate the sucker.

    CC: Catalin Marinas
    Signed-off-by: Robin Murphy
    Signed-off-by: Will Deacon

    Robin Murphy
     
  • We are already using the privileged versions of MMCR0, MMCR1
    and MMCRA in the kernel, so for MMCR2, we should better use
    the privileged versions, too, to be consistent.

    Fixes: 240686c13687 ("powerpc: Initialise PMU related regs on Power8")
    Cc: stable@vger.kernel.org # v3.10+
    Suggested-by: Paul Mackerras
    Signed-off-by: Thomas Huth
    Acked-by: Paul Mackerras
    Signed-off-by: Michael Ellerman

    Thomas Huth
     
  • The SIAR and SDAR registers are available twice, one time as SPRs
    780 / 781 (unprivileged, but read-only), and one time as the SPRs
    796 / 797 (privileged, but read and write). The Linux kernel code
    currently uses the unprivileged SPRs - while this is OK for reading,
    writing to that register of course does not work.
    Since the KVM code tries to write to this register, too (see the mtspr
    in book3s_hv_rmhandlers.S), the contents of this register sometimes get
    lost for the guests, e.g. during migration of a VM.
    To fix this issue, simply switch to the privileged SPR numbers instead.

    Cc: stable@vger.kernel.org
    Signed-off-by: Thomas Huth
    Acked-by: Paul Mackerras
    Signed-off-by: Michael Ellerman

    Thomas Huth
     
  • This reverts commit ff7925848b50050732ac0401e0acf27e8b241d7b.

    Now that the contiguous-hint hugetlb regression has been debugged and
    fixed upstream by 66ee95d16a7f ("mm: exclude HugeTLB pages from THP
    page_mapped() logic"), we can revert the previous partial revert of this
    feature.

    Signed-off-by: Will Deacon

    Will Deacon