07 Jul, 2016

1 commit


03 Jul, 2016

2 commits

  • Pull MIPS fix from Ralf Baechle:
    "Only a single fix for 4.7 pending at this point. It fixes an issue
    that may lead to corruption of the cache mode bits in the page table"

    * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
    MIPS: Fix possible corruption of cache mode by mprotect.

    Linus Torvalds
     
  • Pull powerpc fixes from Michael Ellerman:

    - tm: Always reclaim in start_thread() for exec() class syscalls from
    Cyril Bur

    - tm: Avoid SLB faults in treclaim/trecheckpoint when RI=0 from Michael
    Neuling

    - eeh: Fix wrong argument passed to eeh_rmv_device() from Gavin Shan

    - Initialise pci_io_base as early as possible from Darren Stevens

    * tag 'powerpc-4.7-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
    powerpc: Initialise pci_io_base as early as possible
    powerpc/tm: Avoid SLB faults in treclaim/trecheckpoint when RI=0
    powerpc/eeh: Fix wrong argument passed to eeh_rmv_device()
    powerpc/tm: Always reclaim in start_thread() for exec() class syscalls

    Linus Torvalds
     

02 Jul, 2016

1 commit

  • The following testcase may result in a page table entries with a invalid
    CCA field being generated:

    static void *bindstack;

    static int sysrqfd;

    static void protect_low(int protect)
    {
    mprotect(bindstack, BINDSTACK_SIZE, protect);
    }

    static void sigbus_handler(int signal, siginfo_t * info, void *context)
    {
    void *addr = info->si_addr;

    write(sysrqfd, "x", 1);

    printf("sigbus, fault address %p (should not happen, but might)\n",
    addr);
    abort();
    }

    static void run_bind_test(void)
    {
    unsigned int *p = bindstack;

    p[0] = 0xf001f001;

    write(sysrqfd, "x", 1);

    /* Set trap on access to p[0] */
    protect_low(PROT_NONE);

    write(sysrqfd, "x", 1);

    /* Clear trap on access to p[0] */
    protect_low(PROT_READ | PROT_WRITE | PROT_EXEC);

    write(sysrqfd, "x", 1);

    /* Check the contents of p[0] */
    if (p[0] != 0xf001f001) {
    write(sysrqfd, "x", 1);

    /* Reached, but shouldn't be */
    printf("badness, shouldn't happen but does\n");
    abort();
    }
    }

    int main(void)
    {
    struct sigaction sa;

    sysrqfd = open("/proc/sysrq-trigger", O_WRONLY);

    if (sigprocmask(SIG_BLOCK, NULL, &sa.sa_mask)) {
    perror("sigprocmask");
    return 0;
    }

    sa.sa_sigaction = sigbus_handler;
    sa.sa_flags = SA_SIGINFO | SA_NODEFER | SA_RESTART;
    if (sigaction(SIGBUS, &sa, NULL)) {
    perror("sigaction");
    return 0;
    }

    bindstack = mmap(NULL,
    BINDSTACK_SIZE,
    PROT_READ | PROT_WRITE | PROT_EXEC,
    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (bindstack == MAP_FAILED) {
    perror("mmap bindstack");
    return 0;
    }

    printf("bindstack: %p\n", bindstack);

    run_bind_test();

    printf("done\n");

    return 0;
    }

    There are multiple ingredients for this:

    1) PAGE_NONE is defined to _CACHE_CACHABLE_NONCOHERENT, which is CCA 3
    on all platforms except SB1 where it's CCA 5.
    2) _page_cachable_default must have bits set which are not set
    _CACHE_CACHABLE_NONCOHERENT.
    3) Either the defective version of pte_modify for XPA or the standard
    version must be in used. However pte_modify for the 36 bit address
    space support is no affected.

    In that case additional bits in the final CCA mode may generate an invalid
    value for the CCA field. On the R10000 system where this was tracked
    down for example a CCA 7 has been observed, which is Uncached Accelerated.

    Fixed by:

    1) Using the proper CCA mode for PAGE_NONE just like for all the other
    PAGE_* pte/pmd bits.
    2) Fix the two affected variants of pte_modify.

    Further code inspection also shows the same issue to exist in pmd_modify
    which would affect huge page systems.

    Issue in pte_modify tracked down by Alastair Bridgewater, PAGE_NONE
    and pmd_modify issue found by me.

    The history of this goes back beyond Linus' git history. Chris Dearman's
    commit 351336929ccf222ae38ff0cb7a8dd5fd5c6236a0 ("[MIPS] Allow setting of
    the cache attribute at run time.") missed the opportunity to fix this
    but it was originally introduced in lmo commit
    d523832cf12007b3242e50bb77d0c9e63e0b6518 ("Missing from last commit.")
    and 32cc38229ac7538f2346918a09e75413e8861f87 ("New configuration option
    CONFIG_MIPS_UNCACHED.")

    Signed-off-by: Ralf Baechle
    Reported-by: Alastair Bridgewater

    Ralf Baechle
     

01 Jul, 2016

2 commits

  • Pull KVM fixes from Paolo Bonzini:
    "ARM and x86 fixes"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    KVM: nVMX: VMX instructions: fix segment checks when L1 is in long mode.
    KVM: LAPIC: cap __delay at lapic_timer_advance_ns
    KVM: x86: move nsec_to_cycles from x86.c to x86.h
    pvclock: Get rid of __pvclock_read_cycles in function pvclock_read_flags
    pvclock: Cleanup to remove function pvclock_get_nsec_offset
    pvclock: Add CPU barriers to get correct version value
    KVM: arm/arm64: Stop leaking vcpu pid references
    arm64: KVM: fix build with CONFIG_ARM_PMU disabled

    Linus Torvalds
     
  • Pull ARC fix from Vineet Gupta:
    "Reinstate dwarf unwinder/loadable-modules with new gnu tools"

    * tag 'arc-4.7-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
    arc: unwind: warn only once if DW2_UNWIND is disabled
    ARC: unwind: ensure that .debug_frame is generated (vs. .eh_frame)

    Linus Torvalds
     

30 Jun, 2016

3 commits

  • …t/kvmarm/kvmarm into kvm-master

    KVM/ARM Fixes for v4.7-rc6:

    Fixes a build issue without CONFIG_ARM_PMU and plugs pid leak on arm/arm64.

    Paolo Bonzini
     
  • Several cases of overlapping changes, except the packet scheduler
    conflicts which deal with the addition of the free list parameter
    to qdisc_enqueue().

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Commit d6a9996e84ac ("powerpc/mm: vmalloc abstraction in preparation for
    radix") turned kernel memory and IO addresses from #defined constants to
    variables initialised at runtime.

    On PA6T (pasemi) systems the setup_arch() machine call initialises the
    onboard PCI-e root-ports, and uses pci_io_base to do this, which is now
    before its value has been set, resulting in a panic early in boot before
    console IO is initialised.

    Move the pci_io_base initialisation to the same place as vmalloc ranges
    are set (hash__early_init_mmu()/radix__early_init_mmu()) - this is the
    earliest possible place we can initialise it.

    Fixes: d6a9996e84ac ("powerpc/mm: vmalloc abstraction in preparation for radix")
    Reported-by: Christian Zigotzky
    Signed-off-by: Darren Stevens
    Reviewed-by: Aneesh Kumar K.V
    [mpe: Add #ifdef CONFIG_PCI, massage change log slightly]
    Signed-off-by: Michael Ellerman

    Darren Stevens
     

29 Jun, 2016

1 commit

  • Currently we have 2 segments that are bolted for the kernel linear
    mapping (ie 0xc000... addresses). This is 0 to 1TB and also the kernel
    stacks. Anything accessed outside of these regions may need to be
    faulted in. (In practice machines with TM always have 1T segments)

    If a machine has < 2TB of memory we never fault on the kernel linear
    mapping as these two segments cover all physical memory. If a machine
    has > 2TB of memory, there may be structures outside of these two
    segments that need to be faulted in. This faulting can occur when
    running as a guest as the hypervisor may remove any SLB that's not
    bolted.

    When we treclaim and trecheckpoint we have a window where we need to
    run with the userspace GPRs. This means that we no longer have a valid
    stack pointer in r1. For this window we therefore clear MSR RI to
    indicate that any exceptions taken at this point won't be able to be
    handled. This means that we can't take segment misses in this RI=0
    window.

    In this RI=0 region, we currently access the thread_struct for the
    process being context switched to or from. This thread_struct access
    may cause a segment fault since it's not guaranteed to be covered by
    the two bolted segment entries described above.

    We've seen this with a crash when running as a guest with > 2TB of
    memory on PowerVM:

    Unrecoverable exception 4100 at c00000000004f138
    Oops: Unrecoverable exception, sig: 6 [#1]
    SMP NR_CPUS=2048 NUMA pSeries
    CPU: 1280 PID: 7755 Comm: kworker/1280:1 Tainted: G X 4.4.13-46-default #1
    task: c000189001df4210 ti: c000189001d5c000 task.ti: c000189001d5c000
    NIP: c00000000004f138 LR: 0000000010003a24 CTR: 0000000010001b20
    REGS: c000189001d5f730 TRAP: 4100 Tainted: G X (4.4.13-46-default)
    MSR: 8000000100001031 CR: 24000048 XER: 00000000
    CFAR: c00000000004ed18 SOFTE: 0
    GPR00: ffffffffc58d7b60 c000189001d5f9b0 00000000100d7d00 000000003a738288
    GPR04: 0000000000002781 0000000000000006 0000000000000000 c0000d1f4d889620
    GPR08: 000000000000c350 00000000000008ab 00000000000008ab 00000000100d7af0
    GPR12: 00000000100d7ae8 00003ffe787e67a0 0000000000000000 0000000000000211
    GPR16: 0000000010001b20 0000000000000000 0000000000800000 00003ffe787df110
    GPR20: 0000000000000001 00000000100d1e10 0000000000000000 00003ffe787df050
    GPR24: 0000000000000003 0000000000010000 0000000000000000 00003fffe79e2e30
    GPR28: 00003fffe79e2e68 00000000003d0f00 00003ffe787e67a0 00003ffe787de680
    NIP [c00000000004f138] restore_gprs+0xd0/0x16c
    LR [0000000010003a24] 0x10003a24
    Call Trace:
    [c000189001d5f9b0] [c000189001d5f9f0] 0xc000189001d5f9f0 (unreliable)
    [c000189001d5fb90] [c00000000001583c] tm_recheckpoint+0x6c/0xa0
    [c000189001d5fbd0] [c000000000015c40] __switch_to+0x2c0/0x350
    [c000189001d5fc30] [c0000000007e647c] __schedule+0x32c/0x9c0
    [c000189001d5fcb0] [c0000000007e6b58] schedule+0x48/0xc0
    [c000189001d5fce0] [c0000000000deabc] worker_thread+0x22c/0x5b0
    [c000189001d5fd80] [c0000000000e7000] kthread+0x110/0x130
    [c000189001d5fe30] [c000000000009538] ret_from_kernel_thread+0x5c/0xa4
    Instruction dump:
    7cb103a6 7cc0e3a6 7ca222a6 78a58402 38c00800 7cc62838 08860000 7cc000a6
    38a00006 78c60022 7cc62838 0b060000 7ccff120 e8270078 e8a70098
    ---[ end trace 602126d0a1dedd54 ]---

    This fixes this by copying the required data from the thread_struct to
    the stack before we clear MSR RI. Then once we clear RI, we only access
    the stack, guaranteeing there's no segment miss.

    We also tighten the region over which we set RI=0 on the treclaim()
    path. This may have a slight performance impact since we're adding an
    mtmsr instruction.

    Fixes: 090b9284d725 ("powerpc/tm: Clear MSR RI in non-recoverable TM code")
    Signed-off-by: Michael Neuling
    Reviewed-by: Cyril Bur
    Signed-off-by: Michael Ellerman

    Michael Neuling
     

28 Jun, 2016

6 commits

  • Add "ti,cpsw-mdio" for am335x/am437x/dra7 SoCs where MDIO is
    implemented as part of TI CPSW and, this way, enable PM runtime auto
    suspend for Davinci MDIO driver on these paltforms.

    Signed-off-by: Grygorii Strashko
    Signed-off-by: David S. Miller

    Grygorii Strashko
     
  • When calling eeh_rmv_device() in eeh_reset_device() for partial hotplug
    case, @rmv_data instead of its address is the proper argument.
    Otherwise, the stack frame is corrupted when writing to
    @rmv_data (actually its address) in eeh_rmv_device(). It results in
    kernel crash as observed.

    This fixes the issue by passing @rmv_data, not its address to
    eeh_rmv_device() in eeh_reset_device().

    Fixes: 67086e32b564 ("powerpc/eeh: powerpc/eeh: Support error recovery for VF PE")
    Reported-by: Pridhiviraj Paidipeddi
    Signed-off-by: Gavin Shan
    Signed-off-by: Michael Ellerman

    Gavin Shan
     
  • The test_fp_ctl function is used to test if a given value is a valid
    floating-point control. The inline assembly in test_fp_ctl uses an
    incorrect constraint for the 'orig_fpc' variable. If the compiler
    chooses the same register for 'fpc' and 'orig_fpc' the test_fp_ctl()
    function always returns true. This allows user space to trigger
    kernel oopses with invalid floating-point control values on the
    signal stack.

    This problem has been introduced with git commit 4725c86055f5bbdcdf
    "s390: fix save and restore of the floating-point-control register"

    Cc: stable@vger.kernel.org # v3.13+
    Reviewed-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     
  • This reverts commit 852ffd0f4e23248b47531058e531066a988434b5.

    There are use cases where an intermediate boot kernel (1) uses kexec
    to boot the final production kernel (2). For this scenario we should
    provide the original boot information to the production kernel (2).
    Therefore clearing the boot information during kexec() should not
    be done.

    Cc: stable@vger.kernel.org # v3.17+
    Reported-by: Steffen Maier
    Signed-off-by: Michael Holzheu
    Reviewed-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Michael Holzheu
     
  • If CONFIG_ARC_DW2_UNWIND is disabled every time arc_unwind_core()
    gets called following message gets printed in debug console:
    ----------------->8---------------
    CONFIG_ARC_DW2_UNWIND needs to be enabled
    ----------------->8---------------

    That message makes sense if user indeed wants to see a backtrace or
    get nice function call-graphs in perf but what if user disabled
    unwinder for the purpose? Why pollute his debug console?

    So instead we'll warn user about possibly missing feature once and
    let him decide if that was what he or she really wanted.

    Signed-off-by: Alexey Brodkin
    Cc: stable@vger.kernel.org
    Signed-off-by: Vineet Gupta

    Alexey Brodkin
     
  • With recent binutils update to support dwarf CFI pseudo-ops in gas, we
    now get .eh_frame vs. .debug_frame. Although the call frame info is
    exactly the same in both, the CIE differs, which the current kernel
    unwinder can't cope with.

    This broke both the kernel unwinder as well as loadable modules (latter
    because of a new unhandled relo R_ARC_32_PCREL from .rela.eh_frame in
    the module loader)

    The ideal solution would be to switch unwinder to .eh_frame.
    For now however we can make do by just ensureing .debug_frame is
    generated by removing -fasynchronous-unwind-tables

    .eh_frame generated with -gdwarf-2 -fasynchronous-unwind-tables
    .debug_frame generated with -gdwarf-2

    Fixes STAR 9001058196

    Cc: stable@vger.kernel.org
    Signed-off-by: Vineet Gupta

    Vineet Gupta
     

27 Jun, 2016

8 commits

  • I couldn't get Xen to boot a L2 HVM when it was nested under KVM - it was
    getting a GP(0) on a rather unspecial vmread from Xen:

    (XEN) ----[ Xen-4.7.0-rc x86_64 debug=n Not tainted ]----
    (XEN) CPU: 1
    (XEN) RIP: e008:[] vmx_get_segment_register+0x14e/0x450
    (XEN) RFLAGS: 0000000000010202 CONTEXT: hypervisor (d1v0)
    (XEN) rax: ffff82d0801e6288 rbx: ffff83003ffbfb7c rcx: fffffffffffab928
    (XEN) rdx: 0000000000000000 rsi: 0000000000000000 rdi: ffff83000bdd0000
    (XEN) rbp: ffff83000bdd0000 rsp: ffff83003ffbfab0 r8: ffff830038813910
    (XEN) r9: ffff83003faf3958 r10: 0000000a3b9f7640 r11: ffff83003f82d418
    (XEN) r12: 0000000000000000 r13: ffff83003ffbffff r14: 0000000000004802
    (XEN) r15: 0000000000000008 cr0: 0000000080050033 cr4: 00000000001526e0
    (XEN) cr3: 000000003fc79000 cr2: 0000000000000000
    (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008
    (XEN) Xen code around (vmx_get_segment_register+0x14e/0x450):
    (XEN) 00 00 41 be 02 48 00 00 0f 78 74 24 08 0f 86 38 56 00 00 b8 08 68 00
    (XEN) Xen stack trace from rsp=ffff83003ffbfab0:

    ...

    (XEN) Xen call trace:
    (XEN) [] vmx_get_segment_register+0x14e/0x450
    (XEN) [] get_page_from_gfn_p2m+0x165/0x300
    (XEN) [] hvmemul_get_seg_reg+0x52/0x60
    (XEN) [] hvm_emulate_prepare+0x53/0x70
    (XEN) [] handle_mmio+0x2b/0xd0
    (XEN) [] emulate.c#_hvm_emulate_one+0x111/0x2c0
    (XEN) [] handle_hvm_io_completion+0x274/0x2a0
    (XEN) [] __get_gfn_type_access+0xfa/0x270
    (XEN) [] timer.c#add_entry+0x4b/0xb0
    (XEN) [] timer.c#remove_entry+0x7c/0x90
    (XEN) [] hvm_do_resume+0x23/0x140
    (XEN) [] vmx_do_resume+0xa7/0x140
    (XEN) [] context_switch+0x13b/0xe40
    (XEN) [] schedule.c#schedule+0x22e/0x570
    (XEN) [] softirq.c#__do_softirq+0x5c/0x90
    (XEN) [] domain.c#idle_loop+0x25/0x50
    (XEN)
    (XEN)
    (XEN) ****************************************
    (XEN) Panic on CPU 1:
    (XEN) GENERAL PROTECTION FAULT
    (XEN) [error_code=0000]
    (XEN) ****************************************

    Tracing my host KVM showed it was the one injecting the GP(0) when
    emulating the VMREAD and checking the destination segment permissions in
    get_vmx_mem_address():

    3) | vmx_handle_exit() {
    3) | handle_vmread() {
    3) | nested_vmx_check_permission() {
    3) | vmx_get_segment() {
    3) 0.074 us | vmx_read_guest_seg_base();
    3) 0.065 us | vmx_read_guest_seg_selector();
    3) 0.066 us | vmx_read_guest_seg_ar();
    3) 1.636 us | }
    3) 0.058 us | vmx_get_rflags();
    3) 0.062 us | vmx_read_guest_seg_ar();
    3) 3.469 us | }
    3) | vmx_get_cs_db_l_bits() {
    3) 0.058 us | vmx_read_guest_seg_ar();
    3) 0.662 us | }
    3) | get_vmx_mem_address() {
    3) 0.068 us | vmx_cache_reg();
    3) | vmx_get_segment() {
    3) 0.074 us | vmx_read_guest_seg_base();
    3) 0.068 us | vmx_read_guest_seg_selector();
    3) 0.071 us | vmx_read_guest_seg_ar();
    3) 1.756 us | }
    3) | kvm_queue_exception_e() {
    3) 0.066 us | kvm_multiple_exception();
    3) 0.684 us | }
    3) 4.085 us | }
    3) 9.833 us | }
    3) + 10.366 us | }

    Cross-checking the KVM/VMX VMREAD emulation code with the Intel Software
    Developper Manual Volume 3C - "VMREAD - Read Field from Virtual-Machine
    Control Structure", I found that we're enforcing that the destination
    operand is NOT located in a read-only data segment or any code segment when
    the L1 is in long mode - BUT that check should only happen when it is in
    protected mode.

    Shuffling the code a bit to make our emulation follow the specification
    allows me to boot a Xen dom0 in a nested KVM and start HVM L2 guests
    without problems.

    Fixes: f9eb4af67c9d ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions")
    Signed-off-by: Quentin Casasnovas
    Cc: Eugene Korenevsky
    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: linux-stable
    Signed-off-by: Paolo Bonzini

    Quentin Casasnovas
     
  • The host timer which emulates the guest LAPIC TSC deadline
    timer has its expiration diminished by lapic_timer_advance_ns
    nanoseconds. Therefore if, at wait_lapic_expire, a difference
    larger than lapic_timer_advance_ns is encountered, delay at most
    lapic_timer_advance_ns.

    This fixes a problem where the guest can cause the host
    to delay for large amounts of time.

    Reported-by: Alan Jenkins
    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Paolo Bonzini

    Marcelo Tosatti
     
  • Move the inline function nsec_to_cycles from x86.c to x86.h, as
    the next patch uses it from lapic.c.

    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Paolo Bonzini

    Marcelo Tosatti
     
  • There is a generic function __pvclock_read_cycles to be used to get both
    flags and cycles. For function pvclock_read_flags, it's useless to get
    cycles value. To make this function be more effective, get this variable
    flags directly in function.

    Signed-off-by: Minfei Huang
    Signed-off-by: Paolo Bonzini

    Minfei Huang
     
  • Function __pvclock_read_cycles is short enough, so there is no need to
    have another function pvclock_get_nsec_offset to calculate tsc delta.
    It's better to combine it into function __pvclock_read_cycles.

    Remove useless variables in function __pvclock_read_cycles.

    Signed-off-by: Minfei Huang
    Signed-off-by: Paolo Bonzini

    Minfei Huang
     
  • Protocol for the "version" fields is: hypervisor raises it (making it
    uneven) before it starts updating the fields and raises it again (making
    it even) when it is done. Thus the guest can make sure the time values
    it got are consistent by checking the version before and after reading
    them.

    Add CPU barries after getting version value just like what function
    vread_pvclock does, because all of callees in this function is inline.

    Fixes: 502dfeff239e8313bfbe906ca0a1a6827ac8481b
    Cc: stable@vger.kernel.org
    Signed-off-by: Minfei Huang
    Signed-off-by: Paolo Bonzini

    Minfei Huang
     
  • kvm provides kvm_vcpu_uninit(), which amongst other things, releases the
    last reference to the struct pid of the task that was last running the vcpu.

    On arm64 built with CONFIG_DEBUG_KMEMLEAK, starting a guest with kvmtool,
    then killing it with SIGKILL results (after some considerable time) in:
    > cat /sys/kernel/debug/kmemleak
    > unreferenced object 0xffff80007d5ea080 (size 128):
    > comm "lkvm", pid 2025, jiffies 4294942645 (age 1107.776s)
    > hex dump (first 32 bytes):
    > 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    > backtrace:
    > [] create_object+0xfc/0x278
    > [] kmemleak_alloc+0x34/0x70
    > [] kmem_cache_alloc+0x16c/0x1d8
    > [] alloc_pid+0x34/0x4d0
    > [] copy_process.isra.6+0x79c/0x1338
    > [] _do_fork+0x74/0x320
    > [] SyS_clone+0x18/0x20
    > [] el0_svc_naked+0x24/0x28
    > [] 0xffffffffffffffff

    On x86 kvm_vcpu_uninit() is called on the path from kvm_arch_destroy_vm(),
    on arm no equivalent call is made. Add the call to kvm_arch_vcpu_free().

    Signed-off-by: James Morse
    Fixes: 749cf76c5a36 ("KVM: ARM: Initial skeleton to compile KVM support")
    Cc: # 3.10+
    Acked-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    James Morse
     
  • Userspace can quite legitimately perform an exec() syscall with a
    suspended transaction. exec() does not return to the old process, rather
    it load a new one and starts that, the expectation therefore is that the
    new process starts not in a transaction. Currently exec() is not treated
    any differently to any other syscall which creates problems.

    Firstly it could allow a new process to start with a suspended
    transaction for a binary that no longer exists. This means that the
    checkpointed state won't be valid and if the suspended transaction were
    ever to be resumed and subsequently aborted (a possibility which is
    exceedingly likely as exec()ing will likely doom the transaction) the
    new process will jump to invalid state.

    Secondly the incorrect attempt to keep the transactional state while
    still zeroing state for the new process creates at least two TM Bad
    Things. The first triggers on the rfid to return to userspace as
    start_thread() has given the new process a 'clean' MSR but the suspend
    will still be set in the hardware MSR. The second TM Bad Thing triggers
    in __switch_to() as the processor is still transactionally suspended but
    __switch_to() wants to zero the TM sprs for the new process.

    This is an example of the outcome of calling exec() with a suspended
    transaction. Note the first 700 is likely the first TM bad thing
    decsribed earlier only the kernel can't report it as we've loaded
    userspace registers. c000000000009980 is the rfid in
    fast_exception_return()

    Bad kernel stack pointer 3fffcfa1a370 at c000000000009980
    Oops: Bad kernel stack pointer, sig: 6 [#1]
    CPU: 0 PID: 2006 Comm: tm-execed Not tainted
    NIP: c000000000009980 LR: 0000000000000000 CTR: 0000000000000000
    REGS: c00000003ffefd40 TRAP: 0700 Not tainted
    MSR: 8000000300201031 CR: 00000000 XER: 00000000
    CFAR: c0000000000098b4 SOFTE: 0
    PACATMSCRATCH: b00000010000d033
    GPR00: 0000000000000000 00003fffcfa1a370 0000000000000000 0000000000000000
    GPR04: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
    GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
    GPR12: 00003fff966611c0 0000000000000000 0000000000000000 0000000000000000
    NIP [c000000000009980] fast_exception_return+0xb0/0xb8
    LR [0000000000000000] (null)
    Call Trace:
    Instruction dump:
    f84d0278 e9a100d8 7c7b03a6 e84101a0 7c4ff120 e8410170 7c5a03a6 e8010070
    e8410080 e8610088 e8810090 e8210078 48000000 e8610178 88ed023b

    Kernel BUG at c000000000043e80 [verbose debug info unavailable]
    Unexpected TM Bad Thing exception at c000000000043e80 (msr 0x201033)
    Oops: Unrecoverable exception, sig: 6 [#2]
    CPU: 0 PID: 2006 Comm: tm-execed Tainted: G D
    task: c0000000fbea6d80 ti: c00000003ffec000 task.ti: c0000000fb7ec000
    NIP: c000000000043e80 LR: c000000000015a24 CTR: 0000000000000000
    REGS: c00000003ffef7e0 TRAP: 0700 Tainted: G D
    MSR: 8000000300201033 CR: 28002828 XER: 00000000
    CFAR: c000000000015a20 SOFTE: 0
    PACATMSCRATCH: b00000010000d033
    GPR00: 0000000000000000 c00000003ffefa60 c000000000db5500 c0000000fbead000
    GPR04: 8000000300001033 2222222222222222 2222222222222222 00000000ff160000
    GPR08: 0000000000000000 800000010000d033 c0000000fb7e3ea0 c00000000fe00004
    GPR12: 0000000000002200 c00000000fe00000 0000000000000000 0000000000000000
    GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
    GPR20: 0000000000000000 0000000000000000 c0000000fbea7410 00000000ff160000
    GPR24: c0000000ffe1f600 c0000000fbea8700 c0000000fbea8700 c0000000fbead000
    GPR28: c000000000e20198 c0000000fbea6d80 c0000000fbeab680 c0000000fbea6d80
    NIP [c000000000043e80] tm_restore_sprs+0xc/0x1c
    LR [c000000000015a24] __switch_to+0x1f4/0x420
    Call Trace:
    Instruction dump:
    7c800164 4e800020 7c0022a6 f80304a8 7c0222a6 f80304b0 7c0122a6 f80304b8
    4e800020 e80304a8 7c0023a6 e80304b0 e80304b8 7c0123a6 4e800020

    This fixes CVE-2016-5828.

    Fixes: bc2a9408fa65 ("powerpc: Hook in new transactional memory code")
    Cc: stable@vger.kernel.org # v3.9+
    Signed-off-by: Cyril Bur
    Signed-off-by: Michael Ellerman

    Cyril Bur
     

25 Jun, 2016

16 commits

  • Pull x86 kprobe fix from Thomas Gleixner:
    "A single fix clearing the TF bit when a fault is single stepped"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    kprobes/x86: Clear TF bit in fault on single-stepping

    Linus Torvalds
     
  • Pull powerpc fixes from Michael Ellerman:
    "mm/radix (Aneesh Kumar K.V):
    - Update to tlb functions ric argument
    - Flush page walk cache when freeing page table
    - Update Radix tree size as per ISA 3.0

    mm/hash (Aneesh Kumar K.V):
    - Use the correct PPP mask when updating HPTE
    - Don't add memory coherence if cache inhibited is set

    eeh (Gavin Shan):
    - Fix invalid cached PE primary bus

    bpf/jit (Naveen N. Rao):
    - Disable classic BPF JIT on ppc64le

    .. and fix faults caused by radix patching of SLB miss handler
    (Michael Ellerman)"

    * tag 'powerpc-4.7-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
    powerpc/bpf/jit: Disable classic BPF JIT on ppc64le
    powerpc: Fix faults caused by radix patching of SLB miss handler
    powerpc/eeh: Fix invalid cached PE primary bus
    powerpc/mm/radix: Update Radix tree size as per ISA 3.0
    powerpc/mm/hash: Don't add memory coherence if cache inhibited is set
    powerpc/mm/hash: Use the correct PPP mask when updating HPTE
    powerpc/mm/radix: Flush page walk cache when freeing page table
    powerpc/mm/radix: Update to tlb functions ric argument

    Linus Torvalds
     
  • Merge misc fixes from Andrew Morton:
    "Two weeks worth of fixes here"

    * emailed patches from Andrew Morton : (41 commits)
    init/main.c: fix initcall_blacklisted on ia64, ppc64 and parisc64
    autofs: don't get stuck in a loop if vfs_write() returns an error
    mm/page_owner: avoid null pointer dereference
    tools/vm/slabinfo: fix spelling mistake: "Ocurrences" -> "Occurrences"
    fs/nilfs2: fix potential underflow in call to crc32_le
    oom, suspend: fix oom_reaper vs. oom_killer_disable race
    ocfs2: disable BUG assertions in reading blocks
    mm, compaction: abort free scanner if split fails
    mm: prevent KASAN false positives in kmemleak
    mm/hugetlb: clear compound_mapcount when freeing gigantic pages
    mm/swap.c: flush lru pvecs on compound page arrival
    memcg: css_alloc should return an ERR_PTR value on error
    memcg: mem_cgroup_migrate() may be called with irq disabled
    hugetlb: fix nr_pmds accounting with shared page tables
    Revert "mm: disable fault around on emulated access bit architecture"
    Revert "mm: make faultaround produce old ptes"
    mailmap: add Boris Brezillon's email
    mailmap: add Antoine Tenart's email
    mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask
    mm: mempool: kasan: don't poot mempool objects in quarantine
    ...

    Linus Torvalds
     
  • Pull xen bug fixes from David Vrabel:

    - fix x86 PV dom0 crash during early boot on some hardware

    - fix two pciback bugs affects certain devices

    - fix potential overflow when clearing page tables in x86 PV

    * tag 'for-linus-4.7b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
    xen-pciback: return proper values during BAR sizing
    x86/xen: avoid m2p lookup when setting early page table entries
    xen/pciback: Fix conf_space read/write overlap check.
    x86/xen: fix upper bound of pmd loop in xen_cleanhighmap()
    xen/balloon: Fix declared-but-not-defined warning

    Linus Torvalds
     
  • Pull arm64 fixes from Will Deacon:
    "Here are a few more arm64 fixes, but things do finally appear to be
    slowing down. The main fix is avoiding hibernation in a previously
    unanticipated situation where we have CPUs parked in the kernel, but
    it's all good stuff.

    - Fix icache/dcache sync for anonymous pages under migration
    - Correct the ASID limit check
    - Fix parallel builds of Image and Image.gz
    - Refuse to hibernate when we have CPUs that we can't offline"

    * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
    arm64: hibernate: Don't hibernate on systems with stuck CPUs
    arm64: smp: Add function to determine if cpus are stuck in the kernel
    arm64: mm: remove page_mapping check in __sync_icache_dcache
    arm64: fix boot image dependencies to not generate invalid images
    arm64: update ASID limit

    Linus Torvalds
     
  • __GFP_REPEAT has a rather weak semantic but since it has been introduced
    around 2.6.12 it has been ignored for low order allocations.

    PGALLOC_GFP uses __GFP_REPEAT but it is only used in pte_alloc_one,
    pte_alloc_one_kernel which does order-0 request. This means that this
    flag has never been actually useful here because it has always been used
    only for PAGE_ALLOC_COSTLY requests.

    Link: http://lkml.kernel.org/r/1464599699-30131-17-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Guan Xuetao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __GFP_REPEAT has a rather weak semantic but since it has been introduced
    around 2.6.12 it has been ignored for low order allocations.

    pgtable_alloc_one uses __GFP_REPEAT flag for L2_USER_PGTABLE_ORDER but
    the order is either 0 or 3 if L2_KERNEL_PGTABLE_SHIFT for HPAGE_SHIFT.
    This means that this flag has never been actually useful here because it
    has always been used only for PAGE_ALLOC_COSTLY requests.

    Link: http://lkml.kernel.org/r/1464599699-30131-16-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Chris Metcalf [for tile]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __GFP_REPEAT has a rather weak semantic but since it has been introduced
    around 2.6.12 it has been ignored for low order allocations.

    PGALLOC_GFP uses __GFP_REPEAT but {pgd,pmd}_alloc allocate from
    {pgd,pmd}_cache but both caches are allocating up to PAGE_SIZE objects.
    This means that this flag has never been actually useful here because it
    has always been used only for PAGE_ALLOC_COSTLY requests.

    Link: http://lkml.kernel.org/r/1464599699-30131-15-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __GFP_REPEAT has a rather weak semantic but since it has been introduced
    around 2.6.12 it has been ignored for low order allocations.

    page_table_alloc then uses the flag for a single page allocation. This
    means that this flag has never been actually useful here because it has
    always been used only for PAGE_ALLOC_COSTLY requests.

    Link: http://lkml.kernel.org/r/1464599699-30131-14-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Heiko Carstens
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __GFP_REPEAT has a rather weak semantic but since it has been introduced
    around 2.6.12 it has been ignored for low order allocations.

    {pud,pmd}_alloc_one is using __GFP_REPEAT but it always allocates from
    pgtable_cache which is initialzed to PAGE_SIZE objects. This means that
    this flag has never been actually useful here because it has always been
    used only for PAGE_ALLOC_COSTLY requests.

    Link: http://lkml.kernel.org/r/1464599699-30131-13-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __GFP_REPEAT has a rather weak semantic but since it has been introduced
    around 2.6.12 it has been ignored for low order allocations.

    {pud,pmd}_alloc_one are allocating from {PGT,PUD}_CACHE initialized in
    pgtable_cache_init which doesn't have larger than sizeof(void *) << 12
    size and that fits into !costly allocation request size.

    PGALLOC_GFP is used only in radix__pgd_alloc which uses either order-0
    or order-4 requests. The first one doesn't need the flag while the
    second does. Drop __GFP_REPEAT from PGALLOC_GFP and add it for the
    order-4 one.

    This means that this flag has never been actually useful here because it
    has always been used only for PAGE_ALLOC_COSTLY requests.

    Link: http://lkml.kernel.org/r/1464599699-30131-12-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __GFP_REPEAT has a rather weak semantic but since it has been introduced
    around 2.6.12 it has been ignored for low order allocations.

    pte_alloc_one{_kernel} allocate PTE_ORDER which is 0. This means that
    this flag has never been actually useful here because it has always been
    used only for PAGE_ALLOC_COSTLY requests.

    Link: http://lkml.kernel.org/r/1464599699-30131-11-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Chen Liqin
    Cc: Lennox Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __GFP_REPEAT has a rather weak semantic but since it has been introduced
    around 2.6.12 it has been ignored for low order allocations.

    pmd_alloc_one allocate PMD_ORDER which is 1. This means that this flag
    has never been actually useful here because it has always been used only
    for PAGE_ALLOC_COSTLY requests.

    Link: http://lkml.kernel.org/r/1464599699-30131-10-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __GFP_REPEAT has a rather weak semantic but since it has been introduced
    around 2.6.12 it has been ignored for low order allocations.

    pte_alloc_one{_kernel} allocate PTE_ORDER which is 0. This means that
    this flag has never been actually useful here because it has always been
    used only for PAGE_ALLOC_COSTLY requests.

    Link: http://lkml.kernel.org/r/1464599699-30131-9-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Ley Foon Tan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __GFP_REPEAT has a rather weak semantic but since it has been introduced
    around 2.6.12 it has been ignored for low order allocations.

    pte_alloc_one{_kernel}, pmd_alloc_one allocate PTE_ORDER resp.
    PMD_ORDER but both are not larger than 1. This means that this flag has
    never been actually useful here because it has always been used only for
    PAGE_ALLOC_COSTLY requests.

    Link: http://lkml.kernel.org/r/1464599699-30131-8-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: John Crispin
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __GFP_REPEAT has a rather weak semantic but since it has been introduced
    around 2.6.12 it has been ignored for low order allocations.

    pte_alloc_one_kernel uses __get_order_pte but this is obviously always
    zero because BITS_FOR_PTE is not larger than 9 yet the page size is
    always larger than 4K. This means that this flag has never been
    actually useful here because it has always been used only for
    PAGE_ALLOC_COSTLY requests.

    Link: http://lkml.kernel.org/r/1464599699-30131-7-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vineet Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko