23 Jan, 2017

1 commit


21 Jan, 2017

1 commit

  • Pull KVM fixes from Radim Krčmář:
    "ARM:
    - Fix for timer setup on VHE machines
    - Drop spurious warning when the timer races against the vcpu running
    again
    - Prevent a vgic deadlock when the initialization fails (for stable)

    s390:
    - Fix a kernel memory exposure (for stable)

    x86:
    - Fix exception injection when hypercall instruction cannot be
    patched"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    KVM: s390: do not expose random data via facility bitmap
    KVM: x86: fix fixing of hypercalls
    KVM: arm/arm64: vgic: Fix deadlock on error handling
    KVM: arm64: Access CNTHCTL_EL2 bit fields correctly on VHE systems
    KVM: arm/arm64: Fix occasional warning from the timer work function

    Linus Torvalds
     

20 Jan, 2017

1 commit

  • Pull PCI fixes from Bjorn Helgaas:

    - recognize that a PCI-to-PCIe bridge originates a PCIe hierarchy, so
    we enumerate that hierarchy correctly

    - X-Gene: fix a change merged for v4.10 that broke MSI

    - Keystone: avoid reading undefined registers, which can cause
    asynchronous external aborts

    - Supermicro X8DTH-i/6/iF/6F: ignore broken _CRS that caused us to
    change (and break) existing I/O port assignments

    * tag 'pci-v4.10-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
    PCI/MSI: pci-xgene-msi: Fix CPU hotplug registration handling
    PCI: Enumerate switches below PCI-to-PCIe bridges
    x86/PCI: Ignore _CRS on Supermicro X8DTH-i/6/iF/6F
    PCI: designware: Check for iATU unroll only on platforms that use ATU

    Linus Torvalds
     

19 Jan, 2017

1 commit

  • Pull SMP hotplug update from Thomas Gleixner:
    "This contains a trivial typo fix and an extension to the core code for
    dynamically allocating states in the prepare stage.

    The extension is necessary right now because we need a proper way to
    unbreak LTTNG, which iscurrently non functional due to the removal of
    the notifiers. Surely it's out of tree, but it's widely used by
    distros.

    The simple solution would have been to reserve a state for LTTNG, but
    I'm not fond about unused crap in the kernel and the dynamic range,
    which we admittedly should have done right away, allows us to remove
    quite some of the hardcoded states, i.e. those which have no ordering
    requirements. So doing the right thing now is better than having an
    smaller intermediate solution which needs to be reworked anyway"

    * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    cpu/hotplug: Provide dynamic range for prepare stage
    perf/x86/amd/ibs: Fix typo after cleanup state names in cpu/hotplug

    Linus Torvalds
     

18 Jan, 2017

1 commit

  • commit d32932d02e18 removed the irq_retrigger callback from the IO-APIC
    chip and did not add it to the new IO-APIC-IR irq chip.

    Unfortunately the software resend fallback is not enabled on X86, so edge
    interrupts which are received during the lazy disabled state of the
    interrupt line are not retriggered and therefor lost.

    Restore the callbacks.

    [ tglx: Massaged changelog ]

    Fixes: d32932d02e18 ("x86/irq: Convert IOAPIC to use hierarchical irqdomain interfaces")
    Signed-off-by: Ruslan Ruslichenko
    Cc: xe-linux-external@cisco.com
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/1484662432-13580-1-git-send-email-rruslich@cisco.com
    Signed-off-by: Thomas Gleixner

    Ruslan Ruslichenko
     

17 Jan, 2017

2 commits

  • emulator_fix_hypercall() replaces hypercall with vmcall instruction,
    but it does not handle GP exception properly when writes the new instruction.
    It can return X86EMUL_PROPAGATE_FAULT without setting exception information.
    This leads to incorrect emulation and triggers
    WARN_ON(ctxt->exception.vector > 0x1f) in x86_emulate_insn()
    as discovered by syzkaller fuzzer:

    WARNING: CPU: 2 PID: 18646 at arch/x86/kvm/emulate.c:5558
    Call Trace:
    warn_slowpath_null+0x2c/0x40 kernel/panic.c:582
    x86_emulate_insn+0x16a5/0x4090 arch/x86/kvm/emulate.c:5572
    x86_emulate_instruction+0x403/0x1cc0 arch/x86/kvm/x86.c:5618
    emulate_instruction arch/x86/include/asm/kvm_host.h:1127 [inline]
    handle_exception+0x594/0xfd0 arch/x86/kvm/vmx.c:5762
    vmx_handle_exit+0x2b7/0x38b0 arch/x86/kvm/vmx.c:8625
    vcpu_enter_guest arch/x86/kvm/x86.c:6888 [inline]
    vcpu_run arch/x86/kvm/x86.c:6947 [inline]

    Set exception information when write in emulator_fix_hypercall() fails.

    Signed-off-by: Dmitry Vyukov
    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: Wanpeng Li
    Cc: kvm@vger.kernel.org
    Cc: syzkaller@googlegroups.com
    Signed-off-by: Radim Krčmář

    Dmitry Vyukov
     
  • The CPU hotplug function intel_pmu_cpu_starting() sets
    cpu_hw_events.excl_thread_id unconditionally to 1 when the shared exclusive
    counters data structure is already availabe for the sibling thread.

    This works during the boot process because the first sibling gets threadid
    0 assigned and the second sibling which shares the data structure gets 1.

    But when the first thread of the core is offlined and onlined again it
    shares the data structure with the second thread and gets exclusive thread
    id 1 assigned as well.

    Prevent this by checking the threadid of the already online thread.

    [ tglx: Rewrote changelog ]

    Signed-off-by: Zhou Chengming
    Cc: NuoHan Qiao
    Cc: ak@linux.intel.com
    Cc: peterz@infradead.org
    Cc: kan.liang@intel.com
    Cc: dave.hansen@linux.intel.com
    Cc: eranian@google.com
    Cc: qiaonuohan@huawei.com
    Cc: davidcc@google.com
    Cc: guohanjun@huawei.com
    Link: http://lkml.kernel.org/r/1484536871-3131-1-git-send-email-zhouchengming1@huawei.com
    Signed-off-by: Thomas Gleixner
    --- ---
    arch/x86/events/intel/core.c | 7 +++++--
    1 file changed, 5 insertions(+), 2 deletions(-)

    Zhou Chengming
     

16 Jan, 2017

3 commits

  • Pull x86 fixes from Ingo Molnar:
    "Misc fixes:

    - unwinder fixes
    - AMD CPU topology enumeration fixes
    - microcode loader fixes
    - x86 embedded platform fixes
    - fix for a bootup crash that may trigger when clearcpuid= is used
    with invalid values"

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/mpx: Use compatible types in comparison to fix sparse error
    x86/tsc: Add the Intel Denverton Processor to native_calibrate_tsc()
    x86/entry: Fix the end of the stack for newly forked tasks
    x86/unwind: Include __schedule() in stack traces
    x86/unwind: Disable KASAN checks for non-current tasks
    x86/unwind: Silence warnings for non-current tasks
    x86/microcode/intel: Use correct buffer size for saving microcode data
    x86/microcode/intel: Fix allocation size of struct ucode_patch
    x86/microcode/intel: Add a helper which gives the microcode revision
    x86/microcode: Use native CPUID to tickle out microcode revision
    x86/CPU: Add native CPUID variants returning a single datum
    x86/boot: Add missing declaration of string functions
    x86/CPU/AMD: Fix Bulldozer topology
    x86/platform/intel-mid: Rename 'spidev' to 'mrfld_spidev'
    x86/cpu: Fix typo in the comment for Anniedale
    x86/cpu: Fix bootup crashes by sanitizing the argument of the 'clearcpuid=' command-line option

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "Misc race fixes uncovered by fuzzing efforts, a Sparse fix, two PMU
    driver fixes, plus miscellanous tooling fixes"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86: Reject non sampling events with precise_ip
    perf/x86/intel: Account interrupts for PEBS errors
    perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race
    perf/core: Fix sys_perf_event_open() vs. hotplug
    perf/x86/intel: Use ULL constant to prevent undefined shift behaviour
    perf/x86/intel/uncore: Fix hardcoded socket 0 assumption in the Haswell init code
    perf/x86: Set pmu->module in Intel PMU modules
    perf probe: Fix to probe on gcc generated symbols for offline kernel
    perf probe: Fix --funcs to show correct symbols for offline module
    perf symbols: Robustify reading of build-id from sysfs
    perf tools: Install tools/lib/traceevent plugins with install-bin
    tools lib traceevent: Fix prev/next_prio for deadline tasks
    perf record: Fix --switch-output documentation and comment
    perf record: Make __record_options static
    tools lib subcmd: Add OPT_STRING_OPTARG_SET option
    perf probe: Fix to get correct modname from elf header
    samples/bpf trace_output_user: Remove duplicate sys/ioctl.h include
    samples/bpf sock_example: Avoid getting ethhdr from two includes
    perf sched timehist: Show total scheduling time

    Linus Torvalds
     
  • Pull EFI fixes from Ingo Molnar:
    "A number of regression fixes:

    - Fix a boot hang on machines that have somewhat unusual memory map
    entries of phys_addr=0x0 num_pages=0, which broke due to a recent
    commit. This commit got cherry-picked from the v4.11 queue because
    the bug is affecting real machines.

    - Fix a boot hang also reported by KASAN, caused by incorrect init
    ordering introduced by a recent optimization.

    - Fix a recent robustification fix to allocate_new_fdt_and_exit_boot()
    that introduced an invalid assumption. Neither bugs were seen in
    the wild AFAIK"

    * 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    efi/x86: Prune invalid memory map entries and fix boot regression
    x86/efi: Don't allocate memmap through memblock after mm_init()
    efi/libstub/arm*: Pass latest memory map to the kernel

    Linus Torvalds
     

14 Jan, 2017

6 commits

  • Some machines, such as the Lenovo ThinkPad W541 with firmware GNET80WW
    (2.28), include memory map entries with phys_addr=0x0 and num_pages=0.

    These machines fail to boot after the following commit,

    commit 8e80632fb23f ("efi/esrt: Use efi_mem_reserve() and avoid a kmalloc()")

    Fix this by removing such bogus entries from the memory map.

    Furthermore, currently the log output for this case (with efi=debug)
    looks like:

    [ 0.000000] efi: mem45: [Reserved | | | | | | | | | | | | ] range=[0x0000000000000000-0xffffffffffffffff] (0MB)

    This is clearly wrong, and also not as informative as it could be. This
    patch changes it so that if we find obviously invalid memory map
    entries, we print an error and skip those entries. It also detects the
    display of the address range calculation overflow, so the new output is:

    [ 0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
    [ 0.000000] efi: mem45: [Reserved | | | | | | | | | | | | ] range=[0x0000000000000000-0x0000000000000000] (invalid)

    It also detects memory map sizes that would overflow the physical
    address, for example phys_addr=0xfffffffffffff000 and
    num_pages=0x0200000000000001, and prints:

    [ 0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
    [ 0.000000] efi: mem45: [Reserved | | | | | | | | | | | | ] range=[phys_addr=0xfffffffffffff000-0x20ffffffffffffffff] (invalid)

    It then removes these entries from the memory map.

    Signed-off-by: Peter Jones
    Signed-off-by: Ard Biesheuvel
    [ardb: refactor for clarity with no functional changes, avoid PAGE_SHIFT]
    Signed-off-by: Matt Fleming
    [Matt: Include bugzilla info in commit log]
    Cc: # v4.9+
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=191121
    Signed-off-by: Ingo Molnar

    Peter Jones
     
  • As Peter suggested [1] rejecting non sampling PEBS events,
    because they dont make any sense and could cause bugs
    in the NMI handler [2].

    [1] http://lkml.kernel.org/r/20170103094059.GC3093@worktop
    [2] http://lkml.kernel.org/r/1482931866-6018-3-git-send-email-jolsa@kernel.org

    Signed-off-by: Jiri Olsa
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/20170103142454.GA26251@krava
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     
  • It's possible to set up PEBS events to get only errors and not
    any data, like on SNB-X (model 45) and IVB-EP (model 62)
    via 2 perf commands running simultaneously:

    taskset -c 1 ./perf record -c 4 -e branches:pp -j any -C 10

    This leads to a soft lock up, because the error path of the
    intel_pmu_drain_pebs_nhm() does not account event->hw.interrupt
    for error PEBS interrupts, so in case you're getting ONLY
    errors you don't have a way to stop the event when it's over
    the max_samples_per_tick limit:

    NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [perf_fuzzer:5816]
    ...
    RIP: 0010:[] [] smp_call_function_single+0xe2/0x140
    ...
    Call Trace:
    ? trace_hardirqs_on_caller+0xf5/0x1b0
    ? perf_cgroup_attach+0x70/0x70
    perf_install_in_context+0x199/0x1b0
    ? ctx_resched+0x90/0x90
    SYSC_perf_event_open+0x641/0xf90
    SyS_perf_event_open+0x9/0x10
    do_syscall_64+0x6c/0x1f0
    entry_SYSCALL64_slow_path+0x25/0x25

    Add perf_event_account_interrupt() which does the interrupt
    and frequency checks and call it from intel_pmu_drain_pebs_nhm()'s
    error path.

    We keep the pending_kill and pending_wakeup logic only in the
    __perf_event_overflow() path, because they make sense only if
    there's any data to deliver.

    Signed-off-by: Jiri Olsa
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1482931866-6018-2-git-send-email-jolsa@kernel.org
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     
  • info->si_addr is of type void __user *, so it should be compared against
    something from the same address space.

    This fixes the following sparse error:

    arch/x86/mm/mpx.c:296:27: error: incompatible types in comparison expression (different address spaces)

    Signed-off-by: Tobias Klauser
    Cc: Dave Hansen
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Tobias Klauser
     
  • The Intel Denverton microserver uses a 25 MHz TSC crystal,
    so we can derive its exact [*] TSC frequency
    using CPUID and some arithmetic, eg.:

    TSC: 1800 MHz (25000000 Hz * 216 / 3 / 1000000)

    [*] 'exact' is only as good as the crystal, which should be +/- 20ppm

    Signed-off-by: Len Brown
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/306899f94804aece6d8fa8b4223ede3b48dbb59c.1484287748.git.len.brown@intel.com
    Signed-off-by: Ingo Molnar

    Len Brown
     
  • Pull KVM fixes from Paolo Bonzini:

    - fix for module unload vs deferred jump labels (note: there might be
    other buggy modules!)

    - two NULL pointer dereferences from syzkaller

    - also syzkaller: fix emulation of fxsave/fxrstor/sgdt/sidt, problem
    made worse during this merge window, "just" kernel memory leak on
    releases

    - fix emulation of "mov ss" - somewhat serious on AMD, less so on Intel

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    KVM: x86: fix emulation of "MOV SS, null selector"
    KVM: x86: fix NULL deref in vcpu_scan_ioapic
    KVM: eventfd: fix NULL deref irqbypass consumer
    KVM: x86: Introduce segmented_write_std
    KVM: x86: flush pending lapic jump label updates on module unload
    jump_labels: API for flushing deferred jump label updates

    Linus Torvalds
     

12 Jan, 2017

9 commits

  • This is CVE-2017-2583. On Intel this causes a failed vmentry because
    SS's type is neither 3 nor 7 (even though the manual says this check is
    only done for usable SS, and the dmesg splat says that SS is unusable!).
    On AMD it's worse: svm.c is confused and sets CPL to 0 in the vmcb.

    The fix fabricates a data segment descriptor when SS is set to a null
    selector, so that CPL and SS.DPL are set correctly in the VMCS/vmcb.
    Furthermore, only allow setting SS to a NULL selector if SS.RPL < 3;
    this in turn ensures CPL < 3 because RPL must be equal to CPL.

    Thanks to Andy Lutomirski and Willy Tarreau for help in analyzing
    the bug and deciphering the manuals.

    Reported-by: Xiaohan Zhang
    Fixes: 79d5b4c3cd809c770d4bf9812635647016c56011
    Cc: stable@nongnu.org
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • Reported by syzkaller:

    BUG: unable to handle kernel NULL pointer dereference at 00000000000001b0
    IP: _raw_spin_lock+0xc/0x30
    PGD 3e28eb067
    PUD 3f0ac6067
    PMD 0
    Oops: 0002 [#1] SMP
    CPU: 0 PID: 2431 Comm: test Tainted: G OE 4.10.0-rc1+ #3
    Call Trace:
    ? kvm_ioapic_scan_entry+0x3e/0x110 [kvm]
    kvm_arch_vcpu_ioctl_run+0x10a8/0x15f0 [kvm]
    ? pick_next_task_fair+0xe1/0x4e0
    ? kvm_arch_vcpu_load+0xea/0x260 [kvm]
    kvm_vcpu_ioctl+0x33a/0x600 [kvm]
    ? hrtimer_try_to_cancel+0x29/0x130
    ? do_nanosleep+0x97/0xf0
    do_vfs_ioctl+0xa1/0x5d0
    ? __hrtimer_init+0x90/0x90
    ? do_nanosleep+0x5b/0xf0
    SyS_ioctl+0x79/0x90
    do_syscall_64+0x6e/0x180
    entry_SYSCALL64_slow_path+0x25/0x25
    RIP: _raw_spin_lock+0xc/0x30 RSP: ffffa43688973cc0

    The syzkaller folks reported a NULL pointer dereference due to
    ENABLE_CAP succeeding even without an irqchip. The Hyper-V
    synthetic interrupt controller is activated, resulting in a
    wrong request to rescan the ioapic and a NULL pointer dereference.

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #ifndef KVM_CAP_HYPERV_SYNIC
    #define KVM_CAP_HYPERV_SYNIC 123
    #endif

    void* thr(void* arg)
    {
    struct kvm_enable_cap cap;
    cap.flags = 0;
    cap.cap = KVM_CAP_HYPERV_SYNIC;
    ioctl((long)arg, KVM_ENABLE_CAP, &cap);
    return 0;
    }

    int main()
    {
    void *host_mem = mmap(0, 0x1000, PROT_READ|PROT_WRITE,
    MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    int kvmfd = open("/dev/kvm", 0);
    int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);
    struct kvm_userspace_memory_region memreg;
    memreg.slot = 0;
    memreg.flags = 0;
    memreg.guest_phys_addr = 0;
    memreg.memory_size = 0x1000;
    memreg.userspace_addr = (unsigned long)host_mem;
    host_mem[0] = 0xf4;
    ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &memreg);
    int cpufd = ioctl(vmfd, KVM_CREATE_VCPU, 0);
    struct kvm_sregs sregs;
    ioctl(cpufd, KVM_GET_SREGS, &sregs);
    sregs.cr0 = 0;
    sregs.cr4 = 0;
    sregs.efer = 0;
    sregs.cs.selector = 0;
    sregs.cs.base = 0;
    ioctl(cpufd, KVM_SET_SREGS, &sregs);
    struct kvm_regs regs = { .rflags = 2 };
    ioctl(cpufd, KVM_SET_REGS, ®s);
    ioctl(vmfd, KVM_CREATE_IRQCHIP, 0);
    pthread_t th;
    pthread_create(&th, 0, thr, (void*)(long)cpufd);
    usleep(rand() % 10000);
    ioctl(cpufd, KVM_RUN, 0);
    pthread_join(th, 0);
    return 0;
    }

    This patch fixes it by failing ENABLE_CAP if without an irqchip.

    Reported-by: Dmitry Vyukov
    Fixes: 5c919412fe61 (kvm/x86: Hyper-V synthetic interrupt controller)
    Cc: stable@vger.kernel.org # 4.5+
    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: Dmitry Vyukov
    Signed-off-by: Wanpeng Li
    Signed-off-by: Paolo Bonzini

    Wanpeng Li
     
  • Introduces segemented_write_std.

    Switches from emulated reads/writes to standard read/writes in fxsave,
    fxrstor, sgdt, and sidt. This fixes CVE-2017-2584, a longstanding
    kernel memory leak.

    Since commit 283c95d0e389 ("KVM: x86: emulate FXSAVE and FXRSTOR",
    2016-11-09), which is luckily not yet in any final release, this would
    also be an exploitable kernel memory *write*!

    Reported-by: Dmitry Vyukov
    Cc: stable@vger.kernel.org
    Fixes: 96051572c819194c37a8367624b285be10297eca
    Fixes: 283c95d0e3891b64087706b344a4b545d04a6e62
    Suggested-by: Paolo Bonzini
    Signed-off-by: Steve Rutherford
    Signed-off-by: Paolo Bonzini

    Steve Rutherford
     
  • KVM's lapic emulation uses static_key_deferred (apic_{hw,sw}_disabled).
    These are implemented with delayed_work structs which can still be
    pending when the KVM module is unloaded. We've seen this cause kernel
    panics when the kvm_intel module is quickly reloaded.

    Use the new static_key_deferred_flush() API to flush pending updates on
    module unload.

    Signed-off-by: David Matlack
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini

    David Matlack
     
  • When unwinding a task, the end of the stack is always at the same offset
    right below the saved pt_regs, regardless of which syscall was used to
    enter the kernel. That convention allows the unwinder to verify that a
    stack is sane.

    However, newly forked tasks don't always follow that convention, as
    reported by the following unwinder warning seen by Dave Jones:

    WARNING: kernel stack frame pointer at ffffc90001443f30 in kworker/u8:8:30468 has bad value (null)

    The warning was due to the following call chain:

    (ftrace handler)
    call_usermodehelper_exec_async+0x5/0x140
    ret_from_fork+0x22/0x30

    The problem is that ret_from_fork() doesn't create a stack frame before
    calling other functions. Fix that by carefully using the frame pointer
    macros.

    In addition to conforming to the end of stack convention, this also
    makes related stack traces more sensible by making it clear to the user
    that ret_from_fork() was involved.

    Reported-by: Dave Jones
    Signed-off-by: Josh Poimboeuf
    Cc: Andy Lutomirski
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: Dmitry Vyukov
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Miroslav Benes
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/8854cdaab980e9700a81e9ebf0d4238e4bbb68ef.1483978430.git.jpoimboe@redhat.com
    Signed-off-by: Ingo Molnar

    Josh Poimboeuf
     
  • In the following commit:

    0100301bfdf5 ("sched/x86: Rewrite the switch_to() code")

    ... the layout of the 'inactive_task_frame' struct was designed to have
    a frame pointer header embedded in it, so that the unwinder could use
    the 'bp' and 'ret_addr' fields to report __schedule() on the stack (or
    ret_from_fork() for newly forked tasks which haven't actually run yet).

    Finish the job by changing get_frame_pointer() to return a pointer to
    inactive_task_frame's 'bp' field rather than 'bp' itself. This allows
    the unwinder to start one frame higher on the stack, so that it properly
    reports __schedule().

    Reported-by: Miroslav Benes
    Signed-off-by: Josh Poimboeuf
    Cc: Andy Lutomirski
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Jones
    Cc: Denys Vlasenko
    Cc: Dmitry Vyukov
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/598e9f7505ed0aba86e8b9590aa528c6c7ae8dcd.1483978430.git.jpoimboe@redhat.com
    Signed-off-by: Ingo Molnar

    Josh Poimboeuf
     
  • There are a handful of callers to save_stack_trace_tsk() and
    show_stack() which try to unwind the stack of a task other than current.
    In such cases, it's remotely possible that the task is running on one
    CPU while the unwinder is reading its stack from another CPU, causing
    the unwinder to see stack corruption.

    These cases seem to be mostly harmless. The unwinder has checks which
    prevent it from following bad pointers beyond the bounds of the stack.
    So it's not really a bug as long as the caller understands that
    unwinding another task will not always succeed.

    In such cases, it's possible that the unwinder may read a KASAN-poisoned
    region of the stack. Account for that by using READ_ONCE_NOCHECK() when
    reading the stack of another task.

    Use READ_ONCE() when reading the stack of the current task, since KASAN
    warnings can still be useful for finding bugs in that case.

    Reported-by: Dmitry Vyukov
    Signed-off-by: Josh Poimboeuf
    Cc: Andy Lutomirski
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Jones
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Miroslav Benes
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/4c575eb288ba9f73d498dfe0acde2f58674598f1.1483978430.git.jpoimboe@redhat.com
    Signed-off-by: Ingo Molnar

    Josh Poimboeuf
     
  • There are a handful of callers to save_stack_trace_tsk() and
    show_stack() which try to unwind the stack of a task other than current.
    In such cases, it's remotely possible that the task is running on one
    CPU while the unwinder is reading its stack from another CPU, causing
    the unwinder to see stack corruption.

    These cases seem to be mostly harmless. The unwinder has checks which
    prevent it from following bad pointers beyond the bounds of the stack.
    So it's not really a bug as long as the caller understands that
    unwinding another task will not always succeed.

    Since stack "corruption" on another task's stack isn't necessarily a
    bug, silence the warnings when unwinding tasks other than current.

    Reported-by: Dave Jones
    Signed-off-by: Josh Poimboeuf
    Cc: Andy Lutomirski
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: Dmitry Vyukov
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Miroslav Benes
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/00d8c50eea3446c1524a2a755397a3966629354c.1483978430.git.jpoimboe@redhat.com
    Signed-off-by: Ingo Molnar

    Josh Poimboeuf
     
  • Pull crypto fix from Herbert Xu:
    "This fixes a regression in aesni that renders it useless if it's
    built-in with a modular pcbc configuration"

    * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
    crypto: aesni - Fix failure when built-in with modular pcbc

    Linus Torvalds
     

11 Jan, 2017

3 commits

  • When x86_pmu.num_counters is 32 the shift of the integer constant 1 is
    exceeding 32bit and therefor undefined behaviour.

    Fix this by shifting 1ULL instead of 1.

    Reported-by: CoverityScan CID#1192105 ("Bad bit shift operation")
    Signed-off-by: Colin Ian King
    Cc: Andi Kleen
    Cc: Peter Zijlstra
    Cc: Kan Liang
    Cc: Stephane Eranian
    Cc: Alexander Shishkin
    Link: http://lkml.kernel.org/r/20170111114310.17928-1-colin.king@canonical.com
    Signed-off-by: Thomas Gleixner

    Colin King
     
  • Martin reported that the Supermicro X8DTH-i/6/iF/6F advertises incorrect
    host bridge windows via _CRS:

    pci_root PNP0A08:00: host bridge window [io 0xf000-0xffff]
    pci_root PNP0A08:01: host bridge window [io 0xf000-0xffff]

    Both bridges advertise the 0xf000-0xffff window, which cannot be correct.

    Work around this by ignoring _CRS on this system. The downside is that we
    may not assign resources correctly to hot-added PCI devices (if they are
    possible on this system).

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=42606
    Reported-by: Martin Burnicki
    Signed-off-by: Bjorn Helgaas
    CC: stable@vger.kernel.org

    Bjorn Helgaas
     
  • hswep_uncore_cpu_init() uses a hardcoded physical package id 0 for the boot
    cpu. This works as long as the boot CPU is actually on the physical package
    0, which is normaly the case after power on / reboot.

    But it fails with a NULL pointer dereference when a kdump kernel is started
    on a secondary socket which has a different physical package id because the
    locigal package translation for physical package 0 does not exist.

    Use the logical package id of the boot cpu instead of hard coded 0.

    [ tglx: Rewrote changelog once more ]

    Fixes: cf6d445f6897 ("perf/x86/uncore: Track packages, not per CPU data")
    Signed-off-by: Prarit Bhargava
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Borislav Petkov
    Cc: H. Peter Anvin
    Cc: Harish Chegondi
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/1483628965-2890-1-git-send-email-prarit@redhat.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner

    Prarit Bhargava
     

10 Jan, 2017

6 commits

  • In generic_load_microcode(), curr_mc_size is the size of the last
    allocated buffer and since we have this performance "optimization"
    there to vmalloc a new buffer only when the current one is bigger,
    curr_mc_size ends up becoming the size of the biggest buffer we've seen
    so far.

    However, we end up saving the microcode patch which matches our CPU
    and its size is not curr_mc_size but the respective mc_size during the
    iteration while we're staring at it.

    So save that mc_size into a separate variable and use it to store the
    previously found microcode buffer.

    Without this fix, we could get oops like this:

    BUG: unable to handle kernel paging request at ffffc9000e30f000
    IP: __memcpy+0x12/0x20
    ...
    Call Trace:
    ? kmemdup+0x43/0x60
    __alloc_microcode_buf+0x44/0x70
    save_microcode_patch+0xd4/0x150
    generic_load_microcode+0x1b8/0x260
    request_microcode_user+0x15/0x20
    microcode_write+0x91/0x100
    __vfs_write+0x34/0x120
    vfs_write+0xc1/0x130
    SyS_write+0x56/0xc0
    do_syscall_64+0x6c/0x160
    entry_SYSCALL64_slow_path+0x25/0x25

    Fixes: 06b8534cb728 ("x86/microcode: Rework microcode loading")
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Borislav Petkov
    Link: http://lkml.kernel.org/r/4f33cbfd-44f2-9bed-3b66-7446cd14256f@ce.jp.nec.com
    Signed-off-by: Thomas Gleixner

    Junichi Nomura
     
  • We allocate struct ucode_patch here. @size is the size of microcode data
    and used for kmemdup() later in this function.

    Fixes: 06b8534cb728 ("x86/microcode: Rework microcode loading")
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Borislav Petkov
    Link: http://lkml.kernel.org/r/7a730dc9-ac17-35c4-fe76-dfc94e5ecd95@ce.jp.nec.com
    Signed-off-by: Thomas Gleixner

    Junichi Nomura
     
  • Since on Intel we're required to do CPUID(1) first, before reading
    the microcode revision MSR, let's add a special helper which does the
    required steps so that we don't forget to do them next time, when we
    want to read the microcode revision.

    Signed-off-by: Borislav Petkov
    Link: http://lkml.kernel.org/r/20170109114147.5082-4-bp@alien8.de
    Signed-off-by: Thomas Gleixner

    Borislav Petkov
     
  • Intel supplies the microcode revision value in MSR 0x8b
    (IA32_BIOS_SIGN_ID) after CPUID(1) has been executed. Execute it each
    time before reading that MSR.

    It used to do sync_core() which did do CPUID but

    c198b121b1a1 ("x86/asm: Rewrite sync_core() to use IRET-to-self")

    changed the sync_core() implementation so we better make the microcode
    loading case explicit, as the SDM documents it.

    Reported-and-tested-by: Jun'ichi Nomura
    Signed-off-by: Borislav Petkov
    Link: http://lkml.kernel.org/r/20170109114147.5082-3-bp@alien8.de
    Signed-off-by: Thomas Gleixner

    Borislav Petkov
     
  • ... similarly to the cpuid_() variants.

    Signed-off-by: Borislav Petkov
    Link: http://lkml.kernel.org/r/20170109114147.5082-2-bp@alien8.de
    Signed-off-by: Thomas Gleixner

    Borislav Petkov
     
  • Pull networking fixes from David Miller:

    1) Fix dumping of nft_quota entries, from Pablo Neira Ayuso.

    2) Fix out of bounds access in nf_tables discovered by KASAN, from
    Florian Westphal.

    3) Fix IRQ enabling in dp83867 driver, from Grygorii Strashko.

    4) Fix unicast filtering in be2net driver, from Ivan Vecera.

    5) tg3_get_stats64() can race with driver close and ethtool
    reconfigurations, fix from Michael Chan.

    6) Fix error handling when pass limit is reached in bpf code gen on
    x86. From Daniel Borkmann.

    7) Don't clobber switch ops and use proper MDIO nested reads and writes
    in bcm_sf2 driver, from Florian Fainelli.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (21 commits)
    net: dsa: bcm_sf2: Utilize nested MDIO read/write
    net: dsa: bcm_sf2: Do not clobber b53_switch_ops
    net: stmmac: fix maxmtu assignment to be within valid range
    bpf: change back to orig prog on too many passes
    tg3: Fix race condition in tg3_get_stats64().
    be2net: fix unicast list filling
    be2net: fix accesses to unicast list
    netlabel: add CALIPSO to the list of built-in protocols
    vti6: fix device register to report IFLA_INFO_KIND
    net: phy: dp83867: fix irq generation
    amd-xgbe: Fix IRQ processing when running in single IRQ mode
    sh_eth: R8A7740 supports packet shecksumming
    sh_eth: fix EESIPR values for SH77{34|63}
    r8169: fix the typo in the comment
    nl80211: fix sched scan netlink socket owner destruction
    bridge: netfilter: Fix dropping packets that moving through bridge interface
    netfilter: ipt_CLUSTERIP: check duplicate config when initializing
    netfilter: nft_payload: mangle ckecksum if NFT_PAYLOAD_L4CSUM_PSEUDOHDR is set
    netfilter: nf_tables: fix oob access
    netfilter: nft_queue: use raw_smp_processor_id()
    ...

    Linus Torvalds
     

09 Jan, 2017

2 commits

  • Add the missing declarations of basic string functions to string.h to allow
    a clean build.

    Fixes: 5be865661516 ("String-handling functions for the new x86 setup code.")
    Signed-off-by: Nicholas Mc Guire
    Link: http://lkml.kernel.org/r/1483781911-21399-1-git-send-email-hofrat@osadl.org
    Signed-off-by: Thomas Gleixner

    Nicholas Mc Guire
     
  • If after too many passes still no image could be emitted, then
    swap back to the original program as we do in all other cases
    and don't use the one with blinding.

    Fixes: 959a75791603 ("bpf, x86: add support for constant blinding")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

07 Jan, 2017

3 commits

  • With the following commit:

    4bc9f92e64c8 ("x86/efi-bgrt: Use efi_mem_reserve() to avoid copying image data")

    ... efi_bgrt_init() calls into the memblock allocator through
    efi_mem_reserve() => efi_arch_mem_reserve() *after* mm_init() has been called.

    Indeed, KASAN reports a bad read access later on in efi_free_boot_services():

    BUG: KASAN: use-after-free in efi_free_boot_services+0xae/0x24c
    at addr ffff88022de12740
    Read of size 4 by task swapper/0/0
    page:ffffea0008b78480 count:0 mapcount:-127
    mapping: (null) index:0x1 flags: 0x5fff8000000000()
    [...]
    Call Trace:
    dump_stack+0x68/0x9f
    kasan_report_error+0x4c8/0x500
    kasan_report+0x58/0x60
    __asan_load4+0x61/0x80
    efi_free_boot_services+0xae/0x24c
    start_kernel+0x527/0x562
    x86_64_start_reservations+0x24/0x26
    x86_64_start_kernel+0x157/0x17a
    start_cpu+0x5/0x14

    The instruction at the given address is the first read from the memmap's
    memory, i.e. the read of md->type in efi_free_boot_services().

    Note that the writes earlier in efi_arch_mem_reserve() don't splat because
    they're done through early_memremap()ed addresses.

    So, after memblock is gone, allocations should be done through the "normal"
    page allocator. Introduce a helper, efi_memmap_alloc() for this. Use
    it from efi_arch_mem_reserve(), efi_free_boot_services() and, for the sake
    of consistency, from efi_fake_memmap() as well.

    Note that for the latter, the memmap allocations cease to be page aligned.
    This isn't needed though.

    Tested-by: Dan Williams
    Signed-off-by: Nicolai Stange
    Reviewed-by: Ard Biesheuvel
    Cc: # v4.9
    Cc: Dave Young
    Cc: Linus Torvalds
    Cc: Matt Fleming
    Cc: Mika Penttilä
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-efi@vger.kernel.org
    Fixes: 4bc9f92e64c8 ("x86/efi-bgrt: Use efi_mem_reserve() to avoid copying image data")
    Link: http://lkml.kernel.org/r/20170105125130.2815-1-nicstange@gmail.com
    Signed-off-by: Ingo Molnar

    Nicolai Stange
     
  • Pull KVM fixes from Radim Krčmář:
    "MIPS:
    - fix host kernel crashes when receiving a signal with 64-bit
    userspace

    - flush instruction cache on all vcpus after generating entry code

    (both for stable)

    x86:
    - fix NULL dereference in MMU caused by SMM transitions (for stable)

    - correct guest instruction pointer after emulating some VMX errors

    - minor cleanup"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    KVM: VMX: remove duplicated declaration
    KVM: MIPS: Flush KVM entry code from icache globally
    KVM: MIPS: Don't clobber CP0_Status.UX
    KVM: x86: reset MMU on KVM_SET_VCPU_EVENTS
    KVM: nVMX: fix instruction skipping during emulated vm-entry

    Linus Torvalds
     
  • Pull swiotlb fixes from Konrad Rzeszutek Wilk:
    "This has one fix to make i915 work when using Xen SWIOTLB, and a
    feature from Geert to aid in debugging of devices that can't do DMA
    outside the 32-bit address space.

    The feature from Geert is on top of v4.10 merge window commit
    (specifically you pulling my previous branch), as his changes were
    dependent on the Documentation/ movement patches.

    I figured it would just easier than me trying than to cherry-pick the
    Documentation patches to satisfy git.

    The patches have been soaking since 12/20, albeit I updated the last
    patch due to linux-next catching an compiler error and adding an
    Tested-and-Reported-by tag"

    * 'stable/for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb:
    swiotlb: Export swiotlb_max_segment to users
    swiotlb: Add swiotlb=noforce debug option
    swiotlb: Convert swiotlb_force from int to enum
    x86, swiotlb: Simplify pci_swiotlb_detect_override()

    Linus Torvalds
     

06 Jan, 2017

1 commit

  • The following commit:

    8196dab4fc15 ("x86/cpu: Get rid of compute_unit_id")

    ... broke the initial strategy for Bulldozer-based cores' topology,
    where we consider each thread of a compute unit a standalone core
    and not a HT or SMT thread.

    Revert to the firmware-supplied core_id numbering and do not make
    them thread siblings as we don't consider them for such even if they
    technically are, more or less.

    Reported-and-tested-by: Brice Goglin
    Tested-by: Yazen Ghannam
    Signed-off-by: Borislav Petkov
    Cc: # v4.6+
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 8196dab4fc15 ("x86/cpu: Get rid of compute_unit_id")
    Link: http://lkml.kernel.org/r/20170105092638.5247-1-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Borislav Petkov