06 Sep, 2019

1 commit


03 Sep, 2019

1 commit

  • When the 'start' parameter is >= 0xFF000000 on 32-bit
    systems, or >= 0xFFFFFFFF'FF000000 on 64-bit systems,
    fill_gva_list() gets into an infinite loop.

    With such inputs, 'cur' overflows after adding HV_TLB_FLUSH_UNIT
    and always compares as less than end. Memory is filled with
    guest virtual addresses until the system crashes.

    Fix this by never incrementing 'cur' to be larger than 'end'.

    Reported-by: Jong Hyun Park
    Signed-off-by: Tianyu Lan
    Reviewed-by: Michael Kelley
    Cc: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 2ffd9e33ce4a ("x86/hyper-v: Use hypercall for remote TLB flush")
    Signed-off-by: Ingo Molnar

    Tianyu Lan
     

02 Sep, 2019

4 commits

  • Identical to __put_user(); the __get_user() argument evalution will too
    leak UBSAN crud into the __uaccess_begin() / __uaccess_end() region.
    While uncommon this was observed to happen for:

    drivers/xen/gntdev.c: if (__get_user(old_status, batch->status[i]))

    where UBSAN added array bound checking.

    This complements commit:

    6ae865615fc4 ("x86/uaccess: Dont leak the AC flag into __put_user() argument evaluation")

    Tested-by Sedat Dilek
    Reported-by: Randy Dunlap
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Josh Poimboeuf
    Reviewed-by: Thomas Gleixner
    Cc: broonie@kernel.org
    Cc: sfr@canb.auug.org.au
    Cc: akpm@linux-foundation.org
    Cc: Randy Dunlap
    Cc: mhocko@suse.cz
    Cc: Josh Poimboeuf
    Link: https://lkml.kernel.org/r/20190829082445.GM2369@hirez.programming.kicks-ass.net

    Peter Zijlstra
     
  • Commit

    a90118c445cc ("x86/boot: Save fields explicitly, zero out everything else")

    now zeroes the secure boot setting information (enabled/disabled/...)
    passed by the boot loader or by the kernel's EFI handover mechanism.

    The problem manifests itself with signed kernels using the EFI handoff
    protocol with grub and the kernel loses the information whether secure
    boot is enabled in the firmware, i.e., the log message "Secure boot
    enabled" becomes "Secure boot could not be determined".

    efi_main() arch/x86/boot/compressed/eboot.c sets this field early but it
    is subsequently zeroed by the above referenced commit.

    Include boot_params.secure_boot in the preserve field list.

    [ bp: restructure commit message and massage. ]

    Fixes: a90118c445cc ("x86/boot: Save fields explicitly, zero out everything else")
    Signed-off-by: John S. Gruber
    Signed-off-by: Borislav Petkov
    Reviewed-by: John Hubbard
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Juergen Gross
    Cc: Mark Brown
    Cc: stable
    Cc: Thomas Gleixner
    Cc: x86-ml
    Link: https://lkml.kernel.org/r/CAPotdmSPExAuQcy9iAHqX3js_fc4mMLQOTr5RBGvizyCOPcTQQ@mail.gmail.com

    John S. Gruber
     
  • Pull x86 fixes from Thomas Gleixner:
    "A set of fixes for x86:

    - Fix the bogus detection of 32bit user mode for uretprobes which
    caused corruption of the user return address resulting in
    application crashes. In the uprobes handler in_ia32_syscall() is
    obviously always returning false on a 64bit kernel. Use
    user_64bit_mode() instead which works correctly.

    - Prevent large page splitting when ftrace flips RW/RO on the kernel
    text which caused iTLB performance issues. Ftrace wants to be
    converted to text_poke() which avoids the problem, but for now
    allow large page preservation in the static protections check when
    the change request spawns a full large page.

    - Prevent arch_dynirq_lower_bound() from returning 0 when the IOAPIC
    is configured via device tree. In the device tree case the GSI 1:1
    mapping is meaningless therefore the lower bound which protects the
    GSI range on ACPI machines is irrelevant. Return the lower bound
    which the core hands to the function instead of blindly returning 0
    which causes the core to allocate the invalid virtual interupt
    number 0 which in turn prevents all drivers from allocating and
    requesting an interrupt.

    - Remove the bogus initialization of LDR and DFR in the 32bit bigsmp
    APIC driver. That uses physical destination mode where LDR/DFR are
    ignored, but the initialization and the missing clear of LDR caused
    the APIC to be left in a inconsistent state on kexec/reboot.

    - Clear LDR when clearing the APIC registers so the APIC is in a well
    defined state.

    - Initialize variables proper in the find_trampoline_placement()
    code.

    - Silence GCC( build warning for the real mode part of the build"

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text
    x86/build: Add -Wnoaddress-of-packed-member to REALMODE_CFLAGS, to silence GCC9 build warning
    x86/boot/compressed/64: Fix missing initialization in find_trampoline_placement()
    x86/apic: Include the LDR when clearing out APIC registers
    x86/apic: Do not initialize LDR and DFR for bigsmp
    uprobes/x86: Fix detection of 32-bit user mode
    x86/apic: Fix arch_dynirq_lower_bound() bug for DT enabled machines

    Linus Torvalds
     
  • Pull perf fixes from Thomas Gleixner:
    "Two fixes for perf x86 hardware implementations:

    - Restrict the period on Nehalem machines to prevent perf from
    hogging the CPU

    - Prevent the AMD IBS driver from overwriting the hardwre controlled
    and pre-seeded reserved bits (0-6) in the count register which
    caused a sample bias for dispatched micro-ops"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86/amd/ibs: Fix sample bias for dispatched micro-ops
    perf/x86/intel: Restrict period on Nehalem

    Linus Torvalds
     

01 Sep, 2019

1 commit

  • Pull tracing fixes from Steven Rostedt:
    "Small fixes and minor cleanups for tracing:

    - Make exported ftrace function not static

    - Fix NULL pointer dereference in reading probes as they are created

    - Fix NULL pointer dereference in k/uprobe clean up path

    - Various documentation fixes"

    * tag 'trace-v5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Correct kdoc formats
    ftrace/x86: Remove mcount() declaration
    tracing/probe: Fix null pointer dereference
    tracing: Make exported ftrace_set_clr_event non-static
    ftrace: Check for successful allocation of hash
    ftrace: Check for empty hash and comment the race with registering probes
    ftrace: Fix NULL pointer dereference in t_probe_next()

    Linus Torvalds
     

31 Aug, 2019

1 commit

  • Commit 562e14f72292 ("ftrace/x86: Remove mcount support") removed the
    support for using mcount, so we could remove the mcount() declaration
    to clean up.

    Link: http://lkml.kernel.org/r/20190826170150.10f101ba@xhacker.debian

    Signed-off-by: Jisheng Zhang
    Signed-off-by: Steven Rostedt (VMware)

    Jisheng Zhang
     

30 Aug, 2019

3 commits

  • When counting dispatched micro-ops with cnt_ctl=1, in order to prevent
    sample bias, IBS hardware preloads the least significant 7 bits of
    current count (IbsOpCurCnt) with random values, such that, after the
    interrupt is handled and counting resumes, the next sample taken
    will be slightly perturbed.

    The current count bitfield is in the IBS execution control h/w register,
    alongside the maximum count field.

    Currently, the IBS driver writes that register with the maximum count,
    leaving zeroes to fill the current count field, thereby overwriting
    the random bits the hardware preloaded for itself.

    Fix the driver to actually retain and carry those random bits from the
    read of the IBS control register, through to its write, instead of
    overwriting the lower current count bits with zeroes.

    Tested with:

    perf record -c 100001 -e ibs_op/cnt_ctl=1/pp -a -C 0 taskset -c 0

    'perf annotate' output before:

    15.70 65: addsd %xmm0,%xmm1
    17.30 add $0x1,%rax
    15.88 cmp %rdx,%rax
    je 82
    17.32 72: test $0x1,%al
    jne 7c
    7.52 movapd %xmm1,%xmm0
    5.90 jmp 65
    8.23 7c: sqrtsd %xmm1,%xmm0
    12.15 jmp 65

    'perf annotate' output after:

    16.63 65: addsd %xmm0,%xmm1
    16.82 add $0x1,%rax
    16.81 cmp %rdx,%rax
    je 82
    16.69 72: test $0x1,%al
    jne 7c
    8.30 movapd %xmm1,%xmm0
    8.13 jmp 65
    8.24 7c: sqrtsd %xmm1,%xmm0
    8.39 jmp 65

    Tested on Family 15h and 17h machines.

    Machines prior to family 10h Rev. C don't have the RDWROPCNT capability,
    and have the IbsOpCurCnt bitfield reserved, so this patch shouldn't
    affect their operation.

    It is unknown why commit db98c5faf8cb ("perf/x86: Implement 64-bit
    counter support for IBS") ignored the lower 4 bits of the IbsOpCurCnt
    field; the number of preloaded random bits has always been 7, AFAICT.

    Signed-off-by: Kim Phillips
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: "Arnaldo Carvalho de Melo"
    Cc:
    Cc: Ingo Molnar
    Cc: Ingo Molnar
    Cc: Jiri Olsa
    Cc: Thomas Gleixner
    Cc: "Borislav Petkov"
    Cc: Stephane Eranian
    Cc: Alexander Shishkin
    Cc: "Namhyung Kim"
    Cc: "H. Peter Anvin"
    Link: https://lkml.kernel.org/r/20190826195730.30614-1-kim.phillips@amd.com

    Kim Phillips
     
  • We see our Nehalem machines reporting 'perfevents: irq loop stuck!' in
    some cases when using perf:

    perfevents: irq loop stuck!
    WARNING: CPU: 0 PID: 3485 at arch/x86/events/intel/core.c:2282 intel_pmu_handle_irq+0x37b/0x530
    ...
    RIP: 0010:intel_pmu_handle_irq+0x37b/0x530
    ...
    Call Trace:

    ? perf_event_nmi_handler+0x2e/0x50
    ? intel_pmu_save_and_restart+0x50/0x50
    perf_event_nmi_handler+0x2e/0x50
    nmi_handle+0x6e/0x120
    default_do_nmi+0x3e/0x100
    do_nmi+0x102/0x160
    end_repeat_nmi+0x16/0x50
    ...
    ? native_write_msr+0x6/0x20
    ? native_write_msr+0x6/0x20

    intel_pmu_enable_event+0x1ce/0x1f0
    x86_pmu_start+0x78/0xa0
    x86_pmu_enable+0x252/0x310
    __perf_event_task_sched_in+0x181/0x190
    ? __switch_to_asm+0x41/0x70
    ? __switch_to_asm+0x35/0x70
    ? __switch_to_asm+0x41/0x70
    ? __switch_to_asm+0x35/0x70
    finish_task_switch+0x158/0x260
    __schedule+0x2f6/0x840
    ? hrtimer_start_range_ns+0x153/0x210
    schedule+0x32/0x80
    schedule_hrtimeout_range_clock+0x8a/0x100
    ? hrtimer_init+0x120/0x120
    ep_poll+0x2f7/0x3a0
    ? wake_up_q+0x60/0x60
    do_epoll_wait+0xa9/0xc0
    __x64_sys_epoll_wait+0x1a/0x20
    do_syscall_64+0x4e/0x110
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7fdeb1e96c03
    ...
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: acme@kernel.org
    Cc: Josh Hunt
    Cc: bpuranda@akamai.com
    Cc: mingo@redhat.com
    Cc: jolsa@redhat.com
    Cc: tglx@linutronix.de
    Cc: namhyung@kernel.org
    Cc: alexander.shishkin@linux.intel.com
    Link: https://lkml.kernel.org/r/1566256411-18820-1-git-send-email-johunt@akamai.com

    Josh Hunt
     
  • ftrace does not use text_poke() for enabling trace functionality. It uses
    its own mechanism and flips the whole kernel text to RW and back to RO.

    The CPA rework removed a loop based check of 4k pages which tried to
    preserve a large page by checking each 4k page whether the change would
    actually cover all pages in the large page.

    This resulted in endless loops for nothing as in testing it turned out that
    it actually never preserved anything. Of course testing missed to include
    ftrace, which is the one and only case which benefitted from the 4k loop.

    As a consequence enabling function tracing or ftrace based kprobes results
    in a full 4k split of the kernel text, which affects iTLB performance.

    The kernel RO protection is the only valid case where this can actually
    preserve large pages.

    All other static protections (RO data, data NX, PCI, BIOS) are truly
    static. So a conflict with those protections which results in a split
    should only ever happen when a change of memory next to a protected region
    is attempted. But these conflicts are rightfully splitting the large page
    to preserve the protected regions. In fact a change to the protected
    regions itself is a bug and is warned about.

    Add an exception for the static protection check for kernel text RO when
    the to be changed region spawns a full large page which allows to preserve
    the large mappings. This also prevents the syslog to be spammed about CPA
    violations when ftrace is used.

    The exception needs to be removed once ftrace switched over to text_poke()
    which avoids the whole issue.

    Fixes: 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely")
    Reported-by: Song Liu
    Signed-off-by: Thomas Gleixner
    Tested-by: Song Liu
    Reviewed-by: Song Liu
    Acked-by: Peter Zijlstra (Intel)
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908282355340.1938@nanos.tec.linutronix.de

    Thomas Gleixner
     

28 Aug, 2019

3 commits

  • One of the very few warnings I have in the current build comes from
    arch/x86/boot/edd.c, where I get the following with a gcc9 build:

    arch/x86/boot/edd.c: In function ‘query_edd’:
    arch/x86/boot/edd.c:148:11: warning: taking address of packed member of ‘struct boot_params’ may result in an unaligned pointer value [-Waddress-of-packed-member]
    148 | mbrptr = boot_params.edd_mbr_sig_buffer;
    | ^~~~~~~~~~~

    This warning triggers because we throw away all the CFLAGS and then make
    a new set for REALMODE_CFLAGS, so the -Wno-address-of-packed-member we
    added in the following commit is not present:

    6f303d60534c ("gcc-9: silence 'address-of-packed-member' warning")

    The simplest solution for now is to adjust the warning for this version
    of CFLAGS as well, but it would definitely make sense to examine whether
    REALMODE_CFLAGS could be derived from CFLAGS, so that it picks up changes
    in the compiler flags environment automatically.

    Signed-off-by: Linus Torvalds
    Acked-by: Borislav Petkov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Linus Torvalds
     
  • Don't advance RIP or inject a single-step #DB if emulation signals a
    fault. This logic applies to all state updates that are conditional on
    clean retirement of the emulation instruction, e.g. updating RFLAGS was
    previously handled by commit 38827dbd3fb85 ("KVM: x86: Do not update
    EFLAGS on faulting emulation").

    Not advancing RIP is likely a nop, i.e. ctxt->eip isn't updated with
    ctxt->_eip until emulation "retires" anyways. Skipping #DB injection
    fixes a bug reported by Andy Lutomirski where a #UD on SYSCALL due to
    invalid state with EFLAGS.TF=1 would loop indefinitely due to emulation
    overwriting the #UD with #DB and thus restarting the bad SYSCALL over
    and over.

    Cc: Nadav Amit
    Cc: stable@vger.kernel.org
    Reported-by: Andy Lutomirski
    Fixes: 663f4c61b803 ("KVM: x86: handle singlestep during emulation")
    Signed-off-by: Sean Christopherson
    Signed-off-by: Radim Krčmář

    Sean Christopherson
     
  • If kvm_intel is loaded with nested=0 parameter an attempt to perform
    KVM_GET_SUPPORTED_HV_CPUID results in OOPS as nested_get_evmcs_version hook
    in kvm_x86_ops is NULL (we assign it in nested_vmx_hardware_setup() and
    this only happens in case nested is enabled).

    Check that kvm_x86_ops->nested_get_evmcs_version is not NULL before
    calling it. With this, we can remove the stub from svm as it is no
    longer needed.

    Cc:
    Fixes: e2e871ab2f02 ("x86/kvm/hyper-v: Introduce nested_get_evmcs_version() helper")
    Signed-off-by: Vitaly Kuznetsov
    Reviewed-by: Jim Mattson
    Signed-off-by: Radim Krčmář

    Vitaly Kuznetsov
     

27 Aug, 2019

3 commits

  • Gustavo noticed that 'new' can be left uninitialized if 'bios_start'
    happens to be less or equal to 'entry->addr + entry->size'.

    Initialize the variable at the begin of the iteration to the current value
    of 'bios_start'.

    Fixes: 0a46fff2f910 ("x86/boot/compressed/64: Fix boot on machines with broken E820 table")
    Reported-by: "Gustavo A. R. Silva"
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190826133326.7cxb4vbmiawffv2r@box

    Kirill A. Shutemov
     
  • Although APIC initialization will typically clear out the LDR before
    setting it, the APIC cleanup code should reset the LDR.

    This was discovered with a 32-bit KVM guest jumping into a kdump
    kernel. The stale bits in the LDR triggered a bug in the KVM APIC
    implementation which caused the destination mapping for VCPUs to be
    corrupted.

    Note that this isn't intended to paper over the KVM APIC bug. The kernel
    has to clear the LDR when resetting the APIC registers except when X2APIC
    is enabled.

    This lacks a Fixes tag because missing to clear LDR goes way back into pre
    git history.

    [ tglx: Made x2apic_enabled a function call as required ]

    Signed-off-by: Bandan Das
    Signed-off-by: Thomas Gleixner
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190826101513.5080-3-bsd@redhat.com

    Bandan Das
     
  • Legacy apic init uses bigsmp for smp systems with 8 and more CPUs. The
    bigsmp APIC implementation uses physical destination mode, but it
    nevertheless initializes LDR and DFR. The LDR even ends up incorrectly with
    multiple bit being set.

    This does not cause a functional problem because LDR and DFR are ignored
    when physical destination mode is active, but it triggered a problem on a
    32-bit KVM guest which jumps into a kdump kernel.

    The multiple bits set unearthed a bug in the KVM APIC implementation. The
    code which creates the logical destination map for VCPUs ignores the
    disabled state of the APIC and ends up overwriting an existing valid entry
    and as a result, APIC calibration hangs in the guest during kdump
    initialization.

    Remove the bogus LDR/DFR initialization.

    This is not intended to work around the KVM APIC bug. The LDR/DFR
    ininitalization is wrong on its own.

    The issue goes back into the pre git history. The fixes tag is the commit
    in the bitkeeper import which introduced bigsmp support in 2003.

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git

    Fixes: db7b9e9f26b8 ("[PATCH] Clustered APIC setup for >8 CPU systems")
    Suggested-by: Thomas Gleixner
    Signed-off-by: Bandan Das
    Signed-off-by: Thomas Gleixner
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190826101513.5080-2-bsd@redhat.com

    Bandan Das
     

26 Aug, 2019

4 commits

  • 32-bit processes running on a 64-bit kernel are not always detected
    correctly, causing the process to crash when uretprobes are installed.

    The reason for the crash is that in_ia32_syscall() is used to determine the
    process's mode, which only works correctly when called from a syscall.

    In the case of uretprobes, however, the function is called from a exception
    and always returns 'false' on a 64-bit kernel. In consequence this leads to
    corruption of the process's return address.

    Fix this by using user_64bit_mode() instead of in_ia32_syscall(), which
    is correct in any situation.

    [ tglx: Add a comment and the following historical info ]

    This should have been detected by the rename which happened in commit

    abfb9498ee13 ("x86/entry: Rename is_{ia32,x32}_task() to in_{ia32,x32}_syscall()")

    which states in the changelog:

    The is_ia32_task()/is_x32_task() function names are a big misnomer: they
    suggests that the compat-ness of a system call is a task property, which
    is not true, the compatness of a system call purely depends on how it
    was invoked through the system call layer.
    .....

    and then it went and blindly renamed every call site.

    Sadly enough this was already mentioned here:

    8faaed1b9f50 ("uprobes/x86: Introduce sizeof_long(), cleanup adjust_ret_addr() and
    arch_uretprobe_hijack_return_addr()")

    where the changelog says:

    TODO: is_ia32_task() is not what we actually want, TS_COMPAT does
    not necessarily mean 32bit. Fortunately syscall-like insns can't be
    probed so it actually works, but it would be better to rename and
    use is_ia32_frame().

    and goes all the way back to:

    0326f5a94dde ("uprobes/core: Handle breakpoint and singlestep exceptions")

    Oh well. 7+ years until someone actually tried a uretprobe on a 32bit
    process on a 64bit kernel....

    Fixes: 0326f5a94dde ("uprobes/core: Handle breakpoint and singlestep exceptions")
    Signed-off-by: Sebastian Mayr
    Signed-off-by: Thomas Gleixner
    Cc: Masami Hiramatsu
    Cc: Dmitry Safonov
    Cc: Oleg Nesterov
    Cc: Srikar Dronamraju
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190728152617.7308-1-me@sam.st

    Sebastian Mayr
     
  • Rahul Tanwar reported the following bug on DT systems:

    > 'ioapic_dynirq_base' contains the virtual IRQ base number. Presently, it is
    > updated to the end of hardware IRQ numbers but this is done only when IOAPIC
    > configuration type is IOAPIC_DOMAIN_LEGACY or IOAPIC_DOMAIN_STRICT. There is
    > a third type IOAPIC_DOMAIN_DYNAMIC which applies when IOAPIC configuration
    > comes from devicetree.
    >
    > See dtb_add_ioapic() in arch/x86/kernel/devicetree.c
    >
    > In case of IOAPIC_DOMAIN_DYNAMIC (DT/OF based system), 'ioapic_dynirq_base'
    > remains to zero initialized value. This means that for OF based systems,
    > virtual IRQ base will get set to zero.

    Such systems will very likely not even boot.

    For DT enabled machines ioapic_dynirq_base is irrelevant and not
    updated, so simply map the IRQ base 1:1 instead.

    Reported-by: Rahul Tanwar
    Tested-by: Rahul Tanwar
    Tested-by: Andy Shevchenko
    Signed-off-by: Thomas Gleixner
    Cc: Alexander Shishkin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: alan@linux.intel.com
    Cc: bp@alien8.de
    Cc: cheol.yong.kim@intel.com
    Cc: qi-ming.wu@intel.com
    Cc: rahul.tanwar@intel.com
    Cc: rppt@linux.ibm.com
    Cc: tony.luck@intel.com
    Link: http://lkml.kernel.org/r/20190821081330.1187-1-rahul.tanwar@linux.intel.com
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • Pull x86 fixes from Thomas Gleixner:
    "A few fixes for x86:

    - Fix a boot regression caused by the recent bootparam sanitizing
    change, which escaped the attention of all people who reviewed that
    code.

    - Address a boot problem on machines with broken E820 tables caused
    by an underflow which ended up placing the trampoline start at
    physical address 0.

    - Handle machines which do not advertise a legacy timer of any form,
    but need calibration of the local APIC timer gracefully by making
    the calibration routine independent from the tick interrupt. Marked
    for stable as well as there seems to be quite some new laptops
    rolled out which expose this.

    - Clear the RDRAND CPUID bit on AMD family 15h and 16h CPUs which are
    affected by broken firmware which does not initialize RDRAND
    correctly after resume. Add a command line parameter to override
    this for machine which either do not use suspend/resume or have a
    fixed BIOS. Unfortunately there is no way to detect this on boot,
    so the only safe decision is to turn it off by default.

    - Prevent RFLAGS from being clobbers in CALL_NOSPEC on 32bit which
    caused fast KVM instruction emulation to break.

    - Explain the Intel CPU model naming convention so that the repeating
    discussions come to an end"

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/retpoline: Don't clobber RFLAGS during CALL_NOSPEC on i386
    x86/boot: Fix boot regression caused by bootparam sanitizing
    x86/CPU/AMD: Clear RDRAND CPUID bit on AMD family 15h/16h
    x86/boot/compressed/64: Fix boot on machines with broken E820 table
    x86/apic: Handle missing global clockevent gracefully
    x86/cpu: Explain Intel model naming convention

    Linus Torvalds
     
  • Pull perf fixes from Thomas Gleixner:
    "Two small fixes for kprobes and perf:

    - Prevent a deadlock in kprobe_optimizer() causes by reverse lock
    ordering

    - Fix a comment typo"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    kprobes: Fix potential deadlock in kprobe_optimizer()
    perf/x86: Fix typo in comment

    Linus Torvalds
     

23 Aug, 2019

1 commit

  • Use 'lea' instead of 'add' when adjusting %rsp in CALL_NOSPEC so as to
    avoid clobbering flags.

    KVM's emulator makes indirect calls into a jump table of sorts, where
    the destination of the CALL_NOSPEC is a small blob of code that performs
    fast emulation by executing the target instruction with fixed operands.

    adcb_al_dl:
    0x000339f8 : adc %dl,%al
    0x000339fa : ret

    A major motiviation for doing fast emulation is to leverage the CPU to
    handle consumption and manipulation of arithmetic flags, i.e. RFLAGS is
    both an input and output to the target of CALL_NOSPEC. Clobbering flags
    results in all sorts of incorrect emulation, e.g. Jcc instructions often
    take the wrong path. Sans the nops...

    asm("push %[flags]; popf; " CALL_NOSPEC " ; pushf; pop %[flags]\n"
    0x0003595a : mov 0xc0(%ebx),%eax
    0x00035960 : mov 0x60(%ebx),%edx
    0x00035963 : mov 0x90(%ebx),%ecx
    0x00035969 : push %edi
    0x0003596a : popf
    0x0003596b : call *%esi
    0x000359a0 : pushf
    0x000359a1 : pop %edi
    0x000359a2 : mov %eax,0xc0(%ebx)
    0x000359b1 : mov %edx,0x60(%ebx)

    ctxt->eflags = (ctxt->eflags & ~EFLAGS_MASK) | (flags & EFLAGS_MASK);
    0x000359a8 : mov -0x10(%ebp),%eax
    0x000359ab : and $0x8d5,%edi
    0x000359b4 : and $0xfffff72a,%eax
    0x000359b9 : or %eax,%edi
    0x000359bd : mov %edi,0x4(%ebx)

    For the most part this has gone unnoticed as emulation of guest code
    that can trigger fast emulation is effectively limited to MMIO when
    running on modern hardware, and MMIO is rarely, if ever, accessed by
    instructions that affect or consume flags.

    Breakage is almost instantaneous when running with unrestricted guest
    disabled, in which case KVM must emulate all instructions when the guest
    has invalid state, e.g. when the guest is in Big Real Mode during early
    BIOS.

    Fixes: 776b043848fd2 ("x86/retpoline: Add initial retpoline support")
    Fixes: 1a29b5b7f347a ("KVM: x86: Make indirect calls in emulator speculation safe")
    Signed-off-by: Sean Christopherson
    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190822211122.27579-1-sean.j.christopherson@intel.com

    Sean Christopherson
     

22 Aug, 2019

2 commits

  • commit a90118c445cc ("x86/boot: Save fields explicitly, zero out everything
    else") had two errors:

    * It preserved boot_params.acpi_rsdp_addr, and
    * It failed to preserve boot_params.hdr

    Therefore, zero out acpi_rsdp_addr, and preserve hdr.

    Fixes: a90118c445cc ("x86/boot: Save fields explicitly, zero out everything else")
    Reported-by: Neil MacLeod
    Suggested-by: Thomas Gleixner
    Signed-off-by: John Hubbard
    Signed-off-by: Thomas Gleixner
    Tested-by: Neil MacLeod
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190821192513.20126-1-jhubbard@nvidia.com

    John Hubbard
     
  • Pull KVM fixes from Paolo Bonzini:
    "A couple bugfixes, and mostly selftests changes"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    selftests/kvm: make platform_info_test pass on AMD
    Revert "KVM: x86/mmu: Zap only the relevant pages when removing a memslot"
    selftests: kvm: fix state save/load on processors without XSAVE
    selftests: kvm: fix vmx_set_nested_state_test
    selftests: kvm: provide common function to enable eVMCS
    selftests: kvm: do not try running the VM in vmx_set_nested_state_test
    KVM: x86: svm: remove redundant assignment of var new_entry
    MAINTAINERS: add KVM x86 reviewers
    MAINTAINERS: change list for KVM/s390
    kvm: x86: skip populating logical dest map if apic is not sw enabled

    Linus Torvalds
     

21 Aug, 2019

1 commit

  • This reverts commit 4e103134b862314dc2f2f18f2fb0ab972adc3f5f.
    Alex Williamson reported regressions with device assignment with
    this patch. Even though the bug is probably elsewhere and still
    latent, this is needed to fix the regression.

    Fixes: 4e103134b862 ("KVM: x86/mmu: Zap only the relevant pages when removing a memslot", 2019-02-05)
    Reported-by: Alex Willamson
    Cc: stable@vger.kernel.org
    Cc: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

20 Aug, 2019

2 commits

  • There have been reports of RDRAND issues after resuming from suspend on
    some AMD family 15h and family 16h systems. This issue stems from a BIOS
    not performing the proper steps during resume to ensure RDRAND continues
    to function properly.

    RDRAND support is indicated by CPUID Fn00000001_ECX[30]. This bit can be
    reset by clearing MSR C001_1004[62]. Any software that checks for RDRAND
    support using CPUID, including the kernel, will believe that RDRAND is
    not supported.

    Update the CPU initialization to clear the RDRAND CPUID bit for any family
    15h and 16h processor that supports RDRAND. If it is known that the family
    15h or family 16h system does not have an RDRAND resume issue or that the
    system will not be placed in suspend, the "rdrand=force" kernel parameter
    can be used to stop the clearing of the RDRAND CPUID bit.

    Additionally, update the suspend and resume path to save and restore the
    MSR C001_1004 value to ensure that the RDRAND CPUID setting remains in
    place after resuming from suspend.

    Note, that clearing the RDRAND CPUID bit does not prevent a processor
    that normally supports the RDRAND instruction from executing it. So any
    code that determined the support based on family and model won't #UD.

    Signed-off-by: Tom Lendacky
    Signed-off-by: Borislav Petkov
    Cc: Andrew Cooper
    Cc: Andrew Morton
    Cc: Chen Yu
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Jonathan Corbet
    Cc: Josh Poimboeuf
    Cc: Juergen Gross
    Cc: Kees Cook
    Cc: "linux-doc@vger.kernel.org"
    Cc: "linux-pm@vger.kernel.org"
    Cc: Nathan Chancellor
    Cc: Paolo Bonzini
    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Cc:
    Cc: Thomas Gleixner
    Cc: "x86@kernel.org"
    Link: https://lkml.kernel.org/r/7543af91666f491547bd86cebb1e17c66824ab9f.1566229943.git.thomas.lendacky@amd.com

    Tom Lendacky
     
  • Pull networking fixes from David Miller:

    1) Fix jmp to 1st instruction in x64 JIT, from Alexei Starovoitov.

    2) Severl kTLS fixes in mlx5 driver, from Tariq Toukan.

    3) Fix severe performance regression due to lack of SKB coalescing of
    fragments during local delivery, from Guillaume Nault.

    4) Error path memory leak in sch_taprio, from Ivan Khoronzhuk.

    5) Fix batched events in skbedit packet action, from Roman Mashak.

    6) Propagate VLAN TX offload to hw_enc_features in bond and team
    drivers, from Yue Haibing.

    7) RXRPC local endpoint refcounting fix and read after free in
    rxrpc_queue_local(), from David Howells.

    8) Fix endian bug in ibmveth multicast list handling, from Thomas
    Falcon.

    9) Oops, make nlmsg_parse() wrap around the correct function,
    __nlmsg_parse not __nla_parse(). Fix from David Ahern.

    10) Memleak in sctp_scend_reset_streams(), fro Zheng Bin.

    11) Fix memory leak in cxgb4, from Wenwen Wang.

    12) Yet another race in AF_PACKET, from Eric Dumazet.

    13) Fix false detection of retransmit failures in tipc, from Tuong
    Lien.

    14) Use after free in ravb_tstamp_skb, from Tho Vu.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (101 commits)
    ravb: Fix use-after-free ravb_tstamp_skb
    netfilter: nf_tables: map basechain priority to hardware priority
    net: sched: use major priority number as hardware priority
    wimax/i2400m: fix a memory leak bug
    net: cavium: fix driver name
    ibmvnic: Unmap DMA address of TX descriptor buffers after use
    bnxt_en: Fix to include flow direction in L2 key
    bnxt_en: Use correct src_fid to determine direction of the flow
    bnxt_en: Suppress HWRM errors for HWRM_NVM_GET_VARIABLE command
    bnxt_en: Fix handling FRAG_ERR when NVM_INSTALL_UPDATE cmd fails
    bnxt_en: Improve RX doorbell sequence.
    bnxt_en: Fix VNIC clearing logic for 57500 chips.
    net: kalmia: fix memory leaks
    cx82310_eth: fix a memory leak bug
    bnx2x: Fix VF's VLAN reconfiguration in reload.
    Bluetooth: Add debug setting for changing minimum encryption key size
    tipc: fix false detection of retransmit failures
    lan78xx: Fix memory leaks
    MAINTAINERS: r8169: Update path to the driver
    MAINTAINERS: PHY LIBRARY: Update files in the record
    ...

    Linus Torvalds
     

19 Aug, 2019

3 commits

  • BIOS on Samsung 500C Chromebook reports very rudimentary E820 table that
    consists of 2 entries:

    BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] usable
    BIOS-e820: [mem 0x00000000fffff000-0x00000000ffffffff] reserved

    It breaks logic in find_trampoline_placement(): bios_start lands on the
    end of the first 4k page and trampoline start gets placed below 0.

    Detect underflow and don't touch bios_start for such cases. It makes
    kernel ignore E820 table on machines that doesn't have two usable pages
    below BIOS_START_MAX.

    Fixes: 1b3a62643660 ("x86/boot/compressed/64: Validate trampoline placement against E820")
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: x86-ml
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=203463
    Link: https://lkml.kernel.org/r/20190813131654.24378-1-kirill.shutemov@linux.intel.com

    Kirill A. Shutemov
     
  • Some newer machines do not advertise legacy timers. The kernel can handle
    that situation if the TSC and the CPU frequency are enumerated by CPUID or
    MSRs and the CPU supports TSC deadline timer. If the CPU does not support
    TSC deadline timer the local APIC timer frequency has to be known as well.

    Some Ryzens machines do not advertize legacy timers, but there is no
    reliable way to determine the bus frequency which feeds the local APIC
    timer when the machine allows overclocking of that frequency.

    As there is no legacy timer the local APIC timer calibration crashes due to
    a NULL pointer dereference when accessing the not installed global clock
    event device.

    Switch the calibration loop to a non interrupt based one, which polls
    either TSC (if frequency is known) or jiffies. The latter requires a global
    clockevent. As the machines which do not have a global clockevent installed
    have a known TSC frequency this is a non issue. For older machines where
    TSC frequency is not known, there is no known case where the legacy timers
    do not exist as that would have been reported long ago.

    Reported-by: Daniel Drake
    Reported-by: Jiri Slaby
    Signed-off-by: Thomas Gleixner
    Tested-by: Daniel Drake
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908091443030.21433@nanos.tec.linutronix.de
    Link: http://bugzilla.opensuse.org/show_bug.cgi?id=1142926#c12

    Thomas Gleixner
     
  • No functional change.

    Signed-off-by: Su Yanjun
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1565945001-4413-1-git-send-email-suyj.fnst@cn.fujitsu.com
    Signed-off-by: Ingo Molnar

    Su Yanjun
     

17 Aug, 2019

1 commit

  • Dave Hansen spelled out the rules in an e-mail:

    https://lkml.kernel.org/r/91eefbe4-e32b-d762-be4d-672ff915db47@intel.com

    Copy those right into the file to make it easy for
    people to find them.

    Suggested-by: Borislav Petkov
    Signed-off-by: Tony Luck
    Signed-off-by: Borislav Petkov
    Acked-by: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Dave Hansen
    Cc: Ingo Molnar
    Cc: x86-ml
    Link: https://lkml.kernel.org/r/20190815224704.GA10025@agluck-desk2.amr.corp.intel.com

    Tony Luck
     

16 Aug, 2019

1 commit

  • Recent gcc compilers (gcc 9.1) generate warnings about an out of bounds
    memset, if the memset goes accross several fields of a struct. This
    generated a couple of warnings on x86_64 builds in sanitize_boot_params().

    Fix this by explicitly saving the fields in struct boot_params
    that are intended to be preserved, and zeroing all the rest.

    [ tglx: Tagged for stable as it breaks the warning free build there as well ]

    Suggested-by: Thomas Gleixner
    Suggested-by: H. Peter Anvin
    Signed-off-by: John Hubbard
    Signed-off-by: Thomas Gleixner
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190731054627.5627-2-jhubbard@nvidia.com

    John Hubbard
     

14 Aug, 2019

2 commits

  • new_entry is reassigned a new value next line. So
    it's redundant and remove it.

    Signed-off-by: Miaohe Lin
    Signed-off-by: Paolo Bonzini

    Miaohe Lin
     
  • recalculate_apic_map does not santize ldr and it's possible that
    multiple bits are set. In that case, a previous valid entry
    can potentially be overwritten by an invalid one.

    This condition is hit when booting a 32 bit, >8 CPU, RHEL6 guest and then
    triggering a crash to boot a kdump kernel. This is the sequence of
    events:
    1. Linux boots in bigsmp mode and enables PhysFlat, however, it still
    writes to the LDR which probably will never be used.
    2. However, when booting into kdump, the stale LDR values remain as
    they are not cleared by the guest and there isn't a apic reset.
    3. kdump boots with 1 cpu, and uses Logical Destination Mode but the
    logical map has been overwritten and points to an inactive vcpu.

    Signed-off-by: Radim Krcmar
    Signed-off-by: Bandan Das
    Signed-off-by: Paolo Bonzini

    Radim Krcmar
     

13 Aug, 2019

2 commits

  • /home/tglx/work/kernel/linus/linux/arch/x86/math-emu/errors.c: In function ‘FPU_printall’:
    /home/tglx/work/kernel/linus/linux/arch/x86/math-emu/errors.c:187:9: warning: this statement may fall through [-Wimplicit-fallthrough=]
    tagi = FPU_Special(r);
    ~~~~~^~~~~~~~~~~~~~~~
    /home/tglx/work/kernel/linus/linux/arch/x86/math-emu/errors.c:188:3: note: here
    case TAG_Valid:
    ^~~~
    /home/tglx/work/kernel/linus/linux/arch/x86/math-emu/fpu_trig.c: In function ‘fyl2xp1’:
    /home/tglx/work/kernel/linus/linux/arch/x86/math-emu/fpu_trig.c:1353:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
    if (denormal_operand() < 0)
    ^
    /home/tglx/work/kernel/linus/linux/arch/x86/math-emu/fpu_trig.c:1356:3: note: here
    case TAG_Zero:

    Remove the pointless 'break;' after 'continue;' while at it.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Fix

    arch/x86/kernel/apic/probe_32.c: In function ‘default_setup_apic_routing’:
    arch/x86/kernel/apic/probe_32.c:146:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
    if (!APIC_XAPIC(version)) {
    ^
    arch/x86/kernel/apic/probe_32.c:151:3: note: here
    case X86_VENDOR_HYGON:
    ^~~~

    for 32-bit builds.

    Signed-off-by: Borislav Petkov
    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190811154036.29805-1-bp@alien8.de

    Borislav Petkov
     

12 Aug, 2019

2 commits

  • Currently, failure of cpuhp_setup_state() is ignored and the syscore ops
    and the control interfaces can still be added even after the failure. But,
    this error handling will cause a few issues:

    1. The CPUs may have different values in the IA32_UMWAIT_CONTROL
    MSR because there is no way to roll back the control MSR on
    the CPUs which already set the MSR before the failure.

    2. If the sysfs interface is added successfully, there will be a mismatch
    between the global control value and the control MSR:
    - The interface shows the default global control value. But,
    the control MSR is not set to the value because the CPU online
    function, which is supposed to set the MSR to the value,
    is not installed.
    - If the sysadmin changes the global control value through
    the interface, the control MSR on all current online CPUs is
    set to the new value. But, the control MSR on newly onlined CPUs
    after the value change will not be set to the new value due to
    lack of the CPU online function.

    3. On resume from suspend/hibernation, the boot CPU restores the control
    MSR to the global control value through the syscore ops. But, the
    control MSR on all APs is not set due to lake of the CPU online
    function.

    To solve the issues and enforce consistent behavior on the failure
    of the CPU hotplug setup, make the following changes:

    1. Cache the original control MSR value which is configured by
    hardware or BIOS before kernel boot. This value is likely to
    be 0. But it could be a different number as well. Cache the
    control MSR only once before the MSR is changed.
    2. Add the CPU offline function so that the MSR is restored to the
    original control value on all CPUs on the failure.
    3. On the failure, exit from cpumait_init() so that the syscore ops
    and the control interfaces are not added.

    Reported-by: Valdis Kletnieks
    Suggested-by: Thomas Gleixner
    Signed-off-by: Fenghua Yu
    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/1565401237-60936-1-git-send-email-fenghua.yu@intel.com

    Fenghua Yu
     
  • Daniel Borkmann says:

    ====================
    pull-request: bpf 2019-08-11

    The following pull-request contains BPF updates for your *net* tree.

    The main changes are:

    1) x64 JIT code generation fix for backward-jumps to 1st insn, from Alexei.

    2) Fix buggy multi-closing of BTF file descriptor in libbpf, from Andrii.

    3) Fix libbpf_num_possible_cpus() to make it thread safe, from Takshak.

    4) Fix bpftool to dump an error if pinning fails, from Jakub.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

11 Aug, 2019

1 commit

  • Pull x86 fixes from Thomas Gleixner:
    "A few fixes for x86:

    - Don't reset the carefully adjusted build flags for the purgatory
    and remove the unwanted flags instead. The 'reset all' approach led
    to build fails under certain circumstances.

    - Unbreak CLANG build of the purgatory by avoiding the builtin
    memcpy/memset implementations.

    - Address missing prototype warnings by including the proper header

    - Fix yet more fall-through issues"

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/lib/cpu: Address missing prototypes warning
    x86/purgatory: Use CFLAGS_REMOVE rather than reset KBUILD_CFLAGS
    x86/purgatory: Do not use __builtin_memcpy and __builtin_memset
    x86: mtrr: cyrix: Mark expected switch fall-through
    x86/ptrace: Mark expected switch fall-through

    Linus Torvalds
     

10 Aug, 2019

1 commit

  • Pull kvm fixes from Paolo Bonzini:
    "Bugfixes (arm and x86) and cleanups"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    selftests: kvm: Adding config fragments
    KVM: selftests: Update gitignore file for latest changes
    kvm: remove unnecessary PageReserved check
    KVM: arm/arm64: vgic: Reevaluate level sensitive interrupts on enable
    KVM: arm: Don't write junk to CP15 registers on reset
    KVM: arm64: Don't write junk to sysregs on reset
    KVM: arm/arm64: Sync ICH_VMCR_EL2 back when about to block
    x86: kvm: remove useless calls to kvm_para_available
    KVM: no need to check return value of debugfs_create functions
    KVM: remove kvm_arch_has_vcpu_debugfs()
    KVM: Fix leak vCPU's VMCS value into other pCPU
    KVM: Check preempted_in_kernel for involuntary preemption
    KVM: LAPIC: Don't need to wakeup vCPU twice afer timer fire
    arm64: KVM: hyp: debug-sr: Mark expected switch fall-through
    KVM: arm64: Update kvm_arm_exception_class and esr_class_str for new EC
    KVM: arm: vgic-v3: Mark expected switch fall-through
    arm64: KVM: regmap: Fix unexpected switch fall-through
    KVM: arm/arm64: Introduce kvm_pmu_vcpu_init() to setup PMU counter index

    Linus Torvalds