20 Jan, 2021

2 commits

  • [ Upstream commit a8f7e08a81708920a928664a865208fdf451c49f ]

    The IN and OUT instructions with port address as an immediate operand
    only use an 8-bit immediate (imm8). The current VC handler uses the
    entire 32-bit immediate value but these instructions only set the first
    bytes.

    Cast the operand to an u8 for that.

    [ bp: Massage commit message. ]

    Fixes: 25189d08e5168 ("x86/sev-es: Add support for handling IOIO exceptions")
    Signed-off-by: Peter Gonda
    Signed-off-by: Borislav Petkov
    Acked-by: David Rientjes
    Link: https://lkml.kernel.org/r/20210105163311.221490-1-pgonda@google.com
    Signed-off-by: Sasha Levin

    Peter Gonda
     
  • commit ad0a6bad44758afa3b440c254a24999a0c7e35d5 upstream.

    We've observed crashes due to an empty cpu mask in
    hyperv_flush_tlb_others. Obviously the cpu mask in question is changed
    between the cpumask_empty call at the beginning of the function and when
    it is actually used later.

    One theory is that an interrupt comes in between and a code path ends up
    changing the mask. Move the check after interrupt has been disabled to
    see if it fixes the issue.

    Signed-off-by: Wei Liu
    Cc: stable@kernel.org
    Link: https://lore.kernel.org/r/20210105175043.28325-1-wei.liu@kernel.org
    Reviewed-by: Michael Kelley
    Signed-off-by: Greg Kroah-Hartman

    Wei Liu
     

17 Jan, 2021

1 commit

  • commit 2ca408d9c749c32288bc28725f9f12ba30299e8f upstream.

    Commit

    121b32a58a3a ("x86/entry/32: Use IA32-specific wrappers for syscalls taking 64-bit arguments")

    converted native x86-32 which take 64-bit arguments to use the
    compat handlers to allow conversion to passing args via pt_regs.
    sys_fanotify_mark() was however missed, as it has a general compat
    handler. Add a config option that will use the syscall wrapper that
    takes the split args for native 32-bit.

    [ bp: Fix typo in Kconfig help text. ]

    Fixes: 121b32a58a3a ("x86/entry/32: Use IA32-specific wrappers for syscalls taking 64-bit arguments")
    Reported-by: Paweł Jasiak
    Signed-off-by: Brian Gerst
    Signed-off-by: Borislav Petkov
    Acked-by: Jan Kara
    Acked-by: Andy Lutomirski
    Link: https://lkml.kernel.org/r/20201130223059.101286-1-brgerst@gmail.com
    Signed-off-by: Greg Kroah-Hartman

    Brian Gerst
     

13 Jan, 2021

9 commits

  • commit 2f80d502d627f30257ba7e3655e71c373b7d1a5a upstream.

    Since we know that e >= s, we can reassociate the left shift,
    changing the shifted number from 1 to 2 in exchange for
    decreasing the right hand side by 1.

    Reported-by: syzbot+e87846c48bf72bc85311@syzkaller.appspotmail.com
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Paolo Bonzini
     
  • commit cb7f4a8b1fb426a175d1708f05581939c61329d4 upstream.

    In mtrr_type_lookup(), if the input memory address region is not in the
    MTRR, over 4GB, and not over the top of memory, a write-back attribute
    is returned. These condition checks are for ensuring the input memory
    address region is actually mapped to the physical memory.

    However, if the end address is just aligned with the top of memory,
    the condition check treats the address is over the top of memory, and
    write-back attribute is not returned.

    And this hits in a real use case with NVDIMM: the nd_pmem module tries
    to map NVDIMMs as cacheable memories when NVDIMMs are connected. If a
    NVDIMM is the last of the DIMMs, the performance of this NVDIMM becomes
    very low since it is aligned with the top of memory and its memory type
    is uncached-minus.

    Move the input end address change to inclusive up into
    mtrr_type_lookup(), before checking for the top of memory in either
    mtrr_type_lookup_{variable,fixed}() helpers.

    [ bp: Massage commit message. ]

    Fixes: 0cc705f56e40 ("x86/mm/mtrr: Clean up mtrr_type_lookup()")
    Signed-off-by: Ying-Tsun Huang
    Signed-off-by: Borislav Petkov
    Link: https://lkml.kernel.org/r/20201215070721.4349-1-ying-tsun.huang@amd.com
    Signed-off-by: Greg Kroah-Hartman

    Ying-Tsun Huang
     
  • commit a0195f314a25582b38993bf30db11c300f4f4611 upstream.

    Shakeel Butt reported in [1] that a user can request a task to be moved
    to a resource group even if the task is already in the group. It just
    wastes time to do the move operation which could be costly to send IPI
    to a different CPU.

    Add a sanity check to ensure that the move operation only happens when
    the task is not already in the resource group.

    [1] https://lore.kernel.org/lkml/CALvZod7E9zzHwenzf7objzGKsdBmVwTgEJ0nPgs0LUFU3SN5Pw@mail.gmail.com/

    Fixes: e02737d5b826 ("x86/intel_rdt: Add tasks files")
    Reported-by: Shakeel Butt
    Signed-off-by: Fenghua Yu
    Signed-off-by: Reinette Chatre
    Signed-off-by: Borislav Petkov
    Reviewed-by: Tony Luck
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/962ede65d8e95be793cb61102cca37f7bb018e66.1608243147.git.reinette.chatre@intel.com
    Signed-off-by: Greg Kroah-Hartman

    Fenghua Yu
     
  • commit ae28d1aae48a1258bd09a6f707ebb4231d79a761 upstream.

    Currently, when moving a task to a resource group the PQR_ASSOC MSR is
    updated with the new closid and rmid in an added task callback. If the
    task is running, the work is run as soon as possible. If the task is not
    running, the work is executed later in the kernel exit path when the
    kernel returns to the task again.

    Updating the PQR_ASSOC MSR as soon as possible on the CPU a moved task
    is running is the right thing to do. Queueing work for a task that is
    not running is unnecessary (the PQR_ASSOC MSR is already updated when
    the task is scheduled in) and causing system resource waste with the way
    in which it is implemented: Work to update the PQR_ASSOC register is
    queued every time the user writes a task id to the "tasks" file, even if
    the task already belongs to the resource group.

    This could result in multiple pending work items associated with a
    single task even if they are all identical and even though only a single
    update with most recent values is needed. Specifically, even if a task
    is moved between different resource groups while it is sleeping then it
    is only the last move that is relevant but yet a work item is queued
    during each move.

    This unnecessary queueing of work items could result in significant
    system resource waste, especially on tasks sleeping for a long time.
    For example, as demonstrated by Shakeel Butt in [1] writing the same
    task id to the "tasks" file can quickly consume significant memory. The
    same problem (wasted system resources) occurs when moving a task between
    different resource groups.

    As pointed out by Valentin Schneider in [2] there is an additional issue
    with the way in which the queueing of work is done in that the task_struct
    update is currently done after the work is queued, resulting in a race with
    the register update possibly done before the data needed by the update is
    available.

    To solve these issues, update the PQR_ASSOC MSR in a synchronous way
    right after the new closid and rmid are ready during the task movement,
    only if the task is running. If a moved task is not running nothing
    is done since the PQR_ASSOC MSR will be updated next time the task is
    scheduled. This is the same way used to update the register when tasks
    are moved as part of resource group removal.

    [1] https://lore.kernel.org/lkml/CALvZod7E9zzHwenzf7objzGKsdBmVwTgEJ0nPgs0LUFU3SN5Pw@mail.gmail.com/
    [2] https://lore.kernel.org/lkml/20201123022433.17905-1-valentin.schneider@arm.com

    [ bp: Massage commit message and drop the two update_task_closid_rmid()
    variants. ]

    Fixes: e02737d5b826 ("x86/intel_rdt: Add tasks files")
    Reported-by: Shakeel Butt
    Reported-by: Valentin Schneider
    Signed-off-by: Fenghua Yu
    Signed-off-by: Reinette Chatre
    Signed-off-by: Borislav Petkov
    Reviewed-by: Tony Luck
    Reviewed-by: James Morse
    Reviewed-by: Valentin Schneider
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/17aa2fb38fc12ce7bb710106b3e7c7b45acb9e94.1608243147.git.reinette.chatre@intel.com
    Signed-off-by: Greg Kroah-Hartman

    Fenghua Yu
     
  • commit a889ea54b3daa63ee1463dc19ed699407d61458b upstream.

    Many TDP MMU functions which need to perform some action on all TDP MMU
    roots hold a reference on that root so that they can safely drop the MMU
    lock in order to yield to other threads. However, when releasing the
    reference on the root, there is a bug: the root will not be freed even
    if its reference count (root_count) is reduced to 0.

    To simplify acquiring and releasing references on TDP MMU root pages, and
    to ensure that these roots are properly freed, move the get/put operations
    into another TDP MMU root iterator macro.

    Moving the get/put operations into an iterator macro also helps
    simplify control flow when a root does need to be freed. Note that using
    the list_for_each_entry_safe macro would not have been appropriate in
    this situation because it could keep a pointer to the next root across
    an MMU lock release + reacquire, during which time that root could be
    freed.

    Reported-by: Maciej S. Szmigiero
    Suggested-by: Paolo Bonzini
    Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
    Fixes: 063afacd8730 ("kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU")
    Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
    Fixes: 14881998566d ("kvm: x86/mmu: Support disabling dirty logging for the tdp MMU")
    Signed-off-by: Ben Gardon
    Message-Id:
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Ben Gardon
     
  • commit 39b4d43e6003cee51cd119596d3c33d0449eb44c upstream.

    Get the so called "root" level from the low level shadow page table
    walkers instead of manually attempting to calculate it higher up the
    stack, e.g. in get_mmio_spte(). When KVM is using PAE shadow paging,
    the starting level of the walk, from the callers perspective, is not
    the CR3 root but rather the PDPTR "root". Checking for reserved bits
    from the CR3 root causes get_mmio_spte() to consume uninitialized stack
    data due to indexing into sptes[] for a level that was not filled by
    get_walk(). This can result in false positives and/or negatives
    depending on what garbage happens to be on the stack.

    Opportunistically nuke a few extra newlines.

    Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU")
    Reported-by: Richard Herbert
    Cc: Ben Gardon
    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    Message-Id:
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     
  • commit 2aa078932ff6c66bf10cc5b3144440dbfa7d813d upstream.

    Return -1 from the get_walk() helpers if the shadow walk doesn't fill at
    least one spte, which can theoretically happen if the walk hits a
    not-present PDPTR. Returning the root level in such a case will cause
    get_mmio_spte() to return garbage (uninitialized stack data). In
    practice, such a scenario should be impossible as KVM shouldn't get a
    reserved-bit page fault with a not-present PDPTR.

    Note, using mmu->root_level in get_walk() is wrong for other reasons,
    too, but that's now a moot point.

    Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU")
    Cc: Ben Gardon
    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    Message-Id:
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     
  • commit d1c5246e08eb64991001d97a3bd119c93edbc79a upstream.

    Commit

    28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")

    introduced a new location where a pmd was released, but neglected to
    run the pmd page destructor. In fact, this happened previously for a
    different pmd release path and was fixed by commit:

    c283610e44ec ("x86, mm: do not leak page->ptl for pmd page tables").

    This issue was hidden until recently because the failure mode is silent,
    but commit:

    b2b29d6d0119 ("mm: account PMD tables like PTE tables")

    turns the failure mode into this signature:

    BUG: Bad page state in process lt-pmem-ns pfn:15943d
    page:000000007262ed7b refcount:0 mapcount:-1024 mapping:0000000000000000 index:0x0 pfn:0x15943d
    flags: 0xaffff800000000()
    raw: 00affff800000000 dead000000000100 0000000000000000 0000000000000000
    raw: 0000000000000000 ffff913a029bcc08 00000000fffffbff 0000000000000000
    page dumped because: nonzero mapcount
    [..]
    dump_stack+0x8b/0xb0
    bad_page.cold+0x63/0x94
    free_pcp_prepare+0x224/0x270
    free_unref_page+0x18/0xd0
    pud_free_pmd_page+0x146/0x160
    ioremap_pud_range+0xe3/0x350
    ioremap_page_range+0x108/0x160
    __ioremap_caller.constprop.0+0x174/0x2b0
    ? memremap+0x7a/0x110
    memremap+0x7a/0x110
    devm_memremap+0x53/0xa0
    pmem_attach_disk+0x4ed/0x530 [nd_pmem]
    ? __devm_release_region+0x52/0x80
    nvdimm_bus_probe+0x85/0x210 [libnvdimm]

    Given this is a repeat occurrence it seemed prudent to look for other
    places where this destructor might be missing and whether a better
    helper is needed. try_to_free_pmd_page() looks like a candidate, but
    testing with setting up and tearing down pmd mappings via the dax unit
    tests is thus far not triggering the failure.

    As for a better helper pmd_free() is close, but it is a messy fit
    due to requiring an @mm arg. Also, ___pmd_free_tlb() wants to call
    paravirt_tlb_remove_table() instead of free_page(), so open-coded
    pgtable_pmd_page_dtor() seems the best way forward for now.

    Debugged together with Matthew Wilcox .

    Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
    Signed-off-by: Dan Williams
    Signed-off-by: Borislav Petkov
    Tested-by: Yi Zhang
    Acked-by: Peter Zijlstra (Intel)
    Cc:
    Link: https://lkml.kernel.org/r/160697689204.605323.17629854984697045602.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     
  • [ Upstream commit 87dbc209ea04645fd2351981f09eff5d23f8e2e9 ]

    Make mandatory in include/asm-generic/Kbuild and
    remove all arch/*/include/asm/local64.h arch-specific files since they
    only #include .

    This fixes build errors on arch/c6x/ and arch/nios2/ for
    block/blk-iocost.c.

    Build-tested on 21 of 25 arch-es. (tools problems on the others)

    Yes, we could even rename to
    and change all #includes to use
    instead.

    Link: https://lkml.kernel.org/r/20201227024446.17018-1-rdunlap@infradead.org
    Signed-off-by: Randy Dunlap
    Suggested-by: Christoph Hellwig
    Reviewed-by: Masahiro Yamada
    Cc: Jens Axboe
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Aurelien Jacquiot
    Cc: Peter Zijlstra
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Randy Dunlap
     

30 Dec, 2020

11 commits

  • [ Upstream commit 028c221ed1904af9ac3c5162ee98f48966de6b3d ]

    AMD systems provide a "NodeId" value that represents a global ID
    indicating to which "Node" a logical CPU belongs. The "Node" is a
    physical structure equivalent to a Die, and it should not be confused
    with logical structures like NUMA nodes. Logical nodes can be adjusted
    based on firmware or other settings whereas the physical nodes/dies are
    fixed based on hardware topology.

    The NodeId value can be used when a physical ID is needed by software.

    Save the AMD NodeId to struct cpuinfo_x86.cpu_die_id. Use the value
    from CPUID or MSR as appropriate. Default to phys_proc_id otherwise.
    Do so for both AMD and Hygon systems.

    Drop the node_id parameter from cacheinfo_*_init_llc_id() as it is no
    longer needed.

    Update the x86 topology documentation.

    Suggested-by: Borislav Petkov
    Signed-off-by: Yazen Ghannam
    Signed-off-by: Borislav Petkov
    Link: https://lkml.kernel.org/r/20201109210659.754018-2-Yazen.Ghannam@amd.com
    Signed-off-by: Sasha Levin

    Yazen Ghannam
     
  • commit 9d4747d02376aeb8de38afa25430de79129c5799 upstream.

    When both KVM support and the CCP driver are built into the kernel instead
    of as modules, KVM initialization can happen before CCP initialization. As
    a result, sev_platform_status() will return a failure when it is called
    from sev_hardware_setup(), when this isn't really an error condition.

    Since sev_platform_status() doesn't need to be called at this time anyway,
    remove the invocation from sev_hardware_setup().

    Signed-off-by: Tom Lendacky
    Message-Id:
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Tom Lendacky
     
  • commit 39485ed95d6b83b62fa75c06c2c4d33992e0d971 upstream.

    Until commit e7c587da1252 ("x86/speculation: Use synthetic bits for
    IBRS/IBPB/STIBP"), KVM was testing both Intel and AMD CPUID bits before
    allowing the guest to write MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD.
    Testing only Intel bits on VMX processors, or only AMD bits on SVM
    processors, fails if the guests are created with the "opposite" vendor
    as the host.

    While at it, also tweak the host CPU check to use the vendor-agnostic
    feature bit X86_FEATURE_IBPB, since we only care about the availability
    of the MSR on the host here and not about specific CPUID bits.

    Fixes: e7c587da1252 ("x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP")
    Cc: stable@vger.kernel.org
    Reported-by: Denis V. Lunev
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Paolo Bonzini
     
  • commit f8129cd958b395575e5543ce25a8434874b04d3a upstream.

    The cycle count of a timed LBR is always 1 in perf record -D.

    The cycle count is stored in the first 16 bits of the IA32_LBR_x_INFO
    register, but the get_lbr_cycles() return Boolean type.

    Use u16 to replace the Boolean type.

    Fixes: 47125db27e47 ("perf/x86/intel/lbr: Support Architectural LBR")
    Reported-by: Stephane Eranian
    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20201125213720.15692-2-kan.liang@linux.intel.com
    Signed-off-by: Greg Kroah-Hartman

    Kan Liang
     
  • commit 46b72e1bf4fc571da0c29c6fb3e5b2a2107a4c26 upstream.

    According to the event list from icelake_core_v1.09.json, the encoding
    of the RTM_RETIRED.ABORTED event on Ice Lake should be,
    "EventCode": "0xc9",
    "UMask": "0x04",
    "EventName": "RTM_RETIRED.ABORTED",

    Correct the wrong encoding.

    Fixes: 6017608936c1 ("perf/x86/intel: Add Icelake support")
    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20201125213720.15692-1-kan.liang@linux.intel.com
    Signed-off-by: Greg Kroah-Hartman

    Kan Liang
     
  • commit 306e3e91edf1c6739a55312edd110d298ff498dd upstream.

    The event CYCLE_ACTIVITY.STALLS_MEM_ANY (0x14a3) should be available on
    all 8 GP counters on ICL, but it's only scheduled on the first four
    counters due to the current ICL constraint table.

    Add a line for the CYCLE_ACTIVITY.STALLS_MEM_ANY event in the ICL
    constraint table.
    Correct the comments for the CYCLE_ACTIVITY.CYCLES_MEM_ANY event.

    Fixes: 6017608936c1 ("perf/x86/intel: Add Icelake support")
    Reported-by: Andi Kleen
    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20201019164529.32154-1-kan.liang@linux.intel.com
    Signed-off-by: Greg Kroah-Hartman

    Kan Liang
     
  • [ Upstream commit 57efa1fe5957694fa541c9062de0a127f0b9acb0 ]

    Since commit 70e806e4e645 ("mm: Do early cow for pinned pages during
    fork() for ptes") pages under a FOLL_PIN will not be write protected
    during COW for fork. This means that pages returned from
    pin_user_pages(FOLL_WRITE) should not become write protected while the pin
    is active.

    However, there is a small race where get_user_pages_fast(FOLL_PIN) can
    establish a FOLL_PIN at the same time copy_present_page() is write
    protecting it:

    CPU 0 CPU 1
    get_user_pages_fast()
    internal_get_user_pages_fast()
    copy_page_range()
    pte_alloc_map_lock()
    copy_present_page()
    atomic_read(has_pinned) == 0
    page_maybe_dma_pinned() == false
    atomic_set(has_pinned, 1);
    gup_pgd_range()
    gup_pte_range()
    pte_t pte = gup_get_pte(ptep)
    pte_access_permitted(pte)
    try_grab_compound_head()
    pte = pte_wrprotect(pte)
    set_pte_at();
    pte_unmap_unlock()
    // GUP now returns with a write protected page

    The first attempt to resolve this by using the write protect caused
    problems (and was missing a barrrier), see commit f3c64eda3e50 ("mm: avoid
    early COW write protect games during fork()")

    Instead wrap copy_p4d_range() with the write side of a seqcount and check
    the read side around gup_pgd_range(). If there is a collision then
    get_user_pages_fast() fails and falls back to slow GUP.

    Slow GUP is safe against this race because copy_page_range() is only
    called while holding the exclusive side of the mmap_lock on the src
    mm_struct.

    [akpm@linux-foundation.org: coding style fixes]
    Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com

    Link: https://lkml.kernel.org/r/2-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com
    Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()")
    Signed-off-by: Jason Gunthorpe
    Suggested-by: Linus Torvalds
    Reviewed-by: John Hubbard
    Reviewed-by: Jan Kara
    Reviewed-by: Peter Xu
    Acked-by: "Ahmed S. Darwish" [seqcount_t parts]
    Cc: Andrea Arcangeli
    Cc: "Aneesh Kumar K.V"
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Cc: Jann Horn
    Cc: Kirill Shutemov
    Cc: Kirill Tkhai
    Cc: Leon Romanovsky
    Cc: Michal Hocko
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Jason Gunthorpe
     
  • [ Upstream commit 78ff2733ff352175eb7f4418a34654346e1b6cd2 ]

    Fix to restore BTF if single-stepping causes a page fault and
    it is cancelled.

    Usually the BTF flag was restored when the single stepping is done
    (in resume_execution()). However, if a page fault happens on the
    single stepping instruction, the fault handler is invoked and
    the single stepping is cancelled. Thus, the BTF flag is not
    restored.

    Fixes: 1ecc798c6764 ("x86: debugctlmsr kprobes")
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/160389546985.106936.12727996109376240993.stgit@devnote2
    Signed-off-by: Sasha Levin

    Masami Hiramatsu
     
  • [ Upstream commit 15af36596ae305aefc8c502c2d3e8c58221709eb ]

    Commit

    c9c6d216ed28 ("x86/mce: Rename "first" function as "early"")

    changed the enumeration of MCE notifier priorities. Correct the check
    for notifier priorities to cover the new range.

    [ bp: Rewrite commit message, remove superfluous brackets in
    conditional. ]

    Fixes: c9c6d216ed28 ("x86/mce: Rename "first" function as "early"")
    Signed-off-by: Zhen Lei
    Signed-off-by: Borislav Petkov
    Link: https://lkml.kernel.org/r/20201106141216.2062-2-thunder.leizhen@huawei.com
    Signed-off-by: Sasha Levin

    Zhen Lei
     
  • [ Upstream commit 26573a97746c7a99f394f9d398ce91a8853b3b89 ]

    Currently, Linux as a hypervisor guest will enable x2apic only if there are
    no CPUs present at boot time with an APIC ID above 255.

    Hotplugging a CPU later with a higher APIC ID would result in a CPU which
    cannot be targeted by external interrupts.

    Add a filter in x2apic_apic_id_valid() which can be used to prevent such
    CPUs from coming online, and allow x2apic to be enabled even if they are
    present at boot time.

    Fixes: ce69a784504 ("x86/apic: Enable x2APIC without interrupt remapping under KVM")
    Signed-off-by: David Woodhouse
    Signed-off-by: Thomas Gleixner
    Link: https://lore.kernel.org/r/20201024213535.443185-2-dwmw2@infradead.org
    Signed-off-by: Sasha Levin

    David Woodhouse
     
  • [ Upstream commit 1fcd009102ee02e217f2e7635ab65517d785da8e ]

    Commit

    ea3b5e60ce80 ("x86/mm/ident_map: Add 5-level paging support")

    added ident_p4d_init() to support 5-level paging, but this function
    doesn't check and return errors from ident_pud_init().

    For example, the decompressor stub uses this code to create an identity
    mapping. If it runs out of pages while trying to allocate a PMD
    pagetable, the error will be currently ignored.

    Fix this to propagate errors.

    [ bp: Space out statements for better readability. ]

    Fixes: ea3b5e60ce80 ("x86/mm/ident_map: Add 5-level paging support")
    Signed-off-by: Arvind Sankar
    Signed-off-by: Borislav Petkov
    Reviewed-by: Joerg Roedel
    Acked-by: Kirill A. Shutemov
    Link: https://lkml.kernel.org/r/20201027230648.1885111-1-nivedita@alum.mit.edu
    Signed-off-by: Sasha Levin

    Arvind Sankar
     

26 Dec, 2020

1 commit

  • commit e14fd4ba8fb47fcf5f244366ec01ae94490cd86a upstream.

    When a split lock is detected always make sure to disable interrupts
    before returning from the trap handler.

    The kernel exit code assumes that all exits run with interrupts
    disabled, otherwise the SWAPGS sequence can race against interrupts and
    cause recursing page faults and later panics.

    The problem will only happen on CPUs with split lock disable
    functionality, so Icelake Server, Tiger Lake, Snow Ridge, Jacobsville.

    Fixes: ca4c6a9858c2 ("x86/traps: Make interrupt enable/disable symmetric in C code")
    Fixes: bce9b042ec73 ("x86/traps: Disable interrupts in exc_aligment_check()") # v5.8+
    Signed-off-by: Andi Kleen
    Cc: Peter Zijlstra
    Cc: Fenghua Yu
    Cc: Tony Luck
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andi Kleen
     

14 Dec, 2020

1 commit

  • Pull x86 fixes from Thomas Gleixner:
    "A set of x86 and membarrier fixes:

    - Correct a few problems in the x86 and the generic membarrier
    implementation. Small corrections for assumptions about visibility
    which have turned out not to be true.

    - Make the PAT bits for memory encryption correct vs 4K and 2M/1G
    page table entries as they are at a different location.

    - Fix a concurrency issue in the the local bandwidth readout of
    resource control leading to incorrect values

    - Fix the ordering of allocating a vector for an interrupt. The order
    missed to respect the provided cpumask when the first attempt of
    allocating node local in the mask fails. It then tries the node
    instead of trying the full provided mask first. This leads to
    erroneous error messages and breaking the (user) supplied affinity
    request. Reorder it.

    - Make the INT3 padding detection in optprobe work correctly"

    * tag 'x86-urgent-2020-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/kprobes: Fix optprobe to detect INT3 padding correctly
    x86/apic/vector: Fix ordering in vector assignment
    x86/resctrl: Fix incorrect local bandwidth when mba_sc is enabled
    x86/mm/mem_encrypt: Fix definition of PMD_FLAGS_DEC_WP
    membarrier: Execute SYNC_CORE on the calling thread
    membarrier: Explicitly sync remote cores when SYNC_CORE is requested
    membarrier: Add an actual barrier before rseq_preempt()
    x86/membarrier: Get rid of a dubious optimization

    Linus Torvalds
     

13 Dec, 2020

1 commit

  • Pull kvm fixes from Paolo Bonzini:
    "Bugfixes for ARM, x86 and tools"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    tools/kvm_stat: Exempt time-based counters
    KVM: mmu: Fix SPTE encoding of MMIO generation upper half
    kvm: x86/mmu: Use cpuid to determine max gfn
    kvm: svm: de-allocate svm_cpu_data for all cpus in svm_cpu_uninit()
    selftests: kvm/set_memory_region_test: Fix race in move region test
    KVM: arm64: Add usage of stage 2 fault lookup level in user_mem_abort()
    KVM: arm64: Fix handling of merging tables into a block entry
    KVM: arm64: Fix memory leak on stage2 update of a valid PTE

    Linus Torvalds
     

12 Dec, 2020

2 commits

  • Commit

    7705dc855797 ("x86/vmlinux: Use INT3 instead of NOP for linker fill bytes")

    changed the padding bytes between functions from NOP to INT3. However,
    when optprobe decodes a target function it finds INT3 and gives up the
    jump optimization.

    Instead of giving up any INT3 detection, check whether the rest of the
    bytes to the end of the function are INT3. If all of them are INT3,
    those come from the linker. In that case, continue the optprobe jump
    optimization.

    [ bp: Massage commit message. ]

    Fixes: 7705dc855797 ("x86/vmlinux: Use INT3 instead of NOP for linker fill bytes")
    Reported-by: Adam Zabrocki
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Borislav Petkov
    Reviewed-by: Steven Rostedt (VMware)
    Reviewed-by: Kees Cook
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/160767025681.3880685.16021570341428835411.stgit@devnote2

    Masami Hiramatsu
     
  • Commit cae7ed3c2cb0 ("KVM: x86: Refactor the MMIO SPTE generation handling")
    cleaned up the computation of MMIO generation SPTE masks, however it
    introduced a bug how the upper part was encoded:
    SPTE bits 52-61 were supposed to contain bits 10-19 of the current
    generation number, however a missing shift encoded bits 1-10 there instead
    (mostly duplicating the lower part of the encoded generation number that
    then consisted of bits 1-9).

    In the meantime, the upper part was shrunk by one bit and moved by
    subsequent commits to become an upper half of the encoded generation number
    (bits 9-17 of bits 0-17 encoded in a SPTE).

    In addition to the above, commit 56871d444bc4 ("KVM: x86: fix overlap between SPTE_MMIO_MASK and generation")
    has changed the SPTE bit range assigned to encode the generation number and
    the total number of bits encoded but did not update them in the comment
    attached to their defines, nor in the KVM MMU doc.
    Let's do it here, too, since it is too trivial thing to warrant a separate
    commit.

    Fixes: cae7ed3c2cb0 ("KVM: x86: Refactor the MMIO SPTE generation handling")
    Signed-off-by: Maciej S. Szmigiero
    Message-Id:
    Cc: stable@vger.kernel.org
    [Reorganize macros so that everything is computed from the bit ranges. - Paolo]
    Signed-off-by: Paolo Bonzini

    Maciej S. Szmigiero
     

11 Dec, 2020

2 commits

  • Prarit reported that depending on the affinity setting the

    ' irq $N: Affinity broken due to vector space exhaustion.'

    message is showing up in dmesg, but the vector space on the CPUs in the
    affinity mask is definitely not exhausted.

    Shung-Hsi provided traces and analysis which pinpoints the problem:

    The ordering of trying to assign an interrupt vector in
    assign_irq_vector_any_locked() is simply wrong if the interrupt data has a
    valid node assigned. It does:

    1) Try the intersection of affinity mask and node mask
    2) Try the node mask
    3) Try the full affinity mask
    4) Try the full online mask

    Obviously #2 and #3 are in the wrong order as the requested affinity
    mask has to take precedence.

    In the observed cases #1 failed because the affinity mask did not contain
    CPUs from node 0. That made it allocate a vector from node 0, thereby
    breaking affinity and emitting the misleading message.

    Revert the order of #2 and #3 so the full affinity mask without the node
    intersection is tried before actually affinity is broken.

    If no node is assigned then only the full affinity mask and if that fails
    the full online mask is tried.

    Fixes: d6ffc6ac83b1 ("x86/vector: Respect affinity mask in irq descriptor")
    Reported-by: Prarit Bhargava
    Reported-by: Shung-Hsi Yu
    Signed-off-by: Thomas Gleixner
    Tested-by: Shung-Hsi Yu
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/87ft4djtyp.fsf@nanos.tec.linutronix.de

    Thomas Gleixner
     
  • The MBA software controller (mba_sc) is a feedback loop which
    periodically reads MBM counters and tries to restrict the bandwidth
    below a user-specified value. It tags along the MBM counter overflow
    handler to do the updates with 1s interval in mbm_update() and
    update_mba_bw().

    The purpose of mbm_update() is to periodically read the MBM counters to
    make sure that the hardware counter doesn't wrap around more than once
    between user samplings. mbm_update() calls __mon_event_count() for local
    bandwidth updating when mba_sc is not enabled, but calls mbm_bw_count()
    instead when mba_sc is enabled. __mon_event_count() will not be called
    for local bandwidth updating in MBM counter overflow handler, but it is
    still called when reading MBM local bandwidth counter file
    'mbm_local_bytes', the call path is as below:

    rdtgroup_mondata_show()
    mon_event_read()
    mon_event_count()
    __mon_event_count()

    In __mon_event_count(), m->chunks is updated by delta chunks which is
    calculated from previous MSR value (m->prev_msr) and current MSR value.
    When mba_sc is enabled, m->chunks is also updated in mbm_update() by
    mistake by the delta chunks which is calculated from m->prev_bw_msr
    instead of m->prev_msr. But m->chunks is not used in update_mba_bw() in
    the mba_sc feedback loop.

    When reading MBM local bandwidth counter file, m->chunks was changed
    unexpectedly by mbm_bw_count(). As a result, the incorrect local
    bandwidth counter which calculated from incorrect m->chunks is shown to
    the user.

    Fix this by removing incorrect m->chunks updating in mbm_bw_count() in
    MBM counter overflow handler, and always calling __mon_event_count() in
    mbm_update() to make sure that the hardware local bandwidth counter
    doesn't wrap around.

    Test steps:
    # Run workload with aggressive memory bandwidth (e.g., 10 GB/s)
    git clone https://github.com/intel/intel-cmt-cat && cd intel-cmt-cat
    && make
    ./tools/membw/membw -c 0 -b 10000 --read

    # Enable MBA software controller
    mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl

    # Create control group c1
    mkdir /sys/fs/resctrl/c1

    # Set MB throttle to 6 GB/s
    echo "MB:0=6000;1=6000" > /sys/fs/resctrl/c1/schemata

    # Write PID of the workload to tasks file
    echo `pidof membw` > /sys/fs/resctrl/c1/tasks

    # Read local bytes counters twice with 1s interval, the calculated
    # local bandwidth is not as expected (approaching to 6 GB/s):
    local_1=`cat /sys/fs/resctrl/c1/mon_data/mon_L3_00/mbm_local_bytes`
    sleep 1
    local_2=`cat /sys/fs/resctrl/c1/mon_data/mon_L3_00/mbm_local_bytes`
    echo "local b/w (bytes/s):" `expr $local_2 - $local_1`

    Before fix:
    local b/w (bytes/s): 11076796416

    After fix:
    local b/w (bytes/s): 5465014272

    Fixes: ba0f26d8529c (x86/intel_rdt/mba_sc: Prepare for feedback loop)
    Signed-off-by: Xiaochen Shen
    Signed-off-by: Borislav Petkov
    Reviewed-by: Tony Luck
    Cc:
    Link: https://lkml.kernel.org/r/1607063279-19437-1-git-send-email-xiaochen.shen@intel.com

    Xiaochen Shen
     

10 Dec, 2020

1 commit

  • The PAT bit is in different locations for 4k and 2M/1G page table
    entries.

    Add a definition for _PAGE_LARGE_CACHE_MASK to represent the three
    caching bits (PWT, PCD, PAT), similar to _PAGE_CACHE_MASK for 4k pages,
    and use it in the definition of PMD_FLAGS_DEC_WP to get the correct PAT
    index for write-protected pages.

    Fixes: 6ebcb060713f ("x86/mm: Add support to encrypt the kernel in-place")
    Signed-off-by: Arvind Sankar
    Signed-off-by: Borislav Petkov
    Tested-by: Tom Lendacky
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20201111160946.147341-1-nivedita@alum.mit.edu

    Arvind Sankar
     

09 Dec, 2020

1 commit

  • sync_core_before_usermode() had an incorrect optimization. If the kernel
    returns from an interrupt, it can get to usermode without IRET. It just has
    to schedule to a different task in the same mm and do SYSRET. Fortunately,
    there were no callers of sync_core_before_usermode() that could have had
    in_irq() or in_nmi() equal to true, because it's only ever called from the
    scheduler.

    While at it, clarify a related comment.

    Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Mathieu Desnoyers
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/5afc7632be1422f91eaf7611aaaa1b5b8580a086.1607058304.git.luto@kernel.org

    Andy Lutomirski
     

07 Dec, 2020

3 commits

  • Pull x86 fixes from Thomas Gleixner:
    "A set of fixes for x86:

    - Make the AMD L3 QoS code and data priorization enable/disable
    mechanism work correctly.

    The control bit was only set/cleared on one of the CPUs in a L3
    domain, but it has to be modified on all CPUs in the domain. The
    initial documentation was not clear about this, but the updated one
    from Oct 2020 spells it out.

    - Fix an off by one in the UV platform detection code which causes
    the UV hubs to be identified wrongly.

    The chip revisions start at 1 not at 0.

    - Fix a long standing bug in the evaluation of prefixes in the
    uprobes code which fails to handle repeated prefixes properly.

    The aggregate size of the prefixes can be larger than the bytes
    array but the code blindly iterated over the aggregate size beyond
    the array boundary. Add a macro to handle this case properly and
    use it at the affected places"

    * tag 'x86-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/sev-es: Use new for_each_insn_prefix() macro to loop over prefixes bytes
    x86/insn-eval: Use new for_each_insn_prefix() macro to loop over prefixes bytes
    x86/uprobes: Do not use prefixes.nbytes when looping over prefixes.bytes
    x86/platform/uv: Fix UV4 hub revision adjustment
    x86/resctrl: Fix AMD L3 QOS CDP enable/disable

    Linus Torvalds
     
  • Pull perf fixes from Thomas Gleixner:
    "Two fixes for performance monitoring on X86:

    - Add recursion protection to another callchain invoked from
    x86_pmu_stop() which can recurse back into x86_pmu_stop(). The
    first attempt to fix this missed this extra code path.

    - Use the already filtered status variable to check for PEBS counter
    overflow bits and not the unfiltered full status read from
    IA32_PERF_GLOBAL_STATUS which can have unrelated bits check which
    would be evaluated incorrectly"

    * tag 'perf-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86/intel: Check PEBS status correctly
    perf/x86/intel: Fix a warning on x86_pmu_stop() with large PEBS

    Linus Torvalds
     
  • …t/masahiroy/linux-kbuild

    Pull Kbuild fixes from Masahiro Yamada:

    - Move -Wcast-align to W=3, which tends to be false-positive and there
    is no tree-wide solution.

    - Pass -fmacro-prefix-map to KBUILD_CPPFLAGS because it is a
    preprocessor option and makes sense for .S files as well.

    - Disable -gdwarf-2 for Clang's integrated assembler to avoid warnings.

    - Disable --orphan-handling=warn for LLD 10.0.1 to avoid warnings.

    - Fix undesirable line breaks in *.mod files.

    * tag 'kbuild-fixes-v5.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
    kbuild: avoid split lines in .mod files
    kbuild: Disable CONFIG_LD_ORPHAN_WARN for ld.lld 10.0.1
    kbuild: Hoist '--orphan-handling' into Kconfig
    Kbuild: do not emit debug info for assembly with LLVM_IAS=1
    kbuild: use -fmacro-prefix-map for .S sources
    Makefile.extrawarn: move -Wcast-align to W=3

    Linus Torvalds
     

06 Dec, 2020

3 commits

  • Since insn.prefixes.nbytes can be bigger than the size of
    insn.prefixes.bytes[] when a prefix is repeated, the proper
    check must be:

    insn.prefixes.bytes[i] != 0 and i < 4

    instead of using insn.prefixes.nbytes. Use the new
    for_each_insn_prefix() macro which does it correctly.

    Debugged by Kees Cook .

    [ bp: Massage commit message. ]

    Fixes: 25189d08e516 ("x86/sev-es: Add support for handling IOIO exceptions")
    Reported-by: syzbot+9b64b619f10f19d19a7c@syzkaller.appspotmail.com
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Borislav Petkov
    Link: https://lkml.kernel.org/r/160697106089.3146288.2052422845039649176.stgit@devnote2

    Masami Hiramatsu
     
  • Since insn.prefixes.nbytes can be bigger than the size of
    insn.prefixes.bytes[] when a prefix is repeated, the proper check must
    be

    insn.prefixes.bytes[i] != 0 and i < 4

    instead of using insn.prefixes.nbytes. Use the new
    for_each_insn_prefix() macro which does it correctly.

    Debugged by Kees Cook .

    [ bp: Massage commit message. ]

    Fixes: 32d0b95300db ("x86/insn-eval: Add utility functions to get segment selector")
    Reported-by: syzbot+9b64b619f10f19d19a7c@syzkaller.appspotmail.com
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Borislav Petkov
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/160697104969.3146288.16329307586428270032.stgit@devnote2

    Masami Hiramatsu
     
  • Since insn.prefixes.nbytes can be bigger than the size of
    insn.prefixes.bytes[] when a prefix is repeated, the proper check must
    be

    insn.prefixes.bytes[i] != 0 and i < 4

    instead of using insn.prefixes.nbytes.

    Introduce a for_each_insn_prefix() macro for this purpose. Debugged by
    Kees Cook .

    [ bp: Massage commit message, sync with the respective header in tools/
    and drop "we". ]

    Fixes: 2b1444983508 ("uprobes, mm, x86: Add the ability to install and remove uprobes breakpoints")
    Reported-by: syzbot+9b64b619f10f19d19a7c@syzkaller.appspotmail.com
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Borislav Petkov
    Reviewed-by: Srikar Dronamraju
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/160697103739.3146288.7437620795200799020.stgit@devnote2

    Masami Hiramatsu
     

04 Dec, 2020

2 commits

  • In the TDP MMU, use shadow_phys_bits to dermine the maximum possible GFN
    mapped in the guest for zapping operations. boot_cpu_data.x86_phys_bits
    may be reduced in the case of HW features that steal HPA bits for other
    purposes. However, this doesn't necessarily reduce GPA space that can be
    accessed via TDP. So zap based on a maximum gfn calculated with MAXPHYADDR
    retrieved from CPUID. This is already stored in shadow_phys_bits, so use
    it instead of x86_phys_bits.

    Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
    Signed-off-by: Rick Edgecombe
    Message-Id:
    Reviewed-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Rick Edgecombe
     
  • The cpu arg for svm_cpu_uninit() was previously ignored resulting in the
    per cpu structure svm_cpu_data not being de-allocated for all cpus.

    Signed-off-by: Jacob Xu
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Jacob Xu