30 Sep, 2016

1 commit

  • We use __read_cr4() vs __read_cr4_safe() inconsistently. On
    CR4-less CPUs, all CR4 bits are effectively clear, so we can make
    the code simpler and more robust by making __read_cr4() always fix
    up faults on 32-bit kernels.

    This may fix some bugs on old 486-like CPUs, but I don't have any
    easy way to test that.

    Signed-off-by: Andy Lutomirski
    Cc: Brian Gerst
    Cc: Borislav Petkov
    Cc: david@saggiorato.net
    Link: http://lkml.kernel.org/r/ea647033d357d9ce2ad2bbde5a631045f5052fb6.1475178370.git.luto@kernel.org
    Signed-off-by: Thomas Gleixner

    Andy Lutomirski
     

16 Aug, 2016

1 commit


09 Aug, 2016

1 commit

  • The low-level resume-from-hibernation code on x86-64 uses
    kernel_ident_mapping_init() to create the temoprary identity mapping,
    but that function assumes that the offset between kernel virtual
    addresses and physical addresses is aligned on the PGD level.

    However, with a randomized identity mapping base, it may be aligned
    on the PUD level and if that happens, the temporary identity mapping
    created by set_up_temporary_mappings() will not reflect the actual
    kernel identity mapping and the image restoration will fail as a
    result (leading to a kernel panic most of the time).

    To fix this problem, rework kernel_ident_mapping_init() to support
    unaligned offsets between KVA and PA up to the PMD level and make
    set_up_temporary_mappings() use it as approprtiate.

    Reported-and-tested-by: Thomas Garnier
    Reported-by: Borislav Petkov
    Suggested-by: Yinghai Lu
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Yinghai Lu

    Rafael J. Wysocki
     

03 Aug, 2016

1 commit

  • When CONFIG_RANDOMIZE_MEMORY is set on x86-64, __PAGE_OFFSET becomes
    a variable and using it as a symbol in the image memory restoration
    assembly code under core_restore_code is not correct any more.

    To avoid that problem, modify set_up_temporary_mappings() to compute
    the physical address of the temporary page tables and store it in
    temp_level4_pgt, so that the value of that variable is ready to be
    written into CR3. Then, the assembly code doesn't have to worry
    about converting that value into a physical address and things work
    regardless of whether or not CONFIG_RANDOMIZE_MEMORY is set.

    Reported-and-tested-by: Thomas Garnier
    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     

29 Jul, 2016

1 commit

  • In kernel bug 150021, a kernel panic was reported when restoring a
    hibernate image. Only a picture of the oops was reported, so I can't
    paste the whole thing here. But here are the most interesting parts:

    kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
    BUG: unable to handle kernel paging request at ffff8804615cfd78
    ...
    RIP: ffff8804615cfd78
    RSP: ffff8804615f0000
    RBP: ffff8804615cfdc0
    ...
    Call Trace:
    do_signal+0x23
    exit_to_usermode_loop+0x64
    ...

    The RIP is on the same page as RBP, so it apparently started executing
    on the stack.

    The bug was bisected to commit ef0f3ed5a4ac (x86/asm/power: Create
    stack frames in hibernate_asm_64.S), which in retrospect seems quite
    dangerous, since that code saves and restores the stack pointer from a
    global variable ('saved_context').

    There are a lot of moving parts in the hibernate save and restore paths,
    so I don't know exactly what caused the panic. Presumably, a FRAME_END
    was executed without the corresponding FRAME_BEGIN, or vice versa. That
    would corrupt the return address on the stack and would be consistent
    with the details of the above panic.

    [ rjw: One major problem is that by the time the FRAME_BEGIN in
    restore_registers() is executed, the stack pointer value may not
    be valid any more. Namely, the stack area pointed to by it
    previously may have been overwritten by some image memory contents
    and that page frame may now be used for whatever different purpose
    it had been allocated for before hibernation. In that case, the
    FRAME_BEGIN will corrupt that memory. ]

    Instead of doing the frame pointer save/restore around the bounds of the
    affected functions, just do it around the call to swsusp_save().

    That has the same effect of ensuring that if swsusp_save() sleeps, the
    frame pointers will be correct. It's also a much more obviously safe
    way to do it than the original patch. And objtool still doesn't report
    any warnings.

    Fixes: ef0f3ed5a4ac (x86/asm/power: Create stack frames in hibernate_asm_64.S)
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=150021
    Cc: 4.6+ # 4.6+
    Reported-by: Andre Reinke
    Tested-by: Andre Reinke
    Signed-off-by: Josh Poimboeuf
    Acked-by: Ingo Molnar
    Signed-off-by: Rafael J. Wysocki

    Josh Poimboeuf
     

16 Jul, 2016

1 commit

  • On Intel hardware, native_play_dead() uses mwait_play_dead() by
    default and only falls back to the other methods if that fails.
    That also happens during resume from hibernation, when the restore
    (boot) kernel runs disable_nonboot_cpus() to take all of the CPUs
    except for the boot one offline.

    However, that is problematic, because the address passed to
    __monitor() in mwait_play_dead() is likely to be written to in the
    last phase of hibernate image restoration and that causes the "dead"
    CPU to start executing instructions again. Unfortunately, the page
    containing the address in that CPU's instruction pointer may not be
    valid any more at that point.

    First, that page may have been overwritten with image kernel memory
    contents already, so the instructions the CPU attempts to execute may
    simply be invalid. Second, the page tables previously used by that
    CPU may have been overwritten by image kernel memory contents, so the
    address in its instruction pointer is impossible to resolve then.

    A report from Varun Koyyalagunta and investigation carried out by
    Chen Yu show that the latter sometimes happens in practice.

    To prevent it from happening, temporarily change the smp_ops.play_dead
    pointer during resume from hibernation so that it points to a special
    "play dead" routine which uses hlt_play_dead() and avoids the
    inadvertent "revivals" of "dead" CPUs this way.

    A slightly unpleasant consequence of this change is that if the
    system is hibernated with one or more CPUs offline, it will generally
    draw more power after resume than it did before hibernation, because
    the physical state entered by CPUs via hlt_play_dead() is higher-power
    than the mwait_play_dead() one in the majority of cases. It is
    possible to work around this, but it is unclear how much of a problem
    that's going to be in practice, so the workaround will be implemented
    later if it turns out to be necessary.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
    Reported-by: Varun Koyyalagunta
    Original-by: Chen Yu
    Tested-by: Chen Yu
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Ingo Molnar

    Rafael J. Wysocki
     

01 Jul, 2016

1 commit

  • Logan Gunthorpe reports that hibernation stopped working reliably for
    him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
    and rodata).

    That turns out to be a consequence of a long-standing issue with the
    64-bit image restoration code on x86, which is that the temporary
    page tables set up by it to avoid page tables corruption when the
    last bits of the image kernel's memory contents are copied into
    their original page frames re-use the boot kernel's text mapping,
    but that mapping may very well get corrupted just like any other
    part of the page tables. Of course, if that happens, the final
    jump to the image kernel's entry point will go to nowhere.

    The exact reason why commit ab76f7b4ab23 matters here is that it
    sometimes causes a PMD of a large page to be split into PTEs
    that are allocated dynamically and get corrupted during image
    restoration as described above.

    To fix that issue note that the code copying the last bits of the
    image kernel's memory contents to the page frames occupied by them
    previoulsy doesn't use the kernel text mapping, because it runs from
    a special page covered by the identity mapping set up for that code
    from scratch. Hence, the kernel text mapping is only needed before
    that code starts to run and then it will only be used just for the
    final jump to the image kernel's entry point.

    Accordingly, the temporary page tables set up in swsusp_arch_resume()
    on x86-64 need to contain the kernel text mapping too. That mapping
    is only going to be used for the final jump to the image kernel, so
    it only needs to cover the image kernel's entry point, because the
    first thing the image kernel does after getting control back is to
    switch over to its own original page tables. Moreover, the virtual
    address of the image kernel's entry point in that mapping has to be
    the same as the one mapped by the image kernel's page tables.

    With that in mind, modify the x86-64's arch_hibernation_header_save()
    and arch_hibernation_header_restore() routines to pass the physical
    address of the image kernel's entry point (in addition to its virtual
    address) to the boot kernel (a small piece of assembly code involved
    in passing the entry point's virtual address to the image kernel is
    not necessary any more after that, so drop it). Update RESTORE_MAGIC
    too to reflect the image header format change.

    Next, in set_up_temporary_mappings(), use the physical and virtual
    addresses of the image kernel's entry point passed in the image
    header to set up a minimum kernel text mapping (using memory pages
    that won't be overwritten by the image kernel's memory contents) that
    will map those addresses to each other as appropriate.

    This makes the concern about the possible corruption of the original
    boot kernel text mapping go away and if the the minimum kernel text
    mapping used for the final jump marks the image kernel's entry point
    memory as executable, the jump to it is guaraneed to succeed.

    Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
    Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
    Reported-by: Logan Gunthorpe
    Reported-and-tested-by: Borislav Petkov
    Tested-by: Kees Cook
    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     

31 Mar, 2016

1 commit


24 Feb, 2016

1 commit

  • swsusp_arch_suspend() and restore_registers() are callable non-leaf
    functions which don't honor CONFIG_FRAME_POINTER, which can result in
    bad stack traces. Also they aren't annotated as ELF callable functions
    which can confuse tooling.

    Create a stack frame for them when CONFIG_FRAME_POINTER is enabled and
    give them proper ELF function annotations.

    Signed-off-by: Josh Poimboeuf
    Reviewed-by: Borislav Petkov
    Acked-by: Pavel Machek
    Acked-by: Rafael J. Wysocki
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Andy Lutomirski
    Cc: Arnaldo Carvalho de Melo
    Cc: Bernd Petrovitsch
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Chris J Arges
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jiri Slaby
    Cc: Linus Torvalds
    Cc: Michal Marek
    Cc: Namhyung Kim
    Cc: Pedro Alves
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: live-patching@vger.kernel.org
    Link: http://lkml.kernel.org/r/bdad00205897dc707aebe9e9e39757085e2bf999.1453405861.git.jpoimboe@redhat.com
    Signed-off-by: Ingo Molnar

    Josh Poimboeuf
     

26 Nov, 2015

1 commit

  • A bug was reported that on certain Broadwell platforms, after
    resuming from S3, the CPU is running at an anomalously low
    speed.

    It turns out that the BIOS has modified the value of the
    THERM_CONTROL register during S3, and changed it from 0 to 0x10,
    thus enabled clock modulation(bit4), but with undefined CPU Duty
    Cycle(bit1:3) - which causes the problem.

    Here is a simple scenario to reproduce the issue:

    1. Boot up the system
    2. Get MSR 0x19a, it should be 0
    3. Put the system into sleep, then wake it up
    4. Get MSR 0x19a, it shows 0x10, while it should be 0

    Although some BIOSen want to change the CPU Duty Cycle during
    S3, in our case we don't want the BIOS to do any modification.

    Fix this issue by introducing a more generic x86 framework to
    save/restore specified MSR registers(THERM_CONTROL in this case)
    for suspend/resume. This allows us to fix similar bugs in a much
    simpler way in the future.

    When the kernel wants to protect certain MSRs during suspending,
    we simply add a quirk entry in msr_save_dmi_table, and customize
    the MSR registers inside the quirk callback, for example:

    u32 msr_id_need_to_save[] = {MSR_ID0, MSR_ID1, MSR_ID2...};

    and the quirk mechanism ensures that, once resumed from suspend,
    the MSRs indicated by these IDs will be restored to their
    original, pre-suspend values.

    Since both 64-bit and 32-bit kernels are affected, this patch
    covers the common 64/32-bit suspend/resume code path. And
    because the MSRs specified by the user might not be available or
    readable in any situation, we use rdmsrl_safe() to safely save
    these MSRs.

    Reported-and-tested-by: Marcin Kaszewski
    Signed-off-by: Chen Yu
    Acked-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bp@suse.de
    Cc: len.brown@intel.com
    Cc: linux@horizon.com
    Cc: luto@kernel.org
    Cc: rjw@rjwysocki.net
    Link: http://lkml.kernel.org/r/c9abdcbc173dd2f57e8990e304376f19287e92ba.1448382971.git.yu.c.chen@intel.com
    [ More edits to the naming of data structures. ]
    Signed-off-by: Ingo Molnar

    Chen Yu
     

31 Jul, 2015

1 commit

  • modify_ldt() has questionable locking and does not synchronize
    threads. Improve it: redesign the locking and synchronize all
    threads' LDTs using an IPI on all modifications.

    This will dramatically slow down modify_ldt in multithreaded
    programs, but there shouldn't be any multithreaded programs that
    care about modify_ldt's performance in the first place.

    This fixes some fallout from the CVE-2015-5157 fixes.

    Signed-off-by: Andy Lutomirski
    Reviewed-by: Borislav Petkov
    Cc: Andrew Cooper
    Cc: Andy Lutomirski
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jan Beulich
    Cc: Konrad Rzeszutek Wilk
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Sasha Levin
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: security@kernel.org
    Cc:
    Cc: xen-devel
    Link: http://lkml.kernel.org/r/4c6978476782160600471bd865b318db34c7b628.1438291540.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

23 Jun, 2015

1 commit

  • Pull x86 core updates from Ingo Molnar:
    "There were so many changes in the x86/asm, x86/apic and x86/mm topics
    in this cycle that the topical separation of -tip broke down somewhat -
    so the result is a more traditional architecture pull request,
    collected into the 'x86/core' topic.

    The topics were still maintained separately as far as possible, so
    bisectability and conceptual separation should still be pretty good -
    but there were a handful of merge points to avoid excessive
    dependencies (and conflicts) that would have been poorly tested in the
    end.

    The next cycle will hopefully be much more quiet (or at least will
    have fewer dependencies).

    The main changes in this cycle were:

    * x86/apic changes, with related IRQ core changes: (Jiang Liu, Thomas
    Gleixner)

    - This is the second and most intrusive part of changes to the x86
    interrupt handling - full conversion to hierarchical interrupt
    domains:

    [IOAPIC domain] -----
    |
    [MSI domain] --------[Remapping domain] ----- [ Vector domain ]
    | (optional) |
    [HPET MSI domain] ----- |
    |
    [DMAR domain] -----------------------------
    |
    [Legacy domain] -----------------------------

    This now reflects the actual hardware and allowed us to distangle
    the domain specific code from the underlying parent domain, which
    can be optional in the case of interrupt remapping. It's a clear
    separation of functionality and removes quite some duct tape
    constructs which plugged the remap code between ioapic/msi/hpet
    and the vector management.

    - Intel IOMMU IRQ remapping enhancements, to allow direct interrupt
    injection into guests (Feng Wu)

    * x86/asm changes:

    - Tons of cleanups and small speedups, micro-optimizations. This
    is in preparation to move a good chunk of the low level entry
    code from assembly to C code (Denys Vlasenko, Andy Lutomirski,
    Brian Gerst)

    - Moved all system entry related code to a new home under
    arch/x86/entry/ (Ingo Molnar)

    - Removal of the fragile and ugly CFI dwarf debuginfo annotations.
    Conversion to C will reintroduce many of them - but meanwhile
    they are only getting in the way, and the upstream kernel does
    not rely on them (Ingo Molnar)

    - NOP handling refinements. (Borislav Petkov)

    * x86/mm changes:

    - Big PAT and MTRR rework: making the code more robust and
    preparing to phase out exposing direct MTRR interfaces to drivers -
    in favor of using PAT driven interfaces (Toshi Kani, Luis R
    Rodriguez, Borislav Petkov)

    - New ioremap_wt()/set_memory_wt() interfaces to support
    Write-Through cached memory mappings. This is especially
    important for good performance on NVDIMM hardware (Toshi Kani)

    * x86/ras changes:

    - Add support for deferred errors on AMD (Aravind Gopalakrishnan)

    This is an important RAS feature which adds hardware support for
    poisoned data. That means roughly that the hardware marks data
    which it has detected as corrupted but wasn't able to correct, as
    poisoned data and raises an APIC interrupt to signal that in the
    form of a deferred error. It is the OS's responsibility then to
    take proper recovery action and thus prolonge system lifetime as
    far as possible.

    - Add support for Intel "Local MCE"s: upcoming CPUs will support
    CPU-local MCE interrupts, as opposed to the traditional system-
    wide broadcasted MCE interrupts (Ashok Raj)

    - Misc cleanups (Borislav Petkov)

    * x86/platform changes:

    - Intel Atom SoC updates

    ... and lots of other cleanups, fixlets and other changes - see the
    shortlog and the Git log for details"

    * 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (222 commits)
    x86/hpet: Use proper hpet device number for MSI allocation
    x86/hpet: Check for irq==0 when allocating hpet MSI interrupts
    x86/mm/pat, drivers/infiniband/ipath: Use arch_phys_wc_add() and require PAT disabled
    x86/mm/pat, drivers/media/ivtv: Use arch_phys_wc_add() and require PAT disabled
    x86/platform/intel/baytrail: Add comments about why we disabled HPET on Baytrail
    genirq: Prevent crash in irq_move_irq()
    genirq: Enhance irq_data_to_desc() to support hierarchy irqdomain
    iommu, x86: Properly handle posted interrupts for IOMMU hotplug
    iommu, x86: Provide irq_remapping_cap() interface
    iommu, x86: Setup Posted-Interrupts capability for Intel iommu
    iommu, x86: Add cap_pi_support() to detect VT-d PI capability
    iommu, x86: Avoid migrating VT-d posted interrupts
    iommu, x86: Save the mode (posted or remapped) of an IRTE
    iommu, x86: Implement irq_set_vcpu_affinity for intel_ir_chip
    iommu: dmar: Provide helper to copy shared irte fields
    iommu: dmar: Extend struct irte for VT-d Posted-Interrupts
    iommu: Add new member capability to struct irq_remap_ops
    x86/asm/entry/64: Disentangle error_entry/exit gsbase/ebx/usermode code
    x86/asm/entry/32: Shorten __audit_syscall_entry() args preparation
    x86/asm/entry/32: Explain reloading of registers after __audit_syscall_entry()
    ...

    Linus Torvalds
     

19 May, 2015

4 commits

  • There are a number of FPU internal function prototypes and an inline function
    in fpu/api.h, mostly placed so historically as the code grew over the years.

    Move them over into fpu/internal.h where they belong. (Add sched.h include
    to stackprotector.h which incorrectly relied on getting it from fpu/api.h.)

    fpu/api.h is now a pure file that only contains FPU APIs intended for driver
    use.

    Reviewed-by: Borislav Petkov
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Fenghua Yu
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • The suspend code accesses FPU state internals, add a helper for
    it and isolate it.

    Reviewed-by: Borislav Petkov
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Fenghua Yu
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • So the 'pcntxt_mask' is a misnomer, it's essentially meaningless to anyone
    who doesn't know what it does exactly.

    Name it more descriptively as 'xfeatures_mask'.

    Reviewed-by: Borislav Petkov
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Fenghua Yu
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • This unifies all the FPU related header files under a unified, hiearchical
    naming scheme:

    - asm/fpu/types.h: FPU related data types, needed for 'struct task_struct',
    widely included in almost all kernel code, and hence kept
    as small as possible.

    - asm/fpu/api.h: FPU related 'public' methods exported to other subsystems.

    - asm/fpu/internal.h: FPU subsystem internal methods

    - asm/fpu/xsave.h: XSAVE support internal methods

    (Also standardize the header guard in asm/fpu/internal.h.)

    Reviewed-by: Borislav Petkov
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Fenghua Yu
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

15 Apr, 2015

1 commit

  • ... so that they don't appear in the object file and thus in
    objdump output. They're local anyway and have a meaning only
    within that file.

    No functionality change.

    Signed-off-by: Borislav Petkov
    Acked-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Cc: H. Peter Anvin
    Cc: Rafael J. Wysocki
    Cc: Thomas Gleixner
    Cc: linux-pm@vger.kernel.org
    Link: http://lkml.kernel.org/r/1428867906-12016-1-git-send-email-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Borislav Petkov
     

06 Mar, 2015

1 commit

  • It has nothing to do with init -- there's only one TSS per cpu.

    Other names considered include:

    - current_tss: Confusing because we never switch the tss.
    - singleton_tss: Too long.

    This patch was generated with 's/init_tss/cpu_tss/g'. Followup
    patches will fix INIT_TSS and INIT_TSS_IST by hand.

    Signed-off-by: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/da29fb2a793e4f649d93ce2d1ed320ebe8516262.1425611534.git.luto@amacapital.net
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

04 Feb, 2015

1 commit

  • Context switches and TLB flushes can change individual bits of CR4.
    CR4 reads take several cycles, so store a shadow copy of CR4 in a
    per-cpu variable.

    To avoid wasting a cache line, I added the CR4 shadow to
    cpu_tlbstate, which is already touched in switch_mm. The heaviest
    users of the cr4 shadow will be switch_mm and __switch_to_xtra, and
    __switch_to_xtra is called shortly after switch_mm during context
    switch, so the cacheline is likely to be hot.

    Signed-off-by: Andy Lutomirski
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Kees Cook
    Cc: Andrea Arcangeli
    Cc: Vince Weaver
    Cc: "hillf.zj"
    Cc: Valdis Kletnieks
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/3a54dd3353fffbf84804398e00dfdc5b7c1afd7d.1414190806.git.luto@amacapital.net
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

10 Oct, 2014

1 commit

  • The different architectures used their own (and different) declarations:

    extern __visible const void __nosave_begin, __nosave_end;
    extern const void __nosave_begin, __nosave_end;
    extern long __nosave_begin, __nosave_end;

    Consolidate them using the first variant in .

    Signed-off-by: Geert Uytterhoeven
    Cc: Russell King
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: "David S. Miller"
    Cc: Guan Xuetao
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     

17 Jul, 2014

1 commit

  • ftrace_stop() is used to stop function tracing during suspend and resume
    which removes a lot of possible debugging opportunities with tracing.
    The reason was that some function in the resume path was causing a triple
    fault if it were to be traced. The issue I found was that doing something
    as simple as calling smp_processor_id() would reboot the box!

    When function tracing was first created I didn't have a good way to figure
    out what function was having issues, or it looked to be multiple ones. To
    fix it, we just created a big hammer approach to the problem which was to
    add a flag in the mcount trampoline that could be checked and not call
    the traced functions.

    Lately I developed better ways to find problem functions and I can bisect
    down to see what function is causing the issue. I removed the flag that
    stopped tracing and proceeded to find the problem function and it ended
    up being restore_processor_state(). This function makes sense as when the
    CPU comes back online from a suspend it calls this function to set up
    registers, amongst them the GS register, which stores things such as
    what CPU the processor is (if you call smp_processor_id() without this
    set up properly, it would fault).

    By making restore_processor_state() notrace, the system can suspend and
    resume without the need of the big hammer tracing to stop.

    Link: http://lkml.kernel.org/r/3577662.BSnUZfboWb@vostro.rjw.lan

    Acked-by: "Rafael J. Wysocki"
    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

06 May, 2014

1 commit


07 Aug, 2013

1 commit


03 May, 2013

1 commit

  • The git commite7a5cd063c7b4c58417f674821d63f5eb6747e37
    ("x86-64, gdt: Store/load GDT for ACPI S3 or hibernate/resume path
    is not needed.") assumes that for the hibernate path the booting
    kernel and the resuming kernel MUST be the same. That is certainly
    the case for a 32-bit kernel (see check_image_kernel and
    CONFIG_ARCH_HIBERNATION_HEADER config option).

    However for 64-bit kernels it is OK to have a different kernel
    version (and size of the image) of the booting and resuming kernels.
    Hence the above mentioned git commit introduces an regression.

    This patch fixes it by introducing a 'struct desc_ptr gdt_desc'
    back in the 'struct saved_context'. However instead of having in the
    'save_processor_state' and 'restore_processor_state' the
    store/load_gdt calls, we are only saving the GDT in the
    save_processor_state.

    For the restore path the lgdt operation is done in
    hibernate_asm_[32|64].S in the 'restore_registers' path.

    The apt reader of this description will recognize that only 64-bit
    kernels need this treatment, not 32-bit. This patch adds the logic
    in the 32-bit path to be more similar to 64-bit so that in the future
    the unification process can take advantage of this.

    [ hpa: this also reverts an inadvertent on-disk format change ]

    Suggested-by: "H. Peter Anvin"
    Acked-by: "Rafael J. Wysocki"
    Signed-off-by: Konrad Rzeszutek Wilk
    Link: http://lkml.kernel.org/r/1367459610-9656-2-git-send-email-konrad.wilk@oracle.com
    Signed-off-by: H. Peter Anvin

    Konrad Rzeszutek Wilk
     

30 Apr, 2013

1 commit

  • Pull x86 paravirt update from Ingo Molnar:
    "Various paravirtualization related changes - the biggest one makes
    guest support optional via CONFIG_HYPERVISOR_GUEST"

    * 'x86-paravirt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86, wakeup, sleep: Use pvops functions for changing GDT entries
    x86, xen, gdt: Remove the pvops variant of store_gdt.
    x86-32, gdt: Store/load GDT for ACPI S3 or hibernation/resume path is not needed
    x86-64, gdt: Store/load GDT for ACPI S3 or hibernate/resume path is not needed.
    x86: Make Linux guest support optional
    x86, Kconfig: Move PARAVIRT_DEBUG into the paravirt menu

    Linus Torvalds
     

12 Apr, 2013

3 commits

  • We check the TSS descriptor before we try to dereference it.
    Also we document what the value '9' actually means using the
    AMD64 Architecture Programmer's Manual Volume 2, pg 90:
    "Hex value 9: Available 64-bit TSS" and pg 91:
    "The available 32-bit TSS (09h), which is redefined as the
    available 64-bit TSS."

    Without this, on Xen, where the GDT is available as R/O (to
    protect the hypervisor from the guest modifying it), we end up
    with a pagetable fault.

    Signed-off-by: Konrad Rzeszutek Wilk
    Link: http://lkml.kernel.org/r/1365194544-14648-5-git-send-email-konrad.wilk@oracle.com
    Cc: Rafael J. Wysocki
    Signed-off-by: H. Peter Anvin

    konrad@kernel.org
     
  • During the ACPI S3 suspend, we store the GDT in the wakup_header (see
    wakeup_asm.s) field called 'pmode_gdt'.

    Which is then used during the resume path and has the same exact
    value as what the store/load_gdt do with the saved_context
    (which is saved/restored via save/restore_processor_state()).

    The flow during resume from ACPI S3 is simpler than the 64-bit
    counterpart. We only use the early bootstrap once (wakeup_gdt) and
    do various checks in real mode.

    After the checks are completed, we load the saved GDT ('pmode_gdt') and
    continue on with the resume (by heading to startup_32 in trampoline_32.S) -
    which quickly jumps to what was saved in 'pmode_entry'
    aka 'wakeup_pmode_return'.

    The 'wakeup_pmode_return' restores the GDT (saved_gdt) again (which was
    saved in do_suspend_lowlevel initially). After that it ends up calling
    the 'ret_point' which calls 'restore_processor_state()'.

    We have two opportunities to remove code where we restore the same GDT
    twice.

    Here is the call chain:
    wakeup_start
    |- lgdtl wakeup_gdt [the work-around broken BIOSes]
    |
    | - lgdtl pmode_gdt [the real one]
    |
    \-- startup_32 (in trampoline_32.S)
    \-- wakeup_pmode_return (in wakeup_32.S)
    |- lgdtl saved_gdt [the real one]
    \-- ret_point
    |..
    |- call restore_processor_state

    The hibernate path is much simpler. During the saving of the hibernation
    image we call save_processor_state() and save the contents of that
    along with the rest of the kernel in the hibernation image destination.
    We save the EIP of 'restore_registers' (restore_jump_address) and
    cr3 (restore_cr3).

    During hibernate resume, the 'restore_registers' (via the
    'restore_jump_address) in hibernate_asm_32.S is invoked which
    restores the contents of most registers. Naturally the resume path benefits
    from already being in 32-bit mode, so it does not have to reload the GDT.

    It only reloads the cr3 (from restore_cr3) and continues on. Note
    that the restoration of the restore image page-tables is done prior to
    this.

    After the 'restore_registers' it returns and we end up called
    restore_processor_state() - where we reload the GDT. The reload of
    the GDT is not needed as bootup kernel has already loaded the GDT
    which is at the same physical location as the the restored kernel.

    Note that the hibernation path assumes the GDT is correct during its
    'restore_registers'. The assumption in the code is that the restored
    image is the same as saved - meaning we are not trying to restore
    an different kernel in the virtual address space of a new kernel.

    Signed-off-by: Konrad Rzeszutek Wilk
    Link: http://lkml.kernel.org/r/1365194544-14648-3-git-send-email-konrad.wilk@oracle.com
    Cc: Rafael J. Wysocki
    Signed-off-by: H. Peter Anvin

    Konrad Rzeszutek Wilk
     
  • During the ACPI S3 resume path the trampoline code handles it already.

    During the ACPI S3 suspend phase (acpi_suspend_lowlevel) we set:
    early_gdt_descr.address = (..)get_cpu_gdt_table(smp_processor_id());

    which is then used during the resume path and has the same exact
    value as what the store/load_gdt do with the saved_context
    (which is saved/restored via save/restore_processor_state()).

    The flow during resume is complex and for 64-bit kernels we use three GDTs
    - one early bootstrap GDT (wakeup_igdt) that we load to workaround
    broken BIOSes, an early Protected Mode to Long Mode transition one
    (tr_gdt), and the final one - early_gdt_descr (which points to the real GDT).

    The early ('wakeup_gdt') is loaded in 'trampoline_start' for working
    around broken BIOSes, and then when we end up in Protected Mode in the
    startup_32 (in trampoline_64.s, not head_32.s) we use the 'tr_gdt'
    (still in trampoline_64.s). This 'tr_gdt' has a a 32-bit code segment,
    64-bit code segment with L=1, and a 32-bit data segment.

    Once we have transitioned from Protected Mode to Long Mode we then
    set the GDT to 'early_gdt_desc' and then via an iretq emerge in
    wakeup_long64 (set via 'initial_code' variable in acpi_suspend_lowlevel).

    In the wakeup_long64 we end up restoring the %rip (which is set to
    'resume_point') and jump there.

    In 'resume_point' we call 'restore_processor_state' which does
    the load_gdt on the saved context. This load_gdt is redundant as the
    GDT loaded via early_gdt_desc is the same.

    Here is the call-chain:
    wakeup_start
    |- lgdtl wakeup_gdt [the work-around broken BIOSes]
    |
    \-- trampoline_start (trampoline_64.S)
    |- lgdtl tr_gdt
    |
    \-- startup_32 (trampoline_64.S)
    |
    \-- startup_64 (trampoline_64.S)
    |
    \-- secondary_startup_64
    |- lgdtl early_gdt_desc
    | ...
    |- movq initial_code(%rip), %eax
    |-.. lretq
    \-- wakeup_64
    |-- other registers are reloaded
    |-- call restore_processor_state

    The hibernate path is much simpler. During the saving of the hibernation
    image we call save_processor_state() and save the contents of that along
    with the rest of the kernel in the hibernation image destination.
    We save the EIP of 'restore_registers' (restore_jump_address) and cr3
    (restore_cr3).

    During hibernate resume, the 'restore_registers' (via the
    'restore_jump_address) in hibernate_asm_64.S is invoked which restores
    the contents of most registers. Naturally the resume path benefits from
    already being in 64-bit mode, so it does not have to load the GDT.

    It only reloads the cr3 (from restore_cr3) and continues on. Note that
    the restoration of the restore image page-tables is done prior to this.

    After the 'restore_registers' it returns and we end up called
    restore_processor_state() - where we reload the GDT. The reload of
    the GDT is not needed as bootup kernel has already loaded the GDT which
    is at the same physical location as the the restored kernel.

    Note that the hibernation path assumes the GDT is correct during its
    'restore_registers'. The assumption in the code is that the restored
    image is the same as saved - meaning we are not trying to restore
    an different kernel in the virtual address space of a new kernel.

    Signed-off-by: Konrad Rzeszutek Wilk
    Link: http://lkml.kernel.org/r/1365194544-14648-2-git-send-email-konrad.wilk@oracle.com
    Cc: Rafael J. Wysocki
    Signed-off-by: H. Peter Anvin

    Konrad Rzeszutek Wilk
     

16 Mar, 2013

1 commit

  • This patch fixes a kernel crash when using precise sampling (PEBS)
    after a suspend/resume. Turns out the CPU notifier code is not invoked
    on CPU0 (BP). Therefore, the DS_AREA (used by PEBS) is not restored properly
    by the kernel and keeps it power-on/resume value of 0 causing any PEBS
    measurement to crash when running on CPU0.

    The workaround is to add a hook in the actual resume code to restore
    the DS Area MSR value. It is invoked for all CPUS. So for all but CPU0,
    the DS_AREA will be restored twice but this is harmless.

    Reported-by: Linus Torvalds
    Signed-off-by: Stephane Eranian
    Signed-off-by: Linus Torvalds

    Stephane Eranian
     

01 Feb, 2013

2 commits


30 Jan, 2013

1 commit

  • We should set mappings only for usable memory ranges under max_pfn
    Otherwise causes same problem that is fixed by

    x86, mm: Only direct map addresses that are marked as E820_RAM

    Make it only map range in pfn_mapped array.

    Signed-off-by: Yinghai Lu
    Link: http://lkml.kernel.org/r/1359058816-7615-34-git-send-email-yinghai@kernel.org
    Cc: Pavel Machek
    Cc: Rafael J. Wysocki
    Cc: linux-pm@vger.kernel.org
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     

15 Nov, 2012

2 commits

  • CONFIG_DEBUG_HOTPLUG_CPU0 is for debugging the CPU0 hotplug feature. The switch
    offlines CPU0 as soon as possible and boots userspace up with CPU0 offlined.
    User can online CPU0 back after boot time. The default value of the switch is
    off.

    To debug CPU0 hotplug, you need to enable CPU0 offline/online feature by either
    turning on CONFIG_BOOTPARAM_HOTPLUG_CPU0 during compilation or giving
    cpu0_hotplug kernel parameter at boot.

    It's safe and early place to take down CPU0 after all hotplug notifiers
    are installed and SMP is booted.

    Please note that some applications or drivers, e.g. some versions of udevd,
    during boot time may put CPU0 online again in this CPU0 hotplug debug mode.

    In this debug mode, setup_local_APIC() may report a warning on max_loops
    Link: http://lkml.kernel.org/r/1352835171-3958-15-git-send-email-fenghua.yu@intel.com
    Signed-off-by: H. Peter Anvin

    Fenghua Yu
     
  • Because x86 BIOS requires CPU0 to resume from sleep, suspend or hibernate can't
    be executed if CPU0 is detected offline. To make suspend or hibernate and
    further resume succeed, CPU0 must be online.

    Signed-off-by: Fenghua Yu
    Link: http://lkml.kernel.org/r/1352835171-3958-6-git-send-email-fenghua.yu@intel.com
    Signed-off-by: H. Peter Anvin

    Fenghua Yu
     

02 Apr, 2012

1 commit

  • s2ram broke due to this KVM commit:

    b74f05d61b73 x86: kvmclock: abstract save/restore sched_clock_state

    restore_sched_clock_state() methods use percpu data, therefore
    they must run after %gs is initialized, but before mtrr_bp_restore()
    (due to lockstat using sched_clock).

    Move it to the correct place.

    Reported-and-tested-by: Konstantin Khlebnikov
    Signed-off-by: Marcelo Tosatti
    Cc: Avi Kivity
    Signed-off-by: Ingo Molnar

    Marcelo Tosatti
     

29 Mar, 2012

3 commits

  • …m/linux/kernel/git/dhowells/linux-asm_system

    Pull "Disintegrate and delete asm/system.h" from David Howells:
    "Here are a bunch of patches to disintegrate asm/system.h into a set of
    separate bits to relieve the problem of circular inclusion
    dependencies.

    I've built all the working defconfigs from all the arches that I can
    and made sure that they don't break.

    The reason for these patches is that I recently encountered a circular
    dependency problem that came about when I produced some patches to
    optimise get_order() by rewriting it to use ilog2().

    This uses bitops - and on the SH arch asm/bitops.h drags in
    asm-generic/get_order.h by a circuituous route involving asm/system.h.

    The main difficulty seems to be asm/system.h. It holds a number of
    low level bits with no/few dependencies that are commonly used (eg.
    memory barriers) and a number of bits with more dependencies that
    aren't used in many places (eg. switch_to()).

    These patches break asm/system.h up into the following core pieces:

    (1) asm/barrier.h

    Move memory barriers here. This already done for MIPS and Alpha.

    (2) asm/switch_to.h

    Move switch_to() and related stuff here.

    (3) asm/exec.h

    Move arch_align_stack() here. Other process execution related bits
    could perhaps go here from asm/processor.h.

    (4) asm/cmpxchg.h

    Move xchg() and cmpxchg() here as they're full word atomic ops and
    frequently used by atomic_xchg() and atomic_cmpxchg().

    (5) asm/bug.h

    Move die() and related bits.

    (6) asm/auxvec.h

    Move AT_VECTOR_SIZE_ARCH here.

    Other arch headers are created as needed on a per-arch basis."

    Fixed up some conflicts from other header file cleanups and moving code
    around that has happened in the meantime, so David's testing is somewhat
    weakened by that. We'll find out anything that got broken and fix it..

    * tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system: (38 commits)
    Delete all instances of asm/system.h
    Remove all #inclusions of asm/system.h
    Add #includes needed to permit the removal of asm/system.h
    Move all declarations of free_initmem() to linux/mm.h
    Disintegrate asm/system.h for OpenRISC
    Split arch_align_stack() out from asm-generic/system.h
    Split the switch_to() wrapper out of asm-generic/system.h
    Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h
    Create asm-generic/barrier.h
    Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h
    Disintegrate asm/system.h for Xtensa
    Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt]
    Disintegrate asm/system.h for Tile
    Disintegrate asm/system.h for Sparc
    Disintegrate asm/system.h for SH
    Disintegrate asm/system.h for Score
    Disintegrate asm/system.h for S390
    Disintegrate asm/system.h for PowerPC
    Disintegrate asm/system.h for PA-RISC
    Disintegrate asm/system.h for MN10300
    ...

    Linus Torvalds
     
  • Pull kvm updates from Avi Kivity:
    "Changes include timekeeping improvements, support for assigning host
    PCI devices that share interrupt lines, s390 user-controlled guests, a
    large ppc update, and random fixes."

    This is with the sign-off's fixed, hopefully next merge window we won't
    have rebased commits.

    * 'kvm-updates/3.4' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (130 commits)
    KVM: Convert intx_mask_lock to spin lock
    KVM: x86: fix kvm_write_tsc() TSC matching thinko
    x86: kvmclock: abstract save/restore sched_clock_state
    KVM: nVMX: Fix erroneous exception bitmap check
    KVM: Ignore the writes to MSR_K7_HWCR(3)
    KVM: MMU: make use of ->root_level in reset_rsvds_bits_mask
    KVM: PMU: add proper support for fixed counter 2
    KVM: PMU: Fix raw event check
    KVM: PMU: warn when pin control is set in eventsel msr
    KVM: VMX: Fix delayed load of shared MSRs
    KVM: use correct tlbs dirty type in cmpxchg
    KVM: Allow host IRQ sharing for assigned PCI 2.3 devices
    KVM: Ensure all vcpus are consistent with in-kernel irqchip settings
    KVM: x86 emulator: Allow PM/VM86 switch during task switch
    KVM: SVM: Fix CPL updates
    KVM: x86 emulator: VM86 segments must have DPL 3
    KVM: x86 emulator: Fix task switch privilege checks
    arch/powerpc/kvm/book3s_hv.c: included linux/sched.h twice
    KVM: x86 emulator: correctly mask pmc index bits in RDPMC instruction emulation
    KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
    ...

    Linus Torvalds
     
  • Disintegrate asm/system.h for X86.

    Signed-off-by: David Howells
    Acked-by: H. Peter Anvin
    cc: x86@kernel.org

    David Howells
     

20 Mar, 2012

1 commit

  • Upon resume from hibernation, CPU 0's hvclock area contains the old
    values for system_time and tsc_timestamp. It is necessary for the
    hypervisor to update these values with uptodate ones before the CPU uses
    them.

    Abstract TSC's save/restore sched_clock_state functions and use
    restore_state to write to KVM_SYSTEM_TIME MSR, forcing an update.

    Also move restore_sched_clock_state before __restore_processor_state,
    since the later calls CONFIG_LOCK_STAT's lockstat_clock (also for TSC).
    Thanks to Igor Mammedov for tracking it down.

    Fixes suspend-to-disk with kvmclock.

    Reviewed-by: Thomas Gleixner
    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Avi Kivity

    Marcelo Tosatti
     

22 Feb, 2012

1 commit

  • While various modules include to get access to things we
    actually *intend* for them to use, most of that header file was really
    pretty low-level internal stuff that we really don't want to expose to
    others.

    So split the header file into two: the small exported interfaces remain
    in , while the internal definitions that are only used by
    core architecture code are now in .

    The guiding principle for this was to expose functions that we export to
    modules, and leave them in , while stuff that is used by
    task switching or was marked GPL-only is in .

    The fpu-internal.h file could be further split up too, especially since
    arch/x86/kvm/ uses some of the remaining stuff for its module. But that
    kvm usage should probably be abstracted out a bit, and at least now the
    internal FPU accessor functions are much more contained. Even if it
    isn't perhaps as contained as it _could_ be.

    Signed-off-by: Linus Torvalds
    Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1202211340330.5354@i5.linux-foundation.org
    Signed-off-by: H. Peter Anvin

    Linus Torvalds