11 Aug, 2016

6 commits

  • There's no need to run setup_real_mode() as early as we run it.
    Defer it to the same early_initcall that sets up the page
    permissions for the real mode code.

    This should be a code size reduction. More importantly, it give us
    a longer window in which we can allocate the real mode trampoline.

    Signed-off-by: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Mario Limonciello
    Cc: Matt Fleming
    Cc: Matthew Garrett
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/fd62f0da4f79357695e9bf3e365623736b05f119.1470821230.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     
  • The initialization process for trampoline_cr4_features and
    mmu_cr4_features was confusing. The intent is for mmu_cr4_features
    and *trampoline_cr4_features to stay in sync, but
    trampoline_cr4_features is NULL until setup_real_mode() runs. The
    old code synchronized *trampoline_cr4_features *twice*, once in
    setup_real_mode() and once in setup_arch(). It also initialized
    mmu_cr4_features in setup_real_mode(), which causes the actual value
    of mmu_cr4_features to potentially depend on when setup_real_mode()
    is called.

    With this patch, mmu_cr4_features is initialized directly in
    setup_arch(), and *trampoline_cr4_features is synchronized to
    mmu_cr4_features when the trampoline is set up.

    After this patch, it should be safe to defer setup_real_mode().

    Signed-off-by: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Mario Limonciello
    Cc: Matt Fleming
    Cc: Matthew Garrett
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/d48a263f9912389b957dd495a7127b009259ffe0.1470821230.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     
  • reserve_bios_regions() is a quirk that reserves memory that we might
    otherwise think is available. There's no need to run it so early,
    and running it before we have the memory map initialized with its
    non-quirky inputs makes it hard to make reserve_bios_regions() more
    intelligent.

    Move it right after we populate the memblock state.

    Signed-off-by: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Mario Limonciello
    Cc: Matt Fleming
    Cc: Matthew Garrett
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/59f58618911005c799c6c9979ce6ae4881d907c2.1470821230.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     
  • Since commit:

    52aec3308db8 ("x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR")

    the TLB remote shootdown is done through call function vector. That
    commit didn't take care of irq_tlb_count, which a later commit:

    fd0f5869724f ("x86: Distinguish TLB shootdown interrupts from other functions call interrupts")

    ... tried to fix.

    The fix assumes every increase of irq_tlb_count has a corresponding
    increase of irq_call_count. So the irq_call_count is always bigger than
    irq_tlb_count and we could substract irq_tlb_count from irq_call_count.

    Unfortunately this is not true for the smp_call_function_single() case.
    The IPI is only sent if the target CPU's call_single_queue is empty when
    adding a csd into it in generic_exec_single. That means if two threads
    are both adding flush tlb csds to the same CPU's call_single_queue, only
    one IPI is sent. In other words, the irq_call_count is incremented by 1
    but irq_tlb_count is incremented by 2. Over time, irq_tlb_count will be
    bigger than irq_call_count and the substract will produce a very large
    irq_call_count value due to overflow.

    Considering that:

    1) it's not worth to send more IPIs for the sake of accurate counting of
    irq_call_count in generic_exec_single();

    2) it's not easy to tell if the call function interrupt is for TLB
    shootdown in __smp_call_function_single_interrupt().

    Not to exclude TLB shootdown from call function count seems to be the
    simplest fix and this patch just does that.

    This bug was found by LKP's cyclic performance regression tracking recently
    with the vm-scalability test suite. I have bisected to commit:

    3dec0ba0be6a ("mm/rmap: share the i_mmap_rwsem")

    This commit didn't do anything wrong but revealed the irq_call_count
    problem. IIUC, the commit makes rwc->remap_one in rmap_walk_file
    concurrent with multiple threads. When remap_one is try_to_unmap_one(),
    then multiple threads could queue flush TLB to the same CPU but only
    one IPI will be sent.

    Since the commit was added in Linux v3.19, the counting problem only
    shows up from v3.19 onwards.

    Signed-off-by: Aaron Lu
    Cc: Alex Shi
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Davidlohr Bueso
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Huang Ying
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tomoki Sekiyama
    Link: http://lkml.kernel.org/r/20160811074430.GA18163@aaronlu.sh.intel.com
    Signed-off-by: Ingo Molnar

    Aaron Lu
     
  • A recent patch changed the format of a swap PTE.

    The comment explaining the format of the swap PTE is wrong about
    the bits used for the swap type field. Amusingly, the ASCII art
    and the patch description are correct, but the comment itself
    is wrong.

    As I was looking at this, I also noticed that the
    SWP_OFFSET_FIRST_BIT has an off-by-one error. This does not
    really hurt anything. It just wasted a bit of space in the PTE,
    giving us 2^59 bytes of addressable space in our swapfiles
    instead of 2^60. But, it doesn't match with the comments, and it
    wastes a bit of space, so fix it.

    Signed-off-by: Dave Hansen
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Luis R. Rodriguez
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Toshi Kani
    Fixes: 00839ee3b299 ("x86/mm: Move swap offset/type up in PTE to work around erratum")
    Link: http://lkml.kernel.org/r/20160810172325.E56AD7DA@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     
  • debug_putstr() is used to output strings without using printf-like
    formatting but debug_putstr(v) is defined as early_printk(v) in
    arch/x86/lib/kaslr.c.

    This makes clang reports the following warning when building
    with -Wformat-security:

    arch/x86/lib/kaslr.c:57:15: warning: format string is not a string
    literal (potentially insecure) [-Wformat-security]
    debug_putstr(purpose);
    ^~~~~~~

    Fix this by using "%s" in early_printk().

    Signed-off-by: Nicolas Iooss
    Acked-by: Kees Cook
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20160806102039.27221-1-nicolas.iooss_linux@m4x.org
    Signed-off-by: Ingo Molnar

    Nicolas Iooss
     

10 Aug, 2016

18 commits

  • The Memory Protection Keys "rights register" (PKRU) is
    XSAVE-managed, and is saved/restored along with the FPU state.

    When kernel code accesses FPU regsisters, it does a delicate
    dance with preempt. Otherwise, the context switching code can
    get confused as to whether the most up-to-date state is in the
    registers themselves or in the XSAVE buffer.

    But, PKRU is not a normal FPU register. Using it does not
    generate the normal device-not-available (#NM) exceptions which
    means we can not manage it lazily, and the kernel completley
    disallows using lazy mode when it is enabled.

    The dance with preempt *only* occurs when managing the FPU
    lazily. Since we never manage PKRU lazily, we do not have to do
    the dance with preempt; we can access it directly. Doing it
    this way saves a ton of complicated code (and is faster too).

    Further, the XSAVES reenabling failed to patch a bit of code
    in fpu__xfeature_set_state() the checked for compacted buffers.
    That check caused fpu__xfeature_set_state() to silently refuse to
    work when the kernel is using compacted XSAVE buffers. This
    broke execute-only and future pkey_mprotect() support when using
    compact XSAVE buffers.

    But, removing fpu__xfeature_set_state() gets rid of this issue,
    in addition to the nice cleanup and speedup.

    This fixes the same thing as a fix that Sai posted:

    https://lkml.org/lkml/2016/7/25/637

    The fix that he posted is a much more obviously correct, but I
    think we should just do this instead.

    Reported-by: Sai Praneeth Prakhya
    Signed-off-by: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: Dave Hansen
    Cc: Fenghua Yu
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Quentin Casasnovas
    Cc: Ravi Shankar
    Cc: Thomas Gleixner
    Cc: Yu-Cheng Yu
    Link: http://lkml.kernel.org/r/20160727232040.7D060DAD@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     
  • Building an X86_64 kernel with W=1 throws a total of 9,948 lines of warnings of
    this form for both 32-bit and 64-bit syscall tables. Given that the entire rest
    of the build for my config only generates 8,375 lines of output, this is a big
    reduction in the warnings generated.

    The warnings follow this pattern:

    ./arch/x86/include/generated/asm/syscalls_32.h:885:21: warning: initialized field overwritten [-Woverride-init]
    __SYSCALL_I386(379, compat_sys_pwritev2, )
    ^
    arch/x86/entry/syscall_32.c:13:46: note: in definition of macro '__SYSCALL_I386'
    #define __SYSCALL_I386(nr, sym, qual) [nr] = sym,
    ^~~
    ./arch/x86/include/generated/asm/syscalls_32.h:885:21: note: (near initialization for 'ia32_sys_call_table[379]')
    __SYSCALL_I386(379, compat_sys_pwritev2, )
    ^
    arch/x86/entry/syscall_32.c:13:46: note: in definition of macro '__SYSCALL_I386'
    #define __SYSCALL_I386(nr, sym, qual) [nr] = sym,

    Since we intentionally build the syscall tables this way, ignore that one
    warning in the two files.

    Signed-off-by: Valdis Kletnieks
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/7464.1470021890@turing-police.cc.vt.edu
    Signed-off-by: Ingo Molnar

    Valdis Kletnieks
     
  • The latest UV kernel support panics when RHEL7 kexec's the kdump kernel
    to make a dumpfile. This patch fixes the problem by turning off all UV
    support if NUMA is off.

    Tested-by: Frank Ramsay
    Tested-by: John Estabrook
    Signed-off-by: Mike Travis
    Reviewed-by: Dimitri Sivanich
    Reviewed-by: Nathan Zimmer
    Cc: Alex Thorlton
    Cc: Andrew Banman
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Russ Anderson
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20160801184050.577755634@asylum.americas.sgi.com
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • There are some circumstances where the UV4 BIOS cannot provide the
    correct Proximity Node values to associate with specific Sockets and
    Physical Nodes. The decision was made to remove these values from BIOS
    and for the kernel to get these values from the standard ACPI tables.

    Tested-by: Frank Ramsay
    Tested-by: John Estabrook
    Signed-off-by: Mike Travis
    Reviewed-by: Dimitri Sivanich
    Reviewed-by: Nathan Zimmer
    Cc: Alex Thorlton
    Cc: Andrew Banman
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Russ Anderson
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20160801184050.414210079@asylum.americas.sgi.com
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • Save the uv_systab::size field before doing the iounmap()
    of the struct pointer, to avoid a NULL dereference crash.

    Tested-by: Frank Ramsay
    Tested-by: John Estabrook
    Signed-off-by: Mike Travis
    Reviewed-by: Dimitri Sivanich
    Reviewed-by: Nathan Zimmer
    Cc: Alex Thorlton
    Cc: Andrew Banman
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Russ Anderson
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20160801184050.250424783@asylum.americas.sgi.com
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • The UV4 Socket IDs are not guaranteed to equate to Node values which
    can cause the GAM (Global Addressable Memory) table lookups to fail.
    Fix this by using an independent index into the GAM table instead of
    the Socket ID to reference the base address.

    Tested-by: Frank Ramsay
    Tested-by: John Estabrook
    Signed-off-by: Mike Travis
    Reviewed-by: Dimitri Sivanich
    Reviewed-by: Nathan Zimmer
    Cc: Alex Thorlton
    Cc: Andrew Banman
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Russ Anderson
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20160801184050.048755337@asylum.americas.sgi.com
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • Clarify why exactly RF cannot be restored properly by SYSRET to avoid
    confusion.

    No functionality change.

    Signed-off-by: Borislav Petkov
    Acked-by: Andy Lutomirski
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20160803171429.GA2590@nazgul.tnic
    Signed-off-by: Ingo Molnar

    Borislav Petkov
     
  • There's a subtle preemption race on UP kernels:

    Usually current->mm (and therefore mm->pgd) stays the same during the
    lifetime of a task so it does not matter if a task gets preempted during
    the read and write of the CR3.

    But then, there is this scenario on x86-UP:

    TaskA is in do_exit() and exit_mm() sets current->mm = NULL followed by:

    -> mmput()
    -> exit_mmap()
    -> tlb_finish_mmu()
    -> tlb_flush_mmu()
    -> tlb_flush_mmu_tlbonly()
    -> tlb_flush()
    -> flush_tlb_mm_range()
    -> __flush_tlb_up()
    -> __flush_tlb()
    -> __native_flush_tlb()

    At this point current->mm is NULL but current->active_mm still points to
    the "old" mm.

    Let's preempt taskA _after_ native_read_cr3() by taskB. TaskB has its
    own mm so CR3 has changed.

    Now preempt back to taskA. TaskA has no ->mm set so it borrows taskB's
    mm and so CR3 remains unchanged. Once taskA gets active it continues
    where it was interrupted and that means it writes its old CR3 value
    back. Everything is fine because userland won't need its memory
    anymore.

    Now the fun part:

    Let's preempt taskA one more time and get back to taskB. This
    time switch_mm() won't do a thing because oldmm (->active_mm)
    is the same as mm (as per context_switch()). So we remain
    with a bad CR3 / PGD and return to userland.

    The next thing that happens is handle_mm_fault() with an address for
    the execution of its code in userland. handle_mm_fault() realizes that
    it has a PTE with proper rights so it returns doing nothing. But the
    CPU looks at the wrong PGD and insists that something is wrong and
    faults again. And again. And one more time…

    This pagefault circle continues until the scheduler gets tired of it and
    puts another task on the CPU. It gets little difficult if the task is a
    RT task with a high priority. The system will either freeze or it gets
    fixed by the software watchdog thread which usually runs at RT-max prio.
    But waiting for the watchdog will increase the latency of the RT task
    which is no good.

    Fix this by disabling preemption across the critical code section.

    Signed-off-by: Sebastian Andrzej Siewior
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Rik van Riel
    Acked-by: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-mm@kvack.org
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/1470404259-26290-1-git-send-email-bigeasy@linutronix.de
    [ Prettified the changelog. ]
    Signed-off-by: Ingo Molnar

    Sebastian Andrzej Siewior
     
  • Default implementation expects 6 pages maximum are needed for low page
    allocations. If KASLR memory randomization is enabled, the worse case
    of e820 layout would require 12 pages (no large pages). It is due to the
    PUD level randomization and the variable e820 memory layout.

    This bug was found while doing extensive testing of KASLR memory
    randomization on different type of hardware.

    Signed-off-by: Thomas Garnier
    Cc: Aleksey Makarov
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Baoquan He
    Cc: Borislav Petkov
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Christian Borntraeger
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Dave Young
    Cc: Denys Vlasenko
    Cc: Fabian Frederick
    Cc: H. Peter Anvin
    Cc: Joerg Roedel
    Cc: Josh Poimboeuf
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Lv Zheng
    Cc: Mark Salter
    Cc: Peter Zijlstra
    Cc: Rafael J . Wysocki
    Cc: Thomas Gleixner
    Cc: Toshi Kani
    Cc: kernel-hardening@lists.openwall.com
    Fixes: 021182e52fe0 ("Enable KASLR for physical mapping memory regions")
    Link: http://lkml.kernel.org/r/1470762665-88032-2-git-send-email-thgarnie@google.com
    Signed-off-by: Ingo Molnar

    Thomas Garnier
     
  • Initialize KASLR memory randomization after max_pfn is initialized. Also
    ensure the size is rounded up. It could create problems on machines
    with more than 1Tb of memory on certain random addresses.

    Signed-off-by: Thomas Garnier
    Cc: Aleksey Makarov
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Baoquan He
    Cc: Borislav Petkov
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Christian Borntraeger
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Dave Young
    Cc: Denys Vlasenko
    Cc: Fabian Frederick
    Cc: H. Peter Anvin
    Cc: Joerg Roedel
    Cc: Josh Poimboeuf
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Lv Zheng
    Cc: Mark Salter
    Cc: Peter Zijlstra
    Cc: Rafael J . Wysocki
    Cc: Thomas Gleixner
    Cc: Toshi Kani
    Cc: kernel-hardening@lists.openwall.com
    Fixes: 021182e52fe0 ("Enable KASLR for physical mapping memory regions")
    Link: http://lkml.kernel.org/r/1470762665-88032-1-git-send-email-thgarnie@google.com
    Signed-off-by: Ingo Molnar

    Thomas Garnier
     
  • Dmitry Vyukov has reported unexpected KASAN stackdepot growth:

    https://github.com/google/kasan/issues/36

    ... which is caused by the APIC handlers not being present in .irqentry.text:

    When building with CONFIG_FUNCTION_GRAPH_TRACER=y or CONFIG_KASAN=y, put the
    APIC interrupt handlers into the .irqentry.text section. This is needed
    because both KASAN and function graph tracer use __irqentry_text_start and
    __irqentry_text_end to determine whether a function is an IRQ entry point.

    Reported-by: Dmitry Vyukov
    Signed-off-by: Alexander Potapenko
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: aryabinin@virtuozzo.com
    Cc: kasan-dev@googlegroups.com
    Cc: kcc@google.com
    Cc: rostedt@goodmis.org
    Link: http://lkml.kernel.org/r/1468575763-144889-1-git-send-email-glider@google.com
    [ Minor edits. ]
    Signed-off-by: Ingo Molnar

    Alexander Potapenko
     
  • This reverts commit 874f9c7da9a4acbc1b9e12ca722579fb50e4d142.

    Geert Uytterhoeven reports:
    "This change seems to have an (unintendent?) side-effect.

    Before, pr_*() calls without a trailing newline characters would be
    printed with a newline character appended, both on the console and in
    the output of the dmesg command.

    After this commit, no new line character is appended, and the output
    of the next pr_*() call of the same type may be appended, like in:

    - Truncating RAM at 0x0000000040000000-0x00000000c0000000 to -0x0000000070000000
    - Ignoring RAM at 0x0000000200000000-0x0000000240000000 (!CONFIG_HIGHMEM)
    + Truncating RAM at 0x0000000040000000-0x00000000c0000000 to -0x0000000070000000Ignoring RAM at 0x0000000200000000-0x0000000240000000 (!CONFIG_HIGHMEM)"

    Joe Perches says:
    "No, that is not intentional.

    The newline handling code inside vprintk_emit is a bit involved and
    for now I suggest a revert until this has all the same behavior as
    earlier"

    Reported-by: Geert Uytterhoeven
    Requested-by: Joe Perches
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Pull tracing fix from Steven Rostedt:
    "Fix tick_stop tracepoint symbols for user export.

    Luiz Capitulino noticed that the tick_stop tracepoint wasn't being
    parsed properly by the tracing user space tools.

    This was due to the TRACE_DEFINE_ENUM() being set to a define, when it
    should have been set to the enum itself. The define was of the MASK
    that used the BIT to shift. The BIT was the enum and by adding that,
    everything gets converted nicely. The MASK is still kept just in case
    it gets converted to an enum in the future"

    * tag 'trace-v4.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Fix tick_stop tracepoint symbols for user export

    Linus Torvalds
     
  • …inux/kernel/git/kees/linux

    Pull gcc plugin improvements from Kees Cook:
    "Several fixes/improvements for the gcc plugin infrastructure:

    - fix a problem with gcc plugins interfering with cc-option tests.

    - abort more gracefully when gcc plugin headers or compiler support
    is missing.

    - improve the gcc plugin rule generation to be more dynamic, pass
    arguments, and build from subdirectories"

    * tag 'gcc-plugin-infrastructure-v4.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    gcc-plugins: Add support for plugin subdirectories
    gcc-plugins: Automate make rule generation
    gcc-plugins: Add support for passing plugin arguments
    gcc-plugins: abort builds cleanly when not supported
    kbuild: no gcc-plugins during cc-option tests

    Linus Torvalds
     
  • …linux-platform-drivers-x86

    Pull x86 platform driver update from Darren Hart:
    "dell-wmi: ignore battery remove/insert event"

    * tag 'platform-drivers-x86-v4.8-3' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86:
    dell-wmi: Ignore WMI event 0xe00e

    Linus Torvalds
     
  • Pull drm fixes from Dave Airlie:
    "This contains a bunch of amdgpu fixes, and some i915 regression fixes.

    It also contains some fixes for an older regression with some EDID
    changes and some 6bpc panels.

    Then there are the lockdep, cirrus and rcar-du regression fixes from
    this window"

    * tag 'drm-fixes-for-4.8-rc2' of git://people.freedesktop.org/~airlied/linux:
    drm/cirrus: Fix NULL pointer dereference when registering the fbdev
    drm/edid: Set 8 bpc color depth for displays with "DFP 1.x compliant TMDS".
    drm/i915/dp: Revert "drm/i915/dp: fall back to 18 bpp when sink capability is unknown"
    drm/edid: Add 6 bpc quirk for display AEO model 0.
    drm: Paper over locking inversion after registration rework
    drm: rcar-du: Link HDMI encoder with bridge
    drm/ttm: Wait for a BO to become idle before unbinding it from GTT
    drm/i915/fbdev: Check for the framebuffer before use
    drm/amdgpu: update golden setting of polaris10
    drm/amdgpu: update golden setting of stoney
    drm/amdgpu: update golden setting of polaris11
    drm/amdgpu: update golden setting of carrizo
    drm/amdgpu: update golden setting of iceland
    drm/amd/amdgpu: change pptable output format from ASCII to binary
    drm/amdgpu/ci: add mullins to default case for smc ucode
    drm/amdgpu/gmc7: add missing mullins case
    drm/i915: Never fully mask the the EI up rps interrupt on SNB/IVB
    drm/i915: Wait up to 3ms for the pcu to ack the cdclk change request on SKL

    Linus Torvalds
     
  • Commit b195d5e2bffd ("ipr: Wait to do async scan until scsi host is
    initialized") fixed async scan for ipr, but broke sync scan for ipr.

    This fixes sync scan back up.

    Signed-off-by: Brian King
    Reported-and-tested-by: Michael Ellerman
    Signed-off-by: Linus Torvalds

    Brian King
     
  • To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg,
    which sets page->_mapcount to -512. Currently, we set/clear PageKmemcg
    in __alloc_pages_nodemask()/free_pages_prepare() for any page allocated
    with __GFP_ACCOUNT, including those that aren't actually charged to any
    cgroup, i.e. allocated from the root cgroup context. To avoid overhead
    in case cgroups are not used, we only do that if memcg_kmem_enabled() is
    true. The latter is set iff there are kmem-enabled memory cgroups
    (online or offline). The root cgroup is not considered kmem-enabled.

    As a result, if a page is allocated with __GFP_ACCOUNT for the root
    cgroup when there are kmem-enabled memory cgroups and is freed after all
    kmem-enabled memory cgroups were removed, e.g.

    # no memory cgroups has been created yet, create one
    mkdir /sys/fs/cgroup/memory/test
    # run something allocating pages with __GFP_ACCOUNT, e.g.
    # a program using pipe
    dmesg | tail
    # remove the memory cgroup
    rmdir /sys/fs/cgroup/memory/test

    we'll get bad page state bug complaining about page->_mapcount != -1:

    BUG: Bad page state in process swapper/0 pfn:1fd945c
    page:ffffea007f651700 count:0 mapcount:-511 mapping: (null) index:0x0
    flags: 0x1000000000000000()

    To avoid that, let's mark with PageKmemcg only those pages that are
    actually charged to and hence pin a non-root memory cgroup.

    Fixes: 4949148ad433 ("mm: charge/uncharge kmemcg from generic page allocator paths")
    Reported-and-tested-by: Eric Dumazet
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

09 Aug, 2016

16 commits

  • The symbols used in the tick_stop tracepoint were not being converted
    properly into integers in the trace_stop format file. Instead we had this:

    print fmt: "success=%d dependency=%s", REC->success,
    __print_symbolic(REC->dependency, { 0, "NONE" },
    { (1 << TICK_DEP_BIT_POSIX_TIMER), "POSIX_TIMER" },
    { (1 << TICK_DEP_BIT_PERF_EVENTS), "PERF_EVENTS" },
    { (1 << TICK_DEP_BIT_SCHED), "SCHED" },
    { (1 << TICK_DEP_BIT_CLOCK_UNSTABLE), "CLOCK_UNSTABLE" })

    User space tools have no idea how to parse "TICK_DEP_BIT_SCHED" or the other
    symbols used to do the bit shifting. The reason is that the conversion was
    done with using the TICK_DEP_MASK_* symbols which are just macros that
    convert to the BIT shift itself (with the exception of NONE, which was
    converted properly, because it doesn't use bits, and is defined as zero).

    The TICK_DEP_BIT_* needs to be denoted by TRACE_DEFINE_ENUM() in order to
    have this properly converted for user space tools to parse this event.

    Cc: stable@vger.kernel.org
    Cc: Frederic Weisbecker
    Fixes: e6e6cc22e067 ("nohz: Use enum code for tick stop failure tracing message")
    Reported-by: Luiz Capitulino
    Tested-by: Luiz Capitulino
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • cirrus_modeset_init() is initializing/registering the emulated fbdev
    and, since commit c61b93fe51b1 ("drm/atomic: Fix remaining places where
    !funcs->best_encoder is valid"), DRM internals can access/test some of
    the fields in mode_config->funcs as part of the fbdev registration
    process.
    Make sure dev->mode_config.funcs is properly set to avoid dereferencing
    a NULL pointer.

    Reported-by: Mike Marshall
    Reported-by: Eric W. Biederman
    Signed-off-by: Boris Brezillon
    Fixes: c61b93fe51b1 ("drm/atomic: Fix remaining places where !funcs->best_encoder is valid")
    Signed-off-by: Dave Airlie

    Boris Brezillon
     
  • This adds support for building more complex gcc plugins that live in a
    subdirectory instead of just in a single source file.

    Reported-by: PaX Team
    Signed-off-by: Emese Revfy
    [kees: clarified commit message]
    Signed-off-by: Kees Cook

    Emese Revfy
     
  • There's no reason to repeat the same names in the Makefile when the .so
    files have already been listed. The .o list can be generated from them.

    Reported-by: PaX Team
    Signed-off-by: Emese Revfy
    [kees: clarified commit message]
    Signed-off-by: Kees Cook

    Emese Revfy
     
  • The latent_entropy plugin needs to pass arguments, so this adds the
    support.

    Signed-off-by: Emese Revfy
    Signed-off-by: Kees Cook

    Emese Revfy
     
  • When the compiler doesn't support gcc plugins (either due to missing
    headers or too old a version), report the problem and abort the build
    instead of emitting a warning and letting the build founder with arcane
    compiler errors.

    Signed-off-by: Kees Cook

    Kees Cook
     
  • The gcc-plugins arguments should not be included when performing
    cc-option tests.

    Steps to reproduce:
    1) make mrproper
    2) make defconfig
    3) enable GCC_PLUGINS, GCC_PLUGIN_CYC_COMPLEXITY
    4) enable FUNCTION_TRACER (it will select other options as well)
    5) make && make modules

    Build errors:
    MODPOST 18 modules
    ERROR: "__fentry__" [net/netfilter/xt_nat.ko] undefined!
    ERROR: "__fentry__" [net/netfilter/xt_mark.ko] undefined!
    ERROR: "__fentry__" [net/netfilter/xt_addrtype.ko] undefined!
    ERROR: "__fentry__" [net/netfilter/xt_LOG.ko] undefined!
    ERROR: "__fentry__" [net/netfilter/nf_nat_sip.ko] undefined!
    ERROR: "__fentry__" [net/netfilter/nf_nat_irc.ko] undefined!
    ERROR: "__fentry__" [net/netfilter/nf_nat_ftp.ko] undefined!
    ERROR: "__fentry__" [net/netfilter/nf_nat.ko] undefined!

    Reported-by: Laura Abbott
    Signed-off-by: Emese Revfy
    [kees: renamed variable, clarified commit message]
    Signed-off-by: Kees Cook

    Emese Revfy
     
  • According to E-EDID spec 1.3, table 3.9, a digital video sink with the
    "DFP 1.x compliant TMDS" bit set is "signal compatible with VESA DFP 1.x
    TMDS CRGB, 1 pixel / clock, up to 8 bits / color MSB aligned".

    For such displays, the DFP spec 1.0, section 3.10 "EDID support" says:

    "If the DFP monitor only supports EDID 1.X (1.1, 1.2, etc.)
    without extensions, the host will make the following assumptions:

    1. 24-bit MSB-aligned RGB TFT
    2. DE polarity is active high
    3. H and V syncs are active high
    4. Established CRT timings will be used
    5. Dithering will not be enabled on the host"

    So if we don't know the bit depth of the display from additional
    colorimetry info we should assume 8 bpc / 24 bpp by default.

    This patch adds info->bpc = 8 assignement for that case.

    Signed-off-by: Mario Kleiner
    Cc: Jani Nikula
    Cc: Ville Syrjälä
    Cc: Daniel Vetter
    Signed-off-by: Dave Airlie

    Mario Kleiner
     
  • This reverts commit 013dd9e03872
    ("drm/i915/dp: fall back to 18 bpp when sink capability is unknown")

    This commit introduced a regression into stable kernels,
    as it reduces output color depth to 6 bpc for any video
    sink connected to a Displayport connector if that sink
    doesn't report a specific color depth via EDID, or if
    our EDID parser doesn't actually recognize the proper
    bpc from EDID.

    Affected are active DisplayPort->VGA converters and
    active DisplayPort->DVI converters. Both should be
    able to handle 8 bpc, but are degraded to 6 bpc with
    this patch.

    The reverted commit was meant to fix
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=105331

    A followup patch implements a fix for that specific bug,
    which is caused by a faulty EDID of the affected DP panel
    by adding a new EDID quirk for that panel.

    DP 18 bpp fallback handling and other improvements to
    DP sink bpc detection will be handled for future
    kernels in a separate series of patches.

    Please backport to stable.

    Signed-off-by: Mario Kleiner
    Acked-by: Jani Nikula
    Cc: stable@vger.kernel.org
    Cc: Ville Syrjälä
    Cc: Daniel Vetter
    Signed-off-by: Dave Airlie

    Mario Kleiner
     
  • Bugzilla https://bugzilla.kernel.org/show_bug.cgi?id=105331
    reports that the "AEO model 0" display is driven with 8 bpc
    without dithering by default, which looks bad because that
    panel is apparently a 6 bpc DP panel with faulty EDID.

    A fix for this was made by commit 013dd9e03872
    ("drm/i915/dp: fall back to 18 bpp when sink capability is unknown").

    That commit triggers new regressions in precision for DP->DVI and
    DP->VGA displays. A patch is out to revert that commit, but it will
    revert video output for the AEO model 0 panel to 8 bpc without
    dithering.

    The EDID 1.3 of that panel, as decoded from the xrandr output
    attached to that bugzilla bug report, is somewhat faulty, and beyond
    other problems also sets the "DFP 1.x compliant TMDS" bit, which
    according to DFP spec means to drive the panel with 8 bpc and
    no dithering in absence of other colorimetry information.

    Try to make the original bug reporter happy despite the
    faulty EDID by adding a quirk to mark that panel as 6 bpc,
    so 6 bpc output with dithering creates a nice picture.

    Tested by injecting the edid from the fdo bug into a DP connector
    via drm_kms_helper.edid_firmware and verifying the 6 bpc + dithering
    is selected.

    This patch should be backported to stable.

    Signed-off-by: Mario Kleiner
    Cc: stable@vger.kernel.org
    Cc: Jani Nikula
    Cc: Ville Syrjälä
    Cc: Daniel Vetter
    Signed-off-by: Dave Airlie

    Mario Kleiner
     
  • Pull lkdtm update from Kees Cook:
    "Fix rebuild problem with LKDTM's rodata test"

    [ This, and the usercopy branch, both came in before the merge window
    closed, but ended up in my 'need to look more' queue and thus got
    merged only after rc1 was out ]

    * tag 'lkdtm-v4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    lkdtm: Fix targets for objcopy usage
    lkdtm: fix false positive warning from -Wmaybe-uninitialized

    Linus Torvalds
     
  • Pull usercopy protection from Kees Cook:
    "Tbhis implements HARDENED_USERCOPY verification of copy_to_user and
    copy_from_user bounds checking for most architectures on SLAB and
    SLUB"

    * tag 'usercopy-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    mm: SLUB hardened usercopy support
    mm: SLAB hardened usercopy support
    s390/uaccess: Enable hardened usercopy
    sparc/uaccess: Enable hardened usercopy
    powerpc/uaccess: Enable hardened usercopy
    ia64/uaccess: Enable hardened usercopy
    arm64/uaccess: Enable hardened usercopy
    ARM: uaccess: Enable hardened usercopy
    x86/uaccess: Enable hardened usercopy
    mm: Hardened usercopy
    mm: Implement stack frame object validation
    mm: Add is_migrate_cma_page

    Linus Torvalds
     
  • When I initially added the unsafe_[get|put]_user() helpers in commit
    5b24a7a2aa20 ("Add 'unsafe' user access functions for batched
    accesses"), I made the mistake of modeling the interface on our
    traditional __[get|put]_user() functions, which return zero on success,
    or -EFAULT on failure.

    That interface is fairly easy to use, but it's actually fairly nasty for
    good code generation, since it essentially forces the caller to check
    the error value for each access.

    In particular, since the error handling is already internally
    implemented with an exception handler, and we already use "asm goto" for
    various other things, we could fairly easily make the error cases just
    jump directly to an error label instead, and avoid the need for explicit
    checking after each operation.

    So switch the interface to pass in an error label, rather than checking
    the error value in the caller. Best do it now before we start growing
    more users (the signal handling code in particular would be a good place
    to use the new interface).

    So rather than

    if (unsafe_get_user(x, ptr))
    ... handle error ..

    the interface is now

    unsafe_get_user(x, ptr, label);

    where an error during the user mode fetch will now just cause a jump to
    'label' in the caller.

    Right now the actual _implementation_ of this all still ends up being a
    "if (err) goto label", and does not take advantage of any exception
    label tricks, but for "unsafe_put_user()" in particular it should be
    fairly straightforward to convert to using the exception table model.

    Note that "unsafe_get_user()" is much harder to convert to a clever
    exception table model, because current versions of gcc do not allow the
    use of "asm goto" (for the exception) with output values (for the actual
    value to be fetched). But that is hopefully not a limitation in the
    long term.

    [ Also note that it might be a good idea to switch unsafe_get_user() to
    actually _return_ the value it fetches from user space, but this
    commit only changes the error handling semantics ]

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • In commit 874f9c7da9a4 ("printk: create pr_ functions"), new
    pr_level defines were added to printk.c.

    These new defines are guarded by an #ifdef CONFIG_PRINTK - however,
    there is already a surrounding #ifdef CONFIG_PRINTK starting a lot
    earlier in line 249 which means the newly introduced #ifdef is
    unnecessary.

    Let's remove it to avoid confusion.

    Signed-off-by: Andreas Ziegler
    Cc: Joe Perches
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andreas Ziegler
     
  • WMI event 0xe00e is received when battery was removed or inserted.

    Signed-off-by: Pali Rohár
    Signed-off-by: Darren Hart

    Pali Rohár
     
  • The caller expects %rdi to remain intact, push+pop it make that happen.

    Fixes the following kind of explosions on my core2duo machine when
    trying to reboot or shut down:

    general protection fault: 0000 [#1] PREEMPT SMP
    Modules linked in: i915 i2c_algo_bit drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm netconsole configfs binfmt_misc iTCO_wdt psmouse pcspkr snd_hda_codec_idt e100 coretemp hwmon snd_hda_codec_generic i2c_i801 mii i2c_smbus lpc_ich mfd_core snd_hda_intel uhci_hcd snd_hda_codec snd_hwdep snd_hda_core ehci_pci 8250 ehci_hcd snd_pcm 8250_base usbcore evdev serial_core usb_common parport_pc parport snd_timer snd soundcore
    CPU: 0 PID: 3070 Comm: reboot Not tainted 4.8.0-rc1-perf-dirty #69
    Hardware name: /D946GZIS, BIOS TS94610J.86A.0087.2007.1107.1049 11/07/2007
    task: ffff88012a0b4080 task.stack: ffff880123850000
    RIP: 0010:[] [] x86_perf_event_update+0x52/0xc0
    RSP: 0018:ffff880123853b60 EFLAGS: 00010087
    RAX: 0000000000000001 RBX: ffff88012fc0a3c0 RCX: 000000000000001e
    RDX: 0000000000000000 RSI: 0000000040000000 RDI: ffff88012b014800
    RBP: ffff880123853b88 R08: ffffffffffffffff R09: 0000000000000000
    R10: ffffea0004a012c0 R11: ffffea0004acedc0 R12: ffffffff80000001
    R13: ffff88012b0149c0 R14: ffff88012b014800 R15: 0000000000000018
    FS: 00007f8b155cd700(0000) GS:ffff88012fc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f8b155f5000 CR3: 000000012a2d7000 CR4: 00000000000006f0
    Stack:
    ffff88012fc0a3c0 ffff88012b014800 0000000000000004 0000000000000001
    ffff88012fc1b750 ffff880123853bb0 ffffffff81003d59 ffff88012b014800
    ffff88012fc0a3c0 ffff88012b014800 ffff880123853bd8 ffffffff81003e13
    Call Trace:
    [] x86_pmu_stop+0x59/0xd0
    [] x86_pmu_del+0x43/0x140
    [] event_sched_out.isra.105+0xbd/0x260
    [] __perf_remove_from_context+0x2d/0xb0
    [] __perf_event_exit_context+0x4d/0x70
    [] generic_exec_single+0xb6/0x140
    [] ? __perf_remove_from_context+0xb0/0xb0
    [] ? __perf_remove_from_context+0xb0/0xb0
    [] smp_call_function_single+0xdf/0x140
    [] perf_event_exit_cpu_context+0x87/0xc0
    [] perf_reboot+0x13/0x40
    [] notifier_call_chain+0x4a/0x70
    [] __blocking_notifier_call_chain+0x47/0x60
    [] blocking_notifier_call_chain+0x16/0x20
    [] kernel_restart_prepare+0x1d/0x40
    [] kernel_restart+0x12/0x60
    [] SYSC_reboot+0xf6/0x1b0
    [] ? mntput_no_expire+0x2c/0x1b0
    [] ? mntput+0x24/0x40
    [] ? __fput+0x16c/0x1e0
    [] ? ____fput+0xe/0x10
    [] ? task_work_run+0x83/0xa0
    [] ? exit_to_usermode_loop+0x53/0xc0
    [] ? trace_hardirqs_on_thunk+0x1a/0x1c
    [] SyS_reboot+0xe/0x10
    [] entry_SYSCALL_64_fastpath+0x18/0xa3
    Code: 7c 4c 8d af c0 01 00 00 49 89 fe eb 10 48 09 c2 4c 89 e0 49 0f b1 55 00 4c 39 e0 74 35 4d 8b a6 c0 01 00 00 41 8b 8e 60 01 00 00 33 8b 35 6e 02 8c 00 48 c1 e2 20 85 f6 7e d2 48 89 d3 89 cf
    RIP [] x86_perf_event_update+0x52/0xc0
    RSP
    ---[ end trace 7ec95181faf211be ]---
    note: reboot[3070] exited with preempt_count 2

    Cc: Borislav Petkov
    Cc: H. Peter Anvin
    Cc: Andy Lutomirski
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Fixes: f5967101e9de ("x86/hweight: Get rid of the special calling convention")
    Signed-off-by: Ville Syrjälä
    Signed-off-by: Linus Torvalds

    Ville Syrjälä