18 Aug, 2018

1 commit

  • commit 785a19f9d1dd8a4ab2d0633be4656653bd3de1fc upstream.

    The following kernel panic was observed on ARM64 platform due to a stale
    TLB entry.

    1. ioremap with 4K size, a valid pte page table is set.
    2. iounmap it, its pte entry is set to 0.
    3. ioremap the same address with 2M size, update its pmd entry with
    a new value.
    4. CPU may hit an exception because the old pmd entry is still in TLB,
    which leads to a kernel panic.

    Commit b6bdb7517c3d ("mm/vmalloc: add interfaces to free unmapped page
    table") has addressed this panic by falling to pte mappings in the above
    case on ARM64.

    To support pmd mappings in all cases, TLB purge needs to be performed
    in this case on ARM64.

    Add a new arg, 'addr', to pud_free_pmd_page() and pmd_free_pte_page()
    so that TLB purge can be added later in seprate patches.

    [toshi.kani@hpe.com: merge changes, rewrite patch description]
    Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
    Signed-off-by: Chintan Pandya
    Signed-off-by: Toshi Kani
    Signed-off-by: Thomas Gleixner
    Cc: mhocko@suse.com
    Cc: akpm@linux-foundation.org
    Cc: hpa@zytor.com
    Cc: linux-mm@kvack.org
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: Will Deacon
    Cc: Joerg Roedel
    Cc: stable@vger.kernel.org
    Cc: Andrew Morton
    Cc: Michal Hocko
    Cc: "H. Peter Anvin"
    Cc:
    Link: https://lkml.kernel.org/r/20180627141348.21777-3-toshi.kani@hpe.com
    Signed-off-by: Greg Kroah-Hartman

    Chintan Pandya
     

16 Aug, 2018

2 commits

  • commit 6c26fcd2abfe0a56bbd95271fce02df2896cfd24 upstream.

    pfn_modify_allowed() and arch_has_pfn_modify_check() are outside of the
    !__ASSEMBLY__ section in include/asm-generic/pgtable.h, which confuses
    assembler on archs that don't have __HAVE_ARCH_PFN_MODIFY_ALLOWED (e.g.
    ia64) and breaks build:

    include/asm-generic/pgtable.h: Assembler messages:
    include/asm-generic/pgtable.h:538: Error: Unknown opcode `static inline bool pfn_modify_allowed(unsigned long pfn,pgprot_t prot)'
    include/asm-generic/pgtable.h:540: Error: Unknown opcode `return true'
    include/asm-generic/pgtable.h:543: Error: Unknown opcode `static inline bool arch_has_pfn_modify_check(void)'
    include/asm-generic/pgtable.h:545: Error: Unknown opcode `return false'
    arch/ia64/kernel/entry.S:69: Error: `mov' does not fit into bundle

    Move those two static inlines into the !__ASSEMBLY__ section so that they
    don't confuse the asm build pass.

    Fixes: 42e4089c7890 ("x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings")
    Signed-off-by: Jiri Kosina
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Guenter Roeck
    Signed-off-by: Greg Kroah-Hartman

    Jiri Kosina
     
  • commit 42e4089c7890725fcd329999252dc489b72f2921 upstream

    For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
    table entry. This sets the high bits in the CPU's address space, thus
    making sure to point to not point an unmapped entry to valid cached memory.

    Some server system BIOSes put the MMIO mappings high up in the physical
    address space. If such an high mapping was mapped to unprivileged users
    they could attack low memory by setting such a mapping to PROT_NONE. This
    could happen through a special device driver which is not access
    protected. Normal /dev/mem is of course access protected.

    To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.

    Valid page mappings are allowed because the system is then unsafe anyways.

    It's not expected that users commonly use PROT_NONE on MMIO. But to
    minimize any impact this is only enforced if the mapping actually refers to
    a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
    the check for root.

    For mmaps this is straight forward and can be handled in vm_insert_pfn and
    in remap_pfn_range().

    For mprotect it's a bit trickier. At the point where the actual PTEs are
    accessed a lot of state has been changed and it would be difficult to undo
    on an error. Since this is a uncommon case use a separate early page talk
    walk pass for MMIO PROT_NONE mappings that checks for this condition
    early. For non MMIO and non PROT_NONE there are no changes.

    Signed-off-by: Andi Kleen
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Josh Poimboeuf
    Acked-by: Dave Hansen
    Signed-off-by: Greg Kroah-Hartman

    Andi Kleen
     

30 May, 2018

1 commit

  • [ Upstream commit 173a3efd3edb2ef6ef07471397c5f542a360e9c1 ]

    Looking at functions with large stack frames across all architectures
    led me discovering that BUG() suffers from the same problem as
    fortify_panic(), which I've added a workaround for already.

    In short, variables that go out of scope by calling a noreturn function
    or __builtin_unreachable() keep using stack space in functions
    afterwards.

    A workaround that was identified is to insert an empty assembler
    statement just before calling the function that doesn't return. I'm
    adding a macro "barrier_before_unreachable()" to document this, and
    insert calls to that in all instances of BUG() that currently suffer
    from this problem.

    The files that saw the largest change from this had these frame sizes
    before, and much less with my patch:

    fs/ext4/inode.c:82:1: warning: the frame size of 1672 bytes is larger than 800 bytes [-Wframe-larger-than=]
    fs/ext4/namei.c:434:1: warning: the frame size of 904 bytes is larger than 800 bytes [-Wframe-larger-than=]
    fs/ext4/super.c:2279:1: warning: the frame size of 1160 bytes is larger than 800 bytes [-Wframe-larger-than=]
    fs/ext4/xattr.c:146:1: warning: the frame size of 1168 bytes is larger than 800 bytes [-Wframe-larger-than=]
    fs/f2fs/inode.c:152:1: warning: the frame size of 1424 bytes is larger than 800 bytes [-Wframe-larger-than=]
    net/netfilter/ipvs/ip_vs_core.c:1195:1: warning: the frame size of 1068 bytes is larger than 800 bytes [-Wframe-larger-than=]
    net/netfilter/ipvs/ip_vs_core.c:395:1: warning: the frame size of 1084 bytes is larger than 800 bytes [-Wframe-larger-than=]
    net/netfilter/ipvs/ip_vs_ftp.c:298:1: warning: the frame size of 928 bytes is larger than 800 bytes [-Wframe-larger-than=]
    net/netfilter/ipvs/ip_vs_ftp.c:418:1: warning: the frame size of 908 bytes is larger than 800 bytes [-Wframe-larger-than=]
    net/netfilter/ipvs/ip_vs_lblcr.c:718:1: warning: the frame size of 960 bytes is larger than 800 bytes [-Wframe-larger-than=]
    drivers/net/xen-netback/netback.c:1500:1: warning: the frame size of 1088 bytes is larger than 800 bytes [-Wframe-larger-than=]

    In case of ARC and CRIS, it turns out that the BUG() implementation
    actually does return (or at least the compiler thinks it does),
    resulting in lots of warnings about uninitialized variable use and
    leaving noreturn functions, such as:

    block/cfq-iosched.c: In function 'cfq_async_queue_prio':
    block/cfq-iosched.c:3804:1: error: control reaches end of non-void function [-Werror=return-type]
    include/linux/dmaengine.h: In function 'dma_maxpq':
    include/linux/dmaengine.h:1123:1: error: control reaches end of non-void function [-Werror=return-type]

    This makes them call __builtin_trap() instead, which should normally
    dump the stack and kill the current process, like some of the other
    architectures already do.

    I tried adding barrier_before_unreachable() to panic() and
    fortify_panic() as well, but that had very little effect, so I'm not
    submitting that patch.

    Vineet said:

    : For ARC, it is double win.
    :
    : 1. Fixes 3 -Wreturn-type warnings
    :
    : | ../net/core/ethtool.c:311:1: warning: control reaches end of non-void function
    : [-Wreturn-type]
    : | ../kernel/sched/core.c:3246:1: warning: control reaches end of non-void function
    : [-Wreturn-type]
    : | ../include/linux/sunrpc/svc_xprt.h:180:1: warning: control reaches end of
    : non-void function [-Wreturn-type]
    :
    : 2. bloat-o-meter reports code size improvements as gcc elides the
    : generated code for stack return.

    Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82365
    Link: http://lkml.kernel.org/r/20171219114112.939391-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Acked-by: Vineet Gupta [arch/arc]
    Tested-by: Vineet Gupta [arch/arc]
    Cc: Mikael Starvik
    Cc: Jesper Nilsson
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Geert Uytterhoeven
    Cc: "David S. Miller"
    Cc: Christopher Li
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Kees Cook
    Cc: Ingo Molnar
    Cc: Josh Poimboeuf
    Cc: Will Deacon
    Cc: "Steven Rostedt (VMware)"
    Cc: Mark Rutland
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Arnd Bergmann
     

02 May, 2018

1 commit

  • commit dd709e72cb934eefd44de8d9969097173fbf45dc upstream.

    Commit 99492c39f39f ("earlycon: Fix __earlycon_table stride") tried to fix
    __earlycon_table stride by forcing the earlycon_id struct alignment to 32
    and asking the linker to 32-byte align the __earlycon_table symbol. This
    fix was based on commit 07fca0e57fca92 ("tracing: Properly align linker
    defined symbols") which tried a similar fix for the tracing subsystem.

    However, this fix doesn't quite work because there is no guarantee that
    gcc will place structures packed into an array format. In fact, gcc 4.9
    chooses to 64-byte align these structs by inserting additional padding
    between the entries because it has no clue that they are supposed to be in
    an array. If we are unlucky, the linker will assign symbol
    "__earlycon_table" to a 32-byte aligned address which does not correspond
    to the 64-byte aligned contents of section "__earlycon_table".

    To address this same problem, the fix to the tracing system was
    subsequently re-implemented using a more robust table of pointers approach
    by commits:
    3d56e331b653 ("tracing: Replace syscall_meta_data struct array with pointer array")
    654986462939 ("tracepoints: Fix section alignment using pointer array")
    e4a9ea5ee7c8 ("tracing: Replace trace_event struct array with pointer array")

    Let's use this same "array of pointers to structs" approach for
    EARLYCON_TABLE.

    Fixes: 99492c39f39f ("earlycon: Fix __earlycon_table stride")
    Signed-off-by: Daniel Kurtz
    Suggested-by: Aaron Durbin
    Reviewed-by: Rob Herring
    Tested-by: Guenter Roeck
    Reviewed-by: Guenter Roeck
    Cc: stable
    Signed-off-by: Greg Kroah-Hartman

    Daniel Kurtz
     

26 Apr, 2018

1 commit

  • [ Upstream commit c58f0bb77ed8bf93dfdde762b01cb67eebbdfc29 ]

    Patch series "Do not lose dirty bit on THP pages", v4.

    Vlastimil noted that pmdp_invalidate() is not atomic and we can lose
    dirty and access bits if CPU sets them after pmdp dereference, but
    before set_pmd_at().

    The bug can lead to data loss, but the race window is tiny and I haven't
    seen any reports that suggested that it happens in reality. So I don't
    think it worth sending it to stable.

    Unfortunately, there's no way to address the issue in a generic way. We
    need to fix all architectures that support THP one-by-one.

    All architectures that have THP supported have to provide atomic
    pmdp_invalidate() that returns previous value.

    If generic implementation of pmdp_invalidate() is used, architecture
    needs to provide atomic pmdp_estabish().

    pmdp_estabish() is not used out-side generic implementation of
    pmdp_invalidate() so far, but I think this can change in the future.

    This patch (of 12):

    This is an implementation of pmdp_establish() that is only suitable for
    an architecture that doesn't have hardware dirty/accessed bits. In this
    case we can't race with CPU which sets these bits and non-atomic
    approach is fine.

    Link: http://lkml.kernel.org/r/20171213105756.69879-2-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Cc: Aneesh Kumar K.V
    Cc: Catalin Marinas
    Cc: David Daney
    Cc: David Miller
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Martin Schwidefsky
    Cc: Nitin Gupta
    Cc: Ralf Baechle
    Cc: Thomas Gleixner
    Cc: Vineet Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     

29 Mar, 2018

1 commit

  • commit b6bdb7517c3d3f41f20e5c2948d6bc3f8897394e upstream.

    On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap() may
    create pud/pmd mappings. A kernel panic was observed on arm64 systems
    with Cortex-A75 in the following steps as described by Hanjun Guo.

    1. ioremap a 4K size, valid page table will build,
    2. iounmap it, pte0 will set to 0;
    3. ioremap the same address with 2M size, pgd/pmd is unchanged,
    then set the a new value for pmd;
    4. pte0 is leaked;
    5. CPU may meet exception because the old pmd is still in TLB,
    which will lead to kernel panic.

    This panic is not reproducible on x86. INVLPG, called from iounmap,
    purges all levels of entries associated with purged address on x86. x86
    still has memory leak.

    The patch changes the ioremap path to free unmapped page table(s) since
    doing so in the unmap path has the following issues:

    - The iounmap() path is shared with vunmap(). Since vmap() only
    supports pte mappings, making vunmap() to free a pte page is an
    overhead for regular vmap users as they do not need a pte page freed
    up.

    - Checking if all entries in a pte page are cleared in the unmap path
    is racy, and serializing this check is expensive.

    - The unmap path calls free_vmap_area_noflush() to do lazy TLB purges.
    Clearing a pud/pmd entry before the lazy TLB purges needs extra TLB
    purge.

    Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(), which
    clear a given pud/pmd entry and free up a page for the lower level
    entries.

    This patch implements their stub functions on x86 and arm64, which work
    as workaround.

    [akpm@linux-foundation.org: fix typo in pmd_free_pte_page() stub]
    Link: http://lkml.kernel.org/r/20180314180155.19492-2-toshi.kani@hpe.com
    Fixes: e61ce6ade404e ("mm: change ioremap to set up huge I/O mappings")
    Reported-by: Lei Li
    Signed-off-by: Toshi Kani
    Cc: Catalin Marinas
    Cc: Wang Xuefeng
    Cc: Will Deacon
    Cc: Hanjun Guo
    Cc: Michal Hocko
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Borislav Petkov
    Cc: Matthew Wilcox
    Cc: Chintan Pandya
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Toshi Kani
     

30 Dec, 2017

2 commits

  • commit 613e396bc0d4c7604fba23256644e78454c68cf6 upstream.

    init_espfix_bsp() needs to be invoked before the page table isolation
    initialization. Move it into mm_init() which is the place where pti_init()
    will be added.

    While at it get rid of the #ifdeffery and provide proper stub functions.

    Signed-off-by: Thomas Gleixner
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Juergen Gross
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit c10e83f598d08046dd1ebc8360d4bb12d802d51b upstream.

    In order to sanitize the LDT initialization on x86 arch_dup_mmap() must be
    allowed to fail. Fix up all instances.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andy Lutomirski
    Cc: Andy Lutomirsky
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Dave Hansen
    Cc: David Laight
    Cc: Denys Vlasenko
    Cc: Eduardo Valentin
    Cc: Greg KH
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Juergen Gross
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Cc: aliguori@amazon.com
    Cc: dan.j.williams@intel.com
    Cc: hughd@google.com
    Cc: keescook@google.com
    Cc: kirill.shutemov@linux.intel.com
    Cc: linux-mm@kvack.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

25 Dec, 2017

1 commit

  • commit 11af847446ed0d131cf24d16a7ef3d5ea7a49554 upstream.

    Rename the unwinder config options from:

    CONFIG_ORC_UNWINDER
    CONFIG_FRAME_POINTER_UNWINDER
    CONFIG_GUESS_UNWINDER

    to:

    CONFIG_UNWINDER_ORC
    CONFIG_UNWINDER_FRAME_POINTER
    CONFIG_UNWINDER_GUESS

    ... in order to give them a more logical config namespace.

    Suggested-by: Ingo Molnar
    Signed-off-by: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/73972fc7e2762e91912c6b9584582703d6f1b8cc.1507924831.git.jpoimboe@redhat.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Josh Poimboeuf
     

10 Dec, 2017

1 commit

  • [ Upstream commit 564c9cc84e2adf8a6671c1937f0a9fe3da2a4b0e ]

    Using .text.unlikely for refcount exceptions isn't safe because gcc may
    move entire functions into .text.unlikely (e.g. in6_dev_dev()), which
    would cause any uses of a protected refcount_t function to stay inline
    with the function, triggering the protection unconditionally:

    .section .text.unlikely,"ax",@progbits
    .type in6_dev_get, @function
    in6_dev_getx:
    .LFB4673:
    .loc 2 4128 0
    .cfi_startproc
    ...
    lock; incl 480(%rbx)
    js 111f
    .pushsection .text.unlikely
    111: lea 480(%rbx), %rcx
    112: .byte 0x0f, 0xff
    .popsection
    113:

    This creates a unique .text..refcount section and adds an additional
    test to the exception handler to WARN in the case of having none of OF,
    SF, nor ZF set so we can see things like this more easily in the future.

    The double dot for the section name keeps it out of the TEXT_MAIN macro
    namespace, to avoid collisions and so it can be put at the end with
    text.unlikely to keep the cold code together.

    See commit:

    cb87481ee89db ("kbuild: linker script do not match C names unless LD_DEAD_CODE_DATA_ELIMINATION is configured")

    ... which matches C names: [a-zA-Z0-9_] but not ".".

    Reported-by: Mike Galbraith
    Signed-off-by: Kees Cook
    Cc: Ard Biesheuvel
    Cc: Elena
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-arch
    Fixes: 7a46ec0e2f48 ("locking/refcounts, x86/asm: Implement fast refcount overflow protection")
    Link: http://lkml.kernel.org/r/1504382986-49301-2-git-send-email-keescook@chromium.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     

05 Dec, 2017

1 commit

  • commit 1501899a898dfb5477c55534bdfd734c046da06d upstream.

    Currently only get_user_pages_fast() can safely handle the writable gup
    case due to its use of pud_access_permitted() to check whether the pud
    entry is writable. In the gup slow path pud_write() is used instead of
    pud_access_permitted() and to date it has been unimplemented, just calls
    BUG_ON().

    kernel BUG at ./include/linux/hugetlb.h:244!
    [..]
    RIP: 0010:follow_devmap_pud+0x482/0x490
    [..]
    Call Trace:
    follow_page_mask+0x28c/0x6e0
    __get_user_pages+0xe4/0x6c0
    get_user_pages_unlocked+0x130/0x1b0
    get_user_pages_fast+0x89/0xb0
    iov_iter_get_pages_alloc+0x114/0x4a0
    nfs_direct_read_schedule_iovec+0xd2/0x350
    ? nfs_start_io_direct+0x63/0x70
    nfs_file_direct_read+0x1e0/0x250
    nfs_file_read+0x90/0xc0

    For now this just implements a simple check for the _PAGE_RW bit similar
    to pmd_write. However, this implies that the gup-slow-path check is
    missing the extra checks that the gup-fast-path performs with
    pud_access_permitted. Later patches will align all checks to use the
    'access_permitted' helper if the architecture provides it.

    Note that the generic 'access_permitted' helper fallback is the simple
    _PAGE_RW check on architectures that do not define the
    'access_permitted' helper(s).

    [dan.j.williams@intel.com: fix powerpc compile error]
    Link: http://lkml.kernel.org/r/151129126165.37405.16031785266675461397.stgit@dwillia2-desk3.amr.corp.intel.com
    Link: http://lkml.kernel.org/r/151043109938.2842.14834662818213616199.stgit@dwillia2-desk3.amr.corp.intel.com
    Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages")
    Signed-off-by: Dan Williams
    Reported-by: Stephen Rothwell
    Acked-by: Thomas Gleixner [x86]
    Cc: Kirill A. Shutemov
    Cc: Catalin Marinas
    Cc: "David S. Miller"
    Cc: Dave Hansen
    Cc: Will Deacon
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

26 Sep, 2017

1 commit

  • As raw_cpu_generic_read() is a plain read from a raw_cpu_ptr() address,
    it's possible (albeit unlikely) that the compiler will split the access
    across multiple instructions.

    In this_cpu_generic_read() we disable preemption but not interrupts
    before calling raw_cpu_generic_read(). Thus, an interrupt could be taken
    in the middle of the split load instructions. If a this_cpu_write() or
    RMW this_cpu_*() op is made to the same variable in the interrupt
    handling path, this_cpu_read() will return a torn value.

    For native word types, we can avoid tearing using READ_ONCE(), but this
    won't work in all cases (e.g. 64-bit types on most 32-bit platforms).
    This patch reworks this_cpu_generic_read() to use READ_ONCE() where
    possible, otherwise falling back to disabling interrupts.

    Signed-off-by: Mark Rutland
    Cc: Arnd Bergmann
    Cc: Christoph Lameter
    Cc: Peter Zijlstra
    Cc: Pranith Kumar
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: linux-arch@vger.kernel.org
    Cc: stable@vger.kernel.org
    Signed-off-by: Tejun Heo

    Mark Rutland
     

15 Sep, 2017

1 commit


10 Sep, 2017

1 commit

  • Pull MTD updates from Boris Brezillon:
    "General updates:
    - Constify pci_device_id in various drivers
    - Constify device_type
    - Remove pad control code from the Gemini driver
    - Use %pOF to print OF node full_name
    - Various fixes in the physmap_of driver
    - Remove unused vars in mtdswap
    - Check devm_kzalloc() return value in the spear_smi driver
    - Check clk_prepare_enable() return code in the st_spi_fsm driver
    - Create per MTD device debugfs enties

    NAND updates, from Boris Brezillon:
    - Fix memory leaks in the core
    - Remove unused NAND locking support
    - Rename nand.h into rawnand.h (preparing support for spi NANDs)
    - Use NAND_MAX_ID_LEN where appropriate
    - Fix support for 20nm Hynix chips
    - Fix support for Samsung and Hynix SLC NANDs
    - Various cleanup, improvements and fixes in the qcom driver
    - Fixes for bugs detected by various static code analysis tools
    - Fix mxc ooblayout definition
    - Add a new part_parsers to tmio and sharpsl platform data in order
    to define a custom list of partition parsers
    - Request the reset line in exclusive mode in the sunxi driver
    - Fix a build error in the orion-nand driver when compiled for ARMv4
    - Allow 64-bit mvebu platforms to select the PXA3XX driver

    SPI NOR updates, from Cyrille Pitchen and Marek Vasut:
    - add support to the JEDEC JESD216B specification (SFDP tables).
    - add support to the Intel Denverton SPI flash controller.
    - fix error recovery for Spansion/Cypress SPI NOR memories.
    - fix 4-byte address management for the Aspeed SPI controller.
    - add support to some Microchip SST26 memory parts
    - remove unneeded pinctrl header Write a message for tag:"

    * tag 'for-linus-20170904' of git://git.infradead.org/linux-mtd: (74 commits)
    mtd: nand: complain loudly when chip->bits_per_cell is not correctly initialized
    mtd: nand: make Samsung SLC NAND usable again
    mtd: nand: tmio: Register partitions using the parsers
    mfd: tmio: Add partition parsers platform data
    mtd: nand: sharpsl: Register partitions using the parsers
    mtd: nand: sharpsl: Add partition parsers platform data
    mtd: nand: qcom: Support for IPQ8074 QPIC NAND controller
    mtd: nand: qcom: support for IPQ4019 QPIC NAND controller
    dt-bindings: qcom_nandc: IPQ8074 QPIC NAND documentation
    dt-bindings: qcom_nandc: IPQ4019 QPIC NAND documentation
    dt-bindings: qcom_nandc: fix the ipq806x device tree example
    mtd: nand: qcom: support for different DEV_CMD register offsets
    mtd: nand: qcom: QPIC data descriptors handling
    mtd: nand: qcom: enable BAM or ADM mode
    mtd: nand: qcom: erased codeword detection configuration
    mtd: nand: qcom: support for read location registers
    mtd: nand: qcom: support for passing flags in DMA helper functions
    mtd: nand: qcom: add BAM DMA descriptor handling
    mtd: nand: qcom: allocate BAM transaction
    mtd: nand: qcom: DMA mapping support for register read buffer
    ...

    Linus Torvalds
     

09 Sep, 2017

2 commits

  • Soft dirty bit is designed to keep tracked over page migration. This
    patch makes it work in the same manner for thp migration too.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Zi Yan
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • When THP migration is being used, memory management code needs to handle
    pmd migration entries properly. This patch uses !pmd_present() or
    is_swap_pmd() (depending on whether pmd_none() needs separate code or
    not) to check pmd migration entries at the places where a pmd entry is
    present.

    Since pmd-related code uses split_huge_page(), split_huge_pmd(),
    pmd_trans_huge(), pmd_trans_unstable(), or
    pmd_none_or_trans_huge_or_clear_bad(), this patch:

    1. adds pmd migration entry split code in split_huge_pmd(),

    2. takes care of pmd migration entries whenever pmd_trans_huge() is present,

    3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware.

    Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable()
    is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change
    them.

    Until this commit, a pmd entry should be:
    1. pointing to a pte page,
    2. is_swap_pmd(),
    3. pmd_trans_huge(),
    4. pmd_devmap(), or
    5. pmd_none().

    Signed-off-by: Zi Yan
    Cc: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zi Yan
     

05 Sep, 2017

4 commits

  • Pull x86 mm changes from Ingo Molnar:
    "PCID support, 5-level paging support, Secure Memory Encryption support

    The main changes in this cycle are support for three new, complex
    hardware features of x86 CPUs:

    - Add 5-level paging support, which is a new hardware feature on
    upcoming Intel CPUs allowing up to 128 PB of virtual address space
    and 4 PB of physical RAM space - a 512-fold increase over the old
    limits. (Supercomputers of the future forecasting hurricanes on an
    ever warming planet can certainly make good use of more RAM.)

    Many of the necessary changes went upstream in previous cycles,
    v4.14 is the first kernel that can enable 5-level paging.

    This feature is activated via CONFIG_X86_5LEVEL=y - disabled by
    default.

    (By Kirill A. Shutemov)

    - Add 'encrypted memory' support, which is a new hardware feature on
    upcoming AMD CPUs ('Secure Memory Encryption', SME) allowing system
    RAM to be encrypted and decrypted (mostly) transparently by the
    CPU, with a little help from the kernel to transition to/from
    encrypted RAM. Such RAM should be more secure against various
    attacks like RAM access via the memory bus and should make the
    radio signature of memory bus traffic harder to intercept (and
    decrypt) as well.

    This feature is activated via CONFIG_AMD_MEM_ENCRYPT=y - disabled
    by default.

    (By Tom Lendacky)

    - Enable PCID optimized TLB flushing on newer Intel CPUs: PCID is a
    hardware feature that attaches an address space tag to TLB entries
    and thus allows to skip TLB flushing in many cases, even if we
    switch mm's.

    (By Andy Lutomirski)

    All three of these features were in the works for a long time, and
    it's coincidence of the three independent development paths that they
    are all enabled in v4.14 at once"

    * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (65 commits)
    x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)
    x86/mm: Use pr_cont() in dump_pagetable()
    x86/mm: Fix SME encryption stack ptr handling
    kvm/x86: Avoid clearing the C-bit in rsvd_bits()
    x86/CPU: Align CR3 defines
    x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
    acpi, x86/mm: Remove encryption mask from ACPI page protection type
    x86/mm, kexec: Fix memory corruption with SME on successive kexecs
    x86/mm/pkeys: Fix typo in Documentation/x86/protection-keys.txt
    x86/mm/dump_pagetables: Speed up page tables dump for CONFIG_KASAN=y
    x86/mm: Implement PCID based optimization: try to preserve old TLB entries using PCID
    x86: Enable 5-level paging support via CONFIG_X86_5LEVEL=y
    x86/mm: Allow userspace have mappings above 47-bit
    x86/mm: Prepare to expose larger address space to userspace
    x86/mpx: Do not allow MPX if we have mappings above 47-bit
    x86/mm: Rename tasksize_32bit/64bit to task_size_32bit/64bit()
    x86/xen: Redefine XEN_ELFNOTE_INIT_P2M using PUD_SIZE * PTRS_PER_PUD
    x86/mm/dump_pagetables: Fix printout of p4d level
    x86/mm/dump_pagetables: Generalize address normalization
    x86/boot: Fix memremap() related build failure
    ...

    Linus Torvalds
     
  • Pull locking updates from Ingo Molnar:

    - Add 'cross-release' support to lockdep, which allows APIs like
    completions, where it's not the 'owner' who releases the lock, to be
    tracked. It's all activated automatically under
    CONFIG_PROVE_LOCKING=y.

    - Clean up (restructure) the x86 atomics op implementation to be more
    readable, in preparation of KASAN annotations. (Dmitry Vyukov)

    - Fix static keys (Paolo Bonzini)

    - Add killable versions of down_read() et al (Kirill Tkhai)

    - Rework and fix jump_label locking (Marc Zyngier, Paolo Bonzini)

    - Rework (and fix) tlb_flush_pending() barriers (Peter Zijlstra)

    - Remove smp_mb__before_spinlock() and convert its usages, introduce
    smp_mb__after_spinlock() (Peter Zijlstra)

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (56 commits)
    locking/lockdep/selftests: Fix mixed read-write ABBA tests
    sched/completion: Avoid unnecessary stack allocation for COMPLETION_INITIALIZER_ONSTACK()
    acpi/nfit: Fix COMPLETION_INITIALIZER_ONSTACK() abuse
    locking/pvqspinlock: Relax cmpxchg's to improve performance on some architectures
    smp: Avoid using two cache lines for struct call_single_data
    locking/lockdep: Untangle xhlock history save/restore from task independence
    locking/refcounts, x86/asm: Disable CONFIG_ARCH_HAS_REFCOUNT for the time being
    futex: Remove duplicated code and fix undefined behaviour
    Documentation/locking/atomic: Finish the document...
    locking/lockdep: Fix workqueue crossrelease annotation
    workqueue/lockdep: 'Fix' flush_work() annotation
    locking/lockdep/selftests: Add mixed read-write ABBA tests
    mm, locking/barriers: Clarify tlb_flush_pending() barriers
    locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE and CONFIG_LOCKDEP_COMPLETIONS truly non-interactive
    locking/lockdep: Explicitly initialize wq_barrier::done::map
    locking/lockdep: Rename CONFIG_LOCKDEP_COMPLETE to CONFIG_LOCKDEP_COMPLETIONS
    locking/lockdep: Reword title of LOCKDEP_CROSSRELEASE config
    locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE part of CONFIG_PROVE_LOCKING
    locking/refcounts, x86/asm: Implement fast refcount overflow protection
    locking/lockdep: Fix the rollback and overwrite detection logic in crossrelease
    ...

    Linus Torvalds
     
  • Pull x86 asm updates from Ingo Molnar:

    - Introduce the ORC unwinder, which can be enabled via
    CONFIG_ORC_UNWINDER=y.

    The ORC unwinder is a lightweight, Linux kernel specific debuginfo
    implementation, which aims to be DWARF done right for unwinding.
    Objtool is used to generate the ORC unwinder tables during build, so
    the data format is flexible and kernel internal: there's no
    dependency on debuginfo created by an external toolchain.

    The ORC unwinder is almost two orders of magnitude faster than the
    (out of tree) DWARF unwinder - which is important for perf call graph
    profiling. It is also significantly simpler and is coded defensively:
    there has not been a single ORC related kernel crash so far, even
    with early versions. (knock on wood!)

    But the main advantage is that enabling the ORC unwinder allows
    CONFIG_FRAME_POINTERS to be turned off - which speeds up the kernel
    measurably:

    With frame pointers disabled, GCC does not have to add frame pointer
    instrumentation code to every function in the kernel. The kernel's
    .text size decreases by about 3.2%, resulting in better cache
    utilization and fewer instructions executed, resulting in a broad
    kernel-wide speedup. Average speedup of system calls should be
    roughly in the 1-3% range - measurements by Mel Gorman [1] have shown
    a speedup of 5-10% for some function execution intense workloads.

    The main cost of the unwinder is that the unwinder data has to be
    stored in RAM: the memory cost is 2-4MB of RAM, depending on kernel
    config - which is a modest cost on modern x86 systems.

    Given how young the ORC unwinder code is it's not enabled by default
    - but given the performance advantages the plan is to eventually make
    it the default unwinder on x86.

    See Documentation/x86/orc-unwinder.txt for more details.

    - Remove lguest support: its intended role was that of a temporary
    proof of concept for virtualization, plus its removal will enable the
    reduction (removal) of the paravirt API as well, so Rusty agreed to
    its removal. (Juergen Gross)

    - Clean up and fix FSGS related functionality (Andy Lutomirski)

    - Clean up IO access APIs (Andy Shevchenko)

    - Enhance the symbol namespace (Jiri Slaby)

    * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits)
    objtool: Handle GCC stack pointer adjustment bug
    x86/entry/64: Use ENTRY() instead of ALIGN+GLOBAL for stub32_clone()
    x86/fpu/math-emu: Add ENDPROC to functions
    x86/boot/64: Extract efi_pe_entry() from startup_64()
    x86/boot/32: Extract efi_pe_entry() from startup_32()
    x86/lguest: Remove lguest support
    x86/paravirt/xen: Remove xen_patch()
    objtool: Fix objtool fallthrough detection with function padding
    x86/xen/64: Fix the reported SS and CS in SYSCALL
    objtool: Track DRAP separately from callee-saved registers
    objtool: Fix validate_branch() return codes
    x86: Clarify/fix no-op barriers for text_poke_bp()
    x86/switch_to/64: Rewrite FS/GS switching yet again to fix AMD CPUs
    selftests/x86/fsgsbase: Test selectors 1, 2, and 3
    x86/fsgsbase/64: Report FSBASE and GSBASE correctly in core dumps
    x86/fsgsbase/64: Fully initialize FS and GS state in start_thread_common
    x86/asm: Fix UNWIND_HINT_REGS macro for older binutils
    x86/asm/32: Fix regs_get_register() on segment registers
    x86/xen/64: Rearrange the SYSCALL entries
    x86/asm/32: Remove a bunch of '& 0xffff' from pt_regs segment reads
    ...

    Linus Torvalds
     
  • Signed-off-by: Al Viro

    Al Viro
     

04 Sep, 2017

3 commits

  • Pull perf updates from Ingo Molnar:
    "Kernel side changes:

    - Add branch type profiling/tracing support. (Jin Yao)

    - Add the PERF_SAMPLE_PHYS_ADDR ABI to allow the tracing/profiling of
    physical memory addresses, where the PMU supports it. (Kan Liang)

    - Export some PMU capability details in the new
    /sys/bus/event_source/devices/cpu/caps/ sysfs directory. (Andi
    Kleen)

    - Aux data fixes and updates (Will Deacon)

    - kprobes fixes and updates (Masami Hiramatsu)

    - AMD uncore PMU driver fixes and updates (Janakarajan Natarajan)

    On the tooling side, here's a (limited!) list of highlights - there
    were many other changes that I could not list, see the shortlog and
    git history for details:

    UI improvements:

    - Implement a visual marker for fused x86 instructions in the
    annotate TUI browser, available now in 'perf report', more work
    needed to have it available as well in 'perf top' (Jin Yao)

    Further explanation from one of Jin's patches:

    │ ┌──cmpl $0x0,argp_program_version_hook
    81.93 │ ├──je 20
    │ │ lock cmpxchg %esi,0x38a9a4(%rip)
    │ │↓ jne 29
    │ │↓ jmp 43
    11.47 │20:└─→cmpxch %esi,0x38a999(%rip)

    That means the cmpl+je is a fused instruction pair and they should
    be considered together.

    - Record the branch type and then show statistics and info about in
    callchain entries (Jin Yao)

    Example from one of Jin's patches:

    # perf record -g -j any,save_type
    # perf report --branch-history --stdio --no-children

    38.50% div.c:45 [.] main div
    |
    ---main div.c:42 (RET CROSS_2M cycles:2)
    compute_flag div.c:28 (cycles:2)
    compute_flag div.c:27 (RET CROSS_2M cycles:1)
    rand rand.c:28 (cycles:1)
    rand rand.c:28 (RET CROSS_2M cycles:1)
    __random random.c:298 (cycles:1)
    __random random.c:297 (COND_BWD CROSS_2M cycles:1)
    __random random.c:295 (cycles:1)
    __random random.c:295 (COND_BWD CROSS_2M cycles:1)
    __random random.c:295 (cycles:1)
    __random random.c:295 (RET CROSS_2M cycles:9)

    namespaces support:

    - Add initial support for namespaces, using setns to access files in
    namespaces, grabbing their build-ids, etc. (Krister Johansen)

    perf trace enhancements:

    - Beautify pkey_{alloc,free,mprotect} arguments in 'perf trace'
    (Arnaldo Carvalho de Melo)

    - Add initial 'clone' syscall args beautifier in 'perf trace'
    (Arnaldo Carvalho de Melo)

    - Ignore 'fd' and 'offset' args for MAP_ANONYMOUS in 'perf trace'
    (Arnaldo Carvalho de Melo)

    - Beautifiers for the 'cmd' arg of several ioctl types, including:
    sound, DRM, KVM, vhost virtio and perf_events. (Arnaldo Carvalho de
    Melo)

    - Add PERF_SAMPLE_CALLCHAIN and PERF_RECORD_MMAP[2] to 'perf data'
    CTF conversion, allowing CTF trace visualization tools to show
    callchains and to resolve symbols (Geneviève Bastien)

    - Beautify the fcntl syscall, which is an interesting one in the
    sense that infrastructure had to be put in place to change the
    formatters of some arguments according to the value in a previous
    one, i.e. cmd dictates how arg and the syscall return will be
    formatted. (Arnaldo Carvalho de Melo

    perf stat enhancements:

    - Use group read for event groups in 'perf stat', reducing overhead
    when groups are defined in the event specification, i.e. when using
    {} to enclose a list of events, asking them to be read at the same
    time, e.g.: "perf stat -e '{cycles,instructions}'" (Jiri Olsa)

    pipe mode improvements:

    - Process tracing data in 'perf annotate' pipe mode (David
    Carrillo-Cisneros)

    - Add header record types to pipe-mode, now this command:

    $ perf record -o - -e cycles sleep 1 | perf report --stdio --header

    Will show the same as in non-pipe mode, i.e. involving a perf.data
    file (David Carrillo-Cisneros)

    Vendor specific hardware event support updates/enhancements:

    - Update POWER9 vendor events tables (Sukadev Bhattiprolu)

    - Add POWER9 PMU events Sukadev (Bhattiprolu)

    - Support additional POWER8+ PVR in PMU mapfile (Shriya)

    - Add Skylake server uncore JSON vendor events (Andi Kleen)

    - Support exporting Intel PT data to sqlite3 with python perf
    scripts, this is in addition to the postgresql support that was
    already there (Adrian Hunter)"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (253 commits)
    perf symbols: Fix plt entry calculation for ARM and AARCH64
    perf probe: Fix kprobe blacklist checking condition
    perf/x86: Fix caps/ for !Intel
    perf/core, x86: Add PERF_SAMPLE_PHYS_ADDR
    perf/core, pt, bts: Get rid of itrace_started
    perf trace beauty: Beautify pkey_{alloc,free,mprotect} arguments
    tools headers: Sync cpu features kernel ABI headers with tooling headers
    perf tools: Pass full path of FEATURES_DUMP
    perf tools: Robustify detection of clang binary
    tools lib: Allow external definition of CC, AR and LD
    perf tools: Allow external definition of flex and bison binary names
    tools build tests: Don't hardcode gcc name
    perf report: Group stat values on global event id
    perf values: Zero value buffers
    perf values: Fix allocation check
    perf values: Fix thread index bug
    perf report: Add dump_read function
    perf record: Set read_format for inherit_stat
    perf c2c: Fix remote HITM detection for Skylake
    perf tools: Fix static build with newer toolchains
    ...

    Linus Torvalds
     
  • Pull RCU updates from Ingo Molnad:
    "The main RCU related changes in this cycle were:

    - Removal of spin_unlock_wait()
    - SRCU updates
    - RCU torture-test updates
    - RCU Documentation updates
    - Extend the sys_membarrier() ABI with the MEMBARRIER_CMD_PRIVATE_EXPEDITED variant
    - Miscellaneous RCU fixes
    - CPU-hotplug fixes"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (63 commits)
    arch: Remove spin_unlock_wait() arch-specific definitions
    locking: Remove spin_unlock_wait() generic definitions
    drivers/ata: Replace spin_unlock_wait() with lock/unlock pair
    ipc: Replace spin_unlock_wait() with lock/unlock pair
    exit: Replace spin_unlock_wait() with lock/unlock pair
    completion: Replace spin_unlock_wait() with lock/unlock pair
    doc: Set down RCU's scheduling-clock-interrupt needs
    doc: No longer allowed to use rcu_dereference on non-pointers
    doc: Add RCU files to docbook-generation files
    doc: Update memory-barriers.txt for read-to-write dependencies
    doc: Update RCU documentation
    membarrier: Provide expedited private command
    rcu: Remove exports from rcu_idle_exit() and rcu_idle_enter()
    rcu: Add warning to rcu_idle_enter() for irqs enabled
    rcu: Make rcu_idle_enter() rely on callers disabling irqs
    rcu: Add assertions verifying blocked-tasks list
    rcu/tracing: Set disable_rcu_irq_enter on rcu_eqs_exit()
    rcu: Add TPS() protection for _rcu_barrier_trace strings
    rcu: Use idle versions of swait to make idle-hack clear
    swait: Add idle variants which don't contribute to load average
    ...

    Linus Torvalds
     
  • Conflicts:
    mm/page_alloc.c

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

29 Aug, 2017

3 commits

  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • When !NUMA, cpumask_of_node(@node) equals cpu_online_mask regardless of
    @node. The assumption seems that if !NUMA, there shouldn't be more than
    one node and thus reporting cpu_online_mask regardless of @node is
    correct. However, that assumption was broken years ago to support
    DISCONTIGMEM and whether a system has multiple nodes or not is
    separately controlled by NEED_MULTIPLE_NODES.

    This means that, on a system with !NUMA && NEED_MULTIPLE_NODES,
    cpumask_of_node() will report cpu_online_mask for all possible nodes,
    indicating that the CPUs are associated with multiple nodes which is an
    impossible configuration.

    This bug has been around forever but doesn't look like it has caused any
    noticeable symptoms. However, it triggers a WARN recently added to
    workqueue to verify NUMA affinity configuration.

    Fix it by reporting empty cpumask on non-zero nodes if !NUMA.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Geert Uytterhoeven
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • Merge v4.13-rc7 back to resolve merge conflicts in
    drivers/mtd/nand/nandsim.c and include/asm-generic/vmlinux.lds.h.

    Boris Brezillon
     

26 Aug, 2017

2 commits

  • Conflicts:
    arch/x86/kernel/head64.c
    arch/x86/mm/mmap.c

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • There is code duplicated over all architecture's headers for
    futex_atomic_op_inuser. Namely op decoding, access_ok check for uaddr,
    and comparison of the result.

    Remove this duplication and leave up to the arches only the needed
    assembly which is now in arch_futex_atomic_op_inuser.

    This effectively distributes the Will Deacon's arm64 fix for undefined
    behaviour reported by UBSAN to all architectures. The fix was done in
    commit 5f16a046f8e1 (arm64: futex: Fix undefined behaviour with
    FUTEX_OP_OPARG_SHIFT usage). Look there for an example dump.

    And as suggested by Thomas, check for negative oparg too, because it was
    also reported to cause undefined behaviour report.

    Note that s390 removed access_ok check in d12a29703 ("s390/uaccess:
    remove pointless access_ok() checks") as access_ok there returns true.
    We introduce it back to the helper for the sake of simplicity (it gets
    optimized away anyway).

    Signed-off-by: Jiri Slaby
    Signed-off-by: Thomas Gleixner
    Acked-by: Russell King
    Acked-by: Michael Ellerman (powerpc)
    Acked-by: Heiko Carstens [s390]
    Acked-by: Chris Metcalf [for tile]
    Reviewed-by: Darren Hart (VMware)
    Reviewed-by: Will Deacon [core/arm64]
    Cc: linux-mips@linux-mips.org
    Cc: Rich Felker
    Cc: linux-ia64@vger.kernel.org
    Cc: linux-sh@vger.kernel.org
    Cc: peterz@infradead.org
    Cc: Benjamin Herrenschmidt
    Cc: Max Filippov
    Cc: Paul Mackerras
    Cc: sparclinux@vger.kernel.org
    Cc: Jonas Bonn
    Cc: linux-s390@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: Yoshinori Sato
    Cc: linux-hexagon@vger.kernel.org
    Cc: Helge Deller
    Cc: "James E.J. Bottomley"
    Cc: Catalin Marinas
    Cc: Matt Turner
    Cc: linux-snps-arc@lists.infradead.org
    Cc: Fenghua Yu
    Cc: Arnd Bergmann
    Cc: linux-xtensa@linux-xtensa.org
    Cc: Stefan Kristiansson
    Cc: openrisc@lists.librecores.org
    Cc: Ivan Kokshaysky
    Cc: Stafford Horne
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: Richard Henderson
    Cc: Chris Zankel
    Cc: Michal Simek
    Cc: Tony Luck
    Cc: linux-parisc@vger.kernel.org
    Cc: Vineet Gupta
    Cc: Ralf Baechle
    Cc: Richard Kuo
    Cc: linux-alpha@vger.kernel.org
    Cc: Martin Schwidefsky
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: "David S. Miller"
    Link: http://lkml.kernel.org/r/20170824073105.3901-1-jslaby@suse.cz

    Jiri Slaby
     

25 Aug, 2017

3 commits


21 Aug, 2017

1 commit


18 Aug, 2017

1 commit


17 Aug, 2017

2 commits


15 Aug, 2017

1 commit

  • When XIP_KERNEL is enabled, some functions are defined in the .data
    ELF section because we require them to be in RAM whenever we communicate
    with the flash chip. However this causes problems when FTRACE is
    enabled and gcc emits calls to __gnu_mcount_nc in the function
    prolog:

    drivers/built-in.o: In function `cfi_chip_setup':
    :(.data+0x272fc): relocation truncated to fit: R_ARM_CALL against symbol `__gnu_mcount_nc' defined in .text section in arch/arm/kernel/built-in.o
    drivers/built-in.o: In function `cfi_probe_chip':
    :(.data+0x27de8): relocation truncated to fit: R_ARM_CALL against symbol `__gnu_mcount_nc' defined in .text section in arch/arm/kernel/built-in.o
    /tmp/ccY172rP.s: Assembler messages:
    /tmp/ccY172rP.s:70: Warning: ignoring changed section attributes for .data
    /tmp/ccY172rP.s: Error: 1 warning, treating warnings as errors
    make[5]: *** [drivers/mtd/chips/cfi_probe.o] Error 1
    /tmp/ccK4rjeO.s: Assembler messages:
    /tmp/ccK4rjeO.s:421: Warning: ignoring changed section attributes for .data
    /tmp/ccK4rjeO.s: Error: 1 warning, treating warnings as errors
    make[5]: *** [drivers/mtd/chips/cfi_util.o] Error 1
    /tmp/ccUvhCYR.s: Assembler messages:
    /tmp/ccUvhCYR.s:1895: Warning: ignoring changed section attributes for .data
    /tmp/ccUvhCYR.s: Error: 1 warning, treating warnings as errors

    Specifically, this does not work because the .data section is not
    marked executable, which leads LD to not generate trampolines for
    long calls.

    This moves the __xipram functions into their own .xiptext section instead.
    The section is still placed next to .data and located in RAM but is marked
    executable, which avoids the build errors.

    Also, we only need to place the XIP functions into a separate section
    if both CONFIG_XIP_KERNEL and CONFIG_MTD_XIP are set: When only MTD_XIP
    is used, the whole kernel is still in RAM and we do not need to worry
    about pulling out the rug under it. When only XIP_KERNEL but not MTD_XIP
    is set, the kernel is in some form of ROM, but we never write to it.

    Note that MTD_XIP has been broken on ARM since around 2011 or 2012. I
    have sent another patch[2] to fix compilation, which I plan to merge
    through arm-soc unless there are objections. The obvious alternative
    to that would be to completely rip out the MTD_XIP support from the
    kernel, since obviously nobody has been using it in a long while.

    Link: [1] https://patchwork.kernel.org/patch/8109771/
    Link: [2] https://patchwork.kernel.org/patch/9855225/
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Boris Brezillon

    Arnd Bergmann
     

11 Aug, 2017

2 commits

  • Conflicts:
    include/linux/mm_types.h
    mm/huge_memory.c

    I removed the smp_mb__before_spinlock() like the following commit does:

    8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")

    and fixed up the affected commits.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Nadav reported parallel MADV_DONTNEED on same range has a stale TLB
    problem and Mel fixed it[1] and found same problem on MADV_FREE[2].

    Quote from Mel Gorman:
    "The race in question is CPU 0 running madv_free and updating some PTEs
    while CPU 1 is also running madv_free and looking at the same PTEs.
    CPU 1 may have writable TLB entries for a page but fail the pte_dirty
    check (because CPU 0 has updated it already) and potentially fail to
    flush.

    Hence, when madv_free on CPU 1 returns, there are still potentially
    writable TLB entries and the underlying PTE is still present so that a
    subsequent write does not necessarily propagate the dirty bit to the
    underlying PTE any more. Reclaim at some unknown time at the future
    may then see that the PTE is still clean and discard the page even
    though a write has happened in the meantime. I think this is possible
    but I could have missed some protection in madv_free that prevents it
    happening."

    This patch aims for solving both problems all at once and is ready for
    other problem with KSM, MADV_FREE and soft-dirty story[3].

    TLB batch API(tlb_[gather|finish]_mmu] uses [inc|dec]_tlb_flush_pending
    and mmu_tlb_flush_pending so that when tlb_finish_mmu is called, we can
    catch there are parallel threads going on. In that case, forcefully,
    flush TLB to prevent for user to access memory via stale TLB entry
    although it fail to gather page table entry.

    I confirmed this patch works with [4] test program Nadav gave so this
    patch supersedes "mm: Always flush VMA ranges affected by zap_page_range
    v2" in current mmotm.

    NOTE:

    This patch modifies arch-specific TLB gathering interface(x86, ia64,
    s390, sh, um). It seems most of architecture are straightforward but
    s390 need to be careful because tlb_flush_mmu works only if
    mm->context.flush_mm is set to non-zero which happens only a pte entry
    really is cleared by ptep_get_and_clear and friends. However, this
    problem never changes the pte entries but need to flush to prevent
    memory access from stale tlb.

    [1] http://lkml.kernel.org/r/20170725101230.5v7gvnjmcnkzzql3@techsingularity.net
    [2] http://lkml.kernel.org/r/20170725100722.2dxnmgypmwnrfawp@suse.de
    [3] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com
    [4] https://patchwork.kernel.org/patch/9861621/

    [minchan@kernel.org: decrease tlb flush pending count in tlb_finish_mmu]
    Link: http://lkml.kernel.org/r/20170808080821.GA31730@bbox
    Link: http://lkml.kernel.org/r/20170802000818.4760-7-namit@vmware.com
    Signed-off-by: Minchan Kim
    Signed-off-by: Nadav Amit
    Reported-by: Nadav Amit
    Reported-by: Mel Gorman
    Acked-by: Mel Gorman
    Cc: Ingo Molnar
    Cc: Russell King
    Cc: Tony Luck
    Cc: Martin Schwidefsky
    Cc: "David S. Miller"
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Jeff Dike
    Cc: Andrea Arcangeli
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Nadav Amit
    Cc: Rik van Riel
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim