09 Aug, 2016

1 commit

  • Pull usercopy protection from Kees Cook:
    "Tbhis implements HARDENED_USERCOPY verification of copy_to_user and
    copy_from_user bounds checking for most architectures on SLAB and
    SLUB"

    * tag 'usercopy-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    mm: SLUB hardened usercopy support
    mm: SLAB hardened usercopy support
    s390/uaccess: Enable hardened usercopy
    sparc/uaccess: Enable hardened usercopy
    powerpc/uaccess: Enable hardened usercopy
    ia64/uaccess: Enable hardened usercopy
    arm64/uaccess: Enable hardened usercopy
    ARM: uaccess: Enable hardened usercopy
    x86/uaccess: Enable hardened usercopy
    mm: Hardened usercopy
    mm: Implement stack frame object validation
    mm: Add is_migrate_cma_page

    Linus Torvalds
     

03 Aug, 2016

6 commits

  • It doesn't trim just symbols that are totally unused in-tree - it trims
    the symbols unused by any in-tree modules actually built. If you've
    done a 'make localmodconfig' and only build a hundred or so modules,
    it's pretty likely that your out-of-tree module will come up lacking
    something...

    Hopefully this will save the next guy from a Homer Simpson "D'oh!"
    moment.

    Link: http://lkml.kernel.org/r/10177.1469787292@turing-police.cc.vt.edu
    Signed-off-by: Valdis Kletnieks
    Cc: Michal Marek
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Valdis Kletnieks
     
  • Doing patches with allmodconfig kernel compiled and committing stuff
    into local tree have unfortunate consequence: kernel version changes (as
    it should) leading to recompiling and relinking of several files even if
    they weren't touched (or interesting at all). This and "git-whatever"
    figuring out current version slow down compilation for no good reason.

    But lets face it, "allmodconfig" kernels don't care about kernel
    version, they are simply compile check guinea pigs.

    Make LOCALVERSION_AUTO depend on !COMPILE_TEST, so it doesn't sneak into
    allmodconfig .config.

    Link: http://lkml.kernel.org/r/20160707214954.GC31678@p183.telecom.by
    Signed-off-by: Alexey Dobriyan
    Cc: Michal Marek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • sprint_symbol_no_offset() returns the string "function_name
    [module_name]" where [module_name] is not printed for built in kernel
    functions. This means that the blacklisting code will fail when
    comparing module function names with the extended string.

    This patch adds the functionality to block a module's module_init()
    function by finding the space in the string and truncating the
    comparison to that length.

    Link: http://lkml.kernel.org/r/1466124387-20446-1-git-send-email-prarit@redhat.com
    Signed-off-by: Prarit Bhargava
    Cc: Thomas Gleixner
    Cc: Yang Shi
    Cc: Prarit Bhargava
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: Rasmus Villemoes
    Cc: Kees Cook
    Cc: Yaowei Bai
    Cc: Andrey Ryabinin
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Prarit Bhargava
     
  • There was only one use of __initdata_refok and __exit_refok

    __init_refok was used 46 times against 82 for __ref.

    Those definitions are obsolete since commit 312b1485fb50 ("Introduce new
    section reference annotations tags: __ref, __refdata, __refconst")

    This patch removes the following compatibility definitions and replaces
    them treewide.

    /* compatibility defines */
    #define __init_refok __ref
    #define __initdata_refok __refdata
    #define __exit_refok __ref

    I can also provide separate patches if necessary.
    (One patch per tree and check in 1 month or 2 to remove old definitions)

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Cc: Ingo Molnar
    Cc: Sam Ravnborg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • UML is a bit special since it does not have iomem nor dma. That means a
    lot of drivers will not build if they miss a dependency on HAS_IOMEM.
    s390 used to have the same issues but since it gained PCI support UML is
    the only stranger.

    We are tired of patching dozens of new drivers after every merge window
    just to un-break allmod/yesconfig UML builds. One could argue that a
    decent driver has to know on what it depends and therefore a missing
    HAS_IOMEM dependency is a clear driver bug. But the dependency not
    obvious and not everyone does UML builds with COMPILE_TEST enabled when
    developing a device driver.

    A possible solution to make these builds succeed on UML would be
    providing stub functions for ioremap() and friends which fail upon
    runtime. Another one is simply disabling COMPILE_TEST for UML. Since
    it is the least hassle and does not force use to fake iomem support
    let's do the latter.

    Link: http://lkml.kernel.org/r/1466152995-28367-1-git-send-email-richard@nod.at
    Signed-off-by: Richard Weinberger
    Acked-by: Arnd Bergmann
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Weinberger
     
  • cgroup's document path is changed to "cgroup-v1". update it.

    Link: http://lkml.kernel.org/r/1470148443-6509-1-git-send-email-iamyooon@gmail.com
    Signed-off-by: seokhoon.yoon
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    seokhoon.yoon
     

29 Jul, 2016

1 commit

  • Pull trivial tree updates from Jiri Kosina.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
    fat: fix error message for bogus number of directory entries
    fat: fix typo s/supeblock/superblock/
    ASoC: max9877: Remove unused function declaration
    dw2102: don't output spurious blank lines to the kernel log
    init: fix Kconfig text
    ARM: io: fix comment grammar
    ocfs: fix ocfs2_xattr_user_get() argument name
    scsi/qla2xxx: Remove erroneous unused macro qla82xx_get_temp_val1()

    Linus Torvalds
     

27 Jul, 2016

3 commits

  • Implements freelist randomization for the SLUB allocator. It was
    previous implemented for the SLAB allocator. Both use the same
    configuration option (CONFIG_SLAB_FREELIST_RANDOM).

    The list is randomized during initialization of a new set of pages. The
    order on different freelist sizes is pre-computed at boot for
    performance. Each kmem_cache has its own randomized freelist.

    This security feature reduces the predictability of the kernel SLUB
    allocator against heap overflows rendering attacks much less stable.

    For example these attacks exploit the predictability of the heap:
    - Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU)
    - Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95)

    Performance results:

    slab_test impact is between 3% to 4% on average for 100000 attempts
    without smp. It is a very focused testing, kernbench show the overall
    impact on the system is way lower.

    Before:

    Single thread testing
    =====================
    1. Kmalloc: Repeatedly allocate then free test
    100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles
    100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles
    100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles
    100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles
    100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles
    100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles
    100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles
    100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles
    100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles
    100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles
    2. Kmalloc: alloc/free test
    100000 times kmalloc(8)/kfree -> 70 cycles
    100000 times kmalloc(16)/kfree -> 70 cycles
    100000 times kmalloc(32)/kfree -> 70 cycles
    100000 times kmalloc(64)/kfree -> 70 cycles
    100000 times kmalloc(128)/kfree -> 70 cycles
    100000 times kmalloc(256)/kfree -> 69 cycles
    100000 times kmalloc(512)/kfree -> 70 cycles
    100000 times kmalloc(1024)/kfree -> 73 cycles
    100000 times kmalloc(2048)/kfree -> 72 cycles
    100000 times kmalloc(4096)/kfree -> 71 cycles

    After:

    Single thread testing
    =====================
    1. Kmalloc: Repeatedly allocate then free test
    100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles
    100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles
    100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles
    100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles
    100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles
    100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles
    100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles
    100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles
    100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles
    100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles
    2. Kmalloc: alloc/free test
    100000 times kmalloc(8)/kfree -> 66 cycles
    100000 times kmalloc(16)/kfree -> 66 cycles
    100000 times kmalloc(32)/kfree -> 66 cycles
    100000 times kmalloc(64)/kfree -> 66 cycles
    100000 times kmalloc(128)/kfree -> 65 cycles
    100000 times kmalloc(256)/kfree -> 67 cycles
    100000 times kmalloc(512)/kfree -> 67 cycles
    100000 times kmalloc(1024)/kfree -> 64 cycles
    100000 times kmalloc(2048)/kfree -> 67 cycles
    100000 times kmalloc(4096)/kfree -> 67 cycles

    Kernbench, before:

    Average Optimal load -j 12 Run (std deviation):
    Elapsed Time 101.873 (1.16069)
    User Time 1045.22 (1.60447)
    System Time 88.969 (0.559195)
    Percent CPU 1112.9 (13.8279)
    Context Switches 189140 (2282.15)
    Sleeps 99008.6 (768.091)

    After:

    Average Optimal load -j 12 Run (std deviation):
    Elapsed Time 102.47 (0.562732)
    User Time 1045.3 (1.34263)
    System Time 88.311 (0.342554)
    Percent CPU 1105.8 (6.49444)
    Context Switches 189081 (2355.78)
    Sleeps 99231.5 (800.358)

    Link: http://lkml.kernel.org/r/1464295031-26375-3-git-send-email-thgarnie@google.com
    Signed-off-by: Thomas Garnier
    Reviewed-by: Kees Cook
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Garnier
     
  • Under CONFIG_HARDENED_USERCOPY, this adds object size checking to the
    SLUB allocator to catch any copies that may span objects. Includes a
    redzone handling fix discovered by Michael Ellerman.

    Based on code from PaX and grsecurity.

    Signed-off-by: Kees Cook
    Tested-by: Michael Ellerman
    Reviwed-by: Laura Abbott

    Kees Cook
     
  • Under CONFIG_HARDENED_USERCOPY, this adds object size checking to the
    SLAB allocator to catch any copies that may span objects.

    Based on code from PaX and grsecurity.

    Signed-off-by: Kees Cook
    Tested-by: Valdis Kletnieks

    Kees Cook
     

26 Jul, 2016

2 commits

  • Pull NOHZ updates from Ingo Molnar:

    - fix system/idle cputime leaked on cputime accounting (all nohz
    configs) (Rik van Riel)

    - remove the messy, ad-hoc irqtime account on nohz-full and make it
    compatible with CONFIG_IRQ_TIME_ACCOUNTING=y instead (Rik van Riel)

    - cleanups (Frederic Weisbecker)

    - remove unecessary irq disablement in the irqtime code (Rik van Riel)

    * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/cputime: Drop local_irq_save/restore from irqtime_account_irq()
    sched/cputime: Reorganize vtime native irqtime accounting headers
    sched/cputime: Clean up the old vtime gen irqtime accounting completely
    sched/cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code
    sched/cputime: Count actually elapsed irq & softirq time

    Linus Torvalds
     
  • Pull RCU updates from Ingo Molnar:
    "The main changes in this cycle were:

    - documentation updates

    - miscellaneous fixes

    - minor reorganization of code

    - torture-test updates"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
    rcu: Correctly handle sparse possible cpus
    rcu: sysctl: Panic on RCU Stall
    rcu: Fix a typo in a comment
    rcu: Make call_rcu_tasks() tolerate first call with irqs disabled
    rcu: Disable TASKS_RCU for usermode Linux
    rcu: No ordering for rcu_assign_pointer() of NULL
    rcutorture: Fix error return code in rcu_perf_init()
    torture: Inflict default jitter
    rcuperf: Don't treat gp_exp mis-setting as a WARN
    rcutorture: Drop "-soundhw pcspkr" from x86 boot arguments
    rcutorture: Don't specify the cpu type of QEMU on PPC
    rcutorture: Make -soundhw a x86 specific option
    rcutorture: Use vmlinux as the fallback kernel image
    rcutorture/doc: Create initrd using dracut
    torture: Stop onoff task if there is only one cpu
    torture: Add starvation events to error summary
    torture: Break online and offline functions out of torture_onoff()
    torture: Forgive lengthy trace dumps and preemption
    torture: Remove CONFIG_RCU_TORTURE_TEST_RUNNABLE, simplify code
    torture: Simplify code, eliminate RCU_PERF_TEST_RUNNABLE
    ...

    Linus Torvalds
     

14 Jul, 2016

1 commit

  • The CONFIG_VIRT_CPU_ACCOUNTING_GEN irq time tracking code does not
    appear to currently work right.

    On CPUs without nohz_full=, only tick based irq time sampling is
    done, which breaks down when dealing with a nohz_idle CPU.

    On firewalls and similar systems, no ticks may happen on a CPU for a
    while, and the irq time spent may never get accounted properly. This
    can cause issues with capacity planning and power saving, which use
    the CPU statistics as inputs in decision making.

    Remove the VTIME_GEN vtime irq time code, and replace it with the
    IRQ_TIME_ACCOUNTING code, when selected as a config option by the user.

    Signed-off-by: Rik van Riel
    Signed-off-by: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paolo Bonzini
    Cc: Peter Zijlstra
    Cc: Radim Krcmar
    Cc: Thomas Gleixner
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1468421405-20056-3-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     

07 Jul, 2016

1 commit

  • The "expert" menu was broken (split) such that all entries in it after
    KALLSYMS were displayed in the "General setup" area instead of in the
    "Expert users" area. Fix this by adding one kconfig dependency.

    Yes, the Expert users menu is fragile. Problems like this have happened
    several times in the past. I will attempt to isolate the Expert users
    menu if there is interest in that.

    Fixes: 4d5d5664c900 ("x86: kallsyms: disable absolute percpu symbols on !SMP")
    Signed-off-by: Randy Dunlap
    Cc: Ard Biesheuvel
    Cc: stable@vger.kernel.org # 4.6
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

30 Jun, 2016

1 commit


25 Jun, 2016

3 commits

  • Merge misc fixes from Andrew Morton:
    "Two weeks worth of fixes here"

    * emailed patches from Andrew Morton : (41 commits)
    init/main.c: fix initcall_blacklisted on ia64, ppc64 and parisc64
    autofs: don't get stuck in a loop if vfs_write() returns an error
    mm/page_owner: avoid null pointer dereference
    tools/vm/slabinfo: fix spelling mistake: "Ocurrences" -> "Occurrences"
    fs/nilfs2: fix potential underflow in call to crc32_le
    oom, suspend: fix oom_reaper vs. oom_killer_disable race
    ocfs2: disable BUG assertions in reading blocks
    mm, compaction: abort free scanner if split fails
    mm: prevent KASAN false positives in kmemleak
    mm/hugetlb: clear compound_mapcount when freeing gigantic pages
    mm/swap.c: flush lru pvecs on compound page arrival
    memcg: css_alloc should return an ERR_PTR value on error
    memcg: mem_cgroup_migrate() may be called with irq disabled
    hugetlb: fix nr_pmds accounting with shared page tables
    Revert "mm: disable fault around on emulated access bit architecture"
    Revert "mm: make faultaround produce old ptes"
    mailmap: add Boris Brezillon's email
    mailmap: add Antoine Tenart's email
    mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask
    mm: mempool: kasan: don't poot mempool objects in quarantine
    ...

    Linus Torvalds
     
  • When I replaced kasprintf("%pf") with a direct call to
    sprint_symbol_no_offset I must have broken the initcall blacklisting
    feature on the arches where dereference_function_descriptor() is
    non-trivial.

    Fixes: c8cdd2be213f (init/main.c: simplify initcall_blacklisted())
    Link: http://lkml.kernel.org/r/1466027283-4065-1-git-send-email-linux@rasmusvillemoes.dk
    Signed-off-by: Rasmus Villemoes
    Cc: Yang Shi
    Cc: Prarit Bhargava
    Cc: Petr Mladek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • We've had the thread info allocated together with the thread stack for
    most architectures for a long time (since the thread_info was split off
    from the task struct), but that is about to change.

    But the patches that move the thread info to be off-stack (and a part of
    the task struct instead) made it clear how confused the allocator and
    freeing functions are.

    Because the common case was that we share an allocation with the thread
    stack and the thread_info, the two pointers were identical. That
    identity then meant that we would have things like

    ti = alloc_thread_info_node(tsk, node);
    ...
    tsk->stack = ti;

    which certainly _worked_ (since stack and thread_info have the same
    value), but is rather confusing: why are we assigning a thread_info to
    the stack? And if we move the thread_info away, the "confusing" code
    just gets to be entirely bogus.

    So remove all this confusion, and make it clear that we are doing the
    stack allocation by renaming and clarifying the function names to be
    about the stack. The fact that the thread_info then shares the
    allocation is an implementation detail, and not really about the
    allocation itself.

    This is a pure renaming and type fix: we pass in the same pointer, it's
    just that we clarify what the pointer means.

    The ia64 code that actually only has one single allocation (for all of
    task_struct, thread_info and kernel thread stack) now looks a bit odd,
    but since "tsk->stack" is actually not even used there, that oddity
    doesn't matter. It would be a separate thing to clean that up, I
    intentionally left the ia64 changes as a pure brute-force renaming and
    type change.

    Acked-by: Andy Lutomirski
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 Jun, 2016

1 commit


16 Jun, 2016

1 commit

  • Usermode Linux currently does not implement arch_irqs_disabled_flags(),
    which results in a build failure in TASKS_RCU. Therefore, this commit
    disables the TASKS_RCU Kconfig option in usermode Linux builds. The
    usermode Linux maintainers expect to merge arch_irqs_disabled_flags()
    into 4.8, at which point this commit may be reverted.

    Signed-off-by: Paul E. McKenney
    Cc: Jeff Dike
    Acked-by: Richard Weinberger

    Paul E. McKenney
     

28 May, 2016

1 commit

  • page_ext_init() checks suitable pages with pfn_to_nid(), but
    pfn_to_nid() depends on memmap which will not be setup fully until
    page_alloc_init_late() is done. Use early_pfn_to_nid() instead of
    pfn_to_nid() so that page extension could be still used early even
    though CONFIG_ DEFERRED_STRUCT_PAGE_INIT is enabled and catch early page
    allocation call sites.

    Suggested by Joonsoo Kim [1], this fix basically undoes the change
    introduced by commit b8f1a75d61d840 ("mm: call page_ext_init() after all
    struct pages are initialized") and fixes the same problem with a better
    approach.

    [1] http://lkml.kernel.org/r/CAAmzW4OUmyPwQjvd7QUfc6W1Aic__TyAuH80MLRZNMxKy0-wPQ@mail.gmail.com

    Link: http://lkml.kernel.org/r/1464198689-23458-1-git-send-email-yang.shi@linaro.org
    Signed-off-by: Yang Shi
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

27 May, 2016

1 commit

  • Pull kbuild updates from Michal Marek:

    - new option CONFIG_TRIM_UNUSED_KSYMS which does a two-pass build and
    unexports symbols which are not used in the current config [Nicolas
    Pitre]

    - several kbuild rule cleanups [Masahiro Yamada]

    - warning option adjustments for gcov etc [Arnd Bergmann]

    - a few more small fixes

    * 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: (31 commits)
    kbuild: move -Wunused-const-variable to W=1 warning level
    kbuild: fix if_change and friends to consider argument order
    kbuild: fix adjust_autoksyms.sh for modules that need only one symbol
    kbuild: fix ksym_dep_filter when multiple EXPORT_SYMBOL() on the same line
    gcov: disable -Wmaybe-uninitialized warning
    gcov: disable tree-loop-im to reduce stack usage
    gcov: disable for COMPILE_TEST
    Kbuild: disable 'maybe-uninitialized' warning for CONFIG_PROFILE_ALL_BRANCHES
    Kbuild: change CC_OPTIMIZE_FOR_SIZE definition
    kbuild: forbid kernel directory to contain spaces and colons
    kbuild: adjust ksym_dep_filter for some cmd_* renames
    kbuild: Fix dependencies for final vmlinux link
    kbuild: better abstract vmlinux sequential prerequisites
    kbuild: fix call to adjust_autoksyms.sh when output directory specified
    kbuild: Get rid of KBUILD_STR
    kbuild: rename cmd_as_s_S to cmd_cpp_s_S
    kbuild: rename cmd_cc_i_c to cmd_cpp_i_c
    kbuild: drop redundant "PHONY += FORCE"
    kbuild: delete unnecessary "@:"
    kbuild: mark help target as PHONY
    ...

    Linus Torvalds
     

21 May, 2016

4 commits

  • Using kasprintf to get the function name makes us look up the name
    twice, along with all the vsnprintf overhead of parsing the format
    string etc. It also means there is an allocation failure case to deal
    with. Since symbol_string in vsprintf.c would anyway allocate an array
    of size KSYM_SYMBOL_LEN on the stack, that might as well be done up
    here.

    Moreover, since this is a debug feature and the blacklisted_initcalls
    list is usually empty, we might as well test that and thus avoid looking
    up the symbol name even once in the common case.

    Signed-off-by: Rasmus Villemoes
    Acked-by: Rusty Russell
    Acked-by: Prarit Bhargava
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • Testing has shown that the backtrace sometimes does not fit into the 4kB
    temporary buffer that is used in NMI context. The warnings are gone
    when I double the temporary buffer size.

    This patch doubles the buffer size and makes it configurable.

    Note that this problem existed even in the x86-specific implementation
    that was added by the commit a9edc8809328 ("x86/nmi: Perform a safe NMI
    stack trace on all CPUs"). Nobody noticed it because it did not print
    any warnings.

    Signed-off-by: Petr Mladek
    Cc: Jan Kara
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Russell King
    Cc: Daniel Thompson
    Cc: Jiri Kosina
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: David Miller
    Cc: Daniel Thompson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     
  • printk() takes some locks and could not be used a safe way in NMI
    context.

    The chance of a deadlock is real especially when printing stacks from
    all CPUs. This particular problem has been addressed on x86 by the
    commit a9edc8809328 ("x86/nmi: Perform a safe NMI stack trace on all
    CPUs").

    The patchset brings two big advantages. First, it makes the NMI
    backtraces safe on all architectures for free. Second, it makes all NMI
    messages almost safe on all architectures (the temporary buffer is
    limited. We still should keep the number of messages in NMI context at
    minimum).

    Note that there already are several messages printed in NMI context:
    WARN_ON(in_nmi()), BUG_ON(in_nmi()), anything being printed out from MCE
    handlers. These are not easy to avoid.

    This patch reuses most of the code and makes it generic. It is useful
    for all messages and architectures that support NMI.

    The alternative printk_func is set when entering and is reseted when
    leaving NMI context. It queues IRQ work to copy the messages into the
    main ring buffer in a safe context.

    __printk_nmi_flush() copies all available messages and reset the buffer.
    Then we could use a simple cmpxchg operations to get synchronized with
    writers. There is also used a spinlock to get synchronized with other
    flushers.

    We do not longer use seq_buf because it depends on external lock. It
    would be hard to make all supported operations safe for a lockless use.
    It would be confusing and error prone to make only some operations safe.

    The code is put into separate printk/nmi.c as suggested by Steven
    Rostedt. It needs a per-CPU buffer and is compiled only on
    architectures that call nmi_enter(). This is achieved by the new
    HAVE_NMI Kconfig flag.

    The are MN10300 and Xtensa architectures. We need to clean up NMI
    handling there first. Let's do it separately.

    The patch is heavily based on the draft from Peter Zijlstra, see

    https://lkml.org/lkml/2015/6/10/327

    [arnd@arndb.de: printk-nmi: use %zu format string for size_t]
    [akpm@linux-foundation.org: min_t->min - all types are size_t here]
    Signed-off-by: Petr Mladek
    Suggested-by: Peter Zijlstra
    Suggested-by: Steven Rostedt
    Cc: Jan Kara
    Acked-by: Russell King [arm part]
    Cc: Daniel Thompson
    Cc: Jiri Kosina
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: David Miller
    Cc: Daniel Thompson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     
  • When DEFERRED_STRUCT_PAGE_INIT is enabled, just a subset of memmap at
    boot are initialized, then the rest are initialized in parallel by
    starting one-off "pgdatinitX" kernel thread for each node X.

    If page_ext_init is called before it, some pages will not have valid
    extension, this may lead the below kernel oops when booting up kernel:

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] free_pcppages_bulk+0x2d2/0x8d0
    PGD 0
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Modules linked in:
    CPU: 11 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc5-next-20160427 #26
    Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009
    task: ffff88017c080040 ti: ffff88017c084000 task.ti: ffff88017c084000
    RIP: 0010:[] [] free_pcppages_bulk+0x2d2/0x8d0
    RSP: 0000:ffff88017c087c48 EFLAGS: 00010046
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
    RDX: 0000000000000980 RSI: 0000000000000080 RDI: 0000000000660401
    RBP: ffff88017c087cd0 R08: 0000000000000401 R09: 0000000000000009
    R10: ffff88017c080040 R11: 000000000000000a R12: 0000000000000400
    R13: ffffea0019810000 R14: ffffea0019810040 R15: ffff88066cfe6080
    FS: 0000000000000000(0000) GS:ffff88066cd40000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 0000000002406000 CR4: 00000000000006e0
    Call Trace:
    free_hot_cold_page+0x192/0x1d0
    __free_pages+0x5c/0x90
    __free_pages_boot_core+0x11a/0x14e
    deferred_free_range+0x50/0x62
    deferred_init_memmap+0x220/0x3c3
    kthread+0xf8/0x110
    ret_from_fork+0x22/0x40
    Code: 49 89 d4 48 c1 e0 06 49 01 c5 e9 de fe ff ff 4c 89 f7 44 89 4d b8 4c 89 45 c0 44 89 5d c8 48 89 4d d0 e8 62 c7 07 00 48 8b 4d d0 8b 00 44 8b 5d c8 4c 8b 45 c0 44 8b 4d b8 a8 02 0f 84 05 ff
    RIP [] free_pcppages_bulk+0x2d2/0x8d0
    RSP
    CR2: 0000000000000000

    Move page_ext_init() after page_alloc_init_late() to make sure page extension
    is setup for all pages.

    Link: http://lkml.kernel.org/r/1463696006-31360-1-git-send-email-yang.shi@linaro.org
    Signed-off-by: Yang Shi
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

20 May, 2016

1 commit

  • Provides an optional config (CONFIG_SLAB_FREELIST_RANDOM) to randomize
    the SLAB freelist. The list is randomized during initialization of a
    new set of pages. The order on different freelist sizes is pre-computed
    at boot for performance. Each kmem_cache has its own randomized
    freelist. Before pre-computed lists are available freelists are
    generated dynamically. This security feature reduces the predictability
    of the kernel SLAB allocator against heap overflows rendering attacks
    much less stable.

    For example this attack against SLUB (also applicable against SLAB)
    would be affected:

    https://jon.oberheide.org/blog/2010/09/10/linux-kernel-can-slub-overflow/

    Also, since v4.6 the freelist was moved at the end of the SLAB. It
    means a controllable heap is opened to new attacks not yet publicly
    discussed. A kernel heap overflow can be transformed to multiple
    use-after-free. This feature makes this type of attack harder too.

    To generate entropy, we use get_random_bytes_arch because 0 bits of
    entropy is available in the boot stage. In the worse case this function
    will fallback to the get_random_bytes sub API. We also generate a shift
    random number to shift pre-computed freelist for each new set of pages.

    The config option name is not specific to the SLAB as this approach will
    be extended to other allocators like SLUB.

    Performance results highlighted no major changes:

    Hackbench (running 90 10 times):

    Before average: 0.0698
    After average: 0.0663 (-5.01%)

    slab_test 1 run on boot. Difference only seen on the 2048 size test
    being the worse case scenario covered by freelist randomization. New
    slab pages are constantly being created on the 10000 allocations.
    Variance should be mainly due to getting new pages every few
    allocations.

    Before:

    Single thread testing
    =====================
    1. Kmalloc: Repeatedly allocate then free test
    10000 times kmalloc(8) -> 99 cycles kfree -> 112 cycles
    10000 times kmalloc(16) -> 109 cycles kfree -> 140 cycles
    10000 times kmalloc(32) -> 129 cycles kfree -> 137 cycles
    10000 times kmalloc(64) -> 141 cycles kfree -> 141 cycles
    10000 times kmalloc(128) -> 152 cycles kfree -> 148 cycles
    10000 times kmalloc(256) -> 195 cycles kfree -> 167 cycles
    10000 times kmalloc(512) -> 257 cycles kfree -> 199 cycles
    10000 times kmalloc(1024) -> 393 cycles kfree -> 251 cycles
    10000 times kmalloc(2048) -> 649 cycles kfree -> 228 cycles
    10000 times kmalloc(4096) -> 806 cycles kfree -> 370 cycles
    10000 times kmalloc(8192) -> 814 cycles kfree -> 411 cycles
    10000 times kmalloc(16384) -> 892 cycles kfree -> 455 cycles
    2. Kmalloc: alloc/free test
    10000 times kmalloc(8)/kfree -> 121 cycles
    10000 times kmalloc(16)/kfree -> 121 cycles
    10000 times kmalloc(32)/kfree -> 121 cycles
    10000 times kmalloc(64)/kfree -> 121 cycles
    10000 times kmalloc(128)/kfree -> 121 cycles
    10000 times kmalloc(256)/kfree -> 119 cycles
    10000 times kmalloc(512)/kfree -> 119 cycles
    10000 times kmalloc(1024)/kfree -> 119 cycles
    10000 times kmalloc(2048)/kfree -> 119 cycles
    10000 times kmalloc(4096)/kfree -> 121 cycles
    10000 times kmalloc(8192)/kfree -> 119 cycles
    10000 times kmalloc(16384)/kfree -> 119 cycles

    After:

    Single thread testing
    =====================
    1. Kmalloc: Repeatedly allocate then free test
    10000 times kmalloc(8) -> 130 cycles kfree -> 86 cycles
    10000 times kmalloc(16) -> 118 cycles kfree -> 86 cycles
    10000 times kmalloc(32) -> 121 cycles kfree -> 85 cycles
    10000 times kmalloc(64) -> 176 cycles kfree -> 102 cycles
    10000 times kmalloc(128) -> 178 cycles kfree -> 100 cycles
    10000 times kmalloc(256) -> 205 cycles kfree -> 109 cycles
    10000 times kmalloc(512) -> 262 cycles kfree -> 136 cycles
    10000 times kmalloc(1024) -> 342 cycles kfree -> 157 cycles
    10000 times kmalloc(2048) -> 701 cycles kfree -> 238 cycles
    10000 times kmalloc(4096) -> 803 cycles kfree -> 364 cycles
    10000 times kmalloc(8192) -> 835 cycles kfree -> 404 cycles
    10000 times kmalloc(16384) -> 896 cycles kfree -> 441 cycles
    2. Kmalloc: alloc/free test
    10000 times kmalloc(8)/kfree -> 121 cycles
    10000 times kmalloc(16)/kfree -> 121 cycles
    10000 times kmalloc(32)/kfree -> 123 cycles
    10000 times kmalloc(64)/kfree -> 142 cycles
    10000 times kmalloc(128)/kfree -> 121 cycles
    10000 times kmalloc(256)/kfree -> 119 cycles
    10000 times kmalloc(512)/kfree -> 119 cycles
    10000 times kmalloc(1024)/kfree -> 119 cycles
    10000 times kmalloc(2048)/kfree -> 119 cycles
    10000 times kmalloc(4096)/kfree -> 119 cycles
    10000 times kmalloc(8192)/kfree -> 119 cycles
    10000 times kmalloc(16384)/kfree -> 119 cycles

    [akpm@linux-foundation.org: propagate gfp_t into cache_random_seq_create()]
    Signed-off-by: Thomas Garnier
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Kees Cook
    Cc: Greg Thelen
    Cc: Laura Abbott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Garnier
     

10 May, 2016

1 commit

  • CC_OPTIMIZE_FOR_SIZE disables the often useful -Wmaybe-unused warning,
    because that causes a ridiculous amount of false positives when combined
    with -Os.

    This means a lot of warnings don't show up in testing by the developers
    that should see them with an 'allmodconfig' kernel that has
    CC_OPTIMIZE_FOR_SIZE enabled, but only later in randconfig builds
    that don't.

    This changes the Kconfig logic around CC_OPTIMIZE_FOR_SIZE to make
    it a 'choice' statement defaulting to CC_OPTIMIZE_FOR_PERFORMANCE
    that gets added for this purpose. The allmodconfig and allyesconfig
    kernels now default to -O2 with the maybe-unused warning enabled.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Michal Marek

    Arnd Bergmann
     

02 Apr, 2016

1 commit

  • Newer Fedora and OpenSUSE didn't boot with my standard configuration.
    It took me some time to figure out why, in fact I had to write a script
    to try different config options systematically.

    The problem is that something (systemd) in dracut depends on
    CONFIG_FHANDLE, which adds open by file handle syscalls.

    While it is set in defconfigs it is very easy to miss when updating
    older configs because it is not default y.

    Make it default y and also depend on EXPERT, as dracut use is likely
    widespread.

    Signed-off-by: Andi Kleen
    Cc: Richard Weinberger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

30 Mar, 2016

1 commit


19 Mar, 2016

1 commit

  • Pull cgroup updates from Tejun Heo:
    "cgroup changes for v4.6-rc1. No userland visible behavior changes in
    this pull request. I'll send out a separate pull request for the
    addition of cgroup namespace support.

    - The biggest change is the revamping of cgroup core task migration
    and controller handling logic. There are quite a few places where
    controllers and tasks are manipulated. Previously, many of those
    places implemented custom operations for each specific use case
    assuming specific starting conditions. While this worked, it makes
    the code fragile and difficult to follow.

    The bulk of this pull request restructures these operations so that
    most related operations are performed through common helpers which
    implement recursive (subtrees are always processed consistently)
    and idempotent (they make cgroup hierarchy converge to the target
    state rather than performing operations assuming specific starting
    conditions). This makes the code a lot easier to understand,
    verify and extend.

    - Implicit controller support is added. This is primarily for using
    perf_event on the v2 hierarchy so that perf can match cgroup v2
    path without requiring the user to do anything special. The kernel
    portion of perf_event changes is acked but userland changes are
    still pending review.

    - cgroup_no_v1= boot parameter added to ease testing cgroup v2 in
    certain environments.

    - There is a regression introduced during v4.4 devel cycle where
    attempts to migrate zombie tasks can mess up internal object
    management. This was fixed earlier this week and included in this
    pull request w/ stable cc'd.

    - Misc non-critical fixes and improvements"

    * 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (44 commits)
    cgroup: avoid false positive gcc-6 warning
    cgroup: ignore css_sets associated with dead cgroups during migration
    Documentation: cgroup v2: Trivial heading correction.
    cgroup: implement cgroup_subsys->implicit_on_dfl
    cgroup: use css_set->mg_dst_cgrp for the migration target cgroup
    cgroup: make cgroup[_taskset]_migrate() take cgroup_root instead of cgroup
    cgroup: move migration destination verification out of cgroup_migrate_prepare_dst()
    cgroup: fix incorrect destination cgroup in cgroup_update_dfl_csses()
    cgroup: Trivial correction to reflect controller.
    cgroup: remove stale item in cgroup-v1 document INDEX file.
    cgroup: update css iteration in cgroup_update_dfl_csses()
    cgroup: allocate 2x cgrp_cset_links when setting up a new root
    cgroup: make cgroup_calc_subtree_ss_mask() take @this_ss_mask
    cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends
    cgroup: use cgroup_apply_enable_control() in cgroup creation path
    cgroup: combine cgroup_mutex locking and offline css draining
    cgroup: factor out cgroup_{apply|finalize}_control() from cgroup_subtree_control_write()
    cgroup: introduce cgroup_{save|propagate|restore}_control()
    cgroup: make cgroup_drain_offline() and cgroup_apply_control_{disable|enable}() recursive
    cgroup: factor out cgroup_apply_control_enable() from cgroup_subtree_control_write()
    ...

    Linus Torvalds
     

18 Mar, 2016

1 commit

  • Pull security layer updates from James Morris:
    "There are a bunch of fixes to the TPM, IMA, and Keys code, with minor
    fixes scattered across the subsystem.

    IMA now requires signed policy, and that policy is also now measured
    and appraised"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (67 commits)
    X.509: Make algo identifiers text instead of enum
    akcipher: Move the RSA DER encoding check to the crypto layer
    crypto: Add hash param to pkcs1pad
    sign-file: fix build with CMS support disabled
    MAINTAINERS: update tpmdd urls
    MODSIGN: linux/string.h should be #included to get memcpy()
    certs: Fix misaligned data in extra certificate list
    X.509: Handle midnight alternative notation in GeneralizedTime
    X.509: Support leap seconds
    Handle ISO 8601 leap seconds and encodings of midnight in mktime64()
    X.509: Fix leap year handling again
    PKCS#7: fix unitialized boolean 'want'
    firmware: change kernel read fail to dev_dbg()
    KEYS: Use the symbol value for list size, updated by scripts/insert-sys-cert
    KEYS: Reserve an extra certificate symbol for inserting without recompiling
    modsign: hide openssl output in silent builds
    tpm_tis: fix build warning with tpm_tis_resume
    ima: require signed IMA policy
    ima: measure and appraise the IMA policy itself
    ima: load policy using path
    ...

    Linus Torvalds
     

17 Mar, 2016

1 commit

  • Merge first patch-bomb from Andrew Morton:

    - some misc things

    - ofs2 updates

    - about half of MM

    - checkpatch updates

    - autofs4 update

    * emailed patches from Andrew Morton : (120 commits)
    autofs4: fix string.h include in auto_dev-ioctl.h
    autofs4: use pr_xxx() macros directly for logging
    autofs4: change log print macros to not insert newline
    autofs4: make autofs log prints consistent
    autofs4: fix some white space errors
    autofs4: fix invalid ioctl return in autofs4_root_ioctl_unlocked()
    autofs4: fix coding style line length in autofs4_wait()
    autofs4: fix coding style problem in autofs4_get_set_timeout()
    autofs4: coding style fixes
    autofs: show pipe inode in mount options
    kallsyms: add support for relative offsets in kallsyms address table
    kallsyms: don't overload absolute symbol type for percpu symbols
    x86: kallsyms: disable absolute percpu symbols on !SMP
    checkpatch: fix another left brace warning
    checkpatch: improve UNSPECIFIED_INT test for bare signed/unsigned uses
    checkpatch: warn on bare unsigned or signed declarations without int
    checkpatch: exclude asm volatile from complex macro check
    mm: memcontrol: drop unnecessary lru locking from mem_cgroup_migrate()
    mm: migrate: consolidate mem_cgroup_migrate() calls
    mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous
    ...

    Linus Torvalds
     

16 Mar, 2016

4 commits

  • Similar to how relative extables are implemented, it is possible to emit
    the kallsyms table in such a way that it contains offsets relative to
    some anchor point in the kernel image rather than absolute addresses.

    On 64-bit architectures, it cuts the size of the kallsyms address table
    in half, since offsets between kernel symbols can typically be expressed
    in 32 bits. This saves several hundreds of kilobytes of permanent
    .rodata on average. In addition, the kallsyms address table is no
    longer subject to dynamic relocation when CONFIG_RELOCATABLE is in
    effect, so the relocation work done after decompression now doesn't have
    to do relocation updates for all these values. This saves up to 24
    bytes (i.e., the size of a ELF64 RELA relocation table entry) per value,
    which easily adds up to a couple of megabytes of uncompressed __init
    data on ppc64 or arm64. Even if these relocation entries typically
    compress well, the combined size reduction of 2.8 MB uncompressed for a
    ppc64_defconfig build (of which 2.4 MB is __init data) results in a ~500
    KB space saving in the compressed image.

    Since it is useful for some architectures (like x86) to retain the
    ability to emit absolute values as well, this patch also adds support
    for capturing both absolute and relative values when
    KALLSYMS_ABSOLUTE_PERCPU is in effect, by emitting absolute per-cpu
    addresses as positive 32-bit values, and addresses relative to the
    lowest encountered relative symbol as negative values, which are
    subtracted from the runtime address of this base symbol to produce the
    actual address.

    Support for the above is enabled by default for all architectures except
    IA-64 and Tile-GX, whose symbols are too far apart to capture in this
    manner.

    Signed-off-by: Ard Biesheuvel
    Tested-by: Guenter Roeck
    Reviewed-by: Kees Cook
    Tested-by: Kees Cook
    Cc: Heiko Carstens
    Cc: Michael Ellerman
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Benjamin Herrenschmidt
    Cc: Michal Marek
    Cc: Rusty Russell
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ard Biesheuvel
     
  • scripts/kallsyms.c has a special --absolute-percpu command line option
    which deals with the zero based per cpu offsets that are used when
    building for SMP on x86_64. This means that the option should only be
    passed in that case, so add a Kconfig symbol with the correct predicate,
    and use that instead.

    Signed-off-by: Ard Biesheuvel
    Tested-by: Guenter Roeck
    Reviewed-by: Kees Cook
    Tested-by: Kees Cook
    Acked-by: Rusty Russell
    Cc: Heiko Carstens
    Cc: Michael Ellerman
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Benjamin Herrenschmidt
    Cc: Michal Marek
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ard Biesheuvel
     
  • Use list_for_each_entry() instead of list_for_each() to simplify the code.

    Signed-off-by: Geliang Tang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geliang Tang
     
  • Pull cpu hotplug updates from Thomas Gleixner:
    "This is the first part of the ongoing cpu hotplug rework:

    - Initial implementation of the state machine

    - Runs all online and prepare down callbacks on the plugged cpu and
    not on some random processor

    - Replaces busy loop waiting with completions

    - Adds tracepoints so the states can be followed"

    More detailed commentary on this work from an earlier email:
    "What's wrong with the current cpu hotplug infrastructure?

    - Asymmetry

    The hotplug notifier mechanism is asymmetric versus the bringup and
    teardown. This is mostly caused by the notifier mechanism.

    - Largely undocumented dependencies

    While some notifiers use explicitely defined notifier priorities,
    we have quite some notifiers which use numerical priorities to
    express dependencies without any documentation why.

    - Control processor driven

    Most of the bringup/teardown of a cpu is driven by a control
    processor. While it is understandable, that preperatory steps,
    like idle thread creation, memory allocation for and initialization
    of essential facilities needs to be done before a cpu can boot,
    there is no reason why everything else must run on a control
    processor. Before this patch series, bringup looks like this:

    Control CPU Booting CPU

    do preparatory steps
    kick cpu into life

    do low level init

    sync with booting cpu sync with control cpu

    bring the rest up

    - All or nothing approach

    There is no way to do partial bringups. That's something which is
    really desired because we waste e.g. at boot substantial amount of
    time just busy waiting that the cpu comes to life. That's stupid
    as we could very well do preparatory steps and the initial IPI for
    other cpus and then go back and do the necessary low level
    synchronization with the freshly booted cpu.

    - Minimal debuggability

    Due to the notifier based design, it's impossible to switch between
    two stages of the bringup/teardown back and forth in order to test
    the correctness. So in many hotplug notifiers the cancel
    mechanisms are either not existant or completely untested.

    - Notifier [un]registering is tedious

    To [un]register notifiers we need to protect against hotplug at
    every callsite. There is no mechanism that bringup/teardown
    callbacks are issued on the online cpus, so every caller needs to
    do it itself. That also includes error rollback.

    What's the new design?

    The base of the new design is a symmetric state machine, where both
    the control processor and the booting/dying cpu execute a well
    defined set of states. Each state is symmetric in the end, except
    for some well defined exceptions, and the bringup/teardown can be
    stopped and reversed at almost all states.

    So the bringup of a cpu will look like this in the future:

    Control CPU Booting CPU

    do preparatory steps
    kick cpu into life

    do low level init

    sync with booting cpu sync with control cpu

    bring itself up

    The synchronization step does not require the control cpu to wait.
    That mechanism can be done asynchronously via a worker or some
    other mechanism.

    The teardown can be made very similar, so that the dying cpu cleans
    up and brings itself down. Cleanups which need to be done after
    the cpu is gone, can be scheduled asynchronously as well.

    There is a long way to this, as we need to refactor the notion when a
    cpu is available. Today we set the cpu online right after it comes
    out of the low level bringup, which is not really correct.

    The proper mechanism is to set it to available, i.e. cpu local
    threads, like softirqd, hotplug thread etc. can be scheduled on that
    cpu, and once it finished all booting steps, it's set to online, so
    general workloads can be scheduled on it. The reverse happens on
    teardown. First thing to do is to forbid scheduling of general
    workloads, then teardown all the per cpu resources and finally shut it
    off completely.

    This patch series implements the basic infrastructure for this at the
    core level. This includes the following:

    - Basic state machine implementation with well defined states, so
    ordering and prioritization can be expressed.

    - Interfaces to [un]register state callbacks

    This invokes the bringup/teardown callback on all online cpus with
    the proper protection in place and [un]installs the callbacks in
    the state machine array.

    For callbacks which have no particular ordering requirement we have
    a dynamic state space, so that drivers don't have to register an
    explicit hotplug state.

    If a callback fails, the code automatically does a rollback to the
    previous state.

    - Sysfs interface to drive the state machine to a particular step.

    This is only partially functional today. Full functionality and
    therefor testability will be achieved once we converted all
    existing hotplug notifiers over to the new scheme.

    - Run all CPU_ONLINE/DOWN_PREPARE notifiers on the booting/dying
    processor:

    Control CPU Booting CPU

    do preparatory steps
    kick cpu into life

    do low level init

    sync with booting cpu sync with control cpu
    wait for boot
    bring itself up

    Signal completion to control cpu

    In a previous step of this work we've done a full tree mechanical
    conversion of all hotplug notifiers to the new scheme. The balance
    is a net removal of about 4000 lines of code.

    This is not included in this series, as we decided to take a
    different approach. Instead of mechanically converting everything
    over, we will do a proper overhaul of the usage sites one by one so
    they nicely fit into the symmetric callback scheme.

    I decided to do that after I looked at the ugliness of some of the
    converted sites and figured out that their hotplug mechanism is
    completely buggered anyway. So there is no point to do a
    mechanical conversion first as we need to go through the usage
    sites one by one again in order to achieve a full symmetric and
    testable behaviour"

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
    cpu/hotplug: Document states better
    cpu/hotplug: Fix smpboot thread ordering
    cpu/hotplug: Remove redundant state check
    cpu/hotplug: Plug death reporting race
    rcu: Make CPU_DYING_IDLE an explicit call
    cpu/hotplug: Make wait for dead cpu completion based
    cpu/hotplug: Let upcoming cpu bring itself fully up
    arch/hotplug: Call into idle with a proper state
    cpu/hotplug: Move online calls to hotplugged cpu
    cpu/hotplug: Create hotplug threads
    cpu/hotplug: Split out the state walk into functions
    cpu/hotplug: Unpark smpboot threads from the state machine
    cpu/hotplug: Move scheduler cpu_online notifier to hotplug core
    cpu/hotplug: Implement setup/removal interface
    cpu/hotplug: Make target state writeable
    cpu/hotplug: Add sysfs state interface
    cpu/hotplug: Hand in target state to _cpu_up/down
    cpu/hotplug: Convert the hotplugged cpu work to a state machine
    cpu/hotplug: Convert to a state machine for the control processor
    cpu/hotplug: Add tracepoints
    ...

    Linus Torvalds
     

15 Mar, 2016

1 commit

  • Pull read-only kernel memory updates from Ingo Molnar:
    "This tree adds two (security related) enhancements to the kernel's
    handling of read-only kernel memory:

    - extend read-only kernel memory to a new class of formerly writable
    kernel data: 'post-init read-only memory' via the __ro_after_init
    attribute, and mark the ARM and x86 vDSO as such read-only memory.

    This kind of attribute can be used for data that requires a once
    per bootup initialization sequence, but is otherwise never modified
    after that point.

    This feature was based on the work by PaX Team and Brad Spengler.

    (by Kees Cook, the ARM vDSO bits by David Brown.)

    - make CONFIG_DEBUG_RODATA always enabled on x86 and remove the
    Kconfig option. This simplifies the kernel and also signals that
    read-only memory is the default model and a first-class citizen.
    (Kees Cook)"

    * 'mm-readonly-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    ARM/vdso: Mark the vDSO code read-only after init
    x86/vdso: Mark the vDSO code read-only after init
    lkdtm: Verify that '__ro_after_init' works correctly
    arch: Introduce post-init read-only memory
    x86/mm: Always enable CONFIG_DEBUG_RODATA and remove the Kconfig option
    mm/init: Add 'rodata=off' boot cmdline parameter to disable read-only kernel mappings
    asm-generic: Consolidate mark_rodata_ro()

    Linus Torvalds
     

05 Mar, 2016

1 commit