16 Aug, 2018

1 commit

  • commit b5b1404d0815894de0690de8a1ab58269e56eae6 upstream.

    This is purely a preparatory patch for upcoming changes during the 4.19
    merge window.

    We have a function called "boot_cpu_state_init()" that isn't really
    about the bootup cpu state: that is done much earlier by the similarly
    named "boot_cpu_init()" (note lack of "state" in name).

    This function initializes some hotplug CPU state, and needs to run after
    the percpu data has been properly initialized. It even has a comment to
    that effect.

    Except it _doesn't_ actually run after the percpu data has been properly
    initialized. On x86 it happens to do that, but on at least arm and
    arm64, the percpu base pointers are initialized by the arch-specific
    'smp_prepare_boot_cpu()' hook, which ran _after_ boot_cpu_state_init().

    This had some unexpected results, and in particular we have a patch
    pending for the merge window that did the obvious cleanup of using
    'this_cpu_write()' in the cpu hotplug init code:

    - per_cpu_ptr(&cpuhp_state, smp_processor_id())->state = CPUHP_ONLINE;
    + this_cpu_write(cpuhp_state.state, CPUHP_ONLINE);

    which is obviously the right thing to do. Except because of the
    ordering issue, it actually failed miserably and unexpectedly on arm64.

    So this just fixes the ordering, and changes the name of the function to
    be 'boot_cpu_hotplug_init()' to make it obvious that it's about cpu
    hotplug state, because the core CPU state was supposed to have already
    been done earlier.

    Marked for stable, since the (not yet merged) patch that will show this
    problem is marked for stable.

    Reported-by: Vlastimil Babka
    Reported-by: Mian Yousaf Kaukab
    Suggested-by: Catalin Marinas
    Acked-by: Thomas Gleixner
    Cc: Will Deacon
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

21 Jun, 2018

1 commit

  • [ Upstream commit ae646f0b9ca135b87bc73ff606ef996c3029780a ]

    load_module() creates W+X mappings via __vmalloc_node_range() (from
    layout_and_allocate()->move_module()->module_alloc()) by using
    PAGE_KERNEL_EXEC. These mappings are later cleaned up via
    "call_rcu_sched(&freeinit->rcu, do_free_init)" from do_init_module().

    This is a problem because call_rcu_sched() queues work, which can be run
    after debug_checkwx() is run, resulting in a race condition. If hit,
    the race results in a nasty splat about insecure W+X mappings, which
    results in a poor user experience as these are not the mappings that
    debug_checkwx() is intended to catch.

    This issue is observed on multiple arm64 platforms, and has been
    artificially triggered on an x86 platform.

    Address the race by flushing the queued work before running the
    arch-defined mark_rodata_ro() which then calls debug_checkwx().

    Link: http://lkml.kernel.org/r/1525103946-29526-1-git-send-email-jhugo@codeaurora.org
    Fixes: e1a58320a38d ("x86/mm: Warn on W^X mappings")
    Signed-off-by: Jeffrey Hugo
    Reported-by: Timur Tabi
    Reported-by: Jan Glauber
    Acked-by: Kees Cook
    Acked-by: Ingo Molnar
    Acked-by: Will Deacon
    Acked-by: Laura Abbott
    Cc: Mark Rutland
    Cc: Ard Biesheuvel
    Cc: Catalin Marinas
    Cc: Stephen Smalley
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jeffrey Hugo
     

22 Feb, 2018

1 commit

  • commit 4950276672fce5c241857540f8561c440663673d upstream.

    Patch series "kmemcheck: kill kmemcheck", v2.

    As discussed at LSF/MM, kill kmemcheck.

    KASan is a replacement that is able to work without the limitation of
    kmemcheck (single CPU, slow). KASan is already upstream.

    We are also not aware of any users of kmemcheck (or users who don't
    consider KASan as a suitable replacement).

    The only objection was that since KASAN wasn't supported by all GCC
    versions provided by distros at that time we should hold off for 2
    years, and try again.

    Now that 2 years have passed, and all distros provide gcc that supports
    KASAN, kill kmemcheck again for the very same reasons.

    This patch (of 4):

    Remove kmemcheck annotations, and calls to kmemcheck from the kernel.

    [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
    Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
    Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Alexander Potapenko
    Cc: Eric W. Biederman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Tim Hansen
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Levin, Alexander (Sasha Levin)
     

03 Jan, 2018

1 commit

  • commit aa8c6248f8c75acfd610fe15d8cae23cf70d9d09 upstream.

    Add the initial files for kernel page table isolation, with a minimal init
    function and the boot time detection for this misfeature.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Borislav Petkov
    Cc: Andy Lutomirski
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: David Laight
    Cc: Denys Vlasenko
    Cc: Eduardo Valentin
    Cc: Greg KH
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Juergen Gross
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Cc: aliguori@amazon.com
    Cc: daniel.gruss@iaik.tugraz.at
    Cc: hughd@google.com
    Cc: keescook@google.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

30 Dec, 2017

1 commit

  • commit 613e396bc0d4c7604fba23256644e78454c68cf6 upstream.

    init_espfix_bsp() needs to be invoked before the page table isolation
    initialization. Move it into mm_init() which is the place where pti_init()
    will be added.

    While at it get rid of the #ifdeffery and provide proper stub functions.

    Signed-off-by: Thomas Gleixner
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Juergen Gross
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

09 Sep, 2017

2 commits

  • Feed the boot command-line as to the /dev/random entropy pool

    Existing Android bootloaders usually pass data which may not be known by
    an external attacker on the kernel command-line. It may also be the
    case on other embedded systems. Sample command-line from a Google Pixel
    running CopperheadOS....

    console=ttyHSL0,115200,n8 androidboot.console=ttyHSL0
    androidboot.hardware=sailfish user_debug=31 ehci-hcd.park=3
    lpm_levels.sleep_disabled=1 cma=32M@0-0xffffffff buildvariant=user
    veritykeyid=id:dfcb9db0089e5b3b4090a592415c28e1cb4545ab
    androidboot.bootdevice=624000.ufshc androidboot.verifiedbootstate=yellow
    androidboot.veritymode=enforcing androidboot.keymaster=1
    androidboot.serialno=FA6CE0305299 androidboot.baseband=msm
    mdss_mdp.panel=1:dsi:0:qcom,mdss_dsi_samsung_ea8064tg_1080p_cmd:1:none:cfg:single_dsi
    androidboot.slot_suffix=_b fpsimd.fpsimd_settings=0
    app_setting.use_app_setting=0 kernelflag=0x00000000 debugflag=0x00000000
    androidboot.hardware.revision=PVT radioflag=0x00000000
    radioflagex1=0x00000000 radioflagex2=0x00000000 cpumask=0x00000000
    androidboot.hardware.ddr=4096MB,Hynix,LPDDR4 androidboot.ddrinfo=00000006
    androidboot.ddrsize=4GB androidboot.hardware.color=GRA00
    androidboot.hardware.ufs=32GB,Samsung androidboot.msm.hw_ver_id=268824801
    androidboot.qf.st=2 androidboot.cid=11111111 androidboot.mid=G-2PW4100
    androidboot.bootloader=8996-012001-1704121145
    androidboot.oem_unlock_support=1 androidboot.fp_src=1
    androidboot.htc.hrdump=detected androidboot.ramdump.opt=mem@2g:2g,mem@4g:2g
    androidboot.bootreason=reboot androidboot.ramdump_enable=0 ro
    root=/dev/dm-0 dm="system none ro,0 1 android-verity /dev/sda34"
    rootwait skip_initramfs init=/init androidboot.wificountrycode=US
    androidboot.boottime=1BLL:85,1BLE:669,2BLL:0,2BLE:1777,SW:6,KL:8136

    Among other things, it contains a value unique to the device
    (androidboot.serialno=FA6CE0305299), unique to the OS builds for the
    device variant (veritykeyid=id:dfcb9db0089e5b3b4090a592415c28e1cb4545ab)
    and timings from the bootloader stages in milliseconds
    (androidboot.boottime=1BLL:85,1BLE:669,2BLL:0,2BLE:1777,SW:6,KL:8136).

    [tytso@mit.edu: changelog tweak]
    [labbott@redhat.com: line-wrapped command line]
    Link: http://lkml.kernel.org/r/20170816231458.2299-3-labbott@redhat.com
    Signed-off-by: Daniel Micay
    Signed-off-by: Laura Abbott
    Acked-by: Kees Cook
    Cc: "Theodore Ts'o"
    Cc: Laura Abbott
    Cc: Nick Kralevich
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Micay
     
  • Patch series "Command line randomness", v3.

    A series to add the kernel command line as a source of randomness.

    This patch (of 2):

    Stack canary intialization involves getting a random number. Getting this
    random number may involve accessing caches or other architectural specific
    features which are not available until after the architecture is setup.
    Move the stack canary initialization later to accommodate this.

    Link: http://lkml.kernel.org/r/20170816231458.2299-2-labbott@redhat.com
    Signed-off-by: Laura Abbott
    Signed-off-by: Laura Abbott
    Acked-by: Kees Cook
    Cc: "Theodore Ts'o"
    Cc: Daniel Micay
    Cc: Nick Kralevich
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     

07 Sep, 2017

2 commits

  • Pull percpu updates from Tejun Heo:
    "A lot of changes for percpu this time around. percpu inherited the
    same area allocator from the original pre-virtual-address-mapped
    implementation. This was from the time when percpu allocator wasn't
    used all that much and the implementation was focused on simplicity,
    with the unfortunate computational complexity of O(number of areas
    allocated from the chunk) per alloc / free.

    With the increase in percpu usage, we're hitting cases where the lack
    of scalability is hurting. The most prominent one right now is bpf
    perpcu map creation / destruction which may allocate and free a lot of
    entries consecutively and it's likely that the problem will become
    more prominent in the future.

    To address the issue, Dennis replaced the area allocator with hinted
    bitmap allocator which is more consistent. While the new allocator
    does perform a bit worse in some cases, it outperforms the old
    allocator way more than an order of magnitude in other more common
    scenarios while staying mostly flat in CPU overhead and completely
    flat in memory consumption"

    * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (27 commits)
    percpu: update header to contain bitmap allocator explanation.
    percpu: update pcpu_find_block_fit to use an iterator
    percpu: use metadata blocks to update the chunk contig hint
    percpu: update free path to take advantage of contig hints
    percpu: update alloc path to only scan if contig hints are broken
    percpu: keep track of the best offset for contig hints
    percpu: skip chunks if the alloc does not fit in the contig hint
    percpu: add first_bit to keep track of the first free in the bitmap
    percpu: introduce bitmap metadata blocks
    percpu: replace area map allocator with bitmap
    percpu: generalize bitmap (un)populated iterators
    percpu: increase minimum percpu allocation size and align first regions
    percpu: introduce nr_empty_pop_pages to help empty page accounting
    percpu: change the number of pages marked in the first_chunk pop bitmap
    percpu: combine percpu address checks
    percpu: modify base_addr to be region specific
    percpu: setup_first_chunk rename schunk/dchunk to chunk
    percpu: end chunk area maps page aligned for the populated bitmap
    percpu: unify allocation of schunk and dchunk
    percpu: setup_first_chunk remove dyn_size and consolidate logic
    ...

    Linus Torvalds
     
  • build_all_zonelists gets a zone parameter to initialize zone's pagesets.
    There is only a single user which gives a non-NULL zone parameter and
    that one doesn't really need the rest of the build_all_zonelists (see
    commit 6dcd73d7011b ("memory-hotplug: allocate zone's pcp before
    onlining pages")).

    Therefore remove setup_zone_pageset from build_all_zonelists and call it
    from its only user directly. This will also remove a pointless zonlists
    rebuilding which is always good.

    Link: http://lkml.kernel.org/r/20170721143915.14161-5-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Shaohua Li
    Cc: Toshi Kani
    Cc: Wen Congyang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

05 Sep, 2017

2 commits

  • Pull x86 mm changes from Ingo Molnar:
    "PCID support, 5-level paging support, Secure Memory Encryption support

    The main changes in this cycle are support for three new, complex
    hardware features of x86 CPUs:

    - Add 5-level paging support, which is a new hardware feature on
    upcoming Intel CPUs allowing up to 128 PB of virtual address space
    and 4 PB of physical RAM space - a 512-fold increase over the old
    limits. (Supercomputers of the future forecasting hurricanes on an
    ever warming planet can certainly make good use of more RAM.)

    Many of the necessary changes went upstream in previous cycles,
    v4.14 is the first kernel that can enable 5-level paging.

    This feature is activated via CONFIG_X86_5LEVEL=y - disabled by
    default.

    (By Kirill A. Shutemov)

    - Add 'encrypted memory' support, which is a new hardware feature on
    upcoming AMD CPUs ('Secure Memory Encryption', SME) allowing system
    RAM to be encrypted and decrypted (mostly) transparently by the
    CPU, with a little help from the kernel to transition to/from
    encrypted RAM. Such RAM should be more secure against various
    attacks like RAM access via the memory bus and should make the
    radio signature of memory bus traffic harder to intercept (and
    decrypt) as well.

    This feature is activated via CONFIG_AMD_MEM_ENCRYPT=y - disabled
    by default.

    (By Tom Lendacky)

    - Enable PCID optimized TLB flushing on newer Intel CPUs: PCID is a
    hardware feature that attaches an address space tag to TLB entries
    and thus allows to skip TLB flushing in many cases, even if we
    switch mm's.

    (By Andy Lutomirski)

    All three of these features were in the works for a long time, and
    it's coincidence of the three independent development paths that they
    are all enabled in v4.14 at once"

    * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (65 commits)
    x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)
    x86/mm: Use pr_cont() in dump_pagetable()
    x86/mm: Fix SME encryption stack ptr handling
    kvm/x86: Avoid clearing the C-bit in rsvd_bits()
    x86/CPU: Align CR3 defines
    x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
    acpi, x86/mm: Remove encryption mask from ACPI page protection type
    x86/mm, kexec: Fix memory corruption with SME on successive kexecs
    x86/mm/pkeys: Fix typo in Documentation/x86/protection-keys.txt
    x86/mm/dump_pagetables: Speed up page tables dump for CONFIG_KASAN=y
    x86/mm: Implement PCID based optimization: try to preserve old TLB entries using PCID
    x86: Enable 5-level paging support via CONFIG_X86_5LEVEL=y
    x86/mm: Allow userspace have mappings above 47-bit
    x86/mm: Prepare to expose larger address space to userspace
    x86/mpx: Do not allow MPX if we have mappings above 47-bit
    x86/mm: Rename tasksize_32bit/64bit to task_size_32bit/64bit()
    x86/xen: Redefine XEN_ELFNOTE_INIT_P2M using PUD_SIZE * PTRS_PER_PUD
    x86/mm/dump_pagetables: Fix printout of p4d level
    x86/mm/dump_pagetables: Generalize address normalization
    x86/boot: Fix memremap() related build failure
    ...

    Linus Torvalds
     
  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - fix affine wakeups (Peter Zijlstra)

    - improve CPU onlining (and general bootup) scalability on systems
    with ridiculous number (thousands) of CPUs (Peter Zijlstra)

    - sched/numa updates (Rik van Riel)

    - sched/deadline updates (Byungchul Park)

    - sched/cpufreq enhancements and related cleanups (Viresh Kumar)

    - sched/debug enhancements (Xie XiuQi)

    - various fixes"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
    sched/debug: Optimize sched_domain sysctl generation
    sched/topology: Avoid pointless rebuild
    sched/topology, cpuset: Avoid spurious/wrong domain rebuilds
    sched/topology: Improve comments
    sched/topology: Fix memory leak in __sdt_alloc()
    sched/completion: Document that reinit_completion() must be called after complete_all()
    sched/autogroup: Fix error reporting printk text in autogroup_create()
    sched/fair: Fix wake_affine() for !NUMA_BALANCING
    sched/debug: Intruduce task_state_to_char() helper function
    sched/debug: Show task state in /proc/sched_debug
    sched/debug: Use task_pid_nr_ns in /proc/$pid/sched
    sched/core: Remove unnecessary initialization init_idle_bootup_task()
    sched/deadline: Change return value of cpudl_find()
    sched/deadline: Make find_later_rq() choose a closer CPU in topology
    sched/numa: Scale scan period with tasks in group and shared/private
    sched/numa: Slow down scan rate if shared faults dominate
    sched/pelt: Fix false running accounting
    sched: Mark pick_next_task_dl() and build_sched_domain() as static
    sched/cpupri: Don't re-initialize 'struct cpupri'
    sched/deadline: Don't re-initialize 'struct cpudl'
    ...

    Linus Torvalds
     

14 Aug, 2017

1 commit

  • The allocated debug objects are either on the free list or in the
    hashed bucket lists. So they won't get lost. However if both debug
    objects and kmemleak are enabled and kmemleak scanning is done
    while some of the debug objects are transitioning from one list to
    the others, false negative reporting of memory leaks may happen for
    those objects. For example,

    [38687.275678] kmemleak: 12 new suspected memory leaks (see
    /sys/kernel/debug/kmemleak)
    unreferenced object 0xffff92e98aabeb68 (size 40):
    comm "ksmtuned", pid 4344, jiffies 4298403600 (age 906.430s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 d0 bc db 92 e9 92 ff ff ................
    01 00 00 00 00 00 00 00 38 36 8a 61 e9 92 ff ff ........86.a....
    backtrace:
    [] kmemleak_alloc+0x4a/0xa0
    [] kmem_cache_alloc+0xe9/0x320
    [] __debug_object_init+0x3e6/0x400
    [] debug_object_activate+0x131/0x210
    [] __call_rcu+0x3f/0x400
    [] call_rcu_sched+0x1d/0x20
    [] put_object+0x2c/0x40
    [] __delete_object+0x3c/0x50
    [] delete_object_full+0x1d/0x20
    [] kmemleak_free+0x32/0x80
    [] kmem_cache_free+0x77/0x350
    [] unlink_anon_vmas+0x82/0x1e0
    [] free_pgtables+0xa1/0x110
    [] exit_mmap+0xc1/0x170
    [] mmput+0x80/0x150
    [] do_exit+0x2a9/0xd20

    The references in the debug objects may also hide a real memory leak.

    As there is no point in having kmemleak to track debug object
    allocations, kmemleak checking is now disabled for debug objects.

    Signed-off-by: Waiman Long
    Signed-off-by: Thomas Gleixner
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/1502718733-8527-1-git-send-email-longman@redhat.com

    Waiman Long
     

10 Aug, 2017

1 commit

  • init_idle_bootup_task( ) is called in rest_init( ) to switch
    the scheduling class of the boot thread to the idle class.

    the function only sets:

    idle->sched_class = &idle_sched_class;

    which has been set in init_idle() called by sched_init():

    /*
    * The idle tasks have their own, simple scheduling class:
    */
    idle->sched_class = &idle_sched_class;

    We've already set the boot thread to idle class in
    start_kernel()->sched_init()->init_idle()
    so it's unnecessary to set it again in
    start_kernel()->rest_init()->init_idle_bootup_task()

    Signed-off-by: Cheng Jian
    Signed-off-by: Xie XiuQi
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc:
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1501838377-109720-1-git-send-email-cj.chengjian@huawei.com
    Signed-off-by: Ingo Molnar

    Cheng Jian
     

27 Jul, 2017

1 commit

  • The percpu memory allocator is experiencing scalability issues when
    allocating and freeing large numbers of counters as in BPF.
    Additionally, there is a corner case where iteration is triggered over
    all chunks if the contig_hint is the right size, but wrong alignment.

    This patch replaces the area map allocator with a basic bitmap allocator
    implementation. Each subsequent patch will introduce new features and
    replace full scanning functions with faster non-scanning options when
    possible.

    Implementation:
    This patchset removes the area map allocator in favor of a bitmap
    allocator backed by metadata blocks. The primary goal is to provide
    consistency in performance and memory footprint with a focus on small
    allocations (< 64 bytes). The bitmap removes the heavy memmove from the
    freeing critical path and provides a consistent memory footprint. The
    metadata blocks provide a bound on the amount of scanning required by
    maintaining a set of hints.

    In an effort to make freeing fast, the metadata is updated on the free
    path if the new free area makes a page free, a block free, or spans
    across blocks. This causes the chunk's contig hint to potentially be
    smaller than what it could allocate by up to the smaller of a page or a
    block. If the chunk's contig hint is contained within a block, a check
    occurs and the hint is kept accurate. Metadata is always kept accurate
    on allocation, so there will not be a situation where a chunk has a
    later contig hint than available.

    Evaluation:
    I have primarily done testing against a simple workload of allocation of
    1 million objects (2^20) of varying size. Deallocation was done by in
    order, alternating, and in reverse. These numbers were collected after
    rebasing ontop of a80099a152. I present the worst-case numbers here:

    Area Map Allocator:

    Object Size | Alloc Time (ms) | Free Time (ms)
    ----------------------------------------------
    4B | 310 | 4770
    16B | 557 | 1325
    64B | 436 | 273
    256B | 776 | 131
    1024B | 3280 | 122

    Bitmap Allocator:

    Object Size | Alloc Time (ms) | Free Time (ms)
    ----------------------------------------------
    4B | 490 | 70
    16B | 515 | 75
    64B | 610 | 80
    256B | 950 | 100
    1024B | 3520 | 200

    This data demonstrates the inability for the area map allocator to
    handle less than ideal situations. In the best case of reverse
    deallocation, the area map allocator was able to perform within range
    of the bitmap allocator. In the worst case situation, freeing took
    nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
    dramatically improves the consistency of the free path. The small
    allocations performed nearly identical regardless of the freeing
    pattern.

    While it does add to the allocation latency, the allocation scenario
    here is optimal for the area map allocator. The area map allocator runs
    into trouble when it is allocating in chunks where the latter half is
    full. It is difficult to replicate this, so I present a variant where
    the pages are second half filled. Freeing was done sequentially. Below
    are the numbers for this scenario:

    Area Map Allocator:

    Object Size | Alloc Time (ms) | Free Time (ms)
    ----------------------------------------------
    4B | 4118 | 4892
    16B | 1651 | 1163
    64B | 598 | 285
    256B | 771 | 158
    1024B | 3034 | 160

    Bitmap Allocator:

    Object Size | Alloc Time (ms) | Free Time (ms)
    ----------------------------------------------
    4B | 481 | 67
    16B | 506 | 69
    64B | 636 | 75
    256B | 892 | 90
    1024B | 3262 | 147

    The data shows a parabolic curve of performance for the area map
    allocator. This is due to the memmove operation being the dominant cost
    with the lower object sizes as more objects are packed in a chunk and at
    higher object sizes, the traversal of the chunk slots is the dominating
    cost. The bitmap allocator suffers this problem as well. The above data
    shows the inability to scale for the allocation path with the area map
    allocator and that the bitmap allocator demonstrates consistent
    performance in general.

    The second problem of additional scanning can result in the area map
    allocator completing in 52 minutes when trying to allocate 1 million
    4-byte objects with 8-byte alignment. The same workload takes
    approximately 16 seconds to complete for the bitmap allocator.

    V2:
    Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
    using bytes instead of bits.

    Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Signed-off-by: Tejun Heo

    Dennis Zhou (Facebook)
     

18 Jul, 2017

1 commit

  • Since DMA addresses will effectively look like 48-bit addresses when the
    memory encryption mask is set, SWIOTLB is needed if the DMA mask of the
    device performing the DMA does not support 48-bits. SWIOTLB will be
    initialized to create decrypted bounce buffers for use by these devices.

    Signed-off-by: Tom Lendacky
    Reviewed-by: Thomas Gleixner
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brijesh Singh
    Cc: Dave Young
    Cc: Dmitry Vyukov
    Cc: Jonathan Corbet
    Cc: Konrad Rzeszutek Wilk
    Cc: Larry Woodman
    Cc: Linus Torvalds
    Cc: Matt Fleming
    Cc: Michael S. Tsirkin
    Cc: Paolo Bonzini
    Cc: Peter Zijlstra
    Cc: Radim Krčmář
    Cc: Rik van Riel
    Cc: Toshimitsu Kani
    Cc: kasan-dev@googlegroups.com
    Cc: kvm@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-doc@vger.kernel.org
    Cc: linux-efi@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/aa2d29b78ae7d508db8881e46a3215231b9327a7.1500319216.git.thomas.lendacky@amd.com
    Signed-off-by: Ingo Molnar

    Tom Lendacky
     

13 Jul, 2017

1 commit

  • The add_device_randomness() function would ignore incoming bytes if the
    crng wasn't ready. This additionally makes sure to make an early enough
    call to add_latent_entropy() to influence the initial stack canary,
    which is especially important on non-x86 systems where it stays the same
    through the life of the boot.

    Link: http://lkml.kernel.org/r/20170626233038.GA48751@beast
    Signed-off-by: Kees Cook
    Cc: "Theodore Ts'o"
    Cc: Arnd Bergmann
    Cc: Greg Kroah-Hartman
    Cc: Ingo Molnar
    Cc: Jessica Yu
    Cc: Steven Rostedt (VMware)
    Cc: Viresh Kumar
    Cc: Tejun Heo
    Cc: Prarit Bhargava
    Cc: Lokesh Vutla
    Cc: Nicholas Piggin
    Cc: AKASHI Takahiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

23 May, 2017

2 commits

  • might_sleep() and smp_processor_id() checks are enabled after the boot
    process is done. That hides bugs in the SMP bringup and driver
    initialization code.

    Enable it right when the scheduler starts working, i.e. when init task and
    kthreadd have been created and right before the idle task enables
    preemption.

    Tested-by: Mark Rutland
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Mark Rutland
    Cc: Greg Kroah-Hartman
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20170516184736.272225698@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • Some of the boot code in init_kernel_freeable() which runs before SMP
    bringup assumes (rightfully) that it runs on the boot CPU and therefore can
    use smp_processor_id() in preemptible context.

    That works so far because the smp_processor_id() check starts to be
    effective after smp bringup. That's just wrong. Starting with SMP bringup
    and the ability to move threads around, smp_processor_id() in preemptible
    context is broken.

    Aside of that it does not make sense to allow init to run on all CPUs
    before sched_smp_init() has been run.

    Pin the init to the boot CPU so the existing code can continue to use
    smp_processor_id() without triggering the checks when the enabling of those
    checks starts earlier.

    Tested-by: Mark Rutland
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Greg Kroah-Hartman
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20170516184734.943149935@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

09 May, 2017

1 commit

  • Pull tty/serial updates from Greg KH:
    "Here is the "big" TTY/Serial patch updates for 4.12-rc1

    Not a lot of new things here, the normal number of serial driver
    updates and additions, tiny bugs fixed, and some core files split up
    to make future changes a bit easier for Nicolas's "tiny-tty" work.

    All of these have been in linux-next for a while"

    * tag 'tty-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (62 commits)
    serial: small Makefile reordering
    tty: split job control support into a file of its own
    tty: move baudrate handling code to a file of its own
    console: move console_init() out of tty_io.c
    serial: 8250_early: Add earlycon support for Palmchip UART
    tty: pl011: use "qdf2400_e44" as the earlycon name for QDF2400 E44
    vt: make mouse selection of non-ASCII consistent
    vt: set mouse selection word-chars to gpm's default
    imx-serial: Reduce RX DMA startup latency when opening for reading
    serial: omap: suspend device on probe errors
    serial: omap: fix runtime-pm handling on unbind
    tty: serial: omap: add UPF_BOOT_AUTOCONF flag for DT init
    serial: samsung: Remove useless spinlock
    serial: samsung: Add missing checks for dma_map_single failure
    serial: samsung: Use right device for DMA-mapping calls
    serial: imx: setup DCEDTE early and ensure DCD and RI irqs to be off
    tty: fix comment typo s/repsonsible/responsible/
    tty: amba-pl011: Fix spurious TX interrupts
    serial: xuartps: Enable clocks in the pm disable case also
    serial: core: Re-use struct uart_port {name} field
    ...

    Linus Torvalds
     

04 May, 2017

1 commit

  • Pull tracing updates from Steven Rostedt:
    "New features for this release:

    - Pretty much a full rewrite of the processing of function plugins.
    i.e. echo do_IRQ:stacktrace > set_ftrace_filter

    - The rewrite was needed to add plugins to be unique to tracing
    instances. i.e. mkdir instance/foo; cd instances/foo; echo
    do_IRQ:stacktrace > set_ftrace_filter The old way was written very
    hacky. This removes a lot of those hacks.

    - New "function-fork" tracing option. When set, pids in the
    set_ftrace_pid will have their children added when the processes
    with their pids listed in the set_ftrace_pid file forks.

    - Exposure of "maxactive" for kretprobe in kprobe_events

    - Allow for builtin init functions to be traced by the function
    tracer (via the kernel command line). Module init function tracing
    will come in the next release.

    - Added more selftests, and have selftests also test in an instance"

    * tag 'trace-v4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (60 commits)
    ring-buffer: Return reader page back into existing ring buffer
    selftests: ftrace: Allow some event trigger tests to run in an instance
    selftests: ftrace: Have some basic tests run in a tracing instance too
    selftests: ftrace: Have event tests also run in an tracing instance
    selftests: ftrace: Make func_event_triggers and func_traceonoff_triggers tests do instances
    selftests: ftrace: Allow some tests to be run in a tracing instance
    tracing/ftrace: Allow for instances to trigger their own stacktrace probes
    tracing/ftrace: Allow for the traceonoff probe be unique to instances
    tracing/ftrace: Enable snapshot function trigger to work with instances
    tracing/ftrace: Allow instances to have their own function probes
    tracing/ftrace: Add a better way to pass data via the probe functions
    ftrace: Dynamically create the probe ftrace_ops for the trace_array
    tracing: Pass the trace_array into ftrace_probe_ops functions
    tracing: Have the trace_array hold the list of registered func probes
    ftrace: If the hash for a probe fails to update then free what was initialized
    ftrace: Have the function probes call their own function
    ftrace: Have each function probe use its own ftrace_ops
    ftrace: Have unregister_ftrace_function_probe_func() return a value
    ftrace: Add helper function ftrace_hash_move_and_update_ops()
    ftrace: Remove data field from ftrace_func_probe structure
    ...

    Linus Torvalds
     

03 May, 2017

1 commit

  • Pull trivial tree updates from Jiri Kosina.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
    tty: fix comment for __tty_alloc_driver()
    init/main: properly align the multi-line comment
    init/main: Fix double "the" in comment
    Fix dead URLs to ftp.kernel.org
    drivers: Clean up duplicated email address
    treewide: Fix typo in xml/driver-api/basics.xml
    tools/testing/selftests/powerpc: remove redundant CFLAGS in Makefile: "-Wall -O2 -Wall" -> "-O2 -Wall"
    selftests/timers: Spelling s/privledges/privileges/
    HID: picoLCD: Spelling s/REPORT_WRTIE_MEMORY/REPORT_WRITE_MEMORY/
    net: phy: dp83848: Fix Typo
    UBI: Fix typos
    Documentation: ftrace.txt: Correct nice value of 120 priority
    net: fec: Fix typo in error msg and comment
    treewide: Fix typos in printk

    Linus Torvalds
     

24 Apr, 2017

2 commits


19 Apr, 2017

1 commit


04 Apr, 2017

1 commit

  • Relying on free_reserved_area() to call ftrace to free init memory proved to
    not be sufficient. The issue is that on x86, when debug_pagealloc is
    enabled, the init memory is not freed, but simply set as not present. Since
    ftrace was uninformed of this, starting function tracing still tries to
    update pages that are not present according to the page tables, causing
    ftrace to bug, as well as killing the kernel itself.

    Instead of relying on free_reserved_area(), have init/main.c call ftrace
    directly just before it frees the init memory. Then it needs to use
    __init_begin and __init_end to know where the init memory location is.
    Looking at all archs (and testing what I can), it appears that this should
    work for each of them.

    Reported-by: kernel test robot
    Reported-by: Fengguang Wu
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     

01 Apr, 2017

1 commit

  • Yang Li has reported that drain_all_pages triggers a WARN_ON which means
    that this function is called earlier than the mm_percpu_wq is
    initialized on arm64 with CMA configured:

    WARNING: CPU: 2 PID: 1 at mm/page_alloc.c:2423 drain_all_pages+0x244/0x25c
    Modules linked in:
    CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc1-next-20170310-00027-g64dfbc5 #127
    Hardware name: Freescale Layerscape 2088A RDB Board (DT)
    task: ffffffc07c4a6d00 task.stack: ffffffc07c4a8000
    PC is at drain_all_pages+0x244/0x25c
    LR is at start_isolate_page_range+0x14c/0x1f0
    [...]
    drain_all_pages+0x244/0x25c
    start_isolate_page_range+0x14c/0x1f0
    alloc_contig_range+0xec/0x354
    cma_alloc+0x100/0x1fc
    dma_alloc_from_contiguous+0x3c/0x44
    atomic_pool_init+0x7c/0x208
    arm64_dma_init+0x44/0x4c
    do_one_initcall+0x38/0x128
    kernel_init_freeable+0x1a0/0x240
    kernel_init+0x10/0xfc
    ret_from_fork+0x10/0x20

    Fix this by moving the whole setup_vmstat which is an initcall right now
    to init_mm_internals which will be called right after the WQ subsystem
    is initialized.

    Link: http://lkml.kernel.org/r/20170315164021.28532-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Yang Li
    Tested-by: Yang Li
    Tested-by: Xiaolong Ye
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

25 Mar, 2017

2 commits


12 Mar, 2017

1 commit

  • Pull random updates from Ted Ts'o:
    "Change get_random_{int,log} to use the CRNG used by /dev/urandom and
    getrandom(2). It's faster and arguably more secure than cut-down MD5
    that we had been using.

    Also do some code cleanup"

    * tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
    random: move random_min_urandom_seed into CONFIG_SYSCTL ifdef block
    random: convert get_random_int/long into get_random_u32/u64
    random: use chacha20 for get_random_int/long
    random: fix comment for unused random_min_urandom_seed
    random: remove variable limit
    random: remove stale urandom_init_wait
    random: remove stale maybe_reseed_primary_crng

    Linus Torvalds
     

02 Mar, 2017

5 commits


01 Mar, 2017

1 commit

  • Pull IDR rewrite from Matthew Wilcox:
    "The most significant part of the following is the patch to rewrite the
    IDR & IDA to be clients of the radix tree. But there's much more,
    including an enhancement of the IDA to be significantly more space
    efficient, an IDR & IDA test suite, some improvements to the IDR API
    (and driver changes to take advantage of those improvements), several
    improvements to the radix tree test suite and RCU annotations.

    The IDR & IDA rewrite had a good spin in linux-next and Andrew's tree
    for most of the last cycle. Coupled with the IDR test suite, I feel
    pretty confident that any remaining bugs are quite hard to hit. 0-day
    did a great job of watching my git tree and pointing out problems; as
    it hit them, I added new test-cases to be sure not to be caught the
    same way twice"

    Willy goes on to expand a bit on the IDR rewrite rationale:
    "The radix tree and the IDR use very similar data structures.

    Merging the two codebases lets us share the memory allocation pools,
    and results in a net deletion of 500 lines of code. It also opens up
    the possibility of exposing more of the features of the radix tree to
    users of the IDR (and I have some interesting patches along those
    lines waiting for 4.12)

    It also shrinks the size of the 'struct idr' from 40 bytes to 24 which
    will shrink a fair few data structures that embed an IDR"

    * 'idr-4.11' of git://git.infradead.org/users/willy/linux-dax: (32 commits)
    radix tree test suite: Add config option for map shift
    idr: Add missing __rcu annotations
    radix-tree: Fix __rcu annotations
    radix-tree: Add rcu_dereference and rcu_assign_pointer calls
    radix tree test suite: Run iteration tests for longer
    radix tree test suite: Fix split/join memory leaks
    radix tree test suite: Fix leaks in regression2.c
    radix tree test suite: Fix leaky tests
    radix tree test suite: Enable address sanitizer
    radix_tree_iter_resume: Fix out of bounds error
    radix-tree: Store a pointer to the root in each node
    radix-tree: Chain preallocated nodes through ->parent
    radix tree test suite: Dial down verbosity with -v
    radix tree test suite: Introduce kmalloc_verbose
    idr: Return the deleted entry from idr_remove
    radix tree test suite: Build separate binaries for some tests
    ida: Use exceptional entries for small IDAs
    ida: Move ida_bitmap to a percpu variable
    Reimplement IDR and IDA using the radix tree
    radix-tree: Add radix_tree_iter_delete
    ...

    Linus Torvalds
     

28 Feb, 2017

2 commits

  • This patch makes arch-independent testcases for RODATA. Both x86 and
    x86_64 already have testcases for RODATA, But they are arch-specific
    because using inline assembly directly.

    And cacheflush.h is not a suitable location for rodata-test related
    things. Since they were in cacheflush.h, If someone change the state of
    CONFIG_DEBUG_RODATA_TEST, It cause overhead of kernel build.

    To solve the above issues, write arch-independent testcases and move it
    to shared location.

    [jinb.park7@gmail.com: fix config dependency]
    Link: http://lkml.kernel.org/r/20170209131625.GA16954@pjb1027-Latitude-E5410
    Link: http://lkml.kernel.org/r/20170129105436.GA9303@pjb1027-Latitude-E5410
    Signed-off-by: Jinbum Park
    Acked-by: Kees Cook
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Arjan van de Ven
    Cc: Laura Abbott
    Cc: Russell King
    Cc: Valentin Rothberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jinbum Park
     
  • Commit 4a9d4b024a31 ("switch fput to task_work_add") implements a
    schedule_work() for completing fput(), but did not guarantee calling
    __fput() after unpacking initramfs. Because of this, there is a
    possibility that during boot a driver can see ETXTBSY when it tries to
    load a binary from initramfs as fput() is still pending on that binary.

    This patch makes sure that fput() is completed after unpacking initramfs
    and removes the call to flush_delayed_fput() in kernel_init() which
    happens very late after unpacking initramfs.

    Link: http://lkml.kernel.org/r/20170201140540.22051-1-lokeshvutla@ti.com
    Signed-off-by: Lokesh Vutla
    Reported-by: Murali Karicheri
    Cc: Al Viro
    Cc: Tero Kristo
    Cc: Sekhar Nori
    Cc: Nishanth Menon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lokesh Vutla
     

23 Feb, 2017

1 commit

  • Pull printk updates from Petr Mladek:

    - Add Petr Mladek, Sergey Senozhatsky as printk maintainers, and Steven
    Rostedt as the printk reviewer. This idea came up after the
    discussion about printk issues at Kernel Summit. It was formulated
    and discussed at lkml[1].

    - Extend a lock-less NMI per-cpu buffers idea to handle recursive
    printk() calls by Sergey Senozhatsky[2]. It is the first step in
    sanitizing printk as discussed at Kernel Summit.

    The change allows to see messages that would normally get ignored or
    would cause a deadlock.

    Also it allows to enable lockdep in printk(). This already paid off.
    The testing in linux-next helped to discover two old problems that
    were hidden before[3][4].

    - Remove unused parameter by Sergey Senozhatsky. Clean up after a past
    change.

    [1] http://lkml.kernel.org/r/1481798878-31898-1-git-send-email-pmladek@suse.com
    [2] http://lkml.kernel.org/r/20161227141611.940-1-sergey.senozhatsky@gmail.com
    [3] http://lkml.kernel.org/r/20170215044332.30449-1-sergey.senozhatsky@gmail.com
    [4] http://lkml.kernel.org/r/20170217015932.11898-1-sergey.senozhatsky@gmail.com

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
    printk: drop call_console_drivers() unused param
    printk: convert the rest to printk-safe
    printk: remove zap_locks() function
    printk: use printk_safe buffers in printk
    printk: report lost messages in printk safe/nmi contexts
    printk: always use deferred printk when flush printk_safe lines
    printk: introduce per-cpu safe_print seq buffer
    printk: rename nmi.c and exported api
    printk: use vprintk_func in vprintk()
    MAINTAINERS: Add printk maintainers

    Linus Torvalds
     

22 Feb, 2017

2 commits

  • Pull rodata updates from Kees Cook:
    "This renames the (now inaccurate) DEBUG_RODATA and related
    SET_MODULE_RONX configs to the more sensible STRICT_KERNEL_RWX and
    STRICT_MODULE_RWX"

    * tag 'rodata-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    arch: Rename CONFIG_DEBUG_RODATA and CONFIG_DEBUG_MODULE_RONX
    arch: Move CONFIG_DEBUG_RODATA and CONFIG_SET_MODULE_RONX to be common

    Linus Torvalds
     
  • Pull exception table module split from Paul Gortmaker:
    "Final extable.h related changes.

    This completes the separation of exception table content from the
    module.h header file. This is achieved with the final commit that
    removes the one line back compatible change that sourced extable.h
    into the module.h file.

    The commits are unchanged since January, with the exception of a
    couple Acks that came in for the last two commits a bit later. The
    changes have been in linux-next for quite some time[1] and have got
    widespread arch coverage via toolchains I have and also from
    additional ones the kbuild bot has.

    Maintaners of the various arch were Cc'd during the postings to
    lkml[2] and informed that the intention was to take the remaining arch
    specific changes and lump them together with the final two non-arch
    specific changes and submit for this merge window.

    The ia64 diffstat stands out and probably warrants a mention. In an
    earlier review, Al Viro made a valid comment that the original header
    separation of content left something to be desired, and that it get
    fixed as a part of this change, hence the larger diffstat"

    * tag 'extable-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (21 commits)
    module.h: remove extable.h include now users have migrated
    core: migrate exception table users off module.h and onto extable.h
    cris: migrate exception table users off module.h and onto extable.h
    hexagon: migrate exception table users off module.h and onto extable.h
    microblaze: migrate exception table users off module.h and onto extable.h
    unicore32: migrate exception table users off module.h and onto extable.h
    score: migrate exception table users off module.h and onto extable.h
    metag: migrate exception table users off module.h and onto extable.h
    arc: migrate exception table users off module.h and onto extable.h
    nios2: migrate exception table users off module.h and onto extable.h
    sparc: migrate exception table users onto extable.h
    openrisc: migrate exception table users off module.h and onto extable.h
    frv: migrate exception table users off module.h and onto extable.h
    sh: migrate exception table users off module.h and onto extable.h
    xtensa: migrate exception table users off module.h and onto extable.h
    mn10300: migrate exception table users off module.h and onto extable.h
    alpha: migrate exception table users off module.h and onto extable.h
    arm: migrate exception table users off module.h and onto extable.h
    m32r: migrate exception table users off module.h and onto extable.h
    ia64: ensure exception table search users include extable.h
    ...

    Linus Torvalds