11 Jul, 2021

1 commit

  • Pull Kbuild updates from Masahiro Yamada:

    - Increase the -falign-functions alignment for the debug option.

    - Remove ugly libelf checks from the top Makefile.

    - Make the silent build (-s) more silent.

    - Re-compile the kernel if KBUILD_BUILD_TIMESTAMP is specified.

    - Various script cleanups

    * tag 'kbuild-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (27 commits)
    scripts: add generic syscallnr.sh
    scripts: check duplicated syscall number in syscall table
    sparc: syscalls: use pattern rules to generate syscall headers
    parisc: syscalls: use pattern rules to generate syscall headers
    nds32: add arch/nds32/boot/.gitignore
    kbuild: mkcompile_h: consider timestamp if KBUILD_BUILD_TIMESTAMP is set
    kbuild: modpost: Explicitly warn about unprototyped symbols
    kbuild: remove trailing slashes from $(KBUILD_EXTMOD)
    kconfig.h: explain IS_MODULE(), IS_ENABLED()
    kconfig: constify long_opts
    scripts/setlocalversion: simplify the short version part
    scripts/setlocalversion: factor out 12-chars hash construction
    scripts/setlocalversion: add more comments to -dirty flag detection
    scripts/setlocalversion: remove workaround for old make-kpkg
    scripts/setlocalversion: remove mercurial, svn and git-svn supports
    kbuild: clean up ${quiet} checks in shell scripts
    kbuild: sink stdout from cmd for silent build
    init: use $(call cmd,) for generating include/generated/compile.h
    kbuild: merge scripts/mkmakefile to top Makefile
    sh: move core-y in arch/sh/Makefile to arch/sh/Kbuild
    ...

    Linus Torvalds
     

09 Jul, 2021

2 commits

  • Parse the kernel's build ID at initialization so that other code can print
    a hex format string representation of the running kernel's build ID. This
    will be used in the kdump and dump_stack code so that developers can
    easily locate the vmlinux debug symbols for a crash/stacktrace.

    [swboyd@chromium.org: fix implicit declaration of init_vmlinux_build_id()]
    Link: https://lkml.kernel.org/r/CAE-0n51UjTbay8N9FXAyE7_aR2+ePrQnKSRJ0gbmRsXtcLBVaw@mail.gmail.com

    Link: https://lkml.kernel.org/r/20210511003845.2429846-4-swboyd@chromium.org
    Signed-off-by: Stephen Boyd
    Acked-by: Baoquan He
    Cc: Jiri Olsa
    Cc: Alexei Starovoitov
    Cc: Jessica Yu
    Cc: Evan Green
    Cc: Hsin-Yi Wang
    Cc: Dave Young
    Cc: Vivek Goyal
    Cc: Andy Shevchenko
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Ingo Molnar
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: Petr Mladek
    Cc: Rasmus Villemoes
    Cc: Sasha Levin
    Cc: Sergey Senozhatsky
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Boyd
     
  • Many stack traces are similar so there are many similar arrays.
    Stackdepot saves each unique stack only once.

    Replace field addrs in struct track with depot_stack_handle_t handle. Use
    stackdepot to save stack trace.

    The benefits are smaller memory overhead and possibility to aggregate
    per-cache statistics in the future using the stackdepot handle instead of
    matching stacks manually.

    [rdunlap@infradead.org: rename save_stack_trace()]
    Link: https://lkml.kernel.org/r/20210513051920.29320-1-rdunlap@infradead.org
    [vbabka@suse.cz: fix lockdep splat]
    Link: https://lkml.kernel.org/r/20210516195150.26740-1-vbabka@suse.czLink: https://lkml.kernel.org/r/20210414163434.4376-1-glittao@gmail.com

    Signed-off-by: Oliver Glitta
    Signed-off-by: Randy Dunlap
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oliver Glitta
     

05 Jul, 2021

1 commit

  • …git/paulmck/linux-rcu

    Pull RCU updates from Paul McKenney:

    - Bitmap parsing support for "all" as an alias for all bits

    - Documentation updates

    - Miscellaneous fixes, including some that overlap into mm and lockdep

    - kvfree_rcu() updates

    - mem_dump_obj() updates, with acks from one of the slab-allocator
    maintainers

    - RCU NOCB CPU updates, including limited deoffloading

    - SRCU updates

    - Tasks-RCU updates

    - Torture-test updates

    * 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (78 commits)
    tasks-rcu: Make show_rcu_tasks_gp_kthreads() be static inline
    rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent states
    rcu: Add missing __releases() annotation
    rcu: Remove obsolete rcu_read_unlock() deadlock commentary
    rcu: Improve comments describing RCU read-side critical sections
    rcu: Create an unrcu_pointer() to remove __rcu from a pointer
    srcu: Early test SRCU polling start
    rcu: Fix various typos in comments
    rcu/nocb: Unify timers
    rcu/nocb: Prepare for fine-grained deferred wakeup
    rcu/nocb: Only cancel nocb timer if not polling
    rcu/nocb: Delete bypass_timer upon nocb_gp wakeup
    rcu/nocb: Cancel nocb_timer upon nocb_gp wakeup
    rcu/nocb: Allow de-offloading rdp leader
    rcu/nocb: Directly call __wake_nocb_gp() from bypass timer
    rcu: Don't penalize priority boosting when there is nothing to boost
    rcu: Point to documentation of ordering guarantees
    rcu: Make rcu_gp_cleanup() be noinline for tracing
    rcu: Restrict RCU_STRICT_GRACE_PERIOD to at most four CPUs
    rcu: Make show_rcu_gp_kthreads() dump rcu_node structures blocking GP
    ...

    Linus Torvalds
     

04 Jul, 2021

1 commit

  • Pull tracing updates from Steven Rostedt:

    - Added option for per CPU threads to the hwlat tracer

    - Have hwlat tracer handle hotplug CPUs

    - New tracer: osnoise, that detects latency caused by interrupts,
    softirqs and scheduling of other tasks.

    - Added timerlat tracer that creates a thread and measures in detail
    what sources of latency it has for wake ups.

    - Removed the "success" field of the sched_wakeup trace event. This has
    been hardcoded as "1" since 2015, no tooling should be looking at it
    now. If one exists, we can revert this commit, fix that tool and try
    to remove it again in the future.

    - tgid mapping fixed to handle more than PID_MAX_DEFAULT pids/tgids.

    - New boot command line option "tp_printk_stop", as tp_printk causes
    trace events to write to console. When user space starts, this can
    easily live lock the system. Having a boot option to stop just after
    boot up is useful to prevent that from happening.

    - Have ftrace_dump_on_oops boot command line option take numbers that
    match the numbers shown in /proc/sys/kernel/ftrace_dump_on_oops.

    - Bootconfig clean ups, fixes and enhancements.

    - New ktest script that tests bootconfig options.

    - Add tracepoint_probe_register_may_exist() to register a tracepoint
    without triggering a WARN*() if it already exists. BPF has a path
    from user space that can do this. All other paths are considered a
    bug.

    - Small clean ups and fixes

    * tag 'trace-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (49 commits)
    tracing: Resize tgid_map to pid_max, not PID_MAX_DEFAULT
    tracing: Simplify & fix saved_tgids logic
    treewide: Add missing semicolons to __assign_str uses
    tracing: Change variable type as bool for clean-up
    trace/timerlat: Fix indentation on timerlat_main()
    trace/osnoise: Make 'noise' variable s64 in run_osnoise()
    tracepoint: Add tracepoint_probe_register_may_exist() for BPF tracing
    tracing: Fix spelling in osnoise tracer "interferences" -> "interference"
    Documentation: Fix a typo on trace/osnoise-tracer
    trace/osnoise: Fix return value on osnoise_init_hotplug_support
    trace/osnoise: Make interval u64 on osnoise_main
    trace/osnoise: Fix 'no previous prototype' warnings
    tracing: Have osnoise_main() add a quiescent state for task rcu
    seq_buf: Make trace_seq_putmem_hex() support data longer than 8
    seq_buf: Fix overflow in seq_buf_putmem_hex()
    trace/osnoise: Support hotplug operations
    trace/hwlat: Support hotplug operations
    trace/hwlat: Protect kdata->kthread with get/put_online_cpus
    trace: Add timerlat tracer
    trace: Add osnoise tracer
    ...

    Linus Torvalds
     

03 Jul, 2021

1 commit

  • Merge more updates from Andrew Morton:
    "190 patches.

    Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
    vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
    migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
    zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
    core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
    signals, exec, kcov, selftests, compress/decompress, and ipc"

    * emailed patches from Andrew Morton : (190 commits)
    ipc/util.c: use binary search for max_idx
    ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
    ipc: use kmalloc for msg_queue and shmid_kernel
    ipc sem: use kvmalloc for sem_undo allocation
    lib/decompressors: remove set but not used variabled 'level'
    selftests/vm/pkeys: exercise x86 XSAVE init state
    selftests/vm/pkeys: refill shadow register after implicit kernel write
    selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
    selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
    kcov: add __no_sanitize_coverage to fix noinstr for all architectures
    exec: remove checks in __register_bimfmt()
    x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
    hfsplus: report create_date to kstat.btime
    hfsplus: remove unnecessary oom message
    nilfs2: remove redundant continue statement in a while-loop
    kprobes: remove duplicated strong free_insn_page in x86 and s390
    init: print out unknown kernel parameters
    checkpatch: do not complain about positive return values starting with EPOLL
    checkpatch: improve the indented label test
    checkpatch: scripts/spdxcheck.py now requires python3
    ...

    Linus Torvalds
     

02 Jul, 2021

1 commit

  • It is easy to foobar setting a kernel parameter on the command line
    without realizing it, there's not much output that you can use to assess
    what the kernel did with that parameter by default.

    Make it a little more explicit which parameters on the command line
    _looked_ like a valid parameter for the kernel, but did not match anything
    and ultimately got tossed to init. This is very similar to the unknown
    parameter message received when loading a module.

    This assumes the parameters are processed in a normal fashion, some
    parameters (dyndbg= for example) don't register their parameter with the
    rest of the kernel's parameters, and therefore always show up in this list
    (and are also given to init - like the rest of this list).

    Another example is BOOT_IMAGE= is highlighted as an offender, which it
    technically is, but is passed by LILO and GRUB so most systems will see
    that complaint.

    An example output where "foobared" and "unrecognized" are intentionally
    invalid parameters:

    Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.12-dirty debug log_buf_len=4M foobared unrecognized=foo
    Unknown command line parameters: foobared BOOT_IMAGE=/boot/vmlinuz-5.12-dirty unrecognized=foo

    Link: https://lkml.kernel.org/r/20210511211009.42259-1-ahalaney@redhat.com
    Signed-off-by: Andrew Halaney
    Suggested-by: Steven Rostedt
    Suggested-by: Borislav Petkov
    Acked-by: Borislav Petkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Halaney
     

01 Jul, 2021

2 commits

  • Pull clang feature updates from Kees Cook:

    - Add CC_HAS_NO_PROFILE_FN_ATTR in preparation for PGO support in the
    face of the noinstr attribute, paving the way for PGO and fixing
    GCOV. (Nick Desaulniers)

    - x86_64 LTO coverage is expanded to 32-bit x86. (Nathan Chancellor)

    - Small fixes to CFI. (Mark Rutland, Nathan Chancellor)

    * tag 'clang-features-v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    qemu_fw_cfg: Make fw_cfg_rev_attr a proper kobj_attribute
    Kconfig: Introduce ARCH_WANTS_NO_INSTR and CC_HAS_NO_PROFILE_FN_ATTR
    compiler_attributes.h: cleanups for GCC 4.9+
    compiler_attributes.h: define __no_profile, add to noinstr
    x86, lto: Enable Clang LTO for 32-bit as well
    CFI: Move function_nocfi() into compiler.h
    MAINTAINERS: Add Clang CFI section

    Linus Torvalds
     
  • Pull core block updates from Jens Axboe:

    - disk events cleanup (Christoph)

    - gendisk and request queue allocation simplifications (Christoph)

    - bdev_disk_changed cleanups (Christoph)

    - IO priority improvements (Bart)

    - Chained bio completion trace fix (Edward)

    - blk-wbt fixes (Jan)

    - blk-wbt enable/disable fix (Zhang)

    - Scheduler dispatch improvements (Jan, Ming)

    - Shared tagset scheduler improvements (John)

    - BFQ updates (Paolo, Luca, Pietro)

    - BFQ lock inversion fix (Jan)

    - Documentation improvements (Kir)

    - CLONE_IO block cgroup fix (Tejun)

    - Remove of ancient and deprecated block dump feature (zhangyi)

    - Discard merge fix (Ming)

    - Misc fixes or followup fixes (Colin, Damien, Dan, Long, Max, Thomas,
    Yang)

    * tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block: (129 commits)
    block: fix discard request merge
    block/mq-deadline: Remove a WARN_ON_ONCE() call
    blk-mq: update hctx->dispatch_busy in case of real scheduler
    blk: Fix lock inversion between ioc lock and bfqd lock
    bfq: Remove merged request already in bfq_requests_merged()
    block: pass a gendisk to bdev_disk_changed
    block: move bdev_disk_changed
    block: add the events* attributes to disk_attrs
    block: move the disk events code to a separate file
    block: fix trace completion for chained bio
    block/partitions/msdos: Fix typo inidicator -> indicator
    block, bfq: reset waker pointer with shared queues
    block, bfq: check waker only for queues with no in-flight I/O
    block, bfq: avoid delayed merge of async queues
    block, bfq: boost throughput by extending queue-merging times
    block, bfq: consider also creation time in delayed stable merge
    block, bfq: fix delayed stable merge check
    block, bfq: let also stably merged queues enjoy weight raising
    blk-wbt: make sure throttle is enabled properly
    blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled()
    ...

    Linus Torvalds
     

23 Jun, 2021

1 commit

  • We don't want compiler instrumentation to touch noinstr functions,
    which are annotated with the no_profile_instrument_function function
    attribute. Add a Kconfig test for this and make GCOV depend on it, and
    in the future, PGO.

    If an architecture is using noinstr, it should denote that via this
    Kconfig value. That makes Kconfigs that depend on noinstr able to express
    dependencies in an architecturally agnostic way.

    Cc: Masahiro Yamada
    Link: https://lore.kernel.org/lkml/YMTn9yjuemKFLbws@hirez.programming.kicks-ass.net/
    Link: https://lore.kernel.org/lkml/YMcssV%2Fn5IBGv4f0@hirez.programming.kicks-ass.net/
    Suggested-by: Nathan Chancellor
    Suggested-by: Peter Zijlstra
    Signed-off-by: Nick Desaulniers
    Reviewed-by: Peter Oberparleiter
    Reviewed-by: Nathan Chancellor
    Acked-by: Mark Rutland
    Acked-by: Heiko Carstens
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Kees Cook
    Link: https://lore.kernel.org/r/20210621231822.2848305-4-ndesaulniers@google.com

    Nick Desaulniers
     

18 Jun, 2021

2 commits

  • Change the type and name of task_struct::state. Drop the volatile and
    shrink it to an 'unsigned int'. Rename it in order to find all uses
    such that we can use READ_ONCE/WRITE_ONCE as appropriate.

    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Daniel Bristot de Oliveira
    Acked-by: Will Deacon
    Acked-by: Daniel Thompson
    Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org

    Peter Zijlstra
     
  • This commit in sched/urgent moved the cfs_rq_is_decayed() function:

    a7b359fc6a37: ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")

    and this fresh commit in sched/core modified it in the old location:

    9e077b52d86a: ("sched/pelt: Check that *_avg are null when *_sum are")

    Merge the two variants.

    Conflicts:
    kernel/sched/fair.c

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

11 Jun, 2021

1 commit


05 Jun, 2021

1 commit

  • During boot, kernel_init_freeable() initializes `cad_pid` to the init
    task's struct pid. Later on, we may change `cad_pid` via a sysctl, and
    when this happens proc_do_cad_pid() will increment the refcount on the
    new pid via get_pid(), and will decrement the refcount on the old pid
    via put_pid(). As we never called get_pid() when we initialized
    `cad_pid`, we decrement a reference we never incremented, can therefore
    free the init task's struct pid early. As there can be dangling
    references to the struct pid, we can later encounter a use-after-free
    (e.g. when delivering signals).

    This was spotted when fuzzing v5.13-rc3 with Syzkaller, but seems to
    have been around since the conversion of `cad_pid` to struct pid in
    commit 9ec52099e4b8 ("[PATCH] replace cad_pid by a struct pid") from the
    pre-KASAN stone age of v2.6.19.

    Fix this by getting a reference to the init task's struct pid when we
    assign it to `cad_pid`.

    Full KASAN splat below.

    ==================================================================
    BUG: KASAN: use-after-free in ns_of_pid include/linux/pid.h:153 [inline]
    BUG: KASAN: use-after-free in task_active_pid_ns+0xc0/0xc8 kernel/pid.c:509
    Read of size 4 at addr ffff23794dda0004 by task syz-executor.0/273

    CPU: 1 PID: 273 Comm: syz-executor.0 Not tainted 5.12.0-00001-g9aef892b2d15 #1
    Hardware name: linux,dummy-virt (DT)
    Call trace:
    ns_of_pid include/linux/pid.h:153 [inline]
    task_active_pid_ns+0xc0/0xc8 kernel/pid.c:509
    do_notify_parent+0x308/0xe60 kernel/signal.c:1950
    exit_notify kernel/exit.c:682 [inline]
    do_exit+0x2334/0x2bd0 kernel/exit.c:845
    do_group_exit+0x108/0x2c8 kernel/exit.c:922
    get_signal+0x4e4/0x2a88 kernel/signal.c:2781
    do_signal arch/arm64/kernel/signal.c:882 [inline]
    do_notify_resume+0x300/0x970 arch/arm64/kernel/signal.c:936
    work_pending+0xc/0x2dc

    Allocated by task 0:
    slab_post_alloc_hook+0x50/0x5c0 mm/slab.h:516
    slab_alloc_node mm/slub.c:2907 [inline]
    slab_alloc mm/slub.c:2915 [inline]
    kmem_cache_alloc+0x1f4/0x4c0 mm/slub.c:2920
    alloc_pid+0xdc/0xc00 kernel/pid.c:180
    copy_process+0x2794/0x5e18 kernel/fork.c:2129
    kernel_clone+0x194/0x13c8 kernel/fork.c:2500
    kernel_thread+0xd4/0x110 kernel/fork.c:2552
    rest_init+0x44/0x4a0 init/main.c:687
    arch_call_rest_init+0x1c/0x28
    start_kernel+0x520/0x554 init/main.c:1064
    0x0

    Freed by task 270:
    slab_free_hook mm/slub.c:1562 [inline]
    slab_free_freelist_hook+0x98/0x260 mm/slub.c:1600
    slab_free mm/slub.c:3161 [inline]
    kmem_cache_free+0x224/0x8e0 mm/slub.c:3177
    put_pid.part.4+0xe0/0x1a8 kernel/pid.c:114
    put_pid+0x30/0x48 kernel/pid.c:109
    proc_do_cad_pid+0x190/0x1b0 kernel/sysctl.c:1401
    proc_sys_call_handler+0x338/0x4b0 fs/proc/proc_sysctl.c:591
    proc_sys_write+0x34/0x48 fs/proc/proc_sysctl.c:617
    call_write_iter include/linux/fs.h:1977 [inline]
    new_sync_write+0x3ac/0x510 fs/read_write.c:518
    vfs_write fs/read_write.c:605 [inline]
    vfs_write+0x9c4/0x1018 fs/read_write.c:585
    ksys_write+0x124/0x240 fs/read_write.c:658
    __do_sys_write fs/read_write.c:670 [inline]
    __se_sys_write fs/read_write.c:667 [inline]
    __arm64_sys_write+0x78/0xb0 fs/read_write.c:667
    __invoke_syscall arch/arm64/kernel/syscall.c:37 [inline]
    invoke_syscall arch/arm64/kernel/syscall.c:49 [inline]
    el0_svc_common.constprop.1+0x16c/0x388 arch/arm64/kernel/syscall.c:129
    do_el0_svc+0xf8/0x150 arch/arm64/kernel/syscall.c:168
    el0_svc+0x28/0x38 arch/arm64/kernel/entry-common.c:416
    el0_sync_handler+0x134/0x180 arch/arm64/kernel/entry-common.c:432
    el0_sync+0x154/0x180 arch/arm64/kernel/entry.S:701

    The buggy address belongs to the object at ffff23794dda0000
    which belongs to the cache pid of size 224
    The buggy address is located 4 bytes inside of
    224-byte region [ffff23794dda0000, ffff23794dda00e0)
    The buggy address belongs to the page:
    page:(____ptrval____) refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4dda0
    head:(____ptrval____) order:1 compound_mapcount:0
    flags: 0x3fffc0000010200(slab|head)
    raw: 03fffc0000010200 dead000000000100 dead000000000122 ffff23794d40d080
    raw: 0000000000000000 0000000000190019 00000001ffffffff 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff23794dd9ff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff23794dd9ff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    >ffff23794dda0000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ^
    ffff23794dda0080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
    ffff23794dda0100: fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 00
    ==================================================================

    Link: https://lkml.kernel.org/r/20210524172230.38715-1-mark.rutland@arm.com
    Fixes: 9ec52099e4b8678a ("[PATCH] replace cad_pid by a struct pid")
    Signed-off-by: Mark Rutland
    Acked-by: Christian Brauner
    Cc: Cedric Le Goater
    Cc: Christian Brauner
    Cc: Eric W. Biederman
    Cc: Kees Cook
    Cc: Paul Mackerras
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Rutland
     

04 Jun, 2021

1 commit


01 Jun, 2021

2 commits

  • Extend 8fb12156b8db ("init: Pin init task to the boot CPU, initially")
    to cover the new PF_NO_SETAFFINITY requirement.

    While there, move wait_for_completion(&kthreadd_done) into kernel_init()
    to make it absolutely clear it is the very first thing done by the init
    thread.

    Fixes: 570a752b7a9b ("lib/smp_processor_id: Use is_percpu_thread() instead of nr_cpus_allowed")
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Valentin Schneider
    Tested-by: Valentin Schneider
    Tested-by: Borislav Petkov
    Link: https://lkml.kernel.org/r/YLS4mbKUrA3Gnb4t@hirez.programming.kicks-ass.net

    Peter Zijlstra
     
  • Add a helper to find the dev_t for a disk + partno tuple.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Link: https://lore.kernel.org/r/20210525061301.2242282-8-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

27 May, 2021

1 commit


12 May, 2021

3 commits

  • As pointed out by commit

    de9b8f5dcbd9 ("sched: Fix crash trying to dequeue/enqueue the idle thread")

    init_idle() can and will be invoked more than once on the same idle
    task. At boot time, it is invoked for the boot CPU thread by
    sched_init(). Then smp_init() creates the threads for all the secondary
    CPUs and invokes init_idle() on them.

    As the hotplug machinery brings the secondaries to life, it will issue
    calls to idle_thread_get(), which itself invokes init_idle() yet again.
    In this case it's invoked twice more per secondary: at _cpu_up(), and at
    bringup_cpu().

    Given smp_init() already initializes the idle tasks for all *possible*
    CPUs, no further initialization should be required. Now, removing
    init_idle() from idle_thread_get() exposes some interesting expectations
    with regards to the idle task's preempt_count: the secondary startup always
    issues a preempt_disable(), requiring some reset of the preempt count to 0
    between hot-unplug and hotplug, which is currently served by
    idle_thread_get() -> idle_init().

    Given the idle task is supposed to have preemption disabled once and never
    see it re-enabled, it seems that what we actually want is to initialize its
    preempt_count to PREEMPT_DISABLED and leave it there. Do that, and remove
    init_idle() from idle_thread_get().

    Secondary startups were patched via coccinelle:

    @begone@
    @@

    -preempt_disable();
    ...
    cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);

    Signed-off-by: Valentin Schneider
    Signed-off-by: Ingo Molnar
    Acked-by: Peter Zijlstra
    Link: https://lore.kernel.org/r/20210512094636.2958515-1-valentin.schneider@arm.com

    Valentin Schneider
     
  • Daniel Borkmann says:

    ====================
    pull-request: bpf 2021-05-11

    The following pull-request contains BPF updates for your *net* tree.

    We've added 13 non-merge commits during the last 8 day(s) which contain
    a total of 21 files changed, 817 insertions(+), 382 deletions(-).

    The main changes are:

    1) Fix multiple ringbuf bugs in particular to prevent writable mmap of
    read-only pages, from Andrii Nakryiko & Thadeu Lima de Souza Cascardo.

    2) Fix verifier alu32 known-const subregister bound tracking for bitwise
    operations and/or/xor, from Daniel Borkmann.

    3) Reject trampoline attachment for functions with variable arguments,
    and also add a deny list of other forbidden functions, from Jiri Olsa.

    4) Fix nested bpf_bprintf_prepare() calls used by various helpers by
    switching to per-CPU buffers, from Florent Revest.

    5) Fix kernel compilation with BTF debug info on ppc64 due to pahole
    missing TCP-CC functions like cubictcp_init, from Martin KaFai Lau.

    6) Add a kconfig entry to provide an option to disallow unprivileged
    BPF by default, from Daniel Borkmann.

    7) Fix libbpf compilation for older libelf when GELF_ST_VISIBILITY()
    macro is not available, from Arnaldo Carvalho de Melo.

    8) Migrate test_tc_redirect to test_progs framework as prep work
    for upcoming skb_change_head() fix & selftest, from Jussi Maki.

    9) Fix a libbpf segfault in add_dummy_ksym_var() if BTF is not
    present, from Ian Rogers.

    10) Fix tx_only micro-benchmark in xdpsock BPF sample with proper frame
    size, from Magnus Karlsson.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Right now, all core BPF related options are scattered in different Kconfig
    locations mainly due to historic reasons. Moving forward, lets add a proper
    subsystem entry under ...

    General setup --->
    BPF subsystem --->

    ... in order to have all knobs in a single location and thus ease BPF related
    configuration. Networking related bits such as sockmap are out of scope for
    the general setup and therefore better suited to remain in net/Kconfig.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/f23f58765a4d59244ebd8037da7b6a6b2fb58446.1620765074.git.daniel@iogearbox.net

    Daniel Borkmann
     

11 May, 2021

1 commit

  • Once srcu_init() is called, the SRCU core will make use of delayed
    workqueues, which rely on timers. However init_timers() is called
    several steps after rcu_init(). This means that a call_srcu() after
    rcu_init() but before init_timers() would find itself within a dangerously
    uninitialized timer core.

    This commit therefore creates a separate call to srcu_init() after
    init_timer() completes, which ensures that we stay in early SRCU mode
    until timers are safe(r).

    Signed-off-by: Frederic Weisbecker
    Cc: Uladzislau Rezki
    Cc: Boqun Feng
    Cc: Lai Jiangshan
    Cc: Neeraj Upadhyay
    Cc: Josh Triplett
    Cc: Joel Fernandes
    Signed-off-by: Paul E. McKenney

    Frederic Weisbecker
     

07 May, 2021

3 commits

  • Merge yet more updates from Andrew Morton:
    "This is everything else from -mm for this merge window.

    90 patches.

    Subsystems affected by this patch series: mm (cleanups and slub),
    alpha, procfs, sysctl, misc, core-kernel, bitmap, lib, compat,
    checkpatch, epoll, isofs, nilfs2, hpfs, exit, fork, kexec, gcov,
    panic, delayacct, gdb, resource, selftests, async, initramfs, ipc,
    drivers/char, and spelling"

    * emailed patches from Andrew Morton : (90 commits)
    mm: fix typos in comments
    mm: fix typos in comments
    treewide: remove editor modelines and cruft
    ipc/sem.c: spelling fix
    fs: fat: fix spelling typo of values
    kernel/sys.c: fix typo
    kernel/up.c: fix typo
    kernel/user_namespace.c: fix typos
    kernel/umh.c: fix some spelling mistakes
    include/linux/pgtable.h: few spelling fixes
    mm/slab.c: fix spelling mistake "disired" -> "desired"
    scripts/spelling.txt: add "overflw"
    scripts/spelling.txt: Add "diabled" typo
    scripts/spelling.txt: add "overlfow"
    arm: print alloc free paths for address in registers
    mm/vmalloc: remove vwrite()
    mm: remove xlate_dev_kmem_ptr()
    drivers/char: remove /dev/kmem for good
    mm: fix some typos and code style problems
    ipc/sem.c: mundane typo fixes
    ...

    Linus Torvalds
     
  • Allow the developer to specifiy the initial value of the modprobe_path[]
    string. This can be used to set it to the empty string initially, thus
    effectively disabling request_module() during early boot until userspace
    writes a new value via the /proc/sys/kernel/modprobe interface. [1]

    When building a custom kernel (often for an embedded target), it's normal
    to build everything into the kernel that is needed for booting, and indeed
    the initramfs often contains no modules at all, so every such
    request_module() done before userspace init has mounted the real rootfs is
    a waste of time.

    This is particularly useful when combined with the previous patch, which
    made the initramfs unpacking asynchronous - for that to work, it had to
    make any usermodehelper call wait for the unpacking to finish before
    attempting to invoke the userspace helper. By eliminating all such
    (known-to-be-futile) calls of usermodehelper, the initramfs unpacking and
    the {device,late}_initcalls can proceed in parallel for much longer.

    For a relatively slow ppc board I'm working on, the two patches combined
    lead to 0.2s faster boot - but more importantly, the fact that the
    initramfs unpacking proceeds completely in the background while devices
    get probed means I get to handle the gpio watchdog in time without getting
    reset.

    [1] __request_module() already has an early -ENOENT return when
    modprobe_path is the empty string.

    Link: https://lkml.kernel.org/r/20210313212528.2956377-3-linux@rasmusvillemoes.dk
    Signed-off-by: Rasmus Villemoes
    Reviewed-by: Greg Kroah-Hartman
    Acked-by: Jessica Yu
    Acked-by: Luis Chamberlain
    Cc: Borislav Petkov
    Cc: Jonathan Corbet
    Cc: Nick Desaulniers
    Cc: Takashi Iwai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • Patch series "background initramfs unpacking, and CONFIG_MODPROBE_PATH", v3.

    These two patches are independent, but better-together.

    The second is a rather trivial patch that simply allows the developer to
    change "/sbin/modprobe" to something else - e.g. the empty string, so
    that all request_module() during early boot return -ENOENT early, without
    even spawning a usermode helper, needlessly synchronizing with the
    initramfs unpacking.

    The first patch delegates decompressing the initramfs to a worker thread,
    allowing do_initcalls() in main.c to proceed to the device_ and late_
    initcalls without waiting for that decompression (and populating of
    rootfs) to finish. Obviously, some of those later calls may rely on the
    initramfs being available, so I've added synchronization points in the
    firmware loader and usermodehelper paths - there might be other places
    that would need this, but so far no one has been able to think of any
    places I have missed.

    There's not much to win if most of the functionality needed during boot is
    only available as modules. But systems with a custom-made .config and
    initramfs can boot faster, partly due to utilizing more than one cpu
    earlier, partly by avoiding known-futile modprobe calls (which would still
    trigger synchronization with the initramfs unpacking, thus eliminating
    most of the first benefit).

    This patch (of 2):

    Most of the boot process doesn't actually need anything from the
    initramfs, until of course PID1 is to be executed. So instead of doing
    the decompressing and populating of the initramfs synchronously in
    populate_rootfs() itself, push that off to a worker thread.

    This is primarily motivated by an embedded ppc target, where unpacking
    even the rather modest sized initramfs takes 0.6 seconds, which is long
    enough that the external watchdog becomes unhappy that it doesn't get
    attention soon enough. By doing the initramfs decompression in a worker
    thread, we get to do the device_initcalls and hence start petting the
    watchdog much sooner.

    Normal desktops might benefit as well. On my mostly stock Ubuntu kernel,
    my initramfs is a 26M xz-compressed blob, decompressing to around 126M.
    That takes almost two seconds:

    [ 0.201454] Trying to unpack rootfs image as initramfs...
    [ 1.976633] Freeing initrd memory: 29416K

    Before this patch, these lines occur consecutively in dmesg. With this
    patch, the timestamps on these two lines is roughly the same as above, but
    with 172 lines inbetween - so more than one cpu has been kept busy doing
    work that would otherwise only happen after the populate_rootfs()
    finished.

    Should one of the initcalls done after rootfs_initcall time (i.e., device_
    and late_ initcalls) need something from the initramfs (say, a kernel
    module or a firmware blob), it will simply wait for the initramfs
    unpacking to be done before proceeding, which should in theory make this
    completely safe.

    But if some driver pokes around in the filesystem directly and not via one
    of the official kernel interfaces (i.e. request_firmware*(),
    call_usermodehelper*) that theory may not hold - also, I certainly might
    have missed a spot when sprinkling wait_for_initramfs(). So there is an
    escape hatch in the form of an initramfs_async= command line parameter.

    Link: https://lkml.kernel.org/r/20210313212528.2956377-1-linux@rasmusvillemoes.dk
    Link: https://lkml.kernel.org/r/20210313212528.2956377-2-linux@rasmusvillemoes.dk
    Signed-off-by: Rasmus Villemoes
    Reviewed-by: Luis Chamberlain
    Cc: Jessica Yu
    Cc: Borislav Petkov
    Cc: Jonathan Corbet
    Cc: Greg Kroah-Hartman
    Cc: Nick Desaulniers
    Cc: Takashi Iwai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     

06 May, 2021

2 commits

  • Merge more updates from Andrew Morton:
    "The remainder of the main mm/ queue.

    143 patches.

    Subsystems affected by this patch series (all mm): pagecache, hugetlb,
    userfaultfd, vmscan, compaction, migration, cma, ksm, vmstat, mmap,
    kconfig, util, memory-hotplug, zswap, zsmalloc, highmem, cleanups, and
    kfence"

    * emailed patches from Andrew Morton : (143 commits)
    kfence: use power-efficient work queue to run delayed work
    kfence: maximize allocation wait timeout duration
    kfence: await for allocation using wait_event
    kfence: zero guard page after out-of-bounds access
    mm/process_vm_access.c: remove duplicate include
    mm/mempool: minor coding style tweaks
    mm/highmem.c: fix coding style issue
    btrfs: use memzero_page() instead of open coded kmap pattern
    iov_iter: lift memzero_page() to highmem.h
    mm/zsmalloc: use BUG_ON instead of if condition followed by BUG.
    mm/zswap.c: switch from strlcpy to strscpy
    arm64/Kconfig: introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
    x86/Kconfig: introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
    mm,memory_hotplug: add kernel boot option to enable memmap_on_memory
    acpi,memhotplug: enable MHP_MEMMAP_ON_MEMORY when supported
    mm,memory_hotplug: allocate memmap from the added memory range
    mm,memory_hotplug: factor out adjusting present pages into adjust_present_page_count()
    mm,memory_hotplug: relax fully spanned sections check
    drivers/base/memory: introduce memory_block_{online,offline}
    mm/memory_hotplug: remove broken locking of zone PCP structures during hot remove
    ...

    Linus Torvalds
     
  • Patch series "userfaultfd: add minor fault handling", v9.

    Overview
    ========

    This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
    When enabled (via the UFFDIO_API ioctl), this feature means that any
    hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
    get events for "minor" faults. By "minor" fault, I mean the following
    situation:

    Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
    memory). One of the mappings is registered with userfaultfd (in minor
    mode), and the other is not. Via the non-UFFD mapping, the underlying
    pages have already been allocated & filled with some contents. The UFFD
    mapping has not yet been faulted in; when it is touched for the first
    time, this results in what I'm calling a "minor" fault. As a concrete
    example, when working with hugetlbfs, we have huge_pte_none(), but
    find_lock_page() finds an existing page.

    We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE. The idea
    is, userspace resolves the fault by either a) doing nothing if the
    contents are already correct, or b) updating the underlying contents using
    the second, non-UFFD mapping (via memcpy/memset or similar, or something
    fancier like RDMA, or etc...). In either case, userspace issues
    UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
    correct, carry on setting up the mapping".

    Use Case
    ========

    Consider the use case of VM live migration (e.g. under QEMU/KVM):

    1. While a VM is still running, we copy the contents of its memory to a
    target machine. The pages are populated on the target by writing to the
    non-UFFD mapping, using the setup described above. The VM is still running
    (and therefore its memory is likely changing), so this may be repeated
    several times, until we decide the target is "up to date enough".

    2. We pause the VM on the source, and start executing on the target machine.
    During this gap, the VM's user(s) will *see* a pause, so it is desirable to
    minimize this window.

    3. Between the last time any page was copied from the source to the target, and
    when the VM was paused, the contents of that page may have changed - and
    therefore the copy we have on the target machine is out of date. Although we
    can keep track of which pages are out of date, for VMs with large amounts of
    memory, it is "slow" to transfer this information to the target machine. We
    want to resume execution before such a transfer would complete.

    4. So, the guest begins executing on the target machine. The first time it
    touches its memory (via the UFFD-registered mapping), userspace wants to
    intercept this fault. Userspace checks whether or not the page is up to date,
    and if not, copies the updated page from the source machine, via the non-UFFD
    mapping. Finally, whether a copy was performed or not, userspace issues a
    UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
    are correct, carry on setting up the mapping".

    We don't have to do all of the final updates on-demand. The userfaultfd manager
    can, in the background, also copy over updated pages once it receives the map of
    which pages are up-to-date or not.

    Interaction with Existing APIs
    ==============================

    Because this is a feature, a registered VMA could potentially receive both
    missing and minor faults. I spent some time thinking through how the
    existing API interacts with the new feature:

    UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
    allocate a new page. If UFFDIO_CONTINUE is used on a non-minor fault:

    - For non-shared memory or shmem, -EINVAL is returned.
    - For hugetlb, -EFAULT is returned.

    UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
    Without modifications, the existing codepath assumes a new page needs to
    be allocated. This is okay, since userspace must have a second
    non-UFFD-registered mapping anyway, thus there isn't much reason to want
    to use these in any case (just memcpy or memset or similar).

    - If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
    - If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
    in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
    - UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
    -ENOENT in that case (regardless of the kind of fault).

    Future Work
    ===========

    This series only supports hugetlbfs. I have a second series in flight to
    support shmem as well, extending the functionality. This series is more
    mature than the shmem support at this point, and the functionality works
    fully on hugetlbfs, so this series can be merged first and then shmem
    support will follow.

    This patch (of 6):

    This feature allows userspace to intercept "minor" faults. By "minor"
    faults, I mean the following situation:

    Let there exist two mappings (i.e., VMAs) to the same page(s). One of the
    mappings is registered with userfaultfd (in minor mode), and the other is
    not. Via the non-UFFD mapping, the underlying pages have already been
    allocated & filled with some contents. The UFFD mapping has not yet been
    faulted in; when it is touched for the first time, this results in what
    I'm calling a "minor" fault. As a concrete example, when working with
    hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
    page.

    This commit adds the new registration mode, and sets the relevant flag on
    the VMAs being registered. In the hugetlb fault path, if we find that we
    have huge_pte_none(), but find_lock_page() does indeed find an existing
    page, then we have a "minor" fault, and if the VMA has the userfaultfd
    registration flag, we call into userfaultfd to handle it.

    This is implemented as a new registration mode, instead of an API feature.
    This is because the alternative implementation has significant drawbacks
    [1].

    However, doing it this was requires we allocate a VM_* flag for the new
    registration mode. On 32-bit systems, there are no unused bits, so this
    feature is only supported on architectures with
    CONFIG_ARCH_USES_HIGH_VMA_FLAGS. When attempting to register a VMA in
    MINOR mode on 32-bit architectures, we return -EINVAL.

    [1] https://lore.kernel.org/patchwork/patch/1380226/

    [peterx@redhat.com: fix minor fault page leak]
    Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com

    Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
    Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com
    Signed-off-by: Axel Rasmussen
    Reviewed-by: Peter Xu
    Reviewed-by: Mike Kravetz
    Cc: Alexander Viro
    Cc: Alexey Dobriyan
    Cc: Andrea Arcangeli
    Cc: Anshuman Khandual
    Cc: Catalin Marinas
    Cc: Chinwen Chang
    Cc: Huang Ying
    Cc: Ingo Molnar
    Cc: Jann Horn
    Cc: Jerome Glisse
    Cc: Lokesh Gidra
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Michael Ellerman
    Cc: "Michal Koutn"
    Cc: Michel Lespinasse
    Cc: Mike Rapoport
    Cc: Nicholas Piggin
    Cc: Peter Xu
    Cc: Shaohua Li
    Cc: Shawn Anastasio
    Cc: Steven Rostedt
    Cc: Steven Price
    Cc: Vlastimil Babka
    Cc: Adam Ruprecht
    Cc: Axel Rasmussen
    Cc: Cannon Matthews
    Cc: "Dr . David Alan Gilbert"
    Cc: David Rientjes
    Cc: Mina Almasry
    Cc: Oliver Upton
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Axel Rasmussen
     

04 May, 2021

1 commit

  • Pull tracing updates from Steven Rostedt:
    "New feature:

    - A new "func-no-repeats" option in tracefs/options directory.

    When set the function tracer will detect if the current function
    being traced is the same as the previous one, and instead of
    recording it, it will keep track of the number of times that the
    function is repeated in a row. And when another function is
    recorded, it will write a new event that shows the function that
    repeated, the number of times it repeated and the time stamp of
    when the last repeated function occurred.

    Enhancements:

    - In order to implement the above "func-no-repeats" option, the ring
    buffer timestamp can now give the accurate timestamp of the event
    as it is being recorded, instead of having to record an absolute
    timestamp for all events. This helps the histogram code which no
    longer needs to waste ring buffer space.

    - New validation logic to make sure all trace events that access
    dereferenced pointers do so in a safe way, and will warn otherwise.

    Fixes:

    - No longer limit the PIDs of tasks that are recorded for
    "saved_cmdlines" to PID_MAX_DEFAULT (32768), as systemd now allows
    for a much larger range. This caused the mapping of PIDs to the
    task names to be dropped for all tasks with a PID greater than
    32768.

    - Change trace_clock_global() to never block. This caused a deadlock.

    Clean ups:

    - Typos, prototype fixes, and removing of duplicate or unused code.

    - Better management of ftrace_page allocations"

    * tag 'trace-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (32 commits)
    tracing: Restructure trace_clock_global() to never block
    tracing: Map all PIDs to command lines
    ftrace: Reuse the output of the function tracer for func_repeats
    tracing: Add "func_no_repeats" option for function tracing
    tracing: Unify the logic for function tracing options
    tracing: Add method for recording "func_repeats" events
    tracing: Add "last_func_repeats" to struct trace_array
    tracing: Define new ftrace event "func_repeats"
    tracing: Define static void trace_print_time()
    ftrace: Simplify the calculation of page number for ftrace_page->records some more
    ftrace: Store the order of pages allocated in ftrace_page
    tracing: Remove unused argument from "ring_buffer_time_stamp()
    tracing: Remove duplicate struct declaration in trace_events.h
    tracing: Update create_system_filter() kernel-doc comment
    tracing: A minor cleanup for create_system_filter()
    kernel: trace: Mundane typo fixes in the file trace_events_filter.c
    tracing: Fix various typos in comments
    scripts/recordmcount.pl: Make vim and emacs indent the same
    scripts/recordmcount.pl: Make indent spacing consistent
    tracing: Add a verifier to check string pointers for trace events
    ...

    Linus Torvalds
     

02 May, 2021

1 commit

  • Pull IMA updates from Mimi Zohar:
    "In addition to loading the kernel module signing key onto the builtin
    keyring, load it onto the IMA keyring as well.

    Also six trivial changes and bug fixes"

    * tag 'integrity-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
    ima: ensure IMA_APPRAISE_MODSIG has necessary dependencies
    ima: Fix fall-through warnings for Clang
    integrity: Add declarations to init_once void arguments.
    ima: Fix function name error in comment.
    ima: enable loading of build time generated key on .ima keyring
    ima: enable signing of modules with build time generated key
    keys: cleanup build time module signing keys
    ima: Fix the error code for restoring the PCR value
    ima: without an IMA policy loaded, return quickly

    Linus Torvalds
     

01 May, 2021

2 commits

  • mem_init_print_info() is called in mem_init() on each architecture, and
    pass NULL argument, so using void argument and move it into mm_init().

    Link: https://lkml.kernel.org/r/20210317015210.33641-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang
    Acked-by: Dave Hansen [x86]
    Reviewed-by: Christophe Leroy [powerpc]
    Acked-by: David Hildenbrand
    Tested-by: Anatoly Pugachev [sparc64]
    Acked-by: Russell King [arm]
    Acked-by: Mike Rapoport
    Cc: Catalin Marinas
    Cc: Richard Henderson
    Cc: Guo Ren
    Cc: Yoshinori Sato
    Cc: Huacai Chen
    Cc: Jonas Bonn
    Cc: Palmer Dabbelt
    Cc: Heiko Carstens
    Cc: "David S. Miller"
    Cc: "Peter Zijlstra"
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kefeng Wang
     
  • This changes the awkward approach where architectures provide init
    functions to determine which levels they can provide large mappings for,
    to one where the arch is queried for each call.

    This removes code and indirection, and allows constant-folding of dead
    code for unsupported levels.

    This also adds a prot argument to the arch query. This is unused
    currently but could help with some architectures (e.g., some powerpc
    processors can't map uncacheable memory with large pages).

    Link: https://lkml.kernel.org/r/20210317062402.533919-7-npiggin@gmail.com
    Signed-off-by: Nicholas Piggin
    Reviewed-by: Ding Tianhong
    Acked-by: Catalin Marinas [arm64]
    Cc: Will Deacon
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: Christoph Hellwig
    Cc: Miaohe Lin
    Cc: Michael Ellerman
    Cc: Russell King
    Cc: Uladzislau Rezki (Sony)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     

30 Apr, 2021

3 commits

  • Pull Kconfig updates from Masahiro Yamada:

    - Change 'option defconfig' to the environment variable
    KCONFIG_DEFCONFIG_LIST

    - Refactor tinyconfig without using allnoconfig_y

    - Remove 'option allnoconfig_y' syntax

    - Change 'option modules' to 'modules'

    - Do not use /boot/config-* etc. as base config for cross-compilation

    - Fix a search bug in nconf

    - Various code cleanups

    * tag 'kconfig-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (34 commits)
    kconfig: refactor .gitignore
    kconfig: highlight xconfig 'comment' lines with '***'
    kconfig: highlight gconfig 'comment' lines with '***'
    kconfig: gconf: remove unused code
    kconfig: remove unused PACKAGE definition
    kconfig: nconf: stop endless search loops
    kconfig: split menu.c out of parser.y
    kconfig: nconf: refactor in print_in_middle()
    kconfig: nconf: remove meaningless wattrset() call from show_menu()
    kconfig: nconf: change set_config_filename() to void function
    kconfig: nconf: refactor attributes setup code
    kconfig: nconf: remove unneeded default for menu prompt
    kconfig: nconf: get rid of (void) casts from wattrset() calls
    kconfig: nconf: fix NORMAL attributes
    kconfig: mconf,nconf: remove unneeded '\0' termination after snprintf()
    kconfig: use /boot/config-* etc. as DEFCONFIG_LIST only for native build
    kconfig: change sym_change_count to a boolean flag
    kconfig: nconf: fix core dump when searching in empty menu
    kconfig: lxdialog: A spello fix and a punctuation added
    kconfig: streamline_config.pl: Couple of typo fixes
    ...

    Linus Torvalds
     
  • Pull Kbuild updates from Masahiro Yamada:

    - Evaluate $(call cc-option,...) etc. only for build targets

    - Add CONFIG_VMLINUX_MAP to generate .map file when linking vmlinux

    - Remove unnecessary --gcc-toolchains Clang flag because the --prefix
    flag finds the toolchains

    - Do not pass Clang's --prefix flag when using the integrated as

    - Check the assembler version in Kconfig time

    - Add new CONFIG options, AS_VERSION, AS_IS_GNU, AS_IS_LLVM to clean up
    some dependencies in Kconfig

    - Fix invalid Module.symvers creation when building only modules
    without vmlinux

    - Fix false-positive modpost warnings when CONFIG_TRIM_UNUSED_KSYMS is
    set, but there is no module to build

    - Refactor module installation Makefile

    - Support zstd for module compression

    - Convert alpha and ia64 to use generic shell scripts to generate the
    syscall headers

    - Add a new elfnote to indicate if the kernel was built with LTO, which
    will be used by pahole

    - Flatten the directory structure under include/config/ so CONFIG
    options and filenames match

    - Change the deb source package name from linux-$(KERNELRELEASE) to
    linux-upstream

    * tag 'kbuild-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (42 commits)
    kbuild: Add $(KBUILD_HOSTLDFLAGS) to 'has_libelf' test
    kbuild: deb-pkg: change the source package name to linux-upstream
    tools: do not include scripts/Kbuild.include
    kbuild: redo fake deps at include/config/*.h
    kbuild: remove TMPO from try-run
    MAINTAINERS: add pattern for dummy-tools
    kbuild: add an elfnote for whether vmlinux is built with lto
    ia64: syscalls: switch to generic syscallhdr.sh
    ia64: syscalls: switch to generic syscalltbl.sh
    alpha: syscalls: switch to generic syscallhdr.sh
    alpha: syscalls: switch to generic syscalltbl.sh
    sysctl: use min() helper for namecmp()
    kbuild: add support for zstd compressed modules
    kbuild: remove CONFIG_MODULE_COMPRESS
    kbuild: merge scripts/Makefile.modsign to scripts/Makefile.modinst
    kbuild: move module strip/compression code into scripts/Makefile.modinst
    kbuild: refactor scripts/Makefile.modinst
    kbuild: rename extmod-prefix to extmod_prefix
    kbuild: check module name conflict for external modules as well
    kbuild: show the target directory for depmod log
    ...

    Linus Torvalds
     
  • Pull networking updates from Jakub Kicinski:
    "Core:

    - bpf:
    - allow bpf programs calling kernel functions (initially to
    reuse TCP congestion control implementations)
    - enable task local storage for tracing programs - remove the
    need to store per-task state in hash maps, and allow tracing
    programs access to task local storage previously added for
    BPF_LSM
    - add bpf_for_each_map_elem() helper, allowing programs to walk
    all map elements in a more robust and easier to verify fashion
    - sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT
    redirection
    - lpm: add support for batched ops in LPM trie
    - add BTF_KIND_FLOAT support - mostly to allow use of BTF on
    s390 which has floats in its headers files
    - improve BPF syscall documentation and extend the use of kdoc
    parsing scripts we already employ for bpf-helpers
    - libbpf, bpftool: support static linking of BPF ELF files
    - improve support for encapsulation of L2 packets

    - xdp: restructure redirect actions to avoid a runtime lookup,
    improving performance by 4-8% in microbenchmarks

    - xsk: build skb by page (aka generic zerocopy xmit) - improve
    performance of software AF_XDP path by 33% for devices which don't
    need headers in the linear skb part (e.g. virtio)

    - nexthop: resilient next-hop groups - improve path stability on
    next-hops group changes (incl. offload for mlxsw)

    - ipv6: segment routing: add support for IPv4 decapsulation

    - icmp: add support for RFC 8335 extended PROBE messages

    - inet: use bigger hash table for IP ID generation

    - tcp: deal better with delayed TX completions - make sure we don't
    give up on fast TCP retransmissions only because driver is slow in
    reporting that it completed transmitting the original

    - tcp: reorder tcp_congestion_ops for better cache locality

    - mptcp:
    - add sockopt support for common TCP options
    - add support for common TCP msg flags
    - include multiple address ids in RM_ADDR
    - add reset option support for resetting one subflow

    - udp: GRO L4 improvements - improve 'forward' / 'frag_list'
    co-existence with UDP tunnel GRO, allowing the first to take place
    correctly even for encapsulated UDP traffic

    - micro-optimize dev_gro_receive() and flow dissection, avoid
    retpoline overhead on VLAN and TEB GRO

    - use less memory for sysctls, add a new sysctl type, to allow using
    u8 instead of "int" and "long" and shrink networking sysctls

    - veth: allow GRO without XDP - this allows aggregating UDP packets
    before handing them off to routing, bridge, OvS, etc.

    - allow specifing ifindex when device is moved to another namespace

    - netfilter:
    - nft_socket: add support for cgroupsv2
    - nftables: add catch-all set element - special element used to
    define a default action in case normal lookup missed
    - use net_generic infra in many modules to avoid allocating
    per-ns memory unnecessarily

    - xps: improve the xps handling to avoid potential out-of-bound
    accesses and use-after-free when XPS change race with other
    re-configuration under traffic

    - add a config knob to turn off per-cpu netdev refcnt to catch
    underflows in testing

    Device APIs:

    - add WWAN subsystem to organize the WWAN interfaces better and
    hopefully start driving towards more unified and vendor-
    independent APIs

    - ethtool:
    - add interface for reading IEEE MIB stats (incl. mlx5 and bnxt
    support)
    - allow network drivers to dump arbitrary SFP EEPROM data,
    current offset+length API was a poor fit for modern SFP which
    define EEPROM in terms of pages (incl. mlx5 support)

    - act_police, flow_offload: add support for packet-per-second
    policing (incl. offload for nfp)

    - psample: add additional metadata attributes like transit delay for
    packets sampled from switch HW (and corresponding egress and
    policy-based sampling in the mlxsw driver)

    - dsa: improve support for sandwiched LAGs with bridge and DSA

    - netfilter:
    - flowtable: use direct xmit in topologies with IP forwarding,
    bridging, vlans etc.
    - nftables: counter hardware offload support

    - Bluetooth:
    - improvements for firmware download w/ Intel devices
    - add support for reading AOSP vendor capabilities
    - add support for virtio transport driver

    - mac80211:
    - allow concurrent monitor iface and ethernet rx decap
    - set priority and queue mapping for injected frames

    - phy: add support for Clause-45 PHY Loopback

    - pci/iov: add sysfs MSI-X vector assignment interface to distribute
    MSI-X resources to VFs (incl. mlx5 support)

    New hardware/drivers:

    - dsa: mv88e6xxx: add support for Marvell mv88e6393x - 11-port
    Ethernet switch with 8x 1-Gigabit Ethernet and 3x 10-Gigabit
    interfaces.

    - dsa: support for legacy Broadcom tags used on BCM5325, BCM5365 and
    BCM63xx switches

    - Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches

    - ath11k: support for QCN9074 a 802.11ax device

    - Bluetooth: Broadcom BCM4330 and BMC4334

    - phy: Marvell 88X2222 transceiver support

    - mdio: add BCM6368 MDIO mux bus controller

    - r8152: support RTL8153 and RTL8156 (USB Ethernet) chips

    - mana: driver for Microsoft Azure Network Adapter (MANA)

    - Actions Semi Owl Ethernet MAC

    - can: driver for ETAS ES58X CAN/USB interfaces

    Pure driver changes:

    - add XDP support to: enetc, igc, stmmac

    - add AF_XDP support to: stmmac

    - virtio:
    - page_to_skb() use build_skb when there's sufficient tailroom
    (21% improvement for 1000B UDP frames)
    - support XDP even without dedicated Tx queues - share the Tx
    queues with the stack when necessary

    - mlx5:
    - flow rules: add support for mirroring with conntrack, matching
    on ICMP, GTP, flex filters and more
    - support packet sampling with flow offloads
    - persist uplink representor netdev across eswitch mode changes
    - allow coexistence of CQE compression and HW time-stamping
    - add ethtool extended link error state reporting

    - ice, iavf: support flow filters, UDP Segmentation Offload

    - dpaa2-switch:
    - move the driver out of staging
    - add spanning tree (STP) support
    - add rx copybreak support
    - add tc flower hardware offload on ingress traffic

    - ionic:
    - implement Rx page reuse
    - support HW PTP time-stamping

    - octeon: support TC hardware offloads - flower matching on ingress
    and egress ratelimitting.

    - stmmac:
    - add RX frame steering based on VLAN priority in tc flower
    - support frame preemption (FPE)
    - intel: add cross time-stamping freq difference adjustment

    - ocelot:
    - support forwarding of MRP frames in HW
    - support multiple bridges
    - support PTP Sync one-step timestamping

    - dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like
    learning, flooding etc.

    - ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350,
    SC7280 SoCs)

    - mt7601u: enable TDLS support

    - mt76:
    - add support for 802.3 rx frames (mt7915/mt7615)
    - mt7915 flash pre-calibration support
    - mt7921/mt7663 runtime power management fixes"

    * tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2451 commits)
    net: selftest: fix build issue if INET is disabled
    net: netrom: nr_in: Remove redundant assignment to ns
    net: tun: Remove redundant assignment to ret
    net: phy: marvell: add downshift support for M88E1240
    net: dsa: ksz: Make reg_mib_cnt a u8 as it never exceeds 255
    net/sched: act_ct: Remove redundant ct get and check
    icmp: standardize naming of RFC 8335 PROBE constants
    bpf, selftests: Update array map tests for per-cpu batched ops
    bpf: Add batched ops support for percpu array
    bpf: Implement formatted output helpers with bstr_printf
    seq_file: Add a seq_bprintf function
    sfc: adjust efx->xdp_tx_queue_count with the real number of initialized queues
    net:nfc:digital: Fix a double free in digital_tg_recv_dep_req
    net: fix a concurrency bug in l2tp_tunnel_register()
    net/smc: Remove redundant assignment to rc
    mpls: Remove redundant assignment to err
    llc2: Remove redundant assignment to rc
    net/tls: Remove redundant initialization of record
    rds: Remove redundant assignment to nr_sig
    dt-bindings: net: mdio-gpio: add compatible for microchip,mdio-smi0
    ...

    Linus Torvalds
     

28 Apr, 2021

4 commits

  • Pull cgroup changes from Tejun Heo:
    "The only notable change is Vipin's new misc cgroup controller.

    This implements generic support for resources which can be controlled
    by simply counting and limiting the number of resource instances - ie
    there's X number of these on the system and this cgroup subtree can
    have upto Y of those.

    The first user is the address space IDs used for virtual machine
    memory encryption and expected future usages are similar - niche
    hardware features with concrete resource limits and simple usage
    models"

    * 'for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: use tsk->in_iowait instead of delayacct_is_task_waiting_on_io()
    cgroup/cpuset: fix typos in comments
    cgroup: misc: mark dummy misc_cg_res_total_usage() static inline
    svm/sev: Register SEV and SEV-ES ASIDs to the misc controller
    cgroup: Miscellaneous cgroup documentation.
    cgroup: Add misc cgroup controller

    Linus Torvalds
     
  • BPF has three formatted output helpers: bpf_trace_printk, bpf_seq_printf
    and bpf_snprintf. Their signatures specify that all arguments are
    provided from the BPF world as u64s (in an array or as registers). All
    of these helpers are currently implemented by calling functions such as
    snprintf() whose signatures take a variable number of arguments, then
    placed in a va_list by the compiler to call vsnprintf().

    "d9c9e4db bpf: Factorize bpf_trace_printk and bpf_seq_printf" introduced
    a bpf_printf_prepare function that fills an array of u64 sanitized
    arguments with an array of "modifiers" which indicate what the "real"
    size of each argument should be (given by the format specifier). The
    BPF_CAST_FMT_ARG macro consumes these arrays and casts each argument to
    its real size. However, the C promotion rules implicitely cast them all
    back to u64s. Therefore, the arguments given to snprintf are u64s and
    the va_list constructed by the compiler will use 64 bits for each
    argument. On 64 bit machines, this happens to work well because 32 bit
    arguments in va_lists need to occupy 64 bits anyway, but on 32 bit
    architectures this breaks the layout of the va_list expected by the
    called function and mangles values.

    In "88a5c690b6 bpf: fix bpf_trace_printk on 32 bit archs", this problem
    had been solved for bpf_trace_printk only with a "horrid workaround"
    that emitted multiple calls to trace_printk where each call had
    different argument types and generated different va_list layouts. One of
    the call would be dynamically chosen at runtime. This was ok with the 3
    arguments that bpf_trace_printk takes but bpf_seq_printf and
    bpf_snprintf accept up to 12 arguments. Because this approach scales
    code exponentially, it is not a viable option anymore.

    Because the promotion rules are part of the language and because the
    construction of a va_list is an arch-specific ABI, it's best to just
    avoid variadic arguments and va_lists altogether. Thankfully the
    kernel's snprintf() has an alternative in the form of bstr_printf() that
    accepts arguments in a "binary buffer representation". These binary
    buffers are currently created by vbin_printf and used in the tracing
    subsystem to split the cost of printing into two parts: a fast one that
    only dereferences and remembers values, and a slower one, called later,
    that does the pretty-printing.

    This patch refactors bpf_printf_prepare to construct binary buffers of
    arguments consumable by bstr_printf() instead of arrays of arguments and
    modifiers. This gets rid of BPF_CAST_FMT_ARG and greatly simplifies the
    bpf_printf_prepare usage but there are a few gotchas that change how
    bpf_printf_prepare needs to do things.

    Currently, bpf_printf_prepare uses a per cpu temporary buffer as a
    generic storage for strings and IP addresses. With this refactoring, the
    temporary buffers now holds all the arguments in a structured binary
    format.

    To comply with the format expected by bstr_printf, certain format
    specifiers also need to be pre-formatted: %pB and %pi6/%pi4/%pI4/%pI6.
    Because vsnprintf subroutines for these specifiers are hard to expose,
    we pre-format these arguments with calls to snprintf().

    Reported-by: Rasmus Villemoes
    Signed-off-by: Florent Revest
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20210427174313.860948-3-revest@chromium.org

    Florent Revest
     
  • Pull CFI on arm64 support from Kees Cook:
    "This builds on last cycle's LTO work, and allows the arm64 kernels to
    be built with Clang's Control Flow Integrity feature. This feature has
    happily lived in Android kernels for almost 3 years[1], so I'm excited
    to have it ready for upstream.

    The wide diffstat is mainly due to the treewide fixing of mismatched
    list_sort prototypes. Other things in core kernel are to address
    various CFI corner cases. The largest code portion is the CFI runtime
    implementation itself (which will be shared by all architectures
    implementing support for CFI). The arm64 pieces are Acked by arm64
    maintainers rather than coming through the arm64 tree since carrying
    this tree over there was going to be awkward.

    CFI support for x86 is still under development, but is pretty close.
    There are a handful of corner cases on x86 that need some improvements
    to Clang and objtool, but otherwise works well.

    Summary:

    - Clean up list_sort prototypes (Sami Tolvanen)

    - Introduce CONFIG_CFI_CLANG for arm64 (Sami Tolvanen)"

    * tag 'cfi-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    arm64: allow CONFIG_CFI_CLANG to be selected
    KVM: arm64: Disable CFI for nVHE
    arm64: ftrace: use function_nocfi for ftrace_call
    arm64: add __nocfi to __apply_alternatives
    arm64: add __nocfi to functions that jump to a physical address
    arm64: use function_nocfi with __pa_symbol
    arm64: implement function_nocfi
    psci: use function_nocfi for cpu_resume
    lkdtm: use function_nocfi
    treewide: Change list_sort to use const pointers
    bpf: disable CFI in dispatcher functions
    kallsyms: strip ThinLTO hashes from static functions
    kthread: use WARN_ON_FUNCTION_MISMATCH
    workqueue: use WARN_ON_FUNCTION_MISMATCH
    module: ensure __cfi_check alignment
    mm: add generic function_nocfi macro
    cfi: add __cficanonical
    add support for Clang CFI

    Linus Torvalds
     
  • Pull seccomp updates from Kees Cook:

    - Fix "cacheable" typo in comments (Cui GaoSheng)

    - Fix CONFIG for /proc/$pid/status Seccomp_filters (Kenta.Tada@sony.com)

    * tag 'seccomp-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    seccomp: Fix "cacheable" typo in comments
    seccomp: Fix CONFIG tests for Seccomp_filters

    Linus Torvalds
     

25 Apr, 2021

2 commits

  • Make include/config/foo/bar.h fake deps files generation simpler.

    * delete .h suffix
    those aren't header files, shorten filenames,

    * delete tolower()
    Linux filesystems can deal with both upper and lowercase
    filenames very well,

    * put everything in 1 directory
    Presumably 'mkdir -p' split is from dark times when filesystems
    handled huge directories badly, disks were round adding to
    seek times.

    x86_64 allmodconfig lists 12364 files in include/config.

    ../obj/include/config/
    ├── 104_QUAD_8
    ├── 60XX_WDT
    ├── 64BIT
    ...
    ├── ZSWAP_DEFAULT_ON
    ├── ZSWAP_ZPOOL_DEFAULT
    └── ZSWAP_ZPOOL_DEFAULT_ZBUD

    0 directories, 12364 files

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Masahiro Yamada

    Alexey Dobriyan
     
  • Currently, clang LTO built vmlinux won't work with pahole.
    LTO introduced cross-cu dwarf tag references and broke
    current pahole model which handles one cu as a time.
    The solution is to merge all cu's as one pahole cu as in [1].
    We would like to do this merging only if cross-cu dwarf
    references happens. The LTO build mode is a pretty good
    indication for that.

    In earlier version of this patch ([2]), clang flag
    -grecord-gcc-switches is proposed to add to compilation flags
    so pahole could detect "-flto" and then merging cu's.
    This will increate the binary size of 1% without LTO though.

    Arnaldo suggested to use a note to indicate the vmlinux
    is built with LTO. Such a cheap way to get whether the vmlinux
    is built with LTO or not helps pahole but is also useful
    for tracing as LTO may inline/delete/demote global functions,
    promote static functions, etc.

    So this patch added an elfnote with a new type LINUX_ELFNOTE_LTO_INFO.
    The owner of the note is "Linux".

    With gcc 8.4.1 and clang trunk, without LTO, I got
    $ readelf -n vmlinux
    Displaying notes found in: .notes
    Owner Data size Description
    ...
    Linux 0x00000004 func
    description data: 00 00 00 00
    ...
    With "readelf -x ".notes" vmlinux", I can verify the above "func"
    with type code 0x101.

    With clang thin-LTO, I got the same as above except the following:
    description data: 01 00 00 00
    which indicates the vmlinux is built with LTO.

    [1] https://lore.kernel.org/bpf/20210325065316.3121287-1-yhs@fb.com/
    [2] https://lore.kernel.org/bpf/20210331001623.2778934-1-yhs@fb.com/

    Suggested-by: Arnaldo Carvalho de Melo
    Signed-off-by: Yonghong Song
    Reviewed-by: Nick Desaulniers
    Tested-by: Sedat Dilek # LLVM/Clang v12.0.0-rc4 (x86-64)
    Tested-by: Arnaldo Carvalho de Melo
    Signed-off-by: Masahiro Yamada

    Yonghong Song