17 Aug, 2022

1 commit

  • [ Upstream commit 375561bd6195a31bf4c109732bd538cb97a941f4 ]

    Fix the following Sparse warnings that got noticed when the PPC-dev
    patchwork was checking another patch (see the link below):

    init/main.c:862:1: warning: symbol 'randomize_kstack_offset' was not declared. Should it be static?
    init/main.c:864:1: warning: symbol 'kstack_offset' was not declared. Should it be static?

    Which in fact are triggered on all architectures that have
    HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET support (for instances x86, arm64
    etc).

    Link: https://lore.kernel.org/lkml/e7b0d68b-914d-7283-827c-101988923929@huawei.com/T/#m49b2d4490121445ce4bf7653500aba59eefcb67f
    Cc: Christophe Leroy
    Cc: Xiu Jianfeng
    Signed-off-by: GONG, Ruiqi
    Reviewed-by: Christophe Leroy
    Fixes: 39218ff4c625 ("stack: Optionally randomize kernel stack offset each syscall")
    Signed-off-by: Kees Cook
    Link: https://lore.kernel.org/r/20220629060423.2515693-1-gongruiqi1@huawei.com
    Signed-off-by: Sasha Levin

    GONG, Ruiqi
     

30 May, 2022

2 commits

  • commit 2f14062bb14b0fcfcc21e6dc7d5b5c0d25966164 upstream.

    Currently, start_kernel() adds latent entropy and the command line to
    the entropy bool *after* the RNG has been initialized, deferring when
    it's actually used by things like stack canaries until the next time
    the pool is seeded. This surely is not intended.

    Rather than splitting up which entropy gets added where and when between
    start_kernel() and random_init(), just do everything in random_init(),
    which should eliminate these kinds of bugs in the future.

    While we're at it, rename the awkwardly titled "rand_initialize()" to
    the more standard "random_init()" nomenclature.

    Reviewed-by: Dominik Brodowski
    Signed-off-by: Jason A. Donenfeld
    Signed-off-by: Greg Kroah-Hartman

    Jason A. Donenfeld
     
  • commit fe222a6ca2d53c38433cba5d3be62a39099e708e upstream.

    Currently time_init() is called after rand_initialize(), but
    rand_initialize() makes use of the timer on various platforms, and
    sometimes this timer needs to be initialized by time_init() first. In
    order for random_get_entropy() to not return zero during early boot when
    it's potentially used as an entropy source, reverse the order of these
    two calls. The block doing random initialization was right before
    time_init() before, so changing the order shouldn't have any complicated
    effects.

    Cc: Andrew Morton
    Reviewed-by: Stafford Horne
    Signed-off-by: Jason A. Donenfeld
    Signed-off-by: Greg Kroah-Hartman

    Jason A. Donenfeld
     

14 Apr, 2022

2 commits

  • [ Upstream commit f9a40b0890658330c83c95511f9d6b396610defc ]

    initcall_blacklist() should return 1 to indicate that it handled its
    cmdline arguments.

    set_debug_rodata() should return 1 to indicate that it handled its
    cmdline arguments. Print a warning if the option string is invalid.

    This prevents these strings from being added to the 'init' program's
    environment as they are not init arguments/parameters.

    Link: https://lkml.kernel.org/r/20220221050901.23985-1-rdunlap@infradead.org
    Signed-off-by: Randy Dunlap
    Reported-by: Igor Zhbanov
    Cc: Ingo Molnar
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Randy Dunlap
     
  • [ Upstream commit 9c1be1935fb68b2413796cdc03d019b8cf35ab51 ]

    While testing a patch that will follow later
    ("net: add netns refcount tracker to struct nsproxy")
    I found that devtmpfs_init() was called before init_net
    was initialized.

    This is a bug, because devtmpfs_setup() calls
    ksys_unshare(CLONE_NEWNS);

    This has the effect of increasing init_net refcount,
    which will be later overwritten to 1, as part of setup_net(&init_net)

    We had too many prior patches [1] trying to work around the root cause.

    Really, make sure init_net is in BSS section, and that net_ns_init()
    is called earlier at boot time.

    Note that another patch ("vfs: add netns refcount tracker
    to struct fs_context") also will need net_ns_init() being called
    before vfs_caches_init()

    As a bonus, this patch saves around 4KB in .data section.

    [1]

    f8c46cb39079 ("netns: do not call pernet ops for not yet set up init_net namespace")
    b5082df8019a ("net: Initialise init_net.count to 1")
    734b65417b24 ("net: Statically initialize init_net.dev_base_head")

    v2: fixed a build error reported by kernel build bots (CONFIG_NET=n)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Eric Dumazet
     

19 Nov, 2021

1 commit

  • [ Upstream commit 8bc2b3dca7292347d8e715fb723c587134abe013 ]

    The prior message is confusing users, which is the exact opposite of the
    goal. If the message is being seen, one of the following situations is
    happening:

    1. the param is misspelled
    2. the param is not valid due to the kernel configuration
    3. the param is intended for init but isn't after the '--'
    delineator on the command line

    To make that more clear to the user, explicitly mention "kernel command
    line" and also note that the params are still passed to user space to
    avoid causing any alarm over params intended for init.

    Link: https://lkml.kernel.org/r/20211013223502.96756-1-ahalaney@redhat.com
    Fixes: 86d1919a4fb0 ("init: print out unknown kernel parameters")
    Signed-off-by: Andrew Halaney
    Suggested-by: Steven Rostedt (VMware)
    Acked-by: Randy Dunlap
    Cc: Borislav Petkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Andrew Halaney
     

11 Oct, 2021

1 commit

  • Free unused memblock in a error case to fix memblock leak
    in xbc_make_cmdline().

    Link: https://lkml.kernel.org/r/163177339181.682366.8713781325929549256.stgit@devnote2

    Fixes: 51887d03aca1 ("bootconfig: init: Allow admin to use bootconfig for kernel command line")
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)

    Masami Hiramatsu
     

23 Sep, 2021

1 commit


15 Sep, 2021

1 commit

  • The boot-time allocation interface for memblock is a mess, with
    'memblock_alloc()' returning a virtual pointer, but then you are
    supposed to free it with 'memblock_free()' that takes a _physical_
    address.

    Not only is that all kinds of strange and illogical, but it actually
    causes bugs, when people then use it like a normal allocation function,
    and it fails spectacularly on a NULL pointer:

    https://lore.kernel.org/all/20210912140820.GD25450@xsang-OptiPlex-9020/

    or just random memory corruption if the debug checks don't catch it:

    https://lore.kernel.org/all/61ab2d0c-3313-aaab-514c-e15b7aa054a0@suse.cz/

    I really don't want to apply patches that treat the symptoms, when the
    fundamental cause is this horribly confusing interface.

    I started out looking at just automating a sane replacement sequence,
    but because of this mix or virtual and physical addresses, and because
    people have used the "__pa()" macro that can take either a regular
    kernel pointer, or just the raw "unsigned long" address, it's all quite
    messy.

    So this just introduces a new saner interface for freeing a virtual
    address that was allocated using 'memblock_alloc()', and that was kept
    as a regular kernel pointer. And then it converts a couple of users
    that are obvious and easy to test, including the 'xbc_nodes' case in
    lib/bootconfig.c that caused problems.

    Reported-by: kernel test robot
    Fixes: 40caa127f3c7 ("init: bootconfig: Remove all bootconfig data when the init memory is removed")
    Cc: Steven Rostedt
    Cc: Mike Rapoport
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Masami Hiramatsu
    Cc: Vlastimil Babka
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

10 Sep, 2021

1 commit

  • Pull more tracing updates from Steven Rostedt:

    - Add migrate-disable counter to tracing header

    - Fix error handling in event probes

    - Fix missed unlock in osnoise in error path

    - Fix merge issue with tools/bootconfig

    - Clean up bootconfig data when init memory is removed

    - Fix bootconfig to loop only on subkeys

    - Have kernel command lines override bootconfig options

    - Increase field counts for synthetic events

    - Have histograms dynamic allocate event elements to save space

    - Fixes in testing and documentation

    * tag 'trace-v5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing/boot: Fix to loop on only subkeys
    selftests/ftrace: Exclude "(fault)" in testing add/remove eprobe events
    tracing: Dynamically allocate the per-elt hist_elt_data array
    tracing: synth events: increase max fields count
    tools/bootconfig: Show whole test command for each test case
    bootconfig: Fix missing return check of xbc_node_compose_key function
    tools/bootconfig: Fix tracing_on option checking in ftrace2bconf.sh
    docs: bootconfig: Add how to use bootconfig for kernel parameters
    init/bootconfig: Reorder init parameter from bootconfig and cmdline
    init: bootconfig: Remove all bootconfig data when the init memory is removed
    tracing/osnoise: Fix missed cpus_read_unlock() in start_per_cpu_kthreads()
    tracing: Fix some alloc_event_probe() error handling bugs
    tracing: Add migrate-disabled counter to tracing output.

    Linus Torvalds
     

09 Sep, 2021

5 commits

  • Merge more updates from Andrew Morton:
    "147 patches, based on 7d2a07b769330c34b4deabeed939325c77a7ec2f.

    Subsystems affected by this patch series: mm (memory-hotplug, rmap,
    ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan),
    alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib,
    checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig,
    selftests, ipc, and scripts"

    * emailed patches from Andrew Morton : (94 commits)
    scripts: check_extable: fix typo in user error message
    mm/workingset: correct kernel-doc notations
    ipc: replace costly bailout check in sysvipc_find_ipc()
    selftests/memfd: remove unused variable
    Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH
    configs: remove the obsolete CONFIG_INPUT_POLLDEV
    prctl: allow to setup brk for et_dyn executables
    pid: cleanup the stale comment mentioning pidmap_init().
    kernel/fork.c: unexport get_{mm,task}_exe_file
    coredump: fix memleak in dump_vma_snapshot()
    fs/coredump.c: log if a core dump is aborted due to changed file permissions
    nilfs2: use refcount_dec_and_lock() to fix potential UAF
    nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group
    nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group
    nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group
    nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group
    nilfs2: fix NULL pointer in nilfs_##name##_attr_release
    nilfs2: fix memory leak in nilfs_sysfs_create_device_group
    trap: cleanup trap_init()
    init: move usermodehelper_enable() to populate_rootfs()
    ...

    Linus Torvalds
     
  • Reorder the init parameters from bootconfig and kernel cmdline
    so that the kernel cmdline always be the last part of the
    parameters as below.

    " -- "[bootconfig init params][cmdline init params]

    This change will help us to prevent that bootconfig init params
    overwrite the init params which user gives in the command line.

    Link: https://lkml.kernel.org/r/163077085675.222577.5665176468023636160.stgit@devnote2

    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)

    Masami Hiramatsu
     
  • Since the bootconfig is used only in the init functions,
    it doesn't need to keep the data after boot. Free it when
    the init memory is removed.

    Link: https://lkml.kernel.org/r/163077084958.222577.5924961258513004428.stgit@devnote2

    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)

    Masami Hiramatsu
     
  • There are some empty trap_init() definitions in different ARCHs, Introduce
    a new weak trap_init() function to clean them up.

    Link: https://lkml.kernel.org/r/20210812123602.76356-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang
    Acked-by: Russell King (Oracle) [arm32]
    Acked-by: Vineet Gupta [arc]
    Acked-by: Michael Ellerman [powerpc]
    Cc: Yoshinori Sato
    Cc: Ley Foon Tan
    Cc: Jonas Bonn
    Cc: Stefan Kristiansson
    Cc: Stafford Horne
    Cc: James E.J. Bottomley
    Cc: Helge Deller
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Anton Ivanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kefeng Wang
     
  • Currently, usermodehelper is enabled right before PID1 starts going
    through the initcalls. However, any call of a usermodehelper from a
    pure_, core_, postcore_, arch_, subsys_ or fs_ initcall is futile, as
    there is no filesystem contents yet.

    Up until commit e7cb072eb988 ("init/initramfs.c: do unpacking
    asynchronously"), such calls, whether via some request_module(), a
    legacy uevent "/sbin/hotplug" notification or something else, would
    just fail silently with (presumably) -ENOENT from
    kernel_execve(). However, that commit introduced the
    wait_for_initramfs() synchronization hook which must be called from
    the usermodehelper exec path right before the kernel_execve, in order
    that request_module() et al done from *after* rootfs_initcall()
    time (i.e. device_ and late_ initcalls) would continue to find a
    populated initramfs as they used to.

    Any call of wait_for_initramfs() done before the unpacking has been
    scheduled (i.e. before rootfs_initcall time) must just return
    immediately [and let the caller find an empty file system] in order
    not to deadlock the machine. I mistakenly thought, and my limited
    testing confirmed, that there were no such calls, so I added a
    pr_warn_once() in wait_for_initramfs(). It turns out that one can
    indeed hit request_module() as well as kobject_uevent_env() during
    those early init calls, leading to a user-visible warning in the
    kernel log emitted consistently for certain configurations.

    We could just remove the pr_warn_once(), but I think it's better to
    postpone enabling the usermodehelper framework until there is at least
    some chance of finding the executable. That is also a little more
    efficient in that a lot of work done in umh.c will be elided. However,
    it does change the error seen by those early callers from -ENOENT to
    -EBUSY, so there is a risk of a regression if any caller care about
    the exact error value.

    Link: https://lkml.kernel.org/r/20210728134638.329060-1-linux@rasmusvillemoes.dk
    Fixes: e7cb072eb988 ("init/initramfs.c: do unpacking asynchronously")
    Signed-off-by: Rasmus Villemoes
    Reported-by: Alexander Egorenkov
    Reported-by: Bruno Goncalves
    Reported-by: Heiner Kallweit
    Cc: Luis Chamberlain
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     

20 Aug, 2021

1 commit


13 Aug, 2021

1 commit

  • Since the 'bootconfig' command line parameter is handled before
    parsing the command line, it doesn't use early_param(). But in
    this case, kernel shows a wrong warning message about it.

    [ 0.013714] Kernel command line: ro console=ttyS0 bootconfig console=tty0
    [ 0.013741] Unknown command line parameters: bootconfig

    To suppress this message, add a dummy handler for 'bootconfig'.

    Link: https://lkml.kernel.org/r/162812945097.77369.1849780946468010448.stgit@devnote2

    Fixes: 86d1919a4fb0 ("init: print out unknown kernel parameters")
    Reviewed-by: Andrew Halaney
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)

    Masami Hiramatsu
     

03 Aug, 2021

1 commit

  • Now that ax88796.c exports the ax_NS8390_reinit() symbol, we can
    include 8390.h instead of lib8390.c, avoiding duplication of that
    function and killing a few compile warnings in the bargain.

    Fixes: 861928f4e60e826c ("net-next: New ax88796 platform
    driver for Amiga X-Surf 100 Zorro board (m68k)")

    Signed-off-by: Michael Schmitz
    Signed-off-by: Arnd Bergmann
    Signed-off-by: David S. Miller

    Michael Schmitz
     

09 Jul, 2021

1 commit

  • Parse the kernel's build ID at initialization so that other code can print
    a hex format string representation of the running kernel's build ID. This
    will be used in the kdump and dump_stack code so that developers can
    easily locate the vmlinux debug symbols for a crash/stacktrace.

    [swboyd@chromium.org: fix implicit declaration of init_vmlinux_build_id()]
    Link: https://lkml.kernel.org/r/CAE-0n51UjTbay8N9FXAyE7_aR2+ePrQnKSRJ0gbmRsXtcLBVaw@mail.gmail.com

    Link: https://lkml.kernel.org/r/20210511003845.2429846-4-swboyd@chromium.org
    Signed-off-by: Stephen Boyd
    Acked-by: Baoquan He
    Cc: Jiri Olsa
    Cc: Alexei Starovoitov
    Cc: Jessica Yu
    Cc: Evan Green
    Cc: Hsin-Yi Wang
    Cc: Dave Young
    Cc: Vivek Goyal
    Cc: Andy Shevchenko
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Ingo Molnar
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: Petr Mladek
    Cc: Rasmus Villemoes
    Cc: Sasha Levin
    Cc: Sergey Senozhatsky
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Boyd
     

05 Jul, 2021

1 commit

  • …git/paulmck/linux-rcu

    Pull RCU updates from Paul McKenney:

    - Bitmap parsing support for "all" as an alias for all bits

    - Documentation updates

    - Miscellaneous fixes, including some that overlap into mm and lockdep

    - kvfree_rcu() updates

    - mem_dump_obj() updates, with acks from one of the slab-allocator
    maintainers

    - RCU NOCB CPU updates, including limited deoffloading

    - SRCU updates

    - Tasks-RCU updates

    - Torture-test updates

    * 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (78 commits)
    tasks-rcu: Make show_rcu_tasks_gp_kthreads() be static inline
    rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent states
    rcu: Add missing __releases() annotation
    rcu: Remove obsolete rcu_read_unlock() deadlock commentary
    rcu: Improve comments describing RCU read-side critical sections
    rcu: Create an unrcu_pointer() to remove __rcu from a pointer
    srcu: Early test SRCU polling start
    rcu: Fix various typos in comments
    rcu/nocb: Unify timers
    rcu/nocb: Prepare for fine-grained deferred wakeup
    rcu/nocb: Only cancel nocb timer if not polling
    rcu/nocb: Delete bypass_timer upon nocb_gp wakeup
    rcu/nocb: Cancel nocb_timer upon nocb_gp wakeup
    rcu/nocb: Allow de-offloading rdp leader
    rcu/nocb: Directly call __wake_nocb_gp() from bypass timer
    rcu: Don't penalize priority boosting when there is nothing to boost
    rcu: Point to documentation of ordering guarantees
    rcu: Make rcu_gp_cleanup() be noinline for tracing
    rcu: Restrict RCU_STRICT_GRACE_PERIOD to at most four CPUs
    rcu: Make show_rcu_gp_kthreads() dump rcu_node structures blocking GP
    ...

    Linus Torvalds
     

04 Jul, 2021

1 commit

  • Pull tracing updates from Steven Rostedt:

    - Added option for per CPU threads to the hwlat tracer

    - Have hwlat tracer handle hotplug CPUs

    - New tracer: osnoise, that detects latency caused by interrupts,
    softirqs and scheduling of other tasks.

    - Added timerlat tracer that creates a thread and measures in detail
    what sources of latency it has for wake ups.

    - Removed the "success" field of the sched_wakeup trace event. This has
    been hardcoded as "1" since 2015, no tooling should be looking at it
    now. If one exists, we can revert this commit, fix that tool and try
    to remove it again in the future.

    - tgid mapping fixed to handle more than PID_MAX_DEFAULT pids/tgids.

    - New boot command line option "tp_printk_stop", as tp_printk causes
    trace events to write to console. When user space starts, this can
    easily live lock the system. Having a boot option to stop just after
    boot up is useful to prevent that from happening.

    - Have ftrace_dump_on_oops boot command line option take numbers that
    match the numbers shown in /proc/sys/kernel/ftrace_dump_on_oops.

    - Bootconfig clean ups, fixes and enhancements.

    - New ktest script that tests bootconfig options.

    - Add tracepoint_probe_register_may_exist() to register a tracepoint
    without triggering a WARN*() if it already exists. BPF has a path
    from user space that can do this. All other paths are considered a
    bug.

    - Small clean ups and fixes

    * tag 'trace-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (49 commits)
    tracing: Resize tgid_map to pid_max, not PID_MAX_DEFAULT
    tracing: Simplify & fix saved_tgids logic
    treewide: Add missing semicolons to __assign_str uses
    tracing: Change variable type as bool for clean-up
    trace/timerlat: Fix indentation on timerlat_main()
    trace/osnoise: Make 'noise' variable s64 in run_osnoise()
    tracepoint: Add tracepoint_probe_register_may_exist() for BPF tracing
    tracing: Fix spelling in osnoise tracer "interferences" -> "interference"
    Documentation: Fix a typo on trace/osnoise-tracer
    trace/osnoise: Fix return value on osnoise_init_hotplug_support
    trace/osnoise: Make interval u64 on osnoise_main
    trace/osnoise: Fix 'no previous prototype' warnings
    tracing: Have osnoise_main() add a quiescent state for task rcu
    seq_buf: Make trace_seq_putmem_hex() support data longer than 8
    seq_buf: Fix overflow in seq_buf_putmem_hex()
    trace/osnoise: Support hotplug operations
    trace/hwlat: Support hotplug operations
    trace/hwlat: Protect kdata->kthread with get/put_online_cpus
    trace: Add timerlat tracer
    trace: Add osnoise tracer
    ...

    Linus Torvalds
     

03 Jul, 2021

1 commit

  • Merge more updates from Andrew Morton:
    "190 patches.

    Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
    vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
    migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
    zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
    core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
    signals, exec, kcov, selftests, compress/decompress, and ipc"

    * emailed patches from Andrew Morton : (190 commits)
    ipc/util.c: use binary search for max_idx
    ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
    ipc: use kmalloc for msg_queue and shmid_kernel
    ipc sem: use kvmalloc for sem_undo allocation
    lib/decompressors: remove set but not used variabled 'level'
    selftests/vm/pkeys: exercise x86 XSAVE init state
    selftests/vm/pkeys: refill shadow register after implicit kernel write
    selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
    selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
    kcov: add __no_sanitize_coverage to fix noinstr for all architectures
    exec: remove checks in __register_bimfmt()
    x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
    hfsplus: report create_date to kstat.btime
    hfsplus: remove unnecessary oom message
    nilfs2: remove redundant continue statement in a while-loop
    kprobes: remove duplicated strong free_insn_page in x86 and s390
    init: print out unknown kernel parameters
    checkpatch: do not complain about positive return values starting with EPOLL
    checkpatch: improve the indented label test
    checkpatch: scripts/spdxcheck.py now requires python3
    ...

    Linus Torvalds
     

02 Jul, 2021

1 commit

  • It is easy to foobar setting a kernel parameter on the command line
    without realizing it, there's not much output that you can use to assess
    what the kernel did with that parameter by default.

    Make it a little more explicit which parameters on the command line
    _looked_ like a valid parameter for the kernel, but did not match anything
    and ultimately got tossed to init. This is very similar to the unknown
    parameter message received when loading a module.

    This assumes the parameters are processed in a normal fashion, some
    parameters (dyndbg= for example) don't register their parameter with the
    rest of the kernel's parameters, and therefore always show up in this list
    (and are also given to init - like the rest of this list).

    Another example is BOOT_IMAGE= is highlighted as an offender, which it
    technically is, but is passed by LILO and GRUB so most systems will see
    that complaint.

    An example output where "foobared" and "unrecognized" are intentionally
    invalid parameters:

    Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.12-dirty debug log_buf_len=4M foobared unrecognized=foo
    Unknown command line parameters: foobared BOOT_IMAGE=/boot/vmlinuz-5.12-dirty unrecognized=foo

    Link: https://lkml.kernel.org/r/20210511211009.42259-1-ahalaney@redhat.com
    Signed-off-by: Andrew Halaney
    Suggested-by: Steven Rostedt
    Suggested-by: Borislav Petkov
    Acked-by: Borislav Petkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Halaney
     

18 Jun, 2021

1 commit

  • This commit in sched/urgent moved the cfs_rq_is_decayed() function:

    a7b359fc6a37: ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")

    and this fresh commit in sched/core modified it in the old location:

    9e077b52d86a: ("sched/pelt: Check that *_avg are null when *_sum are")

    Merge the two variants.

    Conflicts:
    kernel/sched/fair.c

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

11 Jun, 2021

1 commit


05 Jun, 2021

1 commit

  • During boot, kernel_init_freeable() initializes `cad_pid` to the init
    task's struct pid. Later on, we may change `cad_pid` via a sysctl, and
    when this happens proc_do_cad_pid() will increment the refcount on the
    new pid via get_pid(), and will decrement the refcount on the old pid
    via put_pid(). As we never called get_pid() when we initialized
    `cad_pid`, we decrement a reference we never incremented, can therefore
    free the init task's struct pid early. As there can be dangling
    references to the struct pid, we can later encounter a use-after-free
    (e.g. when delivering signals).

    This was spotted when fuzzing v5.13-rc3 with Syzkaller, but seems to
    have been around since the conversion of `cad_pid` to struct pid in
    commit 9ec52099e4b8 ("[PATCH] replace cad_pid by a struct pid") from the
    pre-KASAN stone age of v2.6.19.

    Fix this by getting a reference to the init task's struct pid when we
    assign it to `cad_pid`.

    Full KASAN splat below.

    ==================================================================
    BUG: KASAN: use-after-free in ns_of_pid include/linux/pid.h:153 [inline]
    BUG: KASAN: use-after-free in task_active_pid_ns+0xc0/0xc8 kernel/pid.c:509
    Read of size 4 at addr ffff23794dda0004 by task syz-executor.0/273

    CPU: 1 PID: 273 Comm: syz-executor.0 Not tainted 5.12.0-00001-g9aef892b2d15 #1
    Hardware name: linux,dummy-virt (DT)
    Call trace:
    ns_of_pid include/linux/pid.h:153 [inline]
    task_active_pid_ns+0xc0/0xc8 kernel/pid.c:509
    do_notify_parent+0x308/0xe60 kernel/signal.c:1950
    exit_notify kernel/exit.c:682 [inline]
    do_exit+0x2334/0x2bd0 kernel/exit.c:845
    do_group_exit+0x108/0x2c8 kernel/exit.c:922
    get_signal+0x4e4/0x2a88 kernel/signal.c:2781
    do_signal arch/arm64/kernel/signal.c:882 [inline]
    do_notify_resume+0x300/0x970 arch/arm64/kernel/signal.c:936
    work_pending+0xc/0x2dc

    Allocated by task 0:
    slab_post_alloc_hook+0x50/0x5c0 mm/slab.h:516
    slab_alloc_node mm/slub.c:2907 [inline]
    slab_alloc mm/slub.c:2915 [inline]
    kmem_cache_alloc+0x1f4/0x4c0 mm/slub.c:2920
    alloc_pid+0xdc/0xc00 kernel/pid.c:180
    copy_process+0x2794/0x5e18 kernel/fork.c:2129
    kernel_clone+0x194/0x13c8 kernel/fork.c:2500
    kernel_thread+0xd4/0x110 kernel/fork.c:2552
    rest_init+0x44/0x4a0 init/main.c:687
    arch_call_rest_init+0x1c/0x28
    start_kernel+0x520/0x554 init/main.c:1064
    0x0

    Freed by task 270:
    slab_free_hook mm/slub.c:1562 [inline]
    slab_free_freelist_hook+0x98/0x260 mm/slub.c:1600
    slab_free mm/slub.c:3161 [inline]
    kmem_cache_free+0x224/0x8e0 mm/slub.c:3177
    put_pid.part.4+0xe0/0x1a8 kernel/pid.c:114
    put_pid+0x30/0x48 kernel/pid.c:109
    proc_do_cad_pid+0x190/0x1b0 kernel/sysctl.c:1401
    proc_sys_call_handler+0x338/0x4b0 fs/proc/proc_sysctl.c:591
    proc_sys_write+0x34/0x48 fs/proc/proc_sysctl.c:617
    call_write_iter include/linux/fs.h:1977 [inline]
    new_sync_write+0x3ac/0x510 fs/read_write.c:518
    vfs_write fs/read_write.c:605 [inline]
    vfs_write+0x9c4/0x1018 fs/read_write.c:585
    ksys_write+0x124/0x240 fs/read_write.c:658
    __do_sys_write fs/read_write.c:670 [inline]
    __se_sys_write fs/read_write.c:667 [inline]
    __arm64_sys_write+0x78/0xb0 fs/read_write.c:667
    __invoke_syscall arch/arm64/kernel/syscall.c:37 [inline]
    invoke_syscall arch/arm64/kernel/syscall.c:49 [inline]
    el0_svc_common.constprop.1+0x16c/0x388 arch/arm64/kernel/syscall.c:129
    do_el0_svc+0xf8/0x150 arch/arm64/kernel/syscall.c:168
    el0_svc+0x28/0x38 arch/arm64/kernel/entry-common.c:416
    el0_sync_handler+0x134/0x180 arch/arm64/kernel/entry-common.c:432
    el0_sync+0x154/0x180 arch/arm64/kernel/entry.S:701

    The buggy address belongs to the object at ffff23794dda0000
    which belongs to the cache pid of size 224
    The buggy address is located 4 bytes inside of
    224-byte region [ffff23794dda0000, ffff23794dda00e0)
    The buggy address belongs to the page:
    page:(____ptrval____) refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4dda0
    head:(____ptrval____) order:1 compound_mapcount:0
    flags: 0x3fffc0000010200(slab|head)
    raw: 03fffc0000010200 dead000000000100 dead000000000122 ffff23794d40d080
    raw: 0000000000000000 0000000000190019 00000001ffffffff 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff23794dd9ff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff23794dd9ff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    >ffff23794dda0000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ^
    ffff23794dda0080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
    ffff23794dda0100: fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 00
    ==================================================================

    Link: https://lkml.kernel.org/r/20210524172230.38715-1-mark.rutland@arm.com
    Fixes: 9ec52099e4b8678a ("[PATCH] replace cad_pid by a struct pid")
    Signed-off-by: Mark Rutland
    Acked-by: Christian Brauner
    Cc: Cedric Le Goater
    Cc: Christian Brauner
    Cc: Eric W. Biederman
    Cc: Kees Cook
    Cc: Paul Mackerras
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Rutland
     

01 Jun, 2021

1 commit

  • Extend 8fb12156b8db ("init: Pin init task to the boot CPU, initially")
    to cover the new PF_NO_SETAFFINITY requirement.

    While there, move wait_for_completion(&kthreadd_done) into kernel_init()
    to make it absolutely clear it is the very first thing done by the init
    thread.

    Fixes: 570a752b7a9b ("lib/smp_processor_id: Use is_percpu_thread() instead of nr_cpus_allowed")
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Valentin Schneider
    Tested-by: Valentin Schneider
    Tested-by: Borislav Petkov
    Link: https://lkml.kernel.org/r/YLS4mbKUrA3Gnb4t@hirez.programming.kicks-ass.net

    Peter Zijlstra
     

12 May, 2021

1 commit

  • As pointed out by commit

    de9b8f5dcbd9 ("sched: Fix crash trying to dequeue/enqueue the idle thread")

    init_idle() can and will be invoked more than once on the same idle
    task. At boot time, it is invoked for the boot CPU thread by
    sched_init(). Then smp_init() creates the threads for all the secondary
    CPUs and invokes init_idle() on them.

    As the hotplug machinery brings the secondaries to life, it will issue
    calls to idle_thread_get(), which itself invokes init_idle() yet again.
    In this case it's invoked twice more per secondary: at _cpu_up(), and at
    bringup_cpu().

    Given smp_init() already initializes the idle tasks for all *possible*
    CPUs, no further initialization should be required. Now, removing
    init_idle() from idle_thread_get() exposes some interesting expectations
    with regards to the idle task's preempt_count: the secondary startup always
    issues a preempt_disable(), requiring some reset of the preempt count to 0
    between hot-unplug and hotplug, which is currently served by
    idle_thread_get() -> idle_init().

    Given the idle task is supposed to have preemption disabled once and never
    see it re-enabled, it seems that what we actually want is to initialize its
    preempt_count to PREEMPT_DISABLED and leave it there. Do that, and remove
    init_idle() from idle_thread_get().

    Secondary startups were patched via coccinelle:

    @begone@
    @@

    -preempt_disable();
    ...
    cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);

    Signed-off-by: Valentin Schneider
    Signed-off-by: Ingo Molnar
    Acked-by: Peter Zijlstra
    Link: https://lore.kernel.org/r/20210512094636.2958515-1-valentin.schneider@arm.com

    Valentin Schneider
     

11 May, 2021

1 commit

  • Once srcu_init() is called, the SRCU core will make use of delayed
    workqueues, which rely on timers. However init_timers() is called
    several steps after rcu_init(). This means that a call_srcu() after
    rcu_init() but before init_timers() would find itself within a dangerously
    uninitialized timer core.

    This commit therefore creates a separate call to srcu_init() after
    init_timer() completes, which ensures that we stay in early SRCU mode
    until timers are safe(r).

    Signed-off-by: Frederic Weisbecker
    Cc: Uladzislau Rezki
    Cc: Boqun Feng
    Cc: Lai Jiangshan
    Cc: Neeraj Upadhyay
    Cc: Josh Triplett
    Cc: Joel Fernandes
    Signed-off-by: Paul E. McKenney

    Frederic Weisbecker
     

07 May, 2021

2 commits

  • Merge yet more updates from Andrew Morton:
    "This is everything else from -mm for this merge window.

    90 patches.

    Subsystems affected by this patch series: mm (cleanups and slub),
    alpha, procfs, sysctl, misc, core-kernel, bitmap, lib, compat,
    checkpatch, epoll, isofs, nilfs2, hpfs, exit, fork, kexec, gcov,
    panic, delayacct, gdb, resource, selftests, async, initramfs, ipc,
    drivers/char, and spelling"

    * emailed patches from Andrew Morton : (90 commits)
    mm: fix typos in comments
    mm: fix typos in comments
    treewide: remove editor modelines and cruft
    ipc/sem.c: spelling fix
    fs: fat: fix spelling typo of values
    kernel/sys.c: fix typo
    kernel/up.c: fix typo
    kernel/user_namespace.c: fix typos
    kernel/umh.c: fix some spelling mistakes
    include/linux/pgtable.h: few spelling fixes
    mm/slab.c: fix spelling mistake "disired" -> "desired"
    scripts/spelling.txt: add "overflw"
    scripts/spelling.txt: Add "diabled" typo
    scripts/spelling.txt: add "overlfow"
    arm: print alloc free paths for address in registers
    mm/vmalloc: remove vwrite()
    mm: remove xlate_dev_kmem_ptr()
    drivers/char: remove /dev/kmem for good
    mm: fix some typos and code style problems
    ipc/sem.c: mundane typo fixes
    ...

    Linus Torvalds
     
  • Patch series "background initramfs unpacking, and CONFIG_MODPROBE_PATH", v3.

    These two patches are independent, but better-together.

    The second is a rather trivial patch that simply allows the developer to
    change "/sbin/modprobe" to something else - e.g. the empty string, so
    that all request_module() during early boot return -ENOENT early, without
    even spawning a usermode helper, needlessly synchronizing with the
    initramfs unpacking.

    The first patch delegates decompressing the initramfs to a worker thread,
    allowing do_initcalls() in main.c to proceed to the device_ and late_
    initcalls without waiting for that decompression (and populating of
    rootfs) to finish. Obviously, some of those later calls may rely on the
    initramfs being available, so I've added synchronization points in the
    firmware loader and usermodehelper paths - there might be other places
    that would need this, but so far no one has been able to think of any
    places I have missed.

    There's not much to win if most of the functionality needed during boot is
    only available as modules. But systems with a custom-made .config and
    initramfs can boot faster, partly due to utilizing more than one cpu
    earlier, partly by avoiding known-futile modprobe calls (which would still
    trigger synchronization with the initramfs unpacking, thus eliminating
    most of the first benefit).

    This patch (of 2):

    Most of the boot process doesn't actually need anything from the
    initramfs, until of course PID1 is to be executed. So instead of doing
    the decompressing and populating of the initramfs synchronously in
    populate_rootfs() itself, push that off to a worker thread.

    This is primarily motivated by an embedded ppc target, where unpacking
    even the rather modest sized initramfs takes 0.6 seconds, which is long
    enough that the external watchdog becomes unhappy that it doesn't get
    attention soon enough. By doing the initramfs decompression in a worker
    thread, we get to do the device_initcalls and hence start petting the
    watchdog much sooner.

    Normal desktops might benefit as well. On my mostly stock Ubuntu kernel,
    my initramfs is a 26M xz-compressed blob, decompressing to around 126M.
    That takes almost two seconds:

    [ 0.201454] Trying to unpack rootfs image as initramfs...
    [ 1.976633] Freeing initrd memory: 29416K

    Before this patch, these lines occur consecutively in dmesg. With this
    patch, the timestamps on these two lines is roughly the same as above, but
    with 172 lines inbetween - so more than one cpu has been kept busy doing
    work that would otherwise only happen after the populate_rootfs()
    finished.

    Should one of the initcalls done after rootfs_initcall time (i.e., device_
    and late_ initcalls) need something from the initramfs (say, a kernel
    module or a firmware blob), it will simply wait for the initramfs
    unpacking to be done before proceeding, which should in theory make this
    completely safe.

    But if some driver pokes around in the filesystem directly and not via one
    of the official kernel interfaces (i.e. request_firmware*(),
    call_usermodehelper*) that theory may not hold - also, I certainly might
    have missed a spot when sprinkling wait_for_initramfs(). So there is an
    escape hatch in the form of an initramfs_async= command line parameter.

    Link: https://lkml.kernel.org/r/20210313212528.2956377-1-linux@rasmusvillemoes.dk
    Link: https://lkml.kernel.org/r/20210313212528.2956377-2-linux@rasmusvillemoes.dk
    Signed-off-by: Rasmus Villemoes
    Reviewed-by: Luis Chamberlain
    Cc: Jessica Yu
    Cc: Borislav Petkov
    Cc: Jonathan Corbet
    Cc: Greg Kroah-Hartman
    Cc: Nick Desaulniers
    Cc: Takashi Iwai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     

04 May, 2021

1 commit

  • Pull tracing updates from Steven Rostedt:
    "New feature:

    - A new "func-no-repeats" option in tracefs/options directory.

    When set the function tracer will detect if the current function
    being traced is the same as the previous one, and instead of
    recording it, it will keep track of the number of times that the
    function is repeated in a row. And when another function is
    recorded, it will write a new event that shows the function that
    repeated, the number of times it repeated and the time stamp of
    when the last repeated function occurred.

    Enhancements:

    - In order to implement the above "func-no-repeats" option, the ring
    buffer timestamp can now give the accurate timestamp of the event
    as it is being recorded, instead of having to record an absolute
    timestamp for all events. This helps the histogram code which no
    longer needs to waste ring buffer space.

    - New validation logic to make sure all trace events that access
    dereferenced pointers do so in a safe way, and will warn otherwise.

    Fixes:

    - No longer limit the PIDs of tasks that are recorded for
    "saved_cmdlines" to PID_MAX_DEFAULT (32768), as systemd now allows
    for a much larger range. This caused the mapping of PIDs to the
    task names to be dropped for all tasks with a PID greater than
    32768.

    - Change trace_clock_global() to never block. This caused a deadlock.

    Clean ups:

    - Typos, prototype fixes, and removing of duplicate or unused code.

    - Better management of ftrace_page allocations"

    * tag 'trace-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (32 commits)
    tracing: Restructure trace_clock_global() to never block
    tracing: Map all PIDs to command lines
    ftrace: Reuse the output of the function tracer for func_repeats
    tracing: Add "func_no_repeats" option for function tracing
    tracing: Unify the logic for function tracing options
    tracing: Add method for recording "func_repeats" events
    tracing: Add "last_func_repeats" to struct trace_array
    tracing: Define new ftrace event "func_repeats"
    tracing: Define static void trace_print_time()
    ftrace: Simplify the calculation of page number for ftrace_page->records some more
    ftrace: Store the order of pages allocated in ftrace_page
    tracing: Remove unused argument from "ring_buffer_time_stamp()
    tracing: Remove duplicate struct declaration in trace_events.h
    tracing: Update create_system_filter() kernel-doc comment
    tracing: A minor cleanup for create_system_filter()
    kernel: trace: Mundane typo fixes in the file trace_events_filter.c
    tracing: Fix various typos in comments
    scripts/recordmcount.pl: Make vim and emacs indent the same
    scripts/recordmcount.pl: Make indent spacing consistent
    tracing: Add a verifier to check string pointers for trace events
    ...

    Linus Torvalds
     

01 May, 2021

2 commits

  • mem_init_print_info() is called in mem_init() on each architecture, and
    pass NULL argument, so using void argument and move it into mm_init().

    Link: https://lkml.kernel.org/r/20210317015210.33641-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang
    Acked-by: Dave Hansen [x86]
    Reviewed-by: Christophe Leroy [powerpc]
    Acked-by: David Hildenbrand
    Tested-by: Anatoly Pugachev [sparc64]
    Acked-by: Russell King [arm]
    Acked-by: Mike Rapoport
    Cc: Catalin Marinas
    Cc: Richard Henderson
    Cc: Guo Ren
    Cc: Yoshinori Sato
    Cc: Huacai Chen
    Cc: Jonas Bonn
    Cc: Palmer Dabbelt
    Cc: Heiko Carstens
    Cc: "David S. Miller"
    Cc: "Peter Zijlstra"
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kefeng Wang
     
  • This changes the awkward approach where architectures provide init
    functions to determine which levels they can provide large mappings for,
    to one where the arch is queried for each call.

    This removes code and indirection, and allows constant-folding of dead
    code for unsupported levels.

    This also adds a prot argument to the arch query. This is unused
    currently but could help with some architectures (e.g., some powerpc
    processors can't map uncacheable memory with large pages).

    Link: https://lkml.kernel.org/r/20210317062402.533919-7-npiggin@gmail.com
    Signed-off-by: Nicholas Piggin
    Reviewed-by: Ding Tianhong
    Acked-by: Catalin Marinas [arm64]
    Cc: Will Deacon
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: Christoph Hellwig
    Cc: Miaohe Lin
    Cc: Michael Ellerman
    Cc: Russell King
    Cc: Uladzislau Rezki (Sony)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     

08 Apr, 2021

1 commit

  • This provides the ability for architectures to enable kernel stack base
    address offset randomization. This feature is controlled by the boot
    param "randomize_kstack_offset=on/off", with its default value set by
    CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT.

    This feature is based on the original idea from the last public release
    of PaX's RANDKSTACK feature: https://pax.grsecurity.net/docs/randkstack.txt
    All the credit for the original idea goes to the PaX team. Note that
    the design and implementation of this upstream randomize_kstack_offset
    feature differs greatly from the RANDKSTACK feature (see below).

    Reasoning for the feature:

    This feature aims to make harder the various stack-based attacks that
    rely on deterministic stack structure. We have had many such attacks in
    past (just to name few):

    https://jon.oberheide.org/files/infiltrate12-thestackisback.pdf
    https://jon.oberheide.org/files/stackjacking-infiltrate11.pdf
    https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html

    As Linux kernel stack protections have been constantly improving
    (vmap-based stack allocation with guard pages, removal of thread_info,
    STACKLEAK), attackers have had to find new ways for their exploits
    to work. They have done so, continuing to rely on the kernel's stack
    determinism, in situations where VMAP_STACK and THREAD_INFO_IN_TASK_STRUCT
    were not relevant. For example, the following recent attacks would have
    been hampered if the stack offset was non-deterministic between syscalls:

    https://repositorio-aberto.up.pt/bitstream/10216/125357/2/374717.pdf
    (page 70: targeting the pt_regs copy with linear stack overflow)

    https://a13xp0p0v.github.io/2020/02/15/CVE-2019-18683.html
    (leaked stack address from one syscall as a target during next syscall)

    The main idea is that since the stack offset is randomized on each system
    call, it is harder for an attack to reliably land in any particular place
    on the thread stack, even with address exposures, as the stack base will
    change on the next syscall. Also, since randomization is performed after
    placing pt_regs, the ptrace-based approach[1] to discover the randomized
    offset during a long-running syscall should not be possible.

    Design description:

    During most of the kernel's execution, it runs on the "thread stack",
    which is pretty deterministic in its structure: it is fixed in size,
    and on every entry from userspace to kernel on a syscall the thread
    stack starts construction from an address fetched from the per-cpu
    cpu_current_top_of_stack variable. The first element to be pushed to the
    thread stack is the pt_regs struct that stores all required CPU registers
    and syscall parameters. Finally the specific syscall function is called,
    with the stack being used as the kernel executes the resulting request.

    The goal of randomize_kstack_offset feature is to add a random offset
    after the pt_regs has been pushed to the stack and before the rest of the
    thread stack is used during the syscall processing, and to change it every
    time a process issues a syscall. The source of randomness is currently
    architecture-defined (but x86 is using the low byte of rdtsc()). Future
    improvements for different entropy sources is possible, but out of scope
    for this patch. Further more, to add more unpredictability, new offsets
    are chosen at the end of syscalls (the timing of which should be less
    easy to measure from userspace than at syscall entry time), and stored
    in a per-CPU variable, so that the life of the value does not stay
    explicitly tied to a single task.

    As suggested by Andy Lutomirski, the offset is added using alloca()
    and an empty asm() statement with an output constraint, since it avoids
    changes to assembly syscall entry code, to the unwinder, and provides
    correct stack alignment as defined by the compiler.

    In order to make this available by default with zero performance impact
    for those that don't want it, it is boot-time selectable with static
    branches. This way, if the overhead is not wanted, it can just be
    left turned off with no performance impact.

    The generated assembly for x86_64 with GCC looks like this:

    ...
    ffffffff81003977: 65 8b 05 02 ea 00 7f mov %gs:0x7f00ea02(%rip),%eax
    # 12380
    ffffffff8100397e: 25 ff 03 00 00 and $0x3ff,%eax
    ffffffff81003983: 48 83 c0 0f add $0xf,%rax
    ffffffff81003987: 25 f8 07 00 00 and $0x7f8,%eax
    ffffffff8100398c: 48 29 c4 sub %rax,%rsp
    ffffffff8100398f: 48 8d 44 24 0f lea 0xf(%rsp),%rax
    ffffffff81003994: 48 83 e0 f0 and $0xfffffffffffffff0,%rax
    ...

    As a result of the above stack alignment, this patch introduces about
    5 bits of randomness after pt_regs is spilled to the thread stack on
    x86_64, and 6 bits on x86_32 (since its has 1 fewer bit required for
    stack alignment). The amount of entropy could be adjusted based on how
    much of the stack space we wish to trade for security.

    My measure of syscall performance overhead (on x86_64):

    lmbench: /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_syscall -N 10000 null
    randomize_kstack_offset=y Simple syscall: 0.7082 microseconds
    randomize_kstack_offset=n Simple syscall: 0.7016 microseconds

    So, roughly 0.9% overhead growth for a no-op syscall, which is very
    manageable. And for people that don't want this, it's off by default.

    There are two gotchas with using the alloca() trick. First,
    compilers that have Stack Clash protection (-fstack-clash-protection)
    enabled by default (e.g. Ubuntu[3]) add pagesize stack probes to
    any dynamic stack allocations. While the randomization offset is
    always less than a page, the resulting assembly would still contain
    (unreachable!) probing routines, bloating the resulting assembly. To
    avoid this, -fno-stack-clash-protection is unconditionally added to
    the kernel Makefile since this is the only dynamic stack allocation in
    the kernel (now that VLAs have been removed) and it is provably safe
    from Stack Clash style attacks.

    The second gotcha with alloca() is a negative interaction with
    -fstack-protector*, in that it sees the alloca() as an array allocation,
    which triggers the unconditional addition of the stack canary function
    pre/post-amble which slows down syscalls regardless of the static
    branch. In order to avoid adding this unneeded check and its associated
    performance impact, architectures need to carefully remove uses of
    -fstack-protector-strong (or -fstack-protector) in the compilation units
    that use the add_random_kstack() macro and to audit the resulting stack
    mitigation coverage (to make sure no desired coverage disappears). No
    change is visible for this on x86 because the stack protector is already
    unconditionally disabled for the compilation unit, but the change is
    required on arm64. There is, unfortunately, no attribute that can be
    used to disable stack protector for specific functions.

    Comparison to PaX RANDKSTACK feature:

    The RANDKSTACK feature randomizes the location of the stack start
    (cpu_current_top_of_stack), i.e. including the location of pt_regs
    structure itself on the stack. Initially this patch followed the same
    approach, but during the recent discussions[2], it has been determined
    to be of a little value since, if ptrace functionality is available for
    an attacker, they can use PTRACE_PEEKUSR/PTRACE_POKEUSR to read/write
    different offsets in the pt_regs struct, observe the cache behavior of
    the pt_regs accesses, and figure out the random stack offset. Another
    difference is that the random offset is stored in a per-cpu variable,
    rather than having it be per-thread. As a result, these implementations
    differ a fair bit in their implementation details and results, though
    obviously the intent is similar.

    [1] https://lore.kernel.org/kernel-hardening/2236FBA76BA1254E88B949DDB74E612BA4BC57C1@IRSMSX102.ger.corp.intel.com/
    [2] https://lore.kernel.org/kernel-hardening/20190329081358.30497-1-elena.reshetova@intel.com/
    [3] https://lists.ubuntu.com/archives/ubuntu-devel/2019-June/040741.html

    Co-developed-by: Elena Reshetova
    Signed-off-by: Elena Reshetova
    Signed-off-by: Kees Cook
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Thomas Gleixner
    Link: https://lore.kernel.org/r/20210401232347.2791257-4-keescook@chromium.org

    Kees Cook
     

19 Mar, 2021

1 commit


27 Feb, 2021

3 commits

  • Currently breakpoints in kernel .init.text section are not handled
    correctly while allowing to remove them even after corresponding pages
    have been freed.

    Fix it via killing .init.text section breakpoints just prior to initmem
    pages being freed.

    Doug: "HW breakpoints aren't handled by this patch but it's probably
    not such a big deal".

    Link: https://lkml.kernel.org/r/20210224081652.587785-1-sumit.garg@linaro.org
    Signed-off-by: Sumit Garg
    Suggested-by: Doug Anderson
    Acked-by: Doug Anderson
    Acked-by: Daniel Thompson
    Tested-by: Daniel Thompson
    Cc: Masami Hiramatsu
    Cc: Steven Rostedt (VMware)
    Cc: Jason Wessel
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sumit Garg
     
  • Add a kernel parameter stack_depot_disable to disable stack depot. So
    that stack hash table doesn't consume any memory when stack depot is
    disabled.

    The use case is CONFIG_PAGE_OWNER without page_owner=on. Without this
    patch, stackdepot will consume the memory for the hashtable. By default,
    it's 8M which is never trivial.

    With this option, in CONFIG_PAGE_OWNER configured system, page_owner=off,
    stack_depot_disable in kernel command line, we could save the wasted
    memory for the hashtable.

    [akpm@linux-foundation.org: fix CONFIG_STACKDEPOT=n build]

    Link: https://lkml.kernel.org/r/1611749198-24316-2-git-send-email-vjitta@codeaurora.org
    Signed-off-by: Vinayak Menon
    Signed-off-by: Vijayanand Jitta
    Cc: Alexander Potapenko
    Cc: Minchan Kim
    Cc: Yogesh Lal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vijayanand Jitta
     
  • Patch series "KFENCE: A low-overhead sampling-based memory safety error detector", v7.

    This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a
    low-overhead sampling-based memory safety error detector of heap
    use-after-free, invalid-free, and out-of-bounds access errors. This
    series enables KFENCE for the x86 and arm64 architectures, and adds
    KFENCE hooks to the SLAB and SLUB allocators.

    KFENCE is designed to be enabled in production kernels, and has near
    zero performance overhead. Compared to KASAN, KFENCE trades performance
    for precision. The main motivation behind KFENCE's design, is that with
    enough total uptime KFENCE will detect bugs in code paths not typically
    exercised by non-production test workloads. One way to quickly achieve a
    large enough total uptime is when the tool is deployed across a large
    fleet of machines.

    KFENCE objects each reside on a dedicated page, at either the left or
    right page boundaries. The pages to the left and right of the object
    page are "guard pages", whose attributes are changed to a protected
    state, and cause page faults on any attempted access to them. Such page
    faults are then intercepted by KFENCE, which handles the fault
    gracefully by reporting a memory access error.

    Guarded allocations are set up based on a sample interval (can be set
    via kfence.sample_interval). After expiration of the sample interval,
    the next allocation through the main allocator (SLAB or SLUB) returns a
    guarded allocation from the KFENCE object pool. At this point, the timer
    is reset, and the next allocation is set up after the expiration of the
    interval.

    To enable/disable a KFENCE allocation through the main allocator's
    fast-path without overhead, KFENCE relies on static branches via the
    static keys infrastructure. The static branch is toggled to redirect the
    allocation to KFENCE.

    The KFENCE memory pool is of fixed size, and if the pool is exhausted no
    further KFENCE allocations occur. The default config is conservative
    with only 255 objects, resulting in a pool size of 2 MiB (with 4 KiB
    pages).

    We have verified by running synthetic benchmarks (sysbench I/O,
    hackbench) and production server-workload benchmarks that a kernel with
    KFENCE (using sample intervals 100-500ms) is performance-neutral
    compared to a non-KFENCE baseline kernel.

    KFENCE is inspired by GWP-ASan [1], a userspace tool with similar
    properties. The name "KFENCE" is a homage to the Electric Fence Malloc
    Debugger [2].

    For more details, see Documentation/dev-tools/kfence.rst added in the
    series -- also viewable here:

    https://raw.githubusercontent.com/google/kasan/kfence/Documentation/dev-tools/kfence.rst

    [1] http://llvm.org/docs/GwpAsan.html
    [2] https://linux.die.net/man/3/efence

    This patch (of 9):

    This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a
    low-overhead sampling-based memory safety error detector of heap
    use-after-free, invalid-free, and out-of-bounds access errors.

    KFENCE is designed to be enabled in production kernels, and has near
    zero performance overhead. Compared to KASAN, KFENCE trades performance
    for precision. The main motivation behind KFENCE's design, is that with
    enough total uptime KFENCE will detect bugs in code paths not typically
    exercised by non-production test workloads. One way to quickly achieve a
    large enough total uptime is when the tool is deployed across a large
    fleet of machines.

    KFENCE objects each reside on a dedicated page, at either the left or
    right page boundaries. The pages to the left and right of the object
    page are "guard pages", whose attributes are changed to a protected
    state, and cause page faults on any attempted access to them. Such page
    faults are then intercepted by KFENCE, which handles the fault
    gracefully by reporting a memory access error. To detect out-of-bounds
    writes to memory within the object's page itself, KFENCE also uses
    pattern-based redzones. The following figure illustrates the page
    layout:

    ---+-----------+-----------+-----------+-----------+-----------+---
    | xxxxxxxxx | O : | xxxxxxxxx | : O | xxxxxxxxx |
    | xxxxxxxxx | B : | xxxxxxxxx | : B | xxxxxxxxx |
    | x GUARD x | J : RED- | x GUARD x | RED- : J | x GUARD x |
    | xxxxxxxxx | E : ZONE | xxxxxxxxx | ZONE : E | xxxxxxxxx |
    | xxxxxxxxx | C : | xxxxxxxxx | : C | xxxxxxxxx |
    | xxxxxxxxx | T : | xxxxxxxxx | : T | xxxxxxxxx |
    ---+-----------+-----------+-----------+-----------+-----------+---

    Guarded allocations are set up based on a sample interval (can be set
    via kfence.sample_interval). After expiration of the sample interval, a
    guarded allocation from the KFENCE object pool is returned to the main
    allocator (SLAB or SLUB). At this point, the timer is reset, and the
    next allocation is set up after the expiration of the interval.

    To enable/disable a KFENCE allocation through the main allocator's
    fast-path without overhead, KFENCE relies on static branches via the
    static keys infrastructure. The static branch is toggled to redirect the
    allocation to KFENCE. To date, we have verified by running synthetic
    benchmarks (sysbench I/O, hackbench) that a kernel compiled with KFENCE
    is performance-neutral compared to the non-KFENCE baseline.

    For more details, see Documentation/dev-tools/kfence.rst (added later in
    the series).

    [elver@google.com: fix parameter description for kfence_object_start()]
    Link: https://lkml.kernel.org/r/20201106092149.GA2851373@elver.google.com
    [elver@google.com: avoid stalling work queue task without allocations]
    Link: https://lkml.kernel.org/r/CADYN=9J0DQhizAGB0-jz4HOBBh+05kMBXb4c0cXMS7Qi5NAJiw@mail.gmail.com
    Link: https://lkml.kernel.org/r/20201110135320.3309507-1-elver@google.com
    [elver@google.com: fix potential deadlock due to wake_up()]
    Link: https://lkml.kernel.org/r/000000000000c0645805b7f982e4@google.com
    Link: https://lkml.kernel.org/r/20210104130749.1768991-1-elver@google.com
    [elver@google.com: add option to use KFENCE without static keys]
    Link: https://lkml.kernel.org/r/20210111091544.3287013-1-elver@google.com
    [elver@google.com: add missing copyright and description headers]
    Link: https://lkml.kernel.org/r/20210118092159.145934-1-elver@google.com

    Link: https://lkml.kernel.org/r/20201103175841.3495947-2-elver@google.com
    Signed-off-by: Marco Elver
    Signed-off-by: Alexander Potapenko
    Reviewed-by: Dmitry Vyukov
    Reviewed-by: SeongJae Park
    Co-developed-by: Marco Elver
    Reviewed-by: Jann Horn
    Cc: "H. Peter Anvin"
    Cc: Paul E. McKenney
    Cc: Andrey Konovalov
    Cc: Andrey Ryabinin
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christopher Lameter
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Eric Dumazet
    Cc: Greg Kroah-Hartman
    Cc: Hillf Danton
    Cc: Ingo Molnar
    Cc: Jonathan Corbet
    Cc: Joonsoo Kim
    Cc: Joern Engel
    Cc: Kees Cook
    Cc: Mark Rutland
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     

16 Feb, 2021

1 commit