22 Jul, 2022

1 commit

  • [ Upstream commit a382f8fee42ca10c9bfce0d2352d4153f931f5dc ]

    These are indeed "should not happen" situations, but it turns out recent
    changes made the 'task_is_stopped_or_trace()' case trigger (fix for that
    exists, is pending more testing), and the BUG_ON() makes it
    unnecessarily hard to actually debug for no good reason.

    It's been that way for a long time, but let's make it clear: BUG_ON() is
    not good for debugging, and should never be used in situations where you
    could just say "this shouldn't happen, but we can continue".

    Use WARN_ON_ONCE() instead to make sure it gets logged, and then just
    continue running. Instead of making the system basically unusuable
    because you crashed the machine while potentially holding some very core
    locks (eg this function is commonly called while holding 'tasklist_lock'
    for writing).

    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Linus Torvalds
     

09 Jun, 2022

1 commit

  • [ Upstream commit 78ed93d72ded679e3caf0758357209887bda885f ]

    With SIGTRAP on perf events, we have encountered termination of
    processes due to user space attempting to block delivery of SIGTRAP.
    Consider this case:


    ...
    sigset_t s;
    sigemptyset(&s);
    sigaddset(&s, SIGTRAP | );
    sigprocmask(SIG_BLOCK, &s, ...);
    ...

    When the perf event triggers, while SIGTRAP is blocked, force_sig_perf()
    will force the signal, but revert back to the default handler, thus
    terminating the task.

    This makes sense for error conditions, but not so much for explicitly
    requested monitoring. However, the expectation is still that signals
    generated by perf events are synchronous, which will no longer be the
    case if the signal is blocked and delivered later.

    To give user space the ability to clearly distinguish synchronous from
    asynchronous signals, introduce siginfo_t::si_perf_flags and
    TRAP_PERF_FLAG_ASYNC (opted for flags in case more binary information is
    required in future).

    The resolution to the problem is then to (a) no longer force the signal
    (avoiding the terminations), but (b) tell user space via si_perf_flags
    if the signal was synchronous or not, so that such signals can be
    handled differently (e.g. let user space decide to ignore or consider
    the data imprecise).

    The alternative of making the kernel ignore SIGTRAP on perf events if
    the signal is blocked may work for some usecases, but likely causes
    issues in others that then have to revert back to interception of
    sigprocmask() (which we want to avoid). [ A concrete example: when using
    breakpoint perf events to track data-flow, in a region of code where
    signals are blocked, data-flow can no longer be tracked accurately.
    When a relevant asynchronous signal is received after unblocking the
    signal, the data-flow tracking logic needs to know its state is
    imprecise. ]

    Fixes: 97ba62b27867 ("perf: Add support for SIGTRAP on perf events")
    Reported-by: Dmitry Vyukov
    Signed-off-by: Marco Elver
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Geert Uytterhoeven
    Tested-by: Dmitry Vyukov
    Link: https://lore.kernel.org/r/20220404111204.935357-1-elver@google.com
    Signed-off-by: Sasha Levin

    Marco Elver
     

09 Mar, 2022

1 commit

  • [ Upstream commit e7f7c99ba911f56bc338845c1cd72954ba591707 ]

    Recently while investigating a problem with rr and signals I noticed
    that siglock is dropped in ptrace_signal and get_signal does not jump
    to relock.

    Looking farther to see if the problem is anywhere else I see that
    do_signal_stop also returns if signal_group_exit is true. I believe
    that test can now never be true, but it is a bit hard to trace
    through and be certain.

    Testing signal_group_exit is not expensive, so move the test for
    signal_group_exit into the for loop inside of get_signal to ensure
    the test is never skipped improperly.

    This has been a potential problem since I added the test for
    signal_group_exit was added.

    Fixes: 35634ffa1751 ("signal: Always notice exiting tasks")
    Reviewed-by: Kees Cook
    Link: https://lkml.kernel.org/r/875yssekcd.fsf_-_@email.froward.int.ebiederm.org
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Sasha Levin

    Eric W. Biederman
     

16 Feb, 2022

1 commit

  • commit 5c72263ef2fbe99596848f03758ae2dc593adf2c upstream.

    Fatal SIGSYS signals (i.e. seccomp RET_KILL_* syscall filter actions)
    were not being delivered to ptraced pid namespace init processes. Make
    sure the SIGNAL_UNKILLABLE doesn't get set for these cases.

    Reported-by: Robert Święcki
    Suggested-by: "Eric W. Biederman"
    Fixes: 00b06da29cf9 ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
    Cc: stable@vger.kernel.org
    Signed-off-by: Kees Cook
    Reviewed-by: "Eric W. Biederman"
    Link: https://lore.kernel.org/lkml/878rui8u4a.fsf@email.froward.int.ebiederm.org
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     

25 Nov, 2021

3 commits

  • commit fcb116bc43c8c37c052530ead79872f8b2615711 upstream.

    Recently to prevent issues with SECCOMP_RET_KILL and similar signals
    being changed before they are delivered SA_IMMUTABLE was added.

    Unfortunately this broke debuggers[1][2] which reasonably expect
    to be able to trap synchronous SIGTRAP and SIGSEGV even when
    the target process is not configured to handle those signals.

    Add force_exit_sig and use it instead of force_fatal_sig where
    historically the code has directly called do_exit. This has the
    implementation benefits of going through the signal exit path
    (including generating core dumps) without the danger of allowing
    userspace to ignore or change these signals.

    This avoids userspace regressions as older kernels exited with do_exit
    which debuggers also can not intercept.

    In the future is should be possible to improve the quality of
    implementation of the kernel by changing some of these force_exit_sig
    calls to force_fatal_sig. That can be done where it matters on
    a case-by-case basis with careful analysis.

    Reported-by: Kyle Huey
    Reported-by: kernel test robot
    [1] https://lkml.kernel.org/r/CAP045AoMY4xf8aC_4QU_-j7obuEPYgTcnQQP3Yxk=2X90jtpjw@mail.gmail.com
    [2] https://lkml.kernel.org/r/20211117150258.GB5403@xsang-OptiPlex-9020
    Fixes: 00b06da29cf9 ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
    Fixes: a3616a3c0272 ("signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die")
    Fixes: 83a1f27ad773 ("signal/powerpc: On swapcontext failure force SIGSEGV")
    Fixes: 9bc508cf0791 ("signal/s390: Use force_sigsegv in default_trap_handler")
    Fixes: 086ec444f866 ("signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig")
    Fixes: c317d306d550 ("signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails")
    Fixes: 695dd0d634df ("signal/x86: In emulate_vsyscall force a signal instead of calling do_exit")
    Fixes: 1fbd60df8a85 ("signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.")
    Fixes: 941edc5bf174 ("exit/syscall_user_dispatch: Send ordinary signals on failure")
    Link: https://lkml.kernel.org/r/871r3dqfv8.fsf_-_@email.froward.int.ebiederm.org
    Reviewed-by: Kees Cook
    Tested-by: Kees Cook
    Tested-by: Kyle Huey
    Signed-off-by: "Eric W. Biederman"
    Cc: Thomas Backlund
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit e349d945fac76bddc78ae1cb92a0145b427a87ce upstream.

    Recently to prevent issues with SECCOMP_RET_KILL and similar signals
    being changed before they are delivered SA_IMMUTABLE was added.

    Unfortunately this broke debuggers[1][2] which reasonably expect to be
    able to trap synchronous SIGTRAP and SIGSEGV even when the target
    process is not configured to handle those signals.

    Update force_sig_to_task to support both the case when we can allow
    the debugger to intercept and possibly ignore the signal and the case
    when it is not safe to let userspace know about the signal until the
    process has exited.

    Suggested-by: Linus Torvalds
    Reported-by: Kyle Huey
    Reported-by: kernel test robot
    Cc: stable@vger.kernel.org
    [1] https://lkml.kernel.org/r/CAP045AoMY4xf8aC_4QU_-j7obuEPYgTcnQQP3Yxk=2X90jtpjw@mail.gmail.com
    [2] https://lkml.kernel.org/r/20211117150258.GB5403@xsang-OptiPlex-9020
    Fixes: 00b06da29cf9 ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
    Link: https://lkml.kernel.org/r/877dd5qfw5.fsf_-_@email.froward.int.ebiederm.org
    Reviewed-by: Kees Cook
    Tested-by: Kees Cook
    Tested-by: Kyle Huey
    Signed-off-by: "Eric W. Biederman"
    Cc: Thomas Backlund
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 26d5badbccddcc063dc5174a2baffd13a23322aa upstream.

    Add a simple helper force_fatal_sig that causes a signal to be
    delivered to a process as if the signal handler was set to SIG_DFL.

    Reimplement force_sigsegv based upon this new helper. This fixes
    force_sigsegv so that when it forces the default signal handler
    to be used the code now forces the signal to be unblocked as well.

    Reusing the tested logic in force_sig_info_to_task that was built for
    force_sig_seccomp this makes the implementation trivial.

    This is interesting both because it makes force_sigsegv simpler and
    because there are a couple of buggy places in the kernel that call
    do_exit(SIGILL) or do_exit(SIGSYS) because there is no straight
    forward way today for those places to simply force the exit of a
    process with the chosen signal. Creating force_fatal_sig allows
    those places to be implemented with normal signal exits.

    Link: https://lkml.kernel.org/r/20211020174406.17889-13-ebiederm@xmission.com
    Signed-off-by: Eric W. Biederman
    Cc: Thomas Backlund
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

19 Nov, 2021

2 commits

  • commit 00b06da29cf9dc633cdba87acd3f57f4df3fd5c7 upstream.

    As Andy pointed out that there are races between
    force_sig_info_to_task and sigaction[1] when force_sig_info_task. As
    Kees discovered[2] ptrace is also able to change these signals.

    In the case of seeccomp killing a process with a signal it is a
    security violation to allow the signal to be caught or manipulated.

    Solve this problem by introducing a new flag SA_IMMUTABLE that
    prevents sigaction and ptrace from modifying these forced signals.
    This flag is carefully made kernel internal so that no new ABI is
    introduced.

    Longer term I think this can be solved by guaranteeing short circuit
    delivery of signals in this case. Unfortunately reliable and
    guaranteed short circuit delivery of these signals is still a ways off
    from being implemented, tested, and merged. So I have implemented a much
    simpler alternative for now.

    [1] https://lkml.kernel.org/r/b5d52d25-7bde-4030-a7b1-7c6f8ab90660@www.fastmail.com
    [2] https://lkml.kernel.org/r/202110281136.5CE65399A7@keescook
    Cc: stable@vger.kernel.org
    Fixes: 307d522f5eb8 ("signal/seccomp: Refactor seccomp signal and coredump generation")
    Tested-by: Andrea Righi
    Tested-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 7d613f9f72ec8f90ddefcae038fdae5adb8404b3 upstream.

    The existence of sigkill_pending is a little silly as it is
    functionally a duplicate of fatal_signal_pending that is used in
    exactly one place.

    Checking for pending fatal signals and returning early in ptrace_stop
    is actively harmful. It casues the ptrace_stop called by
    ptrace_signal to return early before setting current->exit_code.
    Later when ptrace_signal reads the signal number from
    current->exit_code is undefined, making it unpredictable what will
    happen.

    Instead rely on the fact that schedule will not sleep if there is a
    pending signal that can awaken a task.

    Removing the explict sigkill_pending test fixes fixes ptrace_signal
    when ptrace_stop does not stop because current->exit_code is always
    set to to signr.

    Cc: stable@vger.kernel.org
    Fixes: 3d749b9e676b ("ptrace: simplify ptrace_stop()->sigkill_pending() path")
    Fixes: 1a669c2f16d4 ("Add arch_ptrace_stop")
    Link: https://lkml.kernel.org/r/87pmsyx29t.fsf@disp2133
    Reviewed-by: Kees Cook
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

22 Oct, 2021

1 commit

  • …el/git/ebiederm/user-namespace

    Pull ucounts fixes from Eric Biederman:
    "There has been one very hard to track down bug in the ucount code that
    we have been tracking since roughly v5.14 was released. Alex managed
    to find a reliable reproducer a few days ago and then I was able to
    instrument the code and figure out what the issue was.

    It turns out the sigqueue_alloc single atomic operation optimization
    did not play nicely with ucounts multiple level rlimits. It turned out
    that either sigqueue_alloc or sigqueue_free could be operating on
    multiple levels and trigger the conditions for the optimization on
    more than one level at the same time.

    To deal with that situation I have introduced inc_rlimit_get_ucounts
    and dec_rlimit_put_ucounts that just focuses on the optimization and
    the rlimit and ucount changes.

    While looking into the big bug I found I couple of other little issues
    so I am including those fixes here as well.

    When I have time I would very much like to dig into process ownership
    of the shared signal queue and see if we could pick a single owner for
    the entire queue so that all of the rlimits can count to that owner.
    That should entirely remove the need to call get_ucounts and
    put_ucounts in sigqueue_alloc and sigqueue_free. It is difficult
    because Linux unlike POSIX supports setuid that works on a single
    thread"

    * 'ucount-fixes-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    ucounts: Move get_ucounts from cred_alloc_blank to key_change_session_keyring
    ucounts: Proper error handling in set_cred_ucounts
    ucounts: Pair inc_rlimit_ucounts with dec_rlimit_ucoutns in commit_creds
    ucounts: Fix signal ucount refcounting

    Linus Torvalds
     

19 Oct, 2021

1 commit

  • In commit fda31c50292a ("signal: avoid double atomic counter
    increments for user accounting") Linus made a clever optimization to
    how rlimits and the struct user_struct. Unfortunately that
    optimization does not work in the obvious way when moved to nested
    rlimits. The problem is that the last decrement of the per user
    namespace per user sigpending counter might also be the last decrement
    of the sigpending counter in the parent user namespace as well. Which
    means that simply freeing the leaf ucount in __free_sigqueue is not
    enough.

    Maintain the optimization and handle the tricky cases by introducing
    inc_rlimit_get_ucounts and dec_rlimit_put_ucounts.

    By moving the entire optimization into functions that perform all of
    the work it becomes possible to ensure that every level is handled
    properly.

    The new function inc_rlimit_get_ucounts returns 0 on failure to
    increment the ucount. This is different than inc_rlimit_ucounts which
    increments the ucounts and returns LONG_MAX if the ucount counter has
    exceeded it's maximum or it wrapped (to indicate the counter needs to
    decremented).

    I wish we had a single user to account all pending signals to across
    all of the threads of a process so this complexity was not necessary

    Cc: stable@vger.kernel.org
    Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
    v1: https://lkml.kernel.org/r/87mtnavszx.fsf_-_@disp2133
    Link: https://lkml.kernel.org/r/87fssytizw.fsf_-_@disp2133
    Reviewed-by: Alexey Gladkov
    Tested-by: Rune Kleveland
    Tested-by: Yu Zhao
    Tested-by: Jordan Glover
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

04 Sep, 2021

2 commits

  • Merge misc updates from Andrew Morton:
    "173 patches.

    Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
    pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
    bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
    hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
    oom-kill, migration, ksm, percpu, vmstat, and madvise)"

    * emailed patches from Andrew Morton : (173 commits)
    mm/madvise: add MADV_WILLNEED to process_madvise()
    mm/vmstat: remove unneeded return value
    mm/vmstat: simplify the array size calculation
    mm/vmstat: correct some wrong comments
    mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
    selftests: vm: add COW time test for KSM pages
    selftests: vm: add KSM merging time test
    mm: KSM: fix data type
    selftests: vm: add KSM merging across nodes test
    selftests: vm: add KSM zero page merging test
    selftests: vm: add KSM unmerge test
    selftests: vm: add KSM merge test
    mm/migrate: correct kernel-doc notation
    mm: wire up syscall process_mrelease
    mm: introduce process_mrelease system call
    memblock: make memblock_find_in_range method private
    mm/mempolicy.c: use in_task() in mempolicy_slab_node()
    mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
    mm/mempolicy: advertise new MPOL_PREFERRED_MANY
    mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
    ...

    Linus Torvalds
     
  • When a user send a signal to any another processes it forces the kernel to
    allocate memory for 'struct sigqueue' objects. The number of signals is
    limited by RLIMIT_SIGPENDING resource limit, but even the default settings
    allow each user to consume up to several megabytes of memory.

    It makes sense to account for these allocations to restrict the host's
    memory consumption from inside the memcg-limited container.

    Link: https://lkml.kernel.org/r/e34e958c-e785-712e-a62a-2c7b66c646c7@virtuozzo.com
    Signed-off-by: Vasily Averin
    Reviewed-by: Shakeel Butt
    Cc: Alexander Viro
    Cc: Alexey Dobriyan
    Cc: Andrei Vagin
    Cc: Borislav Petkov
    Cc: Borislav Petkov
    Cc: Christian Brauner
    Cc: Dmitry Safonov
    Cc: "Eric W. Biederman"
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: "J. Bruce Fields"
    Cc: Jeff Layton
    Cc: Jens Axboe
    Cc: Jiri Slaby
    Cc: Johannes Weiner
    Cc: Kirill Tkhai
    Cc: Michal Hocko
    Cc: Oleg Nesterov
    Cc: Roman Gushchin
    Cc: Serge Hallyn
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Vladimir Davydov
    Cc: Yutian Yang
    Cc: Zefan Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasily Averin
     

02 Sep, 2021

2 commits

  • …nel/git/ebiederm/user-namespace

    Pull exit cleanups from Eric Biederman:
    "In preparation of doing something about PTRACE_EVENT_EXIT I have
    started cleaning up various pieces of code related to do_exit. Most of
    that code I did not manage to get tested and reviewed before the merge
    window opened but a handful of very useful cleanups are ready to be
    merged.

    The first change is simply the removal of the bdflush system call. The
    code has now been disabled long enough that even the oldest userspace
    working userspace setups anyone can find to test are fine with the
    bdflush system call being removed.

    Changing m68k fsp040_die to use force_sigsegv(SIGSEGV) instead of
    calling do_exit directly is interesting only in that it is nearly the
    most difficult of the incorrect uses of do_exit to remove.

    The change to the seccomp code to simply send a signal instead of
    calling do_coredump directly is a very nice little cleanup made
    possible by realizing the existing signal sending helpers were missing
    a little bit of functionality that is easy to provide"

    * 'exit-cleanups-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    signal/seccomp: Dump core when there is only one live thread
    signal/seccomp: Refactor seccomp signal and coredump generation
    signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die
    exit/bdflush: Remove the deprecated bdflush system call

    Linus Torvalds
     
  • …/kernel/git/ebiederm/user-namespace

    Pull siginfo si_trapno updates from Eric Biederman:
    "The full set of si_trapno changes was not appropriate as a fix for the
    newly added SIGTRAP TRAP_PERF, and so I postponed the rest of the
    related cleanups.

    This is the rest of the cleanups for si_trapno that reduces it from
    being a really weird arch special case that is expect to be always
    present (but isn't) on the architectures that support it to being yet
    another field in the _sigfault union of struct siginfo.

    The changes have been reviewed and marinated in linux-next. With the
    removal of this awkward special case new code (like SIGTRAP TRAP_PERF)
    that works across architectures should be easier to write and
    maintain"

    * 'siginfo-si_trapno-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    signal: Rename SIL_PERF_EVENT SIL_FAULT_PERF_EVENT for consistency
    signal: Verify the alignment and size of siginfo_t
    signal: Remove the generic __ARCH_SI_TRAPNO support
    signal/alpha: si_trapno is only used with SIGFPE and SIGTRAP TRAP_UNK
    signal/sparc: si_trapno is only used with SIGILL ILL_ILLTRP
    arm64: Add compile-time asserts for siginfo_t offsets
    arm: Add compile-time asserts for siginfo_t offsets
    sparc64: Add compile-time asserts for siginfo_t offsets

    Linus Torvalds
     

26 Aug, 2021

1 commit

  • Factor out force_sig_seccomp from the seccomp signal generation and
    place it in kernel/signal.c. The function force_sig_seccomp takes a
    parameter force_coredump to indicate that the sigaction field should
    be reset to SIGDFL so that a coredump will be generated when the
    signal is delivered.

    force_sig_seccomp is then used to replace both seccomp_send_sigsys
    and seccomp_init_siginfo.

    force_sig_info_to_task gains an extra parameter to force using
    the default signal action.

    With this change seccomp is no longer a special case and there
    becomes exactly one place do_coredump is called from.

    Further it no longer becomes necessary for __seccomp_filter
    to call do_group_exit.

    Acked-by: Kees Cook
    Link: https://lkml.kernel.org/r/87r1gr6qc4.fsf_-_@disp2133
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

10 Aug, 2021

1 commit

  • Starting the process wide cputime counter needs to be done in the same
    sighand locking sequence than actually arming the related timer otherwise
    this races against concurrent timers setting/expiring in the same
    threadgroup.

    Detecting that the cputime counter is started without holding the sighand
    lock is a first step toward debugging such situations.

    Suggested-by: Peter Zijlstra (Intel)
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lore.kernel.org/r/20210726125513.271824-2-frederic@kernel.org

    Frederic Weisbecker
     

24 Jul, 2021

4 commits

  • It helps to know which part of the siginfo structure the siginfo_layout
    value is talking about.

    v1: https://lkml.kernel.org/r/m18s4zs7nu.fsf_-_@fess.ebiederm.org
    v2: https://lkml.kernel.org/r/20210505141101.11519-9-ebiederm@xmission.com
    Link: https://lkml.kernel.org/r/87zgumw8cc.fsf_-_@disp2133
    Acked-by: Marco Elver
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • Now that __ARCH_SI_TRAPNO is no longer set by any architecture remove
    all of the code it enabled from the kernel.

    On alpha and sparc a more explict approach of using
    send_sig_fault_trapno or force_sig_fault_trapno in the very limited
    circumstances where si_trapno was set to a non-zero value.

    The generic support that is being removed always set si_trapno on all
    fault signals. With only SIGILL ILL_ILLTRAP on sparc and SIGFPE and
    SIGTRAP TRAP_UNK on alpla providing si_trapno values asking all senders
    of fault signals to provide an si_trapno value does not make sense.

    Making si_trapno an ordinary extension of the fault siginfo layout has
    enabled the architecture generic implementation of SIGTRAP TRAP_PERF,
    and enables other faulting signals to grow architecture generic
    senders as well.

    v1: https://lkml.kernel.org/r/m18s4zs7nu.fsf_-_@fess.ebiederm.org
    v2: https://lkml.kernel.org/r/20210505141101.11519-8-ebiederm@xmission.com
    Link: https://lkml.kernel.org/r/87bl73xx6x.fsf_-_@disp2133
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • While reviewing the signal handlers on alpha it became clear that
    si_trapno is only set to a non-zero value when sending SIGFPE and when
    sending SITGRAP with si_code TRAP_UNK.

    Add send_sig_fault_trapno and send SIGTRAP TRAP_UNK, and SIGFPE with it.

    Remove the define of __ARCH_SI_TRAPNO and remove the always zero
    si_trapno parameter from send_sig_fault and force_sig_fault.

    v1: https://lkml.kernel.org/r/m1eeers7q7.fsf_-_@fess.ebiederm.org
    v2: https://lkml.kernel.org/r/20210505141101.11519-7-ebiederm@xmission.com
    Link: https://lkml.kernel.org/r/87h7gvxx7l.fsf_-_@disp2133
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • While reviewing the signal handlers on sparc it became clear that
    si_trapno is only set to a non-zero value when sending SIGILL with
    si_code ILL_ILLTRP.

    Add force_sig_fault_trapno and send SIGILL ILL_ILLTRP with it.

    Remove the define of __ARCH_SI_TRAPNO and remove the always zero
    si_trapno parameter from send_sig_fault and force_sig_fault.

    v1: https://lkml.kernel.org/r/m1eeers7q7.fsf_-_@fess.ebiederm.org
    v2: https://lkml.kernel.org/r/20210505141101.11519-7-ebiederm@xmission.com
    Link: https://lkml.kernel.org/r/87mtqnxx89.fsf_-_@disp2133
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

09 Jul, 2021

1 commit

  • We must properly handle an errors when we increase the rlimit counter
    and the ucounts reference counter. We have to this with RCU protection
    to prevent possible use-after-free that could occur due to concurrent
    put_cred_rcu().

    The following reproducer triggers the problem:

    $ cat testcase.sh
    case "${STEP:-0}" in
    0)
    ulimit -Si 1
    ulimit -Hi 1
    STEP=1 unshare -rU "$0"
    killall sleep
    ;;
    1)
    for i in 1 2 3 4 5; do unshare -rU sleep 5 & done
    ;;
    esac

    with the KASAN report being along the lines of

    BUG: KASAN: use-after-free in put_ucounts+0x17/0xa0
    Write of size 4 at addr ffff8880045f031c by task swapper/2/0

    CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.13.0+ #19
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-alt4 04/01/2014
    Call Trace:

    put_ucounts+0x17/0xa0
    put_cred_rcu+0xd5/0x190
    rcu_core+0x3bf/0xcb0
    __do_softirq+0xe3/0x341
    irq_exit_rcu+0xbe/0xe0
    sysvec_apic_timer_interrupt+0x6a/0x90

    asm_sysvec_apic_timer_interrupt+0x12/0x20
    default_idle_call+0x53/0x130
    do_idle+0x311/0x3c0
    cpu_startup_entry+0x14/0x20
    secondary_startup_64_no_verify+0xc2/0xcb

    Allocated by task 127:
    kasan_save_stack+0x1b/0x40
    __kasan_kmalloc+0x7c/0x90
    alloc_ucounts+0x169/0x2b0
    set_cred_ucounts+0xbb/0x170
    ksys_unshare+0x24c/0x4e0
    __x64_sys_unshare+0x16/0x20
    do_syscall_64+0x37/0x70
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    Freed by task 0:
    kasan_save_stack+0x1b/0x40
    kasan_set_track+0x1c/0x30
    kasan_set_free_info+0x20/0x30
    __kasan_slab_free+0xeb/0x120
    kfree+0xaa/0x460
    put_cred_rcu+0xd5/0x190
    rcu_core+0x3bf/0xcb0
    __do_softirq+0xe3/0x341

    The buggy address belongs to the object at ffff8880045f0300
    which belongs to the cache kmalloc-192 of size 192
    The buggy address is located 28 bytes inside of
    192-byte region [ffff8880045f0300, ffff8880045f03c0)
    The buggy address belongs to the page:
    page:000000008de0a388 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff8880045f0000 pfn:0x45f0
    flags: 0x100000000000200(slab|node=0|zone=1)
    raw: 0100000000000200 ffffea00000f4640 0000000a0000000a ffff888001042a00
    raw: ffff8880045f0000 000000008010000d 00000001ffffffff 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff8880045f0200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff8880045f0280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
    >ffff8880045f0300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ^
    ffff8880045f0380: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
    ffff8880045f0400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ==================================================================
    Disabling lock debugging due to kernel taint

    Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
    Cc: Eric W. Biederman
    Cc: Oleg Nesterov
    Signed-off-by: Alexey Gladkov
    Signed-off-by: Linus Torvalds

    Alexey Gladkov
     

03 Jul, 2021

1 commit

  • Merge more updates from Andrew Morton:
    "190 patches.

    Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
    vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
    migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
    zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
    core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
    signals, exec, kcov, selftests, compress/decompress, and ipc"

    * emailed patches from Andrew Morton : (190 commits)
    ipc/util.c: use binary search for max_idx
    ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
    ipc: use kmalloc for msg_queue and shmid_kernel
    ipc sem: use kvmalloc for sem_undo allocation
    lib/decompressors: remove set but not used variabled 'level'
    selftests/vm/pkeys: exercise x86 XSAVE init state
    selftests/vm/pkeys: refill shadow register after implicit kernel write
    selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
    selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
    kcov: add __no_sanitize_coverage to fix noinstr for all architectures
    exec: remove checks in __register_bimfmt()
    x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
    hfsplus: report create_date to kstat.btime
    hfsplus: remove unnecessary oom message
    nilfs2: remove redundant continue statement in a while-loop
    kprobes: remove duplicated strong free_insn_page in x86 and s390
    init: print out unknown kernel parameters
    checkpatch: do not complain about positive return values starting with EPOLL
    checkpatch: improve the indented label test
    checkpatch: scripts/spdxcheck.py now requires python3
    ...

    Linus Torvalds
     

02 Jul, 2021

1 commit

  • Currently we handle SS_AUTODISARM as soon as we have stored the altstack
    settings into sigframe - that's the point when we have set the things up
    for eventual sigreturn to restore the old settings. And if we manage to
    set the sigframe up (we are not done with that yet), everything's fine.
    However, in case of failure we end up with sigframe-to-be abandoned and
    SIGSEGV force-delivered. And in that case we end up with inconsistent
    rules - late failures have altstack reset, early ones do not.

    It's trivial to get consistent behaviour - just handle SS_AUTODISARM once
    we have set the sigframe up and are committed to entering the handler,
    i.e. in signal_delivered().

    Link: https://lore.kernel.org/lkml/20200404170604.GN23230@ZenIV.linux.org.uk/
    Link: https://github.com/ClangBuiltLinux/linux/issues/876
    Link: https://lkml.kernel.org/r/20210422230846.1756380-1-ndesaulniers@google.com
    Signed-off-by: Al Viro
    Signed-off-by: Nick Desaulniers
    Acked-by: Oleg Nesterov
    Tested-by: Nathan Chancellor
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Al Viro
     

29 Jun, 2021

2 commits

  • Pull user namespace rlimit handling update from Eric Biederman:
    "This is the work mainly by Alexey Gladkov to limit rlimits to the
    rlimits of the user that created a user namespace, and to allow users
    to have stricter limits on the resources created within a user
    namespace."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    cred: add missing return error code when set_cred_ucounts() failed
    ucounts: Silence warning in dec_rlimit_ucounts
    ucounts: Set ucount_max to the largest positive value the type can hold
    kselftests: Add test to check for rlimit changes in different user namespaces
    Reimplement RLIMIT_MEMLOCK on top of ucounts
    Reimplement RLIMIT_SIGPENDING on top of ucounts
    Reimplement RLIMIT_MSGQUEUE on top of ucounts
    Reimplement RLIMIT_NPROC on top of ucounts
    Use atomic_t for ucounts reference counting
    Add a reference to ucounts for each cred
    Increase size of ucounts to atomic_long_t

    Linus Torvalds
     
  • Pull scheduler udpates from Ingo Molnar:

    - Changes to core scheduling facilities:

    - Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables
    coordinated scheduling across SMT siblings. This is a much
    requested feature for cloud computing platforms, to allow the
    flexible utilization of SMT siblings, without exposing untrusted
    domains to information leaks & side channels, plus to ensure more
    deterministic computing performance on SMT systems used by
    heterogenous workloads.

    There are new prctls to set core scheduling groups, which allows
    more flexible management of workloads that can share siblings.

    - Fix task->state access anti-patterns that may result in missed
    wakeups and rename it to ->__state in the process to catch new
    abuses.

    - Load-balancing changes:

    - Tweak newidle_balance for fair-sched, to improve 'memcache'-like
    workloads.

    - "Age" (decay) average idle time, to better track & improve
    workloads such as 'tbench'.

    - Fix & improve energy-aware (EAS) balancing logic & metrics.

    - Fix & improve the uclamp metrics.

    - Fix task migration (taskset) corner case on !CONFIG_CPUSET.

    - Fix RT and deadline utilization tracking across policy changes

    - Introduce a "burstable" CFS controller via cgroups, which allows
    bursty CPU-bound workloads to borrow a bit against their future
    quota to improve overall latencies & batching. Can be tweaked via
    /sys/fs/cgroup/cpu//cpu.cfs_burst_us.

    - Rework assymetric topology/capacity detection & handling.

    - Scheduler statistics & tooling:

    - Disable delayacct by default, but add a sysctl to enable it at
    runtime if tooling needs it. Use static keys and other
    optimizations to make it more palatable.

    - Use sched_clock() in delayacct, instead of ktime_get_ns().

    - Misc cleanups and fixes.

    * tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
    sched/doc: Update the CPU capacity asymmetry bits
    sched/topology: Rework CPU capacity asymmetry detection
    sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag
    psi: Fix race between psi_trigger_create/destroy
    sched/fair: Introduce the burstable CFS controller
    sched/uclamp: Fix uclamp_tg_restrict()
    sched/rt: Fix Deadline utilization tracking during policy change
    sched/rt: Fix RT utilization tracking during policy change
    sched: Change task_struct::state
    sched,arch: Remove unused TASK_STATE offsets
    sched,timer: Use __set_current_state()
    sched: Add get_current_state()
    sched,perf,kvm: Fix preemption condition
    sched: Introduce task_is_running()
    sched: Unbreak wakeups
    sched/fair: Age the average idle time
    sched/cpufreq: Consider reduced CPU capacity in energy calculation
    sched/fair: Take thermal pressure into account while estimating energy
    thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure
    sched/fair: Return early from update_tg_cfs_load() if delta == 0
    ...

    Linus Torvalds
     

28 Jun, 2021

1 commit

  • This reverts commits 4bad58ebc8bc4f20d89cff95417c9b4674769709 (and
    399f8dd9a866e107639eabd3c1979cd526ca3a98, which tried to fix it).

    I do not believe these are correct, and I'm about to release 5.13, so am
    reverting them out of an abundance of caution.

    The locking is odd, and appears broken.

    On the allocation side (in __sigqueue_alloc()), the locking is somewhat
    straightforward: it depends on sighand->siglock. Since one caller
    doesn't hold that lock, it further then tests 'sigqueue_flags' to avoid
    the case with no locks held.

    On the freeing side (in sigqueue_cache_or_free()), there is no locking
    at all, and the logic instead depends on 'current' being a single
    thread, and not able to race with itself.

    To make things more exciting, there's also the data race between freeing
    a signal and allocating one, which is handled by using WRITE_ONCE() and
    READ_ONCE(), and being mutually exclusive wrt the initial state (ie
    freeing will only free if the old state was NULL, while allocating will
    obviously only use the value if it was non-NULL, so only one or the
    other will actually act on the value).

    However, while the free->alloc paths do seem mutually exclusive thanks
    to just the data value dependency, it's not clear what the memory
    ordering constraints are on it. Could writes from the previous
    allocation possibly be delayed and seen by the new allocation later,
    causing logical inconsistencies?

    So it's all very exciting and unusual.

    And in particular, it seems that the freeing side is incorrect in
    depending on "current" being single-threaded. Yes, 'current' is a
    single thread, but in the presense of asynchronous events even a single
    thread can have data races.

    And such asynchronous events can and do happen, with interrupts causing
    signals to be flushed and thus free'd (for example - sending a
    SIGCONT/SIGSTOP can happen from interrupt context, and can flush
    previously queued process control signals).

    So regardless of all the other questions about the memory ordering and
    locking for this new cached allocation, the sigqueue_cache_or_free()
    assumptions seem to be fundamentally incorrect.

    It may be that people will show me the errors of my ways, and tell me
    why this is all safe after all. We can reinstate it if so. But my
    current belief is that the WRITE_ONCE() that sets the cached entry needs
    to be a smp_store_release(), and the READ_ONCE() that finds a cached
    entry needs to be a smp_load_acquire() to handle memory ordering
    correctly.

    And the sequence in sigqueue_cache_or_free() would need to either use a
    lock or at least be interrupt-safe some way (perhaps by using something
    like the percpu 'cmpxchg': it doesn't need to be SMP-safe, but like the
    percpu operations it needs to be interrupt-safe).

    Fixes: 399f8dd9a866 ("signal: Prevent sigqueue caching after task got released")
    Fixes: 4bad58ebc8bc ("signal: Allow tasks to cache one sigqueue struct")
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Christian Brauner
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

22 Jun, 2021

1 commit

  • syzbot reported a memory leak related to sigqueue caching.

    The assumption that a task cannot cache a sigqueue after the signal handler
    has been dropped and exit_task_sigqueue_cache() has been invoked turns out
    to be wrong.

    Such a task can still invoke release_task(other_task), which cleans up the
    signals of 'other_task' and ends up in sigqueue_cache_or_free(), which in
    turn will cache the signal because task->sigqueue_cache is NULL. That's
    obviously bogus because nothing will free the cached signal of that task
    anymore, so the cached item is leaked.

    This happens when e.g. the last non-leader thread exits and reaps the
    zombie leader.

    Prevent this by setting tsk::sigqueue_cache to an error pointer value in
    exit_task_sigqueue_cache() which forces any subsequent invocation of
    sigqueue_cache_or_free() from that task to hand the sigqueue back to the
    kmemcache.

    Add comments to all relevant places.

    Fixes: 4bad58ebc8bc ("signal: Allow tasks to cache one sigqueue struct")
    Reported-by: syzbot+0bac5fec63d4f399ba98@syzkaller.appspotmail.com
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Oleg Nesterov
    Acked-by: Christian Brauner
    Link: https://lore.kernel.org/r/878s32g6j5.ffs@nanos.tec.linutronix.de

    Thomas Gleixner
     

18 Jun, 2021

1 commit

  • Replace a bunch of 'p->state == TASK_RUNNING' with a new helper:
    task_is_running(p).

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Davidlohr Bueso
    Acked-by: Geert Uytterhoeven
    Acked-by: Will Deacon
    Link: https://lore.kernel.org/r/20210611082838.222401495@infradead.org

    Peter Zijlstra
     

22 May, 2021

1 commit

  • …iederm/user-namespace

    Pull siginfo fix from Eric Biederman:
    "During the merge window an issue with si_perf and the siginfo ABI came
    up. The alpha and sparc siginfo structure layout had changed with the
    addition of SIGTRAP TRAP_PERF and the new field si_perf.

    The reason only alpha and sparc were affected is that they are the
    only architectures that use si_trapno.

    Looking deeper it was discovered that si_trapno is used for only a few
    select signals on alpha and sparc, and that none of the other
    _sigfault fields past si_addr are used at all. Which means technically
    no regression on alpha and sparc.

    While the alignment concerns might be dismissed the abuse of si_errno
    by SIGTRAP TRAP_PERF does have the potential to cause regressions in
    existing userspace.

    While we still have time before userspace starts using and depending
    on the new definition siginfo for SIGTRAP TRAP_PERF this set of
    changes cleans up siginfo_t.

    - The si_trapno field is demoted from magic alpha and sparc status
    and made an ordinary union member of the _sigfault member of
    siginfo_t. Without moving it of course.

    - si_perf is replaced with si_perf_data and si_perf_type ending the
    abuse of si_errno.

    - Unnecessary additions to signalfd_siginfo are removed"

    * 'for-v5.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    signalfd: Remove SIL_PERF_EVENT fields from signalfd_siginfo
    signal: Deliver all of the siginfo perf data in _perf
    signal: Factor force_sig_perf out of perf_sigtrap
    signal: Implement SIL_FAULT_TRAPNO
    siginfo: Move si_trapno inside the union inside _si_fault

    Linus Torvalds
     

19 May, 2021

4 commits

  • Don't abuse si_errno and deliver all of the perf data in _perf member
    of siginfo_t.

    Note: The data field in the perf data structures in a u64 to allow a
    pointer to be encoded without needed to implement a 32bit and 64bit
    version of the same structure. There already exists a 32bit and 64bit
    versions siginfo_t, and the 32bit version can not include a 64bit
    member as it only has 32bit alignment. So unsigned long is used in
    siginfo_t instead of a u64 as unsigned long can encode a pointer on
    all architectures linux supports.

    v1: https://lkml.kernel.org/r/m11rarqqx2.fsf_-_@fess.ebiederm.org
    v2: https://lkml.kernel.org/r/20210503203814.25487-10-ebiederm@xmission.com
    v3: https://lkml.kernel.org/r/20210505141101.11519-11-ebiederm@xmission.com
    Link: https://lkml.kernel.org/r/20210517195748.8880-4-ebiederm@xmission.com
    Reviewed-by: Marco Elver
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Separate filling in siginfo for TRAP_PERF from deciding that
    siginal needs to be sent.

    There are enough little details that need to be correct when
    properly filling in siginfo_t that it is easy to make mistakes
    if filling in the siginfo_t is in the same function with other
    logic. So factor out force_sig_perf to reduce the cognative
    load of on reviewers, maintainers and implementors.

    v1: https://lkml.kernel.org/r/m17dkjqqxz.fsf_-_@fess.ebiederm.org
    v2: https://lkml.kernel.org/r/20210505141101.11519-10-ebiederm@xmission.com
    Link: https://lkml.kernel.org/r/20210517195748.8880-3-ebiederm@xmission.com
    Reviewed-by: Marco Elver
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Now that si_trapno is part of the union in _si_fault and available on
    all architectures, add SIL_FAULT_TRAPNO and update siginfo_layout to
    return SIL_FAULT_TRAPNO when the code assumes si_trapno is valid.

    There is room for future changes to reduce when si_trapno is valid but
    this is all that is needed to make si_trapno and the other members of
    the the union in _sigfault mutually exclusive.

    Update the code that uses siginfo_layout to deal with SIL_FAULT_TRAPNO
    and have the same code ignore si_trapno in in all other cases.

    v1: https://lkml.kernel.org/r/m1o8dvs7s7.fsf_-_@fess.ebiederm.org
    v2: https://lkml.kernel.org/r/20210505141101.11519-6-ebiederm@xmission.com
    Link: https://lkml.kernel.org/r/20210517195748.8880-2-ebiederm@xmission.com
    Reviewed-by: Marco Elver
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • It turns out that linux uses si_trapno very sparingly, and as such it
    can be considered extra information for a very narrow selection of
    signals, rather than information that is present with every fault
    reported in siginfo.

    As such move si_trapno inside the union inside of _si_fault. This
    results in no change in placement, and makes it eaiser
    to extend _si_fault in the future as this reduces the number of
    special cases. In particular with si_trapno included in the union it
    is no longer a concern that the union must be pointer aligned on most
    architectures because the union follows immediately after si_addr
    which is a pointer.

    This change results in a difference in siginfo field placement on
    sparc and alpha for the fields si_addr_lsb, si_lower, si_upper,
    si_pkey, and si_perf. These architectures do not implement the
    signals that would use si_addr_lsb, si_lower, si_upper, si_pkey, and
    si_perf. Further these architecture have not yet implemented the
    userspace that would use si_perf.

    The point of this change is in fact to correct these placement issues
    before sparc or alpha grow userspace that cares. This change was
    discussed[1] and the agreement is that this change is currently safe.

    [1]: https://lkml.kernel.org/r/CAK8P3a0+uKYwL1NhY6Hvtieghba2hKYGD6hcKx5n8=4Gtt+pHA@mail.gmail.com
    Acked-by: Marco Elver
    v1: https://lkml.kernel.org/r/m1tunns7yf.fsf_-_@fess.ebiederm.org
    v2: https://lkml.kernel.org/r/20210505141101.11519-5-ebiederm@xmission.com
    Link: https://lkml.kernel.org/r/20210517195748.8880-1-ebiederm@xmission.com
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

01 May, 2021

1 commit

  • The rlimit counter is tied to uid in the user_namespace. This allows
    rlimit values to be specified in userns even if they are already
    globally exceeded by the user. However, the value of the previous
    user_namespaces cannot be exceeded.

    Changelog

    v11:
    * Revert most of changes to fix performance issues.

    v10:
    * Fix memory leak on get_ucounts failure.

    Signed-off-by: Alexey Gladkov
    Link: https://lkml.kernel.org/r/df9d7764dddd50f28616b7840de74ec0f81711a8.1619094428.git.legion@kernel.org
    Signed-off-by: Eric W. Biederman

    Alexey Gladkov
     

29 Apr, 2021

2 commits

  • Pull scheduler updates from Ingo Molnar:

    - Clean up SCHED_DEBUG: move the decades old mess of sysctl, procfs and
    debugfs interfaces to a unified debugfs interface.

    - Signals: Allow caching one sigqueue object per task, to improve
    performance & latencies.

    - Improve newidle_balance() irq-off latencies on systems with a large
    number of CPU cgroups.

    - Improve energy-aware scheduling

    - Improve the PELT metrics for certain workloads

    - Reintroduce select_idle_smt() to improve load-balancing locality -
    but without the previous regressions

    - Add 'scheduler latency debugging': warn after long periods of pending
    need_resched. This is an opt-in feature that requires the enabling of
    the LATENCY_WARN scheduler feature, or the use of the
    resched_latency_warn_ms=xx boot parameter.

    - CPU hotplug fixes for HP-rollback, and for the 'fail' interface. Fix
    remaining balance_push() vs. hotplug holes/races

    - PSI fixes, plus allow /proc/pressure/ files to be written by
    CAP_SYS_RESOURCE tasks as well

    - Fix/improve various load-balancing corner cases vs. capacity margins

    - Fix sched topology on systems with NUMA diameter of 3 or above

    - Fix PF_KTHREAD vs to_kthread() race

    - Minor rseq optimizations

    - Misc cleanups, optimizations, fixes and smaller updates

    * tag 'sched-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
    cpumask/hotplug: Fix cpu_dying() state tracking
    kthread: Fix PF_KTHREAD vs to_kthread() race
    sched/debug: Fix cgroup_path[] serialization
    sched,psi: Handle potential task count underflow bugs more gracefully
    sched: Warn on long periods of pending need_resched
    sched/fair: Move update_nohz_stats() to the CONFIG_NO_HZ_COMMON block to simplify the code & fix an unused function warning
    sched/debug: Rename the sched_debug parameter to sched_verbose
    sched,fair: Alternative sched_slice()
    sched: Move /proc/sched_debug to debugfs
    sched,debug: Convert sysctl sched_domains to debugfs
    debugfs: Implement debugfs_create_str()
    sched,preempt: Move preempt_dynamic to debug.c
    sched: Move SCHED_DEBUG sysctl to debugfs
    sched: Don't make LATENCYTOP select SCHED_DEBUG
    sched: Remove sched_schedstats sysctl out from under SCHED_DEBUG
    sched/numa: Allow runtime enabling/disabling of NUMA balance without SCHED_DEBUG
    sched: Use cpu_dying() to fix balance_push vs hotplug-rollback
    cpumask: Introduce DYING mask
    cpumask: Make cpu_{online,possible,present,active}() inline
    rseq: Optimise rseq_get_rseq_cs() and clear_rseq_cs()
    ...

    Linus Torvalds
     
  • Pull perf event updates from Ingo Molnar:

    - Improve Intel uncore PMU support:

    - Parse uncore 'discovery tables' - a new hardware capability
    enumeration method introduced on the latest Intel platforms. This
    table is in a well-defined PCI namespace location and is read via
    MMIO. It is organized in an rbtree.

    These uncore tables will allow the discovery of standard counter
    blocks, but fancier counters still need to be enumerated
    explicitly.

    - Add Alder Lake support

    - Improve IIO stacks to PMON mapping support on Skylake servers

    - Add Intel Alder Lake PMU support - which requires the introduction of
    'hybrid' CPUs and PMUs. Alder Lake is a mix of Golden Cove ('big')
    and Gracemont ('small' - Atom derived) cores.

    The CPU-side feature set is entirely symmetrical - but on the PMU
    side there's core type dependent PMU functionality.

    - Reduce data loss with CPU level hardware tracing on Intel PT / AUX
    profiling, by fixing the AUX allocation watermark logic.

    - Improve ring buffer allocation on NUMA systems

    - Put 'struct perf_event' into their separate kmem_cache pool

    - Add support for synchronous signals for select perf events. The
    immediate motivation is to support low-overhead sampling-based race
    detection for user-space code. The feature consists of the following
    main changes:

    - Add thread-only event inheritance via
    perf_event_attr::inherit_thread, which limits inheritance of
    events to CLONE_THREAD.

    - Add the ability for events to not leak through exec(), via
    perf_event_attr::remove_on_exec.

    - Allow the generation of SIGTRAP via perf_event_attr::sigtrap,
    extend siginfo with an u64 ::si_perf, and add the breakpoint
    information to ::si_addr and ::si_perf if the event is
    PERF_TYPE_BREAKPOINT.

    The siginfo support is adequate for breakpoints right now - but the
    new field can be used to introduce support for other types of
    metadata passed over siginfo as well.

    - Misc fixes, cleanups and smaller updates.

    * tag 'perf-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
    signal, perf: Add missing TRAP_PERF case in siginfo_layout()
    signal, perf: Fix siginfo_t by avoiding u64 on 32-bit architectures
    perf/x86: Allow for 8<16
    perf/x86/rapl: Add support for Intel Alder Lake
    perf/x86/cstate: Add Alder Lake CPU support
    perf/x86/msr: Add Alder Lake CPU support
    perf/x86/intel/uncore: Add Alder Lake support
    perf: Extend PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE
    perf/x86/intel: Add Alder Lake Hybrid support
    perf/x86: Support filter_match callback
    perf/x86/intel: Add attr_update for Hybrid PMUs
    perf/x86: Add structures for the attributes of Hybrid PMUs
    perf/x86: Register hybrid PMUs
    perf/x86: Factor out x86_pmu_show_pmu_cap
    perf/x86: Remove temporary pmu assignment in event_init
    perf/x86/intel: Factor out intel_pmu_check_extra_regs
    perf/x86/intel: Factor out intel_pmu_check_event_constraints
    perf/x86/intel: Factor out intel_pmu_check_num_counters
    perf/x86: Hybrid PMU support for extra_regs
    perf/x86: Hybrid PMU support for event constraints
    ...

    Linus Torvalds
     

28 Apr, 2021

1 commit


23 Apr, 2021

1 commit

  • Add the missing TRAP_PERF case in siginfo_layout() for interpreting the
    layout correctly as SIL_PERF_EVENT instead of just SIL_FAULT. This
    ensures the si_perf field is copied and not just the si_addr field.

    This was caught and tested by running the perf_events/sigtrap_threads
    kselftest as a 32-bit binary with a 64-bit kernel.

    Fixes: fb6cc127e0b6 ("signal: Introduce TRAP_PERF si_code and si_perf to siginfo")
    Signed-off-by: Marco Elver
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20210422191823.79012-2-elver@google.com

    Marco Elver
     

20 Apr, 2021

1 commit