Eric Lee / smarc-fsl-linux-kernel

09 Jan, 2021

1 commit

ab7709b55 exec: Transform exec_update_mutex into a rw_semaphore ... Browse Code »

[ Upstream commit f7cfd871ae0c5008d94b6f66834e7845caa93c15 ]

Recently syzbot reported[0] that there is a deadlock amongst the users
of exec_update_mutex. The problematic lock ordering found by lockdep
was:

perf_event_open (exec_update_mutex -> ovl_i_mutex)
chown (ovl_i_mutex -> sb_writes)
sendfile (sb_writes -> p->lock)
by reading from a proc file and writing to overlayfs
proc_pid_syscall (p->lock -> exec_update_mutex)

While looking at possible solutions it occured to me that all of the
users and possible users involved only wanted to state of the given
process to remain the same. They are all readers. The only writer is
exec.

There is no reason for readers to block on each other. So fix
this deadlock by transforming exec_update_mutex into a rw_semaphore
named exec_update_lock that only exec takes for writing.

Cc: Jann Horn
Cc: Vasiliy Kulikov
Cc: Al Viro
Cc: Bernd Edlinger
Cc: Oleg Nesterov
Cc: Christopher Yeoh
Cc: Cyrill Gorcunov
Cc: Sargun Dhillon
Cc: Christian Brauner
Cc: Arnd Bergmann
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo
Fixes: eea9673250db ("exec: Add exec_update_mutex to replace cred_guard_mutex")
[0] https://lkml.kernel.org/r/00000000000063640c05ade8e3de@google.com
Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com
Link: https://lkml.kernel.org/r/87ft4mbqen.fsf@x220.int.ebiederm.org
Signed-off-by: Eric W. Biederman
Signed-off-by: Sasha Levin

Eric W. Biederman
2021-01-09 20:46:24 +0800

01 Oct, 2020

1 commit

0f2122045 io_uring: don't rely on weak ->files references ... Browse Code »

Grab actual references to the files_struct. To avoid circular references
issues due to this, we add a per-task note that keeps track of what
io_uring contexts a task has used. When the tasks execs or exits its
assigned files, we cancel requests based on this tracking.

With that, we can grab proper references to the files table, and no
longer need to rely on stashing away ring_fd and ring_file to check
if the ring_fd may have been closed.

Cc: stable@vger.kernel.org # v5.5+
Reviewed-by: Pavel Begunkov
Signed-off-by: Jens Axboe

Jens Axboe
2020-10-01 10:32:32 +0800

15 Aug, 2020

1 commit

e1d74fbe5 Merge tag 'for-linus' of git://github.com/openrisc/linux ... Browse Code »

Pull OpenRISC updates from Stafford Horne:
"A few patches all over the place during this cycle, mostly bug and
sparse warning fixes for OpenRISC, but a few enhancements too. Note,
there are 2 non OpenRISC specific fixups.

Non OpenRISC fixes:

- In init we need to align the init_task correctly to fix an issue
with MUTEX_FLAGS, reviewed by Peter Z. No one picked this up so I
kept it on my tree.

- In asm-generic/io.h I fixed up some sparse warnings, OK'd by Arnd.
Arnd asked to merge it via my tree.

OpenRISC fixes:

- Many fixes for OpenRISC sprase warnings.

- Add support OpenRISC SMP tlb flushing rather than always flushing
the entire TLB on every CPU.

- Fix bug when dumping stack via /proc/xxx/stack of user threads"

* tag 'for-linus' of git://github.com/openrisc/linux:
openrisc: uaccess: Add user address space check to access_ok
openrisc: signal: Fix sparse address space warnings
openrisc: uaccess: Remove unused macro __addr_ok
openrisc: uaccess: Use static inline function in access_ok
openrisc: uaccess: Fix sparse address space warnings
openrisc: io: Fixup defines and move include to the end
asm-generic/io.h: Fix sparse warnings on big-endian architectures
openrisc: Implement proper SMP tlb flushing
openrisc: Fix oops caused when dumping stack
openrisc: Add support for external initrd images
init: Align init_task to avoid conflict with MUTEX_FLAGS
openrisc: fix __user in raw_copy_to_user()'s prototype

Linus Torvalds
2020-08-15 05:04:53 +0800

11 Aug, 2020

1 commit

97d052ea3 Merge tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull locking updates from Thomas Gleixner:
"A set of locking fixes and updates:

- Untangle the header spaghetti which causes build failures in
various situations caused by the lockdep additions to seqcount to
validate that the write side critical sections are non-preemptible.

- The seqcount associated lock debug addons which were blocked by the
above fallout.

seqcount writers contrary to seqlock writers must be externally
serialized, which usually happens via locking - except for strict
per CPU seqcounts. As the lock is not part of the seqcount, lockdep
cannot validate that the lock is held.

This new debug mechanism adds the concept of associated locks.
sequence count has now lock type variants and corresponding
initializers which take a pointer to the associated lock used for
writer serialization. If lockdep is enabled the pointer is stored
and write_seqcount_begin() has a lockdep assertion to validate that
the lock is held.

Aside of the type and the initializer no other code changes are
required at the seqcount usage sites. The rest of the seqcount API
is unchanged and determines the type at compile time with the help
of _Generic which is possible now that the minimal GCC version has
been moved up.

Adding this lockdep coverage unearthed a handful of seqcount bugs
which have been addressed already independent of this.

While generally useful this comes with a Trojan Horse twist: On RT
kernels the write side critical section can become preemtible if
the writers are serialized by an associated lock, which leads to
the well known reader preempts writer livelock. RT prevents this by
storing the associated lock pointer independent of lockdep in the
seqcount and changing the reader side to block on the lock when a
reader detects that a writer is in the write side critical section.

- Conversion of seqcount usage sites to associated types and
initializers"

* tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
locking/seqlock, headers: Untangle the spaghetti monster
locking, arch/ia64: Reduce header dependencies by moving XTP bits into the new header
x86/headers: Remove APIC headers from
seqcount: More consistent seqprop names
seqcount: Compress SEQCNT_LOCKNAME_ZERO()
seqlock: Fold seqcount_LOCKNAME_init() definition
seqlock: Fold seqcount_LOCKNAME_t definition
seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
hrtimer: Use sequence counter with associated raw spinlock
kvm/eventfd: Use sequence counter with associated spinlock
userfaultfd: Use sequence counter with associated spinlock
NFSv4: Use sequence counter with associated spinlock
iocost: Use sequence counter with associated spinlock
raid5: Use sequence counter with associated spinlock
vfs: Use sequence counter with associated spinlock
timekeeping: Use sequence counter with associated raw spinlock
xfrm: policy: Use sequence counters with associated lock
netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
netfilter: conntrack: Use sequence counter with associated spinlock
sched: tasks: Use sequence counter with associated spinlock
...

Linus Torvalds
2020-08-11 10:07:44 +0800

04 Aug, 2020

1 commit

d0b7213f8 init: Align init_task to avoid conflict with MUTEX_FLAGS ... Browse Code »

When booting on 32-bit machines (seen on OpenRISC) I saw this warning
with CONFIG_DEBUG_MUTEXES turned on.

------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at kernel/locking/mutex.c:1242 __mutex_unlock_slowpath+0x328/0x3ec
DEBUG_LOCKS_WARN_ON(__owner_task(owner) != current)
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 5.8.0-rc1-simple-smp-00005-g2864e2171db4-dirty #179
Call trace:
[] dump_stack+0x34/0x48
[] __warn+0x104/0x158
[] ? __mutex_unlock_slowpath+0x328/0x3ec
[] warn_slowpath_fmt+0x7c/0x94
[] __mutex_unlock_slowpath+0x328/0x3ec
[] mutex_unlock+0x18/0x28
[] __cpuhp_setup_state_cpuslocked.part.0+0x29c/0x2f4
[] ? page_alloc_cpu_dead+0x0/0x30
[] ? start_kernel+0x0/0x684
[] __cpuhp_setup_state+0x4c/0x5c
[] page_alloc_init+0x34/0x68
[] ? start_kernel+0x1a0/0x684
[] ? early_init_dt_scan_nodes+0x60/0x70
irq event stamp: 0

I traced this to kernel/locking/mutex.c storing 3 bits of MUTEX_FLAGS in
the task_struct pointer (mutex.owner). There is a comment saying that
task_structs are always aligned to L1_CACHE_BYTES. This is not true for
the init_task.

On 64-bit machines this is not a problem because symbol addresses are
naturally aligned to 64-bits providing 3 bits for MUTEX_FLAGS. Howerver,
for 32-bit machines the symbol address only has 2 bits available.

Fix this by setting init_task alignment to at least L1_CACHE_BYTES.

Signed-off-by: Stafford Horne
Acked-by: Peter Zijlstra (Intel)

Stafford Horne
2020-08-04 09:59:45 +0800

29 Jul, 2020

1 commit

b75058614 sched: tasks: Use sequence counter with associated spinlock ... Browse Code »

A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.

Use the new seqcount_spinlock_t data type, which allows to associate a
spinlock with the sequence counter. This enables lockdep to verify that
the spinlock used for writer serialization is held when the write side
critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20200720155530.1173732-14-a.darwish@linutronix.de

Ahmed S. Darwish
2020-07-29 22:14:26 +0800

11 Jul, 2020

1 commit

c818c03b6 seccomp: Report number of loaded filters in /proc/$pid/status ... Browse Code »

A common question asked when debugging seccomp filters is "how many
filters are attached to your process?" Provide a way to easily answer
this question through /proc/$pid/status with a "Seccomp_filters" line.

Signed-off-by: Kees Cook

Kees Cook
2020-07-11 07:01:51 +0800

12 Jun, 2020

1 commit

37d1a04b1 Rebase locking/kcsan to locking/urgent ... Browse Code »

Merge the state of the locking kcsan branch before the read/write_once()
and the atomics modifications got merged.

Squash the fallout of the rebase on top of the read/write once and atomic
fallback work into the merge. The history of the original branch is
preserved in tag locking-kcsan-2020-06-02.

Signed-off-by: Thomas Gleixner

Thomas Gleixner
2020-06-12 02:02:46 +0800

10 Jun, 2020

1 commit

e31cf2f4c mm: don't include asm/pgtable.h if linux/mm.h is already included ... Browse Code »

Patch series "mm: consolidate definitions of page table accessors", v2.

The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once. For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.

Most of these definitions are actually identical and typically it boils
down to, e.g.

static inline unsigned long pmd_index(unsigned long address)
{
return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}

static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}

These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.

These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.

This patch (of 12):

The linux/mm.h header includes to allow inlining of the
functions involving page table manipulations, e.g. pte_alloc() and
pmd_alloc(). So, there is no point to explicitly include
in the files that include .

The include statements in such cases are remove with a simple loop:

for f in $(git grep -l "include ") ; do
sed -i -e '/include / d' $f
done

Signed-off-by: Mike Rapoport
Signed-off-by: Andrew Morton
Cc: Arnd Bergmann
Cc: Borislav Petkov
Cc: Brian Cain
Cc: Catalin Marinas
Cc: Chris Zankel
Cc: "David S. Miller"
Cc: Geert Uytterhoeven
Cc: Greentime Hu
Cc: Greg Ungerer
Cc: Guan Xuetao
Cc: Guo Ren
Cc: Heiko Carstens
Cc: Helge Deller
Cc: Ingo Molnar
Cc: Ley Foon Tan
Cc: Mark Salter
Cc: Matthew Wilcox
Cc: Matt Turner
Cc: Max Filippov
Cc: Michael Ellerman
Cc: Michal Simek
Cc: Mike Rapoport
Cc: Nick Hu
Cc: Paul Walmsley
Cc: Richard Weinberger
Cc: Rich Felker
Cc: Russell King
Cc: Stafford Horne
Cc: Thomas Bogendoerfer
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vincent Chen
Cc: Vineet Gupta
Cc: Will Deacon
Cc: Yoshinori Sato
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds

Mike Rapoport
2020-06-10 00:39:13 +0800

02 Jun, 2020

1 commit

533b220f7 Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux ... Browse Code »

Pull arm64 updates from Will Deacon:
"A sizeable pile of arm64 updates for 5.8.

Summary below, but the big two features are support for Branch Target
Identification and Clang's Shadow Call stack. The latter is currently
arm64-only, but the high-level parts are all in core code so it could
easily be adopted by other architectures pending toolchain support

Branch Target Identification (BTI):

- Support for ARMv8.5-BTI in both user- and kernel-space. This allows
branch targets to limit the types of branch from which they can be
called and additionally prevents branching to arbitrary code,
although kernel support requires a very recent toolchain.

- Function annotation via SYM_FUNC_START() so that assembly functions
are wrapped with the relevant "landing pad" instructions.

- BPF and vDSO updates to use the new instructions.

- Addition of a new HWCAP and exposure of BTI capability to userspace
via ID register emulation, along with ELF loader support for the
BTI feature in .note.gnu.property.

- Non-critical fixes to CFI unwind annotations in the sigreturn
trampoline.

Shadow Call Stack (SCS):

- Support for Clang's Shadow Call Stack feature, which reserves
platform register x18 to point at a separate stack for each task
that holds only return addresses. This protects function return
control flow from buffer overruns on the main stack.

- Save/restore of x18 across problematic boundaries (user-mode,
hypervisor, EFI, suspend, etc).

- Core support for SCS, should other architectures want to use it
too.

- SCS overflow checking on context-switch as part of the existing
stack limit check if CONFIG_SCHED_STACK_END_CHECK=y.

CPU feature detection:

- Removed numerous "SANITY CHECK" errors when running on a system
with mismatched AArch32 support at EL1. This is primarily a concern
for KVM, which disabled support for 32-bit guests on such a system.

- Addition of new ID registers and fields as the architecture has
been extended.

Perf and PMU drivers:

- Minor fixes and cleanups to system PMU drivers.

Hardware errata:

- Unify KVM workarounds for VHE and nVHE configurations.

- Sort vendor errata entries in Kconfig.

Secure Monitor Call Calling Convention (SMCCC):

- Update to the latest specification from Arm (v1.2).

- Allow PSCI code to query the SMCCC version.

Software Delegated Exception Interface (SDEI):

- Unexport a bunch of unused symbols.

- Minor fixes to handling of firmware data.

Pointer authentication:

- Add support for dumping the kernel PAC mask in vmcoreinfo so that
the stack can be unwound by tools such as kdump.

- Simplification of key initialisation during CPU bringup.

BPF backend:

- Improve immediate generation for logical and add/sub instructions.

vDSO:

- Minor fixes to the linker flags for consistency with other
architectures and support for LLVM's unwinder.

- Clean up logic to initialise and map the vDSO into userspace.

ACPI:

- Work around for an ambiguity in the IORT specification relating to
the "num_ids" field.

- Support _DMA method for all named components rather than only PCIe
root complexes.

- Minor other IORT-related fixes.

Miscellaneous:

- Initialise debug traps early for KGDB and fix KDB cacheflushing
deadlock.

- Minor tweaks to early boot state (documentation update, set
TEXT_OFFSET to 0x0, increase alignment of PE/COFF sections).

- Refactoring and cleanup"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
KVM: arm64: Move __load_guest_stage2 to kvm_mmu.h
KVM: arm64: Check advertised Stage-2 page size capability
arm64/cpufeature: Add get_arm64_ftr_reg_nowarn()
ACPI/IORT: Remove the unused __get_pci_rid()
arm64/cpuinfo: Add ID_MMFR4_EL1 into the cpuinfo_arm64 context
arm64/cpufeature: Add remaining feature bits in ID_AA64PFR1 register
arm64/cpufeature: Add remaining feature bits in ID_AA64PFR0 register
arm64/cpufeature: Add remaining feature bits in ID_AA64ISAR0 register
arm64/cpufeature: Add remaining feature bits in ID_MMFR4 register
arm64/cpufeature: Add remaining feature bits in ID_PFR0 register
arm64/cpufeature: Introduce ID_MMFR5 CPU register
arm64/cpufeature: Introduce ID_DFR1 CPU register
arm64/cpufeature: Introduce ID_PFR2 CPU register
arm64/cpufeature: Make doublelock a signed feature in ID_AA64DFR0
arm64/cpufeature: Drop TraceFilt feature exposure from ID_DFR0 register
arm64/cpufeature: Add explicit ftr_id_isar0[] for ID_ISAR0 register
arm64: mm: Add asid_gen_match() helper
firmware: smccc: Fix missing prototype warning for arm_smccc_version_init
arm64: vdso: Fix CFI directives in sigreturn trampoline
arm64: vdso: Don't prefix sigreturn trampoline with a BTI C instruction
...

Linus Torvalds
2020-06-02 06:18:27 +0800

15 May, 2020

1 commit

d08b9f0ca scs: Add support for Clang's Shadow Call Stack (SCS) ... Browse Code »

This change adds generic support for Clang's Shadow Call Stack,
which uses a shadow stack to protect return addresses from being
overwritten by an attacker. Details are available here:

https://clang.llvm.org/docs/ShadowCallStack.html

Note that security guarantees in the kernel differ from the ones
documented for user space. The kernel must store addresses of
shadow stacks in memory, which means an attacker capable reading
and writing arbitrary memory may be able to locate them and hijack
control flow by modifying the stacks.

Signed-off-by: Sami Tolvanen
Reviewed-by: Kees Cook
Reviewed-by: Miguel Ojeda
[will: Numerous cosmetic changes]
Signed-off-by: Will Deacon

Sami Tolvanen
2020-05-15 23:35:45 +0800

08 May, 2020

1 commit

97a9474ae Merge branch 'kcsan-for-tip' of git://git.kernel.org/pub/scm/linux/kernel/git/pa… ... Browse Code »

…ulmck/linux-rcu into locking/kcsan

Pull KCSAN updates from Paul McKenney.

Thomas Gleixner
2020-05-08 20:58:28 +0800

28 Apr, 2020

2 commits

276c41044 rcu-tasks: Split ->trc_reader_need_end ... Browse Code »

This commit splits ->trc_reader_need_end by using the rcu_special union.
This change permits readers to check to see if a memory barrier is
required without any added overhead in the common case where no such
barrier is required. This commit also adds the read-side checking.
Later commits will add the machinery to properly set the new
->trc_reader_special.b.need_mb field.

This commit also makes rcu_read_unlock_trace_special() tolerate nested
read-side critical sections within interrupt and NMI handlers.

Signed-off-by: Paul E. McKenney

Paul E. McKenney
2020-04-28 02:03:52 +0800
d5f177d35 rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks ... Browse Code »

Because RCU does not watch exception early-entry/late-exit, idle-loop,
or CPU-hotplug execution, protection of tracing and BPF operations is
needlessly complicated. This commit therefore adds a variant of
Tasks RCU that:

o Has explicit read-side markers to allow finite grace periods in
the face of in-kernel loops for PREEMPT=n builds. These markers
are rcu_read_lock_trace() and rcu_read_unlock_trace().

o Protects code in the idle loop, exception entry/exit, and
CPU-hotplug code paths. In this respect, RCU-tasks trace is
similar to SRCU, but with lighter-weight readers.

o Avoids expensive read-side instruction, having overhead similar
to that of Preemptible RCU.

There are of course downsides:

o The grace-period code can send IPIs to CPUs, even when those
CPUs are in the idle loop or in nohz_full userspace. This is
mitigated by later commits.

o It is necessary to scan the full tasklist, much as for Tasks RCU.

o There is a single callback queue guarded by a single lock,
again, much as for Tasks RCU. However, those early use cases
that request multiple grace periods in quick succession are
expected to do so from a single task, which makes the single
lock almost irrelevant. If needed, multiple callback queues
can be provided using any number of schemes.

Perhaps most important, this variant of RCU does not affect the vanilla
flavors, rcu_preempt and rcu_sched. The fact that RCU Tasks Trace
readers can operate from idle, offline, and exception entry/exit in no
way enables rcu_preempt and rcu_sched readers to do so.

The memory ordering was outlined here:
https://lore.kernel.org/lkml/20200319034030.GX3199@paulmck-ThinkPad-P72/

This effort benefited greatly from off-list discussions of BPF
requirements with Alexei Starovoitov and Andrii Nakryiko. At least
some of the on-list discussions are captured in the Link: tags below.
In addition, KCSAN was quite helpful in finding some early bugs.

Link: https://lore.kernel.org/lkml/20200219150744.428764577@infradead.org/
Link: https://lore.kernel.org/lkml/87mu8p797b.fsf@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/lkml/20200225221305.605144982@linutronix.de/
Cc: Alexei Starovoitov
Cc: Andrii Nakryiko
[ paulmck: Apply feedback from Steve Rostedt and Joel Fernandes. ]
[ paulmck: Decrement trc_n_readers_need_end upon IPI failure. ]
[ paulmck: Fix locking issue reported by rcutorture. ]
Signed-off-by: Paul E. McKenney

Paul E. McKenney
2020-04-28 02:03:51 +0800

14 Apr, 2020

1 commit

757a4cefd kcsan: Add support for scoped accesses ... Browse Code »

This adds support for scoped accesses, where the memory range is checked
for the duration of the scope. The feature is implemented by inserting
the relevant access information into a list of scoped accesses for
the current execution context, which are then checked (until removed)
on every call (through instrumentation) into the KCSAN runtime.

An alternative, more complex, implementation could set up a watchpoint for
the scoped access, and keep the watchpoint set up. This, however, would
require first exposing a handle to the watchpoint, as well as dealing
with cases such as accesses by the same thread while the watchpoint is
still set up (and several more cases). It is also doubtful if this would
provide any benefit, since the majority of delay where the watchpoint
is set up is likely due to the injected delays by KCSAN. Therefore,
the implementation in this patch is simpler and avoids hurting KCSAN's
main use-case (normal data race detection); it also implicitly increases
scoped-access race-detection-ability due to increased probability of
setting up watchpoints by repeatedly calling __kcsan_check_access()
throughout the scope of the access.

The implementation required adding an additional conditional branch to
the fast-path. However, the microbenchmark showed a *speedup* of ~5%
on the fast-path. This appears to be due to subtly improved codegen by
GCC from moving get_ctx() and associated load of preempt_count earlier.

Suggested-by: Boqun Feng
Suggested-by: Paul E. McKenney
Signed-off-by: Marco Elver
Signed-off-by: Paul E. McKenney

Marco Elver
2020-04-14 08:18:11 +0800

13 Apr, 2020

1 commit

3b02a051d Merge tag 'v5.7-rc1' into locking/kcsan, to resolve conflicts and refresh ... Browse Code »

Resolve these conflicts:

arch/x86/Kconfig
arch/x86/kernel/Makefile

Do a minor "evil merge" to move the KCSAN entry up a bit by a few lines
in the Kconfig to reduce the probability of future conflicts.

Signed-off-by: Ingo Molnar

Ingo Molnar
2020-04-13 15:44:39 +0800

25 Mar, 2020

1 commit

eea967325 exec: Add exec_update_mutex to replace cred_guard_mutex ... Browse Code »

The cred_guard_mutex is problematic as it is held over possibly
indefinite waits for userspace. The possible indefinite waits for
userspace that I have identified are: The cred_guard_mutex is held in
PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is
held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The
cred_guard_mutex is held over "get_user(futex_offset, ...") in
exit_robust_list. The cred_guard_mutex held over copy_strings.

The functions get_user and put_user can trigger a page fault which can
potentially wait indefinitely in the case of userfaultfd or if
userspace implements part of the page fault path.

In any of those cases the userspace process that the kernel is waiting
for might make a different system call that winds up taking the
cred_guard_mutex and result in deadlock.

Holding a mutex over any of those possibly indefinite waits for
userspace does not appear necessary. Add exec_update_mutex that will
just cover updating the process during exec where the permissions and
the objects pointed to by the task struct may be out of sync.

The plan is to switch the users of cred_guard_mutex to
exec_update_mutex one by one. This lets us move forward while still
being careful and not introducing any regressions.

Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
Reviewed-by: Kirill Tkhai
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Bernd Edlinger
Signed-off-by: Eric W. Biederman

Eric W. Biederman
2020-03-25 23:03:36 +0800

21 Mar, 2020

1 commit

81af89e15 kcsan: Add kcsan_set_access_mask() support ... Browse Code »

When setting up an access mask with kcsan_set_access_mask(), KCSAN will
only report races if concurrent changes to bits set in access_mask are
observed. Conveying access_mask via a separate call avoids introducing
overhead in the common-case fast-path.

Acked-by: John Hubbard
Signed-off-by: Marco Elver
Signed-off-by: Paul E. McKenney
Signed-off-by: Ingo Molnar

Marco Elver
2020-03-21 16:44:08 +0800

16 Nov, 2019

1 commit

dfd402a4c kcsan: Add Kernel Concurrency Sanitizer infrastructure ... Browse Code »

Kernel Concurrency Sanitizer (KCSAN) is a dynamic data-race detector for
kernel space. KCSAN is a sampling watchpoint-based data-race detector.
See the included Documentation/dev-tools/kcsan.rst for more details.

This patch adds basic infrastructure, but does not yet enable KCSAN for
any architecture.

Signed-off-by: Marco Elver
Acked-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney

Marco Elver
2019-11-16 23:23:13 +0800

18 Sep, 2019

1 commit

7f2444d38 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull core timer updates from Thomas Gleixner:
"Timers and timekeeping updates:

- A large overhaul of the posix CPU timer code which is a preparation
for moving the CPU timer expiry out into task work so it can be
properly accounted on the task/process.

An update to the bogus permission checks will come later during the
merge window as feedback was not complete before heading of for
travel.

- Switch the timerqueue code to use cached rbtrees and get rid of the
homebrewn caching of the leftmost node.

- Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
single function

- Implement the separation of hrtimers to be forced to expire in hard
interrupt context even when PREEMPT_RT is enabled and mark the
affected timers accordingly.

- Implement a mechanism for hrtimers and the timer wheel to protect
RT against priority inversion and live lock issues when a (hr)timer
which should be canceled is currently executing the callback.
Instead of infinitely spinning, the task which tries to cancel the
timer blocks on a per cpu base expiry lock which is held and
released by the (hr)timer expiry code.

- Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
resulting in faster access to timekeeping functions.

- Updates to various clocksource/clockevent drivers and their device
tree bindings.

- The usual small improvements all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
posix-cpu-timers: Fix permission check regression
posix-cpu-timers: Always clear head pointer on dequeue
hrtimer: Add a missing bracket and hide `migration_base' on !SMP
posix-cpu-timers: Make expiry_active check actually work correctly
posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
tick: Mark sched_timer to expire in hard interrupt context
hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
posix-cpu-timers: Utilize timerqueue for storage
posix-cpu-timers: Move state tracking to struct posix_cputimers
posix-cpu-timers: Deduplicate rlimit handling
posix-cpu-timers: Remove pointless comparisons
posix-cpu-timers: Get rid of 64bit divisions
posix-cpu-timers: Consolidate timer expiry further
posix-cpu-timers: Get rid of zero checks
rlimit: Rewrite non-sensical RLIMIT_CPU comment
posix-cpu-timers: Respect INFINITY for hard RTTIME limit
posix-cpu-timers: Switch thread group sampling to array
posix-cpu-timers: Restructure expiry array
posix-cpu-timers: Remove cputime_expires
...

Linus Torvalds
2019-09-18 03:35:15 +0800

28 Aug, 2019

1 commit

244d49e30 posix-cpu-timers: Move state tracking to struct posix_cputimers ... Browse Code »

Put it where it belongs and clean up the ifdeffery in fork completely.

Signed-off-by: Thomas Gleixner
Link: https://lkml.kernel.org/r/20190821192922.743229404@linutronix.de

Thomas Gleixner
2019-08-28 17:50:42 +0800

01 Aug, 2019

1 commit

c1a280b68 sched/preempt: Use CONFIG_PREEMPTION where appropriate ... Browse Code »

CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by
CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same
functionality which today depends on CONFIG_PREEMPT.

Switch the preemption code, scheduler and init task over to use
CONFIG_PREEMPTION.

That's the first step towards RT in that area. The more complex changes are
coming separately.

Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Masami Hiramatsu
Cc: Paolo Bonzini
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Steven Rostedt
Link: http://lkml.kernel.org/r/20190726212124.117528401@linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
2019-08-01 01:03:34 +0800

09 Jul, 2019

1 commit

dad1c12ed Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:

- Remove the unused per rq load array and all its infrastructure, by
Dietmar Eggemann.

- Add utilization clamping support by Patrick Bellasi. This is a
refinement of the energy aware scheduling framework with support for
boosting of interactive and capping of background workloads: to make
sure critical GUI threads get maximum frequency ASAP, and to make
sure background processing doesn't unnecessarily move to cpufreq
governor to higher frequencies and less energy efficient CPU modes.

- Add the bare minimum of tracepoints required for LISA EAS regression
testing, by Qais Yousef - which allows automated testing of various
power management features, including energy aware scheduling.

- Restructure the former tsk_nr_cpus_allowed() facility that the -rt
kernel used to modify the scheduler's CPU affinity logic such as
migrate_disable() - introduce the task->cpus_ptr value instead of
taking the address of &task->cpus_allowed directly - by Sebastian
Andrzej Siewior.

- Misc optimizations, fixes, cleanups and small enhancements - see the
Git log for details.

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
sched/uclamp: Add uclamp support to energy_compute()
sched/uclamp: Add uclamp_util_with()
sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks
sched/uclamp: Set default clamps for RT tasks
sched/uclamp: Reset uclamp values on RESET_ON_FORK
sched/uclamp: Extend sched_setattr() to support utilization clamping
sched/core: Allow sched_setattr() to use the current policy
sched/uclamp: Add system default clamps
sched/uclamp: Enforce last task's UCLAMP_MAX
sched/uclamp: Add bucket local max tracking
sched/uclamp: Add CPU's clamp buckets refcounting
sched/fair: Rename weighted_cpuload() to cpu_runnable_load()
sched/debug: Export the newly added tracepoints
sched/debug: Add sched_overutilized tracepoint
sched/debug: Add new tracepoint to track PELT at se level
sched/debug: Add new tracepoints to track PELT at rq level
sched/debug: Add a new sched_trace_*() helper functions
sched/autogroup: Make autogroup_path() always available
sched/wait: Deduplicate code with do-while
sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity()
...

Linus Torvalds
2019-07-09 07:39:53 +0800

03 Jun, 2019

3 commits

f6ec8829a locking/lockdep: Define INITIAL_CHAIN_KEY for chain keys to start with ... Browse Code »

Chain keys are computed using Jenkins hash function, which needs an initial
hash to start with. Dedicate a macro to make this clear and configurable. A
later patch changes this initial chain key.

Signed-off-by: Yuyang Du
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: bvanassche@acm.org
Cc: frederic@kernel.org
Cc: ming.lei@redhat.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190506081939.74287-9-duyuyang@gmail.com
Signed-off-by: Ingo Molnar

Yuyang Du
2019-06-03 17:55:43 +0800
e196e479a locking/lockdep: Use lockdep_init_task for task initiation consistently ... Browse Code »

Despite that there is a lockdep_init_task() which does nothing, lockdep
initiates tasks by assigning lockdep fields and does so inconsistently. Fix
this by using lockdep_init_task().

Signed-off-by: Yuyang Du
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: bvanassche@acm.org
Cc: frederic@kernel.org
Cc: ming.lei@redhat.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190506081939.74287-8-duyuyang@gmail.com
Signed-off-by: Ingo Molnar

Yuyang Du
2019-06-03 17:55:42 +0800
3bd370625 sched/core: Provide a pointer to the valid CPU mask ... Browse Code »

In commit:

4b53a3412d66 ("sched/core: Remove the tsk_nr_cpus_allowed() wrapper")

the tsk_nr_cpus_allowed() wrapper was removed. There was not
much difference in !RT but in RT we used this to implement
migrate_disable(). Within a migrate_disable() section the CPU mask is
restricted to single CPU while the "normal" CPU mask remains untouched.

As an alternative implementation Ingo suggested to use:

struct task_struct {
const cpumask_t *cpus_ptr;
cpumask_t cpus_mask;
};
with
t->cpus_ptr = &t->cpus_mask;

In -RT we then can switch the cpus_ptr to:

t->cpus_ptr = &cpumask_of(task_cpu(p));

in a migration disabled region. The rules are simple:

- Code that 'uses' ->cpus_allowed would use the pointer.
- Code that 'modifies' ->cpus_allowed would use the direct mask.

Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Thomas Gleixner
Cc: Linus Torvalds
Cc: Peter Zijlstra
Link: https://lkml.kernel.org/r/20190423142636.14347-1-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar

Sebastian Andrzej Siewior
2019-06-03 17:49:37 +0800

08 Mar, 2019

1 commit

be37f21a0 Merge tag 'audit-pr-20190305' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit ... Browse Code »

Pull audit updates from Paul Moore:
"A lucky 13 audit patches for v5.1.

Despite the rather large diffstat, most of the changes are from two
bug fix patches that move code from one Kconfig option to another.

Beyond that bit of churn, the remaining changes are largely cleanups
and bug-fixes as we slowly march towards container auditing. It isn't
all boring though, we do have a couple of new things: file
capabilities v3 support, and expanded support for filtering on
filesystems to solve problems with remote filesystems.

All changes pass the audit-testsuite. Please merge for v5.1"

* tag 'audit-pr-20190305' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: mark expected switch fall-through
audit: hide auditsc_get_stamp and audit_serial prototypes
audit: join tty records to their syscall
audit: remove audit_context when CONFIG_ AUDIT and not AUDITSYSCALL
audit: remove unused actx param from audit_rule_match
audit: ignore fcaps on umount
audit: clean up AUDITSYSCALL prototypes and stubs
audit: more filter PATH records keyed on filesystem magic
audit: add support for fcaps v3
audit: move loginuid and sessionid from CONFIG_AUDITSYSCALL to CONFIG_AUDIT
audit: add syscall information to CONFIG_CHANGE records
audit: hand taken context to audit_kill_trees for syscall logging
audit: give a clue what CONFIG_CHANGE op was involved

Linus Torvalds
2019-03-08 04:20:11 +0800

07 Mar, 2019

1 commit

8dcd175bc Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge misc updates from Andrew Morton:

- a few misc things

- ocfs2 updates

- most of MM

* emailed patches from Andrew Morton : (159 commits)
tools/testing/selftests/proc/proc-self-syscall.c: remove duplicate include
proc: more robust bulk read test
proc: test /proc/*/maps, smaps, smaps_rollup, statm
proc: use seq_puts() everywhere
proc: read kernel cpu stat pointer once
proc: remove unused argument in proc_pid_lookup()
fs/proc/thread_self.c: code cleanup for proc_setup_thread_self()
fs/proc/self.c: code cleanup for proc_setup_self()
proc: return exit code 4 for skipped tests
mm,mremap: bail out earlier in mremap_to under map pressure
mm/sparse: fix a bad comparison
mm/memory.c: do_fault: avoid usage of stale vm_area_struct
writeback: fix inode cgroup switching comment
mm/huge_memory.c: fix "orig_pud" set but not used
mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC
mm/memcontrol.c: fix bad line in comment
mm/cma.c: cma_declare_contiguous: correct err handling
mm/page_ext.c: fix an imbalance with kmemleak
mm/compaction: pass pgdat to too_many_isolated() instead of zone
mm: remove zone_lru_lock() function, access ->lru_lock directly
...

Linus Torvalds
2019-03-07 02:31:36 +0800

06 Mar, 2019

1 commit

98fa15f34 mm: replace all open encodings for NUMA_NO_NODE ... Browse Code »

Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

All these places for replacement were found by running the following
grep patterns on the entire kernel code. Please let me know if this
might have missed some instances. This might also have replaced some
false positives. I will appreciate suggestions, inputs and review.

1. git grep "nid == -1"
2. git grep "node == -1"
3. git grep "nid = -1"
4. git grep "node = -1"

This patch (of 2):

At present there are multiple places where invalid node number is
encoded as -1. Even though implicitly understood it is always better to
have macros in there. Replace these open encodings for an invalid node
number with the global macro NUMA_NO_NODE. This helps remove NUMA
related assumptions like 'invalid node' from various places redirecting
them to a common definition.

Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual
Reviewed-by: David Hildenbrand
Acked-by: Jeff Kirsher [ixgbe]
Acked-by: Jens Axboe [mtip32xx]
Acked-by: Vinod Koul [dmaengine.c]
Acked-by: Michael Ellerman [powerpc]
Acked-by: Doug Ledford [drivers/infiniband]
Cc: Joseph Qi
Cc: Hans Verkuil
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anshuman Khandual
2019-03-06 13:07:14 +0800

04 Feb, 2019

3 commits

f0b89d395 sched/core: Convert task_struct.stack_refcount to refcount_t ... Browse Code »

atomic_t variables are currently used to implement reference
counters with the following properties:

- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable task_struct.stack_refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

** Important note for maintainers:

Some functions from refcount_t API defined in lib/refcount.c
have different memory ordering guarantees than their atomic
counterparts.

The full comparison can be seen in
https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
in state to be merged to the documentation tree.

Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in
some rare cases it might matter.

Please double check that you don't have some undocumented
memory guarantees for this variable usage.

For the task_struct.stack_refcount it might make a difference
in following places:

- try_get_task_stack(): increment in refcount_inc_not_zero() only
guarantees control dependency on success vs. fully ordered
atomic counterpart
- put_task_stack(): decrement in refcount_dec_and_test() only
provides RELEASE ordering and control dependency on success
vs. fully ordered atomic counterpart

Suggested-by: Kees Cook
Signed-off-by: Elena Reshetova
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: David Windsor
Reviewed-by: Hans Liljestrand
Reviewed-by: Andrea Parri
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: akpm@linux-foundation.org
Cc: viro@zeniv.linux.org.uk
Link: https://lkml.kernel.org/r/1547814450-18902-6-git-send-email-elena.reshetova@intel.com
Signed-off-by: Ingo Molnar

Elena Reshetova
2019-02-04 15:53:56 +0800
ec1d28192 sched/core: Convert task_struct.usage to refcount_t ... Browse Code »

atomic_t variables are currently used to implement reference
counters with the following properties:

- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable task_struct.usage is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

** Important note for maintainers:

Some functions from refcount_t API defined in lib/refcount.c
have different memory ordering guarantees than their atomic
counterparts.

The full comparison can be seen in
https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
in state to be merged to the documentation tree.

Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in
some rare cases it might matter.

Please double check that you don't have some undocumented
memory guarantees for this variable usage.

For the task_struct.usage it might make a difference
in following places:

- put_task_struct(): decrement in refcount_dec_and_test() only
provides RELEASE ordering and control dependency on success
vs. fully ordered atomic counterpart

Suggested-by: Kees Cook
Signed-off-by: Elena Reshetova
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: David Windsor
Reviewed-by: Hans Liljestrand
Reviewed-by: Andrea Parri
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: akpm@linux-foundation.org
Cc: viro@zeniv.linux.org.uk
Link: https://lkml.kernel.org/r/1547814450-18902-5-git-send-email-elena.reshetova@intel.com
Signed-off-by: Ingo Molnar

Elena Reshetova
2019-02-04 15:53:55 +0800
60d4de3ff sched/core: Convert signal_struct.sigcnt to refcount_t ... Browse Code »

atomic_t variables are currently used to implement reference
counters with the following properties:

- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable signal_struct.sigcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

** Important note for maintainers:

Some functions from refcount_t API defined in lib/refcount.c
have different memory ordering guarantees than their atomic
counterparts.

The full comparison can be seen in
https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
in state to be merged to the documentation tree.

Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in
some rare cases it might matter.

Please double check that you don't have some undocumented
memory guarantees for this variable usage.

For the signal_struct.sigcnt it might make a difference
in following places:

- put_signal_struct(): decrement in refcount_dec_and_test() only
provides RELEASE ordering and control dependency on success
vs. fully ordered atomic counterpart

Suggested-by: Kees Cook
Signed-off-by: Elena Reshetova
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: David Windsor
Reviewed-by: Hans Liljestrand
Reviewed-by: Andrea Parri
Reviewed-by: Oleg Nesterov
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: akpm@linux-foundation.org
Cc: viro@zeniv.linux.org.uk
Link: https://lkml.kernel.org/r/1547814450-18902-3-git-send-email-elena.reshetova@intel.com
Signed-off-by: Ingo Molnar

Elena Reshetova
2019-02-04 15:53:53 +0800

26 Jan, 2019

1 commit

4b7d248b3 audit: move loginuid and sessionid from CONFIG_AUDITSYSCALL to CONFIG_AUDIT ... Browse Code »

loginuid and sessionid (and audit_log_session_info) should be part of
CONFIG_AUDIT scope and not CONFIG_AUDITSYSCALL since it is used in
CONFIG_CHANGE, ANOM_LINK, FEATURE_CHANGE (and INTEGRITY_RULE), none of
which are otherwise dependent on AUDITSYSCALL.

Please see github issue
https://github.com/linux-audit/audit-kernel/issues/104

Signed-off-by: Richard Guy Briggs
[PM: tweaked subject line for better grep'ing]
Signed-off-by: Paul Moore

Richard Guy Briggs
2019-01-26 02:03:23 +0800

10 Aug, 2018

1 commit

c3ad2c3b0 signal: Don't restart fork when signals come in. ... Browse Code »

Wen Yang and majiang
report that a periodic signal received during fork can cause fork to
continually restart preventing an application from making progress.

The code was being overly pessimistic. Fork needs to guarantee that a
signal sent to multiple processes is logically delivered before the
fork and just to the forking process or logically delivered after the
fork to both the forking process and it's newly spawned child. For
signals like periodic timers that are always delivered to a single
process fork can safely complete and let them appear to logically
delivered after the fork().

While examining this issue I also discovered that fork today will miss
signals delivered to multiple processes during the fork and handled by
another thread. Similarly the current code will also miss blocked
signals that are delivered to multiple process, as those signals will
not appear pending during fork.

Add a list of each thread that is currently forking, and keep on that
list a signal set that records all of the signals sent to multiple
processes. When fork completes initialize the new processes
shared_pending signal set with it. The calculate_sigpending function
will see those signals and set TIF_SIGPENDING causing the new task to
take the slow path to userspace to handle those signals. Making it
appear as if those signals were received immediately after the fork.

It is not possible to send real time signals to multiple processes and
exceptions don't go to multiple processes, which means that that are
no signals sent to multiple processes that require siginfo. This
means it is safe to not bother collecting siginfo on signals sent
during fork.

The sigaction of a child of fork is initially the same as the
sigaction of the parent process. So a signal the parent ignores the
child will also initially ignore. Therefore it is safe to ignore
signals sent to multiple processes and ignored by the forking process.

Signals sent to only a single process or only a single thread and delivered
during fork are treated as if they are received after the fork, and generally
not dealt with. They won't cause any problems.

V2: Added removal from the multiprocess list on failure.
V3: Use -ERESTARTNOINTR directly
V4: - Don't queue both SIGCONT and SIGSTOP
- Initialize signal_struct.multiprocess in init_task
- Move setting of shared_pending to before the new task
is visible to signals. This prevents signals from comming
in before shared_pending.signal is set to delayed.signal
and being lost.
V5: - rework list add and delete to account for idle threads
v6: - Use sigdelsetmask when removing stop signals

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200447
Reported-by: Wen Yang and
Reported-by: majiang
Fixes: 4a2c7a7837da ("[PATCH] make fork() atomic wrt pgrp/session signals")
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2018-08-10 02:07:01 +0800

21 Jul, 2018

3 commits

6883f81aa pid: Implement PIDTYPE_TGID ... Browse Code »

Everywhere except in the pid array we distinguish between a tasks pid and
a tasks tgid (thread group id). Even in the enumeration we want that
distinction sometimes so we have added __PIDTYPE_TGID. With leader_pid
we almost have an implementation of PIDTYPE_TGID in struct signal_struct.

Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
into the pids array. Then remove the __PIDTYPE_TGID special case and the
leader_pid in signal_struct.

The net size increase is just an extra pointer added to struct pid and
an extra pair of pointers of an hlist_node added to task_struct.

The effect on code maintenance is the removal of a number of special
cases today and the potential to remove many more special cases as
PIDTYPE_TGID gets used to it's fullest. The long term potential
is allowing zombie thread group leaders to exit, which will remove
a lot more special cases in the code.

Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2018-07-21 23:43:12 +0800
2c4704756 pids: Move the pgrp and session pid pointers from task_struct to signal_struct ... Browse Code »

To access these fields the code always has to go to group leader so
going to signal struct is no loss and is actually a fundamental simplification.

This saves a little bit of memory by only allocating the pid pointer array
once instead of once for every thread, and even better this removes a
few potential races caused by the fact that group_leader can be changed
by de_thread, while signal_struct can not.

Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2018-07-21 23:43:12 +0800
2896b0f09 pids: Initialize leader_pid in init_task ... Browse Code »

This is cheap and no cost so we might as well.

Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2018-07-21 23:43:12 +0800

15 May, 2018

1 commit

f0b752168 audit: convert sessionid unset to a macro ... Browse Code »

Use a macro, "AUDIT_SID_UNSET", to replace each instance of
initialization and comparison to an audit session ID.

Signed-off-by: Richard Guy Briggs
Signed-off-by: Paul Moore

Richard Guy Briggs
2018-05-15 03:56:35 +0800

17 Jan, 2018

2 commits

3678e2fcc Expand the INIT_SIGNALS and INIT_SIGHAND macros and remove ... Browse Code »

There doesn't seem to be any need to have the INIT_SIGNALS and INIT_SIGHAND
macros, so expand them in their single places of use and remove them.

Signed-off-by: David Howells
Tested-by: Tony Luck
Tested-by: Will Deacon (arm64)
Tested-by: Palmer Dabbelt
Acked-by: Thomas Gleixner

David Howells
2018-01-17 19:30:16 +0800
4e7e3adbb Expand various INIT_* macros and remove ... Browse Code »

Expand various INIT_* macros into the single places they're used in
init/init_task.c and remove them.

Signed-off-by: David Howells
Tested-by: Tony Luck
Tested-by: Will Deacon (arm64)
Tested-by: Palmer Dabbelt
Acked-by: Thomas Gleixner

David Howells
2018-01-17 19:30:16 +0800