Eric Lee / smarc-fsl-linux-kernel

01 Oct, 2020

1 commit

b796d9492 exec: Add exec_update_mutex to replace cred_guard_mutex ... Browse Code »

[ Upstream commit eea9673250db4e854e9998ef9da6d4584857f0ea ]

The cred_guard_mutex is problematic as it is held over possibly
indefinite waits for userspace. The possible indefinite waits for
userspace that I have identified are: The cred_guard_mutex is held in
PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is
held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The
cred_guard_mutex is held over "get_user(futex_offset, ...") in
exit_robust_list. The cred_guard_mutex held over copy_strings.

The functions get_user and put_user can trigger a page fault which can
potentially wait indefinitely in the case of userfaultfd or if
userspace implements part of the page fault path.

In any of those cases the userspace process that the kernel is waiting
for might make a different system call that winds up taking the
cred_guard_mutex and result in deadlock.

Holding a mutex over any of those possibly indefinite waits for
userspace does not appear necessary. Add exec_update_mutex that will
just cover updating the process during exec where the permissions and
the objects pointed to by the task struct may be out of sync.

The plan is to switch the users of cred_guard_mutex to
exec_update_mutex one by one. This lets us move forward while still
being careful and not introducing any regressions.

Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
Reviewed-by: Kirill Tkhai
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Bernd Edlinger
Signed-off-by: Eric W. Biederman
Signed-off-by: Sasha Levin

Eric W. Biederman
2020-10-01 19:17:47 +0800

20 May, 2020

1 commit

553a2cbca exec: Move would_dump into flush_old_exec ... Browse Code »

commit f87d1c9559164294040e58f5e3b74a162bf7c6e8 upstream.

I goofed when I added mm->user_ns support to would_dump. I missed the
fact that in the case of binfmt_loader, binfmt_em86, binfmt_misc, and
binfmt_script bprm->file is reassigned. Which made the move of
would_dump from setup_new_exec to __do_execve_file before exec_binprm
incorrect as it can result in would_dump running on the script instead
of the interpreter of the script.

The net result is that the code stopped making unreadable interpreters
undumpable. Which allows them to be ptraced and written to disk
without special permissions. Oops.

The move was necessary because the call in set_new_exec was after
bprm->mm was no longer valid.

To correct this mistake move the misplaced would_dump from
__do_execve_file into flos_old_exec, before exec_mmap is called.

I tested and confirmed that without this fix I can attach with gdb to
a script with an unreadable interpreter, and with this fix I can not.

Cc: stable@vger.kernel.org
Fixes: f84df2a6f268 ("exec: Ensure mm->user_ns contains the execed files")
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2020-05-20 14:20:34 +0800

17 Apr, 2020

1 commit

5f2d04139 signal: Extend exec_id to 64bits ... Browse Code »

commit d1e7fd6462ca9fc76650fbe6ca800e35b24267da upstream.

Replace the 32bit exec_id with a 64bit exec_id to make it impossible
to wrap the exec_id counter. With care an attacker can cause exec_id
wrap and send arbitrary signals to a newly exec'd parent. This
bypasses the signal sending checks if the parent changes their
credentials during exec.

The severity of this problem can been seen that in my limited testing
of a 32bit exec_id it can take as little as 19s to exec 65536 times.
Which means that it can take as little as 14 days to wrap a 32bit
exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7
days. Even my slower timing is in the uptime of a typical server.
Which means self_exec_id is simply a speed bump today, and if exec
gets noticably faster self_exec_id won't even be a speed bump.

Extending self_exec_id to 64bits introduces a problem on 32bit
architectures where reading self_exec_id is no longer atomic and can
take two read instructions. Which means that is is possible to hit
a window where the read value of exec_id does not match the written
value. So with very lucky timing after this change this still
remains expoiltable.

I have updated the update of exec_id on exec to use WRITE_ONCE
and the read of exec_id in do_notify_parent to use READ_ONCE
to make it clear that there is no locking between these two
locations.

Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl
Fixes: 2.3.23pre2
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2020-04-17 16:50:12 +0800

29 Nov, 2019

1 commit

7d7e93588 exit/exec: Seperate mm_release() ... Browse Code »

commit 4610ba7ad877fafc0a25a30c6c82015304120426 upstream.

mm_release() contains the futex exit handling. mm_release() is called from
do_exit()->exit_mm() and from exec()->exec_mm().

In the exit_mm() case PF_EXITING and the futex state is updated. In the
exec_mm() case these states are not touched.

As the futex exit code needs further protections against exit races, this
needs to be split into two functions.

Preparatory only, no functional change.

Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Acked-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20191106224556.240518241@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-11-29 17:10:10 +0800

25 Sep, 2019

1 commit

227a4aadc sched/membarrier: Fix p->mm->membarrier_state racy load ... Browse Code »

The membarrier_state field is located within the mm_struct, which
is not guaranteed to exist when used from runqueue-lock-free iteration
on runqueues by the membarrier system call.

Copy the membarrier_state from the mm_struct into the scheduler runqueue
when the scheduler switches between mm.

When registering membarrier for mm, after setting the registration bit
in the mm membarrier state, issue a synchronize_rcu() to ensure the
scheduler observes the change. In order to take care of the case
where a runqueue keeps executing the target mm without swapping to
other mm, iterate over each runqueue and issue an IPI to copy the
membarrier_state from the mm_struct into each runqueue which have the
same mm which state has just been modified.

Move the mm membarrier_state field closer to pgd in mm_struct to use
a cache line already touched by the scheduler switch_mm.

The membarrier_execve() (now membarrier_exec_mmap) hook now needs to
clear the runqueue's membarrier state in addition to clear the mm
membarrier state, so move its implementation into the scheduler
membarrier code so it can access the runqueue structure.

Add memory barrier in membarrier_exec_mmap() prior to clearing
the membarrier state, ensuring memory accesses executed prior to exec
are not reordered with the stores clearing the membarrier state.

As suggested by Linus, move all membarrier.c RCU read-side locks outside
of the for each cpu loops.

Suggested-by: Linus Torvalds
Signed-off-by: Mathieu Desnoyers
Signed-off-by: Peter Zijlstra (Intel)
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Eric W. Biederman
Cc: Kirill Tkhai
Cc: Mike Galbraith
Cc: Oleg Nesterov
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Russell King - ARM Linux admin
Cc: Thomas Gleixner
Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar

Mathieu Desnoyers
2019-09-25 23:42:30 +0800

25 Jul, 2019

1 commit

16d51a590 sched/fair: Don't free p->numa_faults with concurrent readers ... Browse Code »

When going through execve(), zero out the NUMA fault statistics instead of
freeing them.

During execve, the task is reachable through procfs and the scheduler. A
concurrent /proc/*/sched reader can read data from a freed ->numa_faults
allocation (confirmed by KASAN) and write it back to userspace.
I believe that it would also be possible for a use-after-free read to occur
through a race between a NUMA fault and execve(): task_numa_fault() can
lead to task_numa_compare(), which invokes task_weight() on the currently
running task of a different CPU.

Another way to fix this would be to make ->numa_faults RCU-managed or add
extra locking, but it seems easier to wipe the NUMA fault statistics on
execve.

Signed-off-by: Jann Horn
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Petr Mladek
Cc: Sergey Senozhatsky
Cc: Thomas Gleixner
Cc: Will Deacon
Fixes: 82727018b0d3 ("sched/numa: Call task_numa_free() from do_execve()")
Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.com
Signed-off-by: Ingo Molnar

Jann Horn
2019-07-25 21:37:04 +0800

09 Jul, 2019

1 commit

5ad18b2e6 Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/eb… ... Browse Code »

…iederm/user-namespace

Pull force_sig() argument change from Eric Biederman:
"A source of error over the years has been that force_sig has taken a
task parameter when it is only safe to use force_sig with the current
task.

The force_sig function is built for delivering synchronous signals
such as SIGSEGV where the userspace application caused a synchronous
fault (such as a page fault) and the kernel responded with a signal.

Because the name force_sig does not make this clear, and because the
force_sig takes a task parameter the function force_sig has been
abused for sending other kinds of signals over the years. Slowly those
have been fixed when the oopses have been tracked down.

This set of changes fixes the remaining abusers of force_sig and
carefully rips out the task parameter from force_sig and friends
making this kind of error almost impossible in the future"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus
signal: Remove the signal number and task parameters from force_sig_info
signal: Factor force_sig_info_to_task out of force_sig_info
signal: Generate the siginfo in force_sig
signal: Move the computation of force into send_signal and correct it.
signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal
signal: Remove the task parameter from force_sig_fault
signal: Use force_sig_fault_to_task for the two calls that don't deliver to current
signal: Explicitly call force_sig_fault on current
signal/unicore32: Remove tsk parameter from __do_user_fault
signal/arm: Remove tsk parameter from __do_user_fault
signal/arm: Remove tsk parameter from ptrace_break
signal/nds32: Remove tsk parameter from send_sigtrap
signal/riscv: Remove tsk parameter from do_trap
signal/sh: Remove tsk parameter from force_sig_info_fault
signal/um: Remove task parameter from send_sigtrap
signal/x86: Remove task parameter from send_sigtrap
signal: Remove task parameter from force_sig_mceerr
signal: Remove task parameter from force_sig
signal: Remove task parameter from force_sigsegv
...

Linus Torvalds
2019-07-09 12:48:15 +0800

27 May, 2019

1 commit

cb44c9a0a signal: Remove task parameter from force_sigsegv ... Browse Code »

The function force_sigsegv is always called on the current task
so passing in current is redundant and not passing in current
makes this fact obvious.

This also makes it clear force_sigsegv always calls force_sig
on the current task.

Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2019-05-27 22:36:28 +0800

21 May, 2019

1 commit

457c89965 treewide: Add SPDX license identifier for missed files ... Browse Code »

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-05-21 16:50:45 +0800

15 May, 2019

1 commit

d53ddd018 fs/exec.c: move ->recursion_depth out of critical sections ... Browse Code »

->recursion_depth is changed only by current, therefore decrementing can
be done without taking any locks.

Link: http://lkml.kernel.org/r/20190417213150.GA26474@avx2
Signed-off-by: Alexey Dobriyan
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2019-05-15 10:52:50 +0800

08 Mar, 2019

2 commits

6eb3c3d0a exec: increase BINPRM_BUF_SIZE to 256 ... Browse Code »

Large enterprise clients often run applications out of networked file
systems where the IT mandated layout of project volumes can end up
leading to paths that are longer than 128 characters. Bumping this up
to the next order of two solves this problem in all but the most
egregious case while still fitting into a 512b slab.

[oleg@redhat.com: update comment, per Kees]
Link: http://lkml.kernel.org/r/20181112160956.GA28472@redhat.com
Signed-off-by: Oleg Nesterov
Reported-by: Ben Woodard
Reviewed-by: Andrew Morton
Acked-by: Michal Hocko
Acked-by: Kees Cook
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2019-03-08 10:32:01 +0800
26e152252 fs/exec.c: replace opencoded set_mask_bits() ... Browse Code »

Link: http://lkml.kernel.org/r/1548275584-18096-2-git-send-email-vgupta@synopsys.com
Link: http://lkml.kernel.org/g/20150807115710.GA16897@redhat.com
Signed-off-by: Vineet Gupta
Reviewed-by: Anthony Yznaga
Acked-by: Oleg Nesterov
Cc: Alexander Viro
Cc: Peter Zijlstra (Intel)
Cc: Chris Wilson
Cc: Ingo Molnar
Cc: Jani Nikula
Cc: Miklos Szeredi
Cc: Theodore Ts'o
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vineet Gupta
2019-03-08 10:32:01 +0800

07 Mar, 2019

1 commit

45802da05 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:

- refcount conversions

- Solve the rq->leaf_cfs_rq_list can of worms for real.

- improve power-aware scheduling

- add sysctl knob for Energy Aware Scheduling

- documentation updates

- misc other changes"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
kthread: Do not use TIMER_IRQSAFE
kthread: Convert worker lock to raw spinlock
sched/fair: Use non-atomic cpumask_{set,clear}_cpu()
sched/fair: Remove unused 'sd' parameter from select_idle_smt()
sched/wait: Use freezable_schedule() when possible
sched/fair: Prune, fix and simplify the nohz_balancer_kick() comment block
sched/fair: Explain LLC nohz kick condition
sched/fair: Simplify nohz_balancer_kick()
sched/topology: Fix percpu data types in struct sd_data & struct s_data
sched/fair: Simplify post_init_entity_util_avg() by calling it with a task_struct pointer argument
sched/fair: Fix O(nr_cgroups) in the load balancing path
sched/fair: Optimize update_blocked_averages()
sched/fair: Fix insertion in rq->leaf_cfs_rq_list
sched/fair: Add tmp_alone_branch assertion
sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK
sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity
sched/fair: Update scale invariance of PELT
sched/fair: Move the rq_of() helper function
sched/core: Convert task_struct.stack_refcount to refcount_t
...

Linus Torvalds
2019-03-07 00:14:05 +0800

19 Feb, 2019

1 commit

f612acfae exec: Fix mem leak in kernel_read_file ... Browse Code »

syzkaller report this:
BUG: memory leak
unreferenced object 0xffffc9000488d000 (size 9195520):
comm "syz-executor.0", pid 2752, jiffies 4294787496 (age 18.757s)
hex dump (first 32 bytes):
ff ff ff ff ff ff ff ff a8 00 00 00 01 00 00 00 ................
02 00 00 00 00 00 00 00 80 a1 7a c1 ff ff ff ff ..........z.....
backtrace:
[] __vmalloc_node mm/vmalloc.c:1795 [inline]
[] __vmalloc_node_flags mm/vmalloc.c:1809 [inline]
[] vmalloc+0x8c/0xb0 mm/vmalloc.c:1831
[] kernel_read_file+0x58f/0x7d0 fs/exec.c:924
[] kernel_read_file_from_fd+0x49/0x80 fs/exec.c:993
[] __do_sys_finit_module+0x13b/0x2a0 kernel/module.c:3895
[] do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
[] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[] 0xffffffffffffffff

It should goto 'out_free' lable to free allocated buf while kernel_read
fails.

Fixes: 39d637af5aa7 ("vfs: forbid write access when reading a file into memory")
Signed-off-by: YueHaibing
Signed-off-by: Al Viro

YueHaibing
2019-02-19 10:26:24 +0800

04 Feb, 2019

1 commit

d036bda7d sched/core: Convert sighand_struct.count to refcount_t ... Browse Code »

atomic_t variables are currently used to implement reference
counters with the following properties:

- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable sighand_struct.count is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

** Important note for maintainers:

Some functions from refcount_t API defined in lib/refcount.c
have different memory ordering guarantees than their atomic
counterparts.

The full comparison can be seen in
https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
in state to be merged to the documentation tree.

Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in
some rare cases it might matter.

Please double check that you don't have some undocumented
memory guarantees for this variable usage.

For the sighand_struct.count it might make a difference
in following places:

- __cleanup_sighand: decrement in refcount_dec_and_test() only
provides RELEASE ordering and control dependency on success
vs. fully ordered atomic counterpart

Suggested-by: Kees Cook
Signed-off-by: Elena Reshetova
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: David Windsor
Reviewed-by: Hans Liljestrand
Reviewed-by: Andrea Parri
Reviewed-by: Oleg Nesterov
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: akpm@linux-foundation.org
Cc: viro@zeniv.linux.org.uk
Link: https://lkml.kernel.org/r/1547814450-18902-2-git-send-email-elena.reshetova@intel.com
Signed-off-by: Ingo Molnar

Elena Reshetova
2019-02-04 15:53:52 +0800

06 Jan, 2019

1 commit

9b286efeb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull trivial vfs updates from Al Viro:
"A few cleanups + Neil's namespace_unlock() optimization"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
exec: make prepare_bprm_creds static
genheaders: %-s had been there since v6; %-*s - since v7
VFS: use synchronize_rcu_expedited() in namespace_unlock()
iov_iter: reduce code duplication

Linus Torvalds
2019-01-06 05:18:59 +0800

05 Jan, 2019

2 commits

08d405c8b fs/: remove caller signal_pending branch predictions ... Browse Code »

This is already done for us internally by the signal machinery.

[akpm@linux-foundation.org: fix fs/buffer.c]
Link: http://lkml.kernel.org/r/20181116002713.8474-7-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2019-01-05 05:13:48 +0800
655c16a8c exec: separate MM_ANONPAGES and RLIMIT_STACK accounting ... Browse Code »

get_arg_page() checks bprm->rlim_stack.rlim_cur and re-calculates the
"extra" size for argv/envp pointers every time, this is a bit ugly and
even not strictly correct: acct_arg_size() must not account this size.

Remove all the rlimit code in get_arg_page(). Instead, add bprm->argmin
calculated once at the start of __do_execve_file() and change
copy_strings to check bprm->p >= bprm->argmin.

The patch adds the new helper, prepare_arg_pages() which initializes
bprm->argc/envc and bprm->argmin.

[oleg@redhat.com: fix !CONFIG_MMU version of get_arg_page()]
Link: http://lkml.kernel.org/r/20181126122307.GA1660@redhat.com
[akpm@linux-foundation.org: use max_t]
Link: http://lkml.kernel.org/r/20181112160910.GA28440@redhat.com
Signed-off-by: Oleg Nesterov
Acked-by: Kees Cook
Tested-by: Guenter Roeck
Cc: "Eric W. Biederman"
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2019-01-05 05:13:47 +0800

10 Dec, 2018

1 commit

4addd2640 exec: make prepare_bprm_creds static ... Browse Code »

prepare_bprm_creds is not used outside exec.c, so there's no reason for it
to have external linkage.

Signed-off-by: Chanho Min
Signed-off-by: Al Viro

Chanho Min
2018-12-10 17:11:06 +0800

04 Dec, 2018

1 commit

a72173ecf Revert "exec: make de_thread() freezable" ... Browse Code »

Revert commit c22397888f1e "exec: make de_thread() freezable" as
requested by Ingo Molnar:

"So there's a new regression in v4.20-rc4, my desktop produces this
lockdep splat:

[ 1772.588771] WARNING: pkexec/4633 still has locks held!
[ 1772.588773] 4.20.0-rc4-custom-00213-g93a49841322b #1 Not tainted
[ 1772.588775] ------------------------------------
[ 1772.588776] 1 lock held by pkexec/4633:
[ 1772.588778] #0: 00000000ed85fbf8 (&sig->cred_guard_mutex){+.+.}, at: prepare_bprm_creds+0x2a/0x70
[ 1772.588786] stack backtrace:
[ 1772.588789] CPU: 7 PID: 4633 Comm: pkexec Not tainted 4.20.0-rc4-custom-00213-g93a49841322b #1
[ 1772.588792] Call Trace:
[ 1772.588800] dump_stack+0x85/0xcb
[ 1772.588803] flush_old_exec+0x116/0x890
[ 1772.588807] ? load_elf_phdrs+0x72/0xb0
[ 1772.588809] load_elf_binary+0x291/0x1620
[ 1772.588815] ? sched_clock+0x5/0x10
[ 1772.588817] ? search_binary_handler+0x6d/0x240
[ 1772.588820] search_binary_handler+0x80/0x240
[ 1772.588823] load_script+0x201/0x220
[ 1772.588825] search_binary_handler+0x80/0x240
[ 1772.588828] __do_execve_file.isra.32+0x7d2/0xa60
[ 1772.588832] ? strncpy_from_user+0x40/0x180
[ 1772.588835] __x64_sys_execve+0x34/0x40
[ 1772.588838] do_syscall_64+0x60/0x1c0

The warning gets triggered by an ancient lockdep check in the freezer:

(gdb) list *0xffffffff812ece06
0xffffffff812ece06 is in flush_old_exec (./include/linux/freezer.h:57).
52 * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION
53 * If try_to_freeze causes a lockdep warning it means the caller may deadlock
54 */
55 static inline bool try_to_freeze_unsafe(void)
56 {
57 might_sleep();
58 if (likely(!freezing(current)))
59 return false;
60 return __refrigerator(false);
61 }

I reviewed the ->cred_guard_mutex code, and the mutex is held across all
of exec() - and we always did this.

But there's this recent -rc4 commit:

> Chanho Min (1):
> exec: make de_thread() freezable

c22397888f1e: exec: make de_thread() freezable

I believe this commit is bogus, you cannot call try_to_freeze() from
de_thread(), because it's holding the ->cred_guard_mutex."

Reported-by: Ingo Molnar
Tested-by: Ingo Molnar
Signed-off-by: Rafael J. Wysocki

Rafael J. Wysocki
2018-12-04 23:04:20 +0800

19 Nov, 2018

1 commit

c22397888 exec: make de_thread() freezable ... Browse Code »

Suspend fails due to the exec family of functions blocking the freezer.
The casue is that de_thread() sleeps in TASK_UNINTERRUPTIBLE waiting for
all sub-threads to die, and we have the deadlock if one of them is frozen.
This also can occur with the schedule() waiting for the group thread leader
to exit if it is frozen.

In our machine, it causes freeze timeout as bellows.

Freezing of tasks failed after 20.010 seconds (1 tasks refusing to freeze, wq_busy=0):
setcpushares-ls D ffffffc00008ed70 0 5817 1483 0x0040000d
Call trace:
[] __switch_to+0x88/0xa0
[] __schedule+0x1bc/0x720
[] schedule+0x40/0xa8
[] flush_old_exec+0xdc/0x640
[] load_elf_binary+0x2a8/0x1090
[] search_binary_handler+0x9c/0x240
[] load_script+0x20c/0x228
[] search_binary_handler+0x9c/0x240
[] do_execveat_common.isra.14+0x4f8/0x6e8
[] compat_SyS_execve+0x38/0x48
[] el0_svc_naked+0x24/0x28

To fix this, make de_thread() freezable. It looks safe and works fine.

Suggested-by: Oleg Nesterov
Signed-off-by: Chanho Min
Acked-by: Oleg Nesterov
Acked-by: Pavel Machek
Acked-by: Michal Hocko
Signed-off-by: Rafael J. Wysocki

Chanho Min
2018-11-19 18:28:10 +0800

11 Oct, 2018

1 commit

691115c35 vfs: require i_size <= SIZE_MAX in kernel_read_file() ... Browse Code »

On 32-bit systems, the buffer allocated by kernel_read_file() is too
small if the file size is > SIZE_MAX, due to truncation to size_t.

Fortunately, since the 'count' argument to kernel_read() is also
truncated to size_t, only the allocated space is filled; then, -EIO is
returned since 'pos != i_size' after the read loop.

But this is not obvious and seems incidental. We should be more
explicit about this case. So, fail early if i_size > SIZE_MAX.

Signed-off-by: Eric Biggers
Signed-off-by: Mimi Zohar

Eric Biggers
2018-10-11 00:56:14 +0800

22 Aug, 2018

1 commit

0214f46b3 Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/eb… ... Browse Code »

…iederm/user-namespace

Pull core signal handling updates from Eric Biederman:
"It was observed that a periodic timer in combination with a
sufficiently expensive fork could prevent fork from every completing.
This contains the changes to remove the need for that restart.

This set of changes is split into several parts:

- The first part makes PIDTYPE_TGID a proper pid type instead
something only for very special cases. The part starts using
PIDTYPE_TGID enough so that in __send_signal where signals are
actually delivered we know if the signal is being sent to a a group
of processes or just a single process.

- With that prep work out of the way the logic in fork is modified so
that fork logically makes signals received while it is running
appear to be received after the fork completes"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
signal: Don't send signals to tasks that don't exist
signal: Don't restart fork when signals come in.
fork: Have new threads join on-going signal group stops
fork: Skip setting TIF_SIGPENDING in ptrace_init_task
signal: Add calculate_sigpending()
fork: Unconditionally exit if a fatal signal is pending
fork: Move and describe why the code examines PIDNS_ADDING
signal: Push pid type down into complete_signal.
signal: Push pid type down into __send_signal
signal: Push pid type down into send_signal
signal: Pass pid type into do_send_sig_info
signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
signal: Pass pid type into group_send_sig_info
signal: Pass pid and pid type into send_sigqueue
posix-timers: Noralize good_sigevent
signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
pid: Implement PIDTYPE_TGID
pids: Move the pgrp and session pid pointers from task_struct to signal_struct
kvm: Don't open code task_pid in kvm_vcpu_ioctl
pids: Compute task_tgid using signal->leader_pid
...

Linus Torvalds
2018-08-22 04:47:29 +0800

27 Jul, 2018

1 commit

bfd40eaff mm: fix vma_is_anonymous() false-positives ... Browse Code »

vma_is_anonymous() relies on ->vm_ops being NULL to detect anonymous
VMA. This is unreliable as ->mmap may not set ->vm_ops.

False-positive vma_is_anonymous() may lead to crashes:

next ffff8801ce5e7040 prev ffff8801d20eca50 mm ffff88019c1e13c0
prot 27 anon_vma ffff88019680cdd8 vm_ops 0000000000000000
pgoff 0 file ffff8801b2ec2d00 private_data 0000000000000000
flags: 0xff(read|write|exec|shared|mayread|maywrite|mayexec|mayshare)
------------[ cut here ]------------
kernel BUG at mm/memory.c:1422!
invalid opcode: 0000 [#1] SMP KASAN
CPU: 0 PID: 18486 Comm: syz-executor3 Not tainted 4.18.0-rc3+ #136
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
01/01/2011
RIP: 0010:zap_pmd_range mm/memory.c:1421 [inline]
RIP: 0010:zap_pud_range mm/memory.c:1466 [inline]
RIP: 0010:zap_p4d_range mm/memory.c:1487 [inline]
RIP: 0010:unmap_page_range+0x1c18/0x2220 mm/memory.c:1508
Call Trace:
unmap_single_vma+0x1a0/0x310 mm/memory.c:1553
zap_page_range_single+0x3cc/0x580 mm/memory.c:1644
unmap_mapping_range_vma mm/memory.c:2792 [inline]
unmap_mapping_range_tree mm/memory.c:2813 [inline]
unmap_mapping_pages+0x3a7/0x5b0 mm/memory.c:2845
unmap_mapping_range+0x48/0x60 mm/memory.c:2880
truncate_pagecache+0x54/0x90 mm/truncate.c:800
truncate_setsize+0x70/0xb0 mm/truncate.c:826
simple_setattr+0xe9/0x110 fs/libfs.c:409
notify_change+0xf13/0x10f0 fs/attr.c:335
do_truncate+0x1ac/0x2b0 fs/open.c:63
do_sys_ftruncate+0x492/0x560 fs/open.c:205
__do_sys_ftruncate fs/open.c:215 [inline]
__se_sys_ftruncate fs/open.c:213 [inline]
__x64_sys_ftruncate+0x59/0x80 fs/open.c:213
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Reproducer:

#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include

#define KCOV_INIT_TRACE _IOR('c', 1, unsigned long)
#define KCOV_ENABLE _IO('c', 100)
#define KCOV_DISABLE _IO('c', 101)
#define COVER_SIZE (1024<<< 20);
return 0;
}

This can be fixed by assigning anonymous VMAs own vm_ops and not relying
on it being NULL.

If ->mmap() failed to set ->vm_ops, mmap_region() will set it to
dummy_vm_ops. This way we will have non-NULL ->vm_ops for all VMAs.

Link: http://lkml.kernel.org/r/20180724121139.62570-4-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Reported-by: syzbot+3f84280d52be9b7083cc@syzkaller.appspotmail.com
Acked-by: Linus Torvalds
Reviewed-by: Andrew Morton
Cc: Dmitry Vyukov
Cc: Oleg Nesterov
Cc: Andrea Arcangeli
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2018-07-27 10:38:03 +0800

22 Jul, 2018

2 commits

490fc0538 mm: make vm_area_alloc() initialize core fields ... Browse Code »

Like vm_area_dup(), it initializes the anon_vma_chain head, and the
basic mm pointer.

The rest of the fields end up being different for different users,
although the plan is to also initialize the 'vm_ops' field to a dummy
entry.

Signed-off-by: Linus Torvalds

Linus Torvalds
2018-07-22 06:24:03 +0800
3928d4f5e mm: use helper functions for allocating and freeing vm_area structs ... Browse Code »

The vm_area_struct is one of the most fundamental memory management
objects, but the management of it is entirely open-coded evertwhere,
ranging from allocation and freeing (using kmem_cache_[z]alloc and
kmem_cache_free) to initializing all the fields.

We want to unify this in order to end up having some unified
initialization of the vmas, and the first step to this is to at least
have basic allocation functions.

Right now those functions are literally just wrappers around the
kmem_cache_*() calls. This is a purely mechanical conversion:

# new vma:
kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()

# copy old vma
kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)

# free vma
kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)

to the point where the old vma passed in to the vm_area_dup() function
isn't even used yet (because I've left all the old manual initialization
alone).

Signed-off-by: Linus Torvalds

Linus Torvalds
2018-07-22 04:48:51 +0800

21 Jul, 2018

1 commit

6883f81aa pid: Implement PIDTYPE_TGID ... Browse Code »

Everywhere except in the pid array we distinguish between a tasks pid and
a tasks tgid (thread group id). Even in the enumeration we want that
distinction sometimes so we have added __PIDTYPE_TGID. With leader_pid
we almost have an implementation of PIDTYPE_TGID in struct signal_struct.

Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
into the pids array. Then remove the __PIDTYPE_TGID special case and the
leader_pid in signal_struct.

The net size increase is just an extra pointer added to struct pid and
an extra pair of pointers of an hlist_node added to task_struct.

The effect on code maintenance is the removal of a number of special
cases today and the potential to remove many more special cases as
PIDTYPE_TGID gets used to it's fullest. The long term potential
is allowing zombie thread group leaders to exit, which will remove
a lot more special cases in the code.

Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2018-07-21 23:43:12 +0800

11 Jun, 2018

1 commit

d82991a86 Merge branch 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull restartable sequence support from Thomas Gleixner:
"The restartable sequences syscall (finally):

After a lot of back and forth discussion and massive delays caused by
the speculative distraction of maintainers, the core set of
restartable sequences has finally reached a consensus.

It comes with the basic non disputed core implementation along with
support for arm, powerpc and x86 and a full set of selftests

It was exposed to linux-next earlier this week, so it does not fully
comply with the merge window requirements, but there is really no
point to drag it out for yet another cycle"

* 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rseq/selftests: Provide Makefile, scripts, gitignore
rseq/selftests: Provide parametrized tests
rseq/selftests: Provide basic percpu ops test
rseq/selftests: Provide basic test
rseq/selftests: Provide rseq library
selftests/lib.mk: Introduce OVERRIDE_TARGETS
powerpc: Wire up restartable sequences system call
powerpc: Add syscall detection for restartable sequences
powerpc: Add support for restartable sequences
x86: Wire up restartable sequence system call
x86: Add support for restartable sequences
arm: Wire up restartable sequences system call
arm: Add syscall detection for restartable sequences
arm: Add restartable sequences support
rseq: Introduce restartable sequences system call
uapi/headers: Provide types_32_64.h

Linus Torvalds
2018-06-11 01:17:09 +0800

06 Jun, 2018

1 commit

d7822b1e2 rseq: Introduce restartable sequences system call ... Browse Code »

Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.

* Restartable sequences (per-cpu atomics)

Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.

The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.

Here are benchmarks of various rseq use-cases.

Test hardware:

arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading

The following benchmarks were all performed on a single thread.

* Per-CPU statistic counter increment

getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 344.0 31.4 11.0
x86-64: 15.3 2.0 7.7

* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
per-cpu buffer

getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 2502.0 2250.0 1.1
x86-64: 117.4 98.0 1.2

* liburcu percpu: lock-unlock pair, dereference, read/compare word

getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 751.0 128.5 5.8
x86-64: 53.4 28.6 1.9

* jemalloc memory allocator adapted to use rseq

Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):

The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.

* Reading the current CPU number

Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.

Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:

- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.

On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.

Benchmarking various approaches for reading the current CPU number:

ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns

x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns

- Speed (benchmark taken on v8 of patchset)

Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:

Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.

* CONFIG_RSEQ=n

avg.: 41.37 s
std.dev.: 0.36 s

* CONFIG_RSEQ=y

avg.: 40.46 s
std.dev.: 0.33 s

- Size

On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.

[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

Signed-off-by: Mathieu Desnoyers
Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra (Intel)
Cc: Joel Fernandes
Cc: Catalin Marinas
Cc: Dave Watson
Cc: Will Deacon
Cc: Andi Kleen
Cc: "H . Peter Anvin"
Cc: Chris Lameter
Cc: Russell King
Cc: Andrew Hunter
Cc: Michael Kerrisk
Cc: "Paul E . McKenney"
Cc: Paul Turner
Cc: Boqun Feng
Cc: Josh Triplett
Cc: Steven Rostedt
Cc: Ben Maurer
Cc: Alexander Viro
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski
Cc: Andrew Morton
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com

Mathieu Desnoyers
2018-06-06 17:58:31 +0800

24 May, 2018

1 commit

449325b52 umh: introduce fork_usermode_blob() helper ... Browse Code »

Introduce helper:
int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
struct umh_info {
struct file *pipe_to_umh;
struct file *pipe_from_umh;
pid_t pid;
};

that GPLed kernel modules (signed or unsigned) can use it to execute part
of its own data as swappable user mode process.

The kernel will do:
- allocate a unique file in tmpfs
- populate that file with [data, data + len] bytes
- user-mode-helper code will do_execve that file and, before the process
starts, the kernel will create two unix pipes for bidirectional
communication between kernel module and umh
- close tmpfs file, effectively deleting it
- the fork_usermode_blob will return zero on success and populate
'struct umh_info' with two unix pipes and the pid of the user process

As the first step in the development of the bpfilter project
the fork_usermode_blob() helper is introduced to allow user mode code
to be invoked from a kernel module. The idea is that user mode code plus
normal kernel module code are built as part of the kernel build
and installed as traditional kernel module into distro specified location,
such that from a distribution point of view, there is
no difference between regular kernel modules and kernel modules + umh code.
Such modules can be signed, modprobed, rmmod, etc. The use of this new helper
by a kernel module doesn't make it any special from kernel and user space
tooling point of view.

Such approach enables kernel to delegate functionality traditionally done
by the kernel modules into the user space processes (either root or !root) and
reduces security attack surface of the new code. The buggy umh code would crash
the user process, but not the kernel. Another advantage is that umh code
of the kernel module can be debugged and tested out of user space
(e.g. opening the possibility to run clang sanitizers, fuzzers or
user space test suites on the umh code).
In case of the bpfilter project such architecture allows complex control plane
to be done in the user space while bpf based data plane stays in the kernel.

Since umh can crash, can be oom-ed by the kernel, killed by the admin,
the kernel module that uses them (like bpfilter) needs to manage life
time of umh on its own via two unix pipes and the pid of umh.

The exit code of such kernel module should kill the umh it started,
so that rmmod of the kernel module will cleanup the corresponding umh.
Just like if the kernel module does kmalloc() it should kfree() it
in the exit code.

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2018-05-24 01:23:39 +0800

12 Apr, 2018

3 commits

c31dbb146 exec: pin stack limit during exec ... Browse Code »

Since the stack rlimit is used in multiple places during exec and it can
be changed via other threads (via setrlimit()) or processes (via
prlimit()), the assumption that the value doesn't change cannot be made.
This leads to races with mm layout selection and argument size
calculations. This changes the exec path to use the rlimit stored in
bprm instead of in current. Before starting the thread, the bprm stack
rlimit is stored back to current.

Link: http://lkml.kernel.org/r/1518638796-20819-4-git-send-email-keescook@chromium.org
Fixes: 64701dee4178e ("exec: Use sane stack rlimit under secureexec")
Signed-off-by: Kees Cook
Reported-by: Ben Hutchings
Reported-by: Andy Lutomirski
Reported-by: Brad Spengler
Acked-by: Michal Hocko
Cc: Ben Hutchings
Cc: Greg KH
Cc: Hugh Dickins
Cc: "Jason A. Donenfeld"
Cc: Laura Abbott
Cc: Oleg Nesterov
Cc: Rik van Riel
Cc: Willy Tarreau
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2018-04-12 01:28:37 +0800
b83838313 exec: introduce finalize_exec() before start_thread() ... Browse Code »

Provide a final callback into fs/exec.c before start_thread() takes
over, to handle any last-minute changes, like the coming restoration of
the stack limit.

Link: http://lkml.kernel.org/r/1518638796-20819-3-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook
Cc: Andy Lutomirski
Cc: Ben Hutchings
Cc: Ben Hutchings
Cc: Brad Spengler
Cc: Greg KH
Cc: Hugh Dickins
Cc: "Jason A. Donenfeld"
Cc: Laura Abbott
Cc: Michal Hocko
Cc: Oleg Nesterov
Cc: Rik van Riel
Cc: Willy Tarreau
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2018-04-12 01:28:37 +0800
8f2af155b exec: pass stack rlimit into mm layout functions ... Browse Code »

Patch series "exec: Pin stack limit during exec".

Attempts to solve problems with the stack limit changing during exec
continue to be frustrated[1][2]. In addition to the specific issues
around the Stack Clash family of flaws, Andy Lutomirski pointed out[3]
other places during exec where the stack limit is used and is assumed to
be unchanging. Given the many places it gets used and the fact that it
can be manipulated/raced via setrlimit() and prlimit(), I think the only
way to handle this is to move away from the "current" view of the stack
limit and instead attach it to the bprm, and plumb this down into the
functions that need to know the stack limits. This series implements
the approach.

[1] 04e35f4495dd ("exec: avoid RLIMIT_STACK races with prlimit()")
[2] 779f4e1c6c7c ("Revert "exec: avoid RLIMIT_STACK races with prlimit()"")
[3] to security@kernel.org, "Subject: existing rlimit races?"

This patch (of 3):

Since it is possible that the stack rlimit can change externally during
exec (either via another thread calling setrlimit() or another process
calling prlimit()), provide a way to pass the rlimit down into the
per-architecture mm layout functions so that the rlimit can stay in the
bprm structure instead of sitting in the signal structure until exec is
finalized.

Link: http://lkml.kernel.org/r/1518638796-20819-2-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook
Cc: Michal Hocko
Cc: Ben Hutchings
Cc: Willy Tarreau
Cc: Hugh Dickins
Cc: Oleg Nesterov
Cc: "Jason A. Donenfeld"
Cc: Rik van Riel
Cc: Laura Abbott
Cc: Greg KH
Cc: Andy Lutomirski
Cc: Ben Hutchings
Cc: Brad Spengler
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2018-04-12 01:28:37 +0800

19 Mar, 2018

1 commit

7bd698b3c exec: Set file unwritable before LSM check ... Browse Code »

The LSM check should happen after the file has been confirmed to be
unchanging. Without this, we could have a race between the Time of Check
(the call to security_kernel_read_file() which could read the file and
make access policy decisions) and the Time of Use (starting with
kernel_read_file()'s reading of the file contents). In theory, file
contents could change between the two.

Signed-off-by: Kees Cook
Reviewed-by: Mimi Zohar
Signed-off-by: James Morris

Kees Cook
2018-03-19 12:49:32 +0800

04 Jan, 2018

1 commit

e816c201a exec: Weaken dumpability for secureexec ... Browse Code »

This is a logical revert of commit e37fdb785a5f ("exec: Use secureexec
for setting dumpability")

This weakens dumpability back to checking only for uid/gid changes in
current (which is useless), but userspace depends on dumpability not
being tied to secureexec.

https://bugzilla.redhat.com/show_bug.cgi?id=1528633

Reported-by: Tom Horsley
Fixes: e37fdb785a5f ("exec: Use secureexec for setting dumpability")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook
Signed-off-by: Linus Torvalds

Kees Cook
2018-01-04 02:13:36 +0800

18 Dec, 2017

1 commit

779f4e1c6 Revert "exec: avoid RLIMIT_STACK races with prlimit()" ... Browse Code »

This reverts commit 04e35f4495dd560db30c25efca4eecae8ec8c375.

SELinux runs with secureexec for all non-"noatsecure" domain transitions,
which means lots of processes end up hitting the stack hard-limit change
that was introduced in order to fix a race with prlimit(). That race fix
will need to be redesigned.

Reported-by: Laura Abbott
Reported-by: Tomáš Trnka
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook
Signed-off-by: Linus Torvalds

Kees Cook
2017-12-18 06:26:25 +0800

15 Dec, 2017

1 commit

3756f6401 exec: avoid gcc-8 warning for get_task_comm ... Browse Code »

gcc-8 warns about using strncpy() with the source size as the limit:

fs/exec.c:1223:32: error: argument to 'sizeof' in 'strncpy' call is the same expression as the source; did you mean to use the size of the destination? [-Werror=sizeof-pointer-memaccess]

This is indeed slightly suspicious, as it protects us from source
arguments without NUL-termination, but does not guarantee that the
destination is terminated.

This keeps the strncpy() to ensure we have properly padded target
buffer, but ensures that we use the correct length, by passing the
actual length of the destination buffer as well as adding a build-time
check to ensure it is exactly TASK_COMM_LEN.

There are only 23 callsites which I all reviewed to ensure this is
currently the case. We could get away with doing only the check or
passing the right length, but it doesn't hurt to do both.

Link: http://lkml.kernel.org/r/20171205151724.1764896-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann
Suggested-by: Kees Cook
Acked-by: Kees Cook
Acked-by: Ingo Molnar
Cc: Alexander Viro
Cc: Peter Zijlstra
Cc: Serge Hallyn
Cc: James Morris
Cc: Aleksa Sarai
Cc: "Eric W. Biederman"
Cc: Frederic Weisbecker
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arnd Bergmann
2017-12-15 08:00:48 +0800

30 Nov, 2017

1 commit

04e35f449 exec: avoid RLIMIT_STACK races with prlimit() ... Browse Code »

While the defense-in-depth RLIMIT_STACK limit on setuid processes was
protected against races from other threads calling setrlimit(), I missed
protecting it against races from external processes calling prlimit().
This adds locking around the change and makes sure that rlim_max is set
too.

Link: http://lkml.kernel.org/r/20171127193457.GA11348@beast
Fixes: 64701dee4178e ("exec: Use sane stack rlimit under secureexec")
Signed-off-by: Kees Cook
Reported-by: Ben Hutchings
Reported-by: Brad Spengler
Acked-by: Serge Hallyn
Cc: James Morris
Cc: Andy Lutomirski
Cc: Oleg Nesterov
Cc: Jiri Slaby
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2017-11-30 10:40:42 +0800

25 Oct, 2017

1 commit

6aa7de059 locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to … ... Browse Code »

…READ_ONCE()/WRITE_ONCE()

Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.

For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.

However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:

----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()

// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch

virtual patch

@ depends on patch @
expression E1, E2;
@@

- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)

@ depends on patch @
expression E;
@@

- ACCESS_ONCE(E)
+ READ_ONCE(E)
----

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Mark Rutland
2017-10-25 17:01:08 +0800

20 Oct, 2017

1 commit

a961e4091 membarrier: Provide register expedited private command ... Browse Code »

This introduces a "register private expedited" membarrier command which
allows eventual removal of important memory barrier constraints on the
scheduler fast-paths. It changes how the "private expedited" membarrier
command (new to 4.14) is used from user-space.

This new command allows processes to register their intent to use the
private expedited command. This affects how the expedited private
command introduced in 4.14-rc is meant to be used, and should be merged
before 4.14 final.

Processes are now required to register before using
MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.

This fixes a problem that arose when designing requested extensions to
sys_membarrier() to allow JITs to efficiently flush old code from
instruction caches. Several potential algorithms are much less painful
if the user register intent to use this functionality early on, for
example, before the process spawns the second thread. Registering at
this time removes the need to interrupt each and every thread in that
process at the first expedited sys_membarrier() system call.

Signed-off-by: Mathieu Desnoyers
Acked-by: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Alexander Viro
Signed-off-by: Linus Torvalds

Mathieu Desnoyers
2017-10-20 10:13:40 +0800