Eric Lee / smarc-fsl-linux-kernel

25 Dec, 2019

1 commit

66528a457 rseq: Reject unknown flags on rseq unregister ... Browse Code »

It is preferrable to reject unknown flags within rseq unregistration
rather than to ignore them. It is an oversight caused by the fact that
the check for unknown flags is after the rseq unregister flag check.

Signed-off-by: Mathieu Desnoyers
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: https://lkml.kernel.org/r/20191211161713.4490-2-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar

Mathieu Desnoyers
2019-12-25 17:41:20 +0800

27 May, 2019

2 commits

3cf5d076f signal: Remove task parameter from force_sig ... Browse Code »

All of the remaining callers pass current into force_sig so
remove the task parameter to make this obvious and to make
misuse more difficult in the future.

This also makes it clear force_sig passes current into force_sig_info.

Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2019-05-27 22:36:28 +0800
cb44c9a0a signal: Remove task parameter from force_sigsegv ... Browse Code »

The function force_sigsegv is always called on the current task
so passing in current is redundant and not passing in current
makes this fact obvious.

This also makes it clear force_sigsegv always calls force_sig
on the current task.

Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2019-05-27 22:36:28 +0800

19 Apr, 2019

2 commits

83b0b15bc rseq: Remove superfluous rseq_len from task_struct ... Browse Code »

The rseq system call, when invoked with flags of "0" or
"RSEQ_FLAG_UNREGISTER" values, expects the rseq_len parameter to
be equal to sizeof(struct rseq), which is fixed-size and fixed-layout,
specified in uapi linux/rseq.h.

Expecting a fixed size for rseq_len is a design choice that ensures
multiple libraries and application defining __rseq_abi in the same
process agree on its exact size.

Considering that this size is and will always be the same value, there
is no point in saving this value within task_struct rseq_len. Remove
this field from task_struct.

No change in functionality intended.

Signed-off-by: Mathieu Desnoyers
Acked-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Ben Maurer
Cc: Boqun Feng
Cc: Catalin Marinas
Cc: Chris Lameter
Cc: Dave Watson
Cc: H. Peter Anvin
Cc: Joel Fernandes
Cc: Josh Triplett
Cc: Linus Torvalds
Cc: Michael Kerrisk
Cc: Paul E. McKenney
Cc: Paul Turner
Cc: Peter Zijlstra
Cc: Russell King
Cc: Steven Rostedt
Cc: Thomas Gleixner
Cc: Will Deacon
Cc: linux-api@vger.kernel.org
Link: http://lkml.kernel.org/r/20190305194755.2602-3-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar

Mathieu Desnoyers
2019-04-19 18:39:32 +0800
bff9504bf rseq: Clean up comments by reflecting removal of event counter ... Browse Code »

The "event counter" was removed from rseq before it was merged upstream.
However, a few comments in the source code still refer to it. Adapt the
comments to match reality.

Signed-off-by: Mathieu Desnoyers
Acked-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Ben Maurer
Cc: Boqun Feng
Cc: Catalin Marinas
Cc: Chris Lameter
Cc: Dave Watson
Cc: H. Peter Anvin
Cc: Joel Fernandes
Cc: Josh Triplett
Cc: Linus Torvalds
Cc: Michael Kerrisk
Cc: Paul E. McKenney
Cc: Paul Turner
Cc: Peter Zijlstra
Cc: Russell King
Cc: Steven Rostedt
Cc: Thomas Gleixner
Cc: Will Deacon
Cc: linux-api@vger.kernel.org
Link: http://lkml.kernel.org/r/20190305194755.2602-2-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar

Mathieu Desnoyers
2019-04-19 18:39:31 +0800

04 Jan, 2019

1 commit

96d4f267e Remove 'type' argument from access_ok() function ... Browse Code »

Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.

It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.

A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.

This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

There were a couple of notable cases:

- csky still had the old "verify_area()" name as an alias.

- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)

- microblaze used the type argument for a debug printout

but other than those oddities this should be a total no-op patch.

I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.

Signed-off-by: Linus Torvalds

Linus Torvalds
2019-01-04 10:57:57 +0800

11 Jul, 2018

4 commits

ec9c82e03 rseq: uapi: Declare rseq_cs field as union, update includes ... Browse Code »

Declaring the rseq_cs field as a union between __u64 and two __u32
allows both 32-bit and 64-bit kernels to read the full __u64, and
therefore validate that a 32-bit user-space cleared the upper 32
bits, thus ensuring a consistent behavior between native 32-bit
kernels and 32-bit compat tasks on 64-bit kernels.

Check that the rseq_cs value read is < TASK_SIZE.

The asm/byteorder.h header needs to be included by rseq.h, now
that it is not using linux/types_32_64.h anymore.

Considering that only __32 and __u64 types are declared in linux/rseq.h,
the linux/types.h header should always be included for both kernel and
user-space code: including stdint.h is just for u64 and u32, which are
not used in this header at all.

Use copy_from_user()/clear_user() to interact with a 64-bit field,
because arm32 does not implement 64-bit __get_user, and ppc32 does not
64-bit get_user. Considering that the rseq_cs pointer does not need to
be loaded/stored with single-copy atomicity from the kernel anymore, we
can simply use copy_from_user()/clear_user().

Signed-off-by: Mathieu Desnoyers
Signed-off-by: Thomas Gleixner
Cc: linux-api@vger.kernel.org
Cc: Peter Zijlstra
Cc: "Paul E . McKenney"
Cc: Boqun Feng
Cc: Andy Lutomirski
Cc: Dave Watson
Cc: Paul Turner
Cc: Andrew Morton
Cc: Russell King
Cc: "H . Peter Anvin"
Cc: Andi Kleen
Cc: Chris Lameter
Cc: Ben Maurer
Cc: Steven Rostedt
Cc: Josh Triplett
Cc: Linus Torvalds
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Michael Kerrisk
Cc: Joel Fernandes
Cc: "Paul E. McKenney"
Cc: "H. Peter Anvin"
Link: https://lkml.kernel.org/r/20180709195155.7654-5-mathieu.desnoyers@efficios.com

Mathieu Desnoyers
2018-07-11 04:18:52 +0800
0fb9a1abc rseq: uapi: Update uapi comments ... Browse Code »

Update rseq uapi header comments to reflect that user-space need to do
thread-local loads/stores from/to the struct rseq fields.

As a consequence of this added requirement, the kernel does not need
to perform loads/stores with single-copy atomicity.

Update the comment associated to the "flags" fields to describe
more accurately that it's only useful to facilitate single-stepping
through rseq critical sections with debuggers.

Signed-off-by: Mathieu Desnoyers
Signed-off-by: Thomas Gleixner
Cc: linux-api@vger.kernel.org
Cc: Peter Zijlstra
Cc: "Paul E . McKenney"
Cc: Boqun Feng
Cc: Andy Lutomirski
Cc: Dave Watson
Cc: Paul Turner
Cc: Andrew Morton
Cc: Russell King
Cc: "H . Peter Anvin"
Cc: Andi Kleen
Cc: Chris Lameter
Cc: Ben Maurer
Cc: Steven Rostedt
Cc: Josh Triplett
Cc: Linus Torvalds
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Michael Kerrisk
Cc: Joel Fernandes
Cc: "Paul E. McKenney"
Cc: "H. Peter Anvin"
Link: https://lkml.kernel.org/r/20180709195155.7654-4-mathieu.desnoyers@efficios.com

Mathieu Desnoyers
2018-07-11 04:18:52 +0800
8f2817701 rseq: Use get_user/put_user rather than __get_user/__put_user ... Browse Code »

__get_user()/__put_user() is used to read values for address ranges that
were already checked with access_ok() on rseq registration.

It has been recognized that __get_user/__put_user are optimizing the
wrong thing. Replace them by get_user/put_user across rseq instead.

If those end up showing up in benchmarks, the proper approach would be to
use user_access_begin() / unsafe_{get,put}_user() / user_access_end()
anyway.

Signed-off-by: Mathieu Desnoyers
Signed-off-by: Thomas Gleixner
Cc: linux-api@vger.kernel.org
Cc: Peter Zijlstra
Cc: "Paul E . McKenney"
Cc: Boqun Feng
Cc: Andy Lutomirski
Cc: Dave Watson
Cc: Paul Turner
Cc: Andrew Morton
Cc: Russell King
Cc: "H . Peter Anvin"
Cc: Andi Kleen
Cc: Chris Lameter
Cc: Ben Maurer
Cc: Steven Rostedt
Cc: Josh Triplett
Cc: Linus Torvalds
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Michael Kerrisk
Cc: Joel Fernandes
Cc: linux-arm-kernel@lists.infradead.org
Link: https://lkml.kernel.org/r/20180709195155.7654-3-mathieu.desnoyers@efficios.com

Mathieu Desnoyers
2018-07-11 04:18:52 +0800
e96d71359 rseq: Use __u64 for rseq_cs fields, validate user inputs ... Browse Code »

Change the rseq ABI so rseq_cs start_ip, post_commit_offset and abort_ip
fields are seen as 64-bit fields by both 32-bit and 64-bit kernels rather
that ignoring the 32 upper bits on 32-bit kernels. This ensures we have a
consistent behavior for a 32-bit binary executed on 32-bit kernels and in
compat mode on 64-bit kernels.

Validating the value of abort_ip field to be below TASK_SIZE ensures the
kernel don't return to an invalid address when returning to userspace
after an abort. I don't fully trust each architecture code to consistently
deal with invalid return addresses.

Validating the value of the start_ip and post_commit_offset fields
prevents overflow on arithmetic performed on those values, used to
check whether abort_ip is within the rseq critical section.

If validation fails, the process is killed with a segmentation fault.

When the signature encountered before abort_ip does not match the expected
signature, return -EINVAL rather than -EPERM to be consistent with other
input validation return codes from rseq_get_rseq_cs().

Signed-off-by: Mathieu Desnoyers
Signed-off-by: Thomas Gleixner
Cc: linux-api@vger.kernel.org
Cc: Peter Zijlstra
Cc: "Paul E . McKenney"
Cc: Boqun Feng
Cc: Andy Lutomirski
Cc: Dave Watson
Cc: Paul Turner
Cc: Andrew Morton
Cc: Russell King
Cc: "H . Peter Anvin"
Cc: Andi Kleen
Cc: Chris Lameter
Cc: Ben Maurer
Cc: Steven Rostedt
Cc: Josh Triplett
Cc: Linus Torvalds
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Michael Kerrisk
Cc: Joel Fernandes
Cc: "Paul E. McKenney"
Cc: "H. Peter Anvin"
Link: https://lkml.kernel.org/r/20180709195155.7654-2-mathieu.desnoyers@efficios.com

Mathieu Desnoyers
2018-07-11 04:18:51 +0800

23 Jun, 2018

1 commit

784e0300f rseq: Avoid infinite recursion when delivering SIGSEGV ... Browse Code »

When delivering a signal to a task that is using rseq, we call into
__rseq_handle_notify_resume() so that the registers pushed in the
sigframe are updated to reflect the state of the restartable sequence
(for example, ensuring that the signal returns to the abort handler if
necessary).

However, if the rseq management fails due to an unrecoverable fault when
accessing userspace or certain combinations of RSEQ_CS_* flags, then we
will attempt to deliver a SIGSEGV. This has the potential for infinite
recursion if the rseq code continuously fails on signal delivery.

Avoid this problem by using force_sigsegv() instead of force_sig(), which
is explicitly designed to reset the SEGV handler to SIG_DFL in the case
of a recursive fault. In doing so, remove rseq_signal_deliver() from the
internal rseq API and have an optional struct ksignal * parameter to
rseq_handle_notify_resume() instead.

Signed-off-by: Will Deacon
Signed-off-by: Thomas Gleixner
Acked-by: Mathieu Desnoyers
Cc: peterz@infradead.org
Cc: paulmck@linux.vnet.ibm.com
Cc: boqun.feng@gmail.com
Link: https://lkml.kernel.org/r/1529664307-983-1-git-send-email-will.deacon@arm.com

Will Deacon
2018-06-23 01:04:22 +0800

06 Jun, 2018

1 commit

d7822b1e2 rseq: Introduce restartable sequences system call ... Browse Code »

Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.

* Restartable sequences (per-cpu atomics)

Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.

The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.

Here are benchmarks of various rseq use-cases.

Test hardware:

arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading

The following benchmarks were all performed on a single thread.

* Per-CPU statistic counter increment

getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 344.0 31.4 11.0
x86-64: 15.3 2.0 7.7

* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
per-cpu buffer

getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 2502.0 2250.0 1.1
x86-64: 117.4 98.0 1.2

* liburcu percpu: lock-unlock pair, dereference, read/compare word

getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 751.0 128.5 5.8
x86-64: 53.4 28.6 1.9

* jemalloc memory allocator adapted to use rseq

Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):

The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.

* Reading the current CPU number

Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.

Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:

- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.

On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.

Benchmarking various approaches for reading the current CPU number:

ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns

x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns

- Speed (benchmark taken on v8 of patchset)

Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:

Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.

* CONFIG_RSEQ=n

avg.: 41.37 s
std.dev.: 0.36 s

* CONFIG_RSEQ=y

avg.: 40.46 s
std.dev.: 0.33 s

- Size

On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.

[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

Signed-off-by: Mathieu Desnoyers
Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra (Intel)
Cc: Joel Fernandes
Cc: Catalin Marinas
Cc: Dave Watson
Cc: Will Deacon
Cc: Andi Kleen
Cc: "H . Peter Anvin"
Cc: Chris Lameter
Cc: Russell King
Cc: Andrew Hunter
Cc: Michael Kerrisk
Cc: "Paul E . McKenney"
Cc: Paul Turner
Cc: Boqun Feng
Cc: Josh Triplett
Cc: Steven Rostedt
Cc: Ben Maurer
Cc: Alexander Viro
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski
Cc: Andrew Morton
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com

Mathieu Desnoyers
2018-06-06 17:58:31 +0800