25 Dec, 2019

1 commit

  • It is preferrable to reject unknown flags within rseq unregistration
    rather than to ignore them. It is an oversight caused by the fact that
    the check for unknown flags is after the rseq unregister flag check.

    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20191211161713.4490-2-mathieu.desnoyers@efficios.com
    Signed-off-by: Ingo Molnar

    Mathieu Desnoyers
     

27 May, 2019

2 commits


19 Apr, 2019

2 commits

  • The rseq system call, when invoked with flags of "0" or
    "RSEQ_FLAG_UNREGISTER" values, expects the rseq_len parameter to
    be equal to sizeof(struct rseq), which is fixed-size and fixed-layout,
    specified in uapi linux/rseq.h.

    Expecting a fixed size for rseq_len is a design choice that ensures
    multiple libraries and application defining __rseq_abi in the same
    process agree on its exact size.

    Considering that this size is and will always be the same value, there
    is no point in saving this value within task_struct rseq_len. Remove
    this field from task_struct.

    No change in functionality intended.

    Signed-off-by: Mathieu Desnoyers
    Acked-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Ben Maurer
    Cc: Boqun Feng
    Cc: Catalin Marinas
    Cc: Chris Lameter
    Cc: Dave Watson
    Cc: H. Peter Anvin
    Cc: Joel Fernandes
    Cc: Josh Triplett
    Cc: Linus Torvalds
    Cc: Michael Kerrisk
    Cc: Paul E. McKenney
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Russell King
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: linux-api@vger.kernel.org
    Link: http://lkml.kernel.org/r/20190305194755.2602-3-mathieu.desnoyers@efficios.com
    Signed-off-by: Ingo Molnar

    Mathieu Desnoyers
     
  • The "event counter" was removed from rseq before it was merged upstream.
    However, a few comments in the source code still refer to it. Adapt the
    comments to match reality.

    Signed-off-by: Mathieu Desnoyers
    Acked-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Ben Maurer
    Cc: Boqun Feng
    Cc: Catalin Marinas
    Cc: Chris Lameter
    Cc: Dave Watson
    Cc: H. Peter Anvin
    Cc: Joel Fernandes
    Cc: Josh Triplett
    Cc: Linus Torvalds
    Cc: Michael Kerrisk
    Cc: Paul E. McKenney
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Russell King
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: linux-api@vger.kernel.org
    Link: http://lkml.kernel.org/r/20190305194755.2602-2-mathieu.desnoyers@efficios.com
    Signed-off-by: Ingo Molnar

    Mathieu Desnoyers
     

04 Jan, 2019

1 commit

  • Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
    of the user address range verification function since we got rid of the
    old racy i386-only code to walk page tables by hand.

    It existed because the original 80386 would not honor the write protect
    bit when in kernel mode, so you had to do COW by hand before doing any
    user access. But we haven't supported that in a long time, and these
    days the 'type' argument is a purely historical artifact.

    A discussion about extending 'user_access_begin()' to do the range
    checking resulted this patch, because there is no way we're going to
    move the old VERIFY_xyz interface to that model. And it's best done at
    the end of the merge window when I've done most of my merges, so let's
    just get this done once and for all.

    This patch was mostly done with a sed-script, with manual fix-ups for
    the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

    There were a couple of notable cases:

    - csky still had the old "verify_area()" name as an alias.

    - the iter_iov code had magical hardcoded knowledge of the actual
    values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
    really used it)

    - microblaze used the type argument for a debug printout

    but other than those oddities this should be a total no-op patch.

    I tried to fix up all architectures, did fairly extensive grepping for
    access_ok() uses, and the changes are trivial, but I may have missed
    something. Any missed conversion should be trivially fixable, though.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

11 Jul, 2018

4 commits

  • Declaring the rseq_cs field as a union between __u64 and two __u32
    allows both 32-bit and 64-bit kernels to read the full __u64, and
    therefore validate that a 32-bit user-space cleared the upper 32
    bits, thus ensuring a consistent behavior between native 32-bit
    kernels and 32-bit compat tasks on 64-bit kernels.

    Check that the rseq_cs value read is < TASK_SIZE.

    The asm/byteorder.h header needs to be included by rseq.h, now
    that it is not using linux/types_32_64.h anymore.

    Considering that only __32 and __u64 types are declared in linux/rseq.h,
    the linux/types.h header should always be included for both kernel and
    user-space code: including stdint.h is just for u64 and u32, which are
    not used in this header at all.

    Use copy_from_user()/clear_user() to interact with a 64-bit field,
    because arm32 does not implement 64-bit __get_user, and ppc32 does not
    64-bit get_user. Considering that the rseq_cs pointer does not need to
    be loaded/stored with single-copy atomicity from the kernel anymore, we
    can simply use copy_from_user()/clear_user().

    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Thomas Gleixner
    Cc: linux-api@vger.kernel.org
    Cc: Peter Zijlstra
    Cc: "Paul E . McKenney"
    Cc: Boqun Feng
    Cc: Andy Lutomirski
    Cc: Dave Watson
    Cc: Paul Turner
    Cc: Andrew Morton
    Cc: Russell King
    Cc: "H . Peter Anvin"
    Cc: Andi Kleen
    Cc: Chris Lameter
    Cc: Ben Maurer
    Cc: Steven Rostedt
    Cc: Josh Triplett
    Cc: Linus Torvalds
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Michael Kerrisk
    Cc: Joel Fernandes
    Cc: "Paul E. McKenney"
    Cc: "H. Peter Anvin"
    Link: https://lkml.kernel.org/r/20180709195155.7654-5-mathieu.desnoyers@efficios.com

    Mathieu Desnoyers
     
  • Update rseq uapi header comments to reflect that user-space need to do
    thread-local loads/stores from/to the struct rseq fields.

    As a consequence of this added requirement, the kernel does not need
    to perform loads/stores with single-copy atomicity.

    Update the comment associated to the "flags" fields to describe
    more accurately that it's only useful to facilitate single-stepping
    through rseq critical sections with debuggers.

    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Thomas Gleixner
    Cc: linux-api@vger.kernel.org
    Cc: Peter Zijlstra
    Cc: "Paul E . McKenney"
    Cc: Boqun Feng
    Cc: Andy Lutomirski
    Cc: Dave Watson
    Cc: Paul Turner
    Cc: Andrew Morton
    Cc: Russell King
    Cc: "H . Peter Anvin"
    Cc: Andi Kleen
    Cc: Chris Lameter
    Cc: Ben Maurer
    Cc: Steven Rostedt
    Cc: Josh Triplett
    Cc: Linus Torvalds
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Michael Kerrisk
    Cc: Joel Fernandes
    Cc: "Paul E. McKenney"
    Cc: "H. Peter Anvin"
    Link: https://lkml.kernel.org/r/20180709195155.7654-4-mathieu.desnoyers@efficios.com

    Mathieu Desnoyers
     
  • __get_user()/__put_user() is used to read values for address ranges that
    were already checked with access_ok() on rseq registration.

    It has been recognized that __get_user/__put_user are optimizing the
    wrong thing. Replace them by get_user/put_user across rseq instead.

    If those end up showing up in benchmarks, the proper approach would be to
    use user_access_begin() / unsafe_{get,put}_user() / user_access_end()
    anyway.

    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Thomas Gleixner
    Cc: linux-api@vger.kernel.org
    Cc: Peter Zijlstra
    Cc: "Paul E . McKenney"
    Cc: Boqun Feng
    Cc: Andy Lutomirski
    Cc: Dave Watson
    Cc: Paul Turner
    Cc: Andrew Morton
    Cc: Russell King
    Cc: "H . Peter Anvin"
    Cc: Andi Kleen
    Cc: Chris Lameter
    Cc: Ben Maurer
    Cc: Steven Rostedt
    Cc: Josh Triplett
    Cc: Linus Torvalds
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Michael Kerrisk
    Cc: Joel Fernandes
    Cc: linux-arm-kernel@lists.infradead.org
    Link: https://lkml.kernel.org/r/20180709195155.7654-3-mathieu.desnoyers@efficios.com

    Mathieu Desnoyers
     
  • Change the rseq ABI so rseq_cs start_ip, post_commit_offset and abort_ip
    fields are seen as 64-bit fields by both 32-bit and 64-bit kernels rather
    that ignoring the 32 upper bits on 32-bit kernels. This ensures we have a
    consistent behavior for a 32-bit binary executed on 32-bit kernels and in
    compat mode on 64-bit kernels.

    Validating the value of abort_ip field to be below TASK_SIZE ensures the
    kernel don't return to an invalid address when returning to userspace
    after an abort. I don't fully trust each architecture code to consistently
    deal with invalid return addresses.

    Validating the value of the start_ip and post_commit_offset fields
    prevents overflow on arithmetic performed on those values, used to
    check whether abort_ip is within the rseq critical section.

    If validation fails, the process is killed with a segmentation fault.

    When the signature encountered before abort_ip does not match the expected
    signature, return -EINVAL rather than -EPERM to be consistent with other
    input validation return codes from rseq_get_rseq_cs().

    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Thomas Gleixner
    Cc: linux-api@vger.kernel.org
    Cc: Peter Zijlstra
    Cc: "Paul E . McKenney"
    Cc: Boqun Feng
    Cc: Andy Lutomirski
    Cc: Dave Watson
    Cc: Paul Turner
    Cc: Andrew Morton
    Cc: Russell King
    Cc: "H . Peter Anvin"
    Cc: Andi Kleen
    Cc: Chris Lameter
    Cc: Ben Maurer
    Cc: Steven Rostedt
    Cc: Josh Triplett
    Cc: Linus Torvalds
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Michael Kerrisk
    Cc: Joel Fernandes
    Cc: "Paul E. McKenney"
    Cc: "H. Peter Anvin"
    Link: https://lkml.kernel.org/r/20180709195155.7654-2-mathieu.desnoyers@efficios.com

    Mathieu Desnoyers
     

23 Jun, 2018

1 commit

  • When delivering a signal to a task that is using rseq, we call into
    __rseq_handle_notify_resume() so that the registers pushed in the
    sigframe are updated to reflect the state of the restartable sequence
    (for example, ensuring that the signal returns to the abort handler if
    necessary).

    However, if the rseq management fails due to an unrecoverable fault when
    accessing userspace or certain combinations of RSEQ_CS_* flags, then we
    will attempt to deliver a SIGSEGV. This has the potential for infinite
    recursion if the rseq code continuously fails on signal delivery.

    Avoid this problem by using force_sigsegv() instead of force_sig(), which
    is explicitly designed to reset the SEGV handler to SIG_DFL in the case
    of a recursive fault. In doing so, remove rseq_signal_deliver() from the
    internal rseq API and have an optional struct ksignal * parameter to
    rseq_handle_notify_resume() instead.

    Signed-off-by: Will Deacon
    Signed-off-by: Thomas Gleixner
    Acked-by: Mathieu Desnoyers
    Cc: peterz@infradead.org
    Cc: paulmck@linux.vnet.ibm.com
    Cc: boqun.feng@gmail.com
    Link: https://lkml.kernel.org/r/1529664307-983-1-git-send-email-will.deacon@arm.com

    Will Deacon
     

06 Jun, 2018

1 commit

  • Expose a new system call allowing each thread to register one userspace
    memory area to be used as an ABI between kernel and user-space for two
    purposes: user-space restartable sequences and quick access to read the
    current CPU number value from user-space.

    * Restartable sequences (per-cpu atomics)

    Restartables sequences allow user-space to perform update operations on
    per-cpu data without requiring heavy-weight atomic operations.

    The restartable critical sections (percpu atomics) work has been started
    by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
    critical sections. [1] [2] The re-implementation proposed here brings a
    few simplifications to the ABI which facilitates porting to other
    architectures and speeds up the user-space fast path.

    Here are benchmarks of various rseq use-cases.

    Test hardware:

    arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
    x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading

    The following benchmarks were all performed on a single thread.

    * Per-CPU statistic counter increment

    getcpu+atomic (ns/op) rseq (ns/op) speedup
    arm32: 344.0 31.4 11.0
    x86-64: 15.3 2.0 7.7

    * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
    per-cpu buffer

    getcpu+atomic (ns/op) rseq (ns/op) speedup
    arm32: 2502.0 2250.0 1.1
    x86-64: 117.4 98.0 1.2

    * liburcu percpu: lock-unlock pair, dereference, read/compare word

    getcpu+atomic (ns/op) rseq (ns/op) speedup
    arm32: 751.0 128.5 5.8
    x86-64: 53.4 28.6 1.9

    * jemalloc memory allocator adapted to use rseq

    Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
    rseq 2016 implementation):

    The production workload response-time has 1-2% gain avg. latency, and
    the P99 overall latency drops by 2-3%.

    * Reading the current CPU number

    Speeding up reading the current CPU number on which the caller thread is
    running is done by keeping the current CPU number up do date within the
    cpu_id field of the memory area registered by the thread. This is done
    by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
    current thread. Upon return to user-space, a notify-resume handler
    updates the current CPU value within the registered user-space memory
    area. User-space can then read the current CPU number directly from
    memory.

    Keeping the current cpu id in a memory area shared between kernel and
    user-space is an improvement over current mechanisms available to read
    the current CPU number, which has the following benefits over
    alternative approaches:

    - 35x speedup on ARM vs system call through glibc
    - 20x speedup on x86 compared to calling glibc, which calls vdso
    executing a "lsl" instruction,
    - 14x speedup on x86 compared to inlined "lsl" instruction,
    - Unlike vdso approaches, this cpu_id value can be read from an inline
    assembly, which makes it a useful building block for restartable
    sequences.
    - The approach of reading the cpu id through memory mapping shared
    between kernel and user-space is portable (e.g. ARM), which is not the
    case for the lsl-based x86 vdso.

    On x86, yet another possible approach would be to use the gs segment
    selector to point to user-space per-cpu data. This approach performs
    similarly to the cpu id cache, but it has two disadvantages: it is
    not portable, and it is incompatible with existing applications already
    using the gs segment selector for other purposes.

    Benchmarking various approaches for reading the current CPU number:

    ARMv7 Processor rev 4 (v7l)
    Machine model: Cubietruck
    - Baseline (empty loop): 8.4 ns
    - Read CPU from rseq cpu_id: 16.7 ns
    - Read CPU from rseq cpu_id (lazy register): 19.8 ns
    - glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
    - getcpu system call: 234.9 ns

    x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
    - Baseline (empty loop): 0.8 ns
    - Read CPU from rseq cpu_id: 0.8 ns
    - Read CPU from rseq cpu_id (lazy register): 0.8 ns
    - Read using gs segment selector: 0.8 ns
    - "lsl" inline assembly: 13.0 ns
    - glibc 2.19-0ubuntu6 getcpu: 16.6 ns
    - getcpu system call: 53.9 ns

    - Speed (benchmark taken on v8 of patchset)

    Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
    expectations, that enabling CONFIG_RSEQ slightly accelerates the
    scheduler:

    Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
    2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
    saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
    kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
    restartable sequences series applied.

    * CONFIG_RSEQ=n

    avg.: 41.37 s
    std.dev.: 0.36 s

    * CONFIG_RSEQ=y

    avg.: 40.46 s
    std.dev.: 0.33 s

    - Size

    On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
    567 bytes, and the data size increase of vmlinux is 5696 bytes.

    [1] https://lwn.net/Articles/650333/
    [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Cc: Joel Fernandes
    Cc: Catalin Marinas
    Cc: Dave Watson
    Cc: Will Deacon
    Cc: Andi Kleen
    Cc: "H . Peter Anvin"
    Cc: Chris Lameter
    Cc: Russell King
    Cc: Andrew Hunter
    Cc: Michael Kerrisk
    Cc: "Paul E . McKenney"
    Cc: Paul Turner
    Cc: Boqun Feng
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Ben Maurer
    Cc: Alexander Viro
    Cc: linux-api@vger.kernel.org
    Cc: Andy Lutomirski
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
    Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
    Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com

    Mathieu Desnoyers