19 Oct, 2020

1 commit

  • There is usecase that System Management Software(SMS) want to give a
    memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
    case of Android, it is the ActivityManagerService.

    The information required to make the reclaim decision is not known to the
    app. Instead, it is known to the centralized userspace
    daemon(ActivityManagerService), and that daemon must be able to initiate
    reclaim on its own without any app involvement.

    To solve the issue, this patch introduces a new syscall
    process_madvise(2). It uses pidfd of an external process to give the
    hint. It also supports vector address range because Android app has
    thousands of vmas due to zygote so it's totally waste of CPU and power if
    we should call the syscall one by one for each vma.(With testing 2000-vma
    syscall vs 1-vector syscall, it showed 15% performance improvement. I
    think it would be bigger in real practice because the testing ran very
    cache friendly environment).

    Another potential use case for the vector range is to amortize the cost
    ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
    benefit users like TCP receive zerocopy and malloc implementations. In
    future, we could find more usecases for other advises so let's make it
    happens as API since we introduce a new syscall at this moment. With
    that, existing madvise(2) user could replace it with process_madvise(2)
    with their own pid if they want to have batch address ranges support
    feature.

    ince it could affect other process's address range, only privileged
    process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
    UID) gives it the right to ptrace the process could use it successfully.
    The flag argument is reserved for future use if we need to extend the API.

    I think supporting all hints madvise has/will supported/support to
    process_madvise is rather risky. Because we are not sure all hints make
    sense from external process and implementation for the hint may rely on
    the caller being in the current context so it could be error-prone. Thus,
    I just limited hints as MADV_[COLD|PAGEOUT] in this patch.

    If someone want to add other hints, we could hear the usecase and review
    it for each hint. It's safer for maintenance rather than introducing a
    buggy syscall but hard to fix it later.

    So finally, the API is as follows,

    ssize_t process_madvise(int pidfd, const struct iovec *iovec,
    unsigned long vlen, int advice, unsigned int flags);

    DESCRIPTION
    The process_madvise() system call is used to give advice or directions
    to the kernel about the address ranges from external process as well as
    local process. It provides the advice to address ranges of process
    described by iovec and vlen. The goal of such advice is to improve
    system or application performance.

    The pidfd selects the process referred to by the PID file descriptor
    specified in pidfd. (See pidofd_open(2) for further information)

    The pointer iovec points to an array of iovec structures, defined in
    as:

    struct iovec {
    void *iov_base; /* starting address */
    size_t iov_len; /* number of bytes to be advised */
    };

    The iovec describes address ranges beginning at address(iov_base)
    and with size length of bytes(iov_len).

    The vlen represents the number of elements in iovec.

    The advice is indicated in the advice argument, which is one of the
    following at this moment if the target process specified by pidfd is
    external.

    MADV_COLD
    MADV_PAGEOUT

    Permission to provide a hint to external process is governed by a
    ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).

    The process_madvise supports every advice madvise(2) has if target
    process is in same thread group with calling process so user could
    use process_madvise(2) to extend existing madvise(2) to support
    vector address ranges.

    RETURN VALUE
    On success, process_madvise() returns the number of bytes advised.
    This return value may be less than the total number of requested
    bytes, if an error occurred. The caller should check return value
    to determine whether a partial advice occurred.

    FAQ:

    Q.1 - Why does any external entity have better knowledge?

    Quote from Sandeep

    "For Android, every application (including the special SystemServer)
    are forked from Zygote. The reason of course is to share as many
    libraries and classes between the two as possible to benefit from the
    preloading during boot.

    After applications start, (almost) all of the APIs end up calling into
    this SystemServer process over IPC (binder) and back to the
    application.

    In a fully running system, the SystemServer monitors every single
    process periodically to calculate their PSS / RSS and also decides
    which process is "important" to the user for interactivity.

    So, because of how these processes start _and_ the fact that the
    SystemServer is looping to monitor each process, it does tend to *know*
    which address range of the application is not used / useful.

    Besides, we can never rely on applications to clean things up
    themselves. We've had the "hey app1, the system is low on memory,
    please trim your memory usage down" notifications for a long time[1].
    They rely on applications honoring the broadcasts and very few do.

    So, if we want to avoid the inevitable killing of the application and
    restarting it, some way to be able to tell the OS about unimportant
    memory in these applications will be useful.

    - ssp

    Q.2 - How to guarantee the race(i.e., object validation) between when
    giving a hint from an external process and get the hint from the target
    process?

    process_madvise operates on the target process's address space as it
    exists at the instant that process_madvise is called. If the space
    target process can run between the time the process_madvise process
    inspects the target process address space and the time that
    process_madvise is actually called, process_madvise may operate on
    memory regions that the calling process does not expect. It's the
    responsibility of the process calling process_madvise to close this
    race condition. For example, the calling process can suspend the
    target process with ptrace, SIGSTOP, or the freezer cgroup so that it
    doesn't have an opportunity to change its own address space before
    process_madvise is called. Another option is to operate on memory
    regions that the caller knows a priori will be unchanged in the target
    process. Yet another option is to accept the race for certain
    process_madvise calls after reasoning that mistargeting will do no
    harm. The suggested API itself does not provide synchronization. It
    also apply other APIs like move_pages, process_vm_write.

    The race isn't really a problem though. Why is it so wrong to require
    that callers do their own synchronization in some manner? Nobody
    objects to write(2) merely because it's possible for two processes to
    open the same file and clobber each other's writes --- instead, we tell
    people to use flock or something. Think about mmap. It never
    guarantees newly allocated address space is still valid when the user
    tries to access it because other threads could unmap the memory right
    before. That's where we need synchronization by using other API or
    design from userside. It shouldn't be part of API itself. If someone
    needs more fine-grained synchronization rather than process level,
    there were two ideas suggested - cookie[2] and anon-fd[3]. Both are
    applicable via using last reserved argument of the API but I don't
    think it's necessary right now since we have already ways to prevent
    the race so don't want to add additional complexity with more
    fine-grained optimization model.

    To make the API extend, it reserved an unsigned long as last argument
    so we could support it in future if someone really needs it.

    Q.3 - Why doesn't ptrace work?

    Injecting an madvise in the target process using ptrace would not work
    for us because such injected madvise would have to be executed by the
    target process, which means that process would have to be runnable and
    that creates the risk of the abovementioned race and hinting a wrong
    VMA. Furthermore, we want to act the hint in caller's context, not the
    callee's, because the callee is usually limited in cpuset/cgroups or
    even freezed state so they can't act by themselves quick enough, which
    causes more thrashing/kill. It doesn't work if the target process are
    ptraced(e.g., strace, debugger, minidump) because a process can have at
    most one ptracer.

    [1] https://developer.android.com/topic/performance/memory"

    [2] process_getinfo for getting the cookie which is updated whenever
    vma of process address layout are changed - Daniel Colascione -
    https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224

    [3] anonymous fd which is used for the object(i.e., address range)
    validation - Michal Hocko -
    https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/

    [minchan@kernel.org: fix process_madvise build break for arm64]
    Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
    [minchan@kernel.org: fix build error for mips of process_madvise]
    Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
    [akpm@linux-foundation.org: fix patch ordering issue]
    [akpm@linux-foundation.org: fix arm64 whoops]
    [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
    [akpm@linux-foundation.org: fix i386 build]
    [sfr@canb.auug.org.au: fix syscall numbering]
    Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
    [sfr@canb.auug.org.au: madvise.c needs compat.h]
    Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
    [minchan@kernel.org: fix mips build]
    Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
    [yuehaibing@huawei.com: remove duplicate header which is included twice]
    Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
    [minchan@kernel.org: do not use helper functions for process_madvise]
    Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
    [akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
    [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
    Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au

    Signed-off-by: Minchan Kim
    Signed-off-by: YueHaibing
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Reviewed-by: Suren Baghdasaryan
    Reviewed-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Alexander Duyck
    Cc: Brian Geffon
    Cc: Christian Brauner
    Cc: Daniel Colascione
    Cc: Jann Horn
    Cc: Jens Axboe
    Cc: Joel Fernandes
    Cc: Johannes Weiner
    Cc: John Dias
    Cc: Kirill Tkhai
    Cc: Michal Hocko
    Cc: Oleksandr Natalenko
    Cc: Sandeep Patil
    Cc: SeongJae Park
    Cc: SeongJae Park
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Tim Murray
    Cc: Christian Brauner
    Cc: Florian Weimer
    Cc:
    Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
    Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
    Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

18 Sep, 2020

1 commit


15 Aug, 2020

1 commit

  • Since commit 61a47c1ad3a4dc ("sysctl: Remove the sysctl system call"),
    sys_sysctl is actually unavailable: any input can only return an error.

    We have been warning about people using the sysctl system call for years
    and believe there are no more users. Even if there are users of this
    interface if they have not complained or fixed their code by now they
    probably are not going to, so there is no point in warning them any
    longer.

    So completely remove sys_sysctl on all architectures.

    [nixiaoming@huawei.com: s390: fix build error for sys_call_table_emu]
    Link: http://lkml.kernel.org/r/20200618141426.16884-1-nixiaoming@huawei.com

    Signed-off-by: Xiaoming Ni
    Signed-off-by: Andrew Morton
    Acked-by: Will Deacon [arm/arm64]
    Acked-by: "Eric W. Biederman"
    Cc: Aleksa Sarai
    Cc: Alexander Shishkin
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Bin Meng
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Catalin Marinas
    Cc: chenzefeng
    Cc: Christian Borntraeger
    Cc: Christian Brauner
    Cc: Chris Zankel
    Cc: David Howells
    Cc: David S. Miller
    Cc: Diego Elio Pettenò
    Cc: Dmitry Vyukov
    Cc: Dominik Brodowski
    Cc: Fenghua Yu
    Cc: Geert Uytterhoeven
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Iurii Zaikin
    Cc: Ivan Kokshaysky
    Cc: James Bottomley
    Cc: Jens Axboe
    Cc: Jiri Olsa
    Cc: Kars de Jong
    Cc: Kees Cook
    Cc: Krzysztof Kozlowski
    Cc: Luis Chamberlain
    Cc: Marco Elver
    Cc: Mark Rutland
    Cc: Martin K. Petersen
    Cc: Masahiro Yamada
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Miklos Szeredi
    Cc: Minchan Kim
    Cc: Namhyung Kim
    Cc: Naveen N. Rao
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Olof Johansson
    Cc: Paul Burton
    Cc: "Paul E. McKenney"
    Cc: Paul Mackerras
    Cc: Peter Zijlstra (Intel)
    Cc: Randy Dunlap
    Cc: Ravi Bangoria
    Cc: Richard Henderson
    Cc: Rich Felker
    Cc: Russell King
    Cc: Sami Tolvanen
    Cc: Sargun Dhillon
    Cc: Stephen Rothwell
    Cc: Sudeep Holla
    Cc: Sven Schnelle
    Cc: Thiago Jung Bauermann
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vasily Gorbik
    Cc: Vlastimil Babka
    Cc: Yoshinori Sato
    Cc: Zhou Yanjie
    Link: http://lkml.kernel.org/r/20200616030734.87257-1-nixiaoming@huawei.com
    Signed-off-by: Linus Torvalds

    Xiaoming Ni
     

15 Nov, 2019

1 commit

  • At the moment, the compilation of the old time32 system calls depends
    purely on the architecture. As systems with new libc based on 64-bit
    time_t are getting deployed, even architectures that previously supported
    these (notably x86-32 and arm32 but also many others) no longer depend on
    them, and removing them from a kernel image results in a smaller kernel
    binary, the same way we can leave out many other optional system calls.

    More importantly, on an embedded system that needs to keep working
    beyond year 2038, any user space program calling these system calls
    is likely a bug, so removing them from the kernel image does provide
    an extra debugging help for finding broken applications.

    I've gone back and forth on hiding this option unless CONFIG_EXPERT
    is set. This version leaves it visible based on the logic that
    eventually it will be turned off indefinitely.

    Acked-by: Christian Brauner
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

21 Jun, 2019

1 commit

  • This cleanly handles arches who do not yet define clone3.

    clone3() was initially placed under __ARCH_WANT_SYS_CLONE under the
    assumption that this would cleanly handle all architectures. It does
    not.
    Architectures such as nios2 or h8300 simply take the asm-generic syscall
    definitions and generate their syscall table from it. Since they don't
    define __ARCH_WANT_SYS_CLONE the build would fail complaining about
    sys_clone3 missing. The reason this doesn't happen for legacy clone is
    that nios2 and h8300 provide assembly stubs for sys_clone. This seems to
    be done for architectural reasons.

    The build failures for nios2 and h8300 were caught int -next luckily.
    The solution is to define __ARCH_WANT_SYS_CLONE3 that architectures can
    add. Additionally, we need a cond_syscall(clone3) for architectures such
    as nios2 or h8300 that generate their syscall table in the way I
    explained above.

    Fixes: 8f3220a80654 ("arch: wire-up clone3() syscall")
    Signed-off-by: Christian Brauner
    Cc: Arnd Bergmann
    Cc: Kees Cook
    Cc: David Howells
    Cc: Andrew Morton
    Cc: Oleg Nesterov
    Cc: Adrian Reber
    Cc: Linus Torvalds
    Cc: Al Viro
    Cc: Florian Weimer
    Cc: linux-api@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: x86@kernel.org

    Christian Brauner
     

07 May, 2019

1 commit

  • Let pidfd_send_signal() use pidfds retrieved via CLONE_PIDFD. With this
    patch pidfd_send_signal() becomes independent of procfs. This fullfils
    the request made when we merged the pidfd_send_signal() patchset. The
    pidfd_send_signal() syscall is now always available allowing for it to
    be used by users without procfs mounted or even users without procfs
    support compiled into the kernel.

    Signed-off-by: Christian Brauner
    Co-developed-by: Jann Horn
    Signed-off-by: Jann Horn
    Acked-by: Oleg Nesterov
    Cc: Arnd Bergmann
    Cc: "Eric W. Biederman"
    Cc: Kees Cook
    Cc: Thomas Gleixner
    Cc: David Howells
    Cc: "Michael Kerrisk (man-pages)"
    Cc: Andy Lutomirsky
    Cc: Andrew Morton
    Cc: Aleksa Sarai
    Cc: Linus Torvalds
    Cc: Al Viro

    Christian Brauner
     

17 Mar, 2019

1 commit

  • Pull pidfd system call from Christian Brauner:
    "This introduces the ability to use file descriptors from /proc//
    as stable handles on struct pid. Even if a pid is recycled the handle
    will not change. For a start these fds can be used to send signals to
    the processes they refer to.

    With the ability to use /proc/ fds as stable handles on struct
    pid we can fix a long-standing issue where after a process has exited
    its pid can be reused by another process. If a caller sends a signal
    to a reused pid it will end up signaling the wrong process.

    With this patchset we enable a variety of use cases. One obvious
    example is that we can now safely delegate an important part of
    process management - sending signals - to processes other than the
    parent of a given process by sending file descriptors around via scm
    rights and not fearing that the given process will have been recycled
    in the meantime. It also allows for easy testing whether a given
    process is still alive or not by sending signal 0 to a pidfd which is
    quite handy.

    There has been some interest in this feature e.g. from systems
    management (systemd, glibc) and container managers. I have requested
    and gotten comments from glibc to make sure that this syscall is
    suitable for their needs as well. In the future I expect it to take on
    most other pid-based signal syscalls. But such features are left for
    the future once they are needed.

    This has been sitting in linux-next for quite a while and has not
    caused any issues. It comes with selftests which verify basic
    functionality and also test that a recycled pid cannot be signaled via
    a pidfd.

    Jon has written about a prior version of this patchset. It should
    cover the basic functionality since not a lot has changed since then:

    https://lwn.net/Articles/773459/

    The commit message for the syscall itself is extensively documenting
    the syscall, including it's functionality and extensibility"

    * tag 'pidfd-v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    selftests: add tests for pidfd_send_signal()
    signal: add pidfd_send_signal() syscall

    Linus Torvalds
     

09 Mar, 2019

1 commit

  • Pull io_uring IO interface from Jens Axboe:
    "Second attempt at adding the io_uring interface.

    Since the first one, we've added basic unit testing of the three
    system calls, that resides in liburing like the other unit tests that
    we have so far. It'll take a while to get full coverage of it, but
    we're working towards it. I've also added two basic test programs to
    tools/io_uring. One uses the raw interface and has support for all the
    various features that io_uring supports outside of standard IO, like
    fixed files, fixed IO buffers, and polled IO. The other uses the
    liburing API, and is a simplified version of cp(1).

    This adds support for a new IO interface, io_uring.

    io_uring allows an application to communicate with the kernel through
    two rings, the submission queue (SQ) and completion queue (CQ) ring.
    This allows for very efficient handling of IOs, see the v5 posting for
    some basic numbers:

    https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/

    Outside of just efficiency, the interface is also flexible and
    extendable, and allows for future use cases like the upcoming NVMe
    key-value store API, networked IO, and so on. It also supports async
    buffered IO, something that we've always failed to support in the
    kernel.

    Outside of basic IO features, it supports async polled IO as well.
    This particular feature has already been tested at Facebook months ago
    for flash storage boxes, with 25-33% improvements. It makes polled IO
    actually useful for real world use cases, where even basic flash sees
    a nice win in terms of efficiency, latency, and performance. These
    boxes were IOPS bound before, now they are not.

    This series adds three new system calls. One for setting up an
    io_uring instance (io_uring_setup(2)), one for submitting/completing
    IO (io_uring_enter(2)), and one for aux functions like registrating
    file sets, buffers, etc (io_uring_register(2)). Through the help of
    Arnd, I've coordinated the syscall numbers so merge on that front
    should be painless.

    Jon did a writeup of the interface a while back, which (except for
    minor details that have been tweaked) is still accurate. Find that
    here:

    https://lwn.net/Articles/776703/

    Huge thanks to Al Viro for helping getting the reference cycle code
    correct, and to Jann Horn for his extensive reviews focused on both
    security and bugs in general.

    There's a userspace library that provides basic functionality for
    applications that don't need or want to care about how to fiddle with
    the rings directly. It has helpers to allow applications to easily set
    up an io_uring instance, and submit/complete IO through it without
    knowing about the intricacies of the rings. It also includes man pages
    (thanks to Jeff Moyer), and will continue to grow support helper
    functions and features as time progresses. Find it here:

    git://git.kernel.dk/liburing

    Fio has full support for the raw interface, both in the form of an IO
    engine (io_uring), but also with a small test application (t/io_uring)
    that can exercise and benchmark the interface"

    * tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block:
    io_uring: add a few test tools
    io_uring: allow workqueue item to handle multiple buffered requests
    io_uring: add support for IORING_OP_POLL
    io_uring: add io_kiocb ref count
    io_uring: add submission polling
    io_uring: add file set registration
    net: split out functions related to registering inflight socket files
    io_uring: add support for pre-mapped user IO buffers
    block: implement bio helper to add iter bvec pages to bio
    io_uring: batch io_kiocb allocation
    io_uring: use fget/fput_many() for file references
    fs: add fget_many() and fput_many()
    io_uring: support for IO polling
    io_uring: add fsync support
    Add io_uring IO interface

    Linus Torvalds
     

06 Mar, 2019

2 commits

  • As John Stultz noticed, my y2038 syscall series caused a link
    failure when CONFIG_SYSVIPC is disabled but CONFIG_COMPAT is
    enabled:

    arch/arm64/kernel/sys32.o:(.rodata+0x960): undefined reference to `__arm64_compat_sys_old_semctl'
    arch/arm64/kernel/sys32.o:(.rodata+0x980): undefined reference to `__arm64_compat_sys_old_msgctl'
    arch/arm64/kernel/sys32.o:(.rodata+0x9a0): undefined reference to `__arm64_compat_sys_old_shmctl'

    Add the missing entries in kernel/sys_ni.c for the new system
    calls.

    Cc: Laura Abbott
    Cc: John Stultz
    Cc: Thomas Gleixner
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     
  • The kill() syscall operates on process identifiers (pid). After a process
    has exited its pid can be reused by another process. If a caller sends a
    signal to a reused pid it will end up signaling the wrong process. This
    issue has often surfaced and there has been a push to address this problem [1].

    This patch uses file descriptors (fd) from proc/ as stable handles on
    struct pid. Even if a pid is recycled the handle will not change. The fd
    can be used to send signals to the process it refers to.
    Thus, the new syscall pidfd_send_signal() is introduced to solve this
    problem. Instead of pids it operates on process fds (pidfd).

    /* prototype and argument /*
    long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags);

    /* syscall number 424 */
    The syscall number was chosen to be 424 to align with Arnd's rework in his
    y2038 to minimize merge conflicts (cf. [25]).

    In addition to the pidfd and signal argument it takes an additional
    siginfo_t and flags argument. If the siginfo_t argument is NULL then
    pidfd_send_signal() is equivalent to kill(, ). If it
    is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo().
    The flags argument is added to allow for future extensions of this syscall.
    It currently needs to be passed as 0. Failing to do so will cause EINVAL.

    /* pidfd_send_signal() replaces multiple pid-based syscalls */
    The pidfd_send_signal() syscall currently takes on the job of
    rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a
    positive pid is passed to kill(2). It will however be possible to also
    replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended.

    /* sending signals to threads (tid) and process groups (pgid) */
    Specifically, the pidfd_send_signal() syscall does currently not operate on
    process groups or threads. This is left for future extensions.
    In order to extend the syscall to allow sending signal to threads and
    process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and
    PIDFD_TYPE_TID) should be added. This implies that the flags argument will
    determine what is signaled and not the file descriptor itself. Put in other
    words, grouping in this api is a property of the flags argument not a
    property of the file descriptor (cf. [13]). Clarification for this has been
    requested by Eric (cf. [19]).
    When appropriate extensions through the flags argument are added then
    pidfd_send_signal() can additionally replace the part of kill(2) which
    operates on process groups as well as the tgkill(2) and
    rt_tgsigqueueinfo(2) syscalls.
    How such an extension could be implemented has been very roughly sketched
    in [14], [15], and [16]. However, this should not be taken as a commitment
    to a particular implementation. There might be better ways to do it.
    Right now this is intentionally left out to keep this patchset as simple as
    possible (cf. [4]).

    /* naming */
    The syscall had various names throughout iterations of this patchset:
    - procfd_signal()
    - procfd_send_signal()
    - taskfd_send_signal()
    In the last round of reviews it was pointed out that given that if the
    flags argument decides the scope of the signal instead of different types
    of fds it might make sense to either settle for "procfd_" or "pidfd_" as
    prefix. The community was willing to accept either (cf. [17] and [18]).
    Given that one developer expressed strong preference for the "pidfd_"
    prefix (cf. [13]) and with other developers less opinionated about the name
    we should settle for "pidfd_" to avoid further bikeshedding.

    The "_send_signal" suffix was chosen to reflect the fact that the syscall
    takes on the job of multiple syscalls. It is therefore intentional that the
    name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the
    fomer because it might imply that pidfd_send_signal() is a replacement for
    kill(2), and not the latter because it is a hassle to remember the correct
    spelling - especially for non-native speakers - and because it is not
    descriptive enough of what the syscall actually does. The name
    "pidfd_send_signal" makes it very clear that its job is to send signals.

    /* zombies */
    Zombies can be signaled just as any other process. No special error will be
    reported since a zombie state is an unreliable state (cf. [3]). However,
    this can be added as an extension through the @flags argument if the need
    ever arises.

    /* cross-namespace signals */
    The patch currently enforces that the signaler and signalee either are in
    the same pid namespace or that the signaler's pid namespace is an ancestor
    of the signalee's pid namespace. This is done for the sake of simplicity
    and because it is unclear to what values certain members of struct
    siginfo_t would need to be set to (cf. [5], [6]).

    /* compat syscalls */
    It became clear that we would like to avoid adding compat syscalls
    (cf. [7]). The compat syscall handling is now done in kernel/signal.c
    itself by adding __copy_siginfo_from_user_generic() which lets us avoid
    compat syscalls (cf. [8]). It should be noted that the addition of
    __copy_siginfo_from_user_any() is caused by a bug in the original
    implementation of rt_sigqueueinfo(2) (cf. 12).
    With upcoming rework for syscall handling things might improve
    significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain
    any additional callers.

    /* testing */
    This patch was tested on x64 and x86.

    /* userspace usage */
    An asciinema recording for the basic functionality can be found under [9].
    With this patch a process can be killed via:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
    unsigned int flags)
    {
    #ifdef __NR_pidfd_send_signal
    return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
    #else
    return -ENOSYS;
    #endif
    }

    int main(int argc, char *argv[])
    {
    int fd, ret, saved_errno, sig;

    if (argc < 3)
    exit(EXIT_FAILURE);

    fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
    if (fd < 0) {
    printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]);
    exit(EXIT_FAILURE);
    }

    sig = atoi(argv[2]);

    printf("Sending signal %d to process %s\n", sig, argv[1]);
    ret = do_pidfd_send_signal(fd, sig, NULL, 0);

    saved_errno = errno;
    close(fd);
    errno = saved_errno;

    if (ret < 0) {
    printf("%s - Failed to send signal %d to process %s\n",
    strerror(errno), sig, argv[1]);
    exit(EXIT_FAILURE);
    }

    exit(EXIT_SUCCESS);
    }

    /* Q&A
    * Given that it seems the same questions get asked again by people who are
    * late to the party it makes sense to add a Q&A section to the commit
    * message so it's hopefully easier to avoid duplicate threads.
    *
    * For the sake of progress please consider these arguments settled unless
    * there is a new point that desperately needs to be addressed. Please make
    * sure to check the links to the threads in this commit message whether
    * this has not already been covered.
    */
    Q-01: (Florian Weimer [20], Andrew Morton [21])
    What happens when the target process has exited?
    A-01: Sending the signal will fail with ESRCH (cf. [22]).

    Q-02: (Andrew Morton [21])
    Is the task_struct pinned by the fd?
    A-02: No. A reference to struct pid is kept. struct pid - as far as I
    understand - was created exactly for the reason to not require to
    pin struct task_struct (cf. [22]).

    Q-03: (Andrew Morton [21])
    Does the entire procfs directory remain visible? Just one entry
    within it?
    A-03: The same thing that happens right now when you hold a file descriptor
    to /proc/ open (cf. [22]).

    Q-04: (Andrew Morton [21])
    Does the pid remain reserved?
    A-04: No. This patchset guarantees a stable handle not that pids are not
    recycled (cf. [22]).

    Q-05: (Andrew Morton [21])
    Do attempts to signal that fd return errors?
    A-05: See {Q,A}-01.

    Q-06: (Andrew Morton [22])
    Is there a cleaner way of obtaining the fd? Another syscall perhaps.
    A-06: Userspace can already trivially retrieve file descriptors from procfs
    so this is something that we will need to support anyway. Hence,
    there's no immediate need to add another syscalls just to make
    pidfd_send_signal() not dependent on the presence of procfs. However,
    adding a syscalls to get such file descriptors is planned for a
    future patchset (cf. [22]).

    Q-07: (Andrew Morton [21] and others)
    This fd-for-a-process sounds like a handy thing and people may well
    think up other uses for it in the future, probably unrelated to
    signals. Are the code and the interface designed to permit such
    future applications?
    A-07: Yes (cf. [22]).

    Q-08: (Andrew Morton [21] and others)
    Now I think about it, why a new syscall? This thing is looking
    rather like an ioctl?
    A-08: This has been extensively discussed. It was agreed that a syscall is
    preferred for a variety or reasons. Here are just a few taken from
    prior threads. Syscalls are safer than ioctl()s especially when
    signaling to fds. Processes are a core kernel concept so a syscall
    seems more appropriate. The layout of the syscall with its four
    arguments would require the addition of a custom struct for the
    ioctl() thereby causing at least the same amount or even more
    complexity for userspace than a simple syscall. The new syscall will
    replace multiple other pid-based syscalls (see description above).
    The file-descriptors-for-processes concept introduced with this
    syscall will be extended with other syscalls in the future. See also
    [22], [23] and various other threads already linked in here.

    Q-09: (Florian Weimer [24])
    What happens if you use the new interface with an O_PATH descriptor?
    A-09:
    pidfds opened as O_PATH fds cannot be used to send signals to a
    process (cf. [2]). Signaling processes through pidfds is the
    equivalent of writing to a file. Thus, this is not an operation that
    operates "purely at the file descriptor level" as required by the
    open(2) manpage. See also [4].

    /* References */
    [1]: https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/
    [2]: https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/
    [3]: https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/
    [4]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
    [5]: https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/
    [6]: https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/
    [7]: https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/
    [8]: https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/
    [9]: https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
    [11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/
    [12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/
    [13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/
    [14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/
    [15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/
    [16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/
    [17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/
    [18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/
    [19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/
    [20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
    [21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/
    [22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/
    [23]: https://lwn.net/Articles/773459/
    [24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
    [25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/

    Cc: "Eric W. Biederman"
    Cc: Jann Horn
    Cc: Andy Lutomirsky
    Cc: Andrew Morton
    Cc: Oleg Nesterov
    Cc: Al Viro
    Cc: Florian Weimer
    Signed-off-by: Christian Brauner
    Reviewed-by: Tycho Andersen
    Reviewed-by: Kees Cook
    Reviewed-by: David Howells
    Acked-by: Arnd Bergmann
    Acked-by: Thomas Gleixner
    Acked-by: Serge Hallyn
    Acked-by: Aleksa Sarai

    Christian Brauner
     

28 Feb, 2019

2 commits

  • If we have fixed user buffers, we can map them into the kernel when we
    setup the io_uring. That avoids the need to do get_user_pages() for
    each and every IO.

    To utilize this feature, the application must call io_uring_register()
    after having setup an io_uring instance, passing in
    IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
    an iovec array, and the nr_args should contain how many iovecs the
    application wishes to map.

    If successful, these buffers are now mapped into the kernel, eligible
    for IO. To use these fixed buffers, the application must use the
    IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
    set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
    must point to somewhere inside the indexed buffer.

    The application may register buffers throughout the lifetime of the
    io_uring instance. It can call io_uring_register() with
    IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
    buffers, and then register a new set. The application need not
    unregister buffers explicitly before shutting down the io_uring
    instance.

    It's perfectly valid to setup a larger buffer, and then sometimes only
    use parts of it for an IO. As long as the range is within the originally
    mapped region, it will work just fine.

    For now, buffers must not be file backed. If file backed buffers are
    passed in, the registration will fail with -1/EOPNOTSUPP. This
    restriction may be relaxed in the future.

    RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
    arbitrary 1G per buffer size is also imposed.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The submission queue (SQ) and completion queue (CQ) rings are shared
    between the application and the kernel. This eliminates the need to
    copy data back and forth to submit and complete IO.

    IO submissions use the io_uring_sqe data structure, and completions
    are generated in the form of io_uring_cqe data structures. The SQ
    ring is an index into the io_uring_sqe array, which makes it possible
    to submit a batch of IOs without them being contiguous in the ring.
    The CQ ring is always contiguous, as completion events are inherently
    unordered, and hence any io_uring_cqe entry can point back to an
    arbitrary submission.

    Two new system calls are added for this:

    io_uring_setup(entries, params)
    Sets up an io_uring instance for doing async IO. On success,
    returns a file descriptor that the application can mmap to
    gain access to the SQ ring, CQ ring, and io_uring_sqes.

    io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
    Initiates IO against the rings mapped to this fd, or waits for
    them to complete, or both. The behavior is controlled by the
    parameters passed in. If 'to_submit' is non-zero, then we'll
    try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
    kernel will wait for 'min_complete' events, if they aren't
    already available. It's valid to set IORING_ENTER_GETEVENTS
    and 'min_complete' == 0 at the same time, this allows the
    kernel to return already completed events without waiting
    for them. This is useful only for polling, as for IRQ
    driven IO, the application can just check the CQ ring
    without entering the kernel.

    With this setup, it's possible to do async IO with a single system
    call. Future developments will enable polled IO with this interface,
    and polled submission as well. The latter will enable an application
    to do IO without doing ANY system calls at all.

    For IRQ driven IO, an application only needs to enter the kernel for
    completions if it wants to wait for them to occur.

    Each io_uring is backed by a workqueue, to support buffered async IO
    as well. We will only punt to an async context if the command would
    need to wait for IO on the device side. Any data that can be accessed
    directly in the page cache is done inline. This avoids the slowness
    issue of usual threadpools, since cached data is accessed as quickly
    as a sync interface.

    Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Jens Axboe
     

07 Feb, 2019

1 commit

  • A lot of system calls that pass a time_t somewhere have an implementation
    using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
    been reworked so that this implementation can now be used on 32-bit
    architectures as well.

    The missing step is to redefine them using the regular SYSCALL_DEFINEx()
    to get them out of the compat namespace and make it possible to build them
    on 32-bit architectures.

    Any system call that ends in 'time' gets a '32' suffix on its name for
    that version, while the others get a '_time32' suffix, to distinguish
    them from the normal version, which takes a 64-bit time argument in the
    future.

    In this step, only 64-bit architectures are changed, doing this rename
    first lets us avoid touching the 32-bit architectures twice.

    Acked-by: Catalin Marinas
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

26 Jan, 2019

1 commit

  • The behavior of these system calls is slightly different between
    architectures, as determined by the CONFIG_ARCH_WANT_IPC_PARSE_VERSION
    symbol. Most architectures that implement the split IPC syscalls don't set
    that symbol and only get the modern version, but alpha, arm, microblaze,
    mips-n32, mips-n64 and xtensa expect the caller to pass the IPC_64 flag.

    For the architectures that so far only implement sys_ipc(), i.e. m68k,
    mips-o32, powerpc, s390, sh, sparc, and x86-32, we want the new behavior
    when adding the split syscalls, so we need to distinguish between the
    two groups of architectures.

    The method I picked for this distinction is to have a separate system call
    entry point: sys_old_*ctl() now uses ipc_parse_version, while sys_*ctl()
    does not. The system call tables of the five architectures are changed
    accordingly.

    As an additional benefit, we no longer need the configuration specific
    definition for ipc_parse_version(), it always does the same thing now,
    but simply won't get called on architectures with the modern interface.

    A small downside is that on architectures that do set
    ARCH_WANT_IPC_PARSE_VERSION, we now have an extra set of entry points
    that are never called. They only add a few bytes of bloat, so it seems
    better to keep them compared to adding yet another Kconfig symbol.
    I considered adding new syscall numbers for the IPC_64 variants for
    consistency, but decided against that for now.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

18 Jan, 2019

1 commit

  • The sys_ipc() and compat_ksys_ipc() functions are meant to only
    be used from the system call table, not called by another function.

    Introduce ksys_*() interfaces for this purpose, as we have done
    for many other system calls.

    Link: https://lore.kernel.org/lkml/20190116131527.2071570-3-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Heiko Carstens
    [heiko.carstens@de.ibm.com: compile fix for !CONFIG_COMPAT]
    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Arnd Bergmann
     

18 Dec, 2018

1 commit

  • recvmmsg() takes two arguments to pointers of structures that differ
    between 32-bit and 64-bit architectures: mmsghdr and timespec.

    For y2038 compatbility, we are changing the native system call from
    timespec to __kernel_timespec with a 64-bit time_t (in another patch),
    and use the existing compat system call on both 32-bit and 64-bit
    architectures for compatibility with traditional 32-bit user space.

    As we now have two variants of recvmmsg() for 32-bit tasks that are both
    different from the variant that we use on 64-bit tasks, this means we
    also require two compat system calls!

    The solution I picked is to flip things around: The existing
    compat_sys_recvmmsg() call gets moved from net/compat.c into net/socket.c
    and now handles the case for old user space on all architectures that
    have set CONFIG_COMPAT_32BIT_TIME. A new compat_sys_recvmmsg_time64()
    call gets added in the old place for 64-bit architectures only, this
    one handles the case of a compat mmsghdr structure combined with
    __kernel_timespec.

    In the indirect sys_socketcall(), we now need to call either
    do_sys_recvmmsg() or __compat_sys_recvmmsg(), depending on what kind of
    architecture we are on. For compat_sys_socketcall(), no such change is
    needed, we always call __compat_sys_recvmmsg().

    I decided to not add a new SYS_RECVMMSG_TIME64 socketcall: Any libc
    implementation for 64-bit time_t will need significant changes including
    an updated asm/unistd.h, and it seems better to consistently use the
    separate syscalls that configuration, leaving the socketcall only for
    backward compatibility with 32-bit time_t based libc.

    The naming is asymmetric for the moment, so both existing syscalls
    entry points keep their names, while the new ones are recvmmsg_time32
    and compat_recvmmsg_time64 respectively. I expect that we will rename
    the compat syscalls later as we start using generated syscall tables
    everywhere and add these entry points.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

11 Jun, 2018

1 commit

  • Pull restartable sequence support from Thomas Gleixner:
    "The restartable sequences syscall (finally):

    After a lot of back and forth discussion and massive delays caused by
    the speculative distraction of maintainers, the core set of
    restartable sequences has finally reached a consensus.

    It comes with the basic non disputed core implementation along with
    support for arm, powerpc and x86 and a full set of selftests

    It was exposed to linux-next earlier this week, so it does not fully
    comply with the merge window requirements, but there is really no
    point to drag it out for yet another cycle"

    * 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    rseq/selftests: Provide Makefile, scripts, gitignore
    rseq/selftests: Provide parametrized tests
    rseq/selftests: Provide basic percpu ops test
    rseq/selftests: Provide basic test
    rseq/selftests: Provide rseq library
    selftests/lib.mk: Introduce OVERRIDE_TARGETS
    powerpc: Wire up restartable sequences system call
    powerpc: Add syscall detection for restartable sequences
    powerpc: Add support for restartable sequences
    x86: Wire up restartable sequence system call
    x86: Add support for restartable sequences
    arm: Wire up restartable sequences system call
    arm: Add syscall detection for restartable sequences
    arm: Add restartable sequences support
    rseq: Introduce restartable sequences system call
    uapi/headers: Provide types_32_64.h

    Linus Torvalds
     

08 Jun, 2018

1 commit

  • Pull powerpc updates from Michael Ellerman:
    "Notable changes:

    - Support for split PMD page table lock on 64-bit Book3S (Power8/9).

    - Add support for HAVE_RELIABLE_STACKTRACE, so we properly support
    live patching again.

    - Add support for patching barrier_nospec in copy_from_user() and
    syscall entry.

    - A couple of fixes for our data breakpoints on Book3S.

    - A series from Nick optimising TLB/mm handling with the Radix MMU.

    - Numerous small cleanups to squash sparse/gcc warnings from Mathieu
    Malaterre.

    - Several series optimising various parts of the 32-bit code from
    Christophe Leroy.

    - Removal of support for two old machines, "SBC834xE" and "C2K"
    ("GEFanuc,C2K"), which is why the diffstat has so many deletions.

    And many other small improvements & fixes.

    There's a few out-of-area changes. Some minor ftrace changes OK'ed by
    Steve, and a fix to our powernv cpuidle driver. Then there's a series
    touching mm, x86 and fs/proc/task_mmu.c, which cleans up some details
    around pkey support. It was ack'ed/reviewed by Ingo & Dave and has
    been in next for several weeks.

    Thanks to: Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Al
    Viro, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Arnd
    Bergmann, Balbir Singh, Cédric Le Goater, Christophe Leroy, Christophe
    Lombard, Colin Ian King, Dave Hansen, Fabio Estevam, Finn Thain,
    Frederic Barrat, Gautham R. Shenoy, Haren Myneni, Hari Bathini, Ingo
    Molnar, Jonathan Neuschäfer, Josh Poimboeuf, Kamalesh Babulal,
    Madhavan Srinivasan, Mahesh Salgaonkar, Mark Greer, Mathieu Malaterre,
    Matthew Wilcox, Michael Neuling, Michal Suchanek, Naveen N. Rao,
    Nicholas Piggin, Nicolai Stange, Olof Johansson, Paul Gortmaker, Paul
    Mackerras, Peter Rosin, Pridhiviraj Paidipeddi, Ram Pai, Rashmica
    Gupta, Ravi Bangoria, Russell Currey, Sam Bobroff, Samuel
    Mendoza-Jonas, Segher Boessenkool, Shilpasri G Bhat, Simon Guo,
    Souptick Joarder, Stewart Smith, Thiago Jung Bauermann, Torsten Duwe,
    Vaibhav Jain, Wei Yongjun, Wolfram Sang, Yisheng Xie, YueHaibing"

    * tag 'powerpc-4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (251 commits)
    powerpc/64s/radix: Fix missing ptesync in flush_cache_vmap
    cpuidle: powernv: Fix promotion from snooze if next state disabled
    powerpc: fix build failure by disabling attribute-alias warning in pci_32
    ocxl: Fix missing unlock on error in afu_ioctl_enable_p9_wait()
    powerpc-opal: fix spelling mistake "Uniterrupted" -> "Uninterrupted"
    powerpc: fix spelling mistake: "Usupported" -> "Unsupported"
    powerpc/pkeys: Detach execute_only key on !PROT_EXEC
    powerpc/powernv: copy/paste - Mask SO bit in CR
    powerpc: Remove core support for Marvell mv64x60 hostbridges
    powerpc/boot: Remove core support for Marvell mv64x60 hostbridges
    powerpc/boot: Remove support for Marvell mv64x60 i2c controller
    powerpc/boot: Remove support for Marvell MPSC serial controller
    powerpc/embedded6xx: Remove C2K board support
    powerpc/lib: optimise PPC32 memcmp
    powerpc/lib: optimise 32 bits __clear_user()
    powerpc/time: inline arch_vtime_task_switch()
    powerpc/Makefile: set -mcpu=860 flag for the 8xx
    powerpc: Implement csum_ipv6_magic in assembly
    powerpc/32: Optimise __csum_partial()
    powerpc/lib: Adjust .balign inside string functions for PPC32
    ...

    Linus Torvalds
     

06 Jun, 2018

1 commit

  • Expose a new system call allowing each thread to register one userspace
    memory area to be used as an ABI between kernel and user-space for two
    purposes: user-space restartable sequences and quick access to read the
    current CPU number value from user-space.

    * Restartable sequences (per-cpu atomics)

    Restartables sequences allow user-space to perform update operations on
    per-cpu data without requiring heavy-weight atomic operations.

    The restartable critical sections (percpu atomics) work has been started
    by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
    critical sections. [1] [2] The re-implementation proposed here brings a
    few simplifications to the ABI which facilitates porting to other
    architectures and speeds up the user-space fast path.

    Here are benchmarks of various rseq use-cases.

    Test hardware:

    arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
    x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading

    The following benchmarks were all performed on a single thread.

    * Per-CPU statistic counter increment

    getcpu+atomic (ns/op) rseq (ns/op) speedup
    arm32: 344.0 31.4 11.0
    x86-64: 15.3 2.0 7.7

    * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
    per-cpu buffer

    getcpu+atomic (ns/op) rseq (ns/op) speedup
    arm32: 2502.0 2250.0 1.1
    x86-64: 117.4 98.0 1.2

    * liburcu percpu: lock-unlock pair, dereference, read/compare word

    getcpu+atomic (ns/op) rseq (ns/op) speedup
    arm32: 751.0 128.5 5.8
    x86-64: 53.4 28.6 1.9

    * jemalloc memory allocator adapted to use rseq

    Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
    rseq 2016 implementation):

    The production workload response-time has 1-2% gain avg. latency, and
    the P99 overall latency drops by 2-3%.

    * Reading the current CPU number

    Speeding up reading the current CPU number on which the caller thread is
    running is done by keeping the current CPU number up do date within the
    cpu_id field of the memory area registered by the thread. This is done
    by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
    current thread. Upon return to user-space, a notify-resume handler
    updates the current CPU value within the registered user-space memory
    area. User-space can then read the current CPU number directly from
    memory.

    Keeping the current cpu id in a memory area shared between kernel and
    user-space is an improvement over current mechanisms available to read
    the current CPU number, which has the following benefits over
    alternative approaches:

    - 35x speedup on ARM vs system call through glibc
    - 20x speedup on x86 compared to calling glibc, which calls vdso
    executing a "lsl" instruction,
    - 14x speedup on x86 compared to inlined "lsl" instruction,
    - Unlike vdso approaches, this cpu_id value can be read from an inline
    assembly, which makes it a useful building block for restartable
    sequences.
    - The approach of reading the cpu id through memory mapping shared
    between kernel and user-space is portable (e.g. ARM), which is not the
    case for the lsl-based x86 vdso.

    On x86, yet another possible approach would be to use the gs segment
    selector to point to user-space per-cpu data. This approach performs
    similarly to the cpu id cache, but it has two disadvantages: it is
    not portable, and it is incompatible with existing applications already
    using the gs segment selector for other purposes.

    Benchmarking various approaches for reading the current CPU number:

    ARMv7 Processor rev 4 (v7l)
    Machine model: Cubietruck
    - Baseline (empty loop): 8.4 ns
    - Read CPU from rseq cpu_id: 16.7 ns
    - Read CPU from rseq cpu_id (lazy register): 19.8 ns
    - glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
    - getcpu system call: 234.9 ns

    x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
    - Baseline (empty loop): 0.8 ns
    - Read CPU from rseq cpu_id: 0.8 ns
    - Read CPU from rseq cpu_id (lazy register): 0.8 ns
    - Read using gs segment selector: 0.8 ns
    - "lsl" inline assembly: 13.0 ns
    - glibc 2.19-0ubuntu6 getcpu: 16.6 ns
    - getcpu system call: 53.9 ns

    - Speed (benchmark taken on v8 of patchset)

    Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
    expectations, that enabling CONFIG_RSEQ slightly accelerates the
    scheduler:

    Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
    2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
    saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
    kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
    restartable sequences series applied.

    * CONFIG_RSEQ=n

    avg.: 41.37 s
    std.dev.: 0.36 s

    * CONFIG_RSEQ=y

    avg.: 40.46 s
    std.dev.: 0.33 s

    - Size

    On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
    567 bytes, and the data size increase of vmlinux is 5696 bytes.

    [1] https://lwn.net/Articles/650333/
    [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Cc: Joel Fernandes
    Cc: Catalin Marinas
    Cc: Dave Watson
    Cc: Will Deacon
    Cc: Andi Kleen
    Cc: "H . Peter Anvin"
    Cc: Chris Lameter
    Cc: Russell King
    Cc: Andrew Hunter
    Cc: Michael Kerrisk
    Cc: "Paul E . McKenney"
    Cc: Paul Turner
    Cc: Boqun Feng
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Ben Maurer
    Cc: Alexander Viro
    Cc: linux-api@vger.kernel.org
    Cc: Andy Lutomirski
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
    Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
    Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com

    Mathieu Desnoyers
     

10 May, 2018

1 commit


03 May, 2018

1 commit

  • This is the io_getevents equivalent of ppoll/pselect and allows to
    properly mix signals and aio completions (especially with IOCB_CMD_POLL)
    and atomically executes the following sequence:

    sigset_t origmask;

    pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
    ret = io_getevents(ctx, min_nr, nr, events, timeout);
    pthread_sigmask(SIG_SETMASK, &origmask, NULL);

    Note that unlike many other signal related calls we do not pass a sigmask
    size, as that would get us to 7 arguments, which aren't easily supported
    by the syscall infrastructure. It seems a lot less painful to just add a
    new syscall variant in the unlikely case we're going to increase the
    sigset size.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Greg Kroah-Hartman
    Reviewed-by: Darrick J. Wong

    Christoph Hellwig
     

05 Apr, 2018

1 commit

  • It may be useful for an architecture to override the definitions of the
    COMPAT_SYSCALL_DEFINE0() and __COMPAT_SYSCALL_DEFINEx() macros in
    , in particular to use a different calling convention
    for syscalls. This patch provides a mechanism to do so, based on the
    previously introduced CONFIG_ARCH_HAS_SYSCALL_WRAPPER. If it is enabled,
    is included in and may be used
    to define the macros mentioned above. Moreover, as the syscall calling
    convention may be different if CONFIG_ARCH_HAS_SYSCALL_WRAPPER is set,
    the compat syscall function prototypes in are #ifndef'd
    out in that case.

    As some of the syscalls and/or compat syscalls may not be present,
    the COND_SYSCALL() and COND_SYSCALL_COMPAT() macros in kernel/sys_ni.c
    as well as the SYS_NI() and COMPAT_SYS_NI() macros in
    kernel/time/posix-stubs.c can be re-defined in iff
    CONFIG_ARCH_HAS_SYSCALL_WRAPPER is enabled.

    Signed-off-by: Dominik Brodowski
    Acked-by: Linus Torvalds
    Cc: Al Viro
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20180405095307.3730-5-linux@dominikbrodowski.net
    Signed-off-by: Ingo Molnar

    Dominik Brodowski
     

03 Apr, 2018

3 commits


02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

23 Dec, 2016

1 commit

  • ... and fix the minor buglet in compat io_submit() - native one
    kills ioctx as cleanup when put_user() fails. Get rid of
    bogus compat_... in !CONFIG_AIO case, while we are at it - they
    should simply fail with ENOSYS, same as for native counterparts.

    Signed-off-by: Al Viro

    Al Viro
     

13 Sep, 2016

1 commit

  • Guenter Roeck reported breakage on the h8300 and c6x architectures (among
    others) caused by the new memory protection keys syscalls. This patch does
    what Arnd suggested and adds them to kernel/sys_ni.c.

    Fixes: a60f7b69d92c ("generic syscalls: Wire up memory protection keys syscalls")
    Reported-and-tested-by: Guenter Roeck
    Signed-off-by: Dave Hansen
    Acked-by: Arnd Bergmann
    Cc: linux-arch@vger.kernel.org
    Cc: Dave Hansen
    Cc: linux-api@vger.kernel.org
    Link: http://lkml.kernel.org/r/20160912203842.48E7AC50@viggo.jf.intel.com
    Signed-off-by: Thomas Gleixner

    Dave Hansen
     

02 Dec, 2015

1 commit

  • Add a copy_file_range() system call for offloading copies between
    regular files.

    This gives an interface to underlying layers of the storage stack which
    can copy without reading and writing all the data. There are a few
    candidates that should support copy offloading in the nearer term:

    - btrfs shares extent references with its clone ioctl
    - NFS has patches to add a COPY command which copies on the server
    - SCSI has a family of XCOPY commands which copy in the device

    This system call avoids the complexity of also accelerating the creation
    of the destination file by operating on an existing destination file
    descriptor, not a path.

    Currently the high level vfs entry point limits copy offloading to files
    on the same mount and super (and not in the same file). This can be
    relaxed if we get implementations which can copy between file systems
    safely.

    Signed-off-by: Zach Brown
    [Anna Schumaker: Change -EINVAL to -EBADF during file verification,
    Change flags parameter from int to unsigned int,
    Add function to include/linux/syscalls.h,
    Check copy len after file open mode,
    Don't forbid ranges inside the same file,
    Use rw_verify_area() to veriy ranges,
    Use file_out rather than file_in,
    Add COPY_FR_REFLINK flag]
    Signed-off-by: Anna Schumaker
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Zach Brown
     

06 Nov, 2015

1 commit

  • With the refactored mlock code, introduce a new system call for mlock.
    The new call will allow the user to specify what lock states are being
    added. mlock2 is trivial at the moment, but a follow on patch will add a
    new mlock state making it useful.

    Signed-off-by: Eric B Munson
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Heiko Carstens
    Cc: Geert Uytterhoeven
    Cc: Catalin Marinas
    Cc: Stephen Rothwell
    Cc: Guenter Roeck
    Cc: Jonathan Corbet
    Cc: Kirill A. Shutemov
    Cc: Michael Kerrisk
    Cc: Ralf Baechle
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     

12 Sep, 2015

1 commit

  • Here is an implementation of a new system call, sys_membarrier(), which
    executes a memory barrier on all threads running on the system. It is
    implemented by calling synchronize_sched(). It can be used to
    distribute the cost of user-space memory barriers asymmetrically by
    transforming pairs of memory barriers into pairs consisting of
    sys_membarrier() and a compiler barrier. For synchronization primitives
    that distinguish between read-side and write-side (e.g. userspace RCU
    [1], rwlocks), the read-side can be accelerated significantly by moving
    the bulk of the memory barrier overhead to the write-side.

    The existing applications of which I am aware that would be improved by
    this system call are as follows:

    * Through Userspace RCU library (http://urcu.so)
    - DNS server (Knot DNS) https://www.knot-dns.cz/
    - Network sniffer (http://netsniff-ng.org/)
    - Distributed object storage (https://sheepdog.github.io/sheepdog/)
    - User-space tracing (http://lttng.org)
    - Network storage system (https://www.gluster.org/)
    - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
    - Financial software (https://lkml.org/lkml/2015/3/23/189)

    Those projects use RCU in userspace to increase read-side speed and
    scalability compared to locking. Especially in the case of RCU used by
    libraries, sys_membarrier can speed up the read-side by moving the bulk of
    the memory barrier cost to synchronize_rcu().

    * Direct users of sys_membarrier
    - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)

    Microsoft core dotnet GC developers are planning to use the mprotect()
    side-effect of issuing memory barriers through IPIs as a way to implement
    Windows FlushProcessWriteBuffers() on Linux. They are referring to
    sys_membarrier in their github thread, specifically stating that
    sys_membarrier() is what they are looking for.

    To explain the benefit of this scheme, let's introduce two example threads:

    Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
    Thread B (frequent, e.g. executing liburcu
    rcu_read_lock()/rcu_read_unlock())

    In a scheme where all smp_mb() in thread A are ordering memory accesses
    with respect to smp_mb() present in Thread B, we can change each
    smp_mb() within Thread A into calls to sys_membarrier() and each
    smp_mb() within Thread B into compiler barriers "barrier()".

    Before the change, we had, for each smp_mb() pairs:

    Thread A Thread B
    previous mem accesses previous mem accesses
    smp_mb() smp_mb()
    following mem accesses following mem accesses

    After the change, these pairs become:

    Thread A Thread B
    prev mem accesses prev mem accesses
    sys_membarrier() barrier()
    follow mem accesses follow mem accesses

    As we can see, there are two possible scenarios: either Thread B memory
    accesses do not happen concurrently with Thread A accesses (1), or they
    do (2).

    1) Non-concurrent Thread A vs Thread B accesses:

    Thread A Thread B
    prev mem accesses
    sys_membarrier()
    follow mem accesses
    prev mem accesses
    barrier()
    follow mem accesses

    In this case, thread B accesses will be weakly ordered. This is OK,
    because at that point, thread A is not particularly interested in
    ordering them with respect to its own accesses.

    2) Concurrent Thread A vs Thread B accesses

    Thread A Thread B
    prev mem accesses prev mem accesses
    sys_membarrier() barrier()
    follow mem accesses follow mem accesses

    In this case, thread B accesses, which are ensured to be in program
    order thanks to the compiler barrier, will be "upgraded" to full
    smp_mb() by synchronize_sched().

    * Benchmarks

    On Intel Xeon E5405 (8 cores)
    (one thread is calling sys_membarrier, the other 7 threads are busy
    looping)

    1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.

    * User-space user of this system call: Userspace RCU library

    Both the signal-based and the sys_membarrier userspace RCU schemes
    permit us to remove the memory barrier from the userspace RCU
    rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
    accelerating them. These memory barriers are replaced by compiler
    barriers on the read-side, and all matching memory barriers on the
    write-side are turned into an invocation of a memory barrier on all
    active threads in the process. By letting the kernel perform this
    synchronization rather than dumbly sending a signal to every process
    threads (as we currently do), we diminish the number of unnecessary wake
    ups and only issue the memory barriers on active threads. Non-running
    threads do not need to execute such barrier anyway, because these are
    implied by the scheduler context switches.

    Results in liburcu:

    Operations in 10s, 6 readers, 2 writers:

    memory barriers in reader: 1701557485 reads, 2202847 writes
    signal-based scheme: 9830061167 reads, 6700 writes
    sys_membarrier: 9952759104 reads, 425 writes
    sys_membarrier (dyn. check): 7970328887 reads, 425 writes

    The dynamic sys_membarrier availability check adds some overhead to
    the read-side compared to the signal-based scheme, but besides that,
    sys_membarrier slightly outperforms the signal-based scheme. However,
    this non-expedited sys_membarrier implementation has a much slower grace
    period than signal and memory barrier schemes.

    Besides diminishing the number of wake-ups, one major advantage of the
    membarrier system call over the signal-based scheme is that it does not
    need to reserve a signal. This plays much more nicely with libraries,
    and with processes injected into for tracing purposes, for which we
    cannot expect that signals will be unused by the application.

    An expedited version of this system call can be added later on to speed
    up the grace period. Its implementation will likely depend on reading
    the cpu_curr()->mm without holding each CPU's rq lock.

    This patch adds the system call to x86 and to asm-generic.

    [1] http://urcu.so

    membarrier(2) man page:

    MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2)

    NAME
    membarrier - issue memory barriers on a set of threads

    SYNOPSIS
    #include

    int membarrier(int cmd, int flags);

    DESCRIPTION
    The cmd argument is one of the following:

    MEMBARRIER_CMD_QUERY
    Query the set of supported commands. It returns a bitmask of
    supported commands.

    MEMBARRIER_CMD_SHARED
    Execute a memory barrier on all threads running on the system.
    Upon return from system call, the caller thread is ensured that
    all running threads have passed through a state where all memory
    accesses to user-space addresses match program order between
    entry to and return from the system call (non-running threads
    are de facto in such a state). This covers threads from all pro=E2=80=90
    cesses running on the system. This command returns 0.

    The flags argument needs to be 0. For future extensions.

    All memory accesses performed in program order from each targeted
    thread is guaranteed to be ordered with respect to sys_membarrier(). If
    we use the semantic "barrier()" to represent a compiler barrier forcing
    memory accesses to be performed in program order across the barrier,
    and smp_mb() to represent explicit memory barriers forcing full memory
    ordering across the barrier, we have the following ordering table for
    each pair of barrier(), sys_membarrier() and smp_mb():

    The pair ordering is detailed as (O: ordered, X: not ordered):

    barrier() smp_mb() sys_membarrier()
    barrier() X X O
    smp_mb() X O O
    sys_membarrier() O O O

    RETURN VALUE
    On success, these system calls return zero. On error, -1 is returned,
    and errno is set appropriately. For a given command, with flags
    argument set to 0, this system call is guaranteed to always return the
    same value until reboot.

    ERRORS
    ENOSYS System call is not implemented.

    EINVAL Invalid arguments.

    Linux 2015-04-15 MEMBARRIER(2)

    Signed-off-by: Mathieu Desnoyers
    Reviewed-by: Paul E. McKenney
    Reviewed-by: Josh Triplett
    Cc: KOSAKI Motohiro
    Cc: Steven Rostedt
    Cc: Nicholas Miell
    Cc: Ingo Molnar
    Cc: Alan Cox
    Cc: Lai Jiangshan
    Cc: Stephen Hemminger
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: David Howells
    Cc: Pranith Kumar
    Cc: Michael Kerrisk
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     

05 Sep, 2015

1 commit

  • This activates the userfaultfd syscall.

    [sfr@canb.auug.org.au: activate syscall fix]
    [akpm@linux-foundation.org: don't enable userfaultfd on powerpc]
    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

31 Jul, 2015

1 commit

  • The modify_ldt syscall exposes a large attack surface and is
    unnecessary for modern userspace. Make it optional.

    Signed-off-by: Andy Lutomirski
    Reviewed-by: Kees Cook
    Cc: Andrew Cooper
    Cc: Andy Lutomirski
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jan Beulich
    Cc: Konrad Rzeszutek Wilk
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Sasha Levin
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: security@kernel.org
    Cc: xen-devel
    Link: http://lkml.kernel.org/r/a605166a771c343fd64802dece77a903507333bd.1438291540.git.luto@kernel.org
    [ Made MATH_EMULATION dependent on MODIFY_LDT_SYSCALL. ]
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

16 Apr, 2015

1 commit

  • There are a lot of embedded systems that run most or all of their
    functionality in init, running as root:root. For these systems,
    supporting multiple users is not necessary.

    This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
    non-root users, non-root groups, and capabilities optional. It is enabled
    under CONFIG_EXPERT menu.

    When this symbol is not defined, UID and GID are zero in any possible case
    and processes always have all capabilities.

    The following syscalls are compiled out: setuid, setregid, setgid,
    setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
    getgroups, setfsuid, setfsgid, capget, capset.

    Also, groups.c is compiled out completely.

    In kernel/capability.c, capable function was moved in order to avoid
    adding two ifdef blocks.

    This change saves about 25 KB on a defconfig build. The most minimal
    kernels have total text sizes in the high hundreds of kB rather than
    low MB. (The 25k goes down a bit with allnoconfig, but not that much.

    The kernel was booted in Qemu. All the common functionalities work.
    Adding users/groups is not possible, failing with -ENOSYS.

    Bloat-o-meter output:
    add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Iulia Manda
    Reviewed-by: Josh Triplett
    Acked-by: Geert Uytterhoeven
    Tested-by: Paul E. McKenney
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Iulia Manda
     

14 Dec, 2014

1 commit

  • This patchset adds execveat(2) for x86, and is derived from Meredydd
    Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).

    The primary aim of adding an execveat syscall is to allow an
    implementation of fexecve(3) that does not rely on the /proc filesystem,
    at least for executables (rather than scripts). The current glibc version
    of fexecve(3) is implemented via /proc, which causes problems in sandboxed
    or otherwise restricted environments.

    Given the desire for a /proc-free fexecve() implementation, HPA suggested
    (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
    an appropriate generalization.

    Also, having a new syscall means that it can take a flags argument without
    back-compatibility concerns. The current implementation just defines the
    AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
    added in future -- for example, flags for new namespaces (as suggested at
    https://lkml.org/lkml/2006/7/11/474).

    Related history:
    - https://lkml.org/lkml/2006/12/27/123 is an example of someone
    realizing that fexecve() is likely to fail in a chroot environment.
    - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
    documenting the /proc requirement of fexecve(3) in its manpage, to
    "prevent other people from wasting their time".
    - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
    problem where a process that did setuid() could not fexecve()
    because it no longer had access to /proc/self/fd; this has since
    been fixed.

    This patch (of 4):

    Add a new execveat(2) system call. execveat() is to execve() as openat()
    is to open(): it takes a file descriptor that refers to a directory, and
    resolves the filename relative to that.

    In addition, if the filename is empty and AT_EMPTY_PATH is specified,
    execveat() executes the file to which the file descriptor refers. This
    replicates the functionality of fexecve(), which is a system call in other
    UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/" (and
    so relies on /proc being mounted).

    The filename fed to the executed program as argv[0] (or the name of the
    script fed to a script interpreter) will be of the form "/dev/fd/"
    (for an empty filename) or "/dev/fd//", effectively
    reflecting how the executable was found. This does however mean that
    execution of a script in a /proc-less environment won't work; also, script
    execution via an O_CLOEXEC file descriptor fails (as the file will not be
    accessible after exec).

    Based on patches by Meredydd Luff.

    Signed-off-by: David Drysdale
    Cc: Meredydd Luff
    Cc: Shuah Khan
    Cc: "Eric W. Biederman"
    Cc: Andy Lutomirski
    Cc: Alexander Viro
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Kees Cook
    Cc: Arnd Bergmann
    Cc: Rich Felker
    Cc: Christoph Hellwig
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Drysdale
     

19 Nov, 2014

1 commit


09 Oct, 2014

1 commit

  • Pull networking updates from David Miller:
    "Most notable changes in here:

    1) By far the biggest accomplishment, thanks to a large range of
    contributors, is the addition of multi-send for transmit. This is
    the result of discussions back in Chicago, and the hard work of
    several individuals.

    Now, when the ->ndo_start_xmit() method of a driver sees
    skb->xmit_more as true, it can choose to defer the doorbell
    telling the driver to start processing the new TX queue entires.

    skb->xmit_more means that the generic networking is guaranteed to
    call the driver immediately with another SKB to send.

    There is logic added to the qdisc layer to dequeue multiple
    packets at a time, and the handling mis-predicted offloads in
    software is now done with no locks held.

    Finally, pktgen is extended to have a "burst" parameter that can
    be used to test a multi-send implementation.

    Several drivers have xmit_more support: i40e, igb, ixgbe, mlx4,
    virtio_net

    Adding support is almost trivial, so export more drivers to
    support this optimization soon.

    I want to thank, in no particular or implied order, Jesper
    Dangaard Brouer, Eric Dumazet, Alexander Duyck, Tom Herbert, Jamal
    Hadi Salim, John Fastabend, Florian Westphal, Daniel Borkmann,
    David Tat, Hannes Frederic Sowa, and Rusty Russell.

    2) PTP and timestamping support in bnx2x, from Michal Kalderon.

    3) Allow adjusting the rx_copybreak threshold for a driver via
    ethtool, and add rx_copybreak support to enic driver. From
    Govindarajulu Varadarajan.

    4) Significant enhancements to the generic PHY layer and the bcm7xxx
    driver in particular (EEE support, auto power down, etc.) from
    Florian Fainelli.

    5) Allow raw buffers to be used for flow dissection, allowing drivers
    to determine the optimal "linear pull" size for devices that DMA
    into pools of pages. The objective is to get exactly the
    necessary amount of headers into the linear SKB area pre-pulled,
    but no more. The new interface drivers use is eth_get_headlen().
    From WANG Cong, with driver conversions (several had their own
    by-hand duplicated implementations) by Alexander Duyck and Eric
    Dumazet.

    6) Support checksumming more smoothly and efficiently for
    encapsulations, and add "foo over UDP" facility. From Tom
    Herbert.

    7) Add Broadcom SF2 switch driver to DSA layer, from Florian
    Fainelli.

    8) eBPF now can load programs via a system call and has an extensive
    testsuite. Alexei Starovoitov and Daniel Borkmann.

    9) Major overhaul of the packet scheduler to use RCU in several major
    areas such as the classifiers and rate estimators. From John
    Fastabend.

    10) Add driver for Intel FM10000 Ethernet Switch, from Alexander
    Duyck.

    11) Rearrange TCP_SKB_CB() to reduce cache line misses, from Eric
    Dumazet.

    12) Add Datacenter TCP congestion control algorithm support, From
    Florian Westphal.

    13) Reorganize sk_buff so that __copy_skb_header() is significantly
    faster. From Eric Dumazet"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1558 commits)
    netlabel: directly return netlbl_unlabel_genl_init()
    net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers
    net: description of dma_cookie cause make xmldocs warning
    cxgb4: clean up a type issue
    cxgb4: potential shift wrapping bug
    i40e: skb->xmit_more support
    net: fs_enet: Add NAPI TX
    net: fs_enet: Remove non NAPI RX
    r8169:add support for RTL8168EP
    net_sched: copy exts->type in tcf_exts_change()
    wimax: convert printk to pr_foo()
    af_unix: remove 0 assignment on static
    ipv6: Do not warn for informational ICMP messages, regardless of type.
    Update Intel Ethernet Driver maintainers list
    bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING
    tipc: fix bug in multicast congestion handling
    net: better IFF_XMIT_DST_RELEASE support
    net/mlx4_en: remove NETDEV_TX_BUSY
    3c59x: fix bad split of cpu_to_le32(pci_map_single())
    net: bcmgenet: fix Tx ring priority programming
    ...

    Linus Torvalds
     

27 Sep, 2014

1 commit


18 Aug, 2014

1 commit

  • Many embedded systems will not need these syscalls, and omitting them
    saves space. Add a new EXPERT config option CONFIG_ADVISE_SYSCALLS
    (default y) to support compiling them out.

    bloat-o-meter:
    add/remove: 0/3 grow/shrink: 0/0 up/down: 0/-2250 (-2250)
    function old new delta
    sys_fadvise64 57 - -57
    sys_fadvise64_64 691 - -691
    sys_madvise 1502 - -1502

    Signed-off-by: Josh Triplett

    Josh Triplett
     

09 Aug, 2014

1 commit

  • This is the new syscall kexec_file_load() declaration/interface. I have
    reserved the syscall number only for x86_64 so far. Other architectures
    (including i386) can reserve syscall number when they enable the support
    for this new syscall.

    Signed-off-by: Vivek Goyal
    Cc: Michael Kerrisk
    Cc: Borislav Petkov
    Cc: Yinghai Lu
    Cc: Eric Biederman
    Cc: H. Peter Anvin
    Cc: Matthew Garrett
    Cc: Greg Kroah-Hartman
    Cc: Dave Young
    Cc: WANG Chao
    Cc: Baoquan He
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal