22 Dec, 2020

3 commits

  • Buffers that are passed to read_actions_logged() and write_actions_logged()
    are in kernel memory; the sysctl core takes care of copying from/to
    userspace.

    Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
    Reviewed-by: Tyler Hicks
    Signed-off-by: Jann Horn
    Signed-off-by: Kees Cook
    Link: https://lore.kernel.org/r/20201120170545.1419332-1-jannh@google.com
    (cherry picked from commit fab686eb0307121e7a2890b6d6c57edd2457863d)
    Signed-off-by: Jeff Vander Stoep
    Change-Id: I378d1eb73ff01928d3255191ed4443cd7b87c720
    Bug: 176068146

    Jann Horn
     
  • SECCOMP_CACHE will only operate on syscalls that do not access
    any syscall arguments or instruction pointer. To facilitate
    this we need a static analyser to know whether a filter will
    return allow regardless of syscall arguments for a given
    architecture number / syscall number pair. This is implemented
    here with a pseudo-emulator, and stored in a per-filter bitmap.

    In order to build this bitmap at filter attach time, each filter is
    emulated for every syscall (under each possible architecture), and
    checked for any accesses of struct seccomp_data that are not the "arch"
    nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
    the program returns allow, then we can be sure that the filter must
    return allow independent from syscall arguments.

    Nearly all seccomp filters are built from these cBPF instructions:

    BPF_LD | BPF_W | BPF_ABS
    BPF_JMP | BPF_JEQ | BPF_K
    BPF_JMP | BPF_JGE | BPF_K
    BPF_JMP | BPF_JGT | BPF_K
    BPF_JMP | BPF_JSET | BPF_K
    BPF_JMP | BPF_JA
    BPF_RET | BPF_K
    BPF_ALU | BPF_AND | BPF_K

    Each of these instructions are emulated. Any weirdness or loading
    from a syscall argument will cause the emulator to bail.

    The emulation is also halted if it reaches a return. In that case,
    if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.

    Emulator structure and comments are from Kees [1] and Jann [2].

    Emulation is done at attach time. If a filter depends on more
    filters, and if the dependee does not guarantee to allow the
    syscall, then we skip the emulation of this syscall.

    [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/
    [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/

    Suggested-by: Jann Horn
    Signed-off-by: YiFei Zhu
    Reviewed-by: Jann Horn
    Co-developed-by: Kees Cook
    Signed-off-by: Kees Cook
    Link: https://lore.kernel.org/r/71c7be2db5ee08905f41c3be5c1ad6e2601ce88f.1602431034.git.yifeifz2@illinois.edu
    (cherry picked from commit 8e01b51a31a1e08e2c3e8fcc0ef6790441be2f61)
    Signed-off-by: Jeff Vander Stoep
    Change-Id: I5047f7f0d6502e5de6c047743f1053fda3025a6e
    Bug: 176068146

    YiFei Zhu
     
  • The overhead of running Seccomp filters has been part of some past
    discussions [1][2][3]. Oftentimes, the filters have a large number
    of instructions that check syscall numbers one by one and jump based
    on that. Some users chain BPF filters which further enlarge the
    overhead. A recent work [6] comprehensively measures the Seccomp
    overhead and shows that the overhead is non-negligible and has a
    non-trivial impact on application performance.

    We observed some common filters, such as docker's [4] or
    systemd's [5], will make most decisions based only on the syscall
    numbers, and as past discussions considered, a bitmap where each bit
    represents a syscall makes most sense for these filters.

    The fast (common) path for seccomp should be that the filter permits
    the syscall to pass through, and failing seccomp is expected to be
    an exceptional case; it is not expected for userspace to call a
    denylisted syscall over and over.

    When it can be concluded that an allow must occur for the given
    architecture and syscall pair (this determination is introduced in
    the next commit), seccomp will immediately allow the syscall,
    bypassing further BPF execution.

    Each architecture number has its own bitmap. The architecture
    number in seccomp_data is checked against the defined architecture
    number constant before proceeding to test the bit against the
    bitmap with the syscall number as the index of the bit in the
    bitmap, and if the bit is set, seccomp returns allow. The bitmaps
    are all clear in this patch and will be initialized in the next
    commit.

    When only one architecture exists, the check against architecture
    number is skipped, suggested by Kees Cook [7].

    [1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/
    [2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
    [3] https://github.com/seccomp/libseccomp/issues/116
    [4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
    [5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
    [6] Draco: Architectural and Operating System Support for System Call Security
    https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020
    [7] https://lore.kernel.org/bpf/202010091614.8BB0EB64@keescook/

    Co-developed-by: Dimitrios Skarlatos
    Signed-off-by: Dimitrios Skarlatos
    Signed-off-by: YiFei Zhu
    Reviewed-by: Jann Horn
    Signed-off-by: Kees Cook
    Link: https://lore.kernel.org/r/10f91a367ec4fcdea7fc3f086de3f5f13a4a7436.1602431034.git.yifeifz2@illinois.edu
    (cherry picked from commit f9d480b6ffbeb336bf7f6ce44825c00f61b3abae)A
    Signed-off-by: Jeff Vander Stoep
    Change-Id: I50b6682e17dc6e91b5e92017361200d722282825
    Bug: 176068146

    YiFei Zhu
     

18 Nov, 2020

1 commit

  • Replace the use of security_capable(current_cred(), ...) with
    ns_capable_noaudit() which set PF_SUPERPRIV.

    Since commit 98f368e9e263 ("kernel: Add noaudit variant of
    ns_capable()"), a new ns_capable_noaudit() helper is available. Let's
    use it!

    Cc: Jann Horn
    Cc: Kees Cook
    Cc: Tyler Hicks
    Cc: Will Drewry
    Cc: stable@vger.kernel.org
    Fixes: e2cfabdfd075 ("seccomp: add system call filtering using BPF")
    Signed-off-by: Mickaël Salaün
    Reviewed-by: Jann Horn
    Signed-off-by: Kees Cook
    Link: https://lore.kernel.org/r/20201030123849.770769-3-mic@digikod.net

    Mickaël Salaün
     

09 Oct, 2020

1 commit

  • Currently, init_listener() tries to prevent adding a filter with
    SECCOMP_FILTER_FLAG_NEW_LISTENER if one of the existing filters already
    has a listener. However, this check happens without holding any lock that
    would prevent another thread from concurrently installing a new filter
    (potentially with a listener) on top of the ones we already have.

    Theoretically, this is also a data race: The plain load from
    current->seccomp.filter can race with concurrent writes to the same
    location.

    Fix it by moving the check into the region that holds the siglock to guard
    against concurrent TSYNC.

    (The "Fixes" tag points to the commit that introduced the theoretical
    data race; concurrent installation of another filter with TSYNC only
    became possible later, in commit 51891498f2da ("seccomp: allow TSYNC and
    USER_NOTIF together").)

    Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
    Reviewed-by: Tycho Andersen
    Signed-off-by: Jann Horn
    Signed-off-by: Kees Cook
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20201005014401.490175-1-jannh@google.com

    Jann Horn
     

09 Sep, 2020

4 commits

  • As described in commit a3460a59747c ("new helper: current_pt_regs()"):
    - arch versions are "optimized versions".
    - some architectures have task_pt_regs() working only for traced tasks
    blocked on signal delivery. current_pt_regs() needs to work for *all*
    processes.

    In preparation for adding a coccinelle rule for using current_*(), instead
    of raw accesses to current members, modify seccomp_do_user_notification(),
    __seccomp_filter(), __secure_computing() to use current_pt_regs().

    Signed-off-by: Denis Efremov
    Link: https://lore.kernel.org/r/20200824125921.488311-1-efremov@linux.com
    [kees: Reworded commit log, add comment to populate_seccomp_data()]
    Signed-off-by: Kees Cook

    Denis Efremov
     
  • Asynchronous termination of a thread outside of the userspace thread
    library's knowledge is an unsafe operation that leaves the process in
    an inconsistent, corrupt, and possibly unrecoverable state. In order
    to make new actions that may be added in the future safe on kernels
    not aware of them, change the default action from
    SECCOMP_RET_KILL_THREAD to SECCOMP_RET_KILL_PROCESS.

    Signed-off-by: Rich Felker
    Link: https://lore.kernel.org/r/20200829015609.GA32566@brightrain.aerifal.cx
    [kees: Fixed up coredump selection logic to match]
    Signed-off-by: Kees Cook

    Rich Felker
     
  • Christian and Kees both pointed out that this is a bit sloppy to open-code
    both places, and Christian points out that we leave a dangling pointer to
    ->notif if file allocation fails. Since we check ->notif for null in order
    to determine if it's ok to install a filter, this means people won't be
    able to install a filter if the file allocation fails for some reason, even
    if they subsequently should be able to.

    To fix this, let's hoist this free+null into its own little helper and use
    it.

    Reported-by: Kees Cook
    Reported-by: Christian Brauner
    Signed-off-by: Tycho Andersen
    Acked-by: Christian Brauner
    Link: https://lore.kernel.org/r/20200902140953.1201956-1-tycho@tycho.pizza
    Signed-off-by: Kees Cook

    Tycho Andersen
     
  • In seccomp_set_mode_filter() with TSYNC | NEW_LISTENER, we first initialize
    the listener fd, then check to see if we can actually use it later in
    seccomp_may_assign_mode(), which can fail if anyone else in our thread
    group has installed a filter and caused some divergence. If we can't, we
    partially clean up the newly allocated file: we put the fd, put the file,
    but don't actually clean up the *memory* that was allocated at
    filter->notif. Let's clean that up too.

    To accomplish this, let's hoist the actual "detach a notifier from a
    filter" code to its own helper out of seccomp_notify_release(), so that in
    case anyone adds stuff to init_listener(), they only have to add the
    cleanup code in one spot. This does a bit of extra locking and such on the
    failure path when the filter is not attached, but it's a slow failure path
    anyway.

    Fixes: 51891498f2da ("seccomp: allow TSYNC and USER_NOTIF together")
    Reported-by: syzbot+3ad9614a12f80994c32e@syzkaller.appspotmail.com
    Signed-off-by: Tycho Andersen
    Acked-by: Christian Brauner
    Link: https://lore.kernel.org/r/20200902014017.934315-1-tycho@tycho.pizza
    Signed-off-by: Kees Cook

    Tycho Andersen
     

15 Jul, 2020

1 commit

  • The current SECCOMP_RET_USER_NOTIF API allows for syscall supervision over
    an fd. It is often used in settings where a supervising task emulates
    syscalls on behalf of a supervised task in userspace, either to further
    restrict the supervisee's syscall abilities or to circumvent kernel
    enforced restrictions the supervisor deems safe to lift (e.g. actually
    performing a mount(2) for an unprivileged container).

    While SECCOMP_RET_USER_NOTIF allows for the interception of any syscall,
    only a certain subset of syscalls could be correctly emulated. Over the
    last few development cycles, the set of syscalls which can't be emulated
    has been reduced due to the addition of pidfd_getfd(2). With this we are
    now able to, for example, intercept syscalls that require the supervisor
    to operate on file descriptors of the supervisee such as connect(2).

    However, syscalls that cause new file descriptors to be installed can not
    currently be correctly emulated since there is no way for the supervisor
    to inject file descriptors into the supervisee. This patch adds a
    new addfd ioctl to remove this restriction by allowing the supervisor to
    install file descriptors into the intercepted task. By implementing this
    feature via seccomp the supervisor effectively instructs the supervisee
    to install a set of file descriptors into its own file descriptor table
    during the intercepted syscall. This way it is possible to intercept
    syscalls such as open() or accept(), and install (or replace, like
    dup2(2)) the supervisor's resulting fd into the supervisee. One
    replacement use-case would be to redirect the stdout and stderr of a
    supervisee into log file descriptors opened by the supervisor.

    The ioctl handling is based on the discussions[1] of how Extensible
    Arguments should interact with ioctls. Instead of building size into
    the addfd structure, make it a function of the ioctl command (which
    is how sizes are normally passed to ioctls). To support forward and
    backward compatibility, just mask out the direction and size, and match
    everything. The size (and any future direction) checks are done along
    with copy_struct_from_user() logic.

    As a note, the seccomp_notif_addfd structure is laid out based on 8-byte
    alignment without requiring packing as there have been packing issues
    with uapi highlighted before[2][3]. Although we could overload the
    newfd field and use -1 to indicate that it is not to be used, doing
    so requires changing the size of the fd field, and introduces struct
    packing complexity.

    [1]: https://lore.kernel.org/lkml/87o8w9bcaf.fsf@mid.deneb.enyo.de/
    [2]: https://lore.kernel.org/lkml/a328b91d-fd8f-4f27-b3c2-91a9c45f18c0@rasmusvillemoes.dk/
    [3]: https://lore.kernel.org/lkml/20200612104629.GA15814@ircssh-2.c.rugged-nimbus-611.internal

    Cc: Christoph Hellwig
    Cc: Christian Brauner
    Cc: Tycho Andersen
    Cc: Jann Horn
    Cc: Robert Sesek
    Cc: Chris Palmer
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-api@vger.kernel.org
    Suggested-by: Matt Denton
    Link: https://lore.kernel.org/r/20200603011044.7972-4-sargun@sargun.me
    Signed-off-by: Sargun Dhillon
    Reviewed-by: Will Drewry
    Co-developed-by: Kees Cook
    Signed-off-by: Kees Cook

    Sargun Dhillon
     

11 Jul, 2020

9 commits

  • The terminator for the mode 1 syscalls list was a 0, but that could be
    a valid syscall number (e.g. x86_64 __NR_read). By luck, __NR_read was
    listed first and the loop construct would not test it, so there was no
    bug. However, this is fragile. Replace the terminator with -1 instead,
    and make the variable name for mode 1 syscall lists more descriptive.

    Cc: Andy Lutomirski
    Cc: Will Drewry
    Signed-off-by: Kees Cook

    Kees Cook
     
  • When SECCOMP_IOCTL_NOTIF_ID_VALID was first introduced it had the wrong
    direction flag set. While this isn't a big deal as nothing currently
    enforces these bits in the kernel, it should be defined correctly. Fix
    the define and provide support for the old command until it is no longer
    needed for backward compatibility.

    Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
    Signed-off-by: Kees Cook

    Kees Cook
     
  • Avoid open-coding "seccomp: " prefixes for pr_*() calls.

    Signed-off-by: Kees Cook

    Kees Cook
     
  • We've been making heavy use of the seccomp notifier to intercept and
    handle certain syscalls for containers. This patch allows a syscall
    supervisor listening on a given notifier to be notified when a seccomp
    filter has become unused.

    A container is often managed by a singleton supervisor process the
    so-called "monitor". This monitor process has an event loop which has
    various event handlers registered. If the user specified a seccomp
    profile that included a notifier for various syscalls then we also
    register a seccomp notify even handler. For any container using a
    separate pid namespace the lifecycle of the seccomp notifier is bound to
    the init process of the pid namespace, i.e. when the init process exits
    the filter must be unused.

    If a new process attaches to a container we force it to assume a seccomp
    profile. This can either be the same seccomp profile as the container
    was started with or a modified one. If the attaching process makes use
    of the seccomp notifier we will register a new seccomp notifier handler
    in the monitor's event loop. However, when the attaching process exits
    we can't simply delete the handler since other child processes could've
    been created (daemons spawned etc.) that have inherited the seccomp
    filter and so we need to keep the seccomp notifier fd alive in the event
    loop. But this is problematic since we don't get a notification when the
    seccomp filter has become unused and so we currently never remove the
    seccomp notifier fd from the event loop and just keep accumulating fds
    in the event loop. We've had this issue for a while but it has recently
    become more pressing as more and larger users make use of this.

    To fix this, we introduce a new "users" reference counter that tracks any
    tasks and dependent filters making use of a filter. When a notifier is
    registered waiting tasks will be notified that the filter is now empty
    by receiving a (E)POLLHUP event.

    The concept in this patch introduces is the same as for signal_struct,
    i.e. reference counting for life-cycle management is decoupled from
    reference counting taks using the object. There's probably some trickery
    possible but the second counter is just the correct way of doing this
    IMHO and has precedence.

    Cc: Tycho Andersen
    Cc: Kees Cook
    Cc: Matt Denton
    Cc: Sargun Dhillon
    Cc: Jann Horn
    Cc: Chris Palmer
    Cc: Aleksa Sarai
    Cc: Robert Sesek
    Cc: Jeffrey Vander Stoep
    Cc: Linux Containers
    Signed-off-by: Christian Brauner
    Link: https://lore.kernel.org/r/20200531115031.391515-3-christian.brauner@ubuntu.com
    Signed-off-by: Kees Cook

    Christian Brauner
     
  • Lift the wait_queue from struct notification into struct seccomp_filter.
    This is cleaner overall and lets us avoid having to take the notifier
    mutex in the future for EPOLLHUP notifications since we need to neither
    read nor modify the notifier specific aspects of the seccomp filter. In
    the exit path I'd very much like to avoid having to take the notifier mutex
    for each filter in the task's filter hierarchy.

    Cc: Tycho Andersen
    Cc: Kees Cook
    Cc: Matt Denton
    Cc: Sargun Dhillon
    Cc: Jann Horn
    Cc: Chris Palmer
    Cc: Aleksa Sarai
    Cc: Robert Sesek
    Cc: Jeffrey Vander Stoep
    Cc: Linux Containers
    Signed-off-by: Christian Brauner
    Signed-off-by: Kees Cook

    Christian Brauner
     
  • The seccomp filter used to be released in free_task() which is called
    asynchronously via call_rcu() and assorted mechanisms. Since we need
    to inform tasks waiting on the seccomp notifier when a filter goes empty
    we will notify them as soon as a task has been marked fully dead in
    release_task(). To not split seccomp cleanup into two parts, move
    filter release out of free_task() and into release_task() after we've
    unhashed struct task from struct pid, exited signals, and unlinked it
    from the threadgroups' thread list. We'll put the empty filter
    notification infrastructure into it in a follow up patch.

    This also renames put_seccomp_filter() to seccomp_filter_release() which
    is a more descriptive name of what we're doing here especially once
    we've added the empty filter notification mechanism in there.

    We're also NULL-ing the task's filter tree entrypoint which seems
    cleaner than leaving a dangling pointer in there. Note that this shouldn't
    need any memory barriers since we're calling this when the task is in
    release_task() which means it's EXIT_DEAD. So it can't modify its seccomp
    filters anymore. You can also see this from the point where we're calling
    seccomp_filter_release(). It's after __exit_signal() and at this point,
    tsk->sighand will already have been NULLed which is required for
    thread-sync and filter installation alike.

    Cc: Tycho Andersen
    Cc: Kees Cook
    Cc: Matt Denton
    Cc: Sargun Dhillon
    Cc: Jann Horn
    Cc: Chris Palmer
    Cc: Aleksa Sarai
    Cc: Robert Sesek
    Cc: Jeffrey Vander Stoep
    Cc: Linux Containers
    Signed-off-by: Christian Brauner
    Link: https://lore.kernel.org/r/20200531115031.391515-2-christian.brauner@ubuntu.com
    Signed-off-by: Kees Cook

    Christian Brauner
     
  • Naming the lifetime counter of a seccomp filter "usage" suggests a
    little too strongly that its about tasks that are using this filter
    while it also tracks other references such as the user notifier or
    ptrace. This also updates the documentation to note this fact.

    We'll be introducing an actual usage counter in a follow-up patch.

    Cc: Tycho Andersen
    Cc: Kees Cook
    Cc: Matt Denton
    Cc: Sargun Dhillon
    Cc: Jann Horn
    Cc: Chris Palmer
    Cc: Aleksa Sarai
    Cc: Robert Sesek
    Cc: Jeffrey Vander Stoep
    Cc: Linux Containers
    Signed-off-by: Christian Brauner
    Link: https://lore.kernel.org/r/20200531115031.391515-1-christian.brauner@ubuntu.com
    Signed-off-by: Kees Cook

    Christian Brauner
     
  • This adds a helper which can iterate through a seccomp_filter to
    find a notification matching an ID. It removes several replicated
    chunks of code.

    Signed-off-by: Sargun Dhillon
    Acked-by: Christian Brauner
    Reviewed-by: Tycho Andersen
    Cc: Matt Denton
    Cc: Kees Cook ,
    Cc: Jann Horn ,
    Cc: Robert Sesek ,
    Cc: Chris Palmer
    Cc: Christian Brauner
    Cc: Tycho Andersen
    Link: https://lore.kernel.org/r/20200601112532.150158-1-sargun@sargun.me
    Signed-off-by: Kees Cook

    Sargun Dhillon
     
  • A common question asked when debugging seccomp filters is "how many
    filters are attached to your process?" Provide a way to easily answer
    this question through /proc/$pid/status with a "Seccomp_filters" line.

    Signed-off-by: Kees Cook

    Kees Cook
     

27 Apr, 2020

1 commit

  • Instead of having all the sysctl handlers deal with user pointers, which
    is rather hairy in terms of the BPF interaction, copy the input to and
    from userspace in common code. This also means that the strings are
    always NUL-terminated by the common code, making the API a little bit
    safer.

    As most handler just pass through the data to one of the common handlers
    a lot of the changes are mechnical.

    Signed-off-by: Christoph Hellwig
    Acked-by: Andrey Ignatov
    Signed-off-by: Al Viro

    Christoph Hellwig
     

01 Apr, 2020

1 commit

  • Pull networking updates from David Miller:
    "Highlights:

    1) Fix the iwlwifi regression, from Johannes Berg.

    2) Support BSS coloring and 802.11 encapsulation offloading in
    hardware, from John Crispin.

    3) Fix some potential Spectre issues in qtnfmac, from Sergey
    Matyukevich.

    4) Add TTL decrement action to openvswitch, from Matteo Croce.

    5) Allow paralleization through flow_action setup by not taking the
    RTNL mutex, from Vlad Buslov.

    6) A lot of zero-length array to flexible-array conversions, from
    Gustavo A. R. Silva.

    7) Align XDP statistics names across several drivers for consistency,
    from Lorenzo Bianconi.

    8) Add various pieces of infrastructure for offloading conntrack, and
    make use of it in mlx5 driver, from Paul Blakey.

    9) Allow using listening sockets in BPF sockmap, from Jakub Sitnicki.

    10) Lots of parallelization improvements during configuration changes
    in mlxsw driver, from Ido Schimmel.

    11) Add support to devlink for generic packet traps, which report
    packets dropped during ACL processing. And use them in mlxsw
    driver. From Jiri Pirko.

    12) Support bcmgenet on ACPI, from Jeremy Linton.

    13) Make BPF compatible with RT, from Thomas Gleixnet, Alexei
    Starovoitov, and your's truly.

    14) Support XDP meta-data in virtio_net, from Yuya Kusakabe.

    15) Fix sysfs permissions when network devices change namespaces, from
    Christian Brauner.

    16) Add a flags element to ethtool_ops so that drivers can more simply
    indicate which coalescing parameters they actually support, and
    therefore the generic layer can validate the user's ethtool
    request. Use this in all drivers, from Jakub Kicinski.

    17) Offload FIFO qdisc in mlxsw, from Petr Machata.

    18) Support UDP sockets in sockmap, from Lorenz Bauer.

    19) Fix stretch ACK bugs in several TCP congestion control modules,
    from Pengcheng Yang.

    20) Support virtual functiosn in octeontx2 driver, from Tomasz
    Duszynski.

    21) Add region operations for devlink and use it in ice driver to dump
    NVM contents, from Jacob Keller.

    22) Add support for hw offload of MACSEC, from Antoine Tenart.

    23) Add support for BPF programs that can be attached to LSM hooks,
    from KP Singh.

    24) Support for multiple paths, path managers, and counters in MPTCP.
    From Peter Krystad, Paolo Abeni, Florian Westphal, Davide Caratti,
    and others.

    25) More progress on adding the netlink interface to ethtool, from
    Michal Kubecek"

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2121 commits)
    net: ipv6: rpl_iptunnel: Fix potential memory leak in rpl_do_srh_inline
    cxgb4/chcr: nic-tls stats in ethtool
    net: dsa: fix oops while probing Marvell DSA switches
    net/bpfilter: remove superfluous testing message
    net: macb: Fix handling of fixed-link node
    net: dsa: ksz: Select KSZ protocol tag
    netdevsim: dev: Fix memory leak in nsim_dev_take_snapshot_write
    net: stmmac: add EHL 2.5Gbps PCI info and PCI ID
    net: stmmac: add EHL PSE0 & PSE1 1Gbps PCI info and PCI ID
    net: stmmac: create dwmac-intel.c to contain all Intel platform
    net: dsa: bcm_sf2: Support specifying VLAN tag egress rule
    net: dsa: bcm_sf2: Add support for matching VLAN TCI
    net: dsa: bcm_sf2: Move writing of CFP_DATA(5) into slicing functions
    net: dsa: bcm_sf2: Check earlier for FLOW_EXT and FLOW_MAC_EXT
    net: dsa: bcm_sf2: Disable learning for ASP port
    net: dsa: b53: Deny enslaving port 7 for 7278 into a bridge
    net: dsa: b53: Prevent tagged VLAN on port 7 for 7278
    net: dsa: b53: Restore VLAN entries upon (re)configuration
    net: dsa: bcm_sf2: Fix overflow checks
    hv_netvsc: Remove unnecessary round_up for recv_completion_cnt
    ...

    Linus Torvalds
     

30 Mar, 2020

1 commit

  • Executing the seccomp_bpf testsuite under a 64-bit kernel with 32-bit
    userland (both s390 and x86) doesn't work because there's no compat_ioctl
    handler defined. Add the handler.

    Signed-off-by: Sven Schnelle
    Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20200310123332.42255-1-svens@linux.ibm.com
    Signed-off-by: Kees Cook

    Sven Schnelle
     

05 Mar, 2020

1 commit

  • The restriction introduced in 7a0df7fbc145 ("seccomp: Make NEW_LISTENER and
    TSYNC flags exclusive") is mostly artificial: there is enough information
    in a seccomp user notification to tell which thread triggered a
    notification. The reason it was introduced is because TSYNC makes the
    syscall return a thread-id on failure, and NEW_LISTENER returns an fd, and
    there's no way to distinguish between these two cases (well, I suppose the
    caller could check all fds it has, then do the syscall, and if the return
    value was an fd that already existed, then it must be a thread id, but
    bleh).

    Matthew would like to use these two flags together in the Chrome sandbox
    which wants to use TSYNC for video drivers and NEW_LISTENER to proxy
    syscalls.

    So, let's fix this ugliness by adding another flag, TSYNC_ESRCH, which
    tells the kernel to just return -ESRCH on a TSYNC error. This way,
    NEW_LISTENER (and any subsequent seccomp() commands that want to return
    positive values) don't conflict with each other.

    Suggested-by: Matthew Denton
    Signed-off-by: Tycho Andersen
    Link: https://lore.kernel.org/r/20200304180517.23867-1-tycho@tycho.ws
    Signed-off-by: Kees Cook

    Tycho Andersen
     

25 Feb, 2020

1 commit

  • All of these cases are strictly of the form:

    preempt_disable();
    BPF_PROG_RUN(...);
    preempt_enable();

    Replace this with bpf_prog_run_pin_on_cpu() which wraps BPF_PROG_RUN()
    with:

    migrate_disable();
    BPF_PROG_RUN(...);
    migrate_enable();

    On non RT enabled kernels this maps to preempt_disable/enable() and on RT
    enabled kernels this solely prevents migration, which is sufficient as
    there is no requirement to prevent reentrancy to any BPF program from a
    preempting task. The only requirement is that the program stays on the same
    CPU.

    Therefore, this is a trivially correct transformation.

    The seccomp loop does not need protection over the loop. It only needs
    protection per BPF filter program

    [ tglx: Converted to bpf_prog_run_pin_on_cpu() ]

    Signed-off-by: David S. Miller
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200224145643.691493094@linutronix.de

    David Miller
     

03 Jan, 2020

1 commit

  • This patch is a small change in enforcement of the uapi for
    SECCOMP_IOCTL_NOTIF_RECV ioctl. Specifically, the datastructure which
    is passed (seccomp_notif) must be zeroed out. Previously any of its
    members could be set to nonsense values, and we would ignore it.

    This ensures all fields are set to their zero value.

    Signed-off-by: Sargun Dhillon
    Reviewed-by: Christian Brauner
    Reviewed-by: Aleksa Sarai
    Acked-by: Tycho Andersen
    Link: https://lore.kernel.org/r/20191229062451.9467-2-sargun@sargun.me
    Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
    Cc: stable@vger.kernel.org
    Signed-off-by: Kees Cook

    Sargun Dhillon
     

11 Oct, 2019

1 commit

  • This allows the seccomp notifier to continue a syscall. A positive
    discussion about this feature was triggered by a post to the
    ksummit-discuss mailing list (cf. [3]) and took place during KSummit
    (cf. [1]) and again at the containers/checkpoint-restore
    micro-conference at Linux Plumbers.

    Recently we landed seccomp support for SECCOMP_RET_USER_NOTIF (cf. [4])
    which enables a process (watchee) to retrieve an fd for its seccomp
    filter. This fd can then be handed to another (usually more privileged)
    process (watcher). The watcher will then be able to receive seccomp
    messages about the syscalls having been performed by the watchee.

    This feature is heavily used in some userspace workloads. For example,
    it is currently used to intercept mknod() syscalls in user namespaces
    aka in containers.
    The mknod() syscall can be easily filtered based on dev_t. This allows
    us to only intercept a very specific subset of mknod() syscalls.
    Furthermore, mknod() is not possible in user namespaces toto coelo and
    so intercepting and denying syscalls that are not in the whitelist on
    accident is not a big deal. The watchee won't notice a difference.

    In contrast to mknod(), a lot of other syscall we intercept (e.g.
    setxattr()) cannot be easily filtered like mknod() because they have
    pointer arguments. Additionally, some of them might actually succeed in
    user namespaces (e.g. setxattr() for all "user.*" xattrs). Since we
    currently cannot tell seccomp to continue from a user notifier we are
    stuck with performing all of the syscalls in lieu of the container. This
    is a huge security liability since it is extremely difficult to
    correctly assume all of the necessary privileges of the calling task
    such that the syscall can be successfully emulated without escaping
    other additional security restrictions (think missing CAP_MKNOD for
    mknod(), or MS_NODEV on a filesystem etc.). This can be solved by
    telling seccomp to resume the syscall.

    One thing that came up in the discussion was the problem that another
    thread could change the memory after userspace has decided to let the
    syscall continue which is a well known TOCTOU with seccomp which is
    present in other ways already.
    The discussion showed that this feature is already very useful for any
    syscall without pointer arguments. For any accidentally intercepted
    non-pointer syscall it is safe to continue.
    For syscalls with pointer arguments there is a race but for any cautious
    userspace and the main usec cases the race doesn't matter. The notifier
    is intended to be used in a scenario where a more privileged watcher
    supervises the syscalls of lesser privileged watchee to allow it to get
    around kernel-enforced limitations by performing the syscall for it
    whenever deemed save by the watcher. Hence, if a user tricks the watcher
    into allowing a syscall they will either get a deny based on
    kernel-enforced restrictions later or they will have changed the
    arguments in such a way that they manage to perform a syscall with
    arguments that they would've been allowed to do anyway.
    In general, it is good to point out again, that the notifier fd was not
    intended to allow userspace to implement a security policy but rather to
    work around kernel security mechanisms in cases where the watcher knows
    that a given action is safe to perform.

    /* References */
    [1]: https://linuxplumbersconf.org/event/4/contributions/560
    [2]: https://linuxplumbersconf.org/event/4/contributions/477
    [3]: https://lore.kernel.org/r/20190719093538.dhyopljyr5ns33qx@brauner.io
    [4]: commit 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")

    Co-developed-by: Kees Cook
    Signed-off-by: Christian Brauner
    Reviewed-by: Tycho Andersen
    Cc: Andy Lutomirski
    Cc: Will Drewry
    CC: Tyler Hicks
    Link: https://lore.kernel.org/r/20190920083007.11475-2-christian.brauner@ubuntu.com
    Signed-off-by: Kees Cook

    Christian Brauner
     

29 May, 2019

1 commit

  • force_sig_info always delivers to the current task and the signal
    parameter always matches info.si_signo. So remove those parameters to
    make it a simpler less error prone interface, and to make it clear
    that none of the callers are doing anything clever.

    This guarantees that force_sig_info will not grow any new buggy
    callers that attempt to call force_sig on a non-current task, or that
    pass an signal number that does not match info.si_signo.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

08 May, 2019

1 commit

  • Pull audit updates from Paul Moore:
    "We've got a reasonably broad set of audit patches for the v5.2 merge
    window, the highlights are below:

    - The biggest change, and the source of all the arch/* changes, is
    the patchset from Dmitry to help enable some of the work he is
    doing around PTRACE_GET_SYSCALL_INFO.

    To be honest, including this in the audit tree is a bit of a
    stretch, but it does help move audit a little further along towards
    proper syscall auditing for all arches, and everyone else seemed to
    agree that audit was a "good" spot for this to land (or maybe they
    just didn't want to merge it? dunno.).

    - We can now audit time/NTP adjustments.

    - We continue the work to connect associated audit records into a
    single event"

    * tag 'audit-pr-20190507' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: (21 commits)
    audit: fix a memory leak bug
    ntp: Audit NTP parameters adjustment
    timekeeping: Audit clock adjustments
    audit: purge unnecessary list_empty calls
    audit: link integrity evm_write_xattrs record to syscall event
    syscall_get_arch: add "struct task_struct *" argument
    unicore32: define syscall_get_arch()
    Move EM_UNICORE to uapi/linux/elf-em.h
    nios2: define syscall_get_arch()
    nds32: define syscall_get_arch()
    Move EM_NDS32 to uapi/linux/elf-em.h
    m68k: define syscall_get_arch()
    hexagon: define syscall_get_arch()
    Move EM_HEXAGON to uapi/linux/elf-em.h
    h8300: define syscall_get_arch()
    c6x: define syscall_get_arch()
    arc: define syscall_get_arch()
    Move EM_ARCOMPACT and EM_ARCV2 to uapi/linux/elf-em.h
    audit: Make audit_log_cap and audit_copy_inode static
    audit: connect LOGIN record to its syscall record
    ...

    Linus Torvalds
     

07 May, 2019

1 commit

  • Pull security subsystem updates from James Morris:
    "Just a few bugfixes and documentation updates"

    * 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
    seccomp: fix up grammar in comment
    Revert "security: inode: fix a missing check for securityfs_create_file"
    Yama: mark function as static
    security: inode: fix a missing check for securityfs_create_file
    keys: safe concurrent user->{session,uid}_keyring access
    security: don't use RCU accessors for cred->session_keyring
    Yama: mark local symbols as static
    LSM: lsm_hooks.h: fix documentation format
    LSM: fix documentation for the shm_* hooks
    LSM: fix documentation for the sem_* hooks
    LSM: fix documentation for the msg_queue_* hooks
    LSM: fix documentation for the audit_* hooks
    LSM: fix documentation for the path_chmod hook
    LSM: fix documentation for the socket_getpeersec_dgram hook
    LSM: fix documentation for the task_setscheduler hook
    LSM: fix documentation for the socket_post_create hook
    LSM: fix documentation for the syslog hook
    LSM: fix documentation for sb_copy_data hook

    Linus Torvalds
     

30 Apr, 2019

1 commit

  • Pull seccomp fixes from Kees Cook:
    "Syzbot found a use-after-free bug in seccomp due to flags that should
    not be allowed to be used together.

    Tycho fixed this, I updated the self-tests, and the syzkaller PoC has
    been running for several days without triggering KASan (before this
    fix, it would reproduce). These patches have also been in -next for
    almost a week, just to be sure.

    - Add logic for making some seccomp flags exclusive (Tycho)

    - Update selftests for exclusivity testing (Kees)"

    * tag 'seccomp-v5.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    seccomp: Make NEW_LISTENER and TSYNC flags exclusive
    selftests/seccomp: Prepare for exclusive seccomp flags

    Linus Torvalds
     

26 Apr, 2019

1 commit

  • As the comment notes, the return codes for TSYNC and NEW_LISTENER
    conflict, because they both return positive values, one in the case of
    success and one in the case of error. So, let's disallow both of these
    flags together.

    While this is technically a userspace break, all the users I know
    of are still waiting on me to land this feature in libseccomp, so I
    think it'll be safe. Also, at present my use case doesn't require
    TSYNC at all, so this isn't a big deal to disallow. If someone
    wanted to support this, a path forward would be to add a new flag like
    TSYNC_AND_LISTENER_YES_I_UNDERSTAND_THAT_TSYNC_WILL_JUST_RETURN_EAGAIN,
    but the use cases are so different I don't see it really happening.

    Finally, it's worth noting that this does actually fix a UAF issue: at the
    end of seccomp_set_mode_filter(), we have:

    if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
    if (ret < 0) {
    listener_f->private_data = NULL;
    fput(listener_f);
    put_unused_fd(listener);
    } else {
    fd_install(listener, listener_f);
    ret = listener;
    }
    }
    out_free:
    seccomp_filter_free(prepared);

    But if ret > 0 because TSYNC raced, we'll install the listener fd and then
    free the filter out from underneath it, causing a UAF when the task closes
    it or dies. This patch also switches the condition to be simply if (ret),
    so that if someone does add the flag mentioned above, they won't have to
    remember to fix this too.

    Reported-by: syzbot+b562969adb2e04af3442@syzkaller.appspotmail.com
    Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
    CC: stable@vger.kernel.org # v5.0+
    Signed-off-by: Tycho Andersen
    Signed-off-by: Kees Cook
    Acked-by: James Morris

    Tycho Andersen
     

24 Apr, 2019

1 commit


05 Apr, 2019

1 commit

  • At Linux Plumbers, Andy Lutomirski approached me and pointed out that the
    function call syscall_get_arguments() implemented in x86 was horribly
    written and not optimized for the standard case of passing in 0 and 6 for
    the starting index and the number of system calls to get. When looking at
    all the users of this function, I discovered that all instances pass in only
    0 and 6 for these arguments. Instead of having this function handle
    different cases that are never used, simply rewrite it to return the first 6
    arguments of a system call.

    This should help out the performance of tracing system calls by ptrace,
    ftrace and perf.

    Link: http://lkml.kernel.org/r/20161107213233.754809394@goodmis.org

    Cc: Oleg Nesterov
    Cc: Kees Cook
    Cc: Andy Lutomirski
    Cc: Dominik Brodowski
    Cc: Dave Martin
    Cc: "Dmitry V. Levin"
    Cc: x86@kernel.org
    Cc: linux-snps-arc@lists.infradead.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linux-c6x-dev@linux-c6x.org
    Cc: uclinux-h8-devel@lists.sourceforge.jp
    Cc: linux-hexagon@vger.kernel.org
    Cc: linux-ia64@vger.kernel.org
    Cc: linux-mips@vger.kernel.org
    Cc: nios2-dev@lists.rocketboards.org
    Cc: openrisc@lists.librecores.org
    Cc: linux-parisc@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: linux-riscv@lists.infradead.org
    Cc: linux-s390@vger.kernel.org
    Cc: linux-sh@vger.kernel.org
    Cc: sparclinux@vger.kernel.org
    Cc: linux-um@lists.infradead.org
    Cc: linux-xtensa@linux-xtensa.org
    Cc: linux-arch@vger.kernel.org
    Acked-by: Paul Burton # MIPS parts
    Acked-by: Max Filippov # For xtensa changes
    Acked-by: Will Deacon # For the arm64 bits
    Reviewed-by: Thomas Gleixner # for x86
    Reviewed-by: Dmitry V. Levin
    Reported-by: Andy Lutomirski
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (Red Hat)
     

21 Mar, 2019

1 commit

  • This argument is required to extend the generic ptrace API with
    PTRACE_GET_SYSCALL_INFO request: syscall_get_arch() is going
    to be called from ptrace_request() along with syscall_get_nr(),
    syscall_get_arguments(), syscall_get_error(), and
    syscall_get_return_value() functions with a tracee as their argument.

    The primary intent is that the triple (audit_arch, syscall_nr, arg1..arg6)
    should describe what system call is being called and what its arguments
    are.

    Reverts: 5e937a9ae913 ("syscall_get_arch: remove useless function arguments")
    Reverts: 1002d94d3076 ("syscall.h: fix doc text for syscall_get_arch()")
    Reviewed-by: Andy Lutomirski # for x86
    Reviewed-by: Palmer Dabbelt
    Acked-by: Paul Moore
    Acked-by: Paul Burton # MIPS parts
    Acked-by: Michael Ellerman (powerpc)
    Acked-by: Kees Cook # seccomp parts
    Acked-by: Mark Salter # for the c6x bit
    Cc: Elvira Khabirova
    Cc: Eugene Syromyatnikov
    Cc: Oleg Nesterov
    Cc: x86@kernel.org
    Cc: linux-alpha@vger.kernel.org
    Cc: linux-snps-arc@lists.infradead.org
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linux-c6x-dev@linux-c6x.org
    Cc: uclinux-h8-devel@lists.sourceforge.jp
    Cc: linux-hexagon@vger.kernel.org
    Cc: linux-ia64@vger.kernel.org
    Cc: linux-m68k@lists.linux-m68k.org
    Cc: linux-mips@vger.kernel.org
    Cc: nios2-dev@lists.rocketboards.org
    Cc: openrisc@lists.librecores.org
    Cc: linux-parisc@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: linux-riscv@lists.infradead.org
    Cc: linux-s390@vger.kernel.org
    Cc: linux-sh@vger.kernel.org
    Cc: sparclinux@vger.kernel.org
    Cc: linux-um@lists.infradead.org
    Cc: linux-xtensa@linux-xtensa.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-audit@redhat.com
    Signed-off-by: Dmitry V. Levin
    Signed-off-by: Paul Moore

    Dmitry V. Levin
     

08 Mar, 2019

1 commit

  • Pull security subsystem updates from James Morris:

    - Extend LSM stacking to allow sharing of cred, file, ipc, inode, and
    task blobs. This paves the way for more full-featured LSMs to be
    merged, and is specifically aimed at LandLock and SARA LSMs. This
    work is from Casey and Kees.

    - There's a new LSM from Micah Morton: "SafeSetID gates the setid
    family of syscalls to restrict UID/GID transitions from a given
    UID/GID to only those approved by a system-wide whitelist." This
    feature is currently shipping in ChromeOS.

    * 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (62 commits)
    keys: fix missing __user in KEYCTL_PKEY_QUERY
    LSM: Update list of SECURITYFS users in Kconfig
    LSM: Ignore "security=" when "lsm=" is specified
    LSM: Update function documentation for cap_capable
    security: mark expected switch fall-throughs and add a missing break
    tomoyo: Bump version.
    LSM: fix return value check in safesetid_init_securityfs()
    LSM: SafeSetID: add selftest
    LSM: SafeSetID: remove unused include
    LSM: SafeSetID: 'depend' on CONFIG_SECURITY
    LSM: Add 'name' field for SafeSetID in DEFINE_LSM
    LSM: add SafeSetID module that gates setid calls
    LSM: add SafeSetID module that gates setid calls
    tomoyo: Allow multiple use_group lines.
    tomoyo: Coding style fix.
    tomoyo: Swicth from cred->security to task_struct->security.
    security: keys: annotate implicit fall throughs
    security: keys: annotate implicit fall throughs
    security: keys: annotate implicit fall through
    capabilities:: annotate implicit fall through
    ...

    Linus Torvalds
     

22 Feb, 2019

1 commit


23 Jan, 2019

1 commit


16 Jan, 2019

1 commit

  • On the failure path, we do an fput() of the listener fd if the filter fails
    to install (e.g. because of a TSYNC race that's lost, or if the thread is
    killed, etc.). fput() doesn't actually release the fd, it just ads it to a
    work queue. Then the thread proceeds to free the filter, even though the
    listener struct file has a reference to it.

    To fix this, on the failure path let's set the private data to null, so we
    know in ->release() to ignore the filter.

    Reported-by: syzbot+981c26489b2d1c6316ba@syzkaller.appspotmail.com
    Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
    Signed-off-by: Tycho Andersen
    Acked-by: Kees Cook
    Signed-off-by: James Morris

    Tycho Andersen
     

11 Jan, 2019

1 commit

  • This patch provides a general mechanism for passing flags to the
    security_capable LSM hook. It replaces the specific 'audit' flag that is
    used to tell security_capable whether it should log an audit message for
    the given capability check. The reason for generalizing this flag
    passing is so we can add an additional flag that signifies whether
    security_capable is being called by a setid syscall (which is needed by
    the proposed SafeSetID LSM).

    Signed-off-by: Micah Morton
    Reviewed-by: Kees Cook
    Signed-off-by: James Morris

    Micah Morton
     

14 Dec, 2018

1 commit

  • sparse complains,

    kernel/seccomp.c:1172:13: warning: incorrect type in assignment (different base types)
    kernel/seccomp.c:1172:13: expected restricted __poll_t [usertype] ret
    kernel/seccomp.c:1172:13: got int
    kernel/seccomp.c:1173:13: warning: restricted __poll_t degrades to integer

    Instead of assigning this to ret, since we don't use this anywhere, let's
    just test it against 0 directly.

    Signed-off-by: Tycho Andersen
    Reported-by: 0day robot
    Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
    Signed-off-by: Kees Cook

    Tycho Andersen