23 May, 2018

1 commit

  • commit 00a02d0c502a06d15e07b857f8ff921e3e402675 upstream

    If a seccomp user is not interested in Speculative Store Bypass mitigation
    by default, it can set the new SECCOMP_FILTER_FLAG_SPEC_ALLOW flag when
    adding filters.

    Signed-off-by: Kees Cook
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

15 Aug, 2017

1 commit

  • Add a new filter flag, SECCOMP_FILTER_FLAG_LOG, that enables logging for
    all actions except for SECCOMP_RET_ALLOW for the given filter.

    SECCOMP_RET_KILL actions are always logged, when "kill" is in the
    actions_logged sysctl, and SECCOMP_RET_ALLOW actions are never logged,
    regardless of this flag.

    This flag can be used to create noisy filters that result in all
    non-allowed actions to be logged. A process may have one noisy filter,
    which is loaded with this flag, as well as a quiet filter that's not
    loaded with this flag. This allows for the actions in a set of filters
    to be selectively conveyed to the admin.

    Since a system could have a large number of allocated seccomp_filter
    structs, struct packing was taken in consideration. On 64 bit x86, the
    new log member takes up one byte of an existing four byte hole in the
    struct. On 32 bit x86, the new log member creates a new four byte hole
    (unavoidable) and consumes one of those bytes.

    Unfortunately, the tests added for SECCOMP_FILTER_FLAG_LOG are not
    capable of inspecting the audit log to verify that the actions taken in
    the filter were logged.

    With this patch, the logic for deciding if an action will be logged is:

    if action == RET_ALLOW:
    do not log
    else if action == RET_KILL && RET_KILL in actions_logged:
    log
    else if filter-requests-logging && action in actions_logged:
    log
    else if audit_enabled && process-is-being-audited:
    log
    else:
    do not log

    Signed-off-by: Tyler Hicks
    Signed-off-by: Kees Cook

    Tyler Hicks
     

15 Jun, 2016

2 commits


28 Oct, 2015

1 commit

  • This patch adds support for dumping a process' (classic BPF) seccomp
    filters via ptrace.

    PTRACE_SECCOMP_GET_FILTER allows the tracer to dump the user's classic BPF
    seccomp filters. addr should be an integer which represents the ith seccomp
    filter (0 is the most recently installed filter). data should be a struct
    sock_filter * with enough room for the ith filter, or NULL, in which case
    the filter is not saved. The return value for this command is the number of
    BPF instructions the program represents, or negative in the case of errors.
    Command specific errors are ENOENT: which indicates that there is no ith
    filter in this seccomp tree, and EMEDIUMTYPE, which indicates that the ith
    filter was not installed as a classic BPF filter.

    A caveat with this approach is that there is no way to get explicitly at
    the heirarchy of seccomp filters, and users need to memcmp() filters to
    decide which are inherited. This means that a task which installs two of
    the same filter can potentially confuse users of this interface.

    v2: * make save_orig const
    * check that the orig_prog exists (not necessary right now, but when
    grows eBPF support it will be)
    * s/n/filter_off and make it an unsigned long to match ptrace
    * count "down" the tree instead of "up" when passing a filter offset

    v3: * don't take the current task's lock for inspecting its seccomp mode
    * use a 0x42** constant for the ptrace command value

    v4: * don't copy to userspace while holding spinlocks

    v5: * add another condition to WARN_ON

    v6: * rebase on net-next

    Signed-off-by: Tycho Andersen
    Acked-by: Kees Cook
    CC: Will Drewry
    Reviewed-by: Oleg Nesterov
    CC: Andy Lutomirski
    CC: Pavel Emelyanov
    CC: Serge E. Hallyn
    CC: Alexei Starovoitov
    CC: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Tycho Andersen
     

16 Jul, 2015

1 commit


04 Sep, 2014

3 commits

  • populate_seccomp_data is expensive: it works by inspecting
    task_pt_regs and various other bits to piece together all the
    information, and it's does so in multiple partially redundant steps.

    Arch-specific code in the syscall entry path can do much better.

    Admittedly this adds a bit of additional room for error, but the
    speedup should be worth it.

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Kees Cook

    Andy Lutomirski
     
  • The reason I did this is to add a seccomp API that will be usable
    for an x86 fast path. The x86 entry code needs to use a rather
    expensive slow path for a syscall that might be visible to things
    like ptrace. By splitting seccomp into two phases, we can check
    whether we need the slow path and then use the fast path in if the
    filter allows the syscall or just returns some errno.

    As a side effect, I think the new code is much easier to understand
    than the old code.

    This has one user-visible effect: the audit record written for
    SECCOMP_RET_TRACE is now a simple indication that SECCOMP_RET_TRACE
    happened. It used to depend in a complicated way on what the tracer
    did. I couldn't make much sense of it.

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Kees Cook

    Andy Lutomirski
     
  • The secure_computing function took a syscall number parameter, but
    it only paid any attention to that parameter if seccomp mode 1 was
    enabled. Rather than coming up with a kludge to get the parameter
    to work in mode 2, just remove the parameter.

    To avoid churn in arches that don't have seccomp filters (and may
    not even support syscall_get_nr right now), this leaves the
    parameter in secure_computing_strict, which is now a real function.

    For ARM, this is a bit ugly due to the fact that ARM conditionally
    supports seccomp filters. Fixing that would probably only be a
    couple of lines of code, but it should be coordinated with the audit
    maintainers.

    This will be a slight slowdown on some arches. The right fix is to
    pass in all of seccomp_data instead of trying to make just the
    syscall nr part be fast.

    This is a prerequisite for making two-phase seccomp work cleanly.

    Cc: Russell King
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: Ralf Baechle
    Cc: linux-mips@linux-mips.org
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: linux-s390@vger.kernel.org
    Cc: x86@kernel.org
    Cc: Kees Cook
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Kees Cook

    Andy Lutomirski
     

19 Jul, 2014

2 commits

  • Applying restrictive seccomp filter programs to large or diverse
    codebases often requires handling threads which may be started early in
    the process lifetime (e.g., by code that is linked in). While it is
    possible to apply permissive programs prior to process start up, it is
    difficult to further restrict the kernel ABI to those threads after that
    point.

    This change adds a new seccomp syscall flag to SECCOMP_SET_MODE_FILTER for
    synchronizing thread group seccomp filters at filter installation time.

    When calling seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC,
    filter) an attempt will be made to synchronize all threads in current's
    threadgroup to its new seccomp filter program. This is possible iff all
    threads are using a filter that is an ancestor to the filter current is
    attempting to synchronize to. NULL filters (where the task is running as
    SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be
    transitioned into SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS,
    ...) has been set on the calling thread, no_new_privs will be set for
    all synchronized threads too. On success, 0 is returned. On failure,
    the pid of one of the failing threads will be returned and no filters
    will have been applied.

    The race conditions against another thread are:
    - requesting TSYNC (already handled by sighand lock)
    - performing a clone (already handled by sighand lock)
    - changing its filter (already handled by sighand lock)
    - calling exec (handled by cred_guard_mutex)
    The clone case is assisted by the fact that new threads will have their
    seccomp state duplicated from their parent before appearing on the tasklist.

    Holding cred_guard_mutex means that seccomp filters cannot be assigned
    while in the middle of another thread's exec (potentially bypassing
    no_new_privs or similar). The call to de_thread() may kill threads waiting
    for the mutex.

    Changes across threads to the filter pointer includes a barrier.

    Based on patches by Will Drewry.

    Suggested-by: Julien Tinnes
    Signed-off-by: Kees Cook
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Andy Lutomirski

    Kees Cook
     
  • Normally, task_struct.seccomp.filter is only ever read or modified by
    the task that owns it (current). This property aids in fast access
    during system call filtering as read access is lockless.

    Updating the pointer from another task, however, opens up race
    conditions. To allow cross-thread filter pointer updates, writes to the
    seccomp fields are now protected by the sighand spinlock (which is shared
    by all threads in the thread group). Read access remains lockless because
    pointer updates themselves are atomic. However, writes (or cloning)
    often entail additional checking (like maximum instruction counts)
    which require locking to perform safely.

    In the case of cloning threads, the child is invisible to the system
    until it enters the task list. To make sure a child can't be cloned from
    a thread and left in a prior state, seccomp duplication is additionally
    moved under the sighand lock. Then parent and child are certain have
    the same seccomp state when they exit the lock.

    Based on patches by Will Drewry and David Drysdale.

    Signed-off-by: Kees Cook
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Andy Lutomirski

    Kees Cook
     

31 Mar, 2014

1 commit

  • This patch replaces/reworks the kernel-internal BPF interpreter with
    an optimized BPF instruction set format that is modelled closer to
    mimic native instruction sets and is designed to be JITed with one to
    one mapping. Thus, the new interpreter is noticeably faster than the
    current implementation of sk_run_filter(); mainly for two reasons:

    1. Fall-through jumps:

    BPF jump instructions are forced to go either 'true' or 'false'
    branch which causes branch-miss penalty. The new BPF jump
    instructions have only one branch and fall-through otherwise,
    which fits the CPU branch predictor logic better. `perf stat`
    shows drastic difference for branch-misses between the old and
    new code.

    2. Jump-threaded implementation of interpreter vs switch
    statement:

    Instead of single table-jump at the top of 'switch' statement,
    gcc will now generate multiple table-jump instructions, which
    helps CPU branch predictor logic.

    Note that the verification of filters is still being done through
    sk_chk_filter() in classical BPF format, so filters from user- or
    kernel space are verified in the same way as we do now, and same
    restrictions/constraints hold as well.

    We reuse current BPF JIT compilers in a way that this upgrade would
    even be fine as is, but nevertheless allows for a successive upgrade
    of BPF JIT compilers to the new format.

    The internal instruction set migration is being done after the
    probing for JIT compilation, so in case JIT compilers are able to
    create a native opcode image, we're going to use that, and in all
    other cases we're doing a follow-up migration of the BPF program's
    instruction set, so that it can be transparently run in the new
    interpreter.

    In short, the *internal* format extends BPF in the following way (more
    details can be taken from the appended documentation):

    - Number of registers increase from 2 to 10
    - Register width increases from 32-bit to 64-bit
    - Conditional jt/jf targets replaced with jt/fall-through
    - Adds signed > and >= insns
    - 16 4-byte stack slots for register spill-fill replaced
    with up to 512 bytes of multi-use stack space
    - Introduction of bpf_call insn and register passing convention
    for zero overhead calls from/to other kernel functions
    - Adds arithmetic right shift and endianness conversion insns
    - Adds atomic_add insn
    - Old tax/txa insns are replaced with 'mov dst,src' insn

    Performance of two BPF filters generated by libpcap resp. bpf_asm
    was measured on x86_64, i386 and arm32 (other libpcap programs
    have similar performance differences):

    fprog #1 is taken from Documentation/networking/filter.txt:
    tcpdump -i eth0 port 22 -dd

    fprog #2 is taken from 'man tcpdump':
    tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<>2)) != 0)' -dd

    Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
    same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
    smaller is better:

    --x86_64--
    fprog #1 fprog #1 fprog #2 fprog #2
    cache-hit cache-miss cache-hit cache-miss
    old BPF 90 101 192 202
    new BPF 31 71 47 97
    old BPF jit 12 34 17 44
    new BPF jit TBD

    --i386--
    fprog #1 fprog #1 fprog #2 fprog #2
    cache-hit cache-miss cache-hit cache-miss
    old BPF 107 136 227 252
    new BPF 40 119 69 172

    --arm32--
    fprog #1 fprog #1 fprog #2 fprog #2
    cache-hit cache-miss cache-hit cache-miss
    old BPF 202 300 475 540
    new BPF 180 270 330 470
    old BPF jit 26 182 37 202
    new BPF jit TBD

    Thus, without changing any userland BPF filters, applications on
    top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
    classifier, netfilter's xt_bpf, team driver's load-balancing mode,
    and many more will have better interpreter filtering performance.

    While we are replacing the internal BPF interpreter, we also need
    to convert seccomp BPF in the same step to make use of the new
    internal structure since it makes use of lower-level API details
    without being further decoupled through higher-level calls like
    sk_unattached_filter_{create,destroy}(), for example.

    Just as for normal socket filtering, also seccomp BPF experiences
    a time-to-verdict speedup:

    05-sim-long_jumps.c of libseccomp was used as micro-benchmark:

    seccomp_rule_add_exact(ctx,...
    seccomp_rule_add_exact(ctx,...

    rc = seccomp_load(ctx);

    for (i = 0; i < 10000000; i++)
    syscall(199, 100);

    'short filter' has 2 rules
    'large filter' has 200 rules

    'short filter' performance is slightly better on x86_64/i386/arm32
    'large filter' is much faster on x86_64 and i386 and shows no
    difference on arm32

    --x86_64-- short filter
    old BPF: 2.7 sec
    39.12% bench libc-2.15.so [.] syscall
    8.10% bench [kernel.kallsyms] [k] sk_run_filter
    6.31% bench [kernel.kallsyms] [k] system_call
    5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
    4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller
    3.70% bench [kernel.kallsyms] [k] __secure_computing
    3.67% bench [kernel.kallsyms] [k] lock_is_held
    3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load
    new BPF: 2.58 sec
    42.05% bench libc-2.15.so [.] syscall
    6.91% bench [kernel.kallsyms] [k] system_call
    6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
    6.07% bench [kernel.kallsyms] [k] __secure_computing
    5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp

    --arm32-- short filter
    old BPF: 4.0 sec
    39.92% bench [kernel.kallsyms] [k] vector_swi
    16.60% bench [kernel.kallsyms] [k] sk_run_filter
    14.66% bench libc-2.17.so [.] syscall
    5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load
    5.10% bench [kernel.kallsyms] [k] __secure_computing
    new BPF: 3.7 sec
    35.93% bench [kernel.kallsyms] [k] vector_swi
    21.89% bench libc-2.17.so [.] syscall
    13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
    6.25% bench [kernel.kallsyms] [k] __secure_computing
    3.96% bench [kernel.kallsyms] [k] syscall_trace_exit

    --x86_64-- large filter
    old BPF: 8.6 seconds
    73.38% bench [kernel.kallsyms] [k] sk_run_filter
    10.70% bench libc-2.15.so [.] syscall
    5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load
    1.97% bench [kernel.kallsyms] [k] system_call
    new BPF: 5.7 seconds
    66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
    16.75% bench libc-2.15.so [.] syscall
    3.31% bench [kernel.kallsyms] [k] system_call
    2.88% bench [kernel.kallsyms] [k] __secure_computing

    --i386-- large filter
    old BPF: 5.4 sec
    new BPF: 3.8 sec

    --arm32-- large filter
    old BPF: 13.5 sec
    73.88% bench [kernel.kallsyms] [k] sk_run_filter
    10.29% bench [kernel.kallsyms] [k] vector_swi
    6.46% bench libc-2.17.so [.] syscall
    2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load
    1.19% bench [kernel.kallsyms] [k] __secure_computing
    0.87% bench [kernel.kallsyms] [k] sys_getuid
    new BPF: 13.5 sec
    76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
    10.98% bench [kernel.kallsyms] [k] vector_swi
    5.87% bench libc-2.17.so [.] syscall
    1.77% bench [kernel.kallsyms] [k] __secure_computing
    0.93% bench [kernel.kallsyms] [k] sys_getuid

    BPF filters generated by seccomp are very branchy, so the new
    internal BPF performance is better than the old one. Performance
    gains will be even higher when BPF JIT is committed for the
    new structure, which is planned in future work (as successive
    JIT migrations).

    BPF has also been stress-tested with trinity's BPF fuzzer.

    Joint work with Daniel Borkmann.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Cc: Hagen Paul Pfeifer
    Cc: Kees Cook
    Cc: Paul Moore
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: linux-kernel@vger.kernel.org
    Acked-by: Kees Cook
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

13 Oct, 2012

1 commit


18 Apr, 2012

1 commit

  • This change is inspired by
    https://lkml.org/lkml/2012/4/16/14
    which fixes the build warnings for arches that don't support
    CONFIG_HAVE_ARCH_SECCOMP_FILTER.

    In particular, there is no requirement for the return value of
    secure_computing() to be checked unless the architecture supports
    seccomp filter. Instead of silencing the warnings with (void)
    a new static inline is added to encode the expected behavior
    in a compiler and human friendly way.

    v2: - cleans things up with a static inline
    - removes sfr's signed-off-by since it is a different approach
    v1: - matches sfr's original change

    Reported-by: Stephen Rothwell
    Signed-off-by: Will Drewry
    Acked-by: Kees Cook
    Signed-off-by: James Morris

    Will Drewry
     

17 Apr, 2012

1 commit


14 Apr, 2012

5 commits

  • This change adds support for a new ptrace option, PTRACE_O_TRACESECCOMP,
    and a new return value for seccomp BPF programs, SECCOMP_RET_TRACE.

    When a tracer specifies the PTRACE_O_TRACESECCOMP ptrace option, the
    tracer will be notified, via PTRACE_EVENT_SECCOMP, for any syscall that
    results in a BPF program returning SECCOMP_RET_TRACE. The 16-bit
    SECCOMP_RET_DATA mask of the BPF program return value will be passed as
    the ptrace_message and may be retrieved using PTRACE_GETEVENTMSG.

    If the subordinate process is not using seccomp filter, then no
    system call notifications will occur even if the option is specified.

    If there is no tracer with PTRACE_O_TRACESECCOMP when SECCOMP_RET_TRACE
    is returned, the system call will not be executed and an -ENOSYS errno
    will be returned to userspace.

    This change adds a dependency on the system call slow path. Any future
    efforts to use the system call fast path for seccomp filter will need to
    address this restriction.

    Signed-off-by: Will Drewry
    Acked-by: Eric Paris

    v18: - rebase
    - comment fatal_signal check
    - acked-by
    - drop secure_computing_int comment
    v17: - ...
    v16: - update PT_TRACE_MASK to 0xbf4 so that STOP isn't clear on SETOPTIONS call (indan@nul.nu)
    [note PT_TRACE_MASK disappears in linux-next]
    v15: - add audit support for non-zero return codes
    - clean up style (indan@nul.nu)
    v14: - rebase/nochanges
    v13: - rebase on to 88ebdda6159ffc15699f204c33feb3e431bf9bdc
    (Brings back a change to ptrace.c and the masks.)
    v12: - rebase to linux-next
    - use ptrace_event and update arch/Kconfig to mention slow-path dependency
    - drop all tracehook changes and inclusion (oleg@redhat.com)
    v11: - invert the logic to just make it a PTRACE_SYSCALL accelerator
    (indan@nul.nu)
    v10: - moved to PTRACE_O_SECCOMP / PT_TRACE_SECCOMP
    v9: - n/a
    v8: - guarded PTRACE_SECCOMP use with an ifdef
    v7: - introduced
    Signed-off-by: James Morris

    Will Drewry
     
  • Adds a new return value to seccomp filters that triggers a SIGSYS to be
    delivered with the new SYS_SECCOMP si_code.

    This allows in-process system call emulation, including just specifying
    an errno or cleanly dumping core, rather than just dying.

    Suggested-by: Markus Gutschke
    Suggested-by: Julien Tinnes
    Signed-off-by: Will Drewry
    Acked-by: Eric Paris

    v18: - acked-by, rebase
    - don't mention secure_computing_int() anymore
    v15: - use audit_seccomp/skip
    - pad out error spacing; clean up switch (indan@nul.nu)
    v14: - n/a
    v13: - rebase on to 88ebdda6159ffc15699f204c33feb3e431bf9bdc
    v12: - rebase on to linux-next
    v11: - clarify the comment (indan@nul.nu)
    - s/sigtrap/sigsys
    v10: - use SIGSYS, syscall_get_arch, updates arch/Kconfig
    note suggested-by (though original suggestion had other behaviors)
    v9: - changes to SIGILL
    v8: - clean up based on changes to dependent patches
    v7: - introduction
    Signed-off-by: James Morris

    Will Drewry
     
  • This change adds the SECCOMP_RET_ERRNO as a valid return value from a
    seccomp filter. Additionally, it makes the first use of the lower
    16-bits for storing a filter-supplied errno. 16-bits is more than
    enough for the errno-base.h calls.

    Returning errors instead of immediately terminating processes that
    violate seccomp policy allow for broader use of this functionality
    for kernel attack surface reduction. For example, a linux container
    could maintain a whitelist of pre-existing system calls but drop
    all new ones with errnos. This would keep a logically static attack
    surface while providing errnos that may allow for graceful failure
    without the downside of do_exit() on a bad call.

    This change also changes the signature of __secure_computing. It
    appears the only direct caller is the arm entry code and it clobbers
    any possible return value (register) immediately.

    Signed-off-by: Will Drewry
    Acked-by: Serge Hallyn
    Reviewed-by: Kees Cook
    Acked-by: Eric Paris

    v18: - fix up comments and rebase
    - fix bad var name which was fixed in later revs
    - remove _int() and just change the __secure_computing signature
    v16-v17: ...
    v15: - use audit_seccomp and add a skip label. (eparis@redhat.com)
    - clean up and pad out return codes (indan@nul.nu)
    v14: - no change/rebase
    v13: - rebase on to 88ebdda6159ffc15699f204c33feb3e431bf9bdc
    v12: - move to WARN_ON if filter is NULL
    (oleg@redhat.com, luto@mit.edu, keescook@chromium.org)
    - return immediately for filter==NULL (keescook@chromium.org)
    - change evaluation to only compare the ACTION so that layered
    errnos don't result in the lowest one being returned.
    (keeschook@chromium.org)
    v11: - check for NULL filter (keescook@chromium.org)
    v10: - change loaders to fn
    v9: - n/a
    v8: - update Kconfig to note new need for syscall_set_return_value.
    - reordered such that TRAP behavior follows on later.
    - made the for loop a little less indent-y
    v7: - introduced
    Signed-off-by: James Morris

    Will Drewry
     
  • [This patch depends on luto@mit.edu's no_new_privs patch:
    https://lkml.org/lkml/2012/1/30/264
    The whole series including Andrew's patches can be found here:
    https://github.com/redpig/linux/tree/seccomp
    Complete diff here:
    https://github.com/redpig/linux/compare/1dc65fed...seccomp
    ]

    This patch adds support for seccomp mode 2. Mode 2 introduces the
    ability for unprivileged processes to install system call filtering
    policy expressed in terms of a Berkeley Packet Filter (BPF) program.
    This program will be evaluated in the kernel for each system call
    the task makes and computes a result based on data in the format
    of struct seccomp_data.

    A filter program may be installed by calling:
    struct sock_fprog fprog = { ... };
    ...
    prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &fprog);

    The return value of the filter program determines if the system call is
    allowed to proceed or denied. If the first filter program installed
    allows prctl(2) calls, then the above call may be made repeatedly
    by a task to further reduce its access to the kernel. All attached
    programs must be evaluated before a system call will be allowed to
    proceed.

    Filter programs will be inherited across fork/clone and execve.
    However, if the task attaching the filter is unprivileged
    (!CAP_SYS_ADMIN) the no_new_privs bit will be set on the task. This
    ensures that unprivileged tasks cannot attach filters that affect
    privileged tasks (e.g., setuid binary).

    There are a number of benefits to this approach. A few of which are
    as follows:
    - BPF has been exposed to userland for a long time
    - BPF optimization (and JIT'ing) are well understood
    - Userland already knows its ABI: system call numbers and desired
    arguments
    - No time-of-check-time-of-use vulnerable data accesses are possible.
    - system call arguments are loaded on access only to minimize copying
    required for system call policy decisions.

    Mode 2 support is restricted to architectures that enable
    HAVE_ARCH_SECCOMP_FILTER. In this patch, the primary dependency is on
    syscall_get_arguments(). The full desired scope of this feature will
    add a few minor additional requirements expressed later in this series.
    Based on discussion, SECCOMP_RET_ERRNO and SECCOMP_RET_TRACE seem to be
    the desired additional functionality.

    No architectures are enabled in this patch.

    Signed-off-by: Will Drewry
    Acked-by: Serge Hallyn
    Reviewed-by: Indan Zupancic
    Acked-by: Eric Paris
    Reviewed-by: Kees Cook

    v18: - rebase to v3.4-rc2
    - s/chk/check/ (akpm@linux-foundation.org,jmorris@namei.org)
    - allocate with GFP_KERNEL|__GFP_NOWARN (indan@nul.nu)
    - add a comment for get_u32 regarding endianness (akpm@)
    - fix other typos, style mistakes (akpm@)
    - added acked-by
    v17: - properly guard seccomp filter needed headers (leann@ubuntu.com)
    - tighten return mask to 0x7fff0000
    v16: - no change
    v15: - add a 4 instr penalty when counting a path to account for seccomp_filter
    size (indan@nul.nu)
    - drop the max insns to 256KB (indan@nul.nu)
    - return ENOMEM if the max insns limit has been hit (indan@nul.nu)
    - move IP checks after args (indan@nul.nu)
    - drop !user_filter check (indan@nul.nu)
    - only allow explicit bpf codes (indan@nul.nu)
    - exit_code -> exit_sig
    v14: - put/get_seccomp_filter takes struct task_struct
    (indan@nul.nu,keescook@chromium.org)
    - adds seccomp_chk_filter and drops general bpf_run/chk_filter user
    - add seccomp_bpf_load for use by net/core/filter.c
    - lower max per-process/per-hierarchy: 1MB
    - moved nnp/capability check prior to allocation
    (all of the above: indan@nul.nu)
    v13: - rebase on to 88ebdda6159ffc15699f204c33feb3e431bf9bdc
    v12: - added a maximum instruction count per path (indan@nul.nu,oleg@redhat.com)
    - removed copy_seccomp (keescook@chromium.org,indan@nul.nu)
    - reworded the prctl_set_seccomp comment (indan@nul.nu)
    v11: - reorder struct seccomp_data to allow future args expansion (hpa@zytor.com)
    - style clean up, @compat dropped, compat_sock_fprog32 (indan@nul.nu)
    - do_exit(SIGSYS) (keescook@chromium.org, luto@mit.edu)
    - pare down Kconfig doc reference.
    - extra comment clean up
    v10: - seccomp_data has changed again to be more aesthetically pleasing
    (hpa@zytor.com)
    - calling convention is noted in a new u32 field using syscall_get_arch.
    This allows for cross-calling convention tasks to use seccomp filters.
    (hpa@zytor.com)
    - lots of clean up (thanks, Indan!)
    v9: - n/a
    v8: - use bpf_chk_filter, bpf_run_filter. update load_fns
    - Lots of fixes courtesy of indan@nul.nu:
    -- fix up load behavior, compat fixups, and merge alloc code,
    -- renamed pc and dropped __packed, use bool compat.
    -- Added a hidden CONFIG_SECCOMP_FILTER to synthesize non-arch
    dependencies
    v7: (massive overhaul thanks to Indan, others)
    - added CONFIG_HAVE_ARCH_SECCOMP_FILTER
    - merged into seccomp.c
    - minimal seccomp_filter.h
    - no config option (part of seccomp)
    - no new prctl
    - doesn't break seccomp on systems without asm/syscall.h
    (works but arg access always fails)
    - dropped seccomp_init_task, extra free functions, ...
    - dropped the no-asm/syscall.h code paths
    - merges with network sk_run_filter and sk_chk_filter
    v6: - fix memory leak on attach compat check failure
    - require no_new_privs || CAP_SYS_ADMIN prior to filter
    installation. (luto@mit.edu)
    - s/seccomp_struct_/seccomp_/ for macros/functions (amwang@redhat.com)
    - cleaned up Kconfig (amwang@redhat.com)
    - on block, note if the call was compat (so the # means something)
    v5: - uses syscall_get_arguments
    (indan@nul.nu,oleg@redhat.com, mcgrathr@chromium.org)
    - uses union-based arg storage with hi/lo struct to
    handle endianness. Compromises between the two alternate
    proposals to minimize extra arg shuffling and account for
    endianness assuming userspace uses offsetof().
    (mcgrathr@chromium.org, indan@nul.nu)
    - update Kconfig description
    - add include/seccomp_filter.h and add its installation
    - (naive) on-demand syscall argument loading
    - drop seccomp_t (eparis@redhat.com)
    v4: - adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS
    - now uses current->no_new_privs
    (luto@mit.edu,torvalds@linux-foundation.com)
    - assign names to seccomp modes (rdunlap@xenotime.net)
    - fix style issues (rdunlap@xenotime.net)
    - reworded Kconfig entry (rdunlap@xenotime.net)
    v3: - macros to inline (oleg@redhat.com)
    - init_task behavior fixed (oleg@redhat.com)
    - drop creator entry and extra NULL check (oleg@redhat.com)
    - alloc returns -EINVAL on bad sizing (serge.hallyn@canonical.com)
    - adds tentative use of "always_unprivileged" as per
    torvalds@linux-foundation.org and luto@mit.edu
    v2: - (patch 2 only)
    Signed-off-by: James Morris

    Will Drewry
     
  • Replaces the seccomp_t typedef with struct seccomp to match modern
    kernel style.

    Signed-off-by: Will Drewry
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Acked-by: Eric Paris

    v18: rebase
    ...
    v14: rebase/nochanges
    v13: rebase on to 88ebdda6159ffc15699f204c33feb3e431bf9bdc
    v12: rebase on to linux-next
    v8-v11: no changes
    v7: struct seccomp_struct -> struct seccomp
    v6: original inclusion in this series.
    Signed-off-by: James Morris

    Will Drewry
     

07 Jun, 2011

1 commit

  • There's a fair amount of code in the vsyscall page. It contains
    a syscall instruction (in the gettimeofday fallback) and who
    knows what will happen if an exploit jumps into the middle of
    some other code.

    Reduce the risk by replacing the vsyscalls with short magic
    incantations that cause the kernel to emulate the real
    vsyscalls. These incantations are useless if entered in the
    middle.

    This causes vsyscalls to be a little more expensive than real
    syscalls. Fortunately sensible programs don't use them.
    The only exception is time() which is still called by glibc
    through the vsyscall - but calling time() millions of times
    per second is not sensible. glibc has this fixed in the
    development tree.

    This patch is not perfect: the vread_tsc and vread_hpet
    functions are still at a fixed address. Fixing that might
    involve making alternative patching work in the vDSO.

    Signed-off-by: Andy Lutomirski
    Acked-by: Linus Torvalds
    Cc: Jesper Juhl
    Cc: Borislav Petkov
    Cc: Arjan van de Ven
    Cc: Jan Beulich
    Cc: richard -rw- weinberger
    Cc: Mikael Pettersson
    Cc: Andi Kleen
    Cc: Brian Gerst
    Cc: Louis Rilling
    Cc: Valdis.Kletnieks@vt.edu
    Cc: pageexec@freemail.hu
    Link: http://lkml.kernel.org/r/e64e1b3c64858820d12c48fa739efbd1485e79d5.1307292171.git.luto@mit.edu
    [ Removed the CONFIG option - it's simpler to just do it unconditionally. Tidied up the code as well. ]
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

20 Apr, 2009

1 commit


17 Jul, 2007

2 commits

  • This follows a suggestion from Chuck Ebbert on how to make seccomp
    absolutely zerocost in schedule too. The only remaining footprint of
    seccomp is in terms of the bzImage size that becomes a few bytes (perhaps
    even a few kbytes) larger, measure it if you care in the embedded.

    Signed-off-by: Andrea Arcangeli
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This reduces the memory footprint and it enforces that only the current
    task can enable seccomp on itself (this is a requirement for a
    strightforward [modulo preempt ;) ] TIF_NOTSC implementation).

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

26 Apr, 2006

1 commit


09 Jan, 2006

1 commit

  • Remove various things which were checking for gcc-1.x and gcc-2.x compilers.

    From: Adrian Bunk

    Some documentation updates and removes some code paths for gcc < 3.2.

    Acked-by: Russell King
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

28 Jun, 2005

1 commit

  • I believe at least for seccomp it's worth to turn off the tsc, not just for
    HT but for the L2 cache too. So it's up to you, either you turn it off
    completely (which isn't very nice IMHO) or I recommend to apply this below
    patch.

    This has been tested successfully on x86-64 against current cogito
    repository (i686 compiles so I didn't bother testing ;). People selling
    the cpu through cpushare may appreciate this bit for a peace of mind.

    There's no way to get any timing info anymore with this applied
    (gettimeofday is forbidden of course). The seccomp environment is
    completely deterministic so it can't be allowed to get timing info, it has
    to be deterministic so in the future I can enable a computing mode that
    does a parallel computing for each task with server side transparent
    checkpointing and verification that the output is the same from all the 2/3
    seller computers for each task, without the buyer even noticing (for now
    the verification is left to the buyer client side and there's no
    checkpointing, since that would require more kernel changes to track the
    dirty bits but it'll be easy to extend once the basic mode is finished).

    Eliminating a cold-cache read of the cr4 global variable will save one
    cacheline during the tlb flush while making the code per-cpu-safe at the
    same time. Thanks to Mikael Pettersson for noticing the tlb flush wasn't
    per-cpu-safe.

    The global tlb flush can run from irq (IPI calling do_flush_tlb_all) but
    it'll be transparent to the switch_to code since the IPI won't make any
    change to the cr4 contents from the point of view of the interrupted code
    and since it's now all per-cpu stuff, it will not race. So no need to
    disable irqs in switch_to slow path.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds