31 Aug, 2016

1 commit

  • This fixes a ptrace vs fatal pending signals bug as manifested in
    seccomp now that seccomp was reordered to happen after ptrace. The
    short version is that seccomp should not attempt to call do_exit()
    while fatal signals are pending under a tracer. The existing code was
    trying to be as defensively paranoid as possible, but it now ends up
    confusing ptrace. Instead, the syscall can just be skipped (which solves
    the original concern that the do_exit() was addressing) and normal signal
    handling, tracer notification, and process death can happen.

    Paraphrasing from the original bug report:

    If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed
    after such a trap but not yet been scheduled, and another task in the
    thread-group calls exit_group(), then the tracee task exits without the
    ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here:
    https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7

    The bug happens because when __seccomp_filter() detects
    fatal_signal_pending(), it calls do_exit() without dequeuing the fatal
    signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and
    that task is descheduled, __schedule() notices that there is a fatal
    signal pending and changes its state from TASK_TRACED to TASK_RUNNING.
    That prevents the ptracer's waitpid() from returning the ptrace event.
    A more detailed analysis is here:
    https://github.com/mozilla/rr/issues/1762#issuecomment-237396255.

    Reported-by: Robert O'Callahan
    Reported-by: Kyle Huey
    Tested-by: Kyle Huey
    Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace")
    Signed-off-by: Kees Cook
    Acked-by: Oleg Nesterov
    Acked-by: James Morris

    Kees Cook
     

04 Aug, 2016

1 commit

  • The use of config_enabled() against config options is ambiguous. In
    practical terms, config_enabled() is equivalent to IS_BUILTIN(), but the
    author might have used it for the meaning of IS_ENABLED(). Using
    IS_ENABLED(), IS_BUILTIN(), IS_MODULE() etc. makes the intention
    clearer.

    This commit replaces config_enabled() with IS_ENABLED() where possible.
    This commit is only touching bool config options.

    I noticed two cases where config_enabled() is used against a tristate
    option:

    - config_enabled(CONFIG_HWMON)
    [ drivers/net/wireless/ath/ath10k/thermal.c ]

    - config_enabled(CONFIG_BACKLIGHT_CLASS_DEVICE)
    [ drivers/gpu/drm/gma500/opregion.c ]

    I did not touch them because they should be converted to IS_BUILTIN()
    in order to keep the logic, but I was not sure it was the authors'
    intention.

    Link: http://lkml.kernel.org/r/1465215656-20569-1-git-send-email-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Acked-by: Kees Cook
    Cc: Stas Sergeev
    Cc: Matt Redfearn
    Cc: Joshua Kinard
    Cc: Jiri Slaby
    Cc: Bjorn Helgaas
    Cc: Borislav Petkov
    Cc: Markos Chandras
    Cc: "Dmitry V. Levin"
    Cc: yu-cheng yu
    Cc: James Hogan
    Cc: Brian Gerst
    Cc: Johannes Berg
    Cc: Peter Zijlstra
    Cc: Al Viro
    Cc: Will Drewry
    Cc: Nikolay Martynov
    Cc: Huacai Chen
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Daniel Borkmann
    Cc: Leonid Yegoshin
    Cc: Rafal Milecki
    Cc: James Cowgill
    Cc: Greg Kroah-Hartman
    Cc: Ralf Baechle
    Cc: Alex Smith
    Cc: Adam Buchbinder
    Cc: Qais Yousef
    Cc: Jiang Liu
    Cc: Mikko Rapeli
    Cc: Paul Gortmaker
    Cc: Denys Vlasenko
    Cc: Brian Norris
    Cc: Hidehiro Kawai
    Cc: "Luis R. Rodriguez"
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Cc: Dave Hansen
    Cc: "Kirill A. Shutemov"
    Cc: Roland McGrath
    Cc: Paul Burton
    Cc: Kalle Valo
    Cc: Viresh Kumar
    Cc: Tony Wu
    Cc: Huaitong Han
    Cc: Sumit Semwal
    Cc: Alexei Starovoitov
    Cc: Juergen Gross
    Cc: Jason Cooper
    Cc: "David S. Miller"
    Cc: Oleg Nesterov
    Cc: Andrea Gelmini
    Cc: David Woodhouse
    Cc: Marc Zyngier
    Cc: Rabin Vincent
    Cc: "Maciej W. Rozycki"
    Cc: David Daney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     

15 Jun, 2016

3 commits


20 May, 2016

1 commit

  • Pull MIPS updates from Ralf Baechle:
    "This is the main pull request for MIPS for 4.7. Here's the summary of
    the changes:

    - ATH79: Support for DTB passuing using the UHI boot protocol
    - ATH79: Remove support for builtin DTB.
    - ATH79: Add zboot debug serial support.
    - ATH79: Add initial support for Dragino MS14 (Dragine 2), Onion Omega
    and DPT-Module.
    - ATH79: Update devicetree clock support for AR9132 and AR9331.
    - ATH79: Cleanup the DT code.
    - ATH79: Support newer SOCs in ath79_ddr_ctrl_init.
    - ATH79: Fix regression in PCI window initialization.
    - BCM47xx: Move SPROM driver to drivers/firmware/
    - BCM63xx: Enable partition parser in defconfig.
    - BMIPS: BMIPS5000 has I cache filing from D cache
    - BMIPS: BMIPS: Add cpu-feature-overrides.h
    - BMIPS: Add Whirlwind support
    - BMIPS: Adjust mips-hpt-frequency for BCM7435
    - BMIPS: Remove maxcpus from BCM97435SVMB DTS
    - BMIPS: Add missing 7038 L1 register cells to BCM7435
    - BMIPS: Various tweaks to initialization code.
    - BMIPS: Enable partition parser in defconfig.
    - BMIPS: Cache tweaks.
    - BMIPS: Add UART, I2C and SATA devices to DT.
    - BMIPS: Add BCM6358 and BCM63268support
    - BMIPS: Add device tree example for BCM6358.
    - BMIPS: Improve Improve BCM6328 and BCM6368 device trees
    - Lantiq: Add support for device tree file from boot loader
    - Lantiq: Allow build with no built-in DT.
    - Loongson 3: Reserve 32MB for RS780E integrated GPU.
    - Loongson 3: Fix build error after ld-version.sh modification
    - Loongson 3: Move chipset ACPI code from drivers to arch.
    - Loongson 3: Speedup irq processing.
    - Loongson 3: Add basic Loongson 3A support.
    - Loongson 3: Set cache flush handlers to nop.
    - Loongson 3: Invalidate special TLBs when needed.
    - Loongson 3: Fast TLB refill handler.
    - MT7620: Fallback strategy for invalid syscfg0.
    - Netlogic: Fix CP0_EBASE redefinition warnings
    - Octeon: Initialization fixes
    - Octeon: Add DTS files for the D-Link DSR-1000N and EdgeRouter Lite
    - Octeon: Enable add Octeon-drivers in cavium_octeon_defconfig
    - Octeon: Correctly handle endian-swapped initramfs images.
    - Octeon: Support CN73xx, CN75xx and CN78xx.
    - Octeon: Remove dead code from cvmx-sysinfo.
    - Octeon: Extend number of supported CPUs past 32.
    - Octeon: Remove some code limiting NR_IRQS to 255.
    - Octeon: Simplify octeon_irq_ciu_gpio_set_type.
    - Octeon: Mark some functions __init in smp.c
    - Octeon: Octeon: Add Octeon III CN7xxx interface detection
    - PIC32: Add serial driver and bindings for it.
    - PIC32: Add PIC32 deadman timer driver and bindings.
    - PIC32: Add PIC32 clock timer driver and bindings.
    - Pistachio: Determine SoC revision during boot
    - Sibyte: Fix Kconfig dependencies of SIBYTE_BUS_WATCHER.
    - Sibyte: Strip redundant comments from bcm1480_regs.h.
    - Panic immediately if panic_on_oops is set.
    - module: fix incorrect IS_ERR_VALUE macro usage.
    - module: Make consistent use of pr_*
    - Remove no longer needed work_on_cpu() call.
    - Remove CONFIG_IPV6_PRIVACY from defconfigs.
    - Fix registers of non-crashing CPUs in dumps.
    - Handle MIPSisms in new vmcore_elf32_check_arch.
    - Select CONFIG_HANDLE_DOMAIN_IRQ and make it work.
    - Allow RIXI to be used on non-R2 or R6 cores.
    - Reserve nosave data for hibernation
    - Fix siginfo.h to use strict POSIX types.
    - Don't unwind user mode with EVA.
    - Fix watchpoint restoration
    - Ptrace watchpoints for R6.
    - Sync icache when it fills from dcache
    - I6400 I-cache fills from dcache.
    - Various MSA fixes.
    - Cleanup MIPS_CPU_* definitions.
    - Signal: Move generic copy_siginfo to signal.h
    - Signal: Fix uapi include in exported asm/siginfo.h
    - Timer fixes for sake of KVM.
    - XPA TLB refill fixes.
    - Treat perf counter feature
    - Update John Crispin's email address
    - Add PIC32 watchdog and bindings.
    - Handle R10000 LL/SC bug in set_pte()
    - cpufreq: Various fixes for Longson1.
    - R6: Fix R2 emulation.
    - mathemu: Cosmetic fix to ADDIUPC emulation, plenty of other small fixes
    - ELF: ABI and FP fixes.
    - Allow for relocatable kernel and use that to support KASLR.
    - Fix CPC_BASE_ADDR mask
    - Plenty fo smp-cps, CM, R6 and M6250 fixes.
    - Make reset_control_ops const.
    - Fix kernel command line handling of leading whitespace.
    - Cleanups to cache handling.
    - Add brcm, bcm6345-l1-intc device tree bindings.
    - Use generic clkdev.h header
    - Remove CLK_IS_ROOT usage.
    - Misc small cleanups.
    - CM: Fix compilation error when !MIPS_CM
    - oprofile: Fix a preemption issue
    - Detect DSP ASE v3 support:1"

    * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (275 commits)
    MIPS: pic32mzda: fix getting timer clock rate.
    MIPS: ath79: fix regression in PCI window initialization
    MIPS: ath79: make ath79_ddr_ctrl_init() compatible for newer SoCs
    MIPS: Fix VZ probe gas errors with binutils of MSA context in non-MSA kernels
    MIPS: cevt-r4k: Dynamically calculate min_delta_ns
    MIPS: malta-time: Take seconds into account
    MIPS: malta-time: Start GIC count before syncing to RTC
    MIPS: Force CPUs to lose FP context during mode switches
    ...

    Linus Torvalds
     

13 May, 2016

2 commits

  • These values are constant and should be marked as such.

    Signed-off-by: Matt Redfearn
    Acked-by: Kees Cook
    Cc: Will Drewry
    Cc: Andy Lutomirski
    Cc: IMG-MIPSLinuxKerneldevelopers@imgtec.com
    Cc: linux-kernel@vger.kernel.org
    Patchwork: https://patchwork.linux-mips.org/patch/12979/
    Signed-off-by: Ralf Baechle

    Matt Redfearn
     
  • Move retrieval of compat syscall numbers into inline function defined in
    asm-generic header so that arches may override it.

    [ralf@linux-mips.org: Resolve merge conflict.]

    Suggested-by: Paul Burton
    Signed-off-by: Matt Redfearn
    Acked-by: Kees Cook
    Cc: IMG-MIPSLinuxKerneldevelopers@imgtec.com
    Cc: Arnd Bergmann
    Cc: Andy Lutomirski
    Cc: Will Drewry
    Cc: linux-arch@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Patchwork: https://patchwork.linux-mips.org/patch/12978/
    Signed-off-by: Ralf Baechle

    Matt Redfearn
     

05 May, 2016

1 commit


23 Mar, 2016

1 commit

  • Seccomp wants to know the syscall bitness, not the caller task bitness,
    when it selects the syscall whitelist.

    As far as I know, this makes no difference on any architecture, so it's
    not a security problem. (It generates identical code everywhere except
    sparc, and, on sparc, the syscall numbering is the same for both ABIs.)

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     

27 Jan, 2016

1 commit

  • Before this patch, a process with some permissive seccomp filter
    that was applied by root without NO_NEW_PRIVS was able to add
    more filters to itself without setting NO_NEW_PRIVS by setting
    the new filter from a throwaway thread with NO_NEW_PRIVS.

    Signed-off-by: Jann Horn
    Cc: stable@vger.kernel.org
    Signed-off-by: Kees Cook

    Jann Horn
     

28 Oct, 2015

1 commit

  • This patch adds support for dumping a process' (classic BPF) seccomp
    filters via ptrace.

    PTRACE_SECCOMP_GET_FILTER allows the tracer to dump the user's classic BPF
    seccomp filters. addr should be an integer which represents the ith seccomp
    filter (0 is the most recently installed filter). data should be a struct
    sock_filter * with enough room for the ith filter, or NULL, in which case
    the filter is not saved. The return value for this command is the number of
    BPF instructions the program represents, or negative in the case of errors.
    Command specific errors are ENOENT: which indicates that there is no ith
    filter in this seccomp tree, and EMEDIUMTYPE, which indicates that the ith
    filter was not installed as a classic BPF filter.

    A caveat with this approach is that there is no way to get explicitly at
    the heirarchy of seccomp filters, and users need to memcmp() filters to
    decide which are inherited. This means that a task which installs two of
    the same filter can potentially confuse users of this interface.

    v2: * make save_orig const
    * check that the orig_prog exists (not necessary right now, but when
    grows eBPF support it will be)
    * s/n/filter_off and make it an unsigned long to match ptrace
    * count "down" the tree instead of "up" when passing a filter offset

    v3: * don't take the current task's lock for inspecting its seccomp mode
    * use a 0x42** constant for the ptrace command value

    v4: * don't copy to userspace while holding spinlocks

    v5: * add another condition to WARN_ON

    v6: * rebase on net-next

    Signed-off-by: Tycho Andersen
    Acked-by: Kees Cook
    CC: Will Drewry
    Reviewed-by: Oleg Nesterov
    CC: Andy Lutomirski
    CC: Pavel Emelyanov
    CC: Serge E. Hallyn
    CC: Alexei Starovoitov
    CC: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Tycho Andersen
     

05 Oct, 2015

1 commit

  • The current ongoing effort to dump existing cBPF seccomp filters back
    to user space requires to hold the pre-transformed instructions like
    we do in case of socket filters from sk_attach_filter() side, so they
    can be reloaded in original form at a later point in time by utilities
    such as criu.

    To prepare for this, simply extend the bpf_prog_create_from_user()
    API to hold a flag that tells whether we should store the original
    or not. Also, fanout filters could make use of that in future for
    things like diag. While fanout filters already use bpf_prog_destroy(),
    move seccomp over to them as well to handle original programs when
    present.

    Signed-off-by: Daniel Borkmann
    Cc: Tycho Andersen
    Cc: Pavel Emelyanov
    Cc: Kees Cook
    Cc: Andy Lutomirski
    Cc: Alexei Starovoitov
    Tested-by: Tycho Andersen
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

20 Jul, 2015

1 commit


16 Jul, 2015

3 commits

  • For clarity, if CONFIG_SECCOMP isn't defined, seccomp_mode() is returning
    "disabled". This makes that more clear, along with another 0-use, and
    results in no operational change.

    Signed-off-by: Kees Cook

    Kees Cook
     
  • This patch is the first step in enabling checkpoint/restore of processes
    with seccomp enabled.

    One of the things CRIU does while dumping tasks is inject code into them
    via ptrace to collect information that is only available to the process
    itself. However, if we are in a seccomp mode where these processes are
    prohibited from making these syscalls, then what CRIU does kills the task.

    This patch adds a new ptrace option, PTRACE_O_SUSPEND_SECCOMP, that enables
    a task from the init user namespace which has CAP_SYS_ADMIN and no seccomp
    filters to disable (and re-enable) seccomp filters for another task so that
    they can be successfully dumped (and restored). We restrict the set of
    processes that can disable seccomp through ptrace because although today
    ptrace can be used to bypass seccomp, there is some discussion of closing
    this loophole in the future and we would like this patch to not depend on
    that behavior and be future proofed for when it is removed.

    Note that seccomp can be suspended before any filters are actually
    installed; this behavior is useful on criu restore, so that we can suspend
    seccomp, restore the filters, unmap our restore code from the restored
    process' address space, and then resume the task by detaching and have the
    filters resumed as well.

    v2 changes:

    * require that the tracer have no seccomp filters installed
    * drop TIF_NOTSC manipulation from the patch
    * change from ptrace command to a ptrace option and use this ptrace option
    as the flag to check. This means that as soon as the tracer
    detaches/dies, seccomp is re-enabled and as a corrollary that one can not
    disable seccomp across PTRACE_ATTACHs.

    v3 changes:

    * get rid of various #ifdefs everywhere
    * report more sensible errors when PTRACE_O_SUSPEND_SECCOMP is incorrectly
    used

    v4 changes:

    * get rid of may_suspend_seccomp() in favor of a capable() check in ptrace
    directly

    v5 changes:

    * check that seccomp is not enabled (or suspended) on the tracer

    Signed-off-by: Tycho Andersen
    CC: Will Drewry
    CC: Roland McGrath
    CC: Pavel Emelyanov
    CC: Serge E. Hallyn
    Acked-by: Oleg Nesterov
    Acked-by: Andy Lutomirski
    [kees: access seccomp.mode through seccomp_mode() instead]
    Signed-off-by: Kees Cook

    Tycho Andersen
     
  • Recently lockless_dereference() was added which can be used in place of
    hard-coding smp_read_barrier_depends(). The following PATCH makes the change.

    Signed-off-by: Pranith Kumar
    Signed-off-by: Kees Cook

    Pranith Kumar
     

10 May, 2015

2 commits

  • Seccomp has always been a special candidate when it comes to preparation
    of its filters in seccomp_prepare_filter(). Due to the extra checks and
    filter rewrite it partially duplicates code and has BPF internals exposed.

    This patch adds a generic API inside the BPF code code that seccomp can use
    and thus keep it's filter preparation code minimal and better maintainable.
    The other side-effect is that now classic JITs can add seccomp support as
    well by only providing a BPF_LDX | BPF_W | BPF_ABS translation.

    Tested with seccomp and BPF test suites.

    Signed-off-by: Daniel Borkmann
    Cc: Nicolas Schichan
    Cc: Alexei Starovoitov
    Cc: Kees Cook
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Remove the calls to bpf_check_classic(), bpf_convert_filter() and
    bpf_migrate_runtime() and let bpf_prepare_filter() take care of that
    instead.

    seccomp_check_filter() is passed to bpf_prepare_filter() so that it
    gets called from there, after bpf_check_classic().

    We can now remove exposure of two internal classic BPF functions
    previously used by seccomp. The export of bpf_check_classic() symbol,
    previously known as sk_chk_filter(), was there since pre git times,
    and no in-tree module was using it, therefore remove it.

    Joint work with Daniel Borkmann.

    Signed-off-by: Nicolas Schichan
    Signed-off-by: Daniel Borkmann
    Cc: Alexei Starovoitov
    Cc: Kees Cook
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Nicolas Schichan
     

18 Feb, 2015

1 commit

  • The value resulting from the SECCOMP_RET_DATA mask could exceed MAX_ERRNO
    when setting errno during a SECCOMP_RET_ERRNO filter action. This makes
    sure we have a reliable value being set, so that an invalid errno will not
    be ignored by userspace.

    Signed-off-by: Kees Cook
    Reported-by: Dmitry V. Levin
    Cc: Andy Lutomirski
    Cc: Will Drewry
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

14 Oct, 2014

1 commit

  • Pull x86 seccomp changes from Ingo Molnar:
    "This tree includes x86 seccomp filter speedups and related preparatory
    work, which touches core seccomp facilities as well.

    The main idea is to split seccomp into two phases, to be able to enter
    a simple fast path for syscalls with ptrace side effects.

    There's no substantial user-visible (and ABI) effects expected from
    this, except a change in how we emit a better audit record for
    SECCOMP_RET_TRACE events"

    * 'x86-seccomp-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86_64, entry: Use split-phase syscall_trace_enter for 64-bit syscalls
    x86_64, entry: Treat regs->ax the same in fastpath and slowpath syscalls
    x86: Split syscall_trace_enter into two phases
    x86, entry: Only call user_exit if TIF_NOHZ
    x86, x32, audit: Fix x32's AUDIT_ARCH wrt audit
    seccomp: Document two-phase seccomp and arch-provided seccomp_data
    seccomp: Allow arch code to provide seccomp_data
    seccomp: Refactor the filter callback and the API
    seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing

    Linus Torvalds
     

06 Sep, 2014

1 commit

  • With eBPF getting more extended and exposure to user space is on it's way,
    hardening the memory range the interpreter uses to steer its command flow
    seems appropriate. This patch moves the to be interpreted bytecode to
    read-only pages.

    In case we execute a corrupted BPF interpreter image for some reason e.g.
    caused by an attacker which got past a verifier stage, it would not only
    provide arbitrary read/write memory access but arbitrary function calls
    as well. After setting up the BPF interpreter image, its contents do not
    change until destruction time, thus we can setup the image on immutable
    made pages in order to mitigate modifications to that code. The idea
    is derived from commit 314beb9bcabf ("x86: bpf_jit_comp: secure bpf jit
    against spraying attacks").

    This is possible because bpf_prog is not part of sk_filter anymore.
    After setup bpf_prog cannot be altered during its life-time. This prevents
    any modifications to the entire bpf_prog structure (incl. function/JIT
    image pointer).

    Every eBPF program (including classic BPF that are migrated) have to call
    bpf_prog_select_runtime() to select either interpreter or a JIT image
    as a last setup step, and they all are being freed via bpf_prog_free(),
    including non-JIT. Therefore, we can easily integrate this into the
    eBPF life-time, plus since we directly allocate a bpf_prog, we have no
    performance penalty.

    Tested with seccomp and test_bpf testsuite in JIT/non-JIT mode and manual
    inspection of kernel_page_tables. Brad Spengler proposed the same idea
    via Twitter during development of this patch.

    Joint work with Hannes Frederic Sowa.

    Suggested-by: Brad Spengler
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Hannes Frederic Sowa
    Cc: Alexei Starovoitov
    Cc: Kees Cook
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

04 Sep, 2014

3 commits

  • populate_seccomp_data is expensive: it works by inspecting
    task_pt_regs and various other bits to piece together all the
    information, and it's does so in multiple partially redundant steps.

    Arch-specific code in the syscall entry path can do much better.

    Admittedly this adds a bit of additional room for error, but the
    speedup should be worth it.

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Kees Cook

    Andy Lutomirski
     
  • The reason I did this is to add a seccomp API that will be usable
    for an x86 fast path. The x86 entry code needs to use a rather
    expensive slow path for a syscall that might be visible to things
    like ptrace. By splitting seccomp into two phases, we can check
    whether we need the slow path and then use the fast path in if the
    filter allows the syscall or just returns some errno.

    As a side effect, I think the new code is much easier to understand
    than the old code.

    This has one user-visible effect: the audit record written for
    SECCOMP_RET_TRACE is now a simple indication that SECCOMP_RET_TRACE
    happened. It used to depend in a complicated way on what the tracer
    did. I couldn't make much sense of it.

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Kees Cook

    Andy Lutomirski
     
  • The secure_computing function took a syscall number parameter, but
    it only paid any attention to that parameter if seccomp mode 1 was
    enabled. Rather than coming up with a kludge to get the parameter
    to work in mode 2, just remove the parameter.

    To avoid churn in arches that don't have seccomp filters (and may
    not even support syscall_get_nr right now), this leaves the
    parameter in secure_computing_strict, which is now a real function.

    For ARM, this is a bit ugly due to the fact that ARM conditionally
    supports seccomp filters. Fixing that would probably only be a
    couple of lines of code, but it should be coordinated with the audit
    maintainers.

    This will be a slight slowdown on some arches. The right fix is to
    pass in all of seccomp_data instead of trying to make just the
    syscall nr part be fast.

    This is a prerequisite for making two-phase seccomp work cleanly.

    Cc: Russell King
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: Ralf Baechle
    Cc: linux-mips@linux-mips.org
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: linux-s390@vger.kernel.org
    Cc: x86@kernel.org
    Cc: Kees Cook
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Kees Cook

    Andy Lutomirski
     

12 Aug, 2014

1 commit

  • Current upstream kernel hangs with mips and powerpc targets in
    uniprocessor mode if SECCOMP is configured.

    Bisect points to commit dbd952127d11 ("seccomp: introduce writer locking").
    Turns out that code such as
    BUG_ON(!spin_is_locked(&list_lock));
    can not be used in uniprocessor mode because spin_is_locked() always
    returns false in this configuration, and that assert_spin_locked()
    exists for that very purpose and must be used instead.

    Fixes: dbd952127d11 ("seccomp: introduce writer locking")
    Cc: Kees Cook
    Signed-off-by: Guenter Roeck
    Signed-off-by: Kees Cook

    Guenter Roeck
     

07 Aug, 2014

1 commit

  • Pull networking updates from David Miller:
    "Highlights:

    1) Steady transitioning of the BPF instructure to a generic spot so
    all kernel subsystems can make use of it, from Alexei Starovoitov.

    2) SFC driver supports busy polling, from Alexandre Rames.

    3) Take advantage of hash table in UDP multicast delivery, from David
    Held.

    4) Lighten locking, in particular by getting rid of the LRU lists, in
    inet frag handling. From Florian Westphal.

    5) Add support for various RFC6458 control messages in SCTP, from
    Geir Ola Vaagland.

    6) Allow to filter bridge forwarding database dumps by device, from
    Jamal Hadi Salim.

    7) virtio-net also now supports busy polling, from Jason Wang.

    8) Some low level optimization tweaks in pktgen from Jesper Dangaard
    Brouer.

    9) Add support for ipv6 address generation modes, so that userland
    can have some input into the process. From Jiri Pirko.

    10) Consolidate common TCP connection request code in ipv4 and ipv6,
    from Octavian Purdila.

    11) New ARP packet logger in netfilter, from Pablo Neira Ayuso.

    12) Generic resizable RCU hash table, with intial users in netlink and
    nftables. From Thomas Graf.

    13) Maintain a name assignment type so that userspace can see where a
    network device name came from (enumerated by kernel, assigned
    explicitly by userspace, etc.) From Tom Gundersen.

    14) Automatic flow label generation on transmit in ipv6, from Tom
    Herbert.

    15) New packet timestamping facilities from Willem de Bruijn, meant to
    assist in measuring latencies going into/out-of the packet
    scheduler, latency from TCP data transmission to ACK, etc"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1536 commits)
    cxgb4 : Disable recursive mailbox commands when enabling vi
    net: reduce USB network driver config options.
    tg3: Modify tg3_tso_bug() to handle multiple TX rings
    amd-xgbe: Perform phy connect/disconnect at dev open/stop
    amd-xgbe: Use dma_set_mask_and_coherent to set DMA mask
    net: sun4i-emac: fix memory leak on bad packet
    sctp: fix possible seqlock seadlock in sctp_packet_transmit()
    Revert "net: phy: Set the driver when registering an MDIO bus device"
    cxgb4vf: Turn off SGE RX/TX Callback Timers and interrupts in PCI shutdown routine
    team: Simplify return path of team_newlink
    bridge: Update outdated comment on promiscuous mode
    net-timestamp: ACK timestamp for bytestreams
    net-timestamp: TCP timestamping
    net-timestamp: SCHED timestamp on entering packet scheduler
    net-timestamp: add key to disambiguate concurrent datagrams
    net-timestamp: move timestamp flags out of sk_flags
    net-timestamp: extend SCM_TIMESTAMPING ancillary data struct
    cxgb4i : Move stray CPL definitions to cxgb4 driver
    tcp: reduce spurious retransmits due to transient SACK reneging
    qlcnic: Initialize dcbnl_ops before register_netdev
    ...

    Linus Torvalds
     

03 Aug, 2014

3 commits

  • clean up names related to socket filtering and bpf in the following way:
    - everything that deals with sockets keeps 'sk_*' prefix
    - everything that is pure BPF is changed to 'bpf_*' prefix

    split 'struct sk_filter' into
    struct sk_filter {
    atomic_t refcnt;
    struct rcu_head rcu;
    struct bpf_prog *prog;
    };
    and
    struct bpf_prog {
    u32 jited:1,
    len:31;
    struct sock_fprog_kern *orig_prog;
    unsigned int (*bpf_func)(const struct sk_buff *skb,
    const struct bpf_insn *filter);
    union {
    struct sock_filter insns[0];
    struct bpf_insn insnsi[0];
    struct work_struct work;
    };
    };
    so that 'struct bpf_prog' can be used independent of sockets and cleans up
    'unattached' bpf use cases

    split SK_RUN_FILTER macro into:
    SK_RUN_FILTER to be used with 'struct sk_filter *' and
    BPF_PROG_RUN to be used with 'struct bpf_prog *'

    __sk_filter_release(struct sk_filter *) gains
    __bpf_prog_release(struct bpf_prog *) helper function

    also perform related renames for the functions that work
    with 'struct bpf_prog *', since they're on the same lines:

    sk_filter_size -> bpf_prog_size
    sk_filter_select_runtime -> bpf_prog_select_runtime
    sk_filter_free -> bpf_prog_free
    sk_unattached_filter_create -> bpf_prog_create
    sk_unattached_filter_destroy -> bpf_prog_destroy
    sk_store_orig_filter -> bpf_prog_store_orig_filter
    sk_release_orig_filter -> bpf_release_orig_filter
    __sk_migrate_filter -> bpf_migrate_filter
    __sk_prepare_filter -> bpf_prepare_filter

    API for attaching classic BPF to a socket stays the same:
    sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
    and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
    which is used by sockets, tun, af_packet

    API for 'unattached' BPF programs becomes:
    bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
    and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
    which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • to indicate that this function is converting classic BPF into eBPF
    and not related to sockets

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • trivial rename to indicate that this functions performs classic BPF checking

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

25 Jul, 2014

1 commit


19 Jul, 2014

9 commits

  • Applying restrictive seccomp filter programs to large or diverse
    codebases often requires handling threads which may be started early in
    the process lifetime (e.g., by code that is linked in). While it is
    possible to apply permissive programs prior to process start up, it is
    difficult to further restrict the kernel ABI to those threads after that
    point.

    This change adds a new seccomp syscall flag to SECCOMP_SET_MODE_FILTER for
    synchronizing thread group seccomp filters at filter installation time.

    When calling seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC,
    filter) an attempt will be made to synchronize all threads in current's
    threadgroup to its new seccomp filter program. This is possible iff all
    threads are using a filter that is an ancestor to the filter current is
    attempting to synchronize to. NULL filters (where the task is running as
    SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be
    transitioned into SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS,
    ...) has been set on the calling thread, no_new_privs will be set for
    all synchronized threads too. On success, 0 is returned. On failure,
    the pid of one of the failing threads will be returned and no filters
    will have been applied.

    The race conditions against another thread are:
    - requesting TSYNC (already handled by sighand lock)
    - performing a clone (already handled by sighand lock)
    - changing its filter (already handled by sighand lock)
    - calling exec (handled by cred_guard_mutex)
    The clone case is assisted by the fact that new threads will have their
    seccomp state duplicated from their parent before appearing on the tasklist.

    Holding cred_guard_mutex means that seccomp filters cannot be assigned
    while in the middle of another thread's exec (potentially bypassing
    no_new_privs or similar). The call to de_thread() may kill threads waiting
    for the mutex.

    Changes across threads to the filter pointer includes a barrier.

    Based on patches by Will Drewry.

    Suggested-by: Julien Tinnes
    Signed-off-by: Kees Cook
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Andy Lutomirski

    Kees Cook
     
  • This changes the mode setting helper to allow threads to change the
    seccomp mode from another thread. We must maintain barriers to keep
    TIF_SECCOMP synchronized with the rest of the seccomp state.

    Signed-off-by: Kees Cook
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Andy Lutomirski

    Kees Cook
     
  • Normally, task_struct.seccomp.filter is only ever read or modified by
    the task that owns it (current). This property aids in fast access
    during system call filtering as read access is lockless.

    Updating the pointer from another task, however, opens up race
    conditions. To allow cross-thread filter pointer updates, writes to the
    seccomp fields are now protected by the sighand spinlock (which is shared
    by all threads in the thread group). Read access remains lockless because
    pointer updates themselves are atomic. However, writes (or cloning)
    often entail additional checking (like maximum instruction counts)
    which require locking to perform safely.

    In the case of cloning threads, the child is invisible to the system
    until it enters the task list. To make sure a child can't be cloned from
    a thread and left in a prior state, seccomp duplication is additionally
    moved under the sighand lock. Then parent and child are certain have
    the same seccomp state when they exit the lock.

    Based on patches by Will Drewry and David Drysdale.

    Signed-off-by: Kees Cook
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Andy Lutomirski

    Kees Cook
     
  • In preparation for adding seccomp locking, move filter creation away
    from where it is checked and applied. This will allow for locking where
    no memory allocation is happening. The validation, filter attachment,
    and seccomp mode setting can all happen under the future locks.

    For extreme defensiveness, I've added a BUG_ON check for the calculated
    size of the buffer allocation in case BPF_MAXINSN ever changes, which
    shouldn't ever happen. The compiler should actually optimize out this
    check since the test above it makes it impossible.

    Signed-off-by: Kees Cook
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Andy Lutomirski

    Kees Cook
     
  • Since seccomp transitions between threads requires updates to the
    no_new_privs flag to be atomic, the flag must be part of an atomic flag
    set. This moves the nnp flag into a separate task field, and introduces
    accessors.

    Signed-off-by: Kees Cook
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Andy Lutomirski

    Kees Cook
     
  • This adds the new "seccomp" syscall with both an "operation" and "flags"
    parameter for future expansion. The third argument is a pointer value,
    used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
    be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).

    In addition to the TSYNC flag later in this patch series, there is a
    non-zero chance that this syscall could be used for configuring a fixed
    argument area for seccomp-tracer-aware processes to pass syscall arguments
    in the future. Hence, the use of "seccomp" not simply "seccomp_add_filter"
    for this syscall. Additionally, this syscall uses operation, flags,
    and user pointer for arguments because strictly passing arguments via
    a user pointer would mean seccomp itself would be unable to trivially
    filter the seccomp syscall itself.

    Signed-off-by: Kees Cook
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Andy Lutomirski

    Kees Cook
     
  • Separates the two mode setting paths to make things more readable with
    fewer #ifdefs within function bodies.

    Signed-off-by: Kees Cook
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Andy Lutomirski

    Kees Cook
     
  • To support splitting mode 1 from mode 2, extract the mode checking and
    assignment logic into common functions.

    Signed-off-by: Kees Cook
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Andy Lutomirski

    Kees Cook
     
  • In preparation for having other callers of the seccomp mode setting
    logic, split the prctl entry point away from the core logic that performs
    seccomp mode setting.

    Signed-off-by: Kees Cook
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Andy Lutomirski

    Kees Cook