Eric Lee / smarc-fsl-linux-kernel

31 Aug, 2016

1 commit

485a252a5 seccomp: Fix tracer exit notifications during fatal signals ... Browse Code »

This fixes a ptrace vs fatal pending signals bug as manifested in
seccomp now that seccomp was reordered to happen after ptrace. The
short version is that seccomp should not attempt to call do_exit()
while fatal signals are pending under a tracer. The existing code was
trying to be as defensively paranoid as possible, but it now ends up
confusing ptrace. Instead, the syscall can just be skipped (which solves
the original concern that the do_exit() was addressing) and normal signal
handling, tracer notification, and process death can happen.

Paraphrasing from the original bug report:

If a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed
after such a trap but not yet been scheduled, and another task in the
thread-group calls exit_group(), then the tracee task exits without the
ptracer receiving a PTRACE_EVENT_EXIT notification. Test case here:
https://gist.github.com/khuey/3c43ac247c72cef8c956ca73281c9be7

The bug happens because when __seccomp_filter() detects
fatal_signal_pending(), it calls do_exit() without dequeuing the fatal
signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and
that task is descheduled, __schedule() notices that there is a fatal
signal pending and changes its state from TASK_TRACED to TASK_RUNNING.
That prevents the ptracer's waitpid() from returning the ptrace event.
A more detailed analysis is here:
https://github.com/mozilla/rr/issues/1762#issuecomment-237396255.

Reported-by: Robert O'Callahan
Reported-by: Kyle Huey
Tested-by: Kyle Huey
Fixes: 93e35efb8de4 ("x86/ptrace: run seccomp after ptrace")
Signed-off-by: Kees Cook
Acked-by: Oleg Nesterov
Acked-by: James Morris

Kees Cook
2016-08-31 07:12:46 +0800

04 Aug, 2016

1 commit

97f2645f3 tree-wide: replace config_enabled() with IS_ENABLED() ... Browse Code »

The use of config_enabled() against config options is ambiguous. In
practical terms, config_enabled() is equivalent to IS_BUILTIN(), but the
author might have used it for the meaning of IS_ENABLED(). Using
IS_ENABLED(), IS_BUILTIN(), IS_MODULE() etc. makes the intention
clearer.

This commit replaces config_enabled() with IS_ENABLED() where possible.
This commit is only touching bool config options.

I noticed two cases where config_enabled() is used against a tristate
option:

- config_enabled(CONFIG_HWMON)
[ drivers/net/wireless/ath/ath10k/thermal.c ]

- config_enabled(CONFIG_BACKLIGHT_CLASS_DEVICE)
[ drivers/gpu/drm/gma500/opregion.c ]

I did not touch them because they should be converted to IS_BUILTIN()
in order to keep the logic, but I was not sure it was the authors'
intention.

Link: http://lkml.kernel.org/r/1465215656-20569-1-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada
Acked-by: Kees Cook
Cc: Stas Sergeev
Cc: Matt Redfearn
Cc: Joshua Kinard
Cc: Jiri Slaby
Cc: Bjorn Helgaas
Cc: Borislav Petkov
Cc: Markos Chandras
Cc: "Dmitry V. Levin"
Cc: yu-cheng yu
Cc: James Hogan
Cc: Brian Gerst
Cc: Johannes Berg
Cc: Peter Zijlstra
Cc: Al Viro
Cc: Will Drewry
Cc: Nikolay Martynov
Cc: Huacai Chen
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Cc: Daniel Borkmann
Cc: Leonid Yegoshin
Cc: Rafal Milecki
Cc: James Cowgill
Cc: Greg Kroah-Hartman
Cc: Ralf Baechle
Cc: Alex Smith
Cc: Adam Buchbinder
Cc: Qais Yousef
Cc: Jiang Liu
Cc: Mikko Rapeli
Cc: Paul Gortmaker
Cc: Denys Vlasenko
Cc: Brian Norris
Cc: Hidehiro Kawai
Cc: "Luis R. Rodriguez"
Cc: Andy Lutomirski
Cc: Ingo Molnar
Cc: Dave Hansen
Cc: "Kirill A. Shutemov"
Cc: Roland McGrath
Cc: Paul Burton
Cc: Kalle Valo
Cc: Viresh Kumar
Cc: Tony Wu
Cc: Huaitong Han
Cc: Sumit Semwal
Cc: Alexei Starovoitov
Cc: Juergen Gross
Cc: Jason Cooper
Cc: "David S. Miller"
Cc: Oleg Nesterov
Cc: Andrea Gelmini
Cc: David Woodhouse
Cc: Marc Zyngier
Cc: Rabin Vincent
Cc: "Maciej W. Rozycki"
Cc: David Daney
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Masahiro Yamada
2016-08-04 20:50:07 +0800

15 Jun, 2016

3 commits

ce6526e8a seccomp: recheck the syscall after RET_TRACE ... Browse Code »

When RET_TRACE triggers, a tracer may change a syscall into something that
should be filtered by seccomp. This re-runs seccomp after a trace event
to make sure things continue to pass.

Signed-off-by: Kees Cook
Cc: Andy Lutomirski

Kees Cook
2016-06-15 01:54:41 +0800
8112c4f14 seccomp: remove 2-phase API ... Browse Code »

Since nothing is using the 2-phase API, and it adds more complexity than
benefit, remove it.

Signed-off-by: Kees Cook
Cc: Andy Lutomirski

Kees Cook
2016-06-15 01:54:40 +0800
2f275de5d seccomp: Add a seccomp_data parameter secure_computing() ... Browse Code »

Currently, if arch code wants to supply seccomp_data directly to
seccomp (which is generally much faster than having seccomp do it
using the syscall_get_xyz() API), it has to use the two-phase
seccomp hooks. Add it to the easy hooks, too.

Cc: linux-arch@vger.kernel.org
Signed-off-by: Andy Lutomirski
Signed-off-by: Kees Cook

Andy Lutomirski
2016-06-15 01:54:39 +0800

20 May, 2016

1 commit

07b75260e Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus ... Browse Code »

Pull MIPS updates from Ralf Baechle:
"This is the main pull request for MIPS for 4.7. Here's the summary of
the changes:

- ATH79: Support for DTB passuing using the UHI boot protocol
- ATH79: Remove support for builtin DTB.
- ATH79: Add zboot debug serial support.
- ATH79: Add initial support for Dragino MS14 (Dragine 2), Onion Omega
and DPT-Module.
- ATH79: Update devicetree clock support for AR9132 and AR9331.
- ATH79: Cleanup the DT code.
- ATH79: Support newer SOCs in ath79_ddr_ctrl_init.
- ATH79: Fix regression in PCI window initialization.
- BCM47xx: Move SPROM driver to drivers/firmware/
- BCM63xx: Enable partition parser in defconfig.
- BMIPS: BMIPS5000 has I cache filing from D cache
- BMIPS: BMIPS: Add cpu-feature-overrides.h
- BMIPS: Add Whirlwind support
- BMIPS: Adjust mips-hpt-frequency for BCM7435
- BMIPS: Remove maxcpus from BCM97435SVMB DTS
- BMIPS: Add missing 7038 L1 register cells to BCM7435
- BMIPS: Various tweaks to initialization code.
- BMIPS: Enable partition parser in defconfig.
- BMIPS: Cache tweaks.
- BMIPS: Add UART, I2C and SATA devices to DT.
- BMIPS: Add BCM6358 and BCM63268support
- BMIPS: Add device tree example for BCM6358.
- BMIPS: Improve Improve BCM6328 and BCM6368 device trees
- Lantiq: Add support for device tree file from boot loader
- Lantiq: Allow build with no built-in DT.
- Loongson 3: Reserve 32MB for RS780E integrated GPU.
- Loongson 3: Fix build error after ld-version.sh modification
- Loongson 3: Move chipset ACPI code from drivers to arch.
- Loongson 3: Speedup irq processing.
- Loongson 3: Add basic Loongson 3A support.
- Loongson 3: Set cache flush handlers to nop.
- Loongson 3: Invalidate special TLBs when needed.
- Loongson 3: Fast TLB refill handler.
- MT7620: Fallback strategy for invalid syscfg0.
- Netlogic: Fix CP0_EBASE redefinition warnings
- Octeon: Initialization fixes
- Octeon: Add DTS files for the D-Link DSR-1000N and EdgeRouter Lite
- Octeon: Enable add Octeon-drivers in cavium_octeon_defconfig
- Octeon: Correctly handle endian-swapped initramfs images.
- Octeon: Support CN73xx, CN75xx and CN78xx.
- Octeon: Remove dead code from cvmx-sysinfo.
- Octeon: Extend number of supported CPUs past 32.
- Octeon: Remove some code limiting NR_IRQS to 255.
- Octeon: Simplify octeon_irq_ciu_gpio_set_type.
- Octeon: Mark some functions __init in smp.c
- Octeon: Octeon: Add Octeon III CN7xxx interface detection
- PIC32: Add serial driver and bindings for it.
- PIC32: Add PIC32 deadman timer driver and bindings.
- PIC32: Add PIC32 clock timer driver and bindings.
- Pistachio: Determine SoC revision during boot
- Sibyte: Fix Kconfig dependencies of SIBYTE_BUS_WATCHER.
- Sibyte: Strip redundant comments from bcm1480_regs.h.
- Panic immediately if panic_on_oops is set.
- module: fix incorrect IS_ERR_VALUE macro usage.
- module: Make consistent use of pr_*
- Remove no longer needed work_on_cpu() call.
- Remove CONFIG_IPV6_PRIVACY from defconfigs.
- Fix registers of non-crashing CPUs in dumps.
- Handle MIPSisms in new vmcore_elf32_check_arch.
- Select CONFIG_HANDLE_DOMAIN_IRQ and make it work.
- Allow RIXI to be used on non-R2 or R6 cores.
- Reserve nosave data for hibernation
- Fix siginfo.h to use strict POSIX types.
- Don't unwind user mode with EVA.
- Fix watchpoint restoration
- Ptrace watchpoints for R6.
- Sync icache when it fills from dcache
- I6400 I-cache fills from dcache.
- Various MSA fixes.
- Cleanup MIPS_CPU_* definitions.
- Signal: Move generic copy_siginfo to signal.h
- Signal: Fix uapi include in exported asm/siginfo.h
- Timer fixes for sake of KVM.
- XPA TLB refill fixes.
- Treat perf counter feature
- Update John Crispin's email address
- Add PIC32 watchdog and bindings.
- Handle R10000 LL/SC bug in set_pte()
- cpufreq: Various fixes for Longson1.
- R6: Fix R2 emulation.
- mathemu: Cosmetic fix to ADDIUPC emulation, plenty of other small fixes
- ELF: ABI and FP fixes.
- Allow for relocatable kernel and use that to support KASLR.
- Fix CPC_BASE_ADDR mask
- Plenty fo smp-cps, CM, R6 and M6250 fixes.
- Make reset_control_ops const.
- Fix kernel command line handling of leading whitespace.
- Cleanups to cache handling.
- Add brcm, bcm6345-l1-intc device tree bindings.
- Use generic clkdev.h header
- Remove CLK_IS_ROOT usage.
- Misc small cleanups.
- CM: Fix compilation error when !MIPS_CM
- oprofile: Fix a preemption issue
- Detect DSP ASE v3 support:1"

* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (275 commits)
MIPS: pic32mzda: fix getting timer clock rate.
MIPS: ath79: fix regression in PCI window initialization
MIPS: ath79: make ath79_ddr_ctrl_init() compatible for newer SoCs
MIPS: Fix VZ probe gas errors with binutils of MSA context in non-MSA kernels
MIPS: cevt-r4k: Dynamically calculate min_delta_ns
MIPS: malta-time: Take seconds into account
MIPS: malta-time: Start GIC count before syncing to RTC
MIPS: Force CPUs to lose FP context during mode switches
...

Linus Torvalds
2016-05-20 01:02:26 +0800

13 May, 2016

2 commits

cb4253aa0 secomp: Constify mode1 syscall whitelist ... Browse Code »

These values are constant and should be marked as such.

Signed-off-by: Matt Redfearn
Acked-by: Kees Cook
Cc: Will Drewry
Cc: Andy Lutomirski
Cc: IMG-MIPSLinuxKerneldevelopers@imgtec.com
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/12979/
Signed-off-by: Ralf Baechle

Matt Redfearn
2016-05-13 20:02:01 +0800
c983f0e86 seccomp: Get compat syscalls from asm-generic header ... Browse Code »

Move retrieval of compat syscall numbers into inline function defined in
asm-generic header so that arches may override it.

[ralf@linux-mips.org: Resolve merge conflict.]

Suggested-by: Paul Burton
Signed-off-by: Matt Redfearn
Acked-by: Kees Cook
Cc: IMG-MIPSLinuxKerneldevelopers@imgtec.com
Cc: Arnd Bergmann
Cc: Andy Lutomirski
Cc: Will Drewry
Cc: linux-arch@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/12978/
Signed-off-by: Ralf Baechle

Matt Redfearn
2016-05-13 20:02:00 +0800

05 May, 2016

1 commit

470bf1f27 seccomp: Fix comment typo ... Browse Code »

Drop accidentally repeated word in comment.

Signed-off-by: Mickaël Salaün
Cc: Kees Cook
Cc: Andy Lutomirski
Cc: Will Drewry

Mickaël Salaün
2016-05-05 01:54:04 +0800

23 Mar, 2016

1 commit

5c38065e0 seccomp: check in_compat_syscall, not is_compat_task, in strict mode ... Browse Code »

Seccomp wants to know the syscall bitness, not the caller task bitness,
when it selects the syscall whitelist.

As far as I know, this makes no difference on any architecture, so it's
not a security problem. (It generates identical code everywhere except
sparc, and, on sparc, the syscall numbering is the same for both ABIs.)

Signed-off-by: Andy Lutomirski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Lutomirski
2016-03-23 06:36:02 +0800

27 Jan, 2016

1 commit

103502a35 seccomp: always propagate NO_NEW_PRIVS on tsync ... Browse Code »

Before this patch, a process with some permissive seccomp filter
that was applied by root without NO_NEW_PRIVS was able to add
more filters to itself without setting NO_NEW_PRIVS by setting
the new filter from a throwaway thread with NO_NEW_PRIVS.

Signed-off-by: Jann Horn
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook

Jann Horn
2016-01-27 23:38:25 +0800

28 Oct, 2015

1 commit

f8e529ed9 seccomp, ptrace: add support for dumping seccomp filters ... Browse Code »

This patch adds support for dumping a process' (classic BPF) seccomp
filters via ptrace.

PTRACE_SECCOMP_GET_FILTER allows the tracer to dump the user's classic BPF
seccomp filters. addr should be an integer which represents the ith seccomp
filter (0 is the most recently installed filter). data should be a struct
sock_filter * with enough room for the ith filter, or NULL, in which case
the filter is not saved. The return value for this command is the number of
BPF instructions the program represents, or negative in the case of errors.
Command specific errors are ENOENT: which indicates that there is no ith
filter in this seccomp tree, and EMEDIUMTYPE, which indicates that the ith
filter was not installed as a classic BPF filter.

A caveat with this approach is that there is no way to get explicitly at
the heirarchy of seccomp filters, and users need to memcmp() filters to
decide which are inherited. This means that a task which installs two of
the same filter can potentially confuse users of this interface.

v2: * make save_orig const
* check that the orig_prog exists (not necessary right now, but when
grows eBPF support it will be)
* s/n/filter_off and make it an unsigned long to match ptrace
* count "down" the tree instead of "up" when passing a filter offset

v3: * don't take the current task's lock for inspecting its seccomp mode
* use a 0x42** constant for the ptrace command value

v4: * don't copy to userspace while holding spinlocks

v5: * add another condition to WARN_ON

v6: * rebase on net-next

Signed-off-by: Tycho Andersen
Acked-by: Kees Cook
CC: Will Drewry
Reviewed-by: Oleg Nesterov
CC: Andy Lutomirski
CC: Pavel Emelyanov
CC: Serge E. Hallyn
CC: Alexei Starovoitov
CC: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Tycho Andersen
2015-10-28 10:55:13 +0800

05 Oct, 2015

1 commit

bab189918 bpf, seccomp: prepare for upcoming criu support ... Browse Code »

The current ongoing effort to dump existing cBPF seccomp filters back
to user space requires to hold the pre-transformed instructions like
we do in case of socket filters from sk_attach_filter() side, so they
can be reloaded in original form at a later point in time by utilities
such as criu.

To prepare for this, simply extend the bpf_prog_create_from_user()
API to hold a flag that tells whether we should store the original
or not. Also, fanout filters could make use of that in future for
things like diag. While fanout filters already use bpf_prog_destroy(),
move seccomp over to them as well to handle original programs when
present.

Signed-off-by: Daniel Borkmann
Cc: Tycho Andersen
Cc: Pavel Emelyanov
Cc: Kees Cook
Cc: Andy Lutomirski
Cc: Alexei Starovoitov
Tested-by: Tycho Andersen
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2015-10-05 21:47:05 +0800

20 Jul, 2015

1 commit

fe6c59dc1 Merge tag 'seccomp-next' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux into next Browse Code »

James Morris
2015-07-20 15:19:19 +0800

16 Jul, 2015

3 commits

221272f97 seccomp: swap hard-coded zeros to defined name ... Browse Code »

For clarity, if CONFIG_SECCOMP isn't defined, seccomp_mode() is returning
"disabled". This makes that more clear, along with another 0-use, and
results in no operational change.

Signed-off-by: Kees Cook

Kees Cook
2015-07-16 02:52:54 +0800
13c4a9011 seccomp: add ptrace options for suspend/resume ... Browse Code »

This patch is the first step in enabling checkpoint/restore of processes
with seccomp enabled.

One of the things CRIU does while dumping tasks is inject code into them
via ptrace to collect information that is only available to the process
itself. However, if we are in a seccomp mode where these processes are
prohibited from making these syscalls, then what CRIU does kills the task.

This patch adds a new ptrace option, PTRACE_O_SUSPEND_SECCOMP, that enables
a task from the init user namespace which has CAP_SYS_ADMIN and no seccomp
filters to disable (and re-enable) seccomp filters for another task so that
they can be successfully dumped (and restored). We restrict the set of
processes that can disable seccomp through ptrace because although today
ptrace can be used to bypass seccomp, there is some discussion of closing
this loophole in the future and we would like this patch to not depend on
that behavior and be future proofed for when it is removed.

Note that seccomp can be suspended before any filters are actually
installed; this behavior is useful on criu restore, so that we can suspend
seccomp, restore the filters, unmap our restore code from the restored
process' address space, and then resume the task by detaching and have the
filters resumed as well.

v2 changes:

* require that the tracer have no seccomp filters installed
* drop TIF_NOTSC manipulation from the patch
* change from ptrace command to a ptrace option and use this ptrace option
as the flag to check. This means that as soon as the tracer
detaches/dies, seccomp is re-enabled and as a corrollary that one can not
disable seccomp across PTRACE_ATTACHs.

v3 changes:

* get rid of various #ifdefs everywhere
* report more sensible errors when PTRACE_O_SUSPEND_SECCOMP is incorrectly
used

v4 changes:

* get rid of may_suspend_seccomp() in favor of a capable() check in ptrace
directly

v5 changes:

* check that seccomp is not enabled (or suspended) on the tracer

Signed-off-by: Tycho Andersen
CC: Will Drewry
CC: Roland McGrath
CC: Pavel Emelyanov
CC: Serge E. Hallyn
Acked-by: Oleg Nesterov
Acked-by: Andy Lutomirski
[kees: access seccomp.mode through seccomp_mode() instead]
Signed-off-by: Kees Cook

Tycho Andersen
2015-07-16 02:52:52 +0800
8225d3853 seccomp: Replace smp_read_barrier_depends() with lockless_dereference() ... Browse Code »

Recently lockless_dereference() was added which can be used in place of
hard-coding smp_read_barrier_depends(). The following PATCH makes the change.

Signed-off-by: Pranith Kumar
Signed-off-by: Kees Cook

Pranith Kumar
2015-07-16 02:52:51 +0800

10 May, 2015

2 commits

ac67eb2c5 seccomp, filter: add and use bpf_prog_create_from_user from seccomp ... Browse Code »

Seccomp has always been a special candidate when it comes to preparation
of its filters in seccomp_prepare_filter(). Due to the extra checks and
filter rewrite it partially duplicates code and has BPF internals exposed.

This patch adds a generic API inside the BPF code code that seccomp can use
and thus keep it's filter preparation code minimal and better maintainable.
The other side-effect is that now classic JITs can add seccomp support as
well by only providing a BPF_LDX | BPF_W | BPF_ABS translation.

Tested with seccomp and BPF test suites.

Signed-off-by: Daniel Borkmann
Cc: Nicolas Schichan
Cc: Alexei Starovoitov
Cc: Kees Cook
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2015-05-10 05:35:05 +0800
d9e12f42e seccomp: simplify seccomp_prepare_filter and reuse bpf_prepare_filter ... Browse Code »

Remove the calls to bpf_check_classic(), bpf_convert_filter() and
bpf_migrate_runtime() and let bpf_prepare_filter() take care of that
instead.

seccomp_check_filter() is passed to bpf_prepare_filter() so that it
gets called from there, after bpf_check_classic().

We can now remove exposure of two internal classic BPF functions
previously used by seccomp. The export of bpf_check_classic() symbol,
previously known as sk_chk_filter(), was there since pre git times,
and no in-tree module was using it, therefore remove it.

Joint work with Daniel Borkmann.

Signed-off-by: Nicolas Schichan
Signed-off-by: Daniel Borkmann
Cc: Alexei Starovoitov
Cc: Kees Cook
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Nicolas Schichan
2015-05-10 05:35:05 +0800

18 Feb, 2015

1 commit

580c57f10 seccomp: cap SECCOMP_RET_ERRNO data to MAX_ERRNO ... Browse Code »

The value resulting from the SECCOMP_RET_DATA mask could exceed MAX_ERRNO
when setting errno during a SECCOMP_RET_ERRNO filter action. This makes
sure we have a reliable value being set, so that an invalid errno will not
be ignored by userspace.

Signed-off-by: Kees Cook
Reported-by: Dmitry V. Levin
Cc: Andy Lutomirski
Cc: Will Drewry
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2015-02-18 06:34:55 +0800

14 Oct, 2014

1 commit

ba1a96fc7 Merge branch 'x86-seccomp-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 seccomp changes from Ingo Molnar:
"This tree includes x86 seccomp filter speedups and related preparatory
work, which touches core seccomp facilities as well.

The main idea is to split seccomp into two phases, to be able to enter
a simple fast path for syscalls with ptrace side effects.

There's no substantial user-visible (and ABI) effects expected from
this, except a change in how we emit a better audit record for
SECCOMP_RET_TRACE events"

* 'x86-seccomp-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86_64, entry: Use split-phase syscall_trace_enter for 64-bit syscalls
x86_64, entry: Treat regs->ax the same in fastpath and slowpath syscalls
x86: Split syscall_trace_enter into two phases
x86, entry: Only call user_exit if TIF_NOHZ
x86, x32, audit: Fix x32's AUDIT_ARCH wrt audit
seccomp: Document two-phase seccomp and arch-provided seccomp_data
seccomp: Allow arch code to provide seccomp_data
seccomp: Refactor the filter callback and the API
seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing

Linus Torvalds
2014-10-14 08:27:06 +0800

06 Sep, 2014

1 commit

60a3b2253 net: bpf: make eBPF interpreter images read-only ... Browse Code »

With eBPF getting more extended and exposure to user space is on it's way,
hardening the memory range the interpreter uses to steer its command flow
seems appropriate. This patch moves the to be interpreted bytecode to
read-only pages.

In case we execute a corrupted BPF interpreter image for some reason e.g.
caused by an attacker which got past a verifier stage, it would not only
provide arbitrary read/write memory access but arbitrary function calls
as well. After setting up the BPF interpreter image, its contents do not
change until destruction time, thus we can setup the image on immutable
made pages in order to mitigate modifications to that code. The idea
is derived from commit 314beb9bcabf ("x86: bpf_jit_comp: secure bpf jit
against spraying attacks").

This is possible because bpf_prog is not part of sk_filter anymore.
After setup bpf_prog cannot be altered during its life-time. This prevents
any modifications to the entire bpf_prog structure (incl. function/JIT
image pointer).

Every eBPF program (including classic BPF that are migrated) have to call
bpf_prog_select_runtime() to select either interpreter or a JIT image
as a last setup step, and they all are being freed via bpf_prog_free(),
including non-JIT. Therefore, we can easily integrate this into the
eBPF life-time, plus since we directly allocate a bpf_prog, we have no
performance penalty.

Tested with seccomp and test_bpf testsuite in JIT/non-JIT mode and manual
inspection of kernel_page_tables. Brad Spengler proposed the same idea
via Twitter during development of this patch.

Joint work with Hannes Frederic Sowa.

Suggested-by: Brad Spengler
Signed-off-by: Daniel Borkmann
Signed-off-by: Hannes Frederic Sowa
Cc: Alexei Starovoitov
Cc: Kees Cook
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2014-09-06 03:02:48 +0800

04 Sep, 2014

3 commits

d39bd00de seccomp: Allow arch code to provide seccomp_data ... Browse Code »

populate_seccomp_data is expensive: it works by inspecting
task_pt_regs and various other bits to piece together all the
information, and it's does so in multiple partially redundant steps.

Arch-specific code in the syscall entry path can do much better.

Admittedly this adds a bit of additional room for error, but the
speedup should be worth it.

Signed-off-by: Andy Lutomirski
Signed-off-by: Kees Cook

Andy Lutomirski
2014-09-04 05:58:17 +0800
13aa72f0f seccomp: Refactor the filter callback and the API ... Browse Code »

The reason I did this is to add a seccomp API that will be usable
for an x86 fast path. The x86 entry code needs to use a rather
expensive slow path for a syscall that might be visible to things
like ptrace. By splitting seccomp into two phases, we can check
whether we need the slow path and then use the fast path in if the
filter allows the syscall or just returns some errno.

As a side effect, I think the new code is much easier to understand
than the old code.

This has one user-visible effect: the audit record written for
SECCOMP_RET_TRACE is now a simple indication that SECCOMP_RET_TRACE
happened. It used to depend in a complicated way on what the tracer
did. I couldn't make much sense of it.

Signed-off-by: Andy Lutomirski
Signed-off-by: Kees Cook

Andy Lutomirski
2014-09-04 05:58:17 +0800
a4412fc94 seccomp,x86,arm,mips,s390: Remove nr parameter from secure_computing ... Browse Code »

The secure_computing function took a syscall number parameter, but
it only paid any attention to that parameter if seccomp mode 1 was
enabled. Rather than coming up with a kludge to get the parameter
to work in mode 2, just remove the parameter.

To avoid churn in arches that don't have seccomp filters (and may
not even support syscall_get_nr right now), this leaves the
parameter in secure_computing_strict, which is now a real function.

For ARM, this is a bit ugly due to the fact that ARM conditionally
supports seccomp filters. Fixing that would probably only be a
couple of lines of code, but it should be coordinated with the audit
maintainers.

This will be a slight slowdown on some arches. The right fix is to
pass in all of seccomp_data instead of trying to make just the
syscall nr part be fast.

This is a prerequisite for making two-phase seccomp work cleanly.

Cc: Russell King
Cc: linux-arm-kernel@lists.infradead.org
Cc: Ralf Baechle
Cc: linux-mips@linux-mips.org
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: linux-s390@vger.kernel.org
Cc: x86@kernel.org
Cc: Kees Cook
Signed-off-by: Andy Lutomirski
Signed-off-by: Kees Cook

Andy Lutomirski
2014-09-04 05:58:17 +0800

12 Aug, 2014

1 commit

69f6a34bd seccomp: Replace BUG(!spin_is_locked()) with assert_spin_lock ... Browse Code »

Current upstream kernel hangs with mips and powerpc targets in
uniprocessor mode if SECCOMP is configured.

Bisect points to commit dbd952127d11 ("seccomp: introduce writer locking").
Turns out that code such as
BUG_ON(!spin_is_locked(&list_lock));
can not be used in uniprocessor mode because spin_is_locked() always
returns false in this configuration, and that assert_spin_locked()
exists for that very purpose and must be used instead.

Fixes: dbd952127d11 ("seccomp: introduce writer locking")
Cc: Kees Cook
Signed-off-by: Guenter Roeck
Signed-off-by: Kees Cook

Guenter Roeck
2014-08-12 04:29:12 +0800

07 Aug, 2014

1 commit

ae045e245 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:
"Highlights:

1) Steady transitioning of the BPF instructure to a generic spot so
all kernel subsystems can make use of it, from Alexei Starovoitov.

2) SFC driver supports busy polling, from Alexandre Rames.

3) Take advantage of hash table in UDP multicast delivery, from David
Held.

4) Lighten locking, in particular by getting rid of the LRU lists, in
inet frag handling. From Florian Westphal.

5) Add support for various RFC6458 control messages in SCTP, from
Geir Ola Vaagland.

6) Allow to filter bridge forwarding database dumps by device, from
Jamal Hadi Salim.

7) virtio-net also now supports busy polling, from Jason Wang.

8) Some low level optimization tweaks in pktgen from Jesper Dangaard
Brouer.

9) Add support for ipv6 address generation modes, so that userland
can have some input into the process. From Jiri Pirko.

10) Consolidate common TCP connection request code in ipv4 and ipv6,
from Octavian Purdila.

11) New ARP packet logger in netfilter, from Pablo Neira Ayuso.

12) Generic resizable RCU hash table, with intial users in netlink and
nftables. From Thomas Graf.

13) Maintain a name assignment type so that userspace can see where a
network device name came from (enumerated by kernel, assigned
explicitly by userspace, etc.) From Tom Gundersen.

14) Automatic flow label generation on transmit in ipv6, from Tom
Herbert.

15) New packet timestamping facilities from Willem de Bruijn, meant to
assist in measuring latencies going into/out-of the packet
scheduler, latency from TCP data transmission to ACK, etc"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1536 commits)
cxgb4 : Disable recursive mailbox commands when enabling vi
net: reduce USB network driver config options.
tg3: Modify tg3_tso_bug() to handle multiple TX rings
amd-xgbe: Perform phy connect/disconnect at dev open/stop
amd-xgbe: Use dma_set_mask_and_coherent to set DMA mask
net: sun4i-emac: fix memory leak on bad packet
sctp: fix possible seqlock seadlock in sctp_packet_transmit()
Revert "net: phy: Set the driver when registering an MDIO bus device"
cxgb4vf: Turn off SGE RX/TX Callback Timers and interrupts in PCI shutdown routine
team: Simplify return path of team_newlink
bridge: Update outdated comment on promiscuous mode
net-timestamp: ACK timestamp for bytestreams
net-timestamp: TCP timestamping
net-timestamp: SCHED timestamp on entering packet scheduler
net-timestamp: add key to disambiguate concurrent datagrams
net-timestamp: move timestamp flags out of sk_flags
net-timestamp: extend SCM_TIMESTAMPING ancillary data struct
cxgb4i : Move stray CPL definitions to cxgb4 driver
tcp: reduce spurious retransmits due to transient SACK reneging
qlcnic: Initialize dcbnl_ops before register_netdev
...

Linus Torvalds
2014-08-07 00:38:14 +0800

03 Aug, 2014

3 commits

7ae457c1e net: filter: split 'struct sk_filter' into socket and bpf parts ... Browse Code »

clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix

split 'struct sk_filter' into
struct sk_filter {
atomic_t refcnt;
struct rcu_head rcu;
struct bpf_prog *prog;
};
and
struct bpf_prog {
u32 jited:1,
len:31;
struct sock_fprog_kern *orig_prog;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
struct sock_filter insns[0];
struct bpf_insn insnsi[0];
struct work_struct work;
};
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases

split SK_RUN_FILTER macro into:
SK_RUN_FILTER to be used with 'struct sk_filter *' and
BPF_PROG_RUN to be used with 'struct bpf_prog *'

__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function

also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:

sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter

API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet

API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2014-08-03 06:03:58 +0800
8fb575ca3 net: filter: rename sk_convert_filter() -> bpf_convert_filter() ... Browse Code »

to indicate that this function is converting classic BPF into eBPF
and not related to sockets

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2014-08-03 06:02:38 +0800
4df95ff48 net: filter: rename sk_chk_filter() -> bpf_check_classic() ... Browse Code »

trivial rename to indicate that this functions performs classic BPF checking

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2014-08-03 06:02:38 +0800

25 Jul, 2014

1 commit

2695fb552 net: filter: rename 'struct sock_filter_int' into 'struct bpf_insn' ... Browse Code »

eBPF is used by socket filtering, seccomp and soon by tracing and
exposed to userspace, therefore 'sock_filter_int' name is not accurate.
Rename it to 'bpf_insn'

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2014-07-25 14:27:17 +0800

19 Jul, 2014

9 commits

c2e1f2e30 seccomp: implement SECCOMP_FILTER_FLAG_TSYNC ... Browse Code »

Applying restrictive seccomp filter programs to large or diverse
codebases often requires handling threads which may be started early in
the process lifetime (e.g., by code that is linked in). While it is
possible to apply permissive programs prior to process start up, it is
difficult to further restrict the kernel ABI to those threads after that
point.

This change adds a new seccomp syscall flag to SECCOMP_SET_MODE_FILTER for
synchronizing thread group seccomp filters at filter installation time.

When calling seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC,
filter) an attempt will be made to synchronize all threads in current's
threadgroup to its new seccomp filter program. This is possible iff all
threads are using a filter that is an ancestor to the filter current is
attempting to synchronize to. NULL filters (where the task is running as
SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be
transitioned into SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS,
...) has been set on the calling thread, no_new_privs will be set for
all synchronized threads too. On success, 0 is returned. On failure,
the pid of one of the failing threads will be returned and no filters
will have been applied.

The race conditions against another thread are:
- requesting TSYNC (already handled by sighand lock)
- performing a clone (already handled by sighand lock)
- changing its filter (already handled by sighand lock)
- calling exec (handled by cred_guard_mutex)
The clone case is assisted by the fact that new threads will have their
seccomp state duplicated from their parent before appearing on the tasklist.

Holding cred_guard_mutex means that seccomp filters cannot be assigned
while in the middle of another thread's exec (potentially bypassing
no_new_privs or similar). The call to de_thread() may kill threads waiting
for the mutex.

Changes across threads to the filter pointer includes a barrier.

Based on patches by Will Drewry.

Suggested-by: Julien Tinnes
Signed-off-by: Kees Cook
Reviewed-by: Oleg Nesterov
Reviewed-by: Andy Lutomirski

Kees Cook
2014-07-19 03:13:40 +0800
3ba2530cc seccomp: allow mode setting across threads ... Browse Code »

This changes the mode setting helper to allow threads to change the
seccomp mode from another thread. We must maintain barriers to keep
TIF_SECCOMP synchronized with the rest of the seccomp state.

Signed-off-by: Kees Cook
Reviewed-by: Oleg Nesterov
Reviewed-by: Andy Lutomirski

Kees Cook
2014-07-19 03:13:40 +0800
dbd952127 seccomp: introduce writer locking ... Browse Code »

Normally, task_struct.seccomp.filter is only ever read or modified by
the task that owns it (current). This property aids in fast access
during system call filtering as read access is lockless.

Updating the pointer from another task, however, opens up race
conditions. To allow cross-thread filter pointer updates, writes to the
seccomp fields are now protected by the sighand spinlock (which is shared
by all threads in the thread group). Read access remains lockless because
pointer updates themselves are atomic. However, writes (or cloning)
often entail additional checking (like maximum instruction counts)
which require locking to perform safely.

In the case of cloning threads, the child is invisible to the system
until it enters the task list. To make sure a child can't be cloned from
a thread and left in a prior state, seccomp duplication is additionally
moved under the sighand lock. Then parent and child are certain have
the same seccomp state when they exit the lock.

Based on patches by Will Drewry and David Drysdale.

Signed-off-by: Kees Cook
Reviewed-by: Oleg Nesterov
Reviewed-by: Andy Lutomirski

Kees Cook
2014-07-19 03:13:39 +0800
c8bee430d seccomp: split filter prep from check and apply ... Browse Code »

In preparation for adding seccomp locking, move filter creation away
from where it is checked and applied. This will allow for locking where
no memory allocation is happening. The validation, filter attachment,
and seccomp mode setting can all happen under the future locks.

For extreme defensiveness, I've added a BUG_ON check for the calculated
size of the buffer allocation in case BPF_MAXINSN ever changes, which
shouldn't ever happen. The compiler should actually optimize out this
check since the test above it makes it impossible.

Signed-off-by: Kees Cook
Reviewed-by: Oleg Nesterov
Reviewed-by: Andy Lutomirski

Kees Cook
2014-07-19 03:13:39 +0800
1d4457f99 sched: move no_new_privs into new atomic flags ... Browse Code »

Since seccomp transitions between threads requires updates to the
no_new_privs flag to be atomic, the flag must be part of an atomic flag
set. This moves the nnp flag into a separate task field, and introduces
accessors.

Signed-off-by: Kees Cook
Reviewed-by: Oleg Nesterov
Reviewed-by: Andy Lutomirski

Kees Cook
2014-07-19 03:13:38 +0800
48dc92b9f seccomp: add "seccomp" syscall ... Browse Code »

This adds the new "seccomp" syscall with both an "operation" and "flags"
parameter for future expansion. The third argument is a pointer value,
used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).

In addition to the TSYNC flag later in this patch series, there is a
non-zero chance that this syscall could be used for configuring a fixed
argument area for seccomp-tracer-aware processes to pass syscall arguments
in the future. Hence, the use of "seccomp" not simply "seccomp_add_filter"
for this syscall. Additionally, this syscall uses operation, flags,
and user pointer for arguments because strictly passing arguments via
a user pointer would mean seccomp itself would be unable to trivially
filter the seccomp syscall itself.

Signed-off-by: Kees Cook
Reviewed-by: Oleg Nesterov
Reviewed-by: Andy Lutomirski

Kees Cook
2014-07-19 03:13:37 +0800
3b23dd128 seccomp: split mode setting routines ... Browse Code »

Separates the two mode setting paths to make things more readable with
fewer #ifdefs within function bodies.

Signed-off-by: Kees Cook
Reviewed-by: Oleg Nesterov
Reviewed-by: Andy Lutomirski

Kees Cook
2014-07-19 03:13:37 +0800
1f41b4504 seccomp: extract check/assign mode helpers ... Browse Code »

To support splitting mode 1 from mode 2, extract the mode checking and
assignment logic into common functions.

Signed-off-by: Kees Cook
Reviewed-by: Oleg Nesterov
Reviewed-by: Andy Lutomirski

Kees Cook
2014-07-19 03:13:36 +0800
d78ab02c2 seccomp: create internal mode-setting function ... Browse Code »

In preparation for having other callers of the seccomp mode setting
logic, split the prctl entry point away from the core logic that performs
seccomp mode setting.

Signed-off-by: Kees Cook
Reviewed-by: Oleg Nesterov
Reviewed-by: Andy Lutomirski

Kees Cook
2014-07-19 03:13:36 +0800