21 Dec, 2012

1 commit

  • Pull signal handling cleanups from Al Viro:
    "sigaltstack infrastructure + conversion for x86, alpha and um,
    COMPAT_SYSCALL_DEFINE infrastructure.

    Note that there are several conflicts between "unify
    SS_ONSTACK/SS_DISABLE definitions" and UAPI patches in mainline;
    resolution is trivial - just remove definitions of SS_ONSTACK and
    SS_DISABLED from arch/*/uapi/asm/signal.h; they are all identical and
    include/uapi/linux/signal.h contains the unified variant."

    Fixed up conflicts as per Al.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    alpha: switch to generic sigaltstack
    new helpers: __save_altstack/__compat_save_altstack, switch x86 and um to those
    generic compat_sys_sigaltstack()
    introduce generic sys_sigaltstack(), switch x86 and um to it
    new helper: compat_user_stack_pointer()
    new helper: restore_altstack()
    unify SS_ONSTACK/SS_DISABLE definitions
    new helper: current_user_stack_pointer()
    missing user_stack_pointer() instances
    Bury the conditionals from kernel_thread/kernel_execve series
    COMPAT_SYSCALL_DEFINE: infrastructure

    Linus Torvalds
     

20 Dec, 2012

2 commits


19 Dec, 2012

1 commit

  • Pull preparatory gcc intrisics bswap patch from David Woodhouse:
    "This single patch is effectively a no-op for now. It enables
    architectures to opt in to using GCC's __builtin_bswapXX() intrinsics
    for byteswapping, and if we merge this now then the architecture
    maintainers can enable it for their arch during the next cycle without
    dependency issues.

    It's worth making it a par-arch opt-in, because although in *theory*
    the compiler should never do worse than hand-coded assembler (and of
    course it also ought to do a lot better on platforms like Atom and
    PowerPC which have load-and-swap or store-and-swap instructions), that
    isn't always the case. See

    http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46453

    for example."

    * tag 'byteswap-for-linus-20121219' of git://git.infradead.org/users/dwmw2/byteswap:
    byteorder: allow arch to opt to use GCC intrinsics for byteswapping

    Linus Torvalds
     

18 Dec, 2012

1 commit

  • Currently only block_dev and uprobes use percpu_rw_semaphore,
    add the config option selected by BLOCK || UPROBES.

    Signed-off-by: Oleg Nesterov
    Cc: Anton Arapov
    Cc: Ingo Molnar
    Cc: Linus Torvalds
    Cc: Michal Marek
    Cc: Mikulas Patocka
    Cc: "Paul E. McKenney"
    Cc: Peter Zijlstra
    Cc: Srikar Dronamraju
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

13 Dec, 2012

1 commit

  • Pull big execve/kernel_thread/fork unification series from Al Viro:
    "All architectures are converted to new model. Quite a bit of that
    stuff is actually shared with architecture trees; in such cases it's
    literally shared branch pulled by both, not a cherry-pick.

    A lot of ugliness and black magic is gone (-3KLoC total in this one):

    - kernel_thread()/kernel_execve()/sys_execve() redesign.

    We don't do syscalls from kernel anymore for either kernel_thread()
    or kernel_execve():

    kernel_thread() is essentially clone(2) with callback run before we
    return to userland, the callbacks either never return or do
    successful do_execve() before returning.

    kernel_execve() is a wrapper for do_execve() - it doesn't need to
    do transition to user mode anymore.

    As a result kernel_thread() and kernel_execve() are
    arch-independent now - they live in kernel/fork.c and fs/exec.c
    resp. sys_execve() is also in fs/exec.c and it's completely
    architecture-independent.

    - daemonize() is gone, along with its parts in fs/*.c

    - struct pt_regs * is no longer passed to do_fork/copy_process/
    copy_thread/do_execve/search_binary_handler/->load_binary/do_coredump.

    - sys_fork()/sys_vfork()/sys_clone() unified; some architectures
    still need wrappers (ones with callee-saved registers not saved in
    pt_regs on syscall entry), but the main part of those suckers is in
    kernel/fork.c now."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (113 commits)
    do_coredump(): get rid of pt_regs argument
    print_fatal_signal(): get rid of pt_regs argument
    ptrace_signal(): get rid of unused arguments
    get rid of ptrace_signal_deliver() arguments
    new helper: signal_pt_regs()
    unify default ptrace_signal_deliver
    flagday: kill pt_regs argument of do_fork()
    death to idle_regs()
    don't pass regs to copy_process()
    flagday: don't pass regs to copy_thread()
    bfin: switch to generic vfork, get rid of pointless wrappers
    xtensa: switch to generic clone()
    openrisc: switch to use of generic fork and clone
    unicore32: switch to generic clone(2)
    score: switch to generic fork/vfork/clone
    c6x: sanitize copy_thread(), get rid of clone(2) wrapper, switch to generic clone()
    take sys_fork/sys_vfork/sys_clone prototypes to linux/syscalls.h
    mn10300: switch to generic fork/vfork/clone
    h8300: switch to generic fork/vfork/clone
    tile: switch to generic clone()
    ...

    Conflicts:
    arch/microblaze/include/asm/Kbuild

    Linus Torvalds
     

06 Dec, 2012

1 commit

  • Since GCC 4.4, there have been __builtin_bswap32() and __builtin_bswap16()
    intrinsics. A __builtin_bswap16() came a little later (4.6 for PowerPC,
    48 for other platforms).

    By using these instead of the inline assembler that most architectures
    have in their __arch_swabXX() macros, we let the compiler see what's
    actually happening. The resulting code should be at least as good, and
    much *better* in the cases where it can be combined with a nearby load
    or store, using a load-and-byteswap or store-and-byteswap instruction
    (e.g. lwbrx/stwbrx on PowerPC, movbe on Atom).

    When GCC is sufficiently recent *and* the architecture opts in to using
    the intrinsics by setting CONFIG_ARCH_USE_BUILTIN_BSWAP, they will be
    used in preference to the __arch_swabXX() macros. An architecture which
    does not set ARCH_USE_BUILTIN_BSWAP will continue to use its own
    hand-crafted macros.

    Signed-off-by: David Woodhouse
    Acked-by: H. Peter Anvin

    David Woodhouse
     

01 Dec, 2012

1 commit

  • Create a new subsystem that probes on kernel boundaries
    to keep track of the transitions between level contexts
    with two basic initial contexts: user or kernel.

    This is an abstraction of some RCU code that use such tracking
    to implement its userspace extended quiescent state.

    We need to pull this up from RCU into this new level of indirection
    because this tracking is also going to be used to implement an "on
    demand" generic virtual cputime accounting. A necessary step to
    shutdown the tick while still accounting the cputime.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Li Zhong
    Cc: Gilad Ben-Yossef
    Reviewed-by: Steven Rostedt
    [ paulmck: fix whitespace error and email address. ]
    Signed-off-by: Paul E. McKenney

    Frederic Weisbecker
     

29 Nov, 2012

1 commit

  • ... and get rid of idiotic struct pt_regs * in asm-generic/syscalls.h
    prototypes of the same, while we are at it. Eventually we want those
    in linux/syscalls.h, of course, but that'll have to wait a bit.

    Note that there are *three* variants of sys_clone() order of arguments.
    Braindamage galore...

    Signed-off-by: Al Viro

    Al Viro
     

15 Oct, 2012

1 commit

  • Pull module signing support from Rusty Russell:
    "module signing is the highlight, but it's an all-over David Howells frenzy..."

    Hmm "Magrathea: Glacier signing key". Somebody has been reading too much HHGTTG.

    * 'modules-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (37 commits)
    X.509: Fix indefinite length element skip error handling
    X.509: Convert some printk calls to pr_devel
    asymmetric keys: fix printk format warning
    MODSIGN: Fix 32-bit overflow in X.509 certificate validity date checking
    MODSIGN: Make mrproper should remove generated files.
    MODSIGN: Use utf8 strings in signer's name in autogenerated X.509 certs
    MODSIGN: Use the same digest for the autogen key sig as for the module sig
    MODSIGN: Sign modules during the build process
    MODSIGN: Provide a script for generating a key ID from an X.509 cert
    MODSIGN: Implement module signature checking
    MODSIGN: Provide module signing public keys to the kernel
    MODSIGN: Automatically generate module signing keys if missing
    MODSIGN: Provide Kconfig options
    MODSIGN: Provide gitignore and make clean rules for extra files
    MODSIGN: Add FIPS policy
    module: signature checking hook
    X.509: Add a crypto key parser for binary (DER) X.509 certificates
    MPILIB: Provide a function to read raw data into an MPI
    X.509: Add an ASN.1 decoder
    X.509: Add simple ASN.1 grammar compiler
    ...

    Linus Torvalds
     

13 Oct, 2012

2 commits

  • Pull third pile of kernel_execve() patches from Al Viro:
    "The last bits of infrastructure for kernel_thread() et.al., with
    alpha/arm/x86 use of those. Plus sanitizing the asm glue and
    do_notify_resume() on alpha, fixing the "disabled irq while running
    task_work stuff" breakage there.

    At that point the rest of kernel_thread/kernel_execve/sys_execve work
    can be done independently for different architectures. The only
    pending bits that do depend on having all architectures converted are
    restrictred to fs/* and kernel/* - that'll obviously have to wait for
    the next cycle.

    I thought we'd have to wait for all of them done before we start
    eliminating the longjump-style insanity in kernel_execve(), but it
    turned out there's a very simple way to do that without flagday-style
    changes."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    alpha: switch to saner kernel_execve() semantics
    arm: switch to saner kernel_execve() semantics
    x86, um: convert to saner kernel_execve() semantics
    infrastructure for saner ret_from_kernel_thread semantics
    make sure that kernel_thread() callbacks call do_exit() themselves
    make sure that we always have a return path from kernel_execve()
    ppc: eeh_event should just use kthread_run()
    don't bother with kernel_thread/kernel_execve for launching linuxrc
    alpha: get rid of switch_stack argument of do_work_pending()
    alpha: don't bother passing switch_stack separately from regs
    alpha: take SIGPENDING/NOTIFY_RESUME loop into signal.c
    alpha: simplify TIF_NEED_RESCHED handling

    Linus Torvalds
     
  • * allow kernel_execve() leave the actual return to userland to
    caller (selected by CONFIG_GENERIC_KERNEL_EXECVE). Callers
    updated accordingly.
    * architecture that does select GENERIC_KERNEL_EXECVE in its
    Kconfig should have its ret_from_kernel_thread() do this:
    call schedule_tail
    call the callback left for it by copy_thread(); if it ever
    returns, that's because it has just done successful kernel_execve()
    jump to return from syscall
    IOW, its only difference from ret_from_fork() is that it does call the
    callback.
    * such an architecture should also get rid of ret_from_kernel_execve()
    and __ARCH_WANT_KERNEL_EXECVE

    This is the last part of infrastructure patches in that area - from
    that point on work on different architectures can live independently.

    Signed-off-by: Al Viro

    Al Viro
     

10 Oct, 2012

1 commit

  • Pull generic execve() changes from Al Viro:
    "This introduces the generic kernel_thread() and kernel_execve()
    functions, and switches x86, arm, alpha, um and s390 over to them."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (26 commits)
    s390: convert to generic kernel_execve()
    s390: switch to generic kernel_thread()
    s390: fold kernel_thread_helper() into ret_from_fork()
    s390: fold execve_tail() into start_thread(), convert to generic sys_execve()
    um: switch to generic kernel_thread()
    x86, um/x86: switch to generic sys_execve and kernel_execve
    x86: split ret_from_fork
    alpha: introduce ret_from_kernel_execve(), switch to generic kernel_execve()
    alpha: switch to generic kernel_thread()
    alpha: switch to generic sys_execve()
    arm: get rid of execve wrapper, switch to generic execve() implementation
    arm: optimized current_pt_regs()
    arm: introduce ret_from_kernel_execve(), switch to generic kernel_execve()
    arm: split ret_from_fork, simplify kernel_thread() [based on patch by rmk]
    generic sys_execve()
    generic kernel_execve()
    new helper: current_pt_regs()
    preparation for generic kernel_thread()
    um: kill thread->forking
    um: let signal_delivered() do SIGTRAP on singlestepping into handler
    ...

    Linus Torvalds
     

09 Oct, 2012

1 commit

  • Cleanup patch in preparation for transparent hugepage support on s390.
    Adding new architectures to the TRANSPARENT_HUGEPAGE config option can
    make the "depends" line rather ugly, like "depends on (X86 || (S390 &&
    64BIT)) && MMU".

    This patch adds a HAVE_ARCH_TRANSPARENT_HUGEPAGE instead. x86 already has
    MMU "def_bool y", so the MMU check is superfluous there and
    HAVE_ARCH_TRANSPARENT_HUGEPAGE can be selected in arch/x86/Kconfig.

    Signed-off-by: Gerald Schaefer
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: Hugh Dickins
    Cc: Hillf Danton
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     

02 Oct, 2012

2 commits

  • Pull scheduler changes from Ingo Molnar:
    "Continued quest to clean up and enhance the cputime code by Frederic
    Weisbecker, in preparation for future tickless kernel features.

    Other than that, smallish changes."

    Fix up trivial conflicts due to additions next to each other in arch/{x86/}Kconfig

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    cputime: Make finegrained irqtime accounting generally available
    cputime: Gather time/stats accounting config options into a single menu
    ia64: Reuse system and user vtime accounting functions on task switch
    ia64: Consolidate user vtime accounting
    vtime: Consolidate system/idle context detection
    cputime: Use a proper subsystem naming for vtime related APIs
    sched: cpu_power: enable ARCH_POWER
    sched/nohz: Clean up select_nohz_load_balancer()
    sched: Fix load avg vs. cpu-hotplug
    sched: Remove __ARCH_WANT_INTERRUPTS_ON_CTXSW
    sched: Fix nohz_idle_balance()
    sched: Remove useless code in yield_to()
    sched: Add time unit suffix to sched sysctl knobs
    sched/debug: Limit sd->*_idx range on sysctl
    sched: Remove AFFINE_WAKEUPS feature flag
    s390: Remove leftover account_tick_vtime() header
    cputime: Consolidate vtime handling on context switch
    sched: Move cputime code to its own file
    cputime: Generalize CONFIG_VIRT_CPU_ACCOUNTING
    tile: Remove SD_PREFER_LOCAL leftover
    ...

    Linus Torvalds
     
  • Pull perf update from Ingo Molnar:
    "Lots of changes in this cycle as well, with hundreds of commits from
    over 30 contributors. Most of the activity was on the tooling side.

    Higher level changes:

    - New 'perf kvm' analysis tool, from Xiao Guangrong.

    - New 'perf trace' system-wide tracing tool

    - uprobes fixes + cleanups from Oleg Nesterov.

    - Lots of patches to make perf build on Android out of box, from
    Irina Tirdea

    - Extend ftrace function tracing utility to be more dynamic for its
    users. It allows for data passing to the callback functions, as
    well as reading regs as if a breakpoint were to trigger at function
    entry.

    The main goal of this patch series was to allow kprobes to use
    ftrace as an optimized probe point when a probe is placed on an
    ftrace nop. With lots of help from Masami Hiramatsu, and going
    through lots of iterations, we finally came up with a good
    solution.

    - Add cpumask for uncore pmu, use it in 'stat', from Yan, Zheng.

    - Various tracing updates from Steve Rostedt

    - Clean up and improve 'perf sched' performance by elliminating lots
    of needless calls to libtraceevent.

    - Event group parsing support, from Jiri Olsa

    - UI/gtk refactorings and improvements from Namhyung Kim

    - Add support for non-tracepoint events in perf script python, from
    Feng Tang

    - Add --symbols to 'script', similar to the one in 'report', from
    Feng Tang.

    Infrastructure enhancements and fixes:

    - Convert the trace builtins to use the growing evsel/evlist
    tracepoint infrastructure, removing several open coded constructs
    like switch like series of strcmp to dispatch events, etc.
    Basically what had already been showcased in 'perf sched'.

    - Add evsel constructor for tracepoints, that uses libtraceevent just
    to parse the /format events file, use it in a new 'perf test' to
    make sure the libtraceevent format parsing regressions can be more
    readily caught.

    - Some strange errors were happening in some builds, but not on the
    next, reported by several people, problem was some parser related
    files, generated during the build, didn't had proper make deps, fix
    from Eric Sandeen.

    - Introduce struct and cache information about the environment where
    a perf.data file was captured, from Namhyung Kim.

    - Fix handling of unresolved samples when --symbols is used in
    'report', from Feng Tang.

    - Add union member access support to 'probe', from Hyeoncheol Lee.

    - Fixups to die() removal, from Namhyung Kim.

    - Render fixes for the TUI, from Namhyung Kim.

    - Don't enable annotation in non symbolic view, from Namhyung Kim.

    - Fix pipe mode in 'report', from Namhyung Kim.

    - Move related stats code from stat to util/, will be used by the
    'stat' kvm tool, from Xiao Guangrong.

    - Remove die()/exit() calls from several tools.

    - Resolve vdso callchains, from Jiri Olsa

    - Don't pass const char pointers to basename, so that we can
    unconditionally use libgen.h and thus avoid ifdef BIONIC lines,
    from David Ahern

    - Refactor hist formatting so that it can be reused with the GTK
    browser, From Namhyung Kim

    - Fix build for another rbtree.c change, from Adrian Hunter.

    - Make 'perf diff' command work with evsel hists, from Jiri Olsa.

    - Use the only field_sep var that is set up: symbol_conf.field_sep,
    fix from Jiri Olsa.

    - .gitignore compiled python binaries, from Namhyung Kim.

    - Get rid of die() in more libtraceevent places, from Namhyung Kim.

    - Rename libtraceevent 'private' struct member to 'priv' so that it
    works in C++, from Steven Rostedt

    - Remove lots of exit()/die() calls from tools so that the main perf
    exit routine can take place, from David Ahern

    - Fix x86 build on x86-64, from David Ahern.

    - {int,str,rb}list fixes from Suzuki K Poulose

    - perf.data header fixes from Namhyung Kim

    - Allow user to indicate objdump path, needed in cross environments,
    from Maciek Borzecki

    - Fix hardware cache event name generation, fix from Jiri Olsa

    - Add round trip test for sw, hw and cache event names, catching the
    problem Jiri fixed, after Jiri's patch, the test passes
    successfully.

    - Clean target should do clean for lib/traceevent too, fix from David
    Ahern

    - Check the right variable for allocation failure, fix from Namhyung
    Kim

    - Set up evsel->tp_format regardless of evsel->name being set
    already, fix from Namhyung Kim

    - Oprofile fixes from Robert Richter.

    - Remove perf_event_attr needless version inflation, from Jiri Olsa

    - Introduce libtraceevent strerror like error reporting facility,
    from Namhyung Kim

    - Add pmu mappings to perf.data header and use event names from cmd
    line, from Robert Richter

    - Fix include order for bison/flex-generated C files, from Ben
    Hutchings

    - Build fixes and documentation corrections from David Ahern

    - Assorted cleanups from Robert Richter

    - Let O= makes handle relative paths, from Steven Rostedt

    - perf script python fixes, from Feng Tang.

    - Initial bash completion support, from Frederic Weisbecker

    - Allow building without libelf, from Namhyung Kim.

    - Support DWARF CFI based unwind to have callchains when %bp based
    unwinding is not possible, from Jiri Olsa.

    - Symbol resolution fixes, while fixing support PPC64 files with an
    .opt ELF section was the end goal, several fixes for code that
    handles all architectures and cleanups are included, from Cody
    Schafer.

    - Assorted fixes for Documentation and build in 32 bit, from Robert
    Richter

    - Cache the libtraceevent event_format associated to each evsel
    early, so that we avoid relookups, i.e. calling pevent_find_event
    repeatedly when processing tracepoint events.

    [ This is to reduce the surface contact with libtraceevents and
    make clear what is that the perf tools needs from that lib: so
    far parsing the common and per event fields. ]

    - Don't stop the build if the audit libraries are not installed, fix
    from Namhyung Kim.

    - Fix bfd.h/libbfd detection with recent binutils, from Markus
    Trippelsdorf.

    - Improve warning message when libunwind devel packages not present,
    from Jiri Olsa"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (282 commits)
    perf trace: Add aliases for some syscalls
    perf probe: Print an enum type variable in "enum variable-name" format when showing accessible variables
    perf tools: Check libaudit availability for perf-trace builtin
    perf hists: Add missing period_* fields when collapsing a hist entry
    perf trace: New tool
    perf evsel: Export the event_format constructor
    perf evsel: Introduce rawptr() method
    perf tools: Use perf_evsel__newtp in the event parser
    perf evsel: The tracepoint constructor should store sys:name
    perf evlist: Introduce set_filter() method
    perf evlist: Renane set_filters method to apply_filters
    perf test: Add test to check we correctly parse and match syscall open parms
    perf evsel: Handle endianity in intval method
    perf evsel: Know if byte swap is needed
    perf tools: Allow handling a NULL cpu_map as meaning "all cpus"
    perf evsel: Improve tracepoint constructor setup
    tools lib traceevent: Fix error path on pevent_parse_event
    perf test: Fix build failure
    trace: Move trace event enable from fs_initcall to core_initcall
    tracing: Add an option for disabling markers
    ...

    Linus Torvalds
     

01 Oct, 2012

1 commit

  • Let architectures select GENERIC_KERNEL_THREAD and have their copy_thread()
    treat NULL regs as "it came from kernel_thread(), sp argument contains
    the function new thread will be calling and stack_size - the argument for
    that function". Switching the architectures begins shortly...

    Signed-off-by: Al Viro

    Al Viro
     

28 Sep, 2012

1 commit

  • Use the mapping of Elf_[SPE]hdr, Elf_Addr, Elf_Sym, Elf_Dyn, Elf_Rel/Rela,
    ELF_R_TYPE() and ELF_R_SYM() to either the 32-bit version or the 64-bit version
    into asm-generic/module.h for all arches bar MIPS.

    Also, use the generic definition mod_arch_specific where possible.

    To this end, I've defined three new config bools:

    (*) HAVE_MOD_ARCH_SPECIFIC

    Arches define this if they don't want to use the empty generic
    mod_arch_specific struct.

    (*) MODULES_USE_ELF_RELA

    Arches define this if their modules can contain RELA records. This causes
    the Elf_Rela mapping to be emitted and allows apply_relocate_add() to be
    defined by the arch rather than have the core emit an error message.

    (*) MODULES_USE_ELF_REL

    Arches define this if their modules can contain REL records. This causes
    the Elf_Rel mapping to be emitted and allows apply_relocate() to be
    defined by the arch rather than have the core emit an error message.

    Note that it is possible to allow both REL and RELA records: m68k and mips are
    two arches that do this.

    With this, some arch asm/module.h files can be deleted entirely and replaced
    with a generic-y marker in the arch Kbuild file.

    Additionally, I have removed the bits from m32r and score that handle the
    unsupported type of relocation record as that's now handled centrally.

    Signed-off-by: David Howells
    Acked-by: Sam Ravnborg
    Signed-off-by: Rusty Russell

    David Howells
     

26 Sep, 2012

1 commit

  • Create a new config option under the RCU menu that put
    CPUs under RCU extended quiescent state (as in dynticks
    idle mode) when they run in userspace. This require
    some contribution from architectures to hook into kernel
    and userspace boundaries.

    Signed-off-by: Frederic Weisbecker
    Cc: Alessio Igor Bogani
    Cc: Andrew Morton
    Cc: Avi Kivity
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Josh Triplett
    Cc: Kevin Hilman
    Cc: Max Krasnyansky
    Cc: Peter Zijlstra
    Cc: Stephen Hemminger
    Cc: Steven Rostedt
    Cc: Sven-Thorsten Dietrich
    Cc: Thomas Gleixner
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Frederic Weisbecker
     

25 Sep, 2012

1 commit

  • There is no known reason for this option to be unavailable on other
    archs than x86. They just need to call enable_sched_clock_irqtime()
    if they have a sufficiently finegrained clock to make it working.

    Move it to the general option and let the user choose between
    it and pure tick based or virtual cputime accounting.

    Note that virtual cputime accounting already performs a finegrained
    irqtime accounting. CONFIG_IRQ_TIME_ACCOUNTING is a kind of middle ground
    between tick and virtual based accounting. So CONFIG_IRQ_TIME_ACCOUNTING
    and CONFIG_VIRT_CPU_ACCOUNTING are mutually exclusive choices.

    Signed-off-by: Frederic Weisbecker
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra

    Frederic Weisbecker
     

17 Aug, 2012

1 commit

  • S390, ia64 and powerpc all define their own version
    of CONFIG_VIRT_CPU_ACCOUNTING. Generalize the config
    and its description to a single place to avoid
    duplication.

    Signed-off-by: Frederic Weisbecker
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra

    Frederic Weisbecker
     

10 Aug, 2012

2 commits

  • Introducing PERF_SAMPLE_STACK_USER sample type bit to trigger the dump
    of the user level stack on sample. The size of the dump is specified by
    sample_stack_user value.

    Being able to dump parts of the user stack, starting from the stack
    pointer, will be useful to make a post mortem dwarf CFI based stack
    unwinding.

    Added HAVE_PERF_USER_STACK_DUMP config option to determine if the
    architecture provides user stack dump on perf event samples. This needs
    access to the user stack pointer which is not unified across
    architectures. Enabling this for x86 architecture.

    Signed-off-by: Jiri Olsa
    Original-patch-by: Frederic Weisbecker
    Cc: "Frank Ch. Eigler"
    Cc: Arun Sharma
    Cc: Benjamin Redelings
    Cc: Corey Ashford
    Cc: Cyrill Gorcunov
    Cc: Frank Ch. Eigler
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Masami Hiramatsu
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Tom Zanussi
    Cc: Ulrich Drepper
    Link: http://lkml.kernel.org/r/1344345647-11536-6-git-send-email-jolsa@redhat.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Jiri Olsa
     
  • This brings a new API to help the selective dump of registers on event
    sampling, and its implementation for x86 arch.

    Added HAVE_PERF_REGS config option to determine if the architecture
    provides perf registers ABI.

    The information about desired registers will be passed in u64 mask.
    It's up to the architecture to map the registers into the mask bits.

    For the x86 arch implementation, both 32 and 64 bit registers bits are
    defined within single enum to ensure 64 bit system can provide register
    dump for compat task if needed in the future.

    Original-patch-by: Frederic Weisbecker
    [ Added missing linux/errno.h include ]
    Signed-off-by: Jiri Olsa
    Cc: "Frank Ch. Eigler"
    Cc: Arun Sharma
    Cc: Benjamin Redelings
    Cc: Corey Ashford
    Cc: Cyrill Gorcunov
    Cc: Frank Ch. Eigler
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Masami Hiramatsu
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Tom Zanussi
    Cc: Ulrich Drepper
    Link: http://lkml.kernel.org/r/1344345647-11536-2-git-send-email-jolsa@redhat.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Jiri Olsa
     

31 Jul, 2012

1 commit

  • Rather than #define the options manually in the architecture code, add
    Kconfig options for them and select them there instead. This also allows
    us to select the compat IPC version parsing automatically for platforms
    using the old compat IPC interface.

    Reported-by: Andrew Morton
    Signed-off-by: Will Deacon
    Cc: Arnd Bergmann
    Cc: Chris Metcalf
    Cc: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Will Deacon
     

26 May, 2012

1 commit

  • Pull CMA and ARM DMA-mapping updates from Marek Szyprowski:
    "These patches contain two major updates for DMA mapping subsystem
    (mainly for ARM architecture). First one is Contiguous Memory
    Allocator (CMA) which makes it possible for device drivers to allocate
    big contiguous chunks of memory after the system has booted.

    The main difference from the similar frameworks is the fact that CMA
    allows to transparently reuse the memory region reserved for the big
    chunk allocation as a system memory, so no memory is wasted when no
    big chunk is allocated. Once the alloc request is issued, the
    framework migrates system pages to create space for the required big
    chunk of physically contiguous memory.

    For more information one can refer to nice LWN articles:

    - 'A reworked contiguous memory allocator':
    http://lwn.net/Articles/447405/

    - 'CMA and ARM':
    http://lwn.net/Articles/450286/

    - 'A deep dive into CMA':
    http://lwn.net/Articles/486301/

    - and the following thread with the patches and links to all previous
    versions:
    https://lkml.org/lkml/2012/4/3/204

    The main client for this new framework is ARM DMA-mapping subsystem.

    The second part provides a complete redesign in ARM DMA-mapping
    subsystem. The core implementation has been changed to use common
    struct dma_map_ops based infrastructure with the recent updates for
    new dma attributes merged in v3.4-rc2. This allows to use more than
    one implementation of dma-mapping calls and change/select them on the
    struct device basis. The first client of this new infractructure is
    dmabounce implementation which has been completely cut out of the
    core, common code.

    The last patch of this redesign update introduces a new, experimental
    implementation of dma-mapping calls on top of generic IOMMU framework.
    This lets ARM sub-platform to transparently use IOMMU for DMA-mapping
    calls if one provides required IOMMU hardware.

    For more information please refer to the following thread:
    http://www.spinics.net/lists/arm-kernel/msg175729.html

    The last patch merges changes from both updates and provides a
    resolution for the conflicts which cannot be avoided when patches have
    been applied on the same files (mainly arch/arm/mm/dma-mapping.c)."

    Acked by Andrew Morton :
    "Yup, this one please. It's had much work, plenty of review and I
    think even Russell is happy with it."

    * 'for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping: (28 commits)
    ARM: dma-mapping: use PMD size for section unmap
    cma: fix migration mode
    ARM: integrate CMA with DMA-mapping subsystem
    X86: integrate CMA with DMA-mapping subsystem
    drivers: add Contiguous Memory Allocator
    mm: trigger page reclaim in alloc_contig_range() to stabilise watermarks
    mm: extract reclaim code from __alloc_pages_direct_reclaim()
    mm: Serialize access to min_free_kbytes
    mm: page_isolation: MIGRATE_CMA isolation functions added
    mm: mmzone: MIGRATE_CMA migration type added
    mm: page_alloc: change fallbacks array handling
    mm: page_alloc: introduce alloc_contig_range()
    mm: compaction: export some of the functions
    mm: compaction: introduce isolate_freepages_range()
    mm: compaction: introduce map_pages()
    mm: compaction: introduce isolate_migratepages_range()
    mm: page_alloc: remove trailing whitespace
    ARM: dma-mapping: add support for IOMMU mapper
    ARM: dma-mapping: use alloc, mmap, free from dma_ops
    ARM: dma-mapping: remove redundant code and do the cleanup
    ...

    Conflicts:
    arch/x86/include/asm/dma-mapping.h

    Linus Torvalds
     

25 May, 2012

1 commit

  • Pull user-space probe instrumentation from Ingo Molnar:
    "The uprobes code originates from SystemTap and has been used for years
    in Fedora and RHEL kernels. This version is much rewritten, reviews
    from PeterZ, Oleg and myself shaped the end result.

    This tree includes uprobes support in 'perf probe' - but SystemTap
    (and other tools) can take advantage of user probe points as well.

    Sample usage of uprobes via perf, for example to profile malloc()
    calls without modifying user-space binaries.

    First boot a new kernel with CONFIG_UPROBE_EVENT=y enabled.

    If you don't know which function you want to probe you can pick one
    from 'perf top' or can get a list all functions that can be probed
    within libc (binaries can be specified as well):

    $ perf probe -F -x /lib/libc.so.6

    To probe libc's malloc():

    $ perf probe -x /lib64/libc.so.6 malloc
    Added new event:
    probe_libc:malloc (on 0x7eac0)

    You can now use it in all perf tools, such as:

    perf record -e probe_libc:malloc -aR sleep 1

    Make use of it to create a call graph (as the flat profile is going to
    look very boring):

    $ perf record -e probe_libc:malloc -gR make
    [ perf record: Woken up 173 times to write data ]
    [ perf record: Captured and wrote 44.190 MB perf.data (~1930712

    $ perf report | less

    32.03% git libc-2.15.so [.] malloc
    |
    --- malloc

    29.49% cc1 libc-2.15.so [.] malloc
    |
    --- malloc
    |
    |--0.95%-- 0x208eb1000000000
    |
    |--0.63%-- htab_traverse_noresize

    11.04% as libc-2.15.so [.] malloc
    |
    --- malloc
    |

    7.15% ld libc-2.15.so [.] malloc
    |
    --- malloc
    |

    5.07% sh libc-2.15.so [.] malloc
    |
    --- malloc
    |
    4.99% python-config libc-2.15.so [.] malloc
    |
    --- malloc
    |
    4.54% make libc-2.15.so [.] malloc
    |
    --- malloc
    |
    |--7.34%-- glob
    | |
    | |--93.18%-- 0x41588f
    | |
    | --6.82%-- glob
    | 0x41588f

    ...

    Or:

    $ perf report -g flat | less

    # Overhead Command Shared Object Symbol
    # ........ ............. ............. ..........
    #
    32.03% git libc-2.15.so [.] malloc
    27.19%
    malloc

    29.49% cc1 libc-2.15.so [.] malloc
    24.77%
    malloc

    11.04% as libc-2.15.so [.] malloc
    11.02%
    malloc

    7.15% ld libc-2.15.so [.] malloc
    6.57%
    malloc

    ...

    The core uprobes design is fairly straightforward: uprobes probe
    points register themselves at (inode:offset) addresses of
    libraries/binaries, after which all existing (or new) vmas that map
    that address will have a software breakpoint injected at that address.
    vmas are COW-ed to preserve original content. The probe points are
    kept in an rbtree.

    If user-space executes the probed inode:offset instruction address
    then an event is generated which can be recovered from the regular
    perf event channels and mmap-ed ring-buffer.

    Multiple probes at the same address are supported, they create a
    dynamic callback list of event consumers.

    The basic model is further complicated by the XOL speedup: the
    original instruction that is probed is copied (in an architecture
    specific fashion) and executed out of line when the probe triggers.
    The XOL area is a single vma per process, with a fixed number of
    entries (which limits probe execution parallelism).

    The API: uprobes are installed/removed via
    /sys/kernel/debug/tracing/uprobe_events, the API is integrated to
    align with the kprobes interface as much as possible, but is separate
    to it.

    Injecting a probe point is privileged operation, which can be relaxed
    by setting perf_paranoid to -1.

    You can use multiple probes as well and mix them with kprobes and
    regular PMU events or tracepoints, when instrumenting a task."

    Fix up trivial conflicts in mm/memory.c due to previous cleanup of
    unmap_single_vma().

    * 'perf-uprobes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
    perf probe: Detect probe target when m/x options are absent
    perf probe: Provide perf interface for uprobes
    tracing: Fix kconfig warning due to a typo
    tracing: Provide trace events interface for uprobes
    tracing: Extract out common code for kprobes/uprobes trace events
    tracing: Modify is_delete, is_return from int to bool
    uprobes/core: Decrement uprobe count before the pages are unmapped
    uprobes/core: Make background page replacement logic account for rss_stat counters
    uprobes/core: Optimize probe hits with the help of a counter
    uprobes/core: Allocate XOL slots for uprobes use
    uprobes/core: Handle breakpoint and singlestep exceptions
    uprobes/core: Rename bkpt to swbp
    uprobes/core: Make order of function parameters consistent across functions
    uprobes/core: Make macro names consistent
    uprobes: Update copyright notices
    uprobes/core: Move insn to arch specific structure
    uprobes/core: Remove uprobe_opcode_sz
    uprobes/core: Make instruction tables volatile
    uprobes: Move to kernel/events/
    uprobes/core: Clean up, refactor and improve the code
    ...

    Linus Torvalds
     

22 May, 2012

1 commit

  • Pull security subsystem updates from James Morris:
    "New notable features:
    - The seccomp work from Will Drewry
    - PR_{GET,SET}_NO_NEW_PRIVS from Andy Lutomirski
    - Longer security labels for Smack from Casey Schaufler
    - Additional ptrace restriction modes for Yama by Kees Cook"

    Fix up trivial context conflicts in arch/x86/Kconfig and include/linux/filter.h

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (65 commits)
    apparmor: fix long path failure due to disconnected path
    apparmor: fix profile lookup for unconfined
    ima: fix filename hint to reflect script interpreter name
    KEYS: Don't check for NULL key pointer in key_validate()
    Smack: allow for significantly longer Smack labels v4
    gfp flags for security_inode_alloc()?
    Smack: recursive tramsmute
    Yama: replace capable() with ns_capable()
    TOMOYO: Accept manager programs which do not start with / .
    KEYS: Add invalidation support
    KEYS: Do LRU discard in full keyrings
    KEYS: Permit in-place link replacement in keyring list
    KEYS: Perform RCU synchronisation on keys prior to key destruction
    KEYS: Announce key type (un)registration
    KEYS: Reorganise keys Makefile
    KEYS: Move the key config into security/keys/Kconfig
    KEYS: Use the compat keyctl() syscall wrapper on Sparc64 for Sparc32 compat
    Yama: remove an unused variable
    samples/seccomp: fix dependencies on arch macros
    Yama: add additional ptrace scopes
    ...

    Linus Torvalds
     

21 May, 2012

1 commit

  • The Contiguous Memory Allocator is a set of helper functions for DMA
    mapping framework that improves allocations of contiguous memory chunks.

    CMA grabs memory on system boot, marks it with MIGRATE_CMA migrate type
    and gives back to the system. Kernel is allowed to allocate only movable
    pages within CMA's managed memory so that it can be used for example for
    page cache when DMA mapping do not use it. On
    dma_alloc_from_contiguous() request such pages are migrated out of CMA
    area to free required contiguous block and fulfill the request. This
    allows to allocate large contiguous chunks of memory at any time
    assuming that there is enough free memory available in the system.

    This code is heavily based on earlier works by Michal Nazarewicz.

    Signed-off-by: Marek Szyprowski
    Signed-off-by: Kyungmin Park
    Signed-off-by: Michal Nazarewicz
    Acked-by: Arnd Bergmann
    Tested-by: Rob Clark
    Tested-by: Ohad Ben-Cohen
    Tested-by: Benjamin Gaignard
    Tested-by: Robert Nelson
    Tested-by: Barry Song

    Marek Szyprowski
     

08 May, 2012

2 commits

  • Commit f3f096cfe ("tracing: Provide trace events interface for
    uprobes") throws a warning about unmet dependencies.

    The exact warning message is:
    warning: (UPROBE_EVENT) selects UPROBES which has unmet direct dependencies (UPROBE_EVENTS && PERF_EVENTS)

    This is due to a typo in arch/Kconfig file. Fix similar typos in
    the uprobetracer documentation.

    Also add sample format of a uprobe event in the uprobetracer
    documentation as suggested by Masami Hiramatsu.

    Reported-by: Stephen Boyd
    Reported-by: Ingo Molnar
    Signed-off-by: Srikar Dronamraju
    Cc: Linus Torvalds
    Cc: Ananth N Mavinakayanahalli
    Cc: Oleg Nesterov
    Cc: Christoph Hellwig
    Cc: Steven Rostedt
    Cc: Arnaldo Carvalho de Melo
    Cc: Masami Hiramatsu
    Cc: Anton Arapov
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120508111126.21004.38285.sendpatchset@srdronam.in.ibm.com
    Signed-off-by: Ingo Molnar

    Srikar Dronamraju
     
  • Replace __HAVE_ARCH_TASK_ALLOCATOR and __HAVE_ARCH_THREAD_ALLOCATOR
    with proper config switches.

    Signed-off-by: Thomas Gleixner
    Cc: Sam Ravnborg
    Cc: Tony Luck
    Link: http://lkml.kernel.org/r/20120505150142.371309416@linutronix.de

    Thomas Gleixner
     

07 May, 2012

1 commit

  • Implements trace_event support for uprobes. In its current form
    it can be used to put probes at a specified offset in a file and
    dump the required registers when the code flow reaches the
    probed address.

    The following example shows how to dump the instruction pointer
    and %ax a register at the probed text address. Here we are
    trying to probe zfree in /bin/zsh:

    # cd /sys/kernel/debug/tracing/
    # cat /proc/`pgrep zsh`/maps | grep /bin/zsh | grep r-xp
    00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
    # objdump -T /bin/zsh | grep -w zfree
    0000000000446420 g DF .text 0000000000000012 Base
    zfree # echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events
    # cat uprobe_events
    p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420
    # echo 1 > events/uprobes/enable
    # sleep 20
    # echo 0 > events/uprobes/enable
    # cat trace
    # tracer: nop
    #
    # TASK-PID CPU# TIMESTAMP FUNCTION
    # | | | | |
    zsh-24842 [006] 258544.995456: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
    zsh-24842 [007] 258545.000270: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
    zsh-24842 [002] 258545.043929: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
    zsh-24842 [004] 258547.046129: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79

    Signed-off-by: Srikar Dronamraju
    Acked-by: Steven Rostedt
    Acked-by: Masami Hiramatsu
    Cc: Linus Torvalds
    Cc: Ananth N Mavinakayanahalli
    Cc: Jim Keniston
    Cc: Linux-mm
    Cc: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Christoph Hellwig
    Cc: Arnaldo Carvalho de Melo
    Cc: Anton Arapov
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120411103043.GB29437@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Srikar Dronamraju
     

05 May, 2012

2 commits

  • Now that all archs except ia64 are converted, replace the config and
    let the ia64 select CONFIG_ARCH_INIT_TASK

    Signed-off-by: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20120503085035.867948914@linutronix.de

    Thomas Gleixner
     
  • All archs define init_task in the same way (except ia64, but there is
    no particular reason why ia64 cannot use the common version). Create a
    generic instance so all archs can be converted over.

    The config switch is temporary and will be removed when all archs are
    converted over.

    Signed-off-by: Thomas Gleixner
    Cc: Benjamin Herrenschmidt
    Cc: Chen Liqin
    Cc: Chris Metcalf
    Cc: Chris Zankel
    Cc: David Howells
    Cc: David S. Miller
    Cc: Geert Uytterhoeven
    Cc: Guan Xuetao
    Cc: Haavard Skinnemoen
    Cc: Hirokazu Takata
    Cc: James E.J. Bottomley
    Cc: Jesper Nilsson
    Cc: Jonas Bonn
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Matt Turner
    Cc: Michal Simek
    Cc: Mike Frysinger
    Cc: Paul Mundt
    Cc: Ralf Baechle
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Russell King
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20120503085034.092585287@linutronix.de

    Thomas Gleixner
     

26 Apr, 2012

1 commit

  • All SMP architectures have magic to fork the idle task and to store it
    for reusage when cpu hotplug is enabled. Provide a generic
    infrastructure for it.

    Create/reinit the idle thread for the cpu which is brought up in the
    generic code and hand the thread pointer to the architecture code via
    __cpu_up().

    Note, that fork_idle() is called via a workqueue, because this
    guarantees that the idle thread does not get a reference to a user
    space VM. This can happen when the boot process did not bring up all
    possible cpus and a later cpu_up() is initiated via the sysfs
    interface. In that case fork_idle() would be called in the context of
    the user space task and take a reference on the user space VM.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Rusty Russell
    Cc: Paul E. McKenney
    Cc: Srivatsa S. Bhat
    Cc: Matt Turner
    Cc: Russell King
    Cc: Mike Frysinger
    Cc: Jesper Nilsson
    Cc: Richard Kuo
    Cc: Tony Luck
    Cc: Hirokazu Takata
    Cc: Ralf Baechle
    Cc: David Howells
    Cc: James E.J. Bottomley
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: Paul Mundt
    Cc: David S. Miller
    Cc: Chris Metcalf
    Cc: Richard Weinberger
    Cc: x86@kernel.org
    Acked-by: Venkatesh Pallipadi
    Link: http://lkml.kernel.org/r/20120420124557.102478630@linutronix.de

    Thomas Gleixner
     

14 Apr, 2012

5 commits

  • Merge in latest upstream (and the latest perf development tree),
    to prepare for tooling changes, and also to pick up v3.4 MM
    changes that the uprobes code needs to take care of.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • This change adds support for a new ptrace option, PTRACE_O_TRACESECCOMP,
    and a new return value for seccomp BPF programs, SECCOMP_RET_TRACE.

    When a tracer specifies the PTRACE_O_TRACESECCOMP ptrace option, the
    tracer will be notified, via PTRACE_EVENT_SECCOMP, for any syscall that
    results in a BPF program returning SECCOMP_RET_TRACE. The 16-bit
    SECCOMP_RET_DATA mask of the BPF program return value will be passed as
    the ptrace_message and may be retrieved using PTRACE_GETEVENTMSG.

    If the subordinate process is not using seccomp filter, then no
    system call notifications will occur even if the option is specified.

    If there is no tracer with PTRACE_O_TRACESECCOMP when SECCOMP_RET_TRACE
    is returned, the system call will not be executed and an -ENOSYS errno
    will be returned to userspace.

    This change adds a dependency on the system call slow path. Any future
    efforts to use the system call fast path for seccomp filter will need to
    address this restriction.

    Signed-off-by: Will Drewry
    Acked-by: Eric Paris

    v18: - rebase
    - comment fatal_signal check
    - acked-by
    - drop secure_computing_int comment
    v17: - ...
    v16: - update PT_TRACE_MASK to 0xbf4 so that STOP isn't clear on SETOPTIONS call (indan@nul.nu)
    [note PT_TRACE_MASK disappears in linux-next]
    v15: - add audit support for non-zero return codes
    - clean up style (indan@nul.nu)
    v14: - rebase/nochanges
    v13: - rebase on to 88ebdda6159ffc15699f204c33feb3e431bf9bdc
    (Brings back a change to ptrace.c and the masks.)
    v12: - rebase to linux-next
    - use ptrace_event and update arch/Kconfig to mention slow-path dependency
    - drop all tracehook changes and inclusion (oleg@redhat.com)
    v11: - invert the logic to just make it a PTRACE_SYSCALL accelerator
    (indan@nul.nu)
    v10: - moved to PTRACE_O_SECCOMP / PT_TRACE_SECCOMP
    v9: - n/a
    v8: - guarded PTRACE_SECCOMP use with an ifdef
    v7: - introduced
    Signed-off-by: James Morris

    Will Drewry
     
  • Adds a new return value to seccomp filters that triggers a SIGSYS to be
    delivered with the new SYS_SECCOMP si_code.

    This allows in-process system call emulation, including just specifying
    an errno or cleanly dumping core, rather than just dying.

    Suggested-by: Markus Gutschke
    Suggested-by: Julien Tinnes
    Signed-off-by: Will Drewry
    Acked-by: Eric Paris

    v18: - acked-by, rebase
    - don't mention secure_computing_int() anymore
    v15: - use audit_seccomp/skip
    - pad out error spacing; clean up switch (indan@nul.nu)
    v14: - n/a
    v13: - rebase on to 88ebdda6159ffc15699f204c33feb3e431bf9bdc
    v12: - rebase on to linux-next
    v11: - clarify the comment (indan@nul.nu)
    - s/sigtrap/sigsys
    v10: - use SIGSYS, syscall_get_arch, updates arch/Kconfig
    note suggested-by (though original suggestion had other behaviors)
    v9: - changes to SIGILL
    v8: - clean up based on changes to dependent patches
    v7: - introduction
    Signed-off-by: James Morris

    Will Drewry
     
  • This change adds the SECCOMP_RET_ERRNO as a valid return value from a
    seccomp filter. Additionally, it makes the first use of the lower
    16-bits for storing a filter-supplied errno. 16-bits is more than
    enough for the errno-base.h calls.

    Returning errors instead of immediately terminating processes that
    violate seccomp policy allow for broader use of this functionality
    for kernel attack surface reduction. For example, a linux container
    could maintain a whitelist of pre-existing system calls but drop
    all new ones with errnos. This would keep a logically static attack
    surface while providing errnos that may allow for graceful failure
    without the downside of do_exit() on a bad call.

    This change also changes the signature of __secure_computing. It
    appears the only direct caller is the arm entry code and it clobbers
    any possible return value (register) immediately.

    Signed-off-by: Will Drewry
    Acked-by: Serge Hallyn
    Reviewed-by: Kees Cook
    Acked-by: Eric Paris

    v18: - fix up comments and rebase
    - fix bad var name which was fixed in later revs
    - remove _int() and just change the __secure_computing signature
    v16-v17: ...
    v15: - use audit_seccomp and add a skip label. (eparis@redhat.com)
    - clean up and pad out return codes (indan@nul.nu)
    v14: - no change/rebase
    v13: - rebase on to 88ebdda6159ffc15699f204c33feb3e431bf9bdc
    v12: - move to WARN_ON if filter is NULL
    (oleg@redhat.com, luto@mit.edu, keescook@chromium.org)
    - return immediately for filter==NULL (keescook@chromium.org)
    - change evaluation to only compare the ACTION so that layered
    errnos don't result in the lowest one being returned.
    (keeschook@chromium.org)
    v11: - check for NULL filter (keescook@chromium.org)
    v10: - change loaders to fn
    v9: - n/a
    v8: - update Kconfig to note new need for syscall_set_return_value.
    - reordered such that TRAP behavior follows on later.
    - made the for loop a little less indent-y
    v7: - introduced
    Signed-off-by: James Morris

    Will Drewry
     
  • [This patch depends on luto@mit.edu's no_new_privs patch:
    https://lkml.org/lkml/2012/1/30/264
    The whole series including Andrew's patches can be found here:
    https://github.com/redpig/linux/tree/seccomp
    Complete diff here:
    https://github.com/redpig/linux/compare/1dc65fed...seccomp
    ]

    This patch adds support for seccomp mode 2. Mode 2 introduces the
    ability for unprivileged processes to install system call filtering
    policy expressed in terms of a Berkeley Packet Filter (BPF) program.
    This program will be evaluated in the kernel for each system call
    the task makes and computes a result based on data in the format
    of struct seccomp_data.

    A filter program may be installed by calling:
    struct sock_fprog fprog = { ... };
    ...
    prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &fprog);

    The return value of the filter program determines if the system call is
    allowed to proceed or denied. If the first filter program installed
    allows prctl(2) calls, then the above call may be made repeatedly
    by a task to further reduce its access to the kernel. All attached
    programs must be evaluated before a system call will be allowed to
    proceed.

    Filter programs will be inherited across fork/clone and execve.
    However, if the task attaching the filter is unprivileged
    (!CAP_SYS_ADMIN) the no_new_privs bit will be set on the task. This
    ensures that unprivileged tasks cannot attach filters that affect
    privileged tasks (e.g., setuid binary).

    There are a number of benefits to this approach. A few of which are
    as follows:
    - BPF has been exposed to userland for a long time
    - BPF optimization (and JIT'ing) are well understood
    - Userland already knows its ABI: system call numbers and desired
    arguments
    - No time-of-check-time-of-use vulnerable data accesses are possible.
    - system call arguments are loaded on access only to minimize copying
    required for system call policy decisions.

    Mode 2 support is restricted to architectures that enable
    HAVE_ARCH_SECCOMP_FILTER. In this patch, the primary dependency is on
    syscall_get_arguments(). The full desired scope of this feature will
    add a few minor additional requirements expressed later in this series.
    Based on discussion, SECCOMP_RET_ERRNO and SECCOMP_RET_TRACE seem to be
    the desired additional functionality.

    No architectures are enabled in this patch.

    Signed-off-by: Will Drewry
    Acked-by: Serge Hallyn
    Reviewed-by: Indan Zupancic
    Acked-by: Eric Paris
    Reviewed-by: Kees Cook

    v18: - rebase to v3.4-rc2
    - s/chk/check/ (akpm@linux-foundation.org,jmorris@namei.org)
    - allocate with GFP_KERNEL|__GFP_NOWARN (indan@nul.nu)
    - add a comment for get_u32 regarding endianness (akpm@)
    - fix other typos, style mistakes (akpm@)
    - added acked-by
    v17: - properly guard seccomp filter needed headers (leann@ubuntu.com)
    - tighten return mask to 0x7fff0000
    v16: - no change
    v15: - add a 4 instr penalty when counting a path to account for seccomp_filter
    size (indan@nul.nu)
    - drop the max insns to 256KB (indan@nul.nu)
    - return ENOMEM if the max insns limit has been hit (indan@nul.nu)
    - move IP checks after args (indan@nul.nu)
    - drop !user_filter check (indan@nul.nu)
    - only allow explicit bpf codes (indan@nul.nu)
    - exit_code -> exit_sig
    v14: - put/get_seccomp_filter takes struct task_struct
    (indan@nul.nu,keescook@chromium.org)
    - adds seccomp_chk_filter and drops general bpf_run/chk_filter user
    - add seccomp_bpf_load for use by net/core/filter.c
    - lower max per-process/per-hierarchy: 1MB
    - moved nnp/capability check prior to allocation
    (all of the above: indan@nul.nu)
    v13: - rebase on to 88ebdda6159ffc15699f204c33feb3e431bf9bdc
    v12: - added a maximum instruction count per path (indan@nul.nu,oleg@redhat.com)
    - removed copy_seccomp (keescook@chromium.org,indan@nul.nu)
    - reworded the prctl_set_seccomp comment (indan@nul.nu)
    v11: - reorder struct seccomp_data to allow future args expansion (hpa@zytor.com)
    - style clean up, @compat dropped, compat_sock_fprog32 (indan@nul.nu)
    - do_exit(SIGSYS) (keescook@chromium.org, luto@mit.edu)
    - pare down Kconfig doc reference.
    - extra comment clean up
    v10: - seccomp_data has changed again to be more aesthetically pleasing
    (hpa@zytor.com)
    - calling convention is noted in a new u32 field using syscall_get_arch.
    This allows for cross-calling convention tasks to use seccomp filters.
    (hpa@zytor.com)
    - lots of clean up (thanks, Indan!)
    v9: - n/a
    v8: - use bpf_chk_filter, bpf_run_filter. update load_fns
    - Lots of fixes courtesy of indan@nul.nu:
    -- fix up load behavior, compat fixups, and merge alloc code,
    -- renamed pc and dropped __packed, use bool compat.
    -- Added a hidden CONFIG_SECCOMP_FILTER to synthesize non-arch
    dependencies
    v7: (massive overhaul thanks to Indan, others)
    - added CONFIG_HAVE_ARCH_SECCOMP_FILTER
    - merged into seccomp.c
    - minimal seccomp_filter.h
    - no config option (part of seccomp)
    - no new prctl
    - doesn't break seccomp on systems without asm/syscall.h
    (works but arg access always fails)
    - dropped seccomp_init_task, extra free functions, ...
    - dropped the no-asm/syscall.h code paths
    - merges with network sk_run_filter and sk_chk_filter
    v6: - fix memory leak on attach compat check failure
    - require no_new_privs || CAP_SYS_ADMIN prior to filter
    installation. (luto@mit.edu)
    - s/seccomp_struct_/seccomp_/ for macros/functions (amwang@redhat.com)
    - cleaned up Kconfig (amwang@redhat.com)
    - on block, note if the call was compat (so the # means something)
    v5: - uses syscall_get_arguments
    (indan@nul.nu,oleg@redhat.com, mcgrathr@chromium.org)
    - uses union-based arg storage with hi/lo struct to
    handle endianness. Compromises between the two alternate
    proposals to minimize extra arg shuffling and account for
    endianness assuming userspace uses offsetof().
    (mcgrathr@chromium.org, indan@nul.nu)
    - update Kconfig description
    - add include/seccomp_filter.h and add its installation
    - (naive) on-demand syscall argument loading
    - drop seccomp_t (eparis@redhat.com)
    v4: - adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS
    - now uses current->no_new_privs
    (luto@mit.edu,torvalds@linux-foundation.com)
    - assign names to seccomp modes (rdunlap@xenotime.net)
    - fix style issues (rdunlap@xenotime.net)
    - reworded Kconfig entry (rdunlap@xenotime.net)
    v3: - macros to inline (oleg@redhat.com)
    - init_task behavior fixed (oleg@redhat.com)
    - drop creator entry and extra NULL check (oleg@redhat.com)
    - alloc returns -EINVAL on bad sizing (serge.hallyn@canonical.com)
    - adds tentative use of "always_unprivileged" as per
    torvalds@linux-foundation.org and luto@mit.edu
    v2: - (patch 2 only)
    Signed-off-by: James Morris

    Will Drewry
     

30 Mar, 2012

1 commit