02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

09 Sep, 2017

2 commits

  • The entire file is now conditionally compiled only when CONFIG_MODULES is
    enabled, and this this is a bool. Just move this conditional to the
    Makefile as its easier to read this way.

    Link: http://lkml.kernel.org/r/20170810180618.22457-5-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Cc: Kees Cook
    Cc: Dmitry Torokhov
    Cc: Jessica Yu
    Cc: Rusty Russell
    Cc: Michal Marek
    Cc: Petr Mladek
    Cc: Miroslav Benes
    Cc: Josh Poimboeuf
    Cc: Guenter Roeck
    Cc: "Eric W. Biederman"
    Cc: Matt Redfearn
    Cc: Dan Carpenter
    Cc: Colin Ian King
    Cc: Daniel Mentz
    Cc: David Binderman
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • Patch series "kmod: few code cleanups to split out umh code"

    The usermode helper has a provenance from the old usb code which first
    required a usermode helper. Eventually this was shoved into kmod.c and
    the kernel's modprobe calls was converted over eventually to share the
    same code. Over time the list of usermode helpers in the kernel has grown
    -- so kmod is just but one user of the API.

    This series is a simple logical cleanup which acknowledges the code
    evolution of the usermode helper and shoves the UMH API into its own
    dedicated file. This way users of the API can later just include umh.h
    instead of kmod.h.

    Note despite the diff state the first patch really is just a code shove,
    no functional changes are done there. I did use git format-patch -M to
    generate the patch, but in the end the split was not enough for git to
    consider it a rename hence the large diffstat.

    I've put this through 0-day and it gives me their machine compilation
    blessings with all tests as OK.

    This patch (of 4):

    There's a slew of usermode helper users and kmod is just one of them.
    Split out the usermode helper code into its own file to keep the logic and
    focus split up.

    This change provides no functional changes.

    Link: http://lkml.kernel.org/r/20170810180618.22457-2-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Cc: Kees Cook
    Cc: Dmitry Torokhov
    Cc: Jessica Yu
    Cc: Rusty Russell
    Cc: Michal Marek
    Cc: Petr Mladek
    Cc: Miroslav Benes
    Cc: Josh Poimboeuf
    Cc: Guenter Roeck
    Cc: "Eric W. Biederman"
    Cc: Matt Redfearn
    Cc: Dan Carpenter
    Cc: Colin Ian King
    Cc: Daniel Mentz
    Cc: David Binderman
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     

17 Aug, 2017

1 commit

  • Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
    from all runqueues for which current thread's mm is the same as the
    thread calling sys_membarrier. It executes faster than the non-expedited
    variant (no blocking). It also works on NOHZ_FULL configurations.

    Scheduler-wise, it requires a memory barrier before and after context
    switching between processes (which have different mm). The memory
    barrier before context switch is already present. For the barrier after
    context switch:

    * Our TSO archs can do RELEASE without being a full barrier. Look at
    x86 spin_unlock() being a regular STORE for example. But for those
    archs, all atomics imply smp_mb and all of them have atomic ops in
    switch_mm() for mm_cpumask(), and on x86 the CR3 load acts as a full
    barrier.

    * From all weakly ordered machines, only ARM64 and PPC can do RELEASE,
    the rest does indeed do smp_mb(), so there the spin_unlock() is a full
    barrier and we're good.

    * ARM64 has a very heavy barrier in switch_to(), which suffices.

    * PPC just removed its barrier from switch_to(), but appears to be
    talking about adding something to switch_mm(). So add a
    smp_mb__after_unlock_lock() for now, until this is settled on the PPC
    side.

    Changes since v3:
    - Properly document the memory barriers provided by each architecture.

    Changes since v2:
    - Address comments from Peter Zijlstra,
    - Add smp_mb__after_unlock_lock() after finish_lock_switch() in
    finish_task_switch() to add the memory barrier we need after storing
    to rq->curr. This is much simpler than the previous approach relying
    on atomic_dec_and_test() in mmdrop(), which actually added a memory
    barrier in the common case of switching between userspace processes.
    - Return -EINVAL when MEMBARRIER_CMD_SHARED is used on a nohz_full
    kernel, rather than having the whole membarrier system call returning
    -ENOSYS. Indeed, CMD_PRIVATE_EXPEDITED is compatible with nohz_full.
    Adapt the CMD_QUERY mask accordingly.

    Changes since v1:
    - move membarrier code under kernel/sched/ because it uses the
    scheduler runqueue,
    - only add the barrier when we switch from a kernel thread. The case
    where we switch from a user-space thread is already handled by
    the atomic_dec_and_test() in mmdrop().
    - add a comment to mmdrop() documenting the requirement on the implicit
    memory barrier.

    CC: Peter Zijlstra
    CC: Paul E. McKenney
    CC: Boqun Feng
    CC: Andrew Hunter
    CC: Maged Michael
    CC: gromer@google.com
    CC: Avi Kivity
    CC: Benjamin Herrenschmidt
    CC: Paul Mackerras
    CC: Michael Ellerman
    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Paul E. McKenney
    Tested-by: Dave Watson

    Mathieu Desnoyers
     

13 Jul, 2017

1 commit

  • Split SOFTLOCKUP_DETECTOR from LOCKUP_DETECTOR, and split
    HARDLOCKUP_DETECTOR_PERF from HARDLOCKUP_DETECTOR.

    LOCKUP_DETECTOR implies the general boot, sysctl, and programming
    interfaces for the lockup detectors.

    An architecture that wants to use a hard lockup detector must define
    HAVE_HARDLOCKUP_DETECTOR_PERF or HAVE_HARDLOCKUP_DETECTOR_ARCH.

    Alternatively an arch can define HAVE_NMI_WATCHDOG, which provides the
    minimum arch_touch_nmi_watchdog, and it otherwise does its own thing and
    does not implement the LOCKUP_DETECTOR interfaces.

    sparc is unusual in that it has started to implement some of the
    interfaces, but not fully yet. It should probably be converted to a full
    HAVE_HARDLOCKUP_DETECTOR_ARCH.

    [npiggin@gmail.com: fix]
    Link: http://lkml.kernel.org/r/20170617223522.66c0ad88@roar.ozlabs.ibm.com
    Link: http://lkml.kernel.org/r/20170616065715.18390-4-npiggin@gmail.com
    Signed-off-by: Nicholas Piggin
    Reviewed-by: Don Zickus
    Reviewed-by: Babu Moger
    Tested-by: Babu Moger [sparc]
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     

09 May, 2017

1 commit

  • Patch series "kexec/fadump: remove dependency with CONFIG_KEXEC and
    reuse crashkernel parameter for fadump", v4.

    Traditionally, kdump is used to save vmcore in case of a crash. Some
    architectures like powerpc can save vmcore using architecture specific
    support instead of kexec/kdump mechanism. Such architecture specific
    support also needs to reserve memory, to be used by dump capture kernel.
    crashkernel parameter can be a reused, for memory reservation, by such
    architecture specific infrastructure.

    This patchset removes dependency with CONFIG_KEXEC for crashkernel
    parameter and vmcoreinfo related code as it can be reused without kexec
    support. Also, crashkernel parameter is reused instead of
    fadump_reserve_mem to reserve memory for fadump.

    The first patch moves crashkernel parameter parsing and vmcoreinfo
    related code under CONFIG_CRASH_CORE instead of CONFIG_KEXEC_CORE. The
    second patch reuses the definitions of append_elf_note() & final_note()
    functions under CONFIG_CRASH_CORE in IA64 arch code. The third patch
    removes dependency on CONFIG_KEXEC for firmware-assisted dump (fadump)
    in powerpc. The next patch reuses crashkernel parameter for reserving
    memory for fadump, instead of the fadump_reserve_mem parameter. This
    has the advantage of using all syntaxes crashkernel parameter supports,
    for fadump as well. The last patch updates fadump kernel documentation
    about use of crashkernel parameter.

    This patch (of 5):

    Traditionally, kdump is used to save vmcore in case of a crash. Some
    architectures like powerpc can save vmcore using architecture specific
    support instead of kexec/kdump mechanism. Such architecture specific
    support also needs to reserve memory, to be used by dump capture kernel.
    crashkernel parameter can be a reused, for memory reservation, by such
    architecture specific infrastructure.

    But currently, code related to vmcoreinfo and parsing of crashkernel
    parameter is built under CONFIG_KEXEC_CORE. This patch introduces
    CONFIG_CRASH_CORE and moves the above mentioned code under this config,
    allowing code reuse without dependency on CONFIG_KEXEC. There is no
    functional change with this patch.

    Link: http://lkml.kernel.org/r/149035338104.6881.4550894432615189948.stgit@hbathini.in.ibm.com
    Signed-off-by: Hari Bathini
    Acked-by: Dave Young
    Cc: Fenghua Yu
    Cc: Tony Luck
    Cc: Eric Biederman
    Cc: Mahesh Salgaonkar
    Cc: Vivek Goyal
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hari Bathini
     

28 Dec, 2016

1 commit

  • They're growing to be too many and planned to get split further. Move
    them under their own directory.

    kernel/cgroup.c -> kernel/cgroup/cgroup.c
    kernel/cgroup_freezer.c -> kernel/cgroup/freezer.c
    kernel/cgroup_pids.c -> kernel/cgroup/pids.c
    kernel/cpuset.c -> kernel/cgroup/cpuset.c

    Signed-off-by: Tejun Heo
    Acked-by: Acked-by: Zefan Li

    Tejun Heo
     

15 Dec, 2016

2 commits

  • Merge more updates from Andrew Morton:

    - a few misc things

    - kexec updates

    - DMA-mapping updates to better support networking DMA operations

    - IPC updates

    - various MM changes to improve DAX fault handling

    - lots of radix-tree changes, mainly to the test suite. All leading up
    to reimplementing the IDA/IDR code to be a wrapper layer over the
    radix-tree. However the final trigger-pulling patch is held off for
    4.11.

    * emailed patches from Andrew Morton : (114 commits)
    radix tree test suite: delete unused rcupdate.c
    radix tree test suite: add new tag check
    radix-tree: ensure counts are initialised
    radix tree test suite: cache recently freed objects
    radix tree test suite: add some more functionality
    idr: reduce the number of bits per level from 8 to 6
    rxrpc: abstract away knowledge of IDR internals
    tpm: use idr_find(), not idr_find_slowpath()
    idr: add ida_is_empty
    radix tree test suite: check multiorder iteration
    radix-tree: fix replacement for multiorder entries
    radix-tree: add radix_tree_split_preload()
    radix-tree: add radix_tree_split
    radix-tree: add radix_tree_join
    radix-tree: delete radix_tree_range_tag_if_tagged()
    radix-tree: delete radix_tree_locate_item()
    radix-tree: improve multiorder iterators
    btrfs: fix race in btrfs_free_dummy_fs_info()
    radix-tree: improve dump output
    radix-tree: make radix_tree_find_next_bit more useful
    ...

    Linus Torvalds
     
  • Separate hardlockup code from watchdog.c and move it to watchdog_hld.c.
    It is mostly straight forward. Remove everything inside
    CONFIG_HARDLOCKUP_DETECTORS. This code will go to file watchdog_hld.c.
    Also update the makefile accordigly.

    Link: http://lkml.kernel.org/r/1478034826-43888-3-git-send-email-babu.moger@oracle.com
    Signed-off-by: Babu Moger
    Acked-by: Don Zickus
    Cc: Ingo Molnar
    Cc: Jiri Kosina
    Cc: Andi Kleen
    Cc: Yaowei Bai
    Cc: Aaron Tomlin
    Cc: Ulrich Obergfell
    Cc: Tejun Heo
    Cc: Hidehiro Kawai
    Cc: Josh Hunt
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Babu Moger
     

14 Dec, 2016

1 commit

  • The build system stopped generating ikconfig.h in v2.6.8. Remove an entry
    for it in dontdiff. There's also a reference to it in a small comment.
    Remove that comment too, as it is of little help in any case.

    Signed-off-by: Paul Bolle
    Signed-off-by: Jiri Kosina

    Paul Bolle
     

09 Aug, 2016

1 commit


24 May, 2016

1 commit

  • CONFIG_MIPS32_N32=y but CONFIG_BINFMT_ELF disabled results in the
    following linker errors:

    arch/mips/built-in.o: In function `elf_core_dump':
    binfmt_elfn32.c:(.text+0x23dbc): undefined reference to `elf_core_extra_phdrs'
    binfmt_elfn32.c:(.text+0x246e4): undefined reference to `elf_core_extra_data_size'
    binfmt_elfn32.c:(.text+0x248d0): undefined reference to `elf_core_write_extra_phdrs'
    binfmt_elfn32.c:(.text+0x24ac4): undefined reference to `elf_core_write_extra_data'

    CONFIG_MIPS32_O32=y but CONFIG_BINFMT_ELF disabled results in the following
    linker errors:

    arch/mips/built-in.o: In function `elf_core_dump':
    binfmt_elfo32.c:(.text+0x28a04): undefined reference to `elf_core_extra_phdrs'
    binfmt_elfo32.c:(.text+0x29330): undefined reference to `elf_core_extra_data_size'
    binfmt_elfo32.c:(.text+0x2951c): undefined reference to `elf_core_write_extra_phdrs'
    binfmt_elfo32.c:(.text+0x29710): undefined reference to `elf_core_write_extra_data'

    This is because binfmt_elfn32 and binfmt_elfo32 are using symbols from
    elfcore but for these configurations elfcore will not be built.

    Fixed by making elfcore selectable by a separate config symbol which
    unlike the current mechanism can also be used from other directories
    than kernel/, then having each flavor of ELF that relies on elfcore.o,
    select it in Kconfig, including CONFIG_MIPS32_N32 and CONFIG_MIPS32_O32
    which fixes this issue.

    Link: http://lkml.kernel.org/r/20160520141705.GA1913@linux-mips.org
    Signed-off-by: Ralf Baechle
    Reviewed-by: James Hogan
    Cc: "Maciej W. Rozycki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralf Baechle
     

23 Mar, 2016

1 commit

  • kcov provides code coverage collection for coverage-guided fuzzing
    (randomized testing). Coverage-guided fuzzing is a testing technique
    that uses coverage feedback to determine new interesting inputs to a
    system. A notable user-space example is AFL
    (http://lcamtuf.coredump.cx/afl/). However, this technique is not
    widely used for kernel testing due to missing compiler and kernel
    support.

    kcov does not aim to collect as much coverage as possible. It aims to
    collect more or less stable coverage that is function of syscall inputs.
    To achieve this goal it does not collect coverage in soft/hard
    interrupts and instrumentation of some inherently non-deterministic or
    non-interesting parts of kernel is disbled (e.g. scheduler, locking).

    Currently there is a single coverage collection mode (tracing), but the
    API anticipates additional collection modes. Initially I also
    implemented a second mode which exposes coverage in a fixed-size hash
    table of counters (what Quentin used in his original patch). I've
    dropped the second mode for simplicity.

    This patch adds the necessary support on kernel side. The complimentary
    compiler support was added in gcc revision 231296.

    We've used this support to build syzkaller system call fuzzer, which has
    found 90 kernel bugs in just 2 months:

    https://github.com/google/syzkaller/wiki/Found-Bugs

    We've also found 30+ bugs in our internal systems with syzkaller.
    Another (yet unexplored) direction where kcov coverage would greatly
    help is more traditional "blob mutation". For example, mounting a
    random blob as a filesystem, or receiving a random blob over wire.

    Why not gcov. Typical fuzzing loop looks as follows: (1) reset
    coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
    typical coverage can be just a dozen of basic blocks (e.g. an invalid
    input). In such context gcov becomes prohibitively expensive as
    reset/collect coverage steps depend on total number of basic
    blocks/edges in program (in case of kernel it is about 2M). Cost of
    kcov depends only on number of executed basic blocks/edges. On top of
    that, kernel requires per-thread coverage because there are always
    background threads and unrelated processes that also produce coverage.
    With inlined gcov instrumentation per-thread coverage is not possible.

    kcov exposes kernel PCs and control flow to user-space which is
    insecure. But debugfs should not be mapped as user accessible.

    Based on a patch by Quentin Casasnovas.

    [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
    [akpm@linux-foundation.org: unbreak allmodconfig]
    [akpm@linux-foundation.org: follow x86 Makefile layout standards]
    Signed-off-by: Dmitry Vyukov
    Reviewed-by: Kees Cook
    Cc: syzkaller
    Cc: Vegard Nossum
    Cc: Catalin Marinas
    Cc: Tavis Ormandy
    Cc: Will Deacon
    Cc: Quentin Casasnovas
    Cc: Kostya Serebryany
    Cc: Eric Dumazet
    Cc: Alexander Potapenko
    Cc: Kees Cook
    Cc: Bjorn Helgaas
    Cc: Sasha Levin
    Cc: David Drysdale
    Cc: Ard Biesheuvel
    Cc: Andrey Ryabinin
    Cc: Kirill A. Shutemov
    Cc: Jiri Slaby
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     

31 Jan, 2016

1 commit


12 Sep, 2015

1 commit

  • Here is an implementation of a new system call, sys_membarrier(), which
    executes a memory barrier on all threads running on the system. It is
    implemented by calling synchronize_sched(). It can be used to
    distribute the cost of user-space memory barriers asymmetrically by
    transforming pairs of memory barriers into pairs consisting of
    sys_membarrier() and a compiler barrier. For synchronization primitives
    that distinguish between read-side and write-side (e.g. userspace RCU
    [1], rwlocks), the read-side can be accelerated significantly by moving
    the bulk of the memory barrier overhead to the write-side.

    The existing applications of which I am aware that would be improved by
    this system call are as follows:

    * Through Userspace RCU library (http://urcu.so)
    - DNS server (Knot DNS) https://www.knot-dns.cz/
    - Network sniffer (http://netsniff-ng.org/)
    - Distributed object storage (https://sheepdog.github.io/sheepdog/)
    - User-space tracing (http://lttng.org)
    - Network storage system (https://www.gluster.org/)
    - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
    - Financial software (https://lkml.org/lkml/2015/3/23/189)

    Those projects use RCU in userspace to increase read-side speed and
    scalability compared to locking. Especially in the case of RCU used by
    libraries, sys_membarrier can speed up the read-side by moving the bulk of
    the memory barrier cost to synchronize_rcu().

    * Direct users of sys_membarrier
    - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)

    Microsoft core dotnet GC developers are planning to use the mprotect()
    side-effect of issuing memory barriers through IPIs as a way to implement
    Windows FlushProcessWriteBuffers() on Linux. They are referring to
    sys_membarrier in their github thread, specifically stating that
    sys_membarrier() is what they are looking for.

    To explain the benefit of this scheme, let's introduce two example threads:

    Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
    Thread B (frequent, e.g. executing liburcu
    rcu_read_lock()/rcu_read_unlock())

    In a scheme where all smp_mb() in thread A are ordering memory accesses
    with respect to smp_mb() present in Thread B, we can change each
    smp_mb() within Thread A into calls to sys_membarrier() and each
    smp_mb() within Thread B into compiler barriers "barrier()".

    Before the change, we had, for each smp_mb() pairs:

    Thread A Thread B
    previous mem accesses previous mem accesses
    smp_mb() smp_mb()
    following mem accesses following mem accesses

    After the change, these pairs become:

    Thread A Thread B
    prev mem accesses prev mem accesses
    sys_membarrier() barrier()
    follow mem accesses follow mem accesses

    As we can see, there are two possible scenarios: either Thread B memory
    accesses do not happen concurrently with Thread A accesses (1), or they
    do (2).

    1) Non-concurrent Thread A vs Thread B accesses:

    Thread A Thread B
    prev mem accesses
    sys_membarrier()
    follow mem accesses
    prev mem accesses
    barrier()
    follow mem accesses

    In this case, thread B accesses will be weakly ordered. This is OK,
    because at that point, thread A is not particularly interested in
    ordering them with respect to its own accesses.

    2) Concurrent Thread A vs Thread B accesses

    Thread A Thread B
    prev mem accesses prev mem accesses
    sys_membarrier() barrier()
    follow mem accesses follow mem accesses

    In this case, thread B accesses, which are ensured to be in program
    order thanks to the compiler barrier, will be "upgraded" to full
    smp_mb() by synchronize_sched().

    * Benchmarks

    On Intel Xeon E5405 (8 cores)
    (one thread is calling sys_membarrier, the other 7 threads are busy
    looping)

    1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.

    * User-space user of this system call: Userspace RCU library

    Both the signal-based and the sys_membarrier userspace RCU schemes
    permit us to remove the memory barrier from the userspace RCU
    rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
    accelerating them. These memory barriers are replaced by compiler
    barriers on the read-side, and all matching memory barriers on the
    write-side are turned into an invocation of a memory barrier on all
    active threads in the process. By letting the kernel perform this
    synchronization rather than dumbly sending a signal to every process
    threads (as we currently do), we diminish the number of unnecessary wake
    ups and only issue the memory barriers on active threads. Non-running
    threads do not need to execute such barrier anyway, because these are
    implied by the scheduler context switches.

    Results in liburcu:

    Operations in 10s, 6 readers, 2 writers:

    memory barriers in reader: 1701557485 reads, 2202847 writes
    signal-based scheme: 9830061167 reads, 6700 writes
    sys_membarrier: 9952759104 reads, 425 writes
    sys_membarrier (dyn. check): 7970328887 reads, 425 writes

    The dynamic sys_membarrier availability check adds some overhead to
    the read-side compared to the signal-based scheme, but besides that,
    sys_membarrier slightly outperforms the signal-based scheme. However,
    this non-expedited sys_membarrier implementation has a much slower grace
    period than signal and memory barrier schemes.

    Besides diminishing the number of wake-ups, one major advantage of the
    membarrier system call over the signal-based scheme is that it does not
    need to reserve a signal. This plays much more nicely with libraries,
    and with processes injected into for tracing purposes, for which we
    cannot expect that signals will be unused by the application.

    An expedited version of this system call can be added later on to speed
    up the grace period. Its implementation will likely depend on reading
    the cpu_curr()->mm without holding each CPU's rq lock.

    This patch adds the system call to x86 and to asm-generic.

    [1] http://urcu.so

    membarrier(2) man page:

    MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2)

    NAME
    membarrier - issue memory barriers on a set of threads

    SYNOPSIS
    #include

    int membarrier(int cmd, int flags);

    DESCRIPTION
    The cmd argument is one of the following:

    MEMBARRIER_CMD_QUERY
    Query the set of supported commands. It returns a bitmask of
    supported commands.

    MEMBARRIER_CMD_SHARED
    Execute a memory barrier on all threads running on the system.
    Upon return from system call, the caller thread is ensured that
    all running threads have passed through a state where all memory
    accesses to user-space addresses match program order between
    entry to and return from the system call (non-running threads
    are de facto in such a state). This covers threads from all pro=E2=80=90
    cesses running on the system. This command returns 0.

    The flags argument needs to be 0. For future extensions.

    All memory accesses performed in program order from each targeted
    thread is guaranteed to be ordered with respect to sys_membarrier(). If
    we use the semantic "barrier()" to represent a compiler barrier forcing
    memory accesses to be performed in program order across the barrier,
    and smp_mb() to represent explicit memory barriers forcing full memory
    ordering across the barrier, we have the following ordering table for
    each pair of barrier(), sys_membarrier() and smp_mb():

    The pair ordering is detailed as (O: ordered, X: not ordered):

    barrier() smp_mb() sys_membarrier()
    barrier() X X O
    smp_mb() X O O
    sys_membarrier() O O O

    RETURN VALUE
    On success, these system calls return zero. On error, -1 is returned,
    and errno is set appropriately. For a given command, with flags
    argument set to 0, this system call is guaranteed to always return the
    same value until reboot.

    ERRORS
    ENOSYS System call is not implemented.

    EINVAL Invalid arguments.

    Linux 2015-04-15 MEMBARRIER(2)

    Signed-off-by: Mathieu Desnoyers
    Reviewed-by: Paul E. McKenney
    Reviewed-by: Josh Triplett
    Cc: KOSAKI Motohiro
    Cc: Steven Rostedt
    Cc: Nicholas Miell
    Cc: Ingo Molnar
    Cc: Alan Cox
    Cc: Lai Jiangshan
    Cc: Stephen Hemminger
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: David Howells
    Cc: Pranith Kumar
    Cc: Michael Kerrisk
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     

11 Sep, 2015

2 commits

  • There are two kexec load syscalls, kexec_load another and kexec_file_load.
    kexec_file_load has been splited as kernel/kexec_file.c. In this patch I
    split kexec_load syscall code to kernel/kexec.c.

    And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
    use kexec_file_load only, or vice verse.

    The original requirement is from Ted Ts'o, he want kexec kernel signature
    being checked with CONFIG_KEXEC_VERIFY_SIG enabled. But kexec-tools use
    kexec_load syscall can bypass the checking.

    Vivek Goyal proposed to create a common kconfig option so user can compile
    in only one syscall for loading kexec kernel. KEXEC/KEXEC_FILE selects
    KEXEC_CORE so that old config files still work.

    Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
    architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
    KEXEC_CORE in arch Kconfig. Also updated general kernel code with to
    kexec_load syscall.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Dave Young
    Cc: Eric W. Biederman
    Cc: Vivek Goyal
    Cc: Petr Tesarik
    Cc: Theodore Ts'o
    Cc: Josh Boyer
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Young
     
  • Split kexec_file syscall related code to another file kernel/kexec_file.c
    so that the #ifdef CONFIG_KEXEC_FILE in kexec.c can be dropped.

    Sharing variables and functions are moved to kernel/kexec_internal.h per
    suggestion from Vivek and Petr.

    [akpm@linux-foundation.org: fix bisectability]
    [akpm@linux-foundation.org: declare the various arch_kexec functions]
    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Dave Young
    Cc: Eric W. Biederman
    Cc: Vivek Goyal
    Cc: Petr Tesarik
    Cc: Theodore Ts'o
    Cc: Josh Boyer
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Young
     

09 Sep, 2015

3 commits

  • Pull libnvdimm updates from Dan Williams:
    "This update has successfully completed a 0day-kbuild run and has
    appeared in a linux-next release. The changes outside of the typical
    drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
    removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
    the introduction of ZONE_DEVICE + devm_memremap_pages().

    Summary:

    - Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
    mechanism for adding device-driver-discovered memory regions to the
    kernel's direct map.

    This facility is used by the pmem driver to enable pfn_to_page()
    operations on the page frames returned by DAX ('direct_access' in
    'struct block_device_operations').

    For now, the 'memmap' allocation for these "device" pages comes
    from "System RAM". Support for allocating the memmap from device
    memory will arrive in a later kernel.

    - Introduce memremap() to replace usages of ioremap_cache() and
    ioremap_wt(). memremap() drops the __iomem annotation for these
    mappings to memory that do not have i/o side effects. The
    replacement of ioremap_cache() with memremap() is limited to the
    pmem driver to ease merging the api change in v4.3.

    Completion of the conversion is targeted for v4.4.

    - Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
    driver, update the VFS DAX implementation and PMEM api to provide
    persistence guarantees for kernel operations on a DAX mapping.

    - Convert the ACPI NFIT 'BLK' driver to map the block apertures as
    cacheable to improve performance.

    - Miscellaneous updates and fixes to libnvdimm including support for
    issuing "address range scrub" commands, clarifying the optimal
    'sector size' of pmem devices, a clarification of the usage of the
    ACPI '_STA' (status) property for DIMM devices, and other minor
    fixes"

    * tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
    libnvdimm, pmem: direct map legacy pmem by default
    libnvdimm, pmem: 'struct page' for pmem
    libnvdimm, pfn: 'struct page' provider infrastructure
    x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
    add devm_memremap_pages
    mm: ZONE_DEVICE for "device memory"
    mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
    dax: drop size parameter to ->direct_access()
    nd_blk: change aperture mapping from WC to WB
    nvdimm: change to use generic kvfree()
    pmem, dax: have direct_access use __pmem annotation
    dax: update I/O path to do proper PMEM flushing
    pmem: add copy_from_iter_pmem() and clear_pmem()
    pmem, x86: clean up conditional pmem includes
    pmem: remove layer when calling arch_has_wmb_pmem()
    pmem, x86: move x86 PMEM API to new pmem.h header
    libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
    pmem: switch to devm_ allocations
    devres: add devm_memremap
    libnvdimm, btt: write and validate parent_uuid
    ...

    Linus Torvalds
     
  • Pull audit update from Paul Moore:
    "This is one of the larger audit patchsets in recent history,
    consisting of eight patches and almost 400 lines of changes.

    The bulk of the patchset is the new "audit by executable"
    functionality which allows admins to set an audit watch based on the
    executable on disk. Prior to this, admins could only track an
    application by PID, which has some obvious limitations.

    Beyond the new functionality we also have some refcnt fixes and a few
    minor cleanups"

    * 'upstream' of git://git.infradead.org/users/pcmoore/audit:
    fixup: audit: implement audit by executable
    audit: implement audit by executable
    audit: clean simple fsnotify implementation
    audit: use macros for unset inode and device values
    audit: make audit_del_rule() more robust
    audit: fix uninitialized variable in audit_add_rule()
    audit: eliminate unnecessary extra layer of watch parent references
    audit: eliminate unnecessary extra layer of watch references

    Linus Torvalds
     
  • Pull security subsystem updates from James Morris:
    "Highlights:

    - PKCS#7 support added to support signed kexec, also utilized for
    module signing. See comments in 3f1e1bea.

    ** NOTE: this requires linking against the OpenSSL library, which
    must be installed, e.g. the openssl-devel on Fedora **

    - Smack
    - add IPv6 host labeling; ignore labels on kernel threads
    - support smack labeling mounts which use binary mount data

    - SELinux:
    - add ioctl whitelisting (see
    http://kernsec.org/files/lss2015/vanderstoep.pdf)
    - fix mprotect PROT_EXEC regression caused by mm change

    - Seccomp:
    - add ptrace options for suspend/resume"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (57 commits)
    PKCS#7: Add OIDs for sha224, sha284 and sha512 hash algos and use them
    Documentation/Changes: Now need OpenSSL devel packages for module signing
    scripts: add extract-cert and sign-file to .gitignore
    modsign: Handle signing key in source tree
    modsign: Use if_changed rule for extracting cert from module signing key
    Move certificate handling to its own directory
    sign-file: Fix warning about BIO_reset() return value
    PKCS#7: Add MODULE_LICENSE() to test module
    Smack - Fix build error with bringup unconfigured
    sign-file: Document dependency on OpenSSL devel libraries
    PKCS#7: Appropriately restrict authenticated attributes and content type
    KEYS: Add a name for PKEY_ID_PKCS7
    PKCS#7: Improve and export the X.509 ASN.1 time object decoder
    modsign: Use extract-cert to process CONFIG_SYSTEM_TRUSTED_KEYS
    extract-cert: Cope with multiple X.509 certificates in a single file
    sign-file: Generate CMS message as signature instead of PKCS#7
    PKCS#7: Support CMS messages also [RFC5652]
    X.509: Change recorded SKID & AKID to not include Subject or Issuer
    PKCS#7: Check content type and versions
    MAINTAINERS: The keyrings mailing list has moved
    ...

    Linus Torvalds
     

15 Aug, 2015

1 commit

  • Existing users of ioremap_cache() are mapping memory that is known in
    advance to not have i/o side effects. These users are forced to cast
    away the __iomem annotation, or otherwise neglect to fix the sparse
    errors thrown when dereferencing pointers to this memory. Provide
    memremap() as a non __iomem annotated ioremap_*() in the case when
    ioremap is otherwise a pointer to cacheable memory. Empirically,
    ioremap_() call sites are seeking memory-like semantics
    (e.g. speculative reads, and prefetching permitted).

    memremap() is a break from the ioremap implementation pattern of adding
    a new memremap_() for each mapping type and having silent
    compatibility fall backs. Instead, the implementation defines flags
    that are passed to the central memremap() and if a mapping type is not
    supported by an arch memremap returns NULL.

    We introduce a memremap prototype as a trivial wrapper of
    ioremap_cache() and ioremap_wt(). Later, once all ioremap_cache() and
    ioremap_wt() usage has been removed from drivers we teach archs to
    implement arch_memremap() with the ability to strictly enforce the
    mapping type.

    Cc: Arnd Bergmann
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

14 Aug, 2015

1 commit


13 Aug, 2015

1 commit


07 Aug, 2015

5 commits

  • Let the user explicitly provide a file containing trusted keys, instead of
    just automatically finding files matching *.x509 in the build tree and
    trusting whatever we find. This really ought to be an *explicit*
    configuration, and the build rules for dealing with the files were
    fairly painful too.

    Fix applied from James Morris that removes an '=' from a macro definition
    in kernel/Makefile as this is a feature that only exists from GNU make 3.82
    onwards.

    Signed-off-by: David Woodhouse
    Signed-off-by: David Howells

    David Woodhouse
     
  • The current rule for generating signing_key.priv and signing_key.x509 is
    a classic example of a bad rule which has a tendency to break parallel
    make. When invoked to create *either* target, it generates the other
    target as a side-effect that make didn't predict.

    So let's switch to using a single file signing_key.pem which contains
    both key and certificate. That matches what we do in the case of an
    external key specified by CONFIG_MODULE_SIG_KEY anyway, so it's also
    slightly cleaner.

    Signed-off-by: David Woodhouse
    Signed-off-by: David Howells

    David Woodhouse
     
  • Where an external PEM file or PKCS#11 URI is given, we can get the cert
    from it for ourselves instead of making the user drop signing_key.x509
    in place for us.

    Signed-off-by: David Woodhouse
    Signed-off-by: David Howells

    David Woodhouse
     
  • Signed-off-by: David Woodhouse
    Signed-off-by: David Howells

    David Woodhouse
     
  • This is to be used to audit by executable path rules, but audit watches should
    be able to share this code eventually.

    At the moment the audit watch code is a lot more complex. That code only
    creates one fsnotify watch per parent directory. That 'audit_parent' in
    turn has a list of 'audit_watches' which contain the name, ino, dev of
    the specific object we care about. This just creates one fsnotify watch
    per object we care about. So if you watch 100 inodes in /etc this code
    will create 100 fsnotify watches on /etc. The audit_watch code will
    instead create 1 fsnotify watch on /etc (the audit_parent) and then 100
    individual watches chained from that fsnotify mark.

    We should be able to convert the audit_watch code to do one fsnotify
    mark per watch and simplify things/remove a whole lot of code. After
    that conversion we should be able to convert the audit_fsnotify code to
    support that hierarchy if the optimization is necessary.

    Move the access to the entry for audit_match_signal() to the beginning of
    the audit_del_rule() function in case the entry found is the same one passed
    in. This will enable it to be used by audit_autoremove_mark_rule(),
    kill_rules() and audit_remove_parent_watches().

    This is a heavily modified and merged version of two patches originally
    submitted by Eric Paris.

    Cc: Peter Moody
    Cc: Eric Paris
    Signed-off-by: Richard Guy Briggs
    [PM: added a space after a declaration to keep ./scripts/checkpatch happy]
    Signed-off-by: Paul Moore

    Richard Guy Briggs
     

15 Jul, 2015

1 commit

  • Adds a new single-purpose PIDs subsystem to limit the number of
    tasks that can be forked inside a cgroup. Essentially this is an
    implementation of RLIMIT_NPROC that applies to a cgroup rather than a
    process tree.

    However, it should be noted that organisational operations (adding and
    removing tasks from a PIDs hierarchy) will *not* be prevented. Rather,
    the number of tasks in the hierarchy cannot exceed the limit through
    forking. This is due to the fact that, in the unified hierarchy, attach
    cannot fail (and it is not possible for a task to overcome its PIDs
    cgroup policy limit by attaching to a child cgroup -- even if migrating
    mid-fork it must be able to fork in the parent first).

    PIDs are fundamentally a global resource, and it is possible to reach
    PID exhaustion inside a cgroup without hitting any reasonable kmemcg
    policy. Once you've hit PID exhaustion, you're only in a marginally
    better state than OOM. This subsystem allows PID exhaustion inside a
    cgroup to be prevented.

    Signed-off-by: Aleksa Sarai
    Signed-off-by: Tejun Heo

    Aleksa Sarai
     

03 Jul, 2015

1 commit


01 May, 2015

1 commit


16 Apr, 2015

1 commit

  • There are a lot of embedded systems that run most or all of their
    functionality in init, running as root:root. For these systems,
    supporting multiple users is not necessary.

    This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
    non-root users, non-root groups, and capabilities optional. It is enabled
    under CONFIG_EXPERT menu.

    When this symbol is not defined, UID and GID are zero in any possible case
    and processes always have all capabilities.

    The following syscalls are compiled out: setuid, setregid, setgid,
    setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
    getgroups, setfsuid, setfsgid, capget, capset.

    Also, groups.c is compiled out completely.

    In kernel/capability.c, capable function was moved in order to avoid
    adding two ifdef blocks.

    This change saves about 25 KB on a defconfig build. The most minimal
    kernels have total text sizes in the high hundreds of kB rather than
    low MB. (The 25k goes down a bit with allnoconfig, but not that much.

    The kernel was booted in Qemu. All the common functionalities work.
    Adding users/groups is not possible, failing with -ENOSYS.

    Bloat-o-meter output:
    add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Iulia Manda
    Reviewed-by: Josh Triplett
    Acked-by: Geert Uytterhoeven
    Tested-by: Paul E. McKenney
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Iulia Manda
     

12 Feb, 2015

2 commits

  • Pull security layer updates from James Morris:
    "Highlights:

    - Smack adds secmark support for Netfilter
    - /proc/keys is now mandatory if CONFIG_KEYS=y
    - TPM gets its own device class
    - Added TPM 2.0 support
    - Smack file hook rework (all Smack users should review this!)"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (64 commits)
    cipso: don't use IPCB() to locate the CIPSO IP option
    SELinux: fix error code in policydb_init()
    selinux: add security in-core xattr support for pstore and debugfs
    selinux: quiet the filesystem labeling behavior message
    selinux: Remove unused function avc_sidcmp()
    ima: /proc/keys is now mandatory
    Smack: Repair netfilter dependency
    X.509: silence asn1 compiler debug output
    X.509: shut up about included cert for silent build
    KEYS: Make /proc/keys unconditional if CONFIG_KEYS=y
    MAINTAINERS: email update
    tpm/tpm_tis: Add missing ifdef CONFIG_ACPI for pnp_acpi_device
    smack: fix possible use after frees in task_security() callers
    smack: Add missing logging in bidirectional UDS connect check
    Smack: secmark support for netfilter
    Smack: Rework file hooks
    tpm: fix format string error in tpm-chip.c
    char/tpm/tpm_crb: fix build error
    smack: Fix a bidirectional UDS connect check typo
    smack: introduce a special case for tmpfs in smack_d_instantiate()
    ...

    Linus Torvalds
     
  • Pull s390 updates from Martin Schwidefsky:

    - The remaining patches for the z13 machine support: kernel build
    option for z13, the cache synonym avoidance, SMT support,
    compare-and-delay for spinloops and the CES5S crypto adapater.

    - The ftrace support for function tracing with the gcc hotpatch option.
    This touches common code Makefiles, Steven is ok with the changes.

    - The hypfs file system gets an extension to access diagnose 0x0c data
    in user space for performance analysis for Linux running under z/VM.

    - The iucv hvc console gets wildcard spport for the user id filtering.

    - The cacheinfo code is converted to use the generic infrastructure.

    - Cleanup and bug fixes.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (42 commits)
    s390/process: free vx save area when releasing tasks
    s390/hypfs: Eliminate hypfs interval
    s390/hypfs: Add diagnose 0c support
    s390/cacheinfo: don't use smp_processor_id() in preemptible context
    s390/zcrypt: fixed domain scanning problem (again)
    s390/smp: increase maximum value of NR_CPUS to 512
    s390/jump label: use different nop instruction
    s390/jump label: add sanity checks
    s390/mm: correct missing space when reporting user process faults
    s390/dasd: cleanup profiling
    s390/dasd: add locking for global_profile access
    s390/ftrace: hotpatch support for function tracing
    ftrace: let notrace function attribute disable hotpatching if necessary
    ftrace: allow architectures to specify ftrace compile options
    s390: reintroduce diag 44 calls for cpu_relax()
    s390/zcrypt: Add support for new crypto express (CEX5S) adapter.
    s390/zcrypt: Number of supported ap domains is not retrievable.
    s390/spinlock: add compare-and-delay to lock wait loops
    s390/tape: remove redundant if statement
    s390/hvc_iucv: add simple wildcard matches to the iucv allow filter
    ...

    Linus Torvalds
     

29 Jan, 2015

1 commit

  • If the kernel is compiled with function tracer support the -pg compile option
    is passed to gcc to generate extra code into the prologue of each function.

    This patch replaces the "open-coded" -pg compile flag with a CC_FLAGS_FTRACE
    makefile variable which architectures can override if a different option
    should be used for code generation.

    Acked-by: Steven Rostedt
    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     

23 Jan, 2015

1 commit

  • Every kernel build that includes X.509 support prints out
    a message like

    - Including cert signing_key.x509

    This may be useful for some cases, but when doing automated
    build tests, it just means noise.

    To hide the message, this uses '$(kecho)' for printing the
    message, which means we still see it when building with V=1,
    but not at the normal level or when building with 'make -s'.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: David Howells

    Arnd Bergmann
     

22 Dec, 2014

1 commit

  • This commit introduces code for the live patching core. It implements
    an ftrace-based mechanism and kernel interface for doing live patching
    of kernel and kernel module functions.

    It represents the greatest common functionality set between kpatch and
    kgraft and can accept patches built using either method.

    This first version does not implement any consistency mechanism that
    ensures that old and new code do not run together. In practice, ~90% of
    CVEs are safe to apply in this way, since they simply add a conditional
    check. However, any function change that can not execute safely with
    the old version of the function can _not_ be safely applied in this
    version.

    [ jkosina@suse.cz: due to the number of contributions that got folded into
    this original patch from Seth Jennings, add SUSE's copyright as well, as
    discussed via e-mail ]

    Signed-off-by: Seth Jennings
    Signed-off-by: Josh Poimboeuf
    Reviewed-by: Miroslav Benes
    Reviewed-by: Petr Mladek
    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Miroslav Benes
    Signed-off-by: Petr Mladek
    Signed-off-by: Jiri Kosina

    Seth Jennings
     

11 Dec, 2014

1 commit

  • All memory accounting and limiting has been switched over to the
    lockless page counters. Bye, res_counter!

    [akpm@linux-foundation.org: update Documentation/cgroups/memory.txt]
    [mhocko@suse.cz: ditch the last remainings of res_counter]
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: David Rientjes
    Cc: Paul Bolle
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

28 Oct, 2014

1 commit

  • introduce two configs:
    - hidden CONFIG_BPF to select eBPF interpreter that classic socket filters
    depend on
    - visible CONFIG_BPF_SYSCALL (default off) that tracing and sockets can use

    that solves several problems:
    - tracing and others that wish to use eBPF don't need to depend on NET.
    They can use BPF_SYSCALL to allow loading from userspace or select BPF
    to use it directly from kernel in NET-less configs.
    - in 3.18 programs cannot be attached to events yet, so don't force it on
    - when the rest of eBPF infra is there in 3.19+, it's still useful to
    switch it off to minimize kernel size

    bloat-o-meter on x64 shows:
    add/remove: 0/60 grow/shrink: 0/2 up/down: 0/-15601 (-15601)

    tested with many different config combinations. Hopefully didn't miss anything.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

09 Aug, 2014

1 commit

  • This patch series does not do kernel signature verification yet. I plan
    to post another patch series for that. Now distributions are already
    signing PE/COFF bzImage with PKCS7 signature I plan to parse and verify
    those signatures.

    Primary goal of this patchset is to prepare groundwork so that kernel
    image can be signed and signatures be verified during kexec load. This
    should help with two things.

    - It should allow kexec/kdump on secureboot enabled machines.

    - In general it can help even without secureboot. By being able to verify
    kernel image signature in kexec, it should help with avoiding module
    signing restrictions. Matthew Garret showed how to boot into a custom
    kernel, modify first kernel's memory and then jump back to old kernel and
    bypass any policy one wants to.

    This patch (of 15):

    Kexec wants to use bin2c and it wants to use it really early in the build
    process. See arch/x86/purgatory/ code in later patches.

    So move bin2c in scripts/basic so that it can be built very early and
    be usable by arch/x86/purgatory/

    Signed-off-by: Vivek Goyal
    Cc: Borislav Petkov
    Cc: Michael Kerrisk
    Cc: Yinghai Lu
    Cc: Eric Biederman
    Cc: H. Peter Anvin
    Cc: Matthew Garrett
    Cc: Greg Kroah-Hartman
    Cc: Dave Young
    Cc: WANG Chao
    Cc: Baoquan He
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal