09 Aug, 2016

1 commit


24 May, 2016

1 commit

  • CONFIG_MIPS32_N32=y but CONFIG_BINFMT_ELF disabled results in the
    following linker errors:

    arch/mips/built-in.o: In function `elf_core_dump':
    binfmt_elfn32.c:(.text+0x23dbc): undefined reference to `elf_core_extra_phdrs'
    binfmt_elfn32.c:(.text+0x246e4): undefined reference to `elf_core_extra_data_size'
    binfmt_elfn32.c:(.text+0x248d0): undefined reference to `elf_core_write_extra_phdrs'
    binfmt_elfn32.c:(.text+0x24ac4): undefined reference to `elf_core_write_extra_data'

    CONFIG_MIPS32_O32=y but CONFIG_BINFMT_ELF disabled results in the following
    linker errors:

    arch/mips/built-in.o: In function `elf_core_dump':
    binfmt_elfo32.c:(.text+0x28a04): undefined reference to `elf_core_extra_phdrs'
    binfmt_elfo32.c:(.text+0x29330): undefined reference to `elf_core_extra_data_size'
    binfmt_elfo32.c:(.text+0x2951c): undefined reference to `elf_core_write_extra_phdrs'
    binfmt_elfo32.c:(.text+0x29710): undefined reference to `elf_core_write_extra_data'

    This is because binfmt_elfn32 and binfmt_elfo32 are using symbols from
    elfcore but for these configurations elfcore will not be built.

    Fixed by making elfcore selectable by a separate config symbol which
    unlike the current mechanism can also be used from other directories
    than kernel/, then having each flavor of ELF that relies on elfcore.o,
    select it in Kconfig, including CONFIG_MIPS32_N32 and CONFIG_MIPS32_O32
    which fixes this issue.

    Link: http://lkml.kernel.org/r/20160520141705.GA1913@linux-mips.org
    Signed-off-by: Ralf Baechle
    Reviewed-by: James Hogan
    Cc: "Maciej W. Rozycki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralf Baechle
     

23 Mar, 2016

1 commit

  • kcov provides code coverage collection for coverage-guided fuzzing
    (randomized testing). Coverage-guided fuzzing is a testing technique
    that uses coverage feedback to determine new interesting inputs to a
    system. A notable user-space example is AFL
    (http://lcamtuf.coredump.cx/afl/). However, this technique is not
    widely used for kernel testing due to missing compiler and kernel
    support.

    kcov does not aim to collect as much coverage as possible. It aims to
    collect more or less stable coverage that is function of syscall inputs.
    To achieve this goal it does not collect coverage in soft/hard
    interrupts and instrumentation of some inherently non-deterministic or
    non-interesting parts of kernel is disbled (e.g. scheduler, locking).

    Currently there is a single coverage collection mode (tracing), but the
    API anticipates additional collection modes. Initially I also
    implemented a second mode which exposes coverage in a fixed-size hash
    table of counters (what Quentin used in his original patch). I've
    dropped the second mode for simplicity.

    This patch adds the necessary support on kernel side. The complimentary
    compiler support was added in gcc revision 231296.

    We've used this support to build syzkaller system call fuzzer, which has
    found 90 kernel bugs in just 2 months:

    https://github.com/google/syzkaller/wiki/Found-Bugs

    We've also found 30+ bugs in our internal systems with syzkaller.
    Another (yet unexplored) direction where kcov coverage would greatly
    help is more traditional "blob mutation". For example, mounting a
    random blob as a filesystem, or receiving a random blob over wire.

    Why not gcov. Typical fuzzing loop looks as follows: (1) reset
    coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
    typical coverage can be just a dozen of basic blocks (e.g. an invalid
    input). In such context gcov becomes prohibitively expensive as
    reset/collect coverage steps depend on total number of basic
    blocks/edges in program (in case of kernel it is about 2M). Cost of
    kcov depends only on number of executed basic blocks/edges. On top of
    that, kernel requires per-thread coverage because there are always
    background threads and unrelated processes that also produce coverage.
    With inlined gcov instrumentation per-thread coverage is not possible.

    kcov exposes kernel PCs and control flow to user-space which is
    insecure. But debugfs should not be mapped as user accessible.

    Based on a patch by Quentin Casasnovas.

    [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
    [akpm@linux-foundation.org: unbreak allmodconfig]
    [akpm@linux-foundation.org: follow x86 Makefile layout standards]
    Signed-off-by: Dmitry Vyukov
    Reviewed-by: Kees Cook
    Cc: syzkaller
    Cc: Vegard Nossum
    Cc: Catalin Marinas
    Cc: Tavis Ormandy
    Cc: Will Deacon
    Cc: Quentin Casasnovas
    Cc: Kostya Serebryany
    Cc: Eric Dumazet
    Cc: Alexander Potapenko
    Cc: Kees Cook
    Cc: Bjorn Helgaas
    Cc: Sasha Levin
    Cc: David Drysdale
    Cc: Ard Biesheuvel
    Cc: Andrey Ryabinin
    Cc: Kirill A. Shutemov
    Cc: Jiri Slaby
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     

31 Jan, 2016

1 commit


12 Sep, 2015

1 commit

  • Here is an implementation of a new system call, sys_membarrier(), which
    executes a memory barrier on all threads running on the system. It is
    implemented by calling synchronize_sched(). It can be used to
    distribute the cost of user-space memory barriers asymmetrically by
    transforming pairs of memory barriers into pairs consisting of
    sys_membarrier() and a compiler barrier. For synchronization primitives
    that distinguish between read-side and write-side (e.g. userspace RCU
    [1], rwlocks), the read-side can be accelerated significantly by moving
    the bulk of the memory barrier overhead to the write-side.

    The existing applications of which I am aware that would be improved by
    this system call are as follows:

    * Through Userspace RCU library (http://urcu.so)
    - DNS server (Knot DNS) https://www.knot-dns.cz/
    - Network sniffer (http://netsniff-ng.org/)
    - Distributed object storage (https://sheepdog.github.io/sheepdog/)
    - User-space tracing (http://lttng.org)
    - Network storage system (https://www.gluster.org/)
    - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
    - Financial software (https://lkml.org/lkml/2015/3/23/189)

    Those projects use RCU in userspace to increase read-side speed and
    scalability compared to locking. Especially in the case of RCU used by
    libraries, sys_membarrier can speed up the read-side by moving the bulk of
    the memory barrier cost to synchronize_rcu().

    * Direct users of sys_membarrier
    - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)

    Microsoft core dotnet GC developers are planning to use the mprotect()
    side-effect of issuing memory barriers through IPIs as a way to implement
    Windows FlushProcessWriteBuffers() on Linux. They are referring to
    sys_membarrier in their github thread, specifically stating that
    sys_membarrier() is what they are looking for.

    To explain the benefit of this scheme, let's introduce two example threads:

    Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
    Thread B (frequent, e.g. executing liburcu
    rcu_read_lock()/rcu_read_unlock())

    In a scheme where all smp_mb() in thread A are ordering memory accesses
    with respect to smp_mb() present in Thread B, we can change each
    smp_mb() within Thread A into calls to sys_membarrier() and each
    smp_mb() within Thread B into compiler barriers "barrier()".

    Before the change, we had, for each smp_mb() pairs:

    Thread A Thread B
    previous mem accesses previous mem accesses
    smp_mb() smp_mb()
    following mem accesses following mem accesses

    After the change, these pairs become:

    Thread A Thread B
    prev mem accesses prev mem accesses
    sys_membarrier() barrier()
    follow mem accesses follow mem accesses

    As we can see, there are two possible scenarios: either Thread B memory
    accesses do not happen concurrently with Thread A accesses (1), or they
    do (2).

    1) Non-concurrent Thread A vs Thread B accesses:

    Thread A Thread B
    prev mem accesses
    sys_membarrier()
    follow mem accesses
    prev mem accesses
    barrier()
    follow mem accesses

    In this case, thread B accesses will be weakly ordered. This is OK,
    because at that point, thread A is not particularly interested in
    ordering them with respect to its own accesses.

    2) Concurrent Thread A vs Thread B accesses

    Thread A Thread B
    prev mem accesses prev mem accesses
    sys_membarrier() barrier()
    follow mem accesses follow mem accesses

    In this case, thread B accesses, which are ensured to be in program
    order thanks to the compiler barrier, will be "upgraded" to full
    smp_mb() by synchronize_sched().

    * Benchmarks

    On Intel Xeon E5405 (8 cores)
    (one thread is calling sys_membarrier, the other 7 threads are busy
    looping)

    1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.

    * User-space user of this system call: Userspace RCU library

    Both the signal-based and the sys_membarrier userspace RCU schemes
    permit us to remove the memory barrier from the userspace RCU
    rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
    accelerating them. These memory barriers are replaced by compiler
    barriers on the read-side, and all matching memory barriers on the
    write-side are turned into an invocation of a memory barrier on all
    active threads in the process. By letting the kernel perform this
    synchronization rather than dumbly sending a signal to every process
    threads (as we currently do), we diminish the number of unnecessary wake
    ups and only issue the memory barriers on active threads. Non-running
    threads do not need to execute such barrier anyway, because these are
    implied by the scheduler context switches.

    Results in liburcu:

    Operations in 10s, 6 readers, 2 writers:

    memory barriers in reader: 1701557485 reads, 2202847 writes
    signal-based scheme: 9830061167 reads, 6700 writes
    sys_membarrier: 9952759104 reads, 425 writes
    sys_membarrier (dyn. check): 7970328887 reads, 425 writes

    The dynamic sys_membarrier availability check adds some overhead to
    the read-side compared to the signal-based scheme, but besides that,
    sys_membarrier slightly outperforms the signal-based scheme. However,
    this non-expedited sys_membarrier implementation has a much slower grace
    period than signal and memory barrier schemes.

    Besides diminishing the number of wake-ups, one major advantage of the
    membarrier system call over the signal-based scheme is that it does not
    need to reserve a signal. This plays much more nicely with libraries,
    and with processes injected into for tracing purposes, for which we
    cannot expect that signals will be unused by the application.

    An expedited version of this system call can be added later on to speed
    up the grace period. Its implementation will likely depend on reading
    the cpu_curr()->mm without holding each CPU's rq lock.

    This patch adds the system call to x86 and to asm-generic.

    [1] http://urcu.so

    membarrier(2) man page:

    MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2)

    NAME
    membarrier - issue memory barriers on a set of threads

    SYNOPSIS
    #include

    int membarrier(int cmd, int flags);

    DESCRIPTION
    The cmd argument is one of the following:

    MEMBARRIER_CMD_QUERY
    Query the set of supported commands. It returns a bitmask of
    supported commands.

    MEMBARRIER_CMD_SHARED
    Execute a memory barrier on all threads running on the system.
    Upon return from system call, the caller thread is ensured that
    all running threads have passed through a state where all memory
    accesses to user-space addresses match program order between
    entry to and return from the system call (non-running threads
    are de facto in such a state). This covers threads from all pro=E2=80=90
    cesses running on the system. This command returns 0.

    The flags argument needs to be 0. For future extensions.

    All memory accesses performed in program order from each targeted
    thread is guaranteed to be ordered with respect to sys_membarrier(). If
    we use the semantic "barrier()" to represent a compiler barrier forcing
    memory accesses to be performed in program order across the barrier,
    and smp_mb() to represent explicit memory barriers forcing full memory
    ordering across the barrier, we have the following ordering table for
    each pair of barrier(), sys_membarrier() and smp_mb():

    The pair ordering is detailed as (O: ordered, X: not ordered):

    barrier() smp_mb() sys_membarrier()
    barrier() X X O
    smp_mb() X O O
    sys_membarrier() O O O

    RETURN VALUE
    On success, these system calls return zero. On error, -1 is returned,
    and errno is set appropriately. For a given command, with flags
    argument set to 0, this system call is guaranteed to always return the
    same value until reboot.

    ERRORS
    ENOSYS System call is not implemented.

    EINVAL Invalid arguments.

    Linux 2015-04-15 MEMBARRIER(2)

    Signed-off-by: Mathieu Desnoyers
    Reviewed-by: Paul E. McKenney
    Reviewed-by: Josh Triplett
    Cc: KOSAKI Motohiro
    Cc: Steven Rostedt
    Cc: Nicholas Miell
    Cc: Ingo Molnar
    Cc: Alan Cox
    Cc: Lai Jiangshan
    Cc: Stephen Hemminger
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: David Howells
    Cc: Pranith Kumar
    Cc: Michael Kerrisk
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     

11 Sep, 2015

2 commits

  • There are two kexec load syscalls, kexec_load another and kexec_file_load.
    kexec_file_load has been splited as kernel/kexec_file.c. In this patch I
    split kexec_load syscall code to kernel/kexec.c.

    And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
    use kexec_file_load only, or vice verse.

    The original requirement is from Ted Ts'o, he want kexec kernel signature
    being checked with CONFIG_KEXEC_VERIFY_SIG enabled. But kexec-tools use
    kexec_load syscall can bypass the checking.

    Vivek Goyal proposed to create a common kconfig option so user can compile
    in only one syscall for loading kexec kernel. KEXEC/KEXEC_FILE selects
    KEXEC_CORE so that old config files still work.

    Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
    architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
    KEXEC_CORE in arch Kconfig. Also updated general kernel code with to
    kexec_load syscall.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Dave Young
    Cc: Eric W. Biederman
    Cc: Vivek Goyal
    Cc: Petr Tesarik
    Cc: Theodore Ts'o
    Cc: Josh Boyer
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Young
     
  • Split kexec_file syscall related code to another file kernel/kexec_file.c
    so that the #ifdef CONFIG_KEXEC_FILE in kexec.c can be dropped.

    Sharing variables and functions are moved to kernel/kexec_internal.h per
    suggestion from Vivek and Petr.

    [akpm@linux-foundation.org: fix bisectability]
    [akpm@linux-foundation.org: declare the various arch_kexec functions]
    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Dave Young
    Cc: Eric W. Biederman
    Cc: Vivek Goyal
    Cc: Petr Tesarik
    Cc: Theodore Ts'o
    Cc: Josh Boyer
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Young
     

09 Sep, 2015

3 commits

  • Pull libnvdimm updates from Dan Williams:
    "This update has successfully completed a 0day-kbuild run and has
    appeared in a linux-next release. The changes outside of the typical
    drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
    removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
    the introduction of ZONE_DEVICE + devm_memremap_pages().

    Summary:

    - Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
    mechanism for adding device-driver-discovered memory regions to the
    kernel's direct map.

    This facility is used by the pmem driver to enable pfn_to_page()
    operations on the page frames returned by DAX ('direct_access' in
    'struct block_device_operations').

    For now, the 'memmap' allocation for these "device" pages comes
    from "System RAM". Support for allocating the memmap from device
    memory will arrive in a later kernel.

    - Introduce memremap() to replace usages of ioremap_cache() and
    ioremap_wt(). memremap() drops the __iomem annotation for these
    mappings to memory that do not have i/o side effects. The
    replacement of ioremap_cache() with memremap() is limited to the
    pmem driver to ease merging the api change in v4.3.

    Completion of the conversion is targeted for v4.4.

    - Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
    driver, update the VFS DAX implementation and PMEM api to provide
    persistence guarantees for kernel operations on a DAX mapping.

    - Convert the ACPI NFIT 'BLK' driver to map the block apertures as
    cacheable to improve performance.

    - Miscellaneous updates and fixes to libnvdimm including support for
    issuing "address range scrub" commands, clarifying the optimal
    'sector size' of pmem devices, a clarification of the usage of the
    ACPI '_STA' (status) property for DIMM devices, and other minor
    fixes"

    * tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
    libnvdimm, pmem: direct map legacy pmem by default
    libnvdimm, pmem: 'struct page' for pmem
    libnvdimm, pfn: 'struct page' provider infrastructure
    x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
    add devm_memremap_pages
    mm: ZONE_DEVICE for "device memory"
    mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
    dax: drop size parameter to ->direct_access()
    nd_blk: change aperture mapping from WC to WB
    nvdimm: change to use generic kvfree()
    pmem, dax: have direct_access use __pmem annotation
    dax: update I/O path to do proper PMEM flushing
    pmem: add copy_from_iter_pmem() and clear_pmem()
    pmem, x86: clean up conditional pmem includes
    pmem: remove layer when calling arch_has_wmb_pmem()
    pmem, x86: move x86 PMEM API to new pmem.h header
    libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
    pmem: switch to devm_ allocations
    devres: add devm_memremap
    libnvdimm, btt: write and validate parent_uuid
    ...

    Linus Torvalds
     
  • Pull audit update from Paul Moore:
    "This is one of the larger audit patchsets in recent history,
    consisting of eight patches and almost 400 lines of changes.

    The bulk of the patchset is the new "audit by executable"
    functionality which allows admins to set an audit watch based on the
    executable on disk. Prior to this, admins could only track an
    application by PID, which has some obvious limitations.

    Beyond the new functionality we also have some refcnt fixes and a few
    minor cleanups"

    * 'upstream' of git://git.infradead.org/users/pcmoore/audit:
    fixup: audit: implement audit by executable
    audit: implement audit by executable
    audit: clean simple fsnotify implementation
    audit: use macros for unset inode and device values
    audit: make audit_del_rule() more robust
    audit: fix uninitialized variable in audit_add_rule()
    audit: eliminate unnecessary extra layer of watch parent references
    audit: eliminate unnecessary extra layer of watch references

    Linus Torvalds
     
  • Pull security subsystem updates from James Morris:
    "Highlights:

    - PKCS#7 support added to support signed kexec, also utilized for
    module signing. See comments in 3f1e1bea.

    ** NOTE: this requires linking against the OpenSSL library, which
    must be installed, e.g. the openssl-devel on Fedora **

    - Smack
    - add IPv6 host labeling; ignore labels on kernel threads
    - support smack labeling mounts which use binary mount data

    - SELinux:
    - add ioctl whitelisting (see
    http://kernsec.org/files/lss2015/vanderstoep.pdf)
    - fix mprotect PROT_EXEC regression caused by mm change

    - Seccomp:
    - add ptrace options for suspend/resume"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (57 commits)
    PKCS#7: Add OIDs for sha224, sha284 and sha512 hash algos and use them
    Documentation/Changes: Now need OpenSSL devel packages for module signing
    scripts: add extract-cert and sign-file to .gitignore
    modsign: Handle signing key in source tree
    modsign: Use if_changed rule for extracting cert from module signing key
    Move certificate handling to its own directory
    sign-file: Fix warning about BIO_reset() return value
    PKCS#7: Add MODULE_LICENSE() to test module
    Smack - Fix build error with bringup unconfigured
    sign-file: Document dependency on OpenSSL devel libraries
    PKCS#7: Appropriately restrict authenticated attributes and content type
    KEYS: Add a name for PKEY_ID_PKCS7
    PKCS#7: Improve and export the X.509 ASN.1 time object decoder
    modsign: Use extract-cert to process CONFIG_SYSTEM_TRUSTED_KEYS
    extract-cert: Cope with multiple X.509 certificates in a single file
    sign-file: Generate CMS message as signature instead of PKCS#7
    PKCS#7: Support CMS messages also [RFC5652]
    X.509: Change recorded SKID & AKID to not include Subject or Issuer
    PKCS#7: Check content type and versions
    MAINTAINERS: The keyrings mailing list has moved
    ...

    Linus Torvalds
     

15 Aug, 2015

1 commit

  • Existing users of ioremap_cache() are mapping memory that is known in
    advance to not have i/o side effects. These users are forced to cast
    away the __iomem annotation, or otherwise neglect to fix the sparse
    errors thrown when dereferencing pointers to this memory. Provide
    memremap() as a non __iomem annotated ioremap_*() in the case when
    ioremap is otherwise a pointer to cacheable memory. Empirically,
    ioremap_() call sites are seeking memory-like semantics
    (e.g. speculative reads, and prefetching permitted).

    memremap() is a break from the ioremap implementation pattern of adding
    a new memremap_() for each mapping type and having silent
    compatibility fall backs. Instead, the implementation defines flags
    that are passed to the central memremap() and if a mapping type is not
    supported by an arch memremap returns NULL.

    We introduce a memremap prototype as a trivial wrapper of
    ioremap_cache() and ioremap_wt(). Later, once all ioremap_cache() and
    ioremap_wt() usage has been removed from drivers we teach archs to
    implement arch_memremap() with the ability to strictly enforce the
    mapping type.

    Cc: Arnd Bergmann
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

14 Aug, 2015

1 commit


13 Aug, 2015

1 commit


07 Aug, 2015

5 commits

  • Let the user explicitly provide a file containing trusted keys, instead of
    just automatically finding files matching *.x509 in the build tree and
    trusting whatever we find. This really ought to be an *explicit*
    configuration, and the build rules for dealing with the files were
    fairly painful too.

    Fix applied from James Morris that removes an '=' from a macro definition
    in kernel/Makefile as this is a feature that only exists from GNU make 3.82
    onwards.

    Signed-off-by: David Woodhouse
    Signed-off-by: David Howells

    David Woodhouse
     
  • The current rule for generating signing_key.priv and signing_key.x509 is
    a classic example of a bad rule which has a tendency to break parallel
    make. When invoked to create *either* target, it generates the other
    target as a side-effect that make didn't predict.

    So let's switch to using a single file signing_key.pem which contains
    both key and certificate. That matches what we do in the case of an
    external key specified by CONFIG_MODULE_SIG_KEY anyway, so it's also
    slightly cleaner.

    Signed-off-by: David Woodhouse
    Signed-off-by: David Howells

    David Woodhouse
     
  • Where an external PEM file or PKCS#11 URI is given, we can get the cert
    from it for ourselves instead of making the user drop signing_key.x509
    in place for us.

    Signed-off-by: David Woodhouse
    Signed-off-by: David Howells

    David Woodhouse
     
  • Signed-off-by: David Woodhouse
    Signed-off-by: David Howells

    David Woodhouse
     
  • This is to be used to audit by executable path rules, but audit watches should
    be able to share this code eventually.

    At the moment the audit watch code is a lot more complex. That code only
    creates one fsnotify watch per parent directory. That 'audit_parent' in
    turn has a list of 'audit_watches' which contain the name, ino, dev of
    the specific object we care about. This just creates one fsnotify watch
    per object we care about. So if you watch 100 inodes in /etc this code
    will create 100 fsnotify watches on /etc. The audit_watch code will
    instead create 1 fsnotify watch on /etc (the audit_parent) and then 100
    individual watches chained from that fsnotify mark.

    We should be able to convert the audit_watch code to do one fsnotify
    mark per watch and simplify things/remove a whole lot of code. After
    that conversion we should be able to convert the audit_fsnotify code to
    support that hierarchy if the optimization is necessary.

    Move the access to the entry for audit_match_signal() to the beginning of
    the audit_del_rule() function in case the entry found is the same one passed
    in. This will enable it to be used by audit_autoremove_mark_rule(),
    kill_rules() and audit_remove_parent_watches().

    This is a heavily modified and merged version of two patches originally
    submitted by Eric Paris.

    Cc: Peter Moody
    Cc: Eric Paris
    Signed-off-by: Richard Guy Briggs
    [PM: added a space after a declaration to keep ./scripts/checkpatch happy]
    Signed-off-by: Paul Moore

    Richard Guy Briggs
     

15 Jul, 2015

1 commit

  • Adds a new single-purpose PIDs subsystem to limit the number of
    tasks that can be forked inside a cgroup. Essentially this is an
    implementation of RLIMIT_NPROC that applies to a cgroup rather than a
    process tree.

    However, it should be noted that organisational operations (adding and
    removing tasks from a PIDs hierarchy) will *not* be prevented. Rather,
    the number of tasks in the hierarchy cannot exceed the limit through
    forking. This is due to the fact that, in the unified hierarchy, attach
    cannot fail (and it is not possible for a task to overcome its PIDs
    cgroup policy limit by attaching to a child cgroup -- even if migrating
    mid-fork it must be able to fork in the parent first).

    PIDs are fundamentally a global resource, and it is possible to reach
    PID exhaustion inside a cgroup without hitting any reasonable kmemcg
    policy. Once you've hit PID exhaustion, you're only in a marginally
    better state than OOM. This subsystem allows PID exhaustion inside a
    cgroup to be prevented.

    Signed-off-by: Aleksa Sarai
    Signed-off-by: Tejun Heo

    Aleksa Sarai
     

03 Jul, 2015

1 commit


01 May, 2015

1 commit


16 Apr, 2015

1 commit

  • There are a lot of embedded systems that run most or all of their
    functionality in init, running as root:root. For these systems,
    supporting multiple users is not necessary.

    This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
    non-root users, non-root groups, and capabilities optional. It is enabled
    under CONFIG_EXPERT menu.

    When this symbol is not defined, UID and GID are zero in any possible case
    and processes always have all capabilities.

    The following syscalls are compiled out: setuid, setregid, setgid,
    setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
    getgroups, setfsuid, setfsgid, capget, capset.

    Also, groups.c is compiled out completely.

    In kernel/capability.c, capable function was moved in order to avoid
    adding two ifdef blocks.

    This change saves about 25 KB on a defconfig build. The most minimal
    kernels have total text sizes in the high hundreds of kB rather than
    low MB. (The 25k goes down a bit with allnoconfig, but not that much.

    The kernel was booted in Qemu. All the common functionalities work.
    Adding users/groups is not possible, failing with -ENOSYS.

    Bloat-o-meter output:
    add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Iulia Manda
    Reviewed-by: Josh Triplett
    Acked-by: Geert Uytterhoeven
    Tested-by: Paul E. McKenney
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Iulia Manda
     

12 Feb, 2015

2 commits

  • Pull security layer updates from James Morris:
    "Highlights:

    - Smack adds secmark support for Netfilter
    - /proc/keys is now mandatory if CONFIG_KEYS=y
    - TPM gets its own device class
    - Added TPM 2.0 support
    - Smack file hook rework (all Smack users should review this!)"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (64 commits)
    cipso: don't use IPCB() to locate the CIPSO IP option
    SELinux: fix error code in policydb_init()
    selinux: add security in-core xattr support for pstore and debugfs
    selinux: quiet the filesystem labeling behavior message
    selinux: Remove unused function avc_sidcmp()
    ima: /proc/keys is now mandatory
    Smack: Repair netfilter dependency
    X.509: silence asn1 compiler debug output
    X.509: shut up about included cert for silent build
    KEYS: Make /proc/keys unconditional if CONFIG_KEYS=y
    MAINTAINERS: email update
    tpm/tpm_tis: Add missing ifdef CONFIG_ACPI for pnp_acpi_device
    smack: fix possible use after frees in task_security() callers
    smack: Add missing logging in bidirectional UDS connect check
    Smack: secmark support for netfilter
    Smack: Rework file hooks
    tpm: fix format string error in tpm-chip.c
    char/tpm/tpm_crb: fix build error
    smack: Fix a bidirectional UDS connect check typo
    smack: introduce a special case for tmpfs in smack_d_instantiate()
    ...

    Linus Torvalds
     
  • Pull s390 updates from Martin Schwidefsky:

    - The remaining patches for the z13 machine support: kernel build
    option for z13, the cache synonym avoidance, SMT support,
    compare-and-delay for spinloops and the CES5S crypto adapater.

    - The ftrace support for function tracing with the gcc hotpatch option.
    This touches common code Makefiles, Steven is ok with the changes.

    - The hypfs file system gets an extension to access diagnose 0x0c data
    in user space for performance analysis for Linux running under z/VM.

    - The iucv hvc console gets wildcard spport for the user id filtering.

    - The cacheinfo code is converted to use the generic infrastructure.

    - Cleanup and bug fixes.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (42 commits)
    s390/process: free vx save area when releasing tasks
    s390/hypfs: Eliminate hypfs interval
    s390/hypfs: Add diagnose 0c support
    s390/cacheinfo: don't use smp_processor_id() in preemptible context
    s390/zcrypt: fixed domain scanning problem (again)
    s390/smp: increase maximum value of NR_CPUS to 512
    s390/jump label: use different nop instruction
    s390/jump label: add sanity checks
    s390/mm: correct missing space when reporting user process faults
    s390/dasd: cleanup profiling
    s390/dasd: add locking for global_profile access
    s390/ftrace: hotpatch support for function tracing
    ftrace: let notrace function attribute disable hotpatching if necessary
    ftrace: allow architectures to specify ftrace compile options
    s390: reintroduce diag 44 calls for cpu_relax()
    s390/zcrypt: Add support for new crypto express (CEX5S) adapter.
    s390/zcrypt: Number of supported ap domains is not retrievable.
    s390/spinlock: add compare-and-delay to lock wait loops
    s390/tape: remove redundant if statement
    s390/hvc_iucv: add simple wildcard matches to the iucv allow filter
    ...

    Linus Torvalds
     

29 Jan, 2015

1 commit

  • If the kernel is compiled with function tracer support the -pg compile option
    is passed to gcc to generate extra code into the prologue of each function.

    This patch replaces the "open-coded" -pg compile flag with a CC_FLAGS_FTRACE
    makefile variable which architectures can override if a different option
    should be used for code generation.

    Acked-by: Steven Rostedt
    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky

    Heiko Carstens
     

23 Jan, 2015

1 commit

  • Every kernel build that includes X.509 support prints out
    a message like

    - Including cert signing_key.x509

    This may be useful for some cases, but when doing automated
    build tests, it just means noise.

    To hide the message, this uses '$(kecho)' for printing the
    message, which means we still see it when building with V=1,
    but not at the normal level or when building with 'make -s'.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: David Howells

    Arnd Bergmann
     

22 Dec, 2014

1 commit

  • This commit introduces code for the live patching core. It implements
    an ftrace-based mechanism and kernel interface for doing live patching
    of kernel and kernel module functions.

    It represents the greatest common functionality set between kpatch and
    kgraft and can accept patches built using either method.

    This first version does not implement any consistency mechanism that
    ensures that old and new code do not run together. In practice, ~90% of
    CVEs are safe to apply in this way, since they simply add a conditional
    check. However, any function change that can not execute safely with
    the old version of the function can _not_ be safely applied in this
    version.

    [ jkosina@suse.cz: due to the number of contributions that got folded into
    this original patch from Seth Jennings, add SUSE's copyright as well, as
    discussed via e-mail ]

    Signed-off-by: Seth Jennings
    Signed-off-by: Josh Poimboeuf
    Reviewed-by: Miroslav Benes
    Reviewed-by: Petr Mladek
    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Miroslav Benes
    Signed-off-by: Petr Mladek
    Signed-off-by: Jiri Kosina

    Seth Jennings
     

11 Dec, 2014

1 commit

  • All memory accounting and limiting has been switched over to the
    lockless page counters. Bye, res_counter!

    [akpm@linux-foundation.org: update Documentation/cgroups/memory.txt]
    [mhocko@suse.cz: ditch the last remainings of res_counter]
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: David Rientjes
    Cc: Paul Bolle
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

28 Oct, 2014

1 commit

  • introduce two configs:
    - hidden CONFIG_BPF to select eBPF interpreter that classic socket filters
    depend on
    - visible CONFIG_BPF_SYSCALL (default off) that tracing and sockets can use

    that solves several problems:
    - tracing and others that wish to use eBPF don't need to depend on NET.
    They can use BPF_SYSCALL to allow loading from userspace or select BPF
    to use it directly from kernel in NET-less configs.
    - in 3.18 programs cannot be attached to events yet, so don't force it on
    - when the rest of eBPF infra is there in 3.19+, it's still useful to
    switch it off to minimize kernel size

    bloat-o-meter on x64 shows:
    add/remove: 0/60 grow/shrink: 0/2 up/down: 0/-15601 (-15601)

    tested with many different config combinations. Hopefully didn't miss anything.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

09 Aug, 2014

1 commit

  • This patch series does not do kernel signature verification yet. I plan
    to post another patch series for that. Now distributions are already
    signing PE/COFF bzImage with PKCS7 signature I plan to parse and verify
    those signatures.

    Primary goal of this patchset is to prepare groundwork so that kernel
    image can be signed and signatures be verified during kexec load. This
    should help with two things.

    - It should allow kexec/kdump on secureboot enabled machines.

    - In general it can help even without secureboot. By being able to verify
    kernel image signature in kexec, it should help with avoiding module
    signing restrictions. Matthew Garret showed how to boot into a custom
    kernel, modify first kernel's memory and then jump back to old kernel and
    bypass any policy one wants to.

    This patch (of 15):

    Kexec wants to use bin2c and it wants to use it really early in the build
    process. See arch/x86/purgatory/ code in later patches.

    So move bin2c in scripts/basic so that it can be built very early and
    be usable by arch/x86/purgatory/

    Signed-off-by: Vivek Goyal
    Cc: Borislav Petkov
    Cc: Michael Kerrisk
    Cc: Yinghai Lu
    Cc: Eric Biederman
    Cc: H. Peter Anvin
    Cc: Matthew Garrett
    Cc: Greg Kroah-Hartman
    Cc: Dave Young
    Cc: WANG Chao
    Cc: Baoquan He
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     

07 Aug, 2014

1 commit

  • Pull networking updates from David Miller:
    "Highlights:

    1) Steady transitioning of the BPF instructure to a generic spot so
    all kernel subsystems can make use of it, from Alexei Starovoitov.

    2) SFC driver supports busy polling, from Alexandre Rames.

    3) Take advantage of hash table in UDP multicast delivery, from David
    Held.

    4) Lighten locking, in particular by getting rid of the LRU lists, in
    inet frag handling. From Florian Westphal.

    5) Add support for various RFC6458 control messages in SCTP, from
    Geir Ola Vaagland.

    6) Allow to filter bridge forwarding database dumps by device, from
    Jamal Hadi Salim.

    7) virtio-net also now supports busy polling, from Jason Wang.

    8) Some low level optimization tweaks in pktgen from Jesper Dangaard
    Brouer.

    9) Add support for ipv6 address generation modes, so that userland
    can have some input into the process. From Jiri Pirko.

    10) Consolidate common TCP connection request code in ipv4 and ipv6,
    from Octavian Purdila.

    11) New ARP packet logger in netfilter, from Pablo Neira Ayuso.

    12) Generic resizable RCU hash table, with intial users in netlink and
    nftables. From Thomas Graf.

    13) Maintain a name assignment type so that userspace can see where a
    network device name came from (enumerated by kernel, assigned
    explicitly by userspace, etc.) From Tom Gundersen.

    14) Automatic flow label generation on transmit in ipv6, from Tom
    Herbert.

    15) New packet timestamping facilities from Willem de Bruijn, meant to
    assist in measuring latencies going into/out-of the packet
    scheduler, latency from TCP data transmission to ACK, etc"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1536 commits)
    cxgb4 : Disable recursive mailbox commands when enabling vi
    net: reduce USB network driver config options.
    tg3: Modify tg3_tso_bug() to handle multiple TX rings
    amd-xgbe: Perform phy connect/disconnect at dev open/stop
    amd-xgbe: Use dma_set_mask_and_coherent to set DMA mask
    net: sun4i-emac: fix memory leak on bad packet
    sctp: fix possible seqlock seadlock in sctp_packet_transmit()
    Revert "net: phy: Set the driver when registering an MDIO bus device"
    cxgb4vf: Turn off SGE RX/TX Callback Timers and interrupts in PCI shutdown routine
    team: Simplify return path of team_newlink
    bridge: Update outdated comment on promiscuous mode
    net-timestamp: ACK timestamp for bytestreams
    net-timestamp: TCP timestamping
    net-timestamp: SCHED timestamp on entering packet scheduler
    net-timestamp: add key to disambiguate concurrent datagrams
    net-timestamp: move timestamp flags out of sk_flags
    net-timestamp: extend SCM_TIMESTAMPING ancillary data struct
    cxgb4i : Move stray CPL definitions to cxgb4 driver
    tcp: reduce spurious retransmits due to transient SACK reneging
    qlcnic: Initialize dcbnl_ops before register_netdev
    ...

    Linus Torvalds
     

24 Jul, 2014

1 commit

  • BPF is used in several kernel components. This split creates logical boundary
    between generic eBPF core and the rest

    kernel/bpf/core.c: eBPF interpreter

    net/core/filter.c: classic->eBPF converter, classic verifiers, socket filters

    This patch only moves functions.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

23 Jun, 2014

1 commit


01 Apr, 2014

2 commits

  • Pull x86 LTO changes from Peter Anvin:
    "More infrastructure work in preparation for link-time optimization
    (LTO). Most of these changes is to make sure symbols accessed from
    assembly code are properly marked as visible so the linker doesn't
    remove them.

    My understanding is that the changes to support LTO are still not
    upstream in binutils, but are on the way there. This patchset should
    conclude the x86-specific changes, and remaining patches to actually
    enable LTO will be fed through the Kbuild tree (other than keeping up
    with changes to the x86 code base, of course), although not
    necessarily in this merge window"

    * 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
    Kbuild, lto: Handle basic LTO in modpost
    Kbuild, lto: Disable LTO for asm-offsets.c
    Kbuild, lto: Add a gcc-ld script to let run gcc as ld
    Kbuild, lto: add ld-version and ld-ifversion macros
    Kbuild, lto: Drop .number postfixes in modpost
    Kbuild, lto, workaround: Don't warn for initcall_reference in modpost
    lto: Disable LTO for sys_ni
    lto: Handle LTO common symbols in module loader
    lto, workaround: Add workaround for initcall reordering
    lto: Make asmlinkage __visible
    x86, lto: Disable LTO for the x86 VDSO
    initconst, x86: Fix initconst mistake in ts5500 code
    initconst: Fix initconst mistake in dcdbas
    asmlinkage: Make trace_hardirqs_on/off_caller visible
    asmlinkage, x86: Fix 32bit memcpy for LTO
    asmlinkage Make __stack_chk_failed and memcmp visible
    asmlinkage: Mark rwsem functions that can be called from assembler asmlinkage
    asmlinkage: Make main_extable_sort_needed visible
    asmlinkage, mutex: Mark __visible
    asmlinkage: Make trace_hardirq visible
    ...

    Linus Torvalds
     
  • Pull scheduler changes from Ingo Molnar:
    "Bigger changes:

    - sched/idle restructuring: they are WIP preparation for deeper
    integration between the scheduler and idle state selection, by
    Nicolas Pitre.

    - add NUMA scheduling pseudo-interleaving, by Rik van Riel.

    - optimize cgroup context switches, by Peter Zijlstra.

    - RT scheduling enhancements, by Thomas Gleixner.

    The rest is smaller changes, non-urgnt fixes and cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (68 commits)
    sched: Clean up the task_hot() function
    sched: Remove double calculation in fix_small_imbalance()
    sched: Fix broken setscheduler()
    sparc64, sched: Remove unused sparc64_multi_core
    sched: Remove unused mc_capable() and smt_capable()
    sched/numa: Move task_numa_free() to __put_task_struct()
    sched/fair: Fix endless loop in idle_balance()
    sched/core: Fix endless loop in pick_next_task()
    sched/fair: Push down check for high priority class task into idle_balance()
    sched/rt: Fix picking RT and DL tasks from empty queue
    trace: Replace hardcoding of 19 with MAX_NICE
    sched: Guarantee task priority in pick_next_task()
    sched/idle: Remove stale old file
    sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED
    cpuidle/arm64: Remove redundant cpuidle_idle_call()
    cpuidle/powernv: Remove redundant cpuidle_idle_call()
    sched, nohz: Exclude isolated cores from load balancing
    sched: Fix select_task_rq_fair() description comments
    workqueue: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
    sys: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
    ...

    Linus Torvalds
     

24 Feb, 2014

1 commit

  • Because rcu_torture_random() will be used by the locking equivalent to
    rcutorture, pull it out into its own module. This new module cannot
    be separately configured, instead, use the Kconfig "select" statement
    from the Kconfig options of tests depending on it.

    Suggested-by: Rusty Russell
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

14 Feb, 2014

1 commit

  • The assembler alias code in cond_syscall does not work
    when compiled for LTO. Just disable LTO for that file.

    Signed-off-by: Andi Kleen
    Link: http://lkml.kernel.org/r/1391846481-31491-6-git-send-email-ak@linux.intel.com
    Signed-off-by: H. Peter Anvin

    Andi Kleen
     

11 Feb, 2014

1 commit

  • Integration of cpuidle with the scheduler requires that the idle loop be
    closely integrated with the scheduler proper. Moving cpu/idle.c into the
    sched directory will allow for a smoother integration, and eliminate a
    subdirectory which contained only one source file.

    Signed-off-by: Nicolas Pitre
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/alpine.LFD.2.11.1401301102210.1652@knanqh.ubzr
    Signed-off-by: Ingo Molnar

    Nicolas Pitre
     

13 Dec, 2013

2 commits

  • Always remove generated SYSTEM_TRUSTED_KEYRING files while doing make mrproper.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: David Howells

    Kirill Tkhai
     
  • Fix the gathering of certificates from both the source tree and the build tree
    to correctly calculate the pathnames of all the certificates.

    The problem was that if the default generated cert, signing_key.x509, didn't
    exist then it would not have a path attached and if it did, it would have a
    path attached.

    This means that the contents of kernel/.x509.list would change between the
    first compilation in a directory and the second. After the second it would
    remain stable because the signing_key.x509 file exists.

    The consequence was that the kernel would get relinked unconditionally on the
    second recompilation. The second recompilation would also show something like
    this:

    X.509 certificate list changed
    CERTS kernel/x509_certificate_list
    - Including cert /home/torvalds/v2.6/linux/signing_key.x509
    AS kernel/system_certificates.o
    LD kernel/built-in.o

    which is why the relink would happen.

    Unfortunately, it isn't a simple matter of just sticking a path on the front
    of the filename of the certificate in the build directory as make can't then
    work out how to build it.

    So the path has to be prepended to the name for sorting and duplicate
    elimination and then removed for the make rule if it is in the build tree.

    Reported-by: Linus Torvalds
    Signed-off-by: David Howells

    David Howells