23 Mar, 2016

9 commits

  • kcov provides code coverage collection for coverage-guided fuzzing
    (randomized testing). Coverage-guided fuzzing is a testing technique
    that uses coverage feedback to determine new interesting inputs to a
    system. A notable user-space example is AFL
    (http://lcamtuf.coredump.cx/afl/). However, this technique is not
    widely used for kernel testing due to missing compiler and kernel
    support.

    kcov does not aim to collect as much coverage as possible. It aims to
    collect more or less stable coverage that is function of syscall inputs.
    To achieve this goal it does not collect coverage in soft/hard
    interrupts and instrumentation of some inherently non-deterministic or
    non-interesting parts of kernel is disbled (e.g. scheduler, locking).

    Currently there is a single coverage collection mode (tracing), but the
    API anticipates additional collection modes. Initially I also
    implemented a second mode which exposes coverage in a fixed-size hash
    table of counters (what Quentin used in his original patch). I've
    dropped the second mode for simplicity.

    This patch adds the necessary support on kernel side. The complimentary
    compiler support was added in gcc revision 231296.

    We've used this support to build syzkaller system call fuzzer, which has
    found 90 kernel bugs in just 2 months:

    https://github.com/google/syzkaller/wiki/Found-Bugs

    We've also found 30+ bugs in our internal systems with syzkaller.
    Another (yet unexplored) direction where kcov coverage would greatly
    help is more traditional "blob mutation". For example, mounting a
    random blob as a filesystem, or receiving a random blob over wire.

    Why not gcov. Typical fuzzing loop looks as follows: (1) reset
    coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
    typical coverage can be just a dozen of basic blocks (e.g. an invalid
    input). In such context gcov becomes prohibitively expensive as
    reset/collect coverage steps depend on total number of basic
    blocks/edges in program (in case of kernel it is about 2M). Cost of
    kcov depends only on number of executed basic blocks/edges. On top of
    that, kernel requires per-thread coverage because there are always
    background threads and unrelated processes that also produce coverage.
    With inlined gcov instrumentation per-thread coverage is not possible.

    kcov exposes kernel PCs and control flow to user-space which is
    insecure. But debugfs should not be mapped as user accessible.

    Based on a patch by Quentin Casasnovas.

    [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
    [akpm@linux-foundation.org: unbreak allmodconfig]
    [akpm@linux-foundation.org: follow x86 Makefile layout standards]
    Signed-off-by: Dmitry Vyukov
    Reviewed-by: Kees Cook
    Cc: syzkaller
    Cc: Vegard Nossum
    Cc: Catalin Marinas
    Cc: Tavis Ormandy
    Cc: Will Deacon
    Cc: Quentin Casasnovas
    Cc: Kostya Serebryany
    Cc: Eric Dumazet
    Cc: Alexander Potapenko
    Cc: Kees Cook
    Cc: Bjorn Helgaas
    Cc: Sasha Levin
    Cc: David Drysdale
    Cc: Ard Biesheuvel
    Cc: Andrey Ryabinin
    Cc: Kirill A. Shutemov
    Cc: Jiri Slaby
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     
  • A couple of functions and variables in the profile implementation are
    used only on SMP systems by the procfs code, but are unused if either
    procfs is disabled or in uniprocessor kernels. gcc prints a harmless
    warning about the unused symbols:

    kernel/profile.c:243:13: error: 'profile_flip_buffers' defined but not used [-Werror=unused-function]
    static void profile_flip_buffers(void)
    ^
    kernel/profile.c:266:13: error: 'profile_discard_flip_buffers' defined but not used [-Werror=unused-function]
    static void profile_discard_flip_buffers(void)
    ^
    kernel/profile.c:330:12: error: 'profile_cpu_callback' defined but not used [-Werror=unused-function]
    static int profile_cpu_callback(struct notifier_block *info,
    ^

    This adds further #ifdef to the file, to annotate exactly in which cases
    they are used. I have done several thousand ARM randconfig kernels with
    this patch applied and no longer get any warnings in this file.

    Signed-off-by: Arnd Bergmann
    Cc: Vlastimil Babka
    Cc: Robin Holt
    Cc: Johannes Weiner
    Cc: Christoph Lameter
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • Commit 1717f2096b54 ("panic, x86: Fix re-entrance problem due to panic
    on NMI") and commit 58c5661f2144 ("panic, x86: Allow CPUs to save
    registers even if looping in NMI context") introduced nmi_panic() which
    prevents concurrent/recursive execution of panic(). It also saves
    registers for the crash dump on x86.

    However, there are some cases where NMI handlers still use panic().
    This patch set partially replaces them with nmi_panic() in those cases.

    Even this patchset is applied, some NMI or similar handlers (e.g. MCE
    handler) continue to use panic(). This is because I can't test them
    well and actual problems won't happen. For example, the possibility
    that normal panic and panic on MCE happen simultaneously is very low.

    This patch (of 3):

    Convert nmi_panic() to a proper function and export it instead of
    exporting internal implementation details to modules, for obvious
    reasons.

    Signed-off-by: Hidehiro Kawai
    Acked-by: Borislav Petkov
    Acked-by: Michal Nazarewicz
    Cc: Michal Hocko
    Cc: Rasmus Villemoes
    Cc: Nicolas Iooss
    Cc: Javi Merino
    Cc: Gobinda Charan Maji
    Cc: "Steven Rostedt (Red Hat)"
    Cc: Thomas Gleixner
    Cc: Vitaly Kuznetsov
    Cc: HATAYAMA Daisuke
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hidehiro Kawai
     
  • This commit fixes the following security hole affecting systems where
    all of the following conditions are fulfilled:

    - The fs.suid_dumpable sysctl is set to 2.
    - The kernel.core_pattern sysctl's value starts with "/". (Systems
    where kernel.core_pattern starts with "|/" are not affected.)
    - Unprivileged user namespace creation is permitted. (This is
    true on Linux >=3.8, but some distributions disallow it by
    default using a distro patch.)

    Under these conditions, if a program executes under secure exec rules,
    causing it to run with the SUID_DUMP_ROOT flag, then unshares its user
    namespace, changes its root directory and crashes, the coredump will be
    written using fsuid=0 and a path derived from kernel.core_pattern - but
    this path is interpreted relative to the root directory of the process,
    allowing the attacker to control where a coredump will be written with
    root privileges.

    To fix the security issue, always interpret core_pattern for dumps that
    are written under SUID_DUMP_ROOT relative to the root directory of init.

    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Andy Lutomirski
    Cc: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • This test-case (simplified version of generated by syzkaller)

    #include
    #include
    #include

    void test(void)
    {
    for (;;) {
    if (fork()) {
    wait(NULL);
    continue;
    }

    ptrace(PTRACE_SEIZE, getppid(), 0, 0);
    ptrace(PTRACE_INTERRUPT, getppid(), 0, 0);
    _exit(0);
    }
    }

    int main(void)
    {
    int np;

    for (np = 0; np < 8; ++np)
    if (!fork())
    test();

    while (wait(NULL) > 0)
    ;
    return 0;
    }

    triggers the 2nd WARN_ON_ONCE(!signr) warning in do_jobctl_trap(). The
    problem is that __ptrace_unlink() clears task->jobctl under siglock but
    task->ptrace is cleared without this lock held; this fools the "else"
    branch which assumes that !PT_SEIZED means PT_PTRACED.

    Note also that most of other PTRACE_SEIZE checks can race with detach
    from the exiting tracer too. Say, the callers of ptrace_trap_notify()
    assume that SEIZED can't go away after it was checked.

    Signed-off-by: Oleg Nesterov
    Reported-by: Dmitry Vyukov
    Cc: Tejun Heo
    Cc: syzkaller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Except on SPARC, this is what the code always did. SPARC compat seccomp
    was buggy, although the impact of the bug was limited because SPARC
    32-bit and 64-bit syscall numbers are the same.

    Signed-off-by: Andy Lutomirski
    Cc: Paul Moore
    Cc: Eric Paris
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Users of the 32-bit ptrace() ABI expect the full 32-bit ABI. siginfo
    translation should check ptrace() ABI, not caller task ABI.

    This is an ABI change on SPARC. Let's hope that no one relied on the
    old buggy ABI.

    Signed-off-by: Andy Lutomirski
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Seccomp wants to know the syscall bitness, not the caller task bitness,
    when it selects the syscall whitelist.

    As far as I know, this makes no difference on any architecture, so it's
    not a security problem. (It generates identical code everywhere except
    sparc, and, on sparc, the syscall numbering is the same for both ABIs.)

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • When new timeout is written to /proc/sys/kernel/hung_task_timeout_secs,
    khungtaskd is interrupted and again sleeps for full timeout duration.

    This means that hang task will not be checked if new timeout is written
    periodically within old timeout duration and/or checking of hang task
    will be delayed for up to previous timeout duration. Fix this by
    remembering last time khungtaskd checked hang task.

    This change will allow other watchdog tasks (if any) to share khungtaskd
    by sleeping for minimal timeout diff of all watchdog tasks. Doing more
    watchdog tasks from khungtaskd will reduce the possibility of printk()
    collisions by multiple watchdog threads.

    Signed-off-by: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: Aaron Tomlin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

22 Mar, 2016

1 commit

  • Pull cgroup namespace support from Tejun Heo:
    "These are changes to implement namespace support for cgroup which has
    been pending for quite some time now. It is very straight-forward and
    only affects what part of cgroup hierarchies are visible.

    After unsharing, mounting a cgroup fs will be scoped to the cgroups
    the task belonged to at the time of unsharing and the cgroup paths
    exposed to userland would be adjusted accordingly"

    * 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: fix and restructure error handling in copy_cgroup_ns()
    cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()
    Add FS_USERNS_FLAG to cgroup fs
    cgroup: Add documentation for cgroup namespaces
    cgroup: mount cgroupns-root when inside non-init cgroupns
    kernfs: define kernfs_node_dentry
    cgroup: cgroup namespace setns support
    cgroup: introduce cgroup namespaces
    sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
    kernfs: Add API to generate relative kernfs path

    Linus Torvalds
     

21 Mar, 2016

2 commits

  • Pull x86 protection key support from Ingo Molnar:
    "This tree adds support for a new memory protection hardware feature
    that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

    There's a background article at LWN.net:

    https://lwn.net/Articles/643797/

    The gist is that protection keys allow the encoding of
    user-controllable permission masks in the pte. So instead of having a
    fixed protection mask in the pte (which needs a system call to change
    and works on a per page basis), the user can map a (handful of)
    protection mask variants and can change the masks runtime relatively
    cheaply, without having to change every single page in the affected
    virtual memory range.

    This allows the dynamic switching of the protection bits of large
    amounts of virtual memory, via user-space instructions. It also
    allows more precise control of MMU permission bits: for example the
    executable bit is separate from the read bit (see more about that
    below).

    This tree adds the MM infrastructure and low level x86 glue needed for
    that, plus it adds a high level API to make use of protection keys -
    if a user-space application calls:

    mmap(..., PROT_EXEC);

    or

    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
    this special case, and will set a special protection key on this
    memory range. It also sets the appropriate bits in the Protection
    Keys User Rights (PKRU) register so that the memory becomes unreadable
    and unwritable.

    So using protection keys the kernel is able to implement 'true'
    PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
    PROT_READ as well. Unreadable executable mappings have security
    advantages: they cannot be read via information leaks to figure out
    ASLR details, nor can they be scanned for ROP gadgets - and they
    cannot be used by exploits for data purposes either.

    We know about no user-space code that relies on pure PROT_EXEC
    mappings today, but binary loaders could start making use of this new
    feature to map binaries and libraries in a more secure fashion.

    There is other pending pkeys work that offers more high level system
    call APIs to manage protection keys - but those are not part of this
    pull request.

    Right now there's a Kconfig that controls this feature
    (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
    (like most x86 CPU feature enablement code that has no runtime
    overhead), but it's not user-configurable at the moment. If there's
    any serious problem with this then we can make it configurable and/or
    flip the default"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
    x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
    mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
    x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
    mm/core, x86/mm/pkeys: Add execute-only protection keys support
    x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
    x86/mm/pkeys: Allow kernel to modify user pkey rights register
    x86/fpu: Allow setting of XSAVE state
    x86/mm: Factor out LDT init from context init
    mm/core, x86/mm/pkeys: Add arch_validate_pkey()
    mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
    x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
    x86/mm/pkeys: Add Kconfig prompt to existing config option
    x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
    x86/mm/pkeys: Dump PKRU with other kernel registers
    mm/core, x86/mm/pkeys: Differentiate instruction fetches
    x86/mm/pkeys: Optimize fault handling in access_error()
    mm/core: Do not enforce PKEY permissions on remote mm access
    um, pkeys: Add UML arch_*_access_permitted() methods
    mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
    x86/mm/gup: Simplify get_user_pages() PTE bit handling
    ...

    Linus Torvalds
     
  • Pull 'objtool' stack frame validation from Ingo Molnar:
    "This tree adds a new kernel build-time object file validation feature
    (ONFIG_STACK_VALIDATION=y): kernel stack frame correctness validation.
    It was written by and is maintained by Josh Poimboeuf.

    The motivation: there's a category of hard to find kernel bugs, most
    of them in assembly code (but also occasionally in C code), that
    degrades the quality of kernel stack dumps/backtraces. These bugs are
    hard to detect at the source code level. Such bugs result in
    incorrect/incomplete backtraces most of time - but can also in some
    rare cases result in crashes or other undefined behavior.

    The build time correctness checking is done via the new 'objtool'
    user-space utility that was written for this purpose and which is
    hosted in the kernel repository in tools/objtool/. The tool's (very
    simple) UI and source code design is shaped after Git and perf and
    shares quite a bit of infrastructure with tools/perf (which tooling
    infrastructure sharing effort got merged via perf and is already
    upstream). Objtool follows the well-known kernel coding style.

    Objtool does not try to check .c or .S files, it instead analyzes the
    resulting .o generated machine code from first principles: it decodes
    the instruction stream and interprets it. (Right now objtool supports
    the x86-64 architecture.)

    From tools/objtool/Documentation/stack-validation.txt:

    "The kernel CONFIG_STACK_VALIDATION option enables a host tool named
    objtool which runs at compile time. It has a "check" subcommand
    which analyzes every .o file and ensures the validity of its stack
    metadata. It enforces a set of rules on asm code and C inline
    assembly code so that stack traces can be reliable.

    Currently it only checks frame pointer usage, but there are plans to
    add CFI validation for C files and CFI generation for asm files.

    For each function, it recursively follows all possible code paths
    and validates the correct frame pointer state at each instruction.

    It also follows code paths involving special sections, like
    .altinstructions, __jump_table, and __ex_table, which can add
    alternative execution paths to a given instruction (or set of
    instructions). Similarly, it knows how to follow switch statements,
    for which gcc sometimes uses jump tables."

    When this new kernel option is enabled (it's disabled by default), the
    tool, if it finds any suspicious assembly code pattern, outputs
    warnings in compiler warning format:

    warning: objtool: rtlwifi_rate_mapping()+0x2e7: frame pointer state mismatch
    warning: objtool: cik_tiling_mode_table_init()+0x6ce: call without frame pointer save/setup
    warning: objtool:__schedule()+0x3c0: duplicate frame pointer save
    warning: objtool:__schedule()+0x3fd: sibling call from callable instruction with changed frame pointer

    ... so that scripts that pick up compiler warnings will notice them.
    All known warnings triggered by the tool are fixed by the tree, most
    of the commits in fact prepare the kernel to be warning-free. Most of
    them are bugfixes or cleanups that stand on their own, but there are
    also some annotations of 'special' stack frames for justified cases
    such entries to JIT-ed code (BPF) or really special boot time code.

    There are two other long-term motivations behind this tool as well:

    - To improve the quality and reliability of kernel stack frames, so
    that they can be used for optimized live patching.

    - To create independent infrastructure to check the correctness of
    CFI stack frames at build time. CFI debuginfo is notoriously
    unreliable and we cannot use it in the kernel as-is without extra
    checking done both on the kernel side and on the build side.

    The quality of kernel stack frames matters to debuggability as well,
    so IMO we can merge this without having to consider the live patching
    or CFI debuginfo angle"

    * 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
    objtool: Only print one warning per function
    objtool: Add several performance improvements
    tools: Copy hashtable.h into tools directory
    objtool: Fix false positive warnings for functions with multiple switch statements
    objtool: Rename some variables and functions
    objtool: Remove superflous INIT_LIST_HEAD
    objtool: Add helper macros for traversing instructions
    objtool: Fix false positive warnings related to sibling calls
    objtool: Compile with debugging symbols
    objtool: Detect infinite recursion
    objtool: Prevent infinite recursion in noreturn detection
    objtool: Detect and warn if libelf is missing and don't break the build
    tools: Support relative directory path for 'O='
    objtool: Support CROSS_COMPILE
    x86/asm/decoder: Use explicitly signed chars
    objtool: Enable stack metadata validation on 64-bit x86
    objtool: Add CONFIG_STACK_VALIDATION option
    objtool: Add tool to perform compile-time stack metadata validation
    x86/kprobes: Mark kretprobe_trampoline() stack frame as non-standard
    sched: Always inline context_switch()
    ...

    Linus Torvalds
     

20 Mar, 2016

2 commits

  • Pull audit updates from Paul Moore:
    "A small set of patches for audit this time; just three in total and
    one is a spelling fix.

    The two patches with actual content are designed to help prevent new
    instances of auditd from displacing an existing, functioning auditd
    and to generate a log of the attempt. Not to worry, dead/stuck auditd
    instances can still be replaced by a new instance without problem.

    Nothing controversial, and everything passes our regression suite"

    * 'stable-4.6' of git://git.infradead.org/users/pcmoore/audit:
    audit: Fix typo in comment
    audit: log failed attempts to change audit_pid configuration
    audit: stop an old auditd being starved out by a new auditd

    Linus Torvalds
     
  • Pull networking updates from David Miller:
    "Highlights:

    1) Support more Realtek wireless chips, from Jes Sorenson.

    2) New BPF types for per-cpu hash and arrap maps, from Alexei
    Starovoitov.

    3) Make several TCP sysctls per-namespace, from Nikolay Borisov.

    4) Allow the use of SO_REUSEPORT in order to do per-thread processing
    of incoming TCP/UDP connections. The muxing can be done using a
    BPF program which hashes the incoming packet. From Craig Gallek.

    5) Add a multiplexer for TCP streams, to provide a messaged based
    interface. BPF programs can be used to determine the message
    boundaries. From Tom Herbert.

    6) Add 802.1AE MACSEC support, from Sabrina Dubroca.

    7) Avoid factorial complexity when taking down an inetdev interface
    with lots of configured addresses. We were doing things like
    traversing the entire address less for each address removed, and
    flushing the entire netfilter conntrack table for every address as
    well.

    8) Add and use SKB bulk free infrastructure, from Jesper Brouer.

    9) Allow offloading u32 classifiers to hardware, and implement for
    ixgbe, from John Fastabend.

    10) Allow configuring IRQ coalescing parameters on a per-queue basis,
    from Kan Liang.

    11) Extend ethtool so that larger link mode masks can be supported.
    From David Decotigny.

    12) Introduce devlink, which can be used to configure port link types
    (ethernet vs Infiniband, etc.), port splitting, and switch device
    level attributes as a whole. From Jiri Pirko.

    13) Hardware offload support for flower classifiers, from Amir Vadai.

    14) Add "Local Checksum Offload". Basically, for a tunneled packet
    the checksum of the outer header is 'constant' (because with the
    checksum field filled into the inner protocol header, the payload
    of the outer frame checksums to 'zero'), and we can take advantage
    of that in various ways. From Edward Cree"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1548 commits)
    bonding: fix bond_get_stats()
    net: bcmgenet: fix dma api length mismatch
    net/mlx4_core: Fix backward compatibility on VFs
    phy: mdio-thunder: Fix some Kconfig typos
    lan78xx: add ndo_get_stats64
    lan78xx: handle statistics counter rollover
    RDS: TCP: Remove unused constant
    RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket
    net: smc911x: convert pxa dma to dmaengine
    team: remove duplicate set of flag IFF_MULTICAST
    bonding: remove duplicate set of flag IFF_MULTICAST
    net: fix a comment typo
    ethernet: micrel: fix some error codes
    ip_tunnels, bpf: define IP_TUNNEL_OPTS_MAX and use it
    bpf, dst: add and use dst_tclassid helper
    bpf: make skb->tc_classid also readable
    net: mvneta: bm: clarify dependencies
    cls_bpf: reset class and reuse major in da
    ldmvsw: Checkpatch sunvnet.c and sunvnet_common.c
    ldmvsw: Add ldmvsw.c driver code
    ...

    Linus Torvalds
     

19 Mar, 2016

3 commits

  • Pull cgroup updates from Tejun Heo:
    "cgroup changes for v4.6-rc1. No userland visible behavior changes in
    this pull request. I'll send out a separate pull request for the
    addition of cgroup namespace support.

    - The biggest change is the revamping of cgroup core task migration
    and controller handling logic. There are quite a few places where
    controllers and tasks are manipulated. Previously, many of those
    places implemented custom operations for each specific use case
    assuming specific starting conditions. While this worked, it makes
    the code fragile and difficult to follow.

    The bulk of this pull request restructures these operations so that
    most related operations are performed through common helpers which
    implement recursive (subtrees are always processed consistently)
    and idempotent (they make cgroup hierarchy converge to the target
    state rather than performing operations assuming specific starting
    conditions). This makes the code a lot easier to understand,
    verify and extend.

    - Implicit controller support is added. This is primarily for using
    perf_event on the v2 hierarchy so that perf can match cgroup v2
    path without requiring the user to do anything special. The kernel
    portion of perf_event changes is acked but userland changes are
    still pending review.

    - cgroup_no_v1= boot parameter added to ease testing cgroup v2 in
    certain environments.

    - There is a regression introduced during v4.4 devel cycle where
    attempts to migrate zombie tasks can mess up internal object
    management. This was fixed earlier this week and included in this
    pull request w/ stable cc'd.

    - Misc non-critical fixes and improvements"

    * 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (44 commits)
    cgroup: avoid false positive gcc-6 warning
    cgroup: ignore css_sets associated with dead cgroups during migration
    Documentation: cgroup v2: Trivial heading correction.
    cgroup: implement cgroup_subsys->implicit_on_dfl
    cgroup: use css_set->mg_dst_cgrp for the migration target cgroup
    cgroup: make cgroup[_taskset]_migrate() take cgroup_root instead of cgroup
    cgroup: move migration destination verification out of cgroup_migrate_prepare_dst()
    cgroup: fix incorrect destination cgroup in cgroup_update_dfl_csses()
    cgroup: Trivial correction to reflect controller.
    cgroup: remove stale item in cgroup-v1 document INDEX file.
    cgroup: update css iteration in cgroup_update_dfl_csses()
    cgroup: allocate 2x cgrp_cset_links when setting up a new root
    cgroup: make cgroup_calc_subtree_ss_mask() take @this_ss_mask
    cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends
    cgroup: use cgroup_apply_enable_control() in cgroup creation path
    cgroup: combine cgroup_mutex locking and offline css draining
    cgroup: factor out cgroup_{apply|finalize}_control() from cgroup_subtree_control_write()
    cgroup: introduce cgroup_{save|propagate|restore}_control()
    cgroup: make cgroup_drain_offline() and cgroup_apply_control_{disable|enable}() recursive
    cgroup: factor out cgroup_apply_control_enable() from cgroup_subtree_control_write()
    ...

    Linus Torvalds
     
  • Pull workqueue updates from Tejun Heo:
    "Three trivial workqueue changes"

    * 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: Fix comment for work_on_cpu()
    sched/core: Get rid of 'cpu' argument in wq_worker_sleeping()
    workqueue: Replace usage of init_name with dev_set_name()

    Linus Torvalds
     
  • Merge second patch-bomb from Andrew Morton:

    - a couple of hotfixes

    - the rest of MM

    - a new timer slack control in procfs

    - a couple of procfs fixes

    - a few misc things

    - some printk tweaks

    - lib/ updates, notably to radix-tree.

    - add my and Nick Piggin's old userspace radix-tree test harness to
    tools/testing/radix-tree/. Matthew said it was a godsend during the
    radix-tree work he did.

    - a few code-size improvements, switching to __always_inline where gcc
    screwed up.

    - partially implement character sets in sscanf

    * emailed patches from Andrew Morton : (118 commits)
    sscanf: implement basic character sets
    lib/bug.c: use common WARN helper
    param: convert some "on"/"off" users to strtobool
    lib: add "on"/"off" support to kstrtobool
    lib: update single-char callers of strtobool()
    lib: move strtobool() to kstrtobool()
    include/linux/unaligned: force inlining of byteswap operations
    include/uapi/linux/byteorder, swab: force inlining of some byteswap operations
    include/asm-generic/atomic-long.h: force inlining of some atomic_long operations
    usb: common: convert to use match_string() helper
    ide: hpt366: convert to use match_string() helper
    ata: hpt366: convert to use match_string() helper
    power: ab8500: convert to use match_string() helper
    power: charger_manager: convert to use match_string() helper
    drm/edid: convert to use match_string() helper
    pinctrl: convert to use match_string() helper
    device property: convert to use match_string() helper
    lib/string: introduce match_string() helper
    radix-tree tests: add test for radix_tree_iter_next
    radix-tree tests: add regression3 test
    ...

    Linus Torvalds
     

18 Mar, 2016

14 commits

  • Pull livepatching update from Jiri Kosina:

    - cleanup of module notifiers; this depends on a module.c cleanup which
    has been acked by Rusty; from Jessica Yu

    - small assorted fixes and MAINTAINERS update

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
    livepatch/module: remove livepatch module notifier
    modules: split part of complete_formation() into prepare_coming_module()
    livepatch: Update maintainers
    livepatch: Fix the error message about unresolvable ambiguity
    klp: remove CONFIG_LIVEPATCH dependency from klp headers
    klp: remove superfluous errors in asm/livepatch.h

    Linus Torvalds
     
  • Pull trivial tree updates from Jiri Kosina.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
    drivers/rtc: broken link fix
    drm/i915 Fix typos in i915_gem_fence.c
    Docs: fix missing word in REPORTING-BUGS
    lib+mm: fix few spelling mistakes
    MAINTAINERS: add git URL for APM driver
    treewide: Fix typo in printk

    Linus Torvalds
     
  • The traceoff_on_warning option doesn't have any effect on s390, powerpc,
    arm64, parisc, and sh because there are two different types of WARN
    implementations:

    1) The above mentioned architectures treat WARN() as a special case of a
    BUG() exception. They handle warnings in report_bug() in lib/bug.c.

    2) All other architectures just call warn_slowpath_*() directly. Their
    warnings are handled in warn_slowpath_common() in kernel/panic.c.

    Support traceoff_on_warning on all architectures and prevent any future
    divergence by using a single common function to emit the warning.

    Also remove the '()' from '%pS()', because the parentheses look funky:

    [ 45.607629] WARNING: at /root/warn_mod/warn_mod.c:17 .init_dummy+0x20/0x40 [warn_mod]()

    Reported-by: Chunyu Hu
    Signed-off-by: Josh Poimboeuf
    Acked-by: Heiko Carstens
    Tested-by: Prarit Bhargava
    Acked-by: Prarit Bhargava
    Acked-by: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josh Poimboeuf
     
  • This changes several users of manual "on"/"off" parsing to use
    strtobool.

    Some side-effects:
    - these uses will now parse y/n/1/0 meaningfully too
    - the early_param uses will now bubble up parse errors

    Signed-off-by: Kees Cook
    Acked-by: Heiko Carstens
    Acked-by: Michael Ellerman
    Cc: Amitkumar Karwar
    Cc: Andy Shevchenko
    Cc: Daniel Borkmann
    Cc: Joe Perches
    Cc: Kalle Valo
    Cc: Martin Schwidefsky
    Cc: Nishant Sarmukadam
    Cc: Rasmus Villemoes
    Cc: Steve French
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • This allows us to extract from the vmcore only the messages emitted
    since the last time the ring buffer was cleared. We just have to make
    sure its value is always up-to-date, when old messages are discarded to
    free space in log_make_free_space() for example.

    Signed-off-by: Zeyu Zhao
    Signed-off-by: Ivan Delalande
    Cc: Kay Sievers
    Cc: Neil Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ivan Delalande
     
  • have_callable_console() must also test CON_ENABLED bit, not just
    CON_ANYTIME. We may have disabled CON_ANYTIME console so printk can
    wrongly assume that it's safe to call_console_drivers().

    Signed-off-by: Sergey Senozhatsky
    Reviewed-by: Petr Mladek
    Cc: Jan Kara
    Cc: Tejun Heo
    Cc: Kyle McMartin
    Cc: Dave Jones
    Cc: Calvin Owens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • console_unlock() allows to cond_resched() if its caller has set
    `console_may_schedule' to 1, since 8d91f8b15361 ("printk: do
    cond_resched() between lines while outputting to consoles").

    The rules are:
    -- console_lock() always sets `console_may_schedule' to 1
    -- console_trylock() always sets `console_may_schedule' to 0

    However, console_trylock() callers (among them is printk()) do not
    always call printk() from atomic contexts, and some of them can
    cond_resched() in console_unlock(), so console_trylock() can set
    `console_may_schedule' to 1 for such processes.

    For !CONFIG_PREEMPT_COUNT kernels, however, console_trylock() always
    sets `console_may_schedule' to 0.

    It's possible to drop explicit preempt_disable()/preempt_enable() in
    vprintk_emit(), because console_unlock() and console_trylock() are now
    smart enough:
    a) console_unlock() does not cond_resched() when it's unsafe
    (console_trylock() takes care of that)
    b) console_unlock() does can_use_console() check.

    Signed-off-by: Sergey Senozhatsky
    Reviewed-by: Petr Mladek
    Cc: Jan Kara
    Cc: Tejun Heo
    Cc: Kyle McMartin
    Cc: Dave Jones
    Cc: Calvin Owens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • console_unlock() allows to cond_resched() if its caller has set
    `console_may_schedule' to 1 (this functionality is present since
    8d91f8b15361 ("printk: do cond_resched() between lines while outputting
    to consoles").

    The rules are:
    -- console_lock() always sets `console_may_schedule' to 1
    -- console_trylock() always sets `console_may_schedule' to 0

    printk() calls console_unlock() with preemption desabled, which
    basically can lead to RCU stalls, watchdog soft lockups, etc. if
    something is simultaneously calling printk() frequent enough (IOW,
    console_sem owner always has new data to send to console divers and
    can't leave console_unlock() for a long time).

    printk()->console_trylock() callers do not necessarily execute in atomic
    contexts, and some of them can cond_resched() in console_unlock().
    console_trylock() can set `console_may_schedule' to 1 (allow
    cond_resched() later in consoe_unlock()) when it's safe.

    This patch (of 3):

    vprintk_emit() disables preemption around console_trylock_for_printk()
    and console_unlock() calls for a strong reason -- can_use_console()
    check. The thing is that vprintl_emit() can be called on a CPU that is
    not fully brought up yet (!cpu_online()), which potentially can cause
    problems if console driver wants to access per-cpu data. A console
    driver can explicitly state that it's safe to call it from !online cpu
    by setting CON_ANYTIME bit in console ->flags. That's why for
    !cpu_online() can_use_console() iterates all the console to find out if
    there is a CON_ANYTIME console, otherwise console_unlock() must be
    avoided.

    can_use_console() ensures that console_unlock() call is safe in
    vprintk_emit() only; console_lock() and console_trylock() are not
    covered by this check. Even though call_console_drivers(), invoked from
    console_cont_flush() and console_unlock(), tests `!cpu_online() &&
    CON_ANYTIME' for_each_console(), it may be too late, which can result in
    messages loss.

    Assume that we have 2 cpus -- CPU0 is online, CPU1 is !online, and no
    CON_ANYTIME consoles available.

    CPU0 online CPU1 !online
    console_trylock()
    ...
    console_unlock()
    console_cont_flush
    spin_lock logbuf_lock
    if (!cont.len) {
    spin_unlock logbuf_lock
    return
    }
    for (;;) {
    vprintk_emit
    spin_lock logbuf_lock
    log_store
    spin_unlock logbuf_lock
    spin_lock logbuf_lock
    !console_trylock_for_printk msg_print_text
    return console_idx = log_next()
    console_seq++
    console_prev = msg->flags
    spin_unlock logbuf_lock

    call_console_drivers()
    for_each_console(con) {
    if (!cpu_online() &&
    !(con->flags & CON_ANYTIME))
    continue;
    }
    /*
    * no message printed, we lost it
    */
    vprintk_emit
    spin_lock logbuf_lock
    log_store
    spin_unlock logbuf_lock
    !console_trylock_for_printk
    return
    /*
    * go to the beginning of the loop,
    * find out there are new messages,
    * lose it
    */
    }

    console_trylock()/console_lock() call on CPU1 may come from cpu
    notifiers registered on that CPU. Since notifiers are not getting
    unregistered when CPU is going DOWN, all of the notifiers receive
    notifications during CPU UP. For example, on my x86_64, I see around 50
    notification sent from offline CPU to itself

    [swapper/2] from cpu:2 to:2 action:CPU_STARTING hotplug_hrtick
    [swapper/2] from cpu:2 to:2 action:CPU_STARTING blk_mq_main_cpu_notify
    [swapper/2] from cpu:2 to:2 action:CPU_STARTING blk_mq_queue_reinit_notify
    [swapper/2] from cpu:2 to:2 action:CPU_STARTING console_cpu_notify

    while doing
    echo 0 > /sys/devices/system/cpu/cpu2/online
    echo 1 > /sys/devices/system/cpu/cpu2/online

    So grabbing the console_sem lock while CPU is !online is possible,
    in theory.

    This patch moves can_use_console() check out of
    console_trylock_for_printk(). Instead it calls it in console_unlock(),
    so now console_lock()/console_unlock() are also 'protected' by
    can_use_console(). This also means that console_trylock_for_printk() is
    not really needed anymore and can be removed.

    Signed-off-by: Sergey Senozhatsky
    Reviewed-by: Petr Mladek
    Cc: Jan Kara
    Cc: Tejun Heo
    Cc: Kyle McMartin
    Cc: Dave Jones
    Cc: Calvin Owens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • This patchset introduces a /proc//timerslack_ns interface which
    would allow controlling processes to be able to set the timerslack value
    on other processes in order to save power by avoiding wakeups (Something
    Android currently does via out-of-tree patches).

    The first patch tries to fix the internal timer_slack_ns usage which was
    defined as a long, which limits the slack range to ~4 seconds on 32bit
    systems. It converts it to a u64, which provides the same basically
    unlimited slack (500 years) on both 32bit and 64bit machines.

    The second patch introduces the /proc//timerslack_ns interface
    which allows the full 64bit slack range for a task to be read or set on
    both 32bit and 64bit machines.

    With these two patches, on a 32bit machine, after setting the slack on
    bash to 10 seconds:

    $ time sleep 1

    real 0m10.747s
    user 0m0.001s
    sys 0m0.005s

    The first patch is a little ugly, since I had to chase the slack delta
    arguments through a number of functions converting them to u64s. Let me
    know if it makes sense to break that up more or not.

    Other than that things are fairly straightforward.

    This patch (of 2):

    The timer_slack_ns value in the task struct is currently a unsigned
    long. This means that on 32bit applications, the maximum slack is just
    over 4 seconds. However, on 64bit machines, its much much larger (~500
    years).

    This disparity could make application development a little (as well as
    the default_slack) to a u64. This means both 32bit and 64bit systems
    have the same effective internal slack range.

    Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
    the interface as a unsigned long, so we preserve that limitation on
    32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
    long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
    actually larger then what can be stored by an unsigned long.

    This patch also modifies hrtimer functions which specified the slack
    delta as a unsigned long.

    Signed-off-by: John Stultz
    Cc: Arjan van de Ven
    Cc: Thomas Gleixner
    Cc: Oren Laadan
    Cc: Ruchi Kandoi
    Cc: Rom Lemarchand
    Cc: Kees Cook
    Cc: Android Kernel Team
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Stultz
     
  • In machines with 140G of memory and enterprise flash storage, we have
    seen read and write bursts routinely exceed the kswapd watermarks and
    cause thundering herds in direct reclaim. Unfortunately, the only way
    to tune kswapd aggressiveness is through adjusting min_free_kbytes - the
    system's emergency reserves - which is entirely unrelated to the
    system's latency requirements. In order to get kswapd to maintain a
    250M buffer of free memory, the emergency reserves need to be set to 1G.
    That is a lot of memory wasted for no good reason.

    On the other hand, it's reasonable to assume that allocation bursts and
    overall allocation concurrency scale with memory capacity, so it makes
    sense to make kswapd aggressiveness a function of that as well.

    Change the kswapd watermark scale factor from the currently fixed 25% of
    the tunable emergency reserve to a tunable 0.1% of memory.

    Beyond 1G of memory, this will produce bigger watermark steps than the
    current formula in default settings. Ensure that the new formula never
    chooses steps smaller than that, i.e. 25% of the emergency reserve.

    On a 140G machine, this raises the default watermark steps - the
    distance between min and low, and low and high - from 16M to 143M.

    Signed-off-by: Johannes Weiner
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Show how much memory is allocated to kernel stacks.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • While working on a script to restore all sysctl params before a series of
    tests I found that writing any value into the
    /proc/sys/kernel/{nmi_watchdog,soft_watchdog,watchdog,watchdog_thresh}
    causes them to call proc_watchdog_update().

    NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
    NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
    NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
    NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.

    There doesn't appear to be a reason for doing this work every time a write
    occurs, so only do it when the values change.

    Signed-off-by: Josh Hunt
    Acked-by: Don Zickus
    Reviewed-by: Aaron Tomlin
    Cc: Ulrich Obergfell
    Cc: [4.1.x+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joshua Hunt
     
  • Pull tty/serial updates from Greg KH:
    "Here's the big tty/serial driver pull request for 4.6-rc1.

    Lots of changes in here, Peter has been on a tear again, with lots of
    refactoring and bugs fixes, many thanks to the great work he has been
    doing. Lots of driver updates and fixes as well, full details in the
    shortlog.

    All have been in linux-next for a while with no reported issues"

    * tag 'tty-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (220 commits)
    serial: 8250: describe CONFIG_SERIAL_8250_RSA
    serial: samsung: optimize UART rx fifo access routine
    serial: pl011: add mark/space parity support
    serial: sa1100: make sa1100_register_uart_fns a function
    tty: serial: 8250: add MOXA Smartio MUE boards support
    serial: 8250: convert drivers to use up_to_u8250p()
    serial: 8250/mediatek: fix building with SERIAL_8250=m
    serial: 8250/ingenic: fix building with SERIAL_8250=m
    serial: 8250/uniphier: fix modular build
    Revert "drivers/tty/serial: make 8250/8250_ingenic.c explicitly non-modular"
    Revert "drivers/tty/serial: make 8250/8250_mtk.c explicitly non-modular"
    serial: mvebu-uart: initial support for Armada-3700 serial port
    serial: mctrl_gpio: Add missing module license
    serial: ifx6x60: avoid uninitialized variable use
    tty/serial: at91: fix bad offset for UART timeout register
    tty/serial: at91: restore dynamic driver binding
    serial: 8250: Add hardware dependency to RT288X option
    TTY, devpts: document pty count limiting
    tty: goldfish: support platform_device with id -1
    drivers: tty: goldfish: Add device tree bindings
    ...

    Linus Torvalds
     
  • Pull security layer updates from James Morris:
    "There are a bunch of fixes to the TPM, IMA, and Keys code, with minor
    fixes scattered across the subsystem.

    IMA now requires signed policy, and that policy is also now measured
    and appraised"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (67 commits)
    X.509: Make algo identifiers text instead of enum
    akcipher: Move the RSA DER encoding check to the crypto layer
    crypto: Add hash param to pkcs1pad
    sign-file: fix build with CMS support disabled
    MAINTAINERS: update tpmdd urls
    MODSIGN: linux/string.h should be #included to get memcpy()
    certs: Fix misaligned data in extra certificate list
    X.509: Handle midnight alternative notation in GeneralizedTime
    X.509: Support leap seconds
    Handle ISO 8601 leap seconds and encodings of midnight in mktime64()
    X.509: Fix leap year handling again
    PKCS#7: fix unitialized boolean 'want'
    firmware: change kernel read fail to dev_dbg()
    KEYS: Use the symbol value for list size, updated by scripts/insert-sys-cert
    KEYS: Reserve an extra certificate symbol for inserting without recompiling
    modsign: hide openssl output in silent builds
    tpm_tis: fix build warning with tpm_tis_resume
    ima: require signed IMA policy
    ima: measure and appraise the IMA policy itself
    ima: load policy using path
    ...

    Linus Torvalds
     

17 Mar, 2016

7 commits

  • Remove the livepatch module notifier in favor of directly enabling and
    disabling patches to modules in the module loader. Hard-coding the
    function calls ensures that ftrace_module_enable() is run before
    klp_module_coming() during module load, and that klp_module_going() is
    run before ftrace_release_mod() during module unload. This way, ftrace
    and livepatch code is run in the correct order during the module
    load/unload sequence without dependence on the module notifier call chain.

    Signed-off-by: Jessica Yu
    Reviewed-by: Petr Mladek
    Acked-by: Josh Poimboeuf
    Acked-by: Rusty Russell
    Signed-off-by: Jiri Kosina

    Jessica Yu
     
  • Put all actions in complete_formation() that are performed after
    module->state is set to MODULE_STATE_COMING into a separate function
    prepare_coming_module(). This split prepares for the removal of the
    livepatch module notifiers in favor of hard-coding function calls to
    klp_module_{coming,going} in the module loader.

    The complete_formation -> prepare_coming_module split will also make error
    handling easier since we can jump to the appropriate error label to do any
    module GOING cleanup after all the COMING-actions have completed.

    Signed-off-by: Jessica Yu
    Reviewed-by: Josh Poimboeuf
    Reviewed-by: Petr Mladek
    Acked-by: Rusty Russell
    Signed-off-by: Jiri Kosina

    Jessica Yu
     
  • Pull libnvdimm updates from Dan Williams:

    - Asynchronous address range scrub:

    Given the capacities of next generation persistent memory devices a
    scrub operation to find all poison may take 10s of seconds. We
    want this scrub work to be done asynchronously with the rest of
    system initialization, so we move it out of line from the NFIT
    probing, i.e. acpi_nfit_add().

    - Clear poison:

    ACPI 6.1 introduces the ability to send "clear error" commands to
    the ACPI0012:00 device representing the root of an "nvdimm bus".
    Similar to relocating a bad block on a disk, this support clears
    media errors in response to a write.

    - Persistent memory resource tracking:

    A persistent memory range may be designated as simply "reserved" by
    platform firmware in the efi/e820 memory map. Later when the NFIT
    driver loads it discovers that the range is "Persistent Memory".

    The NFIT bus driver inserts a resource to advertise that
    "persistent" attribute in the system resource tree for /proc/iomem
    and kernel-internal usages.

    - Miscellaneous cleanups and fixes:

    Workaround section misaligned pmem ranges when allocating a struct
    page memmap, fix handling of the read-only case in the ioctl path,
    and clean up block device major number allocation.

    * tag 'libnvdimm-for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (26 commits)
    libnvdimm, pmem: clear poison on write
    libnvdimm, pmem: fix kmap_atomic() leak in error path
    nvdimm/btt: don't allocate unused major device number
    nvdimm/blk: don't allocate unused major device number
    pmem: don't allocate unused major device number
    ACPI: Change NFIT driver to insert new resource
    resource: Export insert_resource and remove_resource
    resource: Add remove_resource interface
    resource: Change __request_region to inherit from immediate parent
    libnvdimm, pmem: fix ia64 build, use PHYS_PFN
    nfit, libnvdimm: clear poison command support
    libnvdimm, pfn: 'resource'-address and 'size' attributes for pfn devices
    libnvdimm, pmem: adjust for section collisions with 'System RAM'
    libnvdimm, pmem: fix 'pfn' support for section-misaligned namespaces
    libnvdimm: Fix security issue with DSM IOCTL.
    libnvdimm: Clean-up access mode check.
    tools/testing/nvdimm: expand ars unit testing
    nfit: disable userspace initiated ars during scrub
    nfit: scrub and register regions in a workqueue
    nfit, libnvdimm: async region scrub workqueue
    ...

    Linus Torvalds
     
  • Pull power management and ACPI updates from Rafael Wysocki:
    "This time the majority of changes go into cpufreq and they are
    significant.

    First off, the way CPU frequency updates are triggered is different
    now. Instead of having to set up and manage a deferrable timer for
    each CPU in the system to evaluate and possibly change its frequency
    periodically, cpufreq governors set up callbacks to be invoked by the
    scheduler on a regular basis (basically on utilization updates). The
    "old" governors, "ondemand" and "conservative", still do all of their
    work in process context (although that is triggered by the scheduler
    now), but intel_pstate does it all in the callback invoked by the
    scheduler with no need for any additional asynchronous processing.

    Of course, this eliminates the overhead related to the management of
    all those timers, but also it allows the cpufreq governor code to be
    simplified quite a bit. On top of that, the common code and data
    structures used by the "ondemand" and "conservative" governors are
    cleaned up and made more straightforward and some long-standing and
    quite annoying problems are addressed. In particular, the handling of
    governor sysfs attributes is modified and the related locking becomes
    more fine grained which allows some concurrency problems to be avoided
    (particularly deadlocks with the core cpufreq code).

    In principle, the new mechanism for triggering frequency updates
    allows utilization information to be passed from the scheduler to
    cpufreq. Although the current code doesn't make use of it, in the
    works is a new cpufreq governor that will make decisions based on the
    scheduler's utilization data. That should allow the scheduler and
    cpufreq to work more closely together in the long run.

    In addition to the core and governor changes, cpufreq drivers are
    updated too. Fixes and optimizations go into intel_pstate, the
    cpufreq-dt driver is updated on top of some modification in the
    Operating Performance Points (OPP) framework and there are fixes and
    other updates in the powernv cpufreq driver.

    Apart from the cpufreq updates there is some new ACPICA material,
    including a fix for a problem introduced by previous ACPICA updates,
    and some less significant changes in the ACPI code, like CPPC code
    optimizations, ACPI processor driver cleanups and support for loading
    ACPI tables from initrd.

    Also updated are the generic power domains framework, the Intel RAPL
    power capping driver and the turbostat utility and we have a bunch of
    traditional assorted fixes and cleanups.

    Specifics:

    - Redesign of cpufreq governors and the intel_pstate driver to make
    them use callbacks invoked by the scheduler to trigger CPU
    frequency evaluation instead of using per-CPU deferrable timers for
    that purpose (Rafael Wysocki).

    - Reorganization and cleanup of cpufreq governor code to make it more
    straightforward and fix some concurrency problems in it (Rafael
    Wysocki, Viresh Kumar).

    - Cleanup and improvements of locking in the cpufreq core (Viresh
    Kumar).

    - Assorted cleanups in the cpufreq core (Rafael Wysocki, Viresh
    Kumar, Eric Biggers).

    - intel_pstate driver updates including fixes, optimizations and a
    modification to make it enable enable hardware-coordinated P-state
    selection (HWP) by default if supported by the processor (Philippe
    Longepe, Srinivas Pandruvada, Rafael Wysocki, Viresh Kumar, Felipe
    Franciosi).

    - Operating Performance Points (OPP) framework updates to improve its
    handling of voltage regulators and device clocks and updates of the
    cpufreq-dt driver on top of that (Viresh Kumar, Jon Hunter).

    - Updates of the powernv cpufreq driver to fix initialization and
    cleanup problems in it and correct its worker thread handling with
    respect to CPU offline, new powernv_throttle tracepoint (Shilpasri
    Bhat).

    - ACPI cpufreq driver optimization and cleanup (Rafael Wysocki).

    - ACPICA updates including one fix for a regression introduced by
    previos changes in the ACPICA code (Bob Moore, Lv Zheng, David Box,
    Colin Ian King).

    - Support for installing ACPI tables from initrd (Lv Zheng).

    - Optimizations of the ACPI CPPC code (Prashanth Prakash, Ashwin
    Chaugule).

    - Support for _HID(ACPI0010) devices (ACPI processor containers) and
    ACPI processor driver cleanups (Sudeep Holla).

    - Support for ACPI-based enumeration of the AMBA bus (Graeme Gregory,
    Aleksey Makarov).

    - Modification of the ACPI PCI IRQ management code to make it treat
    255 in the Interrupt Line register as "not connected" on x86 (as
    per the specification) and avoid attempts to use that value as a
    valid interrupt vector (Chen Fan).

    - ACPI APEI fixes related to resource leaks (Josh Hunt).

    - Removal of modularity from a few ACPI drivers (BGRT, GHES,
    intel_pmic_crc) that cannot be built as modules in practice (Paul
    Gortmaker).

    - PNP framework update to make it treat ACPI_RESOURCE_TYPE_SERIAL_BUS
    as a valid resource type (Harb Abdulhamid).

    - New device ID (future AMD I2C controller) in the ACPI driver for
    AMD SoCs (APD) and in the designware I2C driver (Xiangliang Yu).

    - Assorted ACPI cleanups (Colin Ian King, Kaiyen Chang, Oleg Drokin).

    - cpuidle menu governor optimization to avoid a square root
    computation in it (Rasmus Villemoes).

    - Fix for potential use-after-free in the generic device properties
    framework (Heikki Krogerus).

    - Updates of the generic power domains (genpd) framework including
    support for multiple power states of a domain, fixes and debugfs
    output improvements (Axel Haslam, Jon Hunter, Laurent Pinchart,
    Geert Uytterhoeven).

    - Intel RAPL power capping driver updates to reduce IPI overhead in
    it (Jacob Pan).

    - System suspend/hibernation code cleanups (Eric Biggers, Saurabh
    Sengar).

    - Year 2038 fix for the process freezer (Abhilash Jindal).

    - turbostat utility updates including new features (decoding of more
    registers and CPUID fields, sub-second intervals support, GFX MHz
    and RC6 printout, --out command line option), fixes (syscall jitter
    detection and workaround, reductioin of the number of syscalls
    made, fixes related to Xeon x200 processors, compiler warning
    fixes) and cleanups (Len Brown, Hubert Chrzaniuk, Chen Yu)"

    * tag 'pm+acpi-4.6-rc1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (182 commits)
    tools/power turbostat: bugfix: TDP MSRs print bits fixing
    tools/power turbostat: correct output for MSR_NHM_SNB_PKG_CST_CFG_CTL dump
    tools/power turbostat: call __cpuid() instead of __get_cpuid()
    tools/power turbostat: indicate SMX and SGX support
    tools/power turbostat: detect and work around syscall jitter
    tools/power turbostat: show GFX%rc6
    tools/power turbostat: show GFXMHz
    tools/power turbostat: show IRQs per CPU
    tools/power turbostat: make fewer systems calls
    tools/power turbostat: fix compiler warnings
    tools/power turbostat: add --out option for saving output in a file
    tools/power turbostat: re-name "%Busy" field to "Busy%"
    tools/power turbostat: Intel Xeon x200: fix turbo-ratio decoding
    tools/power turbostat: Intel Xeon x200: fix erroneous bclk value
    tools/power turbostat: allow sub-sec intervals
    ACPI / APEI: ERST: Fixed leaked resources in erst_init
    ACPI / APEI: Fix leaked resources
    intel_pstate: Do not skip samples partially
    intel_pstate: Remove freq calculation from intel_pstate_calc_busy()
    intel_pstate: Move intel_pstate_calc_busy() into get_target_pstate_use_performance()
    ...

    Linus Torvalds
     
  • When all subsystems are disabled, gcc notices that cgroup_subsys_enabled_key
    is a zero-length array and that any access to it must be out of bounds:

    In file included from ../include/linux/cgroup.h:19:0,
    from ../kernel/cgroup.c:31:
    ../kernel/cgroup.c: In function 'cgroup_add_cftypes':
    ../kernel/cgroup.c:261:53: error: array subscript is above array bounds [-Werror=array-bounds]
    return static_key_enabled(cgroup_subsys_enabled_key[ssid]);
    ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
    ../include/linux/jump_label.h:271:40: note: in definition of macro 'static_key_enabled'
    static_key_count((struct static_key *)x) > 0; \
    ^

    We should never call the function in this particular case, so this is
    not a bug. In order to silence the warning, this adds an explicit check
    for the CGROUP_SUBSYS_COUNT==0 case.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Tejun Heo

    Arnd Bergmann
     
  • Before 2e91fa7f6d45 ("cgroup: keep zombies associated with their
    original cgroups"), all dead tasks were associated with init_css_set.
    If a zombie task is requested for migration, while migration prep
    operations would still be performed on init_css_set, the actual
    migration would ignore zombie tasks. As init_css_set is always valid,
    this worked fine.

    However, after 2e91fa7f6d45, zombie tasks stay with the css_set it was
    associated with at the time of death. Let's say a task T associated
    with cgroup A on hierarchy H-1 and cgroup B on hiearchy H-2. After T
    becomes a zombie, it would still remain associated with A and B. If A
    only contains zombie tasks, it can be removed. On removal, A gets
    marked offline but stays pinned until all zombies are drained. At
    this point, if migration is initiated on T to a cgroup C on hierarchy
    H-2, migration path would try to prepare T's css_set for migration and
    trigger the following.

    WARNING: CPU: 0 PID: 1576 at kernel/cgroup.c:474 cgroup_get+0x121/0x160()
    CPU: 0 PID: 1576 Comm: bash Not tainted 4.4.0-work+ #289
    ...
    Call Trace:
    [] dump_stack+0x4e/0x82
    [] warn_slowpath_common+0x78/0xb0
    [] warn_slowpath_null+0x15/0x20
    [] cgroup_get+0x121/0x160
    [] link_css_set+0x7b/0x90
    [] find_css_set+0x3bc/0x5e0
    [] cgroup_migrate_prepare_dst+0x89/0x1f0
    [] cgroup_attach_task+0x157/0x230
    [] __cgroup_procs_write+0x2b7/0x470
    [] cgroup_tasks_write+0xc/0x10
    [] cgroup_file_write+0x30/0x1b0
    [] kernfs_fop_write+0x13c/0x180
    [] __vfs_write+0x23/0xe0
    [] vfs_write+0xa4/0x1a0
    [] SyS_write+0x44/0xa0
    [] entry_SYSCALL_64_fastpath+0x12/0x6f

    It doesn't make sense to prepare migration for css_sets pointing to
    dead cgroups as they are guaranteed to contain only zombies which are
    ignored later during migration. This patch makes cgroup destruction
    path mark all affected css_sets as dead and updates the migration path
    to ignore them during preparation.

    Signed-off-by: Tejun Heo
    Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
    Cc: stable@vger.kernel.org # v4.4+

    Tejun Heo
     
  • Merge first patch-bomb from Andrew Morton:

    - some misc things

    - ofs2 updates

    - about half of MM

    - checkpatch updates

    - autofs4 update

    * emailed patches from Andrew Morton : (120 commits)
    autofs4: fix string.h include in auto_dev-ioctl.h
    autofs4: use pr_xxx() macros directly for logging
    autofs4: change log print macros to not insert newline
    autofs4: make autofs log prints consistent
    autofs4: fix some white space errors
    autofs4: fix invalid ioctl return in autofs4_root_ioctl_unlocked()
    autofs4: fix coding style line length in autofs4_wait()
    autofs4: fix coding style problem in autofs4_get_set_timeout()
    autofs4: coding style fixes
    autofs: show pipe inode in mount options
    kallsyms: add support for relative offsets in kallsyms address table
    kallsyms: don't overload absolute symbol type for percpu symbols
    x86: kallsyms: disable absolute percpu symbols on !SMP
    checkpatch: fix another left brace warning
    checkpatch: improve UNSPECIFIED_INT test for bare signed/unsigned uses
    checkpatch: warn on bare unsigned or signed declarations without int
    checkpatch: exclude asm volatile from complex macro check
    mm: memcontrol: drop unnecessary lru locking from mem_cgroup_migrate()
    mm: migrate: consolidate mem_cgroup_migrate() calls
    mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous
    ...

    Linus Torvalds
     

16 Mar, 2016

2 commits

  • Similar to how relative extables are implemented, it is possible to emit
    the kallsyms table in such a way that it contains offsets relative to
    some anchor point in the kernel image rather than absolute addresses.

    On 64-bit architectures, it cuts the size of the kallsyms address table
    in half, since offsets between kernel symbols can typically be expressed
    in 32 bits. This saves several hundreds of kilobytes of permanent
    .rodata on average. In addition, the kallsyms address table is no
    longer subject to dynamic relocation when CONFIG_RELOCATABLE is in
    effect, so the relocation work done after decompression now doesn't have
    to do relocation updates for all these values. This saves up to 24
    bytes (i.e., the size of a ELF64 RELA relocation table entry) per value,
    which easily adds up to a couple of megabytes of uncompressed __init
    data on ppc64 or arm64. Even if these relocation entries typically
    compress well, the combined size reduction of 2.8 MB uncompressed for a
    ppc64_defconfig build (of which 2.4 MB is __init data) results in a ~500
    KB space saving in the compressed image.

    Since it is useful for some architectures (like x86) to retain the
    ability to emit absolute values as well, this patch also adds support
    for capturing both absolute and relative values when
    KALLSYMS_ABSOLUTE_PERCPU is in effect, by emitting absolute per-cpu
    addresses as positive 32-bit values, and addresses relative to the
    lowest encountered relative symbol as negative values, which are
    subtracted from the runtime address of this base symbol to produce the
    actual address.

    Support for the above is enabled by default for all architectures except
    IA-64 and Tile-GX, whose symbols are too far apart to capture in this
    manner.

    Signed-off-by: Ard Biesheuvel
    Tested-by: Guenter Roeck
    Reviewed-by: Kees Cook
    Tested-by: Kees Cook
    Cc: Heiko Carstens
    Cc: Michael Ellerman
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Benjamin Herrenschmidt
    Cc: Michal Marek
    Cc: Rusty Russell
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ard Biesheuvel
     
  • By default, page poisoning uses a poison value (0xaa) on free. If this
    is changed to 0, the page is not only sanitized but zeroing on alloc
    with __GFP_ZERO can be skipped as well. The tradeoff is that detecting
    corruption from the poisoning is harder to detect. This feature also
    cannot be used with hibernation since pages are not guaranteed to be
    zeroed after hibernation.

    Credit to Grsecurity/PaX team for inspiring this work

    Signed-off-by: Laura Abbott
    Acked-by: Rafael J. Wysocki
    Cc: "Kirill A. Shutemov"
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Kees Cook
    Cc: Mathias Krause
    Cc: Dave Hansen
    Cc: Jianyu Zhan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott