26 Mar, 2016

3 commits

  • KASAN needs to know whether the allocation happens in an IRQ handler.
    This lets us strip everything below the IRQ entry point to reduce the
    number of unique stack traces needed to be stored.

    Move the definition of __irq_entry to so that the
    users don't need to pull in . Also introduce the
    __softirq_entry macro which is similar to __irq_entry, but puts the
    corresponding functions to the .softirqentry.text section.

    Signed-off-by: Alexander Potapenko
    Acked-by: Steven Rostedt
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • When oom_reaper manages to unmap all the eligible vmas there shouldn't
    be much of the freable memory held by the oom victim left anymore so it
    makes sense to clear the TIF_MEMDIE flag for the victim and allow the
    OOM killer to select another task.

    The lack of TIF_MEMDIE also means that the victim cannot access memory
    reserves anymore but that shouldn't be a problem because it would get
    the access again if it needs to allocate and hits the OOM killer again
    due to the fatal_signal_pending resp. PF_EXITING check. We can safely
    hide the task from the OOM killer because it is clearly not a good
    candidate anymore as everyhing reclaimable has been torn down already.

    This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
    and thus hold off further global OOM killer actions granted the oom
    reaper is able to take mmap_sem for the associated mm struct. This is
    not guaranteed now but further steps should make sure that mmap_sem for
    write should be blocked killable which will help to reduce such a lock
    contention. This is not done by this patch.

    Note that exit_oom_victim might be called on a remote task from
    __oom_reap_task now so we have to check and clear the flag atomically
    otherwise we might race and underflow oom_victims or wake up waiters too
    early.

    Signed-off-by: Michal Hocko
    Suggested-by: Johannes Weiner
    Suggested-by: Tetsuo Handa
    Cc: Andrea Argangeli
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This will be needed in the patch "mm, oom: introduce oom reaper".

    Acked-by: Michal Hocko
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

25 Mar, 2016

5 commits

  • Pull more power management and ACPI updates from Rafael Wysocki:
    "The second batch of power management and ACPI updates for v4.6.

    Included are fixups on top of the previous PM/ACPI pull request and
    other material that didn't make into it but still should go into 4.6.

    Among other things, there's a fix for an intel_pstate driver issue
    uncovered by recent cpufreq changes, a workaround for a boot hang on
    Skylake-H related to the handling of deep C-states by the platform and
    a PCI/ACPI fix for the handling of IO port resources on non-x86
    architectures plus some new device IDs and similar.

    Specifics:

    - Fix for an intel_pstate driver issue related to the handling of MSR
    updates uncovered by the recent cpufreq rework (Rafael Wysocki).

    - cpufreq core cleanups related to starting governors and frequency
    synchronization during resume from system suspend and a locking fix
    for cpufreq_quick_get() (Rafael Wysocki, Richard Cochran).

    - acpi-cpufreq and powernv cpufreq driver updates (Jisheng Zhang,
    Michael Neuling, Richard Cochran, Shilpasri Bhat).

    - intel_idle driver update preventing some Skylake-H systems from
    hanging during initialization by disabling deep C-states mishandled
    by the platform in the problematic configurations (Len Brown).

    - Intel Xeon Phi Processor x200 support for intel_idle
    (Dasaratharaman Chandramouli).

    - cpuidle menu governor updates to make it always honor PM QoS
    latency constraints (and prevent C1 from being used as the fallback
    C-state on x86 when they are set below its exit latency) and to
    restore the previous behavior to fall back to C1 if the next timer
    event is set far enough in the future that was changed in 4.4 which
    led to an energy consumption regression (Rik van Riel, Rafael
    Wysocki).

    - New device ID for a future AMD UART controller in the ACPI driver
    for AMD SoCs (Wang Hongcheng).

    - Rockchip rk3399 support for the rockchip-io-domain adaptive voltage
    scaling (AVS) driver (David Wu).

    - ACPI PCI resources management fix for the handling of IO space
    resources on architectures where the IO space is memory mapped
    (IA64 and ARM64) broken by the introduction of common ACPI
    resources parsing for PCI host bridges in 4.4 (Lorenzo Pieralisi).

    - Fix for the ACPI backend of the generic device properties API to
    make it parse non-device (data node only) children of an ACPI
    device correctly (Irina Tirdea).

    - Fixes for the handling of global suspend flags (introduced in 4.4)
    during hibernation and resume from it (Lukas Wunner).

    - Support for obtaining configuration information from Device Trees
    in the PM clocks framework (Jon Hunter).

    - ACPI _DSM helper code and devfreq framework cleanups (Colin Ian
    King, Geert Uytterhoeven)"

    * tag 'pm+acpi-4.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
    PM / AVS: rockchip-io: add io selectors and supplies for rk3399
    intel_idle: Support for Intel Xeon Phi Processor x200 Product Family
    intel_idle: prevent SKL-H boot failure when C8+C9+C10 enabled
    ACPI / PM: Runtime resume devices when waking from hibernate
    PM / sleep: Clear pm_suspend_global_flags upon hibernate
    cpufreq: governor: Always schedule work on the CPU running update
    cpufreq: Always update current frequency before startig governor
    cpufreq: Introduce cpufreq_update_current_freq()
    cpufreq: Introduce cpufreq_start_governor()
    cpufreq: powernv: Add sysfs attributes to show throttle stats
    cpufreq: acpi-cpufreq: make Intel/AMD MSR access, io port access static
    PCI: ACPI: IA64: fix IO port generic range check
    ACPI / util: cast data to u64 before shifting to fix sign extension
    cpufreq: powernv: Define per_cpu chip pointer to optimize hot-path
    cpuidle: menu: Fall back to polling if next timer event is near
    cpufreq: acpi-cpufreq: Clean up hot plug notifier callback
    intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts
    cpufreq: Make cpufreq_quick_get() safe to call
    ACPI / property: fix data node parsing in acpi_get_next_subnode()
    ACPI / APD: Add device HID for future AMD UART controller
    ...

    Linus Torvalds
     
  • * pm-avs:
    PM / AVS: rockchip-io: add io selectors and supplies for rk3399

    * pm-clk:
    PM / clk: Add support for obtaining clocks from device-tree

    * pm-devfreq:
    PM / devfreq: Spelling s/frequnecy/frequency/

    * pm-sleep:
    ACPI / PM: Runtime resume devices when waking from hibernate
    PM / sleep: Clear pm_suspend_global_flags upon hibernate

    Rafael J. Wysocki
     
  • Pull tracing updates from Steven Rostedt:
    "Nothing major this round. Mostly small clean ups and fixes.

    Some visible changes:

    - A new flag was added to distinguish traces done in NMI context.

    - Preempt tracer now shows functions where preemption is disabled but
    interrupts are still enabled.

    Other notes:

    - Updates were done to function tracing to allow better performance
    with perf.

    - Infrastructure code has been added to allow for a new histogram
    feature for recording live trace event histograms that can be
    configured by simple user commands. The feature itself was just
    finished, but needs a round in linux-next before being pulled.

    This only includes some infrastructure changes that will be needed"

    * tag 'trace-v4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (22 commits)
    tracing: Record and show NMI state
    tracing: Fix trace_printk() to print when not using bprintk()
    tracing: Remove redundant reset per-CPU buff in irqsoff tracer
    x86: ftrace: Fix the misleading comment for arch/x86/kernel/ftrace.c
    tracing: Fix crash from reading trace_pipe with sendfile
    tracing: Have preempt(irqs)off trace preempt disabled functions
    tracing: Fix return while holding a lock in register_tracer()
    ftrace: Use kasprintf() in ftrace_profile_tracefs()
    ftrace: Update dynamic ftrace calls only if necessary
    ftrace: Make ftrace_hash_rec_enable return update bool
    tracing: Fix typoes in code comment and printk in trace_nop.c
    tracing, writeback: Replace cgroup path to cgroup ino
    tracing: Use flags instead of bool in trigger structure
    tracing: Add an unreg_all() callback to trigger commands
    tracing: Add needs_rec flag to event triggers
    tracing: Add a per-event-trigger 'paused' field
    tracing: Add get_syscall_name()
    tracing: Add event record param to trigger_ops.func()
    tracing: Make event trigger functions available
    tracing: Make ftrace_event_field checking functions available
    ...

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "This tree contains various perf fixes on the kernel side, plus three
    hw/event-enablement late additions:

    - Intel Memory Bandwidth Monitoring events and handling
    - the AMD Accumulated Power Mechanism reporting facility
    - more IOMMU events

    ... and a final round of perf tooling updates/fixes"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
    perf llvm: Use strerror_r instead of the thread unsafe strerror one
    perf llvm: Use realpath to canonicalize paths
    perf tools: Unexport some methods unused outside strbuf.c
    perf probe: No need to use formatting strbuf method
    perf help: Use asprintf instead of adhoc equivalents
    perf tools: Remove unused perf_pathdup, xstrdup functions
    perf tools: Do not include stringify.h from the kernel sources
    tools include: Copy linux/stringify.h from the kernel
    tools lib traceevent: Remove redundant CPU output
    perf tools: Remove needless 'extern' from function prototypes
    perf tools: Simplify die() mechanism
    perf tools: Remove unused DIE_IF macro
    perf script: Remove lots of unused arguments
    perf thread: Rename perf_event__preprocess_sample_addr to thread__resolve
    perf machine: Rename perf_event__preprocess_sample to machine__resolve
    perf tools: Add cpumode to struct perf_sample
    perf tests: Forward the perf_sample in the dwarf unwind test
    perf tools: Remove misplaced __maybe_unused
    perf list: Fix documentation of :ppp
    perf bench numa: Fix assertion for nodes bitfield
    ...

    Linus Torvalds
     
  • Pull scheduler fixes from Ingo Molnar:
    "Misc fixes: a cgroup fix, a fair-scheduler migration accounting fix, a
    cputime fix and two cpuacct cleanups"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/cpuacct: Simplify the cpuacct code
    sched/cpuacct: Rename parameter in cpuusage_write() for readability
    sched/fair: Add comments to explain select_idle_sibling()
    sched/fair: Fix fairness issue on migration
    sched/cgroup: Fix/cleanup cgroup teardown/init
    sched/cputime: Fix steal time accounting vs. CPU hotplug

    Linus Torvalds
     

23 Mar, 2016

16 commits

  • When suspending to RAM, waking up and later suspending to disk,
    we gratuitously runtime resume devices after the thaw phase.
    This does not occur if we always suspend to RAM or always to disk.

    pm_complete_with_resume_check(), which gets called from
    pci_pm_complete() among others, schedules a runtime resume
    if PM_SUSPEND_FLAG_FW_RESUME is set. The flag is set during
    a suspend-to-RAM cycle. It is cleared at the beginning of
    the suspend-to-RAM cycle but not afterwards and it is not
    cleared during a suspend-to-disk cycle at all. Fix it.

    Fixes: ef25ba047601 (PM / sleep: Add flags to indicate platform firmware involvement)
    Signed-off-by: Lukas Wunner
    Cc: 4.4+ # 4.4+
    Signed-off-by: Rafael J. Wysocki

    Lukas Wunner
     
  • Use the more common logging method with the eventual goal of removing
    pr_warning altogether.

    Miscellanea:

    - Realign arguments
    - Coalesce formats
    - Add missing space between a few coalesced formats

    Signed-off-by: Joe Perches
    Acked-by: Rafael J. Wysocki [kernel/power/suspend.c]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Add a flag to memremap() for writecombine mappings. Mappings satisfied
    by this flag will not be cached, however writes may be delayed or
    combined into more efficient bursts. This is most suitable for buffers
    written sequentially by the CPU for use by other DMA devices.

    Signed-off-by: Brian Starkey
    Reviewed-by: Catalin Marinas
    Cc: Dan Williams
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brian Starkey
     
  • These patches implement a MEMREMAP_WC flag for memremap(), which can be
    used to obtain writecombine mappings. This is then used for setting up
    dma_coherent_mem regions which use the DMA_MEMORY_MAP flag.

    The motivation is to fix an alignment fault on arm64, and the suggestion
    to implement MEMREMAP_WC for this case was made at [1]. That particular
    issue is handled in patch 4, which makes sure that the appropriate
    memset function is used when zeroing allocations mapped as IO memory.

    This patch (of 4):

    Don't modify the flags input argument to memremap(). MEMREMAP_WB is
    already a special case so we can check for it directly instead of
    clearing flag bits in each mapper.

    Signed-off-by: Brian Starkey
    Cc: Catalin Marinas
    Cc: Dan Williams
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brian Starkey
     
  • The value of __ARCH_SI_PREAMBLE_SIZE defines the size (including
    padding) of the part of the struct siginfo that is before the union, and
    it is then used to calculate the needed padding (SI_PAD_SIZE) to make
    the size of struct siginfo equal to 128 (SI_MAX_SIZE) bytes.

    Depending on the target architecture and word width it equals to either
    3 or 4 times sizeof int.

    Since the very beginning we had __ARCH_SI_PREAMBLE_SIZE wrong on the
    parisc architecture for the 64bit kernel build. It's even more
    frustrating, because it can easily be checked at compile time if the
    value was defined correctly.

    This patch adds such a check for the correctness of
    __ARCH_SI_PREAMBLE_SIZE in the hope that it will prevent existing and
    future architectures from running into the same problem.

    I refrained from replacing __ARCH_SI_PREAMBLE_SIZE by offsetof() in
    copy_siginfo() in include/asm-generic/siginfo.h, because a) it doesn't
    make any difference and b) it's used in the Documentation/kmemcheck.txt
    example.

    I ran this patch through the 0-DAY kernel test infrastructure and only
    the parisc architecture triggered as expected. That means that this
    patch should be OK for all major architectures.

    Signed-off-by: Helge Deller
    Cc: Stephen Rothwell
    Cc: Michael Ellerman
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Helge Deller
     
  • kcov provides code coverage collection for coverage-guided fuzzing
    (randomized testing). Coverage-guided fuzzing is a testing technique
    that uses coverage feedback to determine new interesting inputs to a
    system. A notable user-space example is AFL
    (http://lcamtuf.coredump.cx/afl/). However, this technique is not
    widely used for kernel testing due to missing compiler and kernel
    support.

    kcov does not aim to collect as much coverage as possible. It aims to
    collect more or less stable coverage that is function of syscall inputs.
    To achieve this goal it does not collect coverage in soft/hard
    interrupts and instrumentation of some inherently non-deterministic or
    non-interesting parts of kernel is disbled (e.g. scheduler, locking).

    Currently there is a single coverage collection mode (tracing), but the
    API anticipates additional collection modes. Initially I also
    implemented a second mode which exposes coverage in a fixed-size hash
    table of counters (what Quentin used in his original patch). I've
    dropped the second mode for simplicity.

    This patch adds the necessary support on kernel side. The complimentary
    compiler support was added in gcc revision 231296.

    We've used this support to build syzkaller system call fuzzer, which has
    found 90 kernel bugs in just 2 months:

    https://github.com/google/syzkaller/wiki/Found-Bugs

    We've also found 30+ bugs in our internal systems with syzkaller.
    Another (yet unexplored) direction where kcov coverage would greatly
    help is more traditional "blob mutation". For example, mounting a
    random blob as a filesystem, or receiving a random blob over wire.

    Why not gcov. Typical fuzzing loop looks as follows: (1) reset
    coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
    typical coverage can be just a dozen of basic blocks (e.g. an invalid
    input). In such context gcov becomes prohibitively expensive as
    reset/collect coverage steps depend on total number of basic
    blocks/edges in program (in case of kernel it is about 2M). Cost of
    kcov depends only on number of executed basic blocks/edges. On top of
    that, kernel requires per-thread coverage because there are always
    background threads and unrelated processes that also produce coverage.
    With inlined gcov instrumentation per-thread coverage is not possible.

    kcov exposes kernel PCs and control flow to user-space which is
    insecure. But debugfs should not be mapped as user accessible.

    Based on a patch by Quentin Casasnovas.

    [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
    [akpm@linux-foundation.org: unbreak allmodconfig]
    [akpm@linux-foundation.org: follow x86 Makefile layout standards]
    Signed-off-by: Dmitry Vyukov
    Reviewed-by: Kees Cook
    Cc: syzkaller
    Cc: Vegard Nossum
    Cc: Catalin Marinas
    Cc: Tavis Ormandy
    Cc: Will Deacon
    Cc: Quentin Casasnovas
    Cc: Kostya Serebryany
    Cc: Eric Dumazet
    Cc: Alexander Potapenko
    Cc: Kees Cook
    Cc: Bjorn Helgaas
    Cc: Sasha Levin
    Cc: David Drysdale
    Cc: Ard Biesheuvel
    Cc: Andrey Ryabinin
    Cc: Kirill A. Shutemov
    Cc: Jiri Slaby
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     
  • A couple of functions and variables in the profile implementation are
    used only on SMP systems by the procfs code, but are unused if either
    procfs is disabled or in uniprocessor kernels. gcc prints a harmless
    warning about the unused symbols:

    kernel/profile.c:243:13: error: 'profile_flip_buffers' defined but not used [-Werror=unused-function]
    static void profile_flip_buffers(void)
    ^
    kernel/profile.c:266:13: error: 'profile_discard_flip_buffers' defined but not used [-Werror=unused-function]
    static void profile_discard_flip_buffers(void)
    ^
    kernel/profile.c:330:12: error: 'profile_cpu_callback' defined but not used [-Werror=unused-function]
    static int profile_cpu_callback(struct notifier_block *info,
    ^

    This adds further #ifdef to the file, to annotate exactly in which cases
    they are used. I have done several thousand ARM randconfig kernels with
    this patch applied and no longer get any warnings in this file.

    Signed-off-by: Arnd Bergmann
    Cc: Vlastimil Babka
    Cc: Robin Holt
    Cc: Johannes Weiner
    Cc: Christoph Lameter
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • Commit 1717f2096b54 ("panic, x86: Fix re-entrance problem due to panic
    on NMI") and commit 58c5661f2144 ("panic, x86: Allow CPUs to save
    registers even if looping in NMI context") introduced nmi_panic() which
    prevents concurrent/recursive execution of panic(). It also saves
    registers for the crash dump on x86.

    However, there are some cases where NMI handlers still use panic().
    This patch set partially replaces them with nmi_panic() in those cases.

    Even this patchset is applied, some NMI or similar handlers (e.g. MCE
    handler) continue to use panic(). This is because I can't test them
    well and actual problems won't happen. For example, the possibility
    that normal panic and panic on MCE happen simultaneously is very low.

    This patch (of 3):

    Convert nmi_panic() to a proper function and export it instead of
    exporting internal implementation details to modules, for obvious
    reasons.

    Signed-off-by: Hidehiro Kawai
    Acked-by: Borislav Petkov
    Acked-by: Michal Nazarewicz
    Cc: Michal Hocko
    Cc: Rasmus Villemoes
    Cc: Nicolas Iooss
    Cc: Javi Merino
    Cc: Gobinda Charan Maji
    Cc: "Steven Rostedt (Red Hat)"
    Cc: Thomas Gleixner
    Cc: Vitaly Kuznetsov
    Cc: HATAYAMA Daisuke
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hidehiro Kawai
     
  • This commit fixes the following security hole affecting systems where
    all of the following conditions are fulfilled:

    - The fs.suid_dumpable sysctl is set to 2.
    - The kernel.core_pattern sysctl's value starts with "/". (Systems
    where kernel.core_pattern starts with "|/" are not affected.)
    - Unprivileged user namespace creation is permitted. (This is
    true on Linux >=3.8, but some distributions disallow it by
    default using a distro patch.)

    Under these conditions, if a program executes under secure exec rules,
    causing it to run with the SUID_DUMP_ROOT flag, then unshares its user
    namespace, changes its root directory and crashes, the coredump will be
    written using fsuid=0 and a path derived from kernel.core_pattern - but
    this path is interpreted relative to the root directory of the process,
    allowing the attacker to control where a coredump will be written with
    root privileges.

    To fix the security issue, always interpret core_pattern for dumps that
    are written under SUID_DUMP_ROOT relative to the root directory of init.

    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Andy Lutomirski
    Cc: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • This test-case (simplified version of generated by syzkaller)

    #include
    #include
    #include

    void test(void)
    {
    for (;;) {
    if (fork()) {
    wait(NULL);
    continue;
    }

    ptrace(PTRACE_SEIZE, getppid(), 0, 0);
    ptrace(PTRACE_INTERRUPT, getppid(), 0, 0);
    _exit(0);
    }
    }

    int main(void)
    {
    int np;

    for (np = 0; np < 8; ++np)
    if (!fork())
    test();

    while (wait(NULL) > 0)
    ;
    return 0;
    }

    triggers the 2nd WARN_ON_ONCE(!signr) warning in do_jobctl_trap(). The
    problem is that __ptrace_unlink() clears task->jobctl under siglock but
    task->ptrace is cleared without this lock held; this fools the "else"
    branch which assumes that !PT_SEIZED means PT_PTRACED.

    Note also that most of other PTRACE_SEIZE checks can race with detach
    from the exiting tracer too. Say, the callers of ptrace_trap_notify()
    assume that SEIZED can't go away after it was checked.

    Signed-off-by: Oleg Nesterov
    Reported-by: Dmitry Vyukov
    Cc: Tejun Heo
    Cc: syzkaller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Except on SPARC, this is what the code always did. SPARC compat seccomp
    was buggy, although the impact of the bug was limited because SPARC
    32-bit and 64-bit syscall numbers are the same.

    Signed-off-by: Andy Lutomirski
    Cc: Paul Moore
    Cc: Eric Paris
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Users of the 32-bit ptrace() ABI expect the full 32-bit ABI. siginfo
    translation should check ptrace() ABI, not caller task ABI.

    This is an ABI change on SPARC. Let's hope that no one relied on the
    old buggy ABI.

    Signed-off-by: Andy Lutomirski
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Seccomp wants to know the syscall bitness, not the caller task bitness,
    when it selects the syscall whitelist.

    As far as I know, this makes no difference on any architecture, so it's
    not a security problem. (It generates identical code everywhere except
    sparc, and, on sparc, the syscall numbering is the same for both ABIs.)

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • When new timeout is written to /proc/sys/kernel/hung_task_timeout_secs,
    khungtaskd is interrupted and again sleeps for full timeout duration.

    This means that hang task will not be checked if new timeout is written
    periodically within old timeout duration and/or checking of hang task
    will be delayed for up to previous timeout duration. Fix this by
    remembering last time khungtaskd checked hang task.

    This change will allow other watchdog tasks (if any) to share khungtaskd
    by sleeping for minimal timeout diff of all watchdog tasks. Doing more
    watchdog tasks from khungtaskd will reduce the possibility of printk()
    collisions by multiple watchdog threads.

    Signed-off-by: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: Aaron Tomlin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • The latency tracer format has a nice column to indicate IRQ state, but
    this is not able to tell us about NMI state.

    When tracing perf interrupt handlers (which often run in NMI context)
    it is very useful to see how the events nest.

    Link: http://lkml.kernel.org/r/20160318153022.105068893@infradead.org

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Steven Rostedt

    Peter Zijlstra
     
  • The trace_printk() code will allocate extra buffers if the compile detects
    that a trace_printk() is used. To do this, the format of the trace_printk()
    is saved to the __trace_printk_fmt section, and if that section is bigger
    than zero, the buffers are allocated (along with a message that this has
    happened).

    If trace_printk() uses a format that is not a constant, and thus something
    not guaranteed to be around when the print happens, the compiler optimizes
    the fmt out, as it is not used, and the __trace_printk_fmt section is not
    filled. This means the kernel will not allocate the special buffers needed
    for the trace_printk() and the trace_printk() will not write anything to the
    tracing buffer.

    Adding a "__used" to the variable in the __trace_printk_fmt section will
    keep it around, even though it is set to NULL. This will keep the string
    from being printed in the debugfs/tracing/printk_formats section as it is
    not needed.

    Reported-by: Vlastimil Babka
    Fixes: 07d777fe8c398 "tracing: Add percpu buffers for trace_printk()"
    Cc: stable@vger.kernel.org # v3.5+
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

22 Mar, 2016

1 commit

  • Pull cgroup namespace support from Tejun Heo:
    "These are changes to implement namespace support for cgroup which has
    been pending for quite some time now. It is very straight-forward and
    only affects what part of cgroup hierarchies are visible.

    After unsharing, mounting a cgroup fs will be scoped to the cgroups
    the task belonged to at the time of unsharing and the cgroup paths
    exposed to userland would be adjusted accordingly"

    * 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: fix and restructure error handling in copy_cgroup_ns()
    cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()
    Add FS_USERNS_FLAG to cgroup fs
    cgroup: Add documentation for cgroup namespaces
    cgroup: mount cgroupns-root when inside non-init cgroupns
    kernfs: define kernfs_node_dentry
    cgroup: cgroup namespace setns support
    cgroup: introduce cgroup namespaces
    sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
    kernfs: Add API to generate relative kernfs path

    Linus Torvalds
     

21 Mar, 2016

12 commits

  • - Use for() instead of while() loop in some functions
    to make the code simpler.

    - Use this_cpu_ptr() instead of per_cpu_ptr() to make the code
    cleaner and a bit faster.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Zhao Lei
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/d8a7ef9592f55224630cb26dea239f05b6398a4e.1458187654.git.zhaolei@cn.fujitsu.com
    Signed-off-by: Ingo Molnar

    Zhao Lei
     
  • The name of the 'reset' parameter to cpuusage_write() is quite confusing,
    because the only valid value we allow is '0', so !reset is actually the
    case that resets ...

    Rename it to 'val' and explain it in a comment that we only allow 0.

    Signed-off-by: Dongsheng Yang
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: cgroups@vger.kernel.org
    Cc: tj@kernel.org
    Link: http://lkml.kernel.org/r/1450696483-2864-1-git-send-email-yangds.fnst@cn.fujitsu.com
    Signed-off-by: Ingo Molnar

    Dongsheng Yang
     
  • It's not entirely obvious how the main loop in select_idle_sibling()
    works on first glance. Sprinkle a few comments to explain the design
    and intention behind the loop based on some conversations with Mike
    and Peter.

    Signed-off-by: Matt Fleming
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1457535548-15329-1-git-send-email-matt@codeblueprint.co.uk
    Signed-off-by: Ingo Molnar

    Matt Fleming
     
  • Pavan reported that in the presence of very light tasks (or cgroups)
    the placement of migrated tasks can cause severe fairness issues.

    The problem is that enqueue_entity() places the task before it updates
    time, thereby it can place the task far in the past (remember that
    light tasks will shoot virtual time forward at a high speed, so in
    relation to the pre-existing light task, we can land far in the past).

    This is done because update_curr() needs the current task, and we
    might be placing the current task.

    The obvious solution is to differentiate between the current and any
    other task; placing the current before we update time, and placing any
    other task after, such that !curr tasks end up at the current moment
    in time, and not in the past.

    Reported-by: Pavan Kondeti
    Tested-by: Pavan Kondeti
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Ben Segall
    Cc: Linus Torvalds
    Cc: Matt Fleming
    Cc: Mike Galbraith
    Cc: Morten Rasmussen
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: byungchul.park@lge.com
    Link: http://lkml.kernel.org/r/20160309120403.GK6344@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The CPU controller hasn't kept up with the various changes in the whole
    cgroup initialization / destruction sequence, and commit:

    2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")

    caused it to explode.

    The reason for this is that zombies do not inhibit css_offline() from
    being called, but do stall css_released(). Now we tear down the cfs_rq
    structures on css_offline() but zombies can run after that, leading to
    use-after-free issues.

    The solution is to move the tear-down to css_released(), which
    guarantees nobody (including no zombies) is still using our cgroup.

    Furthermore, a few simple cleanups are possible too. There doesn't
    appear to be any point to us using css_online() (anymore?) so fold that
    in css_alloc().

    And since cgroup code guarantees an RCU grace period between
    css_released() and css_free() we can forgo using call_rcu() and free the
    stuff immediately.

    Suggested-by: Tejun Heo
    Reported-by: Kazuki Yamaguchi
    Reported-by: Niklas Cassel
    Tested-by: Niklas Cassel
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Tejun Heo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
    Link: http://lkml.kernel.org/r/20160316152245.GY6344@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Document some of the hotplug notifier usage.

    Requested-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Andy Lutomirski
    Cc: Arnaldo Carvalho de Melo
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: David Ahern
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Sasha reported:

    [ 3494.030114] UBSAN: Undefined behaviour in kernel/events/ring_buffer.c:685:22
    [ 3494.030647] shift exponent -1 is negative

    Andrey spotted that this is because:

    It happens if nr_pages = 0:
    rb->page_order = ilog2(nr_pages);

    Fix it by making both assignments conditional on nr_pages; since
    otherwise they should both be 0 anyway, and will be because of the
    kzalloc() used to allocate the structure.

    Reported-by: Sasha Levin
    Reported-by: Andrey Ryabinin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Andy Lutomirski
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: David Ahern
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/20160129141751.GA407@worktop
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • There were two problems with the dynamic interrupt throttle mechanism,
    both triggered by the same action.

    When you (or perf_fuzzer) write a huge value into
    /proc/sys/kernel/perf_event_max_sample_rate the computed
    perf_sample_allowed_ns becomes 0. This effectively disables the whole
    dynamic throttle.

    This is fixed by ensuring update_perf_cpu_limits() never sets the
    value to 0. However, we allow disabling of the dynamic throttle by
    writing 100 to /proc/sys/kernel/perf_cpu_time_max_percent. This will
    generate a warning in dmesg.

    The second problem is that by setting the max_sample_rate to a huge
    number, the adaptive process can take a few tries, since it halfs the
    limit each time. Change that to directly compute a new value based on
    the observed duration.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Andy Lutomirski
    Cc: Arnaldo Carvalho de Melo
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: David Ahern
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Its possible to IOC_PERIOD while the event is throttled, this would
    re-start the event and the next tick would then try to unthrottle it,
    and find the event wasn't actually stopped anymore.

    This would tickle a WARN in the x86-pmu code which isn't expecting to
    start a !stopped event.

    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Alexander Shishkin
    Cc: Andy Lutomirski
    Cc: Arnaldo Carvalho de Melo
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: David Ahern
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: dvyukov@google.com
    Cc: oleg@redhat.com
    Cc: panand@redhat.com
    Cc: sasha.levin@oracle.com
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/20160310143924.GR6356@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Pull x86 protection key support from Ingo Molnar:
    "This tree adds support for a new memory protection hardware feature
    that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

    There's a background article at LWN.net:

    https://lwn.net/Articles/643797/

    The gist is that protection keys allow the encoding of
    user-controllable permission masks in the pte. So instead of having a
    fixed protection mask in the pte (which needs a system call to change
    and works on a per page basis), the user can map a (handful of)
    protection mask variants and can change the masks runtime relatively
    cheaply, without having to change every single page in the affected
    virtual memory range.

    This allows the dynamic switching of the protection bits of large
    amounts of virtual memory, via user-space instructions. It also
    allows more precise control of MMU permission bits: for example the
    executable bit is separate from the read bit (see more about that
    below).

    This tree adds the MM infrastructure and low level x86 glue needed for
    that, plus it adds a high level API to make use of protection keys -
    if a user-space application calls:

    mmap(..., PROT_EXEC);

    or

    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
    this special case, and will set a special protection key on this
    memory range. It also sets the appropriate bits in the Protection
    Keys User Rights (PKRU) register so that the memory becomes unreadable
    and unwritable.

    So using protection keys the kernel is able to implement 'true'
    PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
    PROT_READ as well. Unreadable executable mappings have security
    advantages: they cannot be read via information leaks to figure out
    ASLR details, nor can they be scanned for ROP gadgets - and they
    cannot be used by exploits for data purposes either.

    We know about no user-space code that relies on pure PROT_EXEC
    mappings today, but binary loaders could start making use of this new
    feature to map binaries and libraries in a more secure fashion.

    There is other pending pkeys work that offers more high level system
    call APIs to manage protection keys - but those are not part of this
    pull request.

    Right now there's a Kconfig that controls this feature
    (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
    (like most x86 CPU feature enablement code that has no runtime
    overhead), but it's not user-configurable at the moment. If there's
    any serious problem with this then we can make it configurable and/or
    flip the default"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
    x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
    mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
    x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
    mm/core, x86/mm/pkeys: Add execute-only protection keys support
    x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
    x86/mm/pkeys: Allow kernel to modify user pkey rights register
    x86/fpu: Allow setting of XSAVE state
    x86/mm: Factor out LDT init from context init
    mm/core, x86/mm/pkeys: Add arch_validate_pkey()
    mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
    x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
    x86/mm/pkeys: Add Kconfig prompt to existing config option
    x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
    x86/mm/pkeys: Dump PKRU with other kernel registers
    mm/core, x86/mm/pkeys: Differentiate instruction fetches
    x86/mm/pkeys: Optimize fault handling in access_error()
    mm/core: Do not enforce PKEY permissions on remote mm access
    um, pkeys: Add UML arch_*_access_permitted() methods
    mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
    x86/mm/gup: Simplify get_user_pages() PTE bit handling
    ...

    Linus Torvalds
     
  • Pull 'objtool' stack frame validation from Ingo Molnar:
    "This tree adds a new kernel build-time object file validation feature
    (ONFIG_STACK_VALIDATION=y): kernel stack frame correctness validation.
    It was written by and is maintained by Josh Poimboeuf.

    The motivation: there's a category of hard to find kernel bugs, most
    of them in assembly code (but also occasionally in C code), that
    degrades the quality of kernel stack dumps/backtraces. These bugs are
    hard to detect at the source code level. Such bugs result in
    incorrect/incomplete backtraces most of time - but can also in some
    rare cases result in crashes or other undefined behavior.

    The build time correctness checking is done via the new 'objtool'
    user-space utility that was written for this purpose and which is
    hosted in the kernel repository in tools/objtool/. The tool's (very
    simple) UI and source code design is shaped after Git and perf and
    shares quite a bit of infrastructure with tools/perf (which tooling
    infrastructure sharing effort got merged via perf and is already
    upstream). Objtool follows the well-known kernel coding style.

    Objtool does not try to check .c or .S files, it instead analyzes the
    resulting .o generated machine code from first principles: it decodes
    the instruction stream and interprets it. (Right now objtool supports
    the x86-64 architecture.)

    From tools/objtool/Documentation/stack-validation.txt:

    "The kernel CONFIG_STACK_VALIDATION option enables a host tool named
    objtool which runs at compile time. It has a "check" subcommand
    which analyzes every .o file and ensures the validity of its stack
    metadata. It enforces a set of rules on asm code and C inline
    assembly code so that stack traces can be reliable.

    Currently it only checks frame pointer usage, but there are plans to
    add CFI validation for C files and CFI generation for asm files.

    For each function, it recursively follows all possible code paths
    and validates the correct frame pointer state at each instruction.

    It also follows code paths involving special sections, like
    .altinstructions, __jump_table, and __ex_table, which can add
    alternative execution paths to a given instruction (or set of
    instructions). Similarly, it knows how to follow switch statements,
    for which gcc sometimes uses jump tables."

    When this new kernel option is enabled (it's disabled by default), the
    tool, if it finds any suspicious assembly code pattern, outputs
    warnings in compiler warning format:

    warning: objtool: rtlwifi_rate_mapping()+0x2e7: frame pointer state mismatch
    warning: objtool: cik_tiling_mode_table_init()+0x6ce: call without frame pointer save/setup
    warning: objtool:__schedule()+0x3c0: duplicate frame pointer save
    warning: objtool:__schedule()+0x3fd: sibling call from callable instruction with changed frame pointer

    ... so that scripts that pick up compiler warnings will notice them.
    All known warnings triggered by the tool are fixed by the tree, most
    of the commits in fact prepare the kernel to be warning-free. Most of
    them are bugfixes or cleanups that stand on their own, but there are
    also some annotations of 'special' stack frames for justified cases
    such entries to JIT-ed code (BPF) or really special boot time code.

    There are two other long-term motivations behind this tool as well:

    - To improve the quality and reliability of kernel stack frames, so
    that they can be used for optimized live patching.

    - To create independent infrastructure to check the correctness of
    CFI stack frames at build time. CFI debuginfo is notoriously
    unreliable and we cannot use it in the kernel as-is without extra
    checking done both on the kernel side and on the build side.

    The quality of kernel stack frames matters to debuggability as well,
    so IMO we can merge this without having to consider the live patching
    or CFI debuginfo angle"

    * 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
    objtool: Only print one warning per function
    objtool: Add several performance improvements
    tools: Copy hashtable.h into tools directory
    objtool: Fix false positive warnings for functions with multiple switch statements
    objtool: Rename some variables and functions
    objtool: Remove superflous INIT_LIST_HEAD
    objtool: Add helper macros for traversing instructions
    objtool: Fix false positive warnings related to sibling calls
    objtool: Compile with debugging symbols
    objtool: Detect infinite recursion
    objtool: Prevent infinite recursion in noreturn detection
    objtool: Detect and warn if libelf is missing and don't break the build
    tools: Support relative directory path for 'O='
    objtool: Support CROSS_COMPILE
    x86/asm/decoder: Use explicitly signed chars
    objtool: Enable stack metadata validation on 64-bit x86
    objtool: Add CONFIG_STACK_VALIDATION option
    objtool: Add tool to perform compile-time stack metadata validation
    x86/kprobes: Mark kretprobe_trampoline() stack frame as non-standard
    sched: Always inline context_switch()
    ...

    Linus Torvalds
     

20 Mar, 2016

2 commits

  • Pull audit updates from Paul Moore:
    "A small set of patches for audit this time; just three in total and
    one is a spelling fix.

    The two patches with actual content are designed to help prevent new
    instances of auditd from displacing an existing, functioning auditd
    and to generate a log of the attempt. Not to worry, dead/stuck auditd
    instances can still be replaced by a new instance without problem.

    Nothing controversial, and everything passes our regression suite"

    * 'stable-4.6' of git://git.infradead.org/users/pcmoore/audit:
    audit: Fix typo in comment
    audit: log failed attempts to change audit_pid configuration
    audit: stop an old auditd being starved out by a new auditd

    Linus Torvalds
     
  • Pull networking updates from David Miller:
    "Highlights:

    1) Support more Realtek wireless chips, from Jes Sorenson.

    2) New BPF types for per-cpu hash and arrap maps, from Alexei
    Starovoitov.

    3) Make several TCP sysctls per-namespace, from Nikolay Borisov.

    4) Allow the use of SO_REUSEPORT in order to do per-thread processing
    of incoming TCP/UDP connections. The muxing can be done using a
    BPF program which hashes the incoming packet. From Craig Gallek.

    5) Add a multiplexer for TCP streams, to provide a messaged based
    interface. BPF programs can be used to determine the message
    boundaries. From Tom Herbert.

    6) Add 802.1AE MACSEC support, from Sabrina Dubroca.

    7) Avoid factorial complexity when taking down an inetdev interface
    with lots of configured addresses. We were doing things like
    traversing the entire address less for each address removed, and
    flushing the entire netfilter conntrack table for every address as
    well.

    8) Add and use SKB bulk free infrastructure, from Jesper Brouer.

    9) Allow offloading u32 classifiers to hardware, and implement for
    ixgbe, from John Fastabend.

    10) Allow configuring IRQ coalescing parameters on a per-queue basis,
    from Kan Liang.

    11) Extend ethtool so that larger link mode masks can be supported.
    From David Decotigny.

    12) Introduce devlink, which can be used to configure port link types
    (ethernet vs Infiniband, etc.), port splitting, and switch device
    level attributes as a whole. From Jiri Pirko.

    13) Hardware offload support for flower classifiers, from Amir Vadai.

    14) Add "Local Checksum Offload". Basically, for a tunneled packet
    the checksum of the outer header is 'constant' (because with the
    checksum field filled into the inner protocol header, the payload
    of the outer frame checksums to 'zero'), and we can take advantage
    of that in various ways. From Edward Cree"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1548 commits)
    bonding: fix bond_get_stats()
    net: bcmgenet: fix dma api length mismatch
    net/mlx4_core: Fix backward compatibility on VFs
    phy: mdio-thunder: Fix some Kconfig typos
    lan78xx: add ndo_get_stats64
    lan78xx: handle statistics counter rollover
    RDS: TCP: Remove unused constant
    RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket
    net: smc911x: convert pxa dma to dmaengine
    team: remove duplicate set of flag IFF_MULTICAST
    bonding: remove duplicate set of flag IFF_MULTICAST
    net: fix a comment typo
    ethernet: micrel: fix some error codes
    ip_tunnels, bpf: define IP_TUNNEL_OPTS_MAX and use it
    bpf, dst: add and use dst_tclassid helper
    bpf: make skb->tc_classid also readable
    net: mvneta: bm: clarify dependencies
    cls_bpf: reset class and reuse major in da
    ldmvsw: Checkpatch sunvnet.c and sunvnet_common.c
    ldmvsw: Add ldmvsw.c driver code
    ...

    Linus Torvalds
     

19 Mar, 2016

1 commit

  • Pull cgroup updates from Tejun Heo:
    "cgroup changes for v4.6-rc1. No userland visible behavior changes in
    this pull request. I'll send out a separate pull request for the
    addition of cgroup namespace support.

    - The biggest change is the revamping of cgroup core task migration
    and controller handling logic. There are quite a few places where
    controllers and tasks are manipulated. Previously, many of those
    places implemented custom operations for each specific use case
    assuming specific starting conditions. While this worked, it makes
    the code fragile and difficult to follow.

    The bulk of this pull request restructures these operations so that
    most related operations are performed through common helpers which
    implement recursive (subtrees are always processed consistently)
    and idempotent (they make cgroup hierarchy converge to the target
    state rather than performing operations assuming specific starting
    conditions). This makes the code a lot easier to understand,
    verify and extend.

    - Implicit controller support is added. This is primarily for using
    perf_event on the v2 hierarchy so that perf can match cgroup v2
    path without requiring the user to do anything special. The kernel
    portion of perf_event changes is acked but userland changes are
    still pending review.

    - cgroup_no_v1= boot parameter added to ease testing cgroup v2 in
    certain environments.

    - There is a regression introduced during v4.4 devel cycle where
    attempts to migrate zombie tasks can mess up internal object
    management. This was fixed earlier this week and included in this
    pull request w/ stable cc'd.

    - Misc non-critical fixes and improvements"

    * 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (44 commits)
    cgroup: avoid false positive gcc-6 warning
    cgroup: ignore css_sets associated with dead cgroups during migration
    Documentation: cgroup v2: Trivial heading correction.
    cgroup: implement cgroup_subsys->implicit_on_dfl
    cgroup: use css_set->mg_dst_cgrp for the migration target cgroup
    cgroup: make cgroup[_taskset]_migrate() take cgroup_root instead of cgroup
    cgroup: move migration destination verification out of cgroup_migrate_prepare_dst()
    cgroup: fix incorrect destination cgroup in cgroup_update_dfl_csses()
    cgroup: Trivial correction to reflect controller.
    cgroup: remove stale item in cgroup-v1 document INDEX file.
    cgroup: update css iteration in cgroup_update_dfl_csses()
    cgroup: allocate 2x cgrp_cset_links when setting up a new root
    cgroup: make cgroup_calc_subtree_ss_mask() take @this_ss_mask
    cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends
    cgroup: use cgroup_apply_enable_control() in cgroup creation path
    cgroup: combine cgroup_mutex locking and offline css draining
    cgroup: factor out cgroup_{apply|finalize}_control() from cgroup_subtree_control_write()
    cgroup: introduce cgroup_{save|propagate|restore}_control()
    cgroup: make cgroup_drain_offline() and cgroup_apply_control_{disable|enable}() recursive
    cgroup: factor out cgroup_apply_control_enable() from cgroup_subtree_control_write()
    ...

    Linus Torvalds