12 Oct, 2016

8 commits

  • As of Android N, SECCOMP is required. Without it, we will get
    mediaextractor error:

    E /system/bin/mediaextractor: libminijail: prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER): Invalid argument

    Link: http://lkml.kernel.org/r/20160908185934.18098-3-robh@kernel.org
    Signed-off-by: Rob Herring
    Acked-by: John Stultz
    Cc: Amit Pundir
    Cc: Dmitry Shmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rob Herring
     
  • Android won't boot without SELinux enabled, so make it the default.

    Link: http://lkml.kernel.org/r/20160908185934.18098-2-robh@kernel.org
    Signed-off-by: Rob Herring
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rob Herring
     
  • CONFIG_MD is in recommended, but other dependent options like DM_CRYPT and
    DM_VERITY options are in base. The result is the options in base don't
    get enabled when applying both base and recommended fragments. Move all
    the options to recommended.

    Link: http://lkml.kernel.org/r/20160908185934.18098-1-robh@kernel.org
    Signed-off-by: Rob Herring
    Acked-by: John Stultz
    Cc: Amit Pundir
    Cc: Dmitry Shmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rob Herring
     
  • Option is long gone, see commit 5d9efa7ee99e ("ipv6: Remove privacy
    config option.")

    Link: http://lkml.kernel.org/r/20160811170340.9859-1-bp@alien8.de
    Signed-off-by: Borislav Petkov
    Cc: Rob Herring
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Borislav Petkov
     
  • Relay avoids calling wake_up_interruptible() for doing the wakeup of
    readers/consumers, waiting for the generation of new data, from the
    context of a process which produced the data. This is apparently done to
    prevent the possibility of a deadlock in case Scheduler itself is is
    generating data for the relay, after acquiring rq->lock.

    The following patch used a timer (to be scheduled at next jiffy), for
    delegating the wakeup to another context.
    commit 7c9cb38302e78d24e37f7d8a2ea7eed4ae5f2fa7
    Author: Tom Zanussi
    Date: Wed May 9 02:34:01 2007 -0700

    relay: use plain timer instead of delayed work

    relay doesn't need to use schedule_delayed_work() for waking readers
    when a simple timer will do.

    Scheduling a plain timer, at next jiffies boundary, to do the wakeup
    causes a significant wakeup latency for the Userspace client, which makes
    relay less suitable for the high-frequency low-payload use cases where the
    data gets generated at a very high rate, like multiple sub buffers getting
    filled within a milli second. Moreover the timer is re-scheduled on every
    newly produced sub buffer so the timer keeps getting pushed out if sub
    buffers are filled in a very quick succession (less than a jiffy gap
    between filling of 2 sub buffers). As a result relay runs out of sub
    buffers to store the new data.

    By using irq_work it is ensured that wakeup of userspace client, blocked
    in the poll call, is done at earliest (through self IPI or next timer
    tick) enabling it to always consume the data in time. Also this makes
    relay consistent with printk & ring buffers (trace), as they too use
    irq_work for deferred wake up of readers.

    [arnd@arndb.de: select CONFIG_IRQ_WORK]
    Link: http://lkml.kernel.org/r/20160912154035.3222156-1-arnd@arndb.de
    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1472906487-1559-1-git-send-email-akash.goel@intel.com
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Akash Goel
    Cc: Tom Zanussi
    Cc: Chris Wilson
    Cc: Tvrtko Ursulin
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Daniel Walker reported problems which happens when
    crash_kexec_post_notifiers kernel option is enabled
    (https://lkml.org/lkml/2015/6/24/44).

    In that case, smp_send_stop() is called before entering kdump routines
    which assume other CPUs are still online. As the result, for x86, kdump
    routines fail to save other CPUs' registers and disable virtualization
    extensions.

    To fix this problem, call a new kdump friendly function,
    crash_smp_send_stop(), instead of the smp_send_stop() when
    crash_kexec_post_notifiers is enabled. crash_smp_send_stop() is a weak
    function, and it just call smp_send_stop(). Architecture codes should
    override it so that kdump can work appropriately. This patch only
    provides x86-specific version.

    For Xen's PV kernel, just keep the current behavior.

    NOTES:

    - Right solution would be to place crash_smp_send_stop() before
    __crash_kexec() invocation in all cases and remove smp_send_stop(), but
    we can't do that until all architectures implement own
    crash_smp_send_stop()

    - crash_smp_send_stop()-like work is still needed by
    machine_crash_shutdown() because crash_kexec() can be called without
    entering panic()

    Fixes: f06e5153f4ae (kernel/panic.c: add "crash_kexec_post_notifiers" option)
    Link: http://lkml.kernel.org/r/20160810080948.11028.15344.stgit@sysi4-13.yrl.intra.hitachi.co.jp
    Signed-off-by: Hidehiro Kawai
    Reported-by: Daniel Walker
    Cc: Dave Young
    Cc: Baoquan He
    Cc: Vivek Goyal
    Cc: Eric Biederman
    Cc: Masami Hiramatsu
    Cc: Daniel Walker
    Cc: Xunlei Pang
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Borislav Petkov
    Cc: David Vrabel
    Cc: Toshi Kani
    Cc: Ralf Baechle
    Cc: David Daney
    Cc: Aaro Koskinen
    Cc: "Steven J. Hill"
    Cc: Corey Minyard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hidehiro Kawai
     
  • On __ptrace_detach(), called from do_exit()->exit_notify()->
    forget_original_parent()->exit_ptrace(), the TIF_SYSCALL_TRACE in
    thread->flags of the tracee is not cleared up. This results in the
    tracehook_report_syscall_* being called (though there's no longer a tracer
    listening to that) upon its further syscalls.

    Example scenario - attach "strace" to a running process and kill it (the
    strace) with SIGKILL. You'll see that the syscall trace hooks are still
    being called.

    The clearing of this flag should be moved from ptrace_detach() to
    __ptrace_detach().

    Link: http://lkml.kernel.org/r/1472759493-20554-1-git-send-email-alnovak@suse.cz
    Signed-off-by: Ales Novak
    Acked-by: Oleg Nesterov
    Cc: Jiri Kosina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ales Novak
     
  • asm-generic headers are generic implementations for architecture specific
    code and should not be included by common code. Thus use the asm/ version
    of sections.h to get at the linker sections.

    Link: http://lkml.kernel.org/r/1473602302-6208-1-git-send-email-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Acked-by: Masami Hiramatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

11 Oct, 2016

7 commits

  • Pull more vfs updates from Al Viro:
    ">rename2() work from Miklos + current_time() from Deepa"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Replace current_fs_time() with current_time()
    fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
    fs: Replace CURRENT_TIME with current_time() for inode timestamps
    fs: proc: Delete inode time initializations in proc_alloc_inode()
    vfs: Add current_time() api
    vfs: add note about i_op->rename changes to porting
    fs: rename "rename2" i_op to "rename"
    vfs: remove unused i_op->rename
    fs: make remaining filesystems use .rename2
    libfs: support RENAME_NOREPLACE in simple_rename()
    fs: support RENAME_NOREPLACE for local filesystems
    ncpfs: fix unused variable warning

    Linus Torvalds
     
  • Al Viro
     
  • Pull misc vfs updates from Al Viro:
    "Assorted misc bits and pieces.

    There are several single-topic branches left after this (rename2
    series from Miklos, current_time series from Deepa Dinamani, xattr
    series from Andreas, uaccess stuff from from me) and I'd prefer to
    send those separately"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
    proc: switch auxv to use of __mem_open()
    hpfs: support FIEMAP
    cifs: get rid of unused arguments of CIFSSMBWrite()
    posix_acl: uapi header split
    posix_acl: xattr representation cleanups
    fs/aio.c: eliminate redundant loads in put_aio_ring_file
    fs/internal.h: add const to ns_dentry_operations declaration
    compat: remove compat_printk()
    fs/buffer.c: make __getblk_slow() static
    proc: unsigned file descriptors
    fs/file: more unsigned file descriptors
    fs: compat: remove redundant check of nr_segs
    cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
    cifs: don't use memcpy() to copy struct iov_iter
    get rid of separate multipage fault-in primitives
    fs: Avoid premature clearing of capabilities
    fs: Give dentry to inode_change_ok() instead of inode
    fuse: Propagate dentry down to inode_change_ok()
    ceph: Propagate dentry down to inode_change_ok()
    xfs: Propagate dentry down to inode_change_ok()
    ...

    Linus Torvalds
     
  • Pull protection keys syscall interface from Thomas Gleixner:
    "This is the final step of Protection Keys support which adds the
    syscalls so user space can actually allocate keys and protect memory
    areas with them. Details and usage examples can be found in the
    documentation.

    The mm side of this has been acked by Mel"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/pkeys: Update documentation
    x86/mm/pkeys: Do not skip PKRU register if debug registers are not used
    x86/pkeys: Fix pkeys build breakage for some non-x86 arches
    x86/pkeys: Add self-tests
    x86/pkeys: Allow configuration of init_pkru
    x86/pkeys: Default to a restrictive init PKRU
    pkeys: Add details of system call use to Documentation/
    generic syscalls: Wire up memory protection keys syscalls
    x86: Wire up protection keys system calls
    x86/pkeys: Allocation/free syscalls
    x86/pkeys: Make mprotect_key() mask off additional vm_flags
    mm: Implement new pkey_mprotect() system call
    x86/pkeys: Add fault handling for PF_PK page fault bit

    Linus Torvalds
     
  • Pull scheduler fix from Thomas Gleixner:
    "A revert of a commit which pointelessly widened a preempt disabled
    section which in turn caused might_sleep() to trigger.

    The patch intended to prevent usage of smp_processor_id() in
    preemptible context, but the usage in that case is fine because the
    thread is pinned on a single cpu and therefore cannot be migrated off"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    Revert "sched/core: Do not use smp_processor_id() with preempt enabled in smpboot_thread_fn()"

    Linus Torvalds
     
  • Pull timer fix from Thomas Gleixner:
    "A single fix for a regression introduced in 4.8 which causes the
    trace/perf clock to return random nonsense if CONFIG_DEBUG_TIMEKEEPING
    is set"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    timekeeping: Fix __ktime_get_fast_ns() regression

    Linus Torvalds
     
  • Merge my system logging cleanups, triggered by the broken '\n' patches.

    The line continuation handling has been broken basically forever, and
    the code to handle the system log records was both confusing and
    dubious. And it would do entirely the wrong thing unless you always had
    a terminating newline, partly because it couldn't actually see whether a
    message was marked KERN_CONT or not (but partly because the LOG_CONT
    handling in the recording code was rather confusing too).

    This re-introduces a real semantically meaningful KERN_CONT, and fixes
    the few places I noticed where it was missing. There are probably more
    missing cases, since KERN_CONT hasn't actually had any semantic meaning
    for at least four years (other than the checkpatch meaning of "no log
    level necessary, this is a continuation line").

    This also allows the combination of KERN_CONT and a log level. In that
    case the log level will be ignored if the merging with a previous line
    is successful, but if a new record is needed, that new record will now
    get the right log level.

    That also means that you can at least in theory combine KERN_CONT with
    the "pr_info()" style helpers, although any use of pr_fmt() prefixing
    would make that just result in a mess, of course (the prefix would end
    up in the middle of a continuing line).

    * printk-cleanups:
    printk: make reading the kernel log flush pending lines
    printk: re-organize log_output() to be more legible
    printk: split out core logging code into helper function
    printk: reinstate KERN_CONT for printing continuation lines

    Linus Torvalds
     

10 Oct, 2016

4 commits

  • That will mean that any possible subsequent continuation will now be
    broken up onto a line of its own (since reading the log has finalized
    the beginning og the line), but if user space has activated system
    logging (or if there's a kernel message dump going on) that is the right
    thing to do.

    And now that we actually get the continuation flags _right_ for this
    all, the user space logger that is reading the kernel messages can
    actually see the continuation marker. Not that anybody seems to really
    bother with it (or care), but in theory user space can do its own
    message stitching.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Avoid some duplicate logic now that we can return early, and update the
    comments for the new LOG_CONT world order.

    This also stops the continuation flushing from just using random record
    flags for the flushing action, instead taking the flags from the proper
    original line and updating them as we add continuations to it.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The code that actually decides how to log the message (whether to put it
    directly into the record log, whether to append it to an existing
    buffered log, or whether to start a new buffered log) is fairly
    non-obvious code in the middle of the vprintk_emit() function.

    Splitting that code up into a helper function makes it easier to
    understand, but perhaps more importantly also allows for the code to
    just return early out of the helper function once it has made the
    decision about where the new log content goes.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Long long ago the kernel log buffer was a buffered stream of bytes, very
    much like stdio in user space. It supported log levels by scanning the
    stream and noticing the log level markers at the beginning of each line,
    but if you wanted to print a partial line in multiple chunks, you just
    did multiple printk() calls, and it just automatically worked.

    Except when it didn't, and you had very confusing output when different
    lines got all mixed up with each other. Then you got fragment lines
    mixing with each other, or with non-fragment lines, because it was
    traditionally impossible to tell whether a printk() call was a
    continuation or not.

    To at least help clarify the issue of continuation lines, we added a
    KERN_CONT marker back in 2007 to mark continuation lines:

    474925277671 ("printk: add KERN_CONT annotation").

    That continuation marker was initially an empty string, and didn't
    actuall make any semantic difference. But it at least made it possible
    to annotate the source code, and have check-patch notice that a printk()
    didn't need or want a log level marker, because it was a continuation of
    a previous line.

    To avoid the ambiguity between a continuation line that had that
    KERN_CONT marker, and a printk with no level information at all, we then
    in 2009 made KERN_CONT be a real log level marker which meant that we
    could now reliably tell the difference between the two cases.

    5fd29d6ccbc9 ("printk: clean up handling of log-levels and newlines")

    and we could take advantage of that to make sure we didn't mix up
    continuation lines with lines that just didn't have any loglevel at all.

    Then, in 2012, the kernel log buffer was changed to be a "record" based
    log, where each line was a record that has a loglevel and a timestamp.

    You can see the beginning of that conversion in commits

    e11fea92e13f ("kmsg: export printk records to the /dev/kmsg interface")
    7ff9554bb578 ("printk: convert byte-buffer to variable-length record buffer")

    with a number of follow-up commits to fix some painful fallout from that
    conversion. Over all, it took a couple of months to sort out most of
    it. But the upside was that you could have concurrent readers (and
    writers) of the kernel log and not have lines with mixed output in them.

    And one particular pain-point for the record-based kernel logging was
    exactly the fragmentary lines that are generated in smaller chunks. In
    order to still log them as one recrod, the continuation lines need to be
    attached to the previous record properly.

    However the explicit continuation record marker that is actually useful
    for this exact case was actually removed in aroundm the same time by commit

    61e99ab8e35a ("printk: remove the now unnecessary "C" annotation for KERN_CONT")

    due to the incorrect belief that KERN_CONT wasn't meaningful. The
    ambiguity between "is this a continuation line" or "is this a plain
    printk with no log level information" was reintroduced, and in fact
    became an even bigger pain point because there was now the whole
    record-level merging of kernel messages going on.

    This patch reinstates the KERN_CONT as a real non-empty string marker,
    so that the ambiguity is fixed once again.

    But it's not a plain revert of that original removal: in the four years
    since we made KERN_CONT an empty string again, not only has the format
    of the log level markers changed, we've also had some usage changes in
    this area.

    For example, some ACPI code seems to use KERN_CONT _together_ with a log
    level, and now uses both the KERN_CONT marker and (for example) a
    KERN_INFO marker to show that it's an informational continuation of a
    line.

    Which is actually not a bad idea - if the continuation line cannot be
    attached to its predecessor, without the log level information we don't
    know what log level to assign to it (and we traditionally just assigned
    it the default loglevel). So having both a log level and the KERN_CONT
    marker is not necessarily a bad idea, but it does mean that we need to
    actually iterate over potentially multiple markers, rather than just a
    single one.

    Also, since KERN_CONT was still conceptually needed, and encouraged, but
    didn't actually _do_ anything, we've also had the reverse problem:
    rather than having too many annotations it has too few, and there is bit
    rot with code that no longer marks the continuation lines with the
    KERN_CONT marker.

    So this patch not only re-instates the non-empty KERN_CONT marker, it
    also fixes up the cases of bit-rot I noticed in my own logs.

    There are probably other cases where KERN_CONT will be needed to be
    added, either because it is new code that never dealt with the need for
    KERN_CONT, or old code that has bitrotted without anybody noticing.

    That said, we should strive to avoid the need for KERN_CONT. It does
    result in real problems for logging, and should generally not be seen as
    a good feature. If we some day can get rid of the feature entirely,
    because nobody does any fragmented printk calls, that would be lovely.

    But until that point, let's at mark the code that relies on the hacky
    multi-fragment kernel printk's. Not only does it avoid the ambiguity,
    it also annotates code as "maybe this would be good to fix some day".

    (That said, particularly during single-threaded bootup, the downsides of
    KERN_CONT are very limited. Things get much hairier when you have
    multiple threads going on and user level reading and writing logs too).

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

08 Oct, 2016

14 commits

  • Merge updates from Andrew Morton:

    - fsnotify updates

    - ocfs2 updates

    - all of MM

    * emailed patches from Andrew Morton : (127 commits)
    console: don't prefer first registered if DT specifies stdout-path
    cred: simpler, 1D supplementary groups
    CREDITS: update Pavel's information, add GPG key, remove snail mail address
    mailmap: add Johan Hovold
    .gitattributes: set git diff driver for C source code files
    uprobes: remove function declarations from arch/{mips,s390}
    spelling.txt: "modeled" is spelt correctly
    nmi_backtrace: generate one-line reports for idle cpus
    arch/tile: adopt the new nmi_backtrace framework
    nmi_backtrace: do a local dump_stack() instead of a self-NMI
    nmi_backtrace: add more trigger_*_cpu_backtrace() methods
    min/max: remove sparse warnings when they're nested
    Documentation/filesystems/proc.txt: add more description for maps/smaps
    mm, proc: fix region lost in /proc/self/smaps
    proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self
    proc: add LSM hook checks to /proc//timerslack_ns
    proc: relax /proc//timerslack_ns capability requirements
    meminfo: break apart a very long seq_printf with #ifdefs
    seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char
    proc: faster /proc/*/status
    ...

    Linus Torvalds
     
  • If a device tree specifies a preferred device for kernel console output
    via the stdout-path or linux,stdout-path chosen node properties or the
    stdout alias then the kernel ought to honor it & output the kernel
    console to that device. As it stands, this isn't the case. Whilst we
    parse the stdout-path properties & set an of_stdout variable from
    of_alias_scan(), and use that from of_console_check() to determine
    whether to add a console device as a preferred console whilst
    registering it, we also prefer the first registered console if no other
    has been selected at the time of its registration.

    This means that if a console other than the one the device tree selects
    via stdout-path is registered first, we will switch to using it & when
    the stdout-path console is later registered the call to
    add_preferred_console() via of_console_check() is too late to do
    anything useful. In practice this seems to mean that we switch to the
    dummy console device fairly early & see no further console output:

    Console: colour dummy device 80x25
    console [tty0] enabled
    bootconsole [ns16550a0] disabled

    Fix this by not automatically preferring the first registered console if
    one is specified by the device tree. This allows consoles to be
    registered but not enabled, and once the driver for the console selected
    by stdout-path calls of_console_check() the driver will be added to the
    list of preferred consoles before any other console has been enabled.
    When that console is then registered via register_console() it will be
    enabled as expected.

    Link: http://lkml.kernel.org/r/20160809151937.26118-1-paul.burton@imgtec.com
    Signed-off-by: Paul Burton
    Cc: Ralf Baechle
    Cc: Paul Burton
    Cc: Tejun Heo
    Cc: Sergey Senozhatsky
    Cc: Jiri Slaby
    Cc: Daniel Vetter
    Cc: Ivan Delalande
    Cc: Thierry Reding
    Cc: Borislav Petkov
    Cc: Jan Kara
    Cc: Petr Mladek
    Cc: Joe Perches
    Cc: Greg Kroah-Hartman
    Cc: Rob Herring
    Cc: Frank Rowand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Burton
     
  • Current supplementary groups code can massively overallocate memory and
    is implemented in a way so that access to individual gid is done via 2D
    array.

    If number of gids is
    Cc: Vasily Kulikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • When doing an nmi backtrace of many cores, most of which are idle, the
    output is a little overwhelming and very uninformative. Suppress
    messages for cpus that are idling when they are interrupted and just
    emit one line, "NMI backtrace for N skipped: idling at pc 0xNNN".

    We do this by grouping all the cpuidle code together into a new
    .cpuidle.text section, and then checking the address of the interrupted
    PC to see if it lies within that section.

    This commit suitably tags x86 and tile idle routines, and only adds in
    the minimal framework for other architectures.

    Link: http://lkml.kernel.org/r/1472487169-14923-5-git-send-email-cmetcalf@mellanox.com
    Signed-off-by: Chris Metcalf
    Acked-by: Peter Zijlstra (Intel)
    Tested-by: Peter Zijlstra (Intel)
    Tested-by: Daniel Thompson [arm]
    Tested-by: Petr Mladek
    Cc: Aaron Tomlin
    Cc: Peter Zijlstra (Intel)
    Cc: "Rafael J. Wysocki"
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Metcalf
     
  • The global zero page is used to satisfy an anonymous read fault. If
    THP(Transparent HugePage) is enabled then the global huge zero page is
    used. The global huge zero page uses an atomic counter for reference
    counting and is allocated/freed dynamically according to its counter
    value.

    CPU time spent on that counter will greatly increase if there are a lot
    of processes doing anonymous read faults. This patch proposes a way to
    reduce the access to the global counter so that the CPU load can be
    reduced accordingly.

    To do this, a new flag of the mm_struct is introduced:
    MMF_USED_HUGE_ZERO_PAGE. With this flag, the process only need to touch
    the global counter in two cases:

    1 The first time it uses the global huge zero page;
    2 The time when mm_user of its mm_struct reaches zero.

    Note that right now, the huge zero page is eligible to be freed as soon
    as its last use goes away. With this patch, the page will not be
    eligible to be freed until the exit of the last process from which it
    was ever used.

    And with the use of mm_user, the kthread is not eligible to use huge
    zero page either. Since no kthread is using huge zero page today, there
    is no difference after applying this patch. But if that is not desired,
    I can change it to when mm_count reaches zero.

    Case used for test on Haswell EP:

    usemem -n 72 --readonly -j 0x200000 100G

    Which spawns 72 processes and each will mmap 100G anonymous space and
    then do read only access to that space sequentially with a step of 2MB.

    CPU cycles from perf report for base commit:
    54.03% usemem [kernel.kallsyms] [k] get_huge_zero_page
    CPU cycles from perf report for this commit:
    0.11% usemem [kernel.kallsyms] [k] mm_get_huge_zero_page

    Performance(throughput) of the workload for base commit: 1784430792
    Performance(throughput) of the workload for this commit: 4726928591
    164% increase.

    Runtime of the workload for base commit: 707592 us
    Runtime of the workload for this commit: 303970 us
    50% drop.

    Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com
    Signed-off-by: Aaron Lu
    Cc: Sergey Senozhatsky
    Cc: "Kirill A. Shutemov"
    Cc: Dave Hansen
    Cc: Tim Chen
    Cc: Huang Ying
    Cc: Vlastimil Babka
    Cc: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Ebru Akagunduz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     
  • There are no users of exit_oom_victim on !current task anymore so enforce
    the API to always work on the current.

    Link: http://lkml.kernel.org/r/1472119394-11342-8-git-send-email-mhocko@kernel.org
    Signed-off-by: Tetsuo Handa
    Signed-off-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Commit 74070542099c ("oom, suspend: fix oom_reaper vs.
    oom_killer_disable race") has workaround an existing race between
    oom_killer_disable and oom_reaper by adding another round of
    try_to_freeze_tasks after the oom killer was disabled. This was the
    easiest thing to do for a late 4.7 fix. Let's fix it properly now.

    After "oom: keep mm of the killed task available" we no longer have to
    call exit_oom_victim from the oom reaper because we have stable mm
    available and hide the oom_reaped mm by MMF_OOM_SKIP flag. So let's
    remove exit_oom_victim and the race described in the above commit
    doesn't exist anymore if.

    Unfortunately this alone is not sufficient for the oom_killer_disable
    usecase because now we do not have any reliable way to reach
    exit_oom_victim (the victim might get stuck on a way to exit for an
    unbounded amount of time). OOM killer can cope with that by checking mm
    flags and move on to another victim but we cannot do the same for
    oom_killer_disable as we would lose the guarantee of no further
    interference of the victim with the rest of the system. What we can do
    instead is to cap the maximum time the oom_killer_disable waits for
    victims. The only current user of this function (pm suspend) already
    has a concept of timeout for back off so we can reuse the same value
    there.

    Let's drop set_freezable for the oom_reaper kthread because it is no
    longer needed as the reaper doesn't wake or thaw any processes.

    Link: http://lkml.kernel.org/r/1472119394-11342-7-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • After "oom: keep mm of the killed task available" we can safely detect
    an oom victim by checking task->signal->oom_mm so we do not need the
    signal_struct counter anymore so let's get rid of it.

    This alone wouldn't be sufficient for nommu archs because
    exit_oom_victim doesn't hide the process from the oom killer anymore.
    We can, however, mark the mm with a MMF flag in __mmput. We can reuse
    MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP.

    Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Lockdep complains that __mmdrop is not safe from the softirq context:

    =================================
    [ INFO: inconsistent lock state ]
    4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949 Tainted: G W
    ---------------------------------
    inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
    (pgd_lock){+.?...}, at: pgd_free+0x19/0x6b
    {SOFTIRQ-ON-W} state was registered at:
    __lock_acquire+0xa06/0x196e
    lock_acquire+0x139/0x1e1
    _raw_spin_lock+0x32/0x41
    __change_page_attr_set_clr+0x2a5/0xacd
    change_page_attr_set_clr+0x16f/0x32c
    set_memory_nx+0x37/0x3a
    free_init_pages+0x9e/0xc7
    alternative_instructions+0xa2/0xb3
    check_bugs+0xe/0x2d
    start_kernel+0x3ce/0x3ea
    x86_64_start_reservations+0x2a/0x2c
    x86_64_start_kernel+0x17a/0x18d
    irq event stamp: 105916
    hardirqs last enabled at (105916): free_hot_cold_page+0x37e/0x390
    hardirqs last disabled at (105915): free_hot_cold_page+0x2c1/0x390
    softirqs last enabled at (105878): _local_bh_enable+0x42/0x44
    softirqs last disabled at (105879): irq_exit+0x6f/0xd1

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(pgd_lock);

    lock(pgd_lock);

    *** DEADLOCK ***

    1 lock held by swapper/1/0:
    #0: (rcu_callback){......}, at: rcu_process_callbacks+0x390/0x800

    stack backtrace:
    CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W 4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
    Call Trace:

    print_usage_bug.part.25+0x259/0x268
    mark_lock+0x381/0x567
    __lock_acquire+0x993/0x196e
    lock_acquire+0x139/0x1e1
    _raw_spin_lock+0x32/0x41
    pgd_free+0x19/0x6b
    __mmdrop+0x25/0xb9
    __put_task_struct+0x103/0x11e
    delayed_put_task_struct+0x157/0x15e
    rcu_process_callbacks+0x660/0x800
    __do_softirq+0x1ec/0x4d5
    irq_exit+0x6f/0xd1
    smp_apic_timer_interrupt+0x42/0x4d
    apic_timer_interrupt+0x8e/0xa0

    arch_cpu_idle+0xf/0x11
    default_idle_call+0x32/0x34
    cpu_startup_entry+0x20c/0x399
    start_secondary+0xfe/0x101

    More over commit a79e53d85683 ("x86/mm: Fix pgd_lock deadlock") was
    explicit about pgd_lock not to be called from the irq context. This
    means that __mmdrop called from free_signal_struct has to be postponed
    to a user context. We already have a similar mechanism for mmput_async
    so we can use it here as well. This is safe because mm_count is pinned
    by mm_users.

    This fixes bug introduced by "oom: keep mm of the killed task available"

    Link: http://lkml.kernel.org/r/1472119394-11342-5-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • oom_reap_task has to call exit_oom_victim in order to make sure that the
    oom vicim will not block the oom killer for ever. This is, however,
    opening new problems (e.g oom_killer_disable exclusion - see commit
    74070542099c ("oom, suspend: fix oom_reaper vs. oom_killer_disable
    race")). exit_oom_victim should be only called from the victim's
    context ideally.

    One way to achieve this would be to rely on per mm_struct flags. We
    already have MMF_OOM_REAPED to hide a task from the oom killer since
    "mm, oom: hide mm which is shared with kthread or global init". The
    problem is that the exit path:

    do_exit
    exit_mm
    tsk->mm = NULL;
    mmput
    __mmput
    exit_oom_victim

    doesn't guarantee that exit_oom_victim will get called in a bounded
    amount of time. At least exit_aio depends on IO which might get blocked
    due to lack of memory and who knows what else is lurking there.

    This patch takes a different approach. We remember tsk->mm into the
    signal_struct and bind it to the signal struct life time for all oom
    victims. __oom_reap_task_mm as well as oom_scan_process_thread do not
    have to rely on find_lock_task_mm anymore and they will have a reliable
    reference to the mm struct. As a result all the oom specific
    communication inside the OOM killer can be done via tsk->signal->oom_mm.

    Increasing the signal_struct for something as unlikely as the oom killer
    is far from ideal but this approach will make the code much more
    reasonable and long term we even might want to move task->mm into the
    signal_struct anyway. In the next step we might want to make the oom
    killer exclusion and access to memory reserves completely independent
    which would be also nice.

    Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Pull VFS splice updates from Al Viro:
    "There's a bunch of branches this cycle, both mine and from other folks
    and I'd rather send pull requests separately.

    This one is the conversion of ->splice_read() to ITER_PIPE iov_iter
    (and introduction of such). Gets rid of a lot of code in fs/splice.c
    and elsewhere; there will be followups, but these are for the next
    cycle... Some pipe/splice-related cleanups from Miklos in the same
    branch as well"

    * 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    pipe: fix comment in pipe_buf_operations
    pipe: add pipe_buf_steal() helper
    pipe: add pipe_buf_confirm() helper
    pipe: add pipe_buf_release() helper
    pipe: add pipe_buf_get() helper
    relay: simplify relay_file_read()
    switch default_file_splice_read() to use of pipe-backed iov_iter
    switch generic_file_splice_read() to use of ->read_iter()
    new iov_iter flavour: pipe-backed
    fuse_dev_splice_read(): switch to add_to_pipe()
    skb_splice_bits(): get rid of callback
    new helper: add_to_pipe()
    splice: lift pipe_lock out of splice_to_pipe()
    splice: switch get_iovec_page_array() to iov_iter
    splice_to_pipe(): don't open-code wakeup_pipe_readers()
    consistent treatment of EFAULT on O_DIRECT read/write

    Linus Torvalds
     
  • Pull block layer updates from Jens Axboe:
    "This is the main pull request for block layer changes in 4.9.

    As mentioned at the last merge window, I've changed things up and now
    do just one branch for core block layer changes, and driver changes.
    This avoids dependencies between the two branches. Outside of this
    main pull request, there are two topical branches coming as well.

    This pull request contains:

    - A set of fixes, and a conversion to blk-mq, of nbd. From Josef.

    - Set of fixes and updates for lightnvm from Matias, Simon, and Arnd.
    Followup dependency fix from Geert.

    - General fixes from Bart, Baoyou, Guoqing, and Linus W.

    - CFQ async write starvation fix from Glauber.

    - Add supprot for delayed kick of the requeue list, from Mike.

    - Pull out the scalable bitmap code from blk-mq-tag.c and make it
    generally available under the name of sbitmap. Only blk-mq-tag uses
    it for now, but the blk-mq scheduling bits will use it as well.
    From Omar.

    - bdev thaw error progagation from Pierre.

    - Improve the blk polling statistics, and allow the user to clear
    them. From Stephen.

    - Set of minor cleanups from Christoph in block/blk-mq.

    - Set of cleanups and optimizations from me for block/blk-mq.

    - Various nvme/nvmet/nvmeof fixes from the various folks"

    * 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits)
    fs/block_dev.c: return the right error in thaw_bdev()
    nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
    nvme/scsi: Remove power management support
    nvmet: Make dsm number of ranges zero based
    nvmet: Use direct IO for writes
    admin-cmd: Added smart-log command support.
    nvme-fabrics: Add host_traddr options field to host infrastructure
    nvme-fabrics: revise host transport option descriptions
    nvme-fabrics: rework nvmf_get_address() for variable options
    nbd: use BLK_MQ_F_BLOCKING
    blkcg: Annotate blkg_hint correctly
    cfq: fix starvation of asynchronous writes
    blk-mq: add flag for drivers wanting blocking ->queue_rq()
    blk-mq: remove non-blocking pass in blk_mq_map_request
    blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue()
    block: export bio_free_pages to other modules
    lightnvm: propagate device_add() error code
    lightnvm: expose device geometry through sysfs
    lightnvm: control life of nvm_dev in driver
    blk-mq: register device instead of disk
    ...

    Linus Torvalds
     
  • Pull trivial updates from Jiri Kosina:
    "The usual rocket science from the trivial tree"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
    tracing/syscalls: fix multiline in error message text
    lib/Kconfig.debug: fix DEBUG_SECTION_MISMATCH description
    doc: vfs: fix fadvise() sycall name
    x86/entry: spell EBX register correctly in documentation
    securityfs: fix securityfs_create_dir comment
    irq: Fix typo in tracepoint.xml

    Linus Torvalds
     
  • Pull livepatching updates from Jiri Kosina:

    - fix for patching modules that contain .altinstructions or
    .parainstructions sections, from Jessica Yu

    - make TAINT_LIVEPATCH a per-module flag (so that it's immediately
    clear which module caused the taint), from Josh Poimboeuf

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
    livepatch/module: make TAINT_LIVEPATCH module-specific
    Documentation: livepatch: add section about arch-specific code
    livepatch/x86: apply alternatives and paravirt patches after relocations
    livepatch: use arch_klp_init_object_loaded() to finish arch-specific tasks

    Linus Torvalds
     

07 Oct, 2016

3 commits

  • Pull tracing updates from Steven Rostedt:
    "This release cycle is rather small. Just a few fixes to tracing.

    The big change is the addition of the hwlat tracer. It not only
    detects SMIs, but also other latency that's caused by the hardware. I
    have detected some latency from large boxes having bus contention"

    * tag 'trace-v4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Call traceoff trigger after event is recorded
    ftrace/scripts: Add helper script to bisect function tracing problem functions
    tracing: Have max_latency be defined for HWLAT_TRACER as well
    tracing: Add NMI tracing in hwlat detector
    tracing: Have hwlat trace migrate across tracing_cpumask CPUs
    tracing: Add documentation for hwlat_detector tracer
    tracing: Added hardware latency tracer
    ftrace: Access ret_stack->subtime only in the function profiler
    function_graph: Handle TRACE_BPUTS in print_graph_comment
    tracing/uprobe: Drop isdigit() check in create_trace_uprobe

    Linus Torvalds
     
  • Pull KVM updates from Radim Krčmář:
    "All architectures:
    - move `make kvmconfig` stubs from x86
    - use 64 bits for debugfs stats

    ARM:
    - Important fixes for not using an in-kernel irqchip
    - handle SError exceptions and present them to guests if appropriate
    - proxying of GICV access at EL2 if guest mappings are unsafe
    - GICv3 on AArch32 on ARMv8
    - preparations for GICv3 save/restore, including ABI docs
    - cleanups and a bit of optimizations

    MIPS:
    - A couple of fixes in preparation for supporting MIPS EVA host
    kernels
    - MIPS SMP host & TLB invalidation fixes

    PPC:
    - Fix the bug which caused guests to falsely report lockups
    - other minor fixes
    - a small optimization

    s390:
    - Lazy enablement of runtime instrumentation
    - up to 255 CPUs for nested guests
    - rework of machine check deliver
    - cleanups and fixes

    x86:
    - IOMMU part of AMD's AVIC for vmexit-less interrupt delivery
    - Hyper-V TSC page
    - per-vcpu tsc_offset in debugfs
    - accelerated INS/OUTS in nVMX
    - cleanups and fixes"

    * tag 'kvm-4.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (140 commits)
    KVM: MIPS: Drop dubious EntryHi optimisation
    KVM: MIPS: Invalidate TLB by regenerating ASIDs
    KVM: MIPS: Split kernel/user ASID regeneration
    KVM: MIPS: Drop other CPU ASIDs on guest MMU changes
    KVM: arm/arm64: vgic: Don't flush/sync without a working vgic
    KVM: arm64: Require in-kernel irqchip for PMU support
    KVM: PPC: Book3s PR: Allow access to unprivileged MMCR2 register
    KVM: PPC: Book3S PR: Support 64kB page size on POWER8E and POWER8NVL
    KVM: PPC: Book3S: Remove duplicate setting of the B field in tlbie
    KVM: PPC: BookE: Fix a sanity check
    KVM: PPC: Book3S HV: Take out virtual core piggybacking code
    KVM: PPC: Book3S: Treat VTB as a per-subcore register, not per-thread
    ARM: gic-v3: Work around definition of gic_write_bpr1
    KVM: nVMX: Fix the NMI IDT-vectoring handling
    KVM: VMX: Enable MSR-BASED TPR shadow even if APICv is inactive
    KVM: nVMX: Fix reload apic access page warning
    kvmconfig: add virtio-gpu to config fragment
    config: move x86 kvm_guest.config to a common location
    arm64: KVM: Remove duplicating init code for setting VMID
    ARM: KVM: Support vgic-v3
    ...

    Linus Torvalds
     
  • Pull namespace updates from Eric Biederman:
    "This set of changes is a number of smaller things that have been
    overlooked in other development cycles focused on more fundamental
    change. The devpts changes are small things that were a distraction
    until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
    trivial regression fix to autofs for the unprivileged mount changes
    that went in last cycle. A pair of ioctls has been added by Andrey
    Vagin making it is possible to discover the relationships between
    namespaces when referring to them through file descriptors.

    The big user visible change is starting to add simple resource limits
    to catch programs that misbehave. With namespaces in general and user
    namespaces in particular allowing users to use more kinds of
    resources, it has become important to have something to limit errant
    programs. Because the purpose of these limits is to catch errant
    programs the code needs to be inexpensive to use as it always on, and
    the default limits need to be high enough that well behaved programs
    on well behaved systems don't encounter them.

    To this end, after some review I have implemented per user per user
    namespace limits, and use them to limit the number of namespaces. The
    limits being per user mean that one user can not exhause the limits of
    another user. The limits being per user namespace allow contexts where
    the limit is 0 and security conscious folks can remove from their
    threat anlysis the code used to manage namespaces (as they have
    historically done as it root only). At the same time the limits being
    per user namespace allow other parts of the system to use namespaces.

    Namespaces are increasingly being used in application sand boxing
    scenarios so an all or nothing disable for the entire system for the
    security conscious folks makes increasing use of these sandboxes
    impossible.

    There is also added a limit on the maximum number of mounts present in
    a single mount namespace. It is nontrivial to guess what a reasonable
    system wide limit on the number of mount structure in the kernel would
    be, especially as it various based on how a system is using
    containers. A limit on the number of mounts in a mount namespace
    however is much easier to understand and set. In most cases in
    practice only about 1000 mounts are used. Given that some autofs
    scenarious have the potential to be 30,000 to 50,000 mounts I have set
    the default limit for the number of mounts at 100,000 which is well
    above every known set of users but low enough that the mount hash
    tables don't degrade unreaonsably.

    These limits are a start. I expect this estabilishes a pattern that
    other limits for resources that namespaces use will follow. There has
    been interest in making inotify event limits per user per user
    namespace as well as interest expressed in making details about what
    is going on in the kernel more visible"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
    autofs: Fix automounts by using current_real_cred()->uid
    mnt: Add a per mount namespace limit on the number of mounts
    netns: move {inc,dec}_net_namespaces into #ifdef
    nsfs: Simplify __ns_get_path
    tools/testing: add a test to check nsfs ioctl-s
    nsfs: add ioctl to get a parent namespace
    nsfs: add ioctl to get an owning user namespace for ns file descriptor
    kernel: add a helper to get an owning user namespace for a namespace
    devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
    devpts: Remove sync_filesystems
    devpts: Make devpts_kill_sb safe if fsi is NULL
    devpts: Simplify devpts_mount by using mount_nodev
    devpts: Move the creation of /dev/pts/ptmx into fill_super
    devpts: Move parse_mount_options into fill_super
    userns: When the per user per user namespace limit is reached return ENOSPC
    userns; Document per user per user namespace limits.
    mntns: Add a limit on the number of mount namespaces.
    netns: Add a limit on the number of net namespaces
    cgroupns: Add a limit on the number of cgroup namespaces
    ipcns: Add a limit on the number of ipc namespaces
    ...

    Linus Torvalds
     

06 Oct, 2016

2 commits

  • to hell with actors...

    Signed-off-by: Al Viro

    Al Viro
     
  • Pull networking updates from David Miller:

    1) BBR TCP congestion control, from Neal Cardwell, Yuchung Cheng and
    co. at Google. https://lwn.net/Articles/701165/

    2) Do TCP Small Queues for retransmits, from Eric Dumazet.

    3) Support collect_md mode for all IPV4 and IPV6 tunnels, from Alexei
    Starovoitov.

    4) Allow cls_flower to classify packets in ip tunnels, from Amir Vadai.

    5) Support DSA tagging in older mv88e6xxx switches, from Andrew Lunn.

    6) Support GMAC protocol in iwlwifi mwm, from Ayala Beker.

    7) Support ndo_poll_controller in mlx5, from Calvin Owens.

    8) Move VRF processing to an output hook and allow l3mdev to be
    loopback, from David Ahern.

    9) Support SOCK_DESTROY for UDP sockets. Also from David Ahern.

    10) Congestion control in RXRPC, from David Howells.

    11) Support geneve RX offload in ixgbe, from Emil Tantilov.

    12) When hitting pressure for new incoming TCP data SKBs, perform a
    partial rathern than a full purge of the OFO queue (which could be
    huge). From Eric Dumazet.

    13) Convert XFRM state and policy lookups to RCU, from Florian Westphal.

    14) Support RX network flow classification to igb, from Gangfeng Huang.

    15) Hardware offloading of eBPF in nfp driver, from Jakub Kicinski.

    16) New skbmod packet action, from Jamal Hadi Salim.

    17) Remove some inefficiencies in snmp proc output, from Jia He.

    18) Add FIB notifications to properly propagate route changes to
    hardware which is doing forwarding offloading. From Jiri Pirko.

    19) New dsa driver for qca8xxx chips, from John Crispin.

    20) Implement RFC7559 ipv6 router solicitation backoff, from Maciej
    Żenczykowski.

    21) Add L3 mode to ipvlan, from Mahesh Bandewar.

    22) Support 802.1ad in mlx4, from Moshe Shemesh.

    23) Support hardware LRO in mediatek driver, from Nelson Chang.

    24) Add TC offloading to mlx5, from Or Gerlitz.

    25) Convert various drivers to ethtool ksettings interfaces, from
    Philippe Reynes.

    26) TX max rate limiting for cxgb4, from Rahul Lakkireddy.

    27) NAPI support for ath10k, from Rajkumar Manoharan.

    28) Support XDP in mlx5, from Rana Shahout and Saeed Mahameed.

    29) UDP replicast support in TIPC, from Richard Alpe.

    30) Per-queue statistics for qed driver, from Sudarsana Reddy Kalluru.

    31) Support BQL in thunderx driver, from Sunil Goutham.

    32) TSO support in alx driver, from Tobias Regnery.

    33) Add stream parser engine and use it in kcm.

    34) Support async DHCP replies in ipconfig module, from Uwe
    Kleine-König.

    35) DSA port fast aging for mv88e6xxx driver, from Vivien Didelot.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1715 commits)
    mlxsw: switchx2: Fix misuse of hard_header_len
    mlxsw: spectrum: Fix misuse of hard_header_len
    net/faraday: Stop NCSI device on shutdown
    net/ncsi: Introduce ncsi_stop_dev()
    net/ncsi: Rework the channel monitoring
    net/ncsi: Allow to extend NCSI request properties
    net/ncsi: Rework request index allocation
    net/ncsi: Don't probe on the reserved channel ID (0x1f)
    net/ncsi: Introduce NCSI_RESERVED_CHANNEL
    net/ncsi: Avoid unused-value build warning from ia64-linux-gcc
    net: Add netdev all_adj_list refcnt propagation to fix panic
    net: phy: Add Edge-rate driver for Microsemi PHYs.
    vmxnet3: Wake queue from reset work
    i40e: avoid NULL pointer dereference and recursive errors on early PCI error
    qed: Add RoCE ll2 & GSI support
    qed: Add support for memory registeration verbs
    qed: Add support for QP verbs
    qed: PD,PKEY and CQ verb support
    qed: Add support for RoCE hw init
    qede: Add qedr framework
    ...

    Linus Torvalds
     

05 Oct, 2016

2 commits

  • In commit 27727df240c7 ("Avoid taking lock in NMI path with
    CONFIG_DEBUG_TIMEKEEPING"), I changed the logic to open-code
    the timekeeping_get_ns() function, but I forgot to include
    the unit conversion from cycles to nanoseconds, breaking the
    function's output, which impacts users like perf.

    This results in bogus perf timestamps like:
    swapper 0 [000] 253.427536: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.426573: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.426687: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.426800: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.426905: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.427022: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.427127: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.427239: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.427346: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.427463: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 255.426572: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])

    Instead of more reasonable expected timestamps like:
    swapper 0 [000] 39.953768: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.064839: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.175956: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.287103: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.398217: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.509324: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.620437: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.731546: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.842654: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.953772: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 41.064881: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])

    Add the proper use of timekeeping_delta_to_ns() to convert
    the cycle delta to nanoseconds as needed.

    Thanks to Brendan and Alexei for finding this quickly after
    the v4.8 release. Unfortunately the problematic commit has
    landed in some -stable trees so they'll need this fix as
    well.

    Many apologies for this mistake. I'll be looking to add a
    perf-clock sanity test to the kselftest timers tests soon.

    Fixes: 27727df240c7 "timekeeping: Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING"
    Reported-by: Brendan Gregg
    Reported-by: Alexei Starovoitov
    Tested-and-reviewed-by: Mathieu Desnoyers
    Signed-off-by: John Stultz
    Cc: Peter Zijlstra
    Cc: stable
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/1475636148-26539-1-git-send-email-john.stultz@linaro.org
    Signed-off-by: Thomas Gleixner

    John Stultz
     
  • Pull audit updates from Paul Moore:
    "Another relatively small pull request for v4.9 with just two patches.

    The patch from Richard updates the list of features we support and
    report back to userspace; this should have been sent earlier with the
    rest of the v4.8 patches but it got lost in my inbox.

    The second patch fixes a problem reported by our Android friends where
    we weren't very consistent in recording PIDs"

    * 'stable-4.9' of git://git.infradead.org/users/pcmoore/audit:
    audit: add exclude filter extension to feature bitmap
    audit: consistently record PIDs with task_tgid_nr()

    Linus Torvalds