21 Nov, 2015

2 commits

  • Commit 08d78658f393 ("panic: release stale console lock to always get the
    logbuf printed out") introduced an unwanted bad unlock balance report when
    panic() is called directly and not from OOPS (e.g. from out_of_memory()).
    The difference is that in case of OOPS we disable locks debug in
    oops_enter() and on direct panic call nobody does that.

    Fixes: 08d78658f393 ("panic: release stale console lock to always get the logbuf printed out")
    Reported-by: kernel test robot
    Signed-off-by: Vitaly Kuznetsov
    Cc: HATAYAMA Daisuke
    Cc: Masami Hiramatsu
    Cc: Jiri Kosina
    Cc: Baoquan He
    Cc: Prarit Bhargava
    Cc: Xie XiuQi
    Cc: Seth Jennings
    Cc: "K. Y. Srinivasan"
    Cc: Jan Kara
    Cc: Petr Mladek
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Kuznetsov
     
  • sigsuspend() is nowhere used except in signal.c itself, so we can mark it
    static do not pollute the global namespace.

    But this patch is more than a boring cleanup patch, it fixes a real issue
    on UserModeLinux. UML has a special console driver to display ttys using
    xterm, or other terminal emulators, on the host side. Vegard reported
    that sometimes UML is unable to spawn a xterm and he's facing the
    following warning:

    WARNING: CPU: 0 PID: 908 at include/linux/thread_info.h:128 sigsuspend+0xab/0xc0()

    It turned out that this warning makes absolutely no sense as the UML
    xterm code calls sigsuspend() on the host side, at least it tries. But
    as the kernel itself offers a sigsuspend() symbol the linker choose this
    one instead of the glibc wrapper. Interestingly this code used to work
    since ever but always blocked signals on the wrong side. Some recent
    kernel change made the WARN_ON() trigger and uncovered the bug.

    It is a wonderful example of how much works by chance on computers. :-)

    Fixes: 68f3f16d9ad0f1 ("new helper: sigsuspend()")
    Signed-off-by: Richard Weinberger
    Reported-by: Vegard Nossum
    Tested-by: Vegard Nossum
    Acked-by: Oleg Nesterov
    Cc: [3.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Weinberger
     

20 Nov, 2015

1 commit


16 Nov, 2015

3 commits

  • Pull perf updates from Thomas Gleixner:
    "Mostly updates to the perf tool plus two fixes to the kernel core code:

    - Handle tracepoint filters correctly for inherited events (Peter
    Zijlstra)

    - Prevent a deadlock in perf_lock_task_context (Paul McKenney)

    - Add missing newlines to some pr_err() calls (Arnaldo Carvalho de
    Melo)

    - Print full source file paths when using 'perf annotate --print-line
    --full-paths' (Michael Petlan)

    - Fix 'perf probe -d' when just one out of uprobes and kprobes is
    enabled (Wang Nan)

    - Add compiler.h to list.h to fix 'make perf-tar-src-pkg' generated
    tarballs, i.e. out of tree building (Arnaldo Carvalho de Melo)

    - Add the llvm-src-base.c and llvm-src-kbuild.c files, generated by
    the 'perf test' LLVM entries, when running it in-tree, to
    .gitignore (Yunlong Song)

    - libbpf error reporting improvements, using a strerror interface to
    more precisely tell the user about problems with the provided
    scriptlet, be it in C or as a ready made object file (Wang Nan)

    - Do not be case sensitive when searching for matching 'perf test'
    entries (Arnaldo Carvalho de Melo)

    - Inform the user about objdump failures in 'perf annotate' (Andi
    Kleen)

    - Improve the LLVM 'perf test' entry, introduce a new ones for BPF
    and kbuild tests to check the environment used by clang to compile
    .c scriptlets (Wang Nan)"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
    perf/x86/intel/rapl: Remove the unused RAPL_EVENT_DESC() macro
    tools include: Add compiler.h to list.h
    perf probe: Verify parameters in two functions
    perf session: Add missing newlines to some pr_err() calls
    perf annotate: Support full source file paths for srcline fix
    perf test: Add llvm-src-base.c and llvm-src-kbuild.c to .gitignore
    perf: Fix inherited events vs. tracepoint filters
    perf: Disable IRQs across RCU RS CS that acquires scheduler lock
    perf test: Do not be case sensitive when searching for matching tests
    perf test: Add 'perf test BPF'
    perf test: Enhance the LLVM tests: add kbuild test
    perf test: Enhance the LLVM test: update basic BPF test program
    perf bpf: Improve BPF related error messages
    perf tools: Make fetch_kernel_version() publicly available
    bpf tools: Add new API bpf_object__get_kversion()
    bpf tools: Improve libbpf error reporting
    perf probe: Cleanup find_perf_probe_point_from_map to reduce redundancy
    perf annotate: Inform the user about objdump failures in --stdio
    perf stat: Make stat options global
    perf sched latency: Fix thread pid reuse issue
    ...

    Linus Torvalds
     
  • Pull scheduler fix from Thomas Gleixner:
    "A single fix to prevent math underflow in the numa balancing code"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/numa: Fix math underflow in task_tick_numa()

    Linus Torvalds
     
  • ….kernel.org/pub/scm/linux/kernel/git/tip/tip

    Pull irq and timer fixes from Thomas Gleixner:

    - An irq regression fix to restore the wakeup behaviour of chained
    interrupts.

    - A timer fix for a long standing race versus timers scheduled on a
    target cpu which got exposed by recent changes in the workqueue
    implementation.

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    genirq/PM: Restore system wake up from chained interrupts

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    timers: Use proper base migration in add_timer_on()

    Linus Torvalds
     

13 Nov, 2015

2 commits

  • Pull trace cleanups from Steven Rostedt:
    "This contains three more clean up patches.

    One patch is needed to make tracing work without debugfs now that
    tracing uses its own tracefs.

    The second is removing an unused variable.

    The third is fixing a warning about unused variables when MAX_TRACER
    is not configured. Note, this warning shows up in gcc 6.0, but does
    not show up in gcc 4.9, as it seems that gcc does not complain about
    constants not being used"

    * tag 'trace-v4.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: #ifdef out uses of max trace when CONFIG_TRACER_MAX_TRACE is not set
    tracing: Remove unused ftrace_cpu_disabled per cpu variable
    tracing: Make tracing work when debugfs is not configured in

    Linus Torvalds
     
  • Pull second batch of kvm updates from Paolo Bonzini:
    "Four changes:

    - x86: work around two nasty cases where a benign exception occurs
    while another is being delivered. The endless stream of exceptions
    causes an infinite loop in the processor, which not even NMIs or
    SMIs can interrupt; in the virt case, there is no possibility to
    exit to the host either.

    - x86: support for Skylake per-guest TSC rate. Long supported by
    AMD, the patches mostly move things from there to common
    arch/x86/kvm/ code.

    - generic: remove local_irq_save/restore from the guest entry and
    exit paths when context tracking is enabled. The patches are a few
    months old, but we discussed them again at kernel summit. Andy
    will pick up from here and, in 4.5, try to remove it from the user
    entry/exit paths.

    - PPC: Two bug fixes, see merge commit 370289756becc for details"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits)
    KVM: x86: rename update_db_bp_intercept to update_bp_intercept
    KVM: svm: unconditionally intercept #DB
    KVM: x86: work around infinite loop in microcode when #AC is delivered
    context_tracking: avoid irq_save/irq_restore on guest entry and exit
    context_tracking: remove duplicate enabled check
    KVM: VMX: Dump TSC multiplier in dump_vmcs()
    KVM: VMX: Use a scaled host TSC for guest readings of MSR_IA32_TSC
    KVM: VMX: Setup TSC scaling ratio when a vcpu is loaded
    KVM: VMX: Enable and initialize VMX TSC scaling
    KVM: x86: Use the correct vcpu's TSC rate to compute time scale
    KVM: x86: Move TSC scaling logic out of call-back read_l1_tsc()
    KVM: x86: Move TSC scaling logic out of call-back adjust_tsc_offset()
    KVM: x86: Replace call-back compute_tsc_offset() with a common function
    KVM: x86: Replace call-back set_tsc_khz() with a common function
    KVM: x86: Add a common TSC scaling function
    KVM: x86: Add a common TSC scaling ratio field in kvm_vcpu_arch
    KVM: x86: Collect information for setting TSC scaling ratio
    KVM: x86: declare a few variables as __read_mostly
    KVM: x86: merge handle_mmio_page_fault and handle_mmio_page_fault_common
    KVM: PPC: Book3S HV: Don't dynamically split core when already split
    ...

    Linus Torvalds
     

12 Nov, 2015

1 commit

  • With kASLR enabled, old_addr provided by patch module is being shifted
    accrodingly so that the symbol lookups work. To have module relocations
    handled properly as well, the same transformation needs to be perfomed
    on relocation address information.

    [jkosina@suse.cz: extended / reworded changelog a bit]
    Reported-by: Cyril B.
    Signed-off-by: Zhou Chengming
    Acked-by: Josh Poimboeuf
    Signed-off-by: Jiri Kosina

    Zhou Chengming
     

11 Nov, 2015

3 commits

  • Pull networking fixes from David Miller:

    1) Fix null deref in xt_TEE netfilter module, from Eric Dumazet.

    2) Several spots need to get to the original listner for SYN-ACK
    packets, most spots got this ok but some were not. Whilst covering
    the remaining cases, create a helper to do this. From Eric Dumazet.

    3) Missiing check of return value from alloc_netdev() in CAIF SPI code,
    from Rasmus Villemoes.

    4) Don't sleep while != TASK_RUNNING in macvtap, from Vlad Yasevich.

    5) Use after free in mvneta driver, from Justin Maggard.

    6) Fix race on dst->flags access in dst_release(), from Eric Dumazet.

    7) Add missing ZLIB_INFLATE dependency for new qed driver. From Arnd
    Bergmann.

    8) Fix multicast getsockopt deadlock, from WANG Cong.

    9) Fix deadlock in btusb, from Kuba Pawlak.

    10) Some ipv6_add_dev() failure paths were not cleaning up the SNMP6
    counter state. From Sabrina Dubroca.

    11) Fix packet_bind() race, which can cause lost notifications, from
    Francesco Ruggeri.

    12) Fix MAC restoration in qlcnic driver during bonding mode changes,
    from Jarod Wilson.

    13) Revert bridging forward delay change which broke libvirt and other
    userspace things, from Vlad Yasevich.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits)
    Revert "bridge: Allow forward delay to be cfgd when STP enabled"
    bpf_trace: Make dependent on PERF_EVENTS
    qed: select ZLIB_INFLATE
    net: fix a race in dst_release()
    net: mvneta: Fix memory use after free.
    net: Documentation: Fix default value tcp_limit_output_bytes
    macvtap: Resolve possible __might_sleep warning in macvtap_do_read()
    mvneta: add FIXED_PHY dependency
    net: caif: check return value of alloc_netdev
    net: hisilicon: NET_VENDOR_HISILICON should depend on HAS_DMA
    drivers: net: xgene: fix RGMII 10/100Mb mode
    netfilter: nft_meta: use skb_to_full_sk() helper
    net_sched: em_meta: use skb_to_full_sk() helper
    sched: cls_flow: use skb_to_full_sk() helper
    netfilter: xt_owner: use skb_to_full_sk() helper
    smack: use skb_to_full_sk() helper
    net: add skb_to_full_sk() helper and use it in selinux_netlbl_skbuff_setsid()
    bpf: doc: correct arch list for supported eBPF JIT
    dwc_eth_qos: Delete an unnecessary check before the function call "of_node_put"
    bonding: fix panic on non-ARPHRD_ETHER enslave failure
    ...

    Linus Torvalds
     
  • Arnd Bergmann reported:

    In my ARM randconfig tests, I'm getting a build error for
    newly added code in bpf_perf_event_read and bpf_perf_event_output
    whenever CONFIG_PERF_EVENTS is disabled:

    kernel/trace/bpf_trace.c: In function 'bpf_perf_event_read':
    kernel/trace/bpf_trace.c:203:11: error: 'struct perf_event' has no member named 'oncpu'
    if (event->oncpu != smp_processor_id() ||
    ^
    kernel/trace/bpf_trace.c:204:11: error: 'struct perf_event' has no member named 'pmu'
    event->pmu->count)

    This can happen when UPROBE_EVENT is enabled but KPROBE_EVENT
    is disabled. I'm not sure if that is a configuration we care
    about, otherwise we could prevent this case from occuring by
    adding Kconfig dependencies.

    Looking at this further, it's really that UPROBE_EVENT enables PERF_EVENTS.
    By just having BPF_EVENTS depend on PERF_EVENTS, then all is fine.

    Link: http://lkml.kernel.org/r/4525348.Aq9YoXkChv@wuerfel
    Reported-by: Arnd Bergmann
    Signed-off-by: Steven Rostedt
    Signed-off-by: David S. Miller

    Steven Rostedt
     
  • Pull libnvdimm updates from Dan Williams:
    "Outside of the new ACPI-NFIT hot-add support this pull request is more
    notable for what it does not contain, than what it does. There were a
    handful of development topics this cycle, dax get_user_pages, dax
    fsync, and raw block dax, that need more more iteration and will wait
    for 4.5.

    The patches to make devm and the pmem driver NUMA aware have been in
    -next for several weeks. The hot-add support has not, but is
    contained to the NFIT driver and is passing unit tests. The coredump
    support is straightforward and was looked over by Jeff. All of it has
    received a 0day build success notification across 107 configs.

    Summary:

    - Add support for the ACPI 6.0 NFIT hot add mechanism to process
    updates of the NFIT at runtime.

    - Teach the coredump implementation how to filter out DAX mappings.

    - Introduce NUMA hints for allocations made by the pmem driver, and
    as a side effect all devm allocations now hint their NUMA node by
    default"

    * tag 'libnvdimm-for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    coredump: add DAX filtering for FDPIC ELF coredumps
    coredump: add DAX filtering for ELF coredumps
    acpi: nfit: Add support for hot-add
    nfit: in acpi_nfit_init, break on a 0-length table
    pmem, memremap: convert to numa aware allocations
    devm_memremap_pages: use numa_mem_id
    devm: make allocations numa aware by default
    devm_memremap: convert to return ERR_PTR
    devm_memunmap: use devres_release()
    pmem: kill memremap_pmem()
    x86, mm: quiet arch_add_memory()

    Linus Torvalds
     

10 Nov, 2015

8 commits

  • tracing_max_lat_fops is used only when TRACER_MAX_TRACE enabled, so also
    swith the related code. The related warning with defconfig under x86_64:

    CC kernel/trace/trace.o
    kernel/trace/trace.c:5466:37: warning: ‘tracing_max_lat_fops’ defined but not used [-Wunused-const-variable]
    static const struct file_operations tracing_max_lat_fops = {

    Signed-off-by: Chen Gang
    Signed-off-by: Steven Rostedt

    Chen Gang
     
  • Commit e509bd7da149 ("genirq: Allow migration of chained interrupts
    by installing default action") breaks PCS wake up IRQ behaviour on
    TI OMAP based platforms (dra7-evm).

    TI OMAP IRQ wake up configuration:
    GIC-irqchip->PCM_IRQ
    |- omap_prcm_register_chain_handler
    |- PRCM-irqchip -> PRCM_IO_IRQ
    |- pcs_irq_chain_handler
    |- pinctrl-irqchip -> PCS_uart1_wakeup_irq

    This happens because IRQ PM code (irq/pm.c) is expected to ignore
    chained interrupts by default:
    static bool suspend_device_irq(struct irq_desc *desc)
    {
    if (!desc->action || desc->no_suspend_depth)
    return false;
    - it's expected !desc->action = true for chained interrupts;

    but, after above change, all chained interrupt descriptors will
    have default action handler installed - chained_action.
    As result, chained interrupts will be silently disabled during system
    suspend.

    Hence, fix it by introducing helper function irq_desc_is_chained() and
    use it in suspend_device_irq() for chained interrupts identification
    and skip them, once detected.

    Fixes: e509bd7da149 ("genirq: Allow migration of chained interrupts..")
    Signed-off-by: Grygorii Strashko
    Reviewed-by: Mika Westerberg
    Cc: Tony Lindgren
    Cc:
    Cc:
    Cc: Tony Lindgren
    Link: http://lkml.kernel.org/r/1447149492-20699-1-git-send-email-grygorii.strashko@ti.com
    Signed-off-by: Thomas Gleixner

    Grygorii Strashko
     
  • guest_enter and guest_exit must be called with interrupts disabled,
    since they take the vtime_seqlock with write_seq{lock,unlock}.
    Therefore, it is not necessary to check for exceptions, nor to
    save/restore the IRQ state, when context tracking functions are
    called by guest_enter and guest_exit.

    Split the body of context_tracking_entry and context_tracking_exit
    out to __-prefixed functions, and use them from KVM.

    Rik van Riel has measured this to speed up a tight vmentry/vmexit
    loop by about 2%.

    Cc: Andy Lutomirski
    Cc: Frederic Weisbecker
    Cc: Paul McKenney
    Reviewed-by: Rik van Riel
    Tested-by: Rik van Riel
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • All calls to context_tracking_enter and context_tracking_exit
    are already checking context_tracking_is_enabled, except the
    context_tracking_user_enter and context_tracking_user_exit
    functions left in for the benefit of assembly calls.

    Pull the check up to those functions, by making them simple
    wrappers around the user_enter and user_exit inline functions.

    Cc: Frederic Weisbecker
    Cc: Paul McKenney
    Reviewed-by: Rik van Riel
    Tested-by: Rik van Riel
    Acked-by: Andy Lutomirski
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • Merge third patch-bomb from Andrew Morton:
    "We're pretty much done over here - I'm still waiting for a nouveau
    merge so I can cleanly finish up Christoph's dma-mapping rework.

    - bunch of small misc stuff

    - fold abs64() into abs(), remove abs64()

    - new_valid_dev() cleanups

    - binfmt_elf_fdpic feature work"

    * emailed patches from Andrew Morton : (24 commits)
    fs/binfmt_elf_fdpic.c: provide NOMMU loader for regular ELF binaries
    fs/stat.c: remove unnecessary new_valid_dev() check
    fs/reiserfs/namei.c: remove unnecessary new_valid_dev() check
    fs/nilfs2/namei.c: remove unnecessary new_valid_dev() check
    fs/ncpfs/dir.c: remove unnecessary new_valid_dev() check
    fs/jfs: remove unnecessary new_valid_dev() checks
    fs/hpfs/namei.c: remove unnecessary new_valid_dev() check
    fs/f2fs/namei.c: remove unnecessary new_valid_dev() check
    fs/ext2/namei.c: remove unnecessary new_valid_dev() check
    fs/exofs/namei.c: remove unnecessary new_valid_dev() check
    fs/btrfs/inode.c: remove unnecessary new_valid_dev() check
    fs/9p: remove unnecessary new_valid_dev() checks
    include/linux/kdev_t.h: old/new_valid_dev() can return bool
    include/linux/kdev_t.h: remove unused huge_valid_dev()
    kmap_atomic_to_page() has no users, remove it
    drivers/scsi/cxgbi: fix build with EXTRA_CFLAGS
    dma: remove external references to dma_supported
    Documentation/sysctl/vm.txt: fix misleading code reference of overcommit_memory
    remove abs64()
    kernel.h: make abs() work with 64-bit types
    ...

    Linus Torvalds
     
  • Pull module updates from Rusty Russell:
    "Nothing exciting, minor tweaks and cleanups"

    * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    scripts: [modpost] add new sections to white list
    modpost: Add flag -E for making section mismatches fatal
    params: don't ignore the rest of cmdline if parse_one() fails
    modpost: abort if a module symbol is too long

    Linus Torvalds
     
  • Switch everything to the new and more capable implementation of abs().
    Mainly to give the new abs() a bit of a workout.

    Cc: Michal Nazarewicz
    Cc: John Stultz
    Cc: Ingo Molnar
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Masami Hiramatsu
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Dan Williams
     

09 Nov, 2015

3 commits

  • The NUMA balancing code implements delays in scanning by
    advancing curr->node_stamp beyond curr->se.sum_exec_runtime.

    With unsigned math, that creates an underflow, which results
    in task_numa_work being queued all the time, even when we
    don't want to.

    Avoiding the math underflow makes it possible to reduce CPU
    overhead in the NUMA balancing code.

    Reported-and-tested-by: Jan Stancek
    Signed-off-by: Rik van Riel
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: mgorman@suse.de
    Link: http://lkml.kernel.org/r/1446756983-28173-2-git-send-email-riel@redhat.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • Arnaldo reported that tracepoint filters seem to misbehave (ie. not
    apply) on inherited events.

    The fix is obvious; filters are only set on the actual (parent)
    event, use the normal pattern of using this parent event for filters.
    This is safe because each child event has a reference to it.

    Reported-by: Arnaldo Carvalho de Melo
    Tested-by: Arnaldo Carvalho de Melo
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Adrian Hunter
    Cc: Arnaldo Carvalho de Melo
    Cc: David Ahern
    Cc: Frédéric Weisbecker
    Cc: Jiri Olsa
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Wang Nan
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20151102095051.GN17308@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The perf_lock_task_context() function disables preemption across its
    RCU read-side critical section because that critical section acquires
    a scheduler lock. If there was a preemption during that RCU read-side
    critical section, the rcu_read_unlock() could attempt to acquire scheduler
    locks, resulting in deadlock.

    However, recent optimizations to expedited grace periods mean that IPI
    handlers that execute during preemptible RCU read-side critical sections
    can now cause the subsequent rcu_read_unlock() to acquire scheduler locks.
    Disabling preemption does nothiing to prevent these IPI handlers from
    executing, so these optimizations introduced a deadlock. In theory,
    this deadlock could be avoided by pulling all wakeups and printk()s out
    from rnp->lock critical sections, but in practice this would re-introduce
    some RCU CPU stall warning bugs.

    Given that acquiring scheduler locks entails disabling interrupts, these
    deadlocks can be avoided by disabling interrupts (instead of disabling
    preemption) across any RCU read-side critical that acquires scheduler
    locks and holds them across the rcu_read_unlock(). This commit therefore
    makes this change for perf_lock_task_context().

    Reported-by: Dave Jones
    Reported-by: Peter Zijlstra
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20151104134838.GR29027@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

08 Nov, 2015

2 commits

  • Merge second patch-bomb from Andrew Morton:

    - most of the rest of MM

    - procfs

    - lib/ updates

    - printk updates

    - bitops infrastructure tweaks

    - checkpatch updates

    - nilfs2 update

    - signals

    - various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
    dma-debug, dma-mapping, ...

    * emailed patches from Andrew Morton : (102 commits)
    ipc,msg: drop dst nil validation in copy_msg
    include/linux/zutil.h: fix usage example of zlib_adler32()
    panic: release stale console lock to always get the logbuf printed out
    dma-debug: check nents in dma_sync_sg*
    dma-mapping: tidy up dma_parms default handling
    pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
    kexec: use file name as the output message prefix
    fs, seqfile: always allow oom killer
    seq_file: reuse string_escape_str()
    fs/seq_file: use seq_* helpers in seq_hex_dump()
    coredump: change zap_threads() and zap_process() to use for_each_thread()
    coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
    signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
    signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
    signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
    signals: kill block_all_signals() and unblock_all_signals()
    nilfs2: fix gcc uninitialized-variable warnings in powerpc build
    nilfs2: fix gcc unused-but-set-variable warnings
    MAINTAINERS: nilfs2: add header file for tracing
    nilfs2: add tracepoints for analyzing reading and writing metadata files
    ...

    Linus Torvalds
     
  • Since the ring buffer is lockless, there is no need to disable ftrace on
    CPU. And no one doing so: after commit 68179686ac67cb ("tracing: Remove
    ftrace_disable/enable_cpu()") ftrace_cpu_disabled stays the same after
    initialization, nothing changes it.
    ftrace_cpu_disabled shouldn't be used by any external module since it
    disables only function and graph_function tracers but not any other
    tracer.

    Link: http://lkml.kernel.org/r/1446836846-22239-1-git-send-email-0x7f454c46@gmail.com

    Signed-off-by: Dmitry Safonov
    Signed-off-by: Steven Rostedt

    Dmitry Safonov
     

07 Nov, 2015

10 commits

  • In some cases we may end up killing the CPU holding the console lock
    while still having valuable data in logbuf. E.g. I'm observing the
    following:

    - A crash is happening on one CPU and console_unlock() is being called on
    some other.

    - console_unlock() tries to print out the buffer before releasing the lock
    and on slow console it takes time.

    - in the meanwhile crashing CPU does lots of printk()-s with valuable data
    (which go to the logbuf) and sends IPIs to all other CPUs.

    - console_unlock() finishes printing previous chunk and enables interrupts
    before trying to print out the rest, the CPU catches the IPI and never
    releases console lock.

    This is not the only possible case: in VT/fb subsystems we have many other
    console_lock()/console_unlock() users. Non-masked interrupts (or
    receiving NMI in case of extreme slowness) will have the same result.
    Getting the whole console buffer printed out on crash should be top
    priority.

    [akpm@linux-foundation.org: tweak comment text]
    Signed-off-by: Vitaly Kuznetsov
    Cc: HATAYAMA Daisuke
    Cc: Masami Hiramatsu
    Cc: Jiri Kosina
    Cc: Baoquan He
    Cc: Prarit Bhargava
    Cc: Xie XiuQi
    Cc: Seth Jennings
    Cc: "K. Y. Srinivasan"
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Kuznetsov
     
  • setpriority(PRIO_USER, 0, x) will change the priority of tasks outside of
    the current pid namespace. This is in contrast to both the other modes of
    setpriority and the example of kill(-1). Fix this. getpriority and
    ioprio have the same failure mode, fix them too.

    Eric said:

    : After some more thinking about it this patch sounds justifiable.
    :
    : My goal with namespaces is not to build perfect isolation mechanisms
    : as that can get into ill defined territory, but to build well defined
    : mechanisms. And to handle the corner cases so you can use only
    : a single namespace with well defined results.
    :
    : In this case you have found the two interfaces I am aware of that
    : identify processes by uid instead of by pid. Which quite frankly is
    : weird. Unfortunately the weird unexpected cases are hard to handle
    : in the usual way.
    :
    : I was hoping for a little more information. Changes like this one we
    : have to be careful of because someone might be depending on the current
    : behavior. I don't think they are and I do think this make sense as part
    : of the pid namespace.

    Signed-off-by: Ben Segall
    Cc: Oleg Nesterov
    Cc: Al Viro
    Cc: Ambrose Feinstein
    Acked-by: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Segall
     
  • kexec output message misses the prefix "kexec", when Dave Young split the
    kexec code. Now, we use file name as the output message prefix.

    Currently, the format of output message:
    [ 140.290795] SYSC_kexec_load: hello, world
    [ 140.291534] kexec: sanity_check_segment_list: hello, world

    Ideally, the format of output message:
    [ 30.791503] kexec: SYSC_kexec_load, Hello, world
    [ 79.182752] kexec_core: sanity_check_segment_list, Hello, world

    Remove the custom prefix "kexec" in output message.

    Signed-off-by: Minfei Huang
    Acked-by: Dave Young
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minfei Huang
     
  • task_will_free_mem() is wrong in many ways, and in particular the
    SIGNAL_GROUP_COREDUMP check is not reliable: a task can participate in the
    coredumping without SIGNAL_GROUP_COREDUMP bit set.

    change zap_threads() paths to always set SIGNAL_GROUP_COREDUMP even if
    other CLONE_VM processes can't react to SIGKILL. Fortunately, at least
    oom-kill case if fine; it kills all tasks sharing the same mm, so it
    should also kill the process which actually dumps the core.

    The change in prepare_signal() is not strictly necessary, it just ensures
    that the patch does not bring another subtle behavioural change. But it
    reminds us that this SIGNAL_GROUP_EXIT/COREDUMP case needs more changes.

    Signed-off-by: Oleg Nesterov
    Cc: David Rientjes
    Cc: Kyle Walker
    Acked-by: Michal Hocko
    Cc: Stanislav Kozina
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • It is hardly possible to enumerate all problems with block_all_signals()
    and unblock_all_signals(). Just for example,

    1. block_all_signals(SIGSTOP/etc) simply can't help if the caller is
    multithreaded. Another thread can dequeue the signal and force the
    group stop.

    2. Even is the caller is single-threaded, it will "stop" anyway. It
    will not sleep, but it will spin in kernel space until SIGCONT or
    SIGKILL.

    And a lot more. In short, this interface doesn't work at all, at least
    the last 10+ years.

    Daniel said:

    Yeah the only times I played around with the DRM_LOCK stuff was when
    old drivers accidentally deadlocked - my impression is that the entire
    DRM_LOCK thing was never really tested properly ;-) Hence I'm all for
    purging where this leaks out of the drm subsystem.

    Signed-off-by: Oleg Nesterov
    Acked-by: Daniel Vetter
    Acked-by: Dave Airlie
    Cc: Richard Weinberger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The following statement of ABI/testing/dev-kmsg is not quite right:

    It is not possible to inject messages from userspace with the
    facility number LOG_KERN (0), to make sure that the origin of the
    messages can always be reliably determined.

    Userland actually can inject messages with a facility of 0 by abusing the
    fact that the facility is stored in a u8 data type. By using a facility
    which is a multiple of 256 the assignment of msg->facility in log_store()
    implicitly truncates it to 0, i.e. LOG_KERN, allowing users of /dev/kmsg
    to spoof kernel messages as shown below:

    The following call...
    # printf 'Kernel panic - not syncing: beer empty\n' 0 >/dev/kmsg
    ...leads to the following log entry (dmesg -x | tail -n 1):
    user :emerg : [ 66.137758] Kernel panic - not syncing: beer empty

    However, this call...
    # printf 'Kernel panic - not syncing: beer empty\n' 0x800 >/dev/kmsg
    ...leads to the slightly different log entry (note the kernel facility):
    kern :emerg : [ 74.177343] Kernel panic - not syncing: beer empty

    Fix that by limiting the user provided facility to 8 bit right from the
    beginning and catch the truncation early.

    Fixes: 7ff9554bb578 ("printk: convert byte-buffer to variable-length...")
    Signed-off-by: Mathias Krause
    Cc: Greg Kroah-Hartman
    Cc: Petr Mladek
    Cc: Alex Elder
    Cc: Joe Perches
    Cc: Kay Sievers
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathias Krause
     
  • Change the param_free_charp() function from static to exported.

    It is used by zswap in the next patch ("zswap: use charp for zswap param
    strings").

    Signed-off-by: Dan Streetman
    Acked-by: Rusty Russell
    Cc: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • __GFP_WAIT was used to signal that the caller was in atomic context and
    could not sleep. Now it is possible to distinguish between true atomic
    context and callers that are not willing to sleep. The latter should
    clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing
    __GFP_WAIT behaves differently, there is a risk that people will clear the
    wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
    indicate what it does -- setting it allows all reclaim activity, clearing
    them prevents it.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Cc: Christoph Lameter
    Acked-by: David Rientjes
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • Pull tracking updates from Steven Rostedt:
    "Most of the changes are clean ups and small fixes. Some of them have
    stable tags to them. I searched through my INBOX just as the merge
    window opened and found lots of patches to pull. I ran them through
    all my tests and they were in linux-next for a few days.

    Features added this release:
    ----------------------------

    - Module globbing. You can now filter function tracing to several
    modules. # echo '*:mod:*snd*' > set_ftrace_filter (Dmitry Safonov)

    - Tracer specific options are now visible even when the tracer is not
    active. It was rather annoying that you can only see and modify
    tracer options after enabling the tracer. Now they are in the
    options/ directory even when the tracer is not active. Although
    they are still only visible when the tracer is active in the
    trace_options file.

    - Trace options are now per instance (although some of the tracer
    specific options are global)

    - New tracefs file: set_event_pid. If any pid is added to this file,
    then all events in the instance will filter out events that are not
    part of this pid. sched_switch and sched_wakeup events handle next
    and the wakee pids"

    * tag 'trace-v4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (68 commits)
    tracefs: Fix refcount imbalance in start_creating()
    tracing: Put back comma for empty fields in boot string parsing
    tracing: Apply tracer specific options from kernel command line.
    tracing: Add some documentation about set_event_pid
    ring_buffer: Remove unneeded smp_wmb() before wakeup of reader benchmark
    tracing: Allow dumping traces without tracking trace started cpus
    ring_buffer: Fix more races when terminating the producer in the benchmark
    ring_buffer: Do no not complete benchmark reader too early
    tracing: Remove redundant TP_ARGS redefining
    tracing: Rename max_stack_lock to stack_trace_max_lock
    tracing: Allow arch-specific stack tracer
    recordmcount: arm64: Replace the ignored mcount call into nop
    recordmcount: Fix endianness handling bug for nop_mcount
    tracepoints: Fix documentation of RCU lockdep checks
    tracing: ftrace_event_is_function() can return boolean
    tracing: is_legal_op() can return boolean
    ring-buffer: rb_event_is_commit() can return boolean
    ring-buffer: rb_per_cpu_empty() can return boolean
    ring_buffer: ring_buffer_empty{cpu}() can return boolean
    ring-buffer: rb_is_reader_page() can return boolean
    ...

    Linus Torvalds
     

06 Nov, 2015

5 commits

  • Currently tracing_init_dentry() returns -ENODEV when debugfs is not
    configured in, which causes tracefs not populated with tracing files and
    directories, so we will get an empty directory even after we manually
    mount tracefs.

    We can make tracing_init_dentry() return NULL if debugfs is not
    configured in and can manually mount tracefs. But return -ENODEV
    if debugfs is configured in but not initialized or failed to create
    automount point as that would break backward compatibility with older
    tools.

    Link: http://lkml.kernel.org/r/1446797056-11683-1-git-send-email-hello.wjx@gmail.com

    Signed-off-by: Jiaxing Wang
    Signed-off-by: Steven Rostedt

    Jiaxing Wang
     
  • Merge patch-bomb from Andrew Morton:

    - inotify tweaks

    - some ocfs2 updates (many more are awaiting review)

    - various misc bits

    - kernel/watchdog.c updates

    - Some of mm. I have a huge number of MM patches this time and quite a
    lot of it is quite difficult and much will be held over to next time.

    * emailed patches from Andrew Morton : (162 commits)
    selftests: vm: add tests for lock on fault
    mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage
    mm: introduce VM_LOCKONFAULT
    mm: mlock: add new mlock system call
    mm: mlock: refactor mlock, munlock, and munlockall code
    kasan: always taint kernel on report
    mm, slub, kasan: enable user tracking by default with KASAN=y
    kasan: use IS_ALIGNED in memory_is_poisoned_8()
    kasan: Fix a type conversion error
    lib: test_kasan: add some testcases
    kasan: update reference to kasan prototype repo
    kasan: move KASAN_SANITIZE in arch/x86/boot/Makefile
    kasan: various fixes in documentation
    kasan: update log messages
    kasan: accurately determine the type of the bad access
    kasan: update reported bug types for kernel memory accesses
    kasan: update reported bug types for not user nor kernel memory accesses
    mm/kasan: prevent deadlock in kasan reporting
    mm/kasan: don't use kasan shadow pointer in generic functions
    mm/kasan: MODULE_VADDR is not available on all archs
    ...

    Linus Torvalds
     
  • The cost of faulting in all memory to be locked can be very high when
    working with large mappings. If only portions of the mapping will be used
    this can incur a high penalty for locking.

    For the example of a large file, this is the usage pattern for a large
    statical language model (probably applies to other statical or graphical
    models as well). For the security example, any application transacting in
    data that cannot be swapped out (credit card data, medical records, etc).

    This patch introduces the ability to request that pages are not
    pre-faulted, but are placed on the unevictable LRU when they are finally
    faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED
    and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT
    flag for a VMA will cause pages faulted into that VMA to be added to the
    unevictable LRU when they are faulted or if they are already present, but
    will not cause any missing pages to be faulted in.

    Exposing this new lock state means that we cannot overload the meaning of
    the FOLL_POPULATE flag any longer. Prior to this patch it was used to
    mean that the VMA for a fault was locked. This means we need the new
    FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE
    will now only control if the VMA should be populated and in the case of
    VM_LOCKONFAULT, it will not be set.

    Signed-off-by: Eric B Munson
    Acked-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Jonathan Corbet
    Cc: Catalin Marinas
    Cc: Geert Uytterhoeven
    Cc: Guenter Roeck
    Cc: Heiko Carstens
    Cc: Michael Kerrisk
    Cc: Ralf Baechle
    Cc: Shuah Khan
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • With the refactored mlock code, introduce a new system call for mlock.
    The new call will allow the user to specify what lock states are being
    added. mlock2 is trivial at the moment, but a follow on patch will add a
    new mlock state making it useful.

    Signed-off-by: Eric B Munson
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Heiko Carstens
    Cc: Geert Uytterhoeven
    Cc: Catalin Marinas
    Cc: Stephen Rothwell
    Cc: Guenter Roeck
    Cc: Jonathan Corbet
    Cc: Kirill A. Shutemov
    Cc: Michael Kerrisk
    Cc: Ralf Baechle
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • The oom killer takes task_lock() in a couple of places solely to protect
    printing the task's comm.

    A process's comm, including current's comm, may change due to
    /proc/pid/comm or PR_SET_NAME.

    The comm will always be NULL-terminated, so the worst race scenario would
    only be during update. We can tolerate a comm being printed that is in
    the middle of an update to avoid taking the lock.

    Other locations in the kernel have already dropped task_lock() when
    printing comm, so this is consistent.

    Signed-off-by: David Rientjes
    Suggested-by: Oleg Nesterov
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Sergey Senozhatsky
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes