04 Jul, 2016

7 commits

  • Pull the irq affinity managing code which is in a seperate branch for block
    developers to pull.

    Thomas Gleixner
     
  • This is lifted from the blk-mq code and adopted to use the affinity mask
    concept just introduced in the irq handling code. It tries to keep the
    algorithm the same as the one current used by blk-mq, but improvements
    like assining vectors on a per-node basis instead of just per sibling
    are possible with this simple move and refactoring.

    Signed-off-by: Christoph Hellwig
    Cc: linux-block@vger.kernel.org
    Cc: linux-pci@vger.kernel.org
    Cc: linux-nvme@lists.infradead.org
    Cc: axboe@fb.com
    Cc: agordeev@redhat.com
    Link: http://lkml.kernel.org/r/1467621574-8277-7-git-send-email-hch@lst.de
    Signed-off-by: Thomas Gleixner

    Christoph Hellwig
     
  • Allow the MSI code to provide affinity hints per MSI descriptor.

    Signed-off-by: Thomas Gleixner
    Cc: Christoph Hellwig
    Cc: linux-block@vger.kernel.org
    Cc: linux-pci@vger.kernel.org
    Cc: linux-nvme@lists.infradead.org
    Cc: axboe@fb.com
    Cc: agordeev@redhat.com
    Link: http://lkml.kernel.org/r/1467621574-8277-6-git-send-email-hch@lst.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Use the affinity hint in the irqdesc allocator. The hint is used to determine
    the node for the allocation and to set the affinity of the interrupt.

    If multiple interrupts are allocated (multi-MSI) then the allocator iterates
    over the cpumask and for each set cpu it allocates on their node and sets the
    initial affinity to that cpu.

    If a single interrupt is allocated (MSI-X) then the allocator uses the first
    cpu in the mask to compute the allocation node and uses the mask for the
    initial affinity setting.

    Interrupts set up this way are marked with the AFFINITY_MANAGED flag to
    prevent userspace from messing with their affinity settings.

    Signed-off-by: Thomas Gleixner
    Cc: Christoph Hellwig
    Cc: linux-block@vger.kernel.org
    Cc: linux-pci@vger.kernel.org
    Cc: linux-nvme@lists.infradead.org
    Cc: axboe@fb.com
    Cc: agordeev@redhat.com
    Link: http://lkml.kernel.org/r/1467621574-8277-5-git-send-email-hch@lst.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Add an extra argument to the irq(domain) allocation functions, so we can hand
    down affinity hints to the allocator. Thats necessary to implement proper
    support for multiqueue devices.

    Signed-off-by: Thomas Gleixner
    Cc: Christoph Hellwig
    Cc: linux-block@vger.kernel.org
    Cc: linux-pci@vger.kernel.org
    Cc: linux-nvme@lists.infradead.org
    Cc: axboe@fb.com
    Cc: agordeev@redhat.com
    Link: http://lkml.kernel.org/r/1467621574-8277-4-git-send-email-hch@lst.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Interupts marked with this flag are excluded from user space interrupt
    affinity changes. Contrary to the IRQ_NO_BALANCING flag, the kernel internal
    affinity mechanism is not blocked.

    This flag will be used for multi-queue device interrupts.

    Signed-off-by: Thomas Gleixner
    Cc: Christoph Hellwig
    Cc: linux-block@vger.kernel.org
    Cc: linux-pci@vger.kernel.org
    Cc: linux-nvme@lists.infradead.org
    Cc: axboe@fb.com
    Cc: agordeev@redhat.com
    Link: http://lkml.kernel.org/r/1467621574-8277-3-git-send-email-hch@lst.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • No user and we definitely don't want to grow one.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: linux-block@vger.kernel.org
    Cc: linux-pci@vger.kernel.org
    Cc: linux-nvme@lists.infradead.org
    Cc: axboe@fb.com
    Cc: agordeev@redhat.com
    Link: http://lkml.kernel.org/r/1467621574-8277-2-git-send-email-hch@lst.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

30 Jun, 2016

3 commits

  • Pull audit fixes from Paul Moore:
    "Two small patches to fix audit problems in 4.7-rcX: the first fixes a
    potential kref leak, the second removes some header file noise.

    The first is an important bug fix that really should go in before 4.7
    is released, the second is not critical, but falls into the very-nice-
    to-have category so I'm including in the pull request.

    Both patches are straightforward, self-contained, and pass our
    testsuite without problem"

    * 'stable-4.7' of git://git.infradead.org/users/pcmoore/audit:
    audit: move audit_get_tty to reduce scope and kabi changes
    audit: move calcs after alloc and check when logging set loginuid

    Linus Torvalds
     
  • Pull networking fixes from David Miller:
    "I've been traveling so this accumulates more than week or so of bug
    fixing. It perhaps looks a little worse than it really is.

    1) Fix deadlock in ath10k driver, from Ben Greear.

    2) Increase scan timeout in iwlwifi, from Luca Coelho.

    3) Unbreak STP by properly reinjecting STP packets back into the
    stack. Regression fix from Ido Schimmel.

    4) Mediatek driver fixes (missing malloc failure checks, leaking of
    scratch memory, wrong indexing when mapping TX buffers, etc.) from
    John Crispin.

    5) Fix endianness bug in icmpv6_err() handler, from Hannes Frederic
    Sowa.

    6) Fix hashing of flows in UDP in the ruseport case, from Xuemin Su.

    7) Fix netlink notifications in ovs for tunnels, delete link messages
    are never emitted because of how the device registry state is
    handled. From Nicolas Dichtel.

    8) Conntrack module leaks kmemcache on unload, from Florian Westphal.

    9) Prevent endless jump loops in nft rules, from Liping Zhang and
    Pablo Neira Ayuso.

    10) Not early enough spinlock initialization in mlx4, from Eric
    Dumazet.

    11) Bind refcount leak in act_ipt, from Cong WANG.

    12) Missing RCU locking in HTB scheduler, from Florian Westphal.

    13) Several small MACSEC bug fixes from Sabrina Dubroca (missing RCU
    barrier, using heap for SG and IV, and erroneous use of async flag
    when allocating AEAD conext.)

    14) RCU handling fix in TIPC, from Ying Xue.

    15) Pass correct protocol down into ipv4_{update_pmtu,redirect}() in
    SIT driver, from Simon Horman.

    16) Socket timer deadlock fix in TIPC from Jon Paul Maloy.

    17) Fix potential deadlock in team enslave, from Ido Schimmel.

    18) Memory leak in KCM procfs handling, from Jiri Slaby.

    19) ESN generation fix in ipv4 ESP, from Herbert Xu.

    20) Fix GFP_KERNEL allocations with locks held in act_ife, from Cong
    WANG.

    21) Use after free in netem, from Eric Dumazet.

    22) Uninitialized last assert time in multicast router code, from Tom
    Goff.

    23) Skip raw sockets in sock_diag destruction broadcast, from Willem
    de Bruijn.

    24) Fix link status reporting in thunderx, from Sunil Goutham.

    25) Limit resegmentation of retransmit queue so that we do not
    retransmit too large GSO frames. From Eric Dumazet.

    26) Delay bpf program release after grace period, from Daniel
    Borkmann"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (141 commits)
    openvswitch: fix conntrack netlink event delivery
    qed: Protect the doorbell BAR with the write barriers.
    neigh: Explicitly declare RCU-bh read side critical section in neigh_xmit()
    e1000e: keep VLAN interfaces functional after rxvlan off
    cfg80211: fix proto in ieee80211_data_to_8023 for frames without LLC header
    qlcnic: use the correct ring in qlcnic_83xx_process_rcv_ring_diag()
    bpf, perf: delay release of BPF prog after grace period
    net: bridge: fix vlan stats continue counter
    tcp: do not send too big packets at retransmit time
    ibmvnic: fix to use list_for_each_safe() when delete items
    net: thunderx: Fix TL4 configuration for secondary Qsets
    net: thunderx: Fix link status reporting
    net/mlx5e: Reorganize ethtool statistics
    net/mlx5e: Fix number of PFC counters reported to ethtool
    net/mlx5e: Prevent adding the same vxlan port
    net/mlx5e: Check for BlueFlame capability before allocating SQ uar
    net/mlx5e: Change enum to better reflect usage
    net/mlx5: Add ConnectX-5 PCIe 4.0 to list of supported devices
    net/mlx5: Update command strings
    net: marvell: Add separate config ANEG function for Marvell 88E1111
    ...

    Linus Torvalds
     
  • Pull cgroup fixes from Tejun Heo:
    "Three fix patches. Two are for cgroup / css init failure path. The
    last one makes css_set_lock irq-safe as the deadline scheduler ends up
    calling put_css_set() from irq context"

    * 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: Disable IRQs while holding css_set_lock
    cgroup: set css->id to -1 during init
    cgroup: remove redundant cleanup in css_create

    Linus Torvalds
     

29 Jun, 2016

3 commits

  • Commit dead9f29ddcc ("perf: Fix race in BPF program unregister") moved
    destruction of BPF program from free_event_rcu() callback to __free_event(),
    which is problematic if used with tail calls: if prog A is attached as
    trace event directly, but at the same time present in a tail call map used
    by another trace event program elsewhere, then we need to delay destruction
    via RCU grace period since it can still be in use by the program doing the
    tail call (the prog first needs to be dropped from the tail call map, then
    trace event with prog A attached destroyed, so we get immediate destruction).

    Fixes: dead9f29ddcc ("perf: Fix race in BPF program unregister")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Cc: Jann Horn
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • The only users of audit_get_tty and audit_put_tty are internal to
    audit, so move it out of include/linux/audit.h to kernel.h and create
    a proper function rather than inlining it. This also reduces kABI
    changes.

    Suggested-by: Paul Moore
    Signed-off-by: Richard Guy Briggs
    [PM: line wrapped description]
    Signed-off-by: Paul Moore

    Richard Guy Briggs
     
  • Move the calculations of values after the allocation in case the
    allocation fails. This avoids wasting effort in the rare case that it
    fails, but more importantly saves us extra logic to release the tty
    ref.

    Signed-off-by: Richard Guy Briggs
    Signed-off-by: Paul Moore

    Richard Guy Briggs
     

25 Jun, 2016

6 commits

  • Pull scheduler fixes from Thomas Gleixner:
    "A couple of scheduler fixes:

    - force watchdog reset while processing sysrq-w

    - fix a deadlock when enabling trace events in the scheduler

    - fixes to the throttled next buddy logic

    - fixes for the average accounting (missing serialization and
    underflow handling)

    - allow kernel threads for fallback to online but not active cpus"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/core: Allow kthreads to fall back to online && !active cpus
    sched/fair: Do not announce throttled next buddy in dequeue_task_fair()
    sched/fair: Initialize throttle_count for new task-groups lazily
    sched/fair: Fix cfs_rq avg tracking underflow
    kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while processing sysrq-w
    sched/debug: Fix deadlock when enabling sched events
    sched/fair: Fix post_init_entity_util_avg() serialization

    Linus Torvalds
     
  • Pull locking fix from Thomas Gleixner:
    "A single fix to address a race in the static key logic"

    * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    locking/static_key: Fix concurrent static_key_slow_inc()

    Linus Torvalds
     
  • Commit b235beea9e99 ("Clarify naming of thread info/stack allocators")
    breaks the build on some powerpc configs, where THREAD_SIZE < PAGE_SIZE:

    kernel/fork.c:235:2: error: implicit declaration of function 'free_thread_stack'
    kernel/fork.c:355:8: error: assignment from incompatible pointer type
    stack = alloc_thread_stack_node(tsk, node);
    ^

    Fix it by renaming free_stack() to free_thread_stack(), and updating the
    return type of alloc_thread_stack_node().

    Fixes: b235beea9e99 ("Clarify naming of thread info/stack allocators")
    Signed-off-by: Michael Ellerman
    Signed-off-by: Linus Torvalds

    Michael Ellerman
     
  • Merge misc fixes from Andrew Morton:
    "Two weeks worth of fixes here"

    * emailed patches from Andrew Morton : (41 commits)
    init/main.c: fix initcall_blacklisted on ia64, ppc64 and parisc64
    autofs: don't get stuck in a loop if vfs_write() returns an error
    mm/page_owner: avoid null pointer dereference
    tools/vm/slabinfo: fix spelling mistake: "Ocurrences" -> "Occurrences"
    fs/nilfs2: fix potential underflow in call to crc32_le
    oom, suspend: fix oom_reaper vs. oom_killer_disable race
    ocfs2: disable BUG assertions in reading blocks
    mm, compaction: abort free scanner if split fails
    mm: prevent KASAN false positives in kmemleak
    mm/hugetlb: clear compound_mapcount when freeing gigantic pages
    mm/swap.c: flush lru pvecs on compound page arrival
    memcg: css_alloc should return an ERR_PTR value on error
    memcg: mem_cgroup_migrate() may be called with irq disabled
    hugetlb: fix nr_pmds accounting with shared page tables
    Revert "mm: disable fault around on emulated access bit architecture"
    Revert "mm: make faultaround produce old ptes"
    mailmap: add Boris Brezillon's email
    mailmap: add Antoine Tenart's email
    mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask
    mm: mempool: kasan: don't poot mempool objects in quarantine
    ...

    Linus Torvalds
     
  • Tetsuo has reported the following potential oom_killer_disable vs.
    oom_reaper race:

    (1) freeze_processes() starts freezing user space threads.
    (2) Somebody (maybe a kenrel thread) calls out_of_memory().
    (3) The OOM killer calls mark_oom_victim() on a user space thread
    P1 which is already in __refrigerator().
    (4) oom_killer_disable() sets oom_killer_disabled = true.
    (5) P1 leaves __refrigerator() and enters do_exit().
    (6) The OOM reaper calls exit_oom_victim(P1) before P1 can call
    exit_oom_victim(P1).
    (7) oom_killer_disable() returns while P1 not yet finished
    (8) P1 perform IO/interfere with the freezer.

    This situation is unfortunate. We cannot move oom_killer_disable after
    all the freezable kernel threads are frozen because the oom victim might
    depend on some of those kthreads to make a forward progress to exit so
    we could deadlock. It is also far from trivial to teach the oom_reaper
    to not call exit_oom_victim() because then we would lose a guarantee of
    the OOM killer and oom_killer_disable forward progress because
    exit_mm->mmput might block and never call exit_oom_victim.

    It seems the easiest way forward is to workaround this race by calling
    try_to_freeze_tasks again after oom_killer_disable. This will make sure
    that all the tasks are frozen or it bails out.

    Fixes: 449d777d7ad6 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper")
    Link: http://lkml.kernel.org/r/1466597634-16199-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • We've had the thread info allocated together with the thread stack for
    most architectures for a long time (since the thread_info was split off
    from the task struct), but that is about to change.

    But the patches that move the thread info to be off-stack (and a part of
    the task struct instead) made it clear how confused the allocator and
    freeing functions are.

    Because the common case was that we share an allocation with the thread
    stack and the thread_info, the two pointers were identical. That
    identity then meant that we would have things like

    ti = alloc_thread_info_node(tsk, node);
    ...
    tsk->stack = ti;

    which certainly _worked_ (since stack and thread_info have the same
    value), but is rather confusing: why are we assigning a thread_info to
    the stack? And if we move the thread_info away, the "confusing" code
    just gets to be entirely bogus.

    So remove all this confusion, and make it clear that we are doing the
    stack allocation by renaming and clarifying the function names to be
    about the stack. The fact that the thread_info then shares the
    allocation is an implementation detail, and not really about the
    allocation itself.

    This is a pure renaming and type fix: we pass in the same pointer, it's
    just that we clarify what the pointer means.

    The ia64 code that actually only has one single allocation (for all of
    task_struct, thread_info and kernel thread stack) now looks a bit odd,
    but since "tsk->stack" is actually not even used there, that oddity
    doesn't matter. It would be a separate thing to clean that up, I
    intentionally left the ia64 changes as a pure brute-force renaming and
    type change.

    Acked-by: Andy Lutomirski
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

24 Jun, 2016

6 commits

  • During CPU hotplug, CPU_ONLINE callbacks are run while the CPU is
    online but not active. A CPU_ONLINE callback may create or bind a
    kthread so that its cpus_allowed mask only allows the CPU which is
    being brought online. The kthread may start executing before the CPU
    is made active and can end up in select_fallback_rq().

    In such cases, the expected behavior is selecting the CPU which is
    coming online; however, because select_fallback_rq() only chooses from
    active CPUs, it determines that the task doesn't have any viable CPU
    in its allowed mask and ends up overriding it to cpu_possible_mask.

    CPU_ONLINE callbacks should be able to put kthreads on the CPU which
    is coming online. Update select_fallback_rq() so that it follows
    cpu_online() rather than cpu_active() for kthreads.

    Reported-by: Gautham R Shenoy
    Tested-by: Gautham R. Shenoy
    Signed-off-by: Tejun Heo
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Abdul Haleem
    Cc: Aneesh Kumar
    Cc: Linus Torvalds
    Cc: Michael Ellerman
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: kernel-team@fb.com
    Cc: linuxppc-dev@lists.ozlabs.org
    Link: http://lkml.kernel.org/r/20160616193504.GB3262@mtj.duckdns.org
    Signed-off-by: Ingo Molnar

    Tejun Heo
     
  • Hierarchy could be already throttled at this point. Throttled next
    buddy could trigger a NULL pointer dereference in pick_next_task_fair().

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Ben Segall
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/146608183552.21905.15924473394414832071.stgit@buzz
    Signed-off-by: Ingo Molnar

    Konstantin Khlebnikov
     
  • Cgroup created inside throttled group must inherit current throttle_count.
    Broken throttle_count allows to nominate throttled entries as a next buddy,
    later this leads to null pointer dereference in pick_next_task_fair().

    This patch initialize cfs_rq->throttle_count at first enqueue: laziness
    allows to skip locking all rq at group creation. Lazy approach also allows
    to skip full sub-tree scan at throttling hierarchy (not in this patch).

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bsegall@google.com
    Link: http://lkml.kernel.org/r/146608182119.21870.8439834428248129633.stgit@buzz
    Signed-off-by: Ingo Molnar

    Konstantin Khlebnikov
     
  • The following scenario is possible:

    CPU 1 CPU 2
    static_key_slow_inc()
    atomic_inc_not_zero()
    -> key.enabled == 0, no increment
    jump_label_lock()
    atomic_inc_return()
    -> key.enabled == 1 now
    static_key_slow_inc()
    atomic_inc_not_zero()
    -> key.enabled == 1, inc to 2
    return
    ** static key is wrong!
    jump_label_update()
    jump_label_unlock()

    Testing the static key at the point marked by (**) will follow the
    wrong path for jumps that have not been patched yet. This can
    actually happen when creating many KVM virtual machines with userspace
    LAPIC emulation; just run several copies of the following program:

    #include
    #include
    #include
    #include

    int main(void)
    {
    for (;;) {
    int kvmfd = open("/dev/kvm", O_RDONLY);
    int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);
    close(ioctl(vmfd, KVM_CREATE_VCPU, 1));
    close(vmfd);
    close(kvmfd);
    }
    return 0;
    }

    Every KVM_CREATE_VCPU ioctl will attempt a static_key_slow_inc() call.
    The static key's purpose is to skip NULL pointer checks and indeed one
    of the processes eventually dereferences NULL.

    As explained in the commit that introduced the bug:

    706249c222f6 ("locking/static_keys: Rework update logic")

    jump_label_update() needs key.enabled to be true. The solution adopted
    here is to temporarily make key.enabled == -1, and use go down the
    slow path when key.enabled
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: # v4.3+
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 706249c222f6 ("locking/static_keys: Rework update logic")
    Link: http://lkml.kernel.org/r/1466527937-69798-1-git-send-email-pbonzini@redhat.com
    [ Small stylistic edits to the changelog and the code. ]
    Signed-off-by: Ingo Molnar

    Paolo Bonzini
     
  • While testing the deadline scheduler + cgroup setup I hit this
    warning.

    [ 132.612935] ------------[ cut here ]------------
    [ 132.612951] WARNING: CPU: 5 PID: 0 at kernel/softirq.c:150 __local_bh_enable_ip+0x6b/0x80
    [ 132.612952] Modules linked in: (a ton of modules...)
    [ 132.612981] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.7.0-rc2 #2
    [ 132.612981] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
    [ 132.612982] 0000000000000086 45c8bb5effdd088b ffff88013fd43da0 ffffffff813d229e
    [ 132.612984] 0000000000000000 0000000000000000 ffff88013fd43de0 ffffffff810a652b
    [ 132.612985] 00000096811387b5 0000000000000200 ffff8800bab29d80 ffff880034c54c00
    [ 132.612986] Call Trace:
    [ 132.612987] [] dump_stack+0x63/0x85
    [ 132.612994] [] __warn+0xcb/0xf0
    [ 132.612997] [] ? push_dl_task.part.32+0x170/0x170
    [ 132.612999] [] warn_slowpath_null+0x1d/0x20
    [ 132.613000] [] __local_bh_enable_ip+0x6b/0x80
    [ 132.613008] [] _raw_write_unlock_bh+0x1a/0x20
    [ 132.613010] [] _raw_spin_unlock_bh+0xe/0x10
    [ 132.613015] [] put_css_set+0x5c/0x60
    [ 132.613016] [] cgroup_free+0x7f/0xa0
    [ 132.613017] [] __put_task_struct+0x42/0x140
    [ 132.613018] [] dl_task_timer+0xca/0x250
    [ 132.613027] [] ? push_dl_task.part.32+0x170/0x170
    [ 132.613030] [] __hrtimer_run_queues+0xee/0x270
    [ 132.613031] [] hrtimer_interrupt+0xa8/0x190
    [ 132.613034] [] local_apic_timer_interrupt+0x38/0x60
    [ 132.613035] [] smp_apic_timer_interrupt+0x3d/0x50
    [ 132.613037] [] apic_timer_interrupt+0x8c/0xa0
    [ 132.613038] [] ? native_safe_halt+0x6/0x10
    [ 132.613043] [] default_idle+0x1e/0xd0
    [ 132.613044] [] arch_cpu_idle+0xf/0x20
    [ 132.613046] [] default_idle_call+0x2a/0x40
    [ 132.613047] [] cpu_startup_entry+0x2e7/0x340
    [ 132.613048] [] start_secondary+0x155/0x190
    [ 132.613049] ---[ end trace f91934d162ce9977 ]---

    The warn is the spin_(lock|unlock)_bh(&css_set_lock) in the interrupt
    context. Converting the spin_lock_bh to spin_lock_irq(save) to avoid
    this problem - and other problems of sharing a spinlock with an
    interrupt.

    Cc: Tejun Heo
    Cc: Li Zefan
    Cc: Johannes Weiner
    Cc: Juri Lelli
    Cc: Steven Rostedt
    Cc: cgroups@vger.kernel.org
    Cc: stable@vger.kernel.org # 4.5+
    Cc: linux-kernel@vger.kernel.org
    Reviewed-by: Rik van Riel
    Reviewed-by: "Luis Claudio R. Goncalves"
    Signed-off-by: Daniel Bristot de Oliveira
    Acked-by: Zefan Li
    Signed-off-by: Tejun Heo

    Daniel Bristot de Oliveira
     
  • None of the code actually wants a thread_info, it all wants a
    task_struct, and it's just converting back and forth between the two
    ("ti->task" to get the task_struct from the thread_info, and
    "task_thread_info(task)" to go the other way).

    No semantic change.

    Acked-by: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

23 Jun, 2016

1 commit

  • The function irq_create_of_mapping() is used to create an interrupt
    mapping. However, depending on whether the irqdomain, to which the
    interrupt belongs, is part of a hierarchy, determines whether the
    mapping is created via calling irq_domain_alloc_irqs() or
    irq_create_mapping().

    To dispose of the interrupt mapping, drivers call irq_dispose_mapping().
    However, this function does not check to see if the irqdomain is part
    of a hierarchy or not and simply assumes that it was mapped via calling
    irq_create_mapping() so calls irq_domain_disassociate() to unmap the
    interrupt.

    Fix this by checking to see if the irqdomain is part of a hierarchy and
    if so call irq_domain_free_irqs() to free/unmap the interrupt.

    Signed-off-by: Jon Hunter
    Cc: Marc Zyngier
    Cc: Jiang Liu
    Link: http://lkml.kernel.org/r/1466501002-16368-1-git-send-email-jonathanh@nvidia.com
    Signed-off-by: Thomas Gleixner

    Jon Hunter
     

21 Jun, 2016

1 commit

  • Pull tracing fixes from Steven Rostedt:
    "Two fixes for the tracing system:

    - When trace_printk() is used with a non constant format descriptor,
    it adds a NULL pointer into the trace format section, and the code
    isn't prepared to deal with it. This bug appeared by a change that
    was added in v3.5.

    - The ftracetest (selftests section) can't handle testing histograms
    when histograms are not configured. Currently it shows that they
    fail the test, when they should state that they are unsupported.
    This bug was added in the 4.7 merge window with the addition of the
    historgram code"

    * tag 'trace-v4.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    ftracetest: Fix hist unsupported result in hist selftests
    tracing: Handle NULL formats in hold_module_trace_bprintk_format()

    Linus Torvalds
     

20 Jun, 2016

2 commits

  • If a task uses a non constant string for the format parameter in
    trace_printk(), then the trace_printk_fmt variable is set to NULL. This
    variable is then saved in the __trace_printk_fmt section.

    The function hold_module_trace_bprintk_format() checks to see if duplicate
    formats are used by modules, and reuses them if so (saves them to the list
    if it is new). But this function calls lookup_format() that does a strcmp()
    to the value (which is now NULL) and can cause a kernel oops.

    This wasn't an issue till 3debb0a9ddb ("tracing: Fix trace_printk() to print
    when not using bprintk()") which added "__used" to the trace_printk_fmt
    variable, and before that, the kernel simply optimized it out (no NULL value
    was saved).

    The fix is simply to handle the NULL pointer in lookup_format() and have the
    caller ignore the value if it was NULL.

    Link: http://lkml.kernel.org/r/1464769870-18344-1-git-send-email-zhengjun.xing@intel.com

    Reported-by: xingzhen
    Acked-by: Namhyung Kim
    Fixes: 3debb0a9ddb ("tracing: Fix trace_printk() to print when not using bprintk()")
    Cc: stable@vger.kernel.org # v3.5+
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • As per commit:

    b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization")

    > the code generated from update_cfs_rq_load_avg():
    >
    > if (atomic_long_read(&cfs_rq->removed_load_avg)) {
    > s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
    > sa->load_avg = max_t(long, sa->load_avg - r, 0);
    > sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
    > removed_load = 1;
    > }
    >
    > turns into:
    >
    > ffffffff81087064: 49 8b 85 98 00 00 00 mov 0x98(%r13),%rax
    > ffffffff8108706b: 48 85 c0 test %rax,%rax
    > ffffffff8108706e: 74 40 je ffffffff810870b0
    > ffffffff81087070: 4c 89 f8 mov %r15,%rax
    > ffffffff81087073: 49 87 85 98 00 00 00 xchg %rax,0x98(%r13)
    > ffffffff8108707a: 49 29 45 70 sub %rax,0x70(%r13)
    > ffffffff8108707e: 4c 89 f9 mov %r15,%rcx
    > ffffffff81087081: bb 01 00 00 00 mov $0x1,%ebx
    > ffffffff81087086: 49 83 7d 70 00 cmpq $0x0,0x70(%r13)
    > ffffffff8108708b: 49 0f 49 4d 70 cmovns 0x70(%r13),%rcx
    >
    > Which you'll note ends up with sa->load_avg -= r in memory at
    > ffffffff8108707a.

    So I _should_ have looked at other unserialized users of ->load_avg,
    but alas. Luckily nikbor reported a similar /0 from task_h_load() which
    instantly triggered recollection of this here problem.

    Aside from the intermediate value hitting memory and causing problems,
    there's another problem: the underflow detection relies on the signed
    bit. This reduces the effective width of the variables, IOW its
    effectively the same as having these variables be of signed type.

    This patch changes to a different means of unsigned underflow
    detection to not rely on the signed bit. This allows the variables to
    use the 'full' unsigned range. And it does so with explicit LOAD -
    STORE to ensure any intermediate value will never be visible in
    memory, allowing these unserialized loads.

    Note: GCC generates crap code for this, might warrant a look later.

    Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
    maybe we should do clamping on add too.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Yuyang Du
    Cc: bsegall@google.com
    Cc: kernel@kyup.com
    Cc: morten.rasmussen@arm.com
    Cc: pjt@google.com
    Cc: steve.muckle@linaro.org
    Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
    Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

18 Jun, 2016

1 commit

  • This adds a software irq handler for controllers that multiplex
    interrupts from multiple devices, but don't know which device generated
    the interrupt. For these devices, the irq handler that demuxes must
    check every action for every software irq using the same h/w irq in order
    to find out which device generated the interrupt. This will inevitably
    trigger spurious interrupt detection if we are noting the irq.

    The new irq handler does not track the handling for spurious interrupt
    detection. An irq that uses this also won't get stats tracked since it
    didn't generate the interrupt, nor added to randomness since they are
    not random.

    Signed-off-by: Keith Busch
    Cc: Bjorn Helgaas
    Cc: linux-pci@vger.kernel.org
    Cc: Jon Derrick
    Link: http://lkml.kernel.org/r/1466200821-29159-1-git-send-email-keith.busch@intel.com
    Signed-off-by: Thomas Gleixner

    Keith Busch
     

17 Jun, 2016

1 commit

  • If percpu_ref initialization fails during css_create(), the free path
    can end up trying to free css->id of zero. As ID 0 is unused, it
    doesn't cause a critical breakage but it does trigger a warning
    message. Fix it by setting css->id to -1 from init_and_link_css().

    Signed-off-by: Tejun Heo
    Cc: Wenwei Tao
    Fixes: 01e586598b22 ("cgroup: release css->id after css_free")
    Cc: stable@vger.kernel.org # v4.0+
    Signed-off-by: Tejun Heo

    Tejun Heo
     

16 Jun, 2016

2 commits

  • similar to bpf_perf_event_output() the bpf_perf_event_read() helper
    needs to check the type of the perf_event before reading the counter.

    Fixes: a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
    Reported-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • The ctx structure passed into bpf programs is different depending on bpf
    program type. The verifier incorrectly marked ctx->data and ctx->data_end
    access based on ctx offset only. That caused loads in tracing programs
    int bpf_prog(struct pt_regs *ctx) { .. ctx->ax .. }
    to be incorrectly marked as PTR_TO_PACKET which later caused verifier
    to reject the program that was actually valid in tracing context.
    Fix this by doing program type specific matching of ctx offsets.

    Fixes: 969bf05eb3ce ("bpf: direct packet access")
    Reported-by: Sasha Goldshtein
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

15 Jun, 2016

1 commit

  • Since commit 49d200deaa68 ("debugfs: prevent access to removed files'
    private data"), a debugfs file's file_operations methods get proxied
    through lifetime aware wrappers.

    However, only a certain subset of the file_operations members is supported
    by debugfs and ->mmap isn't among them -- it appears to be NULL from the
    VFS layer's perspective.

    This behaviour breaks the /sys/kernel/debug/kcov file introduced
    concurrently with commit 5c9a8750a640 ("kernel: add kcov code coverage").

    Since that file never gets removed, there is no file removal race and thus,
    a lifetime checking proxy isn't needed.

    Avoid the proxying for /sys/kernel/debug/kcov by creating it via
    debugfs_create_file_unsafe() rather than debugfs_create_file().

    Fixes: 49d200deaa68 ("debugfs: prevent access to removed files' private data")
    Fixes: 5c9a8750a640 ("kernel: add kcov code coverage")
    Reported-by: Sasha Levin
    Signed-off-by: Nicolai Stange
    Signed-off-by: Greg Kroah-Hartman

    Nicolai Stange
     

14 Jun, 2016

3 commits

  • Lengthy output of sysrq-w may take a lot of time on slow serial console.

    Currently we reset NMI-watchdog on the current CPU to avoid spurious
    lockup messages. Sometimes this doesn't work since softlockup watchdog
    might trigger on another CPU which is waiting for an IPI to proceed.
    We reset softlockup watchdogs on all CPUs, but we do this only after
    listing all tasks, and this may be too late on a busy system.

    So, reset watchdogs CPUs earlier, in for_each_process_thread() loop.

    Signed-off-by: Andrey Ryabinin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc:
    Link: http://lkml.kernel.org/r/1465474805-14641-1-git-send-email-aryabinin@virtuozzo.com
    Signed-off-by: Ingo Molnar

    Andrey Ryabinin
     
  • I see a hang when enabling sched events:

    echo 1 > /sys/kernel/debug/tracing/events/sched/enable

    The printk buffer shows:

    BUG: spinlock recursion on CPU#1, swapper/1/0
    lock: 0xffff88007d5d8c00, .magic: dead4ead, .owner: swapper/1/0, .owner_cpu: 1
    CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc2+ #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
    ...
    Call Trace:
    [] dump_stack+0x85/0xc2
    [] spin_dump+0x78/0xc0
    [] do_raw_spin_lock+0x11a/0x150
    [] _raw_spin_lock+0x61/0x80
    [] ? try_to_wake_up+0x256/0x4e0
    [] try_to_wake_up+0x256/0x4e0
    [] ? _raw_spin_unlock_irqrestore+0x4a/0x80
    [] wake_up_process+0x15/0x20
    [] insert_work+0x84/0xc0
    [] __queue_work+0x18f/0x660
    [] queue_work_on+0x46/0x90
    [] drm_fb_helper_dirty.isra.11+0xcb/0xe0 [drm_kms_helper]
    [] drm_fb_helper_sys_imageblit+0x30/0x40 [drm_kms_helper]
    [] soft_cursor+0x1ad/0x230
    [] bit_cursor+0x649/0x680
    [] ? update_attr.isra.2+0x90/0x90
    [] fbcon_cursor+0x14a/0x1c0
    [] hide_cursor+0x28/0x90
    [] vt_console_print+0x3bf/0x3f0
    [] call_console_drivers.constprop.24+0x183/0x200
    [] console_unlock+0x3d4/0x610
    [] vprintk_emit+0x3c5/0x610
    [] vprintk_default+0x29/0x40
    [] printk+0x57/0x73
    [] enqueue_entity+0xc2e/0xc70
    [] enqueue_task_fair+0x59/0xab0
    [] ? kvm_sched_clock_read+0x9/0x20
    [] ? sched_clock+0x9/0x10
    [] activate_task+0x5c/0xa0
    [] ttwu_do_activate+0x54/0xb0
    [] sched_ttwu_pending+0x7a/0xb0
    [] scheduler_ipi+0x61/0x170
    [] smp_trace_reschedule_interrupt+0x4f/0x2a0
    [] trace_reschedule_interrupt+0x96/0xa0
    [] ? native_safe_halt+0x6/0x10
    [] ? trace_hardirqs_on+0xd/0x10
    [] default_idle+0x20/0x1a0
    [] arch_cpu_idle+0xf/0x20
    [] default_idle_call+0x2f/0x50
    [] cpu_startup_entry+0x37e/0x450
    [] start_secondary+0x160/0x1a0

    Note the hang only occurs when echoing the above from a physical serial
    console, not from an ssh session.

    The bug is caused by a deadlock where the task is trying to grab the rq
    lock twice because printk()'s aren't safe in sched code.

    Signed-off-by: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Matt Fleming
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: stable@vger.kernel.org
    Fixes: cb2517653fcc ("sched/debug: Make schedstats a runtime tunable that is disabled by default")
    Link: http://lkml.kernel.org/r/20160613073209.gdvdybiruljbkn3p@treble
    Signed-off-by: Ingo Molnar

    Josh Poimboeuf
     
  • Chris Wilson reported a divide by 0 at:

    post_init_entity_util_avg():

    > 725 if (cfs_rq->avg.util_avg != 0) {
    > 726 sa->util_avg = cfs_rq->avg.util_avg * se->load.weight;
    > -> 727 sa->util_avg /= (cfs_rq->avg.load_avg + 1);
    > 728
    > 729 if (sa->util_avg > cap)
    > 730 sa->util_avg = cap;
    > 731 } else {

    Which given the lack of serialization, and the code generated from
    update_cfs_rq_load_avg() is entirely possible:

    if (atomic_long_read(&cfs_rq->removed_load_avg)) {
    s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
    sa->load_avg = max_t(long, sa->load_avg - r, 0);
    sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
    removed_load = 1;
    }

    turns into:

    ffffffff81087064: 49 8b 85 98 00 00 00 mov 0x98(%r13),%rax
    ffffffff8108706b: 48 85 c0 test %rax,%rax
    ffffffff8108706e: 74 40 je ffffffff810870b0
    ffffffff81087070: 4c 89 f8 mov %r15,%rax
    ffffffff81087073: 49 87 85 98 00 00 00 xchg %rax,0x98(%r13)
    ffffffff8108707a: 49 29 45 70 sub %rax,0x70(%r13)
    ffffffff8108707e: 4c 89 f9 mov %r15,%rcx
    ffffffff81087081: bb 01 00 00 00 mov $0x1,%ebx
    ffffffff81087086: 49 83 7d 70 00 cmpq $0x0,0x70(%r13)
    ffffffff8108708b: 49 0f 49 4d 70 cmovns 0x70(%r13),%rcx

    Which you'll note ends up with 'sa->load_avg - r' in memory at
    ffffffff8108707a.

    By calling post_init_entity_util_avg() under rq->lock we're sure to be
    fully serialized against PELT updates and cannot observe intermediate
    state like this.

    Reported-by: Chris Wilson
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrey Ryabinin
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Yuyang Du
    Cc: bsegall@google.com
    Cc: morten.rasmussen@arm.com
    Cc: pjt@google.com
    Cc: steve.muckle@linaro.org
    Fixes: 2b8c41daba32 ("sched/fair: Initiate a new task's util avg to a bounded value")
    Link: http://lkml.kernel.org/r/20160609130750.GQ30909@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

13 Jun, 2016

3 commits

  • …/arm-platforms into irq/core

    First drop of irqchip updates for 4.8 from Marc Zyngier:

    - Fix a few bugs in configuring the default trigger from the irqdomain layer
    - Make the genirq layer PM aware
    - Add PM capability to the ARM GIC driver
    - Add support for 2-level translation tables to the GICv3 ITS driver

    Thomas Gleixner
     
  • Some IRQ chips may be located in a power domain outside of the CPU
    subsystem and hence will require device specific runtime power
    management. In order to support such IRQ chips, add a pointer for a
    device structure to the irq_chip structure, and if this pointer is
    populated by the IRQ chip driver and CONFIG_PM is selected in the kernel
    configuration, then the pm_runtime_get/put APIs for this chip will be
    called when an IRQ is requested/freed, respectively.

    Reviewed-by: Kevin Hilman
    Signed-off-by: Jon Hunter
    Signed-off-by: Marc Zyngier

    Jon Hunter
     
  • Some IRQ chips, such as GPIO controllers or secondary level interrupt
    controllers, may require require additional runtime power management
    control to ensure they are accessible. For such IRQ chips, it makes sense
    to enable the IRQ chip when interrupts are requested and disabled them
    again once all interrupts have been freed.

    When mapping an IRQ, the IRQ type settings are read and then programmed.
    The mapping of the IRQ happens before the IRQ is requested and so the
    programming of the type settings occurs before the IRQ is requested. This
    is a problem for IRQ chips that require additional power management
    control because they may not be accessible yet. Therefore, when mapping
    the IRQ, don't program the type settings, just save them and then program
    these saved settings when the IRQ is requested (so long as if they are not
    overridden via the call to request the IRQ).

    Add a stub function for irq_domain_free_irqs() to avoid any compilation
    errors when CONFIG_IRQ_DOMAIN_HIERARCHY is not selected.

    Signed-off-by: Jon Hunter
    Reviewed-by: Marc Zyngier
    Signed-off-by: Marc Zyngier

    Jon Hunter