Eric Lee / smarc-fsl-linux-kernel

04 Jul, 2016

7 commits

8658be133 Merge branch 'irq/for-block' into irq/core ... Browse Code »

Pull the irq affinity managing code which is in a seperate branch for block
developers to pull.

Thomas Gleixner
2016-07-04 18:26:05 +0800
5e385a6ef genirq: Add a helper to spread an affinity mask for MSI/MSI-X vectors ... Browse Code »

This is lifted from the blk-mq code and adopted to use the affinity mask
concept just introduced in the irq handling code. It tries to keep the
algorithm the same as the one current used by blk-mq, but improvements
like assining vectors on a per-node basis instead of just per sibling
are possible with this simple move and refactoring.

Signed-off-by: Christoph Hellwig
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-7-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner

Christoph Hellwig
2016-07-04 18:25:14 +0800
0972fa57f genirq/msi: Make use of affinity aware allocations ... Browse Code »

Allow the MSI code to provide affinity hints per MSI descriptor.

Signed-off-by: Thomas Gleixner
Cc: Christoph Hellwig
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-6-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2016-07-04 18:25:14 +0800
45ddcecbf genirq: Use affinity hint in irqdesc allocation ... Browse Code »

Use the affinity hint in the irqdesc allocator. The hint is used to determine
the node for the allocation and to set the affinity of the interrupt.

If multiple interrupts are allocated (multi-MSI) then the allocator iterates
over the cpumask and for each set cpu it allocates on their node and sets the
initial affinity to that cpu.

If a single interrupt is allocated (MSI-X) then the allocator uses the first
cpu in the mask to compute the allocation node and uses the mask for the
initial affinity setting.

Interrupts set up this way are marked with the AFFINITY_MANAGED flag to
prevent userspace from messing with their affinity settings.

Signed-off-by: Thomas Gleixner
Cc: Christoph Hellwig
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-5-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2016-07-04 18:25:13 +0800
06ee6d571 genirq: Add affinity hint to irq allocation ... Browse Code »

Add an extra argument to the irq(domain) allocation functions, so we can hand
down affinity hints to the allocator. Thats necessary to implement proper
support for multiqueue devices.

Signed-off-by: Thomas Gleixner
Cc: Christoph Hellwig
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-4-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2016-07-04 18:25:13 +0800
9c2555835 genirq: Introduce IRQD_AFFINITY_MANAGED flag ... Browse Code »

Interupts marked with this flag are excluded from user space interrupt
affinity changes. Contrary to the IRQ_NO_BALANCING flag, the kernel internal
affinity mechanism is not blocked.

This flag will be used for multi-queue device interrupts.

Signed-off-by: Thomas Gleixner
Cc: Christoph Hellwig
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-3-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2016-07-04 18:25:13 +0800
b6140914f genirq/msi: Remove unused MSI_FLAG_IDENTITY_MAP ... Browse Code »

No user and we definitely don't want to grow one.

Signed-off-by: Thomas Gleixner
Reviewed-by: Bart Van Assche
Cc: Christoph Hellwig
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-2-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2016-07-04 18:25:12 +0800

30 Jun, 2016

3 commits

89a82a921 Merge branch 'stable-4.7' of git://git.infradead.org/users/pcmoore/audit ... Browse Code »

Pull audit fixes from Paul Moore:
"Two small patches to fix audit problems in 4.7-rcX: the first fixes a
potential kref leak, the second removes some header file noise.

The first is an important bug fix that really should go in before 4.7
is released, the second is not critical, but falls into the very-nice-
to-have category so I'm including in the pull request.

Both patches are straightforward, self-contained, and pass our
testsuite without problem"

* 'stable-4.7' of git://git.infradead.org/users/pcmoore/audit:
audit: move audit_get_tty to reduce scope and kabi changes
audit: move calcs after alloc and check when logging set loginuid

Linus Torvalds
2016-06-30 06:18:47 +0800
32826ac41 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:
"I've been traveling so this accumulates more than week or so of bug
fixing. It perhaps looks a little worse than it really is.

1) Fix deadlock in ath10k driver, from Ben Greear.

2) Increase scan timeout in iwlwifi, from Luca Coelho.

3) Unbreak STP by properly reinjecting STP packets back into the
stack. Regression fix from Ido Schimmel.

4) Mediatek driver fixes (missing malloc failure checks, leaking of
scratch memory, wrong indexing when mapping TX buffers, etc.) from
John Crispin.

5) Fix endianness bug in icmpv6_err() handler, from Hannes Frederic
Sowa.

6) Fix hashing of flows in UDP in the ruseport case, from Xuemin Su.

7) Fix netlink notifications in ovs for tunnels, delete link messages
are never emitted because of how the device registry state is
handled. From Nicolas Dichtel.

8) Conntrack module leaks kmemcache on unload, from Florian Westphal.

9) Prevent endless jump loops in nft rules, from Liping Zhang and
Pablo Neira Ayuso.

10) Not early enough spinlock initialization in mlx4, from Eric
Dumazet.

11) Bind refcount leak in act_ipt, from Cong WANG.

12) Missing RCU locking in HTB scheduler, from Florian Westphal.

13) Several small MACSEC bug fixes from Sabrina Dubroca (missing RCU
barrier, using heap for SG and IV, and erroneous use of async flag
when allocating AEAD conext.)

14) RCU handling fix in TIPC, from Ying Xue.

15) Pass correct protocol down into ipv4_{update_pmtu,redirect}() in
SIT driver, from Simon Horman.

16) Socket timer deadlock fix in TIPC from Jon Paul Maloy.

17) Fix potential deadlock in team enslave, from Ido Schimmel.

18) Memory leak in KCM procfs handling, from Jiri Slaby.

19) ESN generation fix in ipv4 ESP, from Herbert Xu.

20) Fix GFP_KERNEL allocations with locks held in act_ife, from Cong
WANG.

21) Use after free in netem, from Eric Dumazet.

22) Uninitialized last assert time in multicast router code, from Tom
Goff.

23) Skip raw sockets in sock_diag destruction broadcast, from Willem
de Bruijn.

24) Fix link status reporting in thunderx, from Sunil Goutham.

25) Limit resegmentation of retransmit queue so that we do not
retransmit too large GSO frames. From Eric Dumazet.

26) Delay bpf program release after grace period, from Daniel
Borkmann"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (141 commits)
openvswitch: fix conntrack netlink event delivery
qed: Protect the doorbell BAR with the write barriers.
neigh: Explicitly declare RCU-bh read side critical section in neigh_xmit()
e1000e: keep VLAN interfaces functional after rxvlan off
cfg80211: fix proto in ieee80211_data_to_8023 for frames without LLC header
qlcnic: use the correct ring in qlcnic_83xx_process_rcv_ring_diag()
bpf, perf: delay release of BPF prog after grace period
net: bridge: fix vlan stats continue counter
tcp: do not send too big packets at retransmit time
ibmvnic: fix to use list_for_each_safe() when delete items
net: thunderx: Fix TL4 configuration for secondary Qsets
net: thunderx: Fix link status reporting
net/mlx5e: Reorganize ethtool statistics
net/mlx5e: Fix number of PFC counters reported to ethtool
net/mlx5e: Prevent adding the same vxlan port
net/mlx5e: Check for BlueFlame capability before allocating SQ uar
net/mlx5e: Change enum to better reflect usage
net/mlx5: Add ConnectX-5 PCIe 4.0 to list of supported devices
net/mlx5: Update command strings
net: marvell: Add separate config ANEG function for Marvell 88E1111
...

Linus Torvalds
2016-06-30 02:50:42 +0800
52827f389 Merge branch 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup fixes from Tejun Heo:
"Three fix patches. Two are for cgroup / css init failure path. The
last one makes css_set_lock irq-safe as the deadline scheduler ends up
calling put_css_set() from irq context"

* 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Disable IRQs while holding css_set_lock
cgroup: set css->id to -1 during init
cgroup: remove redundant cleanup in css_create

Linus Torvalds
2016-06-30 01:04:42 +0800

29 Jun, 2016

3 commits

ceb560703 bpf, perf: delay release of BPF prog after grace period ... Browse Code »

Commit dead9f29ddcc ("perf: Fix race in BPF program unregister") moved
destruction of BPF program from free_event_rcu() callback to __free_event(),
which is problematic if used with tail calls: if prog A is attached as
trace event directly, but at the same time present in a tail call map used
by another trace event program elsewhere, then we need to delay destruction
via RCU grace period since it can still be in use by the program doing the
tail call (the prog first needs to be dropped from the tail call map, then
trace event with prog A attached destroyed, so we get immediate destruction).

Fixes: dead9f29ddcc ("perf: Fix race in BPF program unregister")
Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Cc: Jann Horn
Signed-off-by: David S. Miller

Daniel Borkmann
2016-06-29 17:42:55 +0800
3f5be2da8 audit: move audit_get_tty to reduce scope and kabi changes ... Browse Code »

The only users of audit_get_tty and audit_put_tty are internal to
audit, so move it out of include/linux/audit.h to kernel.h and create
a proper function rather than inlining it. This also reduces kABI
changes.

Suggested-by: Paul Moore
Signed-off-by: Richard Guy Briggs
[PM: line wrapped description]
Signed-off-by: Paul Moore

Richard Guy Briggs
2016-06-29 03:48:48 +0800
76a658c20 audit: move calcs after alloc and check when logging set loginuid ... Browse Code »

Move the calculations of values after the allocation in case the
allocation fails. This avoids wasting effort in the rare case that it
fails, but more importantly saves us extra logic to release the tty
ref.

Signed-off-by: Richard Guy Briggs
Signed-off-by: Paul Moore

Richard Guy Briggs
2016-06-29 03:40:17 +0800

25 Jun, 2016

6 commits

57801c1b8 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Thomas Gleixner:
"A couple of scheduler fixes:

- force watchdog reset while processing sysrq-w

- fix a deadlock when enabling trace events in the scheduler

- fixes to the throttled next buddy logic

- fixes for the average accounting (missing serialization and
underflow handling)

- allow kernel threads for fallback to online but not active cpus"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Allow kthreads to fall back to online && !active cpus
sched/fair: Do not announce throttled next buddy in dequeue_task_fair()
sched/fair: Initialize throttle_count for new task-groups lazily
sched/fair: Fix cfs_rq avg tracking underflow
kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while processing sysrq-w
sched/debug: Fix deadlock when enabling sched events
sched/fair: Fix post_init_entity_util_avg() serialization

Linus Torvalds
2016-06-25 21:38:42 +0800
e3b22bc3d Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull locking fix from Thomas Gleixner:
"A single fix to address a race in the static key logic"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/static_key: Fix concurrent static_key_slow_inc()

Linus Torvalds
2016-06-25 21:14:44 +0800
9521d3997 Fix build break in fork.c when THREAD_SIZE < PAGE_SIZE ... Browse Code »

Commit b235beea9e99 ("Clarify naming of thread info/stack allocators")
breaks the build on some powerpc configs, where THREAD_SIZE < PAGE_SIZE:

kernel/fork.c:235:2: error: implicit declaration of function 'free_thread_stack'
kernel/fork.c:355:8: error: assignment from incompatible pointer type
stack = alloc_thread_stack_node(tsk, node);
^

Fix it by renaming free_stack() to free_thread_stack(), and updating the
return type of alloc_thread_stack_node().

Fixes: b235beea9e99 ("Clarify naming of thread info/stack allocators")
Signed-off-by: Michael Ellerman
Signed-off-by: Linus Torvalds

Michael Ellerman
2016-06-25 21:01:28 +0800
086e3eb65 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge misc fixes from Andrew Morton:
"Two weeks worth of fixes here"

* emailed patches from Andrew Morton : (41 commits)
init/main.c: fix initcall_blacklisted on ia64, ppc64 and parisc64
autofs: don't get stuck in a loop if vfs_write() returns an error
mm/page_owner: avoid null pointer dereference
tools/vm/slabinfo: fix spelling mistake: "Ocurrences" -> "Occurrences"
fs/nilfs2: fix potential underflow in call to crc32_le
oom, suspend: fix oom_reaper vs. oom_killer_disable race
ocfs2: disable BUG assertions in reading blocks
mm, compaction: abort free scanner if split fails
mm: prevent KASAN false positives in kmemleak
mm/hugetlb: clear compound_mapcount when freeing gigantic pages
mm/swap.c: flush lru pvecs on compound page arrival
memcg: css_alloc should return an ERR_PTR value on error
memcg: mem_cgroup_migrate() may be called with irq disabled
hugetlb: fix nr_pmds accounting with shared page tables
Revert "mm: disable fault around on emulated access bit architecture"
Revert "mm: make faultaround produce old ptes"
mailmap: add Boris Brezillon's email
mailmap: add Antoine Tenart's email
mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask
mm: mempool: kasan: don't poot mempool objects in quarantine
...

Linus Torvalds
2016-06-25 10:08:33 +0800
740705420 oom, suspend: fix oom_reaper vs. oom_killer_disable race ... Browse Code »

Tetsuo has reported the following potential oom_killer_disable vs.
oom_reaper race:

(1) freeze_processes() starts freezing user space threads.
(2) Somebody (maybe a kenrel thread) calls out_of_memory().
(3) The OOM killer calls mark_oom_victim() on a user space thread
P1 which is already in __refrigerator().
(4) oom_killer_disable() sets oom_killer_disabled = true.
(5) P1 leaves __refrigerator() and enters do_exit().
(6) The OOM reaper calls exit_oom_victim(P1) before P1 can call
exit_oom_victim(P1).
(7) oom_killer_disable() returns while P1 not yet finished
(8) P1 perform IO/interfere with the freezer.

This situation is unfortunate. We cannot move oom_killer_disable after
all the freezable kernel threads are frozen because the oom victim might
depend on some of those kthreads to make a forward progress to exit so
we could deadlock. It is also far from trivial to teach the oom_reaper
to not call exit_oom_victim() because then we would lose a guarantee of
the OOM killer and oom_killer_disable forward progress because
exit_mm->mmput might block and never call exit_oom_victim.

It seems the easiest way forward is to workaround this race by calling
try_to_freeze_tasks again after oom_killer_disable. This will make sure
that all the tasks are frozen or it bails out.

Fixes: 449d777d7ad6 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper")
Link: http://lkml.kernel.org/r/1466597634-16199-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reported-by: Tetsuo Handa
Cc: "Rafael J. Wysocki"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-06-25 08:23:52 +0800
b235beea9 Clarify naming of thread info/stack allocators ... Browse Code »

We've had the thread info allocated together with the thread stack for
most architectures for a long time (since the thread_info was split off
from the task struct), but that is about to change.

But the patches that move the thread info to be off-stack (and a part of
the task struct instead) made it clear how confused the allocator and
freeing functions are.

Because the common case was that we share an allocation with the thread
stack and the thread_info, the two pointers were identical. That
identity then meant that we would have things like

ti = alloc_thread_info_node(tsk, node);
...
tsk->stack = ti;

which certainly _worked_ (since stack and thread_info have the same
value), but is rather confusing: why are we assigning a thread_info to
the stack? And if we move the thread_info away, the "confusing" code
just gets to be entirely bogus.

So remove all this confusion, and make it clear that we are doing the
stack allocation by renaming and clarifying the function names to be
about the stack. The fact that the thread_info then shares the
allocation is an implementation detail, and not really about the
allocation itself.

This is a pure renaming and type fix: we pass in the same pointer, it's
just that we clarify what the pointer means.

The ia64 code that actually only has one single allocation (for all of
task_struct, thread_info and kernel thread stack) now looks a bit odd,
but since "tsk->stack" is actually not even used there, that oddity
doesn't matter. It would be a separate thing to clean that up, I
intentionally left the ia64 changes as a pure brute-force renaming and
type change.

Acked-by: Andy Lutomirski
Signed-off-by: Linus Torvalds

Linus Torvalds
2016-06-25 06:09:37 +0800

24 Jun, 2016

6 commits

feb245e30 sched/core: Allow kthreads to fall back to online && !active cpus ... Browse Code »

During CPU hotplug, CPU_ONLINE callbacks are run while the CPU is
online but not active. A CPU_ONLINE callback may create or bind a
kthread so that its cpus_allowed mask only allows the CPU which is
being brought online. The kthread may start executing before the CPU
is made active and can end up in select_fallback_rq().

In such cases, the expected behavior is selecting the CPU which is
coming online; however, because select_fallback_rq() only chooses from
active CPUs, it determines that the task doesn't have any viable CPU
in its allowed mask and ends up overriding it to cpu_possible_mask.

CPU_ONLINE callbacks should be able to put kthreads on the CPU which
is coming online. Update select_fallback_rq() so that it follows
cpu_online() rather than cpu_active() for kthreads.

Reported-by: Gautham R Shenoy
Tested-by: Gautham R. Shenoy
Signed-off-by: Tejun Heo
Signed-off-by: Peter Zijlstra (Intel)
Cc: Abdul Haleem
Cc: Aneesh Kumar
Cc: Linus Torvalds
Cc: Michael Ellerman
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: kernel-team@fb.com
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/20160616193504.GB3262@mtj.duckdns.org
Signed-off-by: Ingo Molnar

Tejun Heo
2016-06-24 14:26:53 +0800
754bd598b sched/fair: Do not announce throttled next buddy in dequeue_task_fair() ... Browse Code »

Hierarchy could be already throttled at this point. Throttled next
buddy could trigger a NULL pointer dereference in pick_next_task_fair().

Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Ben Segall
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/146608183552.21905.15924473394414832071.stgit@buzz
Signed-off-by: Ingo Molnar

Konstantin Khlebnikov
2016-06-24 14:26:45 +0800
094f46917 sched/fair: Initialize throttle_count for new task-groups lazily ... Browse Code »

Cgroup created inside throttled group must inherit current throttle_count.
Broken throttle_count allows to nominate throttled entries as a next buddy,
later this leads to null pointer dereference in pick_next_task_fair().

This patch initialize cfs_rq->throttle_count at first enqueue: laziness
allows to skip locking all rq at group creation. Lazy approach also allows
to skip full sub-tree scan at throttling hierarchy (not in this patch).

Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: bsegall@google.com
Link: http://lkml.kernel.org/r/146608182119.21870.8439834428248129633.stgit@buzz
Signed-off-by: Ingo Molnar

Konstantin Khlebnikov
2016-06-24 14:26:44 +0800
4c5ea0a9c locking/static_key: Fix concurrent static_key_slow_inc() ... Browse Code »

The following scenario is possible:

CPU 1 CPU 2
static_key_slow_inc()
atomic_inc_not_zero()
-> key.enabled == 0, no increment
jump_label_lock()
atomic_inc_return()
-> key.enabled == 1 now
static_key_slow_inc()
atomic_inc_not_zero()
-> key.enabled == 1, inc to 2
return
** static key is wrong!
jump_label_update()
jump_label_unlock()

Testing the static key at the point marked by (**) will follow the
wrong path for jumps that have not been patched yet. This can
actually happen when creating many KVM virtual machines with userspace
LAPIC emulation; just run several copies of the following program:

#include
#include
#include
#include

int main(void)
{
for (;;) {
int kvmfd = open("/dev/kvm", O_RDONLY);
int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);
close(ioctl(vmfd, KVM_CREATE_VCPU, 1));
close(vmfd);
close(kvmfd);
}
return 0;
}

Every KVM_CREATE_VCPU ioctl will attempt a static_key_slow_inc() call.
The static key's purpose is to skip NULL pointer checks and indeed one
of the processes eventually dereferences NULL.

As explained in the commit that introduced the bug:

706249c222f6 ("locking/static_keys: Rework update logic")

jump_label_update() needs key.enabled to be true. The solution adopted
here is to temporarily make key.enabled == -1, and use go down the
slow path when key.enabled
Signed-off-by: Paolo Bonzini
Signed-off-by: Peter Zijlstra (Intel)
Cc: # v4.3+
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: 706249c222f6 ("locking/static_keys: Rework update logic")
Link: http://lkml.kernel.org/r/1466527937-69798-1-git-send-email-pbonzini@redhat.com
[ Small stylistic edits to the changelog and the code. ]
Signed-off-by: Ingo Molnar

Paolo Bonzini
2016-06-24 14:23:16 +0800
82d6489d0 cgroup: Disable IRQs while holding css_set_lock ... Browse Code »

While testing the deadline scheduler + cgroup setup I hit this
warning.

[ 132.612935] ------------[ cut here ]------------
[ 132.612951] WARNING: CPU: 5 PID: 0 at kernel/softirq.c:150 __local_bh_enable_ip+0x6b/0x80
[ 132.612952] Modules linked in: (a ton of modules...)
[ 132.612981] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.7.0-rc2 #2
[ 132.612981] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
[ 132.612982] 0000000000000086 45c8bb5effdd088b ffff88013fd43da0 ffffffff813d229e
[ 132.612984] 0000000000000000 0000000000000000 ffff88013fd43de0 ffffffff810a652b
[ 132.612985] 00000096811387b5 0000000000000200 ffff8800bab29d80 ffff880034c54c00
[ 132.612986] Call Trace:
[ 132.612987] [] dump_stack+0x63/0x85
[ 132.612994] [] __warn+0xcb/0xf0
[ 132.612997] [] ? push_dl_task.part.32+0x170/0x170
[ 132.612999] [] warn_slowpath_null+0x1d/0x20
[ 132.613000] [] __local_bh_enable_ip+0x6b/0x80
[ 132.613008] [] _raw_write_unlock_bh+0x1a/0x20
[ 132.613010] [] _raw_spin_unlock_bh+0xe/0x10
[ 132.613015] [] put_css_set+0x5c/0x60
[ 132.613016] [] cgroup_free+0x7f/0xa0
[ 132.613017] [] __put_task_struct+0x42/0x140
[ 132.613018] [] dl_task_timer+0xca/0x250
[ 132.613027] [] ? push_dl_task.part.32+0x170/0x170
[ 132.613030] [] __hrtimer_run_queues+0xee/0x270
[ 132.613031] [] hrtimer_interrupt+0xa8/0x190
[ 132.613034] [] local_apic_timer_interrupt+0x38/0x60
[ 132.613035] [] smp_apic_timer_interrupt+0x3d/0x50
[ 132.613037] [] apic_timer_interrupt+0x8c/0xa0
[ 132.613038] [] ? native_safe_halt+0x6/0x10
[ 132.613043] [] default_idle+0x1e/0xd0
[ 132.613044] [] arch_cpu_idle+0xf/0x20
[ 132.613046] [] default_idle_call+0x2a/0x40
[ 132.613047] [] cpu_startup_entry+0x2e7/0x340
[ 132.613048] [] start_secondary+0x155/0x190
[ 132.613049] ---[ end trace f91934d162ce9977 ]---

The warn is the spin_(lock|unlock)_bh(&css_set_lock) in the interrupt
context. Converting the spin_lock_bh to spin_lock_irq(save) to avoid
this problem - and other problems of sharing a spinlock with an
interrupt.

Cc: Tejun Heo
Cc: Li Zefan
Cc: Johannes Weiner
Cc: Juri Lelli
Cc: Steven Rostedt
Cc: cgroups@vger.kernel.org
Cc: stable@vger.kernel.org # 4.5+
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Rik van Riel
Reviewed-by: "Luis Claudio R. Goncalves"
Signed-off-by: Daniel Bristot de Oliveira
Acked-by: Zefan Li
Signed-off-by: Tejun Heo

Daniel Bristot de Oliveira
2016-06-24 05:23:12 +0800
6720a305d locking: avoid passing around 'thread_info' in mutex debugging code ... Browse Code »

None of the code actually wants a thread_info, it all wants a
task_struct, and it's just converting back and forth between the two
("ti->task" to get the task_struct from the thread_info, and
"task_thread_info(task)" to go the other way).

No semantic change.

Acked-by: Peter Zijlstra
Signed-off-by: Linus Torvalds

Linus Torvalds
2016-06-24 03:11:17 +0800

23 Jun, 2016

1 commit

d16dcd3d1 irqdomain: Fix disposal of mappings for interrupt hierarchies ... Browse Code »

The function irq_create_of_mapping() is used to create an interrupt
mapping. However, depending on whether the irqdomain, to which the
interrupt belongs, is part of a hierarchy, determines whether the
mapping is created via calling irq_domain_alloc_irqs() or
irq_create_mapping().

To dispose of the interrupt mapping, drivers call irq_dispose_mapping().
However, this function does not check to see if the irqdomain is part
of a hierarchy or not and simply assumes that it was mapped via calling
irq_create_mapping() so calls irq_domain_disassociate() to unmap the
interrupt.

Fix this by checking to see if the irqdomain is part of a hierarchy and
if so call irq_domain_free_irqs() to free/unmap the interrupt.

Signed-off-by: Jon Hunter
Cc: Marc Zyngier
Cc: Jiang Liu
Link: http://lkml.kernel.org/r/1466501002-16368-1-git-send-email-jonathanh@nvidia.com
Signed-off-by: Thomas Gleixner

Jon Hunter
2016-06-23 16:21:06 +0800

21 Jun, 2016

1 commit

f780f00d7 Merge tag 'trace-v4.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace ... Browse Code »

Pull tracing fixes from Steven Rostedt:
"Two fixes for the tracing system:

- When trace_printk() is used with a non constant format descriptor,
it adds a NULL pointer into the trace format section, and the code
isn't prepared to deal with it. This bug appeared by a change that
was added in v3.5.

- The ftracetest (selftests section) can't handle testing histograms
when histograms are not configured. Currently it shows that they
fail the test, when they should state that they are unsupported.
This bug was added in the 4.7 merge window with the addition of the
historgram code"

* tag 'trace-v4.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftracetest: Fix hist unsupported result in hist selftests
tracing: Handle NULL formats in hold_module_trace_bprintk_format()

Linus Torvalds
2016-06-21 01:35:48 +0800

20 Jun, 2016

2 commits

70c8217ac tracing: Handle NULL formats in hold_module_trace_bprintk_format() ... Browse Code »

If a task uses a non constant string for the format parameter in
trace_printk(), then the trace_printk_fmt variable is set to NULL. This
variable is then saved in the __trace_printk_fmt section.

The function hold_module_trace_bprintk_format() checks to see if duplicate
formats are used by modules, and reuses them if so (saves them to the list
if it is new). But this function calls lookup_format() that does a strcmp()
to the value (which is now NULL) and can cause a kernel oops.

This wasn't an issue till 3debb0a9ddb ("tracing: Fix trace_printk() to print
when not using bprintk()") which added "__used" to the trace_printk_fmt
variable, and before that, the kernel simply optimized it out (no NULL value
was saved).

The fix is simply to handle the NULL pointer in lookup_format() and have the
caller ignore the value if it was NULL.

Link: http://lkml.kernel.org/r/1464769870-18344-1-git-send-email-zhengjun.xing@intel.com

Reported-by: xingzhen
Acked-by: Namhyung Kim
Fixes: 3debb0a9ddb ("tracing: Fix trace_printk() to print when not using bprintk()")
Cc: stable@vger.kernel.org # v3.5+
Signed-off-by: Steven Rostedt

Steven Rostedt (Red Hat)
2016-06-20 21:46:12 +0800
897418922 sched/fair: Fix cfs_rq avg tracking underflow ... Browse Code »

As per commit:

b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization")

> the code generated from update_cfs_rq_load_avg():
>
> if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> sa->load_avg = max_t(long, sa->load_avg - r, 0);
> sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
> removed_load = 1;
> }
>
> turns into:
>
> ffffffff81087064: 49 8b 85 98 00 00 00 mov 0x98(%r13),%rax
> ffffffff8108706b: 48 85 c0 test %rax,%rax
> ffffffff8108706e: 74 40 je ffffffff810870b0
> ffffffff81087070: 4c 89 f8 mov %r15,%rax
> ffffffff81087073: 49 87 85 98 00 00 00 xchg %rax,0x98(%r13)
> ffffffff8108707a: 49 29 45 70 sub %rax,0x70(%r13)
> ffffffff8108707e: 4c 89 f9 mov %r15,%rcx
> ffffffff81087081: bb 01 00 00 00 mov $0x1,%ebx
> ffffffff81087086: 49 83 7d 70 00 cmpq $0x0,0x70(%r13)
> ffffffff8108708b: 49 0f 49 4d 70 cmovns 0x70(%r13),%rcx
>
> Which you'll note ends up with sa->load_avg -= r in memory at
> ffffffff8108707a.

So I _should_ have looked at other unserialized users of ->load_avg,
but alas. Luckily nikbor reported a similar /0 from task_h_load() which
instantly triggered recollection of this here problem.

Aside from the intermediate value hitting memory and causing problems,
there's another problem: the underflow detection relies on the signed
bit. This reduces the effective width of the variables, IOW its
effectively the same as having these variables be of signed type.

This patch changes to a different means of unsigned underflow
detection to not rely on the signed bit. This allows the variables to
use the 'full' unsigned range. And it does so with explicit LOAD -
STORE to ensure any intermediate value will never be visible in
memory, allowing these unserialized loads.

Note: GCC generates crap code for this, might warrant a look later.

Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
maybe we should do clamping on add too.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrey Ryabinin
Cc: Chris Wilson
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Yuyang Du
Cc: bsegall@google.com
Cc: kernel@kyup.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-06-20 17:29:09 +0800

18 Jun, 2016

1 commit

edd14cfeb genirq: Add untracked irq handler ... Browse Code »

This adds a software irq handler for controllers that multiplex
interrupts from multiple devices, but don't know which device generated
the interrupt. For these devices, the irq handler that demuxes must
check every action for every software irq using the same h/w irq in order
to find out which device generated the interrupt. This will inevitably
trigger spurious interrupt detection if we are noting the irq.

The new irq handler does not track the handling for spurious interrupt
detection. An irq that uses this also won't get stats tracked since it
didn't generate the interrupt, nor added to randomness since they are
not random.

Signed-off-by: Keith Busch
Cc: Bjorn Helgaas
Cc: linux-pci@vger.kernel.org
Cc: Jon Derrick
Link: http://lkml.kernel.org/r/1466200821-29159-1-git-send-email-keith.busch@intel.com
Signed-off-by: Thomas Gleixner

Keith Busch
2016-06-18 16:00:55 +0800

17 Jun, 2016

1 commit

8fa3b8d68 cgroup: set css->id to -1 during init ... Browse Code »

If percpu_ref initialization fails during css_create(), the free path
can end up trying to free css->id of zero. As ID 0 is unused, it
doesn't cause a critical breakage but it does trigger a warning
message. Fix it by setting css->id to -1 from init_and_link_css().

Signed-off-by: Tejun Heo
Cc: Wenwei Tao
Fixes: 01e586598b22 ("cgroup: release css->id after css_free")
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Tejun Heo

Tejun Heo
2016-06-17 05:59:35 +0800

16 Jun, 2016

2 commits

ad572d174 bpf, trace: check event type in bpf_perf_event_read ... Browse Code »

similar to bpf_perf_event_output() the bpf_perf_event_read() helper
needs to check the type of the perf_event before reading the counter.

Fixes: a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
Reported-by: Daniel Borkmann
Signed-off-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Alexei Starovoitov
2016-06-16 14:37:54 +0800
19de99f70 bpf: fix matching of data/data_end in verifier ... Browse Code »

The ctx structure passed into bpf programs is different depending on bpf
program type. The verifier incorrectly marked ctx->data and ctx->data_end
access based on ctx offset only. That caused loads in tracing programs
int bpf_prog(struct pt_regs *ctx) { .. ctx->ax .. }
to be incorrectly marked as PTR_TO_PACKET which later caused verifier
to reject the program that was actually valid in tracing context.
Fix this by doing program type specific matching of ctx offsets.

Fixes: 969bf05eb3ce ("bpf: direct packet access")
Reported-by: Sasha Goldshtein
Signed-off-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Alexei Starovoitov
2016-06-16 14:37:54 +0800

15 Jun, 2016

1 commit

df4565f9e kernel/kcov: unproxify debugfs file's fops ... Browse Code »

Since commit 49d200deaa68 ("debugfs: prevent access to removed files'
private data"), a debugfs file's file_operations methods get proxied
through lifetime aware wrappers.

However, only a certain subset of the file_operations members is supported
by debugfs and ->mmap isn't among them -- it appears to be NULL from the
VFS layer's perspective.

This behaviour breaks the /sys/kernel/debug/kcov file introduced
concurrently with commit 5c9a8750a640 ("kernel: add kcov code coverage").

Since that file never gets removed, there is no file removal race and thus,
a lifetime checking proxy isn't needed.

Avoid the proxying for /sys/kernel/debug/kcov by creating it via
debugfs_create_file_unsafe() rather than debugfs_create_file().

Fixes: 49d200deaa68 ("debugfs: prevent access to removed files' private data")
Fixes: 5c9a8750a640 ("kernel: add kcov code coverage")
Reported-by: Sasha Levin
Signed-off-by: Nicolai Stange
Signed-off-by: Greg Kroah-Hartman

Nicolai Stange
2016-06-15 19:56:35 +0800

14 Jun, 2016

3 commits

57675cb97 kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while processing sysrq-w ... Browse Code »

Lengthy output of sysrq-w may take a lot of time on slow serial console.

Currently we reset NMI-watchdog on the current CPU to avoid spurious
lockup messages. Sometimes this doesn't work since softlockup watchdog
might trigger on another CPU which is waiting for an IPI to proceed.
We reset softlockup watchdogs on all CPUs, but we do this only after
listing all tasks, and this may be too late on a busy system.

So, reset watchdogs CPUs earlier, in for_each_process_thread() loop.

Signed-off-by: Andrey Ryabinin
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc:
Link: http://lkml.kernel.org/r/1465474805-14641-1-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Ingo Molnar

Andrey Ryabinin
2016-06-14 18:48:38 +0800
eda8dca51 sched/debug: Fix deadlock when enabling sched events ... Browse Code »

I see a hang when enabling sched events:

echo 1 > /sys/kernel/debug/tracing/events/sched/enable

The printk buffer shows:

BUG: spinlock recursion on CPU#1, swapper/1/0
lock: 0xffff88007d5d8c00, .magic: dead4ead, .owner: swapper/1/0, .owner_cpu: 1
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc2+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
...
Call Trace:
[] dump_stack+0x85/0xc2
[] spin_dump+0x78/0xc0
[] do_raw_spin_lock+0x11a/0x150
[] _raw_spin_lock+0x61/0x80
[] ? try_to_wake_up+0x256/0x4e0
[] try_to_wake_up+0x256/0x4e0
[] ? _raw_spin_unlock_irqrestore+0x4a/0x80
[] wake_up_process+0x15/0x20
[] insert_work+0x84/0xc0
[] __queue_work+0x18f/0x660
[] queue_work_on+0x46/0x90
[] drm_fb_helper_dirty.isra.11+0xcb/0xe0 [drm_kms_helper]
[] drm_fb_helper_sys_imageblit+0x30/0x40 [drm_kms_helper]
[] soft_cursor+0x1ad/0x230
[] bit_cursor+0x649/0x680
[] ? update_attr.isra.2+0x90/0x90
[] fbcon_cursor+0x14a/0x1c0
[] hide_cursor+0x28/0x90
[] vt_console_print+0x3bf/0x3f0
[] call_console_drivers.constprop.24+0x183/0x200
[] console_unlock+0x3d4/0x610
[] vprintk_emit+0x3c5/0x610
[] vprintk_default+0x29/0x40
[] printk+0x57/0x73
[] enqueue_entity+0xc2e/0xc70
[] enqueue_task_fair+0x59/0xab0
[] ? kvm_sched_clock_read+0x9/0x20
[] ? sched_clock+0x9/0x10
[] activate_task+0x5c/0xa0
[] ttwu_do_activate+0x54/0xb0
[] sched_ttwu_pending+0x7a/0xb0
[] scheduler_ipi+0x61/0x170
[] smp_trace_reschedule_interrupt+0x4f/0x2a0
[] trace_reschedule_interrupt+0x96/0xa0
[] ? native_safe_halt+0x6/0x10
[] ? trace_hardirqs_on+0xd/0x10
[] default_idle+0x20/0x1a0
[] arch_cpu_idle+0xf/0x20
[] default_idle_call+0x2f/0x50
[] cpu_startup_entry+0x37e/0x450
[] start_secondary+0x160/0x1a0

Note the hang only occurs when echoing the above from a physical serial
console, not from an ssh session.

The bug is caused by a deadlock where the task is trying to grab the rq
lock twice because printk()'s aren't safe in sched code.

Signed-off-by: Josh Poimboeuf
Cc: Linus Torvalds
Cc: Matt Fleming
Cc: Mel Gorman
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Cc: stable@vger.kernel.org
Fixes: cb2517653fcc ("sched/debug: Make schedstats a runtime tunable that is disabled by default")
Link: http://lkml.kernel.org/r/20160613073209.gdvdybiruljbkn3p@treble
Signed-off-by: Ingo Molnar

Josh Poimboeuf
2016-06-14 18:47:21 +0800
b7fa30c9c sched/fair: Fix post_init_entity_util_avg() serialization ... Browse Code »

Chris Wilson reported a divide by 0 at:

post_init_entity_util_avg():

> 725 if (cfs_rq->avg.util_avg != 0) {
> 726 sa->util_avg = cfs_rq->avg.util_avg * se->load.weight;
> -> 727 sa->util_avg /= (cfs_rq->avg.load_avg + 1);
> 728
> 729 if (sa->util_avg > cap)
> 730 sa->util_avg = cap;
> 731 } else {

Which given the lack of serialization, and the code generated from
update_cfs_rq_load_avg() is entirely possible:

if (atomic_long_read(&cfs_rq->removed_load_avg)) {
s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
sa->load_avg = max_t(long, sa->load_avg - r, 0);
sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
removed_load = 1;
}

turns into:

ffffffff81087064: 49 8b 85 98 00 00 00 mov 0x98(%r13),%rax
ffffffff8108706b: 48 85 c0 test %rax,%rax
ffffffff8108706e: 74 40 je ffffffff810870b0
ffffffff81087070: 4c 89 f8 mov %r15,%rax
ffffffff81087073: 49 87 85 98 00 00 00 xchg %rax,0x98(%r13)
ffffffff8108707a: 49 29 45 70 sub %rax,0x70(%r13)
ffffffff8108707e: 4c 89 f9 mov %r15,%rcx
ffffffff81087081: bb 01 00 00 00 mov $0x1,%ebx
ffffffff81087086: 49 83 7d 70 00 cmpq $0x0,0x70(%r13)
ffffffff8108708b: 49 0f 49 4d 70 cmovns 0x70(%r13),%rcx

Which you'll note ends up with 'sa->load_avg - r' in memory at
ffffffff8108707a.

By calling post_init_entity_util_avg() under rq->lock we're sure to be
fully serialized against PELT updates and cannot observe intermediate
state like this.

Reported-by: Chris Wilson
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrey Ryabinin
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Yuyang Du
Cc: bsegall@google.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Fixes: 2b8c41daba32 ("sched/fair: Initiate a new task's util avg to a bounded value")
Link: http://lkml.kernel.org/r/20160609130750.GQ30909@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-06-14 16:58:34 +0800

13 Jun, 2016

3 commits

a5c8a0196 Merge tag 'irqchip-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/maz… ... Browse Code »

…/arm-platforms into irq/core

First drop of irqchip updates for 4.8 from Marc Zyngier:

- Fix a few bugs in configuring the default trigger from the irqdomain layer
- Make the genirq layer PM aware
- Add PM capability to the ARM GIC driver
- Add support for 2-level translation tables to the GICv3 ITS driver

Thomas Gleixner
2016-06-13 22:34:16 +0800
be45beb2d genirq: Add runtime power management support for IRQ chips ... Browse Code »

Some IRQ chips may be located in a power domain outside of the CPU
subsystem and hence will require device specific runtime power
management. In order to support such IRQ chips, add a pointer for a
device structure to the irq_chip structure, and if this pointer is
populated by the IRQ chip driver and CONFIG_PM is selected in the kernel
configuration, then the pm_runtime_get/put APIs for this chip will be
called when an IRQ is requested/freed, respectively.

Reviewed-by: Kevin Hilman
Signed-off-by: Jon Hunter
Signed-off-by: Marc Zyngier

Jon Hunter
2016-06-13 18:53:51 +0800
1e2a7d784 irqdomain: Don't set type when mapping an IRQ ... Browse Code »

Some IRQ chips, such as GPIO controllers or secondary level interrupt
controllers, may require require additional runtime power management
control to ensure they are accessible. For such IRQ chips, it makes sense
to enable the IRQ chip when interrupts are requested and disabled them
again once all interrupts have been freed.

When mapping an IRQ, the IRQ type settings are read and then programmed.
The mapping of the IRQ happens before the IRQ is requested and so the
programming of the type settings occurs before the IRQ is requested. This
is a problem for IRQ chips that require additional power management
control because they may not be accessible yet. Therefore, when mapping
the IRQ, don't program the type settings, just save them and then program
these saved settings when the IRQ is requested (so long as if they are not
overridden via the call to request the IRQ).

Add a stub function for irq_domain_free_irqs() to avoid any compilation
errors when CONFIG_IRQ_DOMAIN_HIERARCHY is not selected.

Signed-off-by: Jon Hunter
Reviewed-by: Marc Zyngier
Signed-off-by: Marc Zyngier

Jon Hunter
2016-06-13 18:53:51 +0800