04 Jul, 2016
7 commits
-
Pull the irq affinity managing code which is in a seperate branch for block
developers to pull. -
This is lifted from the blk-mq code and adopted to use the affinity mask
concept just introduced in the irq handling code. It tries to keep the
algorithm the same as the one current used by blk-mq, but improvements
like assining vectors on a per-node basis instead of just per sibling
are possible with this simple move and refactoring.Signed-off-by: Christoph Hellwig
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-7-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner -
Allow the MSI code to provide affinity hints per MSI descriptor.
Signed-off-by: Thomas Gleixner
Cc: Christoph Hellwig
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-6-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner -
Use the affinity hint in the irqdesc allocator. The hint is used to determine
the node for the allocation and to set the affinity of the interrupt.If multiple interrupts are allocated (multi-MSI) then the allocator iterates
over the cpumask and for each set cpu it allocates on their node and sets the
initial affinity to that cpu.If a single interrupt is allocated (MSI-X) then the allocator uses the first
cpu in the mask to compute the allocation node and uses the mask for the
initial affinity setting.Interrupts set up this way are marked with the AFFINITY_MANAGED flag to
prevent userspace from messing with their affinity settings.Signed-off-by: Thomas Gleixner
Cc: Christoph Hellwig
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-5-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner -
Add an extra argument to the irq(domain) allocation functions, so we can hand
down affinity hints to the allocator. Thats necessary to implement proper
support for multiqueue devices.Signed-off-by: Thomas Gleixner
Cc: Christoph Hellwig
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-4-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner -
Interupts marked with this flag are excluded from user space interrupt
affinity changes. Contrary to the IRQ_NO_BALANCING flag, the kernel internal
affinity mechanism is not blocked.This flag will be used for multi-queue device interrupts.
Signed-off-by: Thomas Gleixner
Cc: Christoph Hellwig
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-3-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner -
No user and we definitely don't want to grow one.
Signed-off-by: Thomas Gleixner
Reviewed-by: Bart Van Assche
Cc: Christoph Hellwig
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: axboe@fb.com
Cc: agordeev@redhat.com
Link: http://lkml.kernel.org/r/1467621574-8277-2-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner
30 Jun, 2016
3 commits
-
Pull audit fixes from Paul Moore:
"Two small patches to fix audit problems in 4.7-rcX: the first fixes a
potential kref leak, the second removes some header file noise.The first is an important bug fix that really should go in before 4.7
is released, the second is not critical, but falls into the very-nice-
to-have category so I'm including in the pull request.Both patches are straightforward, self-contained, and pass our
testsuite without problem"* 'stable-4.7' of git://git.infradead.org/users/pcmoore/audit:
audit: move audit_get_tty to reduce scope and kabi changes
audit: move calcs after alloc and check when logging set loginuid -
Pull networking fixes from David Miller:
"I've been traveling so this accumulates more than week or so of bug
fixing. It perhaps looks a little worse than it really is.1) Fix deadlock in ath10k driver, from Ben Greear.
2) Increase scan timeout in iwlwifi, from Luca Coelho.
3) Unbreak STP by properly reinjecting STP packets back into the
stack. Regression fix from Ido Schimmel.4) Mediatek driver fixes (missing malloc failure checks, leaking of
scratch memory, wrong indexing when mapping TX buffers, etc.) from
John Crispin.5) Fix endianness bug in icmpv6_err() handler, from Hannes Frederic
Sowa.6) Fix hashing of flows in UDP in the ruseport case, from Xuemin Su.
7) Fix netlink notifications in ovs for tunnels, delete link messages
are never emitted because of how the device registry state is
handled. From Nicolas Dichtel.8) Conntrack module leaks kmemcache on unload, from Florian Westphal.
9) Prevent endless jump loops in nft rules, from Liping Zhang and
Pablo Neira Ayuso.10) Not early enough spinlock initialization in mlx4, from Eric
Dumazet.11) Bind refcount leak in act_ipt, from Cong WANG.
12) Missing RCU locking in HTB scheduler, from Florian Westphal.
13) Several small MACSEC bug fixes from Sabrina Dubroca (missing RCU
barrier, using heap for SG and IV, and erroneous use of async flag
when allocating AEAD conext.)14) RCU handling fix in TIPC, from Ying Xue.
15) Pass correct protocol down into ipv4_{update_pmtu,redirect}() in
SIT driver, from Simon Horman.16) Socket timer deadlock fix in TIPC from Jon Paul Maloy.
17) Fix potential deadlock in team enslave, from Ido Schimmel.
18) Memory leak in KCM procfs handling, from Jiri Slaby.
19) ESN generation fix in ipv4 ESP, from Herbert Xu.
20) Fix GFP_KERNEL allocations with locks held in act_ife, from Cong
WANG.21) Use after free in netem, from Eric Dumazet.
22) Uninitialized last assert time in multicast router code, from Tom
Goff.23) Skip raw sockets in sock_diag destruction broadcast, from Willem
de Bruijn.24) Fix link status reporting in thunderx, from Sunil Goutham.
25) Limit resegmentation of retransmit queue so that we do not
retransmit too large GSO frames. From Eric Dumazet.26) Delay bpf program release after grace period, from Daniel
Borkmann"* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (141 commits)
openvswitch: fix conntrack netlink event delivery
qed: Protect the doorbell BAR with the write barriers.
neigh: Explicitly declare RCU-bh read side critical section in neigh_xmit()
e1000e: keep VLAN interfaces functional after rxvlan off
cfg80211: fix proto in ieee80211_data_to_8023 for frames without LLC header
qlcnic: use the correct ring in qlcnic_83xx_process_rcv_ring_diag()
bpf, perf: delay release of BPF prog after grace period
net: bridge: fix vlan stats continue counter
tcp: do not send too big packets at retransmit time
ibmvnic: fix to use list_for_each_safe() when delete items
net: thunderx: Fix TL4 configuration for secondary Qsets
net: thunderx: Fix link status reporting
net/mlx5e: Reorganize ethtool statistics
net/mlx5e: Fix number of PFC counters reported to ethtool
net/mlx5e: Prevent adding the same vxlan port
net/mlx5e: Check for BlueFlame capability before allocating SQ uar
net/mlx5e: Change enum to better reflect usage
net/mlx5: Add ConnectX-5 PCIe 4.0 to list of supported devices
net/mlx5: Update command strings
net: marvell: Add separate config ANEG function for Marvell 88E1111
... -
Pull cgroup fixes from Tejun Heo:
"Three fix patches. Two are for cgroup / css init failure path. The
last one makes css_set_lock irq-safe as the deadline scheduler ends up
calling put_css_set() from irq context"* 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Disable IRQs while holding css_set_lock
cgroup: set css->id to -1 during init
cgroup: remove redundant cleanup in css_create
29 Jun, 2016
3 commits
-
Commit dead9f29ddcc ("perf: Fix race in BPF program unregister") moved
destruction of BPF program from free_event_rcu() callback to __free_event(),
which is problematic if used with tail calls: if prog A is attached as
trace event directly, but at the same time present in a tail call map used
by another trace event program elsewhere, then we need to delay destruction
via RCU grace period since it can still be in use by the program doing the
tail call (the prog first needs to be dropped from the tail call map, then
trace event with prog A attached destroyed, so we get immediate destruction).Fixes: dead9f29ddcc ("perf: Fix race in BPF program unregister")
Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Cc: Jann Horn
Signed-off-by: David S. Miller -
The only users of audit_get_tty and audit_put_tty are internal to
audit, so move it out of include/linux/audit.h to kernel.h and create
a proper function rather than inlining it. This also reduces kABI
changes.Suggested-by: Paul Moore
Signed-off-by: Richard Guy Briggs
[PM: line wrapped description]
Signed-off-by: Paul Moore -
Move the calculations of values after the allocation in case the
allocation fails. This avoids wasting effort in the rare case that it
fails, but more importantly saves us extra logic to release the tty
ref.Signed-off-by: Richard Guy Briggs
Signed-off-by: Paul Moore
25 Jun, 2016
6 commits
-
Pull scheduler fixes from Thomas Gleixner:
"A couple of scheduler fixes:- force watchdog reset while processing sysrq-w
- fix a deadlock when enabling trace events in the scheduler
- fixes to the throttled next buddy logic
- fixes for the average accounting (missing serialization and
underflow handling)- allow kernel threads for fallback to online but not active cpus"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Allow kthreads to fall back to online && !active cpus
sched/fair: Do not announce throttled next buddy in dequeue_task_fair()
sched/fair: Initialize throttle_count for new task-groups lazily
sched/fair: Fix cfs_rq avg tracking underflow
kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while processing sysrq-w
sched/debug: Fix deadlock when enabling sched events
sched/fair: Fix post_init_entity_util_avg() serialization -
Pull locking fix from Thomas Gleixner:
"A single fix to address a race in the static key logic"* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/static_key: Fix concurrent static_key_slow_inc() -
Commit b235beea9e99 ("Clarify naming of thread info/stack allocators")
breaks the build on some powerpc configs, where THREAD_SIZE < PAGE_SIZE:kernel/fork.c:235:2: error: implicit declaration of function 'free_thread_stack'
kernel/fork.c:355:8: error: assignment from incompatible pointer type
stack = alloc_thread_stack_node(tsk, node);
^Fix it by renaming free_stack() to free_thread_stack(), and updating the
return type of alloc_thread_stack_node().Fixes: b235beea9e99 ("Clarify naming of thread info/stack allocators")
Signed-off-by: Michael Ellerman
Signed-off-by: Linus Torvalds -
Merge misc fixes from Andrew Morton:
"Two weeks worth of fixes here"* emailed patches from Andrew Morton : (41 commits)
init/main.c: fix initcall_blacklisted on ia64, ppc64 and parisc64
autofs: don't get stuck in a loop if vfs_write() returns an error
mm/page_owner: avoid null pointer dereference
tools/vm/slabinfo: fix spelling mistake: "Ocurrences" -> "Occurrences"
fs/nilfs2: fix potential underflow in call to crc32_le
oom, suspend: fix oom_reaper vs. oom_killer_disable race
ocfs2: disable BUG assertions in reading blocks
mm, compaction: abort free scanner if split fails
mm: prevent KASAN false positives in kmemleak
mm/hugetlb: clear compound_mapcount when freeing gigantic pages
mm/swap.c: flush lru pvecs on compound page arrival
memcg: css_alloc should return an ERR_PTR value on error
memcg: mem_cgroup_migrate() may be called with irq disabled
hugetlb: fix nr_pmds accounting with shared page tables
Revert "mm: disable fault around on emulated access bit architecture"
Revert "mm: make faultaround produce old ptes"
mailmap: add Boris Brezillon's email
mailmap: add Antoine Tenart's email
mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask
mm: mempool: kasan: don't poot mempool objects in quarantine
... -
Tetsuo has reported the following potential oom_killer_disable vs.
oom_reaper race:(1) freeze_processes() starts freezing user space threads.
(2) Somebody (maybe a kenrel thread) calls out_of_memory().
(3) The OOM killer calls mark_oom_victim() on a user space thread
P1 which is already in __refrigerator().
(4) oom_killer_disable() sets oom_killer_disabled = true.
(5) P1 leaves __refrigerator() and enters do_exit().
(6) The OOM reaper calls exit_oom_victim(P1) before P1 can call
exit_oom_victim(P1).
(7) oom_killer_disable() returns while P1 not yet finished
(8) P1 perform IO/interfere with the freezer.This situation is unfortunate. We cannot move oom_killer_disable after
all the freezable kernel threads are frozen because the oom victim might
depend on some of those kthreads to make a forward progress to exit so
we could deadlock. It is also far from trivial to teach the oom_reaper
to not call exit_oom_victim() because then we would lose a guarantee of
the OOM killer and oom_killer_disable forward progress because
exit_mm->mmput might block and never call exit_oom_victim.It seems the easiest way forward is to workaround this race by calling
try_to_freeze_tasks again after oom_killer_disable. This will make sure
that all the tasks are frozen or it bails out.Fixes: 449d777d7ad6 ("mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper")
Link: http://lkml.kernel.org/r/1466597634-16199-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reported-by: Tetsuo Handa
Cc: "Rafael J. Wysocki"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We've had the thread info allocated together with the thread stack for
most architectures for a long time (since the thread_info was split off
from the task struct), but that is about to change.But the patches that move the thread info to be off-stack (and a part of
the task struct instead) made it clear how confused the allocator and
freeing functions are.Because the common case was that we share an allocation with the thread
stack and the thread_info, the two pointers were identical. That
identity then meant that we would have things liketi = alloc_thread_info_node(tsk, node);
...
tsk->stack = ti;which certainly _worked_ (since stack and thread_info have the same
value), but is rather confusing: why are we assigning a thread_info to
the stack? And if we move the thread_info away, the "confusing" code
just gets to be entirely bogus.So remove all this confusion, and make it clear that we are doing the
stack allocation by renaming and clarifying the function names to be
about the stack. The fact that the thread_info then shares the
allocation is an implementation detail, and not really about the
allocation itself.This is a pure renaming and type fix: we pass in the same pointer, it's
just that we clarify what the pointer means.The ia64 code that actually only has one single allocation (for all of
task_struct, thread_info and kernel thread stack) now looks a bit odd,
but since "tsk->stack" is actually not even used there, that oddity
doesn't matter. It would be a separate thing to clean that up, I
intentionally left the ia64 changes as a pure brute-force renaming and
type change.Acked-by: Andy Lutomirski
Signed-off-by: Linus Torvalds
24 Jun, 2016
6 commits
-
During CPU hotplug, CPU_ONLINE callbacks are run while the CPU is
online but not active. A CPU_ONLINE callback may create or bind a
kthread so that its cpus_allowed mask only allows the CPU which is
being brought online. The kthread may start executing before the CPU
is made active and can end up in select_fallback_rq().In such cases, the expected behavior is selecting the CPU which is
coming online; however, because select_fallback_rq() only chooses from
active CPUs, it determines that the task doesn't have any viable CPU
in its allowed mask and ends up overriding it to cpu_possible_mask.CPU_ONLINE callbacks should be able to put kthreads on the CPU which
is coming online. Update select_fallback_rq() so that it follows
cpu_online() rather than cpu_active() for kthreads.Reported-by: Gautham R Shenoy
Tested-by: Gautham R. Shenoy
Signed-off-by: Tejun Heo
Signed-off-by: Peter Zijlstra (Intel)
Cc: Abdul Haleem
Cc: Aneesh Kumar
Cc: Linus Torvalds
Cc: Michael Ellerman
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: kernel-team@fb.com
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/20160616193504.GB3262@mtj.duckdns.org
Signed-off-by: Ingo Molnar -
Hierarchy could be already throttled at this point. Throttled next
buddy could trigger a NULL pointer dereference in pick_next_task_fair().Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Ben Segall
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/146608183552.21905.15924473394414832071.stgit@buzz
Signed-off-by: Ingo Molnar -
Cgroup created inside throttled group must inherit current throttle_count.
Broken throttle_count allows to nominate throttled entries as a next buddy,
later this leads to null pointer dereference in pick_next_task_fair().This patch initialize cfs_rq->throttle_count at first enqueue: laziness
allows to skip locking all rq at group creation. Lazy approach also allows
to skip full sub-tree scan at throttling hierarchy (not in this patch).Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: bsegall@google.com
Link: http://lkml.kernel.org/r/146608182119.21870.8439834428248129633.stgit@buzz
Signed-off-by: Ingo Molnar -
The following scenario is possible:
CPU 1 CPU 2
static_key_slow_inc()
atomic_inc_not_zero()
-> key.enabled == 0, no increment
jump_label_lock()
atomic_inc_return()
-> key.enabled == 1 now
static_key_slow_inc()
atomic_inc_not_zero()
-> key.enabled == 1, inc to 2
return
** static key is wrong!
jump_label_update()
jump_label_unlock()Testing the static key at the point marked by (**) will follow the
wrong path for jumps that have not been patched yet. This can
actually happen when creating many KVM virtual machines with userspace
LAPIC emulation; just run several copies of the following program:#include
#include
#include
#includeint main(void)
{
for (;;) {
int kvmfd = open("/dev/kvm", O_RDONLY);
int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);
close(ioctl(vmfd, KVM_CREATE_VCPU, 1));
close(vmfd);
close(kvmfd);
}
return 0;
}Every KVM_CREATE_VCPU ioctl will attempt a static_key_slow_inc() call.
The static key's purpose is to skip NULL pointer checks and indeed one
of the processes eventually dereferences NULL.As explained in the commit that introduced the bug:
706249c222f6 ("locking/static_keys: Rework update logic")
jump_label_update() needs key.enabled to be true. The solution adopted
here is to temporarily make key.enabled == -1, and use go down the
slow path when key.enabled
Signed-off-by: Paolo Bonzini
Signed-off-by: Peter Zijlstra (Intel)
Cc: # v4.3+
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: 706249c222f6 ("locking/static_keys: Rework update logic")
Link: http://lkml.kernel.org/r/1466527937-69798-1-git-send-email-pbonzini@redhat.com
[ Small stylistic edits to the changelog and the code. ]
Signed-off-by: Ingo Molnar -
While testing the deadline scheduler + cgroup setup I hit this
warning.[ 132.612935] ------------[ cut here ]------------
[ 132.612951] WARNING: CPU: 5 PID: 0 at kernel/softirq.c:150 __local_bh_enable_ip+0x6b/0x80
[ 132.612952] Modules linked in: (a ton of modules...)
[ 132.612981] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.7.0-rc2 #2
[ 132.612981] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
[ 132.612982] 0000000000000086 45c8bb5effdd088b ffff88013fd43da0 ffffffff813d229e
[ 132.612984] 0000000000000000 0000000000000000 ffff88013fd43de0 ffffffff810a652b
[ 132.612985] 00000096811387b5 0000000000000200 ffff8800bab29d80 ffff880034c54c00
[ 132.612986] Call Trace:
[ 132.612987] [] dump_stack+0x63/0x85
[ 132.612994] [] __warn+0xcb/0xf0
[ 132.612997] [] ? push_dl_task.part.32+0x170/0x170
[ 132.612999] [] warn_slowpath_null+0x1d/0x20
[ 132.613000] [] __local_bh_enable_ip+0x6b/0x80
[ 132.613008] [] _raw_write_unlock_bh+0x1a/0x20
[ 132.613010] [] _raw_spin_unlock_bh+0xe/0x10
[ 132.613015] [] put_css_set+0x5c/0x60
[ 132.613016] [] cgroup_free+0x7f/0xa0
[ 132.613017] [] __put_task_struct+0x42/0x140
[ 132.613018] [] dl_task_timer+0xca/0x250
[ 132.613027] [] ? push_dl_task.part.32+0x170/0x170
[ 132.613030] [] __hrtimer_run_queues+0xee/0x270
[ 132.613031] [] hrtimer_interrupt+0xa8/0x190
[ 132.613034] [] local_apic_timer_interrupt+0x38/0x60
[ 132.613035] [] smp_apic_timer_interrupt+0x3d/0x50
[ 132.613037] [] apic_timer_interrupt+0x8c/0xa0
[ 132.613038] [] ? native_safe_halt+0x6/0x10
[ 132.613043] [] default_idle+0x1e/0xd0
[ 132.613044] [] arch_cpu_idle+0xf/0x20
[ 132.613046] [] default_idle_call+0x2a/0x40
[ 132.613047] [] cpu_startup_entry+0x2e7/0x340
[ 132.613048] [] start_secondary+0x155/0x190
[ 132.613049] ---[ end trace f91934d162ce9977 ]---The warn is the spin_(lock|unlock)_bh(&css_set_lock) in the interrupt
context. Converting the spin_lock_bh to spin_lock_irq(save) to avoid
this problem - and other problems of sharing a spinlock with an
interrupt.Cc: Tejun Heo
Cc: Li Zefan
Cc: Johannes Weiner
Cc: Juri Lelli
Cc: Steven Rostedt
Cc: cgroups@vger.kernel.org
Cc: stable@vger.kernel.org # 4.5+
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Rik van Riel
Reviewed-by: "Luis Claudio R. Goncalves"
Signed-off-by: Daniel Bristot de Oliveira
Acked-by: Zefan Li
Signed-off-by: Tejun Heo -
None of the code actually wants a thread_info, it all wants a
task_struct, and it's just converting back and forth between the two
("ti->task" to get the task_struct from the thread_info, and
"task_thread_info(task)" to go the other way).No semantic change.
Acked-by: Peter Zijlstra
Signed-off-by: Linus Torvalds
23 Jun, 2016
1 commit
-
The function irq_create_of_mapping() is used to create an interrupt
mapping. However, depending on whether the irqdomain, to which the
interrupt belongs, is part of a hierarchy, determines whether the
mapping is created via calling irq_domain_alloc_irqs() or
irq_create_mapping().To dispose of the interrupt mapping, drivers call irq_dispose_mapping().
However, this function does not check to see if the irqdomain is part
of a hierarchy or not and simply assumes that it was mapped via calling
irq_create_mapping() so calls irq_domain_disassociate() to unmap the
interrupt.Fix this by checking to see if the irqdomain is part of a hierarchy and
if so call irq_domain_free_irqs() to free/unmap the interrupt.Signed-off-by: Jon Hunter
Cc: Marc Zyngier
Cc: Jiang Liu
Link: http://lkml.kernel.org/r/1466501002-16368-1-git-send-email-jonathanh@nvidia.com
Signed-off-by: Thomas Gleixner
21 Jun, 2016
1 commit
-
Pull tracing fixes from Steven Rostedt:
"Two fixes for the tracing system:- When trace_printk() is used with a non constant format descriptor,
it adds a NULL pointer into the trace format section, and the code
isn't prepared to deal with it. This bug appeared by a change that
was added in v3.5.- The ftracetest (selftests section) can't handle testing histograms
when histograms are not configured. Currently it shows that they
fail the test, when they should state that they are unsupported.
This bug was added in the 4.7 merge window with the addition of the
historgram code"* tag 'trace-v4.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftracetest: Fix hist unsupported result in hist selftests
tracing: Handle NULL formats in hold_module_trace_bprintk_format()
20 Jun, 2016
2 commits
-
If a task uses a non constant string for the format parameter in
trace_printk(), then the trace_printk_fmt variable is set to NULL. This
variable is then saved in the __trace_printk_fmt section.The function hold_module_trace_bprintk_format() checks to see if duplicate
formats are used by modules, and reuses them if so (saves them to the list
if it is new). But this function calls lookup_format() that does a strcmp()
to the value (which is now NULL) and can cause a kernel oops.This wasn't an issue till 3debb0a9ddb ("tracing: Fix trace_printk() to print
when not using bprintk()") which added "__used" to the trace_printk_fmt
variable, and before that, the kernel simply optimized it out (no NULL value
was saved).The fix is simply to handle the NULL pointer in lookup_format() and have the
caller ignore the value if it was NULL.Link: http://lkml.kernel.org/r/1464769870-18344-1-git-send-email-zhengjun.xing@intel.com
Reported-by: xingzhen
Acked-by: Namhyung Kim
Fixes: 3debb0a9ddb ("tracing: Fix trace_printk() to print when not using bprintk()")
Cc: stable@vger.kernel.org # v3.5+
Signed-off-by: Steven Rostedt -
As per commit:
b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization")
> the code generated from update_cfs_rq_load_avg():
>
> if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> sa->load_avg = max_t(long, sa->load_avg - r, 0);
> sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
> removed_load = 1;
> }
>
> turns into:
>
> ffffffff81087064: 49 8b 85 98 00 00 00 mov 0x98(%r13),%rax
> ffffffff8108706b: 48 85 c0 test %rax,%rax
> ffffffff8108706e: 74 40 je ffffffff810870b0
> ffffffff81087070: 4c 89 f8 mov %r15,%rax
> ffffffff81087073: 49 87 85 98 00 00 00 xchg %rax,0x98(%r13)
> ffffffff8108707a: 49 29 45 70 sub %rax,0x70(%r13)
> ffffffff8108707e: 4c 89 f9 mov %r15,%rcx
> ffffffff81087081: bb 01 00 00 00 mov $0x1,%ebx
> ffffffff81087086: 49 83 7d 70 00 cmpq $0x0,0x70(%r13)
> ffffffff8108708b: 49 0f 49 4d 70 cmovns 0x70(%r13),%rcx
>
> Which you'll note ends up with sa->load_avg -= r in memory at
> ffffffff8108707a.So I _should_ have looked at other unserialized users of ->load_avg,
but alas. Luckily nikbor reported a similar /0 from task_h_load() which
instantly triggered recollection of this here problem.Aside from the intermediate value hitting memory and causing problems,
there's another problem: the underflow detection relies on the signed
bit. This reduces the effective width of the variables, IOW its
effectively the same as having these variables be of signed type.This patch changes to a different means of unsigned underflow
detection to not rely on the signed bit. This allows the variables to
use the 'full' unsigned range. And it does so with explicit LOAD -
STORE to ensure any intermediate value will never be visible in
memory, allowing these unserialized loads.Note: GCC generates crap code for this, might warrant a look later.
Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
maybe we should do clamping on add too.Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrey Ryabinin
Cc: Chris Wilson
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Yuyang Du
Cc: bsegall@google.com
Cc: kernel@kyup.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar
18 Jun, 2016
1 commit
-
This adds a software irq handler for controllers that multiplex
interrupts from multiple devices, but don't know which device generated
the interrupt. For these devices, the irq handler that demuxes must
check every action for every software irq using the same h/w irq in order
to find out which device generated the interrupt. This will inevitably
trigger spurious interrupt detection if we are noting the irq.The new irq handler does not track the handling for spurious interrupt
detection. An irq that uses this also won't get stats tracked since it
didn't generate the interrupt, nor added to randomness since they are
not random.Signed-off-by: Keith Busch
Cc: Bjorn Helgaas
Cc: linux-pci@vger.kernel.org
Cc: Jon Derrick
Link: http://lkml.kernel.org/r/1466200821-29159-1-git-send-email-keith.busch@intel.com
Signed-off-by: Thomas Gleixner
17 Jun, 2016
1 commit
-
If percpu_ref initialization fails during css_create(), the free path
can end up trying to free css->id of zero. As ID 0 is unused, it
doesn't cause a critical breakage but it does trigger a warning
message. Fix it by setting css->id to -1 from init_and_link_css().Signed-off-by: Tejun Heo
Cc: Wenwei Tao
Fixes: 01e586598b22 ("cgroup: release css->id after css_free")
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Tejun Heo
16 Jun, 2016
2 commits
-
similar to bpf_perf_event_output() the bpf_perf_event_read() helper
needs to check the type of the perf_event before reading the counter.Fixes: a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
Reported-by: Daniel Borkmann
Signed-off-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller -
The ctx structure passed into bpf programs is different depending on bpf
program type. The verifier incorrectly marked ctx->data and ctx->data_end
access based on ctx offset only. That caused loads in tracing programs
int bpf_prog(struct pt_regs *ctx) { .. ctx->ax .. }
to be incorrectly marked as PTR_TO_PACKET which later caused verifier
to reject the program that was actually valid in tracing context.
Fix this by doing program type specific matching of ctx offsets.Fixes: 969bf05eb3ce ("bpf: direct packet access")
Reported-by: Sasha Goldshtein
Signed-off-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller
15 Jun, 2016
1 commit
-
Since commit 49d200deaa68 ("debugfs: prevent access to removed files'
private data"), a debugfs file's file_operations methods get proxied
through lifetime aware wrappers.However, only a certain subset of the file_operations members is supported
by debugfs and ->mmap isn't among them -- it appears to be NULL from the
VFS layer's perspective.This behaviour breaks the /sys/kernel/debug/kcov file introduced
concurrently with commit 5c9a8750a640 ("kernel: add kcov code coverage").Since that file never gets removed, there is no file removal race and thus,
a lifetime checking proxy isn't needed.Avoid the proxying for /sys/kernel/debug/kcov by creating it via
debugfs_create_file_unsafe() rather than debugfs_create_file().Fixes: 49d200deaa68 ("debugfs: prevent access to removed files' private data")
Fixes: 5c9a8750a640 ("kernel: add kcov code coverage")
Reported-by: Sasha Levin
Signed-off-by: Nicolai Stange
Signed-off-by: Greg Kroah-Hartman
14 Jun, 2016
3 commits
-
Lengthy output of sysrq-w may take a lot of time on slow serial console.
Currently we reset NMI-watchdog on the current CPU to avoid spurious
lockup messages. Sometimes this doesn't work since softlockup watchdog
might trigger on another CPU which is waiting for an IPI to proceed.
We reset softlockup watchdogs on all CPUs, but we do this only after
listing all tasks, and this may be too late on a busy system.So, reset watchdogs CPUs earlier, in for_each_process_thread() loop.
Signed-off-by: Andrey Ryabinin
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc:
Link: http://lkml.kernel.org/r/1465474805-14641-1-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Ingo Molnar -
I see a hang when enabling sched events:
echo 1 > /sys/kernel/debug/tracing/events/sched/enable
The printk buffer shows:
BUG: spinlock recursion on CPU#1, swapper/1/0
lock: 0xffff88007d5d8c00, .magic: dead4ead, .owner: swapper/1/0, .owner_cpu: 1
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc2+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
...
Call Trace:
[] dump_stack+0x85/0xc2
[] spin_dump+0x78/0xc0
[] do_raw_spin_lock+0x11a/0x150
[] _raw_spin_lock+0x61/0x80
[] ? try_to_wake_up+0x256/0x4e0
[] try_to_wake_up+0x256/0x4e0
[] ? _raw_spin_unlock_irqrestore+0x4a/0x80
[] wake_up_process+0x15/0x20
[] insert_work+0x84/0xc0
[] __queue_work+0x18f/0x660
[] queue_work_on+0x46/0x90
[] drm_fb_helper_dirty.isra.11+0xcb/0xe0 [drm_kms_helper]
[] drm_fb_helper_sys_imageblit+0x30/0x40 [drm_kms_helper]
[] soft_cursor+0x1ad/0x230
[] bit_cursor+0x649/0x680
[] ? update_attr.isra.2+0x90/0x90
[] fbcon_cursor+0x14a/0x1c0
[] hide_cursor+0x28/0x90
[] vt_console_print+0x3bf/0x3f0
[] call_console_drivers.constprop.24+0x183/0x200
[] console_unlock+0x3d4/0x610
[] vprintk_emit+0x3c5/0x610
[] vprintk_default+0x29/0x40
[] printk+0x57/0x73
[] enqueue_entity+0xc2e/0xc70
[] enqueue_task_fair+0x59/0xab0
[] ? kvm_sched_clock_read+0x9/0x20
[] ? sched_clock+0x9/0x10
[] activate_task+0x5c/0xa0
[] ttwu_do_activate+0x54/0xb0
[] sched_ttwu_pending+0x7a/0xb0
[] scheduler_ipi+0x61/0x170
[] smp_trace_reschedule_interrupt+0x4f/0x2a0
[] trace_reschedule_interrupt+0x96/0xa0
[] ? native_safe_halt+0x6/0x10
[] ? trace_hardirqs_on+0xd/0x10
[] default_idle+0x20/0x1a0
[] arch_cpu_idle+0xf/0x20
[] default_idle_call+0x2f/0x50
[] cpu_startup_entry+0x37e/0x450
[] start_secondary+0x160/0x1a0Note the hang only occurs when echoing the above from a physical serial
console, not from an ssh session.The bug is caused by a deadlock where the task is trying to grab the rq
lock twice because printk()'s aren't safe in sched code.Signed-off-by: Josh Poimboeuf
Cc: Linus Torvalds
Cc: Matt Fleming
Cc: Mel Gorman
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Cc: stable@vger.kernel.org
Fixes: cb2517653fcc ("sched/debug: Make schedstats a runtime tunable that is disabled by default")
Link: http://lkml.kernel.org/r/20160613073209.gdvdybiruljbkn3p@treble
Signed-off-by: Ingo Molnar -
Chris Wilson reported a divide by 0 at:
post_init_entity_util_avg():
> 725 if (cfs_rq->avg.util_avg != 0) {
> 726 sa->util_avg = cfs_rq->avg.util_avg * se->load.weight;
> -> 727 sa->util_avg /= (cfs_rq->avg.load_avg + 1);
> 728
> 729 if (sa->util_avg > cap)
> 730 sa->util_avg = cap;
> 731 } else {Which given the lack of serialization, and the code generated from
update_cfs_rq_load_avg() is entirely possible:if (atomic_long_read(&cfs_rq->removed_load_avg)) {
s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
sa->load_avg = max_t(long, sa->load_avg - r, 0);
sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
removed_load = 1;
}turns into:
ffffffff81087064: 49 8b 85 98 00 00 00 mov 0x98(%r13),%rax
ffffffff8108706b: 48 85 c0 test %rax,%rax
ffffffff8108706e: 74 40 je ffffffff810870b0
ffffffff81087070: 4c 89 f8 mov %r15,%rax
ffffffff81087073: 49 87 85 98 00 00 00 xchg %rax,0x98(%r13)
ffffffff8108707a: 49 29 45 70 sub %rax,0x70(%r13)
ffffffff8108707e: 4c 89 f9 mov %r15,%rcx
ffffffff81087081: bb 01 00 00 00 mov $0x1,%ebx
ffffffff81087086: 49 83 7d 70 00 cmpq $0x0,0x70(%r13)
ffffffff8108708b: 49 0f 49 4d 70 cmovns 0x70(%r13),%rcxWhich you'll note ends up with 'sa->load_avg - r' in memory at
ffffffff8108707a.By calling post_init_entity_util_avg() under rq->lock we're sure to be
fully serialized against PELT updates and cannot observe intermediate
state like this.Reported-by: Chris Wilson
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrey Ryabinin
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Yuyang Du
Cc: bsegall@google.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Fixes: 2b8c41daba32 ("sched/fair: Initiate a new task's util avg to a bounded value")
Link: http://lkml.kernel.org/r/20160609130750.GQ30909@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar
13 Jun, 2016
3 commits
-
…/arm-platforms into irq/core
First drop of irqchip updates for 4.8 from Marc Zyngier:
- Fix a few bugs in configuring the default trigger from the irqdomain layer
- Make the genirq layer PM aware
- Add PM capability to the ARM GIC driver
- Add support for 2-level translation tables to the GICv3 ITS driver -
Some IRQ chips may be located in a power domain outside of the CPU
subsystem and hence will require device specific runtime power
management. In order to support such IRQ chips, add a pointer for a
device structure to the irq_chip structure, and if this pointer is
populated by the IRQ chip driver and CONFIG_PM is selected in the kernel
configuration, then the pm_runtime_get/put APIs for this chip will be
called when an IRQ is requested/freed, respectively.Reviewed-by: Kevin Hilman
Signed-off-by: Jon Hunter
Signed-off-by: Marc Zyngier -
Some IRQ chips, such as GPIO controllers or secondary level interrupt
controllers, may require require additional runtime power management
control to ensure they are accessible. For such IRQ chips, it makes sense
to enable the IRQ chip when interrupts are requested and disabled them
again once all interrupts have been freed.When mapping an IRQ, the IRQ type settings are read and then programmed.
The mapping of the IRQ happens before the IRQ is requested and so the
programming of the type settings occurs before the IRQ is requested. This
is a problem for IRQ chips that require additional power management
control because they may not be accessible yet. Therefore, when mapping
the IRQ, don't program the type settings, just save them and then program
these saved settings when the IRQ is requested (so long as if they are not
overridden via the call to request the IRQ).Add a stub function for irq_domain_free_irqs() to avoid any compilation
errors when CONFIG_IRQ_DOMAIN_HIERARCHY is not selected.Signed-off-by: Jon Hunter
Reviewed-by: Marc Zyngier
Signed-off-by: Marc Zyngier