03 Jan, 2018
6 commits
-
commit 466a2b42d67644447a1765276259a3ea5531ddff upstream.
Since the recent remote cpufreq callback work, its possible that a cpufreq
update is triggered from a remote CPU. For single policies however, the current
code uses the local CPU when trying to determine if the remote sg_cpu entered
idle or is busy. This is incorrect. To remedy this, compare with the nohz tick
idle_calls counter of the remote CPU.Fixes: 674e75411fc2 (sched: cpufreq: Allow remote cpufreq callbacks)
Acked-by: Viresh Kumar
Acked-by: Peter Zijlstra (Intel)
Signed-off-by: Joel Fernandes
Signed-off-by: Rafael J. Wysocki
Signed-off-by: Greg Kroah-Hartman -
commit ae415fa4c5248a8cf4faabd5a3c20576cb1ad607 upstream.
To free the reader page that is allocated with ring_buffer_alloc_read_page(),
ring_buffer_free_read_page() must be called. For faster performance, this
page can be reused by the ring buffer to avoid having to free and allocate
new pages.The issue arises when the page is used with a splice pipe into the
networking code. The networking code may up the page counter for the page,
and keep it active while sending it is queued to go to the network. The
incrementing of the page ref does not prevent it from being reused in the
ring buffer, and this can cause the page that is being sent out to the
network to be modified before it is sent by reading new data.Add a check to the page ref counter, and only reuse the page if it is not
being used anywhere else.Fixes: 73a757e63114d ("ring-buffer: Return reader page back into existing ring buffer")
Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Greg Kroah-Hartman -
commit 45d8b80c2ac5d21cd1e2954431fb676bc2b1e099 upstream.
Two info bits were added to the "commit" part of the ring buffer data page
when returned to be consumed. This was to inform the user space readers that
events have been missed, and that the count may be stored at the end of the
page.What wasn't handled, was the splice code that actually called a function to
return the length of the data in order to zero out the rest of the page
before sending it up to user space. These data bits were returned with the
length making the value negative, and that negative value was not checked.
It was compared to PAGE_SIZE, and only used if the size was less than
PAGE_SIZE. Luckily PAGE_SIZE is unsigned long which made the compare an
unsigned compare, meaning the negative size value did not end up causing a
large portion of memory to be randomly zeroed out.Fixes: 66a8cb95ed040 ("ring-buffer: Add place holder recording of dropped events")
Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Greg Kroah-Hartman -
commit 24f2aaf952ee0b59f31c3a18b8b36c9e3d3c2cf5 upstream.
Double free of the ring buffer happens when it fails to alloc new
ring buffer instance for max_buffer if TRACER_MAX_TRACE is configured.
The root cause is that the pointer is not set to NULL after the buffer
is freed in allocate_trace_buffers(), and the freeing of the ring
buffer is invoked again later if the pointer is not equal to Null,
as:instance_mkdir()
|-allocate_trace_buffers()
|-allocate_trace_buffer(tr, &tr->trace_buffer...)
|-allocate_trace_buffer(tr, &tr->max_buffer...)// allocate fail(-ENOMEM),first free
// and the buffer pointer is not set to null
|-ring_buffer_free(tr->trace_buffer.buffer)// out_free_tr
|-free_trace_buffers()
|-free_trace_buffer(&tr->trace_buffer);//if trace_buffer is not null, free again
|-ring_buffer_free(buf->buffer)
|-rb_free_cpu_buffer(buffer->buffers[cpu])
// ring_buffer_per_cpu is null, and
// crash in ring_buffer_per_cpu->pagesLink: http://lkml.kernel.org/r/20171226071253.8968-1-chunyan.zhang@spreadtrum.com
Fixes: 737223fbca3b1 ("tracing: Consolidate buffer allocation code")
Signed-off-by: Jing Xia
Signed-off-by: Chunyan Zhang
Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Greg Kroah-Hartman -
commit 4397f04575c44e1440ec2e49b6302785c95fd2f8 upstream.
Jing Xia and Chunyan Zhang reported that on failing to allocate part of the
tracing buffer, memory is freed, but the pointers that point to them are not
initialized back to NULL, and later paths may try to free the freed memory
again. Jing and Chunyan fixed one of the locations that does this, but
missed a spot.Link: http://lkml.kernel.org/r/20171226071253.8968-1-chunyan.zhang@spreadtrum.com
Fixes: 737223fbca3b1 ("tracing: Consolidate buffer allocation code")
Reported-by: Jing Xia
Reported-by: Chunyan Zhang
Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Greg Kroah-Hartman -
commit 6b7e633fe9c24682df550e5311f47fb524701586 upstream.
The ring_buffer_read_page() takes care of zeroing out any extra data in the
page that it returns. There's no need to zero it out again from the
consumer. It was removed from one consumer of this function, but
read_buffers_splice_read() did not remove it, and worse, it contained a
nasty bug because of it.Fixes: 2711ca237a084 ("ring-buffer: Move zeroing out excess in page to ring buffer code")
Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Greg Kroah-Hartman
30 Dec, 2017
1 commit
-
commit c10e83f598d08046dd1ebc8360d4bb12d802d51b upstream.
In order to sanitize the LDT initialization on x86 arch_dup_mmap() must be
allowed to fail. Fix up all instances.Signed-off-by: Thomas Gleixner
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andy Lutomirski
Cc: Andy Lutomirsky
Cc: Boris Ostrovsky
Cc: Borislav Petkov
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Dave Hansen
Cc: Dave Hansen
Cc: David Laight
Cc: Denys Vlasenko
Cc: Eduardo Valentin
Cc: Greg KH
Cc: H. Peter Anvin
Cc: Josh Poimboeuf
Cc: Juergen Gross
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Will Deacon
Cc: aliguori@amazon.com
Cc: dan.j.williams@intel.com
Cc: hughd@google.com
Cc: keescook@google.com
Cc: kirill.shutemov@linux.intel.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman
25 Dec, 2017
12 commits
-
From: Alexei Starovoitov
[ Upstream commit bb7f0f989ca7de1153bd128a40a71709e339fa03 ]
There were various issues related to the limited size of integers used in
the verifier:
- `off + size` overflow in __check_map_access()
- `off + reg->off` overflow in check_mem_access()
- `off + reg->var_off.value` overflow or 32-bit truncation of
`reg->var_off.value` in check_mem_access()
- 32-bit truncation in check_stack_boundary()Make sure that any integer math cannot overflow by not allowing
pointer math with large values.Also reduce the scope of "scalar op scalar" tracking.
Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
Reported-by: Jann Horn
Signed-off-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann
Signed-off-by: Greg Kroah-Hartman -
From: Jann Horn
[ Upstream commit 179d1c5602997fef5a940c6ddcf31212cbfebd14 ]
This could be made safe by passing through a reference to env and checking
for env->allow_ptr_leaks, but it would only work one way and is probably
not worth the hassle - not doing it will not directly lead to program
rejection.Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn
Signed-off-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann
Signed-off-by: Greg Kroah-Hartman -
From: Jann Horn
[ Upstream commit a5ec6ae161d72f01411169a938fa5f8baea16e8f ]
Force strict alignment checks for stack pointers because the tracking of
stack spills relies on it; unaligned stack accesses can lead to corruption
of spilled registers, which is exploitable.Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn
Signed-off-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann
Signed-off-by: Greg Kroah-Hartman -
From: Jann Horn
Prevent indirect stack accesses at non-constant addresses, which would
permit reading and corrupting spilled pointers.Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn
Signed-off-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann
Signed-off-by: Greg Kroah-Hartman -
From: Jann Horn
[ Upstream commit 468f6eafa6c44cb2c5d8aad35e12f06c240a812a ]
32-bit ALU ops operate on 32-bit values and have 32-bit outputs.
Adjust the verifier accordingly.Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn
Signed-off-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann
Signed-off-by: Greg Kroah-Hartman -
From: Jann Horn
[ Upstream commit 0c17d1d2c61936401f4702e1846e2c19b200f958 ]
Properly handle register truncation to a smaller size.
The old code first mirrors the clearing of the high 32 bits in the bitwise
tristate representation, which is correct. But then, it computes the new
arithmetic bounds as the intersection between the old arithmetic bounds and
the bounds resulting from the bitwise tristate representation. Therefore,
when coerce_reg_to_32() is called on a number with bounds
[0xffff'fff8, 0x1'0000'0007], the verifier computes
[0xffff'fff8, 0xffff'ffff] as bounds of the truncated number.
This is incorrect: The truncated number could also be in the range [0, 7],
and no meaningful arithmetic bounds can be computed in that case apart from
the obvious [0, 0xffff'ffff].Starting with v4.14, this is exploitable by unprivileged users as long as
the unprivileged_bpf_disabled sysctl isn't set.Debian assigned CVE-2017-16996 for this issue.
v2:
- flip the mask during arithmetic bounds calculation (Ben Hutchings)
v3:
- add CVE number (Ben Hutchings)Fixes: b03c9f9fdc37 ("bpf/verifier: track signed and unsigned min/max values")
Signed-off-by: Jann Horn
Acked-by: Edward Cree
Signed-off-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann
Signed-off-by: Greg Kroah-Hartman -
From: Jann Horn
[ Upstream commit 95a762e2c8c942780948091f8f2a4f32fce1ac6f ]
Distinguish between
BPF_ALU64|BPF_MOV|BPF_K (load 32-bit immediate, sign-extended to 64-bit)
and BPF_ALU|BPF_MOV|BPF_K (load 32-bit immediate, zero-padded to 64-bit);
only perform sign extension in the first case.Starting with v4.14, this is exploitable by unprivileged users as long as
the unprivileged_bpf_disabled sysctl isn't set.Debian assigned CVE-2017-16995 for this issue.
v3:
- add CVE number (Ben Hutchings)Fixes: 484611357c19 ("bpf: allow access into map value arrays")
Signed-off-by: Jann Horn
Acked-by: Edward Cree
Signed-off-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann
Signed-off-by: Greg Kroah-Hartman -
From: Edward Cree
[ Upstream commit 4374f256ce8182019353c0c639bb8d0695b4c941 ]
Incorrect signed bounds were being computed.
If the old upper signed bound was positive and the old lower signed bound was
negative, this could cause the new upper signed bound to be too low,
leading to security issues.Fixes: b03c9f9fdc37 ("bpf/verifier: track signed and unsigned min/max values")
Reported-by: Jann Horn
Signed-off-by: Edward Cree
Acked-by: Alexei Starovoitov
[jannh@google.com: changed description to reflect bug impact]
Signed-off-by: Jann Horn
Signed-off-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 283ca526a9bd75aed7350220d7b1f8027d99c3fd ]
When tracing and networking programs are both attached in the
system and both use event-output helpers that eventually call
into perf_event_output(), then we could end up in a situation
where the tracing attached program runs in user context while
a cls_bpf program is triggered on that same CPU out of softirq
context.Since both rely on the same per-cpu perf_sample_data, we could
potentially corrupt it. This can only ever happen in a combination
of the two types; all tracing programs use a bpf_prog_active
counter to bail out in case a program is already running on
that CPU out of a different context. XDP and cls_bpf programs
by themselves don't have this issue as they run in the same
context only. Therefore, split both perf_sample_data so they
cannot be accessed from each other.Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for perf_sample_data")
Reported-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann
Tested-by: Song Liu
Acked-by: Alexei Starovoitov
Signed-off-by: Alexei Starovoitov
Signed-off-by: Greg Kroah-Hartman -
From: Alexei Starovoitov
[ Upstream commit c131187db2d3fa2f8bf32fdf4e9a4ef805168467 ]
when the verifier detects that register contains a runtime constant
and it's compared with another constant it will prune exploration
of the branch that is guaranteed not to be taken at runtime.
This is all correct, but malicious program may be constructed
in such a way that it always has a constant comparison and
the other branch is never taken under any conditions.
In this case such path through the program will not be explored
by the verifier. It won't be taken at run-time either, but since
all instructions are JITed the malicious program may cause JITs
to complain about using reserved fields, etc.
To fix the issue we have to track the instructions explored by
the verifier and sanitize instructions that are dead at run time
with NOPs. We cannot reject such dead code, since llvm generates
it for valid C code, since it doesn't do as much data flow
analysis as the verifier does.Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)")
Signed-off-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Signed-off-by: Daniel Borkmann
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit a15f7fc20389a8827d5859907568b201234d4b79 ]
There are a small number of 'generic fields' (comm/COMM/cpu/CPU) that
are found by trace_find_event_field() but are only meant for
filtering. Specifically, they unlike normal fields, they have a size
of 0 and thus wreak havoc when used as a histogram key.Exclude these (return -EINVAL) when used as histogram keys.
Link: http://lkml.kernel.org/r/956154cbc3e8a4f0633d619b886c97f0f0edf7b4.1506105045.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi
Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman -
commit 3382290ed2d5e275429cef510ab21889d3ccd164 upstream.
[ Note, this is a Git cherry-pick of the following commit:
506458efaf15 ("locking/barriers: Convert users of lockless_dereference() to READ_ONCE()")
... for easier x86 PTI code testing and back-porting. ]
READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it
can be used instead of lockless_dereference() without any change in
semantics.Signed-off-by: Will Deacon
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman
20 Dec, 2017
5 commits
-
[ Upstream commit 95b982b45122c57da2ee0b46cce70775e1d987af ]
Problem: This flag does not get cleared currently in the suspend or
resume path in the following cases:* In case some driver's suspend routine returns an error.
* Successful s2idle case
* etc?Why is this a problem: What happens is that the next suspend attempt
could fail even though the user did not enable the flag by writing to
/sys/power/wakeup_count. This is 1 use case how the issue can be seen
(but similar use case with driver suspend failure can be thought of):1. Read /sys/power/wakeup_count
2. echo count > /sys/power/wakeup_count
3. echo freeze > /sys/power/wakeup_count
4. Let the system suspend, and wakeup the system using some wake source
that calls pm_wakeup_event() e.g. power button or something.
5. Note that the combined wakeup count would be incremented due
to the pm_wakeup_event() in the resume path.
6. After resuming the events_check_enabled flag is still set.At this point if the user attempts to freeze again (without writing to
/sys/power/wakeup_count), the suspend would fail even though there has
been no wake event since the past resume.Address that by clearing the flag just before a resume is completed,
so that it is always cleared for the corner cases mentioned above.Signed-off-by: Rajat Jain
Acked-by: Pavel Machek
Signed-off-by: Rafael J. Wysocki
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman -
commit cef31d9af908243421258f1df35a4a644604efbe upstream.
timer_create() specifies via sigevent->sigev_notify the signal delivery for
the new timer. The valid modes are SIGEV_NONE, SIGEV_SIGNAL, SIGEV_THREAD
and (SIGEV_SIGNAL | SIGEV_THREAD_ID).The sanity check in good_sigevent() is only checking the valid combination
for the SIGEV_THREAD_ID bit, i.e. SIGEV_SIGNAL, but if SIGEV_THREAD_ID is
not set it accepts any random value.This has no real effects on the posix timer and signal delivery code, but
it affects show_timer() which handles the output of /proc/$PID/timers. That
function uses a string array to pretty print sigev_notify. The access to
that array has no bound checks, so random sigev_notify cause access beyond
the array bounds.Add proper checks for the valid notify modes and remove the SIGEV_THREAD_ID
masking from various code pathes as SIGEV_NONE can never be set in
combination with SIGEV_THREAD_ID.Reported-by: Eric Biggers
Reported-by: Dmitry Vyukov
Reported-by: Alexey Dobriyan
Signed-off-by: Thomas Gleixner
Cc: John Stultz
Signed-off-by: Greg Kroah-Hartman -
commit f73c52a5bcd1710994e53fbccc378c42b97a06b6 upstream.
Daniel Wagner reported a crash on the BeagleBone Black SoC.
This is a single CPU architecture, and does not have a functional
arch_send_call_function_single_ipi() implementation which can crash
the kernel if that is called.As it only has one CPU, it shouldn't be called, but if the kernel is
compiled for SMP, the push/pull RT scheduling logic now calls it for
irq_work if the one CPU is overloaded, it can use that function to call
itself and crash the kernel.Ideally, we should disable the SCHED_FEAT(RT_PUSH_IPI) if the system
only has a single CPU. But SCHED_FEAT is a constant if sched debugging
is turned off. Another fix can also be used, and this should also help
with normal SMP machines. That is, do not initiate the pull code if
there's only one RT overloaded CPU, and that CPU happens to be the
current CPU that is scheduling in a lower priority task.Even on a system with many CPUs, if there's many RT tasks waiting to
run on a single CPU, and that CPU schedules in another RT task of lower
priority, it will initiate the PULL logic in case there's a higher
priority RT task on another CPU that is waiting to run. But if there is
no other CPU with waiting RT tasks, it will initiate the RT pull logic
on itself (as it still has RT tasks waiting to run). This is a wasted
effort.Not only does this help with SMP code where the current CPU is the only
one with RT overloaded tasks, it should also solve the issue that
Daniel encountered, because it will prevent the PULL logic from
executing, as there's only one CPU on the system, and the check added
here will cause it to exit the RT pull code.Reported-by: Daniel Wagner
Signed-off-by: Steven Rostedt (VMware)
Acked-by: Peter Zijlstra
Cc: Linus Torvalds
Cc: Sebastian Andrzej Siewior
Cc: Thomas Gleixner
Cc: linux-rt-users
Fixes: 4bdced5c9 ("sched/rt: Simplify the IPI based RT balancing logic")
Link: http://lkml.kernel.org/r/20171202130454.4cbbfe8d@vmware.local.home
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit 90e406f96f630c07d631a021fd4af10aac913e77 upstream.
The default NR_CPUS can be very large, but actual possible nr_cpu_ids
usually is very small. For my x86 distribution, the NR_CPUS is 8192 and
nr_cpu_ids is 4. About 2 pages are wasted.Most machines don't have so many CPUs, so define a array with NR_CPUS
just wastes memory. So let's allocate the buffer dynamically when need.With this change, the mutext tracing_cpumask_update_lock also can be
removed now, which was used to protect mask_str.Link: http://lkml.kernel.org/r/1512013183-19107-1-git-send-email-changbin.du@intel.com
Fixes: 36dfe9252bd4c ("ftrace: make use of tracing_cpumask")
Signed-off-by: Changbin Du
Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Greg Kroah-Hartman -
commit bdcf0a423ea1c40bbb40e7ee483b50fc8aa3d758 upstream.
In testing, we found that nfsd threads may call set_groups in parallel
for the same entry cached in auth.unix.gid, racing in the call of
groups_sort, corrupting the groups for that entry and leading to
permission denials for the client.This patch:
- Make groups_sort globally visible.
- Move the call to groups_sort to the modifiers of group_info
- Remove the call to groups_sort from set_groupsLink: http://lkml.kernel.org/r/20171211151420.18655-1-thiago.becker@gmail.com
Signed-off-by: Thiago Rafael Becker
Reviewed-by: Matthew Wilcox
Reviewed-by: NeilBrown
Acked-by: "J. Bruce Fields"
Cc: Al Viro
Cc: Martin Schwidefsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
17 Dec, 2017
2 commits
-
[ Upstream commit 173743dd99a49c956b124a74c8aacb0384739a4c ]
Prior to this patch we enabled audit in audit_init(), which is too
late for PID 1 as the standard initcalls are run after the PID 1 task
is forked. This means that we never allocate an audit_context (see
audit_alloc()) for PID 1 and therefore miss a lot of audit events
generated by PID 1.This patch enables audit as early as possible to help ensure that when
PID 1 is forked it can allocate an audit_context if required.Reviewed-by: Richard Guy Briggs
Signed-off-by: Paul Moore
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 33e8a907804428109ce1d12301c3365d619cc4df ]
The API to end auditing has historically been for auditd to set the
pid to 0. This patch restores that functionality.See: https://github.com/linux-audit/audit-kernel/issues/69
Reviewed-by: Richard Guy Briggs
Signed-off-by: Steve Grubb
Signed-off-by: Paul Moore
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman
14 Dec, 2017
5 commits
-
[ Upstream commit 92ee46efeb505ead3ab06d3c5ce695637ed5f152 ]
Fengguang Wu reported that running the rcuperf test during boot can cause
the jump_label_test() to hit a WARN_ON(). The issue is that the core jump
label code relies on kernel_text_address() to detect when it can no longer
update branches that may be contained in __init sections. The
kernel_text_address() in turn assumes that if the system_state variable is
greter than or equal to SYSTEM_RUNNING then __init sections are no longer
valid (since the assumption is that they have been freed). However, when
rcuperf is setup to run in early boot it can call kernel_power_off() which
sets the system_state to SYSTEM_POWER_OFF.Since rcuperf initialization is invoked via a module_init(), we can make
the dependency of jump_label_test() needing to complete before rcuperf
explicit by calling it via early_initcall().Reported-by: Fengguang Wu
Signed-off-by: Jason Baron
Acked-by: Paul E. McKenney
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1510609727-2238-1-git-send-email-jbaron@akamai.com
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 89ad2fa3f043a1e8daae193bcb5fe34d5f8caf28 ]
pcpu_freelist_pop() needs the same lockdep awareness than
pcpu_freelist_populate() to avoid a false positive.[ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
switchto-defaul/12508 [HC0[0]:SC0[6]:HE0:SE0] is trying to acquire:
(&htab->buckets[i].lock){......}, at: [] __htab_percpu_map_update_elem+0x1cb/0x300and this task is already holding:
(dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2){+.-...}, at: [] __dev_queue_xmit+0
x868/0x1240
which would create a new lock dependency:
(dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2){+.-...} -> (&htab->buckets[i].lock){......}but this new dependency connects a SOFTIRQ-irq-safe lock:
(dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2){+.-...}
... which became SOFTIRQ-irq-safe at:
[] __lock_acquire+0x42b/0x1f10
[] lock_acquire+0xbc/0x1b0
[] _raw_spin_lock+0x38/0x50
[] __dev_queue_xmit+0x868/0x1240
[] dev_queue_xmit+0x10/0x20
[] ip_finish_output2+0x439/0x590
[] ip_finish_output+0x150/0x2f0
[] ip_output+0x7d/0x260
[] ip_local_out+0x5e/0xe0
[] ip_queue_xmit+0x205/0x620
[] tcp_transmit_skb+0x5a8/0xcb0
[] tcp_write_xmit+0x242/0x1070
[] __tcp_push_pending_frames+0x3c/0xf0
[] tcp_rcv_established+0x312/0x700
[] tcp_v4_do_rcv+0x11c/0x200
[] tcp_v4_rcv+0xaa2/0xc30
[] ip_local_deliver_finish+0xa7/0x240
[] ip_local_deliver+0x66/0x200
[] ip_rcv_finish+0xdd/0x560
[] ip_rcv+0x295/0x510
[] __netif_receive_skb_core+0x988/0x1020
[] __netif_receive_skb+0x21/0x70
[] process_backlog+0x6f/0x230
[] net_rx_action+0x229/0x420
[] __do_softirq+0xd8/0x43d
[] do_softirq_own_stack+0x1c/0x30
[] do_softirq+0x55/0x60
[] __local_bh_enable_ip+0xa8/0xb0
[] cpu_startup_entry+0x1c7/0x500
[] start_secondary+0x113/0x140to a SOFTIRQ-irq-unsafe lock:
(&head->lock){+.+...}
... which became SOFTIRQ-irq-unsafe at:
... [] __lock_acquire+0x82f/0x1f10
[] lock_acquire+0xbc/0x1b0
[] _raw_spin_lock+0x38/0x50
[] pcpu_freelist_pop+0x7a/0xb0
[] htab_map_alloc+0x50c/0x5f0
[] SyS_bpf+0x265/0x1200
[] entry_SYSCALL_64_fastpath+0x12/0x17other info that might help us debug this:
Chain exists of:
dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2 --> &htab->buckets[i].lock --> &head->lockPossible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&head->lock);
local_irq_disable();
lock(dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2);
lock(&htab->buckets[i].lock);
lock(dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2);*** DEADLOCK ***
Fixes: e19494edab82 ("bpf: introduce percpu_freelist")
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 98159d977f71c3b3dee898d1c34e56f520b094e7 ]
Patch series "A few round_pipe_size() and pipe-max-size fixups", v3.
While backporting Michael's "pipe: fix limit handling" patchset to a
distro-kernel, Mikulas noticed that current upstream pipe limit handling
contains a few problems:1 - procfs signed wrap: echo'ing a large number into
/proc/sys/fs/pipe-max-size and then cat'ing it back out shows a
negative value.2 - round_pipe_size() nr_pages overflow on 32bit: this would
subsequently try roundup_pow_of_two(0), which is undefined.3 - visible non-rounded pipe-max-size value: there is no mutual
exclusion or protection between the time pipe_max_size is assigned
a raw value from proc_dointvec_minmax() and when it is rounded.4 - unsigned long -> unsigned int conversion makes for potential odd
return errors from do_proc_douintvec_minmax_conv() and
do_proc_dopipe_max_size_conv().This version underwent the same testing as v1:
https://marc.info/?l=linux-kernel&m=150643571406022&w=2This patch (of 4):
pipe_max_size is defined as an unsigned int:
unsigned int pipe_max_size = 1048576;
but its procfs/sysctl representation is an integer:
static struct ctl_table fs_table[] = {
...
{
.procname = "pipe-max-size",
.data = &pipe_max_size,
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = &pipe_proc_fn,
.extra1 = &pipe_min_size,
},
...that is signed:
int pipe_proc_fn(struct ctl_table *table, int write, void __user *buf,
size_t *lenp, loff_t *ppos)
{
...
ret = proc_dointvec_minmax(table, write, buf, lenp, ppos)This leads to signed results via procfs for large values of pipe_max_size:
% echo 2147483647 >/proc/sys/fs/pipe-max-size
% cat /proc/sys/fs/pipe-max-size
-2147483648Use unsigned operations on this variable to avoid such negative values.
Link: http://lkml.kernel.org/r/1507658689-11669-2-git-send-email-joe.lawrence@redhat.com
Signed-off-by: Joe Lawrence
Reported-by: Mikulas Patocka
Reviewed-by: Mikulas Patocka
Cc: Michael Kerrisk
Cc: Randy Dunlap
Cc: Al Viro
Cc: Jens Axboe
Cc: Josh Poimboeuf
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman -
commit c07d35338081d107e57cf37572d8cc931a8e32e2 upstream.
kallsyms_symbol_next() returns a boolean (true on success). Currently
kdb_read() tests the return value with an inequality that
unconditionally evaluates to true.This is fixed in the obvious way and, since the conditional branch is
supposed to be unreachable, we also add a WARN_ON().Reported-by: Dan Carpenter
Signed-off-by: Daniel Thompson
Signed-off-by: Jason Wessel
Signed-off-by: Greg Kroah-Hartman -
commit 46febd37f9c758b05cd25feae8512f22584742fe upstream.
Commit 31487f8328f2 ("smp/cfd: Convert core to hotplug state machine")
accidently put this step on the wrong place. The step should be at the
cpuhp_ap_states[] rather than the cpuhp_bp_states[].grep smpcfd /sys/devices/system/cpu/hotplug/states
40: smpcfd:prepare
129: smpcfd:dying"smpcfd:dying" was missing before.
So was the invocation of the function smpcfd_dying_cpu().Fixes: 31487f8328f2 ("smp/cfd: Convert core to hotplug state machine")
Signed-off-by: Lai Jiangshan
Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Richard Weinberger
Cc: Sebastian Andrzej Siewior
Cc: Boris Ostrovsky
Link: https://lkml.kernel.org/r/20171128131954.81229-1-jiangshanlai@gmail.com
Signed-off-by: Greg Kroah-Hartman
10 Dec, 2017
2 commits
-
[ Upstream commit a30b85df7d599f626973e9cd3056fe755bd778e0 ]
We want to wait for all potentially preempted kprobes trampoline
execution to have completed. This guarantees that any freed
trampoline memory is not in use by any task in the system anymore.
synchronize_rcu_tasks() gives such a guarantee, so use it.Also, this guarantees to wait for all potentially preempted tasks
on the instructions which will be replaced with a jump.Since this becomes a problem only when CONFIG_PREEMPT=y, enable
CONFIG_TASKS_RCU=y for synchronize_rcu_tasks() in that case.Signed-off-by: Masami Hiramatsu
Acked-by: Paul E. McKenney
Cc: Ananth N Mavinakayanahalli
Cc: Linus Torvalds
Cc: Naveen N . Rao
Cc: Paul E . McKenney
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/150845661962.5443.17724352636247312231.stgit@devbox
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit a9cd8194e1e6bd09619954721dfaf0f94fe2003e ]
Event timestamps are serialized using ctx->lock, make sure to hold it
over reading all values.Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman
30 Nov, 2017
4 commits
-
commit 4f8413a3a799c958f7a10a6310a451e6b8aef5ad upstream.
When requesting a shared interrupt, we assume that the firmware
support code (DT or ACPI) has called irqd_set_trigger_type
already, so that we can retrieve it and check that the requester
is being reasonnable.Unfortunately, we still have non-DT, non-ACPI systems around,
and these guys won't call irqd_set_trigger_type before requesting
the interrupt. The consequence is that we fail the request that
would have worked before.We can either chase all these use cases (boring), or address it
in core code (easier). Let's have a per-irq_desc flag that
indicates whether irqd_set_trigger_type has been called, and
let's just check it when checking for a shared interrupt.
If it hasn't been set, just take whatever the interrupt
requester asks.Fixes: 382bd4de6182 ("genirq: Use irqd_get_trigger_type to compare the trigger type for shared IRQs")
Reported-and-tested-by: Petr Cvek
Signed-off-by: Marc Zyngier
Signed-off-by: Greg Kroah-Hartman -
commit 4bdced5c9a2922521e325896a7bbbf0132c94e56 upstream.
When a CPU lowers its priority (schedules out a high priority task for a
lower priority one), a check is made to see if any other CPU has overloaded
RT tasks (more than one). It checks the rto_mask to determine this and if so
it will request to pull one of those tasks to itself if the non running RT
task is of higher priority than the new priority of the next task to run on
the current CPU.When we deal with large number of CPUs, the original pull logic suffered
from large lock contention on a single CPU run queue, which caused a huge
latency across all CPUs. This was caused by only having one CPU having
overloaded RT tasks and a bunch of other CPUs lowering their priority. To
solve this issue, commit:b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
changed the way to request a pull. Instead of grabbing the lock of the
overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.Although the IPI logic worked very well in removing the large latency build
up, it still could suffer from a large number of IPIs being sent to a single
CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
when I tested this on a 120 CPU box, with a stress test that had lots of
RT tasks scheduling on all CPUs, it actually triggered the hard lockup
detector! One CPU had so many IPIs sent to it, and due to the restart
mechanism that is triggered when the source run queue has a priority status
change, the CPU spent minutes! processing the IPIs.Thinking about this further, I realized there's no reason for each run queue
to send its own IPI. As all CPUs with overloaded tasks must be scanned
regardless if there's one or many CPUs lowering their priority, because
there's no current way to find the CPU with the highest priority task that
can schedule to one of these CPUs, there really only needs to be one IPI
being sent around at a time.This greatly simplifies the code!
The new approach is to have each root domain have its own irq work, as the
rto_mask is per root domain. The root domain has the following fields
attached to it:rto_push_work - the irq work to process each CPU set in rto_mask
rto_lock - the lock to protect some of the other rto fields
rto_loop_start - an atomic that keeps contention down on rto_lock
the first CPU scheduling in a lower priority task
is the one to kick off the process.
rto_loop_next - an atomic that gets incremented for each CPU that
schedules in a lower priority task.
rto_loop - a variable protected by rto_lock that is used to
compare against rto_loop_next
rto_cpu - The cpu to send the next IPI to, also protected by
the rto_lock.When a CPU schedules in a lower priority task and wants to make sure
overloaded CPUs know about it. It increments the rto_loop_next. Then it
atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
then it is done, as another CPU is kicking off the IPI loop. If the old
value is "0", then it will take the rto_lock to synchronize with a possible
IPI being sent around to the overloaded CPUs.If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
IPI being sent around, or one is about to finish. Then rto_cpu is set to the
first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
in rto_mask, then there's nothing to be done.When the CPU receives the IPI, it will first try to push any RT tasks that is
queued on the CPU but can't run because a higher priority RT task is
currently running on that CPU.Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
finds one, it simply sends an IPI to that CPU and the process continues.If there's no more CPUs in the rto_mask, then rto_loop is compared with
rto_loop_next. If they match, everything is done and the process is over. If
they do not match, then a CPU scheduled in a lower priority task as the IPI
was being passed around, and the process needs to start again. The first CPU
in rto_mask is sent the IPI.This change removes this duplication of work in the IPI logic, and greatly
lowers the latency caused by the IPIs. This removed the lockup happening on
the 120 CPU machine. It also simplifies the code tremendously. What else
could anyone ask for?Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
supplying me with the rto_start_trylock() and rto_start_unlock() helper
functions.Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Peter Zijlstra (Intel)
Cc: Clark Williams
Cc: Daniel Bristot de Oliveira
Cc: John Kacur
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Scott Wood
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit 7c2102e56a3f7d85b5d8f33efbd7aecc1f36fdd8 upstream.
The current implementation of synchronize_sched_expedited() incorrectly
assumes that resched_cpu() is unconditional, which it is not. This means
that synchronize_sched_expedited() can hang when resched_cpu()'s trylock
fails as follows (analysis by Neeraj Upadhyay):o CPU1 is waiting for expedited wait to complete:
sync_rcu_exp_select_cpus
rdp->exp_dynticks_snap & 0x1 // returns 1 for CPU5
IPI sent to CPU5synchronize_sched_expedited_wait
ret = swait_event_timeout(rsp->expedited_wq,
sync_rcu_preempt_exp_done(rnp_root),
jiffies_stall);expmask = 0x20, CPU 5 in idle path (in cpuidle_enter())
o CPU5 handles IPI and fails to acquire rq lock.
Handles IPI
sync_sched_exp_handler
resched_cpu
returns while failing to try lock acquire rq->lock
need_resched is not seto CPU5 calls rcu_idle_enter() and as need_resched is not set, goes to
idle (schedule() is not called).o CPU 1 reports RCU stall.
Given that resched_cpu() is now used only by RCU, this commit fixes the
assumption by making resched_cpu() unconditional.Reported-by: Neeraj Upadhyay
Suggested-by: Neeraj Upadhyay
Signed-off-by: Paul E. McKenney
Acked-by: Steven Rostedt (VMware)
Acked-by: Peter Zijlstra (Intel)
Signed-off-by: Greg Kroah-Hartman -
commit 07458f6a5171d97511dfbdf6ce549ed2ca0280c7 upstream.
'cached_raw_freq' is used to get the next frequency quickly but should
always be in sync with sg_policy->next_freq. There is a case where it is
not and in such cases it should be reset to avoid switching to incorrect
frequencies.Consider this case for example:
- policy->cur is 1.2 GHz (Max)
- New request comes for 780 MHz and we store that in cached_raw_freq.
- Based on 780 MHz, we calculate the effective frequency as 800 MHz.
- We then see the CPU wasn't idle recently and choose to keep the next
freq as 1.2 GHz.
- Now we have cached_raw_freq is 780 MHz and sg_policy->next_freq is
1.2 GHz.
- Now if the utilization doesn't change in then next request, then the
next target frequency will still be 780 MHz and it will match with
cached_raw_freq. But we will choose 1.2 GHz instead of 800 MHz here.Fixes: b7eaf1aab9f8 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely)
Signed-off-by: Viresh Kumar
Signed-off-by: Rafael J. Wysocki
Signed-off-by: Greg Kroah-Hartman
24 Nov, 2017
1 commit
-
commit 135bd1a230bb69a68c9808a7d25467318900b80a upstream.
The pending-callbacks check in rcu_prepare_for_idle() is backwards.
It should accelerate if there are pending callbacks, but the check
rather uselessly accelerates only if there are no callbacks. This commit
therefore inverts this check.Fixes: 15fecf89e46a ("srcu: Abstract multi-tail callback list handling")
Signed-off-by: Neeraj Upadhyay
Signed-off-by: Paul E. McKenney
Signed-off-by: Greg Kroah-Hartman
10 Nov, 2017
1 commit
-
Pull final power management fixes from Rafael Wysocki:
"These fix a regression in the schedutil cpufreq governor introduced by
a recent change and blacklist Dell XPS13 9360 from using the Low Power
S0 Idle _DSM interface which triggers serious problems on one of these
machines.Specifics:
- Prevent the schedutil cpufreq governor from using the utilization
of a wrong CPU in some cases which started to happen after one of
the recent changes in it (Chris Redpath).- Blacklist Dell XPS13 9360 from using the Low Power S0 Idle _DSM
interface as that causes serious issue (related to NVMe) to appear
on one of these machines, even though the other Dells XPS13 9360 in
somewhat different HW configurations behave correctly (Rafael
Wysocki)"* tag 'pm-final-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / PM: Blacklist Low Power S0 Idle _DSM for Dell XPS13 9360
cpufreq: schedutil: Examine the correct CPU when we update util
09 Nov, 2017
1 commit
-
* pm-cpufreq-sched:
cpufreq: schedutil: Examine the correct CPU when we update util