30 Nov, 2017
4 commits
-
commit 4f8413a3a799c958f7a10a6310a451e6b8aef5ad upstream.
When requesting a shared interrupt, we assume that the firmware
support code (DT or ACPI) has called irqd_set_trigger_type
already, so that we can retrieve it and check that the requester
is being reasonnable.Unfortunately, we still have non-DT, non-ACPI systems around,
and these guys won't call irqd_set_trigger_type before requesting
the interrupt. The consequence is that we fail the request that
would have worked before.We can either chase all these use cases (boring), or address it
in core code (easier). Let's have a per-irq_desc flag that
indicates whether irqd_set_trigger_type has been called, and
let's just check it when checking for a shared interrupt.
If it hasn't been set, just take whatever the interrupt
requester asks.Fixes: 382bd4de6182 ("genirq: Use irqd_get_trigger_type to compare the trigger type for shared IRQs")
Reported-and-tested-by: Petr Cvek
Signed-off-by: Marc Zyngier
Signed-off-by: Greg Kroah-Hartman -
commit 4bdced5c9a2922521e325896a7bbbf0132c94e56 upstream.
When a CPU lowers its priority (schedules out a high priority task for a
lower priority one), a check is made to see if any other CPU has overloaded
RT tasks (more than one). It checks the rto_mask to determine this and if so
it will request to pull one of those tasks to itself if the non running RT
task is of higher priority than the new priority of the next task to run on
the current CPU.When we deal with large number of CPUs, the original pull logic suffered
from large lock contention on a single CPU run queue, which caused a huge
latency across all CPUs. This was caused by only having one CPU having
overloaded RT tasks and a bunch of other CPUs lowering their priority. To
solve this issue, commit:b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
changed the way to request a pull. Instead of grabbing the lock of the
overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.Although the IPI logic worked very well in removing the large latency build
up, it still could suffer from a large number of IPIs being sent to a single
CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
when I tested this on a 120 CPU box, with a stress test that had lots of
RT tasks scheduling on all CPUs, it actually triggered the hard lockup
detector! One CPU had so many IPIs sent to it, and due to the restart
mechanism that is triggered when the source run queue has a priority status
change, the CPU spent minutes! processing the IPIs.Thinking about this further, I realized there's no reason for each run queue
to send its own IPI. As all CPUs with overloaded tasks must be scanned
regardless if there's one or many CPUs lowering their priority, because
there's no current way to find the CPU with the highest priority task that
can schedule to one of these CPUs, there really only needs to be one IPI
being sent around at a time.This greatly simplifies the code!
The new approach is to have each root domain have its own irq work, as the
rto_mask is per root domain. The root domain has the following fields
attached to it:rto_push_work - the irq work to process each CPU set in rto_mask
rto_lock - the lock to protect some of the other rto fields
rto_loop_start - an atomic that keeps contention down on rto_lock
the first CPU scheduling in a lower priority task
is the one to kick off the process.
rto_loop_next - an atomic that gets incremented for each CPU that
schedules in a lower priority task.
rto_loop - a variable protected by rto_lock that is used to
compare against rto_loop_next
rto_cpu - The cpu to send the next IPI to, also protected by
the rto_lock.When a CPU schedules in a lower priority task and wants to make sure
overloaded CPUs know about it. It increments the rto_loop_next. Then it
atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
then it is done, as another CPU is kicking off the IPI loop. If the old
value is "0", then it will take the rto_lock to synchronize with a possible
IPI being sent around to the overloaded CPUs.If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
IPI being sent around, or one is about to finish. Then rto_cpu is set to the
first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
in rto_mask, then there's nothing to be done.When the CPU receives the IPI, it will first try to push any RT tasks that is
queued on the CPU but can't run because a higher priority RT task is
currently running on that CPU.Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
finds one, it simply sends an IPI to that CPU and the process continues.If there's no more CPUs in the rto_mask, then rto_loop is compared with
rto_loop_next. If they match, everything is done and the process is over. If
they do not match, then a CPU scheduled in a lower priority task as the IPI
was being passed around, and the process needs to start again. The first CPU
in rto_mask is sent the IPI.This change removes this duplication of work in the IPI logic, and greatly
lowers the latency caused by the IPIs. This removed the lockup happening on
the 120 CPU machine. It also simplifies the code tremendously. What else
could anyone ask for?Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
supplying me with the rto_start_trylock() and rto_start_unlock() helper
functions.Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Peter Zijlstra (Intel)
Cc: Clark Williams
Cc: Daniel Bristot de Oliveira
Cc: John Kacur
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Scott Wood
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit 7c2102e56a3f7d85b5d8f33efbd7aecc1f36fdd8 upstream.
The current implementation of synchronize_sched_expedited() incorrectly
assumes that resched_cpu() is unconditional, which it is not. This means
that synchronize_sched_expedited() can hang when resched_cpu()'s trylock
fails as follows (analysis by Neeraj Upadhyay):o CPU1 is waiting for expedited wait to complete:
sync_rcu_exp_select_cpus
rdp->exp_dynticks_snap & 0x1 // returns 1 for CPU5
IPI sent to CPU5synchronize_sched_expedited_wait
ret = swait_event_timeout(rsp->expedited_wq,
sync_rcu_preempt_exp_done(rnp_root),
jiffies_stall);expmask = 0x20, CPU 5 in idle path (in cpuidle_enter())
o CPU5 handles IPI and fails to acquire rq lock.
Handles IPI
sync_sched_exp_handler
resched_cpu
returns while failing to try lock acquire rq->lock
need_resched is not seto CPU5 calls rcu_idle_enter() and as need_resched is not set, goes to
idle (schedule() is not called).o CPU 1 reports RCU stall.
Given that resched_cpu() is now used only by RCU, this commit fixes the
assumption by making resched_cpu() unconditional.Reported-by: Neeraj Upadhyay
Suggested-by: Neeraj Upadhyay
Signed-off-by: Paul E. McKenney
Acked-by: Steven Rostedt (VMware)
Acked-by: Peter Zijlstra (Intel)
Signed-off-by: Greg Kroah-Hartman -
commit 07458f6a5171d97511dfbdf6ce549ed2ca0280c7 upstream.
'cached_raw_freq' is used to get the next frequency quickly but should
always be in sync with sg_policy->next_freq. There is a case where it is
not and in such cases it should be reset to avoid switching to incorrect
frequencies.Consider this case for example:
- policy->cur is 1.2 GHz (Max)
- New request comes for 780 MHz and we store that in cached_raw_freq.
- Based on 780 MHz, we calculate the effective frequency as 800 MHz.
- We then see the CPU wasn't idle recently and choose to keep the next
freq as 1.2 GHz.
- Now we have cached_raw_freq is 780 MHz and sg_policy->next_freq is
1.2 GHz.
- Now if the utilization doesn't change in then next request, then the
next target frequency will still be 780 MHz and it will match with
cached_raw_freq. But we will choose 1.2 GHz instead of 800 MHz here.Fixes: b7eaf1aab9f8 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely)
Signed-off-by: Viresh Kumar
Signed-off-by: Rafael J. Wysocki
Signed-off-by: Greg Kroah-Hartman
24 Nov, 2017
1 commit
-
commit 135bd1a230bb69a68c9808a7d25467318900b80a upstream.
The pending-callbacks check in rcu_prepare_for_idle() is backwards.
It should accelerate if there are pending callbacks, but the check
rather uselessly accelerates only if there are no callbacks. This commit
therefore inverts this check.Fixes: 15fecf89e46a ("srcu: Abstract multi-tail callback list handling")
Signed-off-by: Neeraj Upadhyay
Signed-off-by: Paul E. McKenney
Signed-off-by: Greg Kroah-Hartman
10 Nov, 2017
1 commit
-
Pull final power management fixes from Rafael Wysocki:
"These fix a regression in the schedutil cpufreq governor introduced by
a recent change and blacklist Dell XPS13 9360 from using the Low Power
S0 Idle _DSM interface which triggers serious problems on one of these
machines.Specifics:
- Prevent the schedutil cpufreq governor from using the utilization
of a wrong CPU in some cases which started to happen after one of
the recent changes in it (Chris Redpath).- Blacklist Dell XPS13 9360 from using the Low Power S0 Idle _DSM
interface as that causes serious issue (related to NVMe) to appear
on one of these machines, even though the other Dells XPS13 9360 in
somewhat different HW configurations behave correctly (Rafael
Wysocki)"* tag 'pm-final-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / PM: Blacklist Low Power S0 Idle _DSM for Dell XPS13 9360
cpufreq: schedutil: Examine the correct CPU when we update util
09 Nov, 2017
1 commit
-
* pm-cpufreq-sched:
cpufreq: schedutil: Examine the correct CPU when we update util
07 Nov, 2017
1 commit
-
Pull workqueue fix from Tejun Heo:
"Another fix for a really old bug.It only affects drain_workqueue() which isn't used often and even then
triggers only during a pretty small race window, so it isn't too
surprising that it stayed hidden for so long.The fix is straight-forward and low-risk. Kudos to Li Bin for
reporting and fixing the bug"* 'for-4.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Fix NULL pointer dereference
06 Nov, 2017
1 commit
-
Pull perf fixes from Ingo Molnar:
"Various fixes:- synchronize kernel and tooling headers
- cgroup support fix
- two tooling fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tools/headers: Synchronize kernel ABI headers
perf/cgroup: Fix perf cgroup hierarchy support
perf tools: Unwind properly location after REJECT
perf symbols: Fix memory corruption because of zero length symbols
05 Nov, 2017
1 commit
-
After commit 674e75411fc2 (sched: cpufreq: Allow remote cpufreq
callbacks) we stopped to always read the utilization for the CPU we
are running the governor on, and instead we read it for the CPU
which we've been told has updated utilization. This is stored in
sugov_cpu->cpu.The value is set in sugov_register() but we clear it in sugov_start()
which leads to always looking at the utilization of CPU0 instead of
the correct one.Fix this by consolidating the initialization code into sugov_start().
Fixes: 674e75411fc2 (sched: cpufreq: Allow remote cpufreq callbacks)
Signed-off-by: Chris Redpath
Reviewed-by: Patrick Bellasi
Reviewed-by: Brendan Jackman
Acked-by: Viresh Kumar
Signed-off-by: Rafael J. Wysocki
04 Nov, 2017
1 commit
-
We want to fix an objtool build warning that got introduced in the latest upstream kernel.
Signed-off-by: Ingo Molnar
03 Nov, 2017
2 commits
-
Signed-off-by: Ingo Molnar
-
…el/git/gregkh/driver-core
Pull initial SPDX identifiers from Greg KH:
"License cleanup: add SPDX license identifiers to some filesMany source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.By default all files without license information are under the default
license of the kernel, which is GPL version 2.Update the files which contain no license information with the
'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally
binding shorthand, which can be used instead of the full boiler plate
text.This patch is based on work done by Thomas Gleixner and Kate Stewart
and Philippe Ombredanne.How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset
of the use cases:- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to
license had to be inferred by heuristics based on keywords.The analysis to determine which SPDX License Identifier to be applied
to a file was done in a spreadsheet of side by side results from of
the output of two independent scanners (ScanCode & Windriver)
producing SPDX tag:value files created by Philippe Ombredanne.
Philippe prepared the base worksheet, and did an initial spot review
of a few 1000 files.The 4.13 kernel was the starting point of the analysis with 60,537
files assessed. Kate Stewart did a file by file comparison of the
scanner results in the spreadsheet to determine which SPDX license
identifier(s) to be applied to the file. She confirmed any
determination that was not immediately clear with lawyers working with
the Linux Foundation.Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained
>5 lines of source- File already had some variant of a license header in it (even if <5
lines).All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that
was:SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that
became the concluded license(s).- when there was disagreement between the two scanners (one detected
a license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply
(and which scanner probably needed to revisit its heuristics).- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases,
confirmation by lawyers working with the Linux Foundation.Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.
The Windriver scanner is based on an older version of FOSSology in
part, so they are related.Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot
checks in about 15000 files.In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect
the correct identifier.Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial
patch version early this week with:- a full scancode scan run, collecting the matched texts, detected
license ids and scores- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct- reviewing anything where there was no detection but the patch
license was not GPL-2.0 WITH Linux-syscall-note to ensure that the
applied SPDX license was correctThis produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"* tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
License cleanup: add SPDX license identifier to uapi header files with a license
License cleanup: add SPDX license identifier to uapi header files with no license
License cleanup: add SPDX GPL-2.0 license identifier to files with no license
02 Nov, 2017
6 commits
-
In commit 30d6e0a4190d ("futex: Remove duplicated code and fix undefined
behaviour"), I let FUTEX_WAKE_OP to fail on invalid op. Namely when op
should be considered as shift and the shift is out of range (< 0 or > 31).But strace's test suite does this madness:
futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xa0caffee);
futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xbadfaced);
futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xffffffff);When I pick the first 0xa0caffee, it decodes as:
0x80000000 & 0xa0caffee: oparg is shift
0x70000000 & 0xa0caffee: op is FUTEX_OP_OR
0x0f000000 & 0xa0caffee: cmp is FUTEX_OP_CMP_EQ
0x00fff000 & 0xa0caffee: oparg is sign-extended 0xcaf = -849
0x00000fff & 0xa0caffee: cmparg is sign-extended 0xfee = -18That means the op tries to do this:
(futex |= (1 << (-849))) == -18
which is completely bogus. The new check of op in the code is:
if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28)) {
if (oparg < 0 || oparg > 31)
return -EINVAL;
oparg = 1 << oparg;
}which results obviously in the "Invalid argument" errno:
FAIL: futex
===========futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xa0caffee) = -1: Invalid argument
futex.test: failed test: ../futex failed with code 1So let us soften the failure to print only a (ratelimited) message, crop
the value and continue as if it were right. When userspace keeps up, we
can switch this to return -EINVAL again.[v2] Do not return 0 immediatelly, proceed with the cropped value.
Fixes: 30d6e0a4190d ("futex: Remove duplicated code and fix undefined behaviour")
Signed-off-by: Jiri Slaby
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Darren Hart
Signed-off-by: Linus Torvalds -
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.By default all files without license information are under the default
license of the kernel, which is GPL version 2.Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman -
Pull signal bugfix from Eric Biederman:
"When making the generic support for SIGEMT conditional on the presence
of SIGEMT I made a typo that causes it to fail to activate. It was
noticed comparatively quickly but the bug report just made it to me
today"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
signal: Fix name of SIGEMT in #if defined() check -
Commit cc731525f26a ("signal: Remove kernel interal si_code magic")
added a check for SIGMET and NSIGEMT being defined. That SIGMET should
in fact be SIGEMT, with SIGEMT being defined in
arch/{alpha,mips,sparc}/include/uapi/asm/signal.hThis was actually pointed out by BenHutchings in a lwn.net comment
here https://lwn.net/Comments/734608/Fixes: cc731525f26a ("signal: Remove kernel interal si_code magic")
Signed-off-by: Andrew Clayton
Signed-off-by: "Eric W. Biederman" -
Guenter reported:
There is still a problem. When running
echo 6 > /proc/sys/kernel/watchdog_thresh
echo 5 > /proc/sys/kernel/watchdog_thresh
repeatedly, the messageNMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
stops after a while (after ~10-30 iterations, with fluctuations).
Maybe watchdog_cpus needs to be atomic ?That's correct as this again is affected by the asynchronous nature of the
smpboot thread unpark mechanism.CPU 0 CPU1 CPU2
write(watchdog_thresh, 6)
stop()
park()
update()
start()
unpark()
thread->unpark()
cnt++;
write(watchdog_thresh, 5) thread->unpark()
stop()
park() thread->park()
cnt--; cnt++;
update()
start()
unpark()That's not a functional problem, it just affects the informational message.
Convert watchdog_cpus to atomic_t to prevent the problem
Reported-and-tested-by: Guenter Roeck
Signed-off-by: Don Zickus
Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Link: https://lkml.kernel.org/r/20171101181126.j727fqjmdthjz4xk@redhat.com -
…fy deferred event destroy")
Guenter reported a crash in the watchdog/perf code, which is caused by
cleanup() and enable() running concurrently. The reason for this is:The watchdog functions are serialized via the watchdog_mutex and cpu
hotplug locking, but the enable of the perf based watchdog happens in
context of the unpark callback of the smpboot thread. But that unpark
function is not synchronous inside the locking. The unparking of the thread
just wakes it up and leaves so there is no guarantee when the thread is
executing.If it starts running _before_ the cleanup happened then it will create a
event and overwrite the dead event pointer. The new event is then cleaned
up because the event is marked dead.lock(watchdog_mutex);
lockup_detector_reconfigure();
cpus_read_lock();
stop();
park()
update();
start();
unpark()
cpus_read_unlock(); thread runs()
overwrite dead event ptr
cleanup();
free new event, which is active inside perf....
unlock(watchdog_mutex);The park side is safe as that actually waits for the thread to reach
parked state.Commit a33d44843d45 removed the protection against this kind of scenario
under the stupid assumption that the hotplug serialization and the
watchdog_mutex cover everything.Bring it back.
Reverts: a33d44843d45 ("watchdog/hardlockup/perf: Simplify deferred event destroy")
Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Thomas Feels-stupid Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Don Zickus <dzickus@redhat.com>
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1710312145190.1942@nanos
01 Nov, 2017
2 commits
-
Dmitry (through syzbot) reported being able to trigger the WARN in
get_pi_state() and a use-after-free on:raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
Both are due to this race:
exit_pi_state_list() put_pi_state()
lock(&curr->pi_lock)
while() {
pi_state = list_first_entry(head);
hb = hash_futex(&pi_state->key);
unlock(&curr->pi_lock);dec_and_test(&pi_state->refcount);
lock(&hb->lock)
lock(&pi_state->pi_mutex.wait_lock) // uaf if pi_state free'd
lock(&curr->pi_lock);....
unlock(&curr->pi_lock);
get_pi_state(); // WARN; refcount==0The problem is we take the reference count too late, and don't allow it
being 0. Fix it by using inc_not_zero() and simply retrying the loop
when we fail to get a refcount. In that case put_pi_state() should
remove the entry from the list.Reported-by: Dmitry Vyukov
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Thomas Gleixner
Cc: Gratian Crisan
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: dvhart@infradead.org
Cc: syzbot
Cc: syzkaller-bugs@googlegroups.com
Cc:
Fixes: c74aef2d06a9 ("futex: Fix pi_state->owner serialization")
Link: http://lkml.kernel.org/r/20171031101853.xpfh72y643kdfhjs@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar -
Now that SK_REDIRECT is no longer a valid return code. Remove it
from the UAPI completely. Then do a namespace remapping internal
to sockmap so SK_REDIRECT is no longer externally visible.Patchs primary change is to do a namechange from SK_REDIRECT to
__SK_REDIRECTReported-by: Alexei Starovoitov
Signed-off-by: John Fastabend
Signed-off-by: David S. Miller
30 Oct, 2017
2 commits
-
When queue_work() is used in irq (not in task context), there is
a potential case that trigger NULL pointer dereference.
----------------------------------------------------------------
worker_thread()
|-spin_lock_irq()
|-process_one_work()
|-worker->current_pwq = pwq
|-spin_unlock_irq()
|-worker->current_func(work)
|-spin_lock_irq()
|-worker->current_pwq = NULL
|-spin_unlock_irq()//interrupt here
|-irq_handler
|-__queue_work()
//assuming that the wq is draining
|-is_chained_work(wq)
|-current_wq_worker()
//Here, 'current' is the interrupted worker!
|-current->current_pwq is NULL here!
|-schedule()
----------------------------------------------------------------Avoid it by checking for task context in current_wq_worker(), and
if not in task context, we shouldn't use the 'current' to check the
condition.Reported-by: Xiaofei Tan
Signed-off-by: Li Bin
Reviewed-by: Lai Jiangshan
Signed-off-by: Tejun Heo
Fixes: 8d03ecfe4718 ("workqueue: reimplement is_chained_work() using current_wq_worker()")
Cc: stable@vger.kernel.org # v3.9+ -
The following commit:
864c2357ca89 ("perf/core: Do not set cpuctx->cgrp for unscheduled cgroups")
made list_update_cgroup_event() skip setting cpuctx->cgrp if no cgroup event
targets %current's cgroup.This breaks perf_event's hierarchical support because events which target one
of the ancestors get ignored.Fix it by using cgroup_is_descendant() test instead of equality.
Signed-off-by: Tejun Heo
Acked-by: Thomas Gleixner
Cc: Arnaldo Carvalho de Melo
Cc: David Carrillo-Cisneros
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: kernel-team@fb.com
Cc: stable@vger.kernel.org # v4.9+
Fixes: 864c2357ca89 ("perf/core: Do not set cpuctx->cgrp for unscheduled cgroups")
Link: http://lkml.kernel.org/r/20171028164237.GA972780@devbig577.frc2.facebook.com
Signed-off-by: Ingo Molnar
29 Oct, 2017
3 commits
-
Pull networking fixes from David Miller:
1) Fix route leak in xfrm_bundle_create().
2) In mac80211, validate user rate mask before configuring it. From
Johannes Berg.3) Properly enforce memory limits in fair queueing code, from Toke
Hoiland-Jorgensen.4) Fix lockdep splat in inet_csk_route_req(), from Eric Dumazet.
5) Fix TSO header allocation and management in mvpp2 driver, from Yan
Markman.6) Don't take socket lock in BH handler in strparser code, from Tom
Herbert.7) Don't show sockets from other namespaces in AF_UNIX code, from
Andrei Vagin.8) Fix double free in error path of tap_open(), from Girish Moodalbail.
9) Fix TX map failure path in igb and ixgbe, from Jean-Philippe Brucker
and Alexander Duyck.10) Fix DCB mode programming in stmmac driver, from Jose Abreu.
11) Fix err_count handling in various tunnels (ipip, ip6_gre). From Xin
Long.12) Properly align SKB head before building SKB in tuntap, from Jason
Wang.13) Avoid matching qdiscs with a zero handle during lookups, from Cong
Wang.14) Fix various endianness bugs in sctp, from Xin Long.
15) Fix tc filter callback races and add selftests which trigger the
problem, from Cong Wang.* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (73 commits)
selftests: Introduce a new test case to tc testsuite
selftests: Introduce a new script to generate tc batch file
net_sched: fix call_rcu() race on act_sample module removal
net_sched: add rtnl assertion to tcf_exts_destroy()
net_sched: use tcf_queue_work() in tcindex filter
net_sched: use tcf_queue_work() in rsvp filter
net_sched: use tcf_queue_work() in route filter
net_sched: use tcf_queue_work() in u32 filter
net_sched: use tcf_queue_work() in matchall filter
net_sched: use tcf_queue_work() in fw filter
net_sched: use tcf_queue_work() in flower filter
net_sched: use tcf_queue_work() in flow filter
net_sched: use tcf_queue_work() in cgroup filter
net_sched: use tcf_queue_work() in bpf filter
net_sched: use tcf_queue_work() in basic filter
net_sched: introduce a workqueue for RCU callbacks of tc filter
sctp: fix some type cast warnings introduced since very beginning
sctp: fix a type cast warnings that causes a_rwnd gets the wrong value
sctp: fix some type cast warnings introduced by transport rhashtable
sctp: fix some type cast warnings introduced by stream reconf
... -
Recent additions to support multiple programs in cgroups impose
a strict requirement, "all yes is yes, any no is no". To enforce
this the infrastructure requires the 'no' return code, SK_DROP in
this case, to be 0.To apply these rules to SK_SKB program types the sk_actions return
codes need to be adjusted.This fix adds SK_PASS and makes 'SK_DROP = 0'. Finally, remove
SK_ABORTED to remove any chance that the API may allow aborted
program flows to be passed up the stack. This would be incorrect
behavior and allow programs to break existing policies.Signed-off-by: John Fastabend
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller -
SK_SKB program types use bpf_compute_data to store the end of the
packet data. However, bpf_compute_data assumes the cb is stored in the
qdisc layer format. But, for SK_SKB this is the wrong layer of the
stack for this type.It happens to work (sort of!) because in most cases nothing happens
to be overwritten today. This is very fragile and error prone.
Fortunately, we have another hole in tcp_skb_cb we can use so lets
put the data_end value there.Note, SK_SKB program types do not use data_meta, they are failed by
sk_skb_is_valid_access().Signed-off-by: John Fastabend
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller
23 Oct, 2017
1 commit
-
Pull workqueue fix from Tejun Heo:
"This is a fix for an old bug in workqueue. Workqueue used a mutex to
arbitrate who gets to be the manager of a pool. When the manager role
gets released, the mutex gets unlocked while holding the pool's
irqsafe spinlock. This can lead to deadlocks as mutex's internal
spinlock isn't irqsafe. This got discovered by recent fixes to mutex
lockdep annotations.The fix is a bit invasive for rc6 but if anything were wrong with the
fix it would likely have already blown up in -next, and we want the
fix in -stable anyway"* 'for-4.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: replace pool->manager_arb mutex with a flag
22 Oct, 2017
6 commits
-
Pull smp/hotplug fix from Thomas Gleixner:
"The recent rework of the callback invocation missed to cleanup the
leftovers of the operation, so under certain circumstances a
subsequent CPU hotplug operation accesses stale data and crashes.
Clean it up."* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Reset node state after operation -
Pull irq fixes from Thomas Gleixner:
"A set of small fixes mostly in the irq drivers area:- Make the tango irq chip work correctly, which requires a new
function in the generiq irq chip implementation- A set of updates to the GIC-V3 ITS driver removing a bogus BUG_ON()
and parsing the VCPU table size correctly"* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: generic chip: remove irq_gc_mask_disable_reg_and_ack()
irqchip/tango: Use irq_gc_mask_disable_and_ack_set
genirq: generic chip: Add irq_gc_mask_disable_and_ack_set()
irqchip/gic-v3-its: Add missing changes to support 52bit physical address
irqchip/gic-v3-its: Fix the incorrect parsing of VCPU table size
irqchip/gic-v3-its: Fix the incorrect BUG_ON in its_init_vpe_domain()
DT: arm,gic-v3: Update the ITS size in the examples -
Pull networking fixes from David Miller:
"A little more than usual this time around. Been travelling, so that is
part of it.Anyways, here are the highlights:
1) Deal with memcontrol races wrt. listener dismantle, from Eric
Dumazet.2) Handle page allocation failures properly in nfp driver, from Jaku
Kicinski.3) Fix memory leaks in macsec, from Sabrina Dubroca.
4) Fix crashes in pppol2tp_session_ioctl(), from Guillaume Nault.
5) Several fixes in bnxt_en driver, including preventing potential
NVRAM parameter corruption from Michael Chan.6) Fix for KRACK attacks in wireless, from Johannes Berg.
7) rtnetlink event generation fixes from Xin Long.
8) Deadlock in mlxsw driver, from Ido Schimmel.
9) Disallow arithmetic operations on context pointers in bpf, from
Jakub Kicinski.10) Missing sock_owned_by_user() check in sctp_icmp_redirect(), from
Xin Long.11) Only TCP is supported for sockmap, make that explicit with a
check, from John Fastabend.12) Fix IP options state races in DCCP and TCP, from Eric Dumazet.
13) Fix panic in packet_getsockopt(), also from Eric Dumazet.
14) Add missing locked in hv_sock layer, from Dexuan Cui.
15) Various aquantia bug fixes, including several statistics handling
cures. From Igor Russkikh et al.16) Fix arithmetic overflow in devmap code, from John Fastabend.
17) Fix busted socket memory accounting when we get a fault in the tcp
zero copy paths. From Willem de Bruijn.18) Don't leave opt->tot_len uninitialized in ipv6, from Eric Dumazet"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
stmmac: Don't access tx_q->dirty_tx before netif_tx_lock
ipv6: flowlabel: do not leave opt->tot_len with garbage
of_mdio: Fix broken PHY IRQ in case of probe deferral
textsearch: fix typos in library helpers
rxrpc: Don't release call mutex on error pointer
net: stmmac: Prevent infinite loop in get_rx_timestamp_status()
net: stmmac: Fix stmmac_get_rx_hwtstamp()
net: stmmac: Add missing call to dev_kfree_skb()
mlxsw: spectrum_router: Configure TIGCR on init
mlxsw: reg: Add Tunneling IPinIP General Configuration Register
net: ethtool: remove error check for legacy setting transceiver type
soreuseport: fix initialization race
net: bridge: fix returning of vlan range op errors
sock: correct sk_wmem_queued accounting on efault in tcp zerocopy
bpf: add test cases to bpf selftests to cover all access tests
bpf: fix pattern matches for direct packet access
bpf: fix off by one for range markings with L{T, E} patterns
bpf: devmap fix arithmetic overflow in bitmap_size calculation
net: aquantia: Bad udp rate on default interrupt coalescing
net: aquantia: Enable coalescing management via ethtool interface
... -
Alexander had a test program with direct packet access, where
the access test was in the form of data + X > data_end. In an
unrelated change to the program LLVM decided to swap the branches
and emitted code for the test in form of data + X
Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Acked-by: John Fastabend
Signed-off-by: David S. Miller -
During review I noticed that the current logic for direct packet
access marking in check_cond_jmp_op() has an off by one for the
upper right range border when marking in find_good_pkt_pointers()
with BPF_JLT and BPF_JLE. It's not really harmful given access
up to pkt_end is always safe, but we should nevertheless correct
the range marking before it becomes ABI. If pkt_data' denotes a
pkt_data derived pointer (pkt_data + X), then for pkt_data' < pkt_end
in the true branch as well as for pkt_end < pkt_end the verifier simulation cannot
deduce that a byte load of pkt_data' - 1 would succeed in this
branch.Fixes: b4e432f1000a ("bpf: enable BPF_J{LT, LE, SLT, SLE} opcodes in verifier")
Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Acked-by: John Fastabend
Signed-off-by: David S. Miller -
An integer overflow is possible in dev_map_bitmap_size() when
calculating the BITS_TO_LONG logic which becomes, after macro
replacement,(((n) + (d) - 1)/ (d))
where 'n' is a __u32 and 'd' is (8 * sizeof(long)). To avoid
overflow cast to u64 before arithmetic.Reported-by: Richard Weinberger
Acked-by: Daniel Borkmann
Signed-off-by: John Fastabend
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller
21 Oct, 2017
2 commits
-
The recent rework of the cpu hotplug internals changed the usage of the per
cpu state->node field, but missed to clean it up after usage.So subsequent hotplug operations use the stale pointer from a previous
operation and hand it into the callback functions. The callbacks then
dereference a pointer which either belongs to a different facility or
points to freed and potentially reused memory. In either case data
corruption and crashes are the obvious consequence.Reset the node and the last pointers in the per cpu state to NULL after the
operation which set them has completed.Fixes: 96abb968549c ("smp/hotplug: Allow external multi-instance rollback")
Reported-by: Tvrtko Ursulin
Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Sebastian Andrzej Siewior
Cc: Boris Ostrovsky
Cc: "Paul E. McKenney"
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1710211606130.3213@nanos -
As pointed out by Linus and David, the earlier waitid() fix resulted in
a (currently harmless) unbalanced user_access_end() call. This fixes it
to just directly return EFAULT on access_ok() failure.Fixes: 96ca579a1ecc ("waitid(): Add missing access_ok() checks")
Acked-by: David Daney
Cc: Al Viro
Signed-off-by: Kees Cook
Signed-off-by: Linus Torvalds
20 Oct, 2017
5 commits
-
Devmap is used with XDP which requires CAP_NET_ADMIN so lets also
make CAP_NET_ADMIN required to use the map.Signed-off-by: John Fastabend
Acked-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller -
Restrict sockmap to CAP_NET_ADMIN.
Signed-off-by: John Fastabend
Acked-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller -
SK_SKB BPF programs are run from the socket/tcp context but early in
the stack before much of the TCP metadata is needed in tcp_skb_cb. So
we can use some unused fields to place BPF metadata needed for SK_SKB
programs when implementing the redirect function.This allows us to drop the preempt disable logic. It does however
require an API change so sk_redirect_map() has been updated to
additionally provide ctx_ptr to skb. Note, we do however continue to
disable/enable preemption around actual BPF program running to account
for map updates.Signed-off-by: John Fastabend
Acked-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller -
Only TCP sockets have been tested and at the moment the state change
callback only handles TCP sockets. This adds a check to ensure that
sockets actually being added are TCP sockets.For net-next we can consider UDP support.
Signed-off-by: John Fastabend
Acked-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller -
Because many of RCU's files have not been included into docbook, a
number of errors have accumulated. This commit fixes them.Signed-off-by: Paul E. McKenney
Signed-off-by: Linus Torvalds