Eric Lee / smarc-fsl-linux-kernel

01 Oct, 2020

1 commit

a48cf1c92 perf: Use new infrastructure to fix deadlocks in execve ... Browse Code »

[ Upstream commit 6914303824bb572278568330d72fc1f8f9814e67 ]

This changes perf_event_set_clock to use the new exec_update_mutex
instead of cred_guard_mutex.

This should be safe, as the credentials are only used for reading.

Signed-off-by: Bernd Edlinger
Signed-off-by: Eric W. Biederman
Signed-off-by: Sasha Levin

Bernd Edlinger
2020-10-01 19:17:48 +0800

26 Aug, 2020

1 commit

e9e3ec03e uprobes: __replace_page() avoid BUG in munlock_vma_page() ... Browse Code »

commit c17c3dc9d08b9aad9a55a1e53f205187972f448e upstream.

syzbot crashed on the VM_BUG_ON_PAGE(PageTail) in munlock_vma_page(), when
called from uprobes __replace_page(). Which of many ways to fix it?
Settled on not calling when PageCompound (since Head and Tail are equals
in this context, PageCompound the usual check in uprobes.c, and the prior
use of FOLL_SPLIT_PMD will have cleared PageMlocked already).

Fixes: 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT")
Reported-by: syzbot
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Reviewed-by: Srikar Dronamraju
Acked-by: Song Liu
Acked-by: Oleg Nesterov
Cc: "Kirill A. Shutemov"
Cc: [5.4+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008161338360.20413@eggly.anvils
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Hugh Dickins
2020-08-26 16:40:51 +0800

11 Aug, 2020

1 commit

68a235037 perf/core: Fix endless multiplex timer ... Browse Code »

commit 90c91dfb86d0ff545bd329d3ddd72c147e2ae198 upstream.

Kan and Andi reported that we fail to kill rotation when the flexible
events go empty, but the context does not. XXX moar

Fixes: fd7d55172d1e ("perf/cgroups: Don't rotate events for cgroups unnecessarily")
Reported-by: Andi Kleen
Reported-by: Kan Liang
Tested-by: Kan Liang
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20200305123851.GX2596@hirez.programming.kicks-ass.net
Cc: Robin Murphy
Signed-off-by: Greg Kroah-Hartman

Peter Zijlstra
2020-08-11 21:33:32 +0800

29 Jul, 2020

1 commit

ee2f6a6b3 uprobes: Change handle_swbp() to send SIGTRAP with si_code=SI_KERNEL, to fix GDB regression ... Browse Code »

commit fe5ed7ab99c656bd2f5b79b49df0e9ebf2cead8a upstream.

If a tracee is uprobed and it hits int3 inserted by debugger, handle_swbp()
does send_sig(SIGTRAP, current, 0) which means si_code == SI_USER. This used
to work when this code was written, but then GDB started to validate si_code
and now it simply can't use breakpoints if the tracee has an active uprobe:

# cat test.c
void unused_func(void)
{
}
int main(void)
{
return 0;
}

# gcc -g test.c -o test
# perf probe -x ./test -a unused_func
# perf record -e probe_test:unused_func gdb ./test -ex run
GNU gdb (GDB) 10.0.50.20200714-git
...
Program received signal SIGTRAP, Trace/breakpoint trap.
0x00007ffff7ddf909 in dl_main () from /lib64/ld-linux-x86-64.so.2
(gdb)

The tracee hits the internal breakpoint inserted by GDB to monitor shared
library events but GDB misinterprets this SIGTRAP and reports a signal.

Change handle_swbp() to use force_sig(SIGTRAP), this matches do_int3_user()
and fixes the problem.

This is the minimal fix for -stable, arch/x86/kernel/uprobes.c is equally
wrong; it should use send_sigtrap(TRAP_TRACE) instead of send_sig(SIGTRAP),
but this doesn't confuse GDB and needs another x86-specific patch.

Reported-by: Aaron Merey
Signed-off-by: Oleg Nesterov
Signed-off-by: Ingo Molnar
Reviewed-by: Srikar Dronamraju
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200723154420.GA32043@redhat.com
Signed-off-by: Greg Kroah-Hartman

Oleg Nesterov
2020-07-29 16:18:29 +0800

17 Jun, 2020

1 commit

e81b05e53 perf: Add cond_resched() to task_function_call() ... Browse Code »

commit 2ed6edd33a214bca02bd2b45e3fc3038a059436b upstream.

Under rare circumstances, task_function_call() can repeatedly fail and
cause a soft lockup.

There is a slight race where the process is no longer running on the cpu
we targeted by the time remote_function() runs. The code will simply
try again. If we are very unlucky, this will continue to fail, until a
watchdog fires. This can happen in a heavily loaded, multi-core virtual
machine.

Reported-by: syzbot+bb4935a5c09b5ff79940@syzkaller.appspotmail.com
Signed-off-by: Barret Rhoden
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20200414222920.121401-1-brho@google.com
Signed-off-by: Greg Kroah-Hartman

Barret Rhoden
2020-06-17 22:40:36 +0800

11 Jun, 2020

1 commit

c06c03bba uprobes: ensure that uprobe->offset and ->ref_ctr_offset are properly aligned ... Browse Code »

commit 013b2deba9a6b80ca02f4fafd7dedf875e9b4450 upstream.

uprobe_write_opcode() must not cross page boundary; prepare_uprobe()
relies on arch_uprobe_analyze_insn() which should validate "vaddr" but
some architectures (csky, s390, and sparc) don't do this.

We can remove the BUG_ON() check in prepare_uprobe() and validate the
offset early in __uprobe_register(). The new IS_ALIGNED() check matches
the alignment check in arch_prepare_kprobe() on supported architectures,
so I think that all insns must be aligned to UPROBE_SWBP_INSN_SIZE.

Another problem is __update_ref_ctr() which was wrong from the very
beginning, it can read/write outside of kmap'ed page unless "vaddr" is
aligned to sizeof(short), __uprobe_register() should check this too.

Reported-by: Linus Torvalds
Suggested-by: Linus Torvalds
Signed-off-by: Oleg Nesterov
Reviewed-by: Srikar Dronamraju
Acked-by: Christian Borntraeger
Tested-by: Sven Schnelle
Cc: Steven Rostedt
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Oleg Nesterov
2020-06-11 02:24:58 +0800

02 May, 2020

1 commit

9578a8c15 perf/core: fix parent pid/tid in task exit events ... Browse Code »

commit f3bed55e850926614b9898fe982f66d2541a36a5 upstream.

Current logic yields the child task as the parent.

Before:
$ perf record bash -c "perf list > /dev/null"
$ perf script -D |grep 'FORK\|EXIT'
4387036190981094 0x5a70 [0x30]: PERF_RECORD_FORK(10472:10472):(10470:10470)
4387036606207580 0xf050 [0x30]: PERF_RECORD_EXIT(10472:10472):(10472:10472)
4387036607103839 0x17150 [0x30]: PERF_RECORD_EXIT(10470:10470):(10470:10470)
^
Note the repeated values here -------------------/

After:
383281514043 0x9d8 [0x30]: PERF_RECORD_FORK(2268:2268):(2266:2266)
383442003996 0x2180 [0x30]: PERF_RECORD_EXIT(2268:2268):(2266:2266)
383451297778 0xb70 [0x30]: PERF_RECORD_EXIT(2266:2266):(2265:2265)

Fixes: 94d5d1b2d891 ("perf_counter: Report the cloning task as parent on perf_counter_fork()")
Reported-by: KP Singh
Signed-off-by: Ian Rogers
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20200417182842.12522-1-irogers@google.com
Signed-off-by: Greg Kroah-Hartman

Ian Rogers
2020-05-02 14:48:52 +0800

29 Apr, 2020

1 commit

16c370534 perf/core: Disable page faults when getting phys address ... Browse Code »

[ Upstream commit d3296fb372bf7497b0e5d0478c4e7a677ec6f6e9 ]

We hit following warning when running tests on kernel
compiled with CONFIG_DEBUG_ATOMIC_SLEEP=y:

WARNING: CPU: 19 PID: 4472 at mm/gup.c:2381 __get_user_pages_fast+0x1a4/0x200
CPU: 19 PID: 4472 Comm: dummy Not tainted 5.6.0-rc6+ #3
RIP: 0010:__get_user_pages_fast+0x1a4/0x200
...
Call Trace:
perf_prepare_sample+0xff1/0x1d90
perf_event_output_forward+0xe8/0x210
__perf_event_overflow+0x11a/0x310
__intel_pmu_pebs_event+0x657/0x850
intel_pmu_drain_pebs_nhm+0x7de/0x11d0
handle_pmi_common+0x1b2/0x650
intel_pmu_handle_irq+0x17b/0x370
perf_event_nmi_handler+0x40/0x60
nmi_handle+0x192/0x590
default_do_nmi+0x6d/0x150
do_nmi+0x2f9/0x3c0
nmi+0x8e/0xd7

While __get_user_pages_fast() is IRQ-safe, it calls access_ok(),
which warns on:

WARN_ON_ONCE(!in_task() && !pagefault_disabled())

Peter suggested disabling page faults around __get_user_pages_fast(),
which gets rid of the warning in access_ok() call.

Suggested-by: Peter Zijlstra
Signed-off-by: Jiri Olsa
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lkml.kernel.org/r/20200407141427.3184722-1-jolsa@kernel.org
Signed-off-by: Sasha Levin

Jiri Olsa
2020-04-29 22:33:02 +0800

11 Feb, 2020

1 commit

743823969 perf/core: Fix mlock accounting in perf_mmap() ... Browse Code »

commit 003461559ef7a9bd0239bae35a22ad8924d6e9ad upstream.

Decreasing sysctl_perf_event_mlock between two consecutive perf_mmap()s of
a perf ring buffer may lead to an integer underflow in locked memory
accounting. This may lead to the undesired behaviors, such as failures in
BPF map creation.

Address this by adjusting the accounting logic to take into account the
possibility that the amount of already locked memory may exceed the
current limit.

Fixes: c4b75479741c ("perf/core: Make the mlock accounting simple again")
Suggested-by: Alexander Shishkin
Signed-off-by: Song Liu
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Cc:
Acked-by: Alexander Shishkin
Link: https://lkml.kernel.org/r/20200123181146.2238074-1-songliubraving@fb.com
Signed-off-by: Greg Kroah-Hartman

Song Liu
2020-02-11 20:35:54 +0800

23 Jan, 2020

1 commit

9d7f2619b perf: Correctly handle failed perf_get_aux_event() ... Browse Code »

commit da9ec3d3dd0f1240a48920be063448a2242dbd90 upstream.

Vince reports a worrying issue:

| so I was tracking down some odd behavior in the perf_fuzzer which turns
| out to be because perf_even_open() sometimes returns 0 (indicating a file
| descriptor of 0) even though as far as I can tell stdin is still open.

... and further the cause:

| error is triggered if aux_sample_size has non-zero value.
|
| seems to be this line in kernel/events/core.c:
|
| if (perf_need_aux_event(event) && !perf_get_aux_event(event, group_leader))
| goto err_locked;
|
| (note, err is never set)

This seems to be a thinko in commit:

ab43762ef010967e ("perf: Allow normal events to output AUX data")

... and we should probably return -EINVAL here, as this should only
happen when the new event is mis-configured or does not have a
compatible aux_event group leader.

Fixes: ab43762ef010967e ("perf: Allow normal events to output AUX data")
Reported-by: Vince Weaver
Signed-off-by: Mark Rutland
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Acked-by: Alexander Shishkin
Tested-by: Vince Weaver
Signed-off-by: Greg Kroah-Hartman

Mark Rutland
2020-01-23 15:22:33 +0800

31 Dec, 2019

1 commit

83e561d6c perf/core: Fix the mlock accounting, again ... Browse Code »

[ Upstream commit 36b3db03b4741b8935b68fffc7e69951d8d70a89 ]

Commit:

5e6c3c7b1ec2 ("perf/aux: Fix tracking of auxiliary trace buffer allocation")

tried to guess the correct combination of arithmetic operations that would
undo the AUX buffer's mlock accounting, and failed, leaking the bottom part
when an allocation needs to be charged partially to both user->locked_vm
and mm->pinned_vm, eventually leaving the user with no locked bonus:

$ perf record -e intel_pt//u -m1,128 uname
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.061 MB perf.data ]

$ perf record -e intel_pt//u -m1,128 uname
Permission error mapping pages.
Consider increasing /proc/sys/kernel/perf_event_mlock_kb,
or try again with a smaller value of -m/--mmap_pages.
(current value: 1,128)

Fix this by subtracting both locked and pinned counts when AUX buffer is
unmapped.

Reported-by: Thomas Richter
Tested-by: Thomas Richter
Signed-off-by: Alexander Shishkin
Acked-by: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Mark Rutland
Cc: Namhyung Kim
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin

Alexander Shishkin
2019-12-31 23:45:38 +0800

13 Nov, 2019

6 commits

d00dbd298 perf/core: Fix missing static inline on perf_cgroup_switch() ... Browse Code »

It looks like a "static inline" has been missed in front
of the empty definition of perf_cgroup_switch() under
certain configurations.

Fixes the following sparse warning:

kernel/events/core.c:1035:1: warning: symbol 'perf_cgroup_switch' was not declared. Should it be static?

Signed-off-by: Ben Dooks (Codethink)
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Mark Rutland
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: David Ahern
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Namhyung Kim
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Link: https://lkml.kernel.org/r/20191106132527.19977-1-ben.dooks@codethink.co.uk
Signed-off-by: Ingo Molnar

Ben Dooks (Codethink)
2019-11-13 15:16:44 +0800
697d87784 perf/core: Consistently fail fork on allocation failures ... Browse Code »

Commit:

313ccb9615948 ("perf: Allocate context task_ctx_data for child event")

makes the inherit path skip over the current event in case of task_ctx_data
allocation failure. This, however, is inconsistent with allocation failures
in perf_event_alloc(), which would abort the fork.

Correct this by returning an error code on task_ctx_data allocation
failure and failing the fork in that case.

Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: David Ahern
Cc: Jiri Olsa
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Mark Rutland
Cc: Namhyung Kim
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Link: https://lkml.kernel.org/r/20191105075702.60319-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar

Alexander Shishkin
2019-11-13 15:16:43 +0800
dce5affb9 perf/aux: Disallow aux_output for kernel events ... Browse Code »

Commit

ab43762ef0109 ("perf: Allow normal events to output AUX data")

added 'aux_output' bit to the attribute structure, which relies on AUX
events and grouping, neither of which is supported for the kernel events.
This notwithstanding, attempts have been made to use it in the kernel
code, suggesting the necessity of an explicit hard -EINVAL.

Fix this by rejecting attributes with aux_output set for kernel events.

Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: David Ahern
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Mark Rutland
Cc: Namhyung Kim
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Link: https://lkml.kernel.org/r/20191030134731.5437-3-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar

Alexander Shishkin
2019-11-13 15:16:42 +0800
f25d8ba9e perf/core: Reattach a misplaced comment ... Browse Code »

A comment is in a wrong place in perf_event_create_kernel_counter().
Fix that.

Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: David Ahern
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Mark Rutland
Cc: Namhyung Kim
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Link: https://lkml.kernel.org/r/20191030134731.5437-2-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar

Alexander Shishkin
2019-11-13 15:16:41 +0800
00496fe5e perf/aux: Fix the aux_output group inheritance fix ... Browse Code »

Commit

f733c6b508bc ("perf/core: Fix inheritance of aux_output groups")

adds a NULL pointer dereference in case inherit_group() races with
perf_release(), which causes the below crash:

> BUG: kernel NULL pointer dereference, address: 000000000000010b
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 3b203b067 P4D 3b203b067 PUD 3b2040067 PMD 0
> Oops: 0000 [#1] SMP KASAN
> CPU: 0 PID: 315 Comm: exclusive-group Tainted: G B 5.4.0-rc3-00181-g72e1839403cb-dirty #878
> RIP: 0010:perf_get_aux_event+0x86/0x270
> Call Trace:
> ? __perf_read_group_add+0x3b0/0x3b0
> ? __kasan_check_write+0x14/0x20
> ? __perf_event_init_context+0x154/0x170
> inherit_task_group.isra.0.part.0+0x14b/0x170
> perf_event_init_task+0x296/0x4b0

Fix this by skipping over events that are getting closed, in the
inheritance path.

Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: David Ahern
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Mark Rutland
Cc: Namhyung Kim
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Fixes: f733c6b508bc ("perf/core: Fix inheritance of aux_output groups")
Link: https://lkml.kernel.org/r/20191101151248.47327-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar

Alexander Shishkin
2019-11-13 15:16:40 +0800
09f4e8f05 perf/core: Disallow uncore-cgroup events ... Browse Code »

While discussing uncore event scheduling, I noticed we do not in fact
seem to dis-allow making uncore-cgroup events. Such events make no
sense what so ever because the cgroup is a CPU local state where
uncore counts across a number of CPUs.

Disallow them.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: David Ahern
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Mark Rutland
Cc: Namhyung Kim
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Signed-off-by: Ingo Molnar

Peter Zijlstra
2019-11-13 15:16:39 +0800

28 Oct, 2019

1 commit

8c7e97566 perf/core: Start rejecting the syscall with attr.__reserved_2 set ... Browse Code »

Commit:

1a5941312414c ("perf: Add wakeup watermark control to the AUX area")

added attr.__reserved_2 padding, but forgot to add an ABI check to reject
attributes with this field set. Fix that.

Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Mark Rutland
Cc: Namhyung Kim
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: adrian.hunter@intel.com
Cc: mathieu.poirier@linaro.org
Link: https://lkml.kernel.org/r/20191025121636.75182-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar

Alexander Shishkin
2019-10-28 18:01:59 +0800

27 Oct, 2019

1 commit

a8a31fdcc Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf fixes from Thomas Gleixner:
"A set of perf fixes:

kernel:

- Unbreak the tracking of auxiliary buffer allocations which got
imbalanced causing recource limit failures.

- Fix the fallout of splitting of ToPA entries which missed to shift
the base entry PA correctly.

- Use the correct context to lookup the AUX event when unmapping the
associated AUX buffer so the event can be stopped and the buffer
reference dropped.

tools:

- Fix buildiid-cache mode setting in copyfile_mode_ns() when copying
/proc/kcore

- Fix freeing id arrays in the event list so the correct event is
closed.

- Sync sched.h anc kvm.h headers with the kernel sources.

- Link jvmti against tools/lib/ctype.o to have weak strlcpy().

- Fix multiple memory and file descriptor leaks, found by coverity in
perf annotate.

- Fix leaks in error handling paths in 'perf c2c', 'perf kmem', found
by a static analysis tool"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/aux: Fix AUX output stopping
perf/aux: Fix tracking of auxiliary trace buffer allocation
perf/x86/intel/pt: Fix base for single entry topa
perf kmem: Fix memory leak in compact_gfp_flags()
tools headers UAPI: Sync sched.h with the kernel
tools headers kvm: Sync kvm.h headers with the kernel sources
tools headers kvm: Sync kvm headers with the kernel sources
tools headers kvm: Sync kvm headers with the kernel sources
perf c2c: Fix memory leak in build_cl_output()
perf tools: Fix mode setting in copyfile_mode_ns()
perf annotate: Fix multiple memory and file descriptor leaks
perf tools: Fix resource leak of closedir() on the error paths
perf evlist: Fix fix for freed id arrays
perf jvmti: Link against tools/lib/ctype.h to have weak strlcpy()

Linus Torvalds
2019-10-27 18:59:34 +0800

22 Oct, 2019

1 commit

f3a519e4a perf/aux: Fix AUX output stopping ... Browse Code »

Commit:

8a58ddae2379 ("perf/core: Fix exclusive events' grouping")

allows CAP_EXCLUSIVE events to be grouped with other events. Since all
of those also happen to be AUX events (which is not the case the other
way around, because arch/s390), this changes the rules for stopping the
output: the AUX event may not be on its PMU's context any more, if it's
grouped with a HW event, in which case it will be on that HW event's
context instead. If that's the case, munmap() of the AUX buffer can't
find and stop the AUX event, potentially leaving the last reference with
the atomic context, which will then end up freeing the AUX buffer. This
will then trip warnings:

Fix this by using the context's PMU context when looking for events
to stop, instead of the event's PMU context.

Signed-off-by: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20191022073940.61814-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar

Alexander Shishkin
2019-10-22 20:39:37 +0800

21 Oct, 2019

1 commit

5e6c3c7b1 perf/aux: Fix tracking of auxiliary trace buffer allocation ... Browse Code »

The following commit from the v5.4 merge window:

d44248a41337 ("perf/core: Rework memory accounting in perf_mmap()")

... breaks auxiliary trace buffer tracking.

If I run command 'perf record -e rbd000' to record samples and saving
them in the **auxiliary** trace buffer then the value of 'locked_vm' becomes
negative after all trace buffers have been allocated and released:

During allocation the values increase:

[52.250027] perf_mmap user->locked_vm:0x87 pinned_vm:0x0 ret:0
[52.250115] perf_mmap user->locked_vm:0x107 pinned_vm:0x0 ret:0
[52.250251] perf_mmap user->locked_vm:0x188 pinned_vm:0x0 ret:0
[52.250326] perf_mmap user->locked_vm:0x208 pinned_vm:0x0 ret:0
[52.250441] perf_mmap user->locked_vm:0x289 pinned_vm:0x0 ret:0
[52.250498] perf_mmap user->locked_vm:0x309 pinned_vm:0x0 ret:0
[52.250613] perf_mmap user->locked_vm:0x38a pinned_vm:0x0 ret:0
[52.250715] perf_mmap user->locked_vm:0x408 pinned_vm:0x2 ret:0
[52.250834] perf_mmap user->locked_vm:0x408 pinned_vm:0x83 ret:0
[52.250915] perf_mmap user->locked_vm:0x408 pinned_vm:0x103 ret:0
[52.251061] perf_mmap user->locked_vm:0x408 pinned_vm:0x184 ret:0
[52.251146] perf_mmap user->locked_vm:0x408 pinned_vm:0x204 ret:0
[52.251299] perf_mmap user->locked_vm:0x408 pinned_vm:0x285 ret:0
[52.251383] perf_mmap user->locked_vm:0x408 pinned_vm:0x305 ret:0
[52.251544] perf_mmap user->locked_vm:0x408 pinned_vm:0x386 ret:0
[52.251634] perf_mmap user->locked_vm:0x408 pinned_vm:0x406 ret:0
[52.253018] perf_mmap user->locked_vm:0x408 pinned_vm:0x487 ret:0
[52.253197] perf_mmap user->locked_vm:0x408 pinned_vm:0x508 ret:0
[52.253374] perf_mmap user->locked_vm:0x408 pinned_vm:0x589 ret:0
[52.253550] perf_mmap user->locked_vm:0x408 pinned_vm:0x60a ret:0
[52.253726] perf_mmap user->locked_vm:0x408 pinned_vm:0x68b ret:0
[52.253903] perf_mmap user->locked_vm:0x408 pinned_vm:0x70c ret:0
[52.254084] perf_mmap user->locked_vm:0x408 pinned_vm:0x78d ret:0
[52.254263] perf_mmap user->locked_vm:0x408 pinned_vm:0x80e ret:0

The value of user->locked_vm increases to a limit then the memory
is tracked by pinned_vm.

During deallocation the size is subtracted from pinned_vm until
it hits a limit. Then a larger value is subtracted from locked_vm
leading to a large number (because of type unsigned):

[64.267797] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x78d
[64.267826] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x70c
[64.267848] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x68b
[64.267869] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x60a
[64.267891] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x589
[64.267911] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x508
[64.267933] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x487
[64.267952] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x406
[64.268883] perf_mmap_close mmap_user->locked_vm:0x307 pinned_vm:0x406
[64.269117] perf_mmap_close mmap_user->locked_vm:0x206 pinned_vm:0x406
[64.269433] perf_mmap_close mmap_user->locked_vm:0x105 pinned_vm:0x406
[64.269536] perf_mmap_close mmap_user->locked_vm:0x4 pinned_vm:0x404
[64.269797] perf_mmap_close mmap_user->locked_vm:0xffffffffffffff84 pinned_vm:0x303
[64.270105] perf_mmap_close mmap_user->locked_vm:0xffffffffffffff04 pinned_vm:0x202
[64.270374] perf_mmap_close mmap_user->locked_vm:0xfffffffffffffe84 pinned_vm:0x101
[64.270628] perf_mmap_close mmap_user->locked_vm:0xfffffffffffffe04 pinned_vm:0x0

This value sticks for the user until system is rebooted, causing
follow-on system calls using locked_vm resource limit to fail.

Note: There is no issue using the normal trace buffer.

In fact the issue is in perf_mmap_close(). During allocation auxiliary
trace buffer memory is either traced as 'extra' and added to 'pinned_vm'
or trace as 'user_extra' and added to 'locked_vm'. This applies for
normal trace buffers and auxiliary trace buffer.

However in function perf_mmap_close() all auxiliary trace buffer is
subtraced from 'locked_vm' and never from 'pinned_vm'. This breaks the
ballance.

Signed-off-by: Thomas Richter
Acked-by: Peter Zijlstra
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Mark Rutland
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: acme@kernel.org
Cc: gor@linux.ibm.com
Cc: hechaol@fb.com
Cc: heiko.carstens@de.ibm.com
Cc: linux-perf-users@vger.kernel.org
Cc: songliubraving@fb.com
Fixes: d44248a41337 ("perf/core: Rework memory accounting in perf_mmap()")
Link: https://lkml.kernel.org/r/20191021083354.67868-1-tmricht@linux.ibm.com
[ Minor readability edits. ]
Signed-off-by: Ingo Molnar

Thomas Richter
2019-10-21 17:31:24 +0800

19 Oct, 2019

1 commit

aa5de305c kernel/events/uprobes.c: only do FOLL_SPLIT_PMD for uprobe register ... Browse Code »

Attaching uprobe to text section in THP splits the PMD mapped page table
into PTE mapped entries. On uprobe detach, we would like to regroup PMD
mapped page table entry to regain performance benefit of THP.

However, the regroup is broken For perf_event based trace_uprobe. This
is because perf_event based trace_uprobe calls uprobe_unregister twice
on close: first in TRACE_REG_PERF_CLOSE, then in
TRACE_REG_PERF_UNREGISTER. The second call will split the PMD mapped
page table entry, which is not the desired behavior.

Fix this by only use FOLL_SPLIT_PMD for uprobe register case.

Add a WARN() to confirm uprobe unregister never work on huge pages, and
abort the operation when this WARN() triggers.

Link: http://lkml.kernel.org/r/20191017164223.2762148-6-songliubraving@fb.com
Fixes: 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT")
Signed-off-by: Song Liu
Reviewed-by: Srikar Dronamraju
Cc: Kirill A. Shutemov
Cc: Oleg Nesterov
Cc: Matthew Wilcox (Oracle)
Cc: William Kucharski
Cc: Yang Shi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Song Liu
2019-10-19 18:32:33 +0800

09 Oct, 2019

2 commits

7fa343b7f perf/core: Fix corner case in perf_rotate_context() ... Browse Code »

In perf_rotate_context(), when the first cpu flexible event fail to
schedule, cpu_rotate is 1, while cpu_event is NULL. Since cpu_event is
NULL, perf_rotate_context will _NOT_ call cpu_ctx_sched_out(), thus
cpuctx->ctx.is_active will have EVENT_FLEXIBLE set. Then, the next
perf_event_sched_in() will skip all cpu flexible events because of the
EVENT_FLEXIBLE bit.

In the next call of perf_rotate_context(), cpu_rotate stays 1, and
cpu_event stays NULL, so this process repeats. The end result is, flexible
events on this cpu will not be scheduled (until another event being added
to the cpuctx).

Here is an easy repro of this issue. On Intel CPUs, where ref-cycles
could only use one counter, run one pinned event for ref-cycles, one
flexible event for ref-cycles, and one flexible event for cycles. The
flexible ref-cycles is never scheduled, which is expected. However,
because of this issue, the cycles event is never scheduled either.

$ perf stat -e ref-cycles:D,ref-cycles,cycles -C 5 -I 1000

time counts unit events
1.000152973 15,412,480 ref-cycles:D
1.000152973 ref-cycles (0.00%)
1.000152973 cycles (0.00%)
2.000486957 18,263,120 ref-cycles:D
2.000486957 ref-cycles (0.00%)
2.000486957 cycles (0.00%)

To fix this, when the flexible_active list is empty, try rotate the
first event in the flexible_groups. Also, rename ctx_first_active() to
ctx_event_to_rotate(), which is more accurate.

Signed-off-by: Song Liu
Signed-off-by: Peter Zijlstra (Intel)
Cc:
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Sasha Levin
Cc: Thomas Gleixner
Fixes: 8d5bce0c37fa ("perf/core: Optimize perf_rotate_context() event scheduling")
Link: https://lkml.kernel.org/r/20191008165949.920548-1-songliubraving@fb.com
Signed-off-by: Ingo Molnar

Song Liu
2019-10-09 18:44:13 +0800
d44248a41 perf/core: Rework memory accounting in perf_mmap() ... Browse Code »

perf_mmap() always increases user->locked_vm. As a result, "extra" could
grow bigger than "user_extra", which doesn't make sense. Here is an
example case:

(Note: Assume "user_lock_limit" is very small.)

| # of perf_mmap calls |vma->vm_mm->pinned_vm|user->locked_vm|
| 0 | 0 | 0 |
| 1 | user_extra | user_extra |
| 2 | 3 * user_extra | 2 * user_extra|
| 3 | 6 * user_extra | 3 * user_extra|
| 4 | 10 * user_extra | 4 * user_extra|

Fix this by maintaining proper user_extra and extra.

Reviewed-By: Hechao Li
Reported-by: Hechao Li
Signed-off-by: Song Liu
Signed-off-by: Peter Zijlstra (Intel)
Cc:
Cc: Jie Meng
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: https://lkml.kernel.org/r/20190904214618.3795672-1-songliubraving@fb.com
Signed-off-by: Ingo Molnar

Song Liu
2019-10-09 18:44:12 +0800

07 Oct, 2019

1 commit

f733c6b50 perf/core: Fix inheritance of aux_output groups ... Browse Code »

Commit:

ab43762ef010 ("perf: Allow normal events to output AUX data")

forgets to configure aux_output relation in the inherited groups, which
results in child PEBS events forever failing to schedule.

Fix this by setting up the AUX output link in the inheritance path.

Signed-off-by: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: https://lkml.kernel.org/r/20191004125729.32397-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar

Alexander Shishkin
2019-10-07 22:50:42 +0800

01 Oct, 2019

1 commit

c2ba8f41a perf_event_open: switch to copy_struct_from_user() ... Browse Code »

Switch perf_event_open() syscall from it's own copying
struct perf_event_attr from userspace to the new dedicated
copy_struct_from_user() helper.

The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls.

Signed-off-by: Aleksa Sarai
Reviewed-by: Kees Cook
Reviewed-by: Christian Brauner
[christian.brauner@ubuntu.com: improve commit message]
Link: https://lore.kernel.org/r/20191001011055.19283-5-cyphar@cyphar.com
Signed-off-by: Christian Brauner

Aleksa Sarai
2019-10-01 21:45:22 +0800

28 Sep, 2019

1 commit

aefcf2f4b Merge branch 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security ... Browse Code »

Pull kernel lockdown mode from James Morris:
"This is the latest iteration of the kernel lockdown patchset, from
Matthew Garrett, David Howells and others.

From the original description:

This patchset introduces an optional kernel lockdown feature,
intended to strengthen the boundary between UID 0 and the kernel.
When enabled, various pieces of kernel functionality are restricted.
Applications that rely on low-level access to either hardware or the
kernel may cease working as a result - therefore this should not be
enabled without appropriate evaluation beforehand.

The majority of mainstream distributions have been carrying variants
of this patchset for many years now, so there's value in providing a
doesn't meet every distribution requirement, but gets us much closer
to not requiring external patches.

There are two major changes since this was last proposed for mainline:

- Separating lockdown from EFI secure boot. Background discussion is
covered here: https://lwn.net/Articles/751061/

- Implementation as an LSM, with a default stackable lockdown LSM
module. This allows the lockdown feature to be policy-driven,
rather than encoding an implicit policy within the mechanism.

The new locked_down LSM hook is provided to allow LSMs to make a
policy decision around whether kernel functionality that would allow
tampering with or examining the runtime state of the kernel should be
permitted.

The included lockdown LSM provides an implementation with a simple
policy intended for general purpose use. This policy provides a coarse
level of granularity, controllable via the kernel command line:

lockdown={integrity|confidentiality}

Enable the kernel lockdown feature. If set to integrity, kernel features
that allow userland to modify the running kernel are disabled. If set to
confidentiality, kernel features that allow userland to extract
confidential information from the kernel are also disabled.

This may also be controlled via /sys/kernel/security/lockdown and
overriden by kernel configuration.

New or existing LSMs may implement finer-grained controls of the
lockdown features. Refer to the lockdown_reason documentation in
include/linux/security.h for details.

The lockdown feature has had signficant design feedback and review
across many subsystems. This code has been in linux-next for some
weeks, with a few fixes applied along the way.

Stephen Rothwell noted that commit 9d1f8be5cf42 ("bpf: Restrict bpf
when kernel lockdown is in confidentiality mode") is missing a
Signed-off-by from its author. Matthew responded that he is providing
this under category (c) of the DCO"

* 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (31 commits)
kexec: Fix file verification on S390
security: constify some arrays in lockdown LSM
lockdown: Print current->comm in restriction messages
efi: Restrict efivar_ssdt_load when the kernel is locked down
tracefs: Restrict tracefs when the kernel is locked down
debugfs: Restrict debugfs when the kernel is locked down
kexec: Allow kexec_file() with appropriate IMA policy when locked down
lockdown: Lock down perf when in confidentiality mode
bpf: Restrict bpf when kernel lockdown is in confidentiality mode
lockdown: Lock down tracing and perf kprobes when in confidentiality mode
lockdown: Lock down /proc/kcore
x86/mmiotrace: Lock down the testmmiotrace module
lockdown: Lock down module params that specify hardware parameters (eg. ioport)
lockdown: Lock down TIOCSSERIAL
lockdown: Prohibit PCMCIA CIS storage when the kernel is locked down
acpi: Disable ACPI table override if the kernel is locked down
acpi: Ignore acpi_rsdp kernel param when the kernel has been locked down
ACPI: Limit access to custom_method when the kernel is locked down
x86/msr: Restrict MSR access when the kernel is locked down
x86: Lock down IO port access when the kernel is locked down
...

Linus Torvalds
2019-09-28 23:14:15 +0800

27 Sep, 2019

1 commit

a7b7b772b Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull more perf updates from Ingo Molnar:
"The only kernel change is comment typo fixes.

The rest is mostly tooling fixes, but also new vendor event additions
and updates, a bigger libperf/libtraceevent library and a header files
reorganization that came in a bit late"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (108 commits)
perf unwind: Fix libunwind build failure on i386 systems
perf parser: Remove needless include directives
perf build: Add detection of java-11-openjdk-devel package
perf jvmti: Include JVMTI support for s390
perf vendor events: Remove P8 HW events which are not supported
perf evlist: Fix access of freed id arrays
perf stat: Fix free memory access / memory leaks in metrics
perf tools: Replace needless mmap.h with what is needed, event.h
perf evsel: Move config terms to a separate header
perf evlist: Remove unused perf_evlist__fprintf() method
perf evsel: Introduce evsel_fprintf.h
perf evsel: Remove need for symbol_conf in evsel_fprintf.c
perf copyfile: Move copyfile routines to separate files
libperf: Add perf_evlist__poll() function
libperf: Add perf_evlist__add_pollfd() function
libperf: Add perf_evlist__alloc_pollfd() function
libperf: Add libperf_init() call to the tests
libperf: Merge libperf_set_print() into libperf_init()
libperf: Add libperf dependency for tests targets
libperf: Use sys/types.h to get ssize_t, not unistd.h
...

Linus Torvalds
2019-09-27 06:38:07 +0800

25 Sep, 2019

3 commits

f385cb85a uprobe: collapse THP pmd after removing all uprobes ... Browse Code »

After all uprobes are removed from the huge page (with PTE pgtable), it is
possible to collapse the pmd and benefit from THP again. This patch does
the collapse by calling collapse_pte_mapped_thp().

Link: http://lkml.kernel.org/r/20190815164525.1848545-7-songliubraving@fb.com
Signed-off-by: Song Liu
Acked-by: Kirill A. Shutemov
Reported-by: kbuild test robot
Reviewed-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Song Liu
2019-09-25 06:54:11 +0800
5a52c9df6 uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT ... Browse Code »

Use the newly added FOLL_SPLIT_PMD in uprobe. This preserves the huge
page when the uprobe is enabled. When the uprobe is disabled, newer
instances of the same application could still benefit from huge page.

For the next step, we will enable khugepaged to regroup the pmd, so that
existing instances of the application could also benefit from huge page
after the uprobe is disabled.

Link: http://lkml.kernel.org/r/20190815164525.1848545-5-songliubraving@fb.com
Signed-off-by: Song Liu
Acked-by: Kirill A. Shutemov
Reviewed-by: Srikar Dronamraju
Reviewed-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Song Liu
2019-09-25 06:54:11 +0800
fb4fb04ff uprobe: use original page when all uprobes are removed ... Browse Code »

Currently, uprobe swaps the target page with a anonymous page in both
install_breakpoint() and remove_breakpoint(). When all uprobes on a page
are removed, the given mm is still using an anonymous page (not the
original page).

This patch allows uprobe to use original page when possible (all uprobes
on the page are already removed, and the original page is in page cache
and uptodate).

As suggested by Oleg, we unmap the old_page and let the original page
fault in.

Link: http://lkml.kernel.org/r/20190815164525.1848545-3-songliubraving@fb.com
Signed-off-by: Song Liu
Suggested-by: Oleg Nesterov
Reviewed-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Song Liu
2019-09-25 06:54:11 +0800

21 Sep, 2019

1 commit

9f014e3a6 perf/core: Fix several typos in comments ... Browse Code »

Fix typos in a few functions' documentation comments.

Signed-off-by: Roy Ben Shlomo
Cc: Alexander Shishkin
Cc: Jiri Olsa
Cc: Mark Rutland
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: royb@sentinelone.com
Link: http://lore.kernel.org/lkml/20190920171254.31373-1-royb@sentinelone.com
Signed-off-by: Arnaldo Carvalho de Melo

Roy Ben Shlomo
2019-09-21 03:05:20 +0800

18 Sep, 2019

1 commit

7f2444d38 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull core timer updates from Thomas Gleixner:
"Timers and timekeeping updates:

- A large overhaul of the posix CPU timer code which is a preparation
for moving the CPU timer expiry out into task work so it can be
properly accounted on the task/process.

An update to the bogus permission checks will come later during the
merge window as feedback was not complete before heading of for
travel.

- Switch the timerqueue code to use cached rbtrees and get rid of the
homebrewn caching of the leftmost node.

- Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
single function

- Implement the separation of hrtimers to be forced to expire in hard
interrupt context even when PREEMPT_RT is enabled and mark the
affected timers accordingly.

- Implement a mechanism for hrtimers and the timer wheel to protect
RT against priority inversion and live lock issues when a (hr)timer
which should be canceled is currently executing the callback.
Instead of infinitely spinning, the task which tries to cancel the
timer blocks on a per cpu base expiry lock which is held and
released by the (hr)timer expiry code.

- Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
resulting in faster access to timekeeping functions.

- Updates to various clocksource/clockevent drivers and their device
tree bindings.

- The usual small improvements all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
posix-cpu-timers: Fix permission check regression
posix-cpu-timers: Always clear head pointer on dequeue
hrtimer: Add a missing bracket and hide `migration_base' on !SMP
posix-cpu-timers: Make expiry_active check actually work correctly
posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
tick: Mark sched_timer to expire in hard interrupt context
hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
posix-cpu-timers: Utilize timerqueue for storage
posix-cpu-timers: Move state tracking to struct posix_cputimers
posix-cpu-timers: Deduplicate rlimit handling
posix-cpu-timers: Remove pointless comparisons
posix-cpu-timers: Get rid of 64bit divisions
posix-cpu-timers: Consolidate timer expiry further
posix-cpu-timers: Get rid of zero checks
rlimit: Rewrite non-sensical RLIMIT_CPU comment
posix-cpu-timers: Respect INFINITY for hard RTTIME limit
posix-cpu-timers: Switch thread group sampling to array
posix-cpu-timers: Restructure expiry array
posix-cpu-timers: Remove cputime_expires
...

Linus Torvalds
2019-09-18 03:35:15 +0800

17 Sep, 2019

2 commits

7e67a8599 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:

- MAINTAINERS: Add Mark Rutland as perf submaintainer, Juri Lelli and
Vincent Guittot as scheduler submaintainers. Add Dietmar Eggemann,
Steven Rostedt, Ben Segall and Mel Gorman as scheduler reviewers.

As perf and the scheduler is getting bigger and more complex,
document the status quo of current responsibilities and interests,
and spread the review pain^H^H^H^H fun via an increase in the Cc:
linecount generated by scripts/get_maintainer.pl. :-)

- Add another series of patches that brings the -rt (PREEMPT_RT) tree
closer to mainline: split the monolithic CONFIG_PREEMPT dependencies
into a new CONFIG_PREEMPTION category that will allow the eventual
introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches
to go though.

- Extend the CPU cgroup controller with uclamp.min and uclamp.max to
allow the finer shaping of CPU bandwidth usage.

- Micro-optimize energy-aware wake-ups from O(CPUS^2) to O(CPUS).

- Improve the behavior of high CPU count, high thread count
applications running under cpu.cfs_quota_us constraints.

- Improve balancing with SCHED_IDLE (SCHED_BATCH) tasks present.

- Improve CPU isolation housekeeping CPU allocation NUMA locality.

- Fix deadline scheduler bandwidth calculations and logic when cpusets
rebuilds the topology, or when it gets deadline-throttled while it's
being offlined.

- Convert the cpuset_mutex to percpu_rwsem, to allow it to be used from
setscheduler() system calls without creating global serialization.
Add new synchronization between cpuset topology-changing events and
the deadline acceptance tests in setscheduler(), which were broken
before.

- Rework the active_mm state machine to be less confusing and more
optimal.

- Rework (simplify) the pick_next_task() slowpath.

- Improve load-balancing on AMD EPYC systems.

- ... and misc cleanups, smaller fixes and improvements - please see
the Git log for more details.

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
sched/psi: Correct overly pessimistic size calculation
sched/fair: Speed-up energy-aware wake-ups
sched/uclamp: Always use 'enum uclamp_id' for clamp_id values
sched/uclamp: Update CPU's refcount on TG's clamp changes
sched/uclamp: Use TG's clamps to restrict TASK's clamps
sched/uclamp: Propagate system defaults to the root group
sched/uclamp: Propagate parent clamps
sched/uclamp: Extend CPU's cgroup controller
sched/topology: Improve load balancing on AMD EPYC systems
arch, ia64: Make NUMA select SMP
sched, perf: MAINTAINERS update, add submaintainers and reviewers
sched/fair: Use rq_lock/unlock in online_fair_sched_group
cpufreq: schedutil: fix equation in comment
sched: Rework pick_next_task() slow-path
sched: Allow put_prev_task() to drop rq->lock
sched/fair: Expose newidle_balance()
sched: Add task_struct pointer to sched_class::set_curr_task
sched: Rework CPU hotplug task selection
sched/{rt,deadline}: Fix set_next_task vs pick_next_task
sched: Fix kerneldoc comment for ia64_set_curr_task
...

Linus Torvalds
2019-09-17 08:25:49 +0800
772c1d06b Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf updates from Ingo Molnar:
"Kernel side changes:

- Improved kbprobes robustness

- Intel PEBS support for PT hardware tracing

- Other Intel PT improvements: high order pages memory footprint
reduction and various related cleanups

- Misc cleanups

The perf tooling side has been very busy in this cycle, with over 300
commits. This is an incomplete high-level summary of the many
improvements done by over 30 developers:

- Lots of updates to the following tools:

'perf c2c'
'perf config'
'perf record'
'perf report'
'perf script'
'perf test'
'perf top'
'perf trace'

- Updates to libperf and libtraceevent, and a consolidation of the
proliferation of x86 instruction decoder libraries.

- Vendor event updates for Intel and PowerPC CPUs,

- Updates to hardware tracing tooling for ARM and Intel CPUs,

- ... and lots of other changes and cleanups - see the shortlog and
Git log for details"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (322 commits)
kprobes: Prohibit probing on BUG() and WARN() address
perf/x86: Make more stuff static
x86, perf: Fix the dependency of the x86 insn decoder selftest
objtool: Ignore intentional differences for the x86 insn decoder
objtool: Update sync-check.sh from perf's check-headers.sh
perf build: Ignore intentional differences for the x86 insn decoder
perf intel-pt: Use shared x86 insn decoder
perf intel-pt: Remove inat.c from build dependency list
perf: Update .gitignore file
objtool: Move x86 insn decoder to a common location
perf metricgroup: Support multiple events for metricgroup
perf metricgroup: Scale the metric result
perf pmu: Change convert_scale from static to global
perf symbols: Move mem_info and branch_info out of symbol.h
perf auxtrace: Uninline functions that touch perf_session
perf tools: Remove needless evlist.h include directives
perf tools: Remove needless evlist.h include directives
perf tools: Remove needless thread_map.h include directives
perf tools: Remove needless thread.h include directives
perf tools: Remove needless map.h include directives
...

Linus Torvalds
2019-09-17 08:06:21 +0800

16 Sep, 2019

1 commit

563c4f85f Merge branch 'sched/rt' into sched/core, to pick up -rt changes ... Browse Code »

Pick up the first couple of patches working towards PREEMPT_RT.

Signed-off-by: Ingo Molnar

Ingo Molnar
2019-09-16 20:05:04 +0800

06 Sep, 2019

1 commit

310aa0a25 perf/hw_breakpoint: Fix arch_hw_breakpoint use-before-initialization ... Browse Code »

If we disable the compiler's auto-initialization feature, if
-fplugin-arg-structleak_plugin-byref or -ftrivial-auto-var-init=pattern
are disabled, arch_hw_breakpoint may be used before initialization after:

9a4903dde2c86 ("perf/hw_breakpoint: Split attribute parse and commit")

On our ARM platform, the struct step_ctrl in arch_hw_breakpoint, which
used to be zero-initialized by kzalloc(), may be used in
arch_install_hw_breakpoint() without initialization.

Signed-off-by: Mark-PK Tsai
Cc: Alexander Shishkin
Cc: Alix Wu
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Mark Rutland
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: YJ Chiang
Link: https://lkml.kernel.org/r/20190906060115.9460-1-mark-pk.tsai@mediatek.com
[ Minor edits. ]
Signed-off-by: Ingo Molnar

Mark-PK Tsai
2019-09-06 14:24:01 +0800

28 Aug, 2019

1 commit

ab43762ef perf: Allow normal events to output AUX data ... Browse Code »

In some cases, ordinary (non-AUX) events can generate data for AUX events.
For example, PEBS events can come out as records in the Intel PT stream
instead of their usual DS records, if configured to do so.

One requirement for such events is to consistently schedule together, to
ensure that the data from the "AUX output" events isn't lost while their
corresponding AUX event is not scheduled. We use grouping to provide this
guarantee: an "AUX output" event can be added to a group where an AUX event
is a group leader, and provided that the former supports writing to the
latter.

Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo
Cc: kan.liang@linux.intel.com
Link: https://lkml.kernel.org/r/20190806084606.4021-2-alexander.shishkin@linux.intel.com

Alexander Shishkin
2019-08-28 17:29:38 +0800

20 Aug, 2019

1 commit

b0c8fdc7f lockdown: Lock down perf when in confidentiality mode ... Browse Code »

Disallow the use of certain perf facilities that might allow userspace to
access kernel data.

Signed-off-by: David Howells
Signed-off-by: Matthew Garrett
Reviewed-by: Kees Cook
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo
Signed-off-by: James Morris

David Howells
2019-08-20 12:54:16 +0800

02 Aug, 2019

1 commit

30f9028b6 perf/core: Mark hrtimers to expire in hard interrupt context ... Browse Code »

To guarantee that the multiplexing mechanism and the hrtimer driven events
work on PREEMPT_RT enabled kernels it's required that the related hrtimers
expire in hard interrupt context. Mark them so PREEMPT_RT kernels wont
defer them to soft interrupt context.

No functional change.

[ tglx: Split out of larger combo patch. Added changelog ]

Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20190726185753.169509224@linutronix.de

Sebastian Andrzej Siewior
2019-08-02 02:51:20 +0800