Eric Lee / smarc-fsl-linux-kernel

24 Aug, 2017

1 commit

d2aaa3dc4 bpf, doc: Add arm32 as arch supporting eBPF JIT ... Browse Code »

As eBPF JIT support for arm32 was added recently with
commit 39c13c204bb1150d401e27d41a9d8b332be47c49, it seems appropriate to
add arm32 as arch with support for eBPF JIT in bpf and sysctl docs as well.

Signed-off-by: Shubham Bansal
Acked-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Shubham Bansal
2017-08-24 13:40:12 +0800

21 Aug, 2017

1 commit

d4dd2d75a bpf, doc: also add s390x as arch to sysctl description ... Browse Code »

Looks like this was accidentally missed, so still add s390x
as supported eBPF JIT arch to bpf_jit_enable.

Fixes: 014cd0a368dc ("bpf: Update sysctl documentation to list all supported architectures")
Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2017-08-21 10:45:09 +0800

19 Aug, 2017

1 commit

2110ba583 bpf, doc: improve sysctl knob description ... Browse Code »

Current context speaking of tcpdump filters is out of date these
days, so lets improve the sysctl description for the BPF knobs
a bit.

Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller

Daniel Borkmann
2017-08-19 02:00:41 +0800

18 Aug, 2017

1 commit

014cd0a36 bpf: Update sysctl documentation to list all supported architectures ... Browse Code »

The sysctl documentation states that the JIT is only available on
x86_64, which is no longer correct.

Update the list, and break it out to indicate which architectures
support the cBPF JIT (via HAVE_CBPF_JIT) or the eBPF JIT
(HAVE_EBPF_JIT).

Signed-off-by: Michael Ellerman
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Michael Ellerman
2017-08-18 01:09:28 +0800

11 Jul, 2017

1 commit

d09b64688 mm: document highmem_is_dirtyable sysctl ... Browse Code »

It seems that there are still people using 32b kernels which a lot of
memory and the IO tend to suck a lot for them by default. Mostly
because writers are throttled too when the lowmem is used. We have
highmem_is_dirtyable to work around that issue but it seems we never
bothered to document it. Let's do it now, finally.

Link: http://lkml.kernel.org/r/20170626093200.18958-1-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: Alkis Georgopoulos
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-11 07:32:32 +0800

22 Apr, 2017

1 commit

7acf8a1e8 Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning ... Browse Code »

Constants used for tuning are generally a bad idea, especially as hardware
changes over time. Replace the constant 2 jiffies with sysctl variable
netdev_budget_usecs to enable sysadmins to tune the softirq processing.
Also document the variable.

For example, a very fast machine might tune this to 1000 microseconds,
while my regression testing 486DX-25 needs it to be 4000 microseconds on
a nearly idle network to prevent time_squeeze from being incremented.

Version 2: changed jiffies to microseconds for predictable units.

Signed-off-by: Matthew Whitehead
Signed-off-by: David S. Miller

Matthew Whitehead
2017-04-22 01:22:34 +0800

05 Mar, 2017

1 commit

be834aafd Merge tag 'docs-4.11-fixes' of git://git.lwn.net/linux ... Browse Code »

Pull documentation fixes from Jonathan Corbet:
"A few fixes for the docs tree, including one for a 4.11 build
regression"

* tag 'docs-4.11-fixes' of git://git.lwn.net/linux:
Documentation/sphinx: fix primary_domain configuration
docs: Fix htmldocs build failure
doc/ko_KR/memory-barriers: Update control-dependencies section
pcieaer doc: update the link
Documentation: Update path to sysrq.txt

Linus Torvalds
2017-03-05 03:32:18 +0800

04 Mar, 2017

1 commit

d3c1a297b Documentation: Update path to sysrq.txt ... Browse Code »

Commit 9d85025b0418 ("docs-rst: create an user's manual book") moved the
sysrq.txt leaving old paths in the kernel docs.

Signed-off-by: Krzysztof Kozlowski
Reviewed-by: Mauro Carvalho Chehab
Signed-off-by: Jonathan Corbet

Krzysztof Kozlowski
2017-03-04 06:48:38 +0800

25 Feb, 2017

1 commit

def5efe03 mm, madvise: fail with ENOMEM when splitting vma will hit max_map_count ... Browse Code »

If madvise(2) advice will result in the underlying vma being split and
the number of areas mapped by the process will exceed
/proc/sys/vm/max_map_count as a result, return ENOMEM instead of EAGAIN.

EAGAIN is returned by madvise(2) when a kernel resource, such as slab,
is temporarily unavailable. It indicates that userspace should retry
the advice in the near future. This is important for advice such as
MADV_DONTNEED which is often used by malloc implementations to free
memory back to the system: we really do want to free memory back when
madvise(2) returns EAGAIN because slab allocations (for vmas, anon_vmas,
or mempolicies) cannot be allocated.

Encountering /proc/sys/vm/max_map_count is not a temporary failure,
however, so return ENOMEM to indicate this is a more serious issue. A
followup patch to the man page will specify this behavior.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701241431120.42507@chino.kir.corp.google.com
Signed-off-by: David Rientjes
Cc: Jonathan Corbet
Cc: Johannes Weiner
Cc: Jerome Marchand
Cc: "Kirill A. Shutemov"
Cc: Michael Kerrisk
Cc: Anshuman Khandual
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2017-02-25 09:46:55 +0800

18 Feb, 2017

1 commit

74451e66d bpf: make jited programs visible in traces ... Browse Code »

Long standing issue with JITed programs is that stack traces from
function tracing check whether a given address is kernel code
through {__,}kernel_text_address(), which checks for code in core
kernel, modules and dynamically allocated ftrace trampolines. But
what is still missing is BPF JITed programs (interpreted programs
are not an issue as __bpf_prog_run() will be attributed to them),
thus when a stack trace is triggered, the code walking the stack
won't see any of the JITed ones. The same for address correlation
done from user space via reading /proc/kallsyms. This is read by
tools like perf, but the latter is also useful for permanent live
tracing with eBPF itself in combination with stack maps when other
eBPF types are part of the callchain. See offwaketime example on
dumping stack from a map.

This work tries to tackle that issue by making the addresses and
symbols known to the kernel. The lookup from *kernel_text_address()
is implemented through a latched RB tree that can be read under
RCU in fast-path that is also shared for symbol/size/offset lookup
for a specific given address in kallsyms. The slow-path iteration
through all symbols in the seq file done via RCU list, which holds
a tiny fraction of all exported ksyms, usually below 0.1 percent.
Function symbols are exported as bpf_prog_, in order to aide
debugging and attribution. This facility is currently enabled for
root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
is active in any mode. The rationale behind this is that still a lot
of systems ship with world read permissions on kallsyms thus addresses
should not get suddenly exposed for them. If that situation gets
much better in future, we always have the option to change the
default on this. Likewise, unprivileged programs are not allowed
to add entries there either, but that is less of a concern as most
such programs types relevant in this context are for root-only anyway.
If enabled, call graphs and stack traces will then show a correct
attribution; one example is illustrated below, where the trace is
now visible in tooling such as perf script --kallsyms=/proc/kallsyms
and friends.

Before:

7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)

After:

7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
[...]
7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)

Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller

Daniel Borkmann
2017-02-18 02:40:05 +0800

30 Dec, 2016

1 commit

3d48b53fb net: dev_weight: TX/RX orthogonality ... Browse Code »

Oftenly, introducing side effects on packet processing on the other half
of the stack by adjusting one of TX/RX via sysctl is not desirable.
There are cases of demand for asymmetric, orthogonal configurability.

This holds true especially for nodes where RPS for RFS usage on top is
configured and therefore use the 'old dev_weight'. This is quite a
common base configuration setup nowadays, even with NICs of superior processing
support (e.g. aRFS).

A good example use case are nodes acting as noSQL data bases with a
large number of tiny requests and rather fewer but large packets as responses.
It's affordable to have large budget and rx dev_weights for the
requests. But as a side effect having this large a number on TX
processed in one run can overwhelm drivers.

This patch therefore introduces an independent configurability via sysctl to
userland.

Signed-off-by: Matthias Tafelmeier
Signed-off-by: David S. Miller

Matthias Tafelmeier
2016-12-30 04:38:35 +0800

13 Dec, 2016

1 commit

e7aa8c2eb Merge tag 'docs-4.10' of git://git.lwn.net/linux ... Browse Code »

Pull documentation update from Jonathan Corbet:
"These are the documentation changes for 4.10.

It's another busy cycle for the docs tree, as the sphinx conversion
continues. Highlights include:

- Further work on PDF output, which remains a bit of a pain but
should be more solid now.

- Five more DocBook template files converted to Sphinx. Only 27 to
go... Lots of plain-text files have also been converted and
integrated.

- Images in binary formats have been replaced with more
source-friendly versions.

- Various bits of organizational work, including the renaming of
various files discussed at the kernel summit.

- New documentation for the device_link mechanism.

... and, of course, lots of typo fixes and small updates"

* tag 'docs-4.10' of git://git.lwn.net/linux: (193 commits)
dma-buf: Extract dma-buf.rst
Update Documentation/00-INDEX
docs: 00-INDEX: document directories/files with no docs
docs: 00-INDEX: remove non-existing entries
docs: 00-INDEX: add missing entries for documentation files/dirs
docs: 00-INDEX: consolidate process/ and admin-guide/ description
scripts: add a script to check if Documentation/00-INDEX is sane
Docs: change sh -> awk in REPORTING-BUGS
Documentation/core-api/device_link: Add initial documentation
core-api: remove an unexpected unident
ppc/idle: Add documentation for powersave=off
Doc: Correct typo, "Introdution" => "Introduction"
Documentation/atomic_ops.txt: convert to ReST markup
Documentation/local_ops.txt: convert to ReST markup
Documentation/assoc_array.txt: convert to ReST markup
docs-rst: parse-headers.pl: cleanup the documentation
docs-rst: fix media cleandocs target
docs-rst: media/Makefile: reorganize the rules
docs-rst: media: build SVG from graphviz files
docs-rst: replace bayer.png by a SVG image
...

Linus Torvalds
2016-12-13 13:58:13 +0800

26 Oct, 2016

1 commit

0ee1dd9f5 x86/dumpstack: Remove raw stack dump ... Browse Code »

For mostly historical reasons, the x86 oops dump shows the raw stack
values:

...
[registers]
Stack:
ffff880079af7350 ffff880079905400 0000000000000000 ffffc900008f3ae0
ffffffffa0196610 0000000000000001 00010000ffffffff 0000000087654321
0000000000000002 0000000000000000 0000000000000000 0000000000000000
Call Trace:
...

This seems to be an artifact from long ago, and probably isn't needed
anymore. It generally just adds noise to the dump, and it can be
actively harmful because it leaks kernel addresses.

Linus says:

"The stack dump actually goes back to forever, and it used to be
useful back in 1992 or so. But it used to be useful mainly because
stacks were simpler and we didn't have very good call traces anyway. I
definitely remember having used them - I just do not remember having
used them in the last ten+ years.

Of course, it's still true that if you can trigger an oops, you've
likely already lost the security game, but since the stack dump is so
useless, let's aim to just remove it and make games like the above
harder."

This also removes the related 'kstack=' cmdline option and the
'kstack_depth_to_print' sysctl.

Suggested-by: Linus Torvalds
Signed-off-by: Josh Poimboeuf
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Denys Vlasenko
Cc: H. Peter Anvin
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/e83bd50df52d8fe88e94d2566426ae40d813bf8f.1477405374.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar

Josh Poimboeuf
2016-10-26 00:40:37 +0800

24 Oct, 2016

1 commit

8c27ceff3 docs: fix locations of several documents that got moved ... Browse Code »

The previous patch renamed several files that are cross-referenced
along the Kernel documentation. Adjust the links to point to
the right places.

Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2016-10-24 18:12:35 +0800

01 Oct, 2016

1 commit

d29216842 mnt: Add a per mount namespace limit on the number of mounts ... Browse Code »

CAI Qian pointed out that the semantics
of shared subtrees make it possible to create an exponentially
increasing number of mounts in a mount namespace.

mkdir /tmp/1 /tmp/2
mount --make-rshared /
for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done

Will create create 2^20 or 1048576 mounts, which is a practical problem
as some people have managed to hit this by accident.

As such CVE-2016-6213 was assigned.

Ian Kent described the situation for autofs users
as follows:

> The number of mounts for direct mount maps is usually not very large because of
> the way they are implemented, large direct mount maps can have performance
> problems. There can be anywhere from a few (likely case a few hundred) to less
> than 10000, plus mounts that have been triggered and not yet expired.
>
> Indirect mounts have one autofs mount at the root plus the number of mounts that
> have been triggered and not yet expired.
>
> The number of autofs indirect map entries can range from a few to the common
> case of several thousand and in rare cases up to between 30000 and 50000. I've
> not heard of people with maps larger than 50000 entries.
>
> The larger the number of map entries the greater the possibility for a large
> number of active mounts so it's not hard to expect cases of a 1000 or somewhat
> more active mounts.

So I am setting the default number of mounts allowed per mount
namespace at 100,000. This is more than enough for any use case I
know of, but small enough to quickly stop an exponential increase
in mounts. Which should be perfect to catch misconfigurations and
malfunctioning programs.

For anyone who needs a higher limit this can be changed by writing
to the new /proc/sys/fs/mount-max sysctl.

Tested-by: CAI Qian
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-10-01 01:46:48 +0800

23 Sep, 2016

1 commit

9c722e406 userns; Document per user per user namespace limits. ... Browse Code »

Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-09-23 01:52:03 +0800

03 Aug, 2016

1 commit

750afe7ba printk: add kernel parameter to control writes to /dev/kmsg ... Browse Code »

Add a "printk.devkmsg" kernel command line parameter which controls how
userspace writes into /dev/kmsg. It has three options:

* ratelimit - ratelimit logging from userspace.
* on - unlimited logging from userspace
* off - logging from userspace gets ignored

The default setting is to ratelimit the messages written to it.

This changes the kernel default setting of "on" to "ratelimit" and we do
that because we want to keep userspace spamming /dev/kmsg to sane
levels. This is especially moot when a small kernel log buffer wraps
around and messages get lost. So the ratelimiting setting should be a
sane setting where kernel messages should have a bit higher chance of
survival from all the spamming.

It additionally does not limit logging to /dev/kmsg while the system is
booting if we haven't disabled it on the command line.

Furthermore, we can control the logging from a lower priority sysctl
interface - kernel.printk_devkmsg.

That interface will succeed only if printk.devkmsg *hasn't* been
supplied on the command line. If it has, then printk.devkmsg is a
one-time setting which remains for the duration of the system lifetime.
This "locking" of the setting is to prevent userspace from changing the
logging on us through sysctl(2).

This patch is based on previous patches from Linus and Steven.

[bp@suse.de: fixes]
Link: http://lkml.kernel.org/r/20160719072344.GC25563@nazgul.tnic
Link: http://lkml.kernel.org/r/20160716061745.15795-3-bp@alien8.de
Signed-off-by: Borislav Petkov
Cc: Dave Young
Cc: Franck Bui
Cc: Greg Kroah-Hartman
Cc: Ingo Molnar
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Uwe Kleine-König
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Borislav Petkov
2016-08-03 07:35:06 +0800

27 Jul, 2016

1 commit

0f776dc37 Merge tag 'docs-for-linus' of git://git.lwn.net/linux ... Browse Code »

Pull documentation updates from Jonathan Corbet:
"Some big changes this month, headlined by the addition of a new
formatted documentation mechanism based on the Sphinx system.

The objectives here are to make it easier to create better-integrated
(and more attractive) documents while (eventually) dumping our
one-of-a-kind, cobbled-together system for something that is widely
used and maintained by others. There's a fair amount of information
what's being done, why, and how to use it in:

https://lwn.net/Articles/692704/
https://lwn.net/Articles/692705/

Closer to home, Documentation/kernel-documentation.rst describes how
it works.

For now, the new system exists alongside the old one; you should soon
see the GPU documentation converted over in the DRM pull and some
significant media conversion work as well. Once all the docs have
been moved over and we're convinced that the rough edges (of which are
are a few) have been smoothed over, the DocBook-based stuff should go
away.

Primary credit is to Jani Nikula for doing the heavy lifting to make
this stuff actually work; there has also been notable effort from
Markus Heiser, Daniel Vetter, and Mauro Carvalho Chehab.

Expect a couple of conflicts on the new index.rst file over the course
of the merge window; they are trivially resolvable. That file may be
a bit of a conflict magnet in the short term, but I don't expect that
situation to last for any real length of time.

Beyond that, of course, we have the usual collection of tweaks,
updates, and typo fixes"

* tag 'docs-for-linus' of git://git.lwn.net/linux: (77 commits)
doc-rst: kernel-doc: fix handling of address_space tags
Revert "doc/sphinx: Enable keep_warnings"
doc-rst: kernel-doc directive, fix state machine reporter
docs: deprecate kernel-doc-nano-HOWTO.txt
doc/sphinx: Enable keep_warnings
Documentation: add watermark_scale_factor to the list of vm systcl file
kernel-doc: Fix up warning output
docs: Get rid of some kernel-documentation warnings
doc-rst: add an option to ignore DocBooks when generating docs
workqueue: Fix a typo in workqueue.txt
Doc: ocfs: Fix typo in filesystems/ocfs2-online-filecheck.txt
Documentation/sphinx: skip build if user requested specific DOCBOOKS
Documentation: add cleanmediadocs to the documentation targets
Add .pyc files to .gitignore
Doc: PM: Fix a typo in intel_powerclamp.txt
doc-rst: flat-table directive - initial implementation
Documentation: add meta-documentation for Sphinx and kernel-doc
Documentation: tiny typo fix in usb/gadget_multi.txt
Documentation: fix wrong value in md.txt
bcache: documentation formatting, edited for clarity, stripe alignment notes
...

Linus Torvalds
2016-07-27 04:05:11 +0800

18 Jul, 2016

1 commit

e6507a00f Documentation: add watermark_scale_factor to the list of vm systcl file ... Browse Code »

Commit 795ae7a0de6b ("mm: scale kswapd watermarks in proportion to
memory") properly added the description of the new knob to
Documentation/sysctl/vm.txt, but forgot to add it to the list of files
in /proc/sys/vm. Let's fix that.

Signed-off-by: Jerome Marchand
Acked-by: Johannes Weiner
Signed-off-by: Jonathan Corbet

Jerome Marchand
2016-07-18 22:27:01 +0800

16 Jun, 2016

1 commit

088e9d253 rcu: sysctl: Panic on RCU Stall ... Browse Code »

It is not always easy to determine the cause of an RCU stall just by
analysing the RCU stall messages, mainly when the problem is caused
by the indirect starvation of rcu threads. For example, when preempt_rcu
is not awakened due to the starvation of a timer softirq.

We have been hard coding panic() in the RCU stall functions for
some time while testing the kernel-rt. But this is not possible in
some scenarios, like when supporting customers.

This patch implements the sysctl kernel.panic_on_rcu_stall. If
set to 1, the system will panic() when an RCU stall takes place,
enabling the capture of a vmcore. The vmcore provides a way to analyze
all kernel/tasks states, helping out to point to the culprit and the
solution for the stall.

The kernel.panic_on_rcu_stall sysctl is disabled by default.

Changes from v1:
- Fixed a typo in the git log
- The if(sysctl_panic_on_rcu_stall) panic() is in a static function
- Fixed the CONFIG_TINY_RCU compilation issue
- The var sysctl_panic_on_rcu_stall is now __read_mostly

Cc: Jonathan Corbet
Cc: "Paul E. McKenney"
Cc: Josh Triplett
Cc: Steven Rostedt
Cc: Mathieu Desnoyers
Cc: Lai Jiangshan
Acked-by: Christian Borntraeger
Reviewed-by: Josh Triplett
Reviewed-by: Arnaldo Carvalho de Melo
Tested-by: "Luis Claudio R. Goncalves"
Signed-off-by: Daniel Bristot de Oliveira
Signed-off-by: Paul E. McKenney

Daniel Bristot de Oliveira
2016-06-16 07:00:05 +0800

26 May, 2016

1 commit

bdc6b758e Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf updates from Ingo Molnar:
"Mostly tooling and PMU driver fixes, but also a number of late updates
such as the reworking of the call-chain size limiting logic to make
call-graph recording more robust, plus tooling side changes for the
new 'backwards ring-buffer' extension to the perf ring-buffer"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
perf record: Read from backward ring buffer
perf record: Rename variable to make code clear
perf record: Prevent reading invalid data in record__mmap_read
perf evlist: Add API to pause/resume
perf trace: Use the ptr->name beautifier as default for "filename" args
perf trace: Use the fd->name beautifier as default for "fd" args
perf report: Add srcline_from/to branch sort keys
perf evsel: Record fd into perf_mmap
perf evsel: Add overwrite attribute and check write_backward
perf tools: Set buildid dir under symfs when --symfs is provided
perf trace: Only auto set call-graph to "dwarf" when syscalls are being traced
perf annotate: Sort list of recognised instructions
perf annotate: Fix identification of ARM blt and bls instructions
perf tools: Fix usage of max_stack sysctl
perf callchain: Stop validating callchains by the max_stack sysctl
perf trace: Fix exit_group() formatting
perf top: Use machine->kptr_restrict_warned
perf trace: Warn when trying to resolve kernel addresses with kptr_restrict=1
perf machine: Do not bail out if not managing to read ref reloc symbol
perf/x86/intel/p4: Trival indentation fix, remove space
...

Linus Torvalds
2016-05-26 08:05:40 +0800

20 May, 2016

1 commit

52b6f46bc mm: /proc/sys/vm/stat_refresh to force vmstat update ... Browse Code »

Provide /proc/sys/vm/stat_refresh to force an immediate update of
per-cpu into global vmstats: useful to avoid a sleep(2) or whatever
before checking counts when testing. Originally added to work around a
bug which left counts stranded indefinitely on a cpu going idle (an
inaccuracy magnified when small below-batch numbers represent "huge"
amounts of memory), but I believe that bug is now fixed: nonetheless,
this is still a useful knob.

Its schedule_on_each_cpu() is probably too expensive just to fold into
reading /proc/meminfo itself: give this mode 0600 to prevent abuse.
Allow a write or a read to do the same: nothing to read, but "grep -h
Shmem /proc/sys/vm/stat_refresh /proc/meminfo" is convenient. Oh, and
since global_page_state() itself is careful to disguise any underflow as
0, hack in an "Invalid argument" and pr_warn() if a counter is negative
after the refresh - this helped to fix a misaccounting of
NR_ISOLATED_FILE in my migration code.

But on recent kernels, I find that NR_ALLOC_BATCH and NR_PAGES_SCANNED
often go negative some of the time. I have not yet worked out why, but
have no evidence that it's actually harmful. Punt for the moment by
just ignoring the anomaly on those.

Signed-off-by: Hugh Dickins
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Andres Lagar-Cavilla
Cc: Yang Shi
Cc: Ning Qu
Cc: Mel Gorman
Cc: Andres Lagar-Cavilla
Cc: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2016-05-20 10:12:14 +0800

18 May, 2016

1 commit

a7fd20d1c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:
"Highlights:

1) Support SPI based w5100 devices, from Akinobu Mita.

2) Partial Segmentation Offload, from Alexander Duyck.

3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.

4) Allow cls_flower stats offload, from Amir Vadai.

5) Implement bpf blinding, from Daniel Borkmann.

6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
actually using FASYNC these atomics are superfluous. From Eric
Dumazet.

7) Run TCP more preemptibly, also from Eric Dumazet.

8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
driver, from Gal Pressman.

9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.

10) Improve BPF usage documentation, from Jesper Dangaard Brouer.

11) Support tunneling offloads in qed, from Manish Chopra.

12) aRFS offloading in mlx5e, from Maor Gottlieb.

13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
Leitner.

14) Add MSG_EOR support to TCP, this allows controlling packet
coalescing on application record boundaries for more accurate
socket timestamp sampling. From Martin KaFai Lau.

15) Fix alignment of 64-bit netlink attributes across the board, from
Nicolas Dichtel.

16) Per-vlan stats in bridging, from Nikolay Aleksandrov.

17) Several conversions of drivers to ethtool ksettings, from Philippe
Reynes.

18) Checksum neutral ILA in ipv6, from Tom Herbert.

19) Factorize all of the various marvell dsa drivers into one, from
Vivien Didelot

20) Add VF support to qed driver, from Yuval Mintz"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
Revert "phy dp83867: Make rgmii parameters optional"
r8169: default to 64-bit DMA on recent PCIe chips
phy dp83867: Make rgmii parameters optional
phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
bpf: arm64: remove callee-save registers use for tmp registers
asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
switchdev: pass pointer to fib_info instead of copy
net_sched: close another race condition in tcf_mirred_release()
tipc: fix nametable publication field in nl compat
drivers: net: Don't print unpopulated net_device name
qed: add support for dcbx.
ravb: Add missing free_irq() calls to ravb_close()
qed: Remove a stray tab
net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fec-mpc52xx: use phydev from struct net_device
bpf, doc: fix typo on bpf_asm descriptions
stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fs-enet: use phydev from struct net_device
...

Linus Torvalds
2016-05-18 07:26:30 +0800

17 May, 2016

2 commits

c85b03349 perf core: Separate accounting of contexts and real addresses in a stack trace ... Browse Code »

The perf_sample->ip_callchain->nr value includes all the entries in the
ip_callchain->ip[] array, real addresses and PERF_CONTEXT_{KERNEL,USER,etc},
while what the user expects is that what is in the kernel.perf_event_max_stack
sysctl or in the upcoming per event perf_event_attr.sample_max_stack knob be
honoured in terms of IP addresses in the stack trace.

So allocate a bunch of extra entries for contexts, and do the accounting
via perf_callchain_entry_ctx struct members.

A new sysctl, kernel.perf_event_max_contexts_per_stack is also
introduced for investigating possible bugs in the callchain
implementation by some arch.

Cc: Adrian Hunter
Cc: Alexander Shishkin
Cc: Alexei Starovoitov
Cc: Brendan Gregg
Cc: David Ahern
Cc: Frederic Weisbecker
Cc: He Kuang
Cc: Jiri Olsa
Cc: Masami Hiramatsu
Cc: Milian Wolff
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: Wang Nan
Cc: Zefan Li
Link: http://lkml.kernel.org/n/tip-3b4wnqk340c4sg4gwkfdi9yk@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Arnaldo Carvalho de Melo
2016-05-17 10:11:53 +0800
4f3446bb8 bpf: add generic constant blinding for use in jits ... Browse Code »

This work adds a generic facility for use from eBPF JIT compilers
that allows for further hardening of JIT generated images through
blinding constants. In response to the original work on BPF JIT
spraying published by Keegan McAllister [1], most BPF JITs were
changed to make images read-only and start at a randomized offset
in the page, where the rest was filled with trap instructions. We
have this nowadays in x86, arm, arm64 and s390 JIT compilers.
Additionally, later work also made eBPF interpreter images read
only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
arm, arm64 and s390 archs as well currently. This is done by
default for mentioned JITs when JITing is enabled. Furthermore,
we had a generic and configurable constant blinding facility on our
todo for quite some time now to further make spraying harder, and
first implementation since around netconf 2016.

We found that for systems where untrusted users can load cBPF/eBPF
code where JIT is enabled, start offset randomization helps a bit
to make jumps into crafted payload harder, but in case where larger
programs that cross page boundary are injected, we again have some
part of the program opcodes at a page start offset. With improved
guessing and more reliable payload injection, chances can increase
to jump into such payload. Elena Reshetova recently wrote a test
case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
can leave some more room for payloads. Note that for all this,
additional bugs in the kernel are still required to make the jump
(and of course to guess right, to not jump into a trap) and naturally
the JIT must be enabled, which is disabled by default.

For helping mitigation, the general idea is to provide an option
bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
that for cases where JIT should be enabled for performance reasons,
the generated image can be further hardened with blinding constants
for unpriviledged users (bpf_jit_harden == 1), with trading off
performance for these, but not for privileged ones. We also added
the option of blinding for all users (bpf_jit_harden == 2), which
is quite helpful for testing f.e. with test_bpf.ko. There are no
further e.g. hardening levels of bpf_jit_harden switch intended,
rationale is to have it dead simple to use as on/off. Since this
functionality would need to be duplicated over and over for JIT
compilers to use, which are already complex enough, we provide a
generic eBPF byte-code level based blinding implementation, which is
then just transparently JITed. JIT compilers need to make only a few
changes to integrate this facility and can be migrated one by one.

This option is for eBPF JITs and will be used in x86, arm64, s390
without too much effort, and soon ppc64 JITs, thus that native eBPF
can be blinded as well as cBPF to eBPF migrations, so that both can
be covered with a single implementation. The rule for JITs is that
bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
and in case blinding is disabled, we follow normally with JITing the
passed program. In case blinding is enabled and we fail during the
process of blinding itself, we must return with the interpreter.
Similarly, in case the JITing process after the blinding failed, we
return normally to the interpreter with the non-blinded code. Meaning,
interpreter doesn't change in any way and operates on eBPF code as
usual. For doing this pre-JIT blinding step, we need to make use of
a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
to the JIT and not in any way part of the eBPF architecture. Just like
in the same way as JITs internally make use of some helper registers
when emitting code, only that here the helper register is one
abstraction level higher in eBPF bytecode, but nevertheless in JIT
phase. That helper register is needed since f.e. manually written
program can issue loads to all registers of eBPF architecture.

The core concept with the additional register is: blind out all 32
and 64 bit constants by converting BPF_K based instructions into a
small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
and REG BPF_REG_AX, so actual operation on the target register
is translated from BPF_K into BPF_X one that is operating on
BPF_REG_AX's content. During rewriting phase when blinding, RND is
newly generated via prandom_u32() for each processed instruction.
64 bit loads are split into two 32 bit loads to make translation and
patching not too complex. Only basic thing required by JITs is to
call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
pair, and to map BPF_REG_AX into an unused register.

Small bpf_jit_disasm extract from [2] when applied to x86 JIT:

echo 0 > /proc/sys/net/core/bpf_jit_harden

ffffffffa034f5e9 + :
[...]
39: mov $0xa8909090,%eax
3e: mov $0xa8909090,%eax
43: mov $0xa8ff3148,%eax
48: mov $0xa89081b4,%eax
4d: mov $0xa8900bb0,%eax
52: mov $0xa810e0c1,%eax
57: mov $0xa8908eb4,%eax
5c: mov $0xa89020b0,%eax
[...]

echo 1 > /proc/sys/net/core/bpf_jit_harden

ffffffffa034f1e5 + :
[...]
39: mov $0xe1192563,%r10d
3f: xor $0x4989b5f3,%r10d
46: mov %r10d,%eax
49: mov $0xb8296d93,%r10d
4f: xor $0x10b9fd03,%r10d
56: mov %r10d,%eax
59: mov $0x8c381146,%r10d
5f: xor $0x24c7200e,%r10d
66: mov %r10d,%eax
69: mov $0xeb2a830e,%r10d
6f: xor $0x43ba02ba,%r10d
76: mov %r10d,%eax
79: mov $0xd9730af,%r10d
7f: xor $0xa5073b1f,%r10d
86: mov %r10d,%eax
89: mov $0x9a45662b,%r10d
8f: xor $0x325586ea,%r10d
96: mov %r10d,%eax
[...]

As can be seen, original constants that carry payload are hidden
when enabled, actual operations are transformed from constant-based
to register-based ones, making jumps into constants ineffective.
Above extract/example uses single BPF load instruction over and
over, but of course all instructions with constants are blinded.

Performance wise, JIT with blinding performs a bit slower than just
JIT and faster than interpreter case. This is expected, since we
still get all the performance benefits from JITing and in normal
use-cases not every single instruction needs to be blinded. Summing
up all 296 test cases averaged over multiple runs from test_bpf.ko
suite, interpreter was 55% slower than JIT only and JIT with blinding
was 8% slower than JIT only. Since there are also some extremes in
the test suite, I expect for ordinary workloads that the performance
for the JIT with blinding case is even closer to JIT only case,
f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
35 (+ blinding), and 151 (interpreter).

BPF test suite, seccomp test suite, eBPF sample code and various
bigger networking eBPF programs have been tested with this and were
running fine. For testing purposes, I also adapted interpreter and
redirected blinded eBPF image to interpreter and also here all tests
pass.

[1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
[2] https://github.com/01org/jit-spray-poc-for-ksp/
[3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5

Signed-off-by: Daniel Borkmann
Reviewed-by: Elena Reshetova
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2016-05-17 01:49:32 +0800

11 May, 2016

1 commit

d2950158d Merge branch 'perf/urgent' into perf/core, to pick up fixes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2016-05-11 22:56:38 +0800

10 May, 2016

1 commit

0161028b7 perf/core: Change the default paranoia level to 2 ... Browse Code »

Allowing unprivileged kernel profiling lets any user dump follow kernel
control flow and dump kernel registers. This most likely allows trivial
kASLR bypassing, and it may allow other mischief as well. (Off the top
of my head, the PERF_SAMPLE_REGS_INTR output during /dev/urandom reads
could be quite interesting.)

Signed-off-by: Andy Lutomirski
Acked-by: Kees Cook
Signed-off-by: Linus Torvalds

Andy Lutomirski
2016-05-10 08:57:12 +0800

05 May, 2016

1 commit

1a618c2cf Merge branch 'perf/urgent' into perf/core, to pick up fixes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2016-05-05 16:12:37 +0800

29 Apr, 2016

1 commit

7c88a292d Documentation/sysctl/vm.txt: update numa_zonelist_order description ... Browse Code »

Commit 3193913ce62c ("mm: page_alloc: default node-ordering on 64-bit
NUMA, zone-ordering on 32-bit") changes the default value of
numa_zonelist_order. Update the document.

Signed-off-by: Xishi Qiu
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: David Rientjes
Cc: Kamezawa Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xishi Qiu
2016-04-29 10:34:04 +0800

27 Apr, 2016

1 commit

c5dfd78eb perf core: Allow setting up max frame stack depth via sysctl ... Browse Code »

The default remains 127, which is good for most cases, and not even hit
most of the time, but then for some cases, as reported by Brendan, 1024+
deep frames are appearing on the radar for things like groovy, ruby.

And in some workloads putting a _lower_ cap on this may make sense. One
that is per event still needs to be put in place tho.

The new file is:

# cat /proc/sys/kernel/perf_event_max_stack
127

Chaging it:

# echo 256 > /proc/sys/kernel/perf_event_max_stack
# cat /proc/sys/kernel/perf_event_max_stack
256

But as soon as there is some event using callchains we get:

# echo 512 > /proc/sys/kernel/perf_event_max_stack
-bash: echo: write error: Device or resource busy
#

Because we only allocate the callchain percpu data structures when there
is a user, which allows for changing the max easily, its just a matter
of having no callchain users at that point.

Reported-and-Tested-by: Brendan Gregg
Reviewed-by: Frederic Weisbecker
Acked-by: Alexei Starovoitov
Acked-by: David Ahern
Cc: Adrian Hunter
Cc: Alexander Shishkin
Cc: He Kuang
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Masami Hiramatsu
Cc: Milian Wolff
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: Wang Nan
Cc: Zefan Li
Link: http://lkml.kernel.org/r/20160426002928.GB16708@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Arnaldo Carvalho de Melo
2016-04-27 21:20:39 +0800

19 Mar, 2016

1 commit

814a2bf95 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge second patch-bomb from Andrew Morton:

- a couple of hotfixes

- the rest of MM

- a new timer slack control in procfs

- a couple of procfs fixes

- a few misc things

- some printk tweaks

- lib/ updates, notably to radix-tree.

- add my and Nick Piggin's old userspace radix-tree test harness to
tools/testing/radix-tree/. Matthew said it was a godsend during the
radix-tree work he did.

- a few code-size improvements, switching to __always_inline where gcc
screwed up.

- partially implement character sets in sscanf

* emailed patches from Andrew Morton : (118 commits)
sscanf: implement basic character sets
lib/bug.c: use common WARN helper
param: convert some "on"/"off" users to strtobool
lib: add "on"/"off" support to kstrtobool
lib: update single-char callers of strtobool()
lib: move strtobool() to kstrtobool()
include/linux/unaligned: force inlining of byteswap operations
include/uapi/linux/byteorder, swab: force inlining of some byteswap operations
include/asm-generic/atomic-long.h: force inlining of some atomic_long operations
usb: common: convert to use match_string() helper
ide: hpt366: convert to use match_string() helper
ata: hpt366: convert to use match_string() helper
power: ab8500: convert to use match_string() helper
power: charger_manager: convert to use match_string() helper
drm/edid: convert to use match_string() helper
pinctrl: convert to use match_string() helper
device property: convert to use match_string() helper
lib/string: introduce match_string() helper
radix-tree tests: add test for radix_tree_iter_next
radix-tree tests: add regression3 test
...

Linus Torvalds
2016-03-19 10:26:54 +0800

18 Mar, 2016

2 commits

795ae7a0d mm: scale kswapd watermarks in proportion to memory ... Browse Code »

In machines with 140G of memory and enterprise flash storage, we have
seen read and write bursts routinely exceed the kswapd watermarks and
cause thundering herds in direct reclaim. Unfortunately, the only way
to tune kswapd aggressiveness is through adjusting min_free_kbytes - the
system's emergency reserves - which is entirely unrelated to the
system's latency requirements. In order to get kswapd to maintain a
250M buffer of free memory, the emergency reserves need to be set to 1G.
That is a lot of memory wasted for no good reason.

On the other hand, it's reasonable to assume that allocation bursts and
overall allocation concurrency scale with memory capacity, so it makes
sense to make kswapd aggressiveness a function of that as well.

Change the kswapd watermark scale factor from the currently fixed 25% of
the tunable emergency reserve to a tunable 0.1% of memory.

Beyond 1G of memory, this will produce bigger watermark steps than the
current formula in default settings. Ensure that the new formula never
chooses steps smaller than that, i.e. 25% of the emergency reserve.

On a 140G machine, this raises the default watermark steps - the
distance between min and low, and low and high - from 16M to 143M.

Signed-off-by: Johannes Weiner
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Acked-by: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-03-18 06:09:34 +0800
96b9b1c95 Merge tag 'tty-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty ... Browse Code »

Pull tty/serial updates from Greg KH:
"Here's the big tty/serial driver pull request for 4.6-rc1.

Lots of changes in here, Peter has been on a tear again, with lots of
refactoring and bugs fixes, many thanks to the great work he has been
doing. Lots of driver updates and fixes as well, full details in the
shortlog.

All have been in linux-next for a while with no reported issues"

* tag 'tty-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (220 commits)
serial: 8250: describe CONFIG_SERIAL_8250_RSA
serial: samsung: optimize UART rx fifo access routine
serial: pl011: add mark/space parity support
serial: sa1100: make sa1100_register_uart_fns a function
tty: serial: 8250: add MOXA Smartio MUE boards support
serial: 8250: convert drivers to use up_to_u8250p()
serial: 8250/mediatek: fix building with SERIAL_8250=m
serial: 8250/ingenic: fix building with SERIAL_8250=m
serial: 8250/uniphier: fix modular build
Revert "drivers/tty/serial: make 8250/8250_ingenic.c explicitly non-modular"
Revert "drivers/tty/serial: make 8250/8250_mtk.c explicitly non-modular"
serial: mvebu-uart: initial support for Armada-3700 serial port
serial: mctrl_gpio: Add missing module license
serial: ifx6x60: avoid uninitialized variable use
tty/serial: at91: fix bad offset for UART timeout register
tty/serial: at91: restore dynamic driver binding
serial: 8250: Add hardware dependency to RT288X option
TTY, devpts: document pty count limiting
tty: goldfish: support platform_device with id -1
drivers: tty: goldfish: Add device tree bindings
...

Linus Torvalds
2016-03-18 04:53:25 +0800

15 Mar, 2016

1 commit

d4e796152 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle are:

- Make schedstats a runtime tunable (disabled by default) and
optimize it via static keys.

As most distributions enable CONFIG_SCHEDSTATS=y due to its
instrumentation value, this is a nice performance enhancement.
(Mel Gorman)

- Implement 'simple waitqueues' (swait): these are just pure
waitqueues without any of the more complex features of full-blown
waitqueues (callbacks, wake flags, wake keys, etc.). Simple
waitqueues have less memory overhead and are faster.

Use simple waitqueues in the RCU code (in 4 different places) and
for handling KVM vCPU wakeups.

(Peter Zijlstra, Daniel Wagner, Thomas Gleixner, Paul Gortmaker,
Marcelo Tosatti)

- sched/numa enhancements (Rik van Riel)

- NOHZ performance enhancements (Rik van Riel)

- Various sched/deadline enhancements (Steven Rostedt)

- Various fixes (Peter Zijlstra)

- ... and a number of other fixes, cleanups and smaller enhancements"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
sched/cputime: Fix steal_account_process_tick() to always return jiffies
sched/deadline: Remove dl_new from struct sched_dl_entity
Revert "kbuild: Add option to turn incompatible pointer check into error"
sched/deadline: Remove superfluous call to switched_to_dl()
sched/debug: Fix preempt_disable_ip recording for preempt_disable()
sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity
time, acct: Drop irq save & restore from __acct_update_integrals()
acct, time: Change indentation in __acct_update_integrals()
sched, time: Remove non-power-of-two divides from __acct_update_integrals()
sched/rt: Kick RT bandwidth timer immediately on start up
sched/debug: Add deadline scheduler bandwidth ratio to /proc/sched_debug
sched/debug: Move sched_domain_sysctl to debug.c
sched/debug: Move the /sys/kernel/debug/sched_features file setup into debug.c
sched/rt: Fix PI handling vs. sched_setscheduler()
sched/core: Remove duplicated sched_group_set_shares() prototype
sched/fair: Consolidate nohz CPU load update code
sched/fair: Avoid using decay_load_missed() with a negative value
sched/deadline: Always calculate end of period on sched_yield()
sched/cgroup: Fix cgroup entity load tracking tear-down
rcu: Use simple wait queues where possible in rcutree
...

Linus Torvalds
2016-03-15 10:14:06 +0800

08 Mar, 2016

1 commit

8b253b07e TTY, devpts: document pty count limiting ... Browse Code »

Logic has been changed in kernel 3.4 by commit e9aba5158a80
("tty: rework pty count limiting") but still not documented.

Sysctl kernel.pty.max works as global limit, kernel.pty.reserve ptys
are reserved for initial devpts instance (mounted without "newinstance").
Per-instance limit also could be set by mount option "max=%d".

Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Greg Kroah-Hartman

Konstantin Khlebnikov
2016-03-08 08:11:14 +0800

09 Feb, 2016

1 commit

cb2517653 sched/debug: Make schedstats a runtime tunable that is disabled by default ... Browse Code »

schedstats is very useful during debugging and performance tuning but it
incurs overhead to calculate the stats. As such, even though it can be
disabled at build time, it is often enabled as the information is useful.

This patch adds a kernel command-line and sysctl tunable to enable or
disable schedstats on demand (when it's built in). It is disabled
by default as someone who knows they need it can also learn to enable
it when necessary.

The benefits are dependent on how scheduler-intensive the workload is.
If it is then the patch reduces the number of cycles spent calculating
the stats with a small benefit from reducing the cache footprint of the
scheduler.

These measurements were taken from a 48-core 2-socket
machine with Xeon(R) E5-2670 v3 cpus although they were also tested on a
single socket machine 8-core machine with Intel i7-3770 processors.

netperf-tcp
4.5.0-rc1 4.5.0-rc1
vanilla nostats-v3r1
Hmean 64 560.45 ( 0.00%) 575.98 ( 2.77%)
Hmean 128 766.66 ( 0.00%) 795.79 ( 3.80%)
Hmean 256 950.51 ( 0.00%) 981.50 ( 3.26%)
Hmean 1024 1433.25 ( 0.00%) 1466.51 ( 2.32%)
Hmean 2048 2810.54 ( 0.00%) 2879.75 ( 2.46%)
Hmean 3312 4618.18 ( 0.00%) 4682.09 ( 1.38%)
Hmean 4096 5306.42 ( 0.00%) 5346.39 ( 0.75%)
Hmean 8192 10581.44 ( 0.00%) 10698.15 ( 1.10%)
Hmean 16384 18857.70 ( 0.00%) 18937.61 ( 0.42%)

Small gains here, UDP_STREAM showed nothing intresting and neither did
the TCP_RR tests. The gains on the 8-core machine were very similar.

tbench4
4.5.0-rc1 4.5.0-rc1
vanilla nostats-v3r1
Hmean mb/sec-1 500.85 ( 0.00%) 522.43 ( 4.31%)
Hmean mb/sec-2 984.66 ( 0.00%) 1018.19 ( 3.41%)
Hmean mb/sec-4 1827.91 ( 0.00%) 1847.78 ( 1.09%)
Hmean mb/sec-8 3561.36 ( 0.00%) 3611.28 ( 1.40%)
Hmean mb/sec-16 5824.52 ( 0.00%) 5929.03 ( 1.79%)
Hmean mb/sec-32 10943.10 ( 0.00%) 10802.83 ( -1.28%)
Hmean mb/sec-64 15950.81 ( 0.00%) 16211.31 ( 1.63%)
Hmean mb/sec-128 15302.17 ( 0.00%) 15445.11 ( 0.93%)
Hmean mb/sec-256 14866.18 ( 0.00%) 15088.73 ( 1.50%)
Hmean mb/sec-512 15223.31 ( 0.00%) 15373.69 ( 0.99%)
Hmean mb/sec-1024 14574.25 ( 0.00%) 14598.02 ( 0.16%)
Hmean mb/sec-2048 13569.02 ( 0.00%) 13733.86 ( 1.21%)
Hmean mb/sec-3072 12865.98 ( 0.00%) 13209.23 ( 2.67%)

Small gains of 2-4% at low thread counts and otherwise flat. The
gains on the 8-core machine were slightly different

tbench4 on 8-core i7-3770 single socket machine
Hmean mb/sec-1 442.59 ( 0.00%) 448.73 ( 1.39%)
Hmean mb/sec-2 796.68 ( 0.00%) 794.39 ( -0.29%)
Hmean mb/sec-4 1322.52 ( 0.00%) 1343.66 ( 1.60%)
Hmean mb/sec-8 2611.65 ( 0.00%) 2694.86 ( 3.19%)
Hmean mb/sec-16 2537.07 ( 0.00%) 2609.34 ( 2.85%)
Hmean mb/sec-32 2506.02 ( 0.00%) 2578.18 ( 2.88%)
Hmean mb/sec-64 2511.06 ( 0.00%) 2569.16 ( 2.31%)
Hmean mb/sec-128 2313.38 ( 0.00%) 2395.50 ( 3.55%)
Hmean mb/sec-256 2110.04 ( 0.00%) 2177.45 ( 3.19%)
Hmean mb/sec-512 2072.51 ( 0.00%) 2053.97 ( -0.89%)

In constract, this shows a relatively steady 2-3% gain at higher thread
counts. Due to the nature of the patch and the type of workload, it's
not a surprise that the result will depend on the CPU used.

hackbench-pipes
4.5.0-rc1 4.5.0-rc1
vanilla nostats-v3r1
Amean 1 0.0637 ( 0.00%) 0.0660 ( -3.59%)
Amean 4 0.1229 ( 0.00%) 0.1181 ( 3.84%)
Amean 7 0.1921 ( 0.00%) 0.1911 ( 0.52%)
Amean 12 0.3117 ( 0.00%) 0.2923 ( 6.23%)
Amean 21 0.4050 ( 0.00%) 0.3899 ( 3.74%)
Amean 30 0.4586 ( 0.00%) 0.4433 ( 3.33%)
Amean 48 0.5910 ( 0.00%) 0.5694 ( 3.65%)
Amean 79 0.8663 ( 0.00%) 0.8626 ( 0.43%)
Amean 110 1.1543 ( 0.00%) 1.1517 ( 0.22%)
Amean 141 1.4457 ( 0.00%) 1.4290 ( 1.16%)
Amean 172 1.7090 ( 0.00%) 1.6924 ( 0.97%)
Amean 192 1.9126 ( 0.00%) 1.9089 ( 0.19%)

Some small gains and losses and while the variance data is not included,
it's close to the noise. The UMA machine did not show anything particularly
different

pipetest
4.5.0-rc1 4.5.0-rc1
vanilla nostats-v2r2
Min Time 4.13 ( 0.00%) 3.99 ( 3.39%)
1st-qrtle Time 4.38 ( 0.00%) 4.27 ( 2.51%)
2nd-qrtle Time 4.46 ( 0.00%) 4.39 ( 1.57%)
3rd-qrtle Time 4.56 ( 0.00%) 4.51 ( 1.10%)
Max-90% Time 4.67 ( 0.00%) 4.60 ( 1.50%)
Max-93% Time 4.71 ( 0.00%) 4.65 ( 1.27%)
Max-95% Time 4.74 ( 0.00%) 4.71 ( 0.63%)
Max-99% Time 4.88 ( 0.00%) 4.79 ( 1.84%)
Max Time 4.93 ( 0.00%) 4.83 ( 2.03%)
Mean Time 4.48 ( 0.00%) 4.39 ( 1.91%)
Best99%Mean Time 4.47 ( 0.00%) 4.39 ( 1.91%)
Best95%Mean Time 4.46 ( 0.00%) 4.38 ( 1.93%)
Best90%Mean Time 4.45 ( 0.00%) 4.36 ( 1.98%)
Best50%Mean Time 4.36 ( 0.00%) 4.25 ( 2.49%)
Best10%Mean Time 4.23 ( 0.00%) 4.10 ( 3.13%)
Best5%Mean Time 4.19 ( 0.00%) 4.06 ( 3.20%)
Best1%Mean Time 4.13 ( 0.00%) 4.00 ( 3.39%)

Small improvement and similar gains were seen on the UMA machine.

The gain is small but it stands to reason that doing less work in the
scheduler is a good thing. The downside is that the lack of schedstats and
tracepoints may be surprising to experts doing performance analysis until
they find the existence of the schedstats= parameter or schedstats sysctl.
It will be automatically activated for latencytop and sleep profiling to
alleviate the problem. For tracepoints, there is a simple warning as it's
not safe to activate schedstats in the context when it's known the tracepoint
may be wanted but is unavailable.

Signed-off-by: Mel Gorman
Reviewed-by: Matt Fleming
Reviewed-by: Srikar Dronamraju
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1454663316-22048-1-git-send-email-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar

Mel Gorman
2016-02-09 18:54:23 +0800

03 Feb, 2016

1 commit

402e8db5c Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git… ... Browse Code »

…/acme/linux into perf/core

Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:

User visible changes:

- Rename the "colors.code" ~/.perfconfig variable to "colors.jump_arrows",
as it controls just the that UI element in the annotate browser (Taeung Song)

- Avoid trying to read ELF symtabs from device files, noticed while doing
memory profiling work (Jiri Olsa)

- Improve context detection when offering options in the hists browser,
i.e. some options don't make sense when the browser is not working with
a perf.data file ('perf top' mode), only in 'perf report' mode, like
scripting (Namhyung Kim)

Infrastructure changes:

- Elliminate duplication in the hists browser filter functions, getting the
common part into a function that receives callbacks for filtering by
DSO, thread, etc. (Namhyung Kim)

- Fix misleadingly indented assignment, found using
gcc6 -Wmisleading-indentation (Markus Trippelsdorf)

- Handle LLVM relocation oddities in libbpf, introducing a 'perf test' that
detects such problems and then fixing the problem, so that the test now
passes (Wang Nan)

- More improvements to the build infrastructure to allow reusing the
feature detection facilities (Wang Nan)

- Auto initialize the globals needed by cpu__max_{cpu,node}() routines
(Arnaldo Carvalho de Melo)

Documentation changes:

- Document the perf sysctls in Documentation/sysctl/kernel.txt (Ben Hutchings)

- Document a bunch more ~/.perfconfig knobs (Taeung Song)

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Ingo Molnar
2016-02-03 17:58:46 +0800

26 Jan, 2016

1 commit

3379e0c3e perf tools: Document the perf sysctls ... Browse Code »

perf_event_paranoid was only documented in source code and a perf error
message. Copy the documentation from the error message to
Documentation/sysctl/kernel.txt.

perf_cpu_time_max_percent was already documented but missing from the
list at the top, so add it there.

Signed-off-by: Ben Hutchings
Cc: Peter Zijlstra
Cc: linux-doc@vger.kernel.org
Link: http://lkml.kernel.org/r/20160119213515.GG2637@decadent.org.uk
[ Remove reference to external Documentation file, provide info inline, as before ]
Signed-off-by: Arnaldo Carvalho de Melo

Ben Hutchings
2016-01-26 22:52:45 +0800

23 Jan, 2016

1 commit

eadee0ce6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull more vfs updates from Al Viro:
"Embarrassing braino fix + pipe page accounting + fixing an eyesore in
find_filesystem() (checking that s1 is equal to prefix of s2 of given
length can be done in many ways, but "compare strlen(s1) with length
and then do strncmp()" is not a good one...)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
[regression] fix braino in fs/dlm/user.c
pipe: limit the per-user amount of pages allocated in pipes
find_filesystem(): simplify comparison

Linus Torvalds
2016-01-23 02:24:03 +0800

21 Jan, 2016

1 commit

41662f5cc sysctl: enable strict writes ... Browse Code »

SYSCTL_WRITES_WARN was added in commit f4aacea2f5d1 ("sysctl: allow for
strict write position handling"), and released in v3.16 in August of
2014. Since then I can find only 1 instance of non-zero offset
writing[1], and it was fixed immediately in CRIU[2]. As such, it
appears safe to flip this to the strict state now.

[1] https://www.google.com/search?q="when%20file%20position%20was%20not%200"
[2] http://lists.openvz.org/pipermail/criu/2015-April/019819.html

Signed-off-by: Kees Cook
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2016-01-21 09:09:18 +0800