Eric Lee / smarc-fsl-linux-kernel

01 Dec, 2018

1 commit

7bcfd8f98 namei: allow restricted O_CREAT of FIFOs and regular files ... Browse Code »

commit 30aba6656f61ed44cba445a3c0d38b296fa9e8f5 upstream.

Disallows open of FIFOs or regular files not owned by the user in world
writable sticky directories, unless the owner is the same as that of the
directory or the file is opened without the O_CREAT flag. The purpose
is to make data spoofing attacks harder. This protection can be turned
on and off separately for FIFOs and regular files via sysctl, just like
the symlinks/hardlinks protection. This patch is based on Openwall's
"HARDEN_FIFO" feature by Solar Designer.

This is a brief list of old vulnerabilities that could have been prevented
by this feature, some of them even allow for privilege escalation:

CVE-2000-1134
CVE-2007-3852
CVE-2008-0525
CVE-2009-0416
CVE-2011-4834
CVE-2015-1838
CVE-2015-7442
CVE-2016-7489

This list is not meant to be complete. It's difficult to track down all
vulnerabilities of this kind because they were often reported without any
mention of this particular attack vector. In fact, before
hardlinks/symlinks restrictions, fifos/regular files weren't the favorite
vehicle to exploit them.

[s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter]
Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda
Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com
[keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future]
[keescook@chromium.org: adjust commit subjet]
Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast
Signed-off-by: Salvatore Mesoraca
Signed-off-by: Kees Cook
Suggested-by: Solar Designer
Suggested-by: Kees Cook
Cc: Al Viro
Cc: Dan Carpenter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Cc: Loic
Signed-off-by: Greg Kroah-Hartman

Salvatore Mesoraca
2018-12-01 16:42:59 +0800

22 Feb, 2018

1 commit

f369f1486 kmemcheck: rip it out ... Browse Code »

commit 4675ff05de2d76d167336b368bd07f3fef6ed5a6 upstream.

Fix up makefiles, remove references, and git rm kmemcheck.

Link: http://lkml.kernel.org/r/20171007030159.22241-4-alexander.levin@verizon.com
Signed-off-by: Sasha Levin
Cc: Steven Rostedt
Cc: Vegard Nossum
Cc: Pekka Enberg
Cc: Michal Hocko
Cc: Eric W. Biederman
Cc: Alexander Potapenko
Cc: Tim Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Levin, Alexander (Sasha Levin)
2018-02-22 22:42:24 +0800

14 Dec, 2017

1 commit

30c2f774e pipe: match pipe_max_size data type with procfs ... Browse Code »

[ Upstream commit 98159d977f71c3b3dee898d1c34e56f520b094e7 ]

Patch series "A few round_pipe_size() and pipe-max-size fixups", v3.

While backporting Michael's "pipe: fix limit handling" patchset to a
distro-kernel, Mikulas noticed that current upstream pipe limit handling
contains a few problems:

1 - procfs signed wrap: echo'ing a large number into
/proc/sys/fs/pipe-max-size and then cat'ing it back out shows a
negative value.

2 - round_pipe_size() nr_pages overflow on 32bit: this would
subsequently try roundup_pow_of_two(0), which is undefined.

3 - visible non-rounded pipe-max-size value: there is no mutual
exclusion or protection between the time pipe_max_size is assigned
a raw value from proc_dointvec_minmax() and when it is rounded.

4 - unsigned long -> unsigned int conversion makes for potential odd
return errors from do_proc_douintvec_minmax_conv() and
do_proc_dopipe_max_size_conv().

This version underwent the same testing as v1:
https://marc.info/?l=linux-kernel&m=150643571406022&w=2

This patch (of 4):

pipe_max_size is defined as an unsigned int:

unsigned int pipe_max_size = 1048576;

but its procfs/sysctl representation is an integer:

static struct ctl_table fs_table[] = {
...
{
.procname = "pipe-max-size",
.data = &pipe_max_size,
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = &pipe_proc_fn,
.extra1 = &pipe_min_size,
},
...

that is signed:

int pipe_proc_fn(struct ctl_table *table, int write, void __user *buf,
size_t *lenp, loff_t *ppos)
{
...
ret = proc_dointvec_minmax(table, write, buf, lenp, ppos)

This leads to signed results via procfs for large values of pipe_max_size:

% echo 2147483647 >/proc/sys/fs/pipe-max-size
% cat /proc/sys/fs/pipe-max-size
-2147483648

Use unsigned operations on this variable to avoid such negative values.

Link: http://lkml.kernel.org/r/1507658689-11669-2-git-send-email-joe.lawrence@redhat.com
Signed-off-by: Joe Lawrence
Reported-by: Mikulas Patocka
Reviewed-by: Mikulas Patocka
Cc: Michael Kerrisk
Cc: Randy Dunlap
Cc: Al Viro
Cc: Jens Axboe
Cc: Josh Poimboeuf
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Joe Lawrence
2017-12-14 16:53:08 +0800

06 Oct, 2017

1 commit

27efed3e8 Merge branch 'core-watchdog-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull watchddog clean-up and fixes from Thomas Gleixner:
"The watchdog (hard/softlockup detector) code is pretty much broken in
its current state. The patch series addresses this by removing all
duct tape and refactoring it into a workable state.

The reasons why I ask for inclusion that late in the cycle are:

1) The code causes lockdep splats vs. hotplug locking which get
reported over and over. Unfortunately there is no easy fix.

2) The risk of breakage is minimal because it's already broken

3) As 4.14 is a long term stable kernel, I prefer to have working
watchdog code in that and the lockdep issues resolved. I wouldn't
ask you to pull if 4.14 wouldn't be a LTS kernel or if the
solution would be easy to backport.

4) The series was around before the merge window opened, but then got
delayed due to the UP failure caused by the for_each_cpu()
surprise which we discussed recently.

Changes vs. V1:

- Addressed your review points

- Addressed the warning in the powerpc code which was discovered late

- Changed two function names which made sense up to a certain point
in the series. Now they match what they do in the end.

- Fixed a 'unused variable' warning, which got not detected by the
intel robot. I triggered it when trying all possible related config
combinations manually. Randconfig testing seems not random enough.

The changes have been tested by and reviewed by Don Zickus and tested
and acked by Micheal Ellerman for powerpc"

* 'core-watchdog-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
watchdog/core: Put softlockup_threads_initialized under ifdef guard
watchdog/core: Rename some softlockup_* functions
powerpc/watchdog: Make use of watchdog_nmi_probe()
watchdog/core, powerpc: Lock cpus across reconfiguration
watchdog/core, powerpc: Replace watchdog_nmi_reconfigure()
watchdog/hardlockup/perf: Fix spelling mistake: "permanetely" -> "permanently"
watchdog/hardlockup/perf: Cure UP damage
watchdog/hardlockup: Clean up hotplug locking mess
watchdog/hardlockup/perf: Simplify deferred event destroy
watchdog/hardlockup/perf: Use new perf CPU enable mechanism
watchdog/hardlockup/perf: Implement CPU enable replacement
watchdog/hardlockup/perf: Implement init time detection of perf
watchdog/hardlockup/perf: Implement init time perf validation
watchdog/core: Get rid of the racy update loop
watchdog/core, powerpc: Make watchdog_nmi_reconfigure() two stage
watchdog/sysctl: Clean up sysctl variable name space
watchdog/sysctl: Get rid of the #ifdeffery
watchdog/core: Clean up header mess
watchdog/core: Further simplify sysctl handling
watchdog/core: Get rid of the thread teardown/setup dance
...

Linus Torvalds
2017-10-06 23:36:41 +0800

04 Oct, 2017

1 commit

3181c38e4 kernel/sysctl.c: remove duplicate UINT_MAX check on do_proc_douintvec_conv() ... Browse Code »

do_proc_douintvec_conv() has two UINT_MAX checks, we can remove one.
This has no functional changes other than fixing a compiler warning:

kernel/sysctl.c:2190]: (warning) Identical condition '*lvalp>UINT_MAX', second condition is always false

Fixes: 4f2fec00afa60 ("sysctl: simplify unsigned int support")
Link: http://lkml.kernel.org/r/20170919072918.12066-1-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez
Reported-by: David Binderman
Acked-by: Kees Cook
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Luis R. Rodriguez
2017-10-04 08:54:25 +0800

29 Sep, 2017

1 commit

5ccba44ba sched/sysctl: Check user input value of sysctl_sched_time_avg ... Browse Code »

System will hang if user set sysctl_sched_time_avg to 0:

[root@XXX ~]# sysctl kernel.sched_time_avg_ms=0

Stack traceback for pid 0
0xffff883f6406c600 0 0 1 3 R 0xffff883f6406cf50 *swapper/3
ffff883f7ccc3ae8 0000000000000018 ffffffff810c4dd0 0000000000000000
0000000000017800 ffff883f7ccc3d78 0000000000000003 ffff883f7ccc3bf8
ffffffff810c4fc9 ffff883f7ccc3c08 00000000810c5043 ffff883f7ccc3c08
Call Trace:
[] ? update_group_capacity+0x110/0x200
[] ? update_sd_lb_stats+0x109/0x600
[] ? find_busiest_group+0x47/0x530
[] ? load_balance+0x194/0x900
[] ? update_rq_clock.part.83+0x1a/0xe0
[] ? rebalance_domains+0x152/0x290
[] ? run_rebalance_domains+0xdc/0x1d0
[] ? __do_softirq+0xfb/0x320
[] ? irq_exit+0x125/0x130
[] ? scheduler_ipi+0x97/0x160
[] ? smp_reschedule_interrupt+0x29/0x30
[] ? reschedule_interrupt+0x6e/0x80
[] ? cpuidle_enter_state+0xcc/0x230
[] ? cpuidle_enter_state+0x9c/0x230
[] ? cpuidle_enter+0x17/0x20
[] ? cpu_startup_entry+0x38c/0x420
[] ? start_secondary+0x173/0x1e0

Because divide-by-zero error happens in function:

update_group_capacity()
update_cpu_capacity()
scale_rt_capacity()
{
...
total = sched_avg_period() + delta;
used = div_u64(avg, total);
...
}

To fix this issue, check user input value of sysctl_sched_time_avg, keep
it unchanged when hitting invalid input, and set the minimum limit of
sysctl_sched_time_avg to 1 ms.

Reported-by: James Puthukattukaran
Signed-off-by: Ethan Zhao
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: efault@gmx.de
Cc: ethan.kernel@gmail.com
Cc: keescook@chromium.org
Cc: mcgrof@kernel.org
Cc:
Link: http://lkml.kernel.org/r/1504504774-18253-1-git-send-email-ethan.zhao@oracle.com
Signed-off-by: Ingo Molnar

Ethan Zhao
2017-09-29 19:20:13 +0800

14 Sep, 2017

2 commits

7feeb9cd4 watchdog/sysctl: Clean up sysctl variable name space ... Browse Code »

Reflect that these variables are user interface related and remove the
whitespace damage in the sysctl table while at it.

Signed-off-by: Thomas Gleixner
Reviewed-by: Don Zickus
Cc: Andrew Morton
Cc: Borislav Petkov
Cc: Chris Metcalf
Cc: Linus Torvalds
Cc: Nicholas Piggin
Cc: Peter Zijlstra
Cc: Sebastian Siewior
Cc: Ulrich Obergfell
Link: http://lkml.kernel.org/r/20170912194147.783210221@linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
2017-09-14 17:41:07 +0800
51d4052b0 watchdog/sysctl: Get rid of the #ifdeffery ... Browse Code »

The sysctl of the nmi_watchdog file prevents writes by setting:

min = max = 0

if none of the users is enabled. That involves ifdeffery and is competely
non obvious.

If none of the facilities is enabeld, then the file can simply be made read
only. Move the ifdeffery into the header and use a constant for file
permissions.

Signed-off-by: Thomas Gleixner
Reviewed-by: Don Zickus
Cc: Andrew Morton
Cc: Borislav Petkov
Cc: Chris Metcalf
Cc: Linus Torvalds
Cc: Nicholas Piggin
Cc: Peter Zijlstra
Cc: Sebastian Siewior
Cc: Ulrich Obergfell
Link: http://lkml.kernel.org/r/20170912194147.706073616@linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
2017-09-14 17:41:07 +0800

13 Jul, 2017

5 commits

05a4a9527 kernel/watchdog: split up config options ... Browse Code »

Split SOFTLOCKUP_DETECTOR from LOCKUP_DETECTOR, and split
HARDLOCKUP_DETECTOR_PERF from HARDLOCKUP_DETECTOR.

LOCKUP_DETECTOR implies the general boot, sysctl, and programming
interfaces for the lockup detectors.

An architecture that wants to use a hard lockup detector must define
HAVE_HARDLOCKUP_DETECTOR_PERF or HAVE_HARDLOCKUP_DETECTOR_ARCH.

Alternatively an arch can define HAVE_NMI_WATCHDOG, which provides the
minimum arch_touch_nmi_watchdog, and it otherwise does its own thing and
does not implement the LOCKUP_DETECTOR interfaces.

sparc is unusual in that it has started to implement some of the
interfaces, but not fully yet. It should probably be converted to a full
HAVE_HARDLOCKUP_DETECTOR_ARCH.

[npiggin@gmail.com: fix]
Link: http://lkml.kernel.org/r/20170617223522.66c0ad88@roar.ozlabs.ibm.com
Link: http://lkml.kernel.org/r/20170616065715.18390-4-npiggin@gmail.com
Signed-off-by: Nicholas Piggin
Reviewed-by: Don Zickus
Reviewed-by: Babu Moger
Tested-by: Babu Moger [sparc]
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Michael Ellerman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nicholas Piggin
2017-07-13 07:26:02 +0800
61d9b56a8 sysctl: add unsigned int range support ... Browse Code »

To keep parity with regular int interfaces provide the an unsigned int
proc_douintvec_minmax() which allows you to specify a range of allowed
valid numbers.

Adding proc_douintvec_minmax_sysadmin() is easy but we can wait for an
actual user for that.

Link: http://lkml.kernel.org/r/20170519033554.18592-6-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez
Acked-by: Kees Cook
Cc: Subash Abhinov Kasiviswanathan
Cc: Heinrich Schuchardt
Cc: Kees Cook
Cc: "David S. Miller"
Cc: Ingo Molnar
Cc: Al Viro
Cc: "Eric W. Biederman"
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Luis R. Rodriguez
2017-07-13 07:26:00 +0800
4f2fec00a sysctl: simplify unsigned int support ... Browse Code »

Commit e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32
fields") added proc_douintvec() to start help adding support for
unsigned int, this however was only half the work needed. Two fixes
have come in since then for the following issues:

o Printing the values shows a negative value, this happens since
do_proc_dointvec() and this uses proc_put_long()

This was fixed by commit 5380e5644afbba9 ("sysctl: don't print negative
flag for proc_douintvec").

o We can easily wrap around the int values: UINT_MAX is 4294967295, if
we echo in 4294967295 + 1 we end up with 0, using 4294967295 + 2 we
end up with 1.
o We echo negative values in and they are accepted

This was fixed by commit 425fffd886ba ("sysctl: report EINVAL if value
is larger than UINT_MAX for proc_douintvec").

It still also failed to be added to sysctl_check_table()... instead of
adding it with the current implementation just provide a proper and
simplified unsigned int support without any array unsigned int support
with no negative support at all.

Historically sysctl proc helpers have supported arrays, due to the
complexity this adds though we've taken a step back to evaluate array
users to determine if its worth upkeeping for unsigned int. An
evaluation using Coccinelle has been done to perform a grammatical
search to ask ourselves:

o How many sysctl proc_dointvec() (int) users exist which likely
should be moved over to proc_douintvec() (unsigned int) ?
Answer: about 8
- Of these how many are array users ?
Answer: Probably only 1
o How many sysctl array users exist ?
Answer: about 12

This last question gives us an idea just how popular arrays: they are not.
Array support should probably just be kept for strings.

The identified uint ports are:

drivers/infiniband/core/ucma.c - max_backlog
drivers/infiniband/core/iwcm.c - default_backlog
net/core/sysctl_net_core.c - rps_sock_flow_sysctl()
net/netfilter/nf_conntrack_timestamp.c - nf_conntrack_timestamp -- bool
net/netfilter/nf_conntrack_acct.c nf_conntrack_acct -- bool
net/netfilter/nf_conntrack_ecache.c - nf_conntrack_events -- bool
net/netfilter/nf_conntrack_helper.c - nf_conntrack_helper -- bool
net/phonet/sysctl.c proc_local_port_range()

The only possible array users is proc_local_port_range() but it does not
seem worth it to add array support just for this given the range support
works just as well. Unsigned int support should be desirable more for
when you *need* more than INT_MAX or using int min/max support then does
not suffice for your ranges.

If you forget and by mistake happen to register an unsigned int proc
entry with an array, the driver will fail and you will get something as
follows:

sysctl table check failed: debug/test_sysctl//uint_0002 array now allowed
CPU: 2 PID: 1342 Comm: modprobe Tainted: G W E
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Call Trace:
dump_stack+0x63/0x81
__register_sysctl_table+0x350/0x650
? kmem_cache_alloc_trace+0x107/0x240
__register_sysctl_paths+0x1b3/0x1e0
? 0xffffffffc005f000
register_sysctl_table+0x1f/0x30
test_sysctl_init+0x10/0x1000 [test_sysctl]
do_one_initcall+0x52/0x1a0
? kmem_cache_alloc_trace+0x107/0x240
do_init_module+0x5f/0x200
load_module+0x1867/0x1bd0
? __symbol_put+0x60/0x60
SYSC_finit_module+0xdf/0x110
SyS_finit_module+0xe/0x10
entry_SYSCALL_64_fastpath+0x1e/0xad
RIP: 0033:0x7f042b22d119

Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields")
Link: http://lkml.kernel.org/r/20170519033554.18592-5-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez
Suggested-by: Alexey Dobriyan
Cc: Subash Abhinov Kasiviswanathan
Cc: Liping Zhang
Cc: Alexey Dobriyan
Cc: Heinrich Schuchardt
Cc: Kees Cook
Cc: "David S. Miller"
Cc: Ingo Molnar
Cc: Al Viro
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Luis R. Rodriguez
2017-07-13 07:26:00 +0800
d383d4847 sysctl: fold sysctl_writes_strict checks into helper ... Browse Code »

The mode sysctl_writes_strict positional checks keep being copy and pasted
as we add new proc handlers. Just add a helper to avoid code duplication.

Link: http://lkml.kernel.org/r/20170519033554.18592-4-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez
Suggested-by: Kees Cook
Cc: Al Viro
Cc: "Eric W. Biederman"
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Luis R. Rodriguez
2017-07-13 07:26:00 +0800
a19ac3374 sysctl: kdoc'ify sysctl_writes_strict ... Browse Code »

Document the different sysctl_writes_strict modes in code.

Link: http://lkml.kernel.org/r/20170519033554.18592-3-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez
Cc: Al Viro
Cc: "Eric W. Biederman"
Cc: Alexey Dobriyan
Cc: Kees Cook
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Luis R. Rodriguez
2017-07-13 07:26:00 +0800

09 May, 2017

1 commit

63259457a proc/sysctl: fix the int overflow for jiffies conversion ... Browse Code »

do_proc_dointvec_jiffies_conv() uses LONG_MAX/HZ as the max value to
avoid overflow. But actually the *valp is int type, so it still causes
overflow.

For example,

echo 2147483647 > ./sys/net/ipv4/tcp_keepalive_time

Then,

cat ./sys/net/ipv4/tcp_keepalive_time

The output is "-1", it is not expected.

Now use INT_MAX/HZ as the max value instead LONG_MAX/HZ to fix it.

Link: http://lkml.kernel.org/r/1490109532-9228-1-git-send-email-fgao@ikuai8.com
Signed-off-by: Gao Feng
Cc: Arnaldo Carvalho de Melo
Cc: Ingo Molnar
Cc: Alexey Dobriyan
Cc: Eric Dumazet
Cc: Josh Poimboeuf
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gao Feng
2017-05-09 08:15:10 +0800

02 May, 2017

1 commit

174ddfd5d Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull timer updates from Thomas Gleixner:
"The timer departement delivers:

- more year 2038 rework

- a massive rework of the arm achitected timer

- preparatory patches to allow NTP correction of clock event devices
to avoid early expiry

- the usual pile of fixes and enhancements all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (91 commits)
timer/sysclt: Restrict timer migration sysctl values to 0 and 1
arm64/arch_timer: Mark errata handlers as __maybe_unused
Clocksource/mips-gic: Remove redundant non devicetree init
MIPS/Malta: Probe gic-timer via devicetree
clocksource: Use GENMASK_ULL in definition of CLOCKSOURCE_MASK
acpi/arm64: Add SBSA Generic Watchdog support in GTDT driver
clocksource: arm_arch_timer: add GTDT support for memory-mapped timer
acpi/arm64: Add memory-mapped timer support in GTDT driver
clocksource: arm_arch_timer: simplify ACPI support code.
acpi/arm64: Add GTDT table parse driver
clocksource: arm_arch_timer: split MMIO timer probing.
clocksource: arm_arch_timer: add structs to describe MMIO timer
clocksource: arm_arch_timer: move arch_timer_needs_of_probing into DT init call
clocksource: arm_arch_timer: refactor arch_timer_needs_probing
clocksource: arm_arch_timer: split dt-only rate handling
x86/uv/time: Set ->min_delta_ticks and ->max_delta_ticks
unicore32/time: Set ->min_delta_ticks and ->max_delta_ticks
um/time: Set ->min_delta_ticks and ->max_delta_ticks
tile/time: Set ->min_delta_ticks and ->max_delta_ticks
score/time: Set ->min_delta_ticks and ->max_delta_ticks
...

Linus Torvalds
2017-05-02 07:15:18 +0800

20 Apr, 2017

1 commit

b94bf594c timer/sysclt: Restrict timer migration sysctl values to 0 and 1 ... Browse Code »

timer_migration sysctl acts as a boolean switch, so the allowed values
should be restricted to 0 and 1.

Add the necessary extra fields to the sysctl table entry to enforce that.

[ tglx: Rewrote changelog ]

Signed-off-by: Myungho Jung
Link: http://lkml.kernel.org/r/1492640690-3550-1-git-send-email-mhjungk@gmail.com
Signed-off-by: Thomas Gleixner

Myungho Jung
2017-04-20 20:56:59 +0800

09 Apr, 2017

1 commit

425fffd88 sysctl: report EINVAL if value is larger than UINT_MAX for proc_douintvec ... Browse Code »

Currently, inputting the following command will succeed but actually the
value will be truncated:

# echo 0x12ffffffff > /proc/sys/net/ipv4/tcp_notsent_lowat

This is not friendly to the user, so instead, we should report error
when the value is larger than UINT_MAX.

Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields")
Signed-off-by: Liping Zhang
Cc: Subash Abhinov Kasiviswanathan
Cc: Andrew Morton
Cc: Eric W. Biederman
Signed-off-by: Linus Torvalds

Liping Zhang
2017-04-09 01:27:40 +0800

08 Apr, 2017

1 commit

5380e5644 sysctl: don't print negative flag for proc_douintvec ... Browse Code »

I saw some very confusing sysctl output on my system:
# cat /proc/sys/net/core/xfrm_aevent_rseqth
-2
# cat /proc/sys/net/core/xfrm_aevent_etime
-10
# cat /proc/sys/net/ipv4/tcp_notsent_lowat
-4294967295

Because we forget to set the *negp flag in proc_douintvec, so it will
become a garbage value.

Since the value related to proc_douintvec is always an unsigned integer,
so we can set *negp to false explictily to fix this issue.

Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields")
Signed-off-by: Liping Zhang
Cc: Subash Abhinov Kasiviswanathan
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Liping Zhang
2017-04-08 00:46:44 +0800

02 Mar, 2017

1 commit

f7ccbae45 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/coredump.h> ... Browse Code »

We are going to split out of , which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder file that just
maps to to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-03-02 15:42:28 +0800

01 Feb, 2017

1 commit

975e155ed sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds ... Browse Code »

We added the 'sched_rr_timeslice_ms' SCHED_RR tuning knob in this commit:

ce0dbbbb30ae ("sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice")

... which name suggests to users that it's in milliseconds, while in reality
it's being set in milliseconds but the result is shown in jiffies.

This is obviously confusing when HZ is not 1000, it makes it appear like the
value set failed, such as HZ=100:

root# echo 100 > /proc/sys/kernel/sched_rr_timeslice_ms
root# cat /proc/sys/kernel/sched_rr_timeslice_ms
10

Fix this to be milliseconds all around.

Signed-off-by: Shile Zhang
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1485612049-20923-1-git-send-email-shile.zhang@nokia.com
Signed-off-by: Ingo Molnar

Shile Zhang
2017-02-01 18:01:30 +0800

27 Jan, 2017

1 commit

ff9f8a7cf sysctl: fix proc_doulongvec_ms_jiffies_minmax() ... Browse Code »

We perform the conversion between kernel jiffies and ms only when
exporting kernel value to user space.

We need to do the opposite operation when value is written by user.

Only matters when HZ != 1000

Signed-off-by: Eric Dumazet
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Eric Dumazet
2017-01-27 01:21:24 +0800

25 Dec, 2016

1 commit

7c0f6ba68 Replace <asm/uaccess.h> with <linux/uaccess.h> globally ... Browse Code »

This was entirely automated, using the script by Al:

PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*'
sed -i -e "s!$PATT!#include !" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro
Signed-off-by: Linus Torvalds

Linus Torvalds
2016-12-25 03:46:01 +0800

16 Dec, 2016

1 commit

179a7ba68 Merge tag 'trace-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace ... Browse Code »

Pull tracing updates from Steven Rostedt:
"This release has a few updates:

- STM can hook into the function tracer
- Function filtering now supports more advance glob matching
- Ftrace selftests updates and added tests
- Softirq tag in traces now show only softirqs
- ARM nop added to non traced locations at compile time
- New trace_marker_raw file that allows for binary input
- Optimizations to the ring buffer
- Removal of kmap in trace_marker
- Wakeup and irqsoff tracers now adhere to the set_graph_notrace file
- Other various fixes and clean ups"

* tag 'trace-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (42 commits)
selftests: ftrace: Shift down default message verbosity
kprobes/trace: Fix kprobe selftest for newer gcc
tracing/kprobes: Add a helper method to return number of probe hits
tracing/rb: Init the CPU mask on allocation
tracing: Use SOFTIRQ_OFFSET for softirq dectection for more accurate results
tracing/fgraph: Have wakeup and irqsoff tracers ignore graph functions too
fgraph: Handle a case where a tracer ignores set_graph_notrace
tracing: Replace kmap with copy_from_user() in trace_marker writing
ftrace/x86_32: Set ftrace_stub to weak to prevent gcc from using short jumps to it
tracing: Allow benchmark to be enabled at early_initcall()
tracing: Have system enable return error if one of the events fail
tracing: Do not start benchmark on boot up
tracing: Have the reg function allow to fail
ring-buffer: Force rb_end_commit() and rb_set_commit_to_write() inline
ring-buffer: Froce rb_update_write_stamp() to be inlined
ring-buffer: Force inline of hotpath helper functions
tracing: Make __buffer_unlock_commit() always_inline
tracing: Make tracepoint_printk a static_key
ring-buffer: Always inline rb_event_data()
ring-buffer: Make rb_reserve_next_event() always inlined
...

Linus Torvalds
2016-12-16 05:49:34 +0800

15 Dec, 2016

1 commit

760c6a913 coredump: clarify "unsafe core_pattern" warning ... Browse Code »

I was amused to find "unsafe core_pattern" warning having these lines in
/etc/sysctl.conf:

fs.suid_dumpable=2
kernel.core_pattern=/core/core-%e-%p-%E
kernel.core_uses_pid=0

Turns out kernel is formally right. Default core_pattern is just "core",
which doesn't qualify for secure path while setting suid.dumpable.

Hint admins about solution, clarify sysctl names, delete unnecessary '\'
characters (string literals are concatenated regardless) and reformat for
easier grepping.

Link: http://lkml.kernel.org/r/20161029152124.GA1258@avx2
Signed-off-by: Alexey Dobriyan
Acked-by: Kees Cook
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2016-12-15 08:04:07 +0800

13 Dec, 2016

1 commit

5645688f9 Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 asm updates from Ingo Molnar:
"The main changes in this development cycle were:

- a large number of call stack dumping/printing improvements: higher
robustness, better cross-context dumping, improved output, etc.
(Josh Poimboeuf)

- vDSO getcpu() performance improvement for future Intel CPUs with
the RDPID instruction (Andy Lutomirski)

- add two new Intel AVX512 features and the CPUID support
infrastructure for it: AVX512IFMA and AVX512VBMI. (Gayatri Kammela,
He Chen)

- more copy-user unification (Borislav Petkov)

- entry code assembly macro simplifications (Alexander Kuleshov)

- vDSO C/R support improvements (Dmitry Safonov)

- misc fixes and cleanups (Borislav Petkov, Paul Bolle)"

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
scripts/decode_stacktrace.sh: Fix address line detection on x86
x86/boot/64: Use defines for page size
x86/dumpstack: Make stack name tags more comprehensible
selftests/x86: Add test_vdso to test getcpu()
x86/vdso: Use RDPID in preference to LSL when available
x86/dumpstack: Handle NULL stack pointer in show_trace_log_lvl()
x86/cpufeatures: Enable new AVX512 cpu features
x86/cpuid: Provide get_scattered_cpuid_leaf()
x86/cpuid: Cleanup cpuid_regs definitions
x86/copy_user: Unify the code by removing the 64-bit asm _copy_*_user() variants
x86/unwind: Ensure stack grows down
x86/vdso: Set vDSO pointer only after success
x86/prctl/uapi: Remove #ifdef for CHECKPOINT_RESTORE
x86/unwind: Detect bad stack return address
x86/dumpstack: Warn on stack recursion
x86/unwind: Warn on bad frame pointer
x86/decoder: Use stderr if insn sanity test fails
x86/decoder: Use stdout if insn decoder test is successful
mm/page_alloc: Remove kernel address exposure in free_reserved_area()
x86/dumpstack: Remove raw stack dump
...

Linus Torvalds
2016-12-13 05:49:57 +0800

24 Nov, 2016

1 commit

423917457 tracing: Make tracepoint_printk a static_key ... Browse Code »

Currently, when tracepoint_printk is set (enabled by the "tp_printk" kernel
command line), it causes trace events to print via printk(). This is a very
dangerous operation, but is useful for debugging.

The issue is, it's seldom used, but it is always checked even if it's not
enabled by the kernel command line. Instead of having this feature called by
a branch against a variable, turn that variable into a static key, and this
will remove the test and jump.

To simplify things, the functions output_printk() and
trace_event_buffer_commit() were moved from trace_events.c to trace.c.

Signed-off-by: Steven Rostedt

Steven Rostedt (Red Hat)
2016-11-24 04:52:45 +0800

26 Oct, 2016

1 commit

0ee1dd9f5 x86/dumpstack: Remove raw stack dump ... Browse Code »

For mostly historical reasons, the x86 oops dump shows the raw stack
values:

...
[registers]
Stack:
ffff880079af7350 ffff880079905400 0000000000000000 ffffc900008f3ae0
ffffffffa0196610 0000000000000001 00010000ffffffff 0000000087654321
0000000000000002 0000000000000000 0000000000000000 0000000000000000
Call Trace:
...

This seems to be an artifact from long ago, and probably isn't needed
anymore. It generally just adds noise to the dump, and it can be
actively harmful because it leaks kernel addresses.

Linus says:

"The stack dump actually goes back to forever, and it used to be
useful back in 1992 or so. But it used to be useful mainly because
stacks were simpler and we didn't have very good call traces anyway. I
definitely remember having used them - I just do not remember having
used them in the last ten+ years.

Of course, it's still true that if you can trigger an oops, you've
likely already lost the security game, but since the stack dump is so
useless, let's aim to just remove it and make games like the above
harder."

This also removes the related 'kstack=' cmdline option and the
'kstack_depth_to_print' sysctl.

Suggested-by: Linus Torvalds
Signed-off-by: Josh Poimboeuf
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Denys Vlasenko
Cc: H. Peter Anvin
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/e83bd50df52d8fe88e94d2566426ae40d813bf8f.1477405374.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar

Josh Poimboeuf
2016-10-26 00:40:37 +0800

20 Oct, 2016

1 commit

3c3fcb45d sched/fair: Kill the unused 'sched_shares_window_ns' tunable ... Browse Code »

The last user of this tunable was removed in 2012 in commit:

82958366cfea ("sched: Replace update_shares weight distribution with per-entity computation")

Delete it since its very existence confuses people.

Signed-off-by: Matt Fleming
Cc: Dietmar Eggemann
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paul Turner
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20161019141059.26408-1-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar

Matt Fleming
2016-10-20 14:44:57 +0800

11 Oct, 2016

1 commit

abb5a14fa Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull misc vfs updates from Al Viro:
"Assorted misc bits and pieces.

There are several single-topic branches left after this (rename2
series from Miklos, current_time series from Deepa Dinamani, xattr
series from Andreas, uaccess stuff from from me) and I'd prefer to
send those separately"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
proc: switch auxv to use of __mem_open()
hpfs: support FIEMAP
cifs: get rid of unused arguments of CIFSSMBWrite()
posix_acl: uapi header split
posix_acl: xattr representation cleanups
fs/aio.c: eliminate redundant loads in put_aio_ring_file
fs/internal.h: add const to ns_dentry_operations declaration
compat: remove compat_printk()
fs/buffer.c: make __getblk_slow() static
proc: unsigned file descriptors
fs/file: more unsigned file descriptors
fs: compat: remove redundant check of nr_segs
cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
cifs: don't use memcpy() to copy struct iov_iter
get rid of separate multipage fault-in primitives
fs: Avoid premature clearing of capabilities
fs: Give dentry to inode_change_ok() instead of inode
fuse: Propagate dentry down to inode_change_ok()
ceph: Propagate dentry down to inode_change_ok()
xfs: Propagate dentry down to inode_change_ok()
...

Linus Torvalds
2016-10-11 04:04:49 +0800

07 Oct, 2016

1 commit

14986a34e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull namespace updates from Eric Biederman:
"This set of changes is a number of smaller things that have been
overlooked in other development cycles focused on more fundamental
change. The devpts changes are small things that were a distraction
until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
trivial regression fix to autofs for the unprivileged mount changes
that went in last cycle. A pair of ioctls has been added by Andrey
Vagin making it is possible to discover the relationships between
namespaces when referring to them through file descriptors.

The big user visible change is starting to add simple resource limits
to catch programs that misbehave. With namespaces in general and user
namespaces in particular allowing users to use more kinds of
resources, it has become important to have something to limit errant
programs. Because the purpose of these limits is to catch errant
programs the code needs to be inexpensive to use as it always on, and
the default limits need to be high enough that well behaved programs
on well behaved systems don't encounter them.

To this end, after some review I have implemented per user per user
namespace limits, and use them to limit the number of namespaces. The
limits being per user mean that one user can not exhause the limits of
another user. The limits being per user namespace allow contexts where
the limit is 0 and security conscious folks can remove from their
threat anlysis the code used to manage namespaces (as they have
historically done as it root only). At the same time the limits being
per user namespace allow other parts of the system to use namespaces.

Namespaces are increasingly being used in application sand boxing
scenarios so an all or nothing disable for the entire system for the
security conscious folks makes increasing use of these sandboxes
impossible.

There is also added a limit on the maximum number of mounts present in
a single mount namespace. It is nontrivial to guess what a reasonable
system wide limit on the number of mount structure in the kernel would
be, especially as it various based on how a system is using
containers. A limit on the number of mounts in a mount namespace
however is much easier to understand and set. In most cases in
practice only about 1000 mounts are used. Given that some autofs
scenarious have the potential to be 30,000 to 50,000 mounts I have set
the default limit for the number of mounts at 100,000 which is well
above every known set of users but low enough that the mount hash
tables don't degrade unreaonsably.

These limits are a start. I expect this estabilishes a pattern that
other limits for resources that namespaces use will follow. There has
been interest in making inotify event limits per user per user
namespace as well as interest expressed in making details about what
is going on in the kernel more visible"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
autofs: Fix automounts by using current_real_cred()->uid
mnt: Add a per mount namespace limit on the number of mounts
netns: move {inc,dec}_net_namespaces into #ifdef
nsfs: Simplify __ns_get_path
tools/testing: add a test to check nsfs ioctl-s
nsfs: add ioctl to get a parent namespace
nsfs: add ioctl to get an owning user namespace for ns file descriptor
kernel: add a helper to get an owning user namespace for a namespace
devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
devpts: Remove sync_filesystems
devpts: Make devpts_kill_sb safe if fsi is NULL
devpts: Simplify devpts_mount by using mount_nodev
devpts: Move the creation of /dev/pts/ptmx into fill_super
devpts: Move parse_mount_options into fill_super
userns: When the per user per user namespace limit is reached return ENOSPC
userns; Document per user per user namespace limits.
mntns: Add a limit on the number of mount namespaces.
netns: Add a limit on the number of net namespaces
cgroupns: Add a limit on the number of cgroup namespaces
ipcns: Add a limit on the number of ipc namespaces
...

Linus Torvalds
2016-10-07 00:52:23 +0800

01 Oct, 2016

1 commit

d29216842 mnt: Add a per mount namespace limit on the number of mounts ... Browse Code »

CAI Qian pointed out that the semantics
of shared subtrees make it possible to create an exponentially
increasing number of mounts in a mount namespace.

mkdir /tmp/1 /tmp/2
mount --make-rshared /
for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done

Will create create 2^20 or 1048576 mounts, which is a practical problem
as some people have managed to hit this by accident.

As such CVE-2016-6213 was assigned.

Ian Kent described the situation for autofs users
as follows:

> The number of mounts for direct mount maps is usually not very large because of
> the way they are implemented, large direct mount maps can have performance
> problems. There can be anywhere from a few (likely case a few hundred) to less
> than 10000, plus mounts that have been triggered and not yet expired.
>
> Indirect mounts have one autofs mount at the root plus the number of mounts that
> have been triggered and not yet expired.
>
> The number of autofs indirect map entries can range from a few to the common
> case of several thousand and in rare cases up to between 30000 and 50000. I've
> not heard of people with maps larger than 50000 entries.
>
> The larger the number of map entries the greater the possibility for a large
> number of active mounts so it's not hard to expect cases of a 1000 or somewhat
> more active mounts.

So I am setting the default number of mounts allowed per mount
namespace at 100,000. This is more than enough for any use case I
know of, but small enough to quickly stop an exponential increase
in mounts. Which should be perfect to catch misconfigurations and
malfunctioning programs.

For anyone who needs a higher limit this can be changed by writing
to the new /proc/sys/fs/mount-max sysctl.

Tested-by: CAI Qian
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2016-10-01 01:46:48 +0800

28 Sep, 2016

2 commits

9dcfcda57 compat: remove compat_printk() ... Browse Code »

After 7e8e385aaf6e ("x86/compat: Remove sys32_vm86_warning"), this
function has become unused, so we can remove it as well.

Link: http://lkml.kernel.org/r/20160617142903.3070388-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann
Cc: Alexander Viro
Cc: "Theodore Ts'o"
Cc: Arnaldo Carvalho de Melo
Signed-off-by: Andrew Morton

Arnd Bergmann
2016-09-28 09:20:53 +0800
9b80a184e fs/file: more unsigned file descriptors ... Browse Code »

Propagate unsignedness for grand total of 149 bytes:

$ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux
add/remove: 0/0 grow/shrink: 0/10 up/down: 0/-149 (-149)
function old new delta
set_close_on_exec 99 98 -1
put_files_struct 201 200 -1
get_close_on_exec 59 58 -1
do_prlimit 498 497 -1
do_execveat_common.isra 1662 1661 -1
__close_fd 178 173 -5
do_dup2 219 204 -15
seq_show 685 660 -25
__alloc_fd 384 357 -27
dup_fd 718 646 -72

It mostly comes from converting "unsigned int" to "long" for bit operations.

Signed-off-by: Alexey Dobriyan
Signed-off-by: Al Viro

Alexey Dobriyan
2016-09-28 06:47:38 +0800

27 Aug, 2016

1 commit

e7d316a02 sysctl: handle error writing UINT_MAX to u32 fields ... Browse Code »

We have scripts which write to certain fields on 3.18 kernels but this
seems to be failing on 4.4 kernels. An entry which we write to here is
xfrm_aevent_rseqth which is u32.

echo 4294967295 > /proc/sys/net/core/xfrm_aevent_rseqth

Commit 230633d109e3 ("kernel/sysctl.c: detect overflows when converting
to int") prevented writing to sysctl entries when integer overflow
occurs. However, this does not apply to unsigned integers.

Heinrich suggested that we introduce a new option to handle 64 bit
limits and set min as 0 and max as UINT_MAX. This might not work as it
leads to issues similar to __do_proc_doulongvec_minmax. Alternatively,
we would need to change the datatype of the entry to 64 bit.

static int __do_proc_doulongvec_minmax(void *data, struct ctl_table
{
i = (unsigned long *) data; //This cast is causing to read beyond the size of data (u32)
vleft = table->maxlen / sizeof(unsigned long); //vleft is 0 because maxlen is sizeof(u32) which is lesser than sizeof(unsigned long) on x86_64.

Introduce a new proc handler proc_douintvec. Individual proc entries
will need to be updated to use the new handler.

[akpm@linux-foundation.org: coding-style fixes]
Fixes: 230633d109e3 ("kernel/sysctl.c:detect overflows when converting to int")
Link: http://lkml.kernel.org/r/1471479806-5252-1-git-send-email-subashab@codeaurora.org
Signed-off-by: Subash Abhinov Kasiviswanathan
Cc: Heinrich Schuchardt
Cc: Kees Cook
Cc: "David S. Miller"
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Subash Abhinov Kasiviswanathan
2016-08-27 08:39:35 +0800

03 Aug, 2016

1 commit

750afe7ba printk: add kernel parameter to control writes to /dev/kmsg ... Browse Code »

Add a "printk.devkmsg" kernel command line parameter which controls how
userspace writes into /dev/kmsg. It has three options:

* ratelimit - ratelimit logging from userspace.
* on - unlimited logging from userspace
* off - logging from userspace gets ignored

The default setting is to ratelimit the messages written to it.

This changes the kernel default setting of "on" to "ratelimit" and we do
that because we want to keep userspace spamming /dev/kmsg to sane
levels. This is especially moot when a small kernel log buffer wraps
around and messages get lost. So the ratelimiting setting should be a
sane setting where kernel messages should have a bit higher chance of
survival from all the spamming.

It additionally does not limit logging to /dev/kmsg while the system is
booting if we haven't disabled it on the command line.

Furthermore, we can control the logging from a lower priority sysctl
interface - kernel.printk_devkmsg.

That interface will succeed only if printk.devkmsg *hasn't* been
supplied on the command line. If it has, then printk.devkmsg is a
one-time setting which remains for the duration of the system lifetime.
This "locking" of the setting is to prevent userspace from changing the
logging on us through sysctl(2).

This patch is based on previous patches from Linus and Steven.

[bp@suse.de: fixes]
Link: http://lkml.kernel.org/r/20160719072344.GC25563@nazgul.tnic
Link: http://lkml.kernel.org/r/20160716061745.15795-3-bp@alien8.de
Signed-off-by: Borislav Petkov
Cc: Dave Young
Cc: Franck Bui
Cc: Greg Kroah-Hartman
Cc: Ingo Molnar
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Uwe Kleine-König
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Borislav Petkov
2016-08-03 07:35:06 +0800

29 Jul, 2016

1 commit

a5f5f91da mm: convert zone_reclaim to node_reclaim ... Browse Code »

As reclaim is now per-node based, convert zone_reclaim to be
node_reclaim. It is possible that a node will be reclaimed multiple
times if it has multiple zones but this is unavoidable without caching
all nodes traversed so far. The documentation and interface to
userspace is the same from a configuration perspective and will will be
similar in behaviour unless the node-local allocation requests were also
limited to lower zones.

Link: http://lkml.kernel.org/r/1467970510-21195-24-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Hillf Danton
Acked-by: Johannes Weiner
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2016-07-29 07:07:41 +0800

16 Jun, 2016

1 commit

088e9d253 rcu: sysctl: Panic on RCU Stall ... Browse Code »

It is not always easy to determine the cause of an RCU stall just by
analysing the RCU stall messages, mainly when the problem is caused
by the indirect starvation of rcu threads. For example, when preempt_rcu
is not awakened due to the starvation of a timer softirq.

We have been hard coding panic() in the RCU stall functions for
some time while testing the kernel-rt. But this is not possible in
some scenarios, like when supporting customers.

This patch implements the sysctl kernel.panic_on_rcu_stall. If
set to 1, the system will panic() when an RCU stall takes place,
enabling the capture of a vmcore. The vmcore provides a way to analyze
all kernel/tasks states, helping out to point to the culprit and the
solution for the stall.

The kernel.panic_on_rcu_stall sysctl is disabled by default.

Changes from v1:
- Fixed a typo in the git log
- The if(sysctl_panic_on_rcu_stall) panic() is in a static function
- Fixed the CONFIG_TINY_RCU compilation issue
- The var sysctl_panic_on_rcu_stall is now __read_mostly

Cc: Jonathan Corbet
Cc: "Paul E. McKenney"
Cc: Josh Triplett
Cc: Steven Rostedt
Cc: Mathieu Desnoyers
Cc: Lai Jiangshan
Acked-by: Christian Borntraeger
Reviewed-by: Josh Triplett
Reviewed-by: Arnaldo Carvalho de Melo
Tested-by: "Luis Claudio R. Goncalves"
Signed-off-by: Daniel Bristot de Oliveira
Signed-off-by: Paul E. McKenney

Daniel Bristot de Oliveira
2016-06-16 07:00:05 +0800

26 May, 2016

1 commit

bdc6b758e Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf updates from Ingo Molnar:
"Mostly tooling and PMU driver fixes, but also a number of late updates
such as the reworking of the call-chain size limiting logic to make
call-graph recording more robust, plus tooling side changes for the
new 'backwards ring-buffer' extension to the perf ring-buffer"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
perf record: Read from backward ring buffer
perf record: Rename variable to make code clear
perf record: Prevent reading invalid data in record__mmap_read
perf evlist: Add API to pause/resume
perf trace: Use the ptr->name beautifier as default for "filename" args
perf trace: Use the fd->name beautifier as default for "fd" args
perf report: Add srcline_from/to branch sort keys
perf evsel: Record fd into perf_mmap
perf evsel: Add overwrite attribute and check write_backward
perf tools: Set buildid dir under symfs when --symfs is provided
perf trace: Only auto set call-graph to "dwarf" when syscalls are being traced
perf annotate: Sort list of recognised instructions
perf annotate: Fix identification of ARM blt and bls instructions
perf tools: Fix usage of max_stack sysctl
perf callchain: Stop validating callchains by the max_stack sysctl
perf trace: Fix exit_group() formatting
perf top: Use machine->kptr_restrict_warned
perf trace: Warn when trying to resolve kernel addresses with kptr_restrict=1
perf machine: Do not bail out if not managing to read ref reloc symbol
perf/x86/intel/p4: Trival indentation fix, remove space
...

Linus Torvalds
2016-05-26 08:05:40 +0800

20 May, 2016

1 commit

52b6f46bc mm: /proc/sys/vm/stat_refresh to force vmstat update ... Browse Code »

Provide /proc/sys/vm/stat_refresh to force an immediate update of
per-cpu into global vmstats: useful to avoid a sleep(2) or whatever
before checking counts when testing. Originally added to work around a
bug which left counts stranded indefinitely on a cpu going idle (an
inaccuracy magnified when small below-batch numbers represent "huge"
amounts of memory), but I believe that bug is now fixed: nonetheless,
this is still a useful knob.

Its schedule_on_each_cpu() is probably too expensive just to fold into
reading /proc/meminfo itself: give this mode 0600 to prevent abuse.
Allow a write or a read to do the same: nothing to read, but "grep -h
Shmem /proc/sys/vm/stat_refresh /proc/meminfo" is convenient. Oh, and
since global_page_state() itself is careful to disguise any underflow as
0, hack in an "Invalid argument" and pr_warn() if a counter is negative
after the refresh - this helped to fix a misaccounting of
NR_ISOLATED_FILE in my migration code.

But on recent kernels, I find that NR_ALLOC_BATCH and NR_PAGES_SCANNED
often go negative some of the time. I have not yet worked out why, but
have no evidence that it's actually harmful. Punt for the moment by
just ignoring the anomaly on those.

Signed-off-by: Hugh Dickins
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Andres Lagar-Cavilla
Cc: Yang Shi
Cc: Ning Qu
Cc: Mel Gorman
Cc: Andres Lagar-Cavilla
Cc: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2016-05-20 10:12:14 +0800

17 May, 2016

1 commit

c85b03349 perf core: Separate accounting of contexts and real addresses in a stack trace ... Browse Code »

The perf_sample->ip_callchain->nr value includes all the entries in the
ip_callchain->ip[] array, real addresses and PERF_CONTEXT_{KERNEL,USER,etc},
while what the user expects is that what is in the kernel.perf_event_max_stack
sysctl or in the upcoming per event perf_event_attr.sample_max_stack knob be
honoured in terms of IP addresses in the stack trace.

So allocate a bunch of extra entries for contexts, and do the accounting
via perf_callchain_entry_ctx struct members.

A new sysctl, kernel.perf_event_max_contexts_per_stack is also
introduced for investigating possible bugs in the callchain
implementation by some arch.

Cc: Adrian Hunter
Cc: Alexander Shishkin
Cc: Alexei Starovoitov
Cc: Brendan Gregg
Cc: David Ahern
Cc: Frederic Weisbecker
Cc: He Kuang
Cc: Jiri Olsa
Cc: Masami Hiramatsu
Cc: Milian Wolff
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: Wang Nan
Cc: Zefan Li
Link: http://lkml.kernel.org/n/tip-3b4wnqk340c4sg4gwkfdi9yk@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Arnaldo Carvalho de Melo
2016-05-17 10:11:53 +0800