Eric Lee / smarc-fsl-linux-kernel

18 Sep, 2020

1 commit

5ef64cc89 mm: allow a controlled amount of unfairness in the page lock ... Browse Code »

Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.

That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.

It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.

Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.

There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.

The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.

This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.

Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel
Tested-by: Matthieu Baerts
Cc: Dave Chinner
Cc: Matthew Wilcox
Cc: Chris Mason
Cc: Jan Kara
Cc: Amir Goldstein
Signed-off-by: Linus Torvalds

Linus Torvalds
2020-09-18 01:26:41 +0800

25 Aug, 2020

1 commit

7787b6fc9 bpf, sysctl: Let bpf_stats_handler take a kernel pointer buffer ... Browse Code »

Commit 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
changed ctl_table.proc_handler to take a kernel pointer. Adjust the
signature of bpf_stats_handler to match ctl_table.proc_handler which
fixes the following sparse warning:

kernel/sysctl.c:226:49: warning: incorrect type in argument 3 (different address spaces)
kernel/sysctl.c:226:49: expected void *
kernel/sysctl.c:226:49: got void [noderef] __user *buffer
kernel/sysctl.c:2640:35: warning: incorrect type in initializer (incompatible argument 3 (different address spaces))
kernel/sysctl.c:2640:35: expected int ( [usertype] *proc_handler )( ... )
kernel/sysctl.c:2640:35: got int ( * )( ... )

Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
Signed-off-by: Tobias Klauser
Signed-off-by: Alexei Starovoitov
Cc: Christoph Hellwig
Link: https://lore.kernel.org/bpf/20200824142047.22043-1-tklauser@distanz.ch

Tobias Klauser
2020-08-25 12:11:40 +0800

13 Aug, 2020

2 commits

d34c0a759 mm: use unsigned types for fragmentation score ... Browse Code »

Proactive compaction uses per-node/zone "fragmentation score" which is
always in range [0, 100], so use unsigned type of these scores as well as
for related constants.

Signed-off-by: Nitin Gupta
Signed-off-by: Andrew Morton
Reviewed-by: Baoquan He
Cc: Luis Chamberlain
Cc: Kees Cook
Cc: Iurii Zaikin
Cc: Vlastimil Babka
Cc: Joonsoo Kim
Link: http://lkml.kernel.org/r/20200618010319.13159-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds

Nitin Gupta
2020-08-13 01:57:56 +0800
facdaa917 mm: proactive compaction ... Browse Code »

For some applications, we need to allocate almost all memory as hugepages.
However, on a running system, higher-order allocations can fail if the
memory is fragmented. Linux kernel currently does on-demand compaction as
we request more hugepages, but this style of compaction incurs very high
latency. Experiments with one-time full memory compaction (followed by
hugepage allocations) show that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within 98% of free memory could be allocated as hugepages)

- With 5.6.0-rc3 + this patch, with proactiveness=20

sysctl -w vm.compaction_proactiveness=20

percentile latency
–––––––––– –––––––
5 2
10 2
25 3
30 3
40 3
50 4
60 4
75 4
80 4
90 5
95 429

Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)

2. JAVA heap allocation

In this test, we first fragment memory using the same method as for (1).

Then, we start a Java process with a heap size set to 700G and request the
heap to be allocated with THP hugepages. We also set THP to madvise to
allow hugepage backing of this heap.

/usr/bin/time
java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch

The above command allocates 700G of Java heap using hugepages.

- With vanilla 5.6.0-rc3

17.39user 1666.48system 27:37.89elapsed

- With 5.6.0-rc3 + this patch, with proactiveness=20

8.35user 194.58system 3:19.62elapsed

Elapsed time remains around 3:15, as proactiveness is further increased.

Note that proactive compaction happens throughout the runtime of these
workloads. The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.

In the above Java workload, proactiveness is set to 20. The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As t he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the low
threshold level (80). Repeat.

bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.

Backoff behavior
================

Above workloads produce a memory state which is easy to compact. However,
if memory is filled with unmovable pages, proactive compaction should
essentially back off. To test this aspect:

- Created a kernel driver that allocates almost all memory as hugepages
followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
(=> ~30 seconds between retries).

[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/

Signed-off-by: Nitin Gupta
Signed-off-by: Andrew Morton
Tested-by: Oleksandr Natalenko
Reviewed-by: Vlastimil Babka
Reviewed-by: Khalid Aziz
Reviewed-by: Oleksandr Natalenko
Cc: Vlastimil Babka
Cc: Khalid Aziz
Cc: Michal Hocko
Cc: Mel Gorman
Cc: Matthew Wilcox
Cc: Mike Kravetz
Cc: Joonsoo Kim
Cc: David Rientjes
Cc: Nitin Gupta
Cc: Oleksandr Natalenko
Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds

Nitin Gupta
2020-08-13 01:57:56 +0800

08 Aug, 2020

1 commit

56f3547bf mm: adjust vm_committed_as_batch according to vm overcommit policy ... Browse Code »

When checking a performance change for will-it-scale scalability mmap test
[1], we found very high lock contention for spinlock of percpu counter
'vm_committed_as':

94.14% 0.35% [kernel.kallsyms] [k] _raw_spin_lock_irqsave
48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;

Actually this heavy lock contention is not always necessary. The
'vm_committed_as' needs to be very precise when the strict
OVERCOMMIT_NEVER policy is set, which requires a rather small batch number
for the percpu counter.

So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy, and
lift it to 64X for OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS policies. Also
add a sysctl handler to adjust it when the policy is reconfigured.

Benchmark with the same testcase in [1] shows 53% improvement on a 8C/16T
desktop, and 2097%(20X) on a 4S/72C/144T server. We tested with test
platforms in 0day (server, desktop and laptop), and 80%+ platforms shows
improvements with that test. And whether it shows improvements depends on
if the test mmap size is bigger than the batch number computed.

And if the lift is 16X, 1/3 of the platforms will show improvements,
though it should help the mmap/unmap usage generally, as Michal Hocko
mentioned:

: I believe that there are non-synthetic worklaods which would benefit from
: a larger batch. E.g. large in memory databases which do large mmaps
: during startups from multiple threads.

[1] https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/

Signed-off-by: Feng Tang
Signed-off-by: Andrew Morton
Acked-by: Michal Hocko
Cc: Matthew Wilcox (Oracle)
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Qian Cai
Cc: Kees Cook
Cc: Andi Kleen
Cc: Tim Chen
Cc: Dave Hansen
Cc: Huang Ying
Cc: Christoph Lameter
Cc: Dennis Zhou
Cc: Haiyang Zhang
Cc: kernel test robot
Cc: "K. Y. Srinivasan"
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/1589611660-89854-4-git-send-email-feng.tang@intel.com
Link: http://lkml.kernel.org/r/1592725000-73486-4-git-send-email-feng.tang@intel.com
Link: http://lkml.kernel.org/r/1594389708-60781-5-git-send-email-feng.tang@intel.com
Signed-off-by: Linus Torvalds

Feng Tang
2020-08-08 02:33:26 +0800

29 Jul, 2020

1 commit

13685c4a0 sched/uclamp: Add a new sysctl to control RT default boost value ... Browse Code »

RT tasks by default run at the highest capacity/performance level. When
uclamp is selected this default behavior is retained by enforcing the
requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
value.

This is also referred to as 'the default boost value of RT tasks'.

See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").

On battery powered devices, it is desired to control this default
(currently hardcoded) behavior at runtime to reduce energy consumed by
RT tasks.

For example, a mobile device manufacturer where big.LITTLE architecture
is dominant, the performance of the little cores varies across SoCs, and
on high end ones the big cores could be too power hungry.

Given the diversity of SoCs, the new knob allows manufactures to tune
the best performance/power for RT tasks for the particular hardware they
run on.

They could opt to further tune the value when the user selects
a different power saving mode or when the device is actively charging.

The runtime aspect of it further helps in creating a single kernel image
that can be run on multiple devices that require different tuning.

Keep in mind that a lot of RT tasks in the system are created by the
kernel. On Android for instance I can see over 50 RT tasks, only
a handful of which created by the Android framework.

To control the default behavior globally by system admins and device
integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
to change the default boost value of the RT tasks.

I anticipate this to be mostly in the form of modifying the init script
of a particular device.

To avoid polluting the fast path with unnecessary code, the approach
taken is to synchronously do the update by traversing all the existing
tasks in the system. This could race with a concurrent fork(), which is
dealt with by introducing sched_post_fork() function which will ensure
the racy fork will get the right update applied.

Tested on Juno-r2 in combination with the RT capacity awareness [1].
By default an RT task will go to the highest capacity CPU and run at the
maximum frequency, which is particularly energy inefficient on high end
mobile devices because the biggest core[s] are 'huge' and power hungry.

With this patch the RT task can be controlled to run anywhere by
default, and doesn't cause the frequency to be maximum all the time.
Yet any task that really needs to be boosted can easily escape this
default behavior by modifying its requested uclamp.min value
(p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.

[1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")

Signed-off-by: Qais Yousef
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com

Qais Yousef
2020-07-29 19:51:47 +0800

15 Jun, 2020

1 commit

b4098bfc5 sched/deadline: Impose global limits on sched_attr::sched_period ... Browse Code »

Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20190726161357.397880775@infradead.org

Peter Zijlstra
2020-06-15 20:10:04 +0800

09 Jun, 2020

4 commits

e77132e75 kernel/sysctl.c: ignore out-of-range taint bits introduced via kernel.tainted ... Browse Code »

Users with SYS_ADMIN capability can add arbitrary taint flags to the
running kernel by writing to /proc/sys/kernel/tainted or issuing the
command 'sysctl -w kernel.tainted=...'. This interface, however, is
open for any integer value and this might cause an invalid set of flags
being committed to the tainted_mask bitset.

This patch introduces a simple way for proc_taint() to ignore any
eventual invalid bit coming from the user input before committing those
bits to the kernel tainted_mask.

Signed-off-by: Rafael Aquini
Signed-off-by: Andrew Morton
Reviewed-by: Luis Chamberlain
Cc: Kees Cook
Cc: Iurii Zaikin
Cc: "Theodore Ts'o"
Link: http://lkml.kernel.org/r/20200512223946.888020-1-aquini@redhat.com
Signed-off-by: Linus Torvalds

Rafael Aquini
2020-06-09 02:05:56 +0800
60c958d8d panic: add sysctl to dump all CPUs backtraces on oops event ... Browse Code »

Usually when the kernel reaches an oops condition, it's a point of no
return; in case not enough debug information is available in the kernel
splat, one of the last resorts would be to collect a kernel crash dump
and analyze it. The problem with this approach is that in order to
collect the dump, a panic is required (to kexec-load the crash kernel).
When in an environment of multiple virtual machines, users may prefer to
try living with the oops, at least until being able to properly shutdown
their VMs / finish their important tasks.

This patch implements a way to collect a bit more debug details when an
oops event is reached, by printing all the CPUs backtraces through the
usage of NMIs (on architectures that support that). The sysctl added
(and documented) here was called "oops_all_cpu_backtrace", and when set
will (as the name suggests) dump all CPUs backtraces.

Far from ideal, this may be the last option though for users that for
some reason cannot panic on oops. Most of times oopses are clear enough
to indicate the kernel portion that must be investigated, but in virtual
environments it's possible to observe hypervisor/KVM issues that could
lead to oopses shown in other guests CPUs (like virtual APIC crashes).
This patch hence aims to help debug such complex issues without
resorting to kdump.

Signed-off-by: Guilherme G. Piccoli
Signed-off-by: Andrew Morton
Reviewed-by: Kees Cook
Cc: Luis Chamberlain
Cc: Iurii Zaikin
Cc: Thomas Gleixner
Cc: Vlastimil Babka
Cc: Randy Dunlap
Cc: Matthew Wilcox
Link: http://lkml.kernel.org/r/20200327224116.21030-1-gpiccoli@canonical.com
Signed-off-by: Linus Torvalds

Guilherme G. Piccoli
2020-06-09 02:05:56 +0800
0ec9dc9bc kernel/hung_task.c: introduce sysctl to print all traces when a hung task is detected ... Browse Code »

Commit 401c636a0eeb ("kernel/hung_task.c: show all hung tasks before
panic") introduced a change in that we started to show all CPUs
backtraces when a hung task is detected _and_ the sysctl/kernel
parameter "hung_task_panic" is set. The idea is good, because usually
when observing deadlocks (that may lead to hung tasks), the culprit is
another task holding a lock and not necessarily the task detected as
hung.

The problem with this approach is that dumping backtraces is a slightly
expensive task, specially printing that on console (and specially in
many CPU machines, as servers commonly found nowadays). So, users that
plan to collect a kdump to investigate the hung tasks and narrow down
the deadlock definitely don't need the CPUs backtrace on dmesg/console,
which will delay the panic and pollute the log (crash tool would easily
grab all CPUs traces with 'bt -a' command).

Also, there's the reciprocal scenario: some users may be interested in
seeing the CPUs backtraces but not have the system panic when a hung
task is detected. The current approach hence is almost as embedding a
policy in the kernel, by forcing the CPUs backtraces' dump (only) on
hung_task_panic.

This patch decouples the panic event on hung task from the CPUs
backtraces dump, by creating (and documenting) a new sysctl called
"hung_task_all_cpu_backtrace", analog to the approach taken on soft/hard
lockups, that have both a panic and an "all_cpu_backtrace" sysctl to
allow individual control. The new mechanism for dumping the CPUs
backtraces on hung task detection respects "hung_task_warnings" by not
dumping the traces in case there's no warnings left.

Signed-off-by: Guilherme G. Piccoli
Signed-off-by: Andrew Morton
Reviewed-by: Kees Cook
Cc: Tetsuo Handa
Link: http://lkml.kernel.org/r/20200327223646.20779-1-gpiccoli@canonical.com
Signed-off-by: Linus Torvalds

Guilherme G. Piccoli
2020-06-09 02:05:56 +0800
db38d5c10 kernel: add panic_on_taint ... Browse Code »

Analogously to the introduction of panic_on_warn, this patch introduces
a kernel option named panic_on_taint in order to provide a simple and
generic way to stop execution and catch a coredump when the kernel gets
tainted by any given flag.

This is useful for debugging sessions as it avoids having to rebuild the
kernel to explicitly add calls to panic() into the code sites that
introduce the taint flags of interest.

For instance, if one is interested in proceeding with a post-mortem
analysis at the point a given code path is hitting a bad page (i.e.
unaccount_page_cache_page(), or slab_bug()), a coredump can be collected
by rebooting the kernel with 'panic_on_taint=0x20' amended to the
command line.

Another, perhaps less frequent, use for this option would be as a means
for assuring a security policy case where only a subset of taints, or no
single taint (in paranoid mode), is allowed for the running system. The
optional switch 'nousertaint' is handy in this particular scenario, as
it will avoid userspace induced crashes by writes to sysctl interface
/proc/sys/kernel/tainted causing false positive hits for such policies.

[akpm@linux-foundation.org: tweak kernel-parameters.txt wording]

Suggested-by: Qian Cai
Signed-off-by: Rafael Aquini
Signed-off-by: Andrew Morton
Reviewed-by: Luis Chamberlain
Cc: Dave Young
Cc: Baoquan He
Cc: Jonathan Corbet
Cc: Kees Cook
Cc: Randy Dunlap
Cc: "Theodore Ts'o"
Cc: Adrian Bunk
Cc: Greg Kroah-Hartman
Cc: Laura Abbott
Cc: Jeff Mahoney
Cc: Jiri Kosina
Cc: Takashi Iwai
Link: http://lkml.kernel.org/r/20200515175502.146720-1-aquini@redhat.com
Signed-off-by: Linus Torvalds

Rafael Aquini
2020-06-09 02:05:56 +0800

04 Jun, 2020

3 commits

ee01c4d72 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge more updates from Andrew Morton:
"More mm/ work, plenty more to come

Subsystems affected by this patch series: slub, memcg, gup, kasan,
pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs,
thp, mmap, kconfig"

* akpm: (131 commits)
arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
riscv: support DEBUG_WX
mm: add DEBUG_WX support
drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid()
powerpc/mm: drop platform defined pmd_mknotpresent()
mm: thp: don't need to drain lru cache when splitting and mlocking THP
hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
sparc32: register memory occupied by kernel as memblock.memory
include/linux/memblock.h: fix minor typo and unclear comment
mm, mempolicy: fix up gup usage in lookup_node
tools/vm/page_owner_sort.c: filter out unneeded line
mm: swap: memcg: fix memcg stats for huge pages
mm: swap: fix vmstats for huge pages
mm: vmscan: limit the range of LRU type balancing
mm: vmscan: reclaim writepage is IO cost
mm: vmscan: determine anon/file pressure balance at the reclaim root
mm: balance LRU lists based on relative thrashing
mm: only count actual rotations as LRU reclaim cost
...

Linus Torvalds
2020-06-04 11:24:15 +0800
c843966c5 mm: allow swappiness that prefers reclaiming anon over the file workingset ... Browse Code »

With the advent of fast random IO devices (SSDs, PMEM) and in-memory swap
devices such as zswap, it's possible for swap to be much faster than
filesystems, and for swapping to be preferable over thrashing filesystem
caches.

Allow setting swappiness - which defines the rough relative IO cost of
cache misses between page cache and swap-backed pages - to reflect such
situations by making the swap-preferred range configurable.

Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Rik van Riel
Link: http://lkml.kernel.org/r/20200520232525.798933-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds

Johannes Weiner
2020-06-04 11:09:48 +0800
cb8e59cc8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next ... Browse Code »

Pull networking updates from David Miller:

1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
Augusto von Dentz.

2) Add GSO partial support to igc, from Sasha Neftin.

3) Several cleanups and improvements to r8169 from Heiner Kallweit.

4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
device self-test. From Andrew Lunn.

5) Start moving away from custom driver versions, use the globally
defined kernel version instead, from Leon Romanovsky.

6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

8) Add sriov and vf support to hinic, from Luo bin.

9) Support Media Redundancy Protocol (MRP) in the bridging code, from
Horatiu Vultur.

10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
Dubroca. Also add ipv6 support for espintcp.

12) Lots of ReST conversions of the networking documentation, from Mauro
Carvalho Chehab.

13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
from Doug Berger.

14) Allow to dump cgroup id and filter by it in inet_diag code, from
Dmitry Yakunin.

15) Add infrastructure to export netlink attribute policies to
userspace, from Johannes Berg.

16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

17) Fallback to the default qdisc if qdisc init fails because otherwise
a packet scheduler init failure will make a device inoperative. From
Jesper Dangaard Brouer.

18) Several RISCV bpf jit optimizations, from Luke Nelson.

19) Correct the return type of the ->ndo_start_xmit() method in several
drivers, it's netdev_tx_t but many drivers were using
'int'. From Yunjian Wang.

20) Add an ethtool interface for PHY master/slave config, from Oleksij
Rempel.

21) Add BPF iterators, from Yonghang Song.

22) Add cable test infrastructure, including ethool interfaces, from
Andrew Lunn. Marvell PHY driver is the first to support this
facility.

23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

24) Calculate and maintain an explicit frame size in XDP, from Jesper
Dangaard Brouer.

25) Add CAP_BPF, from Alexei Starovoitov.

26) Support terse dumps in the packet scheduler, from Vlad Buslov.

27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

28) Add devm_register_netdev(), from Bartosz Golaszewski.

29) Minimize qdisc resets, from Cong Wang.

30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
eliminate set_fs/get_fs calls. From Christoph Hellwig.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
selftests: net: ip_defrag: ignore EPERM
net_failover: fixed rollback in net_failover_open()
Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
vmxnet3: allow rx flow hash ops only when rss is enabled
hinic: add set_channels ethtool_ops support
selftests/bpf: Add a default $(CXX) value
tools/bpf: Don't use $(COMPILE.c)
bpf, selftests: Use bpf_probe_read_kernel
s390/bpf: Use bcr 0,%0 as tail call nop filler
s390/bpf: Maintain 8-byte stack alignment
selftests/bpf: Fix verifier test
selftests/bpf: Fix sample_cnt shared between two threads
bpf, selftests: Adapt cls_redirect to call csum_level helper
bpf: Add csum_level helper for fixing up csum levels
bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
crypto/chtls: IPv6 support for inline TLS
Crypto/chcr: Fixes a coccinile check error
Crypto/chcr: Fixes compilations warnings
...

Linus Torvalds
2020-06-04 07:27:18 +0800

11 May, 2020

1 commit

b6522fa40 parisc: add sysctl file interface panic_on_stackoverflow ... Browse Code »

The variable sysctl_panic_on_stackoverflow is used in
arch/parisc/kernel/irq.c and arch/x86/kernel/irq_32.c, but the sysctl file
interface panic_on_stackoverflow only exists on x86.

Add sysctl file interface panic_on_stackoverflow for parisc

Signed-off-by: Xiaoming Ni
Reviewed-by: Luis Chamberlain
Signed-off-by: Helge Deller

Xiaoming Ni
2020-05-11 04:47:35 +0800

06 May, 2020

1 commit

5447e8e01 sysctl: Fix unused function warning ... Browse Code »

The newly added bpf_stats_handler function has the wrong #ifdef
check around it, leading to an unused-function warning when
CONFIG_SYSCTL is disabled:

kernel/sysctl.c:205:12: error: unused function 'bpf_stats_handler' [-Werror,-Wunused-function]
static int bpf_stats_handler(struct ctl_table *table, int write,

Fix the check to match the reference.

Fixes: d46edd671a14 ("bpf: Sharing bpf runtime stats with BPF_ENABLE_STATS")
Signed-off-by: Arnd Bergmann
Signed-off-by: Alexei Starovoitov
Reviewed-by: Luis Chamberlain
Acked-by: Martin KaFai Lau
Acked-by: Song Liu
Link: https://lore.kernel.org/bpf/20200505140734.503701-1-arnd@arndb.de

Arnd Bergmann
2020-05-06 02:59:32 +0800

02 May, 2020

1 commit

d46edd671 bpf: Sharing bpf runtime stats with BPF_ENABLE_STATS ... Browse Code »

Currently, sysctl kernel.bpf_stats_enabled controls BPF runtime stats.
Typical userspace tools use kernel.bpf_stats_enabled as follows:

1. Enable kernel.bpf_stats_enabled;
2. Check program run_time_ns;
3. Sleep for the monitoring period;
4. Check program run_time_ns again, calculate the difference;
5. Disable kernel.bpf_stats_enabled.

The problem with this approach is that only one userspace tool can toggle
this sysctl. If multiple tools toggle the sysctl at the same time, the
measurement may be inaccurate.

To fix this problem while keep backward compatibility, introduce a new
bpf command BPF_ENABLE_STATS. On success, this command enables stats and
returns a valid fd. BPF_ENABLE_STATS takes argument "type". Currently,
only one type, BPF_STATS_RUN_TIME, is supported. We can extend the
command to support other types of stats in the future.

With BPF_ENABLE_STATS, user space tool would have the following flow:

1. Get a fd with BPF_ENABLE_STATS, and make sure it is valid;
2. Check program run_time_ns;
3. Sleep for the monitoring period;
4. Check program run_time_ns again, calculate the difference;
5. Close the fd.

Signed-off-by: Song Liu
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200430071506.1408910-2-songliubraving@fb.com

Song Liu
2020-05-02 01:36:32 +0800

27 Apr, 2020

4 commits

32927393d sysctl: pass kernel pointers to ->proc_handler ... Browse Code »

Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from userspace in common code. This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.

As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.

Signed-off-by: Christoph Hellwig
Acked-by: Andrey Ignatov
Signed-off-by: Al Viro

Christoph Hellwig
2020-04-27 14:07:40 +0800
f461d2dcd sysctl: avoid forward declarations ... Browse Code »

Move the sysctl tables to the end of the file to avoid lots of pointless
forward declarations.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2020-04-27 14:07:26 +0800
2374c09b1 sysctl: remove all extern declaration from sysctl.c ... Browse Code »

Extern declarations in .c files are a bad style and can lead to
mismatches. Use existing definitions in headers where they exist,
and otherwise move the external declarations to suitable header
files.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2020-04-27 14:06:53 +0800
26363af56 mm: remove watermark_boost_factor_sysctl_handler ... Browse Code »

watermark_boost_factor_sysctl_handler is just a pointless wrapper for
proc_dointvec_minmax, so remove it and use proc_dointvec_minmax
directly.

Signed-off-by: Christoph Hellwig
Acked-by: David Rientjes
Signed-off-by: Al Viro

Christoph Hellwig
2020-04-27 14:06:52 +0800

03 Apr, 2020

2 commits

6923aa0d8 mm/compaction: Disable compact_unevictable_allowed on RT ... Browse Code »

Since commit 5bbe3547aa3ba ("mm: allow compaction of unevictable pages")
it is allowed to examine mlocked pages and compact them by default. On
-RT even minor pagefaults are problematic because it may take a few 100us
to resolve them and until then the task is blocked.

Make compact_unevictable_allowed = 0 default and issue a warning on RT if
it is changed.

[bigeasy@linutronix.de: v5]
Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/
Link: http://lkml.kernel.org/r/20200319165536.ovi75tsr2seared4@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Acked-by: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Thomas Gleixner
Cc: Luis Chamberlain
Cc: Kees Cook
Cc: Iurii Zaikin
Cc: Vlastimil Babka
Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/
Link: http://lkml.kernel.org/r/20200303202225.nhqc3v5gwlb7x6et@linutronix.de
Signed-off-by: Linus Torvalds

Sebastian Andrzej Siewior
2020-04-03 00:35:31 +0800
964b692da mm/compaction: really limit compact_unevictable_allowed to 0 and 1 ... Browse Code »

The proc file `compact_unevictable_allowed' should allow 0 and 1 only, the
`extra*' attribues have been set properly but without
proc_dointvec_minmax() as the `proc_handler' the limit will not be
enforced.

Use proc_dointvec_minmax() as the `proc_handler' to enfoce the valid
specified range.

Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Acked-by: Vlastimil Babka
Cc: Thomas Gleixner
Cc: Luis Chamberlain
Cc: Kees Cook
Cc: Iurii Zaikin
Cc: Mel Gorman
Link: http://lkml.kernel.org/r/20200303202054.gsosv7fsx2ma3cic@linutronix.de
Signed-off-by: Linus Torvalds

Sebastian Andrzej Siewior
2020-04-03 00:35:31 +0800

07 Mar, 2020

1 commit

eaee41727 sysctl/sysrq: Remove __sysrq_enabled copy ... Browse Code »

Many embedded boards have a disconnected TTL level serial which can
generate some garbage that can lead to spurious false sysrq detects.

Currently, sysrq can be either completely disabled for serial console
or always disabled (with CONFIG_MAGIC_SYSRQ_SERIAL), since
commit 732dbf3a6104 ("serial: do not accept sysrq characters via serial port")

At Arista, we have such boards that can generate BREAK and random
garbage. While disabling sysrq for serial console would solve
the problem with spurious false sysrq triggers, it's also desirable
to have a way to enable sysrq back.

Having the way to enable sysrq was beneficial to debug lockups with
a manual investigation in field and on the other side preventing false
sysrq detections.

As a preparation to add sysrq_toggle_support() call into uart,
remove a private copy of sysrq_enabled from sysctl - it should reflect
the actual status of sysrq.

Furthermore, the private copy isn't correct already in case
sysrq_always_enabled is true. So, remove __sysrq_enabled and use a
getter-helper sysrq_mask() to check sysrq_key_op enabled status.

Cc: Iurii Zaikin
Cc: Jiri Slaby
Cc: Luis Chamberlain
Cc: Kees Cook
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Dmitry Safonov
Link: https://lore.kernel.org/r/20200302175135.269397-2-dima@arista.com
Signed-off-by: Greg Kroah-Hartman

Dmitry Safonov
2020-03-07 16:52:02 +0800

19 Feb, 2020

1 commit

140588bfe s390: remove obsolete ieee_emulation_warnings ... Browse Code »

s390 math emulation was removed with commit 5a79859ae0f3 ("s390:
remove 31 bit support"), rendering ieee_emulation_warnings useless.
The code still built because it was protected by CONFIG_MATHEMU, which
was no longer selectable.

This patch removes the sysctl_ieee_emulation_warnings declaration and
the sysctl entry declaration.

Link: https://lkml.kernel.org/r/20200214172628.3598516-1-steve@sk2.org
Reviewed-by: Vasily Gorbik
Signed-off-by: Stephen Kitt
Signed-off-by: Vasily Gorbik

Stephen Kitt
2020-02-19 20:51:46 +0800

10 Dec, 2019

1 commit

b3e627d3d rcu: Make PREEMPT_RCU be a modifier to TREE_RCU ... Browse Code »

Currently PREEMPT_RCU and TREE_RCU are mutually exclusive Kconfig
options. But PREEMPT_RCU actually specifies a kind of TREE_RCU,
namely a preemptible TREE_RCU. This commit therefore makes PREEMPT_RCU
be a modifer to the TREE_RCU Kconfig option. This has the benefit of
simplifying several of the #if expressions that formerly needed to
check both, but now need only check one or the other.

Signed-off-by: Lai Jiangshan
Signed-off-by: Lai Jiangshan
Reviewed-by: Joel Fernandes (Google)
Signed-off-by: Paul E. McKenney

Lai Jiangshan
2019-12-10 04:37:51 +0800

02 Dec, 2019

1 commit

204cb79ad kernel: sysctl: make drop_caches write-only ... Browse Code »

Currently, the drop_caches proc file and sysctl read back the last value
written, suggesting this is somehow a stateful setting instead of a
one-time command. Make it write-only, like e.g. compact_memory.

While mitigating a VM problem at scale in our fleet, there was confusion
about whether writing to this file will permanently switch the kernel into
a non-caching mode. This influences the decision making in a tense
situation, where tens of people are trying to fix tens of thousands of
affected machines: Do we need a rollback strategy? What are the
performance implications of operating in a non-caching state for several
days? It also caused confusion when the kernel team said we may need to
write the file several times to make sure it's effective ("But it already
reads back 3?").

Link: http://lkml.kernel.org/r/20191031221602.9375-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Chris Down
Acked-by: Vlastimil Babka
Acked-by: David Hildenbrand
Acked-by: Michal Hocko
Acked-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2019-12-02 04:59:07 +0800

15 Oct, 2019

1 commit

b67114db6 parisc: sysctl.c: Use CONFIG_PARISC instead of __hppa_ define ... Browse Code »

Signed-off-by: Helge Deller

Helge Deller
2019-10-15 03:43:54 +0800

25 Sep, 2019

1 commit

67f3977f8 arm64, mm: move generic mmap layout functions to mm ... Browse Code »

arm64 handles top-down mmap layout in a way that can be easily reused by
other architectures, so make it available in mm. It then introduces a new
config ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT that can be set by other
architectures to benefit from those functions. Note that this new config
depends on MMU being enabled, if selected without MMU support, a warning
will be thrown.

Link: http://lkml.kernel.org/r/20190730055113.23635-5-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti
Suggested-by: Christoph Hellwig
Acked-by: Catalin Marinas
Acked-by: Kees Cook
Reviewed-by: Christoph Hellwig
Reviewed-by: Luis Chamberlain
Cc: Albert Ou
Cc: Alexander Viro
Cc: James Hogan
Cc: Palmer Dabbelt
Cc: Paul Burton
Cc: Ralf Baechle
Cc: Russell King
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexandre Ghiti
2019-09-25 06:54:11 +0800

19 Jul, 2019

1 commit

eec4844fa proc/sysctl: add shared variables for range check ... Browse Code »

In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.

On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.

The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:

$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248

Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.

This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:

# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%

[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce
Signed-off-by: Arnd Bergmann
Acked-by: Kees Cook
Reviewed-by: Aaron Tomlin
Cc: Matthew Wilcox
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matteo Croce
2019-07-19 08:08:07 +0800

17 Jul, 2019

1 commit

65f50f255 kernel: fix typos and some coding style in comments ... Browse Code »

fix lenght to length

Link: http://lkml.kernel.org/r/20190521050937.4370-1-houweitaoo@gmail.com
Signed-off-by: Weitao Hou
Acked-by: Kees Cook
Cc: Colin Ian King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Weitao Hou
2019-07-17 10:23:21 +0800

25 Jun, 2019

1 commit

e8f14172c sched/uclamp: Add system default clamps ... Browse Code »

Tasks without a user-defined clamp value are considered not clamped
and by default their utilization can have any value in the
[0..SCHED_CAPACITY_SCALE] range.

Tasks with a user-defined clamp value are allowed to request any value
in that range, and the required clamp is unconditionally enforced.
However, a "System Management Software" could be interested in limiting
the range of clamp values allowed for all tasks.

Add a privileged interface to define a system default configuration via:

/proc/sys/kernel/sched_uclamp_util_{min,max}

which works as an unconditional clamp range restriction for all tasks.

With the default configuration, the full SCHED_CAPACITY_SCALE range of
values is allowed for each clamp index. Otherwise, the task-specific
clamp is capped by the corresponding system default value.

Do that by tracking, for each task, the "effective" clamp value and
bucket the task has been refcounted in at enqueue time. This
allows to lazy aggregate "requested" and "system default" values at
enqueue time and simplifies refcounting updates at dequeue time.

The cached bucket ids are used to avoid (relatively) more expensive
integer divisions every time a task is enqueued.

An active flag is used to report when the "effective" value is valid and
thus the task is actually refcounted in the corresponding rq's bucket.

Signed-off-by: Patrick Bellasi
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alessio Balsini
Cc: Dietmar Eggemann
Cc: Joel Fernandes
Cc: Juri Lelli
Cc: Linus Torvalds
Cc: Morten Rasmussen
Cc: Paul Turner
Cc: Peter Zijlstra
Cc: Quentin Perret
Cc: Rafael J . Wysocki
Cc: Steve Muckle
Cc: Suren Baghdasaryan
Cc: Tejun Heo
Cc: Thomas Gleixner
Cc: Todd Kjos
Cc: Vincent Guittot
Cc: Viresh Kumar
Link: https://lkml.kernel.org/r/20190621084217.8167-5-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar

Patrick Bellasi
2019-06-25 01:23:45 +0800

15 Jun, 2019

1 commit

a8e11e5c5 sysctl: define proc_do_static_key() ... Browse Code »

Convert proc_dointvec_minmax_bpf_stats() into a more generic
helper, since we are going to use jump labels more often.

Note that sysctl_bpf_stats_enabled is removed, since
it is no longer needed/used.

Signed-off-by: Eric Dumazet
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Eric Dumazet
2019-06-15 11:18:27 +0800

21 May, 2019

1 commit

457c89965 treewide: Add SPDX license identifier for missed files ... Browse Code »

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-05-21 16:50:45 +0800

15 May, 2019

4 commits

3116ad38f kernel/sysctl.c: fix proc_do_large_bitmap for large input buffers ... Browse Code »

Today, proc_do_large_bitmap() truncates a large write input buffer to
PAGE_SIZE - 1, which may result in misparsed numbers at the (truncated)
end of the buffer. Further, it fails to notify the caller that the
buffer was truncated, so it doesn't get called iteratively to finish the
entire input buffer.

Tell the caller if there's more work to do by adding the skipped amount
back to left/*lenp before returning.

To fix the misparsing, reset the position if we have completely consumed
a truncated buffer (or if just one char is left, which may be a "-" in a
range), and ask the caller to come back for more.

Link: http://lkml.kernel.org/r/20190320222831.8243-7-mcgrof@kernel.org
Signed-off-by: Eric Sandeen
Signed-off-by: Luis Chamberlain
Acked-by: Kees Cook
Cc: Eric Sandeen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Sandeen
2019-05-15 10:52:51 +0800
e260ad01f sysctl: return -EINVAL if val violates minmax ... Browse Code »

Currently when userspace gives us a values that overflow e.g. file-max
and other callers of __do_proc_doulongvec_minmax() we simply ignore the
new value and leave the current value untouched.

This can be problematic as it gives the illusion that the limit has
indeed be bumped when in fact it failed. This commit makes sure to
return EINVAL when an overflow is detected. Please note that this is a
userspace facing change.

Link: http://lkml.kernel.org/r/20190210203943.8227-4-christian@brauner.io
Signed-off-by: Christian Brauner
Acked-by: Luis Chamberlain
Cc: Kees Cook
Cc: Alexey Dobriyan
Cc: Al Viro
Cc: Dominik Brodowski
Cc: "Eric W. Biederman"
Cc: Joe Lawrence
Cc: Waiman Long
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christian Brauner
2019-05-15 10:52:51 +0800
475dae385 kernel/sysctl.c: switch to bitmap_zalloc() ... Browse Code »

Switch to bitmap_zalloc() to show clearly what we are allocating.
Besides that it returns pointer of bitmap type instead of opaque void *.

Link: http://lkml.kernel.org/r/20190304094037.57756-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko
Acked-by: Kees Cook
Reviewed-by: Andrew Morton
Cc: Luis Chamberlain
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Shevchenko
2019-05-15 10:52:51 +0800
cefdca0a8 userfaultfd/sysctl: add vm.unprivileged_userfaultfd ... Browse Code »

Userfaultfd can be misued to make it easier to exploit existing
use-after-free (and similar) bugs that might otherwise only make a
short window or race condition available. By using userfaultfd to
stall a kernel thread, a malicious program can keep some state that it
wrote, stable for an extended period, which it can then access using an
existing exploit. While it doesn't cause the exploit itself, and while
it's not the only thing that can stall a kernel thread when accessing a
memory location, it's one of the few that never needs privilege.

We can add a flag, allowing userfaultfd to be restricted, so that in
general it won't be useable by arbitrary user programs, but in
environments that require userfaultfd it can be turned back on.

Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
whether userfaultfd is allowed by unprivileged users. When this is
set to zero, only privileged users (root user, or users with the
CAP_SYS_PTRACE capability) will be able to use the userfaultfd
syscalls.

Andrea said:

: The only difference between the bpf sysctl and the userfaultfd sysctl
: this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
: requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
: because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
: already if it's doing other kind of tracking on processes runtime, in
: addition of userfaultfd. In other words both syscalls works only for
: root, when the two sysctl are opt-in set to 1.

[dgilbert@redhat.com: changelog additions]
[akpm@linux-foundation.org: documentation tweak, per Mike]
Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.com
Signed-off-by: Peter Xu
Suggested-by: Andrea Arcangeli
Suggested-by: Mike Rapoport
Reviewed-by: Mike Rapoport
Reviewed-by: Andrea Arcangeli
Cc: Paolo Bonzini
Cc: Hugh Dickins
Cc: Luis Chamberlain
Cc: Maxime Coquelin
Cc: Maya Gokhale
Cc: Jerome Glisse
Cc: Pavel Emelyanov
Cc: Johannes Weiner
Cc: Martin Cracauer
Cc: Denis Plotnikov
Cc: Marty McFadden
Cc: Mike Kravetz
Cc: Kees Cook
Cc: Mel Gorman
Cc: "Kirill A . Shutemov"
Cc: "Dr . David Alan Gilbert"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Xu
2019-05-15 00:47:45 +0800

19 Apr, 2019

1 commit

0bc199854 ipv6: Add rate limit mask for ICMPv6 messages ... Browse Code »

To make ICMPv6 closer to ICMPv4, add ratemask parameter. Since the ICMP
message types use larger numeric values, a simple bitmask doesn't fit.
I use large bitmap. The input and output are the in form of list of
ranges. Set the default to rate limit all error messages but Packet Too
Big. For Packet Too Big, use ratemask instead of hard-coded.

There are functions where icmpv6_xrlim_allow() and icmpv6_global_allow()
aren't called. This patch only adds them to icmpv6_echo_reply().

Rate limiting error messages is mandated by RFC 4443 but RFC 4890 says
that it is also acceptable to rate limit informational messages. Thus,
I removed the current hard-coded behavior of icmpv6_mask_allow() that
doesn't rate limit informational messages.

v2: Add dummy function proc_do_large_bitmap() if CONFIG_PROC_SYSCTL
isn't defined, expand the description in ip-sysctl.txt and remove
unnecessary conditional before kfree().
v3: Inline the bitmap instead of dynamically allocated. Still is a
pointer to it is needed because of the way proc_do_large_bitmap work.

Signed-off-by: Stephen Suryaputra
Signed-off-by: David S. Miller

Stephen Suryaputra
2019-04-19 07:58:37 +0800

06 Apr, 2019

1 commit

9002b2146 kernel/sysctl.c: fix out-of-bounds access when setting file-max ... Browse Code »

Commit 32a5ad9c2285 ("sysctl: handle overflow for file-max") hooked up
min/max values for the file-max sysctl parameter via the .extra1 and
.extra2 fields in the corresponding struct ctl_table entry.

Unfortunately, the minimum value points at the global 'zero' variable,
which is an int. This results in a KASAN splat when accessed as a long
by proc_doulongvec_minmax on 64-bit architectures:

| BUG: KASAN: global-out-of-bounds in __do_proc_doulongvec_minmax+0x5d8/0x6a0
| Read of size 8 at addr ffff2000133d1c20 by task systemd/1
|
| CPU: 0 PID: 1 Comm: systemd Not tainted 5.1.0-rc3-00012-g40b114779944 #2
| Hardware name: linux,dummy-virt (DT)
| Call trace:
| dump_backtrace+0x0/0x228
| show_stack+0x14/0x20
| dump_stack+0xe8/0x124
| print_address_description+0x60/0x258
| kasan_report+0x140/0x1a0
| __asan_report_load8_noabort+0x18/0x20
| __do_proc_doulongvec_minmax+0x5d8/0x6a0
| proc_doulongvec_minmax+0x4c/0x78
| proc_sys_call_handler.isra.19+0x144/0x1d8
| proc_sys_write+0x34/0x58
| __vfs_write+0x54/0xe8
| vfs_write+0x124/0x3c0
| ksys_write+0xbc/0x168
| __arm64_sys_write+0x68/0x98
| el0_svc_common+0x100/0x258
| el0_svc_handler+0x48/0xc0
| el0_svc+0x8/0xc
|
| The buggy address belongs to the variable:
| zero+0x0/0x40
|
| Memory state around the buggy address:
| ffff2000133d1b00: 00 00 00 00 00 00 00 00 fa fa fa fa 04 fa fa fa
| ffff2000133d1b80: fa fa fa fa 04 fa fa fa fa fa fa fa 04 fa fa fa
| >ffff2000133d1c00: fa fa fa fa 04 fa fa fa fa fa fa fa 00 00 00 00
| ^
| ffff2000133d1c80: fa fa fa fa 00 fa fa fa fa fa fa fa 00 00 00 00
| ffff2000133d1d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Fix the splat by introducing a unsigned long 'zero_ul' and using that
instead.

Link: http://lkml.kernel.org/r/20190403153409.17307-1-will.deacon@arm.com
Fixes: 32a5ad9c2285 ("sysctl: handle overflow for file-max")
Signed-off-by: Will Deacon
Acked-by: Christian Brauner
Cc: Kees Cook
Cc: Alexey Dobriyan
Cc: Matteo Croce
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Will Deacon
2019-04-06 10:02:31 +0800