18 Sep, 2020
1 commit
-
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel
Tested-by: Matthieu Baerts
Cc: Dave Chinner
Cc: Matthew Wilcox
Cc: Chris Mason
Cc: Jan Kara
Cc: Amir Goldstein
Signed-off-by: Linus Torvalds
25 Aug, 2020
1 commit
-
Commit 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
changed ctl_table.proc_handler to take a kernel pointer. Adjust the
signature of bpf_stats_handler to match ctl_table.proc_handler which
fixes the following sparse warning:kernel/sysctl.c:226:49: warning: incorrect type in argument 3 (different address spaces)
kernel/sysctl.c:226:49: expected void *
kernel/sysctl.c:226:49: got void [noderef] __user *buffer
kernel/sysctl.c:2640:35: warning: incorrect type in initializer (incompatible argument 3 (different address spaces))
kernel/sysctl.c:2640:35: expected int ( [usertype] *proc_handler )( ... )
kernel/sysctl.c:2640:35: got int ( * )( ... )Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
Signed-off-by: Tobias Klauser
Signed-off-by: Alexei Starovoitov
Cc: Christoph Hellwig
Link: https://lore.kernel.org/bpf/20200824142047.22043-1-tklauser@distanz.ch
13 Aug, 2020
2 commits
-
Proactive compaction uses per-node/zone "fragmentation score" which is
always in range [0, 100], so use unsigned type of these scores as well as
for related constants.Signed-off-by: Nitin Gupta
Signed-off-by: Andrew Morton
Reviewed-by: Baoquan He
Cc: Luis Chamberlain
Cc: Kees Cook
Cc: Iurii Zaikin
Cc: Vlastimil Babka
Cc: Joonsoo Kim
Link: http://lkml.kernel.org/r/20200618010319.13159-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds -
For some applications, we need to allocate almost all memory as hugepages.
However, on a running system, higher-order allocations can fail if the
memory is fragmented. Linux kernel currently does on-demand compaction as
we request more hugepages, but this style of compaction incurs very high
latency. Experiments with one-time full memory compaction (followed by
hugepage allocations) show that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within 98% of free memory could be allocated as hugepages)- With 5.6.0-rc3 + this patch, with proactiveness=20
sysctl -w vm.compaction_proactiveness=20
percentile latency
–––––––––– –––––––
5 2
10 2
25 3
30 3
40 3
50 4
60 4
75 4
80 4
90 5
95 429Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)2. JAVA heap allocation
In this test, we first fragment memory using the same method as for (1).
Then, we start a Java process with a heap size set to 700G and request the
heap to be allocated with THP hugepages. We also set THP to madvise to
allow hugepage backing of this heap./usr/bin/time
java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouchThe above command allocates 700G of Java heap using hugepages.
- With vanilla 5.6.0-rc3
17.39user 1666.48system 27:37.89elapsed
- With 5.6.0-rc3 + this patch, with proactiveness=20
8.35user 194.58system 3:19.62elapsed
Elapsed time remains around 3:15, as proactiveness is further increased.
Note that proactive compaction happens throughout the runtime of these
workloads. The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.In the above Java workload, proactiveness is set to 20. The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As t he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the low
threshold level (80). Repeat.bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.Backoff behavior
================Above workloads produce a memory state which is easy to compact. However,
if memory is filled with unmovable pages, proactive compaction should
essentially back off. To test this aspect:- Created a kernel driver that allocates almost all memory as hugepages
followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
(=> ~30 seconds between retries).[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/Signed-off-by: Nitin Gupta
Signed-off-by: Andrew Morton
Tested-by: Oleksandr Natalenko
Reviewed-by: Vlastimil Babka
Reviewed-by: Khalid Aziz
Reviewed-by: Oleksandr Natalenko
Cc: Vlastimil Babka
Cc: Khalid Aziz
Cc: Michal Hocko
Cc: Mel Gorman
Cc: Matthew Wilcox
Cc: Mike Kravetz
Cc: Joonsoo Kim
Cc: David Rientjes
Cc: Nitin Gupta
Cc: Oleksandr Natalenko
Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds
08 Aug, 2020
1 commit
-
When checking a performance change for will-it-scale scalability mmap test
[1], we found very high lock contention for spinlock of percpu counter
'vm_committed_as':94.14% 0.35% [kernel.kallsyms] [k] _raw_spin_lock_irqsave
48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;Actually this heavy lock contention is not always necessary. The
'vm_committed_as' needs to be very precise when the strict
OVERCOMMIT_NEVER policy is set, which requires a rather small batch number
for the percpu counter.So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy, and
lift it to 64X for OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS policies. Also
add a sysctl handler to adjust it when the policy is reconfigured.Benchmark with the same testcase in [1] shows 53% improvement on a 8C/16T
desktop, and 2097%(20X) on a 4S/72C/144T server. We tested with test
platforms in 0day (server, desktop and laptop), and 80%+ platforms shows
improvements with that test. And whether it shows improvements depends on
if the test mmap size is bigger than the batch number computed.And if the lift is 16X, 1/3 of the platforms will show improvements,
though it should help the mmap/unmap usage generally, as Michal Hocko
mentioned:: I believe that there are non-synthetic worklaods which would benefit from
: a larger batch. E.g. large in memory databases which do large mmaps
: during startups from multiple threads.[1] https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/
Signed-off-by: Feng Tang
Signed-off-by: Andrew Morton
Acked-by: Michal Hocko
Cc: Matthew Wilcox (Oracle)
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Qian Cai
Cc: Kees Cook
Cc: Andi Kleen
Cc: Tim Chen
Cc: Dave Hansen
Cc: Huang Ying
Cc: Christoph Lameter
Cc: Dennis Zhou
Cc: Haiyang Zhang
Cc: kernel test robot
Cc: "K. Y. Srinivasan"
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/1589611660-89854-4-git-send-email-feng.tang@intel.com
Link: http://lkml.kernel.org/r/1592725000-73486-4-git-send-email-feng.tang@intel.com
Link: http://lkml.kernel.org/r/1594389708-60781-5-git-send-email-feng.tang@intel.com
Signed-off-by: Linus Torvalds
29 Jul, 2020
1 commit
-
RT tasks by default run at the highest capacity/performance level. When
uclamp is selected this default behavior is retained by enforcing the
requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
value.This is also referred to as 'the default boost value of RT tasks'.
See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").
On battery powered devices, it is desired to control this default
(currently hardcoded) behavior at runtime to reduce energy consumed by
RT tasks.For example, a mobile device manufacturer where big.LITTLE architecture
is dominant, the performance of the little cores varies across SoCs, and
on high end ones the big cores could be too power hungry.Given the diversity of SoCs, the new knob allows manufactures to tune
the best performance/power for RT tasks for the particular hardware they
run on.They could opt to further tune the value when the user selects
a different power saving mode or when the device is actively charging.The runtime aspect of it further helps in creating a single kernel image
that can be run on multiple devices that require different tuning.Keep in mind that a lot of RT tasks in the system are created by the
kernel. On Android for instance I can see over 50 RT tasks, only
a handful of which created by the Android framework.To control the default behavior globally by system admins and device
integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
to change the default boost value of the RT tasks.I anticipate this to be mostly in the form of modifying the init script
of a particular device.To avoid polluting the fast path with unnecessary code, the approach
taken is to synchronously do the update by traversing all the existing
tasks in the system. This could race with a concurrent fork(), which is
dealt with by introducing sched_post_fork() function which will ensure
the racy fork will get the right update applied.Tested on Juno-r2 in combination with the RT capacity awareness [1].
By default an RT task will go to the highest capacity CPU and run at the
maximum frequency, which is particularly energy inefficient on high end
mobile devices because the biggest core[s] are 'huge' and power hungry.With this patch the RT task can be controlled to run anywhere by
default, and doesn't cause the frequency to be maximum all the time.
Yet any task that really needs to be boosted can easily escape this
default behavior by modifying its requested uclamp.min value
(p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.[1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")
Signed-off-by: Qais Yousef
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com
15 Jun, 2020
1 commit
-
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20190726161357.397880775@infradead.org
09 Jun, 2020
4 commits
-
Users with SYS_ADMIN capability can add arbitrary taint flags to the
running kernel by writing to /proc/sys/kernel/tainted or issuing the
command 'sysctl -w kernel.tainted=...'. This interface, however, is
open for any integer value and this might cause an invalid set of flags
being committed to the tainted_mask bitset.This patch introduces a simple way for proc_taint() to ignore any
eventual invalid bit coming from the user input before committing those
bits to the kernel tainted_mask.Signed-off-by: Rafael Aquini
Signed-off-by: Andrew Morton
Reviewed-by: Luis Chamberlain
Cc: Kees Cook
Cc: Iurii Zaikin
Cc: "Theodore Ts'o"
Link: http://lkml.kernel.org/r/20200512223946.888020-1-aquini@redhat.com
Signed-off-by: Linus Torvalds -
Usually when the kernel reaches an oops condition, it's a point of no
return; in case not enough debug information is available in the kernel
splat, one of the last resorts would be to collect a kernel crash dump
and analyze it. The problem with this approach is that in order to
collect the dump, a panic is required (to kexec-load the crash kernel).
When in an environment of multiple virtual machines, users may prefer to
try living with the oops, at least until being able to properly shutdown
their VMs / finish their important tasks.This patch implements a way to collect a bit more debug details when an
oops event is reached, by printing all the CPUs backtraces through the
usage of NMIs (on architectures that support that). The sysctl added
(and documented) here was called "oops_all_cpu_backtrace", and when set
will (as the name suggests) dump all CPUs backtraces.Far from ideal, this may be the last option though for users that for
some reason cannot panic on oops. Most of times oopses are clear enough
to indicate the kernel portion that must be investigated, but in virtual
environments it's possible to observe hypervisor/KVM issues that could
lead to oopses shown in other guests CPUs (like virtual APIC crashes).
This patch hence aims to help debug such complex issues without
resorting to kdump.Signed-off-by: Guilherme G. Piccoli
Signed-off-by: Andrew Morton
Reviewed-by: Kees Cook
Cc: Luis Chamberlain
Cc: Iurii Zaikin
Cc: Thomas Gleixner
Cc: Vlastimil Babka
Cc: Randy Dunlap
Cc: Matthew Wilcox
Link: http://lkml.kernel.org/r/20200327224116.21030-1-gpiccoli@canonical.com
Signed-off-by: Linus Torvalds -
Commit 401c636a0eeb ("kernel/hung_task.c: show all hung tasks before
panic") introduced a change in that we started to show all CPUs
backtraces when a hung task is detected _and_ the sysctl/kernel
parameter "hung_task_panic" is set. The idea is good, because usually
when observing deadlocks (that may lead to hung tasks), the culprit is
another task holding a lock and not necessarily the task detected as
hung.The problem with this approach is that dumping backtraces is a slightly
expensive task, specially printing that on console (and specially in
many CPU machines, as servers commonly found nowadays). So, users that
plan to collect a kdump to investigate the hung tasks and narrow down
the deadlock definitely don't need the CPUs backtrace on dmesg/console,
which will delay the panic and pollute the log (crash tool would easily
grab all CPUs traces with 'bt -a' command).Also, there's the reciprocal scenario: some users may be interested in
seeing the CPUs backtraces but not have the system panic when a hung
task is detected. The current approach hence is almost as embedding a
policy in the kernel, by forcing the CPUs backtraces' dump (only) on
hung_task_panic.This patch decouples the panic event on hung task from the CPUs
backtraces dump, by creating (and documenting) a new sysctl called
"hung_task_all_cpu_backtrace", analog to the approach taken on soft/hard
lockups, that have both a panic and an "all_cpu_backtrace" sysctl to
allow individual control. The new mechanism for dumping the CPUs
backtraces on hung task detection respects "hung_task_warnings" by not
dumping the traces in case there's no warnings left.Signed-off-by: Guilherme G. Piccoli
Signed-off-by: Andrew Morton
Reviewed-by: Kees Cook
Cc: Tetsuo Handa
Link: http://lkml.kernel.org/r/20200327223646.20779-1-gpiccoli@canonical.com
Signed-off-by: Linus Torvalds -
Analogously to the introduction of panic_on_warn, this patch introduces
a kernel option named panic_on_taint in order to provide a simple and
generic way to stop execution and catch a coredump when the kernel gets
tainted by any given flag.This is useful for debugging sessions as it avoids having to rebuild the
kernel to explicitly add calls to panic() into the code sites that
introduce the taint flags of interest.For instance, if one is interested in proceeding with a post-mortem
analysis at the point a given code path is hitting a bad page (i.e.
unaccount_page_cache_page(), or slab_bug()), a coredump can be collected
by rebooting the kernel with 'panic_on_taint=0x20' amended to the
command line.Another, perhaps less frequent, use for this option would be as a means
for assuring a security policy case where only a subset of taints, or no
single taint (in paranoid mode), is allowed for the running system. The
optional switch 'nousertaint' is handy in this particular scenario, as
it will avoid userspace induced crashes by writes to sysctl interface
/proc/sys/kernel/tainted causing false positive hits for such policies.[akpm@linux-foundation.org: tweak kernel-parameters.txt wording]
Suggested-by: Qian Cai
Signed-off-by: Rafael Aquini
Signed-off-by: Andrew Morton
Reviewed-by: Luis Chamberlain
Cc: Dave Young
Cc: Baoquan He
Cc: Jonathan Corbet
Cc: Kees Cook
Cc: Randy Dunlap
Cc: "Theodore Ts'o"
Cc: Adrian Bunk
Cc: Greg Kroah-Hartman
Cc: Laura Abbott
Cc: Jeff Mahoney
Cc: Jiri Kosina
Cc: Takashi Iwai
Link: http://lkml.kernel.org/r/20200515175502.146720-1-aquini@redhat.com
Signed-off-by: Linus Torvalds
04 Jun, 2020
3 commits
-
Merge more updates from Andrew Morton:
"More mm/ work, plenty more to comeSubsystems affected by this patch series: slub, memcg, gup, kasan,
pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs,
thp, mmap, kconfig"* akpm: (131 commits)
arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
riscv: support DEBUG_WX
mm: add DEBUG_WX support
drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid()
powerpc/mm: drop platform defined pmd_mknotpresent()
mm: thp: don't need to drain lru cache when splitting and mlocking THP
hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
sparc32: register memory occupied by kernel as memblock.memory
include/linux/memblock.h: fix minor typo and unclear comment
mm, mempolicy: fix up gup usage in lookup_node
tools/vm/page_owner_sort.c: filter out unneeded line
mm: swap: memcg: fix memcg stats for huge pages
mm: swap: fix vmstats for huge pages
mm: vmscan: limit the range of LRU type balancing
mm: vmscan: reclaim writepage is IO cost
mm: vmscan: determine anon/file pressure balance at the reclaim root
mm: balance LRU lists based on relative thrashing
mm: only count actual rotations as LRU reclaim cost
... -
With the advent of fast random IO devices (SSDs, PMEM) and in-memory swap
devices such as zswap, it's possible for swap to be much faster than
filesystems, and for swapping to be preferable over thrashing filesystem
caches.Allow setting swappiness - which defines the rough relative IO cost of
cache misses between page cache and swap-backed pages - to reflect such
situations by making the swap-preferred range configurable.Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Rik van Riel
Link: http://lkml.kernel.org/r/20200520232525.798933-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds -
Pull networking updates from David Miller:
1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
Augusto von Dentz.2) Add GSO partial support to igc, from Sasha Neftin.
3) Several cleanups and improvements to r8169 from Heiner Kallweit.
4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
device self-test. From Andrew Lunn.5) Start moving away from custom driver versions, use the globally
defined kernel version instead, from Leon Romanovsky.6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.
7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.
8) Add sriov and vf support to hinic, from Luo bin.
9) Support Media Redundancy Protocol (MRP) in the bridging code, from
Horatiu Vultur.10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.
11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
Dubroca. Also add ipv6 support for espintcp.12) Lots of ReST conversions of the networking documentation, from Mauro
Carvalho Chehab.13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
from Doug Berger.14) Allow to dump cgroup id and filter by it in inet_diag code, from
Dmitry Yakunin.15) Add infrastructure to export netlink attribute policies to
userspace, from Johannes Berg.16) Several optimizations to sch_fq scheduler, from Eric Dumazet.
17) Fallback to the default qdisc if qdisc init fails because otherwise
a packet scheduler init failure will make a device inoperative. From
Jesper Dangaard Brouer.18) Several RISCV bpf jit optimizations, from Luke Nelson.
19) Correct the return type of the ->ndo_start_xmit() method in several
drivers, it's netdev_tx_t but many drivers were using
'int'. From Yunjian Wang.20) Add an ethtool interface for PHY master/slave config, from Oleksij
Rempel.21) Add BPF iterators, from Yonghang Song.
22) Add cable test infrastructure, including ethool interfaces, from
Andrew Lunn. Marvell PHY driver is the first to support this
facility.23) Remove zero-length arrays all over, from Gustavo A. R. Silva.
24) Calculate and maintain an explicit frame size in XDP, from Jesper
Dangaard Brouer.25) Add CAP_BPF, from Alexei Starovoitov.
26) Support terse dumps in the packet scheduler, from Vlad Buslov.
27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.
28) Add devm_register_netdev(), from Bartosz Golaszewski.
29) Minimize qdisc resets, from Cong Wang.
30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
eliminate set_fs/get_fs calls. From Christoph Hellwig.* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
selftests: net: ip_defrag: ignore EPERM
net_failover: fixed rollback in net_failover_open()
Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
vmxnet3: allow rx flow hash ops only when rss is enabled
hinic: add set_channels ethtool_ops support
selftests/bpf: Add a default $(CXX) value
tools/bpf: Don't use $(COMPILE.c)
bpf, selftests: Use bpf_probe_read_kernel
s390/bpf: Use bcr 0,%0 as tail call nop filler
s390/bpf: Maintain 8-byte stack alignment
selftests/bpf: Fix verifier test
selftests/bpf: Fix sample_cnt shared between two threads
bpf, selftests: Adapt cls_redirect to call csum_level helper
bpf: Add csum_level helper for fixing up csum levels
bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
crypto/chtls: IPv6 support for inline TLS
Crypto/chcr: Fixes a coccinile check error
Crypto/chcr: Fixes compilations warnings
...
11 May, 2020
1 commit
-
The variable sysctl_panic_on_stackoverflow is used in
arch/parisc/kernel/irq.c and arch/x86/kernel/irq_32.c, but the sysctl file
interface panic_on_stackoverflow only exists on x86.Add sysctl file interface panic_on_stackoverflow for parisc
Signed-off-by: Xiaoming Ni
Reviewed-by: Luis Chamberlain
Signed-off-by: Helge Deller
06 May, 2020
1 commit
-
The newly added bpf_stats_handler function has the wrong #ifdef
check around it, leading to an unused-function warning when
CONFIG_SYSCTL is disabled:kernel/sysctl.c:205:12: error: unused function 'bpf_stats_handler' [-Werror,-Wunused-function]
static int bpf_stats_handler(struct ctl_table *table, int write,Fix the check to match the reference.
Fixes: d46edd671a14 ("bpf: Sharing bpf runtime stats with BPF_ENABLE_STATS")
Signed-off-by: Arnd Bergmann
Signed-off-by: Alexei Starovoitov
Reviewed-by: Luis Chamberlain
Acked-by: Martin KaFai Lau
Acked-by: Song Liu
Link: https://lore.kernel.org/bpf/20200505140734.503701-1-arnd@arndb.de
02 May, 2020
1 commit
-
Currently, sysctl kernel.bpf_stats_enabled controls BPF runtime stats.
Typical userspace tools use kernel.bpf_stats_enabled as follows:1. Enable kernel.bpf_stats_enabled;
2. Check program run_time_ns;
3. Sleep for the monitoring period;
4. Check program run_time_ns again, calculate the difference;
5. Disable kernel.bpf_stats_enabled.The problem with this approach is that only one userspace tool can toggle
this sysctl. If multiple tools toggle the sysctl at the same time, the
measurement may be inaccurate.To fix this problem while keep backward compatibility, introduce a new
bpf command BPF_ENABLE_STATS. On success, this command enables stats and
returns a valid fd. BPF_ENABLE_STATS takes argument "type". Currently,
only one type, BPF_STATS_RUN_TIME, is supported. We can extend the
command to support other types of stats in the future.With BPF_ENABLE_STATS, user space tool would have the following flow:
1. Get a fd with BPF_ENABLE_STATS, and make sure it is valid;
2. Check program run_time_ns;
3. Sleep for the monitoring period;
4. Check program run_time_ns again, calculate the difference;
5. Close the fd.Signed-off-by: Song Liu
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200430071506.1408910-2-songliubraving@fb.com
27 Apr, 2020
4 commits
-
Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from userspace in common code. This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.Signed-off-by: Christoph Hellwig
Acked-by: Andrey Ignatov
Signed-off-by: Al Viro -
Move the sysctl tables to the end of the file to avoid lots of pointless
forward declarations.Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro -
Extern declarations in .c files are a bad style and can lead to
mismatches. Use existing definitions in headers where they exist,
and otherwise move the external declarations to suitable header
files.Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro -
watermark_boost_factor_sysctl_handler is just a pointless wrapper for
proc_dointvec_minmax, so remove it and use proc_dointvec_minmax
directly.Signed-off-by: Christoph Hellwig
Acked-by: David Rientjes
Signed-off-by: Al Viro
03 Apr, 2020
2 commits
-
Since commit 5bbe3547aa3ba ("mm: allow compaction of unevictable pages")
it is allowed to examine mlocked pages and compact them by default. On
-RT even minor pagefaults are problematic because it may take a few 100us
to resolve them and until then the task is blocked.Make compact_unevictable_allowed = 0 default and issue a warning on RT if
it is changed.[bigeasy@linutronix.de: v5]
Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/
Link: http://lkml.kernel.org/r/20200319165536.ovi75tsr2seared4@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Acked-by: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Thomas Gleixner
Cc: Luis Chamberlain
Cc: Kees Cook
Cc: Iurii Zaikin
Cc: Vlastimil Babka
Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/
Link: http://lkml.kernel.org/r/20200303202225.nhqc3v5gwlb7x6et@linutronix.de
Signed-off-by: Linus Torvalds -
The proc file `compact_unevictable_allowed' should allow 0 and 1 only, the
`extra*' attribues have been set properly but without
proc_dointvec_minmax() as the `proc_handler' the limit will not be
enforced.Use proc_dointvec_minmax() as the `proc_handler' to enfoce the valid
specified range.Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Acked-by: Vlastimil Babka
Cc: Thomas Gleixner
Cc: Luis Chamberlain
Cc: Kees Cook
Cc: Iurii Zaikin
Cc: Mel Gorman
Link: http://lkml.kernel.org/r/20200303202054.gsosv7fsx2ma3cic@linutronix.de
Signed-off-by: Linus Torvalds
07 Mar, 2020
1 commit
-
Many embedded boards have a disconnected TTL level serial which can
generate some garbage that can lead to spurious false sysrq detects.Currently, sysrq can be either completely disabled for serial console
or always disabled (with CONFIG_MAGIC_SYSRQ_SERIAL), since
commit 732dbf3a6104 ("serial: do not accept sysrq characters via serial port")At Arista, we have such boards that can generate BREAK and random
garbage. While disabling sysrq for serial console would solve
the problem with spurious false sysrq triggers, it's also desirable
to have a way to enable sysrq back.Having the way to enable sysrq was beneficial to debug lockups with
a manual investigation in field and on the other side preventing false
sysrq detections.As a preparation to add sysrq_toggle_support() call into uart,
remove a private copy of sysrq_enabled from sysctl - it should reflect
the actual status of sysrq.Furthermore, the private copy isn't correct already in case
sysrq_always_enabled is true. So, remove __sysrq_enabled and use a
getter-helper sysrq_mask() to check sysrq_key_op enabled status.Cc: Iurii Zaikin
Cc: Jiri Slaby
Cc: Luis Chamberlain
Cc: Kees Cook
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Dmitry Safonov
Link: https://lore.kernel.org/r/20200302175135.269397-2-dima@arista.com
Signed-off-by: Greg Kroah-Hartman
19 Feb, 2020
1 commit
-
s390 math emulation was removed with commit 5a79859ae0f3 ("s390:
remove 31 bit support"), rendering ieee_emulation_warnings useless.
The code still built because it was protected by CONFIG_MATHEMU, which
was no longer selectable.This patch removes the sysctl_ieee_emulation_warnings declaration and
the sysctl entry declaration.Link: https://lkml.kernel.org/r/20200214172628.3598516-1-steve@sk2.org
Reviewed-by: Vasily Gorbik
Signed-off-by: Stephen Kitt
Signed-off-by: Vasily Gorbik
10 Dec, 2019
1 commit
-
Currently PREEMPT_RCU and TREE_RCU are mutually exclusive Kconfig
options. But PREEMPT_RCU actually specifies a kind of TREE_RCU,
namely a preemptible TREE_RCU. This commit therefore makes PREEMPT_RCU
be a modifer to the TREE_RCU Kconfig option. This has the benefit of
simplifying several of the #if expressions that formerly needed to
check both, but now need only check one or the other.Signed-off-by: Lai Jiangshan
Signed-off-by: Lai Jiangshan
Reviewed-by: Joel Fernandes (Google)
Signed-off-by: Paul E. McKenney
02 Dec, 2019
1 commit
-
Currently, the drop_caches proc file and sysctl read back the last value
written, suggesting this is somehow a stateful setting instead of a
one-time command. Make it write-only, like e.g. compact_memory.While mitigating a VM problem at scale in our fleet, there was confusion
about whether writing to this file will permanently switch the kernel into
a non-caching mode. This influences the decision making in a tense
situation, where tens of people are trying to fix tens of thousands of
affected machines: Do we need a rollback strategy? What are the
performance implications of operating in a non-caching state for several
days? It also caused confusion when the kernel team said we may need to
write the file several times to make sure it's effective ("But it already
reads back 3?").Link: http://lkml.kernel.org/r/20191031221602.9375-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Chris Down
Acked-by: Vlastimil Babka
Acked-by: David Hildenbrand
Acked-by: Michal Hocko
Acked-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
15 Oct, 2019
1 commit
-
Signed-off-by: Helge Deller
25 Sep, 2019
1 commit
-
arm64 handles top-down mmap layout in a way that can be easily reused by
other architectures, so make it available in mm. It then introduces a new
config ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT that can be set by other
architectures to benefit from those functions. Note that this new config
depends on MMU being enabled, if selected without MMU support, a warning
will be thrown.Link: http://lkml.kernel.org/r/20190730055113.23635-5-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti
Suggested-by: Christoph Hellwig
Acked-by: Catalin Marinas
Acked-by: Kees Cook
Reviewed-by: Christoph Hellwig
Reviewed-by: Luis Chamberlain
Cc: Albert Ou
Cc: Alexander Viro
Cc: James Hogan
Cc: Palmer Dabbelt
Cc: Paul Burton
Cc: Ralf Baechle
Cc: Russell King
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
19 Jul, 2019
1 commit
-
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce
Signed-off-by: Arnd Bergmann
Acked-by: Kees Cook
Reviewed-by: Aaron Tomlin
Cc: Matthew Wilcox
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Jul, 2019
1 commit
-
fix lenght to length
Link: http://lkml.kernel.org/r/20190521050937.4370-1-houweitaoo@gmail.com
Signed-off-by: Weitao Hou
Acked-by: Kees Cook
Cc: Colin Ian King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
25 Jun, 2019
1 commit
-
Tasks without a user-defined clamp value are considered not clamped
and by default their utilization can have any value in the
[0..SCHED_CAPACITY_SCALE] range.Tasks with a user-defined clamp value are allowed to request any value
in that range, and the required clamp is unconditionally enforced.
However, a "System Management Software" could be interested in limiting
the range of clamp values allowed for all tasks.Add a privileged interface to define a system default configuration via:
/proc/sys/kernel/sched_uclamp_util_{min,max}
which works as an unconditional clamp range restriction for all tasks.
With the default configuration, the full SCHED_CAPACITY_SCALE range of
values is allowed for each clamp index. Otherwise, the task-specific
clamp is capped by the corresponding system default value.Do that by tracking, for each task, the "effective" clamp value and
bucket the task has been refcounted in at enqueue time. This
allows to lazy aggregate "requested" and "system default" values at
enqueue time and simplifies refcounting updates at dequeue time.The cached bucket ids are used to avoid (relatively) more expensive
integer divisions every time a task is enqueued.An active flag is used to report when the "effective" value is valid and
thus the task is actually refcounted in the corresponding rq's bucket.Signed-off-by: Patrick Bellasi
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alessio Balsini
Cc: Dietmar Eggemann
Cc: Joel Fernandes
Cc: Juri Lelli
Cc: Linus Torvalds
Cc: Morten Rasmussen
Cc: Paul Turner
Cc: Peter Zijlstra
Cc: Quentin Perret
Cc: Rafael J . Wysocki
Cc: Steve Muckle
Cc: Suren Baghdasaryan
Cc: Tejun Heo
Cc: Thomas Gleixner
Cc: Todd Kjos
Cc: Vincent Guittot
Cc: Viresh Kumar
Link: https://lkml.kernel.org/r/20190621084217.8167-5-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar
15 Jun, 2019
1 commit
-
Convert proc_dointvec_minmax_bpf_stats() into a more generic
helper, since we are going to use jump labels more often.Note that sysctl_bpf_stats_enabled is removed, since
it is no longer needed/used.Signed-off-by: Eric Dumazet
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller
21 May, 2019
1 commit
-
Add SPDX license identifiers to all files which:
- Have no license information of any form
- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the fileThese files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:GPL-2.0-only
Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman
15 May, 2019
4 commits
-
Today, proc_do_large_bitmap() truncates a large write input buffer to
PAGE_SIZE - 1, which may result in misparsed numbers at the (truncated)
end of the buffer. Further, it fails to notify the caller that the
buffer was truncated, so it doesn't get called iteratively to finish the
entire input buffer.Tell the caller if there's more work to do by adding the skipped amount
back to left/*lenp before returning.To fix the misparsing, reset the position if we have completely consumed
a truncated buffer (or if just one char is left, which may be a "-" in a
range), and ask the caller to come back for more.Link: http://lkml.kernel.org/r/20190320222831.8243-7-mcgrof@kernel.org
Signed-off-by: Eric Sandeen
Signed-off-by: Luis Chamberlain
Acked-by: Kees Cook
Cc: Eric Sandeen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently when userspace gives us a values that overflow e.g. file-max
and other callers of __do_proc_doulongvec_minmax() we simply ignore the
new value and leave the current value untouched.This can be problematic as it gives the illusion that the limit has
indeed be bumped when in fact it failed. This commit makes sure to
return EINVAL when an overflow is detected. Please note that this is a
userspace facing change.Link: http://lkml.kernel.org/r/20190210203943.8227-4-christian@brauner.io
Signed-off-by: Christian Brauner
Acked-by: Luis Chamberlain
Cc: Kees Cook
Cc: Alexey Dobriyan
Cc: Al Viro
Cc: Dominik Brodowski
Cc: "Eric W. Biederman"
Cc: Joe Lawrence
Cc: Waiman Long
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Switch to bitmap_zalloc() to show clearly what we are allocating.
Besides that it returns pointer of bitmap type instead of opaque void *.Link: http://lkml.kernel.org/r/20190304094037.57756-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko
Acked-by: Kees Cook
Reviewed-by: Andrew Morton
Cc: Luis Chamberlain
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Userfaultfd can be misued to make it easier to exploit existing
use-after-free (and similar) bugs that might otherwise only make a
short window or race condition available. By using userfaultfd to
stall a kernel thread, a malicious program can keep some state that it
wrote, stable for an extended period, which it can then access using an
existing exploit. While it doesn't cause the exploit itself, and while
it's not the only thing that can stall a kernel thread when accessing a
memory location, it's one of the few that never needs privilege.We can add a flag, allowing userfaultfd to be restricted, so that in
general it won't be useable by arbitrary user programs, but in
environments that require userfaultfd it can be turned back on.Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
whether userfaultfd is allowed by unprivileged users. When this is
set to zero, only privileged users (root user, or users with the
CAP_SYS_PTRACE capability) will be able to use the userfaultfd
syscalls.Andrea said:
: The only difference between the bpf sysctl and the userfaultfd sysctl
: this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
: requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
: because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
: already if it's doing other kind of tracking on processes runtime, in
: addition of userfaultfd. In other words both syscalls works only for
: root, when the two sysctl are opt-in set to 1.[dgilbert@redhat.com: changelog additions]
[akpm@linux-foundation.org: documentation tweak, per Mike]
Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.com
Signed-off-by: Peter Xu
Suggested-by: Andrea Arcangeli
Suggested-by: Mike Rapoport
Reviewed-by: Mike Rapoport
Reviewed-by: Andrea Arcangeli
Cc: Paolo Bonzini
Cc: Hugh Dickins
Cc: Luis Chamberlain
Cc: Maxime Coquelin
Cc: Maya Gokhale
Cc: Jerome Glisse
Cc: Pavel Emelyanov
Cc: Johannes Weiner
Cc: Martin Cracauer
Cc: Denis Plotnikov
Cc: Marty McFadden
Cc: Mike Kravetz
Cc: Kees Cook
Cc: Mel Gorman
Cc: "Kirill A . Shutemov"
Cc: "Dr . David Alan Gilbert"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
19 Apr, 2019
1 commit
-
To make ICMPv6 closer to ICMPv4, add ratemask parameter. Since the ICMP
message types use larger numeric values, a simple bitmask doesn't fit.
I use large bitmap. The input and output are the in form of list of
ranges. Set the default to rate limit all error messages but Packet Too
Big. For Packet Too Big, use ratemask instead of hard-coded.There are functions where icmpv6_xrlim_allow() and icmpv6_global_allow()
aren't called. This patch only adds them to icmpv6_echo_reply().Rate limiting error messages is mandated by RFC 4443 but RFC 4890 says
that it is also acceptable to rate limit informational messages. Thus,
I removed the current hard-coded behavior of icmpv6_mask_allow() that
doesn't rate limit informational messages.v2: Add dummy function proc_do_large_bitmap() if CONFIG_PROC_SYSCTL
isn't defined, expand the description in ip-sysctl.txt and remove
unnecessary conditional before kfree().
v3: Inline the bitmap instead of dynamically allocated. Still is a
pointer to it is needed because of the way proc_do_large_bitmap work.Signed-off-by: Stephen Suryaputra
Signed-off-by: David S. Miller
06 Apr, 2019
1 commit
-
Commit 32a5ad9c2285 ("sysctl: handle overflow for file-max") hooked up
min/max values for the file-max sysctl parameter via the .extra1 and
.extra2 fields in the corresponding struct ctl_table entry.Unfortunately, the minimum value points at the global 'zero' variable,
which is an int. This results in a KASAN splat when accessed as a long
by proc_doulongvec_minmax on 64-bit architectures:| BUG: KASAN: global-out-of-bounds in __do_proc_doulongvec_minmax+0x5d8/0x6a0
| Read of size 8 at addr ffff2000133d1c20 by task systemd/1
|
| CPU: 0 PID: 1 Comm: systemd Not tainted 5.1.0-rc3-00012-g40b114779944 #2
| Hardware name: linux,dummy-virt (DT)
| Call trace:
| dump_backtrace+0x0/0x228
| show_stack+0x14/0x20
| dump_stack+0xe8/0x124
| print_address_description+0x60/0x258
| kasan_report+0x140/0x1a0
| __asan_report_load8_noabort+0x18/0x20
| __do_proc_doulongvec_minmax+0x5d8/0x6a0
| proc_doulongvec_minmax+0x4c/0x78
| proc_sys_call_handler.isra.19+0x144/0x1d8
| proc_sys_write+0x34/0x58
| __vfs_write+0x54/0xe8
| vfs_write+0x124/0x3c0
| ksys_write+0xbc/0x168
| __arm64_sys_write+0x68/0x98
| el0_svc_common+0x100/0x258
| el0_svc_handler+0x48/0xc0
| el0_svc+0x8/0xc
|
| The buggy address belongs to the variable:
| zero+0x0/0x40
|
| Memory state around the buggy address:
| ffff2000133d1b00: 00 00 00 00 00 00 00 00 fa fa fa fa 04 fa fa fa
| ffff2000133d1b80: fa fa fa fa 04 fa fa fa fa fa fa fa 04 fa fa fa
| >ffff2000133d1c00: fa fa fa fa 04 fa fa fa fa fa fa fa 00 00 00 00
| ^
| ffff2000133d1c80: fa fa fa fa 00 fa fa fa fa fa fa fa 00 00 00 00
| ffff2000133d1d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00Fix the splat by introducing a unsigned long 'zero_ul' and using that
instead.Link: http://lkml.kernel.org/r/20190403153409.17307-1-will.deacon@arm.com
Fixes: 32a5ad9c2285 ("sysctl: handle overflow for file-max")
Signed-off-by: Will Deacon
Acked-by: Christian Brauner
Cc: Kees Cook
Cc: Alexey Dobriyan
Cc: Matteo Croce
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds