14 Dec, 2014
1 commit
-
SysV can be abused to allocate locked kernel memory. For most systems, a
small limit doesn't make sense, see the discussion with regards to SHMMAX.Therefore: increase MSGMNI to the maximum supported.
And: If we ignore the risk of locking too much memory, then an automatic
scaling of MSGMNI doesn't make sense. Therefore the logic can be removed.The code preserves auto_msgmni to avoid breaking any user space applications
that expect that the value exists.Notes:
1) If an administrator must limit the memory allocations, then he can set
MSGMNI as necessary.Or he can disable sysv entirely (as e.g. done by Android).
2) MSGMAX and MSGMNB are intentionally not increased, as these values are used
to control latency vs. throughput:
If MSGMNB is large, then msgsnd() just returns and more messages can be queued
before a task switch to a task that calls msgrcv() is forced.[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Manfred Spraul
Cc: Davidlohr Bueso
Cc: Rafael Aquini
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Dec, 2014
1 commit
-
Pull networking updates from David Miller:
1) New offloading infrastructure and example 'rocker' driver for
offloading of switching and routing to hardware.This work was done by a large group of dedicated individuals, not
limited to: Scott Feldman, Jiri Pirko, Thomas Graf, John Fastabend,
Jamal Hadi Salim, Andy Gospodarek, Florian Fainelli, Roopa Prabhu2) Start making the networking operate on IOV iterators instead of
modifying iov objects in-situ during transfers. Thanks to Al Viro
and Herbert Xu.3) A set of new netlink interfaces for the TIPC stack, from Richard
Alpe.4) Remove unnecessary looping during ipv6 routing lookups, from Martin
KaFai Lau.5) Add PAUSE frame generation support to gianfar driver, from Matei
Pavaluca.6) Allow for larger reordering levels in TCP, which are easily
achievable in the real world right now, from Eric Dumazet.7) Add a variable of napi_schedule that doesn't need to disable cpu
interrupts, from Eric Dumazet.8) Use a doubly linked list to optimize neigh_parms_release(), from
Nicolas Dichtel.9) Various enhancements to the kernel BPF verifier, and allow eBPF
programs to actually be attached to sockets. From Alexei
Starovoitov.10) Support TSO/LSO in sunvnet driver, from David L Stevens.
11) Allow controlling ECN usage via routing metrics, from Florian
Westphal.12) Remote checksum offload, from Tom Herbert.
13) Add split-header receive, BQL, and xmit_more support to amd-xgbe
driver, from Thomas Lendacky.14) Add MPLS support to openvswitch, from Simon Horman.
15) Support wildcard tunnel endpoints in ipv6 tunnels, from Steffen
Klassert.16) Do gro flushes on a per-device basis using a timer, from Eric
Dumazet. This tries to resolve the conflicting goals between the
desired handling of bulk vs. RPC-like traffic.17) Allow userspace to ask for the CPU upon what a packet was
received/steered, via SO_INCOMING_CPU. From Eric Dumazet.18) Limit GSO packets to half the current congestion window, from Eric
Dumazet.19) Add a generic helper so that all drivers set their RSS keys in a
consistent way, from Eric Dumazet.20) Add xmit_more support to enic driver, from Govindarajulu
Varadarajan.21) Add VLAN packet scheduler action, from Jiri Pirko.
22) Support configurable RSS hash functions via ethtool, from Eyal
Perry.* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1820 commits)
Fix race condition between vxlan_sock_add and vxlan_sock_release
net/macb: fix compilation warning for print_hex_dump() called with skb->mac_header
net/mlx4: Add support for A0 steering
net/mlx4: Refactor QUERY_PORT
net/mlx4_core: Add explicit error message when rule doesn't meet configuration
net/mlx4: Add A0 hybrid steering
net/mlx4: Add mlx4_bitmap zone allocator
net/mlx4: Add a check if there are too many reserved QPs
net/mlx4: Change QP allocation scheme
net/mlx4_core: Use tasklet for user-space CQ completion events
net/mlx4_core: Mask out host side virtualization features for guests
net/mlx4_en: Set csum level for encapsulated packets
be2net: Export tunnel offloads only when a VxLAN tunnel is created
gianfar: Fix dma check map error when DMA_API_DEBUG is enabled
cxgb4/csiostor: Don't use MASTER_MUST for fw_hello call
net: fec: only enable mdio interrupt before phy device link up
net: fec: clear all interrupt events to support i.MX6SX
net: fec: reset fep link status in suspend function
net: sock: fix access via invalid file descriptor
net: introduce helper macro for_each_cmsghdr
...
11 Dec, 2014
1 commit
-
There have been several times where I have had to rebuild a kernel to
cause a panic when hitting a WARN() in the code in order to get a crash
dump from a system. Sometimes this is easy to do, other times (such as
in the case of a remote admin) it is not trivial to send new images to
the user.A much easier method would be a switch to change the WARN() over to a
panic. This makes debugging easier in that I can now test the actual
image the WARN() was seen on and I do not have to engage in remote
debugging.This patch adds a panic_on_warn kernel parameter and
/proc/sys/kernel/panic_on_warn calls panic() in the
warn_slowpath_common() path. The function will still print out the
location of the warning.An example of the panic_on_warn output:
The first line below is from the WARN_ON() to output the WARN_ON()'s
location. After that the panic() output is displayed.WARNING: CPU: 30 PID: 11698 at /home/prarit/dummy_module/dummy-module.c:25 init_dummy+0x1f/0x30 [dummy_module]()
Kernel panic - not syncing: panic_on_warn set ...CPU: 30 PID: 11698 Comm: insmod Tainted: G W OE 3.17.0+ #57
Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013
0000000000000000 000000008e3f87df ffff88080f093c38 ffffffff81665190
0000000000000000 ffffffff818aea3d ffff88080f093cb8 ffffffff8165e2ec
ffffffff00000008 ffff88080f093cc8 ffff88080f093c68 000000008e3f87df
Call Trace:
[] dump_stack+0x46/0x58
[] panic+0xd0/0x204
[] ? init_dummy+0x1f/0x30 [dummy_module]
[] warn_slowpath_common+0xd0/0xd0
[] ? dummy_greetings+0x40/0x40 [dummy_module]
[] warn_slowpath_null+0x1a/0x20
[] init_dummy+0x1f/0x30 [dummy_module]
[] do_one_initcall+0xd4/0x210
[] ? __vunmap+0xc2/0x110
[] load_module+0x16a9/0x1b30
[] ? store_uevent+0x70/0x70
[] ? copy_module_from_fd.isra.44+0x129/0x180
[] SyS_finit_module+0xa6/0xd0
[] system_call_fastpath+0x12/0x17Successfully tested by me.
hpa said: There is another very valid use for this: many operators would
rather a machine shuts down than being potentially compromised either
functionally or security-wise.Signed-off-by: Prarit Bhargava
Cc: Jonathan Corbet
Cc: Rusty Russell
Cc: "H. Peter Anvin"
Cc: Andi Kleen
Cc: Masami Hiramatsu
Acked-by: Yasuaki Ishimatsu
Cc: Fabian Frederick
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Nov, 2014
1 commit
-
RSS (Receive Side Scaling) typically uses Toeplitz hash and a 40 or 52 bytes
RSS key.Some drivers use a constant (and well known key), some drivers use a random
key per port, making bonding setups hard to tune. Well known keys increase
attack surface, considering that number of queues is usually a power of two.This patch provides infrastructure to help drivers doing the right thing.
netdev_rss_key_fill() should be used by drivers to initialize their RSS key,
even if they provide ethtool -X support to let user redefine the key later.A new /proc/sys/net/core/netdev_rss_key file can be used to get the host
RSS key even for drivers not providing ethtool -x support, in case some
applications want to precisely setup flows to match some RX queues.Tested:
myhost:~# cat /proc/sys/net/core/netdev_rss_key
11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41:36:40:74:b6:15:ca:27:44:aa:b3:4d:72myhost:~# ethtool -x eth0
RX flow hash indirection table for eth0 with 8 RX ring(s):
0: 0 1 2 3 4 5 6 7
RSS hash key:
11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
12 Nov, 2014
1 commit
-
Use the more common dynamic_debug capable net_dbg_ratelimited
and remove the LIMIT_NETDEBUG macro.All messages are still ratelimited.
Some KERN_ uses are changed to KERN_DEBUG.
This may have some negative impact on messages that were
emitted at KERN_INFO that are not not enabled at all unless
DEBUG is defined or dynamic_debug is enabled. Even so,
these messages are now _not_ emitted by default.This also eliminates the use of the net_msg_warn sysctl
"/proc/sys/net/core/warnings". For backward compatibility,
the sysctl is not removed, but it has no function. The extern
declaration of net_msg_warn is removed from sock.h and made
static in net/core/sysctl_net_core.cMiscellanea:
o Update the sysctl documentation
o Remove the embedded uses of pr_fmt
o Coalesce format fragments
o Realign argumentsSigned-off-by: Joe Perches
Signed-off-by: David S. Miller
14 Oct, 2014
1 commit
-
format_corename() can only pass the leader's pid to the core handler,
but there is no simple way to figure out which thread originated the
coredump.As Jan explains, this also means that there is no simple way to create
the backtrace of the crashed process:As programs are mostly compiled with implicit gcc -fomit-frame-pointer
one needs program's .eh_frame section (equivalently PT_GNU_EH_FRAME
segment) or .debug_frame section. .debug_frame usually is present only
in separate debug info files usually not even installed on the system.
While .eh_frame is a part of the executable/library (and it is even
always mapped for C++ exceptions unwinding) it no longer has to be
present anywhere on the disk as the program could be upgraded in the
meantime and the running instance has its executable file already
unlinked from disk.One possibility is to echo 0x3f >/proc/*/coredump_filter and dump all
the file-backed memory including the executable's .eh_frame section.
But that can create huge core files, for example even due to mmapped
data files.Other possibility would be to read .eh_frame from /proc/PID/mem at the
core_pattern handler time of the core dump. For the backtrace one needs
to read the register state first which can be done from core_pattern
handler:ptrace(PTRACE_SEIZE, tid, 0, PTRACE_O_TRACEEXIT)
close(0); // close pipe fd to resume the sleeping dumper
waitpid(); // should report EXIT
PTRACE_GETREGS or other requestsThe remaining problem is how to get the 'tid' value of the crashed
thread. It could be read from the first NT_PRSTATUS note of the core
file but that makes the core_pattern handler complicated.Unfortunately %t is already used so this patch uses %i/%I.
Automatic Bug Reporting Tool (https://github.com/abrt/abrt/wiki/overview)
is experimenting with this. It is using the elfutils
(https://fedorahosted.org/elfutils/) unwinder for generating the
backtraces. Apart from not needing matching executables as mentioned
above, another advantage is that we can get the backtrace without saving
the core (which might be quite large) to disk.[mmilata@redhat.com: final paragraph of changelog]
Signed-off-by: Jan Kratochvil
Signed-off-by: Oleg Nesterov
Cc: Alexander Viro
Cc: Denys Vlasenko
Cc: Jan Kratochvil
Cc: Mark Wielaard
Cc: Martin Milata
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
02 Sep, 2014
1 commit
-
TIPC name table updates are distributed asynchronously in a cluster,
entailing a risk of certain race conditions. E.g., if two nodes
simultaneously issue conflicting (overlapping) publications, this may
not be detected until both publications have reached a third node, in
which case one of the publications will be silently dropped on that
node. Hence, we end up with an inconsistent name table.In most cases this conflict is just a temporary race, e.g., one
node is issuing a publication under the assumption that a previous,
conflicting, publication has already been withdrawn by the other node.
However, because of the (rtt related) distributed update delay, this
may not yet hold true on all nodes. The symptom of this failure is a
syslog message: "tipc: Cannot publish {%u,%u,%u}, overlap error".In this commit we add a resiliency queue at the receiving end of
the name table distributor. When insertion of an arriving publication
fails, we retain it in this queue for a short amount of time, assuming
that another update will arrive very soon and clear the conflict. If so
happens, we insert the publication, otherwise we drop it.The (configurable) retention value defaults to 2000 ms. Knowing from
experience that the situation described above is extremely rare, there
is no risk that the queue will accumulate any large number of items.Signed-off-by: Erik Hugne
Signed-off-by: Jon Maloy
Acked-by: Ying Xue
Signed-off-by: David S. Miller
09 Aug, 2014
1 commit
-
This taint flag will be set if the system has ever entered a softlockup
state. Similar to TAINT_WARN it is useful to know whether or not the
system has been in a softlockup state when debugging.[akpm@linux-foundation.org: apply the taint before calling panic()]
Signed-off-by: Josh Hunt
Cc: Jason Baron
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 Jun, 2014
2 commits
-
A 'softlockup' is defined as a bug that causes the kernel to loop in
kernel mode for more than a predefined period to time, without giving
other tasks a chance to run.Currently, upon detection of this condition by the per-cpu watchdog
task, debug information (including a stack trace) is sent to the system
log.On some occasions, we have observed that the "victim" rather than the
actual "culprit" (i.e. the owner/holder of the contended resource) is
reported to the user. Often this information has proven to be
insufficient to assist debugging efforts.To avoid loss of useful debug information, for architectures which
support NMI, this patch makes it possible to improve soft lockup
reporting. This is accomplished by issuing an NMI to each cpu to obtain
a stack trace.If NMI is not supported we just revert back to the old method. A sysctl
and boot-time parameter is available to toggle this feature.[dzickus@redhat.com: add CONFIG_SMP in certain areas]
[akpm@linux-foundation.org: additional CONFIG_SMP=n optimisations]
[mq@suse.cz: fix warning]
Signed-off-by: Aaron Tomlin
Signed-off-by: Don Zickus
Cc: David S. Miller
Cc: Mateusz Guzik
Cc: Oleg Nesterov
Signed-off-by: Jan Moskyto Matejka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Oleg reports a division by zero error on zero-length write() to the
percpu_pagelist_fraction sysctl:divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
RSP: 0018:ffff8800d87a3e78 EFLAGS: 00010246
RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
FS: 00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
Call Trace:
proc_sys_call_handler+0xb3/0xc0
proc_sys_write+0x14/0x20
vfs_write+0xba/0x1e0
SyS_write+0x46/0xb0
tracesys+0xe1/0xe6However, if the percpu_pagelist_fraction sysctl is set by the user, it
is also impossible to restore it to the kernel default since the user
cannot write 0 to the sysctl.This patch allows the user to write 0 to restore the default behavior.
It still requires a fraction equal to or larger than 8, however, as
stated by the documentation for sanity. If a value in the range [1, 7]
is written, the sysctl will return EINVAL.This successfully solves the divide by zero issue at the same time.
Signed-off-by: David Rientjes
Reported-by: Oleg Drokin
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Jun, 2014
1 commit
-
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0$ cat /proc/sys/kernel/modprobe
/bin/trueExpected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Jun, 2014
2 commits
-
Existing description is worded in a way which almost encourages setting of
vfs_cache_pressure above 100, possibly way above it.Users are left in a dark what this numeric value is - an int? a
percentage? what the scale is?As a result, we are getting reports about noticeable performance
degradation from users who have set vfs_cache_pressure to ridiculously
high values - because they thought there is no downside to it.Via code inspection it's obvious that this value is treated as a
percentage. This patch changes text to reflect this fact, and adds a
cautionary paragraph advising against setting vfs_cache_pressure sky high.Signed-off-by: Denys Vlasenko
Cc: Alexander Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When it was introduced, zone_reclaim_mode made sense as NUMA distances
punished and workloads were generally partitioned to fit into a NUMA
node. NUMA machines are now common but few of the workloads are
NUMA-aware and it's routine to see major performance degradation due to
zone_reclaim_mode being enabled but relatively few can identify the
problem.Those that require zone_reclaim_mode are likely to be able to detect
when it needs to be enabled and tune appropriately so lets have a
sensible default for the bulk of users.This patch (of 2):
zone_reclaim_mode causes processes to prefer reclaiming memory from
local node instead of spilling over to other nodes. This made sense
initially when NUMA machines were almost exclusively HPC and the
workload was partitioned into nodes. The NUMA penalties were
sufficiently high to justify reclaiming the memory. On current machines
and workloads it is often the case that zone_reclaim_mode destroys
performance but not all users know how to detect this. Favour the
common case and disable it by default. Users that are sophisticated
enough to know they need zone_reclaim_mode will detect it.Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Reviewed-by: Zhang Yanfei
Acked-by: Michal Hocko
Reviewed-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
08 Apr, 2014
1 commit
-
As sysctl_hung_task_timeout_sec is unsigned long, when this value is
larger then LONG_MAX/HZ, the function schedule_timeout_interruptible in
watchdog will return immediately without sleep and with print :schedule_timeout: wrong timeout value ffffffffffffff83
and then the funtion watchdog will call schedule_timeout_interruptible
again and again. The screen will be filled with"schedule_timeout: wrong timeout value ffffffffffffff83"
This patch does some check and correction in sysctl, to let the function
schedule_timeout_interruptible allways get the valid parameter.Signed-off-by: Liu Hua
Tested-by: Satoru Takeuchi
Cc: [3.4+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Apr, 2014
1 commit
-
Pull module updates from Rusty Russell:
"Nothing major: the stricter permissions checking for sysfs broke a
staging driver; fix included. Greg KH said he'd take the patch but
hadn't as the merge window opened, so it's included here to avoid
breaking build"* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
staging: fix up speakup kobject mode
Use 'E' instead of 'X' for unsigned module taint flag.
VERIFY_OCTAL_PERMISSIONS: stricter checking for sysfs perms.
kallsyms: fix percpu vars on x86-64 with relocation.
kallsyms: generalize address range checking
module: LLVMLinux: Remove unused function warning from __param_check macro
Fix: module signature vs tracepoints: add new TAINT_UNSIGNED_MODULE
module: remove MODULE_GENERIC_TABLE
module: allow multiple calls to MODULE_DEVICE_TABLE() per module
module: use pr_cont
04 Apr, 2014
1 commit
-
There is plenty of anecdotal evidence and a load of blog posts
suggesting that using "drop_caches" periodically keeps your system
running in "tip top shape". Perhaps adding some kernel documentation
will increase the amount of accurate data on its use.If we are not shrinking caches effectively, then we have real bugs.
Using drop_caches will simply mask the bugs and make them harder to
find, but certainly does not fix them, nor is it an appropriate
"workaround" to limit the size of the caches. On the contrary, there
have been bug reports on issues that turned out to be misguided use of
cache dropping.Dropping caches is a very drastic and disruptive operation that is good
for debugging and running tests, but if it creates bug reports from
production use, kernel developers should be aware of its use.Add a bit more documentation about it, a syslog message to track down
abusers, and vmstat drop counters to help analyze problem reports.[akpm@linux-foundation.org: checkpatch fixes]
[hannes@cmpxchg.org: add runtime suppression control]
Signed-off-by: Dave Hansen
Signed-off-by: Michal Hocko
Acked-by: KOSAKI Motohiro
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
01 Apr, 2014
1 commit
-
Pull scheduler changes from Ingo Molnar:
"Bigger changes:- sched/idle restructuring: they are WIP preparation for deeper
integration between the scheduler and idle state selection, by
Nicolas Pitre.- add NUMA scheduling pseudo-interleaving, by Rik van Riel.
- optimize cgroup context switches, by Peter Zijlstra.
- RT scheduling enhancements, by Thomas Gleixner.
The rest is smaller changes, non-urgnt fixes and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (68 commits)
sched: Clean up the task_hot() function
sched: Remove double calculation in fix_small_imbalance()
sched: Fix broken setscheduler()
sparc64, sched: Remove unused sparc64_multi_core
sched: Remove unused mc_capable() and smt_capable()
sched/numa: Move task_numa_free() to __put_task_struct()
sched/fair: Fix endless loop in idle_balance()
sched/core: Fix endless loop in pick_next_task()
sched/fair: Push down check for high priority class task into idle_balance()
sched/rt: Fix picking RT and DL tasks from empty queue
trace: Replace hardcoding of 19 with MAX_NICE
sched: Guarantee task priority in pick_next_task()
sched/idle: Remove stale old file
sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED
cpuidle/arm64: Remove redundant cpuidle_idle_call()
cpuidle/powernv: Remove redundant cpuidle_idle_call()
sched, nohz: Exclude isolated cores from load balancing
sched: Fix select_task_rq_fair() description comments
workqueue: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
sys: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
...
13 Mar, 2014
1 commit
-
Users have reported being unable to trace non-signed modules loaded
within a kernel supporting module signature.This is caused by tracepoint.c:tracepoint_module_coming() refusing to
take into account tracepoints sitting within force-loaded modules
(TAINT_FORCED_MODULE). The reason for this check, in the first place, is
that a force-loaded module may have a struct module incompatible with
the layout expected by the kernel, and can thus cause a kernel crash
upon forced load of that module on a kernel with CONFIG_TRACEPOINTS=y.Tracepoints, however, specifically accept TAINT_OOT_MODULE and
TAINT_CRAP, since those modules do not lead to the "very likely system
crash" issue cited above for force-loaded modules.With kernels having CONFIG_MODULE_SIG=y (signed modules), a non-signed
module is tainted re-using the TAINT_FORCED_MODULE taint flag.
Unfortunately, this means that Tracepoints treat that module as a
force-loaded module, and thus silently refuse to consider any tracepoint
within this module.Since an unsigned module does not fit within the "very likely system
crash" category of tainting, add a new TAINT_UNSIGNED_MODULE taint flag
to specifically address this taint behavior, and accept those modules
within Tracepoints. We use the letter 'X' as a taint flag character for
a module being loaded that doesn't know how to sign its name (proposed
by Steven Rostedt).Also add the missing 'O' entry to trace event show_module_flags() list
for the sake of completeness.Signed-off-by: Mathieu Desnoyers
Acked-by: Steven Rostedt
NAKed-by: Ingo Molnar
CC: Thomas Gleixner
CC: David Howells
CC: Greg Kroah-Hartman
Signed-off-by: Rusty Russell
27 Feb, 2014
1 commit
-
It's not really a regression fix, so move it to the v3.15 queue.
Signed-off-by: Ingo Molnar
02 Feb, 2014
1 commit
-
Conflicts:
kernel/sysctl.cSigned-off-by: Ingo Molnar
01 Feb, 2014
1 commit
-
Pull core debug changes from Ingo Molnar:
"This contains mostly kernel debugging related updates:- make hung_task detection more configurable to distros
- add final bits for x86 UV NMI debugging, with related KGDB changes
- update the mailing-list of MAINTAINERS entries I'm involved with"* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hung_task: Display every hung task warning
sysctl: Add neg_one as a standard constraint
x86/uv/nmi, kgdb/kdb: Fix UV NMI handler when KDB not configured
x86/uv/nmi: Fix Sparse warnings
kgdb/kdb: Fix no KDB config problem
MAINTAINERS: Restore "L: linux-kernel@vger.kernel.org" entries
31 Jan, 2014
1 commit
-
Improve the documentation on hung_task_warnings.
Signed-off-by: Aaron Tomlin
Link: http://lkml.kernel.org/n/tip-xepjnxzummfDlg9lvhh7Rlzc@git.kernel.org
Signed-off-by: Ingo Molnar
30 Jan, 2014
1 commit
-
Prior to commit fe35004fbf9e ("mm: avoid swapping out with
swappiness==0") setting swappiness to 0, reclaim code could still evict
recently used user anonymous memory to swap even though there is a
significant amount of RAM used for page cache.The behaviour of setting swappiness to 0 has since changed. When set,
the reclaim code does not initiate swap until the amount of free pages
and file-backed pages, is less than the high water mark in a zone.Let's update the documentation to reflect this.
[akpm@linux-foundation.org: remove comma, per Randy]
Signed-off-by: Aaron Tomlin
Acked-by: Rik van Riel
Acked-by: Bryn M. Reeves
Cc: Satoru Moriya
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
28 Jan, 2014
1 commit
-
Excessive migration of pages can hurt the performance of workloads
that span multiple NUMA nodes. However, it turns out that the
p->numa_migrate_deferred knob is a really big hammer, which does
reduce migration rates, but does not actually help performance.Now that the second stage of the automatic numa balancing code
has stabilized, it is time to replace the simplistic migration
deferral code with something smarter.Signed-off-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Peter Zijlstra
Cc: Chegu Vinod
Link: http://lkml.kernel.org/r/1390860228-21539-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar
25 Jan, 2014
1 commit
-
When khungtaskd detects hung tasks, it prints out
backtraces from a number of those tasks.Limiting the number of backtraces being printed
out can result in the user not seeing the information
necessary to debug the issue. The hung_task_warnings
sysctl controls this feature.This patch makes it possible for hung_task_warnings
to accept a special value to print an unlimited
number of backtraces when khungtaskd detects hung
tasks.The special value is -1. To use this value it is
necessary to change types from ulong to int.Signed-off-by: Aaron Tomlin
Reviewed-by: Rik van Riel
Acked-by: David Rientjes
Cc: oleg@redhat.com
Link: http://lkml.kernel.org/r/1390239253-24030-3-git-send-email-atomlin@redhat.com
[ Build warning fix. ]
Signed-off-by: Ingo Molnar
24 Jan, 2014
1 commit
-
For general-purpose (i.e. distro) kernel builds it makes sense to build
with CONFIG_KEXEC to allow end users to choose what kind of things they
want to do with kexec. However, in the face of trying to lock down a
system with such a kernel, there needs to be a way to disable kexec_load
(much like module loading can be disabled). Without this, it is too easy
for the root user to modify kernel memory even when CONFIG_STRICT_DEVMEM
and modules_disabled are set. With this change, it is still possible to
load an image for use later, then disable kexec_load so the image (or lack
of image) can't be altered.The intention is for using this in environments where "perfect"
enforcement is hard. Without a verified boot, along with verified
modules, and along with verified kexec, this is trying to give a system a
better chance to defend itself (or at least grow the window of
discoverability) against attack in the face of a privilege escalation.In my mind, I consider several boot scenarios:
1) Verified boot of read-only verified root fs loading fd-based
verification of kexec images.
2) Secure boot of writable root fs loading signed kexec images.
3) Regular boot loading kexec (e.g. kcrash) image early and locking it.
4) Regular boot with no control of kexec image at all.1 and 2 don't exist yet, but will soon once the verified kexec series has
landed. 4 is the state of things now. The gap between 2 and 4 is too
large, so this change creates scenario 3, a middle-ground above 4 when 2
and 1 are not possible for a system.Signed-off-by: Kees Cook
Acked-by: Rik van Riel
Cc: Vivek Goyal
Cc: Eric Biederman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
22 Jan, 2014
1 commit
-
Some applications that run on HPC clusters are designed around the
availability of RAM and the overcommit ratio is fine tuned to get the
maximum usage of memory without swapping. With growing memory, the
1%-of-all-RAM grain provided by overcommit_ratio has become too coarse
for these workload (on a 2TB machine it represents no less than 20GB).This patch adds the new overcommit_kbytes sysctl variable that allow a
much finer grain.[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix nommu build]
Signed-off-by: Jerome Marchand
Cc: Dave Hansen
Cc: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Dec, 2013
1 commit
-
commit 887c290e (sched/numa: Decide whether to favour task or group weights
based on swap candidate relationships) drop the check against
sysctl_numa_balancing_settle_count, this patch remove the sysctl.Signed-off-by: Wanpeng Li
Acked-by: Mel Gorman
Reviewed-by: Rik van Riel
Acked-by: David Rientjes
Signed-off-by: Peter Zijlstra
Cc: Andrew Morton
Cc: Naoya Horiguchi
Link: http://lkml.kernel.org/r/1386833006-6600-1-git-send-email-liwanp@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar
13 Nov, 2013
2 commits
-
Some setuid binaries will allow reading of files which have read
permission by the real user id. This is problematic with files which
use %pK because the file access permission is checked at open() time,
but the kptr_restrict setting is checked at read() time. If a setuid
binary opens a %pK file as an unprivileged user, and then elevates
permissions before reading the file, then kernel pointer values may be
leaked.This happens for example with the setuid pppd application on Ubuntu 12.04:
$ head -1 /proc/kallsyms
00000000 T startup_32$ pppd file /proc/kallsyms
pppd: In file /proc/kallsyms: unrecognized option 'c1000000'This will only leak the pointer value from the first line, but other
setuid binaries may leak more information.Fix this by adding a check that in addition to the current process having
CAP_SYSLOG, that effective user and group ids are equal to the real ids.
If a setuid binary reads the contents of a file which uses %pK then the
pointer values will be printed as NULL if the real user is unprivileged.Update the sysctl documentation to reflect the changes, and also correct
the documentation to state the kptr_restrict=0 is the default.This is a only temporary solution to the issue. The correct solution is
to do the permission check at open() time on files, and to replace %pK
with a function which checks the open() time permission. %pK uses in
printk should be removed since no sane permission check can be done, and
instead protected by using dmesg_restrict.Signed-off-by: Ryan Mallon
Cc: Kees Cook
Cc: Alexander Viro
Cc: Joe Perches
Cc: "Eric W. Biederman"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Now dirty_background_ratio/dirty_ratio contains a percentage of total
avaiable memory, which contains free pages and reclaimable pages. The
number of these pages is not equal to the number of total system memory.
But they are described as a percentage of total system memory in
Documentation/sysctl/vm.txt. So we need to fix them to avoid
misunderstanding.Signed-off-by: Zheng Liu
Cc: Rob Landley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
09 Oct, 2013
5 commits
-
Shared faults can lead to lots of unnecessary page migrations,
slowing down the system, and causing private faults to hit the
per-pgdat migration ratelimit.This patch adds sysctl numa_balancing_migrate_deferred, which specifies
how many shared page migrations to skip unconditionally, after each page
migration that is skipped because it is a shared fault.This reduces the number of page migrations back and forth in
shared fault situations. It also gives a strong preference to
the tasks that are already running where most of the memory is,
and to moving the other tasks to near the memory.Testing this with a much higher scan rate than the default
still seems to result in fewer page migrations than before.Memory seems to be somewhat better consolidated than previously,
with multi-instance specjbb runs on a 4 node system.Signed-off-by: Rik van Riel
Signed-off-by: Mel Gorman
Cc: Andrea Arcangeli
Cc: Johannes Weiner
Cc: Srikar Dronamraju
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1381141781-10992-62-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar -
With scan rate adaptions based on whether the workload has properly
converged or not there should be no need for the scan period reset
hammer. Get rid of it.Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Andrea Arcangeli
Cc: Johannes Weiner
Cc: Srikar Dronamraju
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1381141781-10992-60-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar -
This patch favours moving tasks towards NUMA node that recorded a higher
number of NUMA faults during active load balancing. Ideally this is
self-reinforcing as the longer the task runs on that node, the more faults
it should incur causing task_numa_placement to keep the task running on that
node. In reality a big weakness is that the nodes CPUs can be overloaded
and it would be more efficient to queue tasks on an idle node and migrate
to the new node. This would require additional smarts in the balancer so
for now the balancer will simply prefer to place the task on the preferred
node for a PTE scans which is controlled by the numa_balancing_settle_count
sysctl. Once the settle_count number of scans has complete the schedule
is free to place the task on an alternative node if the load is imbalanced.[srikar@linux.vnet.ibm.com: Fixed statistics]
Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Andrea Arcangeli
Cc: Johannes Weiner
Cc: Srikar Dronamraju
[ Tunable and use higher faults instead of preferred. ]
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1381141781-10992-23-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar -
The NUMA PTE scan rate is controlled with a combination of the
numa_balancing_scan_period_min, numa_balancing_scan_period_max and
numa_balancing_scan_size. This scan rate is independent of the size
of the task and as an aside it is further complicated by the fact that
numa_balancing_scan_size controls how many pages are marked pte_numa and
not how much virtual memory is scanned.In combination, it is almost impossible to meaningfully tune the min and
max scan periods and reasoning about performance is complex when the time
to complete a full scan is is partially a function of the tasks memory
size. This patch alters the semantic of the min and max tunables to be
about tuning the length time it takes to complete a scan of a tasks occupied
virtual address space. Conceptually this is a lot easier to understand. There
is a "sanity" check to ensure the scan rate is never extremely fast based on
the amount of virtual memory that should be scanned in a second. The default
of 2.5G seems arbitrary but it is to have the maximum scan rate after the
patch roughly match the maximum scan rate before the patch was applied.On a similar note, numa_scan_period is in milliseconds and not
jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies
to numa_scan_period means that the rate scanning slows depends on HZ which
is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms.Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Andrea Arcangeli
Cc: Johannes Weiner
Cc: Srikar Dronamraju
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1381141781-10992-18-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar -
Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Andrea Arcangeli
Cc: Johannes Weiner
Cc: Srikar Dronamraju
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1381141781-10992-3-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar
12 Sep, 2013
2 commits
-
Add a new %P variable to be used in core_pattern. This variable contains
the global PID (PID in the init namespace) as %p contains the PID in the
current namespace which isn't always what we want.The main use for this is to make it easier to handle crashes that happened
within a container. With that new variables it's possible to have the
crashes dumped into the container or forwarded to the host with the right
PID (from the host's point of view).Signed-off-by: Stéphane Graber
Reported-by: Hans Feldt
Cc: Alexander Viro
Cc: Eric W. Biederman
Cc: Andy Whitcroft
Acked-by: Serge E. Hallyn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Now hugepage migration is enabled, although restricted on pmd-based
hugepages for now (due to lack of testing.) So we should allocate
migratable hugepages from ZONE_MOVABLE if possible.This patch makes GFP flags in hugepage allocation dependent on migration
support, not only the value of hugepages_treat_as_movable. It provides no
change on the behavior for architectures which do not support hugepage
migration,Signed-off-by: Naoya Horiguchi
Acked-by: Andi Kleen
Reviewed-by: Wanpeng Li
Cc: Hillf Danton
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Cc: Michal Hocko
Cc: Rik van Riel
Cc: "Aneesh Kumar K.V"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
31 Aug, 2013
1 commit
-
By default, the pfifo_fast queue discipline has been used by default
for all devices. But we have better choices now.This patch allow setting the default queueing discipline with sysctl.
This allows easy use of better queueing disciplines on all devices
without having to use tc qdisc scripts. It is intended to allow
an easy path for distributions to make fq_codel or sfq the default
qdisc.This patch also makes pfifo_fast more of a first class qdisc, since
it is now possible to manually override the default and explicitly
use pfifo_fast. The behavior for systems who do not use the sysctl
is unchanged, they still get pfifo_fastAlso removes leftover random # in sysctl net core.
Signed-off-by: Stephen Hemminger
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller
02 Aug, 2013
1 commit
-
Eliezer renames several *ll_poll to *busy_poll, but forgets
CONFIG_NET_LL_RX_POLL, so in case of confusion, rename it too.Cc: Eliezer Tamir
Cc: David S. Miller
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller
11 Jul, 2013
1 commit
-
Rename LL_SO to BUSY_POLL_SO
Rename sysctl_net_ll_{read,poll} to sysctl_busy_{read,poll}
Fix up users of these variables.
Fix documentation for sysctl.a patch for the socket.7 man page will follow separately,
because of limitations of my mail setup.Signed-off-by: Eliezer Tamir
Signed-off-by: David S. Miller