Eric Lee / smarc-fsl-linux-kernel

17 Apr, 2015

1 commit

0ec62afeb Doc/sysctl/kernel.txt: document threads-max ... Browse Code »

File /proc/sys/kernel/threads-max controls the maximum number of threads
that can be created using fork().

[akpm@linux-foundation.org: fix typo, per Guenter]
Signed-off-by: Heinrich Schuchardt
Cc: Oleg Nesterov
Cc: Ingo Molnar
Cc: Guenter Roeck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Heinrich Schuchardt
2015-04-17 21:04:07 +0800

16 Apr, 2015

1 commit

5bbe3547a mm: allow compaction of unevictable pages ... Browse Code »

Currently, pages which are marked as unevictable are protected from
compaction, but not from other types of migration. The POSIX real time
extension explicitly states that mlock() will prevent a major page
fault, but the spirit of this is that mlock() should give a process the
ability to control sources of latency, including minor page faults.
However, the mlock manpage only explicitly says that a locked page will
not be written to swap and this can cause some confusion. The
compaction code today does not give a developer who wants to avoid swap
but wants to have large contiguous areas available any method to achieve
this state. This patch introduces a sysctl for controlling compaction
behavior with respect to the unevictable lru. Users who demand no page
faults after a page is present can set compact_unevictable_allowed to 0
and users who need the large contiguous areas can enable compaction on
locked memory by leaving the default value of 1.

To illustrate this problem I wrote a quick test program that mmaps a
large number of 1MB files filled with random data. These maps are
created locked and read only. Then every other mmap is unmapped and I
attempt to allocate huge pages to the static huge page pool. When the
compact_unevictable_allowed sysctl is 0, I cannot allocate hugepages
after fragmenting memory. When the value is set to 1, allocations
succeed.

Signed-off-by: Eric B Munson
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Acked-by: Christoph Lameter
Acked-by: David Rientjes
Acked-by: Rik van Riel
Cc: Vlastimil Babka
Cc: Thomas Gleixner
Cc: Christoph Lameter
Cc: Peter Zijlstra
Cc: Mel Gorman
Cc: David Rientjes
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric B Munson
2015-04-16 07:35:17 +0800

15 Apr, 2015

1 commit

195daf665 watchdog: enable the new user interface of the watchdog mechanism ... Browse Code »

With the current user interface of the watchdog mechanism it is only
possible to disable or enable both lockup detectors at the same time.
This series introduces new kernel parameters and changes the semantics of
some existing kernel parameters, so that the hard lockup detector and the
soft lockup detector can be disabled or enabled individually. With this
series applied, the user interface is as follows.

- parameters in /proc/sys/kernel

. soft_watchdog
This is a new parameter to control and examine the run state of
the soft lockup detector.

. nmi_watchdog
The semantics of this parameter have changed. It can now be used
to control and examine the run state of the hard lockup detector.

. watchdog
This parameter is still available to control the run state of both
lockup detectors at the same time. If this parameter is examined,
it shows the logical OR of soft_watchdog and nmi_watchdog.

. watchdog_thresh
The semantics of this parameter are not affected by the patch.

- kernel command line parameters

. nosoftlockup
The semantics of this parameter have changed. It can now be used
to disable the soft lockup detector at boot time.

. nmi_watchdog=0 or nmi_watchdog=1
Disable or enable the hard lockup detector at boot time. The patch
introduces '=1' as a new option.

. nowatchdog
The semantics of this parameter are not affected by the patch. It
is still available to disable both lockup detectors at boot time.

Also, remove the proc_dowatchdog() function which is no longer needed.

[dzickus@redhat.com: wrote changelog]
[dzickus@redhat.com: update documentation for kernel params and sysctl]
Signed-off-by: Ulrich Obergfell
Signed-off-by: Don Zickus
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ulrich Obergfell
2015-04-15 07:48:59 +0800

12 Feb, 2015

3 commits

59d53737a Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge second set of updates from Andrew Morton:
"More of MM"

* emailed patches from Andrew Morton : (83 commits)
mm/nommu.c: fix arithmetic overflow in __vm_enough_memory()
mm/mmap.c: fix arithmetic overflow in __vm_enough_memory()
vmstat: Reduce time interval to stat update on idle cpu
mm/page_owner.c: remove unnecessary stack_trace field
Documentation/filesystems/proc.txt: describe /proc//map_files
mm: incorporate read-only pages into transparent huge pages
vmstat: do not use deferrable delayed work for vmstat_update
mm: more aggressive page stealing for UNMOVABLE allocations
mm: always steal split buddies in fallback allocations
mm: when stealing freepages, also take pages created by splitting buddy page
mincore: apply page table walker on do_mincore()
mm: /proc/pid/clear_refs: avoid split_huge_page()
mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)
mempolicy: apply page table walker on queue_pages_range()
arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma()
memcg: cleanup preparation for page table walk
numa_maps: remove numa_maps->vma
numa_maps: fix typo in gather_hugetbl_stats
pagemap: use walk->vma instead of calling find_vma()
clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk()
...

Linus Torvalds
2015-02-12 10:23:28 +0800
dc6c9a35b mm: account pmd page tables to the process ... Browse Code »

Dave noticed that unprivileged process can allocate significant amount of
memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
memory cgroup. The trick is to allocate a lot of PMD page tables. Linux
kernel doesn't account PMD tables to the process, only PTE.

The use-cases below use few tricks to allocate a lot of PMD page tables
while keeping VmRSS and VmPTE low. oom_score for the process will be 0.

#include
#include
#include
#include
#include
#include

#define PUD_SIZE (1UL << 30)
#define PMD_SIZE (1UL << 21)

#define NR_PUD 130000

int main(void)
{
char *addr = NULL;
unsigned long i;

prctl(PR_SET_THP_DISABLE);
for (i = 0; i < NR_PUD ; i++) {
addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
break;
}
*addr = 'x';
munmap(addr, PMD_SIZE);
mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
if (addr == MAP_FAILED)
perror("re-mmap"), exit(1);
}
printf("PID %d consumed %lu KiB in PMD page tables\n",
getpid(), i * 4096 >> 10);
return pause();
}

The patch addresses the issue by account PMD tables to the process the
same way we account PTE.

The main place where PMD tables is accounted is __pmd_alloc() and
free_pmd_range(). But there're few corner cases:

- HugeTLB can share PMD page tables. The patch handles by accounting
the table to all processes who share it.

- x86 PAE pre-allocates few PMD tables on fork.

- Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
check on exit(2).

Accounting only happens on configuration where PMD page table's level is
present (PMD is not folded). As with nr_ptes we use per-mm counter. The
counter value is used to calculate baseline for badness score by
oom-killer.

Signed-off-by: Kirill A. Shutemov
Reported-by: Dave Hansen
Cc: Hugh Dickins
Reviewed-by: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: David Rientjes
Tested-by: Sedat Dilek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-02-12 09:06:04 +0800
73b4f63ae Merge tag 'docs-for-linus' of git://git.lwn.net/linux-2.6 ... Browse Code »

Pull documentation updates from Jonathan Corbet:
"Highlights this time around include:

- A thrashing of SubmittingPatches to bring it out of the "send
everything to Linus" era of kernel development.

- A new document on completions from Nicholas McGuire

- Lots of typo fixes, formatting improvements, corrections, build
fixes, and more"

* tag 'docs-for-linus' of git://git.lwn.net/linux-2.6: (35 commits)
Documentation: Fix the wrong command `echo -1 > set_ftrace_pid` for cleaning the filter.
can-doc: Fixed a wrong filepath in can.txt
Documentation: Fix trivial typo in comment.
kgdb,docs: Fix typo and minor style issues
Documentation: add description for FTRACE probe status
doc: brief user documentation for completion
Documentation/misc-devices/mei: Fix indentation of embedded code.
Documentation/misc-devices/mei: Fix indentation of enumeration.
Documentation/misc-devices/mei: Fix spacing around parentheses.
Documentation/misc-devices/mei: Fix formatting of headings.
Documentation: devicetree: Fix double words in Doumentation/devicetree
Documentation: mm: Fix typo in vm.txt
lockstat: Add documentation on contention and contenting points
Documentation: fix blackfin gptimers-example build errors
Fixes column alignment in table of contents entry 1.9 in Documentation/filesystems/proc.txt
CodingStyle: enable emacs display of trailing whitespace
DocBook: Do not exceed argument list limit
gpio: board.txt: Fix the gpio name example
Documentation/SubmittingPatches: unify whitespace/tabs for the DCO
MAINTAINERS: Add the docs-next git tree to the maintainer entry
...

Linus Torvalds
2015-02-12 05:03:11 +0800

11 Feb, 2015

1 commit

c5ce28df0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:

1) More iov_iter conversion work from Al Viro.

[ The "crypto: switch af_alg_make_sg() to iov_iter" commit was
wrong, and this pull actually adds an extra commit on top of the
branch I'm pulling to fix that up, so that the pre-merge state is
ok. - Linus ]

2) Various optimizations to the ipv4 forwarding information base trie
lookup implementation. From Alexander Duyck.

3) Remove sock_iocb altogether, from CHristoph Hellwig.

4) Allow congestion control algorithm selection via routing metrics.
From Daniel Borkmann.

5) Make ipv4 uncached route list per-cpu, from Eric Dumazet.

6) Handle rfs hash collisions more gracefully, also from Eric Dumazet.

7) Add xmit_more support to r8169, e1000, and e1000e drivers. From
Florian Westphal.

8) Transparent Ethernet Bridging support for GRO, from Jesse Gross.

9) Add BPF packet actions to packet scheduler, from Jiri Pirko.

10) Add support for uniqu flow IDs to openvswitch, from Joe Stringer.

11) New NetCP ethernet driver, from Muralidharan Karicheri and Wingman
Kwok.

12) More sanely handle out-of-window dupacks, which can result in
serious ACK storms. From Neal Cardwell.

13) Various rhashtable bug fixes and enhancements, from Herbert Xu,
Patrick McHardy, and Thomas Graf.

14) Support xmit_more in be2net, from Sathya Perla.

15) Group Policy extensions for vxlan, from Thomas Graf.

16) Remove Checksum Offload support for vxlan, from Tom Herbert.

17) Like ipv4, support lockless transmit over ipv6 UDP sockets. From
Vlad Yasevich.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1494+1 commits)
crypto: fix af_alg_make_sg() conversion to iov_iter
ipv4: Namespecify TCP PMTU mechanism
i40e: Fix for stats init function call in Rx setup
tcp: don't include Fast Open option in SYN-ACK on pure SYN-data
openvswitch: Only set TUNNEL_VXLAN_OPT if VXLAN-GBP metadata is set
ipv6: Make __ipv6_select_ident static
ipv6: Fix fragment id assignment on LE arches.
bridge: Fix inability to add non-vlan fdb entry
net: Mellanox: Delete unnecessary checks before the function call "vunmap"
cxgb4: Add support in cxgb4 to get expansion rom version via ethtool
ethtool: rename reserved1 memeber in ethtool_drvinfo for expansion ROM version
net: dsa: Remove redundant phy_attach()
IB/mlx4: Reset flow support for IB kernel ULPs
IB/mlx4: Always use the correct port for mirrored multicast attachments
net/bonding: Fix potential bad memory access during bonding events
tipc: remove tipc_snprintf
tipc: nl compat add noop and remove legacy nl framework
tipc: convert legacy nl stats show to nl compat
tipc: convert legacy nl net id get to nl compat
tipc: convert legacy nl net id set to nl compat
...

Linus Torvalds
2015-02-11 12:01:30 +0800

03 Feb, 2015

1 commit

b245be1f4 net-timestamp: no-payload only sysctl ... Browse Code »

Tx timestamps are looped onto the error queue on top of an skb. This
mechanism leaks packet headers to processes unless the no-payload
options SOF_TIMESTAMPING_OPT_TSONLY is set.

Add a sysctl that optionally drops looped timestamp with data. This
only affects processes without CAP_NET_RAW.

The policy is checked when timestamps are generated in the stack.
It is possible for timestamps with data to be reported after the
sysctl is set, if these were queued internally earlier.

No vulnerability is immediately known that exploits knowledge
gleaned from packet headers, but it may still be preferable to allow
administrators to lock down this path at the cost of possible
breakage of legacy applications.

Signed-off-by: Willem de Bruijn

----

Changes
(v1 -> v2)
- test socket CAP_NET_RAW instead of capable(CAP_NET_RAW)
(rfc -> v1)
- document the sysctl in Documentation/sysctl/net.txt
- fix access control race: read .._OPT_TSONLY only once,
use same value for permission check and skb generation.
Signed-off-by: David S. Miller

Willem de Bruijn
2015-02-03 10:46:51 +0800

29 Jan, 2015

1 commit

633708a4a Documentation: mm: Fix typo in vm.txt ... Browse Code »

This patch fix a spelling typo in Documentation/sysctl/vm.txt

Signed-off-by: Masanari Iida
Signed-off-by: Jonathan Corbet

Masanari Iida
2015-01-29 06:13:11 +0800

22 Dec, 2014

1 commit

c5f454659 livepatch: kernel: add TAINT_LIVEPATCH ... Browse Code »

This adds a new taint flag to indicate when the kernel or a kernel
module has been live patched. This will provide a clean indication in
bug reports that live patching was used.

Additionally, if the crash occurs in a live patched function, the live
patch module will appear beside the patched function in the backtrace.

Signed-off-by: Seth Jennings
Acked-by: Josh Poimboeuf
Reviewed-by: Miroslav Benes
Reviewed-by: Petr Mladek
Reviewed-by: Masami Hiramatsu
Signed-off-by: Jiri Kosina

Seth Jennings
2014-12-22 22:40:48 +0800

14 Dec, 2014

1 commit

0050ee059 ipc/msg: increase MSGMNI, remove scaling ... Browse Code »

SysV can be abused to allocate locked kernel memory. For most systems, a
small limit doesn't make sense, see the discussion with regards to SHMMAX.

Therefore: increase MSGMNI to the maximum supported.

And: If we ignore the risk of locking too much memory, then an automatic
scaling of MSGMNI doesn't make sense. Therefore the logic can be removed.

The code preserves auto_msgmni to avoid breaking any user space applications
that expect that the value exists.

Notes:
1) If an administrator must limit the memory allocations, then he can set
MSGMNI as necessary.

Or he can disable sysv entirely (as e.g. done by Android).

2) MSGMAX and MSGMNB are intentionally not increased, as these values are used
to control latency vs. throughput:
If MSGMNB is large, then msgsnd() just returns and more messages can be queued
before a task switch to a task that calls msgrcv() is forced.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Manfred Spraul
Cc: Davidlohr Bueso
Cc: Rafael Aquini
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Manfred Spraul
2014-12-14 04:42:52 +0800

12 Dec, 2014

1 commit

70e71ca0a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:

1) New offloading infrastructure and example 'rocker' driver for
offloading of switching and routing to hardware.

This work was done by a large group of dedicated individuals, not
limited to: Scott Feldman, Jiri Pirko, Thomas Graf, John Fastabend,
Jamal Hadi Salim, Andy Gospodarek, Florian Fainelli, Roopa Prabhu

2) Start making the networking operate on IOV iterators instead of
modifying iov objects in-situ during transfers. Thanks to Al Viro
and Herbert Xu.

3) A set of new netlink interfaces for the TIPC stack, from Richard
Alpe.

4) Remove unnecessary looping during ipv6 routing lookups, from Martin
KaFai Lau.

5) Add PAUSE frame generation support to gianfar driver, from Matei
Pavaluca.

6) Allow for larger reordering levels in TCP, which are easily
achievable in the real world right now, from Eric Dumazet.

7) Add a variable of napi_schedule that doesn't need to disable cpu
interrupts, from Eric Dumazet.

8) Use a doubly linked list to optimize neigh_parms_release(), from
Nicolas Dichtel.

9) Various enhancements to the kernel BPF verifier, and allow eBPF
programs to actually be attached to sockets. From Alexei
Starovoitov.

10) Support TSO/LSO in sunvnet driver, from David L Stevens.

11) Allow controlling ECN usage via routing metrics, from Florian
Westphal.

12) Remote checksum offload, from Tom Herbert.

13) Add split-header receive, BQL, and xmit_more support to amd-xgbe
driver, from Thomas Lendacky.

14) Add MPLS support to openvswitch, from Simon Horman.

15) Support wildcard tunnel endpoints in ipv6 tunnels, from Steffen
Klassert.

16) Do gro flushes on a per-device basis using a timer, from Eric
Dumazet. This tries to resolve the conflicting goals between the
desired handling of bulk vs. RPC-like traffic.

17) Allow userspace to ask for the CPU upon what a packet was
received/steered, via SO_INCOMING_CPU. From Eric Dumazet.

18) Limit GSO packets to half the current congestion window, from Eric
Dumazet.

19) Add a generic helper so that all drivers set their RSS keys in a
consistent way, from Eric Dumazet.

20) Add xmit_more support to enic driver, from Govindarajulu
Varadarajan.

21) Add VLAN packet scheduler action, from Jiri Pirko.

22) Support configurable RSS hash functions via ethtool, from Eyal
Perry.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1820 commits)
Fix race condition between vxlan_sock_add and vxlan_sock_release
net/macb: fix compilation warning for print_hex_dump() called with skb->mac_header
net/mlx4: Add support for A0 steering
net/mlx4: Refactor QUERY_PORT
net/mlx4_core: Add explicit error message when rule doesn't meet configuration
net/mlx4: Add A0 hybrid steering
net/mlx4: Add mlx4_bitmap zone allocator
net/mlx4: Add a check if there are too many reserved QPs
net/mlx4: Change QP allocation scheme
net/mlx4_core: Use tasklet for user-space CQ completion events
net/mlx4_core: Mask out host side virtualization features for guests
net/mlx4_en: Set csum level for encapsulated packets
be2net: Export tunnel offloads only when a VxLAN tunnel is created
gianfar: Fix dma check map error when DMA_API_DEBUG is enabled
cxgb4/csiostor: Don't use MASTER_MUST for fw_hello call
net: fec: only enable mdio interrupt before phy device link up
net: fec: clear all interrupt events to support i.MX6SX
net: fec: reset fep link status in suspend function
net: sock: fix access via invalid file descriptor
net: introduce helper macro for_each_cmsghdr
...

Linus Torvalds
2014-12-12 06:27:06 +0800

11 Dec, 2014

1 commit

9e3961a09 kernel: add panic_on_warn ... Browse Code »

There have been several times where I have had to rebuild a kernel to
cause a panic when hitting a WARN() in the code in order to get a crash
dump from a system. Sometimes this is easy to do, other times (such as
in the case of a remote admin) it is not trivial to send new images to
the user.

A much easier method would be a switch to change the WARN() over to a
panic. This makes debugging easier in that I can now test the actual
image the WARN() was seen on and I do not have to engage in remote
debugging.

This patch adds a panic_on_warn kernel parameter and
/proc/sys/kernel/panic_on_warn calls panic() in the
warn_slowpath_common() path. The function will still print out the
location of the warning.

An example of the panic_on_warn output:

The first line below is from the WARN_ON() to output the WARN_ON()'s
location. After that the panic() output is displayed.

WARNING: CPU: 30 PID: 11698 at /home/prarit/dummy_module/dummy-module.c:25 init_dummy+0x1f/0x30 [dummy_module]()
Kernel panic - not syncing: panic_on_warn set ...

CPU: 30 PID: 11698 Comm: insmod Tainted: G W OE 3.17.0+ #57
Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013
0000000000000000 000000008e3f87df ffff88080f093c38 ffffffff81665190
0000000000000000 ffffffff818aea3d ffff88080f093cb8 ffffffff8165e2ec
ffffffff00000008 ffff88080f093cc8 ffff88080f093c68 000000008e3f87df
Call Trace:
[] dump_stack+0x46/0x58
[] panic+0xd0/0x204
[] ? init_dummy+0x1f/0x30 [dummy_module]
[] warn_slowpath_common+0xd0/0xd0
[] ? dummy_greetings+0x40/0x40 [dummy_module]
[] warn_slowpath_null+0x1a/0x20
[] init_dummy+0x1f/0x30 [dummy_module]
[] do_one_initcall+0xd4/0x210
[] ? __vunmap+0xc2/0x110
[] load_module+0x16a9/0x1b30
[] ? store_uevent+0x70/0x70
[] ? copy_module_from_fd.isra.44+0x129/0x180
[] SyS_finit_module+0xa6/0xd0
[] system_call_fastpath+0x12/0x17

Successfully tested by me.

hpa said: There is another very valid use for this: many operators would
rather a machine shuts down than being potentially compromised either
functionally or security-wise.

Signed-off-by: Prarit Bhargava
Cc: Jonathan Corbet
Cc: Rusty Russell
Cc: "H. Peter Anvin"
Cc: Andi Kleen
Cc: Masami Hiramatsu
Acked-by: Yasuaki Ishimatsu
Cc: Fabian Frederick
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Prarit Bhargava
2014-12-11 09:41:10 +0800

17 Nov, 2014

1 commit

960fb622f net: provide a per host RSS key generic infrastructure ... Browse Code »

RSS (Receive Side Scaling) typically uses Toeplitz hash and a 40 or 52 bytes
RSS key.

Some drivers use a constant (and well known key), some drivers use a random
key per port, making bonding setups hard to tune. Well known keys increase
attack surface, considering that number of queues is usually a power of two.

This patch provides infrastructure to help drivers doing the right thing.

netdev_rss_key_fill() should be used by drivers to initialize their RSS key,
even if they provide ethtool -X support to let user redefine the key later.

A new /proc/sys/net/core/netdev_rss_key file can be used to get the host
RSS key even for drivers not providing ethtool -x support, in case some
applications want to precisely setup flows to match some RX queues.

Tested:

myhost:~# cat /proc/sys/net/core/netdev_rss_key
11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41:36:40:74:b6:15:ca:27:44:aa:b3:4d:72

myhost:~# ethtool -x eth0
RX flow hash indirection table for eth0 with 8 RX ring(s):
0: 0 1 2 3 4 5 6 7
RSS hash key:
11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2014-11-17 04:59:11 +0800

12 Nov, 2014

1 commit

ba7a46f16 net: Convert LIMIT_NETDEBUG to net_dbg_ratelimited ... Browse Code »

Use the more common dynamic_debug capable net_dbg_ratelimited
and remove the LIMIT_NETDEBUG macro.

All messages are still ratelimited.

Some KERN_ uses are changed to KERN_DEBUG.

This may have some negative impact on messages that were
emitted at KERN_INFO that are not not enabled at all unless
DEBUG is defined or dynamic_debug is enabled. Even so,
these messages are now _not_ emitted by default.

This also eliminates the use of the net_msg_warn sysctl
"/proc/sys/net/core/warnings". For backward compatibility,
the sysctl is not removed, but it has no function. The extern
declaration of net_msg_warn is removed from sock.h and made
static in net/core/sysctl_net_core.c

Miscellanea:

o Update the sysctl documentation
o Remove the embedded uses of pr_fmt
o Coalesce format fragments
o Realign arguments

Signed-off-by: Joe Perches
Signed-off-by: David S. Miller

Joe Perches
2014-11-12 03:10:31 +0800

14 Oct, 2014

1 commit

b03023ecb coredump: add %i/%I in core_pattern to report the tid of the crashed thread ... Browse Code »

format_corename() can only pass the leader's pid to the core handler,
but there is no simple way to figure out which thread originated the
coredump.

As Jan explains, this also means that there is no simple way to create
the backtrace of the crashed process:

As programs are mostly compiled with implicit gcc -fomit-frame-pointer
one needs program's .eh_frame section (equivalently PT_GNU_EH_FRAME
segment) or .debug_frame section. .debug_frame usually is present only
in separate debug info files usually not even installed on the system.
While .eh_frame is a part of the executable/library (and it is even
always mapped for C++ exceptions unwinding) it no longer has to be
present anywhere on the disk as the program could be upgraded in the
meantime and the running instance has its executable file already
unlinked from disk.

One possibility is to echo 0x3f >/proc/*/coredump_filter and dump all
the file-backed memory including the executable's .eh_frame section.
But that can create huge core files, for example even due to mmapped
data files.

Other possibility would be to read .eh_frame from /proc/PID/mem at the
core_pattern handler time of the core dump. For the backtrace one needs
to read the register state first which can be done from core_pattern
handler:

ptrace(PTRACE_SEIZE, tid, 0, PTRACE_O_TRACEEXIT)
close(0); // close pipe fd to resume the sleeping dumper
waitpid(); // should report EXIT
PTRACE_GETREGS or other requests

The remaining problem is how to get the 'tid' value of the crashed
thread. It could be read from the first NT_PRSTATUS note of the core
file but that makes the core_pattern handler complicated.

Unfortunately %t is already used so this patch uses %i/%I.

Automatic Bug Reporting Tool (https://github.com/abrt/abrt/wiki/overview)
is experimenting with this. It is using the elfutils
(https://fedorahosted.org/elfutils/) unwinder for generating the
backtraces. Apart from not needing matching executables as mentioned
above, another advantage is that we can get the backtrace without saving
the core (which might be quite large) to disk.

[mmilata@redhat.com: final paragraph of changelog]
Signed-off-by: Jan Kratochvil
Signed-off-by: Oleg Nesterov
Cc: Alexander Viro
Cc: Denys Vlasenko
Cc: Jan Kratochvil
Cc: Mark Wielaard
Cc: Martin Milata
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-10-14 08:18:21 +0800

02 Sep, 2014

1 commit

a5325ae5b tipc: add name distributor resiliency queue ... Browse Code »

TIPC name table updates are distributed asynchronously in a cluster,
entailing a risk of certain race conditions. E.g., if two nodes
simultaneously issue conflicting (overlapping) publications, this may
not be detected until both publications have reached a third node, in
which case one of the publications will be silently dropped on that
node. Hence, we end up with an inconsistent name table.

In most cases this conflict is just a temporary race, e.g., one
node is issuing a publication under the assumption that a previous,
conflicting, publication has already been withdrawn by the other node.
However, because of the (rtt related) distributed update delay, this
may not yet hold true on all nodes. The symptom of this failure is a
syslog message: "tipc: Cannot publish {%u,%u,%u}, overlap error".

In this commit we add a resiliency queue at the receiving end of
the name table distributor. When insertion of an arriving publication
fails, we retain it in this queue for a short amount of time, assuming
that another update will arrive very soon and clear the conflict. If so
happens, we insert the publication, otherwise we drop it.

The (configurable) retention value defaults to 2000 ms. Knowing from
experience that the situation described above is extremely rare, there
is no risk that the queue will accumulate any large number of items.

Signed-off-by: Erik Hugne
Signed-off-by: Jon Maloy
Acked-by: Ying Xue
Signed-off-by: David S. Miller

Erik Hugne
2014-09-02 08:51:48 +0800

09 Aug, 2014

1 commit

69361eef9 panic: add TAINT_SOFTLOCKUP ... Browse Code »

This taint flag will be set if the system has ever entered a softlockup
state. Similar to TAINT_WARN it is useful to know whether or not the
system has been in a softlockup state when debugging.

[akpm@linux-foundation.org: apply the taint before calling panic()]
Signed-off-by: Josh Hunt
Cc: Jason Baron
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Josh Hunt
2014-08-09 06:57:24 +0800

24 Jun, 2014

2 commits

ed235875e kernel/watchdog.c: print traces for all cpus on lockup detection ... Browse Code »

A 'softlockup' is defined as a bug that causes the kernel to loop in
kernel mode for more than a predefined period to time, without giving
other tasks a chance to run.

Currently, upon detection of this condition by the per-cpu watchdog
task, debug information (including a stack trace) is sent to the system
log.

On some occasions, we have observed that the "victim" rather than the
actual "culprit" (i.e. the owner/holder of the contended resource) is
reported to the user. Often this information has proven to be
insufficient to assist debugging efforts.

To avoid loss of useful debug information, for architectures which
support NMI, this patch makes it possible to improve soft lockup
reporting. This is accomplished by issuing an NMI to each cpu to obtain
a stack trace.

If NMI is not supported we just revert back to the old method. A sysctl
and boot-time parameter is available to toggle this feature.

[dzickus@redhat.com: add CONFIG_SMP in certain areas]
[akpm@linux-foundation.org: additional CONFIG_SMP=n optimisations]
[mq@suse.cz: fix warning]
Signed-off-by: Aaron Tomlin
Signed-off-by: Don Zickus
Cc: David S. Miller
Cc: Mateusz Guzik
Cc: Oleg Nesterov
Signed-off-by: Jan Moskyto Matejka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aaron Tomlin
2014-06-24 07:47:44 +0800
7cd2b0a34 mm, pcp: allow restoring percpu_pagelist_fraction default ... Browse Code »

Oleg reports a division by zero error on zero-length write() to the
percpu_pagelist_fraction sysctl:

divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
RSP: 0018:ffff8800d87a3e78 EFLAGS: 00010246
RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
FS: 00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
Call Trace:
proc_sys_call_handler+0xb3/0xc0
proc_sys_write+0x14/0x20
vfs_write+0xba/0x1e0
SyS_write+0x46/0xb0
tracesys+0xe1/0xe6

However, if the percpu_pagelist_fraction sysctl is set by the user, it
is also impossible to restore it to the kernel default since the user
cannot write 0 to the sysctl.

This patch allows the user to write 0 to restore the default behavior.
It still requires a fraction equal to or larger than 8, however, as
stated by the documentation for sanity. If a value in the range [1, 7]
is written, the sysctl will return EINVAL.

This successfully solves the divide by zero issue at the same time.

Signed-off-by: David Rientjes
Reported-by: Oleg Drokin
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-06-24 07:47:43 +0800

07 Jun, 2014

1 commit

f4aacea2f sysctl: allow for strict write position handling ... Browse Code »

When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:

open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
write(1, "/bin/true", 9) = 9
close(1) = 0

$ cat /proc/sys/kernel/modprobe
/bin/true

Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.

The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.

This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy]
Signed-off-by: Kees Cook
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2014-06-07 07:08:13 +0800

05 Jun, 2014

2 commits

4a0da71b9 Documentation/sysctl/vm.txt: clarify vfs_cache_pressure description ... Browse Code »

Existing description is worded in a way which almost encourages setting of
vfs_cache_pressure above 100, possibly way above it.

Users are left in a dark what this numeric value is - an int? a
percentage? what the scale is?

As a result, we are getting reports about noticeable performance
degradation from users who have set vfs_cache_pressure to ridiculously
high values - because they thought there is no downside to it.

Via code inspection it's obvious that this value is treated as a
percentage. This patch changes text to reflect this fact, and adds a
cautionary paragraph advising against setting vfs_cache_pressure sky high.

Signed-off-by: Denys Vlasenko
Cc: Alexander Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Denys Vlasenko
2014-06-05 07:54:13 +0800
4f9b16a64 mm: disable zone_reclaim_mode by default ... Browse Code »

When it was introduced, zone_reclaim_mode made sense as NUMA distances
punished and workloads were generally partitioned to fit into a NUMA
node. NUMA machines are now common but few of the workloads are
NUMA-aware and it's routine to see major performance degradation due to
zone_reclaim_mode being enabled but relatively few can identify the
problem.

Those that require zone_reclaim_mode are likely to be able to detect
when it needs to be enabled and tune appropriately so lets have a
sensible default for the bulk of users.

This patch (of 2):

zone_reclaim_mode causes processes to prefer reclaiming memory from
local node instead of spilling over to other nodes. This made sense
initially when NUMA machines were almost exclusively HPC and the
workload was partitioned into nodes. The NUMA penalties were
sufficiently high to justify reclaiming the memory. On current machines
and workloads it is often the case that zone_reclaim_mode destroys
performance but not all users know how to detect this. Favour the
common case and disable it by default. Users that are sophisticated
enough to know they need zone_reclaim_mode will detect it.

Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Reviewed-by: Zhang Yanfei
Acked-by: Michal Hocko
Reviewed-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:53:59 +0800

08 Apr, 2014

1 commit

80df28476 hung_task: check the value of "sysctl_hung_task_timeout_sec" ... Browse Code »

As sysctl_hung_task_timeout_sec is unsigned long, when this value is
larger then LONG_MAX/HZ, the function schedule_timeout_interruptible in
watchdog will return immediately without sleep and with print :

schedule_timeout: wrong timeout value ffffffffffffff83

and then the funtion watchdog will call schedule_timeout_interruptible
again and again. The screen will be filled with

"schedule_timeout: wrong timeout value ffffffffffffff83"

This patch does some check and correction in sysctl, to let the function
schedule_timeout_interruptible allways get the valid parameter.

Signed-off-by: Liu Hua
Tested-by: Satoru Takeuchi
Cc: [3.4+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Liu Hua
2014-04-08 07:36:07 +0800

07 Apr, 2014

1 commit

6f4c98e1c Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux ... Browse Code »

Pull module updates from Rusty Russell:
"Nothing major: the stricter permissions checking for sysfs broke a
staging driver; fix included. Greg KH said he'd take the patch but
hadn't as the merge window opened, so it's included here to avoid
breaking build"

* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
staging: fix up speakup kobject mode
Use 'E' instead of 'X' for unsigned module taint flag.
VERIFY_OCTAL_PERMISSIONS: stricter checking for sysfs perms.
kallsyms: fix percpu vars on x86-64 with relocation.
kallsyms: generalize address range checking
module: LLVMLinux: Remove unused function warning from __param_check macro
Fix: module signature vs tracepoints: add new TAINT_UNSIGNED_MODULE
module: remove MODULE_GENERIC_TABLE
module: allow multiple calls to MODULE_DEVICE_TABLE() per module
module: use pr_cont

Linus Torvalds
2014-04-07 00:38:07 +0800

04 Apr, 2014

1 commit

5509a5d27 drop_caches: add some documentation and info message ... Browse Code »

There is plenty of anecdotal evidence and a load of blog posts
suggesting that using "drop_caches" periodically keeps your system
running in "tip top shape". Perhaps adding some kernel documentation
will increase the amount of accurate data on its use.

If we are not shrinking caches effectively, then we have real bugs.
Using drop_caches will simply mask the bugs and make them harder to
find, but certainly does not fix them, nor is it an appropriate
"workaround" to limit the size of the caches. On the contrary, there
have been bug reports on issues that turned out to be misguided use of
cache dropping.

Dropping caches is a very drastic and disruptive operation that is good
for debugging and running tests, but if it creates bug reports from
production use, kernel developers should be aware of its use.

Add a bit more documentation about it, a syslog message to track down
abusers, and vmstat drop counters to help analyze problem reports.

[akpm@linux-foundation.org: checkpatch fixes]
[hannes@cmpxchg.org: add runtime suppression control]
Signed-off-by: Dave Hansen
Signed-off-by: Michal Hocko
Acked-by: KOSAKI Motohiro
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2014-04-04 07:21:04 +0800

01 Apr, 2014

1 commit

971eae7c9 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler changes from Ingo Molnar:
"Bigger changes:

- sched/idle restructuring: they are WIP preparation for deeper
integration between the scheduler and idle state selection, by
Nicolas Pitre.

- add NUMA scheduling pseudo-interleaving, by Rik van Riel.

- optimize cgroup context switches, by Peter Zijlstra.

- RT scheduling enhancements, by Thomas Gleixner.

The rest is smaller changes, non-urgnt fixes and cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (68 commits)
sched: Clean up the task_hot() function
sched: Remove double calculation in fix_small_imbalance()
sched: Fix broken setscheduler()
sparc64, sched: Remove unused sparc64_multi_core
sched: Remove unused mc_capable() and smt_capable()
sched/numa: Move task_numa_free() to __put_task_struct()
sched/fair: Fix endless loop in idle_balance()
sched/core: Fix endless loop in pick_next_task()
sched/fair: Push down check for high priority class task into idle_balance()
sched/rt: Fix picking RT and DL tasks from empty queue
trace: Replace hardcoding of 19 with MAX_NICE
sched: Guarantee task priority in pick_next_task()
sched/idle: Remove stale old file
sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED
cpuidle/arm64: Remove redundant cpuidle_idle_call()
cpuidle/powernv: Remove redundant cpuidle_idle_call()
sched, nohz: Exclude isolated cores from load balancing
sched: Fix select_task_rq_fair() description comments
workqueue: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
sys: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
...

Linus Torvalds
2014-04-01 02:21:19 +0800

13 Mar, 2014

1 commit

66cc69e34 Fix: module signature vs tracepoints: add new TAINT_UNSIGNED_MODULE ... Browse Code »

Users have reported being unable to trace non-signed modules loaded
within a kernel supporting module signature.

This is caused by tracepoint.c:tracepoint_module_coming() refusing to
take into account tracepoints sitting within force-loaded modules
(TAINT_FORCED_MODULE). The reason for this check, in the first place, is
that a force-loaded module may have a struct module incompatible with
the layout expected by the kernel, and can thus cause a kernel crash
upon forced load of that module on a kernel with CONFIG_TRACEPOINTS=y.

Tracepoints, however, specifically accept TAINT_OOT_MODULE and
TAINT_CRAP, since those modules do not lead to the "very likely system
crash" issue cited above for force-loaded modules.

With kernels having CONFIG_MODULE_SIG=y (signed modules), a non-signed
module is tainted re-using the TAINT_FORCED_MODULE taint flag.
Unfortunately, this means that Tracepoints treat that module as a
force-loaded module, and thus silently refuse to consider any tracepoint
within this module.

Since an unsigned module does not fit within the "very likely system
crash" category of tainting, add a new TAINT_UNSIGNED_MODULE taint flag
to specifically address this taint behavior, and accept those modules
within Tracepoints. We use the letter 'X' as a taint flag character for
a module being loaded that doesn't know how to sign its name (proposed
by Steven Rostedt).

Also add the missing 'O' entry to trace event show_module_flags() list
for the sake of completeness.

Signed-off-by: Mathieu Desnoyers
Acked-by: Steven Rostedt
NAKed-by: Ingo Molnar
CC: Thomas Gleixner
CC: David Howells
CC: Greg Kroah-Hartman
Signed-off-by: Rusty Russell

Mathieu Desnoyers
2014-03-13 09:41:51 +0800

27 Feb, 2014

1 commit

d2ae2e525 Merge branch 'core/urgent' into core/locking ... Browse Code »

It's not really a regression fix, so move it to the v3.15 queue.

Signed-off-by: Ingo Molnar

Ingo Molnar
2014-02-27 20:06:55 +0800

02 Feb, 2014

1 commit

eaa4e4fcf Merge branch 'linus' into sched/core, to resolve conflicts ... Browse Code »

Conflicts:
kernel/sysctl.c

Signed-off-by: Ingo Molnar

Ingo Molnar
2014-02-02 16:45:39 +0800

01 Feb, 2014

1 commit

ab5318788 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull core debug changes from Ingo Molnar:
"This contains mostly kernel debugging related updates:

- make hung_task detection more configurable to distros
- add final bits for x86 UV NMI debugging, with related KGDB changes
- update the mailing-list of MAINTAINERS entries I'm involved with"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hung_task: Display every hung task warning
sysctl: Add neg_one as a standard constraint
x86/uv/nmi, kgdb/kdb: Fix UV NMI handler when KDB not configured
x86/uv/nmi: Fix Sparse warnings
kgdb/kdb: Fix no KDB config problem
MAINTAINERS: Restore "L: linux-kernel@vger.kernel.org" entries

Linus Torvalds
2014-02-01 00:59:46 +0800

31 Jan, 2014

1 commit

70e0ac5f3 hung_task/Documentation: Fix hung_task_warnings description ... Browse Code »

Improve the documentation on hung_task_warnings.

Signed-off-by: Aaron Tomlin
Link: http://lkml.kernel.org/n/tip-xepjnxzummfDlg9lvhh7Rlzc@git.kernel.org
Signed-off-by: Ingo Molnar

Aaron Tomlin
2014-01-31 16:24:03 +0800

30 Jan, 2014

1 commit

8582cb96b mm: document improved handling of swappiness==0 ... Browse Code »

Prior to commit fe35004fbf9e ("mm: avoid swapping out with
swappiness==0") setting swappiness to 0, reclaim code could still evict
recently used user anonymous memory to swap even though there is a
significant amount of RAM used for page cache.

The behaviour of setting swappiness to 0 has since changed. When set,
the reclaim code does not initiate swap until the amount of free pages
and file-backed pages, is less than the high water mark in a zone.

Let's update the documentation to reflect this.

[akpm@linux-foundation.org: remove comma, per Randy]
Signed-off-by: Aaron Tomlin
Acked-by: Rik van Riel
Acked-by: Bryn M. Reeves
Cc: Satoru Moriya
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aaron Tomlin
2014-01-30 08:22:39 +0800

28 Jan, 2014

1 commit

52bf84aa2 sched/numa, mm: Remove p->numa_migrate_deferred ... Browse Code »

Excessive migration of pages can hurt the performance of workloads
that span multiple NUMA nodes. However, it turns out that the
p->numa_migrate_deferred knob is a really big hammer, which does
reduce migration rates, but does not actually help performance.

Now that the second stage of the automatic numa balancing code
has stabilized, it is time to replace the simplistic migration
deferral code with something smarter.

Signed-off-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Peter Zijlstra
Cc: Chegu Vinod
Link: http://lkml.kernel.org/r/1390860228-21539-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar

Rik van Riel
2014-01-28 20:17:04 +0800

25 Jan, 2014

1 commit

270750dbc hung_task: Display every hung task warning ... Browse Code »

When khungtaskd detects hung tasks, it prints out
backtraces from a number of those tasks.

Limiting the number of backtraces being printed
out can result in the user not seeing the information
necessary to debug the issue. The hung_task_warnings
sysctl controls this feature.

This patch makes it possible for hung_task_warnings
to accept a special value to print an unlimited
number of backtraces when khungtaskd detects hung
tasks.

The special value is -1. To use this value it is
necessary to change types from ulong to int.

Signed-off-by: Aaron Tomlin
Reviewed-by: Rik van Riel
Acked-by: David Rientjes
Cc: oleg@redhat.com
Link: http://lkml.kernel.org/r/1390239253-24030-3-git-send-email-atomlin@redhat.com
[ Build warning fix. ]
Signed-off-by: Ingo Molnar

Aaron Tomlin
2014-01-25 19:13:33 +0800

24 Jan, 2014

1 commit

7984754b9 kexec: add sysctl to disable kexec_load ... Browse Code »

For general-purpose (i.e. distro) kernel builds it makes sense to build
with CONFIG_KEXEC to allow end users to choose what kind of things they
want to do with kexec. However, in the face of trying to lock down a
system with such a kernel, there needs to be a way to disable kexec_load
(much like module loading can be disabled). Without this, it is too easy
for the root user to modify kernel memory even when CONFIG_STRICT_DEVMEM
and modules_disabled are set. With this change, it is still possible to
load an image for use later, then disable kexec_load so the image (or lack
of image) can't be altered.

The intention is for using this in environments where "perfect"
enforcement is hard. Without a verified boot, along with verified
modules, and along with verified kexec, this is trying to give a system a
better chance to defend itself (or at least grow the window of
discoverability) against attack in the face of a privilege escalation.

In my mind, I consider several boot scenarios:

1) Verified boot of read-only verified root fs loading fd-based
verification of kexec images.
2) Secure boot of writable root fs loading signed kexec images.
3) Regular boot loading kexec (e.g. kcrash) image early and locking it.
4) Regular boot with no control of kexec image at all.

1 and 2 don't exist yet, but will soon once the verified kexec series has
landed. 4 is the state of things now. The gap between 2 and 4 is too
large, so this change creates scenario 3, a middle-ground above 4 when 2
and 1 are not possible for a system.

Signed-off-by: Kees Cook
Acked-by: Rik van Riel
Cc: Vivek Goyal
Cc: Eric Biederman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2014-01-24 08:37:03 +0800

22 Jan, 2014

1 commit

49f0ce5f9 mm: add overcommit_kbytes sysctl variable ... Browse Code »

Some applications that run on HPC clusters are designed around the
availability of RAM and the overcommit ratio is fine tuned to get the
maximum usage of memory without swapping. With growing memory, the
1%-of-all-RAM grain provided by overcommit_ratio has become too coarse
for these workload (on a 2TB machine it represents no less than 20GB).

This patch adds the new overcommit_kbytes sysctl variable that allow a
much finer grain.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix nommu build]
Signed-off-by: Jerome Marchand
Cc: Dave Hansen
Cc: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jerome Marchand
2014-01-22 08:19:44 +0800

17 Dec, 2013

1 commit

1bd53a7ef sched/numa: Drop sysctl_numa_balancing_settle_count sysctl ... Browse Code »

commit 887c290e (sched/numa: Decide whether to favour task or group weights
based on swap candidate relationships) drop the check against
sysctl_numa_balancing_settle_count, this patch remove the sysctl.

Signed-off-by: Wanpeng Li
Acked-by: Mel Gorman
Reviewed-by: Rik van Riel
Acked-by: David Rientjes
Signed-off-by: Peter Zijlstra
Cc: Andrew Morton
Cc: Naoya Horiguchi
Link: http://lkml.kernel.org/r/1386833006-6600-1-git-send-email-liwanp@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar

Wanpeng Li
2013-12-17 22:24:38 +0800

13 Nov, 2013

2 commits

312b4e226 vsprintf: check real user/group id for %pK ... Browse Code »

Some setuid binaries will allow reading of files which have read
permission by the real user id. This is problematic with files which
use %pK because the file access permission is checked at open() time,
but the kptr_restrict setting is checked at read() time. If a setuid
binary opens a %pK file as an unprivileged user, and then elevates
permissions before reading the file, then kernel pointer values may be
leaked.

This happens for example with the setuid pppd application on Ubuntu 12.04:

$ head -1 /proc/kallsyms
00000000 T startup_32

$ pppd file /proc/kallsyms
pppd: In file /proc/kallsyms: unrecognized option 'c1000000'

This will only leak the pointer value from the first line, but other
setuid binaries may leak more information.

Fix this by adding a check that in addition to the current process having
CAP_SYSLOG, that effective user and group ids are equal to the real ids.
If a setuid binary reads the contents of a file which uses %pK then the
pointer values will be printed as NULL if the real user is unprivileged.

Update the sysctl documentation to reflect the changes, and also correct
the documentation to state the kptr_restrict=0 is the default.

This is a only temporary solution to the issue. The correct solution is
to do the permission check at open() time on files, and to replace %pK
with a function which checks the open() time permission. %pK uses in
printk should be removed since no sane permission check can be done, and
instead protected by using dmesg_restrict.

Signed-off-by: Ryan Mallon
Cc: Kees Cook
Cc: Alexander Viro
Cc: Joe Perches
Cc: "Eric W. Biederman"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ryan Mallon
2013-11-13 11:09:14 +0800
715ea41e6 mm: improve the description for dirty_background_ratio/dirty_ratio sysctl ... Browse Code »

Now dirty_background_ratio/dirty_ratio contains a percentage of total
avaiable memory, which contains free pages and reclaimable pages. The
number of these pages is not equal to the number of total system memory.
But they are described as a percentage of total system memory in
Documentation/sysctl/vm.txt. So we need to fix them to avoid
misunderstanding.

Signed-off-by: Zheng Liu
Cc: Rob Landley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zheng Liu
2013-11-13 11:09:11 +0800