Eric Lee / smarc-fsl-linux-kernel

06 Mar, 2010

1 commit

660f6a360 Merge branch 'perf-probes-for-linus-2' of git://git.kernel.org/pub/scm/linux/ker… ... Browse Code »

…nel/git/tip/linux-2.6-tip

* 'perf-probes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Issue at least one memory barrier in stop_machine_text_poke()
perf probe: Correct probe syntax on command line help
perf probe: Add lazy line matching support
perf probe: Show more lines after last line
perf probe: Check function address range strictly in line finder
perf probe: Use libdw callback routines
perf probe: Use elfutils-libdw for analyzing debuginfo
perf probe: Rename probe finder functions
perf probe: Fix bugs in line range finder
perf probe: Update perf probe document
perf probe: Do not show --line option without dwarf support
kprobes: Add documents of jump optimization
kprobes/x86: Support kprobes jump optimization on x86
x86: Add text_poke_smp for SMP cross modifying code
kprobes/x86: Cleanup save/restore registers
kprobes/x86: Boost probes when reentering
kprobes: Jump optimization sysctl interface
kprobes: Introduce kprobes jump optimization
kprobes: Introduce generic insn_slot framework
kprobes/x86: Cleanup RELATIVEJUMP_INSTRUCTION to RELATIVEJUMP_OPCODE

Linus Torvalds
2010-03-06 02:50:22 +0800

01 Mar, 2010

1 commit

4b1776473 sparc: Support show_unhandled_signals. ... Browse Code »

Just faults right now, will add other traps later.

Signed-off-by: David S. Miller

David S. Miller
2010-03-01 16:02:23 +0800

26 Feb, 2010

1 commit

b2be84df9 kprobes: Jump optimization sysctl interface ... Browse Code »

Add /proc/sys/debug/kprobes-optimization sysctl which enables
and disables kprobes jump optimization on the fly for debugging.

Changes in v7:
- Remove ctl_name = CTL_UNNUMBERED for upstream compatibility.

Changes in v6:
- Update comments and coding style.

Signed-off-by: Masami Hiramatsu
Cc: systemtap
Cc: DLE
Cc: Ananth N Mavinakayanahalli
Cc: Jim Keniston
Cc: Srikar Dronamraju
Cc: Christoph Hellwig
Cc: Steven Rostedt
Cc: Frederic Weisbecker
Cc: Anders Kaseorg
Cc: Tim Abbott
Cc: Andi Kleen
Cc: Jason Baron
Cc: Mathieu Desnoyers
Cc: Frederic Weisbecker
Cc: Ananth N Mavinakayanahalli
LKML-Reference:
Signed-off-by: Ingo Molnar

Masami Hiramatsu
2010-02-26 00:49:25 +0800

18 Dec, 2009

2 commits

efc8e7f4c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorri… ... Browse Code »

…s/security-testing-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
Keys: KEYCTL_SESSION_TO_PARENT needs TIF_NOTIFY_RESUME architecture support
NOMMU: Optimise away the {dac_,}mmap_min_addr tests
security/min_addr.c: make init_mmap_min_addr() static
keys: PTR_ERR return of wrong pointer in keyctl_get_security()

Linus Torvalds
2009-12-18 08:58:26 +0800
3e26120cc kernel/sysctl.c: fix the incomplete part of sysctl_max_map_count-should-be-non-negative.patch ... Browse Code »

It is a mistake that we used 'proc_dointvec', it should be
'proc_dointvec_minmax', as in the original patch.

Signed-off-by: WANG Cong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

WANG Cong
2009-12-18 07:45:30 +0800

17 Dec, 2009

1 commit

6e1415467 NOMMU: Optimise away the {dac_,}mmap_min_addr tests ... Browse Code »

In NOMMU mode clamp dac_mmap_min_addr to zero to cause the tests on it to be
skipped by the compiler. We do this as the minimum mmap address doesn't make
any sense in NOMMU mode.

mmap_min_addr and round_hint_to_min() can be discarded entirely in NOMMU mode.

Signed-off-by: David Howells
Acked-by: Eric Paris
Signed-off-by: James Morris

David Howells
2009-12-17 06:25:19 +0800

16 Dec, 2009

2 commits

70da2340f 'sysctl_max_map_count' should be non-negative ... Browse Code »

Jan Engelhardt reported we have this problem:

setting max_map_count to a value large enough results in programs dying at
first try. This is on 2.6.31.6:

15:59 borg:/proc/sys/vm # echo $[1<max_map_count
15:59 borg:/proc/sys/vm # cat max_map_count
1073741824
15:59 borg:/proc/sys/vm # echo $[1<max_map_count
15:59 borg:/proc/sys/vm # cat max_map_count
Killed

This is because we have a chance to make 'max_map_count' negative. but
it's meaningless. Make it only accept non-negative values.

Reported-by: Jan Engelhardt
Signed-off-by: WANG Cong
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: James Morris
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Amerigo Wang
2009-12-16 00:53:23 +0800
06808b082 hugetlb: derive huge pages nodes allowed from task mempolicy ... Browse Code »

This patch derives a "nodes_allowed" node mask from the numa mempolicy of
the task modifying the number of persistent huge pages to control the
allocation, freeing and adjusting of surplus huge pages when the pool page
count is modified via the new sysctl or sysfs attribute
"nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows:

* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
is produced. This will cause the hugetlb subsystem to use
node_online_map as the "nodes_allowed". This preserves the
behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
a nodemask with the single preferred node will be produced.
"local" policy will NOT track any internode migrations of the
task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
will be used.
* Other than to inform the construction of the nodes_allowed node
mask, the actual mempolicy mode is ignored. That is, all modes
behave like interleave over the resulting nodes_allowed mask
with no "fallback".

See the updated documentation [next patch] for more information
about the implications of this patch.

Examples:

Starting with:

Node 0 HugePages_Total: 0
Node 1 HugePages_Total: 0
Node 2 HugePages_Total: 0
Node 3 HugePages_Total: 0

Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:

sysctl vm.nr_hugepages[_mempolicy]=32

yields:

Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 8
Node 3 HugePages_Total: 8

Of course, we only have nr_hugepages_mempolicy with the patch,
but with default mempolicy, nr_hugepages_mempolicy behaves the
same as nr_hugepages.

Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes. So, starting from the
condition above, with 8 huge pages per node, add 8 more to
node 2 using:

numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40

This yields:

Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8

The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.

Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:

numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32

yields:

Node 0 HugePages_Total: 4
Node 1 HugePages_Total: 4
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8

The 8 huge pages freed were balanced over nodes 0 and 1.

[rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
Signed-off-by: David Rientjes
Signed-off-by: Lee Schermerhorn
Acked-by: Mel Gorman
Reviewed-by: Andi Kleen
Cc: KAMEZAWA Hiroyuki
Cc: Randy Dunlap
Cc: Nishanth Aravamudan
Cc: Adam Litke
Cc: Andy Whitcroft
Cc: Eric Whitney
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2009-12-16 00:53:12 +0800

13 Dec, 2009

1 commit

702a7c760 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kerne… ... Browse Code »

…l/git/tip/linux-2.6-tip

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (21 commits)
sched: Remove forced2_migrations stats
sched: Fix memory leak in two error corner cases
sched: Fix build warning in get_update_sysctl_factor()
sched: Update normalized values on user updates via proc
sched: Make tunable scaling style configurable
sched: Fix missing sched tunable recalculation on cpu add/remove
sched: Fix task priority bug
sched: cgroup: Implement different treatment for idle shares
sched: Remove unnecessary RCU exclusion
sched: Discard some old bits
sched: Clean up check_preempt_wakeup()
sched: Move update_curr() in check_preempt_wakeup() to avoid redundant call
sched: Sanitize fork() handling
sched: Clean up ttwu() rq locking
sched: Remove rq->clock coupling from set_task_cpu()
sched: Consolidate select_task_rq() callers
sched: Remove sysctl.sched_features
sched: Protect sched_rr_get_param() access to task->sched_class
sched: Protect task->cpus_allowed access in sched_getaffinity()
sched: Fix balance vs hotplug race
...

Fixed up conflicts in kernel/sysctl.c (due to sysctl cleanup)

Linus Torvalds
2009-12-13 03:34:10 +0800

09 Dec, 2009

3 commits

acb4a848d sched: Update normalized values on user updates via proc ... Browse Code »

The normalized values are also recalculated in case the scaling factor
changes.

This patch updates the internally used scheduler tuning values that are
normalized to one cpu in case a user sets new values via sysfs.

Together with patch 2 of this series this allows to let user configured
values scale (or not) to cpu add/remove events taking place later.

Signed-off-by: Christian Ehrhardt
Signed-off-by: Peter Zijlstra
LKML-Reference:
[ v2: fix warning ]
Signed-off-by: Ingo Molnar

Christian Ehrhardt
2009-12-09 17:04:02 +0800
1983a922a sched: Make tunable scaling style configurable ... Browse Code »

As scaling now takes place on all kind of cpu add/remove events a user
that configures values via proc should be able to configure if his set
values are still rescaled or kept whatever happens.

As the comments state that log2 was just a second guess that worked the
interface is not just designed for on/off, but to choose a scaling type.
Currently this allows none, log and linear, but more important it allwos
us to keep the interface even if someone has an even better idea how to
scale the values.

Signed-off-by: Christian Ehrhardt
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Christian Ehrhardt
2009-12-09 17:04:01 +0800
6b314d0e1 sched: Remove sysctl.sched_features ... Browse Code »

Since we've had a much saner debugfs interface to this, remove the
sysctl one.

Signed-off-by: Peter Zijlstra
LKML-Reference:
[ v2: build fix ]
Signed-off-by: Ingo Molnar

Peter Zijlstra
2009-12-09 17:03:01 +0800

08 Dec, 2009

1 commit

1557d3300 Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6 ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6: (43 commits)
security/tomoyo: Remove now unnecessary handling of security_sysctl.
security/tomoyo: Add a special case to handle accesses through the internal proc mount.
sysctl: Drop & in front of every proc_handler.
sysctl: Remove CTL_NONE and CTL_UNNUMBERED
sysctl: kill dead ctl_handler definitions.
sysctl: Remove the last of the generic binary sysctl support
sysctl net: Remove unused binary sysctl code
sysctl security/tomoyo: Don't look at ctl_name
sysctl arm: Remove binary sysctl support
sysctl x86: Remove dead binary sysctl support
sysctl sh: Remove dead binary sysctl support
sysctl powerpc: Remove dead binary sysctl support
sysctl ia64: Remove dead binary sysctl support
sysctl s390: Remove dead sysctl binary support
sysctl frv: Remove dead binary sysctl support
sysctl mips/lasat: Remove dead binary sysctl support
sysctl drivers: Remove dead binary sysctl support
sysctl crypto: Remove dead binary sysctl support
sysctl security/keys: Remove dead binary sysctl support
sysctl kernel: Remove binary sysctl logic
...

Linus Torvalds
2009-12-08 23:38:50 +0800

06 Dec, 2009

1 commit

d0b093a8b Merge branch 'core-printk-for-linus' of git://git.kernel.org/pub/scm/linux/kerne… ... Browse Code »

…l/git/tip/linux-2.6-tip

* 'core-printk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
ratelimit: Make suppressed output messages more useful
printk: Remove ratelimit.h from kernel.h
ratelimit: Fix/allow use in atomic contexts
ratelimit: Use per ratelimit context locking

Linus Torvalds
2009-12-06 01:50:22 +0800

19 Nov, 2009

1 commit

6d4561110 sysctl: Drop & in front of every proc_handler. ... Browse Code »

For consistency drop & in front of every proc_handler. Explicity
taking the address is unnecessary and it prevents optimizations
like stubbing the proc_handlers to NULL.

Cc: Alexey Dobriyan
Cc: Ingo Molnar
Cc: Joe Perches
Signed-off-by: Eric W. Biederman

Eric W. Biederman
2009-11-19 00:37:40 +0800

12 Nov, 2009

1 commit

4739a9748 sysctl: Remove the last of the generic binary sysctl support ... Browse Code »

Now that all of the users stopped using ctl_name and strategy it
is safe to remove the fields from struct ctl_table, and it is safe
to remove the stub strategy routines as well.

Signed-off-by: Eric W. Biederman

Eric W. Biederman
2009-11-12 18:05:06 +0800

11 Nov, 2009

3 commits

2315ffa0a sysctl: Don't look at ctl_name and strategy in the generic code ... Browse Code »

The ctl_name and strategy fields are unused, now that sys_sysctl
is a compatibility wrapper around /proc/sys. No longer looking
at them in the generic code is effectively what we are doing
now and provides the guarantee that during further cleanups
we can just remove references to those fields and everything
will work ok.

Signed-off-by: Eric W. Biederman

Eric W. Biederman
2009-11-11 16:53:43 +0800
6fce56ec9 sysctl: Remove references to ctl_name and strategy from the generic sysctl table ... Browse Code »

Now that sys_sysctl is a generic wrapper around /proc/sys .ctl_name
and .strategy members of sysctl tables are dead code. Remove them.

Signed-off-by: Eric W. Biederman

Eric W. Biederman
2009-11-11 16:42:56 +0800
a965cf946 sysctl: Neuter the generic sysctl strategy routines. ... Browse Code »

Now that sys_sysctl is a compatibility layer on top of /proc/sys
these routines are never called but are still put in sysctl
tables so I have reduced them to stubs until they can be
removed entirely.

Signed-off-by: Eric W. Biederman

Eric W. Biederman
2009-11-11 16:42:51 +0800

06 Nov, 2009

1 commit

afa588b26 sysctl: Separate the binary sysctl logic into it's own file. ... Browse Code »

In preparation for more invasive cleanups separate the core
binary sysctl logic into it's own file.

Signed-off-by: Eric W. Biederman

Eric W. Biederman
2009-11-06 19:20:07 +0800

24 Sep, 2009

4 commits

db1682636 Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 ... Browse Code »

* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
HWPOISON: Enable error_remove_page on btrfs
HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
HWPOISON: Add madvise() based injector for hardware poisoned pages v4
HWPOISON: Enable error_remove_page for NFS
HWPOISON: Enable .remove_error_page for migration aware file systems
HWPOISON: The high level memory error handler in the VM v7
HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
HWPOISON: shmem: call set_page_dirty() with locked page
HWPOISON: Define a new error_remove_page address space op for async truncation
HWPOISON: Add invalidate_inode_page
HWPOISON: Refactor truncate to allow direct truncating of page v2
HWPOISON: check and isolate corrupted free pages v2
HWPOISON: Handle hardware poisoned pages in try_to_unmap
HWPOISON: Use bitmask/action code for try_to_unmap behaviour
HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
HWPOISON: Add poison check to page fault handling
HWPOISON: Add basic support for poisoned pages in fault handler v3
HWPOISON: Add new SIGBUS error codes for hardware poison signals
HWPOISON: Add support for poison swap entries v2
HWPOISON: Export some rmap vma locking to outside world
...

Linus Torvalds
2009-09-24 22:53:22 +0800
8d65af789 sysctl: remove "struct file *" argument of ->proc_handler ... Browse Code »

It's unused.

It isn't needed -- read or write flag is already passed and sysctl
shouldn't care about the rest.

It _was_ used in two places at arch/frv for some reason.

Signed-off-by: Alexey Dobriyan
Cc: David Howells
Cc: "Eric W. Biederman"
Cc: Al Viro
Cc: Ralf Baechle
Cc: Martin Schwidefsky
Cc: Ingo Molnar
Cc: "David S. Miller"
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2009-09-24 22:21:04 +0800
a293980c2 exec: let do_coredump() limit the number of concurrent dumps to pipes ... Browse Code »

Introduce core pipe limiting sysctl.

Since we can dump cores to pipe, rather than directly to the filesystem,
we create a condition in which a user can create a very high load on the
system simply by running bad applications.

If the pipe reader specified in core_pattern is poorly written, we can
have lots of ourstandig resources and processes in the system.

This sysctl introduces an ability to limit that resource consumption.
core_pipe_limit defines how many in-flight dumps may be run in parallel,
dumps beyond this value are skipped and a note is made in the kernel log.
A special value of 0 in core_pipe_limit denotes unlimited core dumps may
be handled (this is the default value).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Neil Horman
Reported-by: Earl Chew
Cc: Oleg Nesterov
Cc: Andi Kleen
Cc: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Neil Horman
2009-09-24 22:21:00 +0800
2bcd57ab6 headers: utsname.h redux ... Browse Code »

* remove asm/atomic.h inclusion from linux/utsname.h --
not needed after kref conversion
* remove linux/utsname.h inclusion from files which do not need it

NOTE: it looks like fs/binfmt_elf.c do not need utsname.h, however
due to some personality stuff it _is_ needed -- cowardly leave ELF-related
headers and files alone.

Signed-off-by: Alexey Dobriyan
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2009-09-24 09:13:10 +0800

23 Sep, 2009

1 commit

af91322ef printk: add printk_delay to make messages readable for some scenarios ... Browse Code »

When syslog is not possible, at the same time there's no serial/net
console available, it will be hard to read the printk messages. For
example oops/panic/warning messages in shutdown phase.

Add a printk delay feature, we can make each printk message delay some
milliseconds.

Setting the delay by proc/sysctl interface: /proc/sys/kernel/printk_delay

The value range from 0 - 10000, default value is 0

[akpm@linux-foundation.org: fix a few things]
Signed-off-by: Dave Young
Acked-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Young
2009-09-23 22:39:28 +0800

22 Sep, 2009

1 commit

3fff4c42b printk: Remove ratelimit.h from kernel.h ... Browse Code »

Decouple kernel.h from ratelimit.h: the global declaration of
printk's ratelimit_state is not needed, and it leads to messy
circular dependencies due to ratelimit.h's (new) adding of a
spinlock_types.h include.

Cc: Peter Zijlstra
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: David S. Miller
LKML-Reference:
Signed-off-by: Ingo Molnar

Ingo Molnar
2009-09-22 22:18:09 +0800

21 Sep, 2009

1 commit

cdd6c482c perf: Do the big rename: Performance Counters -> Performance Events ... Browse Code »

Bye-bye Performance Counters, welcome Performance Events!

In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.

Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.

All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)

The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.

Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.

User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)

This patch has been generated via the following script:

FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')

sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES

for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done

FILES=$(find . -name perf_event.*)

sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES

... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.

Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.

( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )

Suggested-by: Stephane Eranian
Acked-by: Peter Zijlstra
Acked-by: Paul Mackerras
Reviewed-by: Arjan van de Ven
Cc: Mike Galbraith
Cc: Arnaldo Carvalho de Melo
Cc: Frederic Weisbecker
Cc: Steven Rostedt
Cc: Benjamin Herrenschmidt
Cc: David Howells
Cc: Kyle McMartin
Cc: Martin Schwidefsky
Cc: "David S. Miller"
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc:
LKML-Reference:
Signed-off-by: Ingo Molnar

Ingo Molnar
2009-09-21 20:28:04 +0800

16 Sep, 2009

2 commits

6a46079cf HWPOISON: The high level memory error handler in the VM v7 ... Browse Code »

Add the high level memory handler that poisons pages
that got corrupted by hardware (typically by a two bit flip in a DIMM
or a cache) on the Linux level. The goal is to prevent everyone
from accessing these pages in the future.

This done at the VM level by marking a page hwpoisoned
and doing the appropriate action based on the type of page
it is.

The code that does this is portable and lives in mm/memory-failure.c

To quote the overview comment:

High level machine check handler. Handles pages reported by the
hardware as being corrupted usually due to a 2bit ECC memory or cache
failure.

This focuses on pages detected as corrupted in the background.
When the current CPU tries to consume corruption the currently
running process can just be killed directly instead. This implies
that if the error cannot be handled for some reason it's safe to
just ignore it because no corruption has been consumed yet. Instead
when that happens another machine check will happen.

Handles page cache pages in various states. The tricky part
here is that we can access any page asynchronous to other VM
users, because memory failures could happen anytime and anywhere,
possibly violating some of their assumptions. This is why this code
has to be extremely careful. Generally it tries to use normal locking
rules, as in get the standard locks, even if that means the
error handling takes potentially a long time.

Some of the operations here are somewhat inefficient and have non
linear algorithmic complexity, because the data structures have not
been optimized for this case. This is in particular the case
for the mapping from a vma to a process. Since this case is expected
to be rare we hope we can get away with this.

There are in principle two strategies to kill processes on poison:
- just unmap the data and wait for an actual reference before
killing
- kill as soon as corruption is detected.
Both have advantages and disadvantages and should be used
in different situations. Right now both are implemented and can
be switched with a new sysctl vm.memory_failure_early_kill
The default is early kill.

The patch does some rmap data structure walking on its own to collect
processes to kill. This is unusual because normally all rmap data structure
knowledge is in rmap.c only. I put it here for now to keep
everything together and rmap knowledge has been seeping out anyways

Includes contributions from Johannes Weiner, Chris Mason, Fengguang Wu,
Nick Piggin (who did a lot of great work) and others.

Cc: npiggin@suse.de
Cc: riel@redhat.com
Signed-off-by: Andi Kleen
Acked-by: Rik van Riel
Reviewed-by: Hidehiro Kawai

Andi Kleen
2009-09-16 17:50:15 +0800
cb684b5bc block: fix linkage problem with blk_iopoll and !CONFIG_BLOCK ... Browse Code »

kernel/built-in.o:(.data+0x17b0): undefined reference to `blk_iopoll_enabled'

Since the extern declaration makes the compile work, but the actual
symbol is missing when block/blk-iopoll.o isn't linked in.

Signed-off-by: Jens Axboe

Jens Axboe
2009-09-16 03:53:11 +0800

15 Sep, 2009

1 commit

355bbd8cb Merge branch 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block: (29 commits)
block: use blkdev_issue_discard in blk_ioctl_discard
Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads
block: don't assume device has a request list backing in nr_requests store
block: Optimal I/O limit wrapper
cfq: choose a new next_req when a request is dispatched
Seperate read and write statistics of in_flight requests
aoe: end barrier bios with EOPNOTSUPP
block: trace bio queueing trial only when it occurs
block: enable rq CPU completion affinity by default
cfq: fix the log message after dispatched a request
block: use printk_once
cciss: memory leak in cciss_init_one()
splice: update mtime and atime on files
block: make blk_iopoll_prep_sched() follow normal 0/1 return convention
cfq-iosched: get rid of must_alloc flag
block: use interrupts disabled version of raise_softirq_irqoff()
block: fix comment in blk-iopoll.c
block: adjust default budget for blk-iopoll
block: fix long lines in block/blk-iopoll.c
block: add blk-iopoll, a NAPI like approach for block devices
...

Linus Torvalds
2009-09-15 08:55:15 +0800

12 Sep, 2009

1 commit

774a694f8 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel… ... Browse Code »

…/git/tip/linux-2.6-tip

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (64 commits)
sched: Fix sched::sched_stat_wait tracepoint field
sched: Disable NEW_FAIR_SLEEPERS for now
sched: Keep kthreads at default priority
sched: Re-tune the scheduler latency defaults to decrease worst-case latencies
sched: Turn off child_runs_first
sched: Ensure that a child can't gain time over it's parent after fork()
sched: enable SD_WAKE_IDLE
sched: Deal with low-load in wake_affine()
sched: Remove short cut from select_task_rq_fair()
sched: Turn on SD_BALANCE_NEWIDLE
sched: Clean up topology.h
sched: Fix dynamic power-balancing crash
sched: Remove reciprocal for cpu_power
sched: Try to deal with low capacity, fix update_sd_power_savings_stats()
sched: Try to deal with low capacity
sched: Scale down cpu_power due to RT tasks
sched: Implement dynamic cpu_power
sched: Add smt_gain
sched: Update the cpu_power sum during load-balance
sched: Add SD_PREFER_SIBLING
...

Linus Torvalds
2009-09-12 04:23:18 +0800

11 Sep, 2009

1 commit

5e605b64a block: add blk-iopoll, a NAPI like approach for block devices ... Browse Code »

This borrows some code from NAPI and implements a polled completion
mode for block devices. The idea is the same as NAPI - instead of
doing the command completion when the irq occurs, schedule a dedicated
softirq in the hopes that we will complete more IO when the iopoll
handler is invoked. Devices have a budget of commands assigned, and will
stay in polled mode as long as they continue to consume their budget
from the iopoll softirq handler. If they do not, the device is set back
to interrupt completion mode.

This patch holds the core bits for blk-iopoll, device driver support
sold separately.

Signed-off-by: Jens Axboe

Jens Axboe
2009-09-11 20:33:31 +0800

09 Sep, 2009

1 commit

2bba22c50 sched: Turn off child_runs_first ... Browse Code »

Set child_runs_first default to off.

It hurts 'optimal' make -j workloads as make jobs
get preempted by child tasks, reducing parallelism.

Note, this patch might make existing races in user
applications more prominent than before - so breakages
might be bisected to this commit.

Child-runs-first is broken on SMP to begin with, and we
already had it off briefly in v2.6.23 so most of the
offenders ought to be fixed. Would be nice not to revert
this commit but fix those apps finally ...

Signed-off-by: Mike Galbraith
Acked-by: Peter Zijlstra
LKML-Reference:
[ made the sysctl independent of CONFIG_SCHED_DEBUG, in case
people want to work around broken apps. ]
Signed-off-by: Ingo Molnar

Mike Galbraith
2009-09-09 23:30:05 +0800

07 Sep, 2009

1 commit

b6d9c2563 Security/SELinux: includecheck fix kernel/sysctl.c ... Browse Code »

fix the following 'make includecheck' warning:

kernel/sysctl.c: linux/security.h is included more than once.

Signed-off-by: Jaswinder Singh Rajput
Signed-off-by: James Morris

Jaswinder Singh Rajput
2009-09-07 09:41:03 +0800

04 Sep, 2009

1 commit

e9e9250bc sched: Scale down cpu_power due to RT tasks ... Browse Code »

Keep an average on the amount of time spend on RT tasks and use
that fraction to scale down the cpu_power for regular tasks.

Signed-off-by: Peter Zijlstra
Tested-by: Andreas Herrmann
Acked-by: Andreas Herrmann
Acked-by: Gautham R Shenoy
Cc: Balbir Singh
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2009-09-04 16:09:55 +0800

17 Aug, 2009

1 commit

788084aba Security/SELinux: seperate lsm specific mmap_min_addr ... Browse Code »

Currently SELinux enforcement of controls on the ability to map low memory
is determined by the mmap_min_addr tunable. This patch causes SELinux to
ignore the tunable and instead use a seperate Kconfig option specific to how
much space the LSM should protect.

The tunable will now only control the need for CAP_SYS_RAWIO and SELinux
permissions will always protect the amount of low memory designated by
CONFIG_LSM_MMAP_MIN_ADDR.

This allows users who need to disable the mmap_min_addr controls (usual reason
being they run WINE as a non-root user) to do so and still have SELinux
controls preventing confined domains (like a web server) from being able to
map some area of low memory.

Signed-off-by: Eric Paris
Signed-off-by: James Morris

Eric Paris
2009-08-17 13:09:11 +0800

29 Jun, 2009

1 commit

8326e284f Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/… ... Browse Code »

…git/tip/linux-2.6-tip

* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, delay: tsc based udelay should have rdtsc_barrier
x86, setup: correct include file in <asm/boot.h>
x86, setup: Fix typo "CONFIG_x86_64" in <asm/boot.h>
x86, mce: percpu mcheck_timer should be pinned
x86: Add sysctl to allow panic on IOCK NMI error
x86: Fix uv bau sending buffer initialization
x86, mce: Fix mce resume on 32bit
x86: Move init_gbpages() to setup_arch()
x86: ensure percpu lpage doesn't consume too much vmalloc space
x86: implement percpu_alloc kernel parameter
x86: fix pageattr handling for lpage percpu allocator and re-enable it
x86: reorganize cpa_process_alias()
x86: prepare setup_pcpu_lpage() for pageattr fix
x86: rename remap percpu first chunk allocator to lpage
x86: fix duplicate free in setup_pcpu_remap() failure path
percpu: fix too lazy vunmap cache flushing
x86: Set cpu_llc_id on AMD CPUs

Linus Torvalds
2009-06-29 02:05:28 +0800

26 Jun, 2009

1 commit

5211a242d x86: Add sysctl to allow panic on IOCK NMI error ... Browse Code »

This patch introduces a new sysctl:

/proc/sys/kernel/panic_on_io_nmi

which defaults to 0 (off).

When enabled, the kernel panics when the kernel receives an NMI
caused by an IO error.

The IO error triggered NMI indicates a serious system
condition, which could result in IO data corruption. Rather
than contiuing, panicing and dumping might be a better choice,
so one can figure out what's causing the IO error.

This could be especially important to companies running IO
intensive applications where corruption must be avoided, e.g. a
bank's databases.

[ SuSE has been shipping it for a while, it was done at the
request of a large database vendor, for their users. ]

Signed-off-by: Kurt Garloff
Signed-off-by: Roberto Angelino
Signed-off-by: Greg Kroah-Hartman
Cc: "Eric W. Biederman"
LKML-Reference:
Signed-off-by: Ingo Molnar

Kurt Garloff
2009-06-26 04:06:11 +0800

23 Jun, 2009

1 commit

bfdb4d9f0 timers: Fix timer_migration interface which accepts any number as input ... Browse Code »

Poornima Nayek reported:

| Timer migration interface /proc/sys/kernel/timer_migration in
| 2.6.30-git9 accepts any numerical value as input.
|
| Steps to reproduce:
| 1. echo -6666666 > /proc/sys/kernel/timer_migration
| 2. cat /proc/sys/kernel/timer_migration
| -6666666
|
| 1. echo 44444444444444444444444444444444444444444444444444444444444 > /proc/sys/kernel/timer_migration
| 2. cat /proc/sys/kernel/timer_migration
| -1357789412
|
| Expected behavior: Should 'echo: write error: Invalid argument' while
| setting any value other then 0 & 1

Restrict valid values to 0 and 1.

Reported-by: Poornima Nayak
Tested-by: Poornima Nayak
Signed-off-by: Arun R Bharadwaj
Cc: poornima nayak
Cc: Arun Bharadwaj
LKML-Reference:
Signed-off-by: Ingo Molnar

Arun R Bharadwaj
2009-06-23 16:49:33 +0800

19 Jun, 2009

1 commit

7338f2998 sysctl.c: remove unused variable ... Browse Code »

Remoce the unused variable 'val' from __do_proc_dointvec()

The integer has been declared and used as 'val = -val' and there is no
reference to it anywhere.

Signed-off-by: Sukanto Ghosh
Cc: Jaswinder Singh Rajput
Cc: Sukanto Ghosh
Cc: Jiri Kosina
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sukanto Ghosh
2009-06-19 04:03:54 +0800