Eric Lee / smarc-fsl-linux-kernel

22 Jan, 2011

1 commit

5bf7a6503 Merge branch 'fixes-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq ... Browse Code »

* 'fixes-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: note the nested NOT_RUNNING test in worker_clr_flags() isn't a noop
workqueue: relax lockdep annotation on flush_work()

Linus Torvalds
2011-01-22 05:38:57 +0800

21 Jan, 2011

6 commits

1c77ff22f genirq: Remove __do_IRQ ... Browse Code »

All architectures are finally converted. Remove the cruft.

Signed-off-by: Thomas Gleixner
Cc: Richard Henderson
Cc: Mike Frysinger
Cc: David Howells
Cc: Tony Luck
Cc: Greg Ungerer
Cc: Michal Simek
Acked-by: David Howells
Cc: Kyle McMartin
Acked-by: Benjamin Herrenschmidt
Cc: Chen Liqin
Cc: "David S. Miller"
Cc: Chris Metcalf
Cc: Jeff Dike

Thomas Gleixner
2011-01-21 18:55:31 +0800
2b1caf6ed Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel… ... Browse Code »

…/git/tip/linux-2.6-tip

* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
smp: Allow on_each_cpu() to be called while early_boot_irqs_disabled status to init/main.c
lockdep: Move early boot local IRQ enable/disable status to init/main.c

Linus Torvalds
2011-01-21 10:30:37 +0800
8d99641f6 Merge branch 'akpm' ... Browse Code »

* akpm:
kernel/smp.c: consolidate writes in smp_call_function_interrupt()
kernel/smp.c: fix smp_call_function_many() SMP race
memcg: correctly order reading PCG_USED and pc->mem_cgroup
backlight: fix 88pm860x_bl macro collision
drivers/leds/ledtrig-gpio.c: make output match input, tighten input checking
MAINTAINERS: update Atmel AT91 entry
mm: fix truncate_setsize() comment
memcg: fix rmdir, force_empty with THP
memcg: fix LRU accounting with THP
memcg: fix USED bit handling at uncharge in THP
memcg: modify accounting function for supporting THP better
fs/direct-io.c: don't try to allocate more than BIO_MAX_PAGES in a bio
mm: compaction: prevent division-by-zero during user-requested compaction
mm/vmscan.c: remove duplicate include of compaction.h
memblock: fix memblock_is_region_memory()
thp: keep highpte mapped until it is no longer needed
kconfig: rename CONFIG_EMBEDDED to CONFIG_EXPERT

Linus Torvalds
2011-01-21 09:02:14 +0800
225c8e010 kernel/smp.c: consolidate writes in smp_call_function_interrupt() ... Browse Code »

We have to test the cpu mask in the interrupt handler before checking the
refs, otherwise we can start to follow an entry before its deleted and
find it partially initailzed for the next trip. Presently we also clear
the cpumask bit before executing the called function, which implies
getting write access to the line. After the function is called we then
decrement refs, and if they go to zero we then unlock the structure.

However, this implies getting write access to the call function data
before and after another the function is called. If we can assert that no
smp_call_function execution function is allowed to enable interrupts, then
we can move both writes to after the function is called, hopfully allowing
both writes with one cache line bounce.

On a 256 thread system with a kernel compiled for 1024 threads, the time
to execute testcase in the "smp_call_function_many race" changelog was
reduced by about 30-40ms out of about 545 ms.

I decided to keep this as WARN because its now a buggy function, even
though the stack trace is of no value -- a simple printk would give us the
information needed.

Raw data:

Without patch:
ipi_test startup took 1219366ns complete 539819014ns total 541038380ns
ipi_test startup took 1695754ns complete 543439872ns total 545135626ns
ipi_test startup took 7513568ns complete 539606362ns total 547119930ns
ipi_test startup took 13304064ns complete 533898562ns total 547202626ns
ipi_test startup took 8668192ns complete 544264074ns total 552932266ns
ipi_test startup took 4977626ns complete 548862684ns total 553840310ns
ipi_test startup took 2144486ns complete 541292318ns total 543436804ns
ipi_test startup took 21245824ns complete 530280180ns total 551526004ns

With patch:
ipi_test startup took 5961748ns complete 500859628ns total 506821376ns
ipi_test startup took 8975996ns complete 495098924ns total 504074920ns
ipi_test startup took 19797750ns complete 492204740ns total 512002490ns
ipi_test startup took 14824796ns complete 487495878ns total 502320674ns
ipi_test startup took 11514882ns complete 494439372ns total 505954254ns
ipi_test startup took 8288084ns complete 502570774ns total 510858858ns
ipi_test startup took 6789954ns complete 493388112ns total 500178066ns

#include
#include
#include /* sched clock */

#define ITERATIONS 100

static void do_nothing_ipi(void *dummy)
{
}

static void do_ipis(struct work_struct *dummy)
{
int i;

for (i = 0; i < ITERATIONS; i++)
smp_call_function(do_nothing_ipi, NULL, 1);

printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id());
}

static struct work_struct work[NR_CPUS];

static int __init testcase_init(void)
{
int cpu;
u64 start, started, done;

start = local_clock();
for_each_online_cpu(cpu) {
INIT_WORK(&work[cpu], do_ipis);
schedule_work_on(cpu, &work[cpu]);
}
started = local_clock();
for_each_online_cpu(cpu)
flush_work(&work[cpu]);
done = local_clock();
pr_info("ipi_test startup took %lldns complete %lldns total %lldns\n",
started-start, done-started, done-start);

return 0;
}

static void __exit testcase_exit(void)
{
}

module_init(testcase_init)
module_exit(testcase_exit)
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Anton Blanchard");

Signed-off-by: Milton Miller
Cc: Anton Blanchard
Cc: Ingo Molnar
Cc: "Paul E. McKenney"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Milton Miller
2011-01-21 09:02:06 +0800
6dc198999 kernel/smp.c: fix smp_call_function_many() SMP race ... Browse Code »

I noticed a failure where we hit the following WARN_ON in
generic_smp_call_function_interrupt:

if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
continue;

data->csd.func(data->csd.info);

refs = atomic_dec_return(&data->refs);
WARN_ON(refs < 0); cpumask sees and
clears bit in cpumask
might be using old or new fn!
decrements refs below 0

set data->refs (too late!)

The important thing to note is since the interrupt handler walks a
potentially stale call_function.queue without any locking, then another
cpu can view the percpu *data structure at any time, even when the owner
is in the process of initialising it.

The following test case hits the WARN_ON 100% of the time on my PowerPC
box (having 128 threads does help :)

#include
#include

#define ITERATIONS 100

static void do_nothing_ipi(void *dummy)
{
}

static void do_ipis(struct work_struct *dummy)
{
int i;

for (i = 0; i < ITERATIONS; i++)
smp_call_function(do_nothing_ipi, NULL, 1);

printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id());
}

static struct work_struct work[NR_CPUS];

static int __init testcase_init(void)
{
int cpu;

for_each_online_cpu(cpu) {
INIT_WORK(&work[cpu], do_ipis);
schedule_work_on(cpu, &work[cpu]);
}

return 0;
}

static void __exit testcase_exit(void)
{
}

module_init(testcase_init)
module_exit(testcase_exit)
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Anton Blanchard");

I tried to fix it by ordering the read and the write of ->cpumask and
->refs. In doing so I missed a critical case but Paul McKenney was able
to spot my bug thankfully :) To ensure we arent viewing previous
iterations the interrupt handler needs to read ->refs then ->cpumask then
->refs _again_.

Thanks to Milton Miller and Paul McKenney for helping to debug this issue.

[miltonm@bga.com: add WARN_ON and BUG_ON, remove extra read of refs before initial read of mask that doesn't help (also noted by Peter Zijlstra), adjust comments, hopefully clarify scenario ]
[miltonm@bga.com: remove excess tests]
Signed-off-by: Anton Blanchard
Signed-off-by: Milton Miller
Cc: Ingo Molnar
Cc: "Paul E. McKenney"
Cc: [2.6.32+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anton Blanchard
2011-01-21 09:02:06 +0800
466c19063 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kerne… ... Browse Code »

…l/git/tip/linux-2.6-tip

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched, cgroup: Use exit hook to avoid use-after-free crash
sched: Fix signed unsigned comparison in check_preempt_tick()
sched: Replace rq->bkl_count with rq->rq_sched_info.bkl_count
sched, autogroup: Fix CONFIG_RT_GROUP_SCHED sched_setscheduler() failure
sched: Display autogroup names in /proc/sched_debug
sched: Reinstate group names in /proc/sched_debug
sched: Update effective_load() to use global share weights

Linus Torvalds
2011-01-21 08:37:55 +0800

20 Jan, 2011

2 commits

bd924e8cb smp: Allow on_each_cpu() to be called while early_boot_irqs_disabled status to init/main.c ... Browse Code »

percpu may end up calling vfree() during early boot which in
turn may call on_each_cpu() for TLB flushes. The function of
on_each_cpu() can be done safely while IRQ is disabled during
early boot but it assumed that the function is always called
with local IRQ enabled which ended up enabling local IRQ
prematurely during boot and triggering a couple of warnings.

This patch updates on_each_cpu() and smp_call_function_many()
such on_each_cpu() can be used safely while
early_boot_irqs_disabled is set.

Signed-off-by: Tejun Heo
Acked-by: Peter Zijlstra
Acked-by: Pekka Enberg
Cc: Linus Torvalds
LKML-Reference:
Signed-off-by: Ingo Molnar
Reported-by: Ingo Molnar

Tejun Heo
2011-01-20 20:32:34 +0800
2ce802f62 lockdep: Move early boot local IRQ enable/disable status to init/main.c ... Browse Code »

During early boot, local IRQ is disabled until IRQ subsystem is
properly initialized. During this time, no one should enable
local IRQ and some operations which usually are not allowed with
IRQ disabled, e.g. operations which might sleep or require
communications with other processors, are allowed.

lockdep tracked this with early_boot_irqs_off/on() callbacks.
As other subsystems need this information too, move it to
init/main.c and make it generally available. While at it,
toggle the boolean to early_boot_irqs_disabled instead of
enabled so that it can be initialized with %false and %true
indicates the exceptional condition.

Signed-off-by: Tejun Heo
Acked-by: Peter Zijlstra
Acked-by: Pekka Enberg
Cc: Linus Torvalds
LKML-Reference:
Signed-off-by: Ingo Molnar

Tejun Heo
2011-01-20 20:32:33 +0800

19 Jan, 2011

4 commits

068c5cc5a sched, cgroup: Use exit hook to avoid use-after-free crash ... Browse Code »

By not notifying the controller of the on-exit move back to
init_css_set, we fail to move the task out of the previous
cgroup's cfs_rq. This leads to an opportunity for a
cgroup-destroy to come in and free the cgroup (there are no
active tasks left in it after all) to which the not-quite dead
task is still enqueued.

Reported-by: Miklos Vajna
Fixed-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
Cc:
Cc: Mike Galbraith
Signed-off-by: Ingo Molnar
LKML-Reference:

Peter Zijlstra
2011-01-19 19:51:32 +0800
335bc70b6 Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel… ... Browse Code »

…/git/tip/linux-2.6-tip

* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf: Validate cpu early in perf_event_alloc()
perf: Find_get_context: fix the per-cpu-counter check
perf: Fix contexted inheritance

Linus Torvalds
2011-01-19 06:29:37 +0800
66832eb4b perf: Validate cpu early in perf_event_alloc() ... Browse Code »

Starting from perf_event_alloc()->perf_init_event(), the kernel
assumes that event->cpu is either -1 or the valid CPU number.

Change perf_event_alloc() to validate this argument early. This
also means we can remove the similar check in
find_get_context().

Signed-off-by: Oleg Nesterov
Acked-by: Peter Zijlstra
Cc: Alan Stern
Cc: Arnaldo Carvalho de Melo
Cc: Frederic Weisbecker
Cc: Paul Mackerras
Cc: Prasad
Cc: Roland McGrath
Cc: gregkh@suse.de
Cc: stable@kernel.org
LKML-Reference:
Signed-off-by: Ingo Molnar

Oleg Nesterov
2011-01-19 02:34:23 +0800
22a4ec729 perf: Find_get_context: fix the per-cpu-counter check ... Browse Code »

If task == NULL, find_get_context() should always check that cpu
is correct.

Afaics, the bug was introduced by 38a81da2 "perf events: Clean
up pid passing", but even before that commit "&& cpu != -1" was
not exactly right, -ESRCH from find_task_by_vpid() is not
accurate.

Signed-off-by: Oleg Nesterov
Acked-by: Peter Zijlstra
Cc: Alan Stern
Cc: Arnaldo Carvalho de Melo
Cc: Frederic Weisbecker
Cc: Paul Mackerras
Cc: Prasad
Cc: Roland McGrath
Cc: gregkh@suse.de
Cc: stable@kernel.org
LKML-Reference:
Signed-off-by: Ingo Molnar

Oleg Nesterov
2011-01-19 02:34:23 +0800

18 Jan, 2011

7 commits

c5ed51455 perf: Fix contexted inheritance ... Browse Code »

Linus reported that the RCU lockdep annotation bits triggered for this
rcu_dereference() because we're not holding rcu_read_lock().

Going over the code I cannot convince myself its correct:

- holding a ref on the parent_ctx, doesn't avoid it being uncloned
concurrently (as the comment says), so we can race with a free.

- holding parent_ctx->mutex doesn't avoid the above free from taking
place either, it would at best avoid parent_ctx from being freed.

I.e. the warning is correct. To fix the bug, serialize against the
unclone_ctx() call by extending the reach of the parent_ctx->lock.

Reported-by: Linus Torvalds
Signed-off-by: Peter Zijlstra
Cc: Paul Mackerras
Cc: Paul E. McKenney
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-01-18 22:10:35 +0800
d7d829441 sched: Fix signed unsigned comparison in check_preempt_tick() ... Browse Code »

Signed unsigned comparison may lead to superfluous resched if leftmost
is right of the current task, wasting a few cycles, and inadvertently
_lengthening_ the current task's slice.

Reported-by: Venkatesh Pallipadi
Signed-off-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Mike Galbraith
2011-01-18 22:09:44 +0800
fce209798 sched: Replace rq->bkl_count with rq->rq_sched_info.bkl_count ... Browse Code »

Now rq->rq_sched_info.bkl_count is not used for rq, scroll
rq->bkl_count into it. Thus we can save some space for rq.

Signed-off-by: Yong Zhang
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Yong Zhang
2011-01-18 22:09:43 +0800
f44937718 sched, autogroup: Fix CONFIG_RT_GROUP_SCHED sched_setscheduler() failure ... Browse Code »

If CONFIG_RT_GROUP_SCHED is set, __sched_setscheduler() fails due to autogroup
not allocating rt_runtime. Free unused/unusable rt_se and rt_rq, redirect RT
tasks to the root task group, and tell __sched_setscheduler() that it's ok.

Reported-and-tested-by: Bharata B Rao
Signed-off-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Mike Galbraith
2011-01-18 22:09:42 +0800
8ecedd7a0 sched: Display autogroup names in /proc/sched_debug ... Browse Code »

Add autogroup name to cfs_rq and tasks information to /proc/sched_debug.

Signed-off-by: Bharata B Rao
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Bharata B Rao
2011-01-18 22:09:40 +0800
efe25c2c7 sched: Reinstate group names in /proc/sched_debug ... Browse Code »

Displaying of group names in /proc/sched_debug was dropped in autogroup
patches. Add group names while displaying cfs_rq and tasks information.

Signed-off-by: Bharata B Rao
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Bharata B Rao
2011-01-18 22:09:39 +0800
977dda7c9 sched: Update effective_load() to use global share weights ... Browse Code »

Previously effective_load would approximate the global load weight present on
a group taking advantage of:

entity_weight = tg->shares ( lw / global_lw ), where entity_weight was provided
by tg_shares_up.

This worked (approximately) for an 'empty' (at tg level) cpu since we would
place boost load representative of what a newly woken task would receive.

However, now that load is instantaneously updated this assumption is no longer
true and the load calculation is rather incorrect in this case.

Fix this (and improve the general case) by re-writing effective_load to take
advantage of the new shares distribution code.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul Turner
2011-01-18 22:09:38 +0800

16 Jan, 2011

1 commit

f9ee7f60d Merge branches 'core-fixes-for-linus', 'x86-fixes-for-linus', 'timers-fixes-for-… ... Browse Code »

…linus' and 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
rcu: avoid pointless blocked-task warnings
rcu: demote SRCU_SYNCHRONIZE_DELAY from kernel-parameter status
rtmutex: Fix comment about why new_owner can be NULL in wake_futex_pi()

* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, olpc: Add missing Kconfig dependencies
x86, mrst: Set correct APB timer IRQ affinity for secondary cpu
x86: tsc: Fix calibration refinement conditionals to avoid divide by zero
x86, ia64, acpi: Clean up x86-ism in drivers/acpi/numa.c

* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
timekeeping: Make local variables static
time: Rename misnamed minsec argument of clocks_calc_mult_shift()

* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing: Remove syscall_exit_fields
tracing: Only process module tracepoints once
perf record: Add "nodelay" mode, disabled by default
perf sched: Fix list of events, dropping unsupported ':r' modifier
Revert "perf tools: Emit clearer message for sys_perf_event_open ENOENT return"
perf top: Fix annotate segv
perf evsel: Fix order of event list deletion

Linus Torvalds
2011-01-16 04:45:00 +0800

15 Jan, 2011

3 commits

7f85803a2 tracing: Remove syscall_exit_fields ... Browse Code »

There is no need for syscall_exit_fields as the syscall
exit event class can already host the fields in its structure,
like most other trace events do by default. Use that
default behavior instead.

Following this scheme, we don't need anymore to override the
get_fields() callback of the syscall exit event class either.

Hence both syscall_exit_fields and syscall_get_exit_fields() can
be removed.

Also changed some indentation to keep the following under 80
characters:

".fields = LIST_HEAD_INIT(event_class_syscall_exit.fields),"

Acked-by: Frederic Weisbecker
Signed-off-by: Lai Jiangshan
LKML-Reference:
Signed-off-by: Steven Rostedt

Lai Jiangshan
2011-01-15 07:06:35 +0800
acda4721a Merge branch 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/gi… ... Browse Code »

…t/npiggin/linux-npiggin

* 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin:
kernel: fix hlist_bl again
cgroups: Fix a lockdep warning at cgroup removal
fs: namei fix ->put_link on wrong inode in do_filp_open

Linus Torvalds
2011-01-15 01:08:29 +0800
c72a04e34 cgroup_fs: fix cgroup use of simple_lookup() ... Browse Code »

cgroup can't use simple_lookup(), since that'd override its desired ->d_op.

Tested-by: Li Zefan
Signed-off-by: Al Viro
Signed-off-by: Linus Torvalds

Al Viro
2011-01-15 00:07:48 +0800

14 Jan, 2011

16 commits

1161ec944 Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck… ... Browse Code »

…/linux-2.6-rcu into core/urgent

Ingo Molnar
2011-01-14 22:30:16 +0800
b24efdfdf rcu: avoid pointless blocked-task warnings ... Browse Code »

If the RCU callback-processing kthread has nothing to do, it parks in
a wait_event(). If RCU remains idle for more than two minutes, the
kernel complains about this. This commit changes from wait_event()
to wait_event_interruptible() to prevent the kernel from complaining
just because RCU is idle.

Reported-by: Russell King
Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney
Tested-by: Thomas Weber
Tested-by: Russell King

Paul E. McKenney
2011-01-14 20:58:08 +0800
c072a388d rcu: demote SRCU_SYNCHRONIZE_DELAY from kernel-parameter status ... Browse Code »

Because the adaptive synchronize_srcu_expedited() approach has
worked very well in testing, remove the kernel parameter and
replace it by a C-preprocessor macro. If someone finds problems
with this approach, a more complex and aggressively adaptive
approach might be required.

Longer term, SRCU will be merged with the other RCU implementations,
at which point synchronize_srcu_expedited() will be event driven,
just as synchronize_sched_expedited() currently is. At that point,
there will be no need for this adaptive approach.

Reported-by: Linus Torvalds
Signed-off-by: Paul E. McKenney

Paul E. McKenney
2011-01-14 20:56:49 +0800
3ec762ad8 cgroups: Fix a lockdep warning at cgroup removal ... Browse Code »

Commit 2fd6b7f5 ("fs: dcache scale subdirs") forgot to annotate a dentry
lock, which caused a lockdep warning.

Reported-by: Valdis Kletnieks
Signed-off-by: Li Zefan

Li Zefan
2011-01-14 16:46:29 +0800
52cfd503a Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 ... Browse Code »

* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (59 commits)
ACPI / PM: Fix build problems for !CONFIG_ACPI related to NVS rework
ACPI: fix resource check message
ACPI / Battery: Update information on info notification and resume
ACPI: Drop device flag wake_capable
ACPI: Always check if _PRW is present before trying to evaluate it
ACPI / PM: Check status of power resources under mutexes
ACPI / PM: Rename acpi_power_off_device()
ACPI / PM: Drop acpi_power_nocheck
ACPI / PM: Drop acpi_bus_get_power()
Platform / x86: Make fujitsu_laptop use acpi_bus_update_power()
ACPI / Fan: Rework the handling of power resources
ACPI / PM: Register power resource devices as soon as they are needed
ACPI / PM: Register acpi_power_driver early
ACPI / PM: Add function for updating device power state consistently
ACPI / PM: Add function for device power state initialization
ACPI / PM: Introduce __acpi_bus_get_power()
ACPI / PM: Introduce function for refcounting device power resources
ACPI / PM: Add functions for manipulating lists of power resources
ACPI / PM: Prevent acpi_power_get_inferred_state() from making changes
ACPICA: Update version to 20101209
...

Linus Torvalds
2011-01-14 12:15:35 +0800
ba76149f4 thp: khugepaged ... Browse Code »

Add khugepaged to relocate fragmented pages into hugepages if new
hugepages become available. (this is indipendent of the defrag logic that
will have to make new hugepages available)

The fundamental reason why khugepaged is unavoidable, is that some memory
can be fragmented and not everything can be relocated. So when a virtual
machine quits and releases gigabytes of hugepages, we want to use those
freely available hugepages to create huge-pmd in the other virtual
machines that may be running on fragmented memory, to maximize the CPU
efficiency at all times. The scan is slow, it takes nearly zero cpu time,
except when it copies data (in which case it means we definitely want to
pay for that cpu time) so it seems a good tradeoff.

In addition to the hugepages being released by other process releasing
memory, we have the strong suspicion that the performance impact of
potentially defragmenting hugepages during or before each page fault could
lead to more performance inconsistency than allocating small pages at
first and having them collapsed into large pages later... if they prove
themselfs to be long lived mappings (khugepaged scan is slow so short
lived mappings have low probability to run into khugepaged if compared to
long lived mappings).

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:43 +0800
e7a00c45f thp: add pmd_huge_pte to mm_struct ... Browse Code »

This increase the size of the mm struct a bit but it is needed to
preallocate one pte for each hugepage so that split_huge_page will not
require a fail path. Guarantee of success is a fundamental property of
split_huge_page to avoid decrasing swapping reliability and to avoid
adding -ENOMEM fail paths that would otherwise force the hugepage-unaware
VM code to learn rolling back in the middle of its pte mangling operations
(if something we need it to learn handling pmd_trans_huge natively rather
being capable of rollback). When split_huge_page runs a pte is needed to
succeed the split, to map the newly splitted regular pages with a regular
pte. This way all existing VM code remains backwards compatible by just
adding a split_huge_page* one liner. The memory waste of those
preallocated ptes is negligible and so it is worth it.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:41 +0800
a5b338f2b thp: update futex compound knowledge ... Browse Code »

Futex code is smarter than most other gup_fast O_DIRECT code and knows
about the compound internals. However now doing a put_page(head_page)
will not release the pin on the tail page taken by gup-fast, leading to
all sort of refcounting bugchecks. Getting a stable head_page is a little
tricky.

page_head = page is there because if this is not a tail page it's also the
page_head. Only in case this is a tail page, compound_head is called,
otherwise it's guaranteed unnecessary. And if it's a tail page
compound_head has to run atomically inside irq disabled section
__get_user_pages_fast before returning. Otherwise ->first_page won't be a
stable pointer.

Disableing irq before __get_user_page_fast and releasing irq after running
compound_head is needed because if __get_user_page_fast returns == 1, it
means the huge pmd is established and cannot go away from under us.
pmdp_splitting_flush_notify in __split_huge_page_splitting will have to
wait for local_irq_enable before the IPI delivery can return. This means
__split_huge_page_refcount can't be running from under us, and in turn
when we run compound_head(page) we're not reading a dangling pointer from
tailpage->first_page. Then after we get to stable head page, we are
always safe to call compound_lock and after taking the compound lock on
head page we can finally re-check if the page returned by gup-fast is
still a tail page. in which case we're set and we didn't need to split
the hugepage in order to take a futex on it.

Signed-off-by: Andrea Arcangeli
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:39 +0800
dabb16f63 oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down ... Browse Code »

We'd like to be able to oom_score_adj a process up/down as it
enters/leaves the foreground. Currently, it is not possible to oom_adj
down without CAP_SYS_RESOURCE. This patch allows a task to decrease its
oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set it to
or its inherited value at fork. Assuming the thread that has forked it
has oom_score_adj of 0, each process could decrease it back from 0 upon
activation unless a CAP_SYS_RESOURCE thread elevated it to something
higher.

Alternative considered:

* a setuid binary
* a daemon with CAP_SYS_RESOURCE

Since you don't wan't all processes to be able to reduce their oom_adj, a
setuid or daemon implementation would be complex. The alternatives also
have much higher overhead.

This patch updated from original patch based on feedback from David
Rientjes.

Signed-off-by: Mandeep Singh Baines
Acked-by: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Rik van Riel
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mandeep Singh Baines
2011-01-14 09:32:35 +0800
43bb40c9e sched: remove long deprecated CLONE_STOPPED flag ... Browse Code »

This warning was added in commit bdff746a3915 ("clone: prepare to recycle
CLONE_STOPPED") three years ago. 2.6.26 came and went. As far as I know,
no-one is actually using CLONE_STOPPED.

Signed-off-by: Dave Jones
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Jones
2011-01-14 09:32:31 +0800
6c9ae009b irq: use per_cpu kstat_irqs ... Browse Code »

Use modern per_cpu API to increment {soft|hard}irq counters, and use
per_cpu allocation for (struct irq_desc)->kstats_irq instead of an array.

This gives better SMP/NUMA locality and saves few instructions per irq.

With small nr_cpuids values (8 for example), kstats_irq was a small array
(less than L1_CACHE_BYTES), potentially source of false sharing.

In the !CONFIG_SPARSE_IRQ case, remove the huge, NUMA/cache unfriendly
kstat_irqs_all[NR_IRQS][NR_CPUS] array.

Note: we still populate kstats_irq for all possible irqs in
early_irq_init(). We probably could use on-demand allocations. (Code
included in alloc_descs()). Problem is not all IRQS are used with a prior
alloc_descs() call.

kstat_irqs_this_cpu() is not used anymore, remove it.

Signed-off-by: Eric Dumazet
Reviewed-by: Christoph Lameter
Cc: Ingo Molnar
Cc: Andi Kleen
Cc: Tejun Heo
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Dumazet
2011-01-14 09:32:31 +0800
275220f0f Merge branch 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
block: ensure that completion error gets properly traced
blktrace: add missing probe argument to block_bio_complete
block cfq: don't use atomic_t for cfq_group
block cfq: don't use atomic_t for cfq_queue
block: trace event block fix unassigned field
block: add internal hd part table references
block: fix accounting bug on cross partition merges
kref: add kref_test_and_get
bio-integrity: mark kintegrityd_wq highpri and CPU intensive
block: make kblockd_workqueue smarter
Revert "sd: implement sd_check_events()"
block: Clean up exit_io_context() source code.
Fix compile warnings due to missing removal of a 'ret' variable
fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
cfq-iosched: don't check cfqg in choose_service_tree()
fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
cdrom: export cdrom_check_events()
sd: implement sd_check_events()
sr: implement sr_check_events()
...

Linus Torvalds
2011-01-14 02:45:01 +0800
b2034d474 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (41 commits)
fs: add documentation on fallocate hole punching
Gfs2: fail if we try to use hole punch
Btrfs: fail if we try to use hole punch
Ext4: fail if we try to use hole punch
Ocfs2: handle hole punching via fallocate properly
XFS: handle hole punching via fallocate properly
fs: add hole punching to fallocate
vfs: pass struct file to do_truncate on O_TRUNC opens (try #2)
fix signedness mess in rw_verify_area() on 64bit architectures
fs: fix kernel-doc for dcache::prepend_path
fs: fix kernel-doc for dcache::d_validate
sanitize ecryptfs ->mount()
switch afs
move internal-only parts of ncpfs headers to fs/ncpfs
switch ncpfs
switch 9p
pass default dentry_operations to mount_pseudo()
switch hostfs
switch affs
switch configfs
...

Linus Torvalds
2011-01-14 02:27:28 +0800
008d23e48 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
Documentation/trace/events.txt: Remove obsolete sched_signal_send.
writeback: fix global_dirty_limits comment runtime -> real-time
ppc: fix comment typo singal -> signal
drivers: fix comment typo diable -> disable.
m68k: fix comment typo diable -> disable.
wireless: comment typo fix diable -> disable.
media: comment typo fix diable -> disable.
remove doc for obsolete dynamic-printk kernel-parameter
remove extraneous 'is' from Documentation/iostats.txt
Fix spelling milisec -> ms in snd_ps3 module parameter description
Fix spelling mistakes in comments
Revert conflicting V4L changes
i7core_edac: fix typos in comments
mm/rmap.c: fix comment
sound, ca0106: Fix assignment to 'channel'.
hrtimer: fix a typo in comment
init/Kconfig: fix typo
anon_inodes: fix wrong function name in comment
fix comment typos concerning "consistent"
poll: fix a typo in comment
...

Fix up trivial conflicts in:
- drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c)
- fs/ext4/ext4.h

Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.

Linus Torvalds
2011-01-14 02:05:56 +0800
e2c18e49a pps: capture MONOTONIC_RAW timestamps as well ... Browse Code »

MONOTONIC_RAW clock timestamps are ideally suited for frequency
calculation and also fit well into the original NTP hardpps design. Now
phase and frequency can be adjusted separately: the former based on
REALTIME clock and the latter based on MONOTONIC_RAW clock.

A new function getnstime_raw_and_real is added to timekeeping subsystem to
capture both timestamps at the same time and atomically.

Signed-off-by: Alexander Gordeev
Acked-by: John Stultz
Cc: Rodolfo Giometti
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexander Gordeev
2011-01-14 00:03:21 +0800
025b40abe ntp: add hardpps implementation ... Browse Code »

This commit adds hardpps() implementation based upon the original one from
the NTPv4 reference kernel code from David Mills. However, it is highly
optimized towards very fast syncronization and maximum stickness to PPS
signal. The typical error is less then a microsecond.

To make it sync faster I had to throw away exponential phase filter so
that the full phase offset is corrected immediately. Then I also had to
throw away median phase filter because it gives a bigger error itself if
used without exponential filter.

Maybe we will find an appropriate filtering scheme in the future but it's
not necessary if the signal quality is ok.

Signed-off-by: Alexander Gordeev
Acked-by: John Stultz
Cc: Rodolfo Giometti
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexander Gordeev
2011-01-14 00:03:20 +0800