Eric Lee / linux-smarc-t335x-v3.2

18 Oct, 2011

1 commit

bcd5cff72 cputimer: Cure lock inversion ... Browse Code »
1

There's a lock inversion between the cputimer->lock and rq->lock;
notably the two callchains involved are:

update_rlimit_cpu()
sighand->siglock
set_process_cpu_timer()
cpu_timer_sample_group()
thread_group_cputimer()
cputimer->lock
thread_group_cputime()
task_sched_runtime()
->pi_lock
rq->lock

scheduler_tick()
rq->lock
task_tick_fair()
update_curr()
account_group_exec()
cputimer->lock

Where the first one is enabling a CLOCK_PROCESS_CPUTIME_ID timer, and
the second one is keeping up-to-date.

This problem was introduced by e8abccb7193 ("posix-cpu-timers: Cure
SMP accounting oddities").

Cure the problem by removing the cputimer->lock and rq->lock nesting,
this leaves concurrent enablers doing duplicate work, but the time
wasted should be on the same order otherwise wasted spinning on the
lock and the greater-than assignment filter should ensure we preserve
monotonicity.

Reported-by: Dave Jones
Reported-by: Simon Kirby
Signed-off-by: Peter Zijlstra
Cc: stable@kernel.org
Cc: Linus Torvalds
Cc: Martin Schwidefsky
Link: http://lkml.kernel.org/r/1318928713.21167.4.camel@twins
Signed-off-by: Thomas Gleixner

Peter Zijlstra
2011-10-18 17:36:59 +0800

17 Oct, 2011

1 commit

a84a79e4d Avoid using variable-length arrays in kernel/sys.c ... Browse Code »

The size is always valid, but variable-length arrays generate worse code
for no good reason (unless the function happens to be inlined and the
compiler sees the length for the simple constant it is).

Also, there seems to be some code generation problem on POWER, where
Henrik Bakken reports that register r28 can get corrupted under some
subtle circumstances (interrupt happening at the wrong time?). That all
indicates some seriously broken compiler issues, but since variable
length arrays are bad regardless, there's little point in trying to
chase it down.

"Just don't do that, then".

Reported-by: Henrik Grindal Bakken
Cc: Benjamin Herrenschmidt
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Linus Torvalds
2011-10-17 23:24:24 +0800

01 Oct, 2011

1 commit

f72a209a3 Merge branches 'irq-urgent-for-linus', 'x86-urgent-for-linus' and 'sched-urgent-… ... Browse Code »

…for-linus' of git://tesla.tglx.de/git/linux-2.6-tip

* 'irq-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
irq: Fix check for already initialized irq_domain in irq_domain_add
irq: Add declaration of irq_domain_simple_ops to irqdomain.h

* 'x86-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
x86/rtc: Don't recursively acquire rtc_lock

* 'sched-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
posix-cpu-timers: Cure SMP wobbles
sched: Fix up wchan borkage
sched/rt: Migrate equal priority tasks to available CPUs

Linus Torvalds
2011-10-01 23:37:25 +0800

30 Sep, 2011

2 commits

d670ec131 posix-cpu-timers: Cure SMP wobbles ... Browse Code »

David reported:

Attached below is a watered-down version of rt/tst-cpuclock2.c from
GLIBC. Just build it with "gcc -o test test.c -lpthread -lrt" or
similar.

Run it several times, and you will see cases where the main thread
will measure a process clock difference before and after the nanosleep
which is smaller than the cpu-burner thread's individual thread clock
difference. This doesn't make any sense since the cpu-burner thread
is part of the top-level process's thread group.

I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
64-bit binaries).

For example:

[davem@boricha build-x86_64-linux]$ ./test
process: before(0.001221967) after(0.498624371) diff(497402404)
thread: before(0.000081692) after(0.498316431) diff(498234739)
self: before(0.001223521) after(0.001240219) diff(16698)
[davem@boricha build-x86_64-linux]$

The diff of 'process' should always be >= the diff of 'thread'.

I make sure to wrap the 'thread' clock measurements the most tightly
around the nanosleep() call, and that the 'process' clock measurements
are the outer-most ones.

---
#include
#include
#include
#include
#include
#include
#include
#include

static pthread_barrier_t barrier;

static void *chew_cpu(void *arg)
{
pthread_barrier_wait(&barrier);
while (1)
__asm__ __volatile__("" : : : "memory");
return NULL;
}

int main(void)
{
clockid_t process_clock, my_thread_clock, th_clock;
struct timespec process_before, process_after;
struct timespec me_before, me_after;
struct timespec th_before, th_after;
struct timespec sleeptime;
unsigned long diff;
pthread_t th;
int err;

err = clock_getcpuclockid(0, &process_clock);
if (err)
return 1;

err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
if (err)
return 1;

pthread_barrier_init(&barrier, NULL, 2);
err = pthread_create(&th, NULL, chew_cpu, NULL);
if (err)
return 1;

err = pthread_getcpuclockid(th, &th_clock);
if (err)
return 1;

pthread_barrier_wait(&barrier);

err = clock_gettime(process_clock, &process_before);
if (err)
return 1;

err = clock_gettime(my_thread_clock, &me_before);
if (err)
return 1;

err = clock_gettime(th_clock, &th_before);
if (err)
return 1;

sleeptime.tv_sec = 0;
sleeptime.tv_nsec = 500000000;
nanosleep(&sleeptime, NULL);

err = clock_gettime(th_clock, &th_after);
if (err)
return 1;

err = clock_gettime(my_thread_clock, &me_after);
if (err)
return 1;

err = clock_gettime(process_clock, &process_after);
if (err)
return 1;

diff = process_after.tv_nsec - process_before.tv_nsec;
printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
process_before.tv_sec, process_before.tv_nsec,
process_after.tv_sec, process_after.tv_nsec, diff);
diff = th_after.tv_nsec - th_before.tv_nsec;
printf("thread: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
th_before.tv_sec, th_before.tv_nsec,
th_after.tv_sec, th_after.tv_nsec, diff);
diff = me_after.tv_nsec - me_before.tv_nsec;
printf("self: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
me_before.tv_sec, me_before.tv_nsec,
me_after.tv_sec, me_after.tv_nsec, diff);

return 0;
}

This is due to us using p->se.sum_exec_runtime in
thread_group_cputime() where we iterate the thread group and sum all
data. This does not take time since the last schedule operation (tick
or otherwise) into account. We can cure this by using
task_sched_runtime() at the cost of having to take locks.

This also means we can (and must) do away with
thread_group_sched_runtime() since the modified thread_group_cputime()
is now more accurate and would deadlock when called from
thread_group_sched_runtime().

Aside of that it makes the function safe on 32 bit systems. The old
code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a
64bit value and could be changed on another cpu at the same time.

Reported-by: David Miller
Signed-off-by: Peter Zijlstra
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins
Tested-by: David Miller
Signed-off-by: Thomas Gleixner

Peter Zijlstra
2011-09-30 20:07:06 +0800
47ea91b40 Resource: fix wrong resource window calculation ... Browse Code »

__find_resource() incorrectly returns a resource window which overlaps
an existing allocated window. This happens when the parent's
resource-window spans 0x00000000 to 0xffffffff and is entirely allocated
to all its children resource-windows.

__find_resource() looks for gaps in resource allocation among the
children resource windows. When it encounters the last child window it
blindly tries the range next to one allocated to the last child. Since
the last child's window ends at 0xffffffff the calculation overflows,
leading the algorithm to believe that any window in the range 0x0000000
to 0xfffffff is available for allocation. This leads to a conflicting
window allocation.

Michal Ludvig reported this issue seen on his platform. The following
patch fixes the problem and has been verified by Michal. I believe this
bug has been there for ages. It got exposed by git commit 2bbc6942273b
("PCI : ability to relocate assigned pci-resources")

Signed-off-by: Ram Pai
Tested-by: Michal Ludvig
Signed-off-by: Linus Torvalds

Ram Pai
2011-09-30 11:04:34 +0800

26 Sep, 2011

2 commits

6ebbe7a07 sched: Fix up wchan borkage ... Browse Code »

Commit c259e01a1ec ("sched: Separate the scheduler entry for
preemption") contained a boo-boo wrecking wchan output. It forgot to
put the new schedule() function in the __sched section and thereby
doesn't get properly ignored for things like wchan.

Tested-by: Simon Kirby
Cc: stable@kernel.org # 2.6.39+
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110923000346.GA25425@hostway.ca
Signed-off-by: Ingo Molnar

Simon Kirby
2011-09-26 18:51:08 +0800
f9d81f61c ptrace: PTRACE_LISTEN forgets to unlock ->siglock ... Browse Code »

If PTRACE_LISTEN fails after lock_task_sighand() it doesn't drop ->siglock.

Reported-by: Matt Fleming
Signed-off-by: Oleg Nesterov
Signed-off-by: Linus Torvalds

Oleg Nesterov
2011-09-26 02:02:00 +0800

20 Sep, 2011

4 commits

eef24afb2 irq: Fix check for already initialized irq_domain in irq_domain_add ... Browse Code »

The sanity check in irq_domain_add() tests desc->irq_data != NULL or
irq_data->domain != NULL. This prevents adding an irq_domain to a irq
descriptor when irq_data exists, which true when the irq descriptor
exists.

This went unnoticed so far as the simple domain code did not enter
this code path because domain->nr_irqs is always 0 for the simple domains.

Split the check for irq_data == NULL out and have a separate warning
for it.

[ tglx: Made the check for irq_data == NULL separate ]

Signed-off-by: Rob Herring
Cc: Grant Likely
Cc: marc.zyngier@arm.com
Cc: thomas.abraham@linaro.org
Cc: jamie@jamieiles.com
Cc: b-cousson@ti.com
Cc: shawn.guo@linaro.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: devicetree-discuss@lists.ozlabs.org
Link: http://lkml.kernel.org/r/1316017900-19918-3-git-send-email-robherring2@gmail.com
Signed-off-by: Thomas Gleixner

Rob Herring
2011-09-20 18:16:22 +0800
9d037a777 Merge branch 'irq-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip ... Browse Code »

* 'irq-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
x86, iommu: Mark DMAR IRQ as non-threaded
genirq: Make irq_shutdown() symmetric vs. irq_startup again

Linus Torvalds
2011-09-20 08:23:41 +0800
58c3c3aa0 Make taskstats round statistics down to nearest 1k bytes/events ... Browse Code »

Even with just the interface limited to admin, there really is little to
reason to give byte-per-byte counts for taskstats. So round it down to
something less intrusive.

Acked-by: Balbir Singh
Signed-off-by: Linus Torvalds

Linus Torvalds
2011-09-20 08:10:57 +0800
1a51410ab Make TASKSTATS require root access ... Browse Code »

Ok, this isn't optimal, since it means that 'iotop' needs admin
capabilities, and we may have to work on this some more. But at the
same time it is very much not acceptable to let anybody just read
anybody elses IO statistics quite at this level.

Use of the GENL_ADMIN_PERM suggested by Johannes Berg as an alternative
to checking the capabilities by hand.

Reported-by: Vasiliy Kulikov
Cc: Johannes Berg
Acked-by: Balbir Singh
Signed-off-by: Linus Torvalds

Linus Torvalds
2011-09-20 08:04:37 +0800

18 Sep, 2011

1 commit

3be209a8e sched/rt: Migrate equal priority tasks to available CPUs ... Browse Code »

Commit 43fa5460fe60dea5c610490a1d263415419c60f6 ("sched: Try not to
migrate higher priority RT tasks") also introduced a change in behavior
which keeps RT tasks on the same CPU if there is an equal priority RT
task currently running even if there are empty CPUs available.

This can cause unnecessary wakeup latencies, and can prevent the
scheduler from balancing all RT tasks across available CPUs.

This change causes an RT task to search for a new CPU if an equal
priority RT task is already running on wakeup. Lower priority tasks
will still have to wait on higher priority tasks, but the system should
still balance out because there is always the possibility that if there
are both a high and low priority RT tasks on a given CPU that the high
priority task could wakeup while the low priority task is running and
force it to search for a better runqueue.

Signed-off-by: Shawn Bohrer
Acked-by: Steven Rostedt
Tested-by: Steven Rostedt
Signed-off-by: Peter Zijlstra
Cc: stable@kernel.org # 37+
Link: http://lkml.kernel.org/r/1315837684-18733-1-git-send-email-sbohrer@rgmadvisors.com
Signed-off-by: Ingo Molnar

Shawn Bohrer
2011-09-18 19:48:56 +0800

15 Sep, 2011

1 commit

fa2563e41 workqueue: lock cwq access in drain_workqueue ... Browse Code »

Take cwq->gcwq->lock to avoid racing between drain_workqueue checking to
make sure the workqueues are empty and cwq_dec_nr_in_flight decrementing
and then incrementing nr_active when it activates a delayed work.

We discovered this when a corner case in one of our drivers resulted in
us trying to destroy a workqueue in which the remaining work would
always requeue itself again in the same workqueue. We would hit this
race condition and trip the BUG_ON on workqueue.c:3080.

Signed-off-by: Thomas Tuttle
Acked-by: Tejun Heo
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Thomas Tuttle
2011-09-15 09:09:38 +0800

12 Sep, 2011

1 commit

ed585a651 genirq: Make irq_shutdown() symmetric vs. irq_startup again ... Browse Code »

If an irq_chip provides .irq_shutdown(), but neither of .irq_disable() or
.irq_mask(), free_irq() crashes when jumping to NULL.
Fix this by only trying .irq_disable() and .irq_mask() if there's no
.irq_shutdown() provided.

This revives the symmetry with irq_startup(), which tries .irq_startup(),
.irq_enable(), and irq_unmask(), and makes it consistent with the comment for
irq_chip.irq_shutdown() in , which says:

* @irq_shutdown: shut down the interrupt (defaults to ->disable if NULL)

This is also how __free_irq() behaved before the big overhaul, cfr. e.g.
3b56f0585fd4c02d047dc406668cb40159b2d340 ("genirq: Remove bogus conditional"),
where the core interrupt code always overrode .irq_shutdown() to
.irq_disable() if .irq_shutdown() was NULL.

Signed-off-by: Geert Uytterhoeven
Cc: linux-m68k@lists.linux-m68k.org
Link: http://lkml.kernel.org/r/1315742394-16036-2-git-send-email-geert@linux-m68k.org
Cc: stable@kernel.org
Signed-off-by: Thomas Gleixner

Geert Uytterhoeven
2011-09-12 15:38:53 +0800

08 Sep, 2011

2 commits

79016f648 Merge branch 'timers-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip ... Browse Code »

* 'timers-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
rtc: twl: Fix registration vs. init order
rtc: Initialized rtc_time->tm_isdst
rtc: Fix RTC PIE frequency limit
rtc: rtc-twl: Remove lockdep related local_irq_enable()
rtc: rtc-twl: Switch to using threaded irq
rtc: ep93xx: Fix 'rtc' may be used uninitialized warning
alarmtimers: Avoid possible denial of service with high freq periodic timers
alarmtimers: Memset itimerspec passed into alarm_timer_get
alarmtimers: Avoid possible null pointer traversal

Linus Torvalds
2011-09-08 04:03:48 +0800
e81b693c0 Merge branch 'sched-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip ... Browse Code »

* 'sched-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
sched: Fix a memory leak in __sdt_free()
sched: Move blk_schedule_flush_plug() out of __schedule()
sched: Separate the scheduler entry for preemption

Linus Torvalds
2011-09-08 04:01:34 +0800

31 Aug, 2011

1 commit

7f310a5d4 perf_event: Fix broken calc_timer_values() ... Browse Code »

We detected a serious issue with PERF_SAMPLE_READ and
timing information when events were being multiplexing.

Samples would have time_running > time_enabled. That
was easy to reproduce with a libpfm4 example (ran 3
times to cause multiplexing on Core 2):

$ syst_smpl -e uops_retired:freq=1 &
$ syst_smpl -e uops_retired:freq=1 &
$ syst_smpl -e uops_retired:freq=1 &
IIP:0x0000000040062d ... PERIOD:2355332948 ENA=40144625315 RUN=60014875184
syst_smpl: WARNING: time_running > time_enabled
63277537998 uops_retired:freq=1 , scaled

The bug was not present in kernel up to (and including) 3.0. It turns
out the bug was introduced by the following commit:

commit c4794295917ebeda8013b6cb9c8d71ab4f74a1fa

events: Move lockless timer calculation into helper function

The parameters of the function got reversed yet the call sites
were not updated to reflect the change. That lead to time_running
and time_enabled being swapped. That had no effect when there was
no multiplexing because in that case time_running = time_enabled
but it would show up in any other scenario.

Signed-off-by: Stephane Eranian
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110829124112.GA4828@quad
Signed-off-by: Ingo Molnar

Eric B Munson
2011-08-31 21:56:29 +0800

29 Aug, 2011

4 commits

a8d757ef0 perf events: Fix slow and broken cgroup context switch code ... Browse Code »

The current cgroup context switch code was incorrect leading
to bogus counts. Furthermore, as soon as there was an active
cgroup event on a CPU, the context switch cost on that CPU
would increase by a significant amount as demonstrated by a
simple ping/pong example:

$ ./pong
Both processes pinned to CPU1, running for 10s
10684.51 ctxsw/s

Now start a cgroup perf stat:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100

$ ./pong
Both processes pinned to CPU1, running for 10s
6674.61 ctxsw/s

That's a 37% penalty.

Note that pong is not even in the monitored cgroup.

The results shown by perf stat are bogus:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100

Performance counter stats for 'sleep 100':

CPU1 cycles test
CPU1 16,984,189,138 cycles # 0.000 GHz

The second 'cycles' event should report a count @ CPU clock
(here 2.4GHz) as it is counting across all cgroups.

The patch below fixes the bogus accounting and bypasses any
cgroup switches in case the outgoing and incoming tasks are
in the same cgroup.

With this patch the same test now yields:
$ ./pong
Both processes pinned to CPU1, running for 10s
10775.30 ctxsw/s

Start perf stat with cgroup:

$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

Run pong outside the cgroup:
$ /pong
Both processes pinned to CPU1, running for 10s
10687.80 ctxsw/s

The penalty is now less than 2%.

And the results for perf stat are correct:

$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

Performance counter stats for 'sleep 10':

CPU1 cycles test # 0.000 GHz
CPU1 23,933,981,448 cycles # 0.000 GHz

Now perf stat reports the correct counts for
for the non cgroup event.

If we run pong inside the cgroup, then we also get the
correct counts:

$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

Performance counter stats for 'sleep 10':

CPU1 22,297,726,205 cycles test # 0.000 GHz
CPU1 23,933,981,448 cycles # 0.000 GHz

10.001457237 seconds time elapsed

Signed-off-by: Stephane Eranian
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110825135803.GA4697@quad
Signed-off-by: Ingo Molnar

Stephane Eranian
2011-08-29 18:28:33 +0800
feff8fa00 sched: Fix a memory leak in __sdt_free() ... Browse Code »

This patch fixes the following memory leak:

unreferenced object 0xffff880107266800 (size 512):
comm "sched-powersave", pid 3718, jiffies 4323097853 (age 27495.450s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[] create_object+0x187/0x28b
[] kmemleak_alloc+0x73/0x98
[] __kmalloc_node+0x104/0x159
[] kzalloc_node.clone.97+0x15/0x17
[] build_sched_domains+0xb7/0x7f3
[] partition_sched_domains+0x1db/0x24a
[] do_rebuild_sched_domains+0x3b/0x47
[] rebuild_sched_domains+0x10/0x12
[] sched_power_savings_store+0x6c/0x7b
[] sched_mc_power_savings_store+0x16/0x18
[] sysdev_class_store+0x20/0x22
[] sysfs_write_file+0x108/0x144
[] vfs_write+0xaf/0x102
[] sys_write+0x4d/0x74
[] system_call_fastpath+0x16/0x1b
[] 0xffffffffffffffff

Signed-off-by: WANG Cong
Signed-off-by: Peter Zijlstra
Cc: stable@kernel.org # 3.0
Link: http://lkml.kernel.org/r/1313671017-4112-1-git-send-email-amwang@redhat.com
Signed-off-by: Ingo Molnar

WANG Cong
2011-08-29 18:27:01 +0800
9c40cef2b sched: Move blk_schedule_flush_plug() out of __schedule() ... Browse Code »

There is no real reason to run blk_schedule_flush_plug() with
interrupts and preemption disabled.

Move it into schedule() and call it when the task is going voluntarily
to sleep. There might be false positives when the task is woken
between that call and actually scheduling, but that's not really
different from being woken immediately after switching away.

This fixes a deadlock in the scheduler where the
blk_schedule_flush_plug() callchain enables interrupts and thereby
allows a wakeup to happen of the task that's going to sleep.

Signed-off-by: Thomas Gleixner
Signed-off-by: Peter Zijlstra
Cc: Tejun Heo
Cc: Jens Axboe
Cc: Linus Torvalds
Cc: stable@kernel.org # 2.6.39+
Link: http://lkml.kernel.org/n/tip-dwfxtra7yg1b5r65m32ywtct@git.kernel.org
Signed-off-by: Ingo Molnar

Thomas Gleixner
2011-08-29 18:26:59 +0800
c259e01a1 sched: Separate the scheduler entry for preemption ... Browse Code »

Block-IO and workqueues call into notifier functions from the
scheduler core code with interrupts and preemption disabled. These
calls should be made before entering the scheduler core.

To simplify this, separate the scheduler core code into
__schedule(). __schedule() is directly called from the places which
set PREEMPT_ACTIVE and from schedule(). This allows us to add the work
checks into schedule(), so they are only called when a task voluntary
goes to sleep.

Signed-off-by: Thomas Gleixner
Signed-off-by: Peter Zijlstra
Cc: Tejun Heo
Cc: Jens Axboe
Cc: Linus Torvalds
Cc: stable@kernel.org # 2.6.39+
Link: http://lkml.kernel.org/r/20110622174918.813258321@linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
2011-08-29 18:26:57 +0800

27 Aug, 2011

1 commit

f5b940997 All Arch: remove linkage for sys_nfsservctl system call ... Browse Code »

The nfsservctl system call is now gone, so we should remove all
linkage for it.

Signed-off-by: NeilBrown
Signed-off-by: J. Bruce Fields
Signed-off-by: Linus Torvalds

NeilBrown
2011-08-27 06:09:58 +0800

26 Aug, 2011

2 commits

4c30c6f56 kernel/printk: do not turn off bootconsole in printk_late_init() if keep_bootcon ... Browse Code »

It seems that 7bf693951a8e ("console: allow to retain boot console via
boot option keep_bootcon") doesn't always achieve what it aims, as when
printk_late_init() runs it unconditionally turns off all boot consoles.
With this patch, I am able to see more messages on the boot console in
KVM guests than I can without, when keep_bootcon is specified.

I think it is appropriate for the relevant -stable trees. However, it's
more of an annoyance than a serious bug (ideally you don't need to keep
the boot console around as console handover should be working -- I was
encountering a situation where the console handover wasn't working and
not having the boot console available meant I couldn't see why).

Signed-off-by: Nishanth Aravamudan
Cc: David S. Miller
Cc: Alan Cox
Cc: Greg KH
Acked-by: Fabio M. Di Nitto
Cc: [2.6.39.x, 3.0.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2011-08-26 07:25:34 +0800
be27425dc Add a personality to report 2.6.x version numbers ... Browse Code »

I ran into a couple of programs which broke with the new Linux 3.0
version. Some of those were binary only. I tried to use LD_PRELOAD to
work around it, but it was quite difficult and in one case impossible
because of a mix of 32bit and 64bit executables.

For example, all kind of management software from HP doesnt work, unless
we pretend to run a 2.6 kernel.

$ uname -a
Linux svivoipvnx001 3.0.0-08107-g97cd98f #1062 SMP Fri Aug 12 18:11:45 CEST 2011 i686 i686 i386 GNU/Linux

$ hpacucli ctrl all show

Error: No controllers detected.

$ rpm -qf /usr/sbin/hpacucli
hpacucli-8.75-12.0

Another notable case is that Python now reports "linux3" from
sys.platform(); which in turn can break things that were checking
sys.platform() == "linux2":

https://bugzilla.mozilla.org/show_bug.cgi?id=664564

It seems pretty clear to me though it's a bug in the apps that are using
'==' instead of .startswith(), but this allows us to unbreak broken
programs.

This patch adds a UNAME26 personality that makes the kernel report a
2.6.40+x version number instead. The x is the x in 3.x.

I know this is somewhat ugly, but I didn't find a better workaround, and
compatibility to existing programs is important.

Some programs also read /proc/sys/kernel/osrelease. This can be worked
around in user space with mount --bind (and a mount namespace)

To use:

wget ftp://ftp.kernel.org/pub/linux/kernel/people/ak/uname26/uname26.c
gcc -o uname26 uname26.c
./uname26 program

Signed-off-by: Andi Kleen
Signed-off-by: Linus Torvalds

Andi Kleen
2011-08-26 01:17:28 +0800

24 Aug, 2011

2 commits

35a177a08 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs ... Browse Code »

* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: fix tracing builds inside the source tree
xfs: remove subdirectories
xfs: don't expect xfs headers to be in subdirectories

Linus Torvalds
2011-08-24 02:41:44 +0800
69dd3d8e2 Revert "irq: Always set IRQF_ONESHOT if no primary handler is specified" ... Browse Code »

This reverts commit f3637a5f2e2eb391ff5757bc83fb5de8f9726464.

It turns out that this breaks several drivers, one example being OMAP
boards which use the on-board OMAP UARTs and the omap-serial driver that
will not boot to userspace after the commit.

Paul Walmsley reports that enabling CONFIG_DEBUG_SHIRQ reveals 'IRQ
handler type mismatch' errors:

IRQ handler type mismatch for IRQ 74
current handler: serial idle
...

and the reason is that setting IRQF_ONESHOT will now result in those
interrupt handlers having different IRQF flags, and thus being
unsharable. So the commit log in the reverted commit:

"Since it is required for those users and
there is no difference for others it makes sense to add this flag
unconditionally."

is simply not true: there may not be any difference from a "actions at
irq time", but there is a *big* difference wrt this flag testing irq
management (see __setup_irq() in kernel/irq/manage.c).

One solution may be to stop verifying IRQF_ONESHOT in __setup_irq(), but
right now the safe course of action is to revert the change. Let's
revisit this in a later merge window.

Reported-by: Paul Walmsley
Cc: Sebastian Andrzej Siewior
Requested-by: Alan Cox
Acked-by: Thomas Gleixner
Signed-off-by: Linus Torvalds

Linus Torvalds
2011-08-24 01:36:51 +0800

20 Aug, 2011

1 commit

5ccc38740 Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

* 'for-linus' of git://git.kernel.dk/linux-block: (23 commits)
Revert "cfq: Remove special treatment for metadata rqs."
block: fix flush machinery for stacking drivers with differring flush flags
block: improve rq_affinity placement
blktrace: add FLUSH/FUA support
Move some REQ flags to the common bio/request area
allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH
xen/blkback: Make description more obvious.
cfq-iosched: Add documentation about idling
block: Make rq_affinity = 1 work as expected
block: swim3: fix unterminated of_device_id table
block/genhd.c: remove useless cast in diskstats_show()
drivers/cdrom/cdrom.c: relax check on dvd manufacturer value
drivers/block/drbd/drbd_nl.c: use bitmap_parse instead of __bitmap_parse
bsg-lib: add module.h include
cfq-iosched: Reduce linked group count upon group destruction
blk-throttle: correctly determine sync bio
loop: fix deadlock when sysfs and LOOP_CLR_FD race against each other
loop: add BLK_DEV_LOOP_MIN_COUNT=%i to allow distros 0 pre-allocated loop devices
loop: add management interface for on-demand device allocation
loop: replace linked list of allocated devices with an idr index
...

Linus Torvalds
2011-08-20 01:47:07 +0800

19 Aug, 2011

1 commit

d522a0d17 irqdesc: fix new kernel-doc warning ... Browse Code »

Fix kernel-doc warning in irqdesc.c:

Warning(kernel/irq/irqdesc.c:353): No description found for parameter 'owner'

Signed-off-by: Randy Dunlap
Signed-off-by: Linus Torvalds

Randy Dunlap
2011-08-19 05:12:48 +0800

18 Aug, 2011

3 commits

b4fd4ae6c Merge branch 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

* 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / Domains: Fix build for CONFIG_PM_RUNTIME unset

Linus Torvalds
2011-08-18 04:15:25 +0800
2da9f365f Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel… ... Browse Code »

…/git/tip/linux-2.6-tip

* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
lockdep: Fix wrong assumption in match_held_lock

Linus Torvalds
2011-08-18 01:25:08 +0800
950d0a10d Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel… ... Browse Code »

…/git/tip/linux-2.6-tip

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
irq: Track the owner of irq descriptor
irq: Always set IRQF_ONESHOT if no primary handler is specified
genirq: Fix wrong bit operation

Linus Torvalds
2011-08-18 01:23:50 +0800

14 Aug, 2011

1 commit

17f2ae7f6 PM / Domains: Fix build for CONFIG_PM_RUNTIME unset ... Browse Code »

Function genpd_queue_power_off_work() is not defined for
CONFIG_PM_RUNTIME, so pm_genpd_poweroff_unused() causes a build
error to happen in that case. Fix the problem by making
pm_genpd_poweroff_unused() depend on CONFIG_PM_RUNTIME too.

Signed-off-by: Rafael J. Wysocki

Rafael J. Wysocki
2011-08-14 19:34:31 +0800

13 Aug, 2011

1 commit

c59d87c46 xfs: remove subdirectories ... Browse Code »

Use the move from Linux 2.6 to Linux 3.x as an excuse to kill the
annoying subdirectories in the XFS source code. Besides the large
amount of file rename the only changes are to the Makefile, a few
files including headers with the subdirectory prefix, and the binary
sysctl compat code that includes a header under fs/xfs/ from
kernel/.

Signed-off-by: Christoph Hellwig
Signed-off-by: Alex Elder

Christoph Hellwig
2011-08-13 05:21:35 +0800

12 Aug, 2011

2 commits

72fa59970 move RLIMIT_NPROC check from set_user() to do_execve_common() ... Browse Code »

The patch http://lkml.org/lkml/2003/7/13/226 introduced an RLIMIT_NPROC
check in set_user() to check for NPROC exceeding via setuid() and
similar functions.

Before the check there was a possibility to greatly exceed the allowed
number of processes by an unprivileged user if the program relied on
rlimit only. But the check created new security threat: many poorly
written programs simply don't check setuid() return code and believe it
cannot fail if executed with root privileges. So, the check is removed
in this patch because of too often privilege escalations related to
buggy programs.

The NPROC can still be enforced in the common code flow of daemons
spawning user processes. Most of daemons do fork()+setuid()+execve().
The check introduced in execve() (1) enforces the same limit as in
setuid() and (2) doesn't create similar security issues.

Neil Brown suggested to track what specific process has exceeded the
limit by setting PF_NPROC_EXCEEDED process flag. With the change only
this process would fail on execve(), and other processes' execve()
behaviour is not changed.

Solar Designer suggested to re-check whether NPROC limit is still
exceeded at the moment of execve(). If the process was sleeping for
days between set*uid() and execve(), and the NPROC counter step down
under the limit, the defered execve() failure because NPROC limit was
exceeded days ago would be unexpected. If the limit is not exceeded
anymore, we clear the flag on successful calls to execve() and fork().

The flag is also cleared on successful calls to set_user() as the limit
was exceeded for the previous user, not the current one.

Similar check was introduced in -ow patches (without the process flag).

v3 - clear PF_NPROC_EXCEEDED on successful calls to set_user().

Reviewed-by: James Morris
Signed-off-by: Vasiliy Kulikov
Acked-by: NeilBrown
Signed-off-by: Linus Torvalds

Vasiliy Kulikov
2011-08-12 02:24:42 +0800
1d229d54d Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kerne… ... Browse Code »

…l/git/tip/linux-2.6-tip

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf symbols: Check '/tmp/perf-' symbol file ownership
perf sched: Usage leftover from trace -> script rename
perf sched: Do not delete session object prematurely
perf tools: Check $HOME/.perfconfig ownership
perf, x86: Add model 45 SandyBridge support
perf tools: Add support to install perf python extension
perf tools: do not look at ./config for configuration
perf tools: Make clean leaves some files
perf lock: Dropping unsupported ':r' modifier
perf probe: Fix coredump introduced by probe module option
jump label: Reduce the cycle count by changing the link order
perf report: Use ui__warning in some more places
perf python: Add PERF_RECORD_{LOST,READ,SAMPLE} routine tables
perf evlist: Introduce 'disable' method
trace events: Update version number reference to new 3.x scheme for EVENT_POWER_TRACING_DEPRECATED
perf buildid-cache: Zero out buffer of filenames when adding/removing buildid

Linus Torvalds
2011-08-12 00:03:48 +0800

11 Aug, 2011

2 commits

c09c47cae blktrace: add FLUSH/FUA support ... Browse Code »

Add FLUSH/FUA support to blktrace. As FLUSH precedes WRITE and/or
FUA follows WRITE, use the same 'F' flag for both cases and
distinguish them by their (relative) position. The end results
look like (other flags might be shown also):

- WRITE: W
- WRITE_FLUSH: FW
- WRITE_FUA: WF
- WRITE_FLUSH_FUA: FWF

Note that we reuse TC_BARRIER due to lack of bit space of act_mask
so that the older versions of blktrace tools will report flush
requests as barriers from now on.

Cc: Steven Rostedt
Cc: Frederic Weisbecker
Cc: Ingo Molnar
Signed-off-by: Namhyung Kim
Reviewed-by: Jeff Moyer
Signed-off-by: Jens Axboe

Namhyung Kim
2011-08-11 16:36:05 +0800
6af7e471e alarmtimers: Avoid possible denial of service with high freq periodic timers ... Browse Code »

Its possible to jam up the alarm timers by setting very small interval
timers, which will cause the alarmtimer subsystem to spend all of its time
firing and restarting timers. This can effectivly lock up a box.

A deeper fix is needed, closely mimicking the hrtimer code, but for now
just cap the interval to 100us to avoid userland hanging the system.

CC: Thomas Gleixner
CC: stable@kernel.org
Signed-off-by: John Stultz

John Stultz
2011-08-11 01:26:09 +0800

10 Aug, 2011

3 commits

ea7802f63 alarmtimers: Memset itimerspec passed into alarm_timer_get ... Browse Code »

Following common_timer_get, zero out the itimerspec passed in.

CC: Thomas Gleixner
CC: stable@kernel.org
Signed-off-by: John Stultz

John Stultz
2011-08-10 22:10:09 +0800
971c90bfa alarmtimers: Avoid possible null pointer traversal ... Browse Code »

We don't check if old_setting is non null before assigning it, so
correct this.

CC: Thomas Gleixner
CC: stable@kernel.org
Signed-off-by: John Stultz

John Stultz
2011-08-10 22:09:53 +0800
f2c0d0266 cap_syslog: don't use WARN_ONCE for CAP_SYS_ADMIN deprecation warning ... Browse Code »

syslog-ng versions before 3.3.0beta1 (2011-05-12) assume that
CAP_SYS_ADMIN is sufficient to access syslog, so ever since CAP_SYSLOG
was introduced (2010-11-25) they have triggered a warning.

Commit ee24aebffb75 ("cap_syslog: accept CAP_SYS_ADMIN for now")
improved matters a little by making syslog-ng work again, just keeping
the WARN_ONCE(). But still, this is a warning that writes a stack trace
we don't care about to syslog, sets a taint flag, and alarms sysadmins
when nothing worse has happened than use of an old userspace with a
recent kernel.

Convert the WARN_ONCE to a printk_once to avoid that while continuing to
give userspace developers a hint that this is an unwanted
backward-compatibility feature and won't be around forever.

Reported-by: Ralf Hildebrandt
Reported-by: Niels
Reported-by: Paweł Sikora
Signed-off-by: Jonathan Nieder
Liked-by: Gergely Nagy
Acked-by: Serge Hallyn
Acked-by: James Morris
Signed-off-by: Linus Torvalds

Jonathan Nieder
2011-08-10 09:22:22 +0800