Eric Lee / smarc-fsl-linux-kernel

12 Mar, 2014

1 commit

adf961d7e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull audit namespace fixes from Eric Biederman:
"Starting with 3.14-rc1 the audit code is faulty (think oopses and
races) with respect to how it computes the network namespace of which
socket to reply to, and I happened to notice by chance when reading
through the code.

My testing and the automated build bots don't find any problems with
these fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
audit: Update kdoc for audit_send_reply and audit_list_rules_send
audit: Send replies in the proper network namespace.
audit: Use struct net not pid_t to remember the network namespce to reply in

Linus Torvalds
2014-03-12 01:17:50 +0800

11 Mar, 2014

1 commit

e97ca8e5b mm: fix GFP_THISNODE callers and clarify ... Browse Code »

GFP_THISNODE is for callers that implement their own clever fallback to
remote nodes. It restricts the allocation to the specified node and
does not invoke reclaim, assuming that the caller will take care of it
when the fallback fails, e.g. through a subsequent allocation request
without GFP_THISNODE set.

However, many current GFP_THISNODE users only want the node exclusive
aspect of the flag, without actually implementing their own fallback or
triggering reclaim if necessary. This results in things like page
migration failing prematurely even when there is easily reclaimable
memory available, unless kswapd happens to be running already or a
concurrent allocation attempt triggers the necessary reclaim.

Convert all callsites that don't implement their own fallback strategy
to __GFP_THISNODE. This restricts the allocation a single node too, but
at the same time allows the allocator to enter the slowpath, wake
kswapd, and invoke direct reclaim if necessary, to make the allocation
happen when memory is full.

Signed-off-by: Johannes Weiner
Acked-by: Rik van Riel
Cc: Jan Stancek
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-03-11 08:26:19 +0800

09 Mar, 2014

2 commits

d211f177b audit: Update kdoc for audit_send_reply and audit_list_rules_send ... Browse Code »

The kbuild test robot reported:
> tree: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-next
> head: 6f285b19d09f72e801525f5eea1bdad22e559bf0
> commit: 6f285b19d09f72e801525f5eea1bdad22e559bf0 [2/2] audit: Send replies in the proper network namespace.
> reproduce: make htmldocs
>
> >> Warning(kernel/audit.c:575): No description found for parameter 'request_skb'
> >> Warning(kernel/audit.c:575): Excess function parameter 'portid' description in 'audit_send_reply'
> >> Warning(kernel/auditfilter.c:1074): No description found for parameter 'request_skb'
> >> Warning(kernel/auditfilter.c:1074): Excess function parameter 'portid' description in 'audit_list_rules_s

Which was caused by my failure to update the kdoc annotations when I
updated the functions. Fix that small oversight now.

Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2014-03-09 07:31:54 +0800
ca62eec4e Merge branch 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup fixes from Tejun Heo:
"Two cpuset locking fixes from Li. Both tagged for -stable"

* 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: fix a race condition in __cpuset_node_allowed_softwall()
cpuset: fix a locking issue in cpuset_migrate_mm()

Linus Torvalds
2014-03-09 03:57:38 +0800

08 Mar, 2014

2 commits

721f0c126 Merge tag 'trace-fixes-v3.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/g… ... Browse Code »

…it/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
"In the past, I've had lots of reports about trace events not working.
Developers would say they put a trace_printk() before and after the
trace event but when they enable it (and the trace event said it was
enabled) they would see the trace_printks but not the trace event.

I was not able to reproduce this, but that's because I wasn't looking
at the right location. Recently, another bug came up that showed the
issue.

If your kernel supports signed modules but allows for non-signed
modules to be loaded, then when one is, the kernel will silently set
the MODULE_FORCED taint on the module. Although, this taint happens
without the need for insmod --force or anything of the kind, it labels
the module with that taint anyway.

If this tainted module has tracepoints, the tracepoints will be
ignored because of the MODULE_FORCED taint. But no error message will
be displayed. Worse yet, the event infrastructure will still be
created letting users enable the trace event represented by the
tracepoint, although that event will never actually be enabled. This
is because the tracepoint infrastructure allows for non-existing
tracepoints to be enabled for new modules to arrive and have their
tracepoints set.

Although there are several things wrong with the above, this change
only addresses the creation of the trace event files for tracepoints
that are not created when a module is loaded and is tainted. This
change will print an error message about the module being tainted and
not the trace events will not be created, and it does not create the
trace event infrastructure"

* tag 'trace-fixes-v3.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Do not add event files for modules that fail tracepoints

Linus Torvalds
2014-03-08 08:32:40 +0800
27ea0f781 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull irq fixes from Thomas Gleixner:
- a bugfix for a long standing waitqueue race
- a trivial fix for a missing include

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Include missing header file in irqdomain.c
genirq: Remove racy waitqueue_active check

Linus Torvalds
2014-03-08 08:31:41 +0800

04 Mar, 2014

2 commits

45ab2813d tracing: Do not add event files for modules that fail tracepoints ... Browse Code »

If a module fails to add its tracepoints due to module tainting, do not
create the module event infrastructure in the debugfs directory. As the events
will not work and worse yet, they will silently fail, making the user wonder
why the events they enable do not display anything.

Having a warning on module load and the events not visible to the users
will make the cause of the problem much clearer.

Link: http://lkml.kernel.org/r/20140227154923.265882695@goodmis.org

Fixes: 6d723736e472 "tracing/events: add support for modules to TRACE_EVENT"
Acked-by: Mathieu Desnoyers
Cc: stable@vger.kernel.org # 2.6.31+
Cc: Rusty Russell
Signed-off-by: Steven Rostedt

Steven Rostedt (Red Hat)
2014-03-04 10:11:05 +0800
0c0bd34a1 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Ingo Molnar:
"Misc fixes, most of them SCHED_DEADLINE fallout"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Prevent rt_time growth to infinity
sched/deadline: Switch CPU's presence test order
sched/deadline: Cleanup RT leftovers from {inc/dec}_dl_migration
sched: Fix double normalization of vruntime

Linus Torvalds
2014-03-04 02:49:24 +0800

03 Mar, 2014

1 commit

3154da34b Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf fixes from Ingo Molnar:
"Misc fixes, most of them on the tooling side"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf tools: Fix strict alias issue for find_first_bit
perf tools: fix BFD detection on opensuse
perf: Fix hotplug splat
perf/x86: Fix event scheduling
perf symbols: Destroy unused symsrcs
perf annotate: Check availability of annotate when processing samples

Linus Torvalds
2014-03-03 01:37:07 +0800

01 Mar, 2014

1 commit

6f285b19d audit: Send replies in the proper network namespace. ... Browse Code »

In perverse cases of file descriptor passing the current network
namespace of a process and the network namespace of a socket used by
that socket may differ. Therefore use the network namespace of the
appropiate socket to ensure replies always go to the appropiate
socket.

Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2014-03-01 11:44:55 +0800

28 Feb, 2014

2 commits

48095d991 audit: Use struct net not pid_t to remember the network namespce to reply in ... Browse Code »

In struct audit_netlink_list and audit_reply add a reference to the
network namespace of the caller and remove the userspace pid of the
caller. This cleanly remembers the callers network namespace, and
removes a huge class of races and nasty failure modes that can occur
when attempting to relook up the callers network namespace from a
pid_t (including the caller's network namespace changing, pid
wraparound, and the pid simply not being present).

Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2014-02-28 20:04:33 +0800
8d7531825 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs ... Browse Code »

Pull filesystem fixes from Jan Kara:
"Notification, writeback, udf, quota fixes

The notification patches are (with one exception) a fallout of my
fsnotify rework which went into -rc1 (I've extented LTP to cover these
cornercases to avoid similar breakage in future).

The UDF patch is a nasty data corruption Al has recently reported,
the revert of the writeback patch is due to possibility of violating
sync(2) guarantees, and a quota bug can lead to corruption of quota
files in ocfs2"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: Allocate overflow events with proper type
fanotify: Handle overflow in case of permission events
fsnotify: Fix detection whether overflow event is queued
Revert "writeback: do not sync data dirtied after sync start"
quota: Fix race between dqput() and dquot_scan_active()
udf: Fix data corruption on file type conversion
inotify: Fix reporting of cookies for inotify events

Linus Torvalds
2014-02-28 02:37:22 +0800

27 Feb, 2014

9 commits

99afb0fd5 cpuset: fix a race condition in __cpuset_node_allowed_softwall() ... Browse Code »

It's not safe to access task's cpuset after releasing task_lock().
Holding callback_mutex won't help.

Cc:
Signed-off-by: Li Zefan
Signed-off-by: Tejun Heo

Li Zefan
2014-02-27 22:39:54 +0800
472958300 cpuset: fix a locking issue in cpuset_migrate_mm() ... Browse Code »

I can trigger a lockdep warning:

# mount -t cgroup -o cpuset xxx /cgroup
# mkdir /cgroup/cpuset
# mkdir /cgroup/tmp
# echo 0 > /cgroup/tmp/cpuset.cpus
# echo 0 > /cgroup/tmp/cpuset.mems
# echo 1 > /cgroup/tmp/cpuset.memory_migrate
# echo $$ > /cgroup/tmp/tasks
# echo 1 > /cgruop/tmp/cpuset.mems

===============================
[ INFO: suspicious RCU usage. ]
3.14.0-rc1-0.1-default+ #32 Not tainted
-------------------------------
include/linux/cgroup.h:682 suspicious rcu_dereference_check() usage!
...
[] dump_stack+0x72/0x86
[] lockdep_rcu_suspicious+0x101/0x140
[] cpuset_migrate_mm+0xb1/0xe0
...

We used to hold cgroup_mutex when calling cpuset_migrate_mm(), but now
we hold cpuset_mutex, which causes task_css() to complain.

This is not a false-positive but a real issue.

Holding cpuset_mutex won't prevent a task from migrating to another
cpuset, and it won't prevent the original task->cgroup from destroying
during this change.

Fixes: 5d21cc2db040 (cpuset: replace cgroup_mutex locking with cpuset internal locking)
Cc: # 3.9+
Signed-off-by: Li Zefan
Sigend-off-by: Tejun Heo

Li Zefan
2014-02-27 22:35:59 +0800
64be38ab0 genirq: Include missing header file in irqdomain.c ... Browse Code »

Include appropriate header file include/linux/of_irq.h in
kernel/irq/irqdomain.c because it contains prototype definition of
function define in kernel/irq/irqdomain.c.

This eliminates the following warning in kernel/irq/irqdomain.c:
kernel/irq/irqdomain.c:468:14: warning: no previous prototype for ‘irq_create_of_mapping’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria
Reviewed-by: Josh Triplett
Cc: Benjamin Herrenschmidt
Link: http://lkml.kernel.org/r/eb89aebea7ff1a46122918ac389ebecf8248be9a.1393493276.git.rashika.kheria@gmail.com
Signed-off-by: Thomas Gleixner

Rashika Kheria
2014-02-27 20:29:35 +0800
e3703f8cd perf: Fix hotplug splat ... Browse Code »

Drew Richardson reported that he could make the kernel go *boom* when hotplugging
while having perf events active.

It turned out that when you have a group event, the code in
__perf_event_exit_context() fails to remove the group siblings from
the context.

We then proceed with destroying and freeing the event, and when you
re-plug the CPU and try and add another event to that CPU, things go
*boom* because you've still got dead entries there.

Reported-by: Drew Richardson
Signed-off-by: Peter Zijlstra
Cc: Will Deacon
Cc:
Link: http://lkml.kernel.org/n/tip-k6v5wundvusvcseqj1si0oz0@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-02-27 19:38:03 +0800
faa599373 sched/deadline: Prevent rt_time growth to infinity ... Browse Code »

Kirill Tkhai noted:

Since deadline tasks share rt bandwidth, we must care about
bandwidth timer set. Otherwise rt_time may grow up to infinity
in update_curr_dl(), if there are no other available RT tasks
on top level bandwidth.

RT task were in fact throttled right after they got enqueued,
and never executed again (rt_time never again went below rt_runtime).

Peter then proposed to accrue DL execution on rt_time only when
rt timer is active, and proposed a patch (this patch is a slight
modification of that) to implement that behavior. While this
solves Kirill problem, it has a drawback.

Indeed, Kirill noted again:

It looks we may get into a situation, when all CPU time is shared
between RT and DL tasks:

rt_runtime = n
rt_period = 2n

| RT working, DL sleeping | DL working, RT sleeping |
-----------------------------------------------------------
| (1) duration = n | (2) duration = n | (repeat)
|--------------------------|------------------------------|
| (rt_bw timer is running) | (rt_bw timer is not running) |

No time for fair tasks at all.

While this can happen during the first period, if rq is always backlogged,
RT tasks won't have the opportunity to execute anymore: rt_time reached
rt_runtime during (1), suppose after (2) RT is enqueued back, it gets
throttled since rt timer didn't fire, replenishment is from now on eaten up
by DL tasks that accrue their execution on rt_time (while rt timer is
active - we have an RT task waiting for replenishment). FAIR tasks are
not touched after this first period. Ok, this is not ideal, and the situation
is even worse!

What above (the nice case), practically never happens in reality, where
your rt timer is not aligned to tasks periods, tasks are in general not
periodic, etc.. Long story short, you always risk to overload your system.

This patch is based on Peter's idea, but exploits an additional fact:
if you don't have RT tasks enqueued, it makes little sense to continue
incrementing rt_time once you reached the upper limit (DL tasks have their
own mechanism for throttling).

This cures both problems:

- no matter how many DL instances in the past, you'll have an rt_time
slightly above rt_runtime when an RT task is enqueued, and from that
point on (after the first replenishment), the task will normally execute;

- you can still eat up all bandwidth during the first period, but not
anymore after that, remember that DL execution will increment rt_time
till the upper limit is reached.

The situation is still not perfect! But, we have a simple solution for now,
that limits how much you can jeopardize your system, as we keep working
towards the right answer: RT groups scheduled using deadline servers.

Reported-by: Kirill Tkhai
Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra
Cc: Steven Rostedt
Link: http://lkml.kernel.org/r/20140225151515.617714e2f2cd6c558531ba61@gmail.com
Signed-off-by: Ingo Molnar

Juri Lelli
2014-02-27 19:29:41 +0800
eec751ed4 sched/deadline: Switch CPU's presence test order ... Browse Code »

Commit 82b9580 ("sched/deadline: Test for CPU's presence explicitly")
changed how we check if a CPU returned by cpudeadline machinery is
valid. But, we don't want to call cpu_present() if best_cpu is
equal to -1. So, switch the order of tests inside WARN_ON().

Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra
Cc: boris.ostrovsky@oracle.com
Cc: konrad.wilk@oracle.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/1393238832-9100-1-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar

Juri Lelli
2014-02-27 19:29:40 +0800
3908ac13b sched/deadline: Cleanup RT leftovers from {inc/dec}_dl_migration ... Browse Code »

In deadline class we do not have group scheduling.

So, let's remove unnecessary

X = X;

equations.

Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra
Cc: Juri Lelli
Link: http://lkml.kernel.org/r/1393343543.4089.5.camel@tkhai
Signed-off-by: Ingo Molnar

Kirill Tkhai
2014-02-27 19:29:39 +0800
791c9e029 sched: Fix double normalization of vruntime ... Browse Code »

dequeue_entity() is called when p->on_rq and sets se->on_rq = 0
which appears to guarentee that the !se->on_rq condition is met.
If the task has done set_current_state(TASK_INTERRUPTIBLE) without
schedule() the second condition will be met and vruntime will be
incorrectly adjusted twice.

In certain cases this can result in the task's vruntime never increasing
past the vruntime of other tasks on the CFS' run queue, starving them of
CPU time.

This patch changes switched_from_fair() to use !p->on_rq instead of
!se->on_rq.

I'm able to cause a task with a priority of 120 to starve all other
tasks with the same priority on an ARM platform running 3.2.51-rt72
PREEMPT RT by writing one character at time to a serial tty (16550 UART)
in a tight loop. I'm also able to verify making this change corrects the
problem on that platform and kernel version.

Signed-off-by: George McCollister
Signed-off-by: Peter Zijlstra
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1392767811-28916-1-git-send-email-george.mccollister@gmail.com
Signed-off-by: Ingo Molnar

George McCollister
2014-02-27 19:29:38 +0800
c685689fd genirq: Remove racy waitqueue_active check ... Browse Code »

We hit one rare case below:

T1 calling disable_irq(), but hanging at synchronize_irq()
always;
The corresponding irq thread is in sleeping state;
And all CPUs are in idle state;

After analysis, we found there is one possible scenerio which
causes T1 is waiting there forever:
CPU0 CPU1
synchronize_irq()
wait_event()
spin_lock()
atomic_dec_and_test(&threads_active)
insert the __wait into queue
spin_unlock()
if(waitqueue_active)
atomic_read(&threads_active)
wake_up()

Here after inserted the __wait into queue on CPU0, and before
test if queue is empty on CPU1, there is no barrier, it maybe
cause it is not visible for CPU1 immediately, although CPU0 has
updated the queue list.
It is similar for CPU0 atomic_read() threads_active also.

So we'd need one smp_mb() before waitqueue_active.that, but removing
the waitqueue_active() check solves it as wel l and it makes
things simple and clear.

Signed-off-by: Chuansheng Liu
Cc: Xiaoming Wang
Link: http://lkml.kernel.org/r/1393212590-32543-1-git-send-email-chuansheng.liu@intel.com
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner

Chuansheng Liu
2014-02-27 17:54:16 +0800

24 Feb, 2014

1 commit

b2880eb83 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull timer fix from Thomas Gleixner:
"Serialize the registration of a new sched_clock in the currently ARM
only generic sched_clock facilty to avoid sched_clock havoc"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched_clock: Prevent callers from seeing half-updated data

Linus Torvalds
2014-02-24 06:17:08 +0800

22 Feb, 2014

9 commits

995b9ea44 sched/deadline: Remove useless dl_nr_total ... Browse Code »

In deadline class we do not have group scheduling like in RT.

dl_nr_total is the same as dl_nr_running. So, one of them should
be removed.

Cc: Ingo Molnar
Cc: Juri Lelli
Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/368631392675853@web20h.yandex.ru
Signed-off-by: Thomas Gleixner

Kirill Tkhai
2014-02-22 04:27:10 +0800
82b95800b sched/deadline: Test for CPU's presence explicitly ... Browse Code »

A hot-removed CPU may have ID that is numerically larger than the number of
existing CPUs in the system (e.g. we can unplug CPU 4 from a system that
has CPUs 0, 1 and 4).

Thus the WARN_ONs should check whether the CPU in question is currently
present, not whether its ID value is less than num_present_cpus().

Cc: Ingo Molnar
Cc: Juri Lelli
Cc: Steven Rostedt
Reported-by: Konrad Rzeszutek Wilk
Signed-off-by: Boris Ostrovsky
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1392646353-1874-1-git-send-email-boris.ostrovsky@oracle.com
Signed-off-by: Thomas Gleixner

Boris Ostrovsky
2014-02-22 04:27:10 +0800
6d35ab480 sched: Add 'flags' argument to sched_{set,get}attr() syscalls ... Browse Code »

Because of a recent syscall design debate; its deemed appropriate for
each syscall to have a flags argument for future extension; without
immediately requiring new syscalls.

Cc: juri.lelli@gmail.com
Cc: Ingo Molnar
Suggested-by: Michael Kerrisk
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20140214161929.GL27965@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner

Peter Zijlstra
2014-02-22 04:27:10 +0800
4efbc454b sched: Fix information leak in sys_sched_getattr() ... Browse Code »

We're copying the on-stack structure to userspace, but forgot to give
the right number of bytes to copy. This allows the calling process to
obtain up to PAGE_SIZE bytes from the stack (and possibly adjacent
kernel memory).

This fix copies only as much as we actually have on the stack
(attr->size defaults to the size of the struct) and leaves the rest of
the userspace-provided buffer untouched.

Found using kmemcheck + trinity.

Fixes: d50dde5a10f30 ("sched: Add new scheduler syscalls to support an extended scheduling parameters ABI")
Cc: Dario Faggioli
Cc: Juri Lelli
Cc: Ingo Molnar
Signed-off-by: Vegard Nossum
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1392585857-10725-1-git-send-email-vegard.nossum@oracle.com
Signed-off-by: Thomas Gleixner

Vegard Nossum
2014-02-22 04:27:10 +0800
3cf1962cd sched,numa: add cond_resched to task_numa_work ... Browse Code »

Normally task_numa_work scans over a fairly small amount of memory,
but it is possible to run into a large unpopulated part of virtual
memory, with no pages mapped. In that case, task_numa_work can run
for a while, and it may make sense to reschedule as required.

Cc: akpm@linux-foundation.org
Cc: Andrea Arcangeli
Signed-off-by: Rik van Riel
Reported-by: Xing Gang
Tested-by: Chegu Vinod
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1392761566-24834-2-git-send-email-riel@redhat.com
Signed-off-by: Thomas Gleixner

Rik van Riel
2014-02-22 04:27:10 +0800
495163420 sched/core: Make dl_b->lock IRQ safe ... Browse Code »

Fix this lockdep warning:

[ 44.804600] =========================================================
[ 44.805746] [ INFO: possible irq lock inversion dependency detected ]
[ 44.805746] 3.14.0-rc2-test+ #14 Not tainted
[ 44.805746] ---------------------------------------------------------
[ 44.805746] bash/3674 just changed the state of lock:
[ 44.805746] (&dl_b->lock){+.....}, at: [] sched_rt_handler+0x132/0x248
[ 44.805746] but this lock was taken by another, HARDIRQ-safe lock in the past:
[ 44.805746] (&rq->lock){-.-.-.}

and interrupts could create inverse lock ordering between them.

[ 44.805746]
[ 44.805746] other info that might help us debug this:
[ 44.805746] Possible interrupt unsafe locking scenario:
[ 44.805746]
[ 44.805746] CPU0 CPU1
[ 44.805746] ---- ----
[ 44.805746] lock(&dl_b->lock);
[ 44.805746] local_irq_disable();
[ 44.805746] lock(&rq->lock);
[ 44.805746] lock(&dl_b->lock);
[ 44.805746]
[ 44.805746] lock(&rq->lock);

by making dl_b->lock acquiring always IRQ safe.

Cc: Ingo Molnar
Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1392107067-19907-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Thomas Gleixner

Juri Lelli
2014-02-22 04:27:10 +0800
e9e7cb38c sched/core: Fix sched_rt_global_validate ... Browse Code »

Don't compare sysctl_sched_rt_runtime against sysctl_sched_rt_period if
the former is equal to RUNTIME_INF, otherwise disabling -rt bandwidth
management (with CONFIG_RT_GROUP_SCHED=n) fails.

Cc: Ingo Molnar
Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1392107067-19907-2-git-send-email-juri.lelli@gmail.com
Signed-off-by: Thomas Gleixner

Juri Lelli
2014-02-22 04:27:10 +0800
4df1638cf sched/deadline: Fix overflow to handle period==0 and deadline!=0 ... Browse Code »

While debugging the crash with the bad nr_running accounting, I hit
another bug where, after running my sched deadline test, I was getting
failures to take a CPU offline. It was giving me a -EBUSY error.

Adding a bunch of trace_printk()s around, I found that the cpu
notifier that called sched_cpu_inactive() was returning a failure. The
overflow value was coming up negative?

Talking this over with Juri, the problem is that the total_bw update was
suppose to be made by dl_overflow() which, during my tests, seemed to
not be called. Adding more trace_printk()s, it wasn't that it wasn't
called, but it exited out right away with the check of new_bw being
equal to p->dl.dl_bw. The new_bw calculates the ratio between period and
runtime. The bug is that if you set a deadline, you do not need to set
a period if you plan on the period being equal to the deadline. That
is, if period is zero and deadline is not, then the system call should
set the period to be equal to the deadline. This is done elsewhere in
the code.

The fix is easy, check if period is set, and if it is not, then use the
deadline.

Cc: Juri Lelli
Cc: Ingo Molnar
Cc: Linus Torvalds
Cc: Andrew Morton
Signed-off-by: Steven Rostedt
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20140219135335.7e74abd4@gandalf.local.home
Signed-off-by: Thomas Gleixner

Steven Rostedt
2014-02-22 04:27:09 +0800
3d5f35bdf sched/deadline: Fix bad accounting of nr_running ... Browse Code »

Rostedt writes:

My test suite was locking up hard when enabling mmiotracer. This was due
to the mmiotracer placing all but one CPU offline. I found this out
when I was able to reproduce the bug with just my stress-cpu-hotplug
test. This bug baffled me because it would not always trigger, and
would only trigger on the first run after boot up. The
stress-cpu-hotplug test would crash hard the first run, or never crash
at all. But a new reboot may cause it to crash on the first run again.

I spent all week bisecting this, as I couldn't find a consistent
reproducer. I finally narrowed it down to the sched deadline patches,
and even more peculiar, to the commit that added the sched
deadline boot up self test to the latency tracer. Then it dawned on me
to what the bug was.

All it took was to run a task under sched deadline to screw up the CPU
hot plugging. This explained why it would lock up only on the first run
of the stress-cpu-hotplug test. The bug happened when the boot up self
test of the schedule latency tracer would test a deadline task. The
deadline task would corrupt something that would cause CPU hotplug to
fail. If it didn't corrupt it, the stress test would always work
(there's no other sched deadline tasks that would run to cause
problems). If it did corrupt on boot up, the first test would lockup
hard.

I proved this theory by running my deadline test program on another box,
and then run the stress-cpu-hotplug test, and it would now consistently
lock up. I could run stress-cpu-hotplug over and over with no problem,
but once I ran the deadline test, the next run of the
stress-cpu-hotplug would lock hard.

After adding lots of tracing to the code, I found the cause. The
function tracer showed that migrate_tasks() was stuck in an infinite
loop, where rq->nr_running never equaled 1 to break out of it. When I
added a trace_printk() to see what that number was, it was 335 and
never decrementing!

Looking at the deadline code I found:

static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) {
dequeue_dl_entity(&p->dl);
dequeue_pushable_dl_task(rq, p);
}

static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) {
update_curr_dl(rq);
__dequeue_task_dl(rq, p, flags);

dec_nr_running(rq);
}

And this:

if (dl_runtime_exceeded(rq, dl_se)) {
__dequeue_task_dl(rq, curr, 0);
if (likely(start_dl_timer(dl_se, curr->dl.dl_boosted)))
dl_se->dl_throttled = 1;
else
enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);

if (!is_leftmost(curr, &rq->dl))
resched_task(curr);
}

Notice how we call __dequeue_task_dl() and in the else case we
call enqueue_task_dl()? Also notice that dequeue_task_dl() has
underscores where enqueue_task_dl() does not. The enqueue_task_dl()
calls inc_nr_running(rq), but __dequeue_task_dl() does not. This is
where we get nr_running out of sync.

[snip]

Another point where nr_running can get out of sync is when the dl_timer
fires:

dl_se->dl_throttled = 0;
if (p->on_rq) {
enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
if (task_has_dl_policy(rq->curr))
check_preempt_curr_dl(rq, p, 0);
else
resched_task(rq->curr);

This patch does two things:

- correctly accounts for throttled tasks (that are now considered
!running);

- fixes the bug, updating nr_running from {inc,dec}_dl_tasks(),
since we risk to update it twice in some situations (e.g., a
task is dequeued while it has exceeded its budget).

Cc: mingo@redhat.com
Cc: torvalds@linux-foundation.org
Cc: akpm@linux-foundation.org
Reported-by: Steven Rostedt
Reviewed-by: Steven Rostedt
Tested-by: Steven Rostedt
Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1392884379-13744-1-git-send-email-juri.lelli@gmail.com
Signed-off-by: Thomas Gleixner

Juri Lelli
2014-02-22 04:27:09 +0800

21 Feb, 2014

3 commits

6a4d07f85 Merge branch 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup fixes from Tejun Heo:
"Quite a few fixes this time.

Three locking fixes, all marked for -stable. A couple error path
fixes and some misc fixes. Hugh found a bug in memcg offlining
sequence and we thought we could fix that from cgroup core side but
that turned out to be insufficient and got reverted. A different fix
has been applied to -mm"

* 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: update cgroup_enable_task_cg_lists() to grab siglock
Revert "cgroup: use an ordered workqueue for cgroup destruction"
cgroup: protect modifications to cgroup_idr with cgroup_mutex
cgroup: fix locking in cgroup_cfts_commit()
cgroup: fix error return from cgroup_create()
cgroup: fix error return value in cgroup_mount()
cgroup: use an ordered workqueue for cgroup destruction
nfs: include xattr.h from fs/nfs/nfs3proc.c
cpuset: update MAINTAINERS entry
arm, pm, vmpressure: add missing slab.h includes

Linus Torvalds
2014-02-21 04:01:09 +0800
2b73d207a Merge branch 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq ... Browse Code »

Pull workqueue fixes from Tejun Heo:
"Two workqueue fixes. One for an unlikely but possible critical bug
during kworker shutdown and the other to make lockdep names a bit more
descriptive"

* 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: ensure @task is valid across kthread_stop()
workqueue: add args to workqueue lockdep name

Linus Torvalds
2014-02-21 04:00:27 +0800
b080e047a user_namespace.c: Remove duplicated word in comment ... Browse Code »

Signed-off-by: Brian Campbell
Signed-off-by: Linus Torvalds

Brian Campbell
2014-02-21 03:58:35 +0800

20 Feb, 2014

1 commit

5ae8aabea sched_clock: Prevent callers from seeing half-updated data ... Browse Code »

The generic sched_clock registration function was previously
done lockless, due to the fact that it was expected to be called
only once. However, now there are systems that may register
multiple sched_clock sources, for which the lack of locking has
casued problems:

If two sched_clock sources are registered we may end up in a
situation where a call to sched_clock() may be accessing the
epoch cycle count for the old counter and the cycle count for the
new counter. This can lead to confusing results where
sched_clock() values jump and then are reset to 0 (due to the way
the registration function forces the epoch_ns to be 0).

Fix this by reorganizing the registration function to hold the
seqlock for as short a time as possible while we update the
clock_data structure for a new counter. We also put any
accumulated time into epoch_ns instead of resetting the time to
0 so that the clock doesn't reset after each successful
registration.

[jstultz: Added extra context to the commit message]

Reported-by: Will Deacon
Signed-off-by: Stephen Boyd
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Will Deacon
Cc: Peter Zijlstra
Cc: Josh Cartwright
Link: http://lkml.kernel.org/r/1392662736-7803-2-git-send-email-john.stultz@linaro.org
Signed-off-by: John Stultz
Signed-off-by: Thomas Gleixner

Stephen Boyd
2014-02-20 00:07:22 +0800

19 Feb, 2014

2 commits

532de3fc7 cgroup: update cgroup_enable_task_cg_lists() to grab siglock ... Browse Code »

Currently, there's nothing preventing cgroup_enable_task_cg_lists()
from missing set PF_EXITING and race against cgroup_exit(). Depending
on the timing, cgroup_exit() may finish with the task still linked on
css_set leading to list corruption. Fix it by grabbing siglock in
cgroup_enable_task_cg_lists() so that PF_EXITING is guaranteed to be
visible.

This whole on-demand cg_list optimization is extremely fragile and has
ample possibility to lead to bugs which can cause things like
once-a-year oops during boot. I'm wondering whether the better
approach would be just adding "cgroup_disable=all" handling which
disables the whole cgroup rather than tempting fate with this
on-demand craziness.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Cc: stable@vger.kernel.org

Tejun Heo
2014-02-19 07:23:18 +0800
5bdfff96c workqueue: ensure @task is valid across kthread_stop() ... Browse Code »

When a kworker should die, the kworkre is notified through WORKER_DIE
flag instead of kthread_should_stop(). This, IIRC, is primarily to
keep the test synchronized inside worker_pool lock. WORKER_DIE is
first set while holding pool->lock, the lock is dropped and
kthread_stop() is called.

Unfortunately, this means that there's a slight chance that the target
kworker may see WORKER_DIE before kthread_stop() finishes and exits
and frees the target task before or during kthread_stop().

Fix it by pinning the target task before setting WORKER_DIE and
putting it after kthread_stop() is done.

tj: Improved patch description and comment. Moved pinning above
WORKER_DIE for better signify what it's protecting.

CC: stable@vger.kernel.org
Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo

Lai Jiangshan
2014-02-19 05:35:20 +0800

18 Feb, 2014

2 commits

45a22f4c1 inotify: Fix reporting of cookies for inotify events ... Browse Code »

My rework of handling of notification events (namely commit 7053aee26a35
"fsnotify: do not share events between notification groups") broke
sending of cookies with inotify events. We didn't propagate the value
passed to fsnotify() properly and passed 4 uninitialized bytes to
userspace instead (so it is also an information leak). Sadly I didn't
notice this during my testing because inotify cookies aren't used very
much and LTP inotify tests ignore them.

Fix the problem by passing the cookie value properly.

Fixes: 7053aee26a3548ebaba046ae2e52396ccf56ac6c
Reported-by: Vegard Nossum
Signed-off-by: Jan Kara

Jan Kara
2014-02-18 18:17:17 +0800
e4178d809 printk: fix syslog() overflowing user buffer ... Browse Code »

This is not a buffer overflow in the traditional sense: we don't
overflow any *kernel* buffers, but we do mis-count the amount of data we
copy back to user space for the SYSLOG_ACTION_READ_ALL case.

In particular, if the user buffer is too small to hold everything, and
*if* there is a continuation line at just the right place, we can end up
giving the user more data than he asked for.

The reason is that we first count up the number of bytes all the log
records contains, then we walk the records again until we've skipped the
records at the beginning that won't fit, and then we walk the rest of
the records and copy them to the user space buffer.

And in between that "skip the initial records that won't fit" and the
"copy the records that *will* fit to user space", we reset the 'prev'
variable that contained the record information for the last record not
copied. That meant that when we started copying to user space, we now
had a different character count than what we had originally calculated
in the first record walk-through.

The fix is to simply not clear the 'prev' flags value (in both cases
where we had the same logic: syslog_print_all and kmsg_dump_get_buffer:
the latter is used for pstore-like dumping)

Reported-and-tested-by: Debabrata Banerjee
Acked-by: Kay Sievers
Cc: Greg Kroah-Hartman
Cc: Jeff Mahoney
Signed-off-by: Linus Torvalds

Linus Torvalds
2014-02-18 04:24:45 +0800

16 Feb, 2014

1 commit

5a667a0c0 Merge branches 'irq-urgent-for-linus' and 'irq-core-for-linus' of git://git.kern… ... Browse Code »

…el.org/pub/scm/linux/kernel/git/tip/tip

Pull irq update from Thomas Gleixner:
"Fix from the urgent branch: a trivial oneliner adding the missing
Kconfig dependency curing build failures which have been discovered by
several build robots.

The update in the irq-core branch provides a new function in the
irq/devres code, which is a prerequisite for driver developers to get
rid of boilerplate code all over the place.

Not a bugfix, but it has zero impact on the current kernel due to the
lack of users. It's simpler to provide the infrastructure to
interested parties via your tree than fulfilling the wishlist of
driver maintainers on which particular commit or tag this should be
based on"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Add missing irq_to_desc export for CONFIG_SPARSE_IRQ=n

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Add devm_request_any_context_irq()

Linus Torvalds
2014-02-16 08:06:12 +0800