Eric Lee / smarc-fsl-linux-kernel

03 Aug, 2018

1 commit

f95745687 kthread, tracing: Don't expose half-written comm when creating kthreads ... Browse Code »

commit 3e536e222f2930534c252c1cc7ae799c725c5ff9 upstream.

There is a window for racing when printing directly to task->comm,
allowing other threads to see a non-terminated string. The vsnprintf
function fills the buffer, counts the truncated chars, then finally
writes the \0 at the end.

creator other
vsnprintf:
fill (not terminated)
count the rest trace_sched_waking(p):
... memcpy(comm, p->comm, TASK_COMM_LEN)
write \0

The consequences depend on how 'other' uses the string. In our case,
it was copied into the tracing system's saved cmdlines, a buffer of
adjacent TASK_COMM_LEN-byte buffers (note the 'n' where 0 should be):

crash-arm64> x/1024s savedcmd->saved_cmdlines | grep 'evenk'
0xffffffd5b3818640: "irq/497-pwr_evenkworker/u16:12"

...and a strcpy out of there would cause stack corruption:

[224761.522292] Kernel panic - not syncing: stack-protector:
Kernel stack is corrupted in: ffffff9bf9783c78

crash-arm64> kbt | grep 'comm\|trace_print_context'
#6 0xffffff9bf9783c78 in trace_print_context+0x18c(+396)
comm (char [16]) = "irq/497-pwr_even"

crash-arm64> rd 0xffffffd4d0e17d14 8
ffffffd4d0e17d14: 2f71726900000000 5f7277702d373934 ....irq/497-pwr_
ffffffd4d0e17d24: 726f776b6e657665 3a3631752f72656b evenkworker/u16:
ffffffd4d0e17d34: f9780248ff003231 cede60e0ffffff9b 12..H.x......`..
ffffffd4d0e17d44: cede60c8ffffffd4 00000fffffffffd4 .....`..........

The workaround in e09e28671 (use strlcpy in __trace_find_cmdline) was
likely needed because of this same bug.

Solved by vsnprintf:ing to a local buffer, then using set_task_comm().
This way, there won't be a window where comm is not terminated.

Link: http://lkml.kernel.org/r/20180726071539.188015-1-snild@sony.com

Cc: stable@vger.kernel.org
Fixes: bc0c38d139ec7 ("ftrace: latency tracer infrastructure")
Reviewed-by: Steven Rostedt (VMware)
Signed-off-by: Snild Dolkow
Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Greg Kroah-Hartman

Snild Dolkow
2018-08-03 13:50:21 +0800

21 Jun, 2018

1 commit

61ca60932 kthread, sched/wait: Fix kthread_parkme() wait-loop ... Browse Code »

[ Upstream commit 741a76b350897604c48fb12beff1c9b77724dc96 ]

Gaurav reported a problem with __kthread_parkme() where a concurrent
try_to_wake_up() could result in competing stores to ->state which,
when the TASK_PARKED store got lost bad things would happen.

The comment near set_current_state() actually mentions this competing
store, but only mentions the case against TASK_RUNNING. This same
store, with different timing, can happen against a subsequent !RUNNING
store.

This normally is not a problem, because as per that same comment, the
!RUNNING state store is inside a condition based wait-loop:

for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (!need_sleep)
break;
schedule();
}
__set_current_state(TASK_RUNNING);

If we loose the (first) TASK_UNINTERRUPTIBLE store to a previous
(concurrent) wakeup, the schedule() will NO-OP and we'll go around the
loop once more.

The problem here is that the TASK_PARKED store is not inside the
KTHREAD_SHOULD_PARK condition wait-loop.

There is a genuine issue with sleeps that do not have a condition;
this is addressed in a subsequent patch.

Reported-by: Gaurav Kohli
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Oleg Nesterov
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Peter Zijlstra
2018-06-21 03:02:53 +0800

01 Sep, 2017

1 commit

22cf8bc6c kernel/kthread.c: kthread_worker: don't hog the cpu ... Browse Code »

If the worker thread continues getting work, it will hog the cpu and rcu
stall complains. Make it a good citizen. This is triggered in a loop
block device test.

Link: http://lkml.kernel.org/r/5de0a179b3184e1a2183fc503448b0269f24d75b.1503697127.git.shli@fb.com
Signed-off-by: Shaohua Li
Cc: Petr Mladek
Cc: Thomas Gleixner
Cc: Tejun Heo
Cc: Oleg Nesterov
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2017-09-01 07:33:15 +0800

17 Mar, 2017

1 commit

77f88796c cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups ... Browse Code »

Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator. Once the new kthread starts
running, it initializes itself and wakes up the creator. The creator
then can further configure the kthread and then let it start doing its
job by waking it up.

In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland. When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity. This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.

Unfortunately, the cgroup side of protection is racy. While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.

This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one. Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.

This patch adds task->no_cgroup_migration which disallows the task to
be migrated by userland. kthreadd starts with the flag set making
every child kthread start in the root cgroup with migration
disallowed. The flag is cleared after the kthread finishes
initialization by which time PF_NO_SETAFFINITY is set if the kthread
should stay in the root cgroup.

It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side. Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.

v2: Switch to a simpler implementation using a new task_struct bit
field suggested by Oleg.

Signed-off-by: Tejun Heo
Suggested-by: Oleg Nesterov
Cc: Linus Torvalds
Cc: Andrew Morton
Cc: Peter Zijlstra (Intel)
Cc: Thomas Gleixner
Reported-and-debugged-by: Chris Mason
Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3)
Signed-off-by: Tejun Heo

Tejun Heo
2017-03-17 22:18:47 +0800

02 Mar, 2017

2 commits

299300258 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/task.h> ... Browse Code »

We are going to split out of , which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder file that just
maps to to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-03-02 15:42:35 +0800
ae7e81c07 sched/headers: Prepare for new header dependencies before moving code to <uapi/linux/sched/types.h> ... Browse Code »

We are going to move scheduler ABI details to ,
which will be used from a number of .c files.

Create empty placeholder header that maps to .

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-03-02 15:42:27 +0800

10 Feb, 2017

1 commit

dfb4357da time: Remove CONFIG_TIMER_STATS ... Browse Code »

Currently CONFIG_TIMER_STATS exposes process information across namespaces:

kernel/time/timer_list.c print_timer():

SEQ_printf(m, ", %s/%d", tmp, timer->start_pid);

/proc/timer_list:

#11: , hrtimer_wakeup, S:01, do_nanosleep, cron/2570

Given that the tracer can give the same information, this patch entirely
removes CONFIG_TIMER_STATS.

Suggested-by: Thomas Gleixner
Signed-off-by: Kees Cook
Acked-by: John Stultz
Cc: Nicolas Pitre
Cc: linux-doc@vger.kernel.org
Cc: Lai Jiangshan
Cc: Shuah Khan
Cc: Xing Gao
Cc: Jonathan Corbet
Cc: Jessica Frazelle
Cc: kernel-hardening@lists.openwall.com
Cc: Nicolas Iooss
Cc: "Paul E. McKenney"
Cc: Petr Mladek
Cc: Richard Cochran
Cc: Tejun Heo
Cc: Michal Marek
Cc: Josh Poimboeuf
Cc: Dmitry Vyukov
Cc: Oleg Nesterov
Cc: "Eric W. Biederman"
Cc: Olof Johansson
Cc: Andrew Morton
Cc: linux-api@vger.kernel.org
Cc: Arjan van de Ven
Link: http://lkml.kernel.org/r/20170208192659.GA32582@beast
Signed-off-by: Thomas Gleixner

Kees Cook
2017-02-10 18:15:08 +0800

13 Dec, 2016

1 commit

c0b942a76 kthread: add __printf attributes ... Browse Code »

When commit fbae2d44aa1d ("kthread: add kthread_create_worker*()")
introduced some kthread_create_...() functions which were taking
printf-like parametter, it introduced __printf attributes to some
functions (e.g. kthread_create_worker()). Nevertheless some new
functions were forgotten (they have been detected thanks to
-Wmissing-format-attribute warning flag).

Add the missing __printf attributes to the newly-introduced functions in
order to detect formatting issues at build-time with -Wformat flag.

Link: http://lkml.kernel.org/r/20161126193543.22672-1-nicolas.iooss_linux@m4x.org
Signed-off-by: Nicolas Iooss
Reviewed-by: Petr Mladek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nicolas Iooss
2016-12-13 10:55:06 +0800

08 Dec, 2016

5 commits

8fb9dcbdc kthread: Don't abuse kthread_create_on_cpu() in __kthread_create_worker() ... Browse Code »

kthread_create_on_cpu() sets KTHREAD_IS_PER_CPU and kthread->cpu, this
only makes sense if this kthread can be parked/unparked by cpuhp code.
kthread workers never call kthread_parkme() so this has no effect.

Change __kthread_create_worker() to simply call kthread_bind(task, cpu).
The very fact that kthread_create_on_cpu() doesn't accept a generic fmt
shows that it should not be used outside of smpboot.c.

Now, the only reason we can not unexport this helper and move it into
smpboot.c is that it sets kthread->cpu and struct kthread is not exported.
And the only reason we can not kill kthread->cpu is that kthread_unpark()
is used by drivers/gpu/drm/amd/scheduler/gpu_scheduler.c and thus we can
not turn _unpark into kthread_unpark(struct smp_hotplug_thread *, cpu).

Signed-off-by: Oleg Nesterov
Tested-by: Petr Mladek
Acked-by: Peter Zijlstra (Intel)
Reviewed-by: Petr Mladek
Cc: Chunming Zhou
Cc: Roman Pen
Cc: Andy Lutomirski
Cc: Tejun Heo
Cc: Andy Lutomirski
Cc: Alex Deucher
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/20161129175110.GA5342@redhat.com
Signed-off-by: Thomas Gleixner

Oleg Nesterov
2016-12-08 21:36:20 +0800
cf380a4a9 kthread: Don't use to_live_kthread() in kthread_[un]park() ... Browse Code »

Now that to_kthread() is always validm change kthread_park() and
kthread_unpark() to use it and kill to_live_kthread().

The conversion of kthread_unpark() is trivial. If KTHREAD_IS_PARKED is set
then the task has called complete(&self->parked) and there the function
cannot race against a concurrent kthread_stop() and exit.

kthread_park() is more tricky, because its semantics are not well
defined. It returns -ENOSYS if the thread exited but this can never happen
and as Roman pointed out kthread_park() can obviously block forever if it
would race with the exiting kthread.

The usage of kthread_park() in cpuhp code (cpu.c, smpboot.c, stop_machine.c)
is fine. It can never see an exiting/exited kthread, smpboot_destroy_threads()
clears *ht->store, smpboot_park_thread() checks it is not NULL under the same
smpboot_threads_lock. cpuhp_threads and cpu_stop_threads never exit, so other
callers are fine too.

But it has two more users:

- watchdog_park_threads():

The code is actually correct, get_online_cpus() ensures that
kthread_park() can't race with itself (note that kthread_park() can't
handle this race correctly), but it should not use kthread_park()
directly.

- drivers/gpu/drm/amd/scheduler/gpu_scheduler.c should not use
kthread_park() either.

kthread_park() must not be called after amd_sched_fini() which does
kthread_stop(), otherwise even to_live_kthread() is not safe because
task_struct can be already freed and sched->thread can point to nowhere.

The usage of kthread_park/unpark should either be restricted to core code
which is properly protected against the exit race or made more robust so it
is safe to use it in drivers.

To catch eventual exit issues, add a WARN_ON(PF_EXITING) for now.

Signed-off-by: Oleg Nesterov
Acked-by: Peter Zijlstra (Intel)
Reviewed-by: Thomas Gleixner
Cc: Chunming Zhou
Cc: Roman Pen
Cc: Petr Mladek
Cc: Andy Lutomirski
Cc: Tejun Heo
Cc: Andy Lutomirski
Cc: Alex Deucher
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/20161129175107.GA5339@redhat.com
Signed-off-by: Thomas Gleixner

Oleg Nesterov
2016-12-08 21:36:19 +0800
efb29fbfa kthread: Don't use to_live_kthread() in kthread_stop() ... Browse Code »

kthread_stop() had to use to_live_kthread() simply because it was not
possible to access kthread->exited after the exiting task clears
task_struct->vfork_done. Now that to_kthread() is always valid,
wake_up_process() + wait_for_completion() can be done
ununconditionally. It's not an issue anymore if the task has already issued
complete_vfork_done() or died.

The exiting task can get the spurious wakeup after mm_release() but this is
possible without this change too and is fine; do_task_dead() ensures that
this can't make any harm.

As a further enhancement this could be converted to task_work_add() later,
so ->vfork_done can be avoided completely.

Signed-off-by: Oleg Nesterov
Acked-by: Peter Zijlstra (Intel)
Reviewed-by: Thomas Gleixner
Cc: Chunming Zhou
Cc: Roman Pen
Cc: Petr Mladek
Cc: Andy Lutomirski
Cc: Tejun Heo
Cc: Andy Lutomirski
Cc: Alex Deucher
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/20161129175103.GA5336@redhat.com
Signed-off-by: Thomas Gleixner

Oleg Nesterov
2016-12-08 21:36:19 +0800
eff966254 Revert "kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_l… ... Browse Code »

…ive_kthread() function"

This reverts commit 23196f2e5f5d810578a772785807dcdc2b9fdce9.

Now that struct kthread is kmalloc'ed and not longer on the task stack
there is no need anymore to pin the stack.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chunming Zhou <David1.Zhou@amd.com>
Cc: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20161129175100.GA5333@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Oleg Nesterov
2016-12-08 21:36:18 +0800
1da5c46fa kthread: Make struct kthread kmalloc'ed ... Browse Code »

commit 23196f2e5f5d "kthread: Pin the stack via try_get_task_stack() /
put_task_stack() in to_live_kthread() function" is a workaround for the
fragile design of struct kthread being allocated on the task stack.

struct kthread in its current form should be removed, but this needs
cleanups outside of kthread.c.

As a first step move struct kthread away from the task stack by making it
kmalloc'ed. This allows to access kthread.exited without the magic of
trying to pin task stack and the try logic in to_live_kthread().

Signed-off-by: Oleg Nesterov
Acked-by: Peter Zijlstra (Intel)
Reviewed-by: Thomas Gleixner
Cc: Chunming Zhou
Cc: Roman Pen
Cc: Petr Mladek
Cc: Andy Lutomirski
Cc: Tejun Heo
Cc: Andy Lutomirski
Cc: Alex Deucher
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/20161129175057.GA5330@redhat.com
Signed-off-by: Thomas Gleixner

Oleg Nesterov
2016-12-08 21:36:18 +0800

12 Oct, 2016

11 commits

dbf52682c kthread: better support freezable kthread workers ... Browse Code »

This patch allows to make kthread worker freezable via a new @flags
parameter. It will allow to avoid an init work in some kthreads.

It currently does not affect the function of kthread_worker_fn()
but it might help to do some optimization or fixes eventually.

I currently do not know about any other use for the @flags
parameter but I believe that we will want more flags
in the future.

Finally, I hope that it will not cause confusion with @flags member
in struct kthread. Well, I guess that we will want to rework the
basic kthreads implementation once all kthreads are converted into
kthread workers or workqueues. It is possible that we will merge
the two structures.

Link: http://lkml.kernel.org/r/1470754545-17632-12-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek
Acked-by: Tejun Heo
Cc: Oleg Nesterov
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: "Paul E. McKenney"
Cc: Josh Triplett
Cc: Thomas Gleixner
Cc: Jiri Kosina
Cc: Borislav Petkov
Cc: Michal Hocko
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Petr Mladek
2016-10-12 06:06:33 +0800
9a6b06c8d kthread: allow to modify delayed kthread work ... Browse Code »

There are situations when we need to modify the delay of a delayed kthread
work. For example, when the work depends on an event and the initial delay
means a timeout. Then we want to queue the work immediately when the event
happens.

This patch implements kthread_mod_delayed_work() as inspired workqueues.
It cancels the timer, removes the work from any worker list and queues it
again with the given timeout.

A very special case is when the work is being canceled at the same time.
It might happen because of the regular kthread_cancel_delayed_work_sync()
or by another kthread_mod_delayed_work(). In this case, we do nothing and
let the other operation win. This should not normally happen as the caller
is supposed to synchronize these operations a reasonable way.

Link: http://lkml.kernel.org/r/1470754545-17632-11-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek
Acked-by: Tejun Heo
Cc: Oleg Nesterov
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: "Paul E. McKenney"
Cc: Josh Triplett
Cc: Thomas Gleixner
Cc: Jiri Kosina
Cc: Borislav Petkov
Cc: Michal Hocko
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Petr Mladek
2016-10-12 06:06:33 +0800
37be45d49 kthread: allow to cancel kthread work ... Browse Code »

We are going to use kthread workers more widely and sometimes we will need
to make sure that the work is neither pending nor running.

This patch implements cancel_*_sync() operations as inspired by
workqueues. Well, we are synchronized against the other operations via
the worker lock, we use del_timer_sync() and a counter to count parallel
cancel operations. Therefore the implementation might be easier.

First, we check if a worker is assigned. If not, the work has newer been
queued after it was initialized.

Second, we take the worker lock. It must be the right one. The work must
not be assigned to another worker unless it is initialized in between.

Third, we try to cancel the timer when it exists. The timer is deleted
synchronously to make sure that the timer call back is not running. We
need to temporary release the worker->lock to avoid a possible deadlock
with the callback. In the meantime, we set work->canceling counter to
avoid any queuing.

Fourth, we try to remove the work from a worker list. It might be
the list of either normal or delayed works.

Fifth, if the work is running, we call kthread_flush_work(). It might
take an arbitrary time. We need to release the worker-lock again. In the
meantime, we again block any queuing by the canceling counter.

As already mentioned, the check for a pending kthread work is done under a
lock. In compare with workqueues, we do not need to fight for a single
PENDING bit to block other operations. Therefore we do not suffer from
the thundering storm problem and all parallel canceling jobs might use
kthread_flush_work(). Any queuing is blocked until the counter gets zero.

Link: http://lkml.kernel.org/r/1470754545-17632-10-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek
Acked-by: Tejun Heo
Cc: Oleg Nesterov
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: "Paul E. McKenney"
Cc: Josh Triplett
Cc: Thomas Gleixner
Cc: Jiri Kosina
Cc: Borislav Petkov
Cc: Michal Hocko
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Petr Mladek
2016-10-12 06:06:33 +0800
22597dc3d kthread: initial support for delayed kthread work ... Browse Code »

We are going to use kthread_worker more widely and delayed works
will be pretty useful.

The implementation is inspired by workqueues. It uses a timer to queue
the work after the requested delay. If the delay is zero, the work is
queued immediately.

In compare with workqueues, each work is associated with a single worker
(kthread). Therefore the implementation could be much easier. In
particular, we use the worker->lock to synchronize all the operations with
the work. We do not need any atomic operation with a flags variable.

In fact, we do not need any state variable at all. Instead, we add a list
of delayed works into the worker. Then the pending work is listed either
in the list of queued or delayed works. And the existing check of pending
works is the same even for the delayed ones.

A work must not be assigned to another worker unless reinitialized.
Therefore the timer handler might expect that dwork->work->worker is valid
and it could simply take the lock. We just add some sanity checks to help
with debugging a potential misuse.

Link: http://lkml.kernel.org/r/1470754545-17632-9-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek
Acked-by: Tejun Heo
Cc: Oleg Nesterov
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: "Paul E. McKenney"
Cc: Josh Triplett
Cc: Thomas Gleixner
Cc: Jiri Kosina
Cc: Borislav Petkov
Cc: Michal Hocko
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Petr Mladek
2016-10-12 06:06:33 +0800
8197b3d43 kthread: detect when a kthread work is used by more workers ... Browse Code »

Nothing currently prevents a work from queuing for a kthread worker when
it is already running on another one. This means that the work might run
in parallel on more than one worker. Also some operations are not
reliable, e.g. flush.

This problem will be even more visible after we add kthread_cancel_work()
function. It will only have "work" as the parameter and will use
worker->lock to synchronize with others.

Well, normally this is not a problem because the API users are sane.
But bugs might happen and users also might be crazy.

This patch adds a warning when we try to insert the work for another
worker. It does not fully prevent the misuse because it would make the
code much more complicated without a big benefit.

It adds the same warning also into kthread_flush_work() instead of the
repeated attempts to get the right lock.

A side effect is that one needs to explicitly reinitialize the work if it
must be queued into another worker. This is needed, for example, when the
worker is stopped and started again. It is a bit inconvenient. But it
looks like a good compromise between the stability and complexity.

I have double checked all existing users of the kthread worker API and
they all seems to initialize the work after the worker gets started.

Just for completeness, the patch adds a check that the work is not already
in a queue.

The patch also puts all the checks into a separate function. It will be
reused when implementing delayed works.

Link: http://lkml.kernel.org/r/1470754545-17632-8-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek
Cc: Oleg Nesterov
Cc: Tejun Heo
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: "Paul E. McKenney"
Cc: Josh Triplett
Cc: Thomas Gleixner
Cc: Jiri Kosina
Cc: Borislav Petkov
Cc: Michal Hocko
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Petr Mladek
2016-10-12 06:06:33 +0800
35033fe9c kthread: add kthread_destroy_worker() ... Browse Code »

The current kthread worker users call flush() and stop() explicitly.
This function does the same plus it frees the kthread_worker struct
in one call.

It is supposed to be used together with kthread_create_worker*() that
allocates struct kthread_worker.

Link: http://lkml.kernel.org/r/1470754545-17632-7-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek
Cc: Oleg Nesterov
Cc: Tejun Heo
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: "Paul E. McKenney"
Cc: Josh Triplett
Cc: Thomas Gleixner
Cc: Jiri Kosina
Cc: Borislav Petkov
Cc: Michal Hocko
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Petr Mladek
2016-10-12 06:06:33 +0800
fbae2d44a kthread: add kthread_create_worker*() ... Browse Code »

Kthread workers are currently created using the classic kthread API,
namely kthread_run(). kthread_worker_fn() is passed as the @threadfn
parameter.

This patch defines kthread_create_worker() and
kthread_create_worker_on_cpu() functions that hide implementation details.

They enforce using kthread_worker_fn() for the main thread. But I doubt
that there are any plans to create any alternative. In fact, I think that
we do not want any alternative main thread because it would be hard to
support consistency with the rest of the kthread worker API.

The naming and function of kthread_create_worker() is inspired by the
workqueues API like the rest of the kthread worker API.

The kthread_create_worker_on_cpu() variant is motivated by the original
kthread_create_on_cpu(). Note that we need to bind per-CPU kthread
workers already when they are created. It makes the life easier.
kthread_bind() could not be used later for an already running worker.

This patch does _not_ convert existing kthread workers. The kthread
worker API need more improvements first, e.g. a function to destroy the
worker.

IMPORTANT:

kthread_create_worker_on_cpu() allows to use any format of the worker
name, in compare with kthread_create_on_cpu(). The good thing is that it
is more generic. The bad thing is that most users will need to pass the
cpu number in two parameters, e.g. kthread_create_worker_on_cpu(cpu,
"helper/%d", cpu).

To be honest, the main motivation was to avoid the need for an empty
va_list. The only legal way was to create a helper function that would be
called with an empty list. Other attempts caused compilation warnings or
even errors on different architectures.

There were also other alternatives, for example, using #define or
splitting __kthread_create_worker(). The used solution looked like the
least ugly.

Link: http://lkml.kernel.org/r/1470754545-17632-6-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek
Acked-by: Tejun Heo
Cc: Oleg Nesterov
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: "Paul E. McKenney"
Cc: Josh Triplett
Cc: Thomas Gleixner
Cc: Jiri Kosina
Cc: Borislav Petkov
Cc: Michal Hocko
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Petr Mladek
2016-10-12 06:06:33 +0800
255451e45 kthread: allow to call __kthread_create_on_node() with va_list args ... Browse Code »

kthread_create_on_node() implements a bunch of logic to create the
kthread. It is already called by kthread_create_on_cpu().

We are going to extend the kthread worker API and will need to call
kthread_create_on_node() with va_list args there.

This patch does only a refactoring and does not modify the existing
behavior.

Link: http://lkml.kernel.org/r/1470754545-17632-5-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek
Acked-by: Tejun Heo
Cc: Oleg Nesterov
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: "Paul E. McKenney"
Cc: Josh Triplett
Cc: Thomas Gleixner
Cc: Jiri Kosina
Cc: Borislav Petkov
Cc: Michal Hocko
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Petr Mladek
2016-10-12 06:06:33 +0800
a65d40961 kthread/smpboot: do not park in kthread_create_on_cpu() ... Browse Code »

kthread_create_on_cpu() was added by the commit 2a1d446019f9a5983e
("kthread: Implement park/unpark facility"). It is currently used only
when enabling new CPU. For this purpose, the newly created kthread has to
be parked.

The CPU binding is a bit tricky. The kthread is parked when the CPU has
not been allowed yet. And the CPU is bound when the kthread is unparked.

The function would be useful for more per-CPU kthreads, e.g.
bnx2fc_thread, fcoethread. For this purpose, the newly created kthread
should stay in the uninterruptible state.

This patch moves the parking into smpboot. It binds the thread already
when created. Then the function might be used universally. Also the
behavior is consistent with kthread_create() and kthread_create_on_node().

Link: http://lkml.kernel.org/r/1470754545-17632-4-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek
Reviewed-by: Thomas Gleixner
Cc: Oleg Nesterov
Cc: Tejun Heo
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: "Paul E. McKenney"
Cc: Josh Triplett
Cc: Jiri Kosina
Cc: Borislav Petkov
Cc: Michal Hocko
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Petr Mladek
2016-10-12 06:06:33 +0800
3989144f8 kthread: kthread worker API cleanup ... Browse Code »

A good practice is to prefix the names of functions by the name
of the subsystem.

The kthread worker API is a mix of classic kthreads and workqueues. Each
worker has a dedicated kthread. It runs a generic function that process
queued works. It is implemented as part of the kthread subsystem.

This patch renames the existing kthread worker API to use
the corresponding name from the workqueues API prefixed by
kthread_:

__init_kthread_worker() -> __kthread_init_worker()
init_kthread_worker() -> kthread_init_worker()
init_kthread_work() -> kthread_init_work()
insert_kthread_work() -> kthread_insert_work()
queue_kthread_work() -> kthread_queue_work()
flush_kthread_work() -> kthread_flush_work()
flush_kthread_worker() -> kthread_flush_worker()

Note that the names of DEFINE_KTHREAD_WORK*() macros stay
as they are. It is common that the "DEFINE_" prefix has
precedence over the subsystem names.

Note that INIT() macros and init() functions use different
naming scheme. There is no good solution. There are several
reasons for this solution:

+ "init" in the function names stands for the verb "initialize"
aka "initialize worker". While "INIT" in the macro names
stands for the noun "INITIALIZER" aka "worker initializer".

+ INIT() macros are used only in DEFINE() macros

+ init() functions are used close to the other kthread()
functions. It looks much better if all the functions
use the same scheme.

+ There will be also kthread_destroy_worker() that will
be used close to kthread_cancel_work(). It is related
to the init() function. Again it looks better if all
functions use the same naming scheme.

+ there are several precedents for such init() function
names, e.g. amd_iommu_init_device(), free_area_init_node(),
jump_label_init_type(), regmap_init_mmio_clk(),

+ It is not an argument but it was inconsistent even before.

[arnd@arndb.de: fix linux-next merge conflict]
Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.com
Suggested-by: Andrew Morton
Signed-off-by: Petr Mladek
Cc: Oleg Nesterov
Cc: Tejun Heo
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: "Paul E. McKenney"
Cc: Josh Triplett
Cc: Thomas Gleixner
Cc: Jiri Kosina
Cc: Borislav Petkov
Cc: Michal Hocko
Cc: Vlastimil Babka
Signed-off-by: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Petr Mladek
2016-10-12 06:06:33 +0800
e700591ae kthread: rename probe_kthread_data() to kthread_probe_data() ... Browse Code »

Patch series "kthread: Kthread worker API improvements"

The intention of this patchset is to make it easier to manipulate and
maintain kthreads. Especially, I want to replace all the custom main
cycles with a generic one. Also I want to make the kthreads sleep in a
consistent state in a common place when there is no work.

This patch (of 11):

A good practice is to prefix the names of functions by the name of the
subsystem.

This patch fixes the name of probe_kthread_data(). The other wrong
functions names are part of the kthread worker API and will be fixed
separately.

Link: http://lkml.kernel.org/r/1470754545-17632-2-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek
Suggested-by: Andrew Morton
Acked-by: Tejun Heo
Cc: Oleg Nesterov
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: "Paul E. McKenney"
Cc: Josh Triplett
Cc: Thomas Gleixner
Cc: Jiri Kosina
Cc: Borislav Petkov
Cc: Michal Hocko
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Petr Mladek
2016-10-12 06:06:33 +0800

16 Sep, 2016

1 commit

23196f2e5 kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function ... Browse Code »

get_task_struct(tsk) no longer pins tsk->stack so all users of
to_live_kthread() should do try_get_task_stack/put_task_stack to protect
"struct kthread" which lives on kthread's stack.

TODO: Kill to_live_kthread(), perhaps we can even kill "struct kthread" too,
and rework kthread_stop(), it can use task_work_add() to sync with the exiting
kernel thread.

Message-Id:
Signed-off-by: Oleg Nesterov
Signed-off-by: Andy Lutomirski
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Denys Vlasenko
Cc: H. Peter Anvin
Cc: Jann Horn
Cc: Josh Poimboeuf
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/cb9b16bbc19d4aea4507ab0552e4644c1211d130.1474003868.git.luto@kernel.org
Signed-off-by: Ingo Molnar

Oleg Nesterov
2016-09-16 15:18:53 +0800

05 Sep, 2015

1 commit

e9f069868 kernel/kthread.c:kthread_create_on_node(): clarify documentation ... Browse Code »

- Make it clear that the `node' arg refers to memory allocations only:
kthread_create_on_node() does not pin the new thread to that node's
CPUs.

- Encourage the use of NUMA_NO_NODE.

[nzimmer@sgi.com: use NUMA_NO_NODE in kthread_create() also]
Cc: Nathan Zimmer
Cc: Tejun Heo
Cc: Eric Dumazet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2015-09-05 07:54:41 +0800

01 Sep, 2015

1 commit

a1d856117 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:
"The biggest change in this cycle is the rewrite of the main SMP load
balancing metric: the CPU load/utilization. The main goal was to make
the metric more precise and more representative - see the changelog of
this commit for the gory details:

9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")

It is done in a way that significantly reduces complexity of the code:

5 files changed, 249 insertions(+), 494 deletions(-)

and the performance testing results are encouraging. Nevertheless we
need to keep an eye on potential regressions, since this potentially
affects every SMP workload in existence.

This work comes from Yuyang Du.

Other changes:

- SCHED_DL updates. (Andrea Parri)

- Simplify architecture callbacks by removing finish_arch_switch().
(Peter Zijlstra et al)

- cputime accounting: guarantee stime + utime == rtime. (Peter
Zijlstra)

- optimize idle CPU wakeups some more - inspired by Facebook server
loads. (Mike Galbraith)

- stop_machine fixes and updates. (Oleg Nesterov)

- Introduce the 'trace_sched_waking' tracepoint. (Peter Zijlstra)

- sched/numa tweaks. (Srikar Dronamraju)

- misc fixes and small cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
sched/deadline: Fix comment in enqueue_task_dl()
sched/deadline: Fix comment in push_dl_tasks()
sched: Change the sched_class::set_cpus_allowed() calling context
sched: Make sched_class::set_cpus_allowed() unconditional
sched: Fix a race between __kthread_bind() and sched_setaffinity()
sched: Ensure a task has a non-normalized vruntime when returning back to CFS
sched/numa: Fix NUMA_DIRECT topology identification
tile: Reorganize _switch_to()
sched, sparc32: Update scheduler comments in copy_thread()
sched: Remove finish_arch_switch()
sched, tile: Remove finish_arch_switch
sched, sh: Fold finish_arch_switch() into switch_to()
sched, score: Remove finish_arch_switch()
sched, avr32: Remove finish_arch_switch()
sched, MIPS: Get rid of finish_arch_switch()
sched, arm: Remove finish_arch_switch()
sched/fair: Clean up load average references
sched/fair: Provide runnable_load_avg back to cfs_rq
sched/fair: Remove task and group entity load when they are dead
sched/fair: Init cfs_rq's sched_entity load average
...

Linus Torvalds
2015-09-01 11:26:22 +0800

12 Aug, 2015

1 commit

25834c73f sched: Fix a race between __kthread_bind() and sched_setaffinity() ... Browse Code »

Because sched_setscheduler() checks p->flags & PF_NO_SETAFFINITY
without locks, a caller might observe an old value and race with the
set_cpus_allowed_ptr() call from __kthread_bind() and effectively undo
it:

__kthread_bind()
do_set_cpus_allowed()

sched_setaffinity()
if (p->flags & PF_NO_SETAFFINITIY)
set_cpus_allowed_ptr()
p->flags |= PF_NO_SETAFFINITY

Fix the bug by putting everything under the regular scheduler locks.

This also closes a hole in the serialization of task_struct::{nr_,}cpus_allowed.

Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Tejun Heo
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: dedekind1@gmail.com
Cc: juri.lelli@arm.com
Cc: mgorman@suse.de
Cc: riel@redhat.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20150515154833.545640346@infradead.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2015-08-12 18:06:09 +0800

07 Aug, 2015

1 commit

18896451e kthread: export kthread functions ... Browse Code »

The s-Par visornic driver, currently in staging, processes a queue being
serviced by the an s-Par service partition. We can get a message that
something has happened with the Service Partition, when that happens, we
must not access the channel until we get a message that the service
partition is back again.

The visornic driver has a thread for processing the channel, when we get
the message, we need to be able to park the thread and then resume it
when the problem clears.

We can do this with kthread_park and unpark but they are not exported
from the kernel, this patch exports the needed functions.

Signed-off-by: David Kershner
Acked-by: Ingo Molnar
Acked-by: Neil Horman
Acked-by: Thomas Gleixner
Cc: Richard Weinberger
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Kershner
2015-08-07 09:39:41 +0800

10 Oct, 2014

1 commit

109228389 kernel/kthread.c: partial revert of 81c98869faa5 ("kthread: ensure locality of t… ... Browse Code »

…ask_struct allocations")

After discussions with Tejun, we don't want to spread the use of
cpu_to_mem() (and thus knowledge of allocators/NUMA topology details) into
callers, but would rather ensure the callees correctly handle memoryless
nodes. With the previous patches ("topology: add support for
node_to_mem_node() to determine the fallback node" and "slub: fallback to
node_to_mem_node() node if allocating on memoryless node") adding and
using node_to_mem_node(), we can safely undo part of the change to the
kthread logic from 81c98869faa5.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Han Pingtian <hanpt@linux.vnet.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Anton Blanchard <anton@samba.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Nishanth Aravamudan
2014-10-10 10:25:51 +0800

29 Jul, 2014

1 commit

ed1403ec2 kthread_work: wake up worker only when the worker is idle ... Browse Code »

If the worker is already executing a work item when another is queued,
we can safely skip wakeup without worrying about stalling queue thus
avoiding waking up the busy worker spuriously. Spurious wakeups
should be fine but still isn't nice and avoiding it is trivial here.

tj: Updated description.

Signed-off-by: Lai Jiangshan
Signed-off-by: Tejun Heo

Lai Jiangshan
2014-07-29 02:07:52 +0800

05 Jun, 2014

1 commit

8fe6929cf kthread: fix return value of kthread_create() upon SIGKILL. ... Browse Code »

Commit 786235eeba0e ("kthread: make kthread_create() killable") meant
for allowing kthread_create() to abort as soon as killed by the
OOM-killer. But returning -ENOMEM is wrong if killed by SIGKILL from
userspace. Change kthread_create() to return -EINTR upon SIGKILL.

Signed-off-by: Tetsuo Handa
Cc: Oleg Nesterov
Acked-by: David Rientjes
Cc: [3.13+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2014-06-05 07:53:51 +0800

04 Apr, 2014

1 commit

81c98869f kthread: ensure locality of task_struct allocations ... Browse Code »

In the presence of memoryless nodes, numa_node_id() will return the
current CPU's NUMA node, but that may not be where we expect to allocate
from memory from. Instead, we should rely on the fallback code in the
memory allocator itself, by using NUMA_NO_NODE. Also, when calling
kthread_create_on_node(), use the nearest node with memory to the cpu in
question, rather than the node it is running on.

Signed-off-by: Nishanth Aravamudan
Reviewed-by: Christoph Lameter
Acked-by: David Rientjes
Cc: Anton Blanchard
Cc: Tejun Heo
Cc: Oleg Nesterov
Cc: Jan Kara
Cc: Thomas Gleixner
Cc: Tetsuo Handa
Cc: Wanpeng Li
Cc: Joonsoo Kim
Cc: Ben Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2014-04-04 07:20:49 +0800

13 Nov, 2013

1 commit

786235eeb kthread: make kthread_create() killable ... Browse Code »

Any user process callers of wait_for_completion() except global init
process might be chosen by the OOM killer while waiting for completion()
call by some other process which does memory allocation. See
CVE-2012-4398 "kernel: request_module() OOM local DoS" can happen.

When such users are chosen by the OOM killer when they are waiting for
completion() in TASK_UNINTERRUPTIBLE, the system will be kept stressed
due to memory starvation because the OOM killer cannot kill such users.

kthread_create() is one of such users and this patch fixes the problem
for kthreadd by making kthread_create() killable - the same approach
used for fixing CVE-2012-4398.

Signed-off-by: Tetsuo Handa
Cc: Oleg Nesterov
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2013-11-13 11:08:59 +0800

01 May, 2013

1 commit

cd42d559e kthread: implement probe_kthread_data() ... Browse Code »

One of the problems that arise when converting dedicated custom threadpool
to workqueue is that the shared worker pool used by workqueue anonimizes
each worker making it more difficult to identify what the worker was doing
on which target from the output of sysrq-t or debug dump from oops, BUG()
and friends.

For example, after writeback is converted to use workqueue instead of
priviate thread pool, there's no easy to tell which backing device a
writeback work item was working on at the time of task dump, which,
according to our writeback brethren, is important in tracking down issues
with a lot of mounted file systems on a lot of different devices.

This patchset implements a way for a work function to mark its execution
instance so that task dump of the worker task includes information to
indicate what the work item was doing.

An example WARN dump would look like the following.

WARNING: at fs/fs-writeback.c:1015 bdi_writeback_workfn+0x2b4/0x3c0()
Modules linked in:
CPU: 0 Pid: 28 Comm: kworker/u18:0 Not tainted 3.9.0-rc1-work+ #24
Hardware name: empty empty/S3992, BIOS 080011 10/26/2007
Workqueue: writeback bdi_writeback_workfn (flush-8:16)
ffffffff820a3a98 ffff88015b927cb8 ffffffff81c61855 ffff88015b927cf8
ffffffff8108f500 0000000000000000 ffff88007a171948 ffff88007a1716b0
ffff88015b49df00 ffff88015b8d3940 0000000000000000 ffff88015b927d08
Call Trace:
[] dump_stack+0x19/0x1b
[] warn_slowpath_common+0x70/0xa0
...

This patch:

Implement probe_kthread_data() which returns kthread_data if accessible.
The function is equivalent to kthread_data() except that the specified
@task may not be a kthread or its vfork_done is already cleared rendering
struct kthread inaccessible. In the former case, probe_kthread_data() may
return any value. In the latter, NULL.

This will be used to safely print debug information without affecting
synchronization in the normal paths. Workqueue debug info printing on
dump_stack() and friends will make use of it.

Signed-off-by: Tejun Heo
Cc: Oleg Nesterov
Acked-by: Jan Kara
Cc: Dave Chinner
Cc: Ingo Molnar
Cc: Jens Axboe
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2013-05-01 08:04:02 +0800

30 Apr, 2013

3 commits

46d9be3e5 Merge branch 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq ... Browse Code »

Pull workqueue updates from Tejun Heo:
"A lot of activities on workqueue side this time. The changes achieve
the followings.

- WQ_UNBOUND workqueues - the workqueues which are per-cpu - are
updated to be able to interface with multiple backend worker pools.
This involved a lot of churning but the end result seems actually
neater as unbound workqueues are now a lot closer to per-cpu ones.

- The ability to interface with multiple backend worker pools are
used to implement unbound workqueues with custom attributes.
Currently the supported attributes are the nice level and CPU
affinity. It may be expanded to include cgroup association in
future. The attributes can be specified either by calling
apply_workqueue_attrs() or through /sys/bus/workqueue/WQ_NAME/* if
the workqueue in question is exported through sysfs.

The backend worker pools are keyed by the actual attributes and
shared by any workqueues which share the same attributes. When
attributes of a workqueue are changed, the workqueue binds to the
worker pool with the specified attributes while leaving the work
items which are already executing in its previous worker pools
alone.

This allows converting custom worker pool implementations which
want worker attribute tuning to use workqueues. The writeback pool
is already converted in block tree and there are a couple others
are likely to follow including btrfs io workers.

- WQ_UNBOUND's ability to bind to multiple worker pools is also used
to make it NUMA-aware. Because there's no association between work
item issuer and the specific worker assigned to execute it, before
this change, using unbound workqueue led to unnecessary cross-node
bouncing and it couldn't be helped by autonuma as it requires tasks
to have implicit node affinity and workers are assigned randomly.

After these changes, an unbound workqueue now binds to multiple
NUMA-affine worker pools so that queued work items are executed in
the same node. This is turned on by default but can be disabled
system-wide or for individual workqueues.

Crypto was requesting NUMA affinity as encrypting data across
different nodes can contribute noticeable overhead and doing it
per-cpu was too limiting for certain cases and IO throughput could
be bottlenecked by one CPU being fully occupied while others have
idle cycles.

While the new features required a lot of changes including
restructuring locking, it didn't complicate the execution paths much.
The unbound workqueue handling is now closer to per-cpu ones and the
new features are implemented by simply associating a workqueue with
different sets of backend worker pools without changing queue,
execution or flush paths.

As such, even though the amount of change is very high, I feel
relatively safe in that it isn't likely to cause subtle issues with
basic correctness of work item execution and handling. If something
is wrong, it's likely to show up as being associated with worker pools
with the wrong attributes or OOPS while workqueue attributes are being
changed or during CPU hotplug.

While this creates more backend worker pools, it doesn't add too many
more workers unless, of course, there are many workqueues with unique
combinations of attributes. Assuming everything else is the same,
NUMA awareness costs an extra worker pool per NUMA node with online
CPUs.

There are also a couple things which are being routed outside the
workqueue tree.

- block tree pulled in workqueue for-3.10 so that writeback worker
pool can be converted to unbound workqueue with sysfs control
exposed. This simplifies the code, makes writeback workers
NUMA-aware and allows tuning nice level and CPU affinity via sysfs.

- The conversion to workqueue means that there's no 1:1 association
between a specific worker, which makes writeback folks unhappy as
they want to be able to tell which filesystem caused a problem from
backtrace on systems with many filesystems mounted. This is
resolved by allowing work items to set debug info string which is
printed when the task is dumped. As this change involves unifying
implementations of dump_stack() and friends in arch codes, it's
being routed through Andrew's -mm tree."

* 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (84 commits)
workqueue: use kmem_cache_free() instead of kfree()
workqueue: avoid false negative WARN_ON() in destroy_workqueue()
workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity
workqueue: implement NUMA affinity for unbound workqueues
workqueue: introduce put_pwq_unlocked()
workqueue: introduce numa_pwq_tbl_install()
workqueue: use NUMA-aware allocation for pool_workqueues
workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()
workqueue: map an unbound workqueues to multiple per-node pool_workqueues
workqueue: move hot fields of workqueue_struct to the end
workqueue: make workqueue->name[] fixed len
workqueue: add workqueue->unbound_attrs
workqueue: determine NUMA node of workers accourding to the allowed cpumask
workqueue: drop 'H' from kworker names of unbound worker pools
workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
workqueue: move pwq_pool_locking outside of get/put_unbound_pool()
workqueue: fix memory leak in apply_workqueue_attrs()
workqueue: fix unbound workqueue attrs hashing / comparison
workqueue: fix race condition in unbound workqueue free path
workqueue: remove pwq_lock which is no longer used
...

Linus Torvalds
2013-04-30 10:07:40 +0800
b5c5442bb kthread: kill task_get_live_kthread() ... Browse Code »

task_get_live_kthread() looks confusing and unneeded. It does
get_task_struct() but only kthread_stop() needs this, it can be called
even if the calller doesn't have a reference when we know that this
kthread can't exit until we do kthread_stop().

kthread_park() and kthread_unpark() do not need get_task_struct(), the
callers already have the reference. And it can not help if we can race
with the exiting kthread anyway, kthread_park() can hang forever in this
case.

Change kthread_park() and kthread_unpark() to use to_live_kthread(),
change kthread_stop() to do get_task_struct() by hand and remove
task_get_live_kthread().

Signed-off-by: Oleg Nesterov
Cc: Thomas Gleixner
Cc: Namhyung Kim
Cc: "Paul E. McKenney"
Cc: Peter Zijlstra
Cc: Rusty Russell
Cc: "Srivatsa S. Bhat"
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-04-30 06:54:25 +0800
4ecdafc80 kthread: introduce to_live_kthread() ... Browse Code »

"k->vfork_done != NULL" with a barrier() after to_kthread(k) in
task_get_live_kthread(k) looks unclear, and sub-optimal because we load
->vfork_done twice.

All we need is to ensure that we do not return to_kthread(NULL). Add a
new trivial helper which loads/checks ->vfork_done once, this also looks
more understandable.

Signed-off-by: Oleg Nesterov
Cc: Thomas Gleixner
Cc: Namhyung Kim
Cc: "Paul E. McKenney"
Cc: Peter Zijlstra
Cc: Rusty Russell
Cc: "Srivatsa S. Bhat"
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2013-04-30 06:54:25 +0800

12 Apr, 2013

1 commit

f2530dc71 kthread: Prevent unpark race which puts threads on the wrong cpu ... Browse Code »

The smpboot threads rely on the park/unpark mechanism which binds per
cpu threads on a particular core. Though the functionality is racy:

CPU0 CPU1 CPU2
unpark(T) wake_up_process(T)
clear(SHOULD_PARK) T runs
leave parkme() due to !SHOULD_PARK
bind_to(CPU2) BUG_ON(wrong CPU)

We cannot let the tasks move themself to the target CPU as one of
those tasks is actually the migration thread itself, which requires
that it starts running on the target cpu right away.

The solution to this problem is to prevent wakeups in park mode which
are not from unpark(). That way we can guarantee that the association
of the task to the target cpu is working correctly.

Add a new task state (TASK_PARKED) which prevents other wakeups and
use this state explicitly for the unpark wakeup.

Peter noticed: Also, since the task state is visible to userspace and
all the parked tasks are still in the PID space, its a good hint in ps
and friends that these tasks aren't really there for the moment.

The migration thread has another related issue.

CPU0 CPU1
Bring up CPU2
create_thread(T)
park(T)
wait_for_completion()
parkme()
complete()
sched_set_stop_task()
schedule(TASK_PARKED)

The sched_set_stop_task() call is issued while the task is on the
runqueue of CPU1 and that confuses the hell out of the stop_task class
on that cpu. So we need the same synchronizaion before
sched_set_stop_task().

Reported-by: Dave Jones
Reported-and-tested-by: Dave Hansen
Reported-and-tested-by: Borislav Petkov
Acked-by: Peter Ziljstra
Cc: Srivatsa S. Bhat
Cc: dhillf@gmail.com
Cc: Ingo Molnar
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2013-04-12 20:18:43 +0800

20 Mar, 2013

1 commit

14a40ffcc sched: replace PF_THREAD_BOUND with PF_NO_SETAFFINITY ... Browse Code »

PF_THREAD_BOUND was originally used to mark kernel threads which were
bound to a specific CPU using kthread_bind() and a task with the flag
set allows cpus_allowed modifications only to itself. Workqueue is
currently abusing it to prevent userland from meddling with
cpus_allowed of workqueue workers.

What we need is a flag to prevent userland from messing with
cpus_allowed of certain kernel tasks. In kernel, anyone can
(incorrectly) squash the flag, and, for worker-type usages,
restricting cpus_allowed modification to the task itself doesn't
provide meaningful extra proection as other tasks can inject work
items to the task anyway.

This patch replaces PF_THREAD_BOUND with PF_NO_SETAFFINITY.
sched_setaffinity() checks the flag and return -EINVAL if set.
set_cpus_allowed_ptr() is no longer affected by the flag.

This will allow simplifying workqueue worker CPU affinity management.

Signed-off-by: Tejun Heo
Acked-by: Ingo Molnar
Reviewed-by: Lai Jiangshan
Cc: Peter Zijlstra
Cc: Thomas Gleixner

Tejun Heo
2013-03-20 04:45:20 +0800