12 Oct, 2016

11 commits

  • This patch allows to make kthread worker freezable via a new @flags
    parameter. It will allow to avoid an init work in some kthreads.

    It currently does not affect the function of kthread_worker_fn()
    but it might help to do some optimization or fixes eventually.

    I currently do not know about any other use for the @flags
    parameter but I believe that we will want more flags
    in the future.

    Finally, I hope that it will not cause confusion with @flags member
    in struct kthread. Well, I guess that we will want to rework the
    basic kthreads implementation once all kthreads are converted into
    kthread workers or workqueues. It is possible that we will merge
    the two structures.

    Link: http://lkml.kernel.org/r/1470754545-17632-12-git-send-email-pmladek@suse.com
    Signed-off-by: Petr Mladek
    Acked-by: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: "Paul E. McKenney"
    Cc: Josh Triplett
    Cc: Thomas Gleixner
    Cc: Jiri Kosina
    Cc: Borislav Petkov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     
  • There are situations when we need to modify the delay of a delayed kthread
    work. For example, when the work depends on an event and the initial delay
    means a timeout. Then we want to queue the work immediately when the event
    happens.

    This patch implements kthread_mod_delayed_work() as inspired workqueues.
    It cancels the timer, removes the work from any worker list and queues it
    again with the given timeout.

    A very special case is when the work is being canceled at the same time.
    It might happen because of the regular kthread_cancel_delayed_work_sync()
    or by another kthread_mod_delayed_work(). In this case, we do nothing and
    let the other operation win. This should not normally happen as the caller
    is supposed to synchronize these operations a reasonable way.

    Link: http://lkml.kernel.org/r/1470754545-17632-11-git-send-email-pmladek@suse.com
    Signed-off-by: Petr Mladek
    Acked-by: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: "Paul E. McKenney"
    Cc: Josh Triplett
    Cc: Thomas Gleixner
    Cc: Jiri Kosina
    Cc: Borislav Petkov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     
  • We are going to use kthread workers more widely and sometimes we will need
    to make sure that the work is neither pending nor running.

    This patch implements cancel_*_sync() operations as inspired by
    workqueues. Well, we are synchronized against the other operations via
    the worker lock, we use del_timer_sync() and a counter to count parallel
    cancel operations. Therefore the implementation might be easier.

    First, we check if a worker is assigned. If not, the work has newer been
    queued after it was initialized.

    Second, we take the worker lock. It must be the right one. The work must
    not be assigned to another worker unless it is initialized in between.

    Third, we try to cancel the timer when it exists. The timer is deleted
    synchronously to make sure that the timer call back is not running. We
    need to temporary release the worker->lock to avoid a possible deadlock
    with the callback. In the meantime, we set work->canceling counter to
    avoid any queuing.

    Fourth, we try to remove the work from a worker list. It might be
    the list of either normal or delayed works.

    Fifth, if the work is running, we call kthread_flush_work(). It might
    take an arbitrary time. We need to release the worker-lock again. In the
    meantime, we again block any queuing by the canceling counter.

    As already mentioned, the check for a pending kthread work is done under a
    lock. In compare with workqueues, we do not need to fight for a single
    PENDING bit to block other operations. Therefore we do not suffer from
    the thundering storm problem and all parallel canceling jobs might use
    kthread_flush_work(). Any queuing is blocked until the counter gets zero.

    Link: http://lkml.kernel.org/r/1470754545-17632-10-git-send-email-pmladek@suse.com
    Signed-off-by: Petr Mladek
    Acked-by: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: "Paul E. McKenney"
    Cc: Josh Triplett
    Cc: Thomas Gleixner
    Cc: Jiri Kosina
    Cc: Borislav Petkov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     
  • We are going to use kthread_worker more widely and delayed works
    will be pretty useful.

    The implementation is inspired by workqueues. It uses a timer to queue
    the work after the requested delay. If the delay is zero, the work is
    queued immediately.

    In compare with workqueues, each work is associated with a single worker
    (kthread). Therefore the implementation could be much easier. In
    particular, we use the worker->lock to synchronize all the operations with
    the work. We do not need any atomic operation with a flags variable.

    In fact, we do not need any state variable at all. Instead, we add a list
    of delayed works into the worker. Then the pending work is listed either
    in the list of queued or delayed works. And the existing check of pending
    works is the same even for the delayed ones.

    A work must not be assigned to another worker unless reinitialized.
    Therefore the timer handler might expect that dwork->work->worker is valid
    and it could simply take the lock. We just add some sanity checks to help
    with debugging a potential misuse.

    Link: http://lkml.kernel.org/r/1470754545-17632-9-git-send-email-pmladek@suse.com
    Signed-off-by: Petr Mladek
    Acked-by: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: "Paul E. McKenney"
    Cc: Josh Triplett
    Cc: Thomas Gleixner
    Cc: Jiri Kosina
    Cc: Borislav Petkov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     
  • Nothing currently prevents a work from queuing for a kthread worker when
    it is already running on another one. This means that the work might run
    in parallel on more than one worker. Also some operations are not
    reliable, e.g. flush.

    This problem will be even more visible after we add kthread_cancel_work()
    function. It will only have "work" as the parameter and will use
    worker->lock to synchronize with others.

    Well, normally this is not a problem because the API users are sane.
    But bugs might happen and users also might be crazy.

    This patch adds a warning when we try to insert the work for another
    worker. It does not fully prevent the misuse because it would make the
    code much more complicated without a big benefit.

    It adds the same warning also into kthread_flush_work() instead of the
    repeated attempts to get the right lock.

    A side effect is that one needs to explicitly reinitialize the work if it
    must be queued into another worker. This is needed, for example, when the
    worker is stopped and started again. It is a bit inconvenient. But it
    looks like a good compromise between the stability and complexity.

    I have double checked all existing users of the kthread worker API and
    they all seems to initialize the work after the worker gets started.

    Just for completeness, the patch adds a check that the work is not already
    in a queue.

    The patch also puts all the checks into a separate function. It will be
    reused when implementing delayed works.

    Link: http://lkml.kernel.org/r/1470754545-17632-8-git-send-email-pmladek@suse.com
    Signed-off-by: Petr Mladek
    Cc: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: "Paul E. McKenney"
    Cc: Josh Triplett
    Cc: Thomas Gleixner
    Cc: Jiri Kosina
    Cc: Borislav Petkov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     
  • The current kthread worker users call flush() and stop() explicitly.
    This function does the same plus it frees the kthread_worker struct
    in one call.

    It is supposed to be used together with kthread_create_worker*() that
    allocates struct kthread_worker.

    Link: http://lkml.kernel.org/r/1470754545-17632-7-git-send-email-pmladek@suse.com
    Signed-off-by: Petr Mladek
    Cc: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: "Paul E. McKenney"
    Cc: Josh Triplett
    Cc: Thomas Gleixner
    Cc: Jiri Kosina
    Cc: Borislav Petkov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     
  • Kthread workers are currently created using the classic kthread API,
    namely kthread_run(). kthread_worker_fn() is passed as the @threadfn
    parameter.

    This patch defines kthread_create_worker() and
    kthread_create_worker_on_cpu() functions that hide implementation details.

    They enforce using kthread_worker_fn() for the main thread. But I doubt
    that there are any plans to create any alternative. In fact, I think that
    we do not want any alternative main thread because it would be hard to
    support consistency with the rest of the kthread worker API.

    The naming and function of kthread_create_worker() is inspired by the
    workqueues API like the rest of the kthread worker API.

    The kthread_create_worker_on_cpu() variant is motivated by the original
    kthread_create_on_cpu(). Note that we need to bind per-CPU kthread
    workers already when they are created. It makes the life easier.
    kthread_bind() could not be used later for an already running worker.

    This patch does _not_ convert existing kthread workers. The kthread
    worker API need more improvements first, e.g. a function to destroy the
    worker.

    IMPORTANT:

    kthread_create_worker_on_cpu() allows to use any format of the worker
    name, in compare with kthread_create_on_cpu(). The good thing is that it
    is more generic. The bad thing is that most users will need to pass the
    cpu number in two parameters, e.g. kthread_create_worker_on_cpu(cpu,
    "helper/%d", cpu).

    To be honest, the main motivation was to avoid the need for an empty
    va_list. The only legal way was to create a helper function that would be
    called with an empty list. Other attempts caused compilation warnings or
    even errors on different architectures.

    There were also other alternatives, for example, using #define or
    splitting __kthread_create_worker(). The used solution looked like the
    least ugly.

    Link: http://lkml.kernel.org/r/1470754545-17632-6-git-send-email-pmladek@suse.com
    Signed-off-by: Petr Mladek
    Acked-by: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: "Paul E. McKenney"
    Cc: Josh Triplett
    Cc: Thomas Gleixner
    Cc: Jiri Kosina
    Cc: Borislav Petkov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     
  • kthread_create_on_node() implements a bunch of logic to create the
    kthread. It is already called by kthread_create_on_cpu().

    We are going to extend the kthread worker API and will need to call
    kthread_create_on_node() with va_list args there.

    This patch does only a refactoring and does not modify the existing
    behavior.

    Link: http://lkml.kernel.org/r/1470754545-17632-5-git-send-email-pmladek@suse.com
    Signed-off-by: Petr Mladek
    Acked-by: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: "Paul E. McKenney"
    Cc: Josh Triplett
    Cc: Thomas Gleixner
    Cc: Jiri Kosina
    Cc: Borislav Petkov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     
  • kthread_create_on_cpu() was added by the commit 2a1d446019f9a5983e
    ("kthread: Implement park/unpark facility"). It is currently used only
    when enabling new CPU. For this purpose, the newly created kthread has to
    be parked.

    The CPU binding is a bit tricky. The kthread is parked when the CPU has
    not been allowed yet. And the CPU is bound when the kthread is unparked.

    The function would be useful for more per-CPU kthreads, e.g.
    bnx2fc_thread, fcoethread. For this purpose, the newly created kthread
    should stay in the uninterruptible state.

    This patch moves the parking into smpboot. It binds the thread already
    when created. Then the function might be used universally. Also the
    behavior is consistent with kthread_create() and kthread_create_on_node().

    Link: http://lkml.kernel.org/r/1470754545-17632-4-git-send-email-pmladek@suse.com
    Signed-off-by: Petr Mladek
    Reviewed-by: Thomas Gleixner
    Cc: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: "Paul E. McKenney"
    Cc: Josh Triplett
    Cc: Jiri Kosina
    Cc: Borislav Petkov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     
  • A good practice is to prefix the names of functions by the name
    of the subsystem.

    The kthread worker API is a mix of classic kthreads and workqueues. Each
    worker has a dedicated kthread. It runs a generic function that process
    queued works. It is implemented as part of the kthread subsystem.

    This patch renames the existing kthread worker API to use
    the corresponding name from the workqueues API prefixed by
    kthread_:

    __init_kthread_worker() -> __kthread_init_worker()
    init_kthread_worker() -> kthread_init_worker()
    init_kthread_work() -> kthread_init_work()
    insert_kthread_work() -> kthread_insert_work()
    queue_kthread_work() -> kthread_queue_work()
    flush_kthread_work() -> kthread_flush_work()
    flush_kthread_worker() -> kthread_flush_worker()

    Note that the names of DEFINE_KTHREAD_WORK*() macros stay
    as they are. It is common that the "DEFINE_" prefix has
    precedence over the subsystem names.

    Note that INIT() macros and init() functions use different
    naming scheme. There is no good solution. There are several
    reasons for this solution:

    + "init" in the function names stands for the verb "initialize"
    aka "initialize worker". While "INIT" in the macro names
    stands for the noun "INITIALIZER" aka "worker initializer".

    + INIT() macros are used only in DEFINE() macros

    + init() functions are used close to the other kthread()
    functions. It looks much better if all the functions
    use the same scheme.

    + There will be also kthread_destroy_worker() that will
    be used close to kthread_cancel_work(). It is related
    to the init() function. Again it looks better if all
    functions use the same naming scheme.

    + there are several precedents for such init() function
    names, e.g. amd_iommu_init_device(), free_area_init_node(),
    jump_label_init_type(), regmap_init_mmio_clk(),

    + It is not an argument but it was inconsistent even before.

    [arnd@arndb.de: fix linux-next merge conflict]
    Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.com
    Suggested-by: Andrew Morton
    Signed-off-by: Petr Mladek
    Cc: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: "Paul E. McKenney"
    Cc: Josh Triplett
    Cc: Thomas Gleixner
    Cc: Jiri Kosina
    Cc: Borislav Petkov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     
  • Patch series "kthread: Kthread worker API improvements"

    The intention of this patchset is to make it easier to manipulate and
    maintain kthreads. Especially, I want to replace all the custom main
    cycles with a generic one. Also I want to make the kthreads sleep in a
    consistent state in a common place when there is no work.

    This patch (of 11):

    A good practice is to prefix the names of functions by the name of the
    subsystem.

    This patch fixes the name of probe_kthread_data(). The other wrong
    functions names are part of the kthread worker API and will be fixed
    separately.

    Link: http://lkml.kernel.org/r/1470754545-17632-2-git-send-email-pmladek@suse.com
    Signed-off-by: Petr Mladek
    Suggested-by: Andrew Morton
    Acked-by: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: "Paul E. McKenney"
    Cc: Josh Triplett
    Cc: Thomas Gleixner
    Cc: Jiri Kosina
    Cc: Borislav Petkov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek
     

16 Sep, 2016

1 commit

  • get_task_struct(tsk) no longer pins tsk->stack so all users of
    to_live_kthread() should do try_get_task_stack/put_task_stack to protect
    "struct kthread" which lives on kthread's stack.

    TODO: Kill to_live_kthread(), perhaps we can even kill "struct kthread" too,
    and rework kthread_stop(), it can use task_work_add() to sync with the exiting
    kernel thread.

    Message-Id:
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jann Horn
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/cb9b16bbc19d4aea4507ab0552e4644c1211d130.1474003868.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

05 Sep, 2015

1 commit

  • - Make it clear that the `node' arg refers to memory allocations only:
    kthread_create_on_node() does not pin the new thread to that node's
    CPUs.

    - Encourage the use of NUMA_NO_NODE.

    [nzimmer@sgi.com: use NUMA_NO_NODE in kthread_create() also]
    Cc: Nathan Zimmer
    Cc: Tejun Heo
    Cc: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

01 Sep, 2015

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The biggest change in this cycle is the rewrite of the main SMP load
    balancing metric: the CPU load/utilization. The main goal was to make
    the metric more precise and more representative - see the changelog of
    this commit for the gory details:

    9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")

    It is done in a way that significantly reduces complexity of the code:

    5 files changed, 249 insertions(+), 494 deletions(-)

    and the performance testing results are encouraging. Nevertheless we
    need to keep an eye on potential regressions, since this potentially
    affects every SMP workload in existence.

    This work comes from Yuyang Du.

    Other changes:

    - SCHED_DL updates. (Andrea Parri)

    - Simplify architecture callbacks by removing finish_arch_switch().
    (Peter Zijlstra et al)

    - cputime accounting: guarantee stime + utime == rtime. (Peter
    Zijlstra)

    - optimize idle CPU wakeups some more - inspired by Facebook server
    loads. (Mike Galbraith)

    - stop_machine fixes and updates. (Oleg Nesterov)

    - Introduce the 'trace_sched_waking' tracepoint. (Peter Zijlstra)

    - sched/numa tweaks. (Srikar Dronamraju)

    - misc fixes and small cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
    sched/deadline: Fix comment in enqueue_task_dl()
    sched/deadline: Fix comment in push_dl_tasks()
    sched: Change the sched_class::set_cpus_allowed() calling context
    sched: Make sched_class::set_cpus_allowed() unconditional
    sched: Fix a race between __kthread_bind() and sched_setaffinity()
    sched: Ensure a task has a non-normalized vruntime when returning back to CFS
    sched/numa: Fix NUMA_DIRECT topology identification
    tile: Reorganize _switch_to()
    sched, sparc32: Update scheduler comments in copy_thread()
    sched: Remove finish_arch_switch()
    sched, tile: Remove finish_arch_switch
    sched, sh: Fold finish_arch_switch() into switch_to()
    sched, score: Remove finish_arch_switch()
    sched, avr32: Remove finish_arch_switch()
    sched, MIPS: Get rid of finish_arch_switch()
    sched, arm: Remove finish_arch_switch()
    sched/fair: Clean up load average references
    sched/fair: Provide runnable_load_avg back to cfs_rq
    sched/fair: Remove task and group entity load when they are dead
    sched/fair: Init cfs_rq's sched_entity load average
    ...

    Linus Torvalds
     

12 Aug, 2015

1 commit

  • Because sched_setscheduler() checks p->flags & PF_NO_SETAFFINITY
    without locks, a caller might observe an old value and race with the
    set_cpus_allowed_ptr() call from __kthread_bind() and effectively undo
    it:

    __kthread_bind()
    do_set_cpus_allowed()

    sched_setaffinity()
    if (p->flags & PF_NO_SETAFFINITIY)
    set_cpus_allowed_ptr()
    p->flags |= PF_NO_SETAFFINITY

    Fix the bug by putting everything under the regular scheduler locks.

    This also closes a hole in the serialization of task_struct::{nr_,}cpus_allowed.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Tejun Heo
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dedekind1@gmail.com
    Cc: juri.lelli@arm.com
    Cc: mgorman@suse.de
    Cc: riel@redhat.com
    Cc: rostedt@goodmis.org
    Link: http://lkml.kernel.org/r/20150515154833.545640346@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

07 Aug, 2015

1 commit

  • The s-Par visornic driver, currently in staging, processes a queue being
    serviced by the an s-Par service partition. We can get a message that
    something has happened with the Service Partition, when that happens, we
    must not access the channel until we get a message that the service
    partition is back again.

    The visornic driver has a thread for processing the channel, when we get
    the message, we need to be able to park the thread and then resume it
    when the problem clears.

    We can do this with kthread_park and unpark but they are not exported
    from the kernel, this patch exports the needed functions.

    Signed-off-by: David Kershner
    Acked-by: Ingo Molnar
    Acked-by: Neil Horman
    Acked-by: Thomas Gleixner
    Cc: Richard Weinberger
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Kershner
     

10 Oct, 2014

1 commit

  • …ask_struct allocations")

    After discussions with Tejun, we don't want to spread the use of
    cpu_to_mem() (and thus knowledge of allocators/NUMA topology details) into
    callers, but would rather ensure the callees correctly handle memoryless
    nodes. With the previous patches ("topology: add support for
    node_to_mem_node() to determine the fallback node" and "slub: fallback to
    node_to_mem_node() node if allocating on memoryless node") adding and
    using node_to_mem_node(), we can safely undo part of the change to the
    kthread logic from 81c98869faa5.

    Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Han Pingtian <hanpt@linux.vnet.ibm.com>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Anton Blanchard <anton@samba.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
    Cc: Tejun Heo <tj@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Nishanth Aravamudan
     

29 Jul, 2014

1 commit

  • If the worker is already executing a work item when another is queued,
    we can safely skip wakeup without worrying about stalling queue thus
    avoiding waking up the busy worker spuriously. Spurious wakeups
    should be fine but still isn't nice and avoiding it is trivial here.

    tj: Updated description.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     

05 Jun, 2014

1 commit

  • Commit 786235eeba0e ("kthread: make kthread_create() killable") meant
    for allowing kthread_create() to abort as soon as killed by the
    OOM-killer. But returning -ENOMEM is wrong if killed by SIGKILL from
    userspace. Change kthread_create() to return -EINTR upon SIGKILL.

    Signed-off-by: Tetsuo Handa
    Cc: Oleg Nesterov
    Acked-by: David Rientjes
    Cc: [3.13+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

04 Apr, 2014

1 commit

  • In the presence of memoryless nodes, numa_node_id() will return the
    current CPU's NUMA node, but that may not be where we expect to allocate
    from memory from. Instead, we should rely on the fallback code in the
    memory allocator itself, by using NUMA_NO_NODE. Also, when calling
    kthread_create_on_node(), use the nearest node with memory to the cpu in
    question, rather than the node it is running on.

    Signed-off-by: Nishanth Aravamudan
    Reviewed-by: Christoph Lameter
    Acked-by: David Rientjes
    Cc: Anton Blanchard
    Cc: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Jan Kara
    Cc: Thomas Gleixner
    Cc: Tetsuo Handa
    Cc: Wanpeng Li
    Cc: Joonsoo Kim
    Cc: Ben Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     

13 Nov, 2013

1 commit

  • Any user process callers of wait_for_completion() except global init
    process might be chosen by the OOM killer while waiting for completion()
    call by some other process which does memory allocation. See
    CVE-2012-4398 "kernel: request_module() OOM local DoS" can happen.

    When such users are chosen by the OOM killer when they are waiting for
    completion() in TASK_UNINTERRUPTIBLE, the system will be kept stressed
    due to memory starvation because the OOM killer cannot kill such users.

    kthread_create() is one of such users and this patch fixes the problem
    for kthreadd by making kthread_create() killable - the same approach
    used for fixing CVE-2012-4398.

    Signed-off-by: Tetsuo Handa
    Cc: Oleg Nesterov
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

01 May, 2013

1 commit

  • One of the problems that arise when converting dedicated custom threadpool
    to workqueue is that the shared worker pool used by workqueue anonimizes
    each worker making it more difficult to identify what the worker was doing
    on which target from the output of sysrq-t or debug dump from oops, BUG()
    and friends.

    For example, after writeback is converted to use workqueue instead of
    priviate thread pool, there's no easy to tell which backing device a
    writeback work item was working on at the time of task dump, which,
    according to our writeback brethren, is important in tracking down issues
    with a lot of mounted file systems on a lot of different devices.

    This patchset implements a way for a work function to mark its execution
    instance so that task dump of the worker task includes information to
    indicate what the work item was doing.

    An example WARN dump would look like the following.

    WARNING: at fs/fs-writeback.c:1015 bdi_writeback_workfn+0x2b4/0x3c0()
    Modules linked in:
    CPU: 0 Pid: 28 Comm: kworker/u18:0 Not tainted 3.9.0-rc1-work+ #24
    Hardware name: empty empty/S3992, BIOS 080011 10/26/2007
    Workqueue: writeback bdi_writeback_workfn (flush-8:16)
    ffffffff820a3a98 ffff88015b927cb8 ffffffff81c61855 ffff88015b927cf8
    ffffffff8108f500 0000000000000000 ffff88007a171948 ffff88007a1716b0
    ffff88015b49df00 ffff88015b8d3940 0000000000000000 ffff88015b927d08
    Call Trace:
    [] dump_stack+0x19/0x1b
    [] warn_slowpath_common+0x70/0xa0
    ...

    This patch:

    Implement probe_kthread_data() which returns kthread_data if accessible.
    The function is equivalent to kthread_data() except that the specified
    @task may not be a kthread or its vfork_done is already cleared rendering
    struct kthread inaccessible. In the former case, probe_kthread_data() may
    return any value. In the latter, NULL.

    This will be used to safely print debug information without affecting
    synchronization in the normal paths. Workqueue debug info printing on
    dump_stack() and friends will make use of it.

    Signed-off-by: Tejun Heo
    Cc: Oleg Nesterov
    Acked-by: Jan Kara
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jens Axboe
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

30 Apr, 2013

3 commits

  • Pull workqueue updates from Tejun Heo:
    "A lot of activities on workqueue side this time. The changes achieve
    the followings.

    - WQ_UNBOUND workqueues - the workqueues which are per-cpu - are
    updated to be able to interface with multiple backend worker pools.
    This involved a lot of churning but the end result seems actually
    neater as unbound workqueues are now a lot closer to per-cpu ones.

    - The ability to interface with multiple backend worker pools are
    used to implement unbound workqueues with custom attributes.
    Currently the supported attributes are the nice level and CPU
    affinity. It may be expanded to include cgroup association in
    future. The attributes can be specified either by calling
    apply_workqueue_attrs() or through /sys/bus/workqueue/WQ_NAME/* if
    the workqueue in question is exported through sysfs.

    The backend worker pools are keyed by the actual attributes and
    shared by any workqueues which share the same attributes. When
    attributes of a workqueue are changed, the workqueue binds to the
    worker pool with the specified attributes while leaving the work
    items which are already executing in its previous worker pools
    alone.

    This allows converting custom worker pool implementations which
    want worker attribute tuning to use workqueues. The writeback pool
    is already converted in block tree and there are a couple others
    are likely to follow including btrfs io workers.

    - WQ_UNBOUND's ability to bind to multiple worker pools is also used
    to make it NUMA-aware. Because there's no association between work
    item issuer and the specific worker assigned to execute it, before
    this change, using unbound workqueue led to unnecessary cross-node
    bouncing and it couldn't be helped by autonuma as it requires tasks
    to have implicit node affinity and workers are assigned randomly.

    After these changes, an unbound workqueue now binds to multiple
    NUMA-affine worker pools so that queued work items are executed in
    the same node. This is turned on by default but can be disabled
    system-wide or for individual workqueues.

    Crypto was requesting NUMA affinity as encrypting data across
    different nodes can contribute noticeable overhead and doing it
    per-cpu was too limiting for certain cases and IO throughput could
    be bottlenecked by one CPU being fully occupied while others have
    idle cycles.

    While the new features required a lot of changes including
    restructuring locking, it didn't complicate the execution paths much.
    The unbound workqueue handling is now closer to per-cpu ones and the
    new features are implemented by simply associating a workqueue with
    different sets of backend worker pools without changing queue,
    execution or flush paths.

    As such, even though the amount of change is very high, I feel
    relatively safe in that it isn't likely to cause subtle issues with
    basic correctness of work item execution and handling. If something
    is wrong, it's likely to show up as being associated with worker pools
    with the wrong attributes or OOPS while workqueue attributes are being
    changed or during CPU hotplug.

    While this creates more backend worker pools, it doesn't add too many
    more workers unless, of course, there are many workqueues with unique
    combinations of attributes. Assuming everything else is the same,
    NUMA awareness costs an extra worker pool per NUMA node with online
    CPUs.

    There are also a couple things which are being routed outside the
    workqueue tree.

    - block tree pulled in workqueue for-3.10 so that writeback worker
    pool can be converted to unbound workqueue with sysfs control
    exposed. This simplifies the code, makes writeback workers
    NUMA-aware and allows tuning nice level and CPU affinity via sysfs.

    - The conversion to workqueue means that there's no 1:1 association
    between a specific worker, which makes writeback folks unhappy as
    they want to be able to tell which filesystem caused a problem from
    backtrace on systems with many filesystems mounted. This is
    resolved by allowing work items to set debug info string which is
    printed when the task is dumped. As this change involves unifying
    implementations of dump_stack() and friends in arch codes, it's
    being routed through Andrew's -mm tree."

    * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (84 commits)
    workqueue: use kmem_cache_free() instead of kfree()
    workqueue: avoid false negative WARN_ON() in destroy_workqueue()
    workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity
    workqueue: implement NUMA affinity for unbound workqueues
    workqueue: introduce put_pwq_unlocked()
    workqueue: introduce numa_pwq_tbl_install()
    workqueue: use NUMA-aware allocation for pool_workqueues
    workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()
    workqueue: map an unbound workqueues to multiple per-node pool_workqueues
    workqueue: move hot fields of workqueue_struct to the end
    workqueue: make workqueue->name[] fixed len
    workqueue: add workqueue->unbound_attrs
    workqueue: determine NUMA node of workers accourding to the allowed cpumask
    workqueue: drop 'H' from kworker names of unbound worker pools
    workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
    workqueue: move pwq_pool_locking outside of get/put_unbound_pool()
    workqueue: fix memory leak in apply_workqueue_attrs()
    workqueue: fix unbound workqueue attrs hashing / comparison
    workqueue: fix race condition in unbound workqueue free path
    workqueue: remove pwq_lock which is no longer used
    ...

    Linus Torvalds
     
  • task_get_live_kthread() looks confusing and unneeded. It does
    get_task_struct() but only kthread_stop() needs this, it can be called
    even if the calller doesn't have a reference when we know that this
    kthread can't exit until we do kthread_stop().

    kthread_park() and kthread_unpark() do not need get_task_struct(), the
    callers already have the reference. And it can not help if we can race
    with the exiting kthread anyway, kthread_park() can hang forever in this
    case.

    Change kthread_park() and kthread_unpark() to use to_live_kthread(),
    change kthread_stop() to do get_task_struct() by hand and remove
    task_get_live_kthread().

    Signed-off-by: Oleg Nesterov
    Cc: Thomas Gleixner
    Cc: Namhyung Kim
    Cc: "Paul E. McKenney"
    Cc: Peter Zijlstra
    Cc: Rusty Russell
    Cc: "Srivatsa S. Bhat"
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • "k->vfork_done != NULL" with a barrier() after to_kthread(k) in
    task_get_live_kthread(k) looks unclear, and sub-optimal because we load
    ->vfork_done twice.

    All we need is to ensure that we do not return to_kthread(NULL). Add a
    new trivial helper which loads/checks ->vfork_done once, this also looks
    more understandable.

    Signed-off-by: Oleg Nesterov
    Cc: Thomas Gleixner
    Cc: Namhyung Kim
    Cc: "Paul E. McKenney"
    Cc: Peter Zijlstra
    Cc: Rusty Russell
    Cc: "Srivatsa S. Bhat"
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

12 Apr, 2013

1 commit

  • The smpboot threads rely on the park/unpark mechanism which binds per
    cpu threads on a particular core. Though the functionality is racy:

    CPU0 CPU1 CPU2
    unpark(T) wake_up_process(T)
    clear(SHOULD_PARK) T runs
    leave parkme() due to !SHOULD_PARK
    bind_to(CPU2) BUG_ON(wrong CPU)

    We cannot let the tasks move themself to the target CPU as one of
    those tasks is actually the migration thread itself, which requires
    that it starts running on the target cpu right away.

    The solution to this problem is to prevent wakeups in park mode which
    are not from unpark(). That way we can guarantee that the association
    of the task to the target cpu is working correctly.

    Add a new task state (TASK_PARKED) which prevents other wakeups and
    use this state explicitly for the unpark wakeup.

    Peter noticed: Also, since the task state is visible to userspace and
    all the parked tasks are still in the PID space, its a good hint in ps
    and friends that these tasks aren't really there for the moment.

    The migration thread has another related issue.

    CPU0 CPU1
    Bring up CPU2
    create_thread(T)
    park(T)
    wait_for_completion()
    parkme()
    complete()
    sched_set_stop_task()
    schedule(TASK_PARKED)

    The sched_set_stop_task() call is issued while the task is on the
    runqueue of CPU1 and that confuses the hell out of the stop_task class
    on that cpu. So we need the same synchronizaion before
    sched_set_stop_task().

    Reported-by: Dave Jones
    Reported-and-tested-by: Dave Hansen
    Reported-and-tested-by: Borislav Petkov
    Acked-by: Peter Ziljstra
    Cc: Srivatsa S. Bhat
    Cc: dhillf@gmail.com
    Cc: Ingo Molnar
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

20 Mar, 2013

1 commit

  • PF_THREAD_BOUND was originally used to mark kernel threads which were
    bound to a specific CPU using kthread_bind() and a task with the flag
    set allows cpus_allowed modifications only to itself. Workqueue is
    currently abusing it to prevent userland from meddling with
    cpus_allowed of workqueue workers.

    What we need is a flag to prevent userland from messing with
    cpus_allowed of certain kernel tasks. In kernel, anyone can
    (incorrectly) squash the flag, and, for worker-type usages,
    restricting cpus_allowed modification to the task itself doesn't
    provide meaningful extra proection as other tasks can inject work
    items to the task anyway.

    This patch replaces PF_THREAD_BOUND with PF_NO_SETAFFINITY.
    sched_setaffinity() checks the flag and return -EINVAL if set.
    set_cpus_allowed_ptr() is no longer affected by the flag.

    This will allow simplifying workqueue worker CPU affinity management.

    Signed-off-by: Tejun Heo
    Acked-by: Ingo Molnar
    Reviewed-by: Lai Jiangshan
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner

    Tejun Heo
     

13 Dec, 2012

1 commit

  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Hillf Danton
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     

13 Oct, 2012

2 commits

  • Pull third pile of kernel_execve() patches from Al Viro:
    "The last bits of infrastructure for kernel_thread() et.al., with
    alpha/arm/x86 use of those. Plus sanitizing the asm glue and
    do_notify_resume() on alpha, fixing the "disabled irq while running
    task_work stuff" breakage there.

    At that point the rest of kernel_thread/kernel_execve/sys_execve work
    can be done independently for different architectures. The only
    pending bits that do depend on having all architectures converted are
    restrictred to fs/* and kernel/* - that'll obviously have to wait for
    the next cycle.

    I thought we'd have to wait for all of them done before we start
    eliminating the longjump-style insanity in kernel_execve(), but it
    turned out there's a very simple way to do that without flagday-style
    changes."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    alpha: switch to saner kernel_execve() semantics
    arm: switch to saner kernel_execve() semantics
    x86, um: convert to saner kernel_execve() semantics
    infrastructure for saner ret_from_kernel_thread semantics
    make sure that kernel_thread() callbacks call do_exit() themselves
    make sure that we always have a return path from kernel_execve()
    ppc: eeh_event should just use kthread_run()
    don't bother with kernel_thread/kernel_execve for launching linuxrc
    alpha: get rid of switch_stack argument of do_work_pending()
    alpha: don't bother passing switch_stack separately from regs
    alpha: take SIGPENDING/NOTIFY_RESUME loop into signal.c
    alpha: simplify TIF_NEED_RESCHED handling

    Linus Torvalds
     
  • * allow kernel_execve() leave the actual return to userland to
    caller (selected by CONFIG_GENERIC_KERNEL_EXECVE). Callers
    updated accordingly.
    * architecture that does select GENERIC_KERNEL_EXECVE in its
    Kconfig should have its ret_from_kernel_thread() do this:
    call schedule_tail
    call the callback left for it by copy_thread(); if it ever
    returns, that's because it has just done successful kernel_execve()
    jump to return from syscall
    IOW, its only difference from ret_from_fork() is that it does call the
    callback.
    * such an architecture should also get rid of ret_from_kernel_execve()
    and __ARCH_WANT_KERNEL_EXECVE

    This is the last part of infrastructure patches in that area - from
    that point on work on different architectures can live independently.

    Signed-off-by: Al Viro

    Al Viro
     

13 Aug, 2012

1 commit

  • To avoid the full teardown/setup of per cpu kthreads in the case of
    cpu hot(un)plug, provide a facility which allows to put the kthread
    into a park position and unpark it when the cpu comes online again.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Namhyung Kim
    Cc: Peter Zijlstra
    Reviewed-by: Srivatsa S. Bhat
    Cc: Rusty Russell
    Reviewed-by: Paul E. McKenney
    Link: http://lkml.kernel.org/r/20120716103948.236618824@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

23 Jul, 2012

2 commits

  • kthread_worker provides minimalistic workqueue-like interface for
    users which need a dedicated worker thread (e.g. for realtime
    priority). It has basic queue, flush_work, flush_worker operations
    which mostly match the workqueue counterparts; however, due to the way
    flush_work() is implemented, it has a noticeable difference of not
    allowing work items to be freed while being executed.

    While the current users of kthread_worker are okay with the current
    behavior, the restriction does impede some valid use cases. Also,
    removing this difference isn't difficult and actually makes the code
    easier to understand.

    This patch reimplements flush_kthread_work() such that it uses a
    flush_work item instead of queue/done sequence numbers.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • Make the following two non-functional changes.

    * Separate out insert_kthread_work() from queue_kthread_work().

    * Relocate struct kthread_flush_work and kthread_flush_work_fn()
    definitions above flush_kthread_work().

    v2: Added lockdep_assert_held() in insert_kthread_work() as suggested
    by Andy Walls.

    Signed-off-by: Tejun Heo
    Acked-by: Andy Walls

    Tejun Heo
     

24 Nov, 2011

1 commit

  • There's no in-kernel user of set_freezable_with_signal() left. Mixing
    TIF_SIGPENDING with kernel threads can lead to nasty corner cases as
    kernel threads never travel signal delivery path on their own.

    e.g. the current implementation is buggy in the cancelation path of
    __thaw_task(). It calls recalc_sigpending_and_wake() in an attempt to
    clear TIF_SIGPENDING but the function never clears it regardless of
    sigpending state. This means that signallable freezable kthreads may
    continue executing with !freezing() && stuck TIF_SIGPENDING, which can
    be troublesome.

    This patch removes set_freezable_with_signal() along with
    PF_FREEZER_NOSIG and recalc_sigpending*() calls in freezer. User
    tasks get TIF_SIGPENDING, kernel tasks get woken up and the spurious
    sigpending is dealt with in the usual signal delivery path.

    Signed-off-by: Tejun Heo
    Acked-by: Oleg Nesterov

    Tejun Heo
     

22 Nov, 2011

1 commit

  • Writeback and thinkpad_acpi have been using thaw_process() to prevent
    deadlock between the freezer and kthread_stop(); unfortunately, this
    is inherently racy - nothing prevents freezing from happening between
    thaw_process() and kthread_stop().

    This patch implements kthread_freezable_should_stop() which enters
    refrigerator if necessary but is guaranteed to return if
    kthread_stop() is invoked. Both thaw_process() users are converted to
    use the new function.

    Note that this deadlock condition exists for many of freezable
    kthreads. They need to be converted to use the new should_stop or
    freezable workqueue.

    Tested with synthetic test case.

    Signed-off-by: Tejun Heo
    Acked-by: Henrique de Moraes Holschuh
    Cc: Jens Axboe
    Cc: Oleg Nesterov

    Tejun Heo
     

31 Oct, 2011

1 commit

  • The changed files were only including linux/module.h for the
    EXPORT_SYMBOL infrastructure, and nothing else. Revector them
    onto the isolated export header for faster compile times.

    Nothing to see here but a whole lot of instances of:

    -#include
    +#include

    This commit is only changing the kernel dir; next targets
    will probably be mm, fs, the arch dirs, etc.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

28 May, 2011

1 commit


31 Mar, 2011

1 commit


23 Mar, 2011

1 commit

  • All kthreads being created from a single helper task, they all use memory
    from a single node for their kernel stack and task struct.

    This patch suite creates kthread_create_on_node(), adding a 'cpu' parameter
    to parameters already used by kthread_create().

    This parameter serves in allocating memory for the new kthread on its
    memory node if possible.

    Signed-off-by: Eric Dumazet
    Acked-by: David S. Miller
    Reviewed-by: Andi Kleen
    Acked-by: Rusty Russell
    Cc: Tejun Heo
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: David Howells
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

07 Jan, 2011

1 commit