16 Apr, 2019

1 commit

  • The worker accounting for CPU bound workers is plugged into the core
    scheduler code and the wakeup code. This is not a hard requirement and
    can be avoided by keeping track of the state in the workqueue code
    itself.

    Keep track of the sleeping state in the worker itself and call the
    notifier before entering the core scheduler. There might be false
    positives when the task is woken between that call and actually
    scheduling, but that's not really different from scheduling and being
    woken immediately after switching away. When nr_running is updated when
    the task is retunrning from schedule() then it is later compared when it
    is done from ttwu().

    [ bigeasy: preempt_disable() around wq_worker_sleeping() by Daniel Bristot de Oliveira ]

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Tejun Heo
    Cc: Daniel Bristot de Oliveira
    Cc: Lai Jiangshan
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/ad2b29b5715f970bffc1a7026cabd6ff0b24076a.1532952814.git.bristot@redhat.com
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

02 Feb, 2019

1 commit

  • psi has provisions to shut off the periodic aggregation worker when
    there is a period of no task activity - and thus no data that needs
    aggregating. However, while developing psi monitoring, Suren noticed
    that the aggregation clock currently won't stay shut off for good.

    Debugging this revealed a flaw in the idle design: an aggregation run
    will see no task activity and decide to go to sleep; shortly thereafter,
    the kworker thread that executed the aggregation will go idle and cause
    a scheduling change, during which the psi callback will kick the
    !pending worker again. This will ping-pong forever, and is equivalent
    to having no shut-off logic at all (but with more code!)

    Fix this by exempting aggregation workers from psi's clock waking logic
    when the state change is them going to sleep. To do this, tag workers
    with the last work function they executed, and if in psi we see a worker
    going to sleep after aggregating psi data, we will not reschedule the
    aggregation work item.

    What if the worker is also executing other items before or after?

    Any psi state times that were incurred by work items preceding the
    aggregation work will have been collected from the per-cpu buckets
    during the aggregation itself. If there are work items following the
    aggregation work, the worker's last_func tag will be overwritten and the
    aggregator will be kept alive to process this genuine new activity.

    If the aggregation work is the last thing the worker does, and we decide
    to go idle, the brief period of non-idle time incurred between the
    aggregation run and the kworker's dequeue will be stranded in the
    per-cpu buckets until the clock is woken by later activity. But that
    should not be a problem. The buckets can hold 4s worth of time, and
    future activity will wake the clock with a 2s delay, giving us 2s worth
    of data we can leave behind when disabling aggregation. If it takes a
    worker more than two seconds to go idle after it finishes its last work
    item, we likely have bigger problems in the system, and won't notice one
    sample that was averaged with a bogus per-CPU weight.

    Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
    Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO")
    Signed-off-by: Johannes Weiner
    Reported-by: Suren Baghdasaryan
    Acked-by: Tejun Heo
    Cc: Peter Zijlstra
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

18 May, 2018

2 commits

  • Work functions can use set_worker_desc() to improve the visibility of
    what the worker task is doing. Currently, the desc field is unset at
    the beginning of each execution and there is a separate field to track
    the field is set during the current execution.

    Instead of leaving empty till desc is set, worker->desc can be used to
    remember the last workqueue the worker worked on by default and users
    that use set_worker_desc() can override it to something more
    informative as necessary.

    This simplifies desc handling and helps tracking the last workqueue
    that the worker exected on to improve visibility.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • For historical reasons, the worker attach/detach functions don't
    currently manage worker->pool and the callers are manually and
    inconsistently updating it.

    This patch moves worker->pool updates into the worker attach/detach
    functions. This makes worker->pool consistent and clearly defines how
    worker->pool updates are synchronized.

    This will help later workqueue visibility improvements by allowing
    safe access to workqueue information from worker->task.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

07 Nov, 2017

1 commit

  • Pull workqueue fix from Tejun Heo:
    "Another fix for a really old bug.

    It only affects drain_workqueue() which isn't used often and even then
    triggers only during a pretty small race window, so it isn't too
    surprising that it stayed hidden for so long.

    The fix is straight-forward and low-risk. Kudos to Li Bin for
    reporting and fixing the bug"

    * 'for-4.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: Fix NULL pointer dereference

    Linus Torvalds
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

30 Oct, 2017

1 commit

  • When queue_work() is used in irq (not in task context), there is
    a potential case that trigger NULL pointer dereference.
    ----------------------------------------------------------------
    worker_thread()
    |-spin_lock_irq()
    |-process_one_work()
    |-worker->current_pwq = pwq
    |-spin_unlock_irq()
    |-worker->current_func(work)
    |-spin_lock_irq()
    |-worker->current_pwq = NULL
    |-spin_unlock_irq()

    //interrupt here
    |-irq_handler
    |-__queue_work()
    //assuming that the wq is draining
    |-is_chained_work(wq)
    |-current_wq_worker()
    //Here, 'current' is the interrupted worker!
    |-current->current_pwq is NULL here!
    |-schedule()
    ----------------------------------------------------------------

    Avoid it by checking for task context in current_wq_worker(), and
    if not in task context, we shouldn't use the 'current' to check the
    condition.

    Reported-by: Xiaofei Tan
    Signed-off-by: Li Bin
    Reviewed-by: Lai Jiangshan
    Signed-off-by: Tejun Heo
    Fixes: 8d03ecfe4718 ("workqueue: reimplement is_chained_work() using current_wq_worker()")
    Cc: stable@vger.kernel.org # v3.9+

    Li Bin
     

02 Mar, 2016

1 commit


20 May, 2014

2 commits

  • manager_mutex is only used to protect the attaching for the pool
    and the pool->workers list. It protects the pool->workers and operations
    based on this list, such as:

    cpu-binding for the workers in the pool->workers
    the operations to set/clear WORKER_UNBOUND

    So let's rename manager_mutex to attach_mutex to better reflect its
    role. This patch is a pure rename.

    tj: Minor command and description updates.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     
  • worker_idr has the iteration (iterating for attached workers) and
    worker ID duties. These two duties don't have to be tied together. We
    can separate them and use a list for tracking attached workers and
    iteration.

    Before this separation, it wasn't possible to add rescuer workers to
    worker_idr due to rescuer workers couldn't allocate ID dynamically
    because ID-allocation depends on memory-allocation, which rescuer
    can't depend on.

    After separation, we can easily add the rescuer workers to the list for
    iteration without any memory-allocation. It is required when we attach
    the rescuer worker to the pool in later patch.

    tj: Minor description updates.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     

19 Jun, 2013

1 commit

  • Most of the stuff from kernel/sched.c was moved to kernel/sched/core.c long time
    back and the comments/Documentation never got updated.

    I figured it out when I was going through sched-domains.txt and so thought of
    fixing it globally.

    I haven't crossed check if the stuff that is referenced in sched/core.c by all
    these files is still present and hasn't changed as that wasn't the motive behind
    this patch.

    Signed-off-by: Viresh Kumar
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/cdff76a265326ab8d71922a1db5be599f20aad45.1370329560.git.viresh.kumar@linaro.org
    Signed-off-by: Ingo Molnar

    Viresh Kumar
     

01 May, 2013

1 commit

  • One of the problems that arise when converting dedicated custom
    threadpool to workqueue is that the shared worker pool used by workqueue
    anonimizes each worker making it more difficult to identify what the
    worker was doing on which target from the output of sysrq-t or debug
    dump from oops, BUG() and friends.

    This patch implements set_worker_desc() which can be called from any
    workqueue work function to set its description. When the worker task is
    dumped for whatever reason - sysrq-t, WARN, BUG, oops, lockdep assertion
    and so on - the description will be printed out together with the
    workqueue name and the worker function pointer.

    The printing side is implemented by print_worker_info() which is called
    from functions in task dump paths - sched_show_task() and
    dump_stack_print_info(). print_worker_info() can be safely called on
    any task in any state as long as the task struct itself is accessible.
    It uses probe_*() functions to access worker fields. It may print
    garbage if something went very wrong, but it wouldn't cause (another)
    oops.

    The description is currently limited to 24bytes including the
    terminating \0. worker->desc_valid and workder->desc[] are added and
    the 64 bytes marker which was already incorrect before adding the new
    fields is moved to the correct position.

    Here's an example dump with writeback updated to set the bdi name as
    worker desc.

    Hardware name: Bochs
    Modules linked in:
    Pid: 7, comm: kworker/u9:0 Not tainted 3.9.0-rc1-work+ #1
    Workqueue: writeback bdi_writeback_workfn (flush-8:0)
    ffffffff820a3ab0 ffff88000f6e9cb8 ffffffff81c61845 ffff88000f6e9cf8
    ffffffff8108f50f 0000000000000000 0000000000000000 ffff88000cde16b0
    ffff88000cde1aa8 ffff88001ee19240 ffff88000f6e9fd8 ffff88000f6e9d08
    Call Trace:
    [] dump_stack+0x19/0x1b
    [] warn_slowpath_common+0x7f/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] bdi_writeback_workfn+0x2a0/0x3b0
    ...

    Signed-off-by: Tejun Heo
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Acked-by: Jan Kara
    Cc: Oleg Nesterov
    Cc: Jens Axboe
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

20 Mar, 2013

1 commit

  • Rebinding workers of a per-cpu pool after a CPU comes online involves
    a lot of back-and-forth mostly because only the task itself could
    adjust CPU affinity if PF_THREAD_BOUND was set.

    As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
    coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
    to the various states a worker can be in, this led to three different
    paths a worker may be rebound. worker->rebind_work is queued to busy
    workers. Idle ones are signaled by unlinking worker->entry and call
    idle_worker_rebind(). The manager isn't covered by either and
    implements its own mechanism.

    PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
    itself now can manipulate CPU affinity of workers. This patch
    replaces the existing rebind mechanism with direct one where
    CPU_ONLINE iterates over all workers using for_each_pool_worker(),
    restores CPU affinity, and clears WORKER_UNBOUND.

    There are a couple subtleties. All bound idle workers should have
    their runqueues set to that of the bound CPU; however, if the target
    task isn't running, set_cpus_allowed_ptr() just updates the
    cpus_allowed mask deferring the actual migration to when the task
    wakes up. This is worked around by waking up idle workers after
    restoring CPU affinity before any workers can become bound.

    Another subtlety is stems from matching @pool->nr_running with the
    number of running unbound workers. While DISASSOCIATED, all workers
    are unbound and nr_running is zero. As workers become bound again,
    nr_running needs to be adjusted accordingly; however, there is no good
    way to tell whether a given worker is running without poking into
    scheduler internals. Instead of clearing UNBOUND directly,
    rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
    REBOUND, which will later be cleared by the workers themselves while
    preparing for the next round of work item execution. The only change
    needed for the workers is clearing REBOUND along with PREP.

    * This patch leaves for_each_busy_worker() without any user. Removed.

    * idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
    and rebind logic in manager_workers() removed.

    * worker_thread() now looks at WORKER_DIE instead of testing whether
    @worker->entry is empty to determine whether it needs to do
    something special as dying is the only special thing now.

    Signed-off-by: Tejun Heo
    Reviewed-by: Lai Jiangshan

    Tejun Heo
     

13 Mar, 2013

1 commit

  • Workqueue is mixing unsigned int and int for @cpu variables. There's
    no point in using unsigned int for cpus - many of cpu related APIs
    take int anyway. Consistently use int for @cpu variables so that we
    can use negative values to mark special ones.

    This patch doesn't introduce any visible behavior changes.

    Signed-off-by: Tejun Heo
    Reviewed-by: Lai Jiangshan

    Tejun Heo
     

05 Mar, 2013

1 commit

  • Rescuers visit different worker_pools to process work items from pools
    under pressure. Currently, rescuer->pool is updated outside any
    locking and when an outsider looks at a rescuer, there's no way to
    tell when and whether rescuer->pool is gonna change. While this
    doesn't currently cause any problem, it is nasty.

    With recent worker_maybe_bind_and_lock() changes, we can move
    rescuer->pool updates inside pool locks such that if rescuer->pool
    equals a locked pool, it's guaranteed to stay that way until the pool
    is unlocked.

    Move rescuer->pool inside pool->lock.

    This patch doesn't introduce any visible behavior difference.

    tj: Updated the description.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     

14 Feb, 2013

1 commit

  • workqueue has moved away from global_cwqs to worker_pools and with the
    scheduled custom worker pools, wforkqueues will be associated with
    pools which don't have anything to do with CPUs. The workqueue code
    went through significant amount of changes recently and mass renaming
    isn't likely to hurt much additionally. Let's replace 'cpu' with
    'pool' so that it reflects the current design.

    * s/struct cpu_workqueue_struct/struct pool_workqueue/
    * s/cpu_wq/pool_wq/
    * s/cwq/pwq/

    This patch is purely cosmetic.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

25 Jan, 2013

1 commit

  • global_cwq is now nothing but a container for per-cpu standard
    worker_pools. Declare the worker pools directly as
    cpu/unbound_std_worker_pools[] and remove global_cwq.

    * ____cacheline_aligned_in_smp moved from global_cwq to worker_pool.
    This probably would have made sense even before this change as we
    want each pool to be aligned.

    * get_gcwq() is replaced with std_worker_pools() which returns the
    pointer to the standard pool array for a given CPU.

    * __alloc_workqueue_key() updated to use get_std_worker_pool() instead
    of open-coding pool determination.

    This is part of an effort to remove global_cwq and make worker_pool
    the top level abstraction, which in turn will help implementing worker
    pools with user-specified attributes.

    v2: Joonsoo pointed out that it'd better to align struct worker_pool
    rather than the array so that every pool is aligned.

    Signed-off-by: Tejun Heo
    Reviewed-by: Lai Jiangshan
    Cc: Joonsoo Kim

    Tejun Heo
     

19 Jan, 2013

3 commits