19 Feb, 2014

1 commit

  • When a kworker should die, the kworkre is notified through WORKER_DIE
    flag instead of kthread_should_stop(). This, IIRC, is primarily to
    keep the test synchronized inside worker_pool lock. WORKER_DIE is
    first set while holding pool->lock, the lock is dropped and
    kthread_stop() is called.

    Unfortunately, this means that there's a slight chance that the target
    kworker may see WORKER_DIE before kthread_stop() finishes and exits
    and frees the target task before or during kthread_stop().

    Fix it by pinning the target task before setting WORKER_DIE and
    putting it after kthread_stop() is done.

    tj: Improved patch description and comment. Moved pinning above
    WORKER_DIE for better signify what it's protecting.

    CC: stable@vger.kernel.org
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     

22 Jan, 2014

1 commit


12 Jan, 2014

1 commit


16 Dec, 2013

1 commit

  • Pull PCI updates from Bjorn Helgaas:
    "PCI device hotplug
    - Move device_del() from pci_stop_dev() to pci_destroy_dev() (Rafael
    Wysocki)

    Host bridge drivers
    - Update maintainers for DesignWare, i.MX6, Armada, R-Car (Bjorn
    Helgaas)
    - mvebu: Return 'unsupported' for Interrupt Line and Interrupt Pin
    (Jason Gunthorpe)

    Miscellaneous
    - Avoid unnecessary CPU switch when calling .probe() (Alexander
    Duyck)
    - Revert "workqueue: allow work_on_cpu() to be called recursively"
    (Bjorn Helgaas)
    - Disable Bus Master only on kexec reboot (Khalid Aziz)
    - Omit PCI ID macro strings to shorten quirk names for LTO (Michal
    Marek)"

    * tag 'pci-v3.13-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
    MAINTAINERS: Add DesignWare, i.MX6, Armada, R-Car PCI host maintainers
    PCI: Disable Bus Master only on kexec reboot
    PCI: mvebu: Return 'unsupported' for Interrupt Line and Interrupt Pin
    PCI: Omit PCI ID macro strings to shorten quirk names
    PCI: Move device_del() from pci_stop_dev() to pci_destroy_dev()
    Revert "workqueue: allow work_on_cpu() to be called recursively"
    PCI: Avoid unnecessary CPU switch when calling driver .probe() method

    Linus Torvalds
     

26 Nov, 2013

1 commit

  • This reverts commit c2fda509667b0fda4372a237f5a59ea4570b1627.

    c2fda509667b removed lockdep annotation from work_on_cpu() to work around
    the PCI path that calls work_on_cpu() from within a work_on_cpu() work item
    (PF driver .probe() method -> pci_enable_sriov() -> add VFs -> VF driver
    .probe method).

    961da7fb6b22 ("PCI: Avoid unnecessary CPU switch when calling driver
    .probe() method) avoids that recursive work_on_cpu() use in a different
    way, so this revert restores the work_on_cpu() lockdep annotation.

    Signed-off-by: Bjorn Helgaas
    Acked-by: Tejun Heo

    Bjorn Helgaas
     

23 Nov, 2013

4 commits

  • When one work starts execution, the high bits of work's data contain
    pool ID. It can represent a maximum of WORK_OFFQ_POOL_NONE. Pool ID
    is assigned WORK_OFFQ_POOL_NONE when the work being initialized
    indicating that no pool is associated and get_work_pool() uses it to
    check the associated pool. So if worker_pool_assign_id() assigns a
    ID greater than or equal WORK_OFFQ_POOL_NONE to a pool, it triggers
    leakage, and it may break the non-reentrance guarantee.

    This patch fix this issue by modifying the worker_pool_assign_id()
    function calling idr_alloc() by setting @end param WORK_OFFQ_POOL_NONE.

    Furthermore, in the current implementation, the BUILD_BUG_ON() in
    init_workqueues makes no sense. The number of worker pools needed
    cannot be determined at compile time, because the number of backing
    pools for UNBOUND workqueues is dynamic based on the assigned custom
    attributes. So remove it.

    tj: Minor comment and indentation updates.

    Signed-off-by: Li Bin
    Signed-off-by: Tejun Heo

    Li Bin
     
  • It seems the "dying" should be "draining" here.

    Signed-off-by: Li Bin
    Signed-off-by: Tejun Heo

    Li Bin
     
  • An ordered workqueue implements execution ordering by using single
    pool_workqueue with max_active == 1. On a given pool_workqueue, work
    items are processed in FIFO order and limiting max_active to 1
    enforces the queued work items to be processed one by one.

    Unfortunately, 4c16bd327c ("workqueue: implement NUMA affinity for
    unbound workqueues") accidentally broke this guarantee by applying
    NUMA affinity to ordered workqueues too. On NUMA setups, an ordered
    workqueue would end up with separate pool_workqueues for different
    nodes. Each pool_workqueue still limits max_active to 1 but multiple
    work items may be executed concurrently and out of order depending on
    which node they are queued to.

    Fix it by using dedicated ordered_wq_attrs[] when creating ordered
    workqueues. The new attrs match the unbound ones except that no_numa
    is always set thus forcing all NUMA nodes to share the default
    pool_workqueue.

    While at it, add sanity check in workqueue creation path which
    verifies that an ordered workqueues has only the default
    pool_workqueue.

    Signed-off-by: Tejun Heo
    Reported-by: Libin
    Cc: stable@vger.kernel.org
    Cc: Lai Jiangshan

    Tejun Heo
     
  • Move the setting of PF_NO_SETAFFINITY up before set_cpus_allowed()
    in create_worker(). Otherwise userland can change ->cpus_allowed
    in between.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Tejun Heo

    Oleg Nesterov
     

07 Sep, 2013

1 commit

  • Pull trivial tree from Jiri Kosina:
    "The usual trivial updates all over the tree -- mostly typo fixes and
    documentation updates"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits)
    doc: Documentation/cputopology.txt fix typo
    treewide: Convert retrun typos to return
    Fix comment typo for init_cma_reserved_pageblock
    Documentation/trace: Correcting and extending tracepoint documentation
    mm/hotplug: fix a typo in Documentation/memory-hotplug.txt
    power: Documentation: Update s2ram link
    doc: fix a typo in Documentation/00-INDEX
    Documentation/printk-formats.txt: No casts needed for u64/s64
    doc: Fix typo "is is" in Documentations
    treewide: Fix printks with 0x%#
    zram: doc fixes
    Documentation/kmemcheck: update kmemcheck documentation
    doc: documentation/hwspinlock.txt fix typo
    PM / Hibernate: add section for resume options
    doc: filesystems : Fix typo in Documentations/filesystems
    scsi/megaraid fixed several typos in comments
    ppc: init_32: Fix error typo "CONFIG_START_KERNEL"
    treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
    page_isolation: Fix a comment typo in test_pages_isolated()
    doc: fix a typo about irq affinity
    ...

    Linus Torvalds
     

04 Sep, 2013

2 commits

  • Pull workqueue updates from Tejun Heo:
    "Nothing interesting. All are doc / comment updates"

    * 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: Correct/Drop references to gcwq in Documentation
    workqueue: Fix manage_workers() RETURNS description
    workqueue: Comment correction in file header
    workqueue: mark WQ_NON_REENTRANT deprecated

    Linus Torvalds
     
  • Pull driver core patches from Greg KH:
    "Here's the big driver core pull request for 3.12-rc1.

    Lots of tiny changes here fixing up the way sysfs attributes are
    created, to try to make drivers simpler, and fix a whole class race
    conditions with creations of device attributes after the device was
    announced to userspace.

    All the various pieces are acked by the different subsystem
    maintainers"

    * tag 'driver-core-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (119 commits)
    firmware loader: fix pending_fw_head list corruption
    drivers/base/memory.c: introduce help macro to_memory_block
    dynamic debug: line queries failing due to uninitialized local variable
    sysfs: sysfs_create_groups returns a value.
    debugfs: provide debugfs_create_x64() when disabled
    rbd: convert bus code to use bus_groups
    firmware: dcdbas: use binary attribute groups
    sysfs: add sysfs_create/remove_groups for when SYSFS is not enabled
    driver core: add #include to core files.
    HID: convert bus code to use dev_groups
    Input: serio: convert bus code to use drv_groups
    Input: gameport: convert bus code to use drv_groups
    driver core: firmware: use __ATTR_RW()
    driver core: core: use DEVICE_ATTR_RO
    driver core: bus: use DRIVER_ATTR_WO()
    driver core: create write-only attribute macros for devices and drivers
    sysfs: create __ATTR_WO()
    driver-core: platform: convert bus code to use dev_groups
    workqueue: convert bus code to use dev_groups
    MEI: convert bus code to use dev_groups
    ...

    Linus Torvalds
     

29 Aug, 2013

1 commit

  • If !PREEMPT, a kworker running work items back to back can hog CPU.
    This becomes dangerous when a self-requeueing work item which is
    waiting for something to happen races against stop_machine. Such
    self-requeueing work item would requeue itself indefinitely hogging
    the kworker and CPU it's running on while stop_machine would wait for
    that CPU to enter stop_machine while preventing anything else from
    happening on all other CPUs. The two would deadlock.

    Jamie Liu reports that this deadlock scenario exists around
    scsi_requeue_run_queue() and libata port multiplier support, where one
    port may exclude command processing from other ports. With the right
    timing, scsi_requeue_run_queue() can end up requeueing itself trying
    to execute an IO which is asked to be retried while another device has
    an exclusive access, which in turn can't make forward progress due to
    stop_machine.

    Fix it by invoking cond_resched() after executing each work item.

    Signed-off-by: Tejun Heo
    Reported-by: Jamie Liu
    References: http://thread.gmane.org/gmane.linux.kernel/1552567
    Cc: stable@vger.kernel.org
    --
    kernel/workqueue.c | 9 +++++++++
    1 file changed, 9 insertions(+)

    Tejun Heo
     

24 Aug, 2013

1 commit


21 Aug, 2013

2 commits


20 Aug, 2013

1 commit

  • When building the htmldocs (in verbose mode), scripts/kernel-doc reports the
    following type of warnings:

    Warning(kernel/workqueue.c:653): No description found for return value of
    'get_work_pool'

    Fix them by:
    - Using "Return:" sections to introduce descriptions of return values
    - Adding some missing descriptions

    Signed-off-by: Yacine Belkadi
    Signed-off-by: Jiri Kosina

    Yacine Belkadi
     

07 Aug, 2013

1 commit

  • Pull two workqueue fixes from Tejun Heo:
    "A lockdep notation update so that nested work_on_cpu() invocations
    don't lead to spurious lockdep warnings and fix for an unbound attr
    bug which made what's shown in sysfs deviate from the actual ones.
    Both patches have pretty limited scope"

    * 'for-3.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: copy workqueue_attrs with all fields
    workqueue: allow work_on_cpu() to be called recursively

    Linus Torvalds
     

01 Aug, 2013

1 commit

  • $echo '0' > /sys/bus/workqueue/devices/xxx/numa
    $cat /sys/bus/workqueue/devices/xxx/numa

    I got 1. It should be 0, the reason is copy_workqueue_attrs() called
    in apply_workqueue_attrs() doesn't copy no_numa field.

    Fix it by making copy_workqueue_attrs() copy ->no_numa too. This
    would also make get_unbound_pool() set a pool's ->no_numa attribute
    according to the workqueue attributes used when the pool was created.
    While harmelss, as ->no_numa isn't a pool attribute, this is a bit
    confusing. Clear it explicitly.

    tj: Updated description and comments a bit.

    Signed-off-by: Shaohua Li
    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org

    Shaohua Li
     

25 Jul, 2013

1 commit

  • If the @fn call work_on_cpu() again, the lockdep will complain:

    > [ INFO: possible recursive locking detected ]
    > 3.11.0-rc1-lockdep-fix-a #6 Not tainted
    > ---------------------------------------------
    > kworker/0:1/142 is trying to acquire lock:
    > ((&wfc.work)){+.+.+.}, at: [] flush_work+0x0/0xb0
    >
    > but task is already holding lock:
    > ((&wfc.work)){+.+.+.}, at: [] process_one_work+0x169/0x610
    >
    > other info that might help us debug this:
    > Possible unsafe locking scenario:
    >
    > CPU0
    > ----
    > lock((&wfc.work));
    > lock((&wfc.work));
    >
    > *** DEADLOCK ***

    It is false-positive lockdep report. In this sutiation,
    the two "wfc"s of the two work_on_cpu() are different,
    they are both on stack. flush_work() can't be deadlock.

    To fix this, we need to avoid the lockdep checking in this case,
    thus we instroduce a internal __flush_work() which skip the lockdep.

    tj: Minor comment adjustment.

    Signed-off-by: Lai Jiangshan
    Reported-by: "Srivatsa S. Bhat"
    Reported-by: Alexander Duyck
    Signed-off-by: Tejun Heo

    Lai Jiangshan
     

15 Jul, 2013

1 commit

  • The __cpuinit type of throwaway sections might have made sense
    some time ago when RAM was more constrained, but now the savings
    do not offset the cost and complications. For example, the fix in
    commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
    is a good example of the nasty type of bugs that can be created
    with improper use of the various __init prefixes.

    After a discussion on LKML[1] it was decided that cpuinit should go
    the way of devinit and be phased out. Once all the users are gone,
    we can then finally remove the macros themselves from linux/init.h.

    This removes all the uses of the __cpuinit macros from C files in
    the core kernel directories (kernel, init, lib, mm, and include)
    that don't really have a specific maintainer.

    [1] https://lkml.org/lkml/2013/5/20/589

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

03 Jul, 2013

1 commit

  • Pull workqueue changes from Tejun Heo:
    "Surprisingly, Lai and I didn't break too many things implementing
    custom pools and stuff last time around and there aren't any follow-up
    changes necessary at this point.

    The only change in this pull request is Viresh's patches to make some
    per-cpu workqueues to behave as unbound workqueues dependent on a boot
    param whose default can be configured via a config option. This leads
    to higher processing overhead / lower bandwidth as more work items are
    bounced across CPUs; however, it can lead to noticeable powersave in
    certain configurations - ~10% w/ idlish constant workload on a
    big.LITTLE configuration according to Viresh.

    This is because per-cpu workqueues interfere with how the scheduler
    perceives whether or not each CPU is idle by forcing pinned tasks on
    them, which makes the scheduler's power-aware scheduling decisions
    less effective.

    Its effectiveness is likely less pronounced on homogenous
    configurations and this type of optimization can probably be made
    automatic; however, the changes are pretty minimal and the affected
    workqueues are clearly marked, so it's an easy gain for some
    configurations for the time being with pretty unintrusive changes."

    * 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    fbcon: queue work on power efficient wq
    block: queue work on power efficient wq
    PHYLIB: queue work on system_power_efficient_wq
    workqueue: Add system wide power_efficient workqueues
    workqueues: Introduce new flag WQ_POWER_EFFICIENT for power oriented workqueues

    Linus Torvalds
     

16 May, 2013

1 commit

  • wq_numa_init() builds per-node cpumasks which are later used to make
    unbound workqueues NUMA-aware. The cpumasks are allocated using
    alloc_cpumask_var_node() for all possible nodes. Unfortunately, on
    machines with off-line nodes, this leads to NUMA-aware allocations on
    existing bug offline nodes, which in turn triggers BUG in the memory
    allocation code.

    Fix it by using NUMA_NO_NODE for cpumask allocations for offline
    nodes.

    kernel BUG at include/linux/gfp.h:323!
    invalid opcode: 0000 [#1] SMP
    Modules linked in:
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0+ #1
    Hardware name: ProLiant BL465c G7, BIOS A19 12/10/2011
    task: ffff880234608000 ti: ffff880234602000 task.ti: ffff880234602000
    RIP: 0010:[] [] new_slab+0x2ad/0x340
    RSP: 0000:ffff880234603bf8 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: ffff880237404b40 RCX: 00000000000000d0
    RDX: 0000000000000001 RSI: 0000000000000003 RDI: 00000000002052d0
    RBP: ffff880234603c28 R08: 0000000000000000 R09: 0000000000000001
    R10: 0000000000000001 R11: ffffffff812e3aa8 R12: 0000000000000001
    R13: ffff8802378161c0 R14: 0000000000030027 R15: 00000000000040d0
    FS: 0000000000000000(0000) GS:ffff880237800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: ffff88043fdff000 CR3: 00000000018d5000 CR4: 00000000000007f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Stack:
    ffff880234603c28 0000000000000001 00000000000000d0 ffff8802378161c0
    ffff880237404b40 ffff880237404b40 ffff880234603d28 ffffffff815edba1
    ffff880237816140 0000000000000000 ffff88023740e1c0
    Call Trace:
    [] __slab_alloc+0x330/0x4f2
    [] kmem_cache_alloc_node_trace+0xa5/0x200
    [] alloc_cpumask_var_node+0x28/0x90
    [] wq_numa_init+0x10d/0x1be
    [] init_workqueues+0x64/0x341
    [] do_one_initcall+0xea/0x1a0
    [] kernel_init_freeable+0xb7/0x1ec
    [] kernel_init+0xe/0xf0
    [] ret_from_fork+0x7c/0xb0
    Code: 45 84 ac 00 00 00 f0 41 80 4d 00 40 e9 f6 fe ff ff 66 0f 1f 84 00 00 00 00 00 e8 eb 4b ff ff 49 89 c5 e9 05 fe ff ff 0b 4c 8b 73 38 44 89 ff 81 cf 00 00 20 00 4c 89 f6 48 c1 ee

    Signed-off-by: Tejun Heo
    Reported-and-Tested-by: Lingzhu Xiang

    Tejun Heo
     

15 May, 2013

4 commits

  • Commit 8425e3d5bdbe ("workqueue: inline trivial wrappers") changed
    schedule_work() and schedule_delayed_work() to inline wrappers,
    but these rely on some symbols that are EXPORT_SYMBOL_GPL, while
    the original functions were EXPORT_SYMBOL. This has the effect of
    changing the licensing requirement for these functions and making
    them unavailable to non GPL modules.

    Make them available again by removing the restriction on the
    required symbols.

    Signed-off-by: Marc Dionne
    Signed-off-by: Tejun Heo

    Marc Dionne
     
  • When we fail to mutex_trylock(), we release the pool spin_lock and do
    mutex_lock(). After that, we should regrab the pool spin_lock, but,
    regrabbing is missed in current code. So correct it.

    Cc: Lai Jiangshan
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Tejun Heo

    Joonsoo Kim
     
  • This patch adds system wide workqueues aligned towards power saving. This is
    done by allocating them with WQ_UNBOUND flag if 'wq_power_efficient' is set to
    'true'.

    tj: updated comments a bit.

    Signed-off-by: Viresh Kumar
    Signed-off-by: Tejun Heo

    Viresh Kumar
     
  • Workqueues can be performance or power-oriented. Currently, most workqueues are
    bound to the CPU they were created on. This gives good performance (due to cache
    effects) at the cost of potentially waking up otherwise idle cores (Idle from
    scheduler's perspective. Which may or may not be physically idle) just to
    process some work. To save power, we can allow the work to be rescheduled on a
    core that is already awake.

    Workqueues created with the WQ_UNBOUND flag will allow some power savings.
    However, we don't change the default behaviour of the system. To enable
    power-saving behaviour, a new config option CONFIG_WQ_POWER_EFFICIENT needs to
    be turned on. This option can also be overridden by the
    workqueue.power_efficient boot parameter.

    tj: Updated config description and comments. Renamed
    CONFIG_WQ_POWER_EFFICIENT to CONFIG_WQ_POWER_EFFICIENT_DEFAULT.

    Signed-off-by: Viresh Kumar
    Reviewed-by: Amit Kucheria
    Signed-off-by: Tejun Heo

    Viresh Kumar
     

11 May, 2013

1 commit

  • df2d5ae499 ("workqueue: map an unbound workqueues to multiple per-node
    pool_workqueues") made unbound workqueues to map to multiple per-node
    pool_workqueues and accordingly updated workqueue_contested() so that,
    for unbound workqueues, it maps the specified @cpu to the NUMA node
    number to obtain the matching pool_workqueue to query the congested
    state.

    Before this change, workqueue_congested() ignored @cpu for unbound
    workqueues as there was only one pool_workqueue and some users
    (fscache) called it with WORK_CPU_UNBOUND. After the commit, this
    causes the following oops as WORK_CPU_UNBOUND gets translated to
    garbage by cpu_to_node().

    BUG: unable to handle kernel paging request at ffff8803598d98b8
    IP: [] unbound_pwq_by_node+0xa1/0xfa
    PGD 2421067 PUD 0
    Oops: 0000 [#1] SMP
    CPU: 1 PID: 2689 Comm: cat Tainted: GF 3.9.0-fsdevel+ #4
    task: ffff88003d801040 ti: ffff880025806000 task.ti: ffff880025806000
    RIP: 0010:[] [] unbound_pwq_by_node+0xa1/0xfa
    RSP: 0018:ffff880025807ad8 EFLAGS: 00010202
    RAX: 0000000000000001 RBX: ffff8800388a2400 RCX: 0000000000000003
    RDX: ffff880025807fd8 RSI: ffffffff81a31420 RDI: ffff88003d8016e0
    RBP: ffff880025807ae8 R08: ffff88003d801730 R09: ffffffffa00b4898
    R10: ffffffff81044217 R11: ffff88003d801040 R12: 0000000064206e97
    R13: ffff880036059d98 R14: ffff880038cc8080 R15: ffff880038cc82d0
    FS: 00007f21afd9c740(0000) GS:ffff88003d100000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: ffff8803598d98b8 CR3: 000000003df49000 CR4: 00000000000007e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Stack:
    ffff8800388a2400 0000000000000002 ffff880025807b18 ffffffff810442ce
    ffffffff81044217 ffff880000000002 ffff8800371b4080 ffff88003d112ec0
    ffff880025807b38 ffffffffa00810b0 ffff880036059d88 ffff880036059be8
    Call Trace:
    [] workqueue_congested+0xb7/0x12c
    [] fscache_enqueue_object+0xb2/0xe8 [fscache]
    [] __fscache_acquire_cookie+0x3b9/0x56c [fscache]
    [] nfs_fscache_set_inode_cookie+0xee/0x132 [nfs]
    [] do_open+0x9/0xd [nfs]
    [] do_dentry_open+0x175/0x24b
    [] finish_open+0x41/0x51

    Fix it by using smp_processor_id() if @cpu is WORK_CPU_UNBOUND.

    Signed-off-by: Tejun Heo
    Reported-by: David Howells
    Tested-and-Acked-by: David Howells

    Tejun Heo
     

01 May, 2013

1 commit

  • One of the problems that arise when converting dedicated custom
    threadpool to workqueue is that the shared worker pool used by workqueue
    anonimizes each worker making it more difficult to identify what the
    worker was doing on which target from the output of sysrq-t or debug
    dump from oops, BUG() and friends.

    This patch implements set_worker_desc() which can be called from any
    workqueue work function to set its description. When the worker task is
    dumped for whatever reason - sysrq-t, WARN, BUG, oops, lockdep assertion
    and so on - the description will be printed out together with the
    workqueue name and the worker function pointer.

    The printing side is implemented by print_worker_info() which is called
    from functions in task dump paths - sched_show_task() and
    dump_stack_print_info(). print_worker_info() can be safely called on
    any task in any state as long as the task struct itself is accessible.
    It uses probe_*() functions to access worker fields. It may print
    garbage if something went very wrong, but it wouldn't cause (another)
    oops.

    The description is currently limited to 24bytes including the
    terminating \0. worker->desc_valid and workder->desc[] are added and
    the 64 bytes marker which was already incorrect before adding the new
    fields is moved to the correct position.

    Here's an example dump with writeback updated to set the bdi name as
    worker desc.

    Hardware name: Bochs
    Modules linked in:
    Pid: 7, comm: kworker/u9:0 Not tainted 3.9.0-rc1-work+ #1
    Workqueue: writeback bdi_writeback_workfn (flush-8:0)
    ffffffff820a3ab0 ffff88000f6e9cb8 ffffffff81c61845 ffff88000f6e9cf8
    ffffffff8108f50f 0000000000000000 0000000000000000 ffff88000cde16b0
    ffff88000cde1aa8 ffff88001ee19240 ffff88000f6e9fd8 ffff88000f6e9d08
    Call Trace:
    [] dump_stack+0x19/0x1b
    [] warn_slowpath_common+0x7f/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] bdi_writeback_workfn+0x2a0/0x3b0
    ...

    Signed-off-by: Tejun Heo
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Acked-by: Jan Kara
    Cc: Oleg Nesterov
    Cc: Jens Axboe
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

10 Apr, 2013

1 commit


04 Apr, 2013

1 commit

  • destroy_workqueue() performs several sanity checks before proceeding
    with destruction of a workqueue. One of the checks verifies that
    refcnt of each pwq (pool_workqueue) is over 1 as at that point there
    should be no in-flight work items and the only holder of pwq refs is
    the workqueue itself.

    This worked fine as a workqueue used to hold only one reference to its
    pwqs; however, since 4c16bd327c ("workqueue: implement NUMA affinity
    for unbound workqueues"), a workqueue may hold multiple references to
    its default pwq triggering this sanity check spuriously.

    Fix it by not triggering the pwq->refcnt assertion on default pwqs.

    An example spurious WARN trigger follows.

    WARNING: at kernel/workqueue.c:4201 destroy_workqueue+0x6a/0x13e()
    Hardware name: 4286C12
    Modules linked in: sdhci_pci sdhci mmc_core usb_storage i915 drm_kms_helper drm i2c_algo_bit i2c_core video
    Pid: 361, comm: umount Not tainted 3.9.0-rc5+ #29
    Call Trace:
    [] warn_slowpath_common+0x7c/0x93
    [] warn_slowpath_null+0x22/0x24
    [] destroy_workqueue+0x6a/0x13e
    [] ext4_put_super+0x43/0x2c4
    [] generic_shutdown_super+0x4b/0xb9
    [] kill_block_super+0x22/0x60
    [] deactivate_locked_super+0x2f/0x56
    [] deactivate_super+0x2e/0x31
    [] mntput_no_expire+0x103/0x108
    [] sys_umount+0x2a2/0x2c4
    [] sys_oldumount+0x1e/0x20
    [] sysenter_do_call+0x12/0x38

    tj: Rewrote description.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tejun Heo
    Reported-by: Fengguang Wu

    Lai Jiangshan
     

02 Apr, 2013

9 commits

  • Writeback conversion to workqueue will be based on top of wq/for-3.10
    branch to take advantage of custom attrs and NUMA support for unbound
    workqueues. Mainline currently contains two commits which result in
    non-trivial merge conflicts with wq/for-3.10 and because
    block/for-3.10/core is based on v3.9-rc3 which contains one of the
    conflicting commits, we need a pre-merge-window merge anyway. Let's
    pull v3.9-rc5 into wq/for-3.10 so that the block tree doesn't suffer
    from workqueue merge conflicts.

    The two conflicts and their resolutions:

    * e68035fb65 ("workqueue: convert to idr_alloc()") in mainline changes
    worker_pool_assign_id() to use idr_alloc() instead of the old idr
    interface. worker_pool_assign_id() goes through multiple locking
    changes in wq/for-3.10 causing the following conflict.

    static int worker_pool_assign_id(struct worker_pool *pool)
    {
    int ret;

    <<<<<<< HEAD
    lockdep_assert_held(&wq_pool_mutex);

    do {
    if (!idr_pre_get(&worker_pool_idr, GFP_KERNEL))
    return -ENOMEM;
    ret = idr_get_new(&worker_pool_idr, pool, &pool->id);
    } while (ret == -EAGAIN);
    =======
    mutex_lock(&worker_pool_idr_mutex);
    ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL);
    if (ret >= 0)
    pool->id = ret;
    mutex_unlock(&worker_pool_idr_mutex);
    >>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89

    return ret < 0 ? ret : 0;
    }

    We want locking from the former and idr_alloc() usage from the
    latter, which can be combined to the following.

    static int worker_pool_assign_id(struct worker_pool *pool)
    {
    int ret;

    lockdep_assert_held(&wq_pool_mutex);

    ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL);
    if (ret >= 0) {
    pool->id = ret;
    return 0;
    }
    return ret;
    }

    * eb2834285c ("workqueue: fix possible pool stall bug in
    wq_unbind_fn()") updated wq_unbind_fn() such that it has single
    larger for_each_std_worker_pool() loop instead of two separate loops
    with a schedule() call inbetween. wq/for-3.10 renamed
    pool->assoc_mutex to pool->manager_mutex causing the following
    conflict (earlier function body and comments omitted for brevity).

    static void wq_unbind_fn(struct work_struct *work)
    {
    ...
    spin_unlock_irq(&pool->lock);
    <<<<<<< HEAD
    mutex_unlock(&pool->manager_mutex);
    }
    =======
    mutex_unlock(&pool->assoc_mutex);
    >>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89

    schedule();

    <<<<<<< HEAD
    for_each_cpu_worker_pool(pool, cpu)
    =======
    >>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89
    atomic_set(&pool->nr_running, 0);

    spin_lock_irq(&pool->lock);
    wake_up_worker(pool);
    spin_unlock_irq(&pool->lock);
    }
    }

    The resolution is mostly trivial. We want the control flow of the
    latter with the rename of the former.

    static void wq_unbind_fn(struct work_struct *work)
    {
    ...
    spin_unlock_irq(&pool->lock);
    mutex_unlock(&pool->manager_mutex);

    schedule();

    atomic_set(&pool->nr_running, 0);

    spin_lock_irq(&pool->lock);
    wake_up_worker(pool);
    spin_unlock_irq(&pool->lock);
    }
    }

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • …o disable NUMA affinity

    Unbound workqueues are now NUMA aware. Let's add some control knobs
    and update sysfs interface accordingly.

    * Add kernel param workqueue.numa_disable which disables NUMA affinity
    globally.

    * Replace sysfs file "pool_id" with "pool_ids" which contain
    node:pool_id pairs. This change is userland-visible but "pool_id"
    hasn't seen a release yet, so this is okay.

    * Add a new sysf files "numa" which can toggle NUMA affinity on
    individual workqueues. This is implemented as attrs->no_numa whichn
    is special in that it isn't part of a pool's attributes. It only
    affects how apply_workqueue_attrs() picks which pools to use.

    After "pool_ids" change, first_pwq() doesn't have any user left.
    Removed.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>

    Tejun Heo
     
  • Currently, an unbound workqueue has single current, or first, pwq
    (pool_workqueue) to which all new work items are queued. This often
    isn't optimal on NUMA machines as workers may jump around across node
    boundaries and work items get assigned to workers without any regard
    to NUMA affinity.

    This patch implements NUMA affinity for unbound workqueues. Instead
    of mapping all entries of numa_pwq_tbl[] to the same pwq,
    apply_workqueue_attrs() now creates a separate pwq covering the
    intersecting CPUs for each NUMA node which has online CPUs in
    @attrs->cpumask. Nodes which don't have intersecting possible CPUs
    are mapped to pwqs covering whole @attrs->cpumask.

    As CPUs come up and go down, the pool association is changed
    accordingly. Changing pool association may involve allocating new
    pools which may fail. To avoid failing CPU_DOWN, each workqueue
    always keeps a default pwq which covers whole attrs->cpumask which is
    used as fallback if pool creation fails during a CPU hotplug
    operation.

    This ensures that all work items issued on a NUMA node is executed on
    the same node as long as the workqueue allows execution on the CPUs of
    the node.

    As this maps a workqueue to multiple pwqs and max_active is per-pwq,
    this change the behavior of max_active. The limit is now per NUMA
    node instead of global. While this is an actual change, max_active is
    already per-cpu for per-cpu workqueues and primarily used as safety
    mechanism rather than for active concurrency control. Concurrency is
    usually limited from workqueue users by the number of concurrently
    active work items and this change shouldn't matter much.

    v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
    by Lai.

    v3: The previous version incorrectly made a workqueue spanning
    multiple nodes spread work items over all online CPUs when some of
    its nodes don't have any desired cpus. Reimplemented so that NUMA
    affinity is properly updated as CPUs go up and down. This problem
    was spotted by Lai Jiangshan.

    v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
    however, wq may be freed at any time after dfl_pwq is put making
    the clearing use-after-free. Clear wq->dfl_pwq before putting it.

    v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
    @pwq_tbl after success. Fixed.

    Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
    application of new attrs is excluded via CPU hotplug. Removed.

    Documentation on CPU affinity guarantee on CPU_DOWN added.

    All changes are suggested by Lai Jiangshan.

    Signed-off-by: Tejun Heo
    Reviewed-by: Lai Jiangshan

    Tejun Heo
     
  • Factor out lock pool, put_pwq(), unlock sequence into
    put_pwq_unlocked(). The two existing places are converted and there
    will be more with NUMA affinity support.

    This is to prepare for NUMA affinity support for unbound workqueues
    and doesn't introduce any functional difference.

    Signed-off-by: Tejun Heo
    Reviewed-by: Lai Jiangshan

    Tejun Heo
     
  • Factor out pool_workqueue linking and installation into numa_pwq_tbl[]
    from apply_workqueue_attrs() into numa_pwq_tbl_install(). link_pwq()
    is made safe to call multiple times. numa_pwq_tbl_install() links the
    pwq, installs it into numa_pwq_tbl[] at the specified node and returns
    the old entry.

    @last_pwq is removed from link_pwq() as the return value of the new
    function can be used instead.

    This is to prepare for NUMA affinity support for unbound workqueues.

    Signed-off-by: Tejun Heo
    Reviewed-by: Lai Jiangshan

    Tejun Heo
     
  • Use kmem_cache_alloc_node() with @pool->node instead of
    kmem_cache_zalloc() when allocating a pool_workqueue so that it's
    allocated on the same node as the associated worker_pool. As there's
    no no kmem_cache_zalloc_node(), move zeroing to init_pwq().

    This was suggested by Lai Jiangshan.

    Signed-off-by: Tejun Heo
    Reviewed-by: Lai Jiangshan

    Tejun Heo
     
  • Break init_and_link_pwq() into init_pwq() and link_pwq() and move
    unbound-workqueue specific handling into apply_workqueue_attrs().
    Also, factor out unbound pool and pool_workqueue allocation into
    alloc_unbound_pwq().

    This reorganization is to prepare for NUMA affinity and doesn't
    introduce any functional changes.

    Signed-off-by: Tejun Heo
    Reviewed-by: Lai Jiangshan

    Tejun Heo
     
  • Currently, an unbound workqueue has only one "current" pool_workqueue
    associated with it. It may have multple pool_workqueues but only the
    first pool_workqueue servies new work items. For NUMA affinity, we
    want to change this so that there are multiple current pool_workqueues
    serving different NUMA nodes.

    Introduce workqueue->numa_pwq_tbl[] which is indexed by NUMA node and
    points to the pool_workqueue to use for each possible node. This
    replaces first_pwq() in __queue_work() and workqueue_congested().

    numa_pwq_tbl[] is currently initialized to point to the same
    pool_workqueue as first_pwq() so this patch doesn't make any behavior
    changes.

    v2: Use rcu_dereference_raw() in unbound_pwq_by_node() as the function
    may be called only with wq->mutex held.

    Signed-off-by: Tejun Heo
    Reviewed-by: Lai Jiangshan

    Tejun Heo
     
  • Move wq->flags and ->cpu_pwqs to the end of workqueue_struct and align
    them to the cacheline. These two fields are used in the work item
    issue path and thus hot. The scheduled NUMA affinity support will add
    dispatch table at the end of workqueue_struct and relocating these two
    fields will allow us hitting only single cacheline on hot paths.

    Note that wq->pwqs isn't moved although it currently is being used in
    the work item issue path for unbound workqueues. The dispatch table
    mentioned above will replace its use in the issue path, so it will
    become cold once NUMA support is implemented.

    Signed-off-by: Tejun Heo
    Reviewed-by: Lai Jiangshan

    Tejun Heo