15 May, 2012

2 commits

  • Under memory load, on x86_64, with lockdep enabled, the workqueue's
    process_one_work() has been seen to oops in __lock_acquire(), barfing
    on a 0xffffffff00000000 pointer in the lockdep_map's class_cache[].

    Because it's permissible to free a work_struct from its callout function,
    the map used is an onstack copy of the map given in the work_struct: and
    that copy is made without any locking.

    Surprisingly, gcc (4.5.1 in Hugh's case) uses "rep movsl" rather than
    "rep movsq" for that structure copy: which might race with a workqueue
    user's wait_on_work() doing lock_map_acquire() on the source of the
    copy, putting a pointer into the class_cache[], but only in time for
    the top half of that pointer to be copied to the destination map.

    Boom when process_one_work() subsequently does lock_map_acquire()
    on its onstack copy of the lockdep_map.

    Fix this, and a similar instance in call_timer_fn(), with a
    lockdep_copy_map() function which additionally NULLs the class_cache[].

    Note: this oops was actually seen on 3.4-next, where flush_work() newly
    does the racing lock_map_acquire(); but Tejun points out that 3.4 and
    earlier are already vulnerable to the same through wait_on_work().

    * Patch orginally from Peter. Hugh modified it a bit and wrote the
    description.

    Signed-off-by: Peter Zijlstra
    Reported-by: Hugh Dickins
    LKML-Reference:
    Signed-off-by: Tejun Heo

    Peter Zijlstra
     
  • worker_enter_idle() has WARN_ON_ONCE() which triggers if nr_running
    isn't zero when every worker is idle. This can trigger spuriously
    while a cpu is going down due to the way trustee sets %WORKER_ROGUE
    and zaps nr_running.

    It first sets %WORKER_ROGUE on all workers without updating
    nr_running, releases gcwq->lock, schedules, regrabs gcwq->lock and
    then zaps nr_running. If the last running worker enters idle
    inbetween, it would see stale nr_running which hasn't been zapped yet
    and trigger the WARN_ON_ONCE().

    Fix it by performing the sanity check iff the trustee is idle.

    Signed-off-by: Tejun Heo
    Reported-by: "Paul E. McKenney"
    Cc: stable@vger.kernel.org

    Tejun Heo
     

24 Apr, 2012

1 commit

  • If a workqueue is flushed with flush_work() lockdep checking can
    be circumvented. For example:

    static DEFINE_MUTEX(mutex);

    static void my_work(struct work_struct *w)
    {
    mutex_lock(&mutex);
    mutex_unlock(&mutex);
    }

    static DECLARE_WORK(work, my_work);

    static int __init start_test_module(void)
    {
    schedule_work(&work);
    return 0;
    }
    module_init(start_test_module);

    static void __exit stop_test_module(void)
    {
    mutex_lock(&mutex);
    flush_work(&work);
    mutex_unlock(&mutex);
    }
    module_exit(stop_test_module);

    would not always print a warning when flush_work() was called.
    In this trivial example nothing could go wrong since we are
    guaranteed module_init() and module_exit() don't run concurrently,
    but if the work item is schedule asynchronously we could have a
    scenario where the work item is running just at the time flush_work()
    is called resulting in a classic ABBA locking problem.

    Add a lockdep hint by acquiring and releasing the work item
    lockdep_map in flush_work() so that we always catch this
    potential deadlock scenario.

    Signed-off-by: Stephen Boyd
    Reviewed-by: Yong Zhang
    Signed-off-by: Tejun Heo

    Stephen Boyd
     

17 Apr, 2012

1 commit

  • This BUG_ON() can be triggered if you call schedule_work() before
    calling INIT_WORK(). It is a bug definitely, but it's nicer to just
    print a stack trace and return.

    Reported-by: Matt Renzelmann
    Signed-off-by: Dan Carpenter
    Signed-off-by: Tejun Heo

    Dan Carpenter
     

21 Mar, 2012

1 commit


13 Mar, 2012

1 commit


02 Mar, 2012

1 commit

  • This patch (as1519) fixes a bug in the block layer's disk-events
    polling. The polling is done by a work routine queued on the
    system_nrt_wq workqueue. Since that workqueue isn't freezable, the
    polling continues even in the middle of a system sleep transition.

    Obviously, polling a suspended drive for media changes and such isn't
    a good thing to do; in the case of USB mass-storage devices it can
    lead to real problems requiring device resets and even re-enumeration.

    The patch fixes things by creating a new system-wide, non-reentrant,
    freezable workqueue and using it for disk-events polling.

    Signed-off-by: Alan Stern
    CC:
    Acked-by: Tejun Heo
    Acked-by: Rafael J. Wysocki
    Signed-off-by: Jens Axboe

    Alan Stern
     

11 Jan, 2012

1 commit

  • alloc_workqueue() currently expects the passed in @name pointer to remain
    accessible. This is inconvenient and a bit silly given that the whole wq
    is being dynamically allocated. This patch updates alloc_workqueue() and
    friends to take printf format string instead of opaque string and matching
    varargs at the end. The name is allocated together with the wq and
    formatted.

    alloc_ordered_workqueue() is converted to a macro to unify varargs
    handling with alloc_workqueue(), and, while at it, add comment to
    alloc_workqueue().

    None of the current in-kernel users pass in string with '%' as constant
    name and this change shouldn't cause any problem.

    [akpm@linux-foundation.org: use __printf]
    Signed-off-by: Tejun Heo
    Suggested-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

31 Oct, 2011

1 commit

  • The changed files were only including linux/module.h for the
    EXPORT_SYMBOL infrastructure, and nothing else. Revector them
    onto the isolated export header for faster compile times.

    Nothing to see here but a whole lot of instances of:

    -#include
    +#include

    This commit is only changing the kernel dir; next targets
    will probably be mm, fs, the arch dirs, etc.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

15 Sep, 2011

1 commit

  • Take cwq->gcwq->lock to avoid racing between drain_workqueue checking to
    make sure the workqueues are empty and cwq_dec_nr_in_flight decrementing
    and then incrementing nr_active when it activates a delayed work.

    We discovered this when a corner case in one of our drivers resulted in
    us trying to destroy a workqueue in which the remaining work would
    always requeue itself again in the same workqueue. We would hit this
    race condition and trip the BUG_ON on workqueue.c:3080.

    Signed-off-by: Thomas Tuttle
    Acked-by: Tejun Heo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Tuttle
     

23 Jul, 2011

1 commit


25 May, 2011

1 commit


20 May, 2011

1 commit

  • There are users which want to drain workqueues without destroying it.
    Separate out drain functionality from destroy_workqueue() into
    drain_workqueue() and make it accessible to workqueue users.

    To guarantee forward-progress, only chain queueing is allowed while
    drain is in progress. If a new work item which isn't chained from the
    running or pending work items is queued while draining is in progress,
    WARN_ON_ONCE() is triggered.

    Signed-off-by: Tejun Heo
    Cc: James Bottomley

    Tejun Heo
     

30 Apr, 2011

1 commit

  • If a rescuer and stop_machine() bringing down a CPU race with each
    other, they may deadlock on non-preemptive kernel. The CPU won't
    accept a new task, so the rescuer can't migrate to the target CPU,
    while stop_machine() can't proceed because the rescuer is holding one
    of the CPU retrying migration. GCWQ_DISASSOCIATED is never cleared
    and worker_maybe_bind_and_lock() retries indefinitely.

    This problem can be reproduced semi reliably while the system is
    entering suspend.

    http://thread.gmane.org/gmane.linux.kernel/1122051

    A lot of kudos to Thilo-Alexander for reporting this tricky issue and
    painstaking testing.

    stable: This affects all kernels with cmwq, so all kernels since and
    including v2.6.36 need this fix.

    Signed-off-by: Tejun Heo
    Reported-by: Thilo-Alexander Ginkel
    Tested-by: Thilo-Alexander Ginkel
    Cc: stable@kernel.org

    Tejun Heo
     

25 Mar, 2011

1 commit

  • Percpu allocator honors alignment request upto PAGE_SIZE and both the
    percpu addresses in the percpu address space and the translated kernel
    addresses should be aligned accordingly. The calculation of the
    former depends on the alignment of percpu output section in the kernel
    image.

    The linker script macros PERCPU_VADDR() and PERCPU() are used to
    define this output section and the latter takes @align parameter.
    Several architectures are using @align smaller than PAGE_SIZE breaking
    percpu memory alignment.

    This patch removes @align parameter from PERCPU(), renames it to
    PERCPU_SECTION() and makes it always align to PAGE_SIZE. While at it,
    add PCPU_SETUP_BUG_ON() checks such that alignment problems are
    reliably detected and remove percpu alignment comment recently added
    in workqueue.c as the condition would trigger BUG way before reaching
    there.

    For um, this patch raises the alignment of percpu area. As the area
    is in .init, there shouldn't be any noticeable difference.

    This problem was discovered by David Howells while debugging boot
    failure on mn10300.

    Signed-off-by: Tejun Heo
    Acked-by: Mike Frysinger
    Cc: uclinux-dist-devel@blackfin.uclinux.org
    Cc: David Howells
    Cc: Jeff Dike
    Cc: user-mode-linux-devel@lists.sourceforge.net

    Tejun Heo
     

23 Mar, 2011

1 commit

  • ksoftirqd, kworker, migration, and pktgend kthreads can be created with
    kthread_create_on_node(), to get proper NUMA affinities for their stack and
    task_struct.

    Signed-off-by: Eric Dumazet
    Acked-by: David S. Miller
    Reviewed-by: Andi Kleen
    Acked-by: Rusty Russell
    Acked-by: Tejun Heo
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: David Howells
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

16 Mar, 2011

1 commit

  • * 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: fix build failure introduced by s/freezeable/freezable/
    workqueue: add system_freezeable_wq
    rds/ib: use system_wq instead of rds_ib_fmr_wq
    net/9p: replace p9_poll_task with a work
    net/9p: use system_wq instead of p9_mux_wq
    xfs: convert to alloc_workqueue()
    reiserfs: make commit_wq use the default concurrency level
    ocfs2: use system_wq instead of ocfs2_quota_wq
    ext4: convert to alloc_workqueue()
    scsi/scsi_tgt_lib: scsi_tgtd isn't used in memory reclaim path
    scsi/be2iscsi,qla2xxx: convert to alloc_workqueue()
    misc/iwmc3200top: use system_wq instead of dedicated workqueues
    i2o: use alloc_workqueue() instead of create_workqueue()
    acpi: kacpi*_wq don't need WQ_MEM_RECLAIM
    fs/aio: aio_wq isn't used in memory reclaim path
    input/tps6507x-ts: use system_wq instead of dedicated workqueue
    cpufreq: use system_wq instead of dedicated workqueues
    wireless/ipw2x00: use system_wq instead of dedicated workqueues
    arm/omap: use system_wq in mailbox
    workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER

    Linus Torvalds
     

08 Mar, 2011

1 commit

  • In complex subsystems like mac80211 structures can contain several
    timers and work structs, so identifying a specific instance from the
    call trace and object type output of debugobjects can be hard.

    Allow the subsystems which support debugobjects to provide a hint
    function. This function returns a pointer to a kernel address
    (preferrably the objects callback function) which is printed along
    with the debugobjects type.

    Add hint methods for timer_list, work_struct and hrtimer.

    [ tglx: Massaged changelog, made it compile ]

    Signed-off-by: Stanislaw Gruszka
    LKML-Reference:
    Signed-off-by: Thomas Gleixner

    Stanislaw Gruszka
     

21 Feb, 2011

2 commits


17 Feb, 2011

2 commits

  • MAYDAY_INITIAL_TIMEOUT is defined as HZ / 100 and depending on
    configuration may end up 0 or 1. Even when it's 1, depending on when
    the mayday timer is added in the current jiffy interval, it may expire
    way before a jiffy has passed.

    Make sure MAYDAY_INITIAL_TIMEOUT is at least two to guarantee that at
    least a full jiffy has passed before calling rescuers.

    Signed-off-by: Tejun Heo
    Reported-by: Ray Jui
    Cc: stable@kernel.org

    Tejun Heo
     
  • There are two spellings in use for 'freeze' + 'able' - 'freezable' and
    'freezeable'. The former is the more prominent one. The latter is
    mostly used by workqueue and in a few other odd places. Unify the
    spelling to 'freezable'.

    Signed-off-by: Tejun Heo
    Reported-by: Alan Stern
    Acked-by: "Rafael J. Wysocki"
    Acked-by: Greg Kroah-Hartman
    Acked-by: Dmitry Torokhov
    Cc: David Woodhouse
    Cc: Alex Dubov
    Cc: "David S. Miller"
    Cc: Steven Whitehouse

    Tejun Heo
     

14 Feb, 2011

1 commit

  • After executing the matching works, a rescuer leaves the gcwq whether
    there are more pending works or not. This may decrease the
    concurrency level to zero and stall execution until a new work item is
    queued on the gcwq.

    Make rescuer wake up a regular worker when it leaves a gcwq if there
    are more works to execute, so that execution isn't stalled.

    Signed-off-by: Tejun Heo
    Reported-by: Ray Jui
    Cc: stable@kernel.org

    Tejun Heo
     

09 Feb, 2011

1 commit


11 Jan, 2011

2 commits

  • The nested NOT_RUNNING test in worker_clr_flags() is slightly
    misleading in that if NOT_RUNNING were a single flag the nested test
    would be always %true and thus noop. Add a comment noting that the
    test isn't a noop.

    Signed-off-by: Tejun Heo
    Cc: Hillf Danton
    Cc: Andrew Morton

    Tejun Heo
     
  • Currently, the lockdep annotation in flush_work() requires exclusive
    access on the workqueue the target work is queued on and triggers
    warning if a work is trying to flush another work on the same
    workqueue; however, this is no longer true as workqueues can now
    execute multiple works concurrently.

    This patch adds lock_map_acquire_read() and make process_one_work()
    hold read access to the workqueue while executing a work and
    start_flush_work() check for write access if concurrnecy level is one
    or the workqueue has a rescuer (as only one execution resource - the
    rescuer - is guaranteed to be available under memory pressure), and
    read access if higher.

    This better represents what's going on and removes spurious lockdep
    warnings which are triggered by fake dependency chain created through
    flush_work().

    * Peter pointed out that flushing another work from a WQ_MEM_RECLAIM
    wq breaks forward progress guarantee under memory pressure.
    Condition check accordingly updated.

    Signed-off-by: Tejun Heo
    Reported-by: "Rafael J. Wysocki"
    Tested-by: "Rafael J. Wysocki"
    Cc: Peter Zijlstra
    Cc: stable@kernel.org

    Tejun Heo
     

21 Dec, 2010

1 commit

  • Currently, destroy_workqueue() makes the workqueue deny all new
    queueing by setting WQ_DYING and flushes the workqueue once before
    proceeding with destruction; however, there are cases where work items
    queue more related work items. Currently, such users need to
    explicitly flush the workqueue multiple times depending on the
    possible depth of such chained queueing.

    This patch updates the queueing path such that a work item can queue
    further work items on the same workqueue even when WQ_DYING is set.
    The flush on destruction is automatically retried until the workqueue
    is empty. This guarantees that the workqueue is empty on destruction
    while allowing chained queueing.

    The flush retry logic whines if it takes too many retries to drain the
    workqueue.

    Signed-off-by: Tejun Heo
    Cc: James Bottomley

    Tejun Heo
     

14 Dec, 2010

1 commit

  • Running the annotate branch profiler on three boxes, including my
    main box that runs firefox, evolution, xchat, and is part of the distcc farm,
    showed this with the likelys in the workqueue code:

    correct incorrect % Function File Line
    ------- --------- - -------- ---- ----
    96 996253 99 wq_worker_sleeping workqueue.c 703
    96 996247 99 wq_worker_waking_up workqueue.c 677

    The likely()s in this case were assuming that WORKER_NOT_RUNNING will
    most likely be false. But this is not the case. The reason is
    (and shown by adding trace_printks and testing it) that most of the time
    WORKER_PREP is set.

    In worker_thread() we have:

    worker_clr_flags(worker, WORKER_PREP);

    [ do work stuff ]

    worker_set_flags(worker, WORKER_PREP, false);

    (that 'false' means not to wake up an idle worker)

    The wq_worker_sleeping() is called from schedule when a worker thread
    is putting itself to sleep. Which happens most of the time outside
    of that [ do work stuff ].

    The wq_worker_waking_up is called by the wakeup worker code, which
    is also callod outside that [ do work stuff ].

    Thus, the likely and unlikely used by those two functions are actually
    backwards.

    Remove the annotation and let gcc figure it out.

    Acked-by: Tejun Heo
    Signed-off-by: Steven Rostedt
    Signed-off-by: Tejun Heo

    Steven Rostedt
     

26 Nov, 2010

1 commit


27 Oct, 2010

1 commit

  • Silly though it is, completions and wait_queue_heads use foo_ONSTACK
    (COMPLETION_INITIALIZER_ONSTACK, DECLARE_COMPLETION_ONSTACK,
    __WAIT_QUEUE_HEAD_INIT_ONSTACK and DECLARE_WAIT_QUEUE_HEAD_ONSTACK) so I
    guess workqueues should do the same thing.

    s/INIT_WORK_ON_STACK/INIT_WORK_ONSTACK/
    s/INIT_DELAYED_WORK_ON_STACK/INIT_DELAYED_WORK_ONSTACK/

    Cc: Peter Zijlstra
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

26 Oct, 2010

1 commit

  • In the MN10300 arch, we occasionally see an assertion being tripped in
    alloc_cwqs() at the following line:

    /* just in case, make sure it's actually aligned */
    ---> BUG_ON(!IS_ALIGNED(wq->cpu_wq.v, align));
    return wq->cpu_wq.v ? 0 : -ENOMEM;

    The values are:

    wa->cpu_wq.v => 0x902776e0
    align => 0x100

    and align is calculated by the following:

    const size_t align = max_t(size_t, 1 << WORK_STRUCT_FLAG_BITS,
    __alignof__(unsigned long long));

    This is because the pointer in question (wq->cpu_wq.v) loses some of its
    lower bits to control flags, and so the object it points to must be
    sufficiently aligned to avoid the need to use those bits for pointing to
    things.

    Currently, 4 control bits and 4 colour bits are used in normal
    circumstances, plus a debugging bit if debugging is set. This requires
    the cpu_workqueue_struct struct to be at least 256 bytes aligned (or 512
    bytes aligned with debugging).

    PERCPU() alignment on MN13000, however, is only 32 bytes as set in
    vmlinux.lds.S. So we set this to PAGE_SIZE (4096) to match most other
    arches and stick a comment in alloc_cwqs() for anyone else who triggers
    the assertion.

    Reported-by: Akira Takeuchi
    Signed-off-by: David Howells
    Acked-by: Mark Salter
    Cc: Tejun Heo
    Signed-off-by: Linus Torvalds

    David Howells
     

19 Oct, 2010

2 commits

  • Commit a25909a4 (lockdep: Add an in_workqueue_context() lockdep-based
    test function) added in_workqueue_context() but there hasn't been any
    in-kernel user and the lockdep annotation in workqueue is scheduled to
    change. Remove the unused function.

    Signed-off-by: Tejun Heo
    Cc: Paul E. McKenney

    Tejun Heo
     
  • The documentation for schedule_on_each_cpu() states that it calls a
    function on each online CPU from keventd. This can easily be
    interpreted as an asyncronous call because the description does not
    mention that flush_work is called. Clarify that it is synchronous.

    tj: rephrased a bit

    Signed-off-by: Mel Gorman
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Tejun Heo

    Tejun Heo
     

11 Oct, 2010

2 commits

  • Add WQ_MEM_RECLAIM flag which currently maps to WQ_RESCUER, mark
    WQ_RESCUER as internal and replace all external WQ_RESCUER usages to
    WQ_MEM_RECLAIM.

    This makes the API users express the intent of the workqueue instead
    of indicating the internal mechanism used to guarantee forward
    progress. This is also to make it cleaner to add more semantics to
    WQ_MEM_RECLAIM. For example, if deemed necessary, memory reclaim
    workqueues can be made highpri.

    This patch doesn't introduce any functional change.

    Signed-off-by: Tejun Heo
    Cc: Jeff Garzik
    Cc: Dave Chinner
    Cc: Steven Whitehouse

    Tejun Heo
     
  • The policy function keep_working() didn't check GCWQ_HIGHPRI_PENDING
    and could return %false with highpri work pending. This could lead to
    late execution of a highpri work which was delayed due to @max_active
    throttling if other works are actively consuming CPU cycles.

    For example, the following could happen.

    1. Work W0 which burns CPU cycles.

    2. Two works W1 and W2 are queued to a highpri wq w/ @max_active of 1.

    3. W1 starts executing and W2 is put to delayed queue. W0 and W1 are
    both runnable.

    4. W1 finishes which puts W2 to pending queue but keep_working()
    incorrectly returns %false and the worker goes to sleep.

    5. W0 finishes and W2 starts execution.

    With this patch applied, W2 starts execution as soon as W1 finishes.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

05 Oct, 2010

2 commits


19 Sep, 2010

3 commits

  • Implement flush[_delayed]_work_sync(). These are flush functions
    which also make sure no CPU is still executing the target work from
    earlier queueing instances. These are similar to
    cancel[_delayed]_work_sync() except that the target work item is
    flushed instead of cancelled.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • Factor out start_flush_work() from flush_work(). start_flush_work()
    has @wait_executing argument which controls whether the barrier is
    queued only if the work is pending or also if executing. As
    flush_work() needs to wait for execution too, it uses %true.

    This commit doesn't cause any behavior difference. start_flush_work()
    will be used to implement flush_work_sync().

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • Make the following cleanup changes.

    * Relocate flush/cancel function prototypes and definitions.

    * Relocate wait_on_cpu_work() and wait_on_work() before
    try_to_grab_pending(). These will be used to implement
    flush_work_sync().

    * Make all flush/cancel functions return bool instead of int.

    * Update wait_on_cpu_work() and wait_on_work() to return %true if they
    actually waited.

    * Add / update comments.

    This patch doesn't cause any functional changes.

    Signed-off-by: Tejun Heo

    Tejun Heo