24 Mar, 2020

1 commit

  • During unmount we can have a job from the delayed inode items work queue
    still running, that can lead to at least two bad things:

    1) A crash, because the worker can try to create a transaction just
    after the fs roots were freed;

    2) A transaction leak, because the worker can create a transaction
    before the fs roots are freed and just after we committed the last
    transaction and after we stopped the transaction kthread.

    A stack trace example of the crash:

    [79011.691214] kernel BUG at lib/radix-tree.c:982!
    [79011.692056] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
    [79011.693180] CPU: 3 PID: 1394 Comm: kworker/u8:2 Tainted: G W 5.6.0-rc2-btrfs-next-54 #2
    (...)
    [79011.696789] Workqueue: btrfs-delayed-meta btrfs_work_helper [btrfs]
    [79011.697904] RIP: 0010:radix_tree_tag_set+0xe7/0x170
    (...)
    [79011.702014] RSP: 0018:ffffb3c84a317ca0 EFLAGS: 00010293
    [79011.702949] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
    [79011.704202] RDX: ffffb3c84a317cb0 RSI: ffffb3c84a317ca8 RDI: ffff8db3931340a0
    [79011.705463] RBP: 0000000000000005 R08: 0000000000000005 R09: ffffffff974629d0
    [79011.706756] R10: ffffb3c84a317bc0 R11: 0000000000000001 R12: ffff8db393134000
    [79011.708010] R13: ffff8db3931340a0 R14: ffff8db393134068 R15: 0000000000000001
    [79011.709270] FS: 0000000000000000(0000) GS:ffff8db3b6a00000(0000) knlGS:0000000000000000
    [79011.710699] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [79011.711710] CR2: 00007f22c2a0a000 CR3: 0000000232ad4005 CR4: 00000000003606e0
    [79011.712958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [79011.714205] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [79011.715448] Call Trace:
    [79011.715925] record_root_in_trans+0x72/0xf0 [btrfs]
    [79011.716819] btrfs_record_root_in_trans+0x4b/0x70 [btrfs]
    [79011.717925] start_transaction+0xdd/0x5c0 [btrfs]
    [79011.718829] btrfs_async_run_delayed_root+0x17e/0x2b0 [btrfs]
    [79011.719915] btrfs_work_helper+0xaa/0x720 [btrfs]
    [79011.720773] process_one_work+0x26d/0x6a0
    [79011.721497] worker_thread+0x4f/0x3e0
    [79011.722153] ? process_one_work+0x6a0/0x6a0
    [79011.722901] kthread+0x103/0x140
    [79011.723481] ? kthread_create_worker_on_cpu+0x70/0x70
    [79011.724379] ret_from_fork+0x3a/0x50
    (...)

    The following diagram shows a sequence of steps that lead to the crash
    during ummount of the filesystem:

    CPU 1 CPU 2 CPU 3

    btrfs_punch_hole()
    btrfs_btree_balance_dirty()
    btrfs_balance_delayed_items()
    --> sees
    fs_info->delayed_root->items
    with value 200, which is greater
    than
    BTRFS_DELAYED_BACKGROUND (128)
    and smaller than
    BTRFS_DELAYED_WRITEBACK (512)
    btrfs_wq_run_delayed_node()
    --> queues a job for
    fs_info->delayed_workers to run
    btrfs_async_run_delayed_root()

    btrfs_async_run_delayed_root()
    --> job queued by CPU 1

    --> starts picking and running
    delayed nodes from the
    prepare_list list

    close_ctree()

    btrfs_delete_unused_bgs()

    btrfs_commit_super()

    btrfs_join_transaction()
    --> gets transaction N

    btrfs_commit_transaction(N)
    --> set transaction state
    to TRANTS_STATE_COMMIT_START

    btrfs_first_prepared_delayed_node()
    --> picks delayed node X through
    the prepared_list list

    btrfs_run_delayed_items()

    btrfs_first_delayed_node()
    --> also picks delayed node X
    but through the node_list
    list

    __btrfs_commit_inode_delayed_items()
    --> runs all delayed items from
    this node and drops the
    node's item count to 0
    through call to
    btrfs_release_delayed_inode()

    --> finishes running any remaining
    delayed nodes

    --> finishes transaction commit

    --> stops cleaner and transaction threads

    btrfs_free_fs_roots()
    --> frees all roots and removes them
    from the radix tree
    fs_info->fs_roots_radix

    btrfs_join_transaction()
    start_transaction()
    btrfs_record_root_in_trans()
    record_root_in_trans()
    radix_tree_tag_set()
    --> crashes because
    the root is not in
    the radix tree
    anymore

    If the worker is able to call btrfs_join_transaction() before the unmount
    task frees the fs roots, we end up leaking a transaction and all its
    resources, since after the call to btrfs_commit_super() and stopping the
    transaction kthread, we don't expect to have any transaction open anymore.

    When this situation happens the worker has a delayed node that has no
    more items to run, since the task calling btrfs_run_delayed_items(),
    which is doing a transaction commit, picks the same node and runs all
    its items first.

    We can not wait for the worker to complete when running delayed items
    through btrfs_run_delayed_items(), because we call that function in
    several phases of a transaction commit, and that could cause a deadlock
    because the worker calls btrfs_join_transaction() and the task doing the
    transaction commit may have already set the transaction state to
    TRANS_STATE_COMMIT_DOING.

    Also it's not possible to get into a situation where only some of the
    items of a delayed node are added to the fs/subvolume tree in the current
    transaction and the remaining ones in the next transaction, because when
    running the items of a delayed inode we lock its mutex, effectively
    waiting for the worker if the worker is running the items of the delayed
    node already.

    Since this can only cause issues when unmounting a filesystem, fix it in
    a simple way by waiting for any jobs on the delayed workers queue before
    calling btrfs_commit_supper() at close_ctree(). This works because at this
    point no one can call btrfs_btree_balance_dirty() or
    btrfs_balance_delayed_items(), and if we end up waiting for any worker to
    complete, btrfs_commit_super() will commit the transaction created by the
    worker.

    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Filipe Manana
     

18 Nov, 2019

4 commits

  • The attribute is more relaxed than const and the functions could
    dereference pointers, as long as the observable state is not changed. We
    do have such functions, based on -Wsuggest-attribute=pure .

    The visible effects of this patch are negligible, there are differences
    in the assembly but hard to summarize.

    Reviewed-by: Nikolay Borisov
    Signed-off-by: David Sterba

    David Sterba
     
  • Commit ac0c7cf8be00 ("btrfs: fix crash when tracepoint arguments are
    freed by wq callbacks") added a void pointer, wtag, which is passed into
    trace_btrfs_all_work_done() instead of the freed work item. This is
    silly for a few reasons:

    1. The freed work item still has the same address.
    2. work is still in scope after it's freed, so assigning wtag doesn't
    stop anyone from using it.
    3. The tracepoint has always taken a void * argument, so assigning wtag
    doesn't actually make things any more type-safe. (Note that the
    original bug in commit bc074524e123 ("btrfs: prefix fsid to all trace
    events") was that the void * was implicitly casted when it was passed
    to btrfs_work_owner() in the trace point itself).

    Instead, let's add some clearer warnings as comments.

    Reviewed-by: Nikolay Borisov
    Reviewed-by: Filipe Manana
    Signed-off-by: Omar Sandoval
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Omar Sandoval
     
  • Commit 9e0af2376434 ("Btrfs: fix task hang under heavy compressed
    write") worked around the issue that a recycled work item could get a
    false dependency on the original work item due to how the workqueue code
    guarantees non-reentrancy. It did so by giving different work functions
    to different types of work.

    However, the fixes in the previous few patches are more complete, as
    they prevent a work item from being recycled at all (except for a tiny
    window that the kernel workqueue code handles for us). This obsoletes
    the previous fix, so we don't need the unique helpers for correctness.
    The only other reason to keep them would be so they show up in stack
    traces, but they always seem to be optimized to a tail call, so they
    don't show up anyways. So, let's just get rid of the extra indirection.

    While we're here, rename normal_work_helper() to the more informative
    btrfs_work_helper().

    Reviewed-by: Nikolay Borisov
    Reviewed-by: Filipe Manana
    Signed-off-by: Omar Sandoval
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Omar Sandoval
     
  • We hit the following very strange deadlock on a system with Btrfs on a
    loop device backed by another Btrfs filesystem:

    1. The top (loop device) filesystem queues an async_cow work item from
    cow_file_range_async(). We'll call this work X.
    2. Worker thread A starts work X (normal_work_helper()).
    3. Worker thread A executes the ordered work for the top filesystem
    (run_ordered_work()).
    4. Worker thread A finishes the ordered work for work X and frees X
    (work->ordered_free()).
    5. Worker thread A executes another ordered work and gets blocked on I/O
    to the bottom filesystem (still in run_ordered_work()).
    6. Meanwhile, the bottom filesystem allocates and queues an async_cow
    work item which happens to be the recently-freed X.
    7. The workqueue code sees that X is already being executed by worker
    thread A, so it schedules X to be executed _after_ worker thread A
    finishes (see the find_worker_executing_work() call in
    process_one_work()).

    Now, the top filesystem is waiting for I/O on the bottom filesystem, but
    the bottom filesystem is waiting for the top filesystem to finish, so we
    deadlock.

    This happens because we are breaking the workqueue assumption that a
    work item cannot be recycled while it still depends on other work. Fix
    it by waiting to free the work item until we are done with all of the
    related ordered work.

    P.S.:

    One might ask why the workqueue code doesn't try to detect a recycled
    work item. It actually does try by checking whether the work item has
    the same work function (find_worker_executing_work()), but in our case
    the function is the same. This is the only key that the workqueue code
    has available to compare, short of adding an additional, layer-violating
    "custom key". Considering that we're the only ones that have ever hit
    this, we should just play by the rules.

    Unfortunately, we haven't been able to create a minimal reproducer other
    than our full container setup using a compress-force=zstd filesystem on
    top of another compress-force=zstd filesystem.

    Suggested-by: Tejun Heo
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Omar Sandoval
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Omar Sandoval
     

09 Sep, 2019

1 commit


25 Feb, 2019

1 commit


12 Apr, 2018

1 commit


30 Oct, 2017

1 commit

  • We've seen the following backtrace stack in ftrace or dmesg log,

    kworker/u16:10-4244 [000] 241942.480955: function: btrfs_put_ordered_extent
    kworker/u16:10-4244 [000] 241942.480956: kernel_stack:
    => finish_ordered_fn (ffffffffa0384475)
    => btrfs_scrubparity_helper (ffffffffa03ca577) btrfs_freespace_write_helper (ffffffffa03ca98e) process_one_work (ffffffff81117b2f)
    => worker_thread (ffffffff81118c2a)
    => kthread (ffffffff81121de0)
    => ret_from_fork (ffffffff81d7087a)

    btrfs_freespace_write_helper is actually calling normal_worker_helper
    instead of btrfs_scrubparity_helper, so somehow kernel has parsed the
    incorrect function address while unwinding the stack,
    btrfs_scrubparity_helper really shouldn't be shown up.

    It's caused by compiler doing inline for our helper function, adding a
    noinline tag can fix that.

    Signed-off-by: Liu Bo
    Reviewed-by: David Sterba
    [ use noinline_for_stack ]
    Signed-off-by: David Sterba

    Liu Bo
     

16 Aug, 2017

1 commit


09 Jan, 2017

1 commit

  • Enabling btrfs tracepoints leads to instant crash, as reported. The wq
    callbacks could free the memory and the tracepoints started to
    dereference the members to get to fs_info.

    The proposed fix https://marc.info/?l=linux-btrfs&m=148172436722606&w=2
    removed the tracepoints but we could preserve them by passing only the
    required data in a safe way.

    Fixes: bc074524e123 ("btrfs: prefix fsid to all trace events")
    CC: stable@vger.kernel.org # 4.8+
    Reported-by: Sebastian Andrzej Siewior
    Reviewed-by: Qu Wenruo
    Signed-off-by: David Sterba

    David Sterba
     

14 Dec, 2016

1 commit

  • Problem statement: unprivileged user who has read-write access to more than
    one btrfs subvolume may easily consume all kernel memory (eventually
    triggering oom-killer).

    Reproducer (./mkrmdir below essentially loops over mkdir/rmdir):

    [root@kteam1 ~]# cat prep.sh

    DEV=/dev/sdb
    mkfs.btrfs -f $DEV
    mount $DEV /mnt
    for i in `seq 1 16`
    do
    mkdir /mnt/$i
    btrfs subvolume create /mnt/SV_$i
    ID=`btrfs subvolume list /mnt |grep "SV_$i$" |cut -d ' ' -f 2`
    mount -t btrfs -o subvolid=$ID $DEV /mnt/$i
    chmod a+rwx /mnt/$i
    done

    [root@kteam1 ~]# sh prep.sh

    [maxim@kteam1 ~]$ for i in `seq 1 16`; do ./mkrmdir /mnt/$i 2000 2000 & done

    [root@kteam1 ~]# for i in `seq 1 4`; do grep "kmalloc-128" /proc/slabinfo | grep -v dma; sleep 60; done
    kmalloc-128 10144 10144 128 32 1 : tunables 0 0 0 : slabdata 317 317 0
    kmalloc-128 9992352 9992352 128 32 1 : tunables 0 0 0 : slabdata 312261 312261 0
    kmalloc-128 24226752 24226752 128 32 1 : tunables 0 0 0 : slabdata 757086 757086 0
    kmalloc-128 42754240 42754240 128 32 1 : tunables 0 0 0 : slabdata 1336070 1336070 0

    The huge numbers above come from insane number of async_work-s allocated
    and queued by btrfs_wq_run_delayed_node.

    The problem is caused by btrfs_wq_run_delayed_node() queuing more and more
    works if the number of delayed items is above BTRFS_DELAYED_BACKGROUND. The
    worker func (btrfs_async_run_delayed_root) processes at least
    BTRFS_DELAYED_BATCH items (if they are present in the list). So, the machinery
    works as expected while the list is almost empty. As soon as it is getting
    bigger, worker func starts to process more than one item at a time, it takes
    longer, and the chances to have async_works queued more than needed is getting
    higher.

    The problem above is worsened by another flaw of delayed-inode implementation:
    if async_work was queued in a throttling branch (number of items >=
    BTRFS_DELAYED_WRITEBACK), corresponding worker func won't quit until
    the number of items < BTRFS_DELAYED_BACKGROUND / 2. So, it is possible that
    the func occupies CPU infinitely (up to 30sec in my experiments): while the
    func is trying to drain the list, the user activity may add more and more
    items to the list.

    The patch fixes both problems in straightforward way: refuse queuing too
    many works in btrfs_wq_run_delayed_node and bail out of worker func if
    at least BTRFS_DELAYED_WRITEBACK items are processed.

    Changed in v2: remove support of thresh == NO_THRESHOLD.

    Signed-off-by: Maxim Patlasov
    Signed-off-by: Chris Mason
    Cc: stable@vger.kernel.org # v3.15+

    Maxim Patlasov
     

26 Jul, 2016

1 commit

  • In order to provide an fsid for trace events, we'll need a btrfs_fs_info
    pointer. The most lightweight way to do that for btrfs_work structures
    is to associate it with the __btrfs_workqueue structure. Each queued
    btrfs_work structure has a workqueue associated with it, so that's
    a natural fit. It's a privately defined structures, so we add accessors
    to retrieve the fs_info pointer.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: David Sterba

    Jeff Mahoney
     

26 Jan, 2016

1 commit


03 Dec, 2015

1 commit


01 Sep, 2015

1 commit

  • At initializing time, for threshold-able workqueue, it's max_active
    of kernel workqueue should be 1 and grow if it hits threshold.

    But due to the bad naming, there is both 'max_active' for kernel
    workqueue and btrfs workqueue.
    So wrong value is given at workqueue initialization.

    This patch fixes it, and to avoid further misunderstanding, change the
    member name of btrfs_workqueue to 'current_active' and 'limit_active'.

    Also corresponding comment is added for readability.

    Reported-by: Alex Lyakas
    Signed-off-by: Qu Wenruo
    Signed-off-by: Chris Mason

    Qu Wenruo
     

10 Jun, 2015

1 commit

  • lockdep report following warning in test:
    [25176.843958] =================================
    [25176.844519] [ INFO: inconsistent lock state ]
    [25176.845047] 4.1.0-rc3 #22 Tainted: G W
    [25176.845591] ---------------------------------
    [25176.846153] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    [25176.846713] fsstress/26661 [HC0[0]:SC1[1]:HE1:SE0] takes:
    [25176.847246] (&wr_ctx->wr_lock){+.?...}, at: [] scrub_free_ctx+0x2d/0xf0 [btrfs]
    [25176.847838] {SOFTIRQ-ON-W} state was registered at:
    [25176.848396] [] __lock_acquire+0x6a0/0xe10
    [25176.848955] [] lock_acquire+0xce/0x2c0
    [25176.849491] [] mutex_lock_nested+0x7f/0x410
    [25176.850029] [] scrub_stripe+0x4df/0x1080 [btrfs]
    [25176.850575] [] scrub_chunk.isra.19+0x111/0x130 [btrfs]
    [25176.851110] [] scrub_enumerate_chunks+0x27c/0x510 [btrfs]
    [25176.851660] [] btrfs_scrub_dev+0x1c7/0x6c0 [btrfs]
    [25176.852189] [] btrfs_dev_replace_start+0x36e/0x450 [btrfs]
    [25176.852771] [] btrfs_ioctl+0x1e10/0x2d20 [btrfs]
    [25176.853315] [] do_vfs_ioctl+0x318/0x570
    [25176.853868] [] SyS_ioctl+0x41/0x80
    [25176.854406] [] system_call_fastpath+0x12/0x6f
    [25176.854935] irq event stamp: 51506
    [25176.855511] hardirqs last enabled at (51506): [] vprintk_emit+0x225/0x5e0
    [25176.856059] hardirqs last disabled at (51505): [] vprintk_emit+0xb7/0x5e0
    [25176.856642] softirqs last enabled at (50886): [] __do_softirq+0x363/0x640
    [25176.857184] softirqs last disabled at (50949): [] irq_exit+0x10d/0x120
    [25176.857746]
    other info that might help us debug this:
    [25176.858845] Possible unsafe locking scenario:
    [25176.859981] CPU0
    [25176.860537] ----
    [25176.861059] lock(&wr_ctx->wr_lock);
    [25176.861705]
    [25176.862272] lock(&wr_ctx->wr_lock);
    [25176.862881]
    *** DEADLOCK ***

    Reason:
    Above warning is caused by:
    Interrupt
    -> bio_endio()
    -> ...
    -> scrub_put_ctx()
    -> scrub_free_ctx() *1
    -> ...
    -> mutex_lock(&wr_ctx->wr_lock);

    scrub_put_ctx() is allowed to be called in end_bio interrupt, but
    in code design, it will never call scrub_free_ctx(sctx) in interrupe
    context(above *1), because btrfs_scrub_dev() get one additional
    reference of sctx->refs, which makes scrub_free_ctx() only called
    withine btrfs_scrub_dev().

    Now the code runs out of our wish, because free sequence in
    scrub_pending_bio_dec() have a gap.

    Current code:
    -----------------------------------+-----------------------------------
    scrub_pending_bio_dec() | btrfs_scrub_dev
    -----------------------------------+-----------------------------------
    atomic_dec(&sctx->bios_in_flight); |
    wake_up(&sctx->list_wait); |
    | scrub_put_ctx()
    | -> atomic_dec_and_test(&sctx->refs)
    scrub_put_ctx(sctx); |
    -> atomic_dec_and_test(&sctx->refs)|
    -> scrub_free_ctx() |
    -----------------------------------+-----------------------------------

    We expected:
    -----------------------------------+-----------------------------------
    scrub_pending_bio_dec() | btrfs_scrub_dev
    -----------------------------------+-----------------------------------
    atomic_dec(&sctx->bios_in_flight); |
    wake_up(&sctx->list_wait); |
    scrub_put_ctx(sctx); |
    -> atomic_dec_and_test(&sctx->refs)|
    | scrub_put_ctx()
    | -> atomic_dec_and_test(&sctx->refs)
    | -> scrub_free_ctx()
    -----------------------------------+-----------------------------------

    Fix:
    Move scrub_pending_bio_dec() to a workqueue, to avoid this function run
    in interrupt context.
    Tested by check tracelog in debug.

    Changelog v1->v2:
    Use workqueue instead of adjust function call sequence in v1,
    because v1 will introduce a bug pointed out by:
    Filipe David Manana

    Reported-by: Qu Wenruo
    Signed-off-by: Zhao Lei
    Reviewed-by: Filipe Manana
    Signed-off-by: Chris Mason

    Zhao Lei
     

17 Feb, 2015

1 commit


02 Oct, 2014

1 commit


18 Sep, 2014

1 commit

  • This patch implement data repair function when direct read fails.

    The detail of the implementation is:
    - When we find the data is not right, we try to read the data from the other
    mirror.
    - When the io on the mirror ends, we will insert the endio work into the
    dedicated btrfs workqueue, not common read endio workqueue, because the
    original endio work is still blocked in the btrfs endio workqueue, if we
    insert the endio work of the io on the mirror into that workqueue, deadlock
    would happen.
    - After we get right data, we write it back to the corrupted mirror.
    - And if the data on the new mirror is still corrupted, we will try next
    mirror until we read right data or all the mirrors are traversed.
    - After the above work, we set the uptodate flag according to the result.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     

24 Aug, 2014

1 commit

  • This has been reported and discussed for a long time, and this hang occurs in
    both 3.15 and 3.16.

    Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.

    Btrfs has a kind of work queued as an ordered way, which means that its
    ordered_func() must be processed in the way of FIFO, so it usually looks like --

    normal_work_helper(arg)
    work = container_of(arg, struct btrfs_work, normal_work);

    work->func() ordered_list
    ordered_work->ordered_func()
    ordered_work->ordered_free()

    The hang is a rare case, first when we find free space, we get an uncached block
    group, then we go to read its free space cache inode for free space information,
    so it will

    file a readahead request
    btrfs_readpages()
    for page that is not in page cache
    __do_readpage()
    submit_extent_page()
    btrfs_submit_bio_hook()
    btrfs_bio_wq_end_io()
    submit_bio()
    end_workqueue_bio() current_work = arg; normal_work
    worker->current_func(arg)
    normal_work_helper(arg)
    A = container_of(arg, struct btrfs_work, normal_work);

    A->func()
    A->ordered_func()
    A->ordered_free() ordered_func()
    submit_compressed_extents()
    find_free_extent()
    load_free_space_inode()
    ... ordered_free()

    As if work A has a high priority in wq->ordered_list and there are more ordered
    works queued after it, such as B->ordered_func(), its memory could have been
    freed before normal_work_helper() returns, which means that kernel workqueue
    code worker_thread() still has worker->current_work pointer to be work
    A->normal_work's, ie. arg's address.

    Meanwhile, work C is allocated after work A is freed, work C->normal_work
    and work A->normal_work are likely to share the same address(I confirmed this
    with ftrace output, so I'm not just guessing, it's rare though).

    When another kthread picks up work C->normal_work to process, and finds our
    kthread is processing it(see find_worker_executing_work()), it'll think
    work C as a collision and skip then, which ends up nobody processing work C.

    So the situation is that our kthread is waiting forever on work C.

    Besides, there're other cases that can lead to deadlock, but the real problem
    is that all btrfs workqueue shares one work->func, -- normal_work_helper,
    so this makes each workqueue to have its own helper function, but only a
    wraper pf normal_work_helper.

    With this patch, I no long hit the above hang.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     

08 Apr, 2014

1 commit

  • Reproducer:
    mount /dev/ubda /mnt
    mount -oremount,thread_pool=42 /mnt

    Gives a crash:
    ? btrfs_workqueue_set_max+0x0/0x70
    btrfs_resize_thread_pool+0xe3/0xf0
    ? sync_filesystem+0x0/0xc0
    ? btrfs_resize_thread_pool+0x0/0xf0
    btrfs_remount+0x1d2/0x570
    ? kern_path+0x0/0x80
    do_remount_sb+0xd9/0x1c0
    do_mount+0x26a/0xbf0
    ? kfree+0x0/0x1b0
    SyS_mount+0xc4/0x110

    It's a call
    btrfs_workqueue_set_max(fs_info->scrub_wr_completion_workers, new_pool_size);
    with
    fs_info->scrub_wr_completion_workers = NULL;

    as scrub wqs get created only on user's demand.

    Patch skips not-created-yet workqueues.

    Signed-off-by: Sergei Trofimovich
    CC: Qu Wenruo
    CC: Chris Mason
    CC: Josef Bacik
    CC: linux-btrfs@vger.kernel.org
    Signed-off-by: Chris Mason

    Sergei Trofimovich
     

21 Mar, 2014

2 commits


11 Mar, 2014

8 commits

  • Add ftrace for btrfs_workqueue for further workqueue tunning.
    This patch needs to applied after the workqueue replace patchset.

    Signed-off-by: Qu Wenruo
    Signed-off-by: Josef Bacik

    Qu Wenruo
     
  • The new btrfs_workqueue still use open-coded function defition,
    this patch will change them into btrfs_func_t type which is much the
    same as kernel workqueue.

    Signed-off-by: Qu Wenruo
    Signed-off-by: Josef Bacik

    Qu Wenruo
     
  • Since the "_struct" suffix is mainly used for distinguish the differnt
    btrfs_work between the original and the newly created one,
    there is no need using the suffix since all btrfs_workers are changed
    into btrfs_workqueue.

    Also this patch fixed some codes whose code style is changed due to the
    too long "_struct" suffix.

    Signed-off-by: Qu Wenruo
    Tested-by: David Sterba
    Signed-off-by: Josef Bacik

    Qu Wenruo
     
  • Since all the btrfs_worker is replaced with the newly created
    btrfs_workqueue, the old codes can be easily remove.

    Signed-off-by: Quwenruo
    Tested-by: David Sterba
    Signed-off-by: Josef Bacik

    Qu Wenruo
     
  • The original btrfs_workers has thresholding functions to dynamically
    create or destroy kthreads.

    Though there is no such function in kernel workqueue because the worker
    is not created manually, we can still use the workqueue_set_max_active
    to simulated the behavior, mainly to achieve a better HDD performance by
    setting a high threshold on submit_workers.
    (Sadly, no resource can be saved)

    So in this patch, extra workqueue pending counters are introduced to
    dynamically change the max active of each btrfs_workqueue_struct, hoping
    to restore the behavior of the original thresholding function.

    Also, workqueue_set_max_active use a mutex to protect workqueue_struct,
    which is not meant to be called too frequently, so a new interval
    mechanism is applied, that will only call workqueue_set_max_active after
    a count of work is queued. Hoping to balance both the random and
    sequence performance on HDD.

    Signed-off-by: Qu Wenruo
    Tested-by: David Sterba
    Signed-off-by: Josef Bacik

    Qu Wenruo
     
  • Add high priority function to btrfs_workqueue.

    This is implemented by embedding a btrfs_workqueue into a
    btrfs_workqueue and use some helper functions to differ the normal
    priority wq and high priority wq.
    So the high priority wq is completely independent from the normal
    workqueue.

    Signed-off-by: Qu Wenruo
    Tested-by: David Sterba
    Signed-off-by: Josef Bacik

    Qu Wenruo
     
  • Use kernel workqueue to implement a new btrfs_workqueue_struct, which
    has the ordering execution feature like the btrfs_worker.

    The func is executed in a concurrency way, and the
    ordred_func/ordered_free is executed in the sequence them are queued
    after the corresponding func is done.

    The new btrfs_workqueue works much like the original one, one workqueue
    for normal work and a list for ordered work.
    When a work is queued, ordered work will be added to the list and helper
    function will be queued into the workqueue.
    The helper function will execute a normal work and then check and execute as many
    ordered work as possible in the sequence they were queued.

    At this patch, high priority work queue or thresholding is not added yet.
    The high priority feature and thresholding will be added in the following patches.

    Signed-off-by: Qu Wenruo
    Signed-off-by: Lai Jiangshan
    Tested-by: David Sterba
    Signed-off-by: Josef Bacik

    Qu Wenruo
     
  • In case we do not refill, we can overwrite cur pointer from prio_head
    by one from not prioritized head, what looks as something that was
    not intended.

    This change make we always take works from prio_head first until it's
    not empty.

    Signed-off-by: Stanislaw Gruszka
    Signed-off-by: Josef Bacik

    Stanislaw Gruszka
     

21 Nov, 2013

1 commit

  • __btrfs_start_workers returns 0 in case it raced with
    btrfs_stop_workers and lost the race. This is wrong because worker in
    this case is not allowed to start and is in fact destroyed. Return
    -EINVAL instead.

    Signed-off-by: Ilya Dryomov
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Ilya Dryomov
     

12 Nov, 2013

1 commit


05 Oct, 2013

1 commit

  • The current implementation of worker threads in Btrfs has races in
    worker stopping code, which cause all kinds of panics and lockups when
    running btrfs/011 xfstest in a loop. The problem is that
    btrfs_stop_workers is unsynchronized with respect to check_idle_worker,
    check_busy_worker and __btrfs_start_workers.

    E.g., check_idle_worker race flow:

    btrfs_stop_workers(): check_idle_worker(aworker):
    - grabs the lock
    - splices the idle list into the
    working list
    - removes the first worker from the
    working list
    - releases the lock to wait for
    its kthread's completion
    - grabs the lock
    - if aworker is on the working list,
    moves aworker from the working list
    to the idle list
    - releases the lock
    - grabs the lock
    - puts the worker
    - removes the second worker from the
    working list
    ......
    btrfs_stop_workers returns, aworker is on the idle list
    FS is umounted, memory is freed
    ......
    aworker is waken up, fireworks ensue

    With this applied, I wasn't able to trigger the problem in 48 hours,
    whereas previously I could reliably reproduce at least one of these
    races within an hour.

    Reported-by: David Sterba
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Josef Bacik

    Ilya Dryomov
     

26 Jul, 2012

1 commit

  • Each ordered operation has a free callback, and this was called with the
    worker spinlock held. Josef made the free callback also call iput,
    which we can't do with the spinlock.

    This drops the spinlock for the free operation and grabs it again before
    moving through the rest of the list. We'll circle back around to this
    and find a cleaner way that doesn't bounce the lock around so much.

    Signed-off-by: Chris Mason
    cc: stable@kernel.org

    Chris Mason
     

22 Mar, 2012

1 commit


26 Dec, 2011

1 commit

  • * pm-sleep: (51 commits)
    PM: Drop generic_subsys_pm_ops
    PM / Sleep: Remove forward-only callbacks from AMBA bus type
    PM / Sleep: Remove forward-only callbacks from platform bus type
    PM: Run the driver callback directly if the subsystem one is not there
    PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers
    PM / Sleep: Merge internal functions in generic_ops.c
    PM / Sleep: Simplify generic system suspend callbacks
    PM / Hibernate: Remove deprecated hibernation snapshot ioctls
    PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled()
    PM / Sleep: Recommend [un]lock_system_sleep() over using pm_mutex directly
    PM / Sleep: Replace mutex_[un]lock(&pm_mutex) with [un]lock_system_sleep()
    PM / Sleep: Make [un]lock_system_sleep() generic
    PM / Sleep: Use the freezer_count() functions in [un]lock_system_sleep() APIs
    PM / Freezer: Remove the "userspace only" constraint from freezer[_do_not]_count()
    PM / Hibernate: Replace unintuitive 'if' condition in kernel/power/user.c with 'else'
    Freezer / sunrpc / NFS: don't allow TASK_KILLABLE sleeps to block the freezer
    PM / Sleep: Unify diagnostic messages from device suspend/resume
    ACPI / PM: Do not save/restore NVS on Asus K54C/K54HR
    PM / Hibernate: Remove deprecated hibernation test modes
    PM / Hibernate: Thaw processes in SNAPSHOT_CREATE_IMAGE ioctl test path
    ...

    Conflicts:
    kernel/kmod.c

    Rafael J. Wysocki
     

23 Dec, 2011

1 commit


22 Dec, 2011

1 commit

  • * master: (848 commits)
    SELinux: Fix RCU deref check warning in sel_netport_insert()
    binary_sysctl(): fix memory leak
    mm/vmalloc.c: remove static declaration of va from __get_vm_area_node
    ipmi_watchdog: restore settings when BMC reset
    oom: fix integer overflow of points in oom_badness
    memcg: keep root group unchanged if creation fails
    nilfs2: potential integer overflow in nilfs_ioctl_clean_segments()
    nilfs2: unbreak compat ioctl
    cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
    evm: prevent racing during tfm allocation
    evm: key must be set once during initialization
    mmc: vub300: fix type of firmware_rom_wait_states module parameter
    Revert "mmc: enable runtime PM by default"
    mmc: sdhci: remove "state" argument from sdhci_suspend_host
    x86, dumpstack: Fix code bytes breakage due to missing KERN_CONT
    IB/qib: Correct sense on freectxts increment and decrement
    RDMA/cma: Verify private data length
    cgroups: fix a css_set not found bug in cgroup_attach_proc
    oprofile: Fix uninitialized memory access when writing to writing to oprofilefs
    Revert "xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel"
    ...

    Conflicts:
    kernel/cgroup_freezer.c

    Rafael J. Wysocki