12 Sep, 2018

1 commit

  • After merging the iolatency policy, we potentially now have 4 policies
    being registered, but only support 3. This causes one of them to fail
    loading. Takashi reports that BFQ no longer works for him, because it
    fails to load due to policy registration failure.

    Bump to 5 policies, and also add a warning for when we have exceeded
    the global amount. If we have to touch this again, we should switch
    to a dynamic scheme instead.

    Reported-by: Takashi Iwai
    Reviewed-by: Jeff Moyer
    Tested-by: Takashi Iwai
    Signed-off-by: Jens Axboe

    Jens Axboe
     

01 Sep, 2018

2 commits

  • Currently, blkcg destruction relies on a sequence of events:
    1. Destruction starts. blkcg_css_offline() is called and blkgs
    release their reference to the blkcg. This immediately destroys
    the cgwbs (writeback).
    2. With blkgs giving up their reference, the blkcg ref count should
    become zero and eventually call blkcg_css_free() which finally
    frees the blkcg.

    Jiufei Xue reported that there is a race between blkcg_bio_issue_check()
    and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent
    on the completion of all writeback associated with the blkcg. A count of
    the number of cgwbs is maintained and once that goes to zero, blkg
    destruction can follow. This should prevent premature blkg destruction
    related to writeback.

    The new process for blkcg cleanup is as follows:
    1. Destruction starts. blkcg_css_offline() is called which offlines
    writeback. Blkg destruction is delayed on the cgwb_refcnt count to
    avoid punting potentially large amounts of outstanding writeback
    to root while maintaining any ongoing policies. Here, the base
    cgwb_refcnt is put back.
    2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called
    and handles destruction of blkgs. This is where the css reference
    held by each blkg is released.
    3. Once the blkcg ref count goes to zero, blkcg_css_free() is called.
    This finally frees the blkg.

    It seems in the past blk-throttle didn't do the most understandable
    things with taking data from a blkg while associating with current. So,
    the simplification and unification of what blk-throttle is doing caused
    this.

    Fixes: 08e18eab0c579 ("block: add bi_blkg to the bio for cgroups")
    Reviewed-by: Josef Bacik
    Signed-off-by: Dennis Zhou
    Cc: Jiufei Xue
    Cc: Joseph Qi
    Cc: Tejun Heo
    Cc: Josef Bacik
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     
  • This reverts commit 4c6994806f708559c2812b73501406e21ae5dcd0.

    Destroying blkgs is tricky because of the nature of the relationship. A
    blkg should go away when either a blkcg or a request_queue goes away.
    However, blkg's pin the blkcg to ensure they remain valid. To break this
    cycle, when a blkcg is offlined, blkgs put back their css ref. This
    eventually lets css_free() get called which frees the blkcg.

    The above commit (4c6994806f70) breaks this order of events by trying to
    destroy blkgs in css_free(). As the blkgs still hold references to the
    blkcg, css_free() is never called.

    The race between blkcg_bio_issue_check() and cgroup_rmdir() will be
    addressed in the following patch by delaying destruction of a blkg until
    all writeback associated with the blkcg has been finished.

    Fixes: 4c6994806f70 ("blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()")
    Reviewed-by: Josef Bacik
    Signed-off-by: Dennis Zhou
    Cc: Jiufei Xue
    Cc: Joseph Qi
    Cc: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     

01 Aug, 2018

1 commit


18 Jul, 2018

1 commit


09 Jul, 2018

3 commits

  • Current IO controllers for the block layer are less than ideal for our
    use case. The io.max controller is great at hard limiting, but it is
    not work conserving. This patch introduces io.latency. You provide a
    latency target for your group and we monitor the io in short windows to
    make sure we are not exceeding those latency targets. This makes use of
    the rq-qos infrastructure and works much like the wbt stuff. There are
    a few differences from wbt

    - It's bio based, so the latency covers the whole block layer in addition to
    the actual io.
    - We will throttle all IO types that comes in here if we need to.
    - We use the mean latency over the 100ms window. This is because writes can
    be particularly fast, which could give us a false sense of the impact of
    other workloads on our protected workload.
    - By default there's no throttling, we set the queue_depth to INT_MAX so that
    we can have as many outstanding bio's as we're allowed to. Only at
    throttle time do we pay attention to the actual queue depth.
    - We backcharge cgroups for root cg issued IO and induce artificial
    delays in order to deal with cases like metadata only or swap heavy
    workloads.

    In testing this has worked out relatively well. Protected workloads
    will throttle noisy workloads down to 1 io at time if they are doing
    normal IO on their own, or induce up to a 1 second delay per syscall if
    they are doing a lot of root issued IO (metadata/swap IO).

    Our testing has revolved mostly around our production web servers where
    we have hhvm (the web server application) in a protected group and
    everything else in another group. We see slightly higher requests per
    second (RPS) on the test tier vs the control tier, and much more stable
    RPS across all machines in the test tier vs the control tier.

    Another test we run is a slow memory allocator in the unprotected group.
    Before this would eventually push us into swap and cause the whole box
    to die and not recover at all. With these patches we see slight RPS
    drops (usually 10-15%) before the memory consumer is properly killed and
    things recover within seconds.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • Since IO can be issued from literally anywhere it's almost impossible to
    do throttling without having some sort of adverse effect somewhere else
    in the system because of locking or other dependencies. The best way to
    solve this is to do the throttling when we know we aren't holding any
    other kernel resources. Do this by tracking throttling in a per-blkg
    basis, and if we require throttling flag the task that it needs to check
    before it returns to user space and possibly sleep there.

    This is to address the case where a process is doing work that is
    generating IO that can't be throttled, whether that is directly with a
    lot of REQ_META IO, or indirectly by allocating so much memory that it
    is swamping the disk with REQ_SWAP. We can't use task_add_work as we
    don't want to induce a memory allocation in the IO path, so simply
    saving the request queue in the task and flagging it to do the
    notify_resume thing achieves the same result without the overhead of a
    memory allocation.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • blk-iolatency has a few stats that it would like to print out, and
    instead of adding a bunch of crap to the generic code just provide a
    helper so that controllers can add stuff to the stat line if they want
    to.

    Hide it behind a boot option since it changes the output of io.stat from
    normal, and these stats are only interesting to developers.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     

19 Apr, 2018

2 commits

  • The initializing of q->root_blkg is currently outside of queue lock
    and rcu, so the blkg may be destroied before the initializing, which
    may cause dangling/null references. On the other side, the destroys
    of blkg are protected by queue lock or rcu. Put the initializing
    inside the queue lock and rcu to make it safer.

    Signed-off-by: Jiang Biao
    Signed-off-by: Wen Yang
    CC: Tejun Heo
    CC: Jens Axboe
    Signed-off-by: Jens Axboe

    Jiang Biao
     
  • The comment before blkg_create() in blkcg_init_queue() was moved
    from blkcg_activate_policy() by commit ec13b1d6f0a0457312e615, but
    it does not suit for the new context.

    Signed-off-by: Jiang Biao
    Signed-off-by: Wen Yang
    CC: Tejun Heo
    CC: Jens Axboe
    Signed-off-by: Jens Axboe

    Jiang Biao
     

18 Apr, 2018

1 commit

  • As described in the comment of blkcg_activate_policy(),
    *Update of each blkg is protected by both queue and blkcg locks so
    that holding either lock and testing blkcg_policy_enabled() is
    always enough for dereferencing policy data.*
    with queue lock held, there is no need to hold blkcg lock in
    blkcg_deactivate_policy(). Similar case is in
    blkcg_activate_policy(), which has removed holding of blkcg lock in
    commit 4c55f4f9ad3001ac1fefdd8d8ca7641d18558e23.

    Signed-off-by: Jiang Biao
    Signed-off-by: Wen Yang
    CC: Tejun Heo
    Signed-off-by: Jens Axboe

    Jiang Biao
     

17 Mar, 2018

1 commit

  • We've triggered a WARNING in blk_throtl_bio() when throttling writeback
    io, which complains blkg->refcnt is already 0 when calling blkg_get(),
    and then kernel crashes with invalid page request.
    After investigating this issue, we've found it is caused by a race
    between blkcg_bio_issue_check() and cgroup_rmdir(), which is described
    below:

    writeback kworker cgroup_rmdir
    cgroup_destroy_locked
    kill_css
    css_killed_ref_fn
    css_killed_work_fn
    offline_css
    blkcg_css_offline
    blkcg_bio_issue_check
    rcu_read_lock
    blkg_lookup
    spin_trylock(q->queue_lock)
    blkg_destroy
    spin_unlock(q->queue_lock)
    blk_throtl_bio
    spin_lock_irq(q->queue_lock)
    ...
    spin_unlock_irq(q->queue_lock)
    rcu_read_unlock

    Since rcu can only prevent blkg from releasing when it is being used,
    the blkg->refcnt can be decreased to 0 during blkg_destroy() and schedule
    blkg release.
    Then trying to blkg_get() in blk_throtl_bio() will complains the WARNING.
    And then the corresponding blkg_put() will schedule blkg release again,
    which result in double free.
    This race is introduced by commit ae1188963611 ("blkcg: consolidate blkg
    creation in blkcg_bio_issue_check()"). Before this commit, it will
    lookup first and then try to lookup/create again with queue_lock. Since
    revive this logic is a bit drastic, so fix it by only offlining pd during
    blkcg_css_offline(), and move the rest destruction (especially
    blkg_put()) into blkcg_css_free(), which should be the right way as
    discussed.

    Fixes: ae1188963611 ("blkcg: consolidate blkg creation in blkcg_bio_issue_check()")
    Reported-by: Jiufei Xue
    Signed-off-by: Joseph Qi
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Joseph Qi
     

27 Feb, 2018

1 commit


05 Nov, 2017

1 commit


10 Oct, 2017

1 commit


26 Aug, 2017

1 commit


02 Jun, 2017

1 commit

  • Since the introduction of .init_rq_fn() and .exit_rq_fn() it is
    essential that the memory allocated for struct request_queue
    stays around until all blk_exit_rl() calls have finished. Hence
    make blk_init_rl() take a reference on struct request_queue.

    This patch fixes the following crash:

    general protection fault: 0000 [#2] SMP
    CPU: 3 PID: 28 Comm: ksoftirqd/3 Tainted: G D 4.12.0-rc2-dbg+ #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
    task: ffff88013a108040 task.stack: ffffc9000071c000
    RIP: 0010:free_request_size+0x1a/0x30
    RSP: 0018:ffffc9000071fd38 EFLAGS: 00010202
    RAX: 6b6b6b6b6b6b6b6b RBX: ffff880067362a88 RCX: 0000000000000003
    RDX: ffff880067464178 RSI: ffff880067362a88 RDI: ffff880135ea4418
    RBP: ffffc9000071fd40 R08: 0000000000000000 R09: 0000000100180009
    R10: ffffc9000071fd38 R11: ffffffff81110800 R12: ffff88006752d3d8
    R13: ffff88006752d3d8 R14: ffff88013a108040 R15: 000000000000000a
    FS: 0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fa8ec1edb00 CR3: 0000000138ee8000 CR4: 00000000001406e0
    Call Trace:
    mempool_destroy.part.10+0x21/0x40
    mempool_destroy+0xe/0x10
    blk_exit_rl+0x12/0x20
    blkg_free+0x4d/0xa0
    __blkg_release_rcu+0x59/0x170
    rcu_process_callbacks+0x260/0x4e0
    __do_softirq+0x116/0x250
    smpboot_thread_fn+0x123/0x1e0
    kthread+0x109/0x140
    ret_from_fork+0x31/0x40

    Fixes: commit e9c787e65c0c ("scsi: allocate scsi_cmnd structures as part of struct request")
    Signed-off-by: Bart Van Assche
    Acked-by: Tejun Heo
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Cc: Jan Kara
    Cc: # v4.11+
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

30 Mar, 2017

2 commits

  • blkg_conf_prep() currently calls blkg_lookup_create() while holding
    request queue spinlock. This means allocating memory for struct
    blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
    failures in call paths like below:

    pcpu_alloc+0x68f/0x710
    __alloc_percpu_gfp+0xd/0x10
    __percpu_counter_init+0x55/0xc0
    cfq_pd_alloc+0x3b2/0x4e0
    blkg_alloc+0x187/0x230
    blkg_create+0x489/0x670
    blkg_lookup_create+0x9a/0x230
    blkg_conf_prep+0x1fb/0x240
    __cfqg_set_weight_device.isra.105+0x5c/0x180
    cfq_set_weight_on_dfl+0x69/0xc0
    cgroup_file_write+0x39/0x1c0
    kernfs_fop_write+0x13f/0x1d0
    __vfs_write+0x23/0x120
    vfs_write+0xc2/0x1f0
    SyS_write+0x44/0xb0
    entry_SYSCALL_64_fastpath+0x18/0xad

    In the code path above, percpu allocator cannot call vmalloc() due to
    queue spinlock.

    A failure in this call path gives grief to tools which are trying to
    configure io weights. We see occasional failures happen shortly after
    reboots even when system is not under any memory pressure. Machines
    with a lot of cpus are more vulnerable to this condition.

    Do struct blkcg_gq allocations outside the queue spinlock to allow
    blocking during memory allocations.

    Suggested-by: Tejun Heo
    Signed-off-by: Tahsin Erdogan
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tahsin Erdogan
     
  • I inadvertently applied the v5 version of this patch, whereas
    the agreed upon version was v5. Revert this one so we can apply
    the right one.

    This reverts commit 7fc6b87a9ff537e7df32b1278118ce9c5bcd6788.

    Jens Axboe
     

29 Mar, 2017

1 commit

  • blkg_conf_prep() currently calls blkg_lookup_create() while holding
    request queue spinlock. This means allocating memory for struct
    blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
    failures in call paths like below:

    pcpu_alloc+0x68f/0x710
    __alloc_percpu_gfp+0xd/0x10
    __percpu_counter_init+0x55/0xc0
    cfq_pd_alloc+0x3b2/0x4e0
    blkg_alloc+0x187/0x230
    blkg_create+0x489/0x670
    blkg_lookup_create+0x9a/0x230
    blkg_conf_prep+0x1fb/0x240
    __cfqg_set_weight_device.isra.105+0x5c/0x180
    cfq_set_weight_on_dfl+0x69/0xc0
    cgroup_file_write+0x39/0x1c0
    kernfs_fop_write+0x13f/0x1d0
    __vfs_write+0x23/0x120
    vfs_write+0xc2/0x1f0
    SyS_write+0x44/0xb0
    entry_SYSCALL_64_fastpath+0x18/0xad

    In the code path above, percpu allocator cannot call vmalloc() due to
    queue spinlock.

    A failure in this call path gives grief to tools which are trying to
    configure io weights. We see occasional failures happen shortly after
    reboots even when system is not under any memory pressure. Machines
    with a lot of cpus are more vulnerable to this condition.

    Update blkg_create() function to temporarily drop the rcu and queue
    locks when it is allowed by gfp mask.

    Suggested-by: Tejun Heo
    Signed-off-by: Tahsin Erdogan
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tahsin Erdogan
     

02 Mar, 2017

1 commit


03 Feb, 2017

1 commit


02 Feb, 2017

1 commit

  • We will want to have struct backing_dev_info allocated separately from
    struct request_queue. As the first step add pointer to backing_dev_info
    to request_queue and convert all users touching it. No functional
    changes in this patch.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

19 Jan, 2017

1 commit

  • There's no potential harm in quiescing the queue, but it also doesn't
    buy us anything. And we can't run the queue async for policy
    deactivate, since we could be in the path of tearing the queue down.
    If we schedule an async run of the queue at that time, we're racing
    with queue teardown AFTER having we've already torn most of it down.

    Reported-by: Omar Sandoval
    Fixes: 4d199c6f1c84 ("blk-cgroup: ensure that we clear the stop bit on quiesced queues")
    Tested-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     

18 Jan, 2017

2 commits

  • If we call blk_mq_quiesce_queue() on a queue, we must remember to
    pair that with something that clears the stopped by on the
    queues later on.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This adds a set of hooks that intercepts the blk-mq path of
    allocating/inserting/issuing/completing requests, allowing
    us to develop a scheduler within that framework.

    We reuse the existing elevator scheduler API on the registration
    side, but augment that with the scheduler flagging support for
    the blk-mq interfce, and with a separate set of ops hooks for MQ
    devices.

    We split driver and scheduler tags, so we can run the scheduling
    independently of device queue depth.

    Signed-off-by: Jens Axboe
    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval

    Jens Axboe
     

22 Nov, 2016

1 commit

  • blkcg allocates some per-cgroup data structures with GFP_NOWAIT and
    when that fails falls back to operations which aren't specific to the
    cgroup. Occassional failures are expected under pressure and falling
    back to non-cgroup operation is the right thing to do.

    Unfortunately, I forgot to add __GFP_NOWARN to these allocations and
    these expected failures end up creating a lot of noise. Add
    __GFP_NOWARN.

    Signed-off-by: Tejun Heo
    Reported-by: Marc MERLIN
    Reported-by: Vlastimil Babka
    Signed-off-by: Jens Axboe

    Tejun Heo
     

30 Sep, 2016

1 commit

  • Unlocking a mutex twice is wrong. Hence modify blkcg_policy_register()
    such that blkcg_pol_mutex is unlocked once if cpd == NULL. This patch
    avoids that smatch reports the following error:

    block/blk-cgroup.c:1378: blkcg_policy_register() error: double unlock 'mutex:&blkcg_pol_mutex'

    Fixes: 06b285bd1125 ("blkcg: fix blkcg_policy_data allocation bug")
    Signed-off-by: Bart Van Assche
    Cc: Tejun Heo
    Cc: # v4.2+
    Signed-off-by: Tejun Heo

    Bart Van Assche
     

14 Jun, 2016

1 commit


10 Feb, 2016

1 commit

  • get_disk(),get_gendisk() calls have non explicit side effect: they
    increase the reference on the disk owner module.

    The following is the correct sequence how to get a disk reference and
    to put it:

    disk = get_gendisk(...);

    /* use disk */

    owner = disk->fops->owner;
    put_disk(disk);
    module_put(owner);

    fs/block_dev.c is aware of this required module_put() call, but f.e.
    blkg_conf_finish(), which is located in block/blk-cgroup.c, does not put
    a module reference. To see a leakage in action cgroups throttle config
    can be used. In the following script I'm removing throttle for /dev/ram0
    (actually this is NOP, because throttle was never set for this device):

    # lsmod | grep brd
    brd 5175 0
    # i=100; while [ $i -gt 0 ]; do echo "1:0 0" > \
    /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device; i=$(($i - 1)); \
    done
    # lsmod | grep brd
    brd 5175 100

    Now brd module has 100 references.

    The issue is fixed by calling module_put() just right away put_disk().

    Signed-off-by: Roman Pen
    Cc: Gi-Oh Kim
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc: linux-block@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Jens Axboe

    Roman Pen
     

03 Dec, 2015

1 commit

  • Consider the following v2 hierarchy.

    P0 (+memory) --- P1 (-memory) --- A
    \- B

    P0 has memory enabled in its subtree_control while P1 doesn't. If
    both A and B contain processes, they would belong to the memory css of
    P1. Now if memory is enabled on P1's subtree_control, memory csses
    should be created on both A and B and A's processes should be moved to
    the former and B's processes the latter. IOW, enabling controllers
    can cause atomic migrations into different csses.

    The core cgroup migration logic has been updated accordingly but the
    controller migration methods haven't and still assume that all tasks
    migrate to a single target css; furthermore, the methods were fed the
    css in which subtree_control was updated which is the parent of the
    target csses. pids controller depends on the migration methods to
    move charges and this made the controller attribute charges to the
    wrong csses often triggering the following warning by driving a
    counter negative.

    WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
    Modules linked in:
    CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
    ...
    ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
    ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
    ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
    Call Trace:
    [] dump_stack+0x4e/0x82
    [] warn_slowpath_common+0x82/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] pids_cancel.constprop.6+0x31/0x40
    [] pids_can_attach+0x6d/0xf0
    [] cgroup_taskset_migrate+0x6c/0x330
    [] cgroup_migrate+0xf5/0x190
    [] cgroup_attach_task+0x176/0x200
    [] __cgroup_procs_write+0x2ad/0x460
    [] cgroup_procs_write+0x14/0x20
    [] cgroup_file_write+0x35/0x1c0
    [] kernfs_fop_write+0x141/0x190
    [] __vfs_write+0x28/0xe0
    [] vfs_write+0xac/0x1a0
    [] SyS_write+0x49/0xb0
    [] entry_SYSCALL_64_fastpath+0x12/0x76

    This patch fixes the bug by removing @css parameter from the three
    migration methods, ->can_attach, ->cancel_attach() and ->attach() and
    updating cgroup_taskset iteration helpers also return the destination
    css in addition to the task being migrated. All controllers are
    updated accordingly.

    * Controllers which don't care whether there are one or multiple
    target csses can be converted trivially. cpu, io, freezer, perf,
    netclassid and netprio fall in this category.

    * cpuset's current implementation assumes that there's single source
    and destination and thus doesn't support v2 hierarchy already. The
    only change made by this patchset is how that single destination css
    is obtained.

    * memory migration path already doesn't do anything on v2. How the
    single destination css is obtained is updated and the prep stage of
    mem_cgroup_can_attach() is reordered to accomodate the change.

    * pids is the only controller which was affected by this bug. It now
    correctly handles multi-destination migrations and no longer causes
    counter underflow from incorrect accounting.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Daniel Wagner
    Cc: Aleksa Sarai

    Tejun Heo
     

06 Nov, 2015

1 commit

  • Pull cgroup updates from Tejun Heo:
    "The cgroup core saw several significant updates this cycle:

    - percpu_rwsem for threadgroup locking is reinstated. This was
    temporarily dropped due to down_write latency issues. Oleg's
    rework of percpu_rwsem which is scheduled to be merged in this
    merge window resolves the issue.

    - On the v2 hierarchy, when controllers are enabled and disabled, all
    operations are atomic and can fail and revert cleanly. This allows
    ->can_attach() failure which is necessary for cpu RT slices.

    - Tasks now stay associated with the original cgroups after exit
    until released. This allows tracking resources held by zombies
    (e.g. pids) and makes it easy to find out where zombies came from
    on the v2 hierarchy. The pids controller was broken before these
    changes as zombies escaped the limits; unfortunately, updating this
    behavior required too many invasive changes and I don't think it's
    a good idea to backport them, so the pids controller on 4.3, the
    first version which included the pids controller, will stay broken
    at least until I'm sure about the cgroup core changes.

    - Optimization of a couple common tests using static_key"

    * 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (38 commits)
    cgroup: fix race condition around termination check in css_task_iter_next()
    blkcg: don't create "io.stat" on the root cgroup
    cgroup: drop cgroup__DEVEL__legacy_files_on_dfl
    cgroup: replace error handling in cgroup_init() with WARN_ON()s
    cgroup: add cgroup_subsys->free() method and use it to fix pids controller
    cgroup: keep zombies associated with their original cgroups
    cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock
    cgroup: don't hold css_set_rwsem across css task iteration
    cgroup: reorganize css_task_iter functions
    cgroup: factor out css_set_move_task()
    cgroup: keep css_set and task lists in chronological order
    cgroup: make cgroup_destroy_locked() test cgroup_is_populated()
    cgroup: make css_sets pin the associated cgroups
    cgroup: relocate cgroup_[try]get/put()
    cgroup: move check_for_release() invocation
    cgroup: replace cgroup_has_tasks() with cgroup_is_populated()
    cgroup: make cgroup->nr_populated count the number of populated css_sets
    cgroup: remove an unused parameter from cgroup_task_migrate()
    cgroup: fix too early usage of static_branch_disable()
    cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically
    ...

    Linus Torvalds
     

22 Oct, 2015

1 commit

  • The stat files on the root cgroup shows stats for the whole system and
    usually don't contain any information which isn't available through
    the usual system monitoring mechanisms. Some controllers skip
    collecting these duplicate stats to optimize cases where cgroup isn't
    used and later try to emulate the result on demand.

    This leads to complexities and subtle differences in the information
    shown through different channels. This is entirely unnecessary and
    cgroup v2 is dropping stat files which are duplicate from all
    controllers. This patch removes "io.stat" from the root hierarchy.

    Signed-off-by: Tejun Heo
    Acked-by: Jens Axboe
    Cc: Vivek Goyal

    Tejun Heo
     

20 Sep, 2015

1 commit

  • Pull block updates from Jens Axboe:
    "This is a bit bigger than it should be, but I could (did) not want to
    send it off last week due to both wanting extra testing, and expecting
    a fix for the bounce regression as well. In any case, this contains:

    - Fix for the blk-merge.c compilation warning on gcc 5.x from me.

    - A set of back/front SG gap merge fixes, from me and from Sagi.
    This ensures that we honor SG gapping for integrity payloads as
    well.

    - Two small fixes for null_blk from Matias, fixing a leak and a
    capacity propagation issue.

    - A blkcg fix from Tejun, fixing a NULL dereference.

    - A fast clone optimization from Ming, fixing a performance
    regression since the arbitrarily sized bio's were introduced.

    - Also from Ming, a regression fix for bouncing IOs"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    block: fix bounce_end_io
    block: blk-merge: fast-clone bio when splitting rw bios
    block: blkg_destroy_all() should clear q->root_blkg and ->root_rl.blkg
    block: Copy a user iovec if it includes gaps
    block: Refuse adding appending a gapped integrity page to a bio
    block: Refuse request/bio merges with gaps in the integrity payload
    block: Check for gaps on front and back merges
    null_blk: fix wrong capacity when bs is not 512 bytes
    null_blk: fix memory leak on cleanup
    block: fix bogus compiler warnings in blk-merge.c

    Linus Torvalds
     

11 Sep, 2015

1 commit

  • While making the root blkg unconditional, ec13b1d6f0a0 ("blkcg: always
    create the blkcg_gq for the root blkcg") removed the part which clears
    q->root_blkg and ->root_rl.blkg during q exit. This leaves the two
    pointers dangling after blkg_destroy_all(). blk-throttle exit path
    performs blkg traversals and dereferences ->root_blkg and can lead to
    the following oops.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000558
    IP: [] __blkg_lookup+0x26/0x70
    ...
    task: ffff88001b4e2580 ti: ffff88001ac0c000 task.ti: ffff88001ac0c000
    RIP: 0010:[] [] __blkg_lookup+0x26/0x70
    ...
    Call Trace:
    [] blk_throtl_drain+0x5a/0x110
    [] blkcg_drain_queue+0x18/0x20
    [] __blk_drain_queue+0xc0/0x170
    [] blk_queue_bypass_start+0x61/0x80
    [] blkcg_deactivate_policy+0x39/0x100
    [] blk_throtl_exit+0x38/0x50
    [] blkcg_exit_queue+0x3e/0x50
    [] blk_release_queue+0x1e/0xc0
    ...

    While the bug is a straigh-forward use-after-free bug, it is tricky to
    reproduce because blkg release is RCU protected and the rest of exit
    path usually finishes before RCU grace period.

    This patch fixes the bug by updating blkg_destro_all() to clear
    q->root_blkg and ->root_rl.blkg.

    Signed-off-by: Tejun Heo
    Reported-by: "Richard W.M. Jones"
    Reported-by: Josh Boyer
    Link: http://lkml.kernel.org/g/CA+5PVA5rzQ0s4723n5rHBcxQa9t0cW8BPPBekr_9aMRoWt2aYg@mail.gmail.com
    Fixes: ec13b1d6f0a0 ("blkcg: always create the blkcg_gq for the root blkcg")
    Cc: stable@vger.kernel.org # v4.2+
    Tested-by: Richard W.M. Jones
    Signed-off-by: Jens Axboe

    Tejun Heo
     

19 Aug, 2015

5 commits

  • cgroup is trying to make interface consistent across different
    controllers. For weight based resource control, the knob should have
    the range [1, 10000] and default to 100. This patch updates
    cfq-iosched so that the weight range conforms. The internal
    calculations have enough range and the widening of the weight range
    shouldn't cause any problem.

    * blkcg_policy->cpd_bind_fn() is added. If present, this is invoked
    when blkcg is attached to a hierarchy.

    * cfq_cpd_init() is updated to use the new default value on the
    unified hierarchy.

    * cfq_cpd_bind() callback is implemented to clear per-blkg configs and
    apply the default config matching the hierarchy type.

    * cfqd->root_group->[leaf_]weight initialization in cfq_init_queue()
    is moved into !CONFIG_CFQ_GROUP_IOSCHED block. cfq_cpd_bind() is
    now responsible for initializing the initial weights when blkcg is
    enabled.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkcg interface grew to be the biggest of all controllers and
    unfortunately most inconsistent too. The interface files are
    inconsistent with a number of cloes duplicates. Some files have
    recursive variants while others don't. There's distinction between
    normal and leaf weights which isn't intuitive and there are a lot of
    stat knobs which don't make much sense outside of debugging and expose
    too much implementation details to userland.

    In the unified hierarchy, everything is always hierarchical and
    internal nodes can't have tasks rendering the two structural issues
    twisting the current interface. The interface has to be updated in a
    significant anyway and this is a good chance to revamp it as a whole.
    This patch implements blkcg interface for the unified hierarchy.

    * (from a previous patch) blkcg is identified by "io" instead of
    "blkio" on the unified hierarchy. Given that the whole interface is
    updated anyway, the rename shouldn't carry noticeable conversion
    overhead.

    * The original interface consisted of 27 files is replaced with the
    following three files.

    blkio.stat : per-blkcg stats
    blkio.weight : per-cgroup and per-cgroup-queue weight settings
    blkio.max : per-cgroup-queue bps and iops max limits

    Documentation/cgroups/unified-hierarchy.txt updated accordingly.

    v2: blkcg_policy->dfl_cftypes wasn't removed on
    blkcg_policy_unregister() corrupting the cftypes list. Fixed.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • * Export blkg_dev_name()

    * Drop unnecessary @cft from __cfq_set_weight().

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, blkg_conf_prep() expects input to be of the following form

    MAJ:MIN NUM

    and reads the NUM part into blkg_conf_ctx->v. This is quite
    restrictive and gets in the way in implementing blkcg interface for
    the unified hierarchy. This patch updates blkg_conf_prep() so that it
    expects

    MAJ:MIN BODY_STR

    where BODY_STR is an arbitrary string. blkg_conf_ctx->v is replaced
    with ->body which is a char pointer pointing to the start of BODY_STR.
    Parsing of the body is moved to blkg_conf_prep()'s callers.

    To allow using, for example, strsep() on blkg_conf_ctx->val, it is a
    non-const pointer and to accommodate that const is dropped from @input
    too.

    This doesn't cause any behavior changes.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkcg is about to grow interface for the unified hierarchy. Add
    legacy to existing cftypes.

    * blkcg_policy->cftypes -> blkcg_policy->legacy_cftypes
    * blk-cgroup.c:blkcg_files -> blkcg_legacy_files
    * cfq-iosched.c:cfq_blkcg_files -> cfq_blkcg_legacy_files
    * blk-throttle.c:throtl_files -> throtl_legacy_files

    Pure renames. No functional change.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo