08 Sep, 2014

1 commit

  • blkcg->id is a unique id given to each blkcg; however, the
    cgroup_subsys_state which each blkcg embeds already has ->serial_nr
    which can be used for the same purpose. Drop blkcg->id and replace
    its uses with blkcg->css.serial_nr. Rename cfq_cgroup->blkcg_id to
    ->blkcg_serial_nr and @id in check_blkcg_changed() to @serial_nr for
    consistency.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     

05 Aug, 2014

1 commit

  • Pull cgroup changes from Tejun Heo:
    "Mostly changes to get the v2 interface ready. The core features are
    mostly ready now and I think it's reasonable to expect to drop the
    devel mask in one or two devel cycles at least for a subset of
    controllers.

    - cgroup added a controller dependency mechanism so that block cgroup
    can depend on memory cgroup. This will be used to finally support
    IO provisioning on the writeback traffic, which is currently being
    implemented.

    - The v2 interface now uses a separate table so that the interface
    files for the new interface are explicitly declared in one place.
    Each controller will explicitly review and add the files for the
    new interface.

    - cpuset is getting ready for the hierarchical behavior which is in
    the similar style with other controllers so that an ancestor's
    configuration change doesn't change the descendants' configurations
    irreversibly and processes aren't silently migrated when a CPU or
    node goes down.

    All the changes are to the new interface and no behavior changed for
    the multiple hierarchies"

    * 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits)
    cpuset: fix the WARN_ON() in update_nodemasks_hier()
    cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test
    cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core
    cgroup: distinguish the default and legacy hierarchies when handling cftypes
    cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes()
    cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes
    cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[]
    cpuset: export effective masks to userspace
    cpuset: allow writing offlined masks to cpuset.cpus/mems
    cpuset: enable onlined cpu/node in effective masks
    cpuset: refactor cpuset_hotplug_update_tasks()
    cpuset: make cs->{cpus, mems}_allowed as user-configured masks
    cpuset: apply cs->effective_{cpus,mems}
    cpuset: initialize top_cpuset's configured masks at mount
    cpuset: use effective cpumask to build sched domains
    cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty
    cpuset: update cs->effective_{cpus, mems} when config changes
    cpuset: update cpuset->effective_{cpus,mems} at hotplug
    cpuset: add cs->effective_cpus and cs->effective_mems
    cgroup: clean up sane_behavior handling
    ...

    Linus Torvalds
     

15 Jul, 2014

2 commits

  • Currently, cftypes added by cgroup_add_cftypes() are used for both the
    unified default hierarchy and legacy ones and subsystems can mark each
    file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to
    appear only on one of them. This is quite hairy and error-prone.
    Also, we may end up exposing interface files to the default hierarchy
    without thinking it through.

    cgroup_subsys will grow two separate cftype addition functions and
    apply each only on the hierarchies of the matching type. This will
    allow organizing cftypes in a lot clearer way and encourage subsystems
    to scrutinize the interface which is being exposed in the new default
    hierarchy.

    In preparation, this patch adds cgroup_add_legacy_cftypes() which
    currently is a simple wrapper around cgroup_add_cftypes() and replaces
    all cgroup_add_cftypes() usages with it.

    While at it, this patch drops a completely spurious return from
    __hugetlb_cgroup_file_init().

    This patch doesn't introduce any functional differences.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Aneesh Kumar K.V

    Tejun Heo
     
  • Currently, cgroup_subsys->base_cftypes is used for both the unified
    default hierarchy and legacy ones and subsystems can mark each file
    with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear
    only on one of them. This is quite hairy and error-prone. Also, we
    may end up exposing interface files to the default hierarchy without
    thinking it through.

    cgroup_subsys will grow two separate cftype arrays and apply each only
    on the hierarchies of the matching type. This will allow organizing
    cftypes in a lot clearer way and encourage subsystems to scrutinize
    the interface which is being exposed in the new default hierarchy.

    In preparation, this patch renames cgroup_subsys->base_cftypes to
    cgroup_subsys->legacy_cftypes. This patch is pure rename.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vivek Goyal
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Aristeu Rozanski
    Cc: Aneesh Kumar K.V

    Tejun Heo
     

12 Jul, 2014

1 commit

  • While a queue is being destroyed, all the blkgs are destroyed and its
    ->root_blkg pointer is set to NULL. If someone else starts to drain
    while the queue is in this state, the following oops happens.

    NULL pointer dereference at 0000000000000028
    IP: [] blk_throtl_drain+0x84/0x230
    PGD e4a1067 PUD b773067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
    CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
    RIP: 0010:[] [] blk_throtl_drain+0x84/0x230
    RSP: 0018:ffff88000efd7bf0 EFLAGS: 00010046
    RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
    R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
    R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
    FS: 00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
    Stack:
    ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
    ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
    ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
    Call Trace:
    [] blkcg_drain_queue+0x1f/0x60
    [] __blk_drain_queue+0x71/0x180
    [] blk_queue_bypass_start+0x6e/0xb0
    [] blkcg_deactivate_policy+0x38/0x120
    [] blk_throtl_exit+0x34/0x50
    [] blkcg_exit_queue+0x35/0x40
    [] blk_release_queue+0x26/0xd0
    [] kobject_cleanup+0x38/0x70
    [] kobject_put+0x28/0x60
    [] blk_put_queue+0x15/0x20
    [] scsi_device_dev_release_usercontext+0x16b/0x1c0
    [] execute_in_process_context+0x89/0xa0
    [] scsi_device_dev_release+0x1c/0x20
    [] device_release+0x32/0xa0
    [] kobject_cleanup+0x38/0x70
    [] kobject_put+0x28/0x60
    [] put_device+0x17/0x20
    [] __scsi_remove_device+0xa9/0xe0
    [] scsi_remove_device+0x2b/0x40
    [] sdev_store_delete+0x27/0x30
    [] dev_attr_store+0x18/0x30
    [] sysfs_kf_write+0x3e/0x50
    [] kernfs_fop_write+0xe7/0x170
    [] vfs_write+0xaf/0x1d0
    [] SyS_write+0x4d/0xc0
    [] system_call_fastpath+0x16/0x1b

    776687bce42b ("block, blk-mq: draining can't be skipped even if
    bypass_depth was non-zero") made it easier to trigger this bug by
    making blk_queue_bypass_start() drain even when it loses the first
    bypass test to blk_cleanup_queue(); however, the bug has always been
    there even before the commit as blk_queue_bypass_start() could race
    against queue destruction, win the initial bypass test but perform the
    actual draining after blk_cleanup_queue() already destroyed all blkgs.

    Fix it by skippping calling into policy draining if all the blkgs are
    already gone.

    Signed-off-by: Tejun Heo
    Reported-by: Shirish Pargaonkar
    Reported-by: Sasha Levin
    Reported-by: Jet Chen
    Cc: stable@vger.kernel.org
    Tested-by: Shirish Pargaonkar
    Signed-off-by: Jens Axboe

    Tejun Heo
     

09 Jul, 2014

1 commit

  • Currently, the blkio subsystem attributes all of writeback IOs to the
    root. One of the issues is that there's no way to tell who originated
    a writeback IO from block layer. Those IOs are usually issued
    asynchronously from a task which didn't have anything to do with
    actually generating the dirty pages. The memory subsystem, when
    enabled, already keeps track of the ownership of each dirty page and
    it's desirable for blkio to piggyback instead of adding its own
    per-page tag.

    cgroup now has a mechanism to express such dependency -
    cgroup_subsys->depends_on. This patch declares that blkcg depends on
    memcg so that memcg is enabled automatically on the default hierarchy
    when available. Future changes will make blkcg map the memcg tag to
    find out the cgroup to blame for writeback IOs.

    As this means that a memcg may be made invisible, this patch also
    implements css_reset() for memcg which resets its basic
    configurations. This implementation will probably need to be expanded
    to cover other states which are used in the default hierarchy.

    v2: blkcg's dependency on memcg is wrapped with CONFIG_MEMCG to avoid
    build failure. Reported by kbuild test robot.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vivek Goyal
    Cc: Jens Axboe

    Tejun Heo
     

23 Jun, 2014

2 commits

  • This reverts commit a2d445d440003f2d70ee4cd4970ea82ace616fee.

    The original commit is buggy, we do use the registration functions
    at runtime for modular builds.

    Jens Axboe
     
  • Hello,

    So, this patch should do. Joe, Vivek, can one of you guys please
    verify that the oops goes away with this patch?

    Jens, the original thread can be read at

    http://thread.gmane.org/gmane.linux.kernel/1720729

    The fix converts blkg->refcnt from int to atomic_t. It does some
    overhead but it should be minute compared to everything else which is
    going on and the involved cacheline bouncing, so I think it's highly
    unlikely to cause any noticeable difference. Also, the refcnt in
    question should be converted to a perpcu_ref for blk-mq anyway, so the
    atomic_t is likely to go away pretty soon anyway.

    Thanks.

    ------- 8< -------
    __blkg_release_rcu() may be invoked after the associated request_queue
    is released with a RCU grace period inbetween. As such, the function
    and callbacks invoked from it must not dereference the associated
    request_queue. This is clearly indicated in the comment above the
    function.

    Unfortunately, while trying to fix a different issue, 2a4fd070ee85
    ("blkcg: move bulk of blkcg_gq release operations to the RCU
    callback") ignored this and added [un]locking of @blkg->q->queue_lock
    to __blkg_release_rcu(). This of course can cause oops as the
    request_queue may be long gone by the time this code gets executed.

    general protection fault: 0000 [#1] SMP
    CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1
    Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013
    task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000
    RIP: 0010:[] [] _raw_spin_lock_irq+0x15/0x60
    RSP: 0018:ffff88085403fdf0 EFLAGS: 00010086
    RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000
    RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b
    RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39
    R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130
    R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0
    Stack:
    ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000
    ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0
    ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30
    Call Trace:
    [] __blkg_release_rcu+0x72/0x150
    [] rcu_nocb_kthread+0x1e8/0x300
    [] kthread+0xe1/0x100
    [] ret_from_fork+0x7c/0xb0
    Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5
    +fa 66 66 90 66 66 90 b8 00 00 02 00 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f
    +b7
    RIP [] _raw_spin_lock_irq+0x15/0x60
    RSP

    The request_queue locking was added because blkcg_gq->refcnt is an int
    protected with the queue lock and __blkg_release_rcu() needs to put
    the parent. Let's fix it by making blkcg_gq->refcnt an atomic_t and
    dropping queue locking in the function.

    Given the general heavy weight of the current request_queue and blkcg
    operations, this is unlikely to cause any noticeable overhead.
    Moreover, blkcg_gq->refcnt is likely to be converted to percpu_ref in
    the near future, so whatever (most likely negligible) overhead it may
    add is temporary.

    Signed-off-by: Tejun Heo
    Reported-by: Joe Lawrence
    Acked-by: Vivek Goyal
    Link: http://lkml.kernel.org/g/alpine.DEB.2.02.1406081816540.17948@jlaw-desktop.mno.stratus.com
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     

11 Jun, 2014

2 commits

  • Pull block layer fixes from Jens Axboe:
    "Final small batch of fixes to be included before -rc1. Some general
    cleanups in here as well, but some of the blk-mq fixes we need for the
    NVMe conversion and/or scsi-mq. The pull request contains:

    - Support for not merging across a specified "chunk size", if set by
    the driver. Some NVMe devices perform poorly for IO that crosses
    such a chunk, so we need to support it generically as part of
    request merging avoid having to do complicated split logic. From
    me.

    - Bump max tag depth to 10Ki tags. Some scsi devices have a huge
    shared tag space. Before we failed with EINVAL if a too large tag
    depth was specified, now we truncate it and pass back the actual
    value. From me.

    - Various blk-mq rq init fixes from me and others.

    - A fix for enter on a dying queue for blk-mq from Keith. This is
    needed to prevent oopsing on hot device removal.

    - Fixup for blk-mq timer addition from Ming Lei.

    - Small round of performance fixes for mtip32xx from Sam Bradshaw.

    - Minor stack leak fix from Rickard Strandqvist.

    - Two __init annotations from Fabian Frederick"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    block: add __init to blkcg_policy_register
    block: add __init to elv_register
    block: ensure that bio_add_page() always accepts a page for an empty bio
    blk-mq: add timer in blk_mq_start_request
    blk-mq: always initialize request->start_time
    block: blk-exec.c: Cleaning up local variable address returnd
    mtip32xx: minor performance enhancements
    blk-mq: ->timeout should be cleared in blk_mq_rq_ctx_init()
    blk-mq: don't allow queue entering for a dying queue
    blk-mq: bump max tag depth to 10K tags
    block: add blk_rq_set_block_pc()
    block: add notion of a chunk size for request merging

    Linus Torvalds
     
  • blkcg_policy_register is only called by
    __init functions:

    __init cfq_init
    __init throtl_init

    Cc: Andrew Morton
    Signed-off-by: Fabian Frederick
    Signed-off-by: Jens Axboe

    Fabian Frederick
     

14 May, 2014

1 commit

  • Unlike the more usual refcnting, what css_tryget() provides is the
    distinction between online and offline csses instead of protection
    against upping a refcnt which already reached zero. cgroup is
    planning to provide actual tryget which fails if the refcnt already
    reached zero. Let's rename the existing trygets so that they clearly
    indicate that they're onliness.

    I thought about keeping the existing names as-are and introducing new
    names for the planned actual tryget; however, given that each
    controller participates in the synchronization of the online state, it
    seems worthwhile to make it explicit that these functions are about
    on/offline state.

    Rename css_tryget() to css_tryget_online() and css_tryget_from_dir()
    to css_tryget_online_from_dir(). This is pure rename.

    v2: cgroup_freezer grew new usages of css_tryget(). Update
    accordingly.

    Signed-off-by: Tejun Heo
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Li Zefan
    Cc: Vivek Goyal
    Cc: Jens Axboe
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo

    Tejun Heo
     

06 May, 2014

1 commit

  • During the recent conversion of cgroup to kernfs, cgroup_tree_mutex
    which nests above both the kernfs s_active protection and cgroup_mutex
    is added to synchronize cgroup file type operations as cgroup_mutex
    needed to be grabbed from some file operations and thus can't be put
    above s_active protection.

    While this arrangement mostly worked for cgroup, this triggered the
    following lockdep warning.

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429 Tainted: G W
    -------------------------------------------------------
    trinity-c173/9024 is trying to acquire lock:
    (blkcg_pol_mutex){+.+.+.}, at: blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)

    but task is already holding lock:
    (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283)

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #2 (s_active#89){++++.+}:
    lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
    __kernfs_remove (arch/x86/include/asm/atomic.h:27 fs/kernfs/dir.c:352 fs/kernfs/dir.c:1024)
    kernfs_remove_by_name_ns (fs/kernfs/dir.c:1219)
    cgroup_addrm_files (include/linux/kernfs.h:427 kernel/cgroup.c:1074 kernel/cgroup.c:2899)
    cgroup_clear_dir (kernel/cgroup.c:1092 (discriminator 2))
    rebind_subsystems (kernel/cgroup.c:1144)
    cgroup_setup_root (kernel/cgroup.c:1568)
    cgroup_mount (kernel/cgroup.c:1716)
    mount_fs (fs/super.c:1094)
    vfs_kern_mount (fs/namespace.c:899)
    do_mount (fs/namespace.c:2238 fs/namespace.c:2561)
    SyS_mount (fs/namespace.c:2758 fs/namespace.c:2729)
    tracesys (arch/x86/kernel/entry_64.S:746)

    -> #1 (cgroup_tree_mutex){+.+.+.}:
    lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
    mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
    cgroup_add_cftypes (include/linux/list.h:76 kernel/cgroup.c:3040)
    blkcg_policy_register (block/blk-cgroup.c:1106)
    throtl_init (block/blk-throttle.c:1694)
    do_one_initcall (init/main.c:789)
    kernel_init_freeable (init/main.c:854 init/main.c:863 init/main.c:882 init/main.c:1003)
    kernel_init (init/main.c:935)
    ret_from_fork (arch/x86/kernel/entry_64.S:552)

    -> #0 (blkcg_pol_mutex){+.+.+.}:
    __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
    lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
    mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
    blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
    cgroup_file_write (kernel/cgroup.c:2714)
    kernfs_fop_write (fs/kernfs/file.c:295)
    vfs_write (fs/read_write.c:532)
    SyS_write (fs/read_write.c:584 fs/read_write.c:576)
    tracesys (arch/x86/kernel/entry_64.S:746)

    other info that might help us debug this:

    Chain exists of:
    blkcg_pol_mutex --> cgroup_tree_mutex --> s_active#89

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(s_active#89);
    lock(cgroup_tree_mutex);
    lock(s_active#89);
    lock(blkcg_pol_mutex);

    *** DEADLOCK ***

    4 locks held by trinity-c173/9024:
    #0: (&f->f_pos_lock){+.+.+.}, at: __fdget_pos (fs/file.c:714)
    #1: (sb_writers#18){.+.+.+}, at: vfs_write (include/linux/fs.h:2255 fs/read_write.c:530)
    #2: (&of->mutex){+.+.+.}, at: kernfs_fop_write (fs/kernfs/file.c:283)
    #3: (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283)

    stack backtrace:
    CPU: 3 PID: 9024 Comm: trinity-c173 Tainted: G W 3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429
    ffffffff919687b0 ffff8805f6373bb8 ffffffff8e52cdbb 0000000000000002
    ffffffff919d8400 ffff8805f6373c08 ffffffff8e51fb88 0000000000000004
    ffff8805f6373c98 ffff8805f6373c08 ffff88061be70d98 ffff88061be70dd0
    Call Trace:
    dump_stack (lib/dump_stack.c:52)
    print_circular_bug (kernel/locking/lockdep.c:1216)
    __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
    lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
    mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
    blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
    cgroup_file_write (kernel/cgroup.c:2714)
    kernfs_fop_write (fs/kernfs/file.c:295)
    vfs_write (fs/read_write.c:532)
    SyS_write (fs/read_write.c:584 fs/read_write.c:576)

    This is a highly unlikely but valid circular dependency between "echo
    1 > blkcg.reset_stats" and cfq module [un]loading. cgroup is going
    through further locking update which will remove this complication but
    for now let's use trylock on blkcg_pol_mutex and retry the file
    operation if the trylock fails.

    Signed-off-by: Tejun Heo
    Reported-by: Sasha Levin
    References: http://lkml.kernel.org/g/5363C04B.4010400@oracle.com

    Tejun Heo
     

04 Apr, 2014

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot updates for cgroup:

    - The biggest one is cgroup's conversion to kernfs. cgroup took
    after the long abandoned vfs-entangled sysfs implementation and
    made it even more convoluted over time. cgroup's internal objects
    were fused with vfs objects which also brought in vfs locking and
    object lifetime rules. Naturally, there are places where vfs rules
    don't fit and nasty hacks, such as credential switching or lock
    dance interleaving inode mutex and cgroup_mutex with object serial
    number comparison thrown in to decide whether the operation is
    actually necessary, needed to be employed.

    After conversion to kernfs, internal object lifetime and locking
    rules are mostly isolated from vfs interactions allowing shedding
    of several nasty hacks and overall simplification. This will also
    allow implmentation of operations which may affect multiple cgroups
    which weren't possible before as it would have required nesting
    i_mutexes.

    - Various simplifications including dropping of module support,
    easier cgroup name/path handling, simplified cgroup file type
    handling and task_cg_lists optimization.

    - Prepatory changes for the planned unified hierarchy, which is still
    a patchset away from being actually operational. The dummy
    hierarchy is updated to serve as the default unified hierarchy.
    Controllers which aren't claimed by other hierarchies are
    associated with it, which BTW was what the dummy hierarchy was for
    anyway.

    - Various fixes from Li and others. This pull request includes some
    patches to add missing slab.h to various subsystems. This was
    triggered xattr.h include removal from cgroup.h. cgroup.h
    indirectly got included a lot of files which brought in xattr.h
    which brought in slab.h.

    There are several merge commits - one to pull in kernfs updates
    necessary for converting cgroup (already in upstream through
    driver-core), others for interfering changes in the fixes branch"

    * 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
    cgroup: remove useless argument from cgroup_exit()
    cgroup: fix spurious lockdep warning in cgroup_exit()
    cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
    cgroup: break kernfs active_ref protection in cgroup directory operations
    cgroup: fix cgroup_taskset walking order
    cgroup: implement CFTYPE_ONLY_ON_DFL
    cgroup: make cgrp_dfl_root mountable
    cgroup: drop const from @buffer of cftype->write_string()
    cgroup: rename cgroup_dummy_root and related names
    cgroup: move ->subsys_mask from cgroupfs_root to cgroup
    cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
    cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
    cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
    cgroup: reorganize cgroup bootstrapping
    cgroup: relocate setting of CGRP_DEAD
    cpuset: use rcu_read_lock() to protect task_cs()
    cgroup_freezer: document freezer_fork() subtleties
    cgroup: update cgroup_transfer_tasks() to either succeed or fail
    cgroup: drop task_lock() protection around task->cgroups
    cgroup: update how a newly forked task gets associated with css_set
    ...

    Linus Torvalds
     

19 Feb, 2014

1 commit

  • (Trivial patch.)

    If the code is looking at the RCU-protected pointer itself, but not
    dereferencing it, the rcu_dereference() functions can be downgraded to
    rcu_access_pointer(). This commit makes this downgrade in blkg_destroy()
    and ioc_destroy_icq(), both of which simply compare the RCU-protected
    pointer against another pointer with no dereferencing.

    Signed-off-by: Paul E. McKenney
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Paul E. McKenney
     

13 Feb, 2014

1 commit

  • If !NULL, @skip_css makes cgroup_taskset_for_each() skip the matching
    css. The intention of the interface is to make it easy to skip css's
    (cgroup_subsys_states) which already match the migration target;
    however, this is entirely unnecessary as migration taskset doesn't
    include tasks which are already in the target cgroup. Drop @skip_css
    from cgroup_taskset_for_each().

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann

    Tejun Heo
     

08 Feb, 2014

2 commits

  • cgroup_subsys is a bit messier than it needs to be.

    * The name of a subsys can be different from its internal identifier
    defined in cgroup_subsys.h. Most subsystems use the matching name
    but three - cpu, memory and perf_event - use different ones.

    * cgroup_subsys_id enums are postfixed with _subsys_id and each
    cgroup_subsys is postfixed with _subsys. cgroup.h is widely
    included throughout various subsystems, it doesn't and shouldn't
    have claim on such generic names which don't have any qualifier
    indicating that they belong to cgroup.

    * cgroup_subsys->subsys_id should always equal the matching
    cgroup_subsys_id enum; however, we require each controller to
    initialize it and then BUG if they don't match, which is a bit
    silly.

    This patch cleans up cgroup_subsys names and initialization by doing
    the followings.

    * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
    cgroup_subsys with _cgrp_subsys.

    * With the above, renaming subsys identifiers to match the userland
    visible names doesn't cause any naming conflicts. All non-matching
    identifiers are renamed to match the official names.

    cpu_cgroup -> cpu
    mem_cgroup -> memory
    perf -> perf_event

    * controllers no longer need to initialize ->subsys_id and ->name.
    They're generated in cgroup core and set automatically during boot.

    * Redundant cgroup_subsys declarations removed.

    * While updating BUG_ON()s in cgroup_init_early(), convert them to
    WARN()s. BUGging that early during boot is stupid - the kernel
    can't print anything, even through serial console and the trap
    handler doesn't even link stack frame properly for back-tracing.

    This patch doesn't introduce any behavior changes.

    v2: Rebased on top of fe1217c4f3f7 ("net: net_cls: move cgroupfs
    classid handling into core").

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: "David S. Miller"
    Acked-by: "Rafael J. Wysocki"
    Acked-by: Michal Hocko
    Acked-by: Peter Zijlstra
    Acked-by: Aristeu Rozanski
    Acked-by: Ingo Molnar
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Serge E. Hallyn
    Cc: Vivek Goyal
    Cc: Thomas Graf

    Tejun Heo
     
  • With module supported dropped from net_prio, no controller is using
    cgroup module support. None of actual resource controllers can be
    built as a module and we aren't gonna add new controllers which don't
    control resources. This patch drops module support from cgroup.

    * cgroup_[un]load_subsys() and cgroup_subsys->module removed.

    * As there's no point in distinguishing IS_BUILTIN() and IS_MODULE(),
    cgroup_subsys.h now uses IS_ENABLED() directly.

    * enum cgroup_subsys_id now exactly matches the list of enabled
    controllers as ordered in cgroup_subsys.h.

    * cgroup_subsys[] is now a contiguously occupied array. Size
    specification is no longer necessary and dropped.

    * for_each_builtin_subsys() is removed and for_each_subsys() is
    updated to not require any locking.

    * module ref handling is removed from rebind_subsystems().

    * Module related comments dropped.

    v2: Rebased on top of fe1217c4f3f7 ("net: net_cls: move cgroupfs
    classid handling into core").

    v3: Added {} around the if (need_forkexit_callback) block in
    cgroup_post_fork() for readability as suggested by Li.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     

23 Sep, 2013

1 commit

  • Pull block IO fixes from Jens Axboe:
    "After merge window, no new stuff this time only a collection of neatly
    confined and simple fixes"

    * 'for-3.12/core' of git://git.kernel.dk/linux-block:
    cfq: explicitly use 64bit divide operation for 64bit arguments
    block: Add nr_bios to block_rq_remap tracepoint
    If the queue is dying then we only call the rq->end_io callout. This leaves bios setup on the request, because the caller assumes when the blk_execute_rq_nowait/blk_execute_rq call has completed that the rq->bios have been cleaned up.
    bio-integrity: Fix use of bs->bio_integrity_pool after free
    blkcg: relocate root_blkg setting and clearing
    block: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...)
    block: trace all devices plug operation

    Linus Torvalds
     

12 Sep, 2013

1 commit

  • Hello, Jens.

    The original thread can be read from

    http://thread.gmane.org/gmane.linux.kernel.cgroups/8937

    While it leads to oops, given that it only triggers under specific
    configurations which aren't common. I don't think it's necessary to
    backport it through -stable and merging it during the coming merge
    window should be enough.

    Thanks!

    ----- 8< -----
    Currently, q->root_blkg and q->root_rl.blkg are set from
    blkcg_activate_policy() and cleared from blkg_destroy_all(). This
    doesn't necessarily coincide with the lifetime of the root blkcg_gq
    leading to the following oops when blkcg is enabled but no policy is
    activated because __blk_queue_next_rl() malfunctions expecting the
    root_blkg pointers to be set.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] __wake_up_common+0x2b/0x90
    PGD 60f7a9067 PUD 60f4c9067 PMD 0
    Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
    gsmi: Log Shutdown Reason 0x03
    Modules linked in: act_mirred cls_tcindex cls_prioshift sch_dsmark xt_multiport iptable_mangle sata_mv elephant elephant_dev_num cdc_acm uhci_hcd ehci_hcd i2c_d
    CPU: 9 PID: 41382 Comm: iSCSI-write- Not tainted 3.11.0-dbg-DEV #19
    Hardware name: Intel XXX
    task: ffff88060d16eec0 ti: ffff88060d170000 task.ti: ffff88060d170000
    RIP: 0010:[] [] __wake_up_common+0x2b/0x90
    RSP: 0000:ffff88060d171818 EFLAGS: 00010096
    RAX: 0000000000000082 RBX: ffff880baa3dee60 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffff880baa3dee60
    RBP: ffff88060d171858 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000002 R12: ffff880baa3dee98
    R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000003
    FS: 00007f977cba6700(0000) GS:ffff880c79c60000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000000000000 CR3: 000000060f7a5000 CR4: 00000000000007e0
    Stack:
    0000000000000082 0000000000000000 ffff88060d171858 ffff880baa3dee60
    0000000000000082 0000000000000003 0000000000000000 0000000000000000
    ffff88060d171898 ffffffff810c7848 ffff88060d171888 ffff880bde4bc4b8
    Call Trace:
    [] __wake_up+0x48/0x70
    [] __blk_drain_queue+0x123/0x190
    [] blk_cleanup_queue+0xf5/0x210
    [] __scsi_remove_device+0x5a/0xd0
    [] scsi_remove_device+0x34/0x50
    [] scsi_remove_target+0x16b/0x220
    [] __iscsi_unbind_session+0xd1/0x1b0
    [] iscsi_remove_session+0xe2/0x1c0
    [] iscsi_destroy_session+0x16/0x60
    [] iscsi_session_teardown+0xd9/0x100
    [] iscsi_sw_tcp_session_destroy+0x5a/0xb0
    [] iscsi_if_rx+0x10e8/0x1560
    [] netlink_unicast+0x145/0x200
    [] netlink_sendmsg+0x303/0x410
    [] sock_sendmsg+0xa6/0xd0
    [] ___sys_sendmsg+0x38c/0x3a0
    [] ? fget_light+0x40/0x160
    [] ? fget_light+0x99/0x160
    [] ? fget_light+0x40/0x160
    [] __sys_sendmsg+0x49/0x90
    [] SyS_sendmsg+0x12/0x20
    [] system_call_fastpath+0x16/0x1b
    Code: 66 66 66 66 90 55 48 89 e5 41 57 41 89 f7 41 56 41 89 ce 41 55 41 54 4c 8d 67 38 53 48 83 ec 18 89 55 c4 48 8b 57 38 4c 89 45 c8 8b 2a 48 8d 42 e8 49

    Fix it by moving r->root_blkg and q->root_rl.blkg setting to
    blkg_create() and clearing to blkg_destroy() so that they area
    initialized when a root blkg is created and cleared when destroyed.

    Reported-and-tested-by: Anatol Pomozov
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

09 Aug, 2013

6 commits

  • Previously, all css descendant iterators didn't include the origin
    (root of subtree) css in the iteration. The reasons were maintaining
    consistency with css_for_each_child() and that at the time of
    introduction more use cases needed skipping the origin anyway;
    however, given that css_is_descendant() considers self to be a
    descendant, omitting the origin css has become more confusing and
    looking at the accumulated use cases rather clearly indicates that
    including origin would result in simpler code overall.

    While this is a change which can easily lead to subtle bugs, cgroup
    API including the iterators has recently gone through major
    restructuring and no out-of-tree changes will be applicable without
    adjustments making this a relatively acceptable opportunity for this
    type of change.

    The conversions are mostly straight-forward. If the iteration block
    had explicit origin handling before or after, it's moved inside the
    iteration. If not, if (pos == origin) continue; is added. Some
    conversions add extra reference get/put around origin handling by
    consolidating origin handling and the rest. While the extra ref
    operations aren't strictly necessary, this shouldn't cause any
    noticeable difference.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Michal Hocko
    Cc: Jens Axboe
    Cc: Matt Helsley
    Cc: Johannes Weiner
    Cc: Balbir Singh

    Tejun Heo
     
  • cgroup is in the process of converting to css (cgroup_subsys_state)
    from cgroup as the principal subsystem interface handle. This is
    mostly to prepare for the unified hierarchy support where css's will
    be created and destroyed dynamically but also helps cleaning up
    subsystem implementations as css is usually what they are interested
    in anyway.

    cgroup_taskset which is used by the subsystem attach methods is the
    last cgroup subsystem API which isn't using css as the handle. Update
    cgroup_taskset_cur_cgroup() to cgroup_taskset_cur_css() and
    cgroup_taskset_for_each() to take @skip_css instead of @skip_cgrp.

    The conversions are pretty mechanical. One exception is
    cpuset::cgroup_cs(), which lost its last user and got removed.

    This patch shouldn't introduce any functional changes.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Daniel Wagner
    Cc: Ingo Molnar
    Cc: Matt Helsley
    Cc: Steven Rostedt

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using css
    (cgroup_subsys_state) as the primary handle instead of cgroup in
    subsystem API. For hierarchy iterators, this is beneficial because

    * In most cases, css is the only thing subsystems care about anyway.

    * On the planned unified hierarchy, iterations for different
    subsystems will need to skip over different subtrees of the
    hierarchy depending on which subsystems are enabled on each cgroup.
    Passing around css makes it unnecessary to explicitly specify the
    subsystem in question as css is intersection between cgroup and
    subsystem

    * For the planned unified hierarchy, css's would need to be created
    and destroyed dynamically independent from cgroup hierarchy. Having
    cgroup core manage css iteration makes enforcing deref rules a lot
    easier.

    Most subsystem conversions are straight-forward. Noteworthy changes
    are

    * blkio: cgroup_to_blkcg() is no longer used. Removed.

    * freezer: cgroup_freezer() is no longer used. Removed.

    * devices: cgroup_to_devcgroup() is no longer used. Removed.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using struct
    cgroup_subsys_state * as the primary handle instead of struct cgroup.
    Please see the previous commit which converts the subsystem methods
    for rationale.

    This patch converts all cftype file operations to take @css instead of
    @cgroup. cftypes for the cgroup core files don't have their subsytem
    pointer set. These will automatically use the dummy_css added by the
    previous patch and can be converted the same way.

    Most subsystem conversions are straight forwards but there are some
    interesting ones.

    * freezer: update_if_frozen() is also converted to take @css instead
    of @cgroup for consistency. This will make the code look simpler
    too once iterators are converted to use css.

    * memory/vmpressure: mem_cgroup_from_css() needs to be exported to
    vmpressure while mem_cgroup_from_cont() can be made static.
    Updated accordingly.

    * cpu: cgroup_tg() doesn't have any user left. Removed.

    * cpuacct: cgroup_ca() doesn't have any user left. Removed.

    * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
    Removed.

    * net_cls: cgrp_cls_state() doesn't have any user left. Removed.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Daniel Wagner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe
    Cc: Steven Rostedt

    Tejun Heo
     
  • cgroup is transitioning to using css (cgroup_subsys_state) instead of
    cgroup as the primary subsystem handle. The cgroupfs file interface
    will be converted to use css's which requires finding out the
    subsystem from cftype so that the matching css can be determined from
    the cgroup.

    This patch adds cftype->ss which points to the subsystem the file
    belongs to. The field is initialized while a cftype is being
    registered. This makes it unnecessary to explicitly specify the
    subsystem for other cftype handling functions. @ss argument dropped
    from various cftype handling functions.

    This patch shouldn't introduce any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Vivek Goyal
    Cc: Jens Axboe

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using struct
    cgroup_subsys_state * as the primary handle instead of struct cgroup *
    in subsystem implementations for the following reasons.

    * With unified hierarchy, subsystems will be dynamically bound and
    unbound from cgroups and thus css's (cgroup_subsys_state) may be
    created and destroyed dynamically over the lifetime of a cgroup,
    which is different from the current state where all css's are
    allocated and destroyed together with the associated cgroup. This
    in turn means that cgroup_css() should be synchronized and may
    return NULL, making it more cumbersome to use.

    * Differing levels of per-subsystem granularity in the unified
    hierarchy means that the task and descendant iterators should behave
    differently depending on the specific subsystem the iteration is
    being performed for.

    * In majority of the cases, subsystems only care about its part in the
    cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods
    often obtain the matching css pointer from the cgroup and don't
    bother with the cgroup pointer itself. Passing around css fits
    much better.

    This patch converts all cgroup_subsys methods to take @css instead of
    @cgroup. The conversions are mostly straight-forward. A few
    noteworthy changes are

    * ->css_alloc() now takes css of the parent cgroup rather than the
    pointer to the new cgroup as the css for the new cgroup doesn't
    exist yet. Knowing the parent css is enough for all the existing
    subsystems.

    * In kernel/cgroup.c::offline_css(), unnecessary open coded css
    dereference is replaced with local variable access.

    This patch shouldn't cause any behavior differences.

    v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced
    with local variable @css as suggested by Li Zefan.

    Rebased on top of new for-3.12 which includes for-3.11-fixes so
    that ->css_free() invocation added by da0a12caff ("cgroup: fix a
    leak when percpu_ref_init() fails") is converted too. Suggested
    by Li Zefan.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Daniel Wagner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe
    Cc: Steven Rostedt

    Tejun Heo
     

15 May, 2013

5 commits

  • With the recent updates, blk-throttle is finally ready for proper
    hierarchy support. Dispatching now honors service_queue->parent_sq
    and propagates correctly. The only thing missing is setting
    ->parent_sq correctly so that throtl_grp hierarchy matches the cgroup
    hierarchy.

    This patch updates throtl_pd_init() such that service_queues form the
    same hierarchy as the cgroup hierarchy if sane_behavior is enabled.
    As this concludes proper hierarchy support for blkcg, the shameful
    .broken_hierarchy tag is removed from blkio_subsys.

    v2: Updated blkio-controller.txt as suggested by Vivek.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Cc: Li Zefan

    Tejun Heo
     
  • Currently, when the last reference of a blkcg_gq is put, all then
    release operations sans the actual freeing happen directly in
    blkg_put(). As blkg_put() may be called under queue_lock, all
    pd_exit_fn()s may be too. This makes it impossible for pd_exit_fn()s
    to use del_timer_sync() on timers which grab the queue_lock which is
    an irq-safe lock due to the deadlock possibility described in the
    comment on top of del_timer_sync().

    This can be easily avoided by perfoming the release operations in the
    RCU callback instead of directly from blkg_put(). This patch moves
    the blkcg_gq release operations to the RCU callback.

    As this leaves __blkg_release() with only call_rcu() invocation,
    blkg_rcu_free() is renamed to __blkg_release_rcu(), exported and
    call_rcu() invocation is now done directly from blkg_put() instead of
    going through __blkg_release() which is removed.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Currently, when creating a new blkcg_gq, each policy's pd_init_fn() is
    invoked in blkg_alloc() before the parent is linked. This makes it
    difficult for policies to perform initializations which are dependent
    on the parent.

    This patch moves pd_init_fn() invocations to blkg_create() after the
    parent blkg is linked where the new blkg is fully initialized. As
    this means that blkg_free() can't assume that pd's are initialized,
    pd_exit_fn() invocations are moved to __blkg_release(). This
    guarantees that pd_exit_fn() is also invoked with fully initialized
    blkgs with valid parent pointers.

    This will help implementing hierarchy support in blk-throttle.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • blk-throttle hierarchy support will make use of it. Move
    blkg_for_each_descendant_pre() from block/blk-cgroup.c to
    block/blk-cgroup.h.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • In blkg_create(), after lookup of parent fails, the control jumps to
    error path with the error code encoded into @blkg. The error path
    doesn't use @blkg for the return value. It returns ERR_PTR(ret).
    Make lookup fail path set @ret instead of @blkg.

    Note that the parent lookup is guaranteed to succeed at that point and
    the condition check is purely for sanity and triggers WARN when fails.
    As such, I don't think it's necessary to mark it for -stable.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     

09 Apr, 2013

1 commit

  • Since 749fefe677 in v3.7 ("block: lift the initial queue bypass mode
    on blk_register_queue() instead of blk_init_allocated_queue()"),
    the following warning appears when multipath is used with CONFIG_PREEMPT=y.

    This patch moves blk_queue_bypass_start() before radix_tree_preload()
    to avoid the sleeping call while preemption is disabled.

    BUG: scheduling while atomic: multipath/2460/0x00000002
    1 lock held by multipath/2460:
    #0: (&md->type_lock){......}, at: [] dm_lock_md_type+0x17/0x19 [dm_mod]
    Modules linked in: ...
    Pid: 2460, comm: multipath Tainted: G W 3.7.0-rc2 #1
    Call Trace:
    [] __schedule_bug+0x6a/0x78
    [] __schedule+0xb4/0x5e0
    [] schedule+0x64/0x66
    [] schedule_timeout+0x39/0xf8
    [] ? put_lock_stats+0xe/0x29
    [] ? lock_release_holdtime+0xb6/0xbb
    [] wait_for_common+0x9d/0xee
    [] ? try_to_wake_up+0x206/0x206
    [] ? kfree_call_rcu+0x1c/0x1c
    [] wait_for_completion+0x1d/0x1f
    [] wait_rcu_gp+0x5d/0x7a
    [] ? wait_rcu_gp+0x7a/0x7a
    [] ? complete+0x21/0x53
    [] synchronize_rcu+0x1e/0x20
    [] blk_queue_bypass_start+0x5d/0x62
    [] blkcg_activate_policy+0x73/0x270
    [] ? kmem_cache_alloc_node_trace+0xc7/0x108
    [] cfq_init_queue+0x80/0x28e
    [] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
    [] elevator_init+0xe1/0x115
    [] ? blk_queue_make_request+0x54/0x59
    [] blk_init_allocated_queue+0x8c/0x9e
    [] dm_setup_md_queue+0x36/0xaa [dm_mod]
    [] table_load+0x1bd/0x2c8 [dm_mod]
    [] ctl_ioctl+0x1d6/0x236 [dm_mod]
    [] ? table_clear+0xaa/0xaa [dm_mod]
    [] dm_ctl_ioctl+0x13/0x17 [dm_mod]
    [] do_vfs_ioctl+0x3fb/0x441
    [] ? file_has_perm+0x8a/0x99
    [] sys_ioctl+0x5e/0x82
    [] ? trace_hardirqs_on_thunk+0x3a/0x3f
    [] system_call_fastpath+0x16/0x1b

    Signed-off-by: Jun'ichi Nomura
    Acked-by: Vivek Goyal
    Acked-by: Tejun Heo
    Cc: Alasdair G Kergon
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jun'ichi Nomura
     

01 Mar, 2013

1 commit

  • Pull block IO core bits from Jens Axboe:
    "Below are the core block IO bits for 3.9. It was delayed a few days
    since my workstation kept crashing every 2-8h after pulling it into
    current -git, but turns out it is a bug in the new pstate code (divide
    by zero, will report separately). In any case, it contains:

    - The big cfq/blkcg update from Tejun and and Vivek.

    - Additional block and writeback tracepoints from Tejun.

    - Improvement of the should sort (based on queues) logic in the plug
    flushing.

    - _io() variants of the wait_for_completion() interface, using
    io_schedule() instead of schedule() to contribute to io wait
    properly.

    - Various little fixes.

    You'll get two trivial merge conflicts, which should be easy enough to
    fix up"

    Fix up the trivial conflicts due to hlist traversal cleanups (commit
    b67bfe0d42ca: "hlist: drop the node parameter from iterators").

    * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
    block: remove redundant check to bd_openers()
    block: use i_size_write() in bd_set_size()
    cfq: fix lock imbalance with failed allocations
    drivers/block/swim3.c: fix null pointer dereference
    block: don't select PERCPU_RWSEM
    block: account iowait time when waiting for completion of IO request
    sched: add wait_for_completion_io[_timeout]
    writeback: add more tracepoints
    block: add block_{touch|dirty}_buffer tracepoint
    buffer: make touch_buffer() an exported function
    block: add @req to bio_{front|back}_merge tracepoints
    block: add missing block_bio_complete() tracepoint
    block: Remove should_sort judgement when flush blk_plug
    block,elevator: use new hashtable implementation
    cfq-iosched: add hierarchical cfq_group statistics
    cfq-iosched: collect stats from dead cfqgs
    cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
    blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
    block: RCU free request_queue
    blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
    ...

    Linus Torvalds
     

28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

10 Jan, 2013

7 commits

  • Instead of holding blkcg->lock while walking ->blkg_list and executing
    prfill(), RCU walk ->blkg_list and hold the blkg's queue lock while
    executing prfill(). This makes prfill() implementations easier as
    stats are mostly protected by queue lock.

    This will be used to implement hierarchical stats.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge().
    The former two collect the [rw]stats designated by the target policy
    data and offset from the pd's subtree. The latter two add one
    [rw]stat to another.

    Note that the recursive sum functions require the queue lock to be
    held on entry to make blkg online test reliable. This is necessary to
    properly handle stats of a dying blkg.

    These will be used to implement hierarchical stats.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Hierarchical stats for cfq-iosched will need __blkg_prfill_rwstat().
    Export it.

    Signed-off-by: Tejun Heo
    Reported-by: Fengguang Wu

    Tejun Heo
     
  • Add two blkcg_policy methods, ->online_pd_fn() and ->offline_pd_fn(),
    which are invoked as the policy_data gets activated and deactivated
    while holding both blkcg and q locks.

    Also, add blkcg_gq->online bool, which is set and cleared as the
    blkcg_gq gets activated and deactivated. This flag also is toggled
    while holding both blkcg and q locks.

    These will be used to implement hierarchical stats.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Add pd->plid so that the policy a pd belongs to can be identified
    easily. This will be used to implement hierarchical blkg_[rw]stats.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • cfq blkcg is about to grow proper hierarchy handling, where a child
    blkg's weight would nest inside the parent's. This makes tasks in a
    blkg to compete against both tasks in the sibling blkgs and the tasks
    of child blkgs.

    We're gonna use the existing weight as the group weight which decides
    the blkg's weight against its siblings. This patch introduces a new
    weight - leaf_weight - which decides the weight of a blkg against the
    child blkgs.

    It's named leaf_weight because another way to look at it is that each
    internal blkg nodes have a hidden child leaf node which contains all
    its tasks and leaf_weight is the weight of the leaf node and handled
    the same as the weight of the child blkgs.

    This patch only adds leaf_weight fields and exposes it to userland.
    The new weight isn't actually used anywhere yet. Note that
    cfq-iosched currently offcially supports only single level hierarchy
    and root blkgs compete with the first level blkgs - ie. root weight is
    basically being used as leaf_weight. For root blkgs, the two weights
    are kept in sync for backward compatibility.

    v2: cfqd->root_group->leaf_weight initialization was missing from
    cfq_init_queue() causing divide by zero when
    !CONFIG_CFQ_GROUP_SCHED. Fix it. Reported by Fengguang.

    Signed-off-by: Tejun Heo
    Cc: Fengguang Wu

    Tejun Heo
     
  • Currently a child blkg (blkcg_gq) can be created even if its parent
    doesn't exist. ie. Given a blkg, it's not guaranteed that its
    ancestors will exist. This makes it difficult to implement proper
    hierarchy support for blkcg policies.

    Always create blkgs recursively and make a child blkg hold a reference
    to its parent. blkg->parent is added so that finding the parent is
    easy. blkcg_parent() is also added in the process.

    This change can be visible to userland. e.g. while issuing IO in a
    nested cgroup didn't affect the ancestors at all, now it will
    initialize all ancestor blkgs and zero stats for the request_queue
    will always appear on them. While this is userland visible, this
    shouldn't cause any functional difference.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo