18 Nov, 2014

6 commits

  • Implement cgroup_get_e_css() which finds and gets the effective css
    for the specified cgroup and subsystem combination. This function
    always returns a valid pinned css. This will be used by cgroup
    writeback support.

    While at it, add comment to cgroup_e_css() to explain why that
    function is different from cgroup_get_e_css() and has to test
    cgrp->child_subsys_mask instead of cgroup_css(cgrp, ss).

    Signed-off-by: Tejun Heo
    Acked-by: Zefan Li

    Tejun Heo
     
  • Add a new cgroup_subsys operatoin ->css_e_css_changed(). This is
    invoked if any of the effective csses seen from the css's cgroup may
    have changed. This will be used to implement cgroup writeback
    support.

    Signed-off-by: Tejun Heo
    Acked-by: Zefan Li

    Tejun Heo
     
  • Add a new cgroup subsys callback css_released(). This is called when
    the reference count of the css (cgroup_subsys_state) reaches zero
    before RCU scheduling free.

    Signed-off-by: Tejun Heo
    Acked-by: Zefan Li

    Tejun Heo
     
  • When a subsystem is offlined, its entry on @cgrp->subsys[] is cleared
    asynchronously. If cgroup_subtree_control_write() is requested to
    enable the subsystem again before the entry is cleared, it has to wait
    for the previous offlining to finish and clear the @cgrp->subsys[]
    entry before trying to enable the subsystem again.

    This is currently done while verifying the input enable / disable
    parameters. This used to be correct but f63070d350e3 ("cgroup: make
    interface files visible iff enabled on cgroup->subtree_control")
    breaks it. The commit is one of the commits implementing subsystem
    dependency.

    Through subsystem dependency, some subsystems may be enabled and
    disabled implicitly in addition to the explicitly requested ones. The
    actual subsystems to be enabled and disabled are determined during
    @css_enable/disable calculation. The current offline wait logic skips
    the ones which are already implicitly enabled and then waits for
    subsystems in @enable; however, this misses the subsystems which may
    be implicitly enabled through dependency from @enable. If such
    implicitly subsystem hasn't yet finished offlining yet, the function
    ends up trying to create a css when its @cgrp->subsys[] slot is
    already occupied triggering BUG_ON() in init_and_link_css().

    Fix it by moving the wait logic after @css_enable is calculated and
    waiting for all the subsystems in @css_enable. This fixes the above
    bug as the mask contains all subsystems which are to be enabled
    including the ones enabled through dependencies.

    Signed-off-by: Tejun Heo
    Fixes: f63070d350e3 ("cgroup: make interface files visible iff enabled on cgroup->subtree_control")
    Acked-by: Zefan Li

    Tejun Heo
     
  • Make cgroup_subtree_control_write() first calculate new
    subtree_control (new_sc), child_subsys_mask (new_ss) and
    css_enable/disable masks before applying them to the cgroup. Also,
    store the original subtree_control (old_sc) and child_subsys_mask
    (old_ss) and use them to restore the orignal state after failure.

    This patch shouldn't cause any behavior changes. This prepares for a
    fix for a bug in the async css offline wait logic.

    Signed-off-by: Tejun Heo
    Acked-by: Zefan Li

    Tejun Heo
     
  • cgroup_refresh_child_subsys_mask() calculates and updates the
    effective @cgrp->child_subsys_maks according to the current
    @cgrp->subtree_control. Separate out the calculation part into
    cgroup_calc_child_subsys_mask(). This will be used to fix a bug in
    the async css offline wait logic.

    Signed-off-by: Tejun Heo
    Acked-by: Zefan Li

    Tejun Heo
     

10 Oct, 2014

2 commits

  • Pull percpu updates from Tejun Heo:
    "A lot of activities on percpu front. Notable changes are...

    - percpu allocator now can take @gfp. If @gfp doesn't contain
    GFP_KERNEL, it tries to allocate from what's already available to
    the allocator and a work item tries to keep the reserve around
    certain level so that these atomic allocations usually succeed.

    This will replace the ad-hoc percpu memory pool used by
    blk-throttle and also be used by the planned blkcg support for
    writeback IOs.

    Please note that I noticed a bug in how @gfp is interpreted while
    preparing this pull request and applied the fix 6ae833c7fe0c
    ("percpu: fix how @gfp is interpreted by the percpu allocator")
    just now.

    - percpu_ref now uses longs for percpu and global counters instead of
    ints. It leads to more sparse packing of the percpu counters on
    64bit machines but the overhead should be negligible and this
    allows using percpu_ref for refcnting pages and in-memory objects
    directly.

    - The switching between percpu and single counter modes of a
    percpu_ref is made independent of putting the base ref and a
    percpu_ref can now optionally be initialized in single or killed
    mode. This allows avoiding percpu shutdown latency for cases where
    the refcounted objects may be synchronously created and destroyed
    in rapid succession with only a fraction of them reaching fully
    operational status (SCSI probing does this when combined with
    blk-mq support). It's also planned to be used to implement forced
    single mode to detect underflow more timely for debugging.

    There's a separate branch percpu/for-3.18-consistent-ops which cleans
    up the duplicate percpu accessors. That branch causes a number of
    conflicts with s390 and other trees. I'll send a separate pull
    request w/ resolutions once other branches are merged"

    * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits)
    percpu: fix how @gfp is interpreted by the percpu allocator
    blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode
    percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky
    percpu_ref: add PERCPU_REF_INIT_* flags
    percpu_ref: decouple switching to percpu mode and reinit
    percpu_ref: decouple switching to atomic mode and killing
    percpu_ref: add PCPU_REF_DEAD
    percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch
    percpu_ref: replace pcpu_ prefix with percpu_
    percpu_ref: minor code and comment updates
    percpu_ref: relocate percpu_ref_reinit()
    Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"
    Revert "percpu: free percpu allocation info for uniprocessor system"
    percpu-refcount: make percpu_ref based on longs instead of ints
    percpu-refcount: improve WARN messages
    percpu: fix locking regression in the failure path of pcpu_alloc()
    percpu-refcount: add @gfp to percpu_ref_init()
    proportions: add @gfp to init functions
    percpu_counter: add @gfp to percpu_counter_init()
    percpu_counter: make percpu_counters_lock irq-safe
    ...

    Linus Torvalds
     
  • Pull cgroup updates from Tejun Heo:
    "Nothing too interesting. Just a handful of cleanup patches"

    * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    Revert "cgroup: remove redundant variable in cgroup_mount()"
    cgroup: remove redundant variable in cgroup_mount()
    cgroup: fix missing unlock in cgroup_release_agent()
    cgroup: remove CGRP_RELEASABLE flag
    perf/cgroup: Remove perf_put_cgroup()
    cgroup: remove redundant check in cgroup_ino()
    cpuset: simplify proc_cpuset_show()
    cgroup: simplify proc_cgroup_show()
    cgroup: use a per-cgroup work for release agent
    cgroup: remove bogus comments
    cgroup: remove redundant code in cgroup_rmdir()
    cgroup: remove some useless forward declarations
    cgroup: fix a typo in comment.

    Linus Torvalds
     

26 Sep, 2014

1 commit

  • This reverts commit 0c7bf3e8cab7900e17ce7f97104c39927d835469.

    If there are child cgroups in the cgroupfs and then we umount it,
    the superblock will be destroyed but the cgroup_root will be kept
    around. When we mount it again, cgroup_mount() will find this
    cgroup_root and allocate a new sb for it.

    So with this commit we will be trapped in a dead loop in the case
    described above, because kernfs_pin_sb() keeps returning NULL.

    Currently I don't see how we can avoid using both pinned_sb and
    new_sb, so just revert it.

    Cc: Al Viro
    Reported-by: Andrey Wagin
    Signed-off-by: Zefan Li
    Signed-off-by: Tejun Heo

    Zefan Li
     

25 Sep, 2014

2 commits

  • With the recent addition of percpu_ref_reinit(), percpu_ref now can be
    used as a persistent switch which can be turned on and off repeatedly
    where turning off maps to killing the ref and waiting for it to drain;
    however, there currently isn't a way to initialize a percpu_ref in its
    off (killed and drained) state, which can be inconvenient for certain
    persistent switch use cases.

    Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic
    selection of operation mode; however, currently a newly initialized
    percpu_ref is always in percpu mode making it impossible to avoid the
    latency overhead of switching to atomic mode.

    This patch adds @flags to percpu_ref_init() and implements the
    following flags.

    * PERCPU_REF_INIT_ATOMIC : start ref in atomic mode
    * PERCPU_REF_INIT_DEAD : start ref killed and drained

    These flags should be able to serve the above two use cases.

    v2: target_core_tpg.c conversion was missing. Fixed.

    Signed-off-by: Tejun Heo
    Reviewed-by: Kent Overstreet
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Cc: Johannes Weiner

    Tejun Heo
     
  • …linux-block into for-3.18

    This is to receive 0a30288da1ae ("blk-mq, percpu_ref: implement a
    kludge for SCSI blk-mq stall during probe") which implements
    __percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The
    commit reverted and patches to implement proper fix will be added.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Cc: Kent Overstreet <kmo@daterainc.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Christoph Hellwig <hch@lst.de>

    Tejun Heo
     

21 Sep, 2014

2 commits

  • Both pinned_sb and new_sb indicate if a new superblock is needed,
    so we can just remove new_sb.

    Note now we must check if kernfs_tryget_sb() returns NULL, because
    when it returns NULL, kernfs_mount() may still re-use an existing
    superblock, which is just allocated by another concurent mount.

    Suggested-by: Tejun Heo
    Signed-off-by: Zefan Li
    Signed-off-by: Tejun Heo

    Zefan Li
     
  • The patch 971ff4935538: "cgroup: use a per-cgroup work for release
    agent" from Sep 18, 2014, leads to the following static checker
    warning:

    kernel/cgroup.c:5310 cgroup_release_agent()
    warn: 'mutex:&cgroup_mutex' is sometimes locked here and sometimes unlocked.

    Reported-by: Dan Carpenter
    Signed-off-by: Zefan Li
    Signed-off-by: Tejun Heo

    Zefan Li
     

19 Sep, 2014

4 commits

  • We call put_css_set() after setting CGRP_RELEASABLE flag in
    cgroup_task_migrate(), but in other places we call it without setting
    the flag. I don't see the necessity of this flag.

    Moreover once the flag is set, it will never be cleared, unless writing
    to the notify_on_release control file, so it can be quite confusing
    if we look at the output of debug.releasable.

    # mount -t cgroup -o debug xxx /cgroup
    # mkdir /cgroup/child
    # cat /cgroup/child/debug.releasable
    0 /cgroup/child/tasks
    # cat /cgroup/child/debug.releasable
    0
    # echo $$ > /cgroup/tasks && echo $$ > /cgroup/child/tasks
    # cat /proc/child/debug.releasable
    1
    Signed-off-by: Tejun Heo

    Zefan Li
     
  • Use the ONE macro instead of REG, and we can simplify proc_cgroup_show().

    Signed-off-by: Zefan Li
    Signed-off-by: Tejun Heo

    Zefan Li
     
  • Instead of using a global work to schedule release agent on removable
    cgroups, we change to use a per-cgroup work to do this, which makes
    the code much simpler.

    v2: use a dedicated work instead of reusing css->destroy_work. (Tejun)

    Signed-off-by: Zefan Li
    Signed-off-by: Tejun Heo

    Zefan Li
     
  • cgroup_pidlist_start() holds cgrp->pidlist_mutex and then calls
    pidlist_array_load(), and cgroup_pidlist_stop() releases the mutex.

    It is wrong that we release the mutex in the failure path in
    pidlist_array_load(), because cgroup_pidlist_stop() will be called
    no matter if cgroup_pidlist_start() returns errno or not.

    Fixes: 4bac00d16a8760eae7205e41d2c246477d42a210
    Cc: # 3.14+
    Signed-off-by: Zefan Li
    Signed-off-by: Tejun Heo
    Acked-by: Cong Wang

    Zefan Li
     

18 Sep, 2014

4 commits


08 Sep, 2014

1 commit

  • Percpu allocator now supports allocation mask. Add @gfp to
    percpu_ref_init() so that !GFP_KERNEL allocation masks can be used
    with percpu_refs too.

    This patch doesn't make any functional difference.

    v2: blk-mq conversion was missing. Updated.

    Signed-off-by: Tejun Heo
    Cc: Kent Overstreet
    Cc: Benjamin LaHaise
    Cc: Li Zefan
    Cc: Nicholas A. Bellinger
    Cc: Jens Axboe

    Tejun Heo
     

05 Sep, 2014

2 commits

  • When cgroup_kn_lock_live() is called through some kernfs operation and
    another thread is calling cgroup_rmdir(), we'll trigger the warning in
    cgroup_get().

    ------------[ cut here ]------------
    WARNING: CPU: 1 PID: 1228 at kernel/cgroup.c:1034 cgroup_get+0x89/0xa0()
    ...
    Call Trace:
    [] dump_stack+0x41/0x52
    [] warn_slowpath_common+0x7f/0xa0
    [] warn_slowpath_null+0x1d/0x20
    [] cgroup_get+0x89/0xa0
    [] cgroup_kn_lock_live+0x28/0x70
    [] __cgroup_procs_write.isra.26+0x51/0x230
    [] cgroup_tasks_write+0x12/0x20
    [] cgroup_file_write+0x40/0x130
    [] kernfs_fop_write+0xd1/0x160
    [] vfs_write+0x98/0x1e0
    [] SyS_write+0x4d/0xa0
    [] sysenter_do_call+0x12/0x12
    ---[ end trace 6f2e0c38c2108a74 ]---

    Fix this by calling css_tryget() instead of cgroup_get().

    v2:
    - move cgroup_tryget() right below cgroup_get() definition. (Tejun)

    Cc: # 3.15+
    Reported-by: Toralf Förster
    Signed-off-by: Zefan Li
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • Run these two scripts concurrently:

    for ((; ;))
    {
    mkdir /cgroup/sub
    rmdir /cgroup/sub
    }

    for ((; ;))
    {
    echo $$ > /cgroup/sub/cgroup.procs
    echo $$ > /cgroup/cgroup.procs
    }

    A kernel bug will be triggered:

    BUG: unable to handle kernel NULL pointer dereference at 00000038
    IP: [] cgroup_put+0x9/0x80
    ...
    Call Trace:
    [] cgroup_kn_unlock+0x39/0x50
    [] cgroup_kn_lock_live+0x61/0x70
    [] __cgroup_procs_write.isra.26+0x51/0x230
    [] cgroup_tasks_write+0x12/0x20
    [] cgroup_file_write+0x40/0x130
    [] kernfs_fop_write+0xd1/0x160
    [] vfs_write+0x98/0x1e0
    [] SyS_write+0x4d/0xa0
    [] sysenter_do_call+0x12/0x12

    We clear cgrp->kn->priv in the end of cgroup_rmdir(), but another
    concurrent thread can access kn->priv after the clearing.

    We should move the clearing to css_release_work_fn(). At that time
    no one is holding reference to the cgroup and no one can gain a new
    reference to access it.

    v2:
    - move RCU_INIT_POINTER() into the else block. (Tejun)
    - remove the cgroup_parent() check. (Tejun)
    - update the comment in css_tryget_online_from_dir().

    Cc: # 3.15+
    Reported-by: Toralf Förster
    Signed-off-by: Zefan Li
    Signed-off-by: Tejun Heo

    Li Zefan
     

25 Aug, 2014

1 commit


23 Aug, 2014

1 commit

  • Kernel command line parameter cgroup__DEVEL__legacy_files_on_dfl forces
    legacy cgroup files to show up on default hierarhcy if susbsystem does
    not have any files defined for default hierarchy.

    But this seems to be working only if legacy files are defined in
    ss->legacy_cftypes. If one adds some cftypes later using
    cgroup_add_legacy_cftypes(), these files don't show up on default
    hierarchy. Update the function accordingly so that the dynamically
    added legacy files also show up in the default hierarchy if the target
    subsystem is also using the base legacy files for the default
    hierarchy.

    tj: Patch description and comment updates.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Tejun Heo

    Vivek Goyal
     

18 Aug, 2014

1 commit


05 Aug, 2014

2 commits

  • Pull cgroup changes from Tejun Heo:
    "Mostly changes to get the v2 interface ready. The core features are
    mostly ready now and I think it's reasonable to expect to drop the
    devel mask in one or two devel cycles at least for a subset of
    controllers.

    - cgroup added a controller dependency mechanism so that block cgroup
    can depend on memory cgroup. This will be used to finally support
    IO provisioning on the writeback traffic, which is currently being
    implemented.

    - The v2 interface now uses a separate table so that the interface
    files for the new interface are explicitly declared in one place.
    Each controller will explicitly review and add the files for the
    new interface.

    - cpuset is getting ready for the hierarchical behavior which is in
    the similar style with other controllers so that an ancestor's
    configuration change doesn't change the descendants' configurations
    irreversibly and processes aren't silently migrated when a CPU or
    node goes down.

    All the changes are to the new interface and no behavior changed for
    the multiple hierarchies"

    * 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits)
    cpuset: fix the WARN_ON() in update_nodemasks_hier()
    cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test
    cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core
    cgroup: distinguish the default and legacy hierarchies when handling cftypes
    cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes()
    cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes
    cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[]
    cpuset: export effective masks to userspace
    cpuset: allow writing offlined masks to cpuset.cpus/mems
    cpuset: enable onlined cpu/node in effective masks
    cpuset: refactor cpuset_hotplug_update_tasks()
    cpuset: make cs->{cpus, mems}_allowed as user-configured masks
    cpuset: apply cs->effective_{cpus,mems}
    cpuset: initialize top_cpuset's configured masks at mount
    cpuset: use effective cpumask to build sched domains
    cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty
    cpuset: update cs->effective_{cpus, mems} when config changes
    cpuset: update cpuset->effective_{cpus,mems} at hotplug
    cpuset: add cs->effective_cpus and cs->effective_mems
    cgroup: clean up sane_behavior handling
    ...

    Linus Torvalds
     
  • Pull percpu updates from Tejun Heo:

    - Major reorganization of percpu header files which I think makes
    things a lot more readable and logical than before.

    - percpu-refcount is updated so that it requires explicit destruction
    and can be reinitialized if necessary. This was pulled into the
    block tree to replace the custom percpu refcnting implemented in
    blk-mq.

    - In the process, percpu and percpu-refcount got cleaned up a bit

    * 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (21 commits)
    percpu-refcount: implement percpu_ref_reinit() and percpu_ref_is_zero()
    percpu-refcount: require percpu_ref to be exited explicitly
    percpu-refcount: use unsigned long for pcpu_count pointer
    percpu-refcount: add helpers for ->percpu_count accesses
    percpu-refcount: one bit is enough for REF_STATUS
    percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc()
    workqueue: stronger test in process_one_work()
    workqueue: clear POOL_DISASSOCIATED in rebind_workers()
    percpu: Use ALIGN macro instead of hand coding alignment calculation
    percpu: invoke __verify_pcpu_ptr() from the generic part of accessors and operations
    percpu: preffity percpu header files
    percpu: use raw_cpu_*() to define __this_cpu_*()
    percpu: reorder macros in percpu header files
    percpu: move {raw|this}_cpu_*() definitions to include/linux/percpu-defs.h
    percpu: move generic {raw|this}_cpu_*_N() definitions to include/asm-generic/percpu.h
    percpu: only allow sized arch overrides for {raw|this}_cpu_*() ops
    percpu: reorganize include/linux/percpu-defs.h
    percpu: move accessors from include/linux/percpu.h to percpu-defs.h
    percpu: include/asm-generic/percpu.h should contain only arch-overridable parts
    percpu: introduce arch_raw_cpu_ptr()
    ...

    Linus Torvalds
     

15 Jul, 2014

6 commits

  • cgrp_dfl_root_inhibit_ss_mask determines which subsystems are not
    supported on the default hierarchy and is currently initialized
    statically and just includes the debug subsystem. Now that there's
    cgroup_subsys->dfl_files, we can easily tell which subsystems support
    the default hierarchy or not.

    Let's initialize cgrp_dfl_root_inhibit_ss_mask by testing whether
    cgroup_subsys->dfl_files is NULL. After all, subsystems with NULL
    ->dfl_files aren't useable on the default hierarchy anyway.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • cgroup now distinguishes cftypes for the default and legacy
    hierarchies more explicitly by using separate arrays and
    CFTYPE_ONLY_ON_DFL and CFTYPE_INSANE should be and are used only
    inside cgroup core proper. Let's make it clear that the flags are
    internal by prefixing them with double underscores.

    CFTYPE_INSANE is renamed to __CFTYPE_NOT_ON_DFL for consistency. The
    two flags are also collected and assigned bits >= 16 so that they
    aren't mixed with the published flags.

    v2: Convert the extra ones in cgroup_exit_cftypes() which are added by
    revision to the previous patch.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • Until now, cftype arrays carried files for both the default and legacy
    hierarchies and the files which needed to be used on only one of them
    were flagged with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE. This
    gets confusing very quickly and we may end up exposing interface files
    to the default hierarchy without thinking it through.

    This patch makes cgroup core provide separate sets of interfaces for
    cftype handling so that the cftypes for the default and legacy
    hierarchies are clearly distinguished. The previous two patches
    renamed the existing ones so that they clearly indicate that they're
    for the legacy hierarchies. This patch adds the interface for the
    default hierarchy and apply them selectively depending on the
    hierarchy type.

    * cftypes added through cgroup_subsys->dfl_cftypes and
    cgroup_add_dfl_cftypes() only show up on the default hierarchy.

    * cftypes added through cgroup_subsys->legacy_cftypes and
    cgroup_add_legacy_cftypes() only show up on the legacy hierarchies.

    * cgroup_subsys->dfl_cftypes and ->legacy_cftypes can point to the
    same array for the cases where the interface files are identical on
    both types of hierarchies.

    * This makes all the existing subsystem interface files legacy-only by
    default and all subsystems will have no interface file created when
    enabled on the default hierarchy. Each subsystem should explicitly
    review and compose the interface for the default hierarchy.

    * A boot param "cgroup__DEVEL__legacy_files_on_dfl" is added which
    makes subsystems which haven't decided the interface files for the
    default hierarchy to present the legacy files on the default
    hierarchy so that its behavior on the default hierarchy can be
    tested. As the awkward name suggests, this is for development only.

    * memcg's CFTYPE_INSANE on "use_hierarchy" is noop now as the whole
    array isn't used on the default hierarchy. The flag is removed.

    v2: Updated documentation for cgroup__DEVEL__legacy_files_on_dfl.

    v3: Clear CFTYPE_ONLY_ON_DFL and CFTYPE_INSANE when cfts are removed
    as suggested by Li.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vivek Goyal
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Aristeu Rozanski
    Cc: Aneesh Kumar K.V

    Tejun Heo
     
  • Currently, cftypes added by cgroup_add_cftypes() are used for both the
    unified default hierarchy and legacy ones and subsystems can mark each
    file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to
    appear only on one of them. This is quite hairy and error-prone.
    Also, we may end up exposing interface files to the default hierarchy
    without thinking it through.

    cgroup_subsys will grow two separate cftype addition functions and
    apply each only on the hierarchies of the matching type. This will
    allow organizing cftypes in a lot clearer way and encourage subsystems
    to scrutinize the interface which is being exposed in the new default
    hierarchy.

    In preparation, this patch adds cgroup_add_legacy_cftypes() which
    currently is a simple wrapper around cgroup_add_cftypes() and replaces
    all cgroup_add_cftypes() usages with it.

    While at it, this patch drops a completely spurious return from
    __hugetlb_cgroup_file_init().

    This patch doesn't introduce any functional differences.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Aneesh Kumar K.V

    Tejun Heo
     
  • Currently, cgroup_subsys->base_cftypes is used for both the unified
    default hierarchy and legacy ones and subsystems can mark each file
    with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear
    only on one of them. This is quite hairy and error-prone. Also, we
    may end up exposing interface files to the default hierarchy without
    thinking it through.

    cgroup_subsys will grow two separate cftype arrays and apply each only
    on the hierarchies of the matching type. This will allow organizing
    cftypes in a lot clearer way and encourage subsystems to scrutinize
    the interface which is being exposed in the new default hierarchy.

    In preparation, this patch renames cgroup_subsys->base_cftypes to
    cgroup_subsys->legacy_cftypes. This patch is pure rename.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vivek Goyal
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Aristeu Rozanski
    Cc: Aneesh Kumar K.V

    Tejun Heo
     
  • Currently cgroup_base_files[] contains the cgroup core interface files
    for both legacy and default hierarchies with each file tagged with
    CFTYPE_INSANE and CFTYPE_ONLY_ON_DFL. This is difficult to read.

    Let's separate it out to two separate tables, cgroup_dfl_base_files[]
    and cgroup_legacy_base_files[], and use the appropriate one in
    cgroup_mkdir() depending on the hierarchy type. This makes tagging
    each file unnecessary.

    This patch doesn't introduce any behavior changes.

    v2: cgroup_dfl_base_files[] was missing the termination entry
    triggering WARN in cgroup_init_cftypes() for 0day kernel testing
    robot. Fixed.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Jet Chen

    Tejun Heo
     

09 Jul, 2014

5 commits

  • After the previous patch to remove sane_behavior support from
    non-default hierarchies, CGRP_ROOT_SANE_BEHAVIOR is used only to
    indicate the default hierarchy while parsing mount options. This
    patch makes the following cleanups around it.

    * Don't show it in the mount option. Eventually the default hierarchy
    will be assigned a different filesystem type.

    * As sane_behavior is no longer effective on non-default hierarchies
    and the default hierarchy doesn't accept any mount options,
    parse_cgroupfs_options() can consider sane_behavior mount option as
    indicating the default hierarchy and fail if any other options are
    specified with it. While at it, remove one of the double blank
    lines in the function.

    * cgroup_mount() can now simply test CGRP_ROOT_SANE_BEHAVIOR to tell
    whether to mount the default hierarchy or not.

    * As CGROUP_ROOT_SANE_BEHAVIOR's only role now is indicating whether
    to select the default hierarchy or not during mount, it doesn't need
    to be set in the default hierarchy itself. cgroup_init_early()
    updated accordingly.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • sane_behavior has been used as a development vehicle for the default
    unified hierarchy. Now that the default hierarchy is in place, the
    flag became redundant and confusing as its usage is allowed on all
    hierarchies. There are gonna be either the default hierarchy or
    legacy ones. Let's make that clear by removing sane_behavior support
    on non-default hierarchies.

    This patch replaces cgroup_sane_behavior() with cgroup_on_dfl(). The
    comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of
    cgroup_on_dfl() with sane_behavior specific part dropped.

    On the default and legacy hierarchies w/o sane_behavior, this
    shouldn't cause any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko

    Tejun Heo
     
  • "cgroup.sane_behavior" is added to help distinguishing whether
    sane_behavior is in effect or not. We now have the default hierarchy
    where the flag is always in effect and are planning to remove
    supporting sane behavior on the legacy hierarchies making this file on
    the default hierarchy rather pointless. Let's make it legacy only and
    thus always zero.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • cgroup_root->flags only contains CGRP_ROOT_* flags and there's no
    reason to mask the flags. Remove CGRP_ROOT_OPTION_MASK.

    This doesn't cause any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • Currently, the blkio subsystem attributes all of writeback IOs to the
    root. One of the issues is that there's no way to tell who originated
    a writeback IO from block layer. Those IOs are usually issued
    asynchronously from a task which didn't have anything to do with
    actually generating the dirty pages. The memory subsystem, when
    enabled, already keeps track of the ownership of each dirty page and
    it's desirable for blkio to piggyback instead of adding its own
    per-page tag.

    blkio piggybacking on memory is an implementation detail which
    preferably should be handled automatically without requiring explicit
    userland action. To achieve that, this patch implements
    cgroup_subsys->depends_on which contains the mask of subsystems which
    should be enabled together when the subsystem is enabled.

    The previous patches already implemented the support for enabled but
    invisible subsystems and cgroup_subsys->depends_on can be easily
    implemented by updating cgroup_refresh_child_subsys_mask() so that it
    calculates cgroup->child_subsys_mask considering
    cgroup_subsys->depends_on of the explicitly enabled subsystems.

    Documentation/cgroups/unified-hierarchy.txt is updated to explain that
    subsystems may not become immediately available after being unused
    from userland and that dependency could be a factor in it. As
    subsystems may already keep residual references, this doesn't
    significantly change how subsystem rebinding can be used.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Johannes Weiner

    Tejun Heo