04 Sep, 2013

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot of activities on the cgroup front. Most changes aren't visible
    to userland at all at this point and are laying foundation for the
    planned unified hierarchy.

    - The biggest change is decoupling the lifetime management of css
    (cgroup_subsys_state) from that of cgroup's. Because controllers
    (cpu, memory, block and so on) will need to be dynamically enabled
    and disabled, css which is the association point between a cgroup
    and a controller may come and go dynamically across the lifetime of
    a cgroup. Till now, css's were created when the associated cgroup
    was created and stayed till the cgroup got destroyed.

    Assumptions around this tight coupling permeated through cgroup
    core and controllers. These assumptions are gradually removed,
    which consists bulk of patches, and css destruction path is
    completely decoupled from cgroup destruction path. Note that
    decoupling of creation path is relatively easy on top of these
    changes and the patchset is pending for the next window.

    - cgroup has its own event mechanism cgroup.event_control, which is
    only used by memcg. It is overly complex trying to achieve high
    flexibility whose benefits seem dubious at best. Going forward,
    new events will simply generate file modified event and the
    existing mechanism is being made specific to memcg. This pull
    request contains prepatory patches for such change.

    - Various fixes and cleanups"

    Fixed up conflict in kernel/cgroup.c as per Tejun.

    * 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits)
    cgroup: fix cgroup_css() invocation in css_from_id()
    cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp()
    cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup
    cgroup: implement CFTYPE_NO_PREFIX
    cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys
    cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax
    cgroup: fix cgroup_write_event_control()
    cgroup: fix subsystem file accesses on the root cgroup
    cgroup: change cgroup_from_id() to css_from_id()
    cgroup: use css_get() in cgroup_create() to check CSS_ROOT
    cpuset: remove an unncessary forward declaration
    cgroup: RCU protect each cgroup_subsys_state release
    cgroup: move subsys file removal to kill_css()
    cgroup: factor out kill_css()
    cgroup: decouple cgroup_subsys_state destruction from cgroup destruction
    cgroup: replace cgroup->css_kill_cnt with ->nr_css
    cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item
    cgroup: move cgroup->subsys[] assignment to online_css()
    cgroup: reorganize css init / exit paths
    cgroup: add __rcu modifier to cgroup->subsys[]
    ...

    Linus Torvalds
     

21 Aug, 2013

1 commit

  • It's not allowed to clear masks of a cpuset if there're tasks in it,
    but it's broken:

    # mkdir /cgroup/sub
    # echo 0 > /cgroup/sub/cpuset.cpus
    # echo 0 > /cgroup/sub/cpuset.mems
    # echo $$ > /cgroup/sub/tasks
    # echo > /cgroup/sub/cpuset.cpus
    (should fail)

    This bug was introduced by commit 88fa523bff295f1d60244a54833480b02f775152
    ("cpuset: allow to move tasks to empty cpusets").

    tj: Dropped temp bool variables and nestes the conditionals directly.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     

14 Aug, 2013

1 commit


13 Aug, 2013

1 commit


09 Aug, 2013

11 commits

  • Previously, all css descendant iterators didn't include the origin
    (root of subtree) css in the iteration. The reasons were maintaining
    consistency with css_for_each_child() and that at the time of
    introduction more use cases needed skipping the origin anyway;
    however, given that css_is_descendant() considers self to be a
    descendant, omitting the origin css has become more confusing and
    looking at the accumulated use cases rather clearly indicates that
    including origin would result in simpler code overall.

    While this is a change which can easily lead to subtle bugs, cgroup
    API including the iterators has recently gone through major
    restructuring and no out-of-tree changes will be applicable without
    adjustments making this a relatively acceptable opportunity for this
    type of change.

    The conversions are mostly straight-forward. If the iteration block
    had explicit origin handling before or after, it's moved inside the
    iteration. If not, if (pos == origin) continue; is added. Some
    conversions add extra reference get/put around origin handling by
    consolidating origin handling and the rest. While the extra ref
    operations aren't strictly necessary, this shouldn't cause any
    noticeable difference.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Michal Hocko
    Cc: Jens Axboe
    Cc: Matt Helsley
    Cc: Johannes Weiner
    Cc: Balbir Singh

    Tejun Heo
     
  • cgroup is in the process of converting to css (cgroup_subsys_state)
    from cgroup as the principal subsystem interface handle. This is
    mostly to prepare for the unified hierarchy support where css's will
    be created and destroyed dynamically but also helps cleaning up
    subsystem implementations as css is usually what they are interested
    in anyway.

    cgroup_taskset which is used by the subsystem attach methods is the
    last cgroup subsystem API which isn't using css as the handle. Update
    cgroup_taskset_cur_cgroup() to cgroup_taskset_cur_css() and
    cgroup_taskset_for_each() to take @skip_css instead of @skip_cgrp.

    The conversions are pretty mechanical. One exception is
    cpuset::cgroup_cs(), which lost its last user and got removed.

    This patch shouldn't introduce any functional changes.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Daniel Wagner
    Cc: Ingo Molnar
    Cc: Matt Helsley
    Cc: Steven Rostedt

    Tejun Heo
     
  • cgroup is in the process of converting to css (cgroup_subsys_state)
    from cgroup as the principal subsystem interface handle. This is
    mostly to prepare for the unified hierarchy support where css's will
    be created and destroyed dynamically but also helps cleaning up
    subsystem implementations as css is usually what they are interested
    in anyway.

    This patch converts task iterators to deal with css instead of cgroup.
    Note that under unified hierarchy, different sets of tasks will be
    considered belonging to a given cgroup depending on the subsystem in
    question and making the iterators deal with css instead cgroup
    provides them with enough information about the iteration.

    While at it, fix several function comment formats in cpuset.c.

    This patch doesn't introduce any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley

    Tejun Heo
     
  • cgroup_scan_tasks() takes a pointer to struct cgroup_scanner as its
    sole argument and the only function of that struct is packing the
    arguments of the function call which are consisted of five fields.
    It's not too unusual to pack parameters into a struct when the number
    of arguments gets excessive or the whole set needs to be passed around
    a lot, but neither holds here making it just weird.

    Drop struct cgroup_scanner and pass the params directly to
    cgroup_scan_tasks(). Note that struct cpuset_change_nodemask_arg was
    added to cpuset.c to pass both ->cs and ->newmems pointer to
    cpuset_change_nodemask() using single data pointer.

    This doesn't make any functional differences.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using css
    (cgroup_subsys_state) as the primary handle instead of cgroup in
    subsystem API. For hierarchy iterators, this is beneficial because

    * In most cases, css is the only thing subsystems care about anyway.

    * On the planned unified hierarchy, iterations for different
    subsystems will need to skip over different subtrees of the
    hierarchy depending on which subsystems are enabled on each cgroup.
    Passing around css makes it unnecessary to explicitly specify the
    subsystem in question as css is intersection between cgroup and
    subsystem

    * For the planned unified hierarchy, css's would need to be created
    and destroyed dynamically independent from cgroup hierarchy. Having
    cgroup core manage css iteration makes enforcing deref rules a lot
    easier.

    Most subsystem conversions are straight-forward. Noteworthy changes
    are

    * blkio: cgroup_to_blkcg() is no longer used. Removed.

    * freezer: cgroup_freezer() is no longer used. Removed.

    * devices: cgroup_to_devcgroup() is no longer used. Removed.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using struct
    cgroup_subsys_state * as the primary handle instead of struct cgroup.
    Please see the previous commit which converts the subsystem methods
    for rationale.

    This patch converts all cftype file operations to take @css instead of
    @cgroup. cftypes for the cgroup core files don't have their subsytem
    pointer set. These will automatically use the dummy_css added by the
    previous patch and can be converted the same way.

    Most subsystem conversions are straight forwards but there are some
    interesting ones.

    * freezer: update_if_frozen() is also converted to take @css instead
    of @cgroup for consistency. This will make the code look simpler
    too once iterators are converted to use css.

    * memory/vmpressure: mem_cgroup_from_css() needs to be exported to
    vmpressure while mem_cgroup_from_cont() can be made static.
    Updated accordingly.

    * cpu: cgroup_tg() doesn't have any user left. Removed.

    * cpuacct: cgroup_ca() doesn't have any user left. Removed.

    * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
    Removed.

    * net_cls: cgrp_cls_state() doesn't have any user left. Removed.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Daniel Wagner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe
    Cc: Steven Rostedt

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using struct
    cgroup_subsys_state * as the primary handle instead of struct cgroup *
    in subsystem implementations for the following reasons.

    * With unified hierarchy, subsystems will be dynamically bound and
    unbound from cgroups and thus css's (cgroup_subsys_state) may be
    created and destroyed dynamically over the lifetime of a cgroup,
    which is different from the current state where all css's are
    allocated and destroyed together with the associated cgroup. This
    in turn means that cgroup_css() should be synchronized and may
    return NULL, making it more cumbersome to use.

    * Differing levels of per-subsystem granularity in the unified
    hierarchy means that the task and descendant iterators should behave
    differently depending on the specific subsystem the iteration is
    being performed for.

    * In majority of the cases, subsystems only care about its part in the
    cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods
    often obtain the matching css pointer from the cgroup and don't
    bother with the cgroup pointer itself. Passing around css fits
    much better.

    This patch converts all cgroup_subsys methods to take @css instead of
    @cgroup. The conversions are mostly straight-forward. A few
    noteworthy changes are

    * ->css_alloc() now takes css of the parent cgroup rather than the
    pointer to the new cgroup as the css for the new cgroup doesn't
    exist yet. Knowing the parent css is enough for all the existing
    subsystems.

    * In kernel/cgroup.c::offline_css(), unnecessary open coded css
    dereference is replaced with local variable access.

    This patch shouldn't cause any behavior differences.

    v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced
    with local variable @css as suggested by Li Zefan.

    Rebased on top of new for-3.12 which includes for-3.11-fixes so
    that ->css_free() invocation added by da0a12caff ("cgroup: fix a
    leak when percpu_ref_init() fails") is converted too. Suggested
    by Li Zefan.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Daniel Wagner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe
    Cc: Steven Rostedt

    Tejun Heo
     
  • Currently, controllers have to explicitly follow the cgroup hierarchy
    to find the parent of a given css. cgroup is moving towards using
    cgroup_subsys_state as the main controller interface construct, so
    let's provide a way to climb the hierarchy using just csses.

    This patch implements css_parent() which, given a css, returns its
    parent. The function is guarnateed to valid non-NULL parent css as
    long as the target css is not at the top of the hierarchy.

    freezer, cpuset, cpu, cpuacct, hugetlb, memory, net_cls and devices
    are converted to use css_parent() instead of accessing cgroup->parent
    directly.

    * __parent_ca() is dropped from cpuacct and its usage is replaced with
    parent_ca(). The only difference between the two was NULL test on
    cgroup->parent which is now embedded in css_parent() making the
    distinction moot. Note that eventually a css->parent field will be
    added to css and the NULL check in css_parent() will go away.

    This patch shouldn't cause any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • css (cgroup_subsys_state) is usually embedded in a subsys specific
    data structure. Subsystems either use container_of() directly to cast
    from css to such data structure or has an accessor function wrapping
    such cast. As cgroup as whole is moving towards using css as the main
    interface handle, add and update such accessors to ease dealing with
    css's.

    All accessors explicitly handle NULL input and return NULL in those
    cases. While this looks like an extra branch in the code, as all
    controllers specific data structures have css as the first field, the
    casting doesn't involve any offsetting and the compiler can trivially
    optimize out the branch.

    * blkio, freezer, cpuset, cpu, cpuacct and net_cls didn't have such
    accessor. Added.

    * memory, hugetlb and devices already had one but didn't explicitly
    handle NULL input. Updated.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • cpuset uses "const" qualifiers on struct cpuset in some functions;
    however, it doesn't work well when a value derived from returned const
    pointer has to be passed to an accessor. It's C after all.

    Drop the "const" qualifiers except for the trivially leaf ones. This
    patch doesn't make any functional changes.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • The names of the two struct cgroup_subsys_state accessors -
    cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
    The former clashes with the type name and the latter doesn't even
    indicate it's somehow related to cgroup.

    We're about to revamp large portion of cgroup API, so, let's rename
    them so that they're less awkward. Most per-controller usages of the
    accessors are localized in accessor wrappers and given the amount of
    scheduled changes, this isn't gonna add any noticeable headache.

    Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
    to task_css(). This patch is pure rename.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     

31 Jul, 2013

1 commit


30 Jul, 2013

2 commits


03 Jul, 2013

1 commit

  • Pull cpuset changes from Tejun Heo:
    "cpuset has always been rather odd about its configurations - a cgroup
    right after creation didn't allow any task executions before
    configuration, changing configuration in the parent modifies the
    descendants irreversibly and so on. These behaviors are inherently
    nasty and almost hostile against sharing the hierarchy with other
    controllers making it very difficult to use in unified hierarchy.

    Li is currently in the process of updating the behaviors for
    __DEVEL__sane_behavior which is the bulk of changes in this pull
    request. It isn't complete yet and the behaviors will change further
    but all changes are gated behind sane_behavior. In the process, the
    rather hairy work-item punting which was used to work around the
    limitations of cgroup descendant iterator was simplified."

    * 'for-3.11-cpuset' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cpuset: rename @cont to @cgrp
    cpuset: fix to migrate mm correctly in a corner case
    cpuset: allow to move tasks to empty cpusets
    cpuset: allow to keep tasks in empty cpusets
    cpuset: introduce effective_{cpumask|nodemask}_cpuset()
    cpuset: record old_mems_allowed in struct cpuset
    cpuset: remove async hotplug propagation work
    cpuset: let hotplug propagation work wait for task attaching
    cpuset: re-structure update_cpumask() a bit
    cpuset: remove cpuset_test_cpumask()
    cpuset: remove unnecessary variable in cpuset_attach()
    cpuset: cleanup guarantee_online_{cpus|mems}()
    cpuset: remove redundant check in cpuset_cpus_allowed_fallback()

    Linus Torvalds
     

19 Jun, 2013

1 commit

  • Most of the stuff from kernel/sched.c was moved to kernel/sched/core.c long time
    back and the comments/Documentation never got updated.

    I figured it out when I was going through sched-domains.txt and so thought of
    fixing it globally.

    I haven't crossed check if the stuff that is referenced in sched/core.c by all
    these files is still present and hasn't changed as that wasn't the motive behind
    this patch.

    Signed-off-by: Viresh Kumar
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/cdff76a265326ab8d71922a1db5be599f20aad45.1370329560.git.viresh.kumar@linaro.org
    Signed-off-by: Ingo Molnar

    Viresh Kumar
     

14 Jun, 2013

6 commits

  • Cont is short for container. control group was named process container
    at first, but then people found container already has a meaning in
    linux kernel.

    Clean up the leftover variable name @cont.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • Before moving tasks out of empty cpusets, update_tasks_nodemask()
    is called, which calls do_migrate_pages(xx, from, to). Then those
    tasks are moved to an ancestor, and do_migrate_pages() is called
    again.

    The first time: from = node_to_be_offlined, to = empty.
    The second time: from = empty, to = ancestor's nodemask.

    so looks like no pages will be migrated.

    Fix this by:

    - Don't call update_tasks_nodemask() on empty cpusets.
    - Pass cs->old_mems_allowed to do_migrate_pages().

    v4: added comment in cpuset_hotplug_update_tasks() and rephased comment
    in cpuset_attach().

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • Currently some cpuset behaviors are not friendly when cpuset is co-mounted
    with other cgroup controllers.

    Now with this patchset if cpuset is mounted with sane_behavior option,
    it behaves differently:

    - Tasks will be kept in empty cpusets when hotplug happens and take
    masks of ancestors with non-empty cpus/mems, instead of being moved to
    an ancestor.

    - A task can be moved into an empty cpuset, and again it takes masks of
    ancestors, so the user can drop a task into a newly created cgroup without
    having to do anything for it.

    As tasks can reside in empy cpusets, here're some rules:

    - They can be moved to another cpuset, regardless it's empty or not.

    - Though it takes masks from ancestors, it takes other configs from the
    empty cpuset.

    - If the ancestors' masks are changed, those tasks will also be updated
    to take new masks.

    v2: add documentation in include/linux/cgroup.h

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • To achieve this:

    - We call update_tasks_cpumask/nodemask() for empty cpusets when
    hotplug happens, instead of moving tasks out of them.

    - When a cpuset's masks are changed by writing cpuset.cpus/mems,
    we also update tasks in child cpusets which are empty.

    v3:
    - do propagation work in one place for both hotplug and unplug

    v2:
    - drop rcu_read_lock before calling update_task_nodemask() and
    update_task_cpumask(), instead of using workqueue.
    - add documentation in include/linux/cgroup.h

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • effective_cpumask_cpuset() returns an ancestor cpuset which has
    non-empty cpumask.

    If a cpuset is empty and the tasks in it need to update their
    cpus_allowed, they take on the ancestor cpuset's cpumask.

    This currently won't change any behavior, but it will later allow us
    to keep tasks in empty cpusets.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • When we update a cpuset's mems_allowed and thus update tasks'
    mems_allowed, it's required to pass the old mems_allowed and new
    mems_allowed to cpuset_migrate_mm().

    Currently we save old mems_allowed in a temp local variable before
    changing cpuset->mems_allowed. This patch changes it by saving
    old mems_allowed in cpuset->old_mems_allowed.

    This currently won't change any behavior, but it will later allow
    us to keep tasks in empty cpusets.

    v3: restored "cpuset_attach_nodemask_to = cs->mems_allowed"

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     

09 Jun, 2013

2 commits

  • As we can drop rcu read lock while iterating cgroup hierarchy,
    we don't have to do propagation asynchronously via workqueue.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • Instead of triggering propagation work in cpuset_attach(), we make
    hotplug propagation work wait until there's no task attaching in
    progress.

    IMO this is more robust. We won't see empty masks in cpuset_attach().

    Also it's a preparation for removing propagation work. Without asynchronous
    propagation we can't call move_tasks_in_empty_cpuset() in cpuset_attach(),
    because otherwise we'll deadlock on cgroup_mutex.

    tj: typo fixes.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     

06 Jun, 2013

5 commits


02 May, 2013

2 commits

  • Pull VFS updates from Al Viro,

    Misc cleanups all over the place, mainly wrt /proc interfaces (switch
    create_proc_entry to proc_create(), get rid of the deprecated
    create_proc_read_entry() in favor of using proc_create_data() and
    seq_file etc).

    7kloc removed.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
    don't bother with deferred freeing of fdtables
    proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
    proc: Make the PROC_I() and PDE() macros internal to procfs
    proc: Supply a function to remove a proc entry by PDE
    take cgroup_open() and cpuset_open() to fs/proc/base.c
    ppc: Clean up scanlog
    ppc: Clean up rtas_flash driver somewhat
    hostap: proc: Use remove_proc_subtree()
    drm: proc: Use remove_proc_subtree()
    drm: proc: Use minor->index to label things, not PDE->name
    drm: Constify drm_proc_list[]
    zoran: Don't print proc_dir_entry data in debug
    reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
    proc: Supply an accessor for getting the data from a PDE's parent
    airo: Use remove_proc_subtree()
    rtl8192u: Don't need to save device proc dir PDE
    rtl8187se: Use a dir under /proc/net/r8180/
    proc: Add proc_mkdir_data()
    proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
    proc: Move PDE_NET() to fs/proc/proc_net.c
    ...

    Linus Torvalds
     
  • Signed-off-by: Al Viro

    Al Viro
     

30 Apr, 2013

3 commits

  • Pull cgroup updates from Tejun Heo:

    - Fixes and a lot of cleanups. Locking cleanup is finally complete.
    cgroup_mutex is no longer exposed to individual controlelrs which
    used to cause nasty deadlock issues. Li fixed and cleaned up quite a
    bit including long standing ones like racy cgroup_path().

    - device cgroup now supports proper hierarchy thanks to Aristeu.

    - perf_event cgroup now supports proper hierarchy.

    - A new mount option "__DEVEL__sane_behavior" is added. As indicated
    by the name, this option is to be used for development only at this
    point and generates a warning message when used. Unfortunately,
    cgroup interface currently has too many brekages and inconsistencies
    to implement a consistent and unified hierarchy on top. The new flag
    is used to collect the behavior changes which are necessary to
    implement consistent unified hierarchy. It's likely that this flag
    won't be used verbatim when it becomes ready but will be enabled
    implicitly along with unified hierarchy.

    The option currently disables some of broken behaviors in cgroup core
    and also .use_hierarchy switch in memcg (will be routed through -mm),
    which can be used to make very unusual hierarchy where nesting is
    partially honored. It will also be used to implement hierarchy
    support for blk-throttle which would be impossible otherwise without
    introducing a full separate set of control knobs.

    This is essentially versioning of interface which isn't very nice but
    at this point I can't see any other options which would allow keeping
    the interface the same while moving towards hierarchy behavior which
    is at least somewhat sane. The planned unified hierarchy is likely
    to require some level of adaptation from userland anyway, so I think
    it'd be best to take the chance and update the interface such that
    it's supportable in the long term.

    Maintaining the existing interface does complicate cgroup core but
    shouldn't put too much strain on individual controllers and I think
    it'd be manageable for the foreseeable future. Maybe we'll be able
    to drop it in a decade.

    Fix up conflicts (including a semantic one adding a new #include to ppc
    that was uncovered by header the file changes) as per Tejun.

    * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits)
    cpuset: fix compile warning when CONFIG_SMP=n
    cpuset: fix cpu hotplug vs rebuild_sched_domains() race
    cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()
    cgroup: restore the call to eventfd->poll()
    cgroup: fix use-after-free when umounting cgroupfs
    cgroup: fix broken file xattrs
    devcg: remove parent_cgroup.
    memcg: force use_hierarchy if sane_behavior
    cgroup: remove cgrp->top_cgroup
    cgroup: introduce sane_behavior mount option
    move cgroupfs_root to include/linux/cgroup.h
    cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix
    cgroup: make cgroup_path() not print double slashes
    Revert "cgroup: remove bind() method from cgroup_subsys."
    perf: make perf_event cgroup hierarchical
    cgroup: implement cgroup_is_descendant()
    cgroup: make sure parent won't be destroyed before its children
    cgroup: remove bind() method from cgroup_subsys.
    devcg: remove broken_hierarchy tag
    cgroup: remove cgroup_lock_is_held()
    ...

    Linus Torvalds
     
  • Pull workqueue updates from Tejun Heo:
    "A lot of activities on workqueue side this time. The changes achieve
    the followings.

    - WQ_UNBOUND workqueues - the workqueues which are per-cpu - are
    updated to be able to interface with multiple backend worker pools.
    This involved a lot of churning but the end result seems actually
    neater as unbound workqueues are now a lot closer to per-cpu ones.

    - The ability to interface with multiple backend worker pools are
    used to implement unbound workqueues with custom attributes.
    Currently the supported attributes are the nice level and CPU
    affinity. It may be expanded to include cgroup association in
    future. The attributes can be specified either by calling
    apply_workqueue_attrs() or through /sys/bus/workqueue/WQ_NAME/* if
    the workqueue in question is exported through sysfs.

    The backend worker pools are keyed by the actual attributes and
    shared by any workqueues which share the same attributes. When
    attributes of a workqueue are changed, the workqueue binds to the
    worker pool with the specified attributes while leaving the work
    items which are already executing in its previous worker pools
    alone.

    This allows converting custom worker pool implementations which
    want worker attribute tuning to use workqueues. The writeback pool
    is already converted in block tree and there are a couple others
    are likely to follow including btrfs io workers.

    - WQ_UNBOUND's ability to bind to multiple worker pools is also used
    to make it NUMA-aware. Because there's no association between work
    item issuer and the specific worker assigned to execute it, before
    this change, using unbound workqueue led to unnecessary cross-node
    bouncing and it couldn't be helped by autonuma as it requires tasks
    to have implicit node affinity and workers are assigned randomly.

    After these changes, an unbound workqueue now binds to multiple
    NUMA-affine worker pools so that queued work items are executed in
    the same node. This is turned on by default but can be disabled
    system-wide or for individual workqueues.

    Crypto was requesting NUMA affinity as encrypting data across
    different nodes can contribute noticeable overhead and doing it
    per-cpu was too limiting for certain cases and IO throughput could
    be bottlenecked by one CPU being fully occupied while others have
    idle cycles.

    While the new features required a lot of changes including
    restructuring locking, it didn't complicate the execution paths much.
    The unbound workqueue handling is now closer to per-cpu ones and the
    new features are implemented by simply associating a workqueue with
    different sets of backend worker pools without changing queue,
    execution or flush paths.

    As such, even though the amount of change is very high, I feel
    relatively safe in that it isn't likely to cause subtle issues with
    basic correctness of work item execution and handling. If something
    is wrong, it's likely to show up as being associated with worker pools
    with the wrong attributes or OOPS while workqueue attributes are being
    changed or during CPU hotplug.

    While this creates more backend worker pools, it doesn't add too many
    more workers unless, of course, there are many workqueues with unique
    combinations of attributes. Assuming everything else is the same,
    NUMA awareness costs an extra worker pool per NUMA node with online
    CPUs.

    There are also a couple things which are being routed outside the
    workqueue tree.

    - block tree pulled in workqueue for-3.10 so that writeback worker
    pool can be converted to unbound workqueue with sysfs control
    exposed. This simplifies the code, makes writeback workers
    NUMA-aware and allows tuning nice level and CPU affinity via sysfs.

    - The conversion to workqueue means that there's no 1:1 association
    between a specific worker, which makes writeback folks unhappy as
    they want to be able to tell which filesystem caused a problem from
    backtrace on systems with many filesystems mounted. This is
    resolved by allowing work items to set debug info string which is
    printed when the task is dumped. As this change involves unifying
    implementations of dump_stack() and friends in arch codes, it's
    being routed through Andrew's -mm tree."

    * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (84 commits)
    workqueue: use kmem_cache_free() instead of kfree()
    workqueue: avoid false negative WARN_ON() in destroy_workqueue()
    workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity
    workqueue: implement NUMA affinity for unbound workqueues
    workqueue: introduce put_pwq_unlocked()
    workqueue: introduce numa_pwq_tbl_install()
    workqueue: use NUMA-aware allocation for pool_workqueues
    workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()
    workqueue: map an unbound workqueues to multiple per-node pool_workqueues
    workqueue: move hot fields of workqueue_struct to the end
    workqueue: make workqueue->name[] fixed len
    workqueue: add workqueue->unbound_attrs
    workqueue: determine NUMA node of workers accourding to the allowed cpumask
    workqueue: drop 'H' from kworker names of unbound worker pools
    workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
    workqueue: move pwq_pool_locking outside of get/put_unbound_pool()
    workqueue: fix memory leak in apply_workqueue_attrs()
    workqueue: fix unbound workqueue attrs hashing / comparison
    workqueue: fix race condition in unbound workqueue free path
    workqueue: remove pwq_lock which is no longer used
    ...

    Linus Torvalds
     
  • Use the new interface, remove one ifdef. No code size changes.

    We could/should have been using __meminit/__meminitdata here but there's
    now no point in doing that because all this code is elided at compile time.

    Cc: Li Zefan
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

28 Apr, 2013

1 commit

  • Reported by Fengguang's kbuild test robot:

    kernel/cpuset.c:787: warning: 'generate_sched_domains' defined but not used

    Introduced by commit e0e80a02e5701c8790bd348ab59edb154fbda60b
    ("cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()),
    which removed generate_sched_domains() from cpuset_hotplug_workfn().

    Reported-by: Fengguang Wu
    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     

27 Apr, 2013

1 commit

  • rebuild_sched_domains() might pass doms with offlined cpu to
    partition_sched_domains(), which results in an oops:

    general protection fault: 0000 [#1] SMP
    ...
    RIP: 0010:[] [] get_group+0x6e/0x90
    ...
    Call Trace:
    [] build_sched_domains+0x70c/0xcb0
    [] ? build_sched_domains+0x937/0xcb0
    [] ? kfree+0xe4/0x1b0
    [] ? partition_sched_domains+0xc0/0x470
    [] partition_sched_domains+0x2e5/0x470
    [] ? partition_sched_domains+0xc0/0x470
    [] ? generate_sched_domains+0xc7/0x530
    [] rebuild_sched_domains_locked+0x38/0x70
    [] cpuset_write_resmask+0x1a4/0x500
    [] ? cpuset_mount+0xe0/0xe0
    [] ? cpuset_read_u64+0x100/0x100
    [] ? cgroup_iter_next+0x90/0x90
    [] ? cpuset_css_offline+0x70/0x70
    [] cgroup_file_write+0x133/0x2e0
    [] vfs_write+0xcb/0x130
    [] sys_write+0x64/0xa0

    Reported-by: Li Zhong
    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan