12 Dec, 2014

1 commit

  • Pull cgroup update from Tejun Heo:
    "cpuset got simplified a bit. cgroup core got a fix on unified
    hierarchy and grew some effective css related interfaces which will be
    used for blkio support for writeback IO traffic which is currently
    being worked on"

    * 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: implement cgroup_get_e_css()
    cgroup: add cgroup_subsys->css_e_css_changed()
    cgroup: add cgroup_subsys->css_released()
    cgroup: fix the async css offline wait logic in cgroup_subtree_control_write()
    cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write()
    cgroup: separate out cgroup_calc_child_subsys_mask() from cgroup_refresh_child_subsys_mask()
    cpuset: lock vs unlock typo
    cpuset: simplify cpuset_node_allowed API
    cpuset: convert callback_mutex to a spinlock

    Linus Torvalds
     

28 Oct, 2014

2 commits

  • How we deal with updates to exclusive cpusets is currently broken.
    As an example, suppose we have an exclusive cpuset composed of
    two cpus: A[cpu0,cpu1]. We can assign SCHED_DEADLINE task to it
    up to the allowed bandwidth. If we want now to modify cpusetA's
    cpumask, we have to check that removing a cpu's amount of
    bandwidth doesn't break AC guarantees. This thing isn't checked
    in the current code.

    This patch fixes the problem above, denying an update if the
    new cpumask won't have enough bandwidth for SCHED_DEADLINE tasks
    that are currently active.

    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Li Zefan
    Cc: cgroups@vger.kernel.org
    Link: http://lkml.kernel.org/r/5433E6AF.5080105@arm.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • Exclusive cpusets are the only way users can restrict SCHED_DEADLINE tasks
    affinity (performing what is commonly called clustered scheduling).
    Unfortunately, such thing is currently broken for two reasons:

    - No check is performed when the user tries to attach a task to
    an exlusive cpuset (recall that exclusive cpusets have an
    associated maximum allowed bandwidth).

    - Bandwidths of source and destination cpusets are not correctly
    updated after a task is migrated between them.

    This patch fixes both things at once, as they are opposite faces
    of the same coin.

    The check is performed in cpuset_can_attach(), as there aren't any
    points of failure after that function. The updated is split in two
    halves. We first reserve bandwidth in the destination cpuset, after
    we pass the check in cpuset_can_attach(). And we then release
    bandwidth from the source cpuset when the task's affinity is
    actually changed. Even if there can be time windows when sched_setattr()
    may erroneously fail in the source cpuset, we are fine with it, as
    we can't perfom an atomic update of both cpusets at once.

    Reported-by: Daniel Wagner
    Reported-by: Vincent Legout
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Dario Faggioli
    Cc: Michael Trimarchi
    Cc: Fabio Checconi
    Cc: michael@amarulasolutions.com
    Cc: luca.abeni@unitn.it
    Cc: Li Zefan
    Cc: Linus Torvalds
    Cc: cgroups@vger.kernel.org
    Link: http://lkml.kernel.org/r/1411118561-26323-3-git-send-email-juri.lelli@arm.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     

27 Oct, 2014

3 commits

  • This will deadlock instead of unlocking.

    Fixes: f73eae8d8384 ('cpuset: simplify cpuset_node_allowed API')
    Signed-off-by: Dan Carpenter
    Acked-by: Vladimir Davydov
    Signed-off-by: Tejun Heo

    Dan Carpenter
     
  • Current cpuset API for checking if a zone/node is allowed to allocate
    from looks rather awkward. We have hardwall and softwall versions of
    cpuset_node_allowed with the softwall version doing literally the same
    as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
    If it isn't, the softwall version may check the given node against the
    enclosing hardwall cpuset, which it needs to take the callback lock to
    do.

    Such a distinction was introduced by commit 02a0e53d8227 ("cpuset:
    rework cpuset_zone_allowed api"). Before, we had the only version with
    the __GFP_HARDWALL flag determining its behavior. The purpose of the
    commit was to avoid sleep-in-atomic bugs when someone would mistakenly
    call the function without the __GFP_HARDWALL flag for an atomic
    allocation. The suffixes introduced were intended to make the callers
    think before using the function.

    However, since the callback lock was converted from mutex to spinlock by
    the previous patch, the softwall check function cannot sleep, and these
    precautions are no longer necessary.

    So let's simplify the API back to the single check.

    Suggested-by: David Rientjes
    Signed-off-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Acked-by: Zefan Li
    Signed-off-by: Tejun Heo

    Vladimir Davydov
     
  • The callback_mutex is only used to synchronize reads/updates of cpusets'
    flags and cpu/node masks. These operations should always proceed fast so
    there's no reason why we can't use a spinlock instead of the mutex.

    Converting the callback_mutex into a spinlock will let us call
    cpuset_zone_allowed_softwall from atomic context. This, in turn, makes
    it possible to simplify the code by merging the hardwall and asoftwall
    checks into the same function, which is the business of the next patch.

    Suggested-by: Zefan Li
    Signed-off-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Acked-by: Zefan Li
    Signed-off-by: Tejun Heo

    Vladimir Davydov
     

10 Oct, 2014

1 commit

  • Pull cgroup updates from Tejun Heo:
    "Nothing too interesting. Just a handful of cleanup patches"

    * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    Revert "cgroup: remove redundant variable in cgroup_mount()"
    cgroup: remove redundant variable in cgroup_mount()
    cgroup: fix missing unlock in cgroup_release_agent()
    cgroup: remove CGRP_RELEASABLE flag
    perf/cgroup: Remove perf_put_cgroup()
    cgroup: remove redundant check in cgroup_ino()
    cpuset: simplify proc_cpuset_show()
    cgroup: simplify proc_cgroup_show()
    cgroup: use a per-cgroup work for release agent
    cgroup: remove bogus comments
    cgroup: remove redundant code in cgroup_rmdir()
    cgroup: remove some useless forward declarations
    cgroup: fix a typo in comment.

    Linus Torvalds
     

25 Sep, 2014

1 commit

  • When we change cpuset.memory_spread_{page,slab}, cpuset will flip
    PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
    This should be done using atomic bitops, but currently we don't,
    which is broken.

    Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
    when one thread tried to clear PF_USED_MATH while at the same time another
    thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
    the same task.

    Here's the full report:
    https://lkml.org/lkml/2014/9/19/230

    To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.

    v4:
    - updated mm/slab.c. (Fengguang Wu)
    - updated Documentation.

    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Miao Xie
    Cc: Kees Cook
    Fixes: 950592f7b991 ("cpusets: update tasks' page/slab spread flags in time")
    Cc: # 2.6.31+
    Reported-by: Tetsuo Handa
    Signed-off-by: Zefan Li
    Signed-off-by: Tejun Heo

    Zefan Li
     

19 Sep, 2014

1 commit


05 Aug, 2014

1 commit

  • Pull cgroup changes from Tejun Heo:
    "Mostly changes to get the v2 interface ready. The core features are
    mostly ready now and I think it's reasonable to expect to drop the
    devel mask in one or two devel cycles at least for a subset of
    controllers.

    - cgroup added a controller dependency mechanism so that block cgroup
    can depend on memory cgroup. This will be used to finally support
    IO provisioning on the writeback traffic, which is currently being
    implemented.

    - The v2 interface now uses a separate table so that the interface
    files for the new interface are explicitly declared in one place.
    Each controller will explicitly review and add the files for the
    new interface.

    - cpuset is getting ready for the hierarchical behavior which is in
    the similar style with other controllers so that an ancestor's
    configuration change doesn't change the descendants' configurations
    irreversibly and processes aren't silently migrated when a CPU or
    node goes down.

    All the changes are to the new interface and no behavior changed for
    the multiple hierarchies"

    * 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits)
    cpuset: fix the WARN_ON() in update_nodemasks_hier()
    cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test
    cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core
    cgroup: distinguish the default and legacy hierarchies when handling cftypes
    cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes()
    cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes
    cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[]
    cpuset: export effective masks to userspace
    cpuset: allow writing offlined masks to cpuset.cpus/mems
    cpuset: enable onlined cpu/node in effective masks
    cpuset: refactor cpuset_hotplug_update_tasks()
    cpuset: make cs->{cpus, mems}_allowed as user-configured masks
    cpuset: apply cs->effective_{cpus,mems}
    cpuset: initialize top_cpuset's configured masks at mount
    cpuset: use effective cpumask to build sched domains
    cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty
    cpuset: update cs->effective_{cpus, mems} when config changes
    cpuset: update cpuset->effective_{cpus,mems} at hotplug
    cpuset: add cs->effective_cpus and cs->effective_mems
    cgroup: clean up sane_behavior handling
    ...

    Linus Torvalds
     

30 Jul, 2014

1 commit


15 Jul, 2014

1 commit

  • Currently, cgroup_subsys->base_cftypes is used for both the unified
    default hierarchy and legacy ones and subsystems can mark each file
    with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear
    only on one of them. This is quite hairy and error-prone. Also, we
    may end up exposing interface files to the default hierarchy without
    thinking it through.

    cgroup_subsys will grow two separate cftype arrays and apply each only
    on the hierarchies of the matching type. This will allow organizing
    cftypes in a lot clearer way and encourage subsystems to scrutinize
    the interface which is being exposed in the new default hierarchy.

    In preparation, this patch renames cgroup_subsys->base_cftypes to
    cgroup_subsys->legacy_cftypes. This patch is pure rename.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vivek Goyal
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Aristeu Rozanski
    Cc: Aneesh Kumar K.V

    Tejun Heo
     

10 Jul, 2014

12 commits

  • cpuset.cpus and cpuset.mems are the configured masks, and we need
    to export effective masks to userspace, so users know the real
    cpus_allowed and mems_allowed that apply to the tasks in a cpuset.

    v2:
    - export those masks unconditionally, suggested by Tejun.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • As the configured masks won't be limited by its parent, and the top
    cpuset's masks won't change when hotplug happens, it's natural to
    allow writing offlined masks to the configured masks.

    If on default hierarchy:

    # echo 0 > /sys/devices/system/cpu/cpu1/online
    # mkdir /cpuset/sub
    # echo 1 > /cpuset/sub/cpuset.cpus
    # cat /cpuset/sub/cpuset.cpus
    1

    If on legacy hierarchy:

    # echo 0 > /sys/devices/system/cpu/cpu1/online
    # mkdir /cpuset/sub
    # echo 1 > /cpuset/sub/cpuset.cpus
    -bash: echo: write error: Invalid argument

    Note the checks don't need to be gated by cgroup_on_dfl, because we've
    initialized top_cpuset.{cpus,mems}_allowed accordingly in cpuset_bind().

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • Firstly offline cpu1:

    # echo 0-1 > cpuset.cpus
    # echo 0 > /sys/devices/system/cpu/cpu1/online
    # cat cpuset.cpus
    0-1
    # cat cpuset.effective_cpus
    0

    Then online it:

    # echo 1 > /sys/devices/system/cpu/cpu1/online
    # cat cpuset.cpus
    0-1
    # cat cpuset.effective_cpus
    0-1

    And cpuset will bring it back to the effective mask.

    The implementation is quite straightforward. Instead of calculating the
    offlined cpus/mems and do updates, we just set the new effective_mask
    to online_mask & congifured_mask.

    This is a behavior change for default hierarchy, so legacy hierarchy
    won't be affected.

    v2:
    - make refactoring of cpuset_hotplug_update_tasks() as seperate patch,
    suggested by Tejun.
    - make hotplug_update_tasks_insane() use @new_cpus and @new_mems as
    hotplug_update_tasks_sane() does.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • We mix the handling for both default hierarchy and legacy hierarchy in
    the same function, and it's quite messy, so split into two functions.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • Now we've used effective cpumasks to enforce hierarchical manner,
    we can use cs->{cpus,mems}_allowed as configured masks.

    Configured masks can be changed by writing cpuset.cpus and cpuset.mems
    only. The new behaviors are:

    - They won't be changed by hotplug anymore.
    - They won't be limited by its parent's masks.

    This ia a behavior change, but won't take effect unless mount with
    sane_behavior.

    v2:
    - Add comments to explain the differences between configured masks and
    effective masks.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • Now we can use cs->effective_{cpus,mems} as effective masks. It's
    used whenever:

    - we update tasks' cpus_allowed/mems_allowed,
    - we want to retrieve tasks_cs(tsk)'s cpus_allowed/mems_allowed.

    They actually replace effective_{cpu,node}mask_cpuset().

    effective_mask == configured_mask & parent effective_mask except when
    the reault is empty, in which case it inherits parent effective_mask.
    The result equals the mask computed from effective_{cpu,node}mask_cpuset().

    This won't affect the original legacy hierarchy, because in this case we
    make sure the effective masks are always the same with user-configured
    masks.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • We now have to support different behaviors for default hierachy and
    legacy hiearchy, top_cpuset's configured masks need to be initialized
    accordingly.

    Suppose we've offlined cpu1.

    On default hierarchy:

    # mount -t cgroup -o __DEVEL__sane_behavior xxx /cpuset
    # cat /cpuset/cpuset.cpus
    0-15

    On legacy hierarchy:

    # mount -t cgroup xxx /cpuset
    # cat /cpuset/cpuset.cpus
    0,2-15

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • We're going to have separate user-configured masks and effective ones.

    Eventually configured masks can only be changed by writing cpuset.cpus
    and cpuset.mems, and they won't be restricted by parent cpuset. While
    effective masks reflect cpu/memory hotplug and hierachical restriction,
    and these are the real masks that apply to the tasks in the cpuset.

    We calculate effective mask this way:
    - top cpuset's effective_mask == online_mask, otherwise
    - cpuset's effective_mask == configured_mask & parent effective_mask,
    if the result is empty, it inherits parent effective mask.

    Those behavior changes are for default hierarchy only. For legacy
    hierarchy, effective_mask and configured_mask are the same, so we won't
    break old interfaces.

    We should partition sched domains according to effective_cpus, which
    is the real cpulist that takes effects on tasks in the cpuset.

    This won't introduce behavior change.

    v2:
    - Add a comment for the call of rebuild_sched_domains(), suggested
    by Tejun.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • We're going to have separate user-configured masks and effective ones.

    Eventually configured masks can only be changed by writing cpuset.cpus
    and cpuset.mems, and they won't be restricted by parent cpuset. While
    effective masks reflect cpu/memory hotplug and hierachical restriction,
    and these are the real masks that apply to the tasks in the cpuset.

    We calculate effective mask this way:
    - top cpuset's effective_mask == online_mask, otherwise
    - cpuset's effective_mask == configured_mask & parent effective_mask,
    if the result is empty, it inherits parent effective mask.

    Those behavior changes are for default hierarchy only. For legacy
    hierarchy, effective_mask and configured_mask are the same, so we won't
    break old interfaces.

    To make cs->effective_{cpus,mems} to be effective masks, we need to
    - update the effective masks at hotplug
    - update the effective masks at config change
    - take on ancestor's mask when the effective mask is empty

    The last item is done here.

    This won't introduce behavior change.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • We're going to have separate user-configured masks and effective ones.

    Eventually configured masks can only be changed by writing cpuset.cpus
    and cpuset.mems, and they won't be restricted by parent cpuset. While
    effective masks reflect cpu/memory hotplug and hierachical restriction,
    and these are the real masks that apply to the tasks in the cpuset.

    We calculate effective mask this way:
    - top cpuset's effective_mask == online_mask, otherwise
    - cpuset's effective_mask == configured_mask & parent effective_mask,
    if the result is empty, it inherits parent effective mask.

    Those behavior changes are for default hierarchy only. For legacy
    hierarchy, effective_mask and configured_mask are the same, so we won't
    break old interfaces.

    To make cs->effective_{cpus,mems} to be effective masks, we need to
    - update the effective masks at hotplug
    - update the effective masks at config change
    - take on ancestor's mask when the effective mask is empty

    The second item is done here. We don't need to treat root_cs specially
    in update_cpumasks_hier().

    This won't introduce behavior change.

    v3:
    - add a WARN_ON() to check if effective masks are the same with configured
    masks on legacy hierarchy.
    - pass trialcs->cpus_allowed to update_cpumasks_hier() and add a comment for
    it. Similar change for update_nodemasks_hier(). Suggested by Tejun.

    v2:
    - revise the comment in update_{cpu,node}masks_hier(), suggested by Tejun.
    - fix to use @cp instead of @cs in these two functions.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • We're going to have separate user-configured masks and effective ones.

    Eventually configured masks can only be changed by writing cpuset.cpus
    and cpuset.mems, and they won't be restricted by parent cpuset. While
    effective masks reflect cpu/memory hotplug and hierachical restriction,
    and these are the real masks that apply to the tasks in the cpuset.

    We calculate effective mask this way:
    - top cpuset's effective_mask == online_mask, otherwise
    - cpuset's effective_mask == configured_mask & parent effective_mask,
    if the result is empty, it inherits parent effective mask.

    Those behavior changes are for default hierarchy only. For legacy
    hierarchy, effective_mask and configured_mask are the same, so we won't
    break old interfaces.

    To make cs->effective_{cpus,mems} to be effective masks, we need to
    - update the effective masks at hotplug
    - update the effective masks at config change
    - take on ancestor's mask when the effective mask is empty

    The first item is done here.

    This won't introduce behavior change.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • We're going to have separate user-configured masks and effective ones.

    Eventually configured masks can only be changed by writing cpuset.cpus
    and cpuset.mems, and they won't be restricted by parent cpuset. While
    effective masks reflect cpu/memory hotplug and hierachical restriction,
    and these are the real masks that apply to the tasks in the cpuset.

    We calculate effective mask this way:
    - top cpuset's effective_mask == online_mask, otherwise
    - cpuset's effective_mask == configured_mask & parent effective_mask,
    if the result is empty, it inherits parent effective mask.

    Those behavior changes are for default hierarchy only. For legacy
    hierachy, effective_mask and configured_mask are the same, so we won't
    break old interfaces.

    This patch adds the effective masks to struct cpuset and initializes
    them. The effective masks of the top cpuset is the same with configured
    masks, and a child cpuset inherits its parent's effective masks.

    This won't introduce behavior change.

    v2:
    - s/real_{mems,cpus}_allowed/effective_{mems,cpus}, suggested by Tejun.
    - don't init effective masks in cpuset_css_online() if !cgroup_on_dfl.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     

09 Jul, 2014

1 commit

  • sane_behavior has been used as a development vehicle for the default
    unified hierarchy. Now that the default hierarchy is in place, the
    flag became redundant and confusing as its usage is allowed on all
    hierarchies. There are gonna be either the default hierarchy or
    legacy ones. Let's make that clear by removing sane_behavior support
    on non-default hierarchies.

    This patch replaces cgroup_sane_behavior() with cgroup_on_dfl(). The
    comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of
    cgroup_on_dfl() with sane_behavior specific part dropped.

    On the default and legacy hierarchies w/o sane_behavior, this
    shouldn't cause any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko

    Tejun Heo
     

02 Jul, 2014

1 commit

  • Writing to either "cpuset.cpus" or "cpuset.mems" file flushes
    cpuset_hotplug_work so that cpu or memory hotunplug doesn't end up
    migrating tasks off a cpuset after new resources are added to it.

    As cpuset_hotplug_work calls into cgroup core via
    cgroup_transfer_tasks(), this flushing adds the dependency to cgroup
    core locking from cpuset_write_resmak(). This used to be okay because
    cgroup interface files were protected by a different mutex; however,
    8353da1f91f1 ("cgroup: remove cgroup_tree_mutex") simplified the
    cgroup core locking and this dependency became a deadlock hazard -
    cgroup file removal performed under cgroup core lock tries to drain
    on-going file operation which is trying to flush cpuset_hotplug_work
    blocked on the same cgroup core lock.

    The locking simplification was done because kernfs added an a lot
    easier way to deal with circular dependencies involving kernfs active
    protection. Let's use the same strategy in cpuset and break active
    protection in cpuset_write_resmask(). While it isn't the prettiest,
    this is a very rare, likely unique, situation which also goes away on
    the unified hierarchy.

    The commands to trigger the deadlock warning without the patch and the
    lockdep output follow.

    localhost:/ # mount -t cgroup -o cpuset xxx /cpuset
    localhost:/ # mkdir /cpuset/tmp
    localhost:/ # echo 1 > /cpuset/tmp/cpuset.cpus
    localhost:/ # echo 0 > cpuset/tmp/cpuset.mems
    localhost:/ # echo $$ > /cpuset/tmp/tasks
    localhost:/ # echo 0 > /sys/devices/system/cpu/cpu1/online

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    3.16.0-rc1-0.1-default+ #7 Not tainted
    -------------------------------------------------------
    kworker/1:0/32649 is trying to acquire lock:
    (cgroup_mutex){+.+.+.}, at: [] cgroup_transfer_tasks+0x37/0x150

    but task is already holding lock:
    (cpuset_hotplug_work){+.+...}, at: [] process_one_work+0x192/0x520

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #2 (cpuset_hotplug_work){+.+...}:
    ...
    -> #1 (s_active#175){++++.+}:
    ...
    -> #0 (cgroup_mutex){+.+.+.}:
    ...

    other info that might help us debug this:

    Chain exists of:
    cgroup_mutex --> s_active#175 --> cpuset_hotplug_work

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(cpuset_hotplug_work);
    lock(s_active#175);
    lock(cpuset_hotplug_work);
    lock(cgroup_mutex);

    *** DEADLOCK ***

    2 locks held by kworker/1:0/32649:
    #0: ("events"){.+.+.+}, at: [] process_one_work+0x192/0x520
    #1: (cpuset_hotplug_work){+.+...}, at: [] process_one_work+0x192/0x520

    stack backtrace:
    CPU: 1 PID: 32649 Comm: kworker/1:0 Not tainted 3.16.0-rc1-0.1-default+ #7
    ...
    Call Trace:
    [] dump_stack+0x72/0x8a
    [] print_circular_bug+0x10f/0x120
    [] check_prev_add+0x43e/0x4b0
    [] validate_chain+0x656/0x7c0
    [] __lock_acquire+0x382/0x660
    [] lock_acquire+0xf9/0x170
    [] mutex_lock_nested+0x6f/0x380
    [] cgroup_transfer_tasks+0x37/0x150
    [] hotplug_update_tasks_insane+0x110/0x1d0
    [] cpuset_hotplug_update_tasks+0x13d/0x180
    [] cpuset_hotplug_workfn+0x18c/0x630
    [] process_one_work+0x254/0x520
    [] worker_thread+0x13d/0x3d0
    [] kthread+0xf8/0x100
    [] ret_from_fork+0x7c/0xb0

    Signed-off-by: Tejun Heo
    Reported-by: Li Zefan
    Tested-by: Li Zefan

    Tejun Heo
     

25 Jun, 2014

1 commit

  • When runing with the kernel(3.15-rc7+), the follow bug occurs:
    [ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
    [ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python
    [ 9969.441175] INFO: lockdep is turned off.
    [ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85
    [ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012
    [ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18
    [ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c
    [ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000
    [ 9969.974071] Call Trace:
    [ 9970.003403] [] dump_stack+0x4d/0x66
    [ 9970.065074] [] __might_sleep+0xfa/0x130
    [ 9970.130743] [] mutex_lock_nested+0x3c/0x4f0
    [ 9970.200638] [] ? kmem_cache_alloc+0x1bc/0x210
    [ 9970.272610] [] cpuset_mems_allowed+0x27/0x140
    [ 9970.344584] [] ? __mpol_dup+0x63/0x150
    [ 9970.409282] [] __mpol_dup+0xe5/0x150
    [ 9970.471897] [] ? __mpol_dup+0x63/0x150
    [ 9970.536585] [] ? copy_process.part.23+0x606/0x1d40
    [ 9970.613763] [] ? trace_hardirqs_on+0xd/0x10
    [ 9970.683660] [] ? monotonic_to_bootbased+0x2f/0x50
    [ 9970.759795] [] copy_process.part.23+0x670/0x1d40
    [ 9970.834885] [] do_fork+0xd8/0x380
    [ 9970.894375] [] ? __audit_syscall_entry+0x9c/0xf0
    [ 9970.969470] [] SyS_clone+0x16/0x20
    [ 9971.030011] [] stub_clone+0x69/0x90
    [ 9971.091573] [] ? system_call_fastpath+0x16/0x1b

    The cause is that cpuset_mems_allowed() try to take
    mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in
    __mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is
    under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock
    protection region to protect the access to cpuset only in
    current_cpuset_is_being_rebound(). So that we can avoid this bug.

    This patch is a temporary solution that just addresses the bug
    mentioned above, can not fix the long-standing issue about cpuset.mems
    rebinding on fork():

    "When the forker's task_struct is duplicated (which includes
    ->mems_allowed) and it races with an update to cpuset_being_rebound
    in update_tasks_nodemask() then the task's mems_allowed doesn't get
    updated. And the child task's mems_allowed can be wrong if the
    cpuset's nodemask changes before the child has been added to the
    cgroup's tasklist."

    Signed-off-by: Gu Zheng
    Acked-by: Li Zefan
    Signed-off-by: Tejun Heo
    Cc: stable

    Gu Zheng
     

10 Jun, 2014

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot of activities on cgroup side. Heavy restructuring including
    locking simplification took place to improve the code base and enable
    implementation of the unified hierarchy, which currently exists behind
    a __DEVEL__ mount option. The core support is mostly complete but
    individual controllers need further work. To explain the design and
    rationales of the the unified hierarchy

    Documentation/cgroups/unified-hierarchy.txt

    is added.

    Another notable change is css (cgroup_subsys_state - what each
    controller uses to identify and interact with a cgroup) iteration
    update. This is part of continuing updates on css object lifetime and
    visibility. cgroup started with reference count draining on removal
    way back and is now reaching a point where csses behave and are
    iterated like normal refcnted objects albeit with some complexities to
    allow distinguishing the state where they're being deleted. The css
    iteration update isn't taken advantage of yet but is planned to be
    used to simplify memcg significantly"

    * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
    cgroup: disallow disabled controllers on the default hierarchy
    cgroup: don't destroy the default root
    cgroup: disallow debug controller on the default hierarchy
    cgroup: clean up MAINTAINERS entries
    cgroup: implement css_tryget()
    device_cgroup: use css_has_online_children() instead of has_children()
    cgroup: convert cgroup_has_live_children() into css_has_online_children()
    cgroup: use CSS_ONLINE instead of CGRP_DEAD
    cgroup: iterate cgroup_subsys_states directly
    cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
    cgroup: move cgroup->serial_nr into cgroup_subsys_state
    cgroup: link all cgroup_subsys_states in their sibling lists
    cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
    cgroup: remove cgroup->parent
    device_cgroup: remove direct access to cgroup->children
    memcg: update memcg_has_children() to use css_next_child()
    memcg: remove tasks/children test from mem_cgroup_force_empty()
    cgroup: remove css_parent()
    cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
    cgroup: use cgroup->self.refcnt for cgroup refcnting
    ...

    Linus Torvalds
     

05 Jun, 2014

1 commit

  • If cpusets are not in use then we still check a global variable on every
    page allocation. Use jump labels to avoid the overhead.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

17 May, 2014

1 commit

  • cgroup in general is moving towards using cgroup_subsys_state as the
    fundamental structural component and css_parent() was introduced to
    convert from using cgroup->parent to css->parent. It was quite some
    time ago and we're moving forward with making css more prominent.

    This patch drops the trivial wrapper css_parent() and let the users
    dereference css->parent. While at it, explicitly mark fields of css
    which are public and immutable.

    v2: New usage from device_cgroup.c converted.

    Signed-off-by: Tejun Heo
    Acked-by: Michal Hocko
    Acked-by: Neil Horman
    Acked-by: "David S. Miller"
    Acked-by: Li Zefan
    Cc: Vivek Goyal
    Cc: Jens Axboe
    Cc: Peter Zijlstra
    Cc: Johannes Weiner

    Tejun Heo
     

14 May, 2014

2 commits

  • Convert all cftype->write_string() users to the new cftype->write()
    which maps directly to kernfs write operation and has full access to
    kernfs and cgroup contexts. The conversions are mostly mechanical.

    * @css and @cft are accessed using of_css() and of_cft() accessors
    respectively instead of being specified as arguments.

    * Should return @nbytes on success instead of 0.

    * @buf is not trimmed automatically. Trim if necessary. Note that
    blkcg and netprio don't need this as the parsers already handle
    whitespaces.

    cftype->write_string() has no user left after the conversions and
    removed.

    While at it, remove unnecessary local variable @p in
    cgroup_subtree_control_write() and stale comment about
    CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.

    This patch doesn't introduce any visible behavior changes.

    v2: netprio was missing from conversion. Converted.

    Signed-off-by: Tejun Heo
    Acked-by: Aristeu Rozanski
    Acked-by: Vivek Goyal
    Acked-by: Li Zefan
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Neil Horman
    Cc: "David S. Miller"

    Tejun Heo
     
  • Unlike the more usual refcnting, what css_tryget() provides is the
    distinction between online and offline csses instead of protection
    against upping a refcnt which already reached zero. cgroup is
    planning to provide actual tryget which fails if the refcnt already
    reached zero. Let's rename the existing trygets so that they clearly
    indicate that they're onliness.

    I thought about keeping the existing names as-are and introducing new
    names for the planned actual tryget; however, given that each
    controller participates in the synchronization of the online state, it
    seems worthwhile to make it explicit that these functions are about
    on/offline state.

    Rename css_tryget() to css_tryget_online() and css_tryget_from_dir()
    to css_tryget_online_from_dir(). This is pure rename.

    v2: cgroup_freezer grew new usages of css_tryget(). Update
    accordingly.

    Signed-off-by: Tejun Heo
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Li Zefan
    Cc: Vivek Goyal
    Cc: Jens Axboe
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo

    Tejun Heo
     

06 May, 2014

2 commits


04 Apr, 2014

3 commits

  • Merge first patch-bomb from Andrew Morton:
    - Various misc bits
    - kmemleak fixes
    - small befs, codafs, cifs, efs, freexxfs, hfsplus, minixfs, reiserfs things
    - fanotify
    - I appear to have become SuperH maintainer
    - ocfs2 updates
    - direct-io tweaks
    - a bit of the MM queue
    - printk updates
    - MAINTAINERS maintenance
    - some backlight things
    - lib/ updates
    - checkpatch updates
    - the rtc queue
    - nilfs2 updates
    - Small Documentation/ updates

    * emailed patches from Andrew Morton : (237 commits)
    Documentation/SubmittingPatches: remove references to patch-scripts
    Documentation/SubmittingPatches: update some dead URLs
    Documentation/filesystems/ntfs.txt: remove changelog reference
    Documentation/kmemleak.txt: updates
    fs/reiserfs/super.c: add __init to init_inodecache
    fs/reiserfs: move prototype declaration to header file
    fs/hfsplus/attributes.c: add __init to hfsplus_create_attr_tree_cache()
    fs/hfsplus/extents.c: fix concurrent acess of alloc_blocks
    fs/hfsplus/extents.c: remove unused variable in hfsplus_get_block
    nilfs2: update project's web site in nilfs2.txt
    nilfs2: update MAINTAINERS file entries fix
    nilfs2: verify metadata sizes read from disk
    nilfs2: add FITRIM ioctl support for nilfs2
    nilfs2: add nilfs_sufile_trim_fs to trim clean segs
    nilfs2: implementation of NILFS_IOCTL_SET_SUINFO ioctl
    nilfs2: add nilfs_sufile_set_suinfo to update segment usage
    nilfs2: add struct nilfs_suinfo_update and flags
    nilfs2: update MAINTAINERS file entries
    fs/coda/inode.c: add __init to init_inodecache()
    BEFS: logging cleanup
    ...

    Linus Torvalds
     
  • Since put_mems_allowed() is strictly optional, its a seqcount retry, we
    don't need to evaluate the function if the allocation was in fact
    successful, saving a smp_rmb some loads and comparisons on some relative
    fast-paths.

    Since the naming, get/put_mems_allowed() does suggest a mandatory
    pairing, rename the interface, as suggested by Mel, to resemble the
    seqcount interface.

    This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
    where it is important to note that the return value of the latter call
    is inverted from its previous incarnation.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Pull cgroup updates from Tejun Heo:
    "A lot updates for cgroup:

    - The biggest one is cgroup's conversion to kernfs. cgroup took
    after the long abandoned vfs-entangled sysfs implementation and
    made it even more convoluted over time. cgroup's internal objects
    were fused with vfs objects which also brought in vfs locking and
    object lifetime rules. Naturally, there are places where vfs rules
    don't fit and nasty hacks, such as credential switching or lock
    dance interleaving inode mutex and cgroup_mutex with object serial
    number comparison thrown in to decide whether the operation is
    actually necessary, needed to be employed.

    After conversion to kernfs, internal object lifetime and locking
    rules are mostly isolated from vfs interactions allowing shedding
    of several nasty hacks and overall simplification. This will also
    allow implmentation of operations which may affect multiple cgroups
    which weren't possible before as it would have required nesting
    i_mutexes.

    - Various simplifications including dropping of module support,
    easier cgroup name/path handling, simplified cgroup file type
    handling and task_cg_lists optimization.

    - Prepatory changes for the planned unified hierarchy, which is still
    a patchset away from being actually operational. The dummy
    hierarchy is updated to serve as the default unified hierarchy.
    Controllers which aren't claimed by other hierarchies are
    associated with it, which BTW was what the dummy hierarchy was for
    anyway.

    - Various fixes from Li and others. This pull request includes some
    patches to add missing slab.h to various subsystems. This was
    triggered xattr.h include removal from cgroup.h. cgroup.h
    indirectly got included a lot of files which brought in xattr.h
    which brought in slab.h.

    There are several merge commits - one to pull in kernfs updates
    necessary for converting cgroup (already in upstream through
    driver-core), others for interfering changes in the fixes branch"

    * 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
    cgroup: remove useless argument from cgroup_exit()
    cgroup: fix spurious lockdep warning in cgroup_exit()
    cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
    cgroup: break kernfs active_ref protection in cgroup directory operations
    cgroup: fix cgroup_taskset walking order
    cgroup: implement CFTYPE_ONLY_ON_DFL
    cgroup: make cgrp_dfl_root mountable
    cgroup: drop const from @buffer of cftype->write_string()
    cgroup: rename cgroup_dummy_root and related names
    cgroup: move ->subsys_mask from cgroupfs_root to cgroup
    cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
    cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
    cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
    cgroup: reorganize cgroup bootstrapping
    cgroup: relocate setting of CGRP_DEAD
    cpuset: use rcu_read_lock() to protect task_cs()
    cgroup_freezer: document freezer_fork() subtleties
    cgroup: update cgroup_transfer_tasks() to either succeed or fail
    cgroup: drop task_lock() protection around task->cgroups
    cgroup: update how a newly forked task gets associated with css_set
    ...

    Linus Torvalds
     

19 Mar, 2014

1 commit

  • cftype->write_string() just passes on the writeable buffer from kernfs
    and there's no reason to add const restriction on the buffer. The
    only thing const achieves is unnecessarily complicating parsing of the
    buffer. Drop const from @buffer.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki

    Tejun Heo
     

04 Mar, 2014

1 commit


27 Feb, 2014

1 commit