21 Feb, 2013

1 commit

  • Pull cpuset changes from Tejun Heo:

    - Synchornization has seen a lot of changes with focus on decoupling
    cpuset synchronization from cgroup internal locking.

    After this change, there only remain a couple of mostly trivial
    dependencies on cgroup_lock outside cgroup core proper. cgroup_lock
    is scheduled to be unexported in this devel cycle.

    This will finally remove the fragile locking order around cgroup
    (cgroup locking wants to / should be one of the outermost but yet has
    been acquired from deep inside individual controllers).

    - At this point, Li is most knowlegeable with cpuset and taking over
    the maintainership of cpuset.

    * 'for-3.9-cpuset' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cpuset: drop spurious retval assignment in proc_cpuset_show()
    cpuset: fix RCU lockdep splat
    cpuset: update MAINTAINERS
    cpuset: remove cpuset->parent
    cpuset: replace cpuset->stack_list with cpuset_for_each_descendant_pre()
    cpuset: replace cgroup_mutex locking with cpuset internal locking
    cpuset: schedule hotplug propagation from cpuset_attach() if the cpuset is empty
    cpuset: pin down cpus and mems while a task is being attached
    cpuset: make CPU / memory hotplug propagation asynchronous
    cpuset: drop async_rebuild_sched_domains()
    cpuset: don't nest cgroup_mutex inside get_online_cpus()
    cpuset: reorganize CPU / memory hotplug handling
    cpuset: cleanup cpuset[_can]_attach()
    cpuset: introduce cpuset_for_each_child()
    cpuset: introduce CS_ONLINE
    cpuset: introduce ->css_on/offline()
    cpuset: remove fast exit path from remove_tasks_in_empty_cpuset()
    cpuset: remove unused cpuset_unlock()

    Linus Torvalds
     

19 Feb, 2013

1 commit

  • rename() will change dentry->d_name. The result of this race can
    be worse than seeing partially rewritten name, but we might access
    a stale pointer because rename() will re-allocate memory to hold
    a longer name.

    It's safe in the protection of dentry->d_lock.

    v2: check NULL dentry before acquiring dentry lock.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org

    Li Zefan
     

16 Jan, 2013

2 commits

  • proc_cpuset_show() has a spurious -EINVAL assignment which does
    nothing. Remove it.

    This patch doesn't make any functional difference.

    tj: Rewrote patch description.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • 5d21cc2db040d01f8c19b8602f6987813e1176b4 ("cpuset: replace
    cgroup_mutex locking with cpuset internal locking") incorrectly
    converted proc_cpuset_show() from cgroup_lock() to cpuset_mutex.
    proc_cpuset_show() is accessing cgroup hierarchy proper to determine
    cgroup path which can't be protected by cpuset_mutex. This triggered
    the following RCU warning.

    ===============================
    [ INFO: suspicious RCU usage. ]
    3.8.0-rc3-next-20130114-sasha-00016-ga107525-dirty #262 Tainted: G W
    -------------------------------
    include/linux/cgroup.h:534 suspicious rcu_dereference_check() usage!

    other info that might help us debug this:

    rcu_scheduler_active = 1, debug_locks = 1
    2 locks held by trinity/7514:
    #0: (&p->lock){+.+.+.}, at: [] seq_read+0x3a/0x3e0
    #1: (cpuset_mutex){+.+...}, at: [] proc_cpuset_show+0x84/0x190

    stack backtrace:
    Pid: 7514, comm: trinity Tainted: G W
    +3.8.0-rc3-next-20130114-sasha-00016-ga107525-dirty #262
    Call Trace:
    [] lockdep_rcu_suspicious+0x10b/0x120
    [] proc_cpuset_show+0x111/0x190
    [] seq_read+0x1b7/0x3e0
    [] ? seq_lseek+0x110/0x110
    [] do_loop_readv_writev+0x4b/0x90
    [] do_readv_writev+0xf6/0x1d0
    [] vfs_readv+0x3e/0x60
    [] sys_readv+0x50/0xd0
    [] tracesys+0xe1/0xe6

    The operation can be performed under RCU read lock. Replace
    cpuset_mutex locking with RCU read locking.

    tj: Rewrote patch description.

    Reported-by: Sasha Levin
    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     

08 Jan, 2013

15 commits

  • cgroup already tracks the hierarchy. Follow cgroup->parent to find
    the parent and drop cpuset->parent.

    Signed-off-by: Tejun Heo
    Reviewed-by: Michal Hocko
    Acked-by: Li Zefan

    Tejun Heo
     
  • Implement cpuset_for_each_descendant_pre() and replace the
    cpuset-specific tree walking using cpuset->stack_list with it.

    Signed-off-by: Tejun Heo
    Reviewed-by: Michal Hocko
    Acked-by: Li Zefan

    Tejun Heo
     
  • Supposedly for historical reasons, cpuset depends on cgroup core for
    locking. It depends on cgroup_mutex in cgroup callbacks and grabs
    cgroup_mutex from other places where it wants to be synchronized.
    This is majorly messy and highly prone to introducing circular locking
    dependency especially because cgroup_mutex is supposed to be one of
    the outermost locks.

    As previous patches already plugged possible races which may happen by
    decoupling from cgroup_mutex, replacing cgroup_mutex with cpuset
    specific cpuset_mutex is mostly straight-forward. Introduce
    cpuset_mutex, replace all occurrences of cgroup_mutex with it, and add
    cpuset_mutex locking to places which inherited cgroup_mutex from
    cgroup core.

    The only complication is from cpuset wanting to initiate task
    migration when a cpuset loses all cpus or memory nodes. Task
    migration may go through full cgroup and all subsystem locking and
    should be initiated without holding any cpuset specific lock; however,
    a previous patch already made hotplug handled asynchronously and
    moving the task migration part outside other locks is easy.
    cpuset_propagate_hotplug_workfn() now invokes
    remove_tasks_in_empty_cpuset() without holding any lock.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • cpuset is scheduled to be decoupled from cgroup_lock which will make
    hotplug handling race with task migration. cpus or mems will be
    allowed to go offline between ->can_attach() and ->attach(). If
    hotplug takes down all cpus or mems of a cpuset while attach is in
    progress, ->attach() may end up putting tasks into an empty cpuset.

    This patchset makes ->attach() schedule hotplug propagation if the
    cpuset is empty after attaching is complete. This will move the tasks
    to the nearest ancestor which can execute and the end result would be
    as if hotplug handling happened after the tasks finished attaching.

    cpuset_write_resmask() now also flushes cpuset_propagate_hotplug_wq to
    wait for propagations scheduled directly by cpuset_attach().

    This currently doesn't make any functional difference as everything is
    protected by cgroup_mutex but enables decoupling the locking.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • cpuset is scheduled to be decoupled from cgroup_lock which will make
    configuration updates race with task migration. Any config update
    will be allowed to happen between ->can_attach() and ->attach(). If
    such config update removes either all cpus or mems, by the time
    ->attach() is called, the condition verified by ->can_attach(), that
    the cpuset is capable of hosting the tasks, is no longer true.

    This patch adds cpuset->attach_in_progress which is incremented from
    ->can_attach() and decremented when the attach operation finishes
    either successfully or not. validate_change() treats cpusets w/
    non-zero ->attach_in_progress like cpusets w/ tasks and refuses to
    remove all cpus or mems from it.

    This currently doesn't make any functional difference as everything is
    protected by cgroup_mutex but enables decoupling the locking.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • cpuset_hotplug_workfn() has been invoking cpuset_propagate_hotplug()
    directly to propagate hotplug updates to !root cpusets; however, this
    has the following problems.

    * cpuset locking is scheduled to be decoupled from cgroup_mutex,
    cgroup_mutex will be unexported, and cgroup_attach_task() will do
    cgroup locking internally, so propagation can't synchronously move
    tasks to a parent cgroup while walking the hierarchy.

    * We can't use cgroup generic tree iterator because propagation to
    each cpuset may sleep. With propagation done asynchronously, we can
    lose the rather ugly cpuset specific iteration.

    Convert cpuset_propagate_hotplug() to
    cpuset_propagate_hotplug_workfn() and execute it from newly added
    cpuset->hotplug_work. The work items are run on an ordered workqueue,
    so the propagation order is preserved. cpuset_hotplug_workfn()
    schedules all propagations while holding cgroup_mutex and waits for
    completion without cgroup_mutex. Each in-flight propagation holds a
    reference to the cpuset->css.

    This patch doesn't cause any functional difference.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • In general, we want to make cgroup_mutex one of the outermost locks
    and be able to use get_online_cpus() and friends from cgroup methods.
    With cpuset hotplug made async, get_online_cpus() can now be nested
    inside cgroup_mutex.

    Currently, cpuset avoids nesting get_online_cpus() inside cgroup_mutex
    by bouncing sched_domain rebuilding to a work item. As such nesting
    is allowed now, remove the workqueue bouncing code and always rebuild
    sched_domains synchronously. This also nests sched_domains_mutex
    inside cgroup_mutex, which is intended and should be okay.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • CPU / memory hotplug path currently grabs cgroup_mutex from hotplug
    event notifications. We want to separate cpuset locking from cgroup
    core and make cgroup_mutex outer to hotplug synchronization so that,
    among other things, mechanisms which depend on get_online_cpus() can
    be used from cgroup callbacks. In general, we want to keep
    cgroup_mutex the outermost lock to minimize locking interactions among
    different controllers.

    Convert cpuset_handle_hotplug() to cpuset_hotplug_workfn() and
    schedule it from the hotplug notifications. As the function can
    already handle multiple mixed events without any input, converting it
    to a work function is mostly trivial; however, one complication is
    that cpuset_update_active_cpus() needs to update sched domains
    synchronously to reflect an offlined cpu to avoid confusing the
    scheduler. This is worked around by falling back to the the default
    single sched domain synchronously before scheduling the actual hotplug
    work. This makes sched domain rebuilt twice per CPU hotplug event but
    the operation isn't that heavy and a lot of the second operation would
    be noop for systems w/ single sched domain, which is the common case.

    This decouples cpuset hotplug handling from the notification callbacks
    and there can be an arbitrary delay between the actual event and
    updates to cpusets. Scheduler and mm can handle it fine but moving
    tasks out of an empty cpuset may race against writes to the cpuset
    restoring execution resources which can lead to confusing behavior.
    Flush hotplug work item from cpuset_write_resmask() to avoid such
    confusions.

    v2: Synchronous sched domain rebuilding using the fallback sched
    domain added. This fixes various issues caused by confused
    scheduler putting tasks on a dead CPU, including the one reported
    by Li Zefan.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • Reorganize hotplug path to prepare for async hotplug handling.

    * Both CPU and memory hotplug handlings are collected into a single
    function - cpuset_handle_hotplug(). It doesn't take any argument
    but compares the current setttings of top_cpuset against what's
    actually available to determine what happened. This function
    directly updates top_cpuset. If there are CPUs or memory nodes
    which are taken down, cpuset_propagate_hotplug() in invoked on all
    !root cpusets.

    * cpuset_propagate_hotplug() is responsible for updating the specified
    cpuset so that it doesn't include any resource which isn't available
    to top_cpuset. If no CPU or memory is left after update, all tasks
    are moved to the nearest ancestor with both resources.

    * update_tasks_cpumask() and update_tasks_nodemask() are now always
    called after cpus or mems masks are updated even if the cpuset
    doesn't have any task. This is for brevity and not expected to have
    any measureable effect.

    * cpu_active_mask and N_HIGH_MEMORY are read exactly once per
    cpuset_handle_hotplug() invocation, all cpusets share the same view
    of what resources are available, and cpuset_handle_hotplug() can
    handle multiple resources going up and down. These properties will
    allow async operation.

    The reorganization, while drastic, is equivalent and shouldn't cause
    any behavior difference. This will enable making hotplug handling
    async and remove get_online_cpus() -> cgroup_mutex nesting.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • cpuset_can_attach() prepare global variables cpus_attach and
    cpuset_attach_nodemask_{to|from} which are used by cpuset_attach().
    There is no reason to prepare in cpuset_can_attach(). The same
    information can be accessed from cpuset_attach().

    Move the prepartion logic from cpuset_can_attach() to cpuset_attach()
    and make the global variables static ones inside cpuset_attach().

    With this change, there's no reason to keep
    cpuset_attach_nodemask_{from|to} global. Move them inside
    cpuset_attach(). Unfortunately, we need to keep cpus_attach global as
    it can't be allocated from cpuset_attach().

    v2: cpus_attach not converted to cpumask_t as per Li Zefan and Rusty
    Russell.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Rusty Russell

    Tejun Heo
     
  • Instead of iterating cgroup->children directly, introduce and use
    cpuset_for_each_child() which wraps cgroup_for_each_child() and
    performs online check. As it uses the generic iterator, it requires
    RCU read locking too.

    As cpuset is currently protected by cgroup_mutex, non-online cpusets
    aren't visible to all the iterations and this patch currently doesn't
    make any functional difference. This will be used to de-couple cpuset
    locking from cgroup core.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • Add CS_ONLINE which is set from css_online() and cleared from
    css_offline(). This will enable using generic cgroup iterator while
    allowing decoupling cpuset from cgroup internal locking.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • Add cpuset_css_on/offline() and rearrange css init/exit such that,

    * Allocation and clearing to the default values happen in css_alloc().
    Allocation now uses kzalloc().

    * Config inheritance and registration happen in css_online().

    * css_offline() undoes what css_online() did.

    * css_free() frees.

    This doesn't introduce any visible behavior changes. This will help
    cleaning up locking.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • The function isn't that hot, the overhead of missing the fast exit is
    low, the test itself depends heavily on cgroup internals, and it's
    gonna be a hindrance when trying to decouple cpuset locking from
    cgroup core. Remove the fast exit path.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     

13 Dec, 2012

1 commit

  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Acked-by: Hillf Danton
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     

20 Nov, 2012

2 commits

  • Currently CGRP_CPUSET_CLONE_CHILDREN triggers ->post_clone(). Now
    that clone_children is cpuset specific, there's no reason to have this
    rather odd option activation mechanism in cgroup core. cpuset can
    check the flag from its ->css_allocate() and take the necessary
    action.

    Move cpuset_post_clone() logic to the end of cpuset_css_alloc() and
    remove cgroup_subsys->post_clone().

    Loosely based on Glauber's "generalize post_clone into post_create"
    patch.

    Signed-off-by: Tejun Heo
    Original-patch-by: Glauber Costa
    Original-patch:
    Acked-by: Serge E. Hallyn
    Acked-by: Li Zefan
    Cc: Glauber Costa

    Tejun Heo
     
  • Rename cgroup_subsys css lifetime related callbacks to better describe
    what their roles are. Also, update documentation.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     

24 Jul, 2012

4 commits

  • cpuset_track_online_cpus() is no longer present. So remove the
    outdated comment and replace it with reference to cpuset_update_active_cpus()
    which is its equivalent.

    Also, we don't lack memory hot-unplug anymore. And David Rientjes pointed
    out how it is dealt with. So update that comment as well.

    Signed-off-by: Srivatsa S. Bhat
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20120524141700.3692.98192.stgit@srivatsabhat.in.ibm.com
    Signed-off-by: Ingo Molnar

    Srivatsa S. Bhat
     
  • Separate out the cpuset related handling for CPU/Memory online/offline.
    This also helps us exploit the most obvious and basic level of optimization
    that any notification mechanism (CPU/Mem online/offline) has to offer us:
    "We *know* why we have been invoked. So stop pretending that we are lost,
    and do only the necessary amount of processing!".

    And while at it, rename scan_for_empty_cpusets() to
    scan_cpusets_upon_hotplug(), which is more appropriate considering how
    it is restructured.

    Signed-off-by: Srivatsa S. Bhat
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20120524141650.3692.48637.stgit@srivatsabhat.in.ibm.com
    Signed-off-by: Ingo Molnar

    Srivatsa S. Bhat
     
  • At present, the functions that deal with cpusets during CPU/Mem hotplug
    are quite messy, since a lot of the functionality is mixed up without clear
    separation. And this takes a toll on optimization as well. For example,
    the function cpuset_update_active_cpus() is called on both CPU offline and CPU
    online events; and it invokes scan_for_empty_cpusets(), which makes sense
    only for CPU offline events. And hence, the current code ends up unnecessarily
    traversing the cpuset tree during CPU online also.

    As a first step towards cleaning up those functions, encapsulate the cpuset
    tree traversal in a helper function, so as to facilitate upcoming changes.

    Signed-off-by: Srivatsa S. Bhat
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20120524141635.3692.893.stgit@srivatsabhat.in.ibm.com
    Signed-off-by: Ingo Molnar

    Srivatsa S. Bhat
     
  • In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
    masks as and when necessary to ensure that the tasks belonging to the cpusets
    have some place (online CPUs) to run on. And regular CPU hotplug is
    destructive in the sense that the kernel doesn't remember the original cpuset
    configurations set by the user, across hotplug operations.

    However, suspend/resume (which uses CPU hotplug) is a special case in which
    the kernel has the responsibility to restore the system (during resume), to
    exactly the same state it was in before suspend.

    In order to achieve that, do the following:

    1. Don't modify cpusets during suspend/resume. At all.
    In particular, don't move the tasks from one cpuset to another, and
    don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
    during the CPU hotplug operations that are carried out in the
    suspend/resume path.

    2. However, cpusets and sched domains are related. We just want to avoid
    altering cpusets alone. So, to keep the sched domains updated, build
    a single sched domain (containing all active cpus) during each of the
    CPU hotplug operations carried out in s/r path, effectively ignoring
    the cpusets' cpus_allowed masks.

    (Since userspace is frozen while doing all this, it will go unnoticed.)

    3. During the last CPU online operation during resume, build the sched
    domains by looking up the (unaltered) cpusets' cpus_allowed masks.
    That will bring back the system to the same original state as it was in
    before suspend.

    Ultimately, this will not only solve the cpuset problem related to suspend
    resume (ie., restores the cpusets to exactly what it was before suspend, by
    not touching it at all) but also speeds up suspend/resume because we avoid
    running cpuset update code for every CPU being offlined/onlined.

    Signed-off-by: Srivatsa S. Bhat
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20120524141611.3692.20155.stgit@srivatsabhat.in.ibm.com
    Signed-off-by: Ingo Molnar

    Srivatsa S. Bhat
     

23 May, 2012

1 commit

  • Pull cgroup updates from Tejun Heo:
    "cgroup file type addition / removal is updated so that file types are
    added and removed instead of individual files so that dynamic file
    type addition / removal can be implemented by cgroup and used by
    controllers. blkio controller changes which will come through block
    tree are dependent on this. Other changes include res_counter cleanup
    and disallowing kthread / PF_THREAD_BOUND threads to be attached to
    non-root cgroups.

    There's a reported bug with the file type addition / removal handling
    which can lead to oops on cgroup umount. The issue is being looked
    into. It shouldn't cause problems for most setups and isn't a
    security concern."

    Fix up trivial conflict in Documentation/feature-removal-schedule.txt

    * 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
    res_counter: Account max_usage when calling res_counter_charge_nofail()
    res_counter: Merge res_counter_charge and res_counter_charge_nofail
    cgroups: disallow attaching kthreadd or PF_THREAD_BOUND threads
    cgroup: remove cgroup_subsys->populate()
    cgroup: get rid of populate for memcg
    cgroup: pass struct mem_cgroup instead of struct cgroup to socket memcg
    cgroup: make css->refcnt clearing on cgroup removal optional
    cgroup: use negative bias on css->refcnt to block css_tryget()
    cgroup: implement cgroup_rm_cftypes()
    cgroup: introduce struct cfent
    cgroup: relocate __d_cgrp() and __d_cft()
    cgroup: remove cgroup_add_file[s]()
    cgroup: convert memcg controller to the new cftype interface
    memcg: always create memsw files if CONFIG_CGROUP_MEM_RES_CTLR_SWAP
    cgroup: convert all non-memcg controllers to the new cftype interface
    cgroup: relocate cftype and cgroup_subsys definitions in controllers
    cgroup: merge cft_release_agent cftype array into the base files array
    cgroup: implement cgroup_add_cftypes() and friends
    cgroup: build list of all cgroups under a given cgroupfs_root
    cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir()
    ...

    Linus Torvalds
     

02 Apr, 2012

2 commits

  • Pull cpumask cleanups from Rusty Russell:
    "(Somehow forgot to send this out; it's been sitting in linux-next, and
    if you don't want it, it can sit there another cycle)"

    I'm a sucker for things that actually delete lines of code.

    Fix up trivial conflict in arch/arm/kernel/kprobes.c, where Rusty fixed
    a user of &cpu_online_map to be cpu_online_mask, but that code got
    deleted by commit b21d55e98ac2 ("ARM: 7332/1: extract out code patch
    function from kprobes").

    * tag 'for-linus' of git://github.com/rustyrussell/linux:
    cpumask: remove old cpu_*_map.
    documentation: remove references to cpu_*_map.
    drivers/cpufreq/db8500-cpufreq: remove references to cpu_*_map.
    remove references to cpu_*_map in arch/

    Linus Torvalds
     
  • Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio,
    net_cls and device controllers to use the new cftype based interface.
    Termination entry is added to cftype arrays and populate callbacks are
    replaced with cgroup_subsys->base_cftypes initializations.

    This is functionally identical transformation. There shouldn't be any
    visible behavior change.

    memcg is rather special and will be converted separately.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Paul Menage
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "David S. Miller"
    Cc: Vivek Goyal

    Tejun Heo
     

30 Mar, 2012

1 commit


29 Mar, 2012

1 commit


28 Mar, 2012

1 commit

  • We don't use "cpu" any more after 2baab4e904 "sched: Fix
    select_fallback_rq() vs cpu_active/cpu_online".

    Signed-off-by: Dan Carpenter
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120328104608.GD29022@elgon.mountain
    Signed-off-by: Ingo Molnar

    Dan Carpenter
     

27 Mar, 2012

1 commit

  • Commit 5fbd036b55 ("sched: Cleanup cpu_active madness"), which was
    supposed to finally sort the cpu_active mess, instead uncovered more.

    Since CPU_STARTING is ran before setting the cpu online, there's a
    (small) window where the cpu has active,!online.

    If during this time there's a wakeup of a task that used to reside on
    that cpu select_task_rq() will use select_fallback_rq() to compute an
    alternative cpu to run on since we find !online.

    select_fallback_rq() however will compute the new cpu against
    cpu_active, this means that it can return the same cpu it started out
    with, the !online one, since that cpu is in fact marked active.

    This results in us trying to scheduling a task on an offline cpu and
    triggering a WARN in the IPI code.

    The solution proposed by Chuansheng Liu of setting cpu_active in
    set_cpu_online() is buggy, firstly not all archs actually use
    set_cpu_online(), secondly, not all archs call set_cpu_online() with
    IRQs disabled, this means we would introduce either the same race or
    the race from fd8a7de17 ("x86: cpu-hotplug: Prevent softirq wakeup on
    wrong CPU") -- albeit much narrower.

    [ By setting online first and active later we have a window of
    online,!active, fresh and bound kthreads have task_cpu() of 0 and
    since cpu0 isn't in tsk_cpus_allowed() we end up in
    select_fallback_rq() which excludes !active, resulting in a reset
    of ->cpus_allowed and the thread running all over the place. ]

    The solution is to re-work select_fallback_rq() to require active
    _and_ online. This makes the active,!online case work as expected,
    OTOH archs running CPU_STARTING after setting online are now
    vulnerable to the issue from fd8a7de17 -- these are alpha and
    blackfin.

    Reported-by: Chuansheng Liu
    Signed-off-by: Peter Zijlstra
    Cc: Mike Frysinger
    Cc: linux-alpha@vger.kernel.org
    Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

22 Mar, 2012

1 commit

  • Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
    changing cpuset's mems") wins a super prize for the largest number of
    memory barriers entered into fast paths for one commit.

    [get|put]_mems_allowed is incredibly heavy with pairs of full memory
    barriers inserted into a number of hot paths. This was detected while
    investigating at large page allocator slowdown introduced some time
    after 2.6.32. The largest portion of this overhead was shown by
    oprofile to be at an mfence introduced by this commit into the page
    allocator hot path.

    For extra style points, the commit introduced the use of yield() in an
    implementation of what looks like a spinning mutex.

    This patch replaces the full memory barriers on both read and write
    sides with a sequence counter with just read barriers on the fast path
    side. This is much cheaper on some architectures, including x86. The
    main bulk of the patch is the retry logic if the nodemask changes in a
    manner that can cause a false failure.

    While updating the nodemask, a check is made to see if a false failure
    is a risk. If it is, the sequence number gets bumped and parallel
    allocators will briefly stall while the nodemask update takes place.

    In a page fault test microbenchmark, oprofile samples from
    __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
    actual results were

    3.3.0-rc3 3.3.0-rc3
    rc3-vanilla nobarrier-v2r1
    Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
    Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
    Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
    Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
    Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
    Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
    Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
    Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
    Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
    Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
    Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
    Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
    Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
    Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
    Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
    MMTests Statistics: duration
    Sys Time Running Test (seconds) 135.68 132.17
    User+Sys Time Running Test (seconds) 164.2 160.13
    Total Elapsed Time (seconds) 123.46 120.87

    The overall improvement is small but the System CPU time is much
    improved and roughly in correlation to what oprofile reported (these
    performance figures are without profiling so skew is expected). The
    actual number of page faults is noticeably improved.

    For benchmarks like kernel builds, the overall benefit is marginal but
    the system CPU time is slightly reduced.

    To test the actual bug the commit fixed I opened two terminals. The
    first ran within a cpuset and continually ran a small program that
    faulted 100M of anonymous data. In a second window, the nodemask of the
    cpuset was continually randomised in a loop.

    Without the commit, the program would fail every so often (usually
    within 10 seconds) and obviously with the commit everything worked fine.
    With this patch applied, it also worked fine so the fix should be
    functionally equivalent.

    Signed-off-by: Mel Gorman
    Cc: Miao Xie
    Cc: David Rientjes
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

03 Feb, 2012

1 commit

  • The argument is not used at all, and it's not necessary, because
    a specific callback handler of course knows which subsys it
    belongs to.

    Now only ->pupulate() takes this argument, because the handlers of
    this callback always call cgroup_add_file()/cgroup_add_files().

    So we reduce a few lines of code, though the shrinking of object size
    is minimal.

    16 files changed, 113 insertions(+), 162 deletions(-)

    text data bss dec hex filename
    5486240 656987 7039960 13183187 c928d3 vmlinux.o.orig
    5486170 656987 7039960 13183117 c9288d vmlinux.o

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     

10 Jan, 2012

1 commit

  • * 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
    cgroup: fix to allow mounting a hierarchy by name
    cgroup: move assignement out of condition in cgroup_attach_proc()
    cgroup: Remove task_lock() from cgroup_post_fork()
    cgroup: add sparse annotation to cgroup_iter_start() and cgroup_iter_end()
    cgroup: mark cgroup_rmdir_waitq and cgroup_attach_proc() as static
    cgroup: only need to check oldcgrp==newgrp once
    cgroup: remove redundant get/put of task struct
    cgroup: remove redundant get/put of old css_set from migrate
    cgroup: Remove unnecessary task_lock before fetching css_set on migration
    cgroup: Drop task_lock(parent) on cgroup_fork()
    cgroups: remove redundant get/put of css_set from css_set_check_fetched()
    resource cgroups: remove bogus cast
    cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
    cgroup, cpuset: don't use ss->pre_attach()
    cgroup: don't use subsys->can_attach_task() or ->attach_task()
    cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach()
    cgroup: improve old cgroup handling in cgroup_attach_proc()
    cgroup: always lock threadgroup during migration
    threadgroup: extend threadgroup_lock() to cover exit and exec
    threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem
    ...

    Fix up conflict in kernel/cgroup.c due to commit e0197aae59e5: "cgroups:
    fix a css_set not found bug in cgroup_attach_proc" that already
    mentioned that the bug is fixed (differently) in Tejun's cgroup
    patchset. This one, in other words.

    Linus Torvalds
     

21 Dec, 2011

1 commit

  • Kernels where MAX_NUMNODES > BITS_PER_LONG may temporarily see an empty
    nodemask in a tsk's mempolicy if its previous nodemask is remapped onto a
    new set of allowed cpuset nodes where the two nodemasks, as a result of
    the remap, are now disjoint.

    c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
    cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
    nodes from changing for a thread. This causes any update to a set of
    allowed nodes to stall until put_mems_allowed() is called.

    This stall is unncessary, however, if at least one node remains unchanged
    in the update to the set of allowed nodes. This was addressed by
    89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one
    node remains set"), but it's still possible that an empty nodemask may be
    read from a mempolicy because the old nodemask may be remapped to the new
    nodemask during rebind. To prevent this, only avoid the stall if there is
    no mempolicy for the thread being changed.

    This is a temporary solution until all reads from mempolicy nodemasks can
    be guaranteed to not be empty without the get_mems_allowed()
    synchronization.

    Also moves the check for nodemask intersection inside task_lock() so that
    tsk->mems_allowed cannot change. This ensures that nothing can set this
    tsk's mems_allowed out from under us and also protects tsk->mempolicy.

    Reported-by: Miao Xie
    Signed-off-by: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Paul Menage
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

13 Dec, 2011

3 commits

  • ->pre_attach() is supposed to be called before migration, which is
    observed during process migration but task migration does it the other
    way around. The only ->pre_attach() user is cpuset which can do the
    same operaitons in ->can_attach(). Collapse cpuset_pre_attach() into
    cpuset_can_attach().

    -v2: Patch contamination from later patch removed. Spotted by Paul
    Menage.

    Signed-off-by: Tejun Heo
    Reviewed-by: Frederic Weisbecker
    Acked-by: Paul Menage
    Cc: Li Zefan

    Tejun Heo
     
  • Now that subsys->can_attach() and attach() take @tset instead of
    @task, they can handle per-task operations. Convert
    ->can_attach_task() and ->attach_task() users to use ->can_attach()
    and attach() instead. Most converions are straight-forward.
    Noteworthy changes are,

    * In cgroup_freezer, remove unnecessary NULL assignments to unused
    methods. It's useless and very prone to get out of sync, which
    already happened.

    * In cpuset, PF_THREAD_BOUND test is checked for each task. This
    doesn't make any practical difference but is conceptually cleaner.

    Signed-off-by: Tejun Heo
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Frederic Weisbecker
    Acked-by: Li Zefan
    Cc: Paul Menage
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: James Morris
    Cc: Ingo Molnar
    Cc: Peter Zijlstra

    Tejun Heo
     
  • Currently, there's no way to pass multiple tasks to cgroup_subsys
    methods necessitating the need for separate per-process and per-task
    methods. This patch introduces cgroup_taskset which can be used to
    pass multiple tasks and their associated cgroups to cgroup_subsys
    methods.

    Three methods - can_attach(), cancel_attach() and attach() - are
    converted to use cgroup_taskset. This unifies passed parameters so
    that all methods have access to all information. Conversions in this
    patchset are identical and don't introduce any behavior change.

    -v2: documentation updated as per Paul Menage's suggestion.

    Signed-off-by: Tejun Heo
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Frederic Weisbecker
    Acked-by: Paul Menage
    Acked-by: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: KAMEZAWA Hiroyuki
    Cc: James Morris

    Tejun Heo