03 Oct, 2012

1 commit

  • Pull cgroup hierarchy update from Tejun Heo:
    "Currently, different cgroup subsystems handle nested cgroups
    completely differently. There's no consistency among subsystems and
    the behaviors often are outright broken.

    People at least seem to agree that the broken hierarhcy behaviors need
    to be weeded out if any progress is gonna be made on this front and
    that the fallouts from deprecating the broken behaviors should be
    acceptable especially given that the current behaviors don't make much
    sense when nested.

    This patch makes cgroup emit warning messages if cgroups for
    subsystems with broken hierarchy behavior are nested to prepare for
    fixing them in the future. This was put in a separate branch because
    more related changes were expected (didn't make it this round) and the
    memory cgroup wanted to pull in this and make changes on top."

    * 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them

    Linus Torvalds
     

15 Sep, 2012

5 commits

  • Currently, cgroup hierarchy support is a mess. cpu related subsystems
    behave correctly - configuration, accounting and control on a parent
    properly cover its children. blkio and freezer completely ignore
    hierarchy and treat all cgroups as if they're directly under the root
    cgroup. Others show yet different behaviors.

    These differing interpretations of cgroup hierarchy make using cgroup
    confusing and it impossible to co-mount controllers into the same
    hierarchy and obtain sane behavior.

    Eventually, we want full hierarchy support from all subsystems and
    probably a unified hierarchy. Users using separate hierarchies
    expecting completely different behaviors depending on the mounted
    subsystem is deterimental to making any progress on this front.

    This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
    for controllers which are lacking in hierarchy support. The goal of
    this patch is two-fold.

    * Move users away from using hierarchy on currently non-hierarchical
    subsystems, so that implementing proper hierarchy support on those
    doesn't surprise them.

    * Keep track of which controllers are broken how and nudge the
    subsystems to implement proper hierarchy support.

    For now, start with a single warning message. We can whine louder
    later on.

    v2: Fixed a typo spotted by Michal. Warning message updated.

    v3: Updated memcg part so that it doesn't generate warning in the
    cases where .use_hierarchy=false doesn't make the behavior
    different from root.use_hierarchy=true. Fixed a typo spotted by
    Glauber.

    v4: Check ->broken_hierarchy after cgroup creation is complete so that
    ->create() can affect the result per Michal. Dropped unnecessary
    memcg root handling per Michal.

    Signed-off-by: Tejun Heo
    Acked-by: Michal Hocko
    Acked-by: Li Zefan
    Acked-by: Serge E. Hallyn
    Cc: Glauber Costa
    Cc: Peter Zijlstra
    Cc: Paul Turner
    Cc: Johannes Weiner
    Cc: Thomas Graf
    Cc: Vivek Goyal
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Neil Horman
    Cc: Aneesh Kumar K.V

    Tejun Heo
     
  • WARNING: With this change it is impossible to load external built
    controllers anymore.

    In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is
    set, corresponding subsys_id should also be a constant. Up to now,
    net_prio_subsys_id and net_cls_subsys_id would be of the type int and
    the value would be assigned during runtime.

    By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN
    to IS_ENABLED, all *_subsys_id will have constant value. That means we
    need to remove all the code which assumes a value can be assigned to
    net_prio_subsys_id and net_cls_subsys_id.

    A close look is necessary on the RCU part which was introduces by
    following patch:

    commit f845172531fb7410c7fb7780b1a6e51ee6df7d52
    Author: Herbert Xu Mon May 24 09:12:34 2010
    Committer: David S. Miller Mon May 24 09:12:34 2010

    cls_cgroup: Store classid in struct sock

    Tis code was added to init_cgroup_cls()

    /* We can't use rcu_assign_pointer because this is an int. */
    smp_wmb();
    net_cls_subsys_id = net_cls_subsys.subsys_id;

    respectively to exit_cgroup_cls()

    net_cls_subsys_id = -1;
    synchronize_rcu();

    and in module version of task_cls_classid()

    rcu_read_lock();
    id = rcu_dereference(net_cls_subsys_id);
    if (id >= 0)
    classid = container_of(task_subsys_state(p, id),
    struct cgroup_cls_state, css)->classid;
    rcu_read_unlock();

    Without an explicit explaination why the RCU part is needed. (The
    rcu_deference was fixed by exchanging it to rcu_derefence_index_check()
    in a later commit, but that is a minor detail.)

    So here is my pondering why it was introduced and why it safe to
    remove it now. Note that this code was copied over to net_prio the
    reasoning holds for that subsystem too.

    The idea behind the RCU use for net_cls_subsys_id is to make sure we
    get a valid pointer back from task_subsys_state(). task_subsys_state()
    is just blindly accessing the subsys array and returning the
    pointer. Obviously, passing in -1 as id into task_subsys_state()
    returns an invalid value (out of lower bound).

    So this code makes sure that only after module is loaded and the
    subsystem registered, the id is assigned.

    Before unregistering the module all old readers must have left the
    critical section. This is done by assigning -1 to the id and issuing a
    synchronized_rcu(). Any new readers wont call task_subsys_state()
    anymore and therefore it is safe to unregister the subsystem.

    The new code relies on the same trick, but it looks at the subsys
    pointer return by task_subsys_state() (remember the id is constant
    and therefore we allways have a valid index into the subsys
    array).

    No precautions need to be taken during module loading
    module. Eventually, all CPUs will get a valid pointer back from
    task_subsys_state() because rebind_subsystem() which is called after
    the module init() function will assigned subsys[net_cls_subsys_id] the
    newly loaded module subsystem pointer.

    When the subsystem is about to be removed, rebind_subsystem() will
    called before the module exit() function. In this case,
    rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer
    and then it calls synchronize_rcu(). All old readers have left by then
    the critical section. Any new reader wont access the subsystem
    anymore. At this point we are safe to unregister the subsystem. No
    synchronize_rcu() call is needed.

    Signed-off-by: Daniel Wagner
    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Neil Horman
    Cc: "David S. Miller"
    Cc: "Paul E. McKenney"
    Cc: Andrew Morton
    Cc: Eric Dumazet
    Cc: Gao feng
    Cc: Glauber Costa
    Cc: Herbert Xu
    Cc: Jamal Hadi Salim
    Cc: John Fastabend
    Cc: Kamezawa Hiroyuki
    Cc: netdev@vger.kernel.org
    Cc: cgroups@vger.kernel.org

    Daniel Wagner
     
  • The *_subsys_id will be used as index to access the subsys. Therefore
    we need to care we populate the subsystem at the correct position by
    using designated initialization.

    With this change we are able to interleave builtin and modules in the subsys
    array.

    Signed-off-by: Daniel Wagner
    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Neil Horman
    Cc: Gao feng
    Cc: Jamal Hadi Salim
    Cc: John Fastabend
    Cc: netdev@vger.kernel.org
    Cc: cgroups@vger.kernel.org

    Daniel Wagner
     
  • Before we are able to define all subsystem ids at compile time we need
    a more fine grained control what gets defined when we include
    cgroup_subsys.h. For example we define the enums for the subsystems or
    to declare for struct cgroup_subsys (builtin subsystem) by including
    cgroup_subsys.h and defining SUBSYS accordingly.

    Currently, the decision if a subsys is used is defined inside the
    header by testing if CONFIG_*=y is true. By moving this test outside
    of cgroup_subsys.h we are able to control it on the include level.

    This is done by introducing IS_SUBSYS_ENABLED which then is defined
    according the task, e.g. is CONFIG_*=y or CONFIG_*=m.

    Signed-off-by: Daniel Wagner
    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Neil Horman
    Cc: Gao feng
    Cc: Jamal Hadi Salim
    Cc: John Fastabend
    Cc: netdev@vger.kernel.org
    Cc: cgroups@vger.kernel.org

    Daniel Wagner
     
  • CGROUP_BUILTIN_SUBSYS_COUNT is used as start index or stop index when
    looping over the subsys array looking either at the builtin or the
    module subsystems. Since all the builtin subsystems have an id which
    is lower then CGROUP_BUILTIN_SUBSYS_COUNT we know that any module will
    have an id larger than CGROUP_BUILTIN_SUBSYS_COUNT. In short the ids
    are sorted.

    We are about to change id assignment to happen only at compile time
    later in this series. That means we can't rely on the above trick
    since all ids will always be defined at compile time. Furthermore,
    ordering the builtin subsystems and the module subsystems is not
    really necessary.

    So we need a different way to know which subsystem is a builtin or a
    module one. We can use the subsys[]->module pointer for this. Any
    place where we need to know if a subsys is module we just check for
    the pointer. If it is NULL then the subsystem is a builtin one.

    With this we are able to drop the CGROUP_BUILTIN_SUBSYS_COUNT
    enum. Though we need to introduce a temporary placeholder so that we
    don't get a compilation error when only CONFIG_CGROUP is selected and
    no single controller. An empty enum definition is not valid. Later in
    this series we are able to remove the placeholder again.

    And with this change we get a fix for this:

    kernel/cgroup.c: In function ‘cgroup_load_subsys’:
    kernel/cgroup.c:4326:38: warning: array subscript is below array bounds [-Warray-bounds]

    when CONFIG_CGROUP=y and no built in controller was enabled.

    Signed-off-by: Daniel Wagner
    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Neil Horman
    Cc: Gao feng
    Cc: Jamal Hadi Salim
    Cc: John Fastabend
    Cc: netdev@vger.kernel.org
    Cc: cgroups@vger.kernel.org

    Daniel Wagner
     

25 Aug, 2012

3 commits

  • In a previous discussion, Tejun Heo suggested to rename references to
    subsys_bits (added_bits, removed_bits, etc) by something more meaningful.

    Cc: Li Zefan
    Cc: Tejun Heo
    Cc: Hugh Dickins
    Cc: Hillf Danton
    Cc: Lennart Poettering
    Signed-off-by: Aristeu Rozanski
    Signed-off-by: Tejun Heo

    Aristeu Rozanski
     
  • This is one of the items in the plumber's wish list.

    For use cases:

    >> What would the use case be for this?
    >
    > Attaching meta information to services, in an easily discoverable
    > way. For example, in systemd we create one cgroup for each service, and
    > could then store data like the main pid of the specific service as an
    > xattr on the cgroup itself. That way we'd have almost all service state
    > in the cgroupfs, which would make it possible to terminate systemd and
    > later restart it without losing any state information. But there's more:
    > for example, some very peculiar services cannot be terminated on
    > shutdown (i.e. fakeraid DM stuff) and it would be really nice if the
    > services in question could just mark that on their cgroup, by setting an
    > xattr. On the more desktopy side of things there are other
    > possibilities: for example there are plans defining what an application
    > is along the lines of a cgroup (i.e. an app being a collection of
    > processes). With xattrs one could then attach an icon or human readable
    > program name on the cgroup.
    >
    > The key idea is that this would allow attaching runtime meta information
    > to cgroups and everything they model (services, apps, vms), that doesn't
    > need any complex userspace infrastructure, has good access control
    > (i.e. because the file system enforces that anyway, and there's the
    > "trusted." xattr namespace), notifications (inotify), and can easily be
    > shared among applications.
    >
    > Lennart

    v7:
    - no changes
    v6:
    - remove user xattr namespace, only allow trusted and security
    v5:
    - check for capabilities before setting/removing xattrs
    v4:
    - no changes
    v3:
    - instead of config option, use mount option to enable xattr support

    Original-patch-by: Li Zefan
    Cc: Li Zefan
    Cc: Tejun Heo
    Cc: Hugh Dickins
    Cc: Hillf Danton
    Cc: Lennart Poettering
    Signed-off-by: Li Zefan
    Signed-off-by: Aristeu Rozanski
    Signed-off-by: Tejun Heo

    Aristeu Rozanski
     
  • When remounting cgroupfs with some subsystems added to it and some
    removed, cgroup will remove all the files in root directory and then
    re-popluate it.

    What I'm doing here is, only remove files which belong to subsystems that
    are to be unbinded, and only create files for newly-added subsystems.
    The purpose is to have all other files untouched.

    This is a preparation for cgroup xattr support.

    v7:
    - checkpatch warnings fixed
    v6:
    - no changes
    v5:
    - no changes
    v4:
    - refactored cgroup_clear_directory() to not use cgroup_rm_file()
    - instead of going thru the list of files, get the file list using the
    subsystems
    - use 'subsys_mask' instead of {added,removed}_bits and made
    cgroup_populate_dir() to match the parameters with cgroup_clear_directory()
    v3:
    - refresh patches after recent refactoring

    Original-patch-by: Li Zefan
    Cc: Li Zefan
    Cc: Hugh Dickins
    Cc: Hillf Danton
    Cc: Lennart Poettering
    Signed-off-by: Li Zefan
    Signed-off-by: Aristeu Rozanski
    Signed-off-by: Tejun Heo

    Aristeu Rozanski
     

25 Jul, 2012

1 commit


14 Jul, 2012

2 commits

  • Pass mount flags to sget() so that it can use them in initialising a new
    superblock before the set function is called. They could also be passed to the
    compare function.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     
  • Just the flags; only NFS cares even about that, but there are
    legitimate uses for such argument. And getting rid of that
    completely would require splitting ->lookup() into a couple
    of methods (at least), so let's leave that alone for now...

    Signed-off-by: Al Viro

    Al Viro
     

10 Jul, 2012

1 commit

  • While refactoring cgroup file removal path, 05ef1d7c4a "cgroup:
    introduce struct cfent" incorrectly changed the @dir argument of
    simple_unlink() to the inode of the file being deleted instead of that
    of the containing directory.

    The effect of this bug is minor - ctime and mtime of the parent
    weren't properly updated on file deletion.

    Fix it by using @cgrp->dentry->d_inode instead.

    Signed-off-by: Tejun Heo
    Reported-by: Al Viro
    Acked-by: Li Zefan
    Cc: stable@vger.kernel.org

    Tejun Heo
     

08 Jul, 2012

2 commits

  • 48ddbe1946 "cgroup: make css->refcnt clearing on cgroup removal
    optional" allowed a css to linger after the associated cgroup is
    removed. As a css holds a reference on the cgroup's dentry, it means
    that cgroup dentries may linger for a while.

    Destroying a superblock which has dentries with positive refcnts is a
    critical bug and triggers BUG() in vfs code. As each cgroup dentry
    holds an s_active reference, any lingering cgroup has both its dentry
    and the superblock pinned and thus preventing premature release of
    superblock.

    Unfortunately, after 48ddbe1946, there's a small window while
    releasing a cgroup which is directly under the root of the hierarchy.
    When a cgroup directory is released, vfs layer first deletes the
    corresponding dentry and then invokes dput() on the parent, which may
    recurse further, so when a cgroup directly below root cgroup is
    released, the cgroup is first destroyed - which releases the s_active
    it was holding - and then the dentry for the root cgroup is dput().

    This creates a window where the root dentry's refcnt isn't zero but
    superblock's s_active is. If umount happens before or during this
    window, vfs will see the root dentry with non-zero refcnt and trigger
    BUG().

    Before 48ddbe1946, this problem didn't exist because the last dentry
    reference was guaranteed to be put synchronously from rmdir(2)
    invocation which holds s_active around the whole process.

    Fix it by holding an extra superblock->s_active reference across
    dput() from css release, which is the dput() path added by 48ddbe1946
    and the only one which doesn't hold an extra s_active ref across the
    final cgroup dput().

    Signed-off-by: Tejun Heo
    LKML-Reference:
    Reported-by: shyju pv
    Tested-by: shyju pv
    Cc: Sasha Levin
    Acked-by: Li Zefan

    Tejun Heo
     
  • This reverts commit fa980ca87d15bb8a1317853f257a505990f3ffde. The
    commit was an attempt to fix a race condition where a cgroup hierarchy
    may be unmounted with positive dentry reference on root cgroup. While
    the commit made the race condition slightly more difficult to trigger,
    the race was still there and could be reliably triggered using a
    different test case.

    Revert the incorrect fix. The next commit will describe the race and
    fix it correctly.

    Signed-off-by: Tejun Heo
    LKML-Reference:
    Reported-by: shyju pv
    Cc: Sasha Levin
    Acked-by: Li Zefan

    Tejun Heo
     

19 Jun, 2012

1 commit

  • When we fixed the race between atomic_dec and css_refcnt, we missed
    the fact that css_refcnt internally subtracts CSS_DEACT_BIAS to get
    the actual reference count. This can potentially cause a refcount leak
    if __css_put races with cgroup_clear_css_refs.

    Signed-off-by: Salman Qazi
    Acked-by: Li Zefan
    Signed-off-by: Tejun Heo

    Salman Qazi
     

07 Jun, 2012

2 commits

  • It was introduced for memcg to iterate cgroup hierarchy without
    holding cgroup_mutex, but soon after that it was replaced with
    a lockless way in memcg.

    No one used hierarchy_mutex since that, so remove it.

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     
  • __css_put is using atomic_dec on the ref count, and then
    looking at the ref count to make decisions. This is prone
    to races, as someone else may decrement ref count between
    our decrement and our decision. Instead, we should base our
    decisions on the value that we decremented the ref count to.

    (This results in an actual race on Google's kernel which I
    haven't been able to reproduce on the upstream kernel. Having
    said that, it's still incorrect by inspection).

    Signed-off-by: Salman Qazi
    Acked-by: Li Zefan
    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org

    Salman Qazi
     

06 Jun, 2012

1 commit

  • Pull cgroup fix from Tejun Heo:
    "This fixes the possible premature superblock release on umount bug
    mentioned during v3.5-rc1 pull request.

    Originally, cgroup dentry destruction path assumed that cgroup dentry
    didn't have any reference left after cgroup removal thus put super
    during dentry removal. Now that there can be lingering dentry
    references, this led to super being put with live dentries. This
    patch fixes the problem by putting super ref on dentry release instead
    of removal."

    * 'for-3.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: superblock can't be released with active dentries

    Linus Torvalds
     

30 May, 2012

1 commit

  • Library functions should not grab locks when the callsites can do it,
    even if the lock nests like the rcu read-side lock does.

    Push the rcu_read_lock() from css_is_ancestor() to its single user,
    mem_cgroup_same_or_subtree() in preparation for another user that may
    already hold the rcu read-side lock.

    Signed-off-by: Johannes Weiner
    Cc: Konstantin Khlebnikov
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Acked-by: Li Zefan
    Cc: Li Zefan
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

28 May, 2012

1 commit

  • 48ddbe1946 "cgroup: make css->refcnt clearing on cgroup removal
    optional" allowed a css to linger after the associated cgroup is
    removed. As a css holds a reference on the cgroup's dentry, it means
    that cgroup dentries may linger for a while.

    cgroup_create() does grab an active reference on the superblock to
    prevent it from going away while there are !root cgroups; however, the
    reference is put from cgroup_diput() which is invoked on cgroup
    removal, so cgroup dentries which are removed but persisting due to
    lingering csses already have released their superblock active refs
    allowing superblock to be killed while those dentries are around.

    Given the right condition, this makes cgroup_kill_sb() call
    kill_litter_super() with dentries with non-zero d_count leading to
    BUG() in shrink_dcache_for_umount_subtree().

    Fix it by adding cgroup_dops->d_release() operation and moving
    deactivate_super() to it. cgroup_diput() now marks dentry->d_fsdata
    with itself if superblock should be deactivated and cgroup_d_release()
    deactivates the superblock on dentry release.

    Signed-off-by: Tejun Heo
    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    LKML-Reference:
    Acked-by: Li Zefan

    Tejun Heo
     

24 May, 2012

1 commit

  • Pull user namespace enhancements from Eric Biederman:
    "This is a course correction for the user namespace, so that we can
    reach an inexpensive, maintainable, and reasonably complete
    implementation.

    Highlights:
    - Config guards make it impossible to enable the user namespace and
    code that has not been converted to be user namespace safe.

    - Use of the new kuid_t type ensures the if you somehow get past the
    config guards the kernel will encounter type errors if you enable
    user namespaces and attempt to compile in code whose permission
    checks have not been updated to be user namespace safe.

    - All uids from child user namespaces are mapped into the initial
    user namespace before they are processed. Removing the need to add
    an additional check to see if the user namespace of the compared
    uids remains the same.

    - With the user namespaces compiled out the performance is as good or
    better than it is today.

    - For most operations absolutely nothing changes performance or
    operationally with the user namespace enabled.

    - The worst case performance I could come up with was timing 1
    billion cache cold stat operations with the user namespace code
    enabled. This went from 156s to 164s on my laptop (or 156ns to
    164ns per stat operation).

    - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
    Most uid/gid setting system calls treat these value specially
    anyway so attempting to use -1 as a uid would likely cause
    entertaining failures in userspace.

    - If setuid is called with a uid that can not be mapped setuid fails.
    I have looked at sendmail, login, ssh and every other program I
    could think of that would call setuid and they all check for and
    handle the case where setuid fails.

    - If stat or a similar system call is called from a context in which
    we can not map a uid we lie and return overflowuid. The LFS
    experience suggests not lying and returning an error code might be
    better, but the historical precedent with uids is different and I
    can not think of anything that would break by lying about a uid we
    can't map.

    - Capabilities are localized to the current user namespace making it
    safe to give the initial user in a user namespace all capabilities.

    My git tree covers all of the modifications needed to convert the core
    kernel and enough changes to make a system bootable to runlevel 1."

    Fix up trivial conflicts due to nearby independent changes in fs/stat.c

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
    userns: Silence silly gcc warning.
    cred: use correct cred accessor with regards to rcu read lock
    userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
    userns: Convert cgroup permission checks to use uid_eq
    userns: Convert tmpfs to use kuid and kgid where appropriate
    userns: Convert sysfs to use kgid/kuid where appropriate
    userns: Convert sysctl permission checks to use kuid and kgids.
    userns: Convert proc to use kuid/kgid where appropriate
    userns: Convert ext4 to user kuid/kgid where appropriate
    userns: Convert ext3 to use kuid/kgid where appropriate
    userns: Convert ext2 to use kuid/kgid where appropriate.
    userns: Convert devpts to use kuid/kgid where appropriate
    userns: Convert binary formats to use kuid/kgid where appropriate
    userns: Add negative depends on entries to avoid building code that is userns unsafe
    userns: signal remove unnecessary map_cred_ns
    userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
    userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
    userns: Convert stat to return values mapped from kuids and kgids
    userns: Convert user specfied uids and gids in chown into kuids and kgid
    userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
    ...

    Linus Torvalds
     

16 May, 2012

1 commit


24 Apr, 2012

1 commit

  • Allowing kthreadd to be moved to a non-root group makes no sense, it being
    a global resource, and needlessly leads unsuspecting users toward trouble.

    1. An RT workqueue worker thread spawned in a task group with no rt_runtime
    allocated is not schedulable. Simple user error, but harmful to the box.

    2. A worker thread which acquires PF_THREAD_BOUND can never leave a cpuset,
    rendering the cpuset immortal.

    Save the user some unexpected trouble, just say no.

    Signed-off-by: Mike Galbraith
    Acked-by: Peter Zijlstra
    Acked-by: Thomas Gleixner
    Acked-by: Li Zefan
    Signed-off-by: Tejun Heo

    Mike Galbraith
     

12 Apr, 2012

1 commit


02 Apr, 2012

12 commits

  • Currently, cgroup removal tries to drain all css references. If there
    are active css references, the removal logic waits and retries
    ->pre_detroy() until either all refs drop to zero or removal is
    cancelled.

    This semantics is unusual and adds non-trivial complexity to cgroup
    core and IMHO is fundamentally misguided in that it couples internal
    implementation details (references to internal data structure) with
    externally visible operation (rmdir). To userland, this is a behavior
    peculiarity which is unnecessary and difficult to expect (css refs is
    otherwise invisible from userland), and, to policy implementations,
    this is an unnecessary restriction (e.g. blkcg wants to hold css refs
    for caching purposes but can't as that becomes visible as rmdir hang).

    Unfortunately, memcg currently depends on ->pre_destroy() retrials and
    cgroup removal vetoing and can't be immmediately switched to the new
    behavior. This patch introduces the new behavior of not waiting for
    css refs to drain and maintains the old behavior for subsystems which
    have __DEPRECATED_clear_css_refs set.

    Once, memcg is updated, we can drop the code paths for the old
    behavior as proposed in the following patch. Note that the following
    patch is incorrect in that dput work item is in cgroup and may lose
    some of dputs when multiples css's are released back-to-back, and
    __css_put() triggers check_for_release() when refcnt reaches 0 instead
    of 1; however, it shows what part can be removed.

    http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251

    Note that, in not-too-distant future, cgroup core will start emitting
    warning messages for subsys which require the old behavior, so please
    get moving.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Vivek Goyal
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki

    Tejun Heo
     
  • When a cgroup is about to be removed, cgroup_clear_css_refs() is
    called to check and ensure that there are no active css references.

    This is currently achieved by dropping the refcnt to zero iff it has
    only the base ref. If all css refs could be dropped to zero, ref
    clearing is successful and CSS_REMOVED is set on all css. If not, the
    base ref is restored. While css ref is zero w/o CSS_REMOVED set, any
    css_tryget() attempt on it busy loops so that they are atomic
    w.r.t. the whole css ref clearing.

    This does work but dropping and re-instating the base ref is somewhat
    hairy and makes it difficult to add more logic to the put path as
    there are two of them - the regular css_put() and the reversible base
    ref clearing.

    This patch updates css ref clearing such that blocking new
    css_tryget() and putting the base ref are separate operations.
    CSS_DEACT_BIAS, defined as INT_MIN, is added to css->refcnt and
    css_tryget() busy loops while refcnt is negative. After all css refs
    are deactivated, if they were all one, ref clearing succeeded and
    CSS_REMOVED is set and the base ref is put using the regular
    css_put(); otherwise, CSS_DEACT_BIAS is subtracted from the refcnts
    and the original postive values are restored.

    css_refcnt() accessor which always returns the unbiased positive
    reference counts is added and used to simplify refcnt usages. While
    at it, relocate and reformat comments in cgroup_has_css_refs().

    This separates css->refcnt deactivation and putting the base ref,
    which enables the next patch to make ref clearing optional.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • Implement cgroup_rm_cftypes() which removes an array of cftypes from a
    subsystem. It can be called whether the target subsys is attached or
    not. cgroup core will remove the specified file from all existing
    cgroups.

    This will be used to improve sub-subsys modularity and will be helpful
    for unified hierarchy.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • This patch adds cfent (cgroup file entry) which is the association
    between a cgroup and a file. This is in-cgroup representation of
    files under a cgroup directory. This simplifies walking walking
    cgroup files and thus cgroup_clear_directory(), which is now
    implemented in two parts - cgroup_rm_file() and a loop around it.

    cgroup_rm_file() will be used to implement cftype removal and cfent is
    scheduled to serve cgroup specific per-file data (e.g. for sysfs-like
    "sever" semantics).

    v2: - cfe was freed from cgroup_rm_file() which led to use-after-free
    if the file had openers at the time of removal. Moved to
    cgroup_diput().

    - cgroup_clear_directory() triggered WARN_ON_ONCE() if d_subdirs
    wasn't empty after removing all files. This triggered
    spuriously if some files were open during directory clearing.
    Removed.

    v3: - In cgroup_diput(), WARN_ONCE(!list_empty(&cfe->node)) could be
    spuriously triggered for root cgroups because they don't go
    through cgroup_clear_directory() on unmount. Don't trigger WARN
    for root cgroups.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Glauber Costa

    Tejun Heo
     
  • Move the two macros upwards as they'll be used earlier in the file.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • No controller is using cgroup_add_files[s](). Unexport them, and
    convert cgroup_add_files() to handle NULL entry terminated array
    instead of taking count explicitly and continue creation on failure
    for internal use.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio,
    net_cls and device controllers to use the new cftype based interface.
    Termination entry is added to cftype arrays and populate callbacks are
    replaced with cgroup_subsys->base_cftypes initializations.

    This is functionally identical transformation. There shouldn't be any
    visible behavior change.

    memcg is rather special and will be converted separately.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Paul Menage
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "David S. Miller"
    Cc: Vivek Goyal

    Tejun Heo
     
  • Now that cftype can express whether a file should only be on root,
    cft_release_agent can be merged into the base files cftypes array.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • Currently, cgroup directories are populated by subsys->populate()
    callback explicitly creating files on each cgroup creation. This
    level of flexibility isn't needed or desirable. It provides largely
    unused flexibility which call for abuses while severely limiting what
    the core layer can do through the lack of structure and conventions.

    Per each cgroup file type, the only distinction that cgroup users is
    making is whether a cgroup is root or not, which can easily be
    expressed with flags.

    This patch introduces cgroup_add_cftypes(). These deal with cftypes
    instead of individual files - controllers indicate that certain types
    of files exist for certain subsystem. Newly added CFTYPE_*_ON_ROOT
    flags indicate whether a cftype should be excluded or created only on
    the root cgroup.

    cgroup_add_cftypes() can be called any time whether the target
    subsystem is currently attached or not. cgroup core will create files
    on the existing cgroups as necessary.

    Also, cgroup_subsys->base_cftypes is added to ease registration of the
    base files for the subsystem. If non-NULL on subsys init, the cftypes
    pointed to by ->base_cftypes are automatically registered on subsys
    init / load.

    Further patches will convert the existing users and remove the file
    based interface. Note that this interface allows dynamic addition of
    files to an active controller. This will be used for sub-controller
    modularity and unified hierarchy in the longer term.

    This patch implements the new mechanism but doesn't apply it to any
    user.

    v2: replaced DECLARE_CGROUP_CFTYPES[_COND]() with
    cgroup_subsys->base_cftypes, which works better for cgroup_subsys
    which is loaded as module.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • Build a list of all cgroups anchored at cgroupfs_root->allcg_list and
    going through cgroup->allcg_node. The list is protected by
    cgroup_mutex and will be used to improve cgroup file handling.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • cgroup_populate_dir() currently clears all files and then repopulate
    the directory; however, the clearing part is only useful when it's
    called from cgroup_remount(). Relocate the invocation to
    cgroup_remount().

    This is to prepare for further cgroup file handling updates.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • This patch marks the following features for deprecation.

    * Rebinding subsys by remount: Never reached useful state - only works
    on empty hierarchies.

    * release_agent update by remount: release_agent itself will be
    replaced with conventional fsnotify notification.

    v2: Lennart pointed out that "name=" is necessary for mounts w/o any
    controller attached. Drop "name=" deprecation.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Lennart Poettering

    Tejun Heo
     

30 Mar, 2012

1 commit

  • 61d1d219c4 "cgroup: remove extra calls to find_existing_css_set" made
    cgroup_task_migrate() return void. An unfortunate side effect was
    that cgroup_attach_task() was depending on that function's return
    value to clear its @retval on the success path. On cgroup mounts
    without any subsystem with ->can_attach() callback,
    cgroup_attach_task() ended up returning @retval without initializing
    it on success.

    For some reason, gcc failed to warn about it and it didn't cause
    cgroup_attach_task() to return non-zero value in many cases, probably
    due to difference in register allocation. When the problem
    materializes, systemd fails to populate /systemd cgroup mount and
    fails to boot.

    Fix it by initializing @retval to zero on declaration.

    Signed-off-by: Tejun Heo
    Reported-by: Jiri Kosina
    LKML-Reference:
    Reviewed-by: Mandeep Singh Baines
    Acked-by: Li Zefan

    Tejun Heo
     

23 Mar, 2012

1 commit

  • Merge first batch of patches from Andrew Morton:
    "A few misc things and all the MM queue"

    * emailed from Andrew Morton : (92 commits)
    memcg: avoid THP split in task migration
    thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE
    memcg: clean up existing move charge code
    mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read()
    mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event()
    mm/memcontrol.c: s/stealed/stolen/
    memcg: fix performance of mem_cgroup_begin_update_page_stat()
    memcg: remove PCG_FILE_MAPPED
    memcg: use new logic for page stat accounting
    memcg: remove PCG_MOVE_LOCK flag from page_cgroup
    memcg: simplify move_account() check
    memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
    memcg: kill dead prev_priority stubs
    memcg: remove PCG_CACHE page_cgroup flag
    memcg: let css_get_next() rely upon rcu_read_lock()
    cgroup: revert ss_id_lock to spinlock
    idr: make idr_get_next() good for rcu_read_lock()
    memcg: remove unnecessary thp check in page stat accounting
    memcg: remove redundant returns
    memcg: enum lru_list lru
    ...

    Linus Torvalds
     

22 Mar, 2012

1 commit

  • Remove lock and unlock around css_get_next()'s call to idr_get_next().
    memcg iterators (only users of css_get_next) already did rcu_read_lock(),
    and its comment demands that; but add a WARN_ON_ONCE to make sure of it.

    Signed-off-by: Hugh Dickins
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Li Zefan
    Cc: Eric Dumazet
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins