09 Oct, 2013

1 commit


09 Aug, 2013

6 commits

  • cgroup is in the process of converting to css (cgroup_subsys_state)
    from cgroup as the principal subsystem interface handle. This is
    mostly to prepare for the unified hierarchy support where css's will
    be created and destroyed dynamically but also helps cleaning up
    subsystem implementations as css is usually what they are interested
    in anyway.

    cgroup_taskset which is used by the subsystem attach methods is the
    last cgroup subsystem API which isn't using css as the handle. Update
    cgroup_taskset_cur_cgroup() to cgroup_taskset_cur_css() and
    cgroup_taskset_for_each() to take @skip_css instead of @skip_cgrp.

    The conversions are pretty mechanical. One exception is
    cpuset::cgroup_cs(), which lost its last user and got removed.

    This patch shouldn't introduce any functional changes.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Daniel Wagner
    Cc: Ingo Molnar
    Cc: Matt Helsley
    Cc: Steven Rostedt

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using struct
    cgroup_subsys_state * as the primary handle instead of struct cgroup.
    Please see the previous commit which converts the subsystem methods
    for rationale.

    This patch converts all cftype file operations to take @css instead of
    @cgroup. cftypes for the cgroup core files don't have their subsytem
    pointer set. These will automatically use the dummy_css added by the
    previous patch and can be converted the same way.

    Most subsystem conversions are straight forwards but there are some
    interesting ones.

    * freezer: update_if_frozen() is also converted to take @css instead
    of @cgroup for consistency. This will make the code look simpler
    too once iterators are converted to use css.

    * memory/vmpressure: mem_cgroup_from_css() needs to be exported to
    vmpressure while mem_cgroup_from_cont() can be made static.
    Updated accordingly.

    * cpu: cgroup_tg() doesn't have any user left. Removed.

    * cpuacct: cgroup_ca() doesn't have any user left. Removed.

    * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
    Removed.

    * net_cls: cgrp_cls_state() doesn't have any user left. Removed.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Daniel Wagner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe
    Cc: Steven Rostedt

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using struct
    cgroup_subsys_state * as the primary handle instead of struct cgroup *
    in subsystem implementations for the following reasons.

    * With unified hierarchy, subsystems will be dynamically bound and
    unbound from cgroups and thus css's (cgroup_subsys_state) may be
    created and destroyed dynamically over the lifetime of a cgroup,
    which is different from the current state where all css's are
    allocated and destroyed together with the associated cgroup. This
    in turn means that cgroup_css() should be synchronized and may
    return NULL, making it more cumbersome to use.

    * Differing levels of per-subsystem granularity in the unified
    hierarchy means that the task and descendant iterators should behave
    differently depending on the specific subsystem the iteration is
    being performed for.

    * In majority of the cases, subsystems only care about its part in the
    cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods
    often obtain the matching css pointer from the cgroup and don't
    bother with the cgroup pointer itself. Passing around css fits
    much better.

    This patch converts all cgroup_subsys methods to take @css instead of
    @cgroup. The conversions are mostly straight-forward. A few
    noteworthy changes are

    * ->css_alloc() now takes css of the parent cgroup rather than the
    pointer to the new cgroup as the css for the new cgroup doesn't
    exist yet. Knowing the parent css is enough for all the existing
    subsystems.

    * In kernel/cgroup.c::offline_css(), unnecessary open coded css
    dereference is replaced with local variable access.

    This patch shouldn't cause any behavior differences.

    v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced
    with local variable @css as suggested by Li Zefan.

    Rebased on top of new for-3.12 which includes for-3.11-fixes so
    that ->css_free() invocation added by da0a12caff ("cgroup: fix a
    leak when percpu_ref_init() fails") is converted too. Suggested
    by Li Zefan.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Daniel Wagner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe
    Cc: Steven Rostedt

    Tejun Heo
     
  • Currently, controllers have to explicitly follow the cgroup hierarchy
    to find the parent of a given css. cgroup is moving towards using
    cgroup_subsys_state as the main controller interface construct, so
    let's provide a way to climb the hierarchy using just csses.

    This patch implements css_parent() which, given a css, returns its
    parent. The function is guarnateed to valid non-NULL parent css as
    long as the target css is not at the top of the hierarchy.

    freezer, cpuset, cpu, cpuacct, hugetlb, memory, net_cls and devices
    are converted to use css_parent() instead of accessing cgroup->parent
    directly.

    * __parent_ca() is dropped from cpuacct and its usage is replaced with
    parent_ca(). The only difference between the two was NULL test on
    cgroup->parent which is now embedded in css_parent() making the
    distinction moot. Note that eventually a css->parent field will be
    added to css and the NULL check in css_parent() will go away.

    This patch shouldn't cause any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • css (cgroup_subsys_state) is usually embedded in a subsys specific
    data structure. Subsystems either use container_of() directly to cast
    from css to such data structure or has an accessor function wrapping
    such cast. As cgroup as whole is moving towards using css as the main
    interface handle, add and update such accessors to ease dealing with
    css's.

    All accessors explicitly handle NULL input and return NULL in those
    cases. While this looks like an extra branch in the code, as all
    controllers specific data structures have css as the first field, the
    casting doesn't involve any offsetting and the compiler can trivially
    optimize out the branch.

    * blkio, freezer, cpuset, cpu, cpuacct and net_cls didn't have such
    accessor. Added.

    * memory, hugetlb and devices already had one but didn't explicitly
    handle NULL input. Updated.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • The names of the two struct cgroup_subsys_state accessors -
    cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
    The former clashes with the type name and the latter doesn't even
    indicate it's somehow related to cgroup.

    We're about to revamp large portion of cgroup API, so, let's rename
    them so that they're less awkward. Most per-controller usages of the
    accessors are localized in accessor wrappers and given the amount of
    scheduled changes, this isn't gonna add any noticeable headache.

    Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
    to task_css(). This patch is pure rename.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     

15 Jan, 2013

1 commit

  • Eric Dumazet pointed out that act_mirred needs to find the current net_ns,
    and struct net pointer is not provided in the call chain. His original
    patch made use of current->nsproxy->net_ns to find the network namespace,
    but this fails to work correctly for userspace code that makes use of
    netlink sockets in different network namespaces. Instead, pass the
    "struct net *" down along the call chain to where it is needed.

    This version removes the ifb changes as Eric has submitted that patch
    separately, but is otherwise identical to the previous version.

    Signed-off-by: Benjamin LaHaise
    Tested-by: Eric Dumazet
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Benjamin LaHaise
     

13 Dec, 2012

1 commit

  • Pull networking changes from David Miller:

    1) Allow to dump, monitor, and change the bridge multicast database
    using netlink. From Cong Wang.

    2) RFC 5961 TCP blind data injection attack mitigation, from Eric
    Dumazet.

    3) Networking user namespace support from Eric W. Biederman.

    4) tuntap/virtio-net multiqueue support by Jason Wang.

    5) Support for checksum offload of encapsulated packets (basically,
    tunneled traffic can still be checksummed by HW). From Joseph
    Gasparakis.

    6) Allow BPF filter access to VLAN tags, from Eric Dumazet and
    Daniel Borkmann.

    7) Bridge port parameters over netlink and BPDU blocking support
    from Stephen Hemminger.

    8) Improve data access patterns during inet socket demux by rearranging
    socket layout, from Eric Dumazet.

    9) TIPC protocol updates and cleanups from Ying Xue, Paul Gortmaker, and
    Jon Maloy.

    10) Update TCP socket hash sizing to be more in line with current day
    realities. The existing heurstics were choosen a decade ago.
    From Eric Dumazet.

    11) Fix races, queue bloat, and excessive wakeups in ATM and
    associated drivers, from Krzysztof Mazur and David Woodhouse.

    12) Support DOVE (Distributed Overlay Virtual Ethernet) extensions
    in VXLAN driver, from David Stevens.

    13) Add "oops_only" mode to netconsole, from Amerigo Wang.

    14) Support set and query of VEB/VEPA bridge mode via PF_BRIDGE, also
    allow DCB netlink to work on namespaces other than the initial
    namespace. From John Fastabend.

    15) Support PTP in the Tigon3 driver, from Matt Carlson.

    16) tun/vhost zero copy fixes and improvements, plus turn it on
    by default, from Michael S. Tsirkin.

    17) Support per-association statistics in SCTP, from Michele
    Baldessari.

    And many, many, driver updates, cleanups, and improvements. Too
    numerous to mention individually.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
    net/mlx4_en: Add support for destination MAC in steering rules
    net/mlx4_en: Use generic etherdevice.h functions.
    net: ethtool: Add destination MAC address to flow steering API
    bridge: add support of adding and deleting mdb entries
    bridge: notify mdb changes via netlink
    ndisc: Unexport ndisc_{build,send}_skb().
    uapi: add missing netconf.h to export list
    pkt_sched: avoid requeues if possible
    solos-pci: fix double-free of TX skb in DMA mode
    bnx2: Fix accidental reversions.
    bna: Driver Version Updated to 3.1.2.1
    bna: Firmware update
    bna: Add RX State
    bna: Rx Page Based Allocation
    bna: TX Intr Coalescing Fix
    bna: Tx and Rx Optimizations
    bna: Code Cleanup and Enhancements
    ath9k: check pdata variable before dereferencing it
    ath5k: RX timestamp is reported at end of frame
    ath9k_htc: RX timestamp is reported at end of frame
    ...

    Linus Torvalds
     

22 Nov, 2012

1 commit

  • It turns out that we'll have to live with attributes which are
    inherited at cgroup creation time but not affected by further updates
    to the parent afterwards - such attributes are already in wide use
    e.g. for cpuset.

    So, there's nothing to do for netcls_cgroup for hierarchy support.
    Its current behavior - inherit only during creation - is good enough.

    Move config inheriting from ->css_alloc() to ->css_online() for
    consistency, which doesn't change behavior at all, and remove
    .broken_hierarchy marking.

    Signed-off-by: Tejun Heo
    Tested-and-Acked-by: Daniel Wagner
    Acked-by: David S. Miller

    Tejun Heo
     

20 Nov, 2012

1 commit


26 Oct, 2012

1 commit

  • The cgroup logic part of net_cls is very similar as the one in
    net_prio. Let's stream line the net_cls logic with the net_prio one.

    The net_prio update logic was changed by following commit (note there
    were some changes necessary later on)

    commit 406a3c638ce8b17d9704052c07955490f732c2b8
    Author: John Fastabend
    Date: Fri Jul 20 10:39:25 2012 +0000

    net: netprio_cgroup: rework update socket logic

    Instead of updating the sk_cgrp_prioidx struct field on every send
    this only updates the field when a task is moved via cgroup
    infrastructure.

    This allows sockets that may be used by a kernel worker thread
    to be managed. For example in the iscsi case today a user can
    put iscsid in a netprio cgroup and control traffic will be sent
    with the correct sk_cgrp_prioidx value set but as soon as data
    is sent the kernel worker thread isssues a send and sk_cgrp_prioidx
    is updated with the kernel worker threads value which is the
    default case.

    It seems more correct to only update the field when the user
    explicitly sets it via control group infrastructure. This allows
    the users to manage sockets that may be used with other threads.

    Since classid is now updated when the task is moved between the
    cgroups, we don't have to call sock_update_classid() from various
    places to ensure we always using the latest classid value.

    [v2: Use iterate_fd() instead of open coding]

    Signed-off-by: Daniel Wagner
    Cc: Li Zefan
    Cc: "David S. Miller"
    Cc: "Michael S. Tsirkin"
    Cc: Jamal Hadi Salim
    Cc: Joe Perches
    Cc: John Fastabend
    Cc: Neil Horman
    Cc: Stanislav Kinsbursky
    Cc: Tejun Heo
    Cc:
    Cc:
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Daniel Wagner
     

03 Oct, 2012

2 commits

  • Pull user namespace changes from Eric Biederman:
    "This is a mostly modest set of changes to enable basic user namespace
    support. This allows the code to code to compile with user namespaces
    enabled and removes the assumption there is only the initial user
    namespace. Everything is converted except for the most complex of the
    filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
    nfs, ocfs2 and xfs as those patches need a bit more review.

    The strategy is to push kuid_t and kgid_t values are far down into
    subsystems and filesystems as reasonable. Leaving the make_kuid and
    from_kuid operations to happen at the edge of userspace, as the values
    come off the disk, and as the values come in from the network.
    Letting compile type incompatible compile errors (present when user
    namespaces are enabled) guide me to find the issues.

    The most tricky areas have been the places where we had an implicit
    union of uid and gid values and were storing them in an unsigned int.
    Those places were converted into explicit unions. I made certain to
    handle those places with simple trivial patches.

    Out of that work I discovered we have generic interfaces for storing
    quota by projid. I had never heard of the project identifiers before.
    Adding full user namespace support for project identifiers accounts
    for most of the code size growth in my git tree.

    Ultimately there will be work to relax privlige checks from
    "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
    root in a user names to do those things that today we only forbid to
    non-root users because it will confuse suid root applications.

    While I was pushing kuid_t and kgid_t changes deep into the audit code
    I made a few other cleanups. I capitalized on the fact we process
    netlink messages in the context of the message sender. I removed
    usage of NETLINK_CRED, and started directly using current->tty.

    Some of these patches have also made it into maintainer trees, with no
    problems from identical code from different trees showing up in
    linux-next.

    After reading through all of this code I feel like I might be able to
    win a game of kernel trivial pursuit."

    Fix up some fairly trivial conflicts in netfilter uid/git logging code.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
    userns: Convert the ufs filesystem to use kuid/kgid where appropriate
    userns: Convert the udf filesystem to use kuid/kgid where appropriate
    userns: Convert ubifs to use kuid/kgid
    userns: Convert squashfs to use kuid/kgid where appropriate
    userns: Convert reiserfs to use kuid and kgid where appropriate
    userns: Convert jfs to use kuid/kgid where appropriate
    userns: Convert jffs2 to use kuid and kgid where appropriate
    userns: Convert hpfs to use kuid and kgid where appropriate
    userns: Convert btrfs to use kuid/kgid where appropriate
    userns: Convert bfs to use kuid/kgid where appropriate
    userns: Convert affs to use kuid/kgid wherwe appropriate
    userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
    userns: On ia64 deal with current_uid and current_gid being kuid and kgid
    userns: On ppc convert current_uid from a kuid before printing.
    userns: Convert s390 getting uid and gid system calls to use kuid and kgid
    userns: Convert s390 hypfs to use kuid and kgid where appropriate
    userns: Convert binder ipc to use kuids
    userns: Teach security_path_chown to take kuids and kgids
    userns: Add user namespace support to IMA
    userns: Convert EVM to deal with kuids and kgids in it's hmac computation
    ...

    Linus Torvalds
     
  • Pull cgroup hierarchy update from Tejun Heo:
    "Currently, different cgroup subsystems handle nested cgroups
    completely differently. There's no consistency among subsystems and
    the behaviors often are outright broken.

    People at least seem to agree that the broken hierarhcy behaviors need
    to be weeded out if any progress is gonna be made on this front and
    that the fallouts from deprecating the broken behaviors should be
    acceptable especially given that the current behaviors don't make much
    sense when nested.

    This patch makes cgroup emit warning messages if cgroups for
    subsystems with broken hierarchy behavior are nested to prepare for
    fixing them in the future. This was put in a separate branch because
    more related changes were expected (didn't make it this round) and the
    memory cgroup wanted to pull in this and make changes on top."

    * 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them

    Linus Torvalds
     

15 Sep, 2012

2 commits

  • Currently, cgroup hierarchy support is a mess. cpu related subsystems
    behave correctly - configuration, accounting and control on a parent
    properly cover its children. blkio and freezer completely ignore
    hierarchy and treat all cgroups as if they're directly under the root
    cgroup. Others show yet different behaviors.

    These differing interpretations of cgroup hierarchy make using cgroup
    confusing and it impossible to co-mount controllers into the same
    hierarchy and obtain sane behavior.

    Eventually, we want full hierarchy support from all subsystems and
    probably a unified hierarchy. Users using separate hierarchies
    expecting completely different behaviors depending on the mounted
    subsystem is deterimental to making any progress on this front.

    This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
    for controllers which are lacking in hierarchy support. The goal of
    this patch is two-fold.

    * Move users away from using hierarchy on currently non-hierarchical
    subsystems, so that implementing proper hierarchy support on those
    doesn't surprise them.

    * Keep track of which controllers are broken how and nudge the
    subsystems to implement proper hierarchy support.

    For now, start with a single warning message. We can whine louder
    later on.

    v2: Fixed a typo spotted by Michal. Warning message updated.

    v3: Updated memcg part so that it doesn't generate warning in the
    cases where .use_hierarchy=false doesn't make the behavior
    different from root.use_hierarchy=true. Fixed a typo spotted by
    Glauber.

    v4: Check ->broken_hierarchy after cgroup creation is complete so that
    ->create() can affect the result per Michal. Dropped unnecessary
    memcg root handling per Michal.

    Signed-off-by: Tejun Heo
    Acked-by: Michal Hocko
    Acked-by: Li Zefan
    Acked-by: Serge E. Hallyn
    Cc: Glauber Costa
    Cc: Peter Zijlstra
    Cc: Paul Turner
    Cc: Johannes Weiner
    Cc: Thomas Graf
    Cc: Vivek Goyal
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Neil Horman
    Cc: Aneesh Kumar K.V

    Tejun Heo
     
  • WARNING: With this change it is impossible to load external built
    controllers anymore.

    In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is
    set, corresponding subsys_id should also be a constant. Up to now,
    net_prio_subsys_id and net_cls_subsys_id would be of the type int and
    the value would be assigned during runtime.

    By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN
    to IS_ENABLED, all *_subsys_id will have constant value. That means we
    need to remove all the code which assumes a value can be assigned to
    net_prio_subsys_id and net_cls_subsys_id.

    A close look is necessary on the RCU part which was introduces by
    following patch:

    commit f845172531fb7410c7fb7780b1a6e51ee6df7d52
    Author: Herbert Xu Mon May 24 09:12:34 2010
    Committer: David S. Miller Mon May 24 09:12:34 2010

    cls_cgroup: Store classid in struct sock

    Tis code was added to init_cgroup_cls()

    /* We can't use rcu_assign_pointer because this is an int. */
    smp_wmb();
    net_cls_subsys_id = net_cls_subsys.subsys_id;

    respectively to exit_cgroup_cls()

    net_cls_subsys_id = -1;
    synchronize_rcu();

    and in module version of task_cls_classid()

    rcu_read_lock();
    id = rcu_dereference(net_cls_subsys_id);
    if (id >= 0)
    classid = container_of(task_subsys_state(p, id),
    struct cgroup_cls_state, css)->classid;
    rcu_read_unlock();

    Without an explicit explaination why the RCU part is needed. (The
    rcu_deference was fixed by exchanging it to rcu_derefence_index_check()
    in a later commit, but that is a minor detail.)

    So here is my pondering why it was introduced and why it safe to
    remove it now. Note that this code was copied over to net_prio the
    reasoning holds for that subsystem too.

    The idea behind the RCU use for net_cls_subsys_id is to make sure we
    get a valid pointer back from task_subsys_state(). task_subsys_state()
    is just blindly accessing the subsys array and returning the
    pointer. Obviously, passing in -1 as id into task_subsys_state()
    returns an invalid value (out of lower bound).

    So this code makes sure that only after module is loaded and the
    subsystem registered, the id is assigned.

    Before unregistering the module all old readers must have left the
    critical section. This is done by assigning -1 to the id and issuing a
    synchronized_rcu(). Any new readers wont call task_subsys_state()
    anymore and therefore it is safe to unregister the subsystem.

    The new code relies on the same trick, but it looks at the subsys
    pointer return by task_subsys_state() (remember the id is constant
    and therefore we allways have a valid index into the subsys
    array).

    No precautions need to be taken during module loading
    module. Eventually, all CPUs will get a valid pointer back from
    task_subsys_state() because rebind_subsystem() which is called after
    the module init() function will assigned subsys[net_cls_subsys_id] the
    newly loaded module subsystem pointer.

    When the subsystem is about to be removed, rebind_subsystem() will
    called before the module exit() function. In this case,
    rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer
    and then it calls synchronize_rcu(). All old readers have left by then
    the critical section. Any new reader wont access the subsystem
    anymore. At this point we are safe to unregister the subsystem. No
    synchronize_rcu() call is needed.

    Signed-off-by: Daniel Wagner
    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Neil Horman
    Cc: "David S. Miller"
    Cc: "Paul E. McKenney"
    Cc: Andrew Morton
    Cc: Eric Dumazet
    Cc: Gao feng
    Cc: Glauber Costa
    Cc: Herbert Xu
    Cc: Jamal Hadi Salim
    Cc: John Fastabend
    Cc: Kamezawa Hiroyuki
    Cc: netdev@vger.kernel.org
    Cc: cgroups@vger.kernel.org

    Daniel Wagner
     

15 Aug, 2012

1 commit

  • cls_flow.c plays with uids and gids. Unless I misread that
    code it is possible for classifiers to depend on the specific uid and
    gid values. Therefore I need to know the user namespace of the
    netlink socket that is installing the packet classifiers. Pass
    in the rtnetlink skb so I can access the NETLINK_CB of the passed
    packet. In particular I want access to sk_user_ns(NETLINK_CB(in_skb).ssk).

    Pass in not the user namespace but the incomming rtnetlink skb into
    the the classifier change routines as that is generally the more useful
    parameter.

    Cc: Jamal Hadi Salim
    Acked-by: David S. Miller
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

02 Apr, 2012

2 commits

  • Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio,
    net_cls and device controllers to use the new cftype based interface.
    Termination entry is added to cftype arrays and populate callbacks are
    replaced with cgroup_subsys->base_cftypes initializations.

    This is functionally identical transformation. There shouldn't be any
    visible behavior change.

    memcg is rather special and will be converted separately.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Paul Menage
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "David S. Miller"
    Cc: Vivek Goyal

    Tejun Heo
     
  • blk-cgroup, netprio_cgroup, cls_cgroup and tcp_memcontrol
    unnecessarily define cftype array and cgroup_subsys structures at the
    top of the file, which is unconventional and necessiates forward
    declaration of methods.

    This patch relocates those below the definitions of the methods and
    removes the forward declarations. Note that forward declaration of
    tcp_files[] is added in tcp_memcontrol.c for tcp_init_cgroup(). This
    will be removed soon by another patch.

    This patch doesn't introduce any functional change.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     

03 Feb, 2012

1 commit

  • The argument is not used at all, and it's not necessary, because
    a specific callback handler of course knows which subsys it
    belongs to.

    Now only ->pupulate() takes this argument, because the handlers of
    this callback always call cgroup_add_file()/cgroup_add_files().

    So we reduce a few lines of code, though the shrinking of object size
    is minimal.

    16 files changed, 113 insertions(+), 162 deletions(-)

    text data bss dec hex filename
    5486240 656987 7039960 13183187 c928d3 vmlinux.o.orig
    5486170 656987 7039960 13183117 c9288d vmlinux.o

    Signed-off-by: Li Zefan
    Signed-off-by: Tejun Heo

    Li Zefan
     

06 Jul, 2011

1 commit


20 Jan, 2011

1 commit


04 Nov, 2010

1 commit

  • Somewhere along the lines net_cls_subsys_id became a macro when
    cls_cgroup is built as a module. Not only did it make cls_cgroup
    completely useless, it also causes it to crash on module unload.

    This patch fixes this by removing that macro.

    Thanks to Eric Dumazet for diagnosing this problem.

    Reported-by: Randy Dunlap
    Signed-off-by: Herbert Xu
    Reviewed-by: Li Zefan
    Signed-off-by: David S. Miller

    Herbert Xu
     

19 Oct, 2010

1 commit

  • Peter Zijlstra found a bug in the way softirq time is accounted in
    VIRT_CPU_ACCOUNTING on this thread:

    http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html

    The problem is, softirq processing uses local_bh_disable internally. There
    is no way, later in the flow, to differentiate between whether softirq is
    being processed or is it just that bh has been disabled. So, a hardirq when bh
    is disabled results in time being wrongly accounted as softirq.

    Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
    as well. As account_system_time() in normal tick based accouting also uses
    softirq_count, which will be set even when not in softirq with bh disabled.

    Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count
    for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
    processing. The patch below does that and adds API in_serving_softirq() which
    returns whether we are currently processing softirq or not.

    Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
    to in_serving_softirq.

    Looks like many usages of in_softirq really want in_serving_softirq. Those
    changes can be made individually on a case by case basis.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Venkatesh Pallipadi
     

24 May, 2010

1 commit

  • Up until now cls_cgroup has relied on fetching the classid out of
    the current executing thread. This runs into trouble when a packet
    processing is delayed in which case it may execute out of another
    thread's context.

    Furthermore, even when a packet is not delayed we may fail to
    classify it if soft IRQs have been disabled, because this scenario
    is indistinguishable from one where a packet unrelated to the
    current thread is processed by a real soft IRQ.

    In fact, the current semantics is inherently broken, as a single
    skb may be constructed out of the writes of two different tasks.
    A different manifestation of this problem is when the TCP stack
    transmits in response of an incoming ACK. This is currently
    unclassified.

    As we already have a concept of packet ownership for accounting
    purposes in the skb->sk pointer, this is a natural place to store
    the classid in a persistent manner.

    This patch adds the cls_cgroup classid in struct sock, filling up
    an existing hole on 64-bit :)

    The value is set at socket creation time. So all sockets created
    via socket(2) automatically gains the ID of the thread creating it.
    Whenever another process touches the socket by either reading or
    writing to it, we will change the socket classid to that of the
    process if it has a valid (non-zero) classid.

    For sockets created on inbound connections through accept(2), we
    inherit the classid of the original listening socket through
    sk_clone, possibly preceding the actual accept(2) call.

    In order to minimise risks, I have not made this the authoritative
    classid. For now it is only used as a backup when we execute
    with soft IRQs disabled. Once we're completely happy with its
    semantics we can use it as the sole classid.

    Footnote: I have rearranged the error path on cls_group module
    creation. If we didn't do this, then there is a window where
    someone could create a tc rule using cls_group before the cgroup
    subsystem has been registered.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

24 Mar, 2010

1 commit

  • Allows the net_cls cgroup subsystem to be compiled as a module

    This patch modifies net/sched/cls_cgroup.c to allow the net_cls subsystem
    to be optionally compiled as a module instead of builtin. The
    cgroup_subsys struct is moved around a bit to allow the subsys_id to be
    either declared as a compile-time constant by the cgroup_subsys.h include
    in cgroup.h, or, if it's a module, initialized within the struct by
    cgroup_load_subsys.

    Signed-off-by: Ben Blum
    Acked-by: Li Zefan
    Cc: Paul Menage
    Cc: "David S. Miller"
    Cc: KAMEZAWA Hiroyuki
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Ben Blum
     

15 Jun, 2009

1 commit


09 Jun, 2009

1 commit

  • I found a bug in cls_cgroup_change() in cls_cgroup.c.
    cls_cgroup_change() expected tca[TCA_OPTIONS] was set from user space properly,
    but tc in iproute2-2.6.29-1 (which I used) didn't set it.

    In the current source code of tc in git, it set tca[TCA_OPTIONS].

    git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git

    If we always use a newest iproute2 in git when we use cls_cgroup,
    we don't face this oops probably.
    But I think, kernel shouldn't panic regardless of use program's behaviour.

    Signed-off-by: Minoru Usui
    Signed-off-by: David S. Miller

    Minoru Usui
     

03 Jun, 2009

1 commit


27 May, 2009

1 commit

  • Avoid reading the unsynchronized value cs->classid multiple times,
    since it could change concurrently from non-zero to zero; this would
    result in the classifier returning a positive result with a bogus
    (zero) classid.

    Signed-off-by: Paul Menage
    Reviewed-by: Li Zefan
    Signed-off-by: David S. Miller

    Paul Menage
     

18 May, 2009

1 commit

  • We can remove this lock here, since we are in cgroup write handler and
    thus the cgrp is guaranteed to be valid, and no lock is needed when
    writing a u32 variable.

    Signed-off-by: Li Zefan
    Acked-by: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Li Zefan
     

30 Dec, 2008

2 commits


20 Nov, 2008

1 commit

  • The use of xchg() hasn't been necessary since 2.2.something when proper
    locking was added to packet schedulers. In the case of classifiers they
    mostly weren't even necessary before that since they're mainly used
    to assign a NULL pointer to the filter root in the ->destroy path;
    the root is destroyed immediately after that.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

08 Nov, 2008

1 commit

  • The classifier should cover the most common use case and will work
    without any special configuration.

    The principle of the classifier is to directly access the
    task_struct via get_current(). In order for this to work,
    classification requests from softirqs must be ignored. This is
    not a problem because the vast majority of packets in softirq
    context are not assigned to a task anyway. For this to work, a
    mechanism is needed to trace softirq context.

    This repost goes back to the method of relying on the number of
    nested bh disable calls for the sake of not adding too much
    complexity and the option to come up with something more reliable
    if actually needed.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf