01 Oct, 2013

1 commit

  • In commit 8ed781668dd49 ("flow_keys: include thoff into flow_keys for
    later usage"), we missed that existing code was using nhoff as a
    temporary variable that could not always contain transport header
    offset.

    This is not a problem for TCP/UDP because port offset (@poff)
    is 0 for these protocols.

    Signed-off-by: Eric Dumazet
    Cc: Daniel Borkmann
    Cc: Nikolay Aleksandrov
    Acked-by: Nikolay Aleksandrov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Eric Dumazet
     

29 Sep, 2013

2 commits

  • A host might need net_secret[] and never open a single socket.

    Problem added in commit aebda156a570782
    ("net: defer net_secret[] initialization")

    Based on prior patch from Hannes Frederic Sowa.

    Reported-by: Hannes Frederic Sowa
    Signed-off-by: Eric Dumazet
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • There is currently serialization network namespaces exiting and
    network devices exiting as the final part of netdev_run_todo does not
    happen under the rtnl_lock. This is compounded by the fact that the
    only list of devices unregistering in netdev_run_todo is local to the
    netdev_run_todo.

    This lack of serialization in extreme cases results in network devices
    unregistering in netdev_run_todo after the loopback device of their
    network namespace has been freed (making dst_ifdown unsafe), and after
    the their network namespace has exited (making the NETDEV_UNREGISTER,
    and NETDEV_UNREGISTER_FINAL callbacks unsafe).

    Add the missing serialization by a per network namespace count of how
    many network devices are unregistering and having a wait queue that is
    woken up whenever the count is decreased. The count and wait queue
    allow default_device_exit_batch to wait until all of the unregistration
    activity for a network namespace has finished before proceeding to
    unregister the loopback device and then allowing the network namespace
    to exit.

    Only a single global wait queue is used because there is a single global
    lock, and there is a single waiter, per network namespace wait queues
    would be a waste of resources.

    The per network namespace count of unregistering devices gives a
    progress guarantee because the number of network devices unregistering
    in an exiting network namespace must ultimately drop to zero (assuming
    network device unregistration completes).

    The basic logic remains the same as in v1. This patch is now half
    comment and half rtnl_lock_unregistering an expanded version of
    wait_event performs no extra work in the common case where no network
    devices are unregistering when we get to default_device_exit_batch.

    Reported-by: Francesco Ruggeri
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

20 Sep, 2013

1 commit

  • I've been hitting a NULL ptr deref while using netconsole because the
    np->dev check and the pointer manipulation in netpoll_cleanup are done
    without rtnl and the following sequence happens when having a netconsole
    over a vlan and we remove the vlan while disabling the netconsole:
    CPU 1 CPU2
    removes vlan and calls the notifier
    enters store_enabled(), calls
    netdev_cleanup which checks np->dev
    and then waits for rtnl
    executes the netconsole netdev
    release notifier making np->dev
    == NULL and releases rtnl
    continues to dereference a member of
    np->dev which at this point is == NULL

    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     

13 Sep, 2013

1 commit


12 Sep, 2013

1 commit

  • commit 416186fbf8c5b4e4465 ("net: Split core bits of netdev_pick_tx
    into __netdev_pick_tx") added a bug that disables caching of queue
    index in the socket.

    This is the source of packet reorders for TCP flows, and
    again this is happening more often when using FQ pacing.

    Old code was doing

    if (queue_index != old_index)
    sk_tx_queue_set(sk, queue_index);

    Alexander renamed the variables but forgot to change sk_tx_queue_set()
    2nd parameter.

    if (queue_index != new_index)
    sk_tx_queue_set(sk, queue_index);

    This means we store -1 over and over in sk->sk_tx_queue_mapping

    Signed-off-by: Eric Dumazet
    Cc: Alexander Duyck
    Acked-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Sep, 2013

1 commit

  • Pull namespace changes from Eric Biederman:
    "This is an assorted mishmash of small cleanups, enhancements and bug
    fixes.

    The major theme is user namespace mount restrictions. nsown_capable
    is killed as it encourages not thinking about details that need to be
    considered. A very hard to hit pid namespace exiting bug was finally
    tracked and fixed. A couple of cleanups to the basic namespace
    infrastructure.

    Finally there is an enhancement that makes per user namespace
    capabilities usable as capabilities, and an enhancement that allows
    the per userns root to nice other processes in the user namespace"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    userns: Kill nsown_capable it makes the wrong thing easy
    capabilities: allow nice if we are privileged
    pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD
    userns: Allow PR_CAPBSET_DROP in a user namespace.
    namespaces: Simplify copy_namespaces so it is clear what is going on.
    pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup
    sysfs: Restrict mounting sysfs
    userns: Better restrictions on when proc and sysfs can be mounted
    vfs: Don't copy mount bind mounts of /proc//ns/mnt between namespaces
    kernel/nsproxy.c: Improving a snippet of code.
    proc: Restrict mounting the proc filesystem
    vfs: Lock in place mounts from more privileged users

    Linus Torvalds
     

06 Sep, 2013

2 commits

  • Pull networking changes from David Miller:
    "Noteworthy changes this time around:

    1) Multicast rejoin support for team driver, from Jiri Pirko.

    2) Centralize and simplify TCP RTT measurement handling in order to
    reduce the impact of bad RTO seeding from SYN/ACKs. Also, when
    both timestamps and local RTT measurements are available prefer
    the later because there are broken middleware devices which
    scramble the timestamp.

    From Yuchung Cheng.

    3) Add TCP_NOTSENT_LOWAT socket option to limit the amount of kernel
    memory consumed to queue up unsend user data. From Eric Dumazet.

    4) Add a "physical port ID" abstraction for network devices, from
    Jiri Pirko.

    5) Add a "suppress" operation to influence fib_rules lookups, from
    Stefan Tomanek.

    6) Add a networking development FAQ, from Paul Gortmaker.

    7) Extend the information provided by tcp_probe and add ipv6 support,
    from Daniel Borkmann.

    8) Use RCU locking more extensively in openvswitch data paths, from
    Pravin B Shelar.

    9) Add SCTP support to openvswitch, from Joe Stringer.

    10) Add EF10 chip support to SFC driver, from Ben Hutchings.

    11) Add new SYNPROXY netfilter target, from Patrick McHardy.

    12) Compute a rate approximation for sending in TCP sockets, and use
    this to more intelligently coalesce TSO frames. Furthermore, add
    a new packet scheduler which takes advantage of this estimate when
    available. From Eric Dumazet.

    13) Allow AF_PACKET fanouts with random selection, from Daniel
    Borkmann.

    14) Add ipv6 support to vxlan driver, from Cong Wang"

    Resolved conflicts as per discussion.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1218 commits)
    openvswitch: Fix alignment of struct sw_flow_key.
    netfilter: Fix build errors with xt_socket.c
    tcp: Add missing braces to do_tcp_setsockopt
    caif: Add missing braces to multiline if in cfctrl_linkup_request
    bnx2x: Add missing braces in bnx2x:bnx2x_link_initialize
    vxlan: Fix kernel panic on device delete.
    net: mvneta: implement ->ndo_do_ioctl() to support PHY ioctls
    net: mvneta: properly disable HW PHY polling and ensure adjust_link() works
    icplus: Use netif_running to determine device state
    ethernet/arc/arc_emac: Fix huge delays in large file copies
    tuntap: orphan frags before trying to set tx timestamp
    tuntap: purge socket error queue on detach
    qlcnic: use standard NAPI weights
    ipv6:introduce function to find route for redirect
    bnx2x: VF RSS support - VF side
    bnx2x: VF RSS support - PF side
    vxlan: Notify drivers for listening UDP port changes
    net: usbnet: update addr_assign_type if appropriate
    driver/net: enic: update enic maintainers and driver
    driver/net: enic: Exposing symbols for Cisco's low latency driver
    ...

    Linus Torvalds
     
  • Conflicts:
    drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
    net/bridge/br_multicast.c
    net/ipv6/sit.c

    The conflicts were minor:

    1) sit.c changes overlap with change to ip_tunnel_xmit() signature.

    2) br_multicast.c had an overlap between computing max_delay using
    msecs_to_jiffies and turning MLDV2_MRC() into an inline function
    with a name using lowercase instead of uppercase letters.

    3) stmmac had two overlapping changes, one which conditionally allocated
    and hooked up a dma_cfg based upon the presence of the pbl OF property,
    and another one handling store-and-forward DMA made. The latter of
    which should not go into the new of_find_property() basic block.

    Signed-off-by: David S. Miller

    David S. Miller
     

04 Sep, 2013

5 commits

  • Currently we're linking upper devices to lower ones, which results in
    upside-down relationship: upper devices seeing lower devices via its upper
    lists.

    Fix this by correctly linking lower devices to the upper ones.

    CC: "David S. Miller"
    CC: Eric Dumazet
    CC: Jiri Pirko
    CC: Alexander Duyck
    CC: Cong Wang
    Signed-off-by: Veaceslav Falico
    Signed-off-by: David S. Miller

    Veaceslav Falico
     
  • This function was only used when a packet was sent to another netns. Now, it can
    also be used after tunnel encapsulation or decapsulation.

    Only skb_orphan() should not be done when a packet is not crossing netns.

    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     
  • This config option is superfluous in that it only guards a call
    to neigh_app_ns(). Enabling CONFIG_ARPD by default has no
    change in behavior. There will now be call to __neigh_notify()
    for each ARP resolution, which has no impact unless there is a
    user space daemon waiting to receive the notification, i.e.,
    the case for which CONFIG_ARPD was designed anyways.

    Suggested-by: Eric W. Biederman
    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: James Morris
    Cc: Hideaki YOSHIFUJI
    Cc: Patrick McHardy
    Cc: "Eric W. Biederman"
    Cc: Gao feng
    Cc: Joe Perches
    Cc: Veaceslav Falico
    Signed-off-by: Tim Gardner
    Reviewed-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Tim Gardner
     
  • Pull cgroup updates from Tejun Heo:
    "A lot of activities on the cgroup front. Most changes aren't visible
    to userland at all at this point and are laying foundation for the
    planned unified hierarchy.

    - The biggest change is decoupling the lifetime management of css
    (cgroup_subsys_state) from that of cgroup's. Because controllers
    (cpu, memory, block and so on) will need to be dynamically enabled
    and disabled, css which is the association point between a cgroup
    and a controller may come and go dynamically across the lifetime of
    a cgroup. Till now, css's were created when the associated cgroup
    was created and stayed till the cgroup got destroyed.

    Assumptions around this tight coupling permeated through cgroup
    core and controllers. These assumptions are gradually removed,
    which consists bulk of patches, and css destruction path is
    completely decoupled from cgroup destruction path. Note that
    decoupling of creation path is relatively easy on top of these
    changes and the patchset is pending for the next window.

    - cgroup has its own event mechanism cgroup.event_control, which is
    only used by memcg. It is overly complex trying to achieve high
    flexibility whose benefits seem dubious at best. Going forward,
    new events will simply generate file modified event and the
    existing mechanism is being made specific to memcg. This pull
    request contains prepatory patches for such change.

    - Various fixes and cleanups"

    Fixed up conflict in kernel/cgroup.c as per Tejun.

    * 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits)
    cgroup: fix cgroup_css() invocation in css_from_id()
    cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp()
    cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup
    cgroup: implement CFTYPE_NO_PREFIX
    cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys
    cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax
    cgroup: fix cgroup_write_event_control()
    cgroup: fix subsystem file accesses on the root cgroup
    cgroup: change cgroup_from_id() to css_from_id()
    cgroup: use css_get() in cgroup_create() to check CSS_ROOT
    cpuset: remove an unncessary forward declaration
    cgroup: RCU protect each cgroup_subsys_state release
    cgroup: move subsys file removal to kill_css()
    cgroup: factor out kill_css()
    cgroup: decouple cgroup_subsys_state destruction from cgroup destruction
    cgroup: replace cgroup->css_kill_cnt with ->nr_css
    cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item
    cgroup: move cgroup->subsys[] assignment to online_css()
    cgroup: reorganize css init / exit paths
    cgroup: add __rcu modifier to cgroup->subsys[]
    ...

    Linus Torvalds
     
  • Pull driver core patches from Greg KH:
    "Here's the big driver core pull request for 3.12-rc1.

    Lots of tiny changes here fixing up the way sysfs attributes are
    created, to try to make drivers simpler, and fix a whole class race
    conditions with creations of device attributes after the device was
    announced to userspace.

    All the various pieces are acked by the different subsystem
    maintainers"

    * tag 'driver-core-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (119 commits)
    firmware loader: fix pending_fw_head list corruption
    drivers/base/memory.c: introduce help macro to_memory_block
    dynamic debug: line queries failing due to uninitialized local variable
    sysfs: sysfs_create_groups returns a value.
    debugfs: provide debugfs_create_x64() when disabled
    rbd: convert bus code to use bus_groups
    firmware: dcdbas: use binary attribute groups
    sysfs: add sysfs_create/remove_groups for when SYSFS is not enabled
    driver core: add #include to core files.
    HID: convert bus code to use dev_groups
    Input: serio: convert bus code to use drv_groups
    Input: gameport: convert bus code to use drv_groups
    driver core: firmware: use __ATTR_RW()
    driver core: core: use DEVICE_ATTR_RO
    driver core: bus: use DRIVER_ATTR_WO()
    driver core: create write-only attribute macros for devices and drivers
    sysfs: create __ATTR_WO()
    driver-core: platform: convert bus code to use dev_groups
    workqueue: convert bus code to use dev_groups
    MEI: convert bus code to use dev_groups
    ...

    Linus Torvalds
     

31 Aug, 2013

3 commits

  • nsown_capable is a special case of ns_capable essentially for just CAP_SETUID and
    CAP_SETGID. For the existing users it doesn't noticably simplify things and
    from the suggested patches I have seen it encourages people to do the wrong
    thing. So remove nsown_capable.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • By default, the pfifo_fast queue discipline has been used by default
    for all devices. But we have better choices now.

    This patch allow setting the default queueing discipline with sysctl.
    This allows easy use of better queueing disciplines on all devices
    without having to use tc qdisc scripts. It is intended to allow
    an easy path for distributions to make fq_codel or sfq the default
    qdisc.

    This patch also makes pfifo_fast more of a first class qdisc, since
    it is now possible to manually override the default and explicitly
    use pfifo_fast. The behavior for systems who do not use the sysctl
    is unchanged, they still get pfifo_fast

    Also removes leftover random # in sysctl net core.

    Signed-off-by: Stephen Hemminger
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    stephen hemminger
     
  • commit 8728c544a9cbdc ("net: dev_pick_tx() fix") and commit
    b6fe83e9525a ("bonding: refine IFF_XMIT_DST_RELEASE capability")
    are quite incompatible : Queue selection is disabled because skb
    dst was dropped before entering bonding device.

    This causes major performance regression, mainly because TCP packets
    for a given flow can be sent to multiple queues.

    This is particularly visible when using the new FQ packet scheduler
    with MQ + FQ setup on the slaves.

    We can safely revert the first commit now that 416186fbf8c5b
    ("net: Split core bits of netdev_pick_tx into __netdev_pick_tx")
    properly caps the queue_index.

    Reported-by: Xi Wang
    Diagnosed-by: Xi Wang
    Signed-off-by: Eric Dumazet
    Cc: Tom Herbert
    Cc: Alexander Duyck
    Cc: Denys Fedorysychenko
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Aug, 2013

4 commits

  • This function returns the next dev in the dev->upper_dev_list after the
    struct list_head **iter position, and updates *iter accordingly. Returns
    NULL if there are no devices left.

    Caller must hold RCU read lock.

    CC: "David S. Miller"
    CC: Eric Dumazet
    CC: Jiri Pirko
    CC: Alexander Duyck
    CC: Cong Wang
    Signed-off-by: Veaceslav Falico
    Signed-off-by: David S. Miller

    Veaceslav Falico
     
  • We already don't need it cause we see every upper/lower device in the list
    already.

    CC: "David S. Miller"
    CC: Eric Dumazet
    CC: Jiri Pirko
    CC: Alexander Duyck
    CC: Cong Wang
    Signed-off-by: Veaceslav Falico
    Signed-off-by: David S. Miller

    Veaceslav Falico
     
  • This patch adds lower_dev_list list_head to net_device, which is the same
    as upper_dev_list, only for lower devices, and begins to use it in the same
    way as the upper list.

    It also changes the way the whole adjacent device lists work - now they
    contain *all* of upper/lower devices, not only the first level. The first
    level devices are distinguished by the bool neighbour field in
    netdev_adjacent, also added by this patch.

    There are cases when a device can be added several times to the adjacent
    list, the simplest would be:

    /---- eth0.10 ---\
    eth0- --- bond0
    \---- eth0.20 ---/

    where both bond0 and eth0 'see' each other in the adjacent lists two times.
    To avoid duplication of netdev_adjacent structures ref_nr is being kept as
    the number of times the device was added to the list.

    The 'full view' is achieved by adding, on link creation, all of the
    upper_dev's upper_dev_list devices as upper devices to all of the
    lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
    versa. On unlink they are removed using the same logic.

    I've tested it with thousands vlans/bonds/bridges, everything works ok and
    no observable lags even on a huge number of interfaces.

    Memory footprint for 128 devices interconnected with each other via both
    upper and lower (which is impossible, but for the comparison) lists would be:

    128*128*2*sizeof(netdev_adjacent) = 1.5MB

    but in the real world we usualy have at most several devices with slaves
    and a lot of vlans, so the footprint will be much lower.

    CC: "David S. Miller"
    CC: Eric Dumazet
    CC: Jiri Pirko
    CC: Alexander Duyck
    CC: Cong Wang
    Signed-off-by: Veaceslav Falico
    Signed-off-by: David S. Miller

    Veaceslav Falico
     
  • Rename the structure to reflect the upcoming addition of lower_dev_list.

    CC: "David S. Miller"
    CC: Eric Dumazet
    CC: Jiri Pirko
    CC: Alexander Duyck
    CC: Cong Wang
    Signed-off-by: Veaceslav Falico
    Signed-off-by: David S. Miller

    Veaceslav Falico
     

29 Aug, 2013

1 commit

  • Don't allow mounting sysfs unless the caller has CAP_SYS_ADMIN rights
    over the net namespace. The principle here is if you create or have
    capabilities over it you can mount it, otherwise you get to live with
    what other people have mounted.

    Instead of testing this with a straight forward ns_capable call,
    perform this check the long and torturous way with kobject helpers,
    this keeps direct knowledge of namespaces out of sysfs, and preserves
    the existing sysfs abstractions.

    Acked-by: Greg Kroah-Hartman
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

28 Aug, 2013

1 commit


17 Aug, 2013

1 commit


15 Aug, 2013

2 commits


14 Aug, 2013

1 commit

  • Fix the iproute2 command `bridge vlan show`, after switching from
    rtgenmsg to ifinfomsg.

    Let's start with a little history:

    Feb 20: Vlad Yasevich got his VLAN-aware bridge patchset included in
    the 3.9 merge window.
    In the kernel commit 6cbdceeb, he added attribute support to
    bridge GETLINK requests sent with rtgenmsg.

    Mar 6th: Vlad got this iproute2 reference implementation of the bridge
    vlan netlink interface accepted (iproute2 9eff0e5c)

    Apr 25th: iproute2 switched from using rtgenmsg to ifinfomsg (63338dca)
    http://patchwork.ozlabs.org/patch/239602/
    http://marc.info/?t=136680900700007

    Apr 28th: Linus released 3.9

    Apr 30th: Stephen released iproute2 3.9.0

    The `bridge vlan show` command haven't been working since the switch to
    ifinfomsg, or in a released version of iproute2. Since the kernel side
    only supports rtgenmsg, which iproute2 switched away from just prior to
    the iproute2 3.9.0 release.

    I haven't been able to find any documentation, about neither rtgenmsg
    nor ifinfomsg, and in which situation to use which, but kernel commit
    88c5b5ce seams to suggest that ifinfomsg should be used.

    Fixing this in kernel will break compatibility, but I doubt that anybody
    have been using it due to this bug in the user space reference
    implementation, at least not without noticing this bug. That said the
    functionality is still fully functional in 3.9, when reversing iproute2
    commit 63338dca.

    This could also be fixed in iproute2, but thats an ugly patch that would
    reintroduce rtgenmsg in iproute2, and from searching in netdev it seams
    like rtgenmsg usage is discouraged. I'm assuming that the only reason
    that Vlad implemented the kernel side to use rtgenmsg, was because
    iproute2 was using it at the time.

    Signed-off-by: Asbjoern Sloth Toennesen
    Reviewed-by: Vlad Yasevich
    Signed-off-by: David S. Miller

    Asbjoern Sloth Toennesen
     

10 Aug, 2013

3 commits

  • Fix inverted check when deleting an fdb entry.

    Signed-off-by: Sridhar Samudrala
    Signed-off-by: David S. Miller

    Sridhar Samudrala
     
  • Adding paged frags skbs to af_unix sockets introduced a performance
    regression on large sends because of additional page allocations, even
    if each skb could carry at least 100% more payload than before.

    We can instruct sock_alloc_send_pskb() to attempt high order
    allocations.

    Most of the time, it does a single page allocation instead of 8.

    I added an additional parameter to sock_alloc_send_pskb() to
    let other users to opt-in for this new feature on followup patches.

    Tested:

    Before patch :

    $ netperf -t STREAM_STREAM
    STREAM STREAM TEST
    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/sec

    2304 212992 212992 10.00 46861.15

    After patch :

    $ netperf -t STREAM_STREAM
    STREAM STREAM TEST
    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/sec

    2304 212992 212992 10.00 57981.11

    Signed-off-by: Eric Dumazet
    Cc: David Rientjes
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Same behavior than 802.1q : finds the encapsulated protocol and
    skip 32bit header.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Aug, 2013

5 commits

  • cgroup is in the process of converting to css (cgroup_subsys_state)
    from cgroup as the principal subsystem interface handle. This is
    mostly to prepare for the unified hierarchy support where css's will
    be created and destroyed dynamically but also helps cleaning up
    subsystem implementations as css is usually what they are interested
    in anyway.

    cgroup_taskset which is used by the subsystem attach methods is the
    last cgroup subsystem API which isn't using css as the handle. Update
    cgroup_taskset_cur_cgroup() to cgroup_taskset_cur_css() and
    cgroup_taskset_for_each() to take @skip_css instead of @skip_cgrp.

    The conversions are pretty mechanical. One exception is
    cpuset::cgroup_cs(), which lost its last user and got removed.

    This patch shouldn't introduce any functional changes.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Daniel Wagner
    Cc: Ingo Molnar
    Cc: Matt Helsley
    Cc: Steven Rostedt

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using struct
    cgroup_subsys_state * as the primary handle instead of struct cgroup.
    Please see the previous commit which converts the subsystem methods
    for rationale.

    This patch converts all cftype file operations to take @css instead of
    @cgroup. cftypes for the cgroup core files don't have their subsytem
    pointer set. These will automatically use the dummy_css added by the
    previous patch and can be converted the same way.

    Most subsystem conversions are straight forwards but there are some
    interesting ones.

    * freezer: update_if_frozen() is also converted to take @css instead
    of @cgroup for consistency. This will make the code look simpler
    too once iterators are converted to use css.

    * memory/vmpressure: mem_cgroup_from_css() needs to be exported to
    vmpressure while mem_cgroup_from_cont() can be made static.
    Updated accordingly.

    * cpu: cgroup_tg() doesn't have any user left. Removed.

    * cpuacct: cgroup_ca() doesn't have any user left. Removed.

    * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
    Removed.

    * net_cls: cgrp_cls_state() doesn't have any user left. Removed.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Daniel Wagner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe
    Cc: Steven Rostedt

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using struct
    cgroup_subsys_state * as the primary handle instead of struct cgroup *
    in subsystem implementations for the following reasons.

    * With unified hierarchy, subsystems will be dynamically bound and
    unbound from cgroups and thus css's (cgroup_subsys_state) may be
    created and destroyed dynamically over the lifetime of a cgroup,
    which is different from the current state where all css's are
    allocated and destroyed together with the associated cgroup. This
    in turn means that cgroup_css() should be synchronized and may
    return NULL, making it more cumbersome to use.

    * Differing levels of per-subsystem granularity in the unified
    hierarchy means that the task and descendant iterators should behave
    differently depending on the specific subsystem the iteration is
    being performed for.

    * In majority of the cases, subsystems only care about its part in the
    cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods
    often obtain the matching css pointer from the cgroup and don't
    bother with the cgroup pointer itself. Passing around css fits
    much better.

    This patch converts all cgroup_subsys methods to take @css instead of
    @cgroup. The conversions are mostly straight-forward. A few
    noteworthy changes are

    * ->css_alloc() now takes css of the parent cgroup rather than the
    pointer to the new cgroup as the css for the new cgroup doesn't
    exist yet. Knowing the parent css is enough for all the existing
    subsystems.

    * In kernel/cgroup.c::offline_css(), unnecessary open coded css
    dereference is replaced with local variable access.

    This patch shouldn't cause any behavior differences.

    v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced
    with local variable @css as suggested by Li Zefan.

    Rebased on top of new for-3.12 which includes for-3.11-fixes so
    that ->css_free() invocation added by da0a12caff ("cgroup: fix a
    leak when percpu_ref_init() fails") is converted too. Suggested
    by Li Zefan.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Daniel Wagner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe
    Cc: Steven Rostedt

    Tejun Heo
     
  • cgroup controller API will be converted to primarily use struct
    cgroup_subsys_state instead of struct cgroup. In preparation, make
    the internal functions of netprio_cgroup pass around @css instead of
    @cgrp.

    While at it, kill struct cgroup_netprio_state which only contained
    struct cgroup_subsys_state without serving any purpose. All functions
    are converted to deal with @css directly.

    This patch shouldn't cause any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Neil Horman
    Acked-by: David S. Miller

    Tejun Heo
     
  • The names of the two struct cgroup_subsys_state accessors -
    cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
    The former clashes with the type name and the latter doesn't even
    indicate it's somehow related to cgroup.

    We're about to revamp large portion of cgroup API, so, let's rename
    them so that they're less awkward. Most per-controller usages of the
    accessors are localized in accessor wrappers and given the amount of
    scheduled changes, this isn't gonna add any noticeable headache.

    Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
    to task_css(). This patch is pure rename.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     

08 Aug, 2013

5 commits