17 May, 2014

1 commit

  • cgroup in general is moving towards using cgroup_subsys_state as the
    fundamental structural component and css_parent() was introduced to
    convert from using cgroup->parent to css->parent. It was quite some
    time ago and we're moving forward with making css more prominent.

    This patch drops the trivial wrapper css_parent() and let the users
    dereference css->parent. While at it, explicitly mark fields of css
    which are public and immutable.

    v2: New usage from device_cgroup.c converted.

    Signed-off-by: Tejun Heo
    Acked-by: Michal Hocko
    Acked-by: Neil Horman
    Acked-by: "David S. Miller"
    Acked-by: Li Zefan
    Cc: Vivek Goyal
    Cc: Jens Axboe
    Cc: Peter Zijlstra
    Cc: Johannes Weiner

    Tejun Heo
     

14 May, 2014

1 commit

  • Convert all cftype->write_string() users to the new cftype->write()
    which maps directly to kernfs write operation and has full access to
    kernfs and cgroup contexts. The conversions are mostly mechanical.

    * @css and @cft are accessed using of_css() and of_cft() accessors
    respectively instead of being specified as arguments.

    * Should return @nbytes on success instead of 0.

    * @buf is not trimmed automatically. Trim if necessary. Note that
    blkcg and netprio don't need this as the parsers already handle
    whitespaces.

    cftype->write_string() has no user left after the conversions and
    removed.

    While at it, remove unnecessary local variable @p in
    cgroup_subtree_control_write() and stale comment about
    CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.

    This patch doesn't introduce any visible behavior changes.

    v2: netprio was missing from conversion. Converted.

    Signed-off-by: Tejun Heo
    Acked-by: Aristeu Rozanski
    Acked-by: Vivek Goyal
    Acked-by: Li Zefan
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Neil Horman
    Cc: "David S. Miller"

    Tejun Heo
     

19 Mar, 2014

1 commit

  • cftype->write_string() just passes on the writeable buffer from kernfs
    and there's no reason to add const restriction on the buffer. The
    only thing const achieves is unnecessarily complicating parsing of the
    buffer. Drop const from @buffer.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki

    Tejun Heo
     

13 Feb, 2014

1 commit

  • If !NULL, @skip_css makes cgroup_taskset_for_each() skip the matching
    css. The intention of the interface is to make it easy to skip css's
    (cgroup_subsys_states) which already match the migration target;
    however, this is entirely unnecessary as migration taskset doesn't
    include tasks which are already in the target cgroup. Drop @skip_css
    from cgroup_taskset_for_each().

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann

    Tejun Heo
     

08 Feb, 2014

3 commits

  • cgroup_subsys is a bit messier than it needs to be.

    * The name of a subsys can be different from its internal identifier
    defined in cgroup_subsys.h. Most subsystems use the matching name
    but three - cpu, memory and perf_event - use different ones.

    * cgroup_subsys_id enums are postfixed with _subsys_id and each
    cgroup_subsys is postfixed with _subsys. cgroup.h is widely
    included throughout various subsystems, it doesn't and shouldn't
    have claim on such generic names which don't have any qualifier
    indicating that they belong to cgroup.

    * cgroup_subsys->subsys_id should always equal the matching
    cgroup_subsys_id enum; however, we require each controller to
    initialize it and then BUG if they don't match, which is a bit
    silly.

    This patch cleans up cgroup_subsys names and initialization by doing
    the followings.

    * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
    cgroup_subsys with _cgrp_subsys.

    * With the above, renaming subsys identifiers to match the userland
    visible names doesn't cause any naming conflicts. All non-matching
    identifiers are renamed to match the official names.

    cpu_cgroup -> cpu
    mem_cgroup -> memory
    perf -> perf_event

    * controllers no longer need to initialize ->subsys_id and ->name.
    They're generated in cgroup core and set automatically during boot.

    * Redundant cgroup_subsys declarations removed.

    * While updating BUG_ON()s in cgroup_init_early(), convert them to
    WARN()s. BUGging that early during boot is stupid - the kernel
    can't print anything, even through serial console and the trap
    handler doesn't even link stack frame properly for back-tracing.

    This patch doesn't introduce any behavior changes.

    v2: Rebased on top of fe1217c4f3f7 ("net: net_cls: move cgroupfs
    classid handling into core").

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: "David S. Miller"
    Acked-by: "Rafael J. Wysocki"
    Acked-by: Michal Hocko
    Acked-by: Peter Zijlstra
    Acked-by: Aristeu Rozanski
    Acked-by: Ingo Molnar
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Serge E. Hallyn
    Cc: Vivek Goyal
    Cc: Thomas Graf

    Tejun Heo
     
  • With module supported dropped from net_prio, no controller is using
    cgroup module support. None of actual resource controllers can be
    built as a module and we aren't gonna add new controllers which don't
    control resources. This patch drops module support from cgroup.

    * cgroup_[un]load_subsys() and cgroup_subsys->module removed.

    * As there's no point in distinguishing IS_BUILTIN() and IS_MODULE(),
    cgroup_subsys.h now uses IS_ENABLED() directly.

    * enum cgroup_subsys_id now exactly matches the list of enabled
    controllers as ordered in cgroup_subsys.h.

    * cgroup_subsys[] is now a contiguously occupied array. Size
    specification is no longer necessary and dropped.

    * for_each_builtin_subsys() is removed and for_each_subsys() is
    updated to not require any locking.

    * module ref handling is removed from rebind_subsystems().

    * Module related comments dropped.

    v2: Rebased on top of fe1217c4f3f7 ("net: net_cls: move cgroupfs
    classid handling into core").

    v3: Added {} around the if (need_forkexit_callback) block in
    cgroup_post_fork() for readability as suggested by Li.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     
  • net_prio is the only cgroup which is allowed to be built as a module.
    The savings from allowing one controller to be built as a module are
    tiny especially given that cgroup module support itself adds quite a
    bit of complexity.

    Given that none of other controllers has much chance of being made a
    module and that we're unlikely to add new modular controllers, the
    added complexity is simply not justifiable.

    As a first step to drop cgroup module support, this patch changes the
    config option to bool from tristate and drops module related code from
    it.

    Also, while an earlier commit fe1217c4f3f7 ("net: net_cls: move
    cgroupfs classid handling into core") dropped module support from
    net_cls cgroup, it retained a call to cgroup_load_subsys(), which is
    noop for built-in controllers. Drop it along with
    init_netclassid_cgroup().

    v2: Removed modular version of task_netprioidx() in
    include/net/netprio_cgroup.h as suggested by Li Zefan.

    v3: Rebased on top of fe1217c4f3f7 ("net: net_cls: move cgroupfs
    classid handling into core"). net_cls cgroup part is mostly
    dropped except for removal of init_netclassid_cgroup().

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: "David S. Miller"
    Acked-by: Li Zefan
    Cc: Thomas Graf

    Tejun Heo
     

26 Jan, 2014

1 commit

  • Pull networking updates from David Miller:

    1) BPF debugger and asm tool by Daniel Borkmann.

    2) Speed up create/bind in AF_PACKET, also from Daniel Borkmann.

    3) Correct reciprocal_divide and update users, from Hannes Frederic
    Sowa and Daniel Borkmann.

    4) Currently we only have a "set" operation for the hw timestamp socket
    ioctl, add a "get" operation to match. From Ben Hutchings.

    5) Add better trace events for debugging driver datapath problems, also
    from Ben Hutchings.

    6) Implement auto corking in TCP, from Eric Dumazet. Basically, if we
    have a small send and a previous packet is already in the qdisc or
    device queue, defer until TX completion or we get more data.

    7) Allow userspace to manage ipv6 temporary addresses, from Jiri Pirko.

    8) Add a qdisc bypass option for AF_PACKET sockets, from Daniel
    Borkmann.

    9) Share IP header compression code between Bluetooth and IEEE802154
    layers, from Jukka Rissanen.

    10) Fix ipv6 router reachability probing, from Jiri Benc.

    11) Allow packets to be captured on macvtap devices, from Vlad Yasevich.

    12) Support tunneling in GRO layer, from Jerry Chu.

    13) Allow bonding to be configured fully using netlink, from Scott
    Feldman.

    14) Allow AF_PACKET users to obtain the VLAN TPID, just like they can
    already get the TCI. From Atzm Watanabe.

    15) New "Heavy Hitter" qdisc, from Terry Lam.

    16) Significantly improve the IPSEC support in pktgen, from Fan Du.

    17) Allow ipv4 tunnels to cache routes, just like sockets. From Tom
    Herbert.

    18) Add Proportional Integral Enhanced packet scheduler, from Vijay
    Subramanian.

    19) Allow openvswitch to mmap'd netlink, from Thomas Graf.

    20) Key TCP metrics blobs also by source address, not just destination
    address. From Christoph Paasch.

    21) Support 10G in generic phylib. From Andy Fleming.

    22) Try to short-circuit GRO flow compares using device provided RX
    hash, if provided. From Tom Herbert.

    The wireless and netfilter folks have been busy little bees too.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2064 commits)
    net/cxgb4: Fix referencing freed adapter
    ipv6: reallocate addrconf router for ipv6 address when lo device up
    fib_frontend: fix possible NULL pointer dereference
    rtnetlink: remove IFLA_BOND_SLAVE definition
    rtnetlink: remove check for fill_slave_info in rtnl_have_link_slave_info
    qlcnic: update version to 5.3.55
    qlcnic: Enhance logic to calculate msix vectors.
    qlcnic: Refactor interrupt coalescing code for all adapters.
    qlcnic: Update poll controller code path
    qlcnic: Interrupt code cleanup
    qlcnic: Enhance Tx timeout debugging.
    qlcnic: Use bool for rx_mac_learn.
    bonding: fix u64 division
    rtnetlink: add missing IFLA_BOND_AD_INFO_UNSPEC
    sfc: Use the correct maximum TX DMA ring size for SFC9100
    Add Shradha Shah as the sfc driver maintainer.
    net/vxlan: Share RX skb de-marking and checksum checks with ovs
    tulip: cleanup by using ARRAY_SIZE()
    ip_tunnel: clear IPCB in ip_tunnel_xmit() in case dst_link_failure() is called
    net/cxgb4: Don't retrieve stats during recovery
    ...

    Linus Torvalds
     

11 Dec, 2013

1 commit


06 Dec, 2013

2 commits

  • In preparation of conversion to kernfs, cgroup file handling is
    updated so that it can be easily mapped to kernfs. This patch
    replaces cftype->read_seq_string() with cftype->seq_show() which is
    not limited to single_open() operation and will map directcly to
    kernfs seq_file interface.

    The conversions are mechanical. As ->seq_show() doesn't have @css and
    @cft, the functions which make use of them are converted to use
    seq_css() and seq_cft() respectively. In several occassions, e.f. if
    it has seq_string in its name, the function name is updated to fit the
    new method better.

    This patch does not introduce any behavior changes.

    Signed-off-by: Tejun Heo
    Acked-by: Aristeu Rozanski
    Acked-by: Vivek Goyal
    Acked-by: Michal Hocko
    Acked-by: Daniel Wagner
    Acked-by: Li Zefan
    Cc: Jens Axboe
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Neil Horman

    Tejun Heo
     
  • In preparation of conversion to kernfs, cgroup file handling is being
    consolidated so that it can be easily mapped to the seq_file based
    interface of kernfs.

    cftype->read_map() doesn't add any value and being replaced with
    ->read_seq_string(). Update read_priomap() to use ->read_seq_string()
    instead.

    This patch doesn't make any visible behavior changes.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: Daniel Wagner
    Acked-by: Li Zefan

    Tejun Heo
     

09 Oct, 2013

1 commit


09 Aug, 2013

5 commits

  • cgroup is in the process of converting to css (cgroup_subsys_state)
    from cgroup as the principal subsystem interface handle. This is
    mostly to prepare for the unified hierarchy support where css's will
    be created and destroyed dynamically but also helps cleaning up
    subsystem implementations as css is usually what they are interested
    in anyway.

    cgroup_taskset which is used by the subsystem attach methods is the
    last cgroup subsystem API which isn't using css as the handle. Update
    cgroup_taskset_cur_cgroup() to cgroup_taskset_cur_css() and
    cgroup_taskset_for_each() to take @skip_css instead of @skip_cgrp.

    The conversions are pretty mechanical. One exception is
    cpuset::cgroup_cs(), which lost its last user and got removed.

    This patch shouldn't introduce any functional changes.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Daniel Wagner
    Cc: Ingo Molnar
    Cc: Matt Helsley
    Cc: Steven Rostedt

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using struct
    cgroup_subsys_state * as the primary handle instead of struct cgroup.
    Please see the previous commit which converts the subsystem methods
    for rationale.

    This patch converts all cftype file operations to take @css instead of
    @cgroup. cftypes for the cgroup core files don't have their subsytem
    pointer set. These will automatically use the dummy_css added by the
    previous patch and can be converted the same way.

    Most subsystem conversions are straight forwards but there are some
    interesting ones.

    * freezer: update_if_frozen() is also converted to take @css instead
    of @cgroup for consistency. This will make the code look simpler
    too once iterators are converted to use css.

    * memory/vmpressure: mem_cgroup_from_css() needs to be exported to
    vmpressure while mem_cgroup_from_cont() can be made static.
    Updated accordingly.

    * cpu: cgroup_tg() doesn't have any user left. Removed.

    * cpuacct: cgroup_ca() doesn't have any user left. Removed.

    * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
    Removed.

    * net_cls: cgrp_cls_state() doesn't have any user left. Removed.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Daniel Wagner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe
    Cc: Steven Rostedt

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using struct
    cgroup_subsys_state * as the primary handle instead of struct cgroup *
    in subsystem implementations for the following reasons.

    * With unified hierarchy, subsystems will be dynamically bound and
    unbound from cgroups and thus css's (cgroup_subsys_state) may be
    created and destroyed dynamically over the lifetime of a cgroup,
    which is different from the current state where all css's are
    allocated and destroyed together with the associated cgroup. This
    in turn means that cgroup_css() should be synchronized and may
    return NULL, making it more cumbersome to use.

    * Differing levels of per-subsystem granularity in the unified
    hierarchy means that the task and descendant iterators should behave
    differently depending on the specific subsystem the iteration is
    being performed for.

    * In majority of the cases, subsystems only care about its part in the
    cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods
    often obtain the matching css pointer from the cgroup and don't
    bother with the cgroup pointer itself. Passing around css fits
    much better.

    This patch converts all cgroup_subsys methods to take @css instead of
    @cgroup. The conversions are mostly straight-forward. A few
    noteworthy changes are

    * ->css_alloc() now takes css of the parent cgroup rather than the
    pointer to the new cgroup as the css for the new cgroup doesn't
    exist yet. Knowing the parent css is enough for all the existing
    subsystems.

    * In kernel/cgroup.c::offline_css(), unnecessary open coded css
    dereference is replaced with local variable access.

    This patch shouldn't cause any behavior differences.

    v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced
    with local variable @css as suggested by Li Zefan.

    Rebased on top of new for-3.12 which includes for-3.11-fixes so
    that ->css_free() invocation added by da0a12caff ("cgroup: fix a
    leak when percpu_ref_init() fails") is converted too. Suggested
    by Li Zefan.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Daniel Wagner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe
    Cc: Steven Rostedt

    Tejun Heo
     
  • cgroup controller API will be converted to primarily use struct
    cgroup_subsys_state instead of struct cgroup. In preparation, make
    the internal functions of netprio_cgroup pass around @css instead of
    @cgrp.

    While at it, kill struct cgroup_netprio_state which only contained
    struct cgroup_subsys_state without serving any purpose. All functions
    are converted to deal with @css directly.

    This patch shouldn't cause any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Neil Horman
    Acked-by: David S. Miller

    Tejun Heo
     
  • The names of the two struct cgroup_subsys_state accessors -
    cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
    The former clashes with the type name and the latter doesn't even
    indicate it's somehow related to cgroup.

    We're about to revamp large portion of cgroup API, so, let's rename
    them so that they're less awkward. Most per-controller usages of the
    accessors are localized in accessor wrappers and given the amount of
    scheduled changes, this isn't gonna add any noticeable headache.

    Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
    to task_css(). This patch is pure rename.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     

29 May, 2013

1 commit

  • So far, only net_device * could be passed along with netdevice notifier
    event. This patch provides a possibility to pass custom structure
    able to provide info that event listener needs to know.

    Signed-off-by: Jiri Pirko

    v2->v3: fix typo on simeth
    shortened dev_getter
    shortened notifier_info struct name
    v1->v2: fix notifier_call parameter in call_netdevice_notifier()
    Signed-off-by: David S. Miller

    Jiri Pirko
     

07 Feb, 2013

1 commit


13 Dec, 2012

1 commit

  • Pull networking changes from David Miller:

    1) Allow to dump, monitor, and change the bridge multicast database
    using netlink. From Cong Wang.

    2) RFC 5961 TCP blind data injection attack mitigation, from Eric
    Dumazet.

    3) Networking user namespace support from Eric W. Biederman.

    4) tuntap/virtio-net multiqueue support by Jason Wang.

    5) Support for checksum offload of encapsulated packets (basically,
    tunneled traffic can still be checksummed by HW). From Joseph
    Gasparakis.

    6) Allow BPF filter access to VLAN tags, from Eric Dumazet and
    Daniel Borkmann.

    7) Bridge port parameters over netlink and BPDU blocking support
    from Stephen Hemminger.

    8) Improve data access patterns during inet socket demux by rearranging
    socket layout, from Eric Dumazet.

    9) TIPC protocol updates and cleanups from Ying Xue, Paul Gortmaker, and
    Jon Maloy.

    10) Update TCP socket hash sizing to be more in line with current day
    realities. The existing heurstics were choosen a decade ago.
    From Eric Dumazet.

    11) Fix races, queue bloat, and excessive wakeups in ATM and
    associated drivers, from Krzysztof Mazur and David Woodhouse.

    12) Support DOVE (Distributed Overlay Virtual Ethernet) extensions
    in VXLAN driver, from David Stevens.

    13) Add "oops_only" mode to netconsole, from Amerigo Wang.

    14) Support set and query of VEB/VEPA bridge mode via PF_BRIDGE, also
    allow DCB netlink to work on namespaces other than the initial
    namespace. From John Fastabend.

    15) Support PTP in the Tigon3 driver, from Matt Carlson.

    16) tun/vhost zero copy fixes and improvements, plus turn it on
    by default, from Michael S. Tsirkin.

    17) Support per-association statistics in SCTP, from Michele
    Baldessari.

    And many, many, driver updates, cleanups, and improvements. Too
    numerous to mention individually.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
    net/mlx4_en: Add support for destination MAC in steering rules
    net/mlx4_en: Use generic etherdevice.h functions.
    net: ethtool: Add destination MAC address to flow steering API
    bridge: add support of adding and deleting mdb entries
    bridge: notify mdb changes via netlink
    ndisc: Unexport ndisc_{build,send}_skb().
    uapi: add missing netconf.h to export list
    pkt_sched: avoid requeues if possible
    solos-pci: fix double-free of TX skb in DMA mode
    bnx2: Fix accidental reversions.
    bna: Driver Version Updated to 3.1.2.1
    bna: Firmware update
    bna: Add RX State
    bna: Rx Page Based Allocation
    bna: TX Intr Coalescing Fix
    bna: Tx and Rx Optimizations
    bna: Code Cleanup and Enhancements
    ath9k: check pdata variable before dereferencing it
    ath5k: RX timestamp is reported at end of frame
    ath9k_htc: RX timestamp is reported at end of frame
    ...

    Linus Torvalds
     

22 Nov, 2012

6 commits

  • Inherit netprio configuration from ->css_online(), allow nesting and
    remove .broken_hierarchy marking. This makes netprio_cgroup's
    behavior match netcls_cgroup's.

    Note that this patch changes userland-visible behavior. Nesting is
    allowed and the first level cgroups below the root cgroup behave
    differently - they inherit priorities from the root cgroup on creation
    instead of starting with 0. This is unfortunate but not doing so is
    much crazier.

    Signed-off-by: Tejun Heo
    Tested-and-Acked-by: Daniel Wagner
    Acked-by: David S. Miller

    Tejun Heo
     
  • Introduce two helpers - netprio_prio() and netprio_set_prio() - which
    hide the details of priomap access and expansion. This will help
    implementing hierarchy support.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Tested-and-Acked-by: Daniel Wagner
    Acked-by: David S. Miller

    Tejun Heo
     
  • With priomap expansion no longer depending on knowing max id
    allocated, netprio_cgroup can use cgroup->id insted of cs->prioidx.
    Drop prioidx alloc/free logic and convert all uses to cgroup->id.

    * In cgrp_css_alloc(), parent->id test is moved above @cs allocation
    to simplify error path.

    * In cgrp_css_free(), @cs assignment is made initialization.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Tested-and-Acked-by: Daniel Wagner
    Acked-by: David S. Miller

    Tejun Heo
     
  • netprio kept track of the highest prioidx allocated and resized
    priomaps accordingly when necessary. This makes it necessary to keep
    track of prioidx allocation and may end up resizing on every new
    prioidx.

    Update extend_netdev_table() such that it takes @target_idx which the
    priomap should be able to accomodate. If the priomap is large enough,
    nothing happens; otherwise, the size is doubled until @target_idx can
    be accomodated.

    This makes max_prioidx and write_update_netdev_table() unnecessary.
    write_priomap() now calls extend_netdev_table() directly.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Tested-and-Acked-by: Daniel Wagner
    Acked-by: David S. Miller

    Tejun Heo
     
  • The function is about to go through a rewrite. In preparation,
    shorten the variable names so that we don't repeat "priomap" so often.

    This patch is cosmetic.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Tested-and-Acked-by: Daniel Wagner
    Acked-by: David S. Miller

    Tejun Heo
     
  • sscanf() doesn't bite.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Tested-and-Acked-by: Daniel Wagner
    Acked-by: David S. Miller

    Tejun Heo
     

20 Nov, 2012

1 commit


26 Oct, 2012

1 commit

  • net_prio_attach() is only access via cgroup_subsys callbacks,
    therefore we can reduce the visibility of this function.

    Signed-off-by: Daniel Wagner
    Cc: "David S. Miller"
    Cc: John Fastabend
    Cc: Li Zefan
    Cc: Neil Horman
    Cc: Tejun Heo
    Cc:
    Cc:
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Daniel Wagner
     

03 Oct, 2012

3 commits

  • Pull vfs update from Al Viro:

    - big one - consolidation of descriptor-related logics; almost all of
    that is moved to fs/file.c

    (BTW, I'm seriously tempted to rename the result to fd.c. As it is,
    we have a situation when file_table.c is about handling of struct
    file and file.c is about handling of descriptor tables; the reasons
    are historical - file_table.c used to be about a static array of
    struct file we used to have way back).

    A lot of stray ends got cleaned up and converted to saner primitives,
    disgusting mess in android/binder.c is still disgusting, but at least
    doesn't poke so much in descriptor table guts anymore. A bunch of
    relatively minor races got fixed in process, plus an ext4 struct file
    leak.

    - related thing - fget_light() partially unuglified; see fdget() in
    there (and yes, it generates the code as good as we used to have).

    - also related - bits of Cyrill's procfs stuff that got entangled into
    that work; _not_ all of it, just the initial move to fs/proc/fd.c and
    switch of fdinfo to seq_file.

    - Alex's fs/coredump.c spiltoff - the same story, had been easier to
    take that commit than mess with conflicts. The rest is a separate
    pile, this was just a mechanical code movement.

    - a few misc patches all over the place. Not all for this cycle,
    there'll be more (and quite a few currently sit in akpm's tree)."

    Fix up trivial conflicts in the android binder driver, and some fairly
    simple conflicts due to two different changes to the sock_alloc_file()
    interface ("take descriptor handling from sock_alloc_file() to callers"
    vs "net: Providing protocol type via system.sockprotoname xattr of
    /proc/PID/fd entries" adding a dentry name to the socket)

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
    MAX_LFS_FILESIZE should be a loff_t
    compat: fs: Generic compat_sys_sendfile implementation
    fs: push rcu_barrier() from deactivate_locked_super() to filesystems
    btrfs: reada_extent doesn't need kref for refcount
    coredump: move core dump functionality into its own file
    coredump: prevent double-free on an error path in core dumper
    usb/gadget: fix misannotations
    fcntl: fix misannotations
    ceph: don't abuse d_delete() on failure exits
    hypfs: ->d_parent is never NULL or negative
    vfs: delete surplus inode NULL check
    switch simple cases of fget_light to fdget
    new helpers: fdget()/fdput()
    switch o2hb_region_dev_write() to fget_light()
    proc_map_files_readdir(): don't bother with grabbing files
    make get_file() return its argument
    vhost_set_vring(): turn pollstart/pollstop into bool
    switch prctl_set_mm_exe_file() to fget_light()
    switch xfs_find_handle() to fget_light()
    switch xfs_swapext() to fget_light()
    ...

    Linus Torvalds
     
  • Pull networking changes from David Miller:

    1) GRE now works over ipv6, from Dmitry Kozlov.

    2) Make SCTP more network namespace aware, from Eric Biederman.

    3) TEAM driver now works with non-ethernet devices, from Jiri Pirko.

    4) Make openvswitch network namespace aware, from Pravin B Shelar.

    5) IPV6 NAT implementation, from Patrick McHardy.

    6) Server side support for TCP Fast Open, from Jerry Chu and others.

    7) Packet BPF filter supports MOD and XOR, from Eric Dumazet and Daniel
    Borkmann.

    8) Increate the loopback default MTU to 64K, from Eric Dumazet.

    9) Use a per-task rather than per-socket page fragment allocator for
    outgoing networking traffic. This benefits processes that have very
    many mostly idle sockets, which is quite common.

    From Eric Dumazet.

    10) Use up to 32K for page fragment allocations, with fallbacks to
    smaller sizes when higher order page allocations fail. Benefits are
    a) less segments for driver to process b) less calls to page
    allocator c) less waste of space.

    From Eric Dumazet.

    11) Allow GRO to be used on GRE tunnels, from Eric Dumazet.

    12) VXLAN device driver, one way to handle VLAN issues such as the
    limitation of 4096 VLAN IDs yet still have some level of isolation.
    From Stephen Hemminger.

    13) As usual there is a large boatload of driver changes, with the scale
    perhaps tilted towards the wireless side this time around.

    Fix up various fairly trivial conflicts, mostly caused by the user
    namespace changes.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1012 commits)
    hyperv: Add buffer for extended info after the RNDIS response message.
    hyperv: Report actual status in receive completion packet
    hyperv: Remove extra allocated space for recv_pkt_list elements
    hyperv: Fix page buffer handling in rndis_filter_send_request()
    hyperv: Fix the missing return value in rndis_filter_set_packet_filter()
    hyperv: Fix the max_xfer_size in RNDIS initialization
    vxlan: put UDP socket in correct namespace
    vxlan: Depend on CONFIG_INET
    sfc: Fix the reported priorities of different filter types
    sfc: Remove EFX_FILTER_FLAG_RX_OVERRIDE_IP
    sfc: Fix loopback self-test with separate_tx_channels=1
    sfc: Fix MCDI structure field lookup
    sfc: Add parentheses around use of bitfield macro arguments
    sfc: Fix null function pointer in efx_sriov_channel_type
    vxlan: virtual extensible lan
    igmp: export symbol ip_mc_leave_group
    netlink: add attributes to fdb interface
    tg3: unconditionally select HWMON support when tg3 is enabled.
    Revert "net: ti cpsw ethernet: allow reading phy interface mode from DT"
    gre: fix sparse warning
    ...

    Linus Torvalds
     
  • Pull cgroup hierarchy update from Tejun Heo:
    "Currently, different cgroup subsystems handle nested cgroups
    completely differently. There's no consistency among subsystems and
    the behaviors often are outright broken.

    People at least seem to agree that the broken hierarhcy behaviors need
    to be weeded out if any progress is gonna be made on this front and
    that the fallouts from deprecating the broken behaviors should be
    acceptable especially given that the current behaviors don't make much
    sense when nested.

    This patch makes cgroup emit warning messages if cgroups for
    subsystems with broken hierarchy behavior are nested to prepare for
    fixing them in the future. This was put in a separate branch because
    more related changes were expected (didn't make it this round) and the
    memory cgroup wanted to pull in this and make changes on top."

    * 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them

    Linus Torvalds
     

27 Sep, 2012

1 commit

  • iterates through the opened files in given descriptor table,
    calling a supplied function; we stop once non-zero is returned.
    Callback gets struct file *, descriptor number and const void *
    argument passed to iterator. It is called with files->file_lock
    held, so it is not allowed to block.

    tty_io, netprio_cgroup and selinux flush_unauthorized_files()
    converted to its use.

    Signed-off-by: Al Viro

    Al Viro
     

15 Sep, 2012

2 commits

  • Currently, cgroup hierarchy support is a mess. cpu related subsystems
    behave correctly - configuration, accounting and control on a parent
    properly cover its children. blkio and freezer completely ignore
    hierarchy and treat all cgroups as if they're directly under the root
    cgroup. Others show yet different behaviors.

    These differing interpretations of cgroup hierarchy make using cgroup
    confusing and it impossible to co-mount controllers into the same
    hierarchy and obtain sane behavior.

    Eventually, we want full hierarchy support from all subsystems and
    probably a unified hierarchy. Users using separate hierarchies
    expecting completely different behaviors depending on the mounted
    subsystem is deterimental to making any progress on this front.

    This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
    for controllers which are lacking in hierarchy support. The goal of
    this patch is two-fold.

    * Move users away from using hierarchy on currently non-hierarchical
    subsystems, so that implementing proper hierarchy support on those
    doesn't surprise them.

    * Keep track of which controllers are broken how and nudge the
    subsystems to implement proper hierarchy support.

    For now, start with a single warning message. We can whine louder
    later on.

    v2: Fixed a typo spotted by Michal. Warning message updated.

    v3: Updated memcg part so that it doesn't generate warning in the
    cases where .use_hierarchy=false doesn't make the behavior
    different from root.use_hierarchy=true. Fixed a typo spotted by
    Glauber.

    v4: Check ->broken_hierarchy after cgroup creation is complete so that
    ->create() can affect the result per Michal. Dropped unnecessary
    memcg root handling per Michal.

    Signed-off-by: Tejun Heo
    Acked-by: Michal Hocko
    Acked-by: Li Zefan
    Acked-by: Serge E. Hallyn
    Cc: Glauber Costa
    Cc: Peter Zijlstra
    Cc: Paul Turner
    Cc: Johannes Weiner
    Cc: Thomas Graf
    Cc: Vivek Goyal
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Neil Horman
    Cc: Aneesh Kumar K.V

    Tejun Heo
     
  • WARNING: With this change it is impossible to load external built
    controllers anymore.

    In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is
    set, corresponding subsys_id should also be a constant. Up to now,
    net_prio_subsys_id and net_cls_subsys_id would be of the type int and
    the value would be assigned during runtime.

    By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN
    to IS_ENABLED, all *_subsys_id will have constant value. That means we
    need to remove all the code which assumes a value can be assigned to
    net_prio_subsys_id and net_cls_subsys_id.

    A close look is necessary on the RCU part which was introduces by
    following patch:

    commit f845172531fb7410c7fb7780b1a6e51ee6df7d52
    Author: Herbert Xu Mon May 24 09:12:34 2010
    Committer: David S. Miller Mon May 24 09:12:34 2010

    cls_cgroup: Store classid in struct sock

    Tis code was added to init_cgroup_cls()

    /* We can't use rcu_assign_pointer because this is an int. */
    smp_wmb();
    net_cls_subsys_id = net_cls_subsys.subsys_id;

    respectively to exit_cgroup_cls()

    net_cls_subsys_id = -1;
    synchronize_rcu();

    and in module version of task_cls_classid()

    rcu_read_lock();
    id = rcu_dereference(net_cls_subsys_id);
    if (id >= 0)
    classid = container_of(task_subsys_state(p, id),
    struct cgroup_cls_state, css)->classid;
    rcu_read_unlock();

    Without an explicit explaination why the RCU part is needed. (The
    rcu_deference was fixed by exchanging it to rcu_derefence_index_check()
    in a later commit, but that is a minor detail.)

    So here is my pondering why it was introduced and why it safe to
    remove it now. Note that this code was copied over to net_prio the
    reasoning holds for that subsystem too.

    The idea behind the RCU use for net_cls_subsys_id is to make sure we
    get a valid pointer back from task_subsys_state(). task_subsys_state()
    is just blindly accessing the subsys array and returning the
    pointer. Obviously, passing in -1 as id into task_subsys_state()
    returns an invalid value (out of lower bound).

    So this code makes sure that only after module is loaded and the
    subsystem registered, the id is assigned.

    Before unregistering the module all old readers must have left the
    critical section. This is done by assigning -1 to the id and issuing a
    synchronized_rcu(). Any new readers wont call task_subsys_state()
    anymore and therefore it is safe to unregister the subsystem.

    The new code relies on the same trick, but it looks at the subsys
    pointer return by task_subsys_state() (remember the id is constant
    and therefore we allways have a valid index into the subsys
    array).

    No precautions need to be taken during module loading
    module. Eventually, all CPUs will get a valid pointer back from
    task_subsys_state() because rebind_subsystem() which is called after
    the module init() function will assigned subsys[net_cls_subsys_id] the
    newly loaded module subsystem pointer.

    When the subsystem is about to be removed, rebind_subsystem() will
    called before the module exit() function. In this case,
    rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer
    and then it calls synchronize_rcu(). All old readers have left by then
    the critical section. Any new reader wont access the subsystem
    anymore. At this point we are safe to unregister the subsystem. No
    synchronize_rcu() call is needed.

    Signed-off-by: Daniel Wagner
    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Neil Horman
    Cc: "David S. Miller"
    Cc: "Paul E. McKenney"
    Cc: Andrew Morton
    Cc: Eric Dumazet
    Cc: Gao feng
    Cc: Glauber Costa
    Cc: Herbert Xu
    Cc: Jamal Hadi Salim
    Cc: John Fastabend
    Cc: Kamezawa Hiroyuki
    Cc: netdev@vger.kernel.org
    Cc: cgroups@vger.kernel.org

    Daniel Wagner
     

14 Sep, 2012

2 commits


17 Aug, 2012

2 commits

  • A race exists where creating cgroups and also updating the priomap
    may result in losing a priomap update. This is because priomap
    writers are not protected by rtnl_lock.

    Move priority writer into rtnl_lock()/rtnl_unlock().

    CC: Neil Horman
    Reported-by: Al Viro
    Signed-off-by: John Fastabend
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    John Fastabend
     
  • Add lock to prevent a race with a file closing and also remove
    useless and ugly sscanf code. The extra code was never needed
    and the case it supposedly protected against is in fact handled
    correctly by sock_from_file as pointed out by Al Viro.

    CC: Neil Horman
    Reported-by: Al Viro
    Signed-off-by: John Fastabend
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    John Fastabend
     

25 Jul, 2012

1 commit

  • Pull trivial tree from Jiri Kosina:
    "Trivial updates all over the place as usual."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (29 commits)
    Fix typo in include/linux/clk.h .
    pci: hotplug: Fix typo in pci
    iommu: Fix typo in iommu
    video: Fix typo in drivers/video
    Documentation: Add newline at end-of-file to files lacking one
    arm,unicore32: Remove obsolete "select MISC_DEVICES"
    module.c: spelling s/postition/position/g
    cpufreq: Fix typo in cpufreq driver
    trivial: typo in comment in mksysmap
    mach-omap2: Fix typo in debug message and comment
    scsi: aha152x: Fix sparse warning and make printing pointer address more portable.
    Change email address for Steve Glendinning
    Btrfs: fix typo in convert_extent_bit
    via: Remove bogus if check
    netprio_cgroup.c: fix comment typo
    backlight: fix memory leak on obscure error path
    Documentation: asus-laptop.txt references an obsolete Kconfig item
    Documentation: ManagementStyle: fixed typo
    mm/vmscan: cleanup comment error in balance_pgdat
    mm: cleanup on the comments of zone_reclaim_stat
    ...

    Linus Torvalds
     

23 Jul, 2012

1 commit

  • Instead of updating the sk_cgrp_prioidx struct field on every send
    this only updates the field when a task is moved via cgroup
    infrastructure.

    This allows sockets that may be used by a kernel worker thread
    to be managed. For example in the iscsi case today a user can
    put iscsid in a netprio cgroup and control traffic will be sent
    with the correct sk_cgrp_prioidx value set but as soon as data
    is sent the kernel worker thread isssues a send and sk_cgrp_prioidx
    is updated with the kernel worker threads value which is the
    default case.

    It seems more correct to only update the field when the user
    explicitly sets it via control group infrastructure. This allows
    the users to manage sockets that may be used with other threads.

    Signed-off-by: John Fastabend
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    John Fastabend