30 Jan, 2013

1 commit

  • The delay calculation with the rate extension introduces in v3.3 does
    not properly work, if other packets are still queued for transmission.
    For the delay calculation to work, both delay types (latency and delay
    introduces by rate limitation) have to be handled differently. The
    latency delay for a packet can overlap with the delay of other packets.
    The delay introduced by the rate however is separate, and can only
    start, once all other rate-introduced delays finished.

    Latency delay is from same distribution for each packet, rate delay
    depends on the packet size.

    .: latency delay
    -: rate delay
    x: additional delay we have to wait since another packet is currently
    transmitted

    .....---- Packet 1
    .....xx------ Packet 2
    .....------ Packet 3
    ^^^^^
    latency stacks
    ^^
    rate delay doesn't stack
    ^^
    latency stacks

    -----> time

    When a packet is enqueued, we first consider the latency delay. If other
    packets are already queued, we can reduce the latency delay until the
    last packet in the queue is send, however the latency delay cannot be

    Acked-by: Hagen Paul Pfeifer
    Signed-off-by: David S. Miller

    Johannes Naab
     

22 Dec, 2012

1 commit


13 Dec, 2012

2 commits

  • Pull networking changes from David Miller:

    1) Allow to dump, monitor, and change the bridge multicast database
    using netlink. From Cong Wang.

    2) RFC 5961 TCP blind data injection attack mitigation, from Eric
    Dumazet.

    3) Networking user namespace support from Eric W. Biederman.

    4) tuntap/virtio-net multiqueue support by Jason Wang.

    5) Support for checksum offload of encapsulated packets (basically,
    tunneled traffic can still be checksummed by HW). From Joseph
    Gasparakis.

    6) Allow BPF filter access to VLAN tags, from Eric Dumazet and
    Daniel Borkmann.

    7) Bridge port parameters over netlink and BPDU blocking support
    from Stephen Hemminger.

    8) Improve data access patterns during inet socket demux by rearranging
    socket layout, from Eric Dumazet.

    9) TIPC protocol updates and cleanups from Ying Xue, Paul Gortmaker, and
    Jon Maloy.

    10) Update TCP socket hash sizing to be more in line with current day
    realities. The existing heurstics were choosen a decade ago.
    From Eric Dumazet.

    11) Fix races, queue bloat, and excessive wakeups in ATM and
    associated drivers, from Krzysztof Mazur and David Woodhouse.

    12) Support DOVE (Distributed Overlay Virtual Ethernet) extensions
    in VXLAN driver, from David Stevens.

    13) Add "oops_only" mode to netconsole, from Amerigo Wang.

    14) Support set and query of VEB/VEPA bridge mode via PF_BRIDGE, also
    allow DCB netlink to work on namespaces other than the initial
    namespace. From John Fastabend.

    15) Support PTP in the Tigon3 driver, from Matt Carlson.

    16) tun/vhost zero copy fixes and improvements, plus turn it on
    by default, from Michael S. Tsirkin.

    17) Support per-association statistics in SCTP, from Michele
    Baldessari.

    And many, many, driver updates, cleanups, and improvements. Too
    numerous to mention individually.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
    net/mlx4_en: Add support for destination MAC in steering rules
    net/mlx4_en: Use generic etherdevice.h functions.
    net: ethtool: Add destination MAC address to flow steering API
    bridge: add support of adding and deleting mdb entries
    bridge: notify mdb changes via netlink
    ndisc: Unexport ndisc_{build,send}_skb().
    uapi: add missing netconf.h to export list
    pkt_sched: avoid requeues if possible
    solos-pci: fix double-free of TX skb in DMA mode
    bnx2: Fix accidental reversions.
    bna: Driver Version Updated to 3.1.2.1
    bna: Firmware update
    bna: Add RX State
    bna: Rx Page Based Allocation
    bna: TX Intr Coalescing Fix
    bna: Tx and Rx Optimizations
    bna: Code Cleanup and Enhancements
    ath9k: check pdata variable before dereferencing it
    ath5k: RX timestamp is reported at end of frame
    ath9k_htc: RX timestamp is reported at end of frame
    ...

    Linus Torvalds
     
  • Pull cgroup changes from Tejun Heo:
    "A lot of activities on cgroup side. The big changes are focused on
    making cgroup hierarchy handling saner.

    - cgroup_rmdir() had peculiar semantics - it allowed cgroup
    destruction to be vetoed by individual controllers and tried to
    drain refcnt synchronously. The vetoing never worked properly and
    caused good deal of contortions in cgroup. memcg was the last
    reamining user. Michal Hocko removed the usage and cgroup_rmdir()
    path has been simplified significantly. This was done in a
    separate branch so that the memcg people can base further memcg
    changes on top.

    - The above allowed cleaning up cgroup lifecycle management and
    implementation of generic cgroup iterators which are used to
    improve hierarchy support.

    - cgroup_freezer updated to allow migration in and out of a frozen
    cgroup and handle hierarchy. If a cgroup is frozen, all descendant
    cgroups are frozen.

    - netcls_cgroup and netprio_cgroup updated to handle hierarchy
    properly.

    - Various fixes and cleanups.

    - Two merge commits. One to pull in memcg and rmdir cleanups (needed
    to build iterators). The other pulled in cgroup/for-3.7-fixes for
    device_cgroup fixes so that further device_cgroup patches can be
    stacked on top."

    Fixed up a trivial conflict in mm/memcontrol.c as per Tejun (due to
    commit bea8c150a7 ("memcg: fix hotplugged memory zone oops") in master
    touching code close to commit 2ef37d3fe4 ("memcg: Simplify
    mem_cgroup_force_empty_list error handling") in for-3.8)

    * 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (65 commits)
    cgroup: update Documentation/cgroups/00-INDEX
    cgroup_rm_file: don't delete the uncreated files
    cgroup: remove subsystem files when remounting cgroup
    cgroup: use cgroup_addrm_files() in cgroup_clear_directory()
    cgroup: warn about broken hierarchies only after css_online
    cgroup: list_del_init() on removed events
    cgroup: fix lockdep warning for event_control
    cgroup: move list add after list head initilization
    netprio_cgroup: allow nesting and inherit config on cgroup creation
    netprio_cgroup: implement netprio[_set]_prio() helpers
    netprio_cgroup: use cgroup->id instead of cgroup_netprio_state->prioidx
    netprio_cgroup: reimplement priomap expansion
    netprio_cgroup: shorten variable names in extend_netdev_table()
    netprio_cgroup: simplify write_priomap()
    netcls_cgroup: move config inheritance to ->css_online() and remove .broken_hierarchy marking
    cgroup: remove obsolete guarantee from cgroup_task_migrate.
    cgroup: add cgroup->id
    cgroup, cpuset: remove cgroup_subsys->post_clone()
    cgroup: s/CGRP_CLONE_CHILDREN/CGRP_CPUSET_CLONE_CHILDREN/
    cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free()
    ...

    Linus Torvalds
     

12 Dec, 2012

1 commit

  • With BQL being deployed, we can more likely have following behavior :

    We dequeue a packet from qdisc in dequeue_skb(), then we realize target
    tx queue is in XOFF state in sch_direct_xmit(), and we have to hold the
    skb into gso_skb for later.

    This shows in stats (tc -s qdisc dev eth0) as requeues.

    Problem of these requeues is that high priority packets can not be
    dequeued as long as this (possibly low prio and big TSO packet) is not
    removed from gso_skb.

    At 1Gbps speed, a full size TSO packet is 500 us of extra latency.

    In some cases, we know that all packets dequeued from a qdisc are
    for a particular and known txq :

    - If device is non multi queue
    - For all MQ/MQPRIO slave qdiscs

    This patch introduces a new qdisc flag, TCQ_F_ONETXQUEUE to mark
    this capability, so that dequeue_skb() is allowed to dequeue a packet
    only if the associated txq is not stopped.

    This indeed reduce latencies for high prio packets (or improve fairness
    with sfq/fq_codel), and almost remove qdisc 'requeues'.

    Signed-off-by: Eric Dumazet
    Cc: Jamal Hadi Salim
    Cc: John Fastabend
    Signed-off-by: David S. Miller

    Eric Dumazet
     

29 Nov, 2012

1 commit

  • This patch turns QFQ into QFQ+, a variant of QFQ that provides the
    following two benefits: 1) QFQ+ is faster than QFQ, 2) differently
    from QFQ, QFQ+ correctly schedules also non-leaves classes in a
    hierarchical setting. A detailed description of QFQ+, plus a
    performance comparison with DRR and QFQ, can be found in [1].

    [1] P. Valente, "Reducing the Execution Time of Fair-Queueing Schedulers"
    http://algo.ing.unimo.it/people/paolo/agg-sched/agg-sched.pdf

    Signed-off-by: Paolo Valente
    Signed-off-by: David S. Miller

    Paolo Valente
     

26 Nov, 2012

1 commit


22 Nov, 2012

1 commit

  • It turns out that we'll have to live with attributes which are
    inherited at cgroup creation time but not affected by further updates
    to the parent afterwards - such attributes are already in wide use
    e.g. for cpuset.

    So, there's nothing to do for netcls_cgroup for hierarchy support.
    Its current behavior - inherit only during creation - is good enough.

    Move config inheriting from ->css_alloc() to ->css_online() for
    consistency, which doesn't change behavior at all, and remove
    .broken_hierarchy marking.

    Signed-off-by: Tejun Heo
    Tested-and-Acked-by: Daniel Wagner
    Acked-by: David S. Miller

    Tejun Heo
     

20 Nov, 2012

1 commit


19 Nov, 2012

1 commit

  • - In rtnetlink_rcv_msg convert the capable(CAP_NET_ADMIN) check
    to ns_capable(net->user-ns, CAP_NET_ADMIN). Allowing unprivileged
    users to make netlink calls to modify their local network
    namespace.

    - In the rtnetlink doit methods add capable(CAP_NET_ADMIN) so
    that calls that are not safe for unprivileged users are still
    protected.

    Later patches will remove the extra capable calls from methods
    that are safe for unprivilged users.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

11 Nov, 2012

1 commit


08 Nov, 2012

1 commit

  • If the max packet size for some class (configured through tc) is
    violated by the actual size of the packets of that class, then QFQ
    would not schedule classes correctly, and the data structures
    implementing the bucket lists may get corrupted. This problem occurs
    with TSO/GSO even if the max packet size is set to the MTU, and is,
    e.g., the cause of the failure reported in [1]. Two patches have been
    proposed to solve this problem in [2], one of them is a preliminary
    version of this patch.

    This patch addresses the above issues by: 1) setting QFQ parameters to
    proper values for supporting TSO/GSO (in particular, setting the
    maximum possible packet size to 64KB), 2) automatically increasing the
    max packet size for a class, lmax, when a packet with a larger size
    than the current value of lmax arrives.

    The drawback of the first point is that the maximum weight for a class
    is now limited to 4096, which is equal to 1/16 of the maximum weight
    sum.

    Finally, this patch also forcibly caps the timestamps of a class if
    they are too high to be stored in the bucket list. This capping, taken
    from QFQ+ [3], handles the unfrequent case described in the comment to
    the function slot_insert.

    [1] http://marc.info/?l=linux-netdev&m=134968777902077&w=2
    [2] http://marc.info/?l=linux-netdev&m=135096573507936&w=2
    [3] http://marc.info/?l=linux-netdev&m=134902691421670&w=2

    Signed-off-by: Paolo Valente
    Tested-by: Cong Wang
    Acked-by: Stephen Hemminger
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Paolo Valente
     

07 Nov, 2012

1 commit

  • Commit 56b765b79e9 (htb: improved accuracy at high rates)
    introduced two bugs :

    1) one bstats_update() was inadvertently removed from
    htb_dequeue_tree(), breaking statistics/rate estimation.

    2) Missing qdisc_put_rtab() calls in htb_change_class(),
    leaking kernel memory, now struct htb_class no longer
    retains pointers to qdisc_rate_table structs.

    Since only rate is used, dont use qdisc_get_rtab() calls
    copying data we ignore anyway.

    Signed-off-by: Eric Dumazet
    Cc: Vimalkumar
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Nov, 2012

1 commit

  • Current HTB (and TBF) uses rate table computed by the "tc"
    userspace program, which has the following issue:

    The rate table has 256 entries to map packet lengths
    to token (time units). With TSO sized packets, the
    256 entry granularity leads to loss/gain of rate,
    making the token bucket inaccurate.

    Thus, instead of relying on rate table, this patch
    explicitly computes the time and accounts for packet
    transmission times with nanosecond granularity.

    This greatly improves accuracy of HTB with a wide
    range of packet sizes.

    Example:

    tc qdisc add dev $dev root handle 1: \
    htb default 1

    tc class add dev $dev classid 1:1 parent 1: \
    rate 5Gbit mtu 64k

    Here is an example of inaccuracy:

    $ iperf -c host -t 10 -i 1

    With old htb:
    eth4: 34.76 Mb/s In 5827.98 Mb/s Out - 65836.0 p/s In 481273.0 p/s Out
    [SUM] 9.0-10.0 sec 669 MBytes 5.61 Gbits/sec
    [SUM] 0.0-10.0 sec 6.50 GBytes 5.58 Gbits/sec

    With new htb:
    eth4: 28.36 Mb/s In 5208.06 Mb/s Out - 53704.0 p/s In 430076.0 p/s Out
    [SUM] 9.0-10.0 sec 594 MBytes 4.98 Gbits/sec
    [SUM] 0.0-10.0 sec 5.80 GBytes 4.98 Gbits/sec

    The bits per second on the wire is still 5200Mb/s with new HTB
    because qdisc accounts for packet length using skb->len, which
    is smaller than total bytes on the wire if GSO is used. But
    that is for another patch regardless of how time is accounted.

    Many thanks to Eric Dumazet for review and feedback.

    Signed-off-by: Vimalkumar
    Signed-off-by: David S. Miller

    Vimalkumar
     

26 Oct, 2012

1 commit

  • The cgroup logic part of net_cls is very similar as the one in
    net_prio. Let's stream line the net_cls logic with the net_prio one.

    The net_prio update logic was changed by following commit (note there
    were some changes necessary later on)

    commit 406a3c638ce8b17d9704052c07955490f732c2b8
    Author: John Fastabend
    Date: Fri Jul 20 10:39:25 2012 +0000

    net: netprio_cgroup: rework update socket logic

    Instead of updating the sk_cgrp_prioidx struct field on every send
    this only updates the field when a task is moved via cgroup
    infrastructure.

    This allows sockets that may be used by a kernel worker thread
    to be managed. For example in the iscsi case today a user can
    put iscsid in a netprio cgroup and control traffic will be sent
    with the correct sk_cgrp_prioidx value set but as soon as data
    is sent the kernel worker thread isssues a send and sk_cgrp_prioidx
    is updated with the kernel worker threads value which is the
    default case.

    It seems more correct to only update the field when the user
    explicitly sets it via control group infrastructure. This allows
    the users to manage sockets that may be used with other threads.

    Since classid is now updated when the task is moved between the
    cgroups, we don't have to call sock_update_classid() from various
    places to ensure we always using the latest classid value.

    [v2: Use iterate_fd() instead of open coding]

    Signed-off-by: Daniel Wagner
    Cc: Li Zefan
    Cc: "David S. Miller"
    Cc: "Michael S. Tsirkin"
    Cc: Jamal Hadi Salim
    Cc: Joe Perches
    Cc: John Fastabend
    Cc: Neil Horman
    Cc: Stanislav Kinsbursky
    Cc: Tejun Heo
    Cc:
    Cc:
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Daniel Wagner
     

22 Oct, 2012

1 commit


03 Oct, 2012

4 commits

  • Pull networking changes from David Miller:

    1) GRE now works over ipv6, from Dmitry Kozlov.

    2) Make SCTP more network namespace aware, from Eric Biederman.

    3) TEAM driver now works with non-ethernet devices, from Jiri Pirko.

    4) Make openvswitch network namespace aware, from Pravin B Shelar.

    5) IPV6 NAT implementation, from Patrick McHardy.

    6) Server side support for TCP Fast Open, from Jerry Chu and others.

    7) Packet BPF filter supports MOD and XOR, from Eric Dumazet and Daniel
    Borkmann.

    8) Increate the loopback default MTU to 64K, from Eric Dumazet.

    9) Use a per-task rather than per-socket page fragment allocator for
    outgoing networking traffic. This benefits processes that have very
    many mostly idle sockets, which is quite common.

    From Eric Dumazet.

    10) Use up to 32K for page fragment allocations, with fallbacks to
    smaller sizes when higher order page allocations fail. Benefits are
    a) less segments for driver to process b) less calls to page
    allocator c) less waste of space.

    From Eric Dumazet.

    11) Allow GRO to be used on GRE tunnels, from Eric Dumazet.

    12) VXLAN device driver, one way to handle VLAN issues such as the
    limitation of 4096 VLAN IDs yet still have some level of isolation.
    From Stephen Hemminger.

    13) As usual there is a large boatload of driver changes, with the scale
    perhaps tilted towards the wireless side this time around.

    Fix up various fairly trivial conflicts, mostly caused by the user
    namespace changes.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1012 commits)
    hyperv: Add buffer for extended info after the RNDIS response message.
    hyperv: Report actual status in receive completion packet
    hyperv: Remove extra allocated space for recv_pkt_list elements
    hyperv: Fix page buffer handling in rndis_filter_send_request()
    hyperv: Fix the missing return value in rndis_filter_set_packet_filter()
    hyperv: Fix the max_xfer_size in RNDIS initialization
    vxlan: put UDP socket in correct namespace
    vxlan: Depend on CONFIG_INET
    sfc: Fix the reported priorities of different filter types
    sfc: Remove EFX_FILTER_FLAG_RX_OVERRIDE_IP
    sfc: Fix loopback self-test with separate_tx_channels=1
    sfc: Fix MCDI structure field lookup
    sfc: Add parentheses around use of bitfield macro arguments
    sfc: Fix null function pointer in efx_sriov_channel_type
    vxlan: virtual extensible lan
    igmp: export symbol ip_mc_leave_group
    netlink: add attributes to fdb interface
    tg3: unconditionally select HWMON support when tg3 is enabled.
    Revert "net: ti cpsw ethernet: allow reading phy interface mode from DT"
    gre: fix sparse warning
    ...

    Linus Torvalds
     
  • Pull user namespace changes from Eric Biederman:
    "This is a mostly modest set of changes to enable basic user namespace
    support. This allows the code to code to compile with user namespaces
    enabled and removes the assumption there is only the initial user
    namespace. Everything is converted except for the most complex of the
    filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
    nfs, ocfs2 and xfs as those patches need a bit more review.

    The strategy is to push kuid_t and kgid_t values are far down into
    subsystems and filesystems as reasonable. Leaving the make_kuid and
    from_kuid operations to happen at the edge of userspace, as the values
    come off the disk, and as the values come in from the network.
    Letting compile type incompatible compile errors (present when user
    namespaces are enabled) guide me to find the issues.

    The most tricky areas have been the places where we had an implicit
    union of uid and gid values and were storing them in an unsigned int.
    Those places were converted into explicit unions. I made certain to
    handle those places with simple trivial patches.

    Out of that work I discovered we have generic interfaces for storing
    quota by projid. I had never heard of the project identifiers before.
    Adding full user namespace support for project identifiers accounts
    for most of the code size growth in my git tree.

    Ultimately there will be work to relax privlige checks from
    "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
    root in a user names to do those things that today we only forbid to
    non-root users because it will confuse suid root applications.

    While I was pushing kuid_t and kgid_t changes deep into the audit code
    I made a few other cleanups. I capitalized on the fact we process
    netlink messages in the context of the message sender. I removed
    usage of NETLINK_CRED, and started directly using current->tty.

    Some of these patches have also made it into maintainer trees, with no
    problems from identical code from different trees showing up in
    linux-next.

    After reading through all of this code I feel like I might be able to
    win a game of kernel trivial pursuit."

    Fix up some fairly trivial conflicts in netfilter uid/git logging code.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
    userns: Convert the ufs filesystem to use kuid/kgid where appropriate
    userns: Convert the udf filesystem to use kuid/kgid where appropriate
    userns: Convert ubifs to use kuid/kgid
    userns: Convert squashfs to use kuid/kgid where appropriate
    userns: Convert reiserfs to use kuid and kgid where appropriate
    userns: Convert jfs to use kuid/kgid where appropriate
    userns: Convert jffs2 to use kuid and kgid where appropriate
    userns: Convert hpfs to use kuid and kgid where appropriate
    userns: Convert btrfs to use kuid/kgid where appropriate
    userns: Convert bfs to use kuid/kgid where appropriate
    userns: Convert affs to use kuid/kgid wherwe appropriate
    userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
    userns: On ia64 deal with current_uid and current_gid being kuid and kgid
    userns: On ppc convert current_uid from a kuid before printing.
    userns: Convert s390 getting uid and gid system calls to use kuid and kgid
    userns: Convert s390 hypfs to use kuid and kgid where appropriate
    userns: Convert binder ipc to use kuids
    userns: Teach security_path_chown to take kuids and kgids
    userns: Add user namespace support to IMA
    userns: Convert EVM to deal with kuids and kgids in it's hmac computation
    ...

    Linus Torvalds
     
  • Pull cgroup hierarchy update from Tejun Heo:
    "Currently, different cgroup subsystems handle nested cgroups
    completely differently. There's no consistency among subsystems and
    the behaviors often are outright broken.

    People at least seem to agree that the broken hierarhcy behaviors need
    to be weeded out if any progress is gonna be made on this front and
    that the fallouts from deprecating the broken behaviors should be
    acceptable especially given that the current behaviors don't make much
    sense when nested.

    This patch makes cgroup emit warning messages if cgroups for
    subsystems with broken hierarchy behavior are nested to prepare for
    fixing them in the future. This was put in a separate branch because
    more related changes were expected (didn't make it this round) and the
    memory cgroup wanted to pull in this and make changes on top."

    * 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them

    Linus Torvalds
     
  • Pull cgroup updates from Tejun Heo:

    - xattr support added. The implementation is shared with tmpfs. The
    usage is restricted and intended to be used to manage per-cgroup
    metadata by system software. tmpfs changes are routed through this
    branch with Hugh's permission.

    - cgroup subsystem ID handling simplified.

    * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: Define CGROUP_SUBSYS_COUNT according the configuration
    cgroup: Assign subsystem IDs during compile time
    cgroup: Do not depend on a given order when populating the subsys array
    cgroup: Wrap subsystem selection macro
    cgroup: Remove CGROUP_BUILTIN_SUBSYS_COUNT
    cgroup: net_prio: Do not define task_netpioidx() when not selected
    cgroup: net_cls: Do not define task_cls_classid() when not selected
    cgroup: net_cls: Move sock_update_classid() declaration to cls_cgroup.h
    cgroup: trivial fixes for Documentation/cgroups/cgroups.txt
    xattr: mark variable as uninitialized to make both gcc and smatch happy
    fs: add missing documentation to simple_xattr functions
    cgroup: add documentation on extended attributes usage
    cgroup: rename subsys_bits to subsys_mask
    cgroup: add xattr support
    cgroup: revise how we re-populate root directory
    xattr: extract simple_xattr code from tmpfs

    Linus Torvalds
     

29 Sep, 2012

1 commit

  • Conflicts:
    drivers/net/team/team.c
    drivers/net/usb/qmi_wwan.c
    net/batman-adv/bat_iv_ogm.c
    net/ipv4/fib_frontend.c
    net/ipv4/route.c
    net/l2tp/l2tp_netlink.c

    The team, fib_frontend, route, and l2tp_netlink conflicts were simply
    overlapping changes.

    qmi_wwan and bat_iv_ogm were of the "use HEAD" variety.

    With help from Antonio Quartulli.

    Signed-off-by: David S. Miller

    David S. Miller
     

28 Sep, 2012

1 commit

  • GCC refuses to recognize that all error control flows do in fact
    set err to something.

    Add an explicit initialization to shut it up.

    net/sched/sch_drr.c: In function ‘drr_enqueue’:
    net/sched/sch_drr.c:359:11: warning: ‘err’ may be used uninitialized in this function [-Wmaybe-uninitialized]
    net/sched/sch_qfq.c: In function ‘qfq_enqueue’:
    net/sched/sch_qfq.c:885:11: warning: ‘err’ may be used uninitialized in this function [-Wmaybe-uninitialized]

    Signed-off-by: David S. Miller

    David S. Miller
     

25 Sep, 2012

1 commit

  • We currently use a per socket order-0 page cache for tcp_sendmsg()
    operations.

    This page is used to build fragments for skbs.

    Its done to increase probability of coalescing small write() into
    single segments in skbs still in write queue (not yet sent)

    But it wastes a lot of memory for applications handling many mostly
    idle sockets, since each socket holds one page in sk->sk_sndmsg_page

    Its also quite inefficient to build TSO 64KB packets, because we need
    about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
    page allocator more than wanted.

    This patch adds a per task frag allocator and uses bigger pages,
    if available. An automatic fallback is done in case of memory pressure.

    (up to 32768 bytes per frag, thats order-3 pages on x86)

    This increases TCP stream performance by 20% on loopback device,
    but also benefits on other network devices, since 8x less frags are
    mapped on transmit and unmapped on tx completion. Alexander Duyck
    mentioned a probable performance win on systems with IOMMU enabled.

    Its possible some SG enabled hardware cant cope with bigger fragments,
    but their ndo_start_xmit() should already handle this, splitting a
    fragment in sub fragments, since some arches have PAGE_SIZE=65536

    Successfully tested on various ethernet devices.
    (ixgbe, igb, bnx2x, tg3, mellanox mlx4)

    Signed-off-by: Eric Dumazet
    Cc: Ben Hutchings
    Cc: Vijay Subramanian
    Cc: Alexander Duyck
    Tested-by: Vijay Subramanian
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Sep, 2012

1 commit

  • If the old timestamps of a class, say cl, are stale when the class
    becomes active, then QFQ may assign to cl a much higher start time
    than the maximum value allowed. This may happen when QFQ assigns to
    the start time of cl the finish time of a group whose classes are
    characterized by a higher value of the ratio
    max_class_pkt/weight_of_the_class with respect to that of
    cl. Inserting a class with a too high start time into the bucket list
    corrupts the data structure and may eventually lead to crashes.
    This patch limits the maximum start time assigned to a class.

    Signed-off-by: Paolo Valente
    Signed-off-by: David S. Miller

    Paolo Valente
     

15 Sep, 2012

3 commits

  • Conflicts:
    net/netfilter/nfnetlink_log.c
    net/netfilter/xt_LOG.c

    Rather easy conflict resolution, the 'net' tree had bug fixes to make
    sure we checked if a socket is a time-wait one or not and elide the
    logging code if so.

    Whereas on the 'net-next' side we are calculating the UID and GID from
    the creds using different interfaces due to the user namespace changes
    from Eric Biederman.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Currently, cgroup hierarchy support is a mess. cpu related subsystems
    behave correctly - configuration, accounting and control on a parent
    properly cover its children. blkio and freezer completely ignore
    hierarchy and treat all cgroups as if they're directly under the root
    cgroup. Others show yet different behaviors.

    These differing interpretations of cgroup hierarchy make using cgroup
    confusing and it impossible to co-mount controllers into the same
    hierarchy and obtain sane behavior.

    Eventually, we want full hierarchy support from all subsystems and
    probably a unified hierarchy. Users using separate hierarchies
    expecting completely different behaviors depending on the mounted
    subsystem is deterimental to making any progress on this front.

    This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
    for controllers which are lacking in hierarchy support. The goal of
    this patch is two-fold.

    * Move users away from using hierarchy on currently non-hierarchical
    subsystems, so that implementing proper hierarchy support on those
    doesn't surprise them.

    * Keep track of which controllers are broken how and nudge the
    subsystems to implement proper hierarchy support.

    For now, start with a single warning message. We can whine louder
    later on.

    v2: Fixed a typo spotted by Michal. Warning message updated.

    v3: Updated memcg part so that it doesn't generate warning in the
    cases where .use_hierarchy=false doesn't make the behavior
    different from root.use_hierarchy=true. Fixed a typo spotted by
    Glauber.

    v4: Check ->broken_hierarchy after cgroup creation is complete so that
    ->create() can affect the result per Michal. Dropped unnecessary
    memcg root handling per Michal.

    Signed-off-by: Tejun Heo
    Acked-by: Michal Hocko
    Acked-by: Li Zefan
    Acked-by: Serge E. Hallyn
    Cc: Glauber Costa
    Cc: Peter Zijlstra
    Cc: Paul Turner
    Cc: Johannes Weiner
    Cc: Thomas Graf
    Cc: Vivek Goyal
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Neil Horman
    Cc: Aneesh Kumar K.V

    Tejun Heo
     
  • WARNING: With this change it is impossible to load external built
    controllers anymore.

    In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is
    set, corresponding subsys_id should also be a constant. Up to now,
    net_prio_subsys_id and net_cls_subsys_id would be of the type int and
    the value would be assigned during runtime.

    By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN
    to IS_ENABLED, all *_subsys_id will have constant value. That means we
    need to remove all the code which assumes a value can be assigned to
    net_prio_subsys_id and net_cls_subsys_id.

    A close look is necessary on the RCU part which was introduces by
    following patch:

    commit f845172531fb7410c7fb7780b1a6e51ee6df7d52
    Author: Herbert Xu Mon May 24 09:12:34 2010
    Committer: David S. Miller Mon May 24 09:12:34 2010

    cls_cgroup: Store classid in struct sock

    Tis code was added to init_cgroup_cls()

    /* We can't use rcu_assign_pointer because this is an int. */
    smp_wmb();
    net_cls_subsys_id = net_cls_subsys.subsys_id;

    respectively to exit_cgroup_cls()

    net_cls_subsys_id = -1;
    synchronize_rcu();

    and in module version of task_cls_classid()

    rcu_read_lock();
    id = rcu_dereference(net_cls_subsys_id);
    if (id >= 0)
    classid = container_of(task_subsys_state(p, id),
    struct cgroup_cls_state, css)->classid;
    rcu_read_unlock();

    Without an explicit explaination why the RCU part is needed. (The
    rcu_deference was fixed by exchanging it to rcu_derefence_index_check()
    in a later commit, but that is a minor detail.)

    So here is my pondering why it was introduced and why it safe to
    remove it now. Note that this code was copied over to net_prio the
    reasoning holds for that subsystem too.

    The idea behind the RCU use for net_cls_subsys_id is to make sure we
    get a valid pointer back from task_subsys_state(). task_subsys_state()
    is just blindly accessing the subsys array and returning the
    pointer. Obviously, passing in -1 as id into task_subsys_state()
    returns an invalid value (out of lower bound).

    So this code makes sure that only after module is loaded and the
    subsystem registered, the id is assigned.

    Before unregistering the module all old readers must have left the
    critical section. This is done by assigning -1 to the id and issuing a
    synchronized_rcu(). Any new readers wont call task_subsys_state()
    anymore and therefore it is safe to unregister the subsystem.

    The new code relies on the same trick, but it looks at the subsys
    pointer return by task_subsys_state() (remember the id is constant
    and therefore we allways have a valid index into the subsys
    array).

    No precautions need to be taken during module loading
    module. Eventually, all CPUs will get a valid pointer back from
    task_subsys_state() because rebind_subsystem() which is called after
    the module init() function will assigned subsys[net_cls_subsys_id] the
    newly loaded module subsystem pointer.

    When the subsystem is about to be removed, rebind_subsystem() will
    called before the module exit() function. In this case,
    rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer
    and then it calls synchronize_rcu(). All old readers have left by then
    the critical section. Any new reader wont access the subsystem
    anymore. At this point we are safe to unregister the subsystem. No
    synchronize_rcu() call is needed.

    Signed-off-by: Daniel Wagner
    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Neil Horman
    Cc: "David S. Miller"
    Cc: "Paul E. McKenney"
    Cc: Andrew Morton
    Cc: Eric Dumazet
    Cc: Gao feng
    Cc: Glauber Costa
    Cc: Herbert Xu
    Cc: Jamal Hadi Salim
    Cc: John Fastabend
    Cc: Kamezawa Hiroyuki
    Cc: netdev@vger.kernel.org
    Cc: cgroups@vger.kernel.org

    Daniel Wagner
     

14 Sep, 2012

4 commits

  • gred_dequeue() and gred_drop() do not seem to get called when the
    queue is empty, meaning that we never start idling while in WRED
    mode. And since qidlestart is not stored by gred_store_wred_set(),
    we would never stop idling while in WRED mode if we ever started.
    This messes up the average queue size calculation that influences
    packet marking/dropping behavior.

    Now, we start WRED mode idling as we are removing the last packet
    from the queue. Also we now actually stop WRED mode idling when we
    are enqueuing a packet.

    Cc: Bruce Osler
    Signed-off-by: David Ward
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    David Ward
     
  • q->vars.qavg is a Wlog scaled value, but q->backlog is not. In order
    to pass q->vars.qavg as the backlog value, we need to un-scale it.
    Additionally, the qave value returned via netlink should not be Wlog
    scaled, so we need to un-scale the result of red_calc_qavg().

    This caused artificially high values for "Average Queue" to be shown
    by 'tc -s -d qdisc', but did not affect the actual operation of GRED.

    Signed-off-by: David Ward
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    David Ward
     
  • Each pair of DPs only needs to be compared once when searching for
    a non-unique prio value.

    Signed-off-by: David Ward
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    David Ward
     
  • Signed-off-by: David Ward
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    David Ward
     

12 Sep, 2012

1 commit

  • Its possible to setup a bad cbq configuration leading to
    an infinite loop in cbq_classify()

    DEV_OUT=eth0
    ICMP="match ip protocol 1 0xff"
    U32="protocol ip u32"
    DST="match ip dst"
    tc qdisc add dev $DEV_OUT root handle 1: cbq avpkt 1000 \
    bandwidth 100mbit
    tc class add dev $DEV_OUT parent 1: classid 1:1 cbq \
    rate 512kbit allot 1500 prio 5 bounded isolated
    tc filter add dev $DEV_OUT parent 1: prio 3 $U32 \
    $ICMP $DST 192.168.3.234 flowid 1:

    Reported-by: Denys Fedoryschenko
    Tested-by: Denys Fedoryschenko
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Sep, 2012

1 commit

  • It is a frequent mistake to confuse the netlink port identifier with a
    process identifier. Try to reduce this confusion by renaming fields
    that hold port identifiers portid instead of pid.

    I have carefully avoided changing the structures exported to
    userspace to avoid changing the userspace API.

    I have successfully built an allyesconfig kernel with this change.

    Signed-off-by: "Eric W. Biederman"
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

06 Sep, 2012

1 commit

  • It seems we need to provide ability for stacked devices
    to use specific lock_class_key for sch->busylock

    We could instead default l2tpeth tx_queue_len to 0 (no qdisc), but
    a user might use a qdisc anyway.

    (So same fixes are probably needed on non LLTX stacked drivers)

    Noticed while stressing L2TPV3 setup :

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    3.6.0-rc3+ #788 Not tainted
    -------------------------------------------------------
    netperf/4660 is trying to acquire lock:
    (l2tpsock){+.-...}, at: [] l2tp_xmit_skb+0x172/0xa50 [l2tp_core]

    but task is already holding lock:
    (&(&sch->busylock)->rlock){+.-...}, at: [] dev_queue_xmit+0xd75/0xe00

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&(&sch->busylock)->rlock){+.-...}:
    [] lock_acquire+0x90/0x200
    [] _raw_spin_lock_irqsave+0x4c/0x60
    [] __wake_up+0x32/0x70
    [] tty_wakeup+0x3e/0x80
    [] pty_write+0x73/0x80
    [] tty_put_char+0x3c/0x40
    [] process_echoes+0x142/0x330
    [] n_tty_receive_buf+0x8fb/0x1230
    [] flush_to_ldisc+0x142/0x1c0
    [] process_one_work+0x198/0x760
    [] worker_thread+0x186/0x4b0
    [] kthread+0x93/0xa0
    [] kernel_thread_helper+0x4/0x10

    -> #0 (l2tpsock){+.-...}:
    [] __lock_acquire+0x1628/0x1b10
    [] lock_acquire+0x90/0x200
    [] _raw_spin_lock+0x41/0x50
    [] l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
    [] l2tp_eth_dev_xmit+0x32/0x60 [l2tp_eth]
    [] dev_hard_start_xmit+0x502/0xa70
    [] sch_direct_xmit+0xfe/0x290
    [] dev_queue_xmit+0x1e5/0xe00
    [] ip_finish_output+0x3d0/0x890
    [] ip_output+0x59/0xf0
    [] ip_local_out+0x2d/0xa0
    [] ip_queue_xmit+0x1c3/0x680
    [] tcp_transmit_skb+0x402/0xa60
    [] tcp_write_xmit+0x1f4/0xa30
    [] tcp_push_one+0x30/0x40
    [] tcp_sendmsg+0xe82/0x1040
    [] inet_sendmsg+0x125/0x230
    [] sock_sendmsg+0xdc/0xf0
    [] sys_sendto+0xfe/0x130
    [] system_call_fastpath+0x16/0x1b
    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&(&sch->busylock)->rlock);
    lock(l2tpsock);
    lock(&(&sch->busylock)->rlock);
    lock(l2tpsock);

    *** DEADLOCK ***

    5 locks held by netperf/4660:
    #0: (sk_lock-AF_INET){+.+.+.}, at: [] tcp_sendmsg+0x2c/0x1040
    #1: (rcu_read_lock){.+.+..}, at: [] ip_queue_xmit+0x0/0x680
    #2: (rcu_read_lock_bh){.+....}, at: [] ip_finish_output+0x135/0x890
    #3: (rcu_read_lock_bh){.+....}, at: [] dev_queue_xmit+0x0/0xe00
    #4: (&(&sch->busylock)->rlock){+.-...}, at: [] dev_queue_xmit+0xd75/0xe00

    stack backtrace:
    Pid: 4660, comm: netperf Not tainted 3.6.0-rc3+ #788
    Call Trace:
    [] print_circular_bug+0x1fb/0x20c
    [] __lock_acquire+0x1628/0x1b10
    [] ? check_usage+0x9b/0x4d0
    [] ? __lock_acquire+0x2e4/0x1b10
    [] lock_acquire+0x90/0x200
    [] ? l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
    [] _raw_spin_lock+0x41/0x50
    [] ? l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
    [] l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
    [] l2tp_eth_dev_xmit+0x32/0x60 [l2tp_eth]
    [] dev_hard_start_xmit+0x502/0xa70
    [] ? dev_hard_start_xmit+0x5e/0xa70
    [] ? dev_queue_xmit+0x141/0xe00
    [] sch_direct_xmit+0xfe/0x290
    [] dev_queue_xmit+0x1e5/0xe00
    [] ? dev_hard_start_xmit+0xa70/0xa70
    [] ip_finish_output+0x3d0/0x890
    [] ? ip_finish_output+0x135/0x890
    [] ip_output+0x59/0xf0
    [] ip_local_out+0x2d/0xa0
    [] ip_queue_xmit+0x1c3/0x680
    [] ? ip_local_out+0xa0/0xa0
    [] tcp_transmit_skb+0x402/0xa60
    [] ? tcp_md5_do_lookup+0x18e/0x1a0
    [] tcp_write_xmit+0x1f4/0xa30
    [] tcp_push_one+0x30/0x40
    [] tcp_sendmsg+0xe82/0x1040
    [] inet_sendmsg+0x125/0x230
    [] ? inet_create+0x6b0/0x6b0
    [] ? sock_update_classid+0xc2/0x3b0
    [] ? sock_update_classid+0x130/0x3b0
    [] sock_sendmsg+0xdc/0xf0
    [] ? fget_light+0x3f9/0x4f0
    [] sys_sendto+0xfe/0x130
    [] ? trace_hardirqs_on+0xd/0x10
    [] ? _raw_spin_unlock_irq+0x30/0x50
    [] ? finish_task_switch+0x83/0xf0
    [] ? finish_task_switch+0x46/0xf0
    [] ? sysret_check+0x1b/0x56
    [] system_call_fastpath+0x16/0x1b

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Sep, 2012

1 commit

  • When fq_codel builds a new flow, it should not reset codel state.

    Codel algo needs to get previous values (lastcount, drop_next) to get
    proper behavior.

    Signed-off-by: Dave Taht
    Signed-off-by: Eric Dumazet
    Acked-by: Dave Taht
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Aug, 2012

1 commit


23 Aug, 2012

1 commit


17 Aug, 2012

1 commit

  • We drop packet unconditionally when we fail to mirror it. This is not intended
    in some cases. Consdier for kvm guest, we may mirror the traffic of the bridge
    to a tap device used by a VM. When kernel fails to mirror the packet in
    conditions such as when qemu crashes or stop polling the tap, it's hard for the
    management software to detect such condition and clean the the mirroring
    before. This would lead all packets to the bridge to be dropped and break the
    netowrk of other virtual machines.

    To solve the issue, the patch does not drop packets when kernel fails to mirror
    it, and only drop the redirected packets.

    Signed-off-by: Jason Wang
    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jason Wang
     

15 Aug, 2012

2 commits

  • The flow classifier can use uids and gids of the sockets that
    are transmitting packets and do insert those uids and gids
    into the packet classification calcuation. I don't fully
    understand the details but it appears that we can depend
    on specific uids and gids when making traffic classification
    decisions.

    To work with user namespaces enabled map from kuids and kgids
    into uids and gids in the initial user namespace giving raw
    integer values the code can play with and depend on.

    To avoid issues of userspace depending on uids and gids in
    packet classifiers installed from other user namespaces
    and getting confused deny all packet classifiers that
    use uids or gids that are not comming from a netlink socket
    in the initial user namespace.

    Cc: Patrick McHardy
    Cc: Eric Dumazet
    Cc: Jamal Hadi Salim
    Cc: Changli Gao
    Acked-by: David S. Miller
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     
  • cls_flow.c plays with uids and gids. Unless I misread that
    code it is possible for classifiers to depend on the specific uid and
    gid values. Therefore I need to know the user namespace of the
    netlink socket that is installing the packet classifiers. Pass
    in the rtnetlink skb so I can access the NETLINK_CB of the passed
    packet. In particular I want access to sk_user_ns(NETLINK_CB(in_skb).ssk).

    Pass in not the user namespace but the incomming rtnetlink skb into
    the the classifier change routines as that is generally the more useful
    parameter.

    Cc: Jamal Hadi Salim
    Acked-by: David S. Miller
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman