17 Feb, 2016

1 commit

  • This patch add a generic, lockless dst cache implementation.
    The need for lock is avoided updating the dst cache fields
    only in per cpu scope, and requiring that the cache manipulation
    functions are invoked with the local bh disabled.

    The refresh_ts and reset_ts fields are used to ensure the cache
    consistency in case of cuncurrent cache update (dst_cache_set*) and
    reset operation (dst_cache_reset).

    Consider the following scenario:

    CPU1: CPU2:



    dst_cache_reset()
    dst_cache_set()

    The dst entry set passed to dst_cache_set() should not be used
    for later dst cache lookup, because it's obtained using old
    configuration values.

    Since the refresh_ts is updated only on dst_cache lookup, the
    cached value in the above scenario will be discarded on the next
    lookup.

    Signed-off-by: Paolo Abeni
    Suggested-and-acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Paolo Abeni
     

11 Jan, 2016

1 commit

  • This work adds a generalization of the ingress qdisc as a qdisc holding
    only classifiers. The clsact qdisc works on ingress, but also on egress.
    In both cases, it's execution happens without taking the qdisc lock, and
    the main difference for the egress part compared to prior version of [1]
    is that this can be applied with _any_ underlying real egress qdisc (also
    classless ones).

    Besides solving the use-case of [1], that is, allowing for more programmability
    on assigning skb->priority for the mqprio case that is supported by most
    popular 10G+ NICs, it also opens up a lot more flexibility for other tc
    applications. The main work on classification can already be done at clsact
    egress time if the use-case allows and state stored for later retrieval
    f.e. again in skb->priority with major/minors (which is checked by most
    classful qdiscs before consulting tc_classify()) and/or in other skb fields
    like skb->tc_index for some light-weight post-processing to get to the
    eventual classid in case of a classful qdisc. Another use case is that
    the clsact egress part allows to have a central egress counterpart to
    the ingress classifiers, so that classifiers can easily share state (e.g.
    in cls_bpf via eBPF maps) for ingress and egress.

    Currently, default setups like mq + pfifo_fast would require for this to
    use, for example, prio qdisc instead (to get a tc_classify() run) and to
    duplicate the egress classifier for each queue. With clsact, it allows
    for leaving the setup as is, it can additionally assign skb->priority to
    put the skb in one of pfifo_fast's bands and it can share state with maps.
    Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
    w/o the need to perform a skb_dst_force() to hold on to it any longer. In
    lwt case, we can also use this facility to setup dst metadata via cls_bpf
    (bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
    that (case of IFF_NO_QUEUE devices, for example).

    The realization can be done without any changes to the scheduler core
    framework. All it takes is that we have two a-priori defined minors/child
    classes, where we can mux between ingress and egress classifier list
    (dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
    dev->_tx to avoid extra cacheline miss for moderate loads). The egress
    part is a bit similar modelled to handle_ing() and patched to a noop in
    case the functionality is not used. Both handlers are now called
    sch_handle_ingress() and sch_handle_egress(), code sharing among the two
    doesn't seem practical as there are various minor differences in both
    paths, so that making them conditional in a single handler would rather
    slow things down.

    Full compatibility to ingress qdisc is provided as well. Since both
    piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
    per netdevice, and thus ingress qdisc specific behaviour can be retained
    for user space. This means, either a user does 'tc qdisc add dev foo ingress'
    and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
    alternative, where both, ingress and egress classifier can be configured
    as in the below example. ingress qdisc supports attaching classifier to any
    minor number whereas clsact has two fixed minors for muxing between the
    lists, therefore to not break user space setups, they are better done as
    two separate qdiscs.

    I decided to extend the sch_ingress module with clsact functionality so
    that commonly used code can be reused, the module is being aliased with
    sch_clsact so that it can be auto-loaded properly. Alternative would have been
    to add a flag when initializing ingress to alter its behaviour plus aliasing
    to a different name (as it's more than just ingress). However, the first would
    end up, based on the flag, choosing the new/old behaviour by calling different
    function implementations to handle each anyway, the latter would require to
    register ingress qdisc once again under different alias. So, this really begs
    to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
    by its own that share callbacks used by both.

    Example, adding qdisc:

    # tc qdisc add dev foo clsact
    # tc qdisc show dev foo
    qdisc mq 0: root
    qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc clsact ffff: parent ffff:fff1

    Adding filters (deleting, etc works analogous by specifying ingress/egress):

    # tc filter add dev foo ingress bpf da obj bar.o sec ingress
    # tc filter add dev foo egress bpf da obj bar.o sec egress
    # tc filter show dev foo ingress
    filter protocol all pref 49152 bpf
    filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
    # tc filter show dev foo egress
    filter protocol all pref 49152 bpf
    filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action

    A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
    show an empty list for clsact. Either using the parent names (ingress/egress)
    or specifying the full major/minor will then show the related filter lists.

    Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.

    [1] http://patchwork.ozlabs.org/patch/512949/

    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

09 Dec, 2015

1 commit

  • Introduce sock->sk_cgrp_data which is a struct sock_cgroup_data.
    ->sk_cgroup_prioidx and ->sk_classid are moved into it. The struct
    and its accessors are defined in cgroup-defs.h. This is to prepare
    for overloading the fields with a cgroup pointer.

    This patch mostly performs equivalent conversions but the followings
    are noteworthy.

    * Equality test before updating classid is removed from
    sock_update_classid(). This shouldn't make any noticeable
    difference and a similar test will be implemented on the helper side
    later.

    * sock_update_netprioidx() now takes struct sock_cgroup_data and can
    be moved to netprio_cgroup.h without causing include dependency
    loop. Moved.

    * The dummy version of sock_update_netprioidx() converted to a static
    inline function while at it.

    Signed-off-by: Tejun Heo
    Signed-off-by: David S. Miller

    Tejun Heo
     

30 Sep, 2015

1 commit

  • L3 master devices allow users of the abstraction to influence FIB lookups
    for enslaved devices. Current API provides a means for the master device
    to return a specific FIB table for an enslaved device, to return an
    rtable/custom dst and influence the OIF used for fib lookups.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

22 Jul, 2015

1 commit


14 May, 2015

1 commit

  • This new config switch enables the ingress filtering infrastructure that is
    controlled through the ingress_needed static key. This prepares the
    introduction of the Netfilter ingress hook that resides under this unique
    static key.

    Note that CONFIG_SCH_INGRESS automatically selects this, that should be no
    problem since this also depends on CONFIG_NET_CLS_ACT.

    Signed-off-by: Pablo Neira Ayuso
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Pablo Neira
     

07 Jan, 2015

1 commit


03 Dec, 2014

1 commit

  • The goal of this is to provide a possibility to support various switch
    chips. Drivers should implement relevant ndos to do so. Now there is
    only one ndo defined:
    - for getting physical switch id is in place.

    Note that user can use random port netdevice to access the switch.

    Signed-off-by: Jiri Pirko
    Reviewed-by: Thomas Graf
    Acked-by: Andy Gospodarek
    Signed-off-by: David S. Miller

    Jiri Pirko
     

28 Oct, 2014

1 commit

  • introduce two configs:
    - hidden CONFIG_BPF to select eBPF interpreter that classic socket filters
    depend on
    - visible CONFIG_BPF_SYSCALL (default off) that tracing and sockets can use

    that solves several problems:
    - tracing and others that wish to use eBPF don't need to depend on NET.
    They can use BPF_SYSCALL to allow loading from userspace or select BPF
    to use it directly from kernel in NET-less configs.
    - in 3.18 programs cannot be attached to events yet, so don't force it on
    - when the rest of eBPF infra is there in 3.19+, it's still useful to
    switch it off to minimize kernel size

    bloat-o-meter on x64 shows:
    add/remove: 0/60 grow/shrink: 0/2 up/down: 0/-15601 (-15601)

    tested with many different config combinations. Hopefully didn't miss anything.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

11 Oct, 2014

1 commit

  • minimal configurations where EPOLL, PERF_EVENTS, etc are disabled,
    but NET is enabled, are failing to build with link error:
    kernel/built-in.o: In function `bpf_prog_load':
    syscall.c:(.text+0x3b728): undefined reference to `anon_inode_getfd'

    fix it by selecting ANON_INODES when NET is enabled

    Reported-by: Michal Sojka
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

01 Oct, 2014

1 commit

  • Eric reports build failure with
    CONFIG_BRIDGE_NETFILTER=n

    We insist to build br_nf_core.o unconditionally, but we must only do so
    if br_netfilter was enabled, else it fails to build due to
    functions being defined to empty stubs (and some structure members
    being defined out).

    Also, BRIDGE_NETFILTER=y|m makes no sense when BRIDGE=n.

    Fixes: 34666d467 (netfilter: bridge: move br_netfilter out of the core)
    Reported-by: Eric Dumazet
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Florian Westphal
     

27 Sep, 2014

1 commit

  • Jesper reported that br_netfilter always registers the hooks since
    this is part of the bridge core. This harms performance for people that
    don't need this.

    This patch modularizes br_netfilter so it can be rmmod'ed, thus,
    the hooks can be unregistered. I think the bridge netfilter should have
    been a separated module since the beginning, Patrick agreed on that.

    Note that this is breaking compatibility for users that expect that
    bridge netfilter is going to be available after explicitly 'modprobe
    bridge' or via automatic load through brctl.

    However, the damage can be easily undone by modprobing br_netfilter.
    The bridge core also spots a message to provide a clue to people that
    didn't notice that this has been deprecated.

    On top of that, the plan is that nftables will not rely on this software
    layer, but integrate the connection tracking into the bridge layer to
    enable stateful filtering and NAT, which is was bridge netfilter users
    seem to require.

    This patch still keeps the fake_dst_ops in the bridge core, since this
    is required by when the bridge port is initialized. So we can safely
    modprobe/rmmod br_netfilter anytime.

    Signed-off-by: Pablo Neira Ayuso
    Acked-by: Florian Westphal

    Pablo Neira Ayuso
     

12 Jul, 2014

1 commit

  • This patch moves generic code which is used by bluetooth and ieee802154
    6lowpan to a new net/6lowpan directory. This directory contains generic
    6LoWPAN code which is shared between bluetooth and ieee802154 MAC-Layer.

    This is the IPHC - "IPv6 Header Compression" format at the moment. Which
    is described by RFC 6282 [0]. The BLTE 6LoWPAN draft describes that the
    IPHC is the same format like IEEE 802.15.4, see [1].

    Futuremore we can put more code into this directory which is shared
    between BLTE and IEEE 802.15.4 6LoWPAN like RFC 6775 or the routing
    protocol RPL RFC 6550.

    To avoid naming conflicts I renamed 6lowpan-y to ieee802154_6lowpan-y
    in net/ieee802154/Makefile.

    [0] http://tools.ietf.org/html/rfc6282
    [1] http://tools.ietf.org/html/draft-ietf-6lowpan-btle-12#section-3.2
    [2] http://tools.ietf.org/html/rfc6775
    [3] http://tools.ietf.org/html/rfc6550

    Signed-off-by: Alexander Aring
    Acked-by: Jukka Rissanen
    Signed-off-by: Marcel Holtmann

    Alexander Aring
     

04 Apr, 2014

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot updates for cgroup:

    - The biggest one is cgroup's conversion to kernfs. cgroup took
    after the long abandoned vfs-entangled sysfs implementation and
    made it even more convoluted over time. cgroup's internal objects
    were fused with vfs objects which also brought in vfs locking and
    object lifetime rules. Naturally, there are places where vfs rules
    don't fit and nasty hacks, such as credential switching or lock
    dance interleaving inode mutex and cgroup_mutex with object serial
    number comparison thrown in to decide whether the operation is
    actually necessary, needed to be employed.

    After conversion to kernfs, internal object lifetime and locking
    rules are mostly isolated from vfs interactions allowing shedding
    of several nasty hacks and overall simplification. This will also
    allow implmentation of operations which may affect multiple cgroups
    which weren't possible before as it would have required nesting
    i_mutexes.

    - Various simplifications including dropping of module support,
    easier cgroup name/path handling, simplified cgroup file type
    handling and task_cg_lists optimization.

    - Prepatory changes for the planned unified hierarchy, which is still
    a patchset away from being actually operational. The dummy
    hierarchy is updated to serve as the default unified hierarchy.
    Controllers which aren't claimed by other hierarchies are
    associated with it, which BTW was what the dummy hierarchy was for
    anyway.

    - Various fixes from Li and others. This pull request includes some
    patches to add missing slab.h to various subsystems. This was
    triggered xattr.h include removal from cgroup.h. cgroup.h
    indirectly got included a lot of files which brought in xattr.h
    which brought in slab.h.

    There are several merge commits - one to pull in kernfs updates
    necessary for converting cgroup (already in upstream through
    driver-core), others for interfering changes in the fixes branch"

    * 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
    cgroup: remove useless argument from cgroup_exit()
    cgroup: fix spurious lockdep warning in cgroup_exit()
    cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
    cgroup: break kernfs active_ref protection in cgroup directory operations
    cgroup: fix cgroup_taskset walking order
    cgroup: implement CFTYPE_ONLY_ON_DFL
    cgroup: make cgrp_dfl_root mountable
    cgroup: drop const from @buffer of cftype->write_string()
    cgroup: rename cgroup_dummy_root and related names
    cgroup: move ->subsys_mask from cgroupfs_root to cgroup
    cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
    cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
    cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
    cgroup: reorganize cgroup bootstrapping
    cgroup: relocate setting of CGRP_DEAD
    cpuset: use rcu_read_lock() to protect task_cs()
    cgroup_freezer: document freezer_fork() subtleties
    cgroup: update cgroup_transfer_tasks() to either succeed or fail
    cgroup: drop task_lock() protection around task->cgroups
    cgroup: update how a newly forked task gets associated with css_set
    ...

    Linus Torvalds
     

02 Apr, 2014

1 commit

  • This commit fixes a build error reported by Fengguang, that is
    triggered when CONFIG_NETWORK_PHY_TIMESTAMPING is not set:

    ERROR: "ptp_classify_raw" [drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.ko] undefined!

    The fix is to introduce its own file for the PTP BPF classifier,
    so that PTP_1588_CLOCK and/or NETWORK_PHY_TIMESTAMPING can select
    it independently from each other. IXP4xx driver on ARM needs to
    select it as well since it does not seem to select PTP_1588_CLOCK
    or similar that would pull it in automatically.

    This also allows for hiding all of the internals of the BPF PTP
    program inside that file, and only exporting relevant API bits
    to drivers.

    This patch also adds a kdoc documentation of ptp_classify_raw()
    API to make it clear that it can return PTP_CLASS_* defines. Also,
    the BPF program has been translated into bpf_asm code, so that it
    can be more easily read and altered (extensively documented in [1]).

    In the kernel tree under tools/net/ we have bpf_asm and bpf_dbg
    tools, so the commented program can simply be translated via
    `./bpf_asm -c prog` where prog is a file that contains the
    commented code. This makes it easily readable/verifiable and when
    there's a need to change something, jump offsets etc do not need
    to be replaced manually which can be very error prone. Instead,
    a newly translated version via bpf_asm can simply replace the old
    code. I have checked opcode diffs before/after and it's the very
    same filter.

    [1] Documentation/networking/filter.txt

    Fixes: 164d8c666521 ("net: ptp: do not reimplement PTP/BPF classifier")
    Reported-by: Fengguang Wu
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Cc: Richard Cochran
    Cc: Jiri Benc
    Acked-by: Richard Cochran
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

08 Feb, 2014

1 commit

  • net_prio is the only cgroup which is allowed to be built as a module.
    The savings from allowing one controller to be built as a module are
    tiny especially given that cgroup module support itself adds quite a
    bit of complexity.

    Given that none of other controllers has much chance of being made a
    module and that we're unlikely to add new modular controllers, the
    added complexity is simply not justifiable.

    As a first step to drop cgroup module support, this patch changes the
    config option to bool from tristate and drops module related code from
    it.

    Also, while an earlier commit fe1217c4f3f7 ("net: net_cls: move
    cgroupfs classid handling into core") dropped module support from
    net_cls cgroup, it retained a call to cgroup_load_subsys(), which is
    noop for built-in controllers. Drop it along with
    init_netclassid_cgroup().

    v2: Removed modular version of task_netprioidx() in
    include/net/netprio_cgroup.h as suggested by Li Zefan.

    v3: Rebased on top of fe1217c4f3f7 ("net: net_cls: move cgroupfs
    classid handling into core"). net_cls cgroup part is mostly
    dropped except for removal of init_netclassid_cgroup().

    Signed-off-by: Tejun Heo
    Acked-by: Neil Horman
    Acked-by: "David S. Miller"
    Acked-by: Li Zefan
    Cc: Thomas Graf

    Tejun Heo
     

04 Jan, 2014

2 commits

  • While we're at it and introduced CGROUP_NET_CLASSID, lets also make
    NETPRIO_CGROUP more consistent with the rest of cgroups and rename it
    into CONFIG_CGROUP_NET_PRIO so that for networking, we now have
    CONFIG_CGROUP_NET_{PRIO,CLASSID}. This not only makes the CONFIG
    option consistent among networking cgroups, but also among cgroups
    CONFIG conventions in general as the vast majority has a prefix of
    CONFIG_CGROUP_.

    Signed-off-by: Daniel Borkmann
    Cc: Zefan Li
    Cc: cgroups@vger.kernel.org
    Acked-by: Li Zefan
    Signed-off-by: Pablo Neira Ayuso

    Daniel Borkmann
     
  • Zefan Li requested [1] to perform the following cleanup/refactoring:

    - Split cgroupfs classid handling into net core to better express a
    possible more generic use.

    - Disable module support for cgroupfs bits as the majority of other
    cgroupfs subsystems do not have that, and seems to be not wished
    from cgroup side. Zefan probably might want to follow-up for netprio
    later on.

    - By this, code can be further reduced which previously took care of
    functionality built when compiled as module.

    cgroupfs bits are being placed under net/core/netclassid_cgroup.c, so
    that we are consistent with {netclassid,netprio}_cgroup naming that is
    under net/core/ as suggested by Zefan.

    No change in functionality, but only code refactoring that is being
    done here.

    [1] http://patchwork.ozlabs.org/patch/304825/

    Suggested-by: Li Zefan
    Signed-off-by: Daniel Borkmann
    Cc: Zefan Li
    Cc: Thomas Graf
    Cc: cgroups@vger.kernel.org
    Acked-by: Li Zefan
    Signed-off-by: Pablo Neira Ayuso

    Daniel Borkmann
     

22 Nov, 2013

1 commit


04 Nov, 2013

1 commit

  • High-availability Seamless Redundancy ("HSR") provides instant failover
    redundancy for Ethernet networks. It requires a special network topology where
    all nodes are connected in a ring (each node having two physical network
    interfaces). It is suited for applications that demand high availability and
    very short reaction time.

    HSR acts on the Ethernet layer, using a registered Ethernet protocol type to
    send special HSR frames in both directions over the ring. The driver creates
    virtual network interfaces that can be used just like any ordinary Linux
    network interface, for IP/TCP/UDP traffic etc. All nodes in the network ring
    must be HSR capable.

    This code is a "best effort" to comply with the HSR standard as described in
    IEC 62439-3:2010 (HSRv0).

    Signed-off-by: Arvid Brodin
    Signed-off-by: David S. Miller

    Arvid Brodin
     

13 Sep, 2013

1 commit


04 Aug, 2013

1 commit


02 Aug, 2013

1 commit


31 Jul, 2013

1 commit


10 Jul, 2013

1 commit

  • Pull networking updates from David Miller:
    "This is a re-do of the net-next pull request for the current merge
    window. The only difference from the one I made the other day is that
    this has Eliezer's interface renames and the timeout handling changes
    made based upon your feedback, as well as a few bug fixes that have
    trickeled in.

    Highlights:

    1) Low latency device polling, eliminating the cost of interrupt
    handling and context switches. Allows direct polling of a network
    device from socket operations, such as recvmsg() and poll().

    Currently ixgbe, mlx4, and bnx2x support this feature.

    Full high level description, performance numbers, and design in
    commit 0a4db187a999 ("Merge branch 'll_poll'")

    From Eliezer Tamir.

    2) With the routing cache removed, ip_check_mc_rcu() gets exercised
    more than ever before in the case where we have lots of multicast
    addresses. Use a hash table instead of a simple linked list, from
    Eric Dumazet.

    3) Add driver for Atheros CQA98xx 802.11ac wireless devices, from
    Bartosz Markowski, Janusz Dziedzic, Kalle Valo, Marek Kwaczynski,
    Marek Puzyniak, Michal Kazior, and Sujith Manoharan.

    4) Support reporting the TUN device persist flag to userspace, from
    Pavel Emelyanov.

    5) Allow controlling network device VF link state using netlink, from
    Rony Efraim.

    6) Support GRE tunneling in openvswitch, from Pravin B Shelar.

    7) Adjust SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF for modern times, from
    Daniel Borkmann and Eric Dumazet.

    8) Allow controlling of TCP quickack behavior on a per-route basis,
    from Cong Wang.

    9) Several bug fixes and improvements to vxlan from Stephen
    Hemminger, Pravin B Shelar, and Mike Rapoport. In particular,
    support receiving on multiple UDP ports.

    10) Major cleanups, particular in the area of debugging and cookie
    lifetime handline, to the SCTP protocol code. From Daniel
    Borkmann.

    11) Allow packets to cross network namespaces when traversing tunnel
    devices. From Nicolas Dichtel.

    12) Allow monitoring netlink traffic via AF_PACKET sockets, in a
    manner akin to how we monitor real network traffic via ptype_all.
    From Daniel Borkmann.

    13) Several bug fixes and improvements for the new alx device driver,
    from Johannes Berg.

    14) Fix scalability issues in the netem packet scheduler's time queue,
    by using an rbtree. From Eric Dumazet.

    15) Several bug fixes in TCP loss recovery handling, from Yuchung
    Cheng.

    16) Add support for GSO segmentation of MPLS packets, from Simon
    Horman.

    17) Make network notifiers have a real data type for the opaque
    pointer that's passed into them. Use this to properly handle
    network device flag changes in arp_netdev_event(). From Jiri
    Pirko and Timo Teräs.

    18) Convert several drivers over to module_pci_driver(), from Peter
    Huewe.

    19) tcp_fixup_rcvbuf() can loop 500 times over loopback, just use a
    O(1) calculation instead. From Eric Dumazet.

    20) Support setting of explicit tunnel peer addresses in ipv6, just
    like ipv4. From Nicolas Dichtel.

    21) Protect x86 BPF JIT against spraying attacks, from Eric Dumazet.

    22) Prevent a single high rate flow from overruning an individual cpu
    during RX packet processing via selective flow shedding. From
    Willem de Bruijn.

    23) Don't use spinlocks in TCP md5 signing fast paths, from Eric
    Dumazet.

    24) Don't just drop GSO packets which are above the TBF scheduler's
    burst limit, chop them up so they are in-bounds instead. Also
    from Eric Dumazet.

    25) VLAN offloads are missed when configured on top of a bridge, fix
    from Vlad Yasevich.

    26) Support IPV6 in ping sockets. From Lorenzo Colitti.

    27) Receive flow steering targets should be updated at poll() time
    too, from David Majnemer.

    28) Fix several corner case regressions in PMTU/redirect handling due
    to the routing cache removal, from Timo Teräs.

    29) We have to be mindful of ipv4 mapped ipv6 sockets in
    upd_v6_push_pending_frames(). From Hannes Frederic Sowa.

    30) Fix L2TP sequence number handling bugs, from James Chapman."

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1214 commits)
    drivers/net: caif: fix wrong rtnl_is_locked() usage
    drivers/net: enic: release rtnl_lock on error-path
    vhost-net: fix use-after-free in vhost_net_flush
    net: mv643xx_eth: do not use port number as platform device id
    net: sctp: confirm route during forward progress
    virtio_net: fix race in RX VQ processing
    virtio: support unlocked queue poll
    net/cadence/macb: fix bug/typo in extracting gem_irq_read_clear bit
    Documentation: Fix references to defunct linux-net@vger.kernel.org
    net/fs: change busy poll time accounting
    net: rename low latency sockets functions to busy poll
    bridge: fix some kernel warning in multicast timer
    sfc: Fix memory leak when discarding scattered packets
    sit: fix tunnel update via netlink
    dt:net:stmmac: Add dt specific phy reset callback support.
    dt:net:stmmac: Add support to dwmac version 3.610 and 3.710
    dt:net:stmmac: Allocate platform data only if its NULL.
    net:stmmac: fix memleak in the open method
    ipv6: rt6_check_neigh should successfully verify neigh if no NUD information are available
    net: ipv6: fix wrong ping_v6_sendmsg return value
    ...

    Linus Torvalds
     

18 Jun, 2013

2 commits


11 Jun, 2013

1 commit

  • Adds an ndo_ll_poll method and the code that supports it.
    This method can be used by low latency applications to busy-poll
    Ethernet device queues directly from the socket code.
    sysctl_net_ll_poll controls how many microseconds to poll.
    Default is zero (disabled).
    Individual protocol support will be added by subsequent patches.

    Signed-off-by: Alexander Duyck
    Signed-off-by: Jesse Brandeburg
    Signed-off-by: Eliezer Tamir
    Acked-by: Eric Dumazet
    Tested-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eliezer Tamir
     

06 Jun, 2013

1 commit

  • Since we have at least one user of this function outside of CONFIG_NET
    scope, we have to provide this function independently. The proposed
    solution is to move it under lib/net_utils.c with corresponding
    configuration variable and select wherever it is needed.

    Signed-off-by: Andy Shevchenko
    Reported-by: Arnd Bergmann
    Acked-by: David S. Miller
    Acked-by: Arnd Bergmann
    Signed-off-by: Greg Kroah-Hartman

    Andy Shevchenko
     

28 May, 2013

1 commit

  • In the case where a non-MPLS packet is received and an MPLS stack is
    added it may well be the case that the original skb is GSO but the
    NIC used for transmit does not support GSO of MPLS packets.

    The aim of this code is to provide GSO in software for MPLS packets
    whose skbs are GSO.

    SKB Usage:

    When an implementation adds an MPLS stack to a non-MPLS packet it should do
    the following to skb metadata:

    * Set skb->inner_protocol to the old non-MPLS ethertype of the packet.
    skb->inner_protocol is added by this patch.

    * Set skb->protocol to the new MPLS ethertype of the packet.

    * Set skb->network_header to correspond to the
    end of the L3 header, including the MPLS label stack.

    I have posted a patch, "[PATCH v3.29] datapath: Add basic MPLS support to
    kernel" which adds MPLS support to the kernel datapath of Open vSwtich.
    That patch sets the above requirements in datapath/actions.c:push_mpls()
    and was used to exercise this code. The datapath patch is against the Open
    vSwtich tree but it is intended that it be added to the Open vSwtich code
    present in the mainline Linux kernel at some point.

    Features:

    I believe that the approach that I have taken is at least partially
    consistent with the handling of other protocols. Jesse, I understand that
    you have some ideas here. I am more than happy to change my implementation.

    This patch adds dev->mpls_features which may be used by devices
    to advertise features supported for MPLS packets.

    A new NETIF_F_MPLS_GSO feature is added for devices which support
    hardware MPLS GSO offload. Currently no devices support this
    and MPLS GSO always falls back to software.

    Alternate Implementation:

    One possible alternate implementation is to teach netif_skb_features()
    and skb_network_protocol() about MPLS, in a similar way to their
    understanding of VLANs. I believe this would avoid the need
    for net/mpls/mpls_gso.c and in particular the calls to
    __skb_push() and __skb_push() in mpls_gso_segment().

    I have decided on the implementation in this patch as it should
    not introduce any overhead in the case where mpls_gso is not compiled
    into the kernel or inserted as a module.

    MPLS GSO suggested by Jesse Gross.
    Based in part on "v4 GRE: Add TCP segmentation offload for GRE"
    by Pravin B Shelar.

    Cc: Jesse Gross
    Cc: Pravin B Shelar
    Signed-off-by: Simon Horman
    Signed-off-by: David S. Miller

    Simon Horman
     

21 May, 2013

1 commit

  • A cpu executing the network receive path sheds packets when its input
    queue grows to netdev_max_backlog. A single high rate flow (such as a
    spoofed source DoS) can exceed a single cpu processing rate and will
    degrade throughput of other flows hashed onto the same cpu.

    This patch adds a more fine grained hashtable. If the netdev backlog
    is above a threshold, IRQ cpus track the ratio of total traffic of
    each flow (using 4096 buckets, configurable). The ratio is measured
    by counting the number of packets per flow over the last 256 packets
    from the source cpu. Any flow that occupies a large fraction of this
    (set at 50%) will see packet drop while above the threshold.

    Tested:
    Setup is a muli-threaded UDP echo server with network rx IRQ on cpu0,
    kernel receive (RPS) on cpu0 and application threads on cpus 2--7
    each handling 20k req/s. Throughput halves when hit with a 400 kpps
    antagonist storm. With this patch applied, antagonist overload is
    dropped and the server processes its complete load.

    The patch is effective when kernel receive processing is the
    bottleneck. The above RPS scenario is a extreme, but the same is
    reached with RFS and sufficient kernel processing (iptables, packet
    socket tap, ..).

    Signed-off-by: Willem de Bruijn
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

02 May, 2013

1 commit

  • Currently, in menuconfig, Netlink's new mmaped IO is the very first
    entry under the ``Networking support'' item and comes even before
    ``Networking options'':

    [ ] Netlink: mmaped IO
    Networking options --->
    ...

    Lets move this into ``Networking options'' under netlink's Kconfig,
    since this might be more appropriate. Introduced by commit ccdfcc398
    (``netlink: mmaped netlink: ring setup'').

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

20 Apr, 2013

1 commit

  • Add support for mmap'ed RX and TX ring setup and teardown based on the
    af_packet.c code. The following patches will use this to add the real
    mmap'ed receive and transmit functionality.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

22 Mar, 2013

1 commit

  • The netlink_diag can be built as a module, just like it's done in
    unix sockets.

    The core dumping message carries the basic info about netlink sockets:
    family, type and protocol, portis, dst_group, dst_portid, state.

    Groups can be received as an optional parameter NETLINK_DIAG_GROUPS.

    Netlink sockets cab be filtered by protocols.

    The socket inode number and cookie is reserved for future per-socket info
    retrieving. The per-protocol filtering is also reserved for future by
    requiring the sdiag_protocol to be zero.

    The file /proc/net/netlink doesn't provide enough information for
    dumping netlink sockets. It doesn't provide dst_group, dst_portid,
    groups above 32.

    v2: fix NETLINK_DIAG_MAX. Now it's equal to the last constant.

    Acked-by: Pavel Emelyanov
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: Pablo Neira Ayuso
    Cc: "Eric W. Biederman"
    Cc: Gao feng
    Cc: Thomas Graf
    Signed-off-by: Andrey Vagin
    Signed-off-by: David S. Miller

    Andrey Vagin
     

22 Feb, 2013

1 commit

  • Pull driver core patches from Greg Kroah-Hartman:
    "Here is the big driver core merge for 3.9-rc1

    There are two major series here, both of which touch lots of drivers
    all over the kernel, and will cause you some merge conflicts:

    - add a new function called devm_ioremap_resource() to properly be
    able to check return values.

    - remove CONFIG_EXPERIMENTAL

    Other than those patches, there's not much here, some minor fixes and
    updates"

    Fix up trivial conflicts

    * tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (221 commits)
    base: memory: fix soft/hard_offline_page permissions
    drivercore: Fix ordering between deferred_probe and exiting initcalls
    backlight: fix class_find_device() arguments
    TTY: mark tty_get_device call with the proper const values
    driver-core: constify data for class_find_device()
    firmware: Ignore abort check when no user-helper is used
    firmware: Reduce ifdef CONFIG_FW_LOADER_USER_HELPER
    firmware: Make user-mode helper optional
    firmware: Refactoring for splitting user-mode helper code
    Driver core: treat unregistered bus_types as having no devices
    watchdog: Convert to devm_ioremap_resource()
    thermal: Convert to devm_ioremap_resource()
    spi: Convert to devm_ioremap_resource()
    power: Convert to devm_ioremap_resource()
    mtd: Convert to devm_ioremap_resource()
    mmc: Convert to devm_ioremap_resource()
    mfd: Convert to devm_ioremap_resource()
    media: Convert to devm_ioremap_resource()
    iommu: Convert to devm_ioremap_resource()
    drm: Convert to devm_ioremap_resource()
    ...

    Linus Torvalds
     

11 Feb, 2013

1 commit

  • VM Sockets allows communication between virtual machines and the hypervisor.
    User level applications both in a virtual machine and on the host can use the
    VM Sockets API, which facilitates fast and efficient communication between
    guest virtual machines and their host. A socket address family, designed to be
    compatible with UDP and TCP at the interface level, is provided.

    Today, VM Sockets is used by various VMware Tools components inside the guest
    for zero-config, network-less access to VMware host services. In addition to
    this, VMware's users are using VM Sockets for various applications, where
    network access of the virtual machine is restricted or non-existent. Examples
    of this are VMs communicating with device proxies for proprietary hardware
    running as host applications and automated testing of applications running
    within virtual machines.

    The VMware VM Sockets are similar to other socket types, like Berkeley UNIX
    socket interface. The VM Sockets module supports both connection-oriented
    stream sockets like TCP, and connectionless datagram sockets like UDP. The VM
    Sockets protocol family is defined as "AF_VSOCK" and the socket operations
    split for SOCK_DGRAM and SOCK_STREAM.

    For additional information about the use of VM Sockets, please refer to the
    VM Sockets Programming Guide available at:

    https://www.vmware.com/support/developer/vmci-sdk/

    Signed-off-by: George Zhang
    Signed-off-by: Dmitry Torokhov
    Signed-off-by: Andy king
    Signed-off-by: David S. Miller

    Andy King
     

01 Feb, 2013

1 commit

  • The original suggestion to delete wanrouter started earlier
    with the mainline commit f0d1b3c2bcc5de8a17af5f2274f7fcde8292b5fc
    ("net/wanrouter: Deprecate and schedule for removal") in May 2012.

    More importantly, Dan Carpenter found[1] that the driver had a
    fundamental breakage introduced back in 2008, with commit
    7be6065b39c3 ("netdevice wanrouter: Convert directly reference of
    netdev->priv"). So we know with certainty that the code hasn't been
    used by anyone willing to at least take the effort to send an e-mail
    report of breakage for at least 4 years.

    This commit does a decouple of the wanrouter subsystem, by going
    after the Makefile/Kconfig and similar files, so that these mainline
    files that we are keeping do not have the big wanrouter file/driver
    deletion commit tied into their history.

    Once this commit is in place, we then can remove the obsolete cyclomx
    drivers and similar that have a dependency on CONFIG_WAN_ROUTER_DRIVERS.

    [1] http://www.spinics.net/lists/netdev/msg218670.html

    Originally-by: Joe Perches
    Cc: Dan Carpenter
    Cc: Arnaldo Carvalho de Melo
    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

12 Jan, 2013

1 commit

  • The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
    while now and is almost always enabled by default. As agreed during the
    Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

    CC: "David S. Miller"
    Signed-off-by: Kees Cook
    Acked-by: David S. Miller

    Kees Cook
     

11 Jan, 2013

1 commit

  • This patch makes it so that we can support transmit packet steering without
    sysfs needing to be enabled. The reason for making this change is to make
    it so that a driver can make use of the XPS even while the sysfs portion of
    the interface is not present.

    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Alexander Duyck
     

05 Sep, 2012

1 commit

  • Stephen Rothwell says:

    ====================
    After merging the final tree, today's linux-next build (powerpc
    ppc44x_defconfig) failed like this:

    net/built-in.o: In function `tcp_fastopen_ctx_free':
    tcp_fastopen.c:(.text+0x5cc5c): undefined reference to `crypto_destroy_tfm'
    net/built-in.o: In function `tcp_fastopen_reset_cipher':
    (.text+0x5cccc): undefined reference to `crypto_alloc_base'
    net/built-in.o: In function `tcp_fastopen_reset_cipher':
    (.text+0x5cd6c): undefined reference to `crypto_destroy_tfm'

    Presumably caused by commit 104671636897 ("tcp: TCP Fast Open Server -
    header & support functions") from the net-next tree. I assume that some
    dependency on the CRYPTO infrastructure is missing.

    I have reverted commit 1bed966cc3bd ("Merge branch
    'tcp_fastopen_server'") for today.
    ====================

    Reported-by: Stephen Rothwell
    Signed-off-by: David S. Miller

    David S. Miller