26 Sep, 2018

2 commits

  • Currently, Qdisc API functions assume that users have rtnl lock taken. To
    implement rtnl unlocked classifiers update interface, Qdisc API must be
    extended with functions that do not require rtnl lock.

    Extend Qdisc structure with rcu. Implement special version of put function
    qdisc_put_unlocked() that is called without rtnl lock taken. This function
    only takes rtnl lock if Qdisc reference counter reached zero and is
    intended to be used as optimization.

    Signed-off-by: Vlad Buslov
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Vlad Buslov
     
  • Rtnl lock is encapsulated in netlink and cannot be accessed by other
    modules directly. This means that reference counted objects that rely on
    rtnl lock cannot use it with refcounter helper function that atomically
    releases decrements reference and obtains mutex.

    This patch implements simple wrapper function around refcount_dec_and_lock
    that obtains rtnl lock if reference counter value reached 0.

    Signed-off-by: Vlad Buslov
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Vlad Buslov
     

30 Mar, 2018

1 commit

  • rtnl_lock() is used everywhere, and contention is very high.
    When someone wants to iterate over alive net namespaces,
    he/she has no a possibility to do that without exclusive lock.
    But the exclusive rtnl_lock() in such places is overkill,
    and it just increases the contention. Yes, there is already
    for_each_net_rcu() in kernel, but it requires rcu_read_lock(),
    and this can't be sleepable. Also, sometimes it may be need
    really prevent net_namespace_list growth, so for_each_net_rcu()
    is not fit there.

    This patch introduces new rw_semaphore, which will be used
    instead of rtnl_mutex to protect net_namespace_list. It is
    sleepable and allows not-exclusive iterations over net
    namespaces list. It allows to stop using rtnl_lock()
    in several places (what is made in next patches) and makes
    less the time, we keep rtnl_mutex. Here we just add new lock,
    while the explanation of we can remove rtnl_lock() there are
    in next patches.

    Fine grained locks generally are better, then one big lock,
    so let's do that with net_namespace_list, while the situation
    allows that.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: David S. Miller

    Kirill Tkhai
     

28 Mar, 2018

1 commit


17 Mar, 2018

1 commit

  • rtnl_lock() is widely used mutex in kernel. Some of kernel code
    does memory allocations under it. In case of memory deficit this
    may invoke OOM killer, but the problem is a killed task can't
    exit if it's waiting for the mutex. This may be a reason of deadlock
    and panic.

    This patch adds a new primitive, which responds on SIGKILL, and
    it allows to use it in the places, where we don't want to sleep
    forever.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: David S. Miller

    Kirill Tkhai
     

21 Feb, 2018

1 commit

  • We take net_mutex, when there are !async pernet_operations
    registered, and read locking of net_sem is not enough. But
    we may get rid of taking the mutex, and just change the logic
    to write lock net_sem in such cases. This obviously reduces
    the number of lock operations, we do.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: David S. Miller

    Kirill Tkhai
     

13 Feb, 2018

1 commit

  • Currently, the mutex is mostly used to protect pernet operations
    list. It orders setup_net() and cleanup_net() with parallel
    {un,}register_pernet_operations() calls, so ->exit{,batch} methods
    of the same pernet operations are executed for a dying net, as
    were used to call ->init methods, even after the net namespace
    is unlinked from net_namespace_list in cleanup_net().

    But there are several problems with scalability. The first one
    is that more than one net can't be created or destroyed
    at the same moment on the node. For big machines with many cpus
    running many containers it's very sensitive.

    The second one is that it's need to synchronize_rcu() after net
    is removed from net_namespace_list():

    Destroy net_ns:
    cleanup_net()
    mutex_lock(&net_mutex)
    list_del_rcu(&net->list)
    synchronize_rcu()
    Acked-by: Andrei Vagin
    Signed-off-by: David S. Miller

    Kirill Tkhai
     

01 Feb, 2018

1 commit

  • Pull networking updates from David Miller:

    1) Significantly shrink the core networking routing structures. Result
    of http://vger.kernel.org/~davem/seoul2017_netdev_keynote.pdf

    2) Add netdevsim driver for testing various offloads, from Jakub
    Kicinski.

    3) Support cross-chip FDB operations in DSA, from Vivien Didelot.

    4) Add a 2nd listener hash table for TCP, similar to what was done for
    UDP. From Martin KaFai Lau.

    5) Add eBPF based queue selection to tun, from Jason Wang.

    6) Lockless qdisc support, from John Fastabend.

    7) SCTP stream interleave support, from Xin Long.

    8) Smoother TCP receive autotuning, from Eric Dumazet.

    9) Lots of erspan tunneling enhancements, from William Tu.

    10) Add true function call support to BPF, from Alexei Starovoitov.

    11) Add explicit support for GRO HW offloading, from Michael Chan.

    12) Support extack generation in more netlink subsystems. From Alexander
    Aring, Quentin Monnet, and Jakub Kicinski.

    13) Add 1000BaseX, flow control, and EEE support to mvneta driver. From
    Russell King.

    14) Add flow table abstraction to netfilter, from Pablo Neira Ayuso.

    15) Many improvements and simplifications to the NFP driver bpf JIT,
    from Jakub Kicinski.

    16) Support for ipv6 non-equal cost multipath routing, from Ido
    Schimmel.

    17) Add resource abstration to devlink, from Arkadi Sharshevsky.

    18) Packet scheduler classifier shared filter block support, from Jiri
    Pirko.

    19) Avoid locking in act_csum, from Davide Caratti.

    20) devinet_ioctl() simplifications from Al viro.

    21) More TCP bpf improvements from Lawrence Brakmo.

    22) Add support for onlink ipv6 route flag, similar to ipv4, from David
    Ahern.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits)
    tls: Add support for encryption using async offload accelerator
    ip6mr: fix stale iterator
    net/sched: kconfig: Remove blank help texts
    openvswitch: meter: Use 64-bit arithmetic instead of 32-bit
    tcp_nv: fix potential integer overflow in tcpnv_acked
    r8169: fix RTL8168EP take too long to complete driver initialization.
    qmi_wwan: Add support for Quectel EP06
    rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK
    ipmr: Fix ptrdiff_t print formatting
    ibmvnic: Wait for device response when changing MAC
    qlcnic: fix deadlock bug
    tcp: release sk_frag.page in tcp_disconnect
    ipv4: Get the address of interface correctly.
    net_sched: gen_estimator: fix lockdep splat
    net: macb: Handle HRESP error
    net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring
    ipv6: addrconf: break critical section in addrconf_verify_rtnl()
    ipv6: change route cache aging logic
    i40e/i40evf: Update DESC_NEEDED value to reflect larger value
    bnxt_en: cleanup DIM work on device shutdown
    ...

    Linus Torvalds
     

30 Jan, 2018

1 commit


27 Dec, 2017

1 commit

  • ASSERT_RTNL() macro is actual open-coded variant of WARN_ONCE() with
    two exceptions. First, it prints stack for multiple hits and not only
    once as WARN_ONCE() does. Second, the user can disable prints of
    WARN_ONCE by setting CONFIG_BUG to N.

    The multiple prints of dump stack are actually not needed, because calls
    without rtnl lock are programming errors and user can't do anything
    about them except to complain to the mailing list after first occurrence
    of such failure.

    The user who disabled BUG/WARN prints did it explicitly because by default
    in upstream kernel and distributions this option is enabled. It means
    that user doesn't want to see prints about missing locks too.

    This patch replaces open-coded variant in favor of already existing
    macro and change error prints to be once only.

    Reviewed-by: Mark Bloch
    Signed-off-by: Leon Romanovsky
    Signed-off-by: David S. Miller

    Leon Romanovsky
     

05 Dec, 2017

1 commit


16 Nov, 2017

1 commit

  • Pull networking updates from David Miller:
    "Highlights:

    1) Maintain the TCP retransmit queue using an rbtree, with 1GB
    windows at 100Gb this really has become necessary. From Eric
    Dumazet.

    2) Multi-program support for cgroup+bpf, from Alexei Starovoitov.

    3) Perform broadcast flooding in hardware in mv88e6xxx, from Andrew
    Lunn.

    4) Add meter action support to openvswitch, from Andy Zhou.

    5) Add a data meta pointer for BPF accessible packets, from Daniel
    Borkmann.

    6) Namespace-ify almost all TCP sysctl knobs, from Eric Dumazet.

    7) Turn on Broadcom Tags in b53 driver, from Florian Fainelli.

    8) More work to move the RTNL mutex down, from Florian Westphal.

    9) Add 'bpftool' utility, to help with bpf program introspection.
    From Jakub Kicinski.

    10) Add new 'cpumap' type for XDP_REDIRECT action, from Jesper
    Dangaard Brouer.

    11) Support 'blocks' of transformations in the packet scheduler which
    can span multiple network devices, from Jiri Pirko.

    12) TC flower offload support in cxgb4, from Kumar Sanghvi.

    13) Priority based stream scheduler for SCTP, from Marcelo Ricardo
    Leitner.

    14) Thunderbolt networking driver, from Amir Levy and Mika Westerberg.

    15) Add RED qdisc offloadability, and use it in mlxsw driver. From
    Nogah Frankel.

    16) eBPF based device controller for cgroup v2, from Roman Gushchin.

    17) Add some fundamental tracepoints for TCP, from Song Liu.

    18) Remove garbage collection from ipv6 route layer, this is a
    significant accomplishment. From Wei Wang.

    19) Add multicast route offload support to mlxsw, from Yotam Gigi"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2177 commits)
    tcp: highest_sack fix
    geneve: fix fill_info when link down
    bpf: fix lockdep splat
    net: cdc_ncm: GetNtbFormat endian fix
    openvswitch: meter: fix NULL pointer dereference in ovs_meter_cmd_reply_start
    netem: remove unnecessary 64 bit modulus
    netem: use 64 bit divide by rate
    tcp: Namespace-ify sysctl_tcp_default_congestion_control
    net: Protect iterations over net::fib_notifier_ops in fib_seq_sum()
    ipv6: set all.accept_dad to 0 by default
    uapi: fix linux/tls.h userspace compilation error
    usbnet: ipheth: prevent TX queue timeouts when device not ready
    vhost_net: conditionally enable tx polling
    uapi: fix linux/rxrpc.h userspace compilation errors
    net: stmmac: fix LPI transitioning for dwmac4
    atm: horizon: Fix irq release error
    net-sysfs: trigger netlink notification on ifalias change via sysfs
    openvswitch: Using kfree_rcu() to simplify the code
    openvswitch: Make local function ovs_nsh_key_attr_size() static
    openvswitch: Fix return value check in ovs_meter_cmd_features()
    ...

    Linus Torvalds
     

07 Nov, 2017

1 commit


04 Nov, 2017

1 commit


02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

25 Oct, 2017

1 commit

  • For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
    preference to ACCESS_ONCE(), and new code is expected to use one of the
    former. So far, there's been no reason to change most existing uses of
    ACCESS_ONCE(), as these aren't currently harmful.

    However, for some features it is necessary to instrument reads and
    writes separately, which is not possible with ACCESS_ONCE(). This
    distinction is critical to correct operation.

    It's possible to transform the bulk of kernel code using the Coccinelle
    script below. However, this doesn't handle comments, leaving references
    to ACCESS_ONCE() instances which have been removed. As a preparatory
    step, this patch converts netlink and netfilter code and comments to use
    {READ,WRITE}_ONCE() consistently.

    ----
    virtual patch

    @ depends on patch @
    expression E1, E2;
    @@

    - ACCESS_ONCE(E1) = E2
    + WRITE_ONCE(E1, E2)

    @ depends on patch @
    expression E;
    @@

    - ACCESS_ONCE(E)
    + READ_ONCE(E)
    ----

    Signed-off-by: Mark Rutland
    Signed-off-by: Paul E. McKenney
    Cc: David S. Miller
    Cc: Florian Westphal
    Cc: Jozsef Kadlecsik
    Cc: Linus Torvalds
    Cc: Pablo Neira Ayuso
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-arch@vger.kernel.org
    Cc: mpe@ellerman.id.au
    Cc: shuah@kernel.org
    Cc: snitzer@redhat.com
    Cc: thor.thayer@linux.intel.com
    Cc: tj@kernel.org
    Cc: viro@zeniv.linux.org.uk
    Cc: will.deacon@arm.com
    Link: http://lkml.kernel.org/r/1508792849-3115-7-git-send-email-paulmck@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Mark Rutland
     

05 Oct, 2017

1 commit

  • x-netns interfaces are bound to two netns: the link netns and the upper
    netns. Usually, this kind of interfaces is created in the link netns and
    then moved to the upper netns. At the end, the interface is visible only
    in the upper netns. The link nsid is advertised via netlink in the upper
    netns, thus the user always knows where is the link part.

    There is no such mechanism in the link netns. When the interface is moved
    to another netns, the user cannot "follow" it.
    This patch adds a new netlink attribute which helps to follow an interface
    which moves to another netns. When the interface is unregistered, the new
    nsid is advertised. If the interface is a x-netns interface (ie
    rtnl_link_ops->get_link_net is defined), the nsid is allocated if needed.

    CC: Jason A. Donenfeld
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     

28 May, 2017

1 commit

  • When netdev events happen, a rtnetlink_event() handler will send
    messages for every event in it's white list. These messages contain
    current information about a particular device, but they do not include
    the iformation about which event just happened. So, it is impossible
    to tell what just happend for these events.

    This patch adds a new extension to RTM_NEWLINK message called IFLA_EVENT
    that would have an encoding of event that triggered this
    message. This would allow the the message consumer to easily determine
    if it needs to perform certain actions.

    Signed-off-by: Vladislav Yasevich
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Vlad Yasevich
     

02 Sep, 2016

1 commit

  • fdb dumps spanning multiple skb's currently restart from the first
    interface again for every skb. This results in unnecessary
    iterations on the already visited interfaces and their fdb
    entries. In large scale setups, we have seen this to slow
    down fdb dumps considerably. On a system with 30k macs we
    see fdb dumps spanning across more than 300 skbs.

    To fix the problem, this patch replaces the existing single fdb
    marker with three markers: netdev hash entries, netdevs and fdb
    index to continue where we left off instead of restarting from the
    first netdev. This is consistent with link dumps.

    In the process of fixing the performance issue, this patch also
    re-implements fix done by
    commit 472681d57a5d ("net: ndo_fdb_dump should report -EMSGSIZE to rtnl_fdb_dump")
    (with an internal fix from Wilson Kok) in the following ways:
    - change ndo_fdb_dump handlers to return error code instead
    of the last fdb index
    - use cb->args strictly for dump frag markers and not error codes.
    This is consistent with other dump functions.

    Below results were taken on a system with 1000 netdevs
    and 35085 fdb entries:
    before patch:
    $time bridge fdb show | wc -l
    15065

    real 1m11.791s
    user 0m0.070s
    sys 1m8.395s

    (existing code does not return all macs)

    after patch:
    $time bridge fdb show | wc -l
    35085

    real 0m2.017s
    user 0m0.113s
    sys 0m1.942s

    Signed-off-by: Roopa Prabhu
    Signed-off-by: Wilson Kok
    Signed-off-by: David S. Miller

    Roopa Prabhu
     

16 Jun, 2016

1 commit

  • qdisc are changed under RTNL protection and often
    while blocking BH and root qdisc spinlock.

    When lots of skbs need to be dropped, we free
    them under these locks causing TX/RX freezes,
    and more generally latency spikes.

    This commit adds rtnl_kfree_skbs(), used to queue
    skbs for deferred freeing.

    Actual freeing happens right after RTNL is released,
    with appropriate scheduling points.

    rtnl_qdisc_drop() can also be used in place
    of disc_drop() when RTNL is held.

    qdisc_reset_queue() and __qdisc_reset_queue() get
    the new behavior, so standard qdiscs like pfifo, pfifo_fast...
    have their ->reset() method automatically handled.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Jan, 2016

1 commit

  • This work adds a generalization of the ingress qdisc as a qdisc holding
    only classifiers. The clsact qdisc works on ingress, but also on egress.
    In both cases, it's execution happens without taking the qdisc lock, and
    the main difference for the egress part compared to prior version of [1]
    is that this can be applied with _any_ underlying real egress qdisc (also
    classless ones).

    Besides solving the use-case of [1], that is, allowing for more programmability
    on assigning skb->priority for the mqprio case that is supported by most
    popular 10G+ NICs, it also opens up a lot more flexibility for other tc
    applications. The main work on classification can already be done at clsact
    egress time if the use-case allows and state stored for later retrieval
    f.e. again in skb->priority with major/minors (which is checked by most
    classful qdiscs before consulting tc_classify()) and/or in other skb fields
    like skb->tc_index for some light-weight post-processing to get to the
    eventual classid in case of a classful qdisc. Another use case is that
    the clsact egress part allows to have a central egress counterpart to
    the ingress classifiers, so that classifiers can easily share state (e.g.
    in cls_bpf via eBPF maps) for ingress and egress.

    Currently, default setups like mq + pfifo_fast would require for this to
    use, for example, prio qdisc instead (to get a tc_classify() run) and to
    duplicate the egress classifier for each queue. With clsact, it allows
    for leaving the setup as is, it can additionally assign skb->priority to
    put the skb in one of pfifo_fast's bands and it can share state with maps.
    Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
    w/o the need to perform a skb_dst_force() to hold on to it any longer. In
    lwt case, we can also use this facility to setup dst metadata via cls_bpf
    (bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
    that (case of IFF_NO_QUEUE devices, for example).

    The realization can be done without any changes to the scheduler core
    framework. All it takes is that we have two a-priori defined minors/child
    classes, where we can mux between ingress and egress classifier list
    (dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
    dev->_tx to avoid extra cacheline miss for moderate loads). The egress
    part is a bit similar modelled to handle_ing() and patched to a noop in
    case the functionality is not used. Both handlers are now called
    sch_handle_ingress() and sch_handle_egress(), code sharing among the two
    doesn't seem practical as there are various minor differences in both
    paths, so that making them conditional in a single handler would rather
    slow things down.

    Full compatibility to ingress qdisc is provided as well. Since both
    piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
    per netdevice, and thus ingress qdisc specific behaviour can be retained
    for user space. This means, either a user does 'tc qdisc add dev foo ingress'
    and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
    alternative, where both, ingress and egress classifier can be configured
    as in the below example. ingress qdisc supports attaching classifier to any
    minor number whereas clsact has two fixed minors for muxing between the
    lists, therefore to not break user space setups, they are better done as
    two separate qdiscs.

    I decided to extend the sch_ingress module with clsact functionality so
    that commonly used code can be reused, the module is being aliased with
    sch_clsact so that it can be auto-loaded properly. Alternative would have been
    to add a flag when initializing ingress to alter its behaviour plus aliasing
    to a different name (as it's more than just ingress). However, the first would
    end up, based on the flag, choosing the new/old behaviour by calling different
    function implementations to handle each anyway, the latter would require to
    register ingress qdisc once again under different alias. So, this really begs
    to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
    by its own that share callbacks used by both.

    Example, adding qdisc:

    # tc qdisc add dev foo clsact
    # tc qdisc show dev foo
    qdisc mq 0: root
    qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc clsact ffff: parent ffff:fff1

    Adding filters (deleting, etc works analogous by specifying ingress/egress):

    # tc filter add dev foo ingress bpf da obj bar.o sec ingress
    # tc filter add dev foo egress bpf da obj bar.o sec egress
    # tc filter show dev foo ingress
    filter protocol all pref 49152 bpf
    filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
    # tc filter show dev foo egress
    filter protocol all pref 49152 bpf
    filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action

    A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
    show an empty list for clsact. Either using the parent names (ingress/egress)
    or specifying the full major/minor will then show the related filter lists.

    Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.

    [1] http://patchwork.ozlabs.org/patch/512949/

    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

09 Oct, 2015

1 commit

  • This patch makes lockdep_rtnl_is_held return bool due to this
    particular function only using either one or zero as its return
    value.

    In another patch lockdep_is_held is also made return bool.

    No functional change.

    Signed-off-by: Yaowei Bai
    Signed-off-by: David S. Miller

    Yaowei Bai
     

23 Jun, 2015

1 commit

  • One more missing piece of the puzzle. Add vlan dump support to switchdev
    port's bridge_getlink. iproute2 "bridge vlan show" cmd already knows how
    to show the vlans installed on the bridge and the device , but (until now)
    no one implemented the port vlan part of the netlink PF_BRIDGE:RTM_GETLINK
    msg. Before this patch, "bridge vlan show":

    $ bridge -c vlan show
    port vlan ids
    sw1p1 30-34 << bridge side vlans
    57

    sw1p1 << device side vlans (missing)

    sw1p2 57

    sw1p2

    sw1p3

    sw1p4

    br0 None

    (When the port is bridged, the output repeats the vlan list for the vlans
    on the bridge side of the port and the vlans on the device side of the
    port. The listing above show no vlans for the device side even though they
    are installed).

    After this patch:

    $ bridge -c vlan show
    port vlan ids
    sw1p1 30-34 << bridge side vlan
    57

    sw1p1 30-34 << device side vlans
    57
    3840 PVID

    sw1p2 57

    sw1p2 57
    3840 PVID

    sw1p3 3842 PVID

    sw1p4 3843 PVID

    br0 None

    I re-used ndo_dflt_bridge_getlink to add vlan fill call-back func.
    switchdev support adds an obj dump for VLAN objects, using the same
    call-back scheme as FDB dump. Support included for both compressed and
    un-compressed vlan dumps.

    Signed-off-by: Scott Feldman
    Signed-off-by: David S. Miller

    Scott Feldman
     

14 May, 2015

2 commits

  • This new config switch enables the ingress filtering infrastructure that is
    controlled through the ingress_needed static key. This prepares the
    introduction of the Netfilter ingress hook that resides under this unique
    static key.

    Note that CONFIG_SCH_INGRESS automatically selects this, that should be no
    problem since this also depends on CONFIG_NET_CLS_ACT.

    Signed-off-by: Pablo Neira Ayuso
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Pablo Neira
     
  • This fixes 4577139b2dabf589 ("net: use jump label patching for ingress qdisc in
    __netif_receive_skb_core").

    The only client of this is sch_ingress and it depends on NET_CLS_ACT. So
    there is no way these definition can be of any help.

    Cc: Daniel Borkmann
    Signed-off-by: Pablo Neira Ayuso
    Acked-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Pablo Neira
     

30 Apr, 2015

1 commit

  • NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact,
    it is sent only at the end of a dump.

    Libraries like libnl will wait forever for NLMSG_DONE.

    Fixes: e5a55a898720 ("net: create generic bridge ops")
    Fixes: 815cccbf10b2 ("ixgbe: add setlink, getlink support to ixgbe and ixgbevf")
    CC: John Fastabend
    CC: Sathya Perla
    CC: Subbu Seetharaman
    CC: Ajit Khaparde
    CC: Jeff Kirsher
    CC: intel-wired-lan@lists.osuosl.org
    CC: Jiri Pirko
    CC: Scott Feldman
    CC: Stephen Hemminger
    CC: bridge@lists.linux-foundation.org
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     

14 Apr, 2015

1 commit

  • Even if we make use of classifier and actions from the egress
    path, we're going into handle_ing() executing additional code
    on a per-packet cost for ingress qdisc, just to realize that
    nothing is attached on ingress.

    Instead, this can just be blinded out as a no-op entirely with
    the use of a static key. On input fast-path, we already make
    use of static keys in various places, e.g. skb time stamping,
    in RPS, etc. It makes sense to not waste time when we're assured
    that no ingress qdisc is attached anywhere.

    Enabling/disabling of that code path is being done via two
    helpers, namely net_{inc,dec}_ingress_queue(), that are being
    invoked under RTNL mutex when a ingress qdisc is being either
    initialized or destructed.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

10 Dec, 2014

1 commit

  • The commit 56bfa7ee7c ("unregister_netdevice : move RTM_DELLINK to
    until after ndo_uninit") tried to do this ealier but while doing so
    it created a problem. Unfortunately the delayed rtmsg_ifinfo() also
    delayed call to fill_info(). So this translated into asking driver
    to remove private state and then query it's private state. This
    could have catastropic consequences.

    This change breaks the rtmsg_ifinfo() into two parts - one takes the
    precise snapshot of the device by called fill_info() before calling
    the ndo_uninit() and the second part sends the notification using
    collected snapshot.

    It was brought to notice when last link is deleted from an ipvlan device
    when it has free-ed the port and the subsequent .fill_info() call is
    trying to get the info from the port.

    kernel: [ 255.139429] ------------[ cut here ]------------
    kernel: [ 255.139439] WARNING: CPU: 12 PID: 11173 at net/core/rtnetlink.c:2238 rtmsg_ifinfo+0x100/0x110()
    kernel: [ 255.139493] Modules linked in: ipvlan bonding w1_therm ds2482 wire cdc_acm ehci_pci ehci_hcd i2c_dev i2c_i801 i2c_core msr cpuid bnx2x ptp pps_core mdio libcrc32c
    kernel: [ 255.139513] CPU: 12 PID: 11173 Comm: ip Not tainted 3.18.0-smp-DEV #167
    kernel: [ 255.139514] Hardware name: Intel RML,PCH/Ibis_QC_18, BIOS 1.0.10 05/15/2012
    kernel: [ 255.139515] 0000000000000009 ffff880851b6b828 ffffffff815d87f4 00000000000000e0
    kernel: [ 255.139516] 0000000000000000 ffff880851b6b868 ffffffff8109c29c 0000000000000000
    kernel: [ 255.139518] 00000000ffffffa6 00000000000000d0 ffffffff81aaf580 0000000000000011
    kernel: [ 255.139520] Call Trace:
    kernel: [ 255.139527] [] dump_stack+0x46/0x58
    kernel: [ 255.139531] [] warn_slowpath_common+0x8c/0xc0
    kernel: [ 255.139540] [] warn_slowpath_null+0x1a/0x20
    kernel: [ 255.139544] [] rtmsg_ifinfo+0x100/0x110
    kernel: [ 255.139547] [] rollback_registered_many+0x1d5/0x2d0
    kernel: [ 255.139549] [] unregister_netdevice_many+0x1f/0xb0
    kernel: [ 255.139551] [] rtnl_dellink+0xbb/0x110
    kernel: [ 255.139553] [] rtnetlink_rcv_msg+0xa0/0x240
    kernel: [ 255.139557] [] ? rhashtable_lookup_compare+0x43/0x80
    kernel: [ 255.139558] [] ? __rtnl_unlock+0x20/0x20
    kernel: [ 255.139562] [] netlink_rcv_skb+0xb1/0xc0
    kernel: [ 255.139563] [] rtnetlink_rcv+0x25/0x40
    kernel: [ 255.139565] [] netlink_unicast+0x178/0x230
    kernel: [ 255.139567] [] netlink_sendmsg+0x30f/0x420
    kernel: [ 255.139571] [] sock_sendmsg+0x9c/0xd0
    kernel: [ 255.139575] [] ? rw_copy_check_uvector+0x6f/0x130
    kernel: [ 255.139577] [] ? copy_msghdr_from_user+0x139/0x1b0
    kernel: [ 255.139578] [] ___sys_sendmsg+0x304/0x310
    kernel: [ 255.139581] [] ? handle_mm_fault+0xca3/0xde0
    kernel: [ 255.139585] [] ? destroy_inode+0x3c/0x70
    kernel: [ 255.139589] [] ? __do_page_fault+0x20c/0x500
    kernel: [ 255.139597] [] ? dput+0xb6/0x190
    kernel: [ 255.139606] [] ? mntput+0x26/0x40
    kernel: [ 255.139611] [] ? __fput+0x174/0x1e0
    kernel: [ 255.139613] [] __sys_sendmsg+0x49/0x90
    kernel: [ 255.139615] [] SyS_sendmsg+0x12/0x20
    kernel: [ 255.139617] [] system_call_fastpath+0x12/0x17
    kernel: [ 255.139619] ---[ end trace 5e6703e87d984f6b ]---

    Signed-off-by: Mahesh Bandewar
    Reported-by: Toshiaki Makita
    Cc: Eric Dumazet
    Cc: Roopa Prabhu
    Cc: David S. Miller
    Acked-by: Eric Dumazet
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Mahesh Bandewar
     

03 Dec, 2014

2 commits

  • To allow brport device to return current brport flags set on port. Add
    returned flags to nested IFLA_PROTINFO netlink msg built in dflt getlink.
    With this change, netlink msg returned for bridge_getlink contains the port's
    offloaded flag settings (the port's SELF settings).

    Signed-off-by: Scott Feldman
    Signed-off-by: Jiri Pirko
    Acked-by: Andy Gospodarek
    Acked-by: Thomas Graf
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Scott Feldman
     
  • Do the work of parsing NDA_VLAN directly in rtnetlink code, pass simple
    u16 vid to drivers from there.

    Signed-off-by: Jiri Pirko
    Acked-by: Andy Gospodarek
    Acked-by: Jamal Hadi Salim
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Jiri Pirko
     

14 Sep, 2014

1 commit

  • Make cls_tcindex RCU safe.

    This patch addds a new RCU routine rcu_dereference_bh_rtnl() to check
    caller either holds the rcu read lock or RTNL. This is needed to
    handle the case where tcindex_lookup() is being called in both cases.

    Signed-off-by: John Fastabend
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    John Fastabend
     

11 Jul, 2014

1 commit


16 May, 2014

1 commit

  • From: Cong Wang

    commit 50624c934db18ab90 (net: Delay default_device_exit_batch until no
    devices are unregistering) introduced rtnl_lock_unregistering() for
    default_device_exit_batch(). Same race could happen we when rmmod a driver
    which calls rtnl_link_unregister() as we call dev->destructor without rtnl
    lock.

    For long term, I think we should clean up the mess of netdev_run_todo()
    and net namespce exit code.

    Cc: Eric W. Biederman
    Cc: David S. Miller
    Signed-off-by: Cong Wang
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

18 Dec, 2013

1 commit

  • It is useful to be able to walk all upper devices when bringing
    a device online where the RTNL lock is held. In this case it
    is safe to walk the all_adj_list because the RTNL lock is used
    to protect the write side as well.

    This patch adds a check to see if the rtnl lock is held before
    throwing a warning in netdev_all_upper_get_next_dev_rcu().

    Also because we now have a call site for lockdep_rtnl_is_held()
    outside COFIG_LOCK_PROVING an inline definition returning 1 is
    needed. Similar to the rcu_read_lock_is_held().

    Fixes: 2a47fa45d4df ("ixgbe: enable l2 forwarding acceleration for macvlans")
    CC: Veaceslav Falico
    Reported-by: Yuanhan Liu
    Signed-off-by: John Fastabend
    Tested-by: Phil Schmitt
    Signed-off-by: Jeff Kirsher

    John Fastabend
     

26 Oct, 2013

1 commit

  • commit 991fb3f74c "dev: always advertise rx_flags changes via netlink"
    introduced rtnl notification from __dev_set_promiscuity(),
    which can be called in atomic context.

    Steps to reproduce:
    ip tuntap add dev tap1 mode tap
    ifconfig tap1 up
    tcpdump -nei tap1 &
    ip tuntap del dev tap1 mode tap

    [ 271.627994] device tap1 left promiscuous mode
    [ 271.639897] BUG: sleeping function called from invalid context at mm/slub.c:940
    [ 271.664491] in_atomic(): 1, irqs_disabled(): 0, pid: 3394, name: ip
    [ 271.677525] INFO: lockdep is turned off.
    [ 271.690503] CPU: 0 PID: 3394 Comm: ip Tainted: G W 3.12.0-rc3+ #73
    [ 271.703996] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012
    [ 271.731254] ffffffff81a58506 ffff8807f0d57a58 ffffffff817544e5 ffff88082fa0f428
    [ 271.760261] ffff8808071f5f40 ffff8807f0d57a88 ffffffff8108bad1 ffffffff81110ff8
    [ 271.790683] 0000000000000010 00000000000000d0 00000000000000d0 ffff8807f0d57af8
    [ 271.822332] Call Trace:
    [ 271.838234] [] dump_stack+0x55/0x76
    [ 271.854446] [] __might_sleep+0x181/0x240
    [ 271.870836] [] ? rcu_irq_exit+0x68/0xb0
    [ 271.887076] [] kmem_cache_alloc_node+0x4e/0x2a0
    [ 271.903368] [] ? vprintk_emit+0x1dc/0x5a0
    [ 271.919716] [] ? __alloc_skb+0x57/0x2a0
    [ 271.936088] [] ? vprintk_emit+0x1e0/0x5a0
    [ 271.952504] [] __alloc_skb+0x57/0x2a0
    [ 271.968902] [] rtmsg_ifinfo+0x52/0x100
    [ 271.985302] [] __dev_notify_flags+0xad/0xc0
    [ 272.001642] [] __dev_set_promiscuity+0x8c/0x1c0
    [ 272.017917] [] ? packet_notifier+0x5/0x380
    [ 272.033961] [] dev_set_promiscuity+0x29/0x50
    [ 272.049855] [] packet_dev_mc+0x87/0xc0
    [ 272.065494] [] packet_notifier+0x1b2/0x380
    [ 272.080915] [] ? packet_notifier+0x5/0x380
    [ 272.096009] [] notifier_call_chain+0x66/0x150
    [ 272.110803] [] __raw_notifier_call_chain+0xe/0x10
    [ 272.125468] [] raw_notifier_call_chain+0x16/0x20
    [ 272.139984] [] call_netdevice_notifiers_info+0x40/0x70
    [ 272.154523] [] call_netdevice_notifiers+0x16/0x20
    [ 272.168552] [] rollback_registered_many+0x145/0x240
    [ 272.182263] [] rollback_registered+0x31/0x40
    [ 272.195369] [] unregister_netdevice_queue+0x58/0x90
    [ 272.208230] [] __tun_detach+0x140/0x340
    [ 272.220686] [] tun_chr_close+0x36/0x60

    Signed-off-by: Alexei Starovoitov
    Acked-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

08 Mar, 2013

1 commit

  • If the driver does not support the ndo_op use the generic
    handler for it. This should work in the majority of cases.
    Eventually the fdb_dflt_add call gets translated into a
    __dev_set_rx_mode() call which should handle hardware
    support for filtering via the IFF_UNICAST_FLT flag.

    Namely IFF_UNICAST_FLT indicates if the hardware can do
    unicast address filtering. If no support is available
    the device is put into promisc mode.

    Signed-off-by: Vlad Yasevich
    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    Vlad Yasevich
     

01 Nov, 2012

1 commit

  • This adds support for the net device ops to manage the embedded
    hardware bridge on ixgbe devices. With this patch the bridge
    mode can be toggled between VEB and VEPA to support stacking
    macvlan devices or using the embedded switch without any SW
    component in 802.1Qbg/br environments.

    Additionally, this adds source address pruning to the ixgbevf
    driver to prune any frames sent back from a reflective relay on
    the switch. This is required because the existing hardware does
    not support this. Without it frames get pushed into the stack
    with its own src mac which is invalid per 802.1Qbg VEPA
    definition.

    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     

13 Oct, 2012

1 commit


11 Jul, 2012

1 commit


28 Jun, 2012

1 commit