Eric Lee / smarc-fsl-linux-kernel

26 Sep, 2018

2 commits

3a7d0d07a net: sched: extend Qdisc with rcu ... Browse Code »

Currently, Qdisc API functions assume that users have rtnl lock taken. To
implement rtnl unlocked classifiers update interface, Qdisc API must be
extended with functions that do not require rtnl lock.

Extend Qdisc structure with rcu. Implement special version of put function
qdisc_put_unlocked() that is called without rtnl lock taken. This function
only takes rtnl lock if Qdisc reference counter reached zero and is
intended to be used as optimization.

Signed-off-by: Vlad Buslov
Acked-by: Jiri Pirko
Signed-off-by: David S. Miller

Vlad Buslov
2018-09-26 11:17:35 +0800
6f99528e9 net: core: netlink: add helper refcount dec and lock function ... Browse Code »

Rtnl lock is encapsulated in netlink and cannot be accessed by other
modules directly. This means that reference counted objects that rely on
rtnl lock cannot use it with refcounter helper function that atomically
releases decrements reference and obtains mutex.

This patch implements simple wrapper function around refcount_dec_and_lock
that obtains rtnl lock if reference counter value reached 0.

Signed-off-by: Vlad Buslov
Acked-by: Jiri Pirko
Signed-off-by: David S. Miller

Vlad Buslov
2018-09-26 11:17:35 +0800

30 Mar, 2018

1 commit

f0b07bb15 net: Introduce net_rwsem to protect net_namespace_list ... Browse Code »

rtnl_lock() is used everywhere, and contention is very high.
When someone wants to iterate over alive net namespaces,
he/she has no a possibility to do that without exclusive lock.
But the exclusive rtnl_lock() in such places is overkill,
and it just increases the contention. Yes, there is already
for_each_net_rcu() in kernel, but it requires rcu_read_lock(),
and this can't be sleepable. Also, sometimes it may be need
really prevent net_namespace_list growth, so for_each_net_rcu()
is not fit there.

This patch introduces new rw_semaphore, which will be used
instead of rtnl_mutex to protect net_namespace_list. It is
sleepable and allows not-exclusive iterations over net
namespaces list. It allows to stop using rtnl_lock()
in several places (what is made in next patches) and makes
less the time, we keep rtnl_mutex. Here we just add new lock,
while the explanation of we can remove rtnl_lock() there are
in next patches.

Fine grained locks generally are better, then one big lock,
so let's do that with net_namespace_list, while the situation
allows that.

Signed-off-by: Kirill Tkhai
Signed-off-by: David S. Miller

Kirill Tkhai
2018-03-30 01:47:53 +0800

28 Mar, 2018

1 commit

4420bf21f net: Rename net_sem to pernet_ops_rwsem ... Browse Code »

net_sem is some undefined area name, so it will be better
to make the area more defined.

Rename it to pernet_ops_rwsem for better readability and
better intelligibility.

Signed-off-by: Kirill Tkhai
Signed-off-by: David S. Miller

Kirill Tkhai
2018-03-28 01:18:09 +0800

17 Mar, 2018

1 commit

79ffdfc65 net: Add rtnl_lock_killable() ... Browse Code »

rtnl_lock() is widely used mutex in kernel. Some of kernel code
does memory allocations under it. In case of memory deficit this
may invoke OOM killer, but the problem is a killed task can't
exit if it's waiting for the mutex. This may be a reason of deadlock
and panic.

This patch adds a new primitive, which responds on SIGKILL, and
it allows to use it in the places, where we don't want to sleep
forever.

Signed-off-by: Kirill Tkhai
Signed-off-by: David S. Miller

Kirill Tkhai
2018-03-17 00:31:19 +0800

21 Feb, 2018

1 commit

19efbd93e net: Kill net_mutex ... Browse Code »

We take net_mutex, when there are !async pernet_operations
registered, and read locking of net_sem is not enough. But
we may get rid of taking the mutex, and just change the logic
to write lock net_sem in such cases. This obviously reduces
the number of lock operations, we do.

Signed-off-by: Kirill Tkhai
Signed-off-by: David S. Miller

Kirill Tkhai
2018-02-21 02:23:13 +0800

13 Feb, 2018

1 commit

1a57feb84 net: Introduce net_sem for protection of pernet_list ... Browse Code »

Currently, the mutex is mostly used to protect pernet operations
list. It orders setup_net() and cleanup_net() with parallel
{un,}register_pernet_operations() calls, so ->exit{,batch} methods
of the same pernet operations are executed for a dying net, as
were used to call ->init methods, even after the net namespace
is unlinked from net_namespace_list in cleanup_net().

But there are several problems with scalability. The first one
is that more than one net can't be created or destroyed
at the same moment on the node. For big machines with many cpus
running many containers it's very sensitive.

The second one is that it's need to synchronize_rcu() after net
is removed from net_namespace_list():

Destroy net_ns:
cleanup_net()
mutex_lock(&net_mutex)
list_del_rcu(&net->list)
synchronize_rcu()
Acked-by: Andrei Vagin
Signed-off-by: David S. Miller

Kirill Tkhai
2018-02-13 23:36:04 +0800

01 Feb, 2018

1 commit

b2fe5fa68 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:

1) Significantly shrink the core networking routing structures. Result
of http://vger.kernel.org/~davem/seoul2017_netdev_keynote.pdf

2) Add netdevsim driver for testing various offloads, from Jakub
Kicinski.

3) Support cross-chip FDB operations in DSA, from Vivien Didelot.

4) Add a 2nd listener hash table for TCP, similar to what was done for
UDP. From Martin KaFai Lau.

5) Add eBPF based queue selection to tun, from Jason Wang.

6) Lockless qdisc support, from John Fastabend.

7) SCTP stream interleave support, from Xin Long.

8) Smoother TCP receive autotuning, from Eric Dumazet.

9) Lots of erspan tunneling enhancements, from William Tu.

10) Add true function call support to BPF, from Alexei Starovoitov.

11) Add explicit support for GRO HW offloading, from Michael Chan.

12) Support extack generation in more netlink subsystems. From Alexander
Aring, Quentin Monnet, and Jakub Kicinski.

13) Add 1000BaseX, flow control, and EEE support to mvneta driver. From
Russell King.

14) Add flow table abstraction to netfilter, from Pablo Neira Ayuso.

15) Many improvements and simplifications to the NFP driver bpf JIT,
from Jakub Kicinski.

16) Support for ipv6 non-equal cost multipath routing, from Ido
Schimmel.

17) Add resource abstration to devlink, from Arkadi Sharshevsky.

18) Packet scheduler classifier shared filter block support, from Jiri
Pirko.

19) Avoid locking in act_csum, from Davide Caratti.

20) devinet_ioctl() simplifications from Al viro.

21) More TCP bpf improvements from Lawrence Brakmo.

22) Add support for onlink ipv6 route flag, similar to ipv4, from David
Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits)
tls: Add support for encryption using async offload accelerator
ip6mr: fix stale iterator
net/sched: kconfig: Remove blank help texts
openvswitch: meter: Use 64-bit arithmetic instead of 32-bit
tcp_nv: fix potential integer overflow in tcpnv_acked
r8169: fix RTL8168EP take too long to complete driver initialization.
qmi_wwan: Add support for Quectel EP06
rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK
ipmr: Fix ptrdiff_t print formatting
ibmvnic: Wait for device response when changing MAC
qlcnic: fix deadlock bug
tcp: release sk_frag.page in tcp_disconnect
ipv4: Get the address of interface correctly.
net_sched: gen_estimator: fix lockdep splat
net: macb: Handle HRESP error
net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring
ipv6: addrconf: break critical section in addrconf_verify_rtnl()
ipv6: change route cache aging logic
i40e/i40evf: Update DESC_NEEDED value to reflect larger value
bnxt_en: cleanup DIM work on device shutdown
...

Linus Torvalds
2018-02-01 06:31:10 +0800

30 Jan, 2018

1 commit

38e01b305 dev: advertise the new ifindex when the netns iface changes ... Browse Code »

The goal is to let the user follow an interface that moves to another
netns.

CC: Jiri Benc
CC: Christian Brauner
Signed-off-by: Nicolas Dichtel
Reviewed-by: Jiri Benc
Signed-off-by: David S. Miller

Nicolas Dichtel
2018-01-30 01:23:52 +0800

27 Dec, 2017

1 commit

66364bdf3 rtnetlink: Replace implementation of ASSERT_RTNL() macro with WARN_ONCE() ... Browse Code »

ASSERT_RTNL() macro is actual open-coded variant of WARN_ONCE() with
two exceptions. First, it prints stack for multiple hits and not only
once as WARN_ONCE() does. Second, the user can disable prints of
WARN_ONCE by setting CONFIG_BUG to N.

The multiple prints of dump stack are actually not needed, because calls
without rtnl lock are programming errors and user can't do anything
about them except to complain to the mailing list after first occurrence
of such failure.

The user who disabled BUG/WARN prints did it explicitly because by default
in upstream kernel and distributions this option is enabled. It means
that user doesn't want to see prints about missing locks too.

This patch replaces open-coded variant in favor of already existing
macro and change error prints to be once only.

Reviewed-by: Mark Bloch
Signed-off-by: Leon Romanovsky
Signed-off-by: David S. Miller

Leon Romanovsky
2017-12-27 01:30:02 +0800

05 Dec, 2017

1 commit

1ba9c5e6c rtnetlink: Update now-misleading smp_read_barrier_depends() comment ... Browse Code »

Now that READ_ONCE() implies smp_read_barrier_depends(), update the
rtnl_dereference() header comment accordingly.

Signed-off-by: Paul E. McKenney
Cc: "David S. Miller"
Cc: Vladislav Yasevich
Cc: Mark Rutland
Cc: David Ahern
Cc: Vlad Yasevich

Paul E. McKenney
2017-12-05 02:52:54 +0800

16 Nov, 2017

1 commit

5bbcc0f59 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:
"Highlights:

1) Maintain the TCP retransmit queue using an rbtree, with 1GB
windows at 100Gb this really has become necessary. From Eric
Dumazet.

2) Multi-program support for cgroup+bpf, from Alexei Starovoitov.

3) Perform broadcast flooding in hardware in mv88e6xxx, from Andrew
Lunn.

4) Add meter action support to openvswitch, from Andy Zhou.

5) Add a data meta pointer for BPF accessible packets, from Daniel
Borkmann.

6) Namespace-ify almost all TCP sysctl knobs, from Eric Dumazet.

7) Turn on Broadcom Tags in b53 driver, from Florian Fainelli.

8) More work to move the RTNL mutex down, from Florian Westphal.

9) Add 'bpftool' utility, to help with bpf program introspection.
From Jakub Kicinski.

10) Add new 'cpumap' type for XDP_REDIRECT action, from Jesper
Dangaard Brouer.

11) Support 'blocks' of transformations in the packet scheduler which
can span multiple network devices, from Jiri Pirko.

12) TC flower offload support in cxgb4, from Kumar Sanghvi.

13) Priority based stream scheduler for SCTP, from Marcelo Ricardo
Leitner.

14) Thunderbolt networking driver, from Amir Levy and Mika Westerberg.

15) Add RED qdisc offloadability, and use it in mlxsw driver. From
Nogah Frankel.

16) eBPF based device controller for cgroup v2, from Roman Gushchin.

17) Add some fundamental tracepoints for TCP, from Song Liu.

18) Remove garbage collection from ipv6 route layer, this is a
significant accomplishment. From Wei Wang.

19) Add multicast route offload support to mlxsw, from Yotam Gigi"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2177 commits)
tcp: highest_sack fix
geneve: fix fill_info when link down
bpf: fix lockdep splat
net: cdc_ncm: GetNtbFormat endian fix
openvswitch: meter: fix NULL pointer dereference in ovs_meter_cmd_reply_start
netem: remove unnecessary 64 bit modulus
netem: use 64 bit divide by rate
tcp: Namespace-ify sysctl_tcp_default_congestion_control
net: Protect iterations over net::fib_notifier_ops in fib_seq_sum()
ipv6: set all.accept_dad to 0 by default
uapi: fix linux/tls.h userspace compilation error
usbnet: ipheth: prevent TX queue timeouts when device not ready
vhost_net: conditionally enable tx polling
uapi: fix linux/rxrpc.h userspace compilation errors
net: stmmac: fix LPI transitioning for dwmac4
atm: horizon: Fix irq release error
net-sysfs: trigger netlink notification on ifalias change via sysfs
openvswitch: Using kfree_rcu() to simplify the code
openvswitch: Make local function ovs_nsh_key_attr_size() static
openvswitch: Fix return value check in ovs_meter_cmd_features()
...

Linus Torvalds
2017-11-16 03:56:19 +0800

07 Nov, 2017

1 commit

8c5db92a7 Merge branch 'linus' into locking/core, to resolve conflicts ... Browse Code »

Conflicts:
include/linux/compiler-clang.h
include/linux/compiler-gcc.h
include/linux/compiler-intel.h
include/uapi/linux/stddef.h

Signed-off-by: Ingo Molnar

Ingo Molnar
2017-11-07 17:32:44 +0800

04 Nov, 2017

1 commit

2a171788b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Files removed in 'net-next' had their license header updated
in 'net'. We take the remove from 'net-next'.

Signed-off-by: David S. Miller

David S. Miller
2017-11-04 08:26:51 +0800

02 Nov, 2017

1 commit

b24413180 License cleanup: add SPDX GPL-2.0 license identifier to files with no license ... Browse Code »

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2017-11-02 18:10:55 +0800

25 Oct, 2017

1 commit

14cd5d4a0 locking/atomics, net/netlink/netfilter: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE() ... Browse Code »

For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't currently harmful.

However, for some features it is necessary to instrument reads and
writes separately, which is not possible with ACCESS_ONCE(). This
distinction is critical to correct operation.

It's possible to transform the bulk of kernel code using the Coccinelle
script below. However, this doesn't handle comments, leaving references
to ACCESS_ONCE() instances which have been removed. As a preparatory
step, this patch converts netlink and netfilter code and comments to use
{READ,WRITE}_ONCE() consistently.

----
virtual patch

@ depends on patch @
expression E1, E2;
@@

- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)

@ depends on patch @
expression E;
@@

- ACCESS_ONCE(E)
+ READ_ONCE(E)
----

Signed-off-by: Mark Rutland
Signed-off-by: Paul E. McKenney
Cc: David S. Miller
Cc: Florian Westphal
Cc: Jozsef Kadlecsik
Cc: Linus Torvalds
Cc: Pablo Neira Ayuso
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-7-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar

Mark Rutland
2017-10-25 17:00:59 +0800

05 Oct, 2017

1 commit

6621dd29e dev: advertise the new nsid when the netns iface changes ... Browse Code »

x-netns interfaces are bound to two netns: the link netns and the upper
netns. Usually, this kind of interfaces is created in the link netns and
then moved to the upper netns. At the end, the interface is visible only
in the upper netns. The link nsid is advertised via netlink in the upper
netns, thus the user always knows where is the link part.

There is no such mechanism in the link netns. When the interface is moved
to another netns, the user cannot "follow" it.
This patch adds a new netlink attribute which helps to follow an interface
which moves to another netns. When the interface is unregistered, the new
nsid is advertised. If the interface is a x-netns interface (ie
rtnl_link_ops->get_link_net is defined), the nsid is allocated if needed.

CC: Jason A. Donenfeld
Signed-off-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Nicolas Dichtel
2017-10-05 09:04:41 +0800

28 May, 2017

1 commit

3d3ea5af5 rtnl: Add support for netdev event to link messages ... Browse Code »

When netdev events happen, a rtnetlink_event() handler will send
messages for every event in it's white list. These messages contain
current information about a particular device, but they do not include
the iformation about which event just happened. So, it is impossible
to tell what just happend for these events.

This patch adds a new extension to RTM_NEWLINK message called IFLA_EVENT
that would have an encoding of event that triggered this
message. This would allow the the message consumer to easily determine
if it needs to perform certain actions.

Signed-off-by: Vladislav Yasevich
Acked-by: David Ahern
Signed-off-by: David S. Miller

Vlad Yasevich
2017-05-28 06:51:41 +0800

02 Sep, 2016

1 commit

d297653dd rtnetlink: fdb dump: optimize by saving last interface markers ... Browse Code »

fdb dumps spanning multiple skb's currently restart from the first
interface again for every skb. This results in unnecessary
iterations on the already visited interfaces and their fdb
entries. In large scale setups, we have seen this to slow
down fdb dumps considerably. On a system with 30k macs we
see fdb dumps spanning across more than 300 skbs.

To fix the problem, this patch replaces the existing single fdb
marker with three markers: netdev hash entries, netdevs and fdb
index to continue where we left off instead of restarting from the
first netdev. This is consistent with link dumps.

In the process of fixing the performance issue, this patch also
re-implements fix done by
commit 472681d57a5d ("net: ndo_fdb_dump should report -EMSGSIZE to rtnl_fdb_dump")
(with an internal fix from Wilson Kok) in the following ways:
- change ndo_fdb_dump handlers to return error code instead
of the last fdb index
- use cb->args strictly for dump frag markers and not error codes.
This is consistent with other dump functions.

Below results were taken on a system with 1000 netdevs
and 35085 fdb entries:
before patch:
$time bridge fdb show | wc -l
15065

real 1m11.791s
user 0m0.070s
sys 1m8.395s

(existing code does not return all macs)

after patch:
$time bridge fdb show | wc -l
35085

real 0m2.017s
user 0m0.113s
sys 0m1.942s

Signed-off-by: Roopa Prabhu
Signed-off-by: Wilson Kok
Signed-off-by: David S. Miller

Roopa Prabhu
2016-09-02 07:56:15 +0800

16 Jun, 2016

1 commit

1b5c5493e net_sched: add the ability to defer skb freeing ... Browse Code »

qdisc are changed under RTNL protection and often
while blocking BH and root qdisc spinlock.

When lots of skbs need to be dropped, we free
them under these locks causing TX/RX freezes,
and more generally latency spikes.

This commit adds rtnl_kfree_skbs(), used to queue
skbs for deferred freeing.

Actual freeing happens right after RTNL is released,
with appropriate scheduling points.

rtnl_qdisc_drop() can also be used in place
of disc_drop() when RTNL is held.

qdisc_reset_queue() and __qdisc_reset_queue() get
the new behavior, so standard qdiscs like pfifo, pfifo_fast...
have their ->reset() method automatically handled.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-06-16 05:08:34 +0800

11 Jan, 2016

1 commit

1f211a1b9 net, sched: add clsact qdisc ... Browse Code »

This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).

Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.

Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).

The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.

Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.

I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.

Example, adding qdisc:

# tc qdisc add dev foo clsact
# tc qdisc show dev foo
qdisc mq 0: root
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact ffff: parent ffff:fff1

Adding filters (deleting, etc works analogous by specifying ingress/egress):

# tc filter add dev foo ingress bpf da obj bar.o sec ingress
# tc filter add dev foo egress bpf da obj bar.o sec egress
# tc filter show dev foo ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
# tc filter show dev foo egress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action

A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.

Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.

[1] http://patchwork.ozlabs.org/patch/512949/

Signed-off-by: Daniel Borkmann
Acked-by: John Fastabend
Signed-off-by: David S. Miller

Daniel Borkmann
2016-01-11 11:13:15 +0800

09 Oct, 2015

1 commit

0cbf33437 net/core: lockdep_rtnl_is_held can be boolean ... Browse Code »

This patch makes lockdep_rtnl_is_held return bool due to this
particular function only using either one or zero as its return
value.

In another patch lockdep_is_held is also made return bool.

No functional change.

Signed-off-by: Yaowei Bai
Signed-off-by: David S. Miller

Yaowei Bai
2015-10-09 22:49:06 +0800

23 Jun, 2015

1 commit

7d4f8d871 switchdev; add VLAN support for port's bridge_getlink ... Browse Code »

One more missing piece of the puzzle. Add vlan dump support to switchdev
port's bridge_getlink. iproute2 "bridge vlan show" cmd already knows how
to show the vlans installed on the bridge and the device , but (until now)
no one implemented the port vlan part of the netlink PF_BRIDGE:RTM_GETLINK
msg. Before this patch, "bridge vlan show":

$ bridge -c vlan show
port vlan ids
sw1p1 30-34 << bridge side vlans
57

sw1p1 << device side vlans (missing)

sw1p2 57

sw1p2

sw1p3

sw1p4

br0 None

(When the port is bridged, the output repeats the vlan list for the vlans
on the bridge side of the port and the vlans on the device side of the
port. The listing above show no vlans for the device side even though they
are installed).

After this patch:

$ bridge -c vlan show
port vlan ids
sw1p1 30-34 << bridge side vlan
57

sw1p1 30-34 << device side vlans
57
3840 PVID

sw1p2 57

sw1p2 57
3840 PVID

sw1p3 3842 PVID

sw1p4 3843 PVID

br0 None

I re-used ndo_dflt_bridge_getlink to add vlan fill call-back func.
switchdev support adds an obj dump for VLAN objects, using the same
call-back scheme as FDB dump. Support included for both compressed and
un-compressed vlan dumps.

Signed-off-by: Scott Feldman
Signed-off-by: David S. Miller

Scott Feldman
2015-06-23 21:56:18 +0800

14 May, 2015

2 commits

1cf51900f net: add CONFIG_NET_INGRESS to enable ingress filtering ... Browse Code »

This new config switch enables the ingress filtering infrastructure that is
controlled through the ingress_needed static key. This prepares the
introduction of the Netfilter ingress hook that resides under this unique
static key.

Note that CONFIG_SCH_INGRESS automatically selects this, that should be no
problem since this also depends on CONFIG_NET_CLS_ACT.

Signed-off-by: Pablo Neira Ayuso
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Pablo Neira
2015-05-14 13:10:05 +0800
f0b5e8a42 net: kill useless net_*_ingress_queue() definitions when NET_CLS_ACT is unset ... Browse Code »

This fixes 4577139b2dabf589 ("net: use jump label patching for ingress qdisc in
__netif_receive_skb_core").

The only client of this is sch_ingress and it depends on NET_CLS_ACT. So
there is no way these definition can be of any help.

Cc: Daniel Borkmann
Signed-off-by: Pablo Neira Ayuso
Acked-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Pablo Neira
2015-05-14 03:44:28 +0800

30 Apr, 2015

1 commit

46c264daa bridge/nl: remove wrong use of NLM_F_MULTI ... Browse Code »

NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact,
it is sent only at the end of a dump.

Libraries like libnl will wait forever for NLMSG_DONE.

Fixes: e5a55a898720 ("net: create generic bridge ops")
Fixes: 815cccbf10b2 ("ixgbe: add setlink, getlink support to ixgbe and ixgbevf")
CC: John Fastabend
CC: Sathya Perla
CC: Subbu Seetharaman
CC: Ajit Khaparde
CC: Jeff Kirsher
CC: intel-wired-lan@lists.osuosl.org
CC: Jiri Pirko
CC: Scott Feldman
CC: Stephen Hemminger
CC: bridge@lists.linux-foundation.org
Signed-off-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Nicolas Dichtel
2015-04-30 02:59:16 +0800

14 Apr, 2015

1 commit

4577139b2 net: use jump label patching for ingress qdisc in __netif_receive_skb_core ... Browse Code »

Even if we make use of classifier and actions from the egress
path, we're going into handle_ing() executing additional code
on a per-packet cost for ingress qdisc, just to realize that
nothing is attached on ingress.

Instead, this can just be blinded out as a no-op entirely with
the use of a static key. On input fast-path, we already make
use of static keys in various places, e.g. skb time stamping,
in RPS, etc. It makes sense to not waste time when we're assured
that no ingress qdisc is attached anywhere.

Enabling/disabling of that code path is being done via two
helpers, namely net_{inc,dec}_ingress_queue(), that are being
invoked under RTNL mutex when a ingress qdisc is being either
initialized or destructed.

Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2015-04-14 01:34:40 +0800

10 Dec, 2014

1 commit

395eea6cc rtnetlink: delay RTM_DELLINK notification until after ndo_uninit() ... Browse Code »

The commit 56bfa7ee7c ("unregister_netdevice : move RTM_DELLINK to
until after ndo_uninit") tried to do this ealier but while doing so
it created a problem. Unfortunately the delayed rtmsg_ifinfo() also
delayed call to fill_info(). So this translated into asking driver
to remove private state and then query it's private state. This
could have catastropic consequences.

This change breaks the rtmsg_ifinfo() into two parts - one takes the
precise snapshot of the device by called fill_info() before calling
the ndo_uninit() and the second part sends the notification using
collected snapshot.

It was brought to notice when last link is deleted from an ipvlan device
when it has free-ed the port and the subsequent .fill_info() call is
trying to get the info from the port.

kernel: [ 255.139429] ------------[ cut here ]------------
kernel: [ 255.139439] WARNING: CPU: 12 PID: 11173 at net/core/rtnetlink.c:2238 rtmsg_ifinfo+0x100/0x110()
kernel: [ 255.139493] Modules linked in: ipvlan bonding w1_therm ds2482 wire cdc_acm ehci_pci ehci_hcd i2c_dev i2c_i801 i2c_core msr cpuid bnx2x ptp pps_core mdio libcrc32c
kernel: [ 255.139513] CPU: 12 PID: 11173 Comm: ip Not tainted 3.18.0-smp-DEV #167
kernel: [ 255.139514] Hardware name: Intel RML,PCH/Ibis_QC_18, BIOS 1.0.10 05/15/2012
kernel: [ 255.139515] 0000000000000009 ffff880851b6b828 ffffffff815d87f4 00000000000000e0
kernel: [ 255.139516] 0000000000000000 ffff880851b6b868 ffffffff8109c29c 0000000000000000
kernel: [ 255.139518] 00000000ffffffa6 00000000000000d0 ffffffff81aaf580 0000000000000011
kernel: [ 255.139520] Call Trace:
kernel: [ 255.139527] [] dump_stack+0x46/0x58
kernel: [ 255.139531] [] warn_slowpath_common+0x8c/0xc0
kernel: [ 255.139540] [] warn_slowpath_null+0x1a/0x20
kernel: [ 255.139544] [] rtmsg_ifinfo+0x100/0x110
kernel: [ 255.139547] [] rollback_registered_many+0x1d5/0x2d0
kernel: [ 255.139549] [] unregister_netdevice_many+0x1f/0xb0
kernel: [ 255.139551] [] rtnl_dellink+0xbb/0x110
kernel: [ 255.139553] [] rtnetlink_rcv_msg+0xa0/0x240
kernel: [ 255.139557] [] ? rhashtable_lookup_compare+0x43/0x80
kernel: [ 255.139558] [] ? __rtnl_unlock+0x20/0x20
kernel: [ 255.139562] [] netlink_rcv_skb+0xb1/0xc0
kernel: [ 255.139563] [] rtnetlink_rcv+0x25/0x40
kernel: [ 255.139565] [] netlink_unicast+0x178/0x230
kernel: [ 255.139567] [] netlink_sendmsg+0x30f/0x420
kernel: [ 255.139571] [] sock_sendmsg+0x9c/0xd0
kernel: [ 255.139575] [] ? rw_copy_check_uvector+0x6f/0x130
kernel: [ 255.139577] [] ? copy_msghdr_from_user+0x139/0x1b0
kernel: [ 255.139578] [] ___sys_sendmsg+0x304/0x310
kernel: [ 255.139581] [] ? handle_mm_fault+0xca3/0xde0
kernel: [ 255.139585] [] ? destroy_inode+0x3c/0x70
kernel: [ 255.139589] [] ? __do_page_fault+0x20c/0x500
kernel: [ 255.139597] [] ? dput+0xb6/0x190
kernel: [ 255.139606] [] ? mntput+0x26/0x40
kernel: [ 255.139611] [] ? __fput+0x174/0x1e0
kernel: [ 255.139613] [] __sys_sendmsg+0x49/0x90
kernel: [ 255.139615] [] SyS_sendmsg+0x12/0x20
kernel: [ 255.139617] [] system_call_fastpath+0x12/0x17
kernel: [ 255.139619] ---[ end trace 5e6703e87d984f6b ]---

Signed-off-by: Mahesh Bandewar
Reported-by: Toshiaki Makita
Cc: Eric Dumazet
Cc: Roopa Prabhu
Cc: David S. Miller
Acked-by: Eric Dumazet
Acked-by: Thomas Graf
Signed-off-by: David S. Miller

Mahesh Bandewar
2014-12-10 02:36:57 +0800

03 Dec, 2014

2 commits

2c3c031c8 bridge: add brport flags to dflt bridge_getlink ... Browse Code »

To allow brport device to return current brport flags set on port. Add
returned flags to nested IFLA_PROTINFO netlink msg built in dflt getlink.
With this change, netlink msg returned for bridge_getlink contains the port's
offloaded flag settings (the port's SELF settings).

Signed-off-by: Scott Feldman
Signed-off-by: Jiri Pirko
Acked-by: Andy Gospodarek
Acked-by: Thomas Graf
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Scott Feldman
2014-12-03 12:01:24 +0800
f6f6424ba net: make vid as a parameter for ndo_fdb_add/ndo_fdb_del ... Browse Code »

Do the work of parsing NDA_VLAN directly in rtnetlink code, pass simple
u16 vid to drivers from there.

Signed-off-by: Jiri Pirko
Acked-by: Andy Gospodarek
Acked-by: Jamal Hadi Salim
Acked-by: John Fastabend
Signed-off-by: David S. Miller

Jiri Pirko
2014-12-03 12:01:18 +0800

14 Sep, 2014

1 commit

331b72922 net: sched: RCU cls_tcindex ... Browse Code »

Make cls_tcindex RCU safe.

This patch addds a new RCU routine rcu_dereference_bh_rtnl() to check
caller either holds the rcu read lock or RTNL. This is needed to
handle the case where tcindex_lookup() is being called in both cases.

Signed-off-by: John Fastabend
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

John Fastabend
2014-09-14 00:30:26 +0800

11 Jul, 2014

1 commit

5d5eacb34 bridge: fdb dumping takes a filter device ... Browse Code »

Dumping a bridge fdb dumps every fdb entry
held. With this change we are going to filter
on selected bridge port.

Signed-off-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Jamal Hadi Salim
2014-07-11 03:37:33 +0800

16 May, 2014

1 commit

200b916f3 rtnetlink: wait for unregistering devices in rtnl_link_unregister() ... Browse Code »

From: Cong Wang

commit 50624c934db18ab90 (net: Delay default_device_exit_batch until no
devices are unregistering) introduced rtnl_lock_unregistering() for
default_device_exit_batch(). Same race could happen we when rmmod a driver
which calls rtnl_link_unregister() as we call dev->destructor without rtnl
lock.

For long term, I think we should clean up the mess of netdev_run_todo()
and net namespce exit code.

Cc: Eric W. Biederman
Cc: David S. Miller
Signed-off-by: Cong Wang
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

Cong Wang
2014-05-16 03:30:33 +0800

18 Dec, 2013

1 commit

85328240c net: allow netdev_all_upper_get_next_dev_rcu with rtnl lock held ... Browse Code »

It is useful to be able to walk all upper devices when bringing
a device online where the RTNL lock is held. In this case it
is safe to walk the all_adj_list because the RTNL lock is used
to protect the write side as well.

This patch adds a check to see if the rtnl lock is held before
throwing a warning in netdev_all_upper_get_next_dev_rcu().

Also because we now have a call site for lockdep_rtnl_is_held()
outside COFIG_LOCK_PROVING an inline definition returning 1 is
needed. Similar to the rcu_read_lock_is_held().

Fixes: 2a47fa45d4df ("ixgbe: enable l2 forwarding acceleration for macvlans")
CC: Veaceslav Falico
Reported-by: Yuanhan Liu
Signed-off-by: John Fastabend
Tested-by: Phil Schmitt
Signed-off-by: Jeff Kirsher

John Fastabend
2013-12-18 13:19:08 +0800

26 Oct, 2013

1 commit

7f2940540 net: fix rtnl notification in atomic context ... Browse Code »

commit 991fb3f74c "dev: always advertise rx_flags changes via netlink"
introduced rtnl notification from __dev_set_promiscuity(),
which can be called in atomic context.

Steps to reproduce:
ip tuntap add dev tap1 mode tap
ifconfig tap1 up
tcpdump -nei tap1 &
ip tuntap del dev tap1 mode tap

[ 271.627994] device tap1 left promiscuous mode
[ 271.639897] BUG: sleeping function called from invalid context at mm/slub.c:940
[ 271.664491] in_atomic(): 1, irqs_disabled(): 0, pid: 3394, name: ip
[ 271.677525] INFO: lockdep is turned off.
[ 271.690503] CPU: 0 PID: 3394 Comm: ip Tainted: G W 3.12.0-rc3+ #73
[ 271.703996] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012
[ 271.731254] ffffffff81a58506 ffff8807f0d57a58 ffffffff817544e5 ffff88082fa0f428
[ 271.760261] ffff8808071f5f40 ffff8807f0d57a88 ffffffff8108bad1 ffffffff81110ff8
[ 271.790683] 0000000000000010 00000000000000d0 00000000000000d0 ffff8807f0d57af8
[ 271.822332] Call Trace:
[ 271.838234] [] dump_stack+0x55/0x76
[ 271.854446] [] __might_sleep+0x181/0x240
[ 271.870836] [] ? rcu_irq_exit+0x68/0xb0
[ 271.887076] [] kmem_cache_alloc_node+0x4e/0x2a0
[ 271.903368] [] ? vprintk_emit+0x1dc/0x5a0
[ 271.919716] [] ? __alloc_skb+0x57/0x2a0
[ 271.936088] [] ? vprintk_emit+0x1e0/0x5a0
[ 271.952504] [] __alloc_skb+0x57/0x2a0
[ 271.968902] [] rtmsg_ifinfo+0x52/0x100
[ 271.985302] [] __dev_notify_flags+0xad/0xc0
[ 272.001642] [] __dev_set_promiscuity+0x8c/0x1c0
[ 272.017917] [] ? packet_notifier+0x5/0x380
[ 272.033961] [] dev_set_promiscuity+0x29/0x50
[ 272.049855] [] packet_dev_mc+0x87/0xc0
[ 272.065494] [] packet_notifier+0x1b2/0x380
[ 272.080915] [] ? packet_notifier+0x5/0x380
[ 272.096009] [] notifier_call_chain+0x66/0x150
[ 272.110803] [] __raw_notifier_call_chain+0xe/0x10
[ 272.125468] [] raw_notifier_call_chain+0x16/0x20
[ 272.139984] [] call_netdevice_notifiers_info+0x40/0x70
[ 272.154523] [] call_netdevice_notifiers+0x16/0x20
[ 272.168552] [] rollback_registered_many+0x145/0x240
[ 272.182263] [] rollback_registered+0x31/0x40
[ 272.195369] [] unregister_netdevice_queue+0x58/0x90
[ 272.208230] [] __tun_detach+0x140/0x340
[ 272.220686] [] tun_chr_close+0x36/0x60

Signed-off-by: Alexei Starovoitov
Acked-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Alexei Starovoitov
2013-10-26 07:03:45 +0800

08 Mar, 2013

1 commit

090096bf3 net: generic fdb support for drivers without ndo_fdb_<op> ... Browse Code »

If the driver does not support the ndo_op use the generic
handler for it. This should work in the majority of cases.
Eventually the fdb_dflt_add call gets translated into a
__dev_set_rx_mode() call which should handle hardware
support for filtering via the IFF_UNICAST_FLT flag.

Namely IFF_UNICAST_FLT indicates if the hardware can do
unicast address filtering. If no support is available
the device is put into promisc mode.

Signed-off-by: Vlad Yasevich
Signed-off-by: John Fastabend
Signed-off-by: David S. Miller

Vlad Yasevich
2013-03-08 04:29:45 +0800

01 Nov, 2012

1 commit

815cccbf1 ixgbe: add setlink, getlink support to ixgbe and ixgbevf ... Browse Code »

This adds support for the net device ops to manage the embedded
hardware bridge on ixgbe devices. With this patch the bridge
mode can be toggled between VEB and VEPA to support stacking
macvlan devices or using the embedded switch without any SW
component in 802.1Qbg/br environments.

Additionally, this adds source address pruning to the ixgbevf
driver to prune any frames sent back from a reflective relay on
the switch. This is required because the existing hardware does
not support this. Without it frames get pushed into the stack
with its own src mac which is invalid per 802.1Qbg VEPA
definition.

Signed-off-by: John Fastabend
Signed-off-by: David S. Miller

John Fastabend
2012-11-01 01:18:29 +0800

13 Oct, 2012

1 commit

607ca46e9 UAPI: (Scripted) Disintegrate include/linux ... Browse Code »

Signed-off-by: David Howells
Acked-by: Arnd Bergmann
Acked-by: Thomas Gleixner
Acked-by: Michael Kerrisk
Acked-by: Paul E. McKenney
Acked-by: Dave Jones

David Howells
2012-10-13 17:46:48 +0800

11 Jul, 2012

1 commit

87a50699c rtnetlink: Remove ts/tsage args to rtnl_put_cacheinfo(). ... Browse Code »

Nobody provides non-zero values any longer.

Signed-off-by: David S. Miller

David S. Miller
2012-07-11 13:40:13 +0800

28 Jun, 2012

1 commit

4c3af034f netlink: Get rid of obsolete rtnetlink macros ... Browse Code »

Removes all RTA_GET*() and RTA_PUT*() variations, as well as the
the unused rtattr_strcmp(). Get rid of rtm_get_table() by moving
it to its only user decnet.

Signed-off-by: Thomas Graf
Signed-off-by: David S. Miller

Thomas Graf
2012-06-28 06:36:44 +0800