29 Jan, 2016

1 commit

  • When switchdev drivers process FDB notifications from the underlying
    device they resolve the netdev to which the entry points to and notify
    the bridge using the switchdev notifier.

    However, since the RTNL mutex is not held there is nothing preventing
    the netdev from disappearing in the middle, which will cause
    br_switchdev_event() to dereference a non-existing netdev.

    Make switchdev drivers hold the lock at the beginning of the
    notification processing session and release it once it ends, after
    notifying the bridge.

    Also, remove switchdev_mutex and fdb_lock, as they are no longer needed
    when RTNL mutex is held.

    Fixes: 03bf0c281234 ("switchdev: introduce switchdev notifier")
    Signed-off-by: Ido Schimmel
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     

16 Jan, 2016

1 commit

  • After promisc mode management was introduced a bridge device could do
    dev_set_promiscuity from its ndo_change_rx_flags() callback which in
    turn can be called after the bridge's addr_list_lock has been taken
    (e.g. by dev_uc_add). This causes a false positive lockdep splat because
    the port interfaces' addr_list_lock is taken when br_manage_promisc()
    runs after the bridge's addr list lock was already taken.
    To remove the false positive introduce a custom bridge addr_list_lock
    class and set it on bridge init.
    A simple way to reproduce this is with the following:
    $ brctl addbr br0
    $ ip l add l br0 br0.100 type vlan id 100
    $ ip l set br0 up
    $ ip l set br0.100 up
    $ echo 1 > /sys/class/net/br0/bridge/vlan_filtering
    $ brctl addif br0 eth0
    Splat:
    [ 43.684325] =============================================
    [ 43.684485] [ INFO: possible recursive locking detected ]
    [ 43.684636] 4.4.0-rc8+ #54 Not tainted
    [ 43.684755] ---------------------------------------------
    [ 43.684906] brctl/1187 is trying to acquire lock:
    [ 43.685047] (_xmit_ETHER){+.....}, at: [] dev_set_rx_mode+0x1e/0x40
    [ 43.685460] but task is already holding lock:
    [ 43.685618] (_xmit_ETHER){+.....}, at: [] dev_uc_add+0x27/0x80
    [ 43.686015] other info that might help us debug this:
    [ 43.686316] Possible unsafe locking scenario:

    [ 43.686743] CPU0
    [ 43.686967] ----
    [ 43.687197] lock(_xmit_ETHER);
    [ 43.687544] lock(_xmit_ETHER);
    [ 43.687886] *** DEADLOCK ***

    [ 43.688438] May be due to missing lock nesting notation

    [ 43.688882] 2 locks held by brctl/1187:
    [ 43.689134] #0: (rtnl_mutex){+.+.+.}, at: [] rtnl_lock+0x17/0x20
    [ 43.689852] #1: (_xmit_ETHER){+.....}, at: [] dev_uc_add+0x27/0x80
    [ 43.690575] stack backtrace:
    [ 43.690970] CPU: 0 PID: 1187 Comm: brctl Not tainted 4.4.0-rc8+ #54
    [ 43.691270] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
    [ 43.691770] ffffffff826a25c0 ffff8800369fb8e0 ffffffff81360ceb ffffffff826a25c0
    [ 43.692425] ffff8800369fb9b8 ffffffff810d0466 ffff8800369fb968 ffffffff81537139
    [ 43.693071] ffff88003a08c880 0000000000000000 00000000ffffffff 0000000002080020
    [ 43.693709] Call Trace:
    [ 43.693931] [] dump_stack+0x4b/0x70
    [ 43.694199] [] __lock_acquire+0x1e46/0x1e90
    [ 43.694483] [] ? netlink_broadcast_filtered+0x139/0x3e0
    [ 43.694789] [] ? nlmsg_notify+0x5a/0xc0
    [ 43.695064] [] lock_acquire+0xe5/0x1f0
    [ 43.695340] [] ? dev_set_rx_mode+0x1e/0x40
    [ 43.695623] [] _raw_spin_lock_bh+0x45/0x80
    [ 43.695901] [] ? dev_set_rx_mode+0x1e/0x40
    [ 43.696180] [] dev_set_rx_mode+0x1e/0x40
    [ 43.696460] [] dev_set_promiscuity+0x3c/0x50
    [ 43.696750] [] br_port_set_promisc+0x25/0x50 [bridge]
    [ 43.697052] [] br_manage_promisc+0x8a/0xe0 [bridge]
    [ 43.697348] [] br_dev_change_rx_flags+0x1e/0x20 [bridge]
    [ 43.697655] [] __dev_set_promiscuity+0x132/0x1f0
    [ 43.697943] [] __dev_set_rx_mode+0x82/0x90
    [ 43.698223] [] dev_uc_add+0x5e/0x80
    [ 43.698498] [] vlan_device_event+0x542/0x650 [8021q]
    [ 43.698798] [] notifier_call_chain+0x5d/0x80
    [ 43.699083] [] raw_notifier_call_chain+0x16/0x20
    [ 43.699374] [] call_netdevice_notifiers_info+0x6e/0x80
    [ 43.699678] [] call_netdevice_notifiers+0x16/0x20
    [ 43.699973] [] br_add_if+0x47e/0x4c0 [bridge]
    [ 43.700259] [] add_del_if+0x6e/0x80 [bridge]
    [ 43.700548] [] br_dev_ioctl+0xaf/0xc0 [bridge]
    [ 43.700836] [] dev_ifsioc+0x30c/0x3c0
    [ 43.701106] [] dev_ioctl+0xf9/0x6f0
    [ 43.701379] [] ? mntput_no_expire+0x5/0x450
    [ 43.701665] [] ? mntput_no_expire+0xae/0x450
    [ 43.701947] [] sock_do_ioctl+0x42/0x50
    [ 43.702219] [] sock_ioctl+0x1e5/0x290
    [ 43.702500] [] do_vfs_ioctl+0x2cb/0x5c0
    [ 43.702771] [] SyS_ioctl+0x79/0x90
    [ 43.703033] [] entry_SYSCALL_64_fastpath+0x16/0x7a

    CC: Vlad Yasevich
    CC: Stephen Hemminger
    CC: Bridge list
    CC: Andy Gospodarek
    CC: Roopa Prabhu
    Fixes: 2796d0c648c9 ("bridge: Automatically manage port promiscuous mode.")
    Reported-by: Andy Gospodarek
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     

11 Jan, 2016

1 commit


09 Jan, 2016

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    The following patchset contains Netfilter updates for net-next, they are:

    1) Release nf_tables objects on netns destructions via
    nft_release_afinfo().

    2) Destroy basechain and rules on netdevice removal in the new netdev
    family.

    3) Get rid of defensive check against removal of inactive objects in
    nf_tables.

    4) Pass down netns pointer to our existing nfnetlink callbacks, as well
    as commit() and abort() nfnetlink callbacks.

    5) Allow to invert limit expression in nf_tables, so we can throttle
    overlimit traffic.

    6) Add packet duplication for the netdev family.

    7) Add forward expression for the netdev family.

    8) Define pr_fmt() in conntrack helpers.

    9) Don't leave nfqueue configuration on inconsistent state in case of
    errors, from Ken-ichirou MATSUZAWA, follow up patches are also from
    him.

    10) Skip queue option handling after unbind.

    11) Return error on unknown both in nfqueue and nflog command.

    12) Autoload ctnetlink when NFQA_CFG_F_CONNTRACK is set.

    13) Add new NFTA_SET_USERDATA attribute to store user data in sets,
    from Carlos Falgueras.

    14) Add support for 64 bit byteordering changes nf_tables, from Florian
    Westphal.

    15) Add conntrack byte/packet counter matching support to nf_tables,
    also from Florian.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

07 Jan, 2016

4 commits


06 Jan, 2016

1 commit

  • [I stole this patch from Eric Biederman. He wrote:]

    > There is no defined mechanism to pass network namespace information
    > into /sbin/bridge-stp therefore don't even try to invoke it except
    > for bridge devices in the initial network namespace.
    >
    > It is possible for unprivileged users to cause /sbin/bridge-stp to be
    > invoked for any network device name which if /sbin/bridge-stp does not
    > guard against unreasonable arguments or being invoked twice on the
    > same network device could cause problems.

    [Hannes: changed patch using netns_eq]

    Cc: Eric W. Biederman
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

01 Jan, 2016

1 commit


29 Dec, 2015

1 commit

  • We have to release the existing objects on netns removal otherwise we
    leak them. Chains are unregistered in first place to make sure no
    packets are walking on our rules and sets anymore.

    The object release happens by when we unregister the family via
    nft_release_afinfo() which is called from nft_unregister_afinfo() from
    the corresponding __net_exit path in every family.

    Reported-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

24 Dec, 2015

1 commit


23 Dec, 2015

1 commit

  • The bridge's ageing time is offloaded to hardware when:
    1) A port joins a bridge
    2) The ageing time of the bridge is changed

    In the first case the ageing time is offloaded as jiffies, but in the
    second case it's offloaded as clock_t, which is what existing switchdev
    drivers expect to receive.

    Fixes: 6ac311ae8bfb ("Adding switchdev ageing notification on port bridged")
    Signed-off-by: Ido Schimmel
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     

19 Dec, 2015

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    The following patchset contains the first batch of Netfilter updates for
    the upcoming 4.5 kernel. This batch contains userspace netfilter header
    compilation fixes, support for packet mangling in nf_tables, the new
    tracing infrastructure for nf_tables and cgroup2 support for iptables.
    More specifically, they are:

    1) Two patches to include dependencies in our netfilter userspace
    headers to resolve compilation problems, from Mikko Rapeli.

    2) Four comestic cleanup patches for the ebtables codebase, from Ian Morris.

    3) Remove duplicate include in the netfilter reject infrastructure,
    from Stephen Hemminger.

    4) Two patches to simplify the netfilter defragmentation code for IPv6,
    patch from Florian Westphal.

    5) Fix root ownership of /proc/net netfilter for unpriviledged net
    namespaces, from Philip Whineray.

    6) Get rid of unused fields in struct nft_pktinfo, from Florian Westphal.

    7) Add mangling support to our nf_tables payload expression, from
    Patrick McHardy.

    8) Introduce a new netlink-based tracing infrastructure for nf_tables,
    from Florian Westphal.

    9) Change setter functions in nfnetlink_log to be void, from
    Rami Rosen.

    10) Add netns support to the cttimeout infrastructure.

    11) Add cgroup2 support to iptables, from Tejun Heo.

    12) Introduce nfnl_dereference_protected() in nfnetlink, from Florian.

    13) Add support for mangling pkttype in the nf_tables meta expression,
    also from Florian.

    BTW, I need that you pull net into net-next, I have another batch that
    requires changes that I don't yet see in net.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

16 Dec, 2015

1 commit

  • switchdev drivers need to know the netdev on which the switchdev op was
    invoked. For example, the STP state of a VLAN interface configured on top
    of a port can change while being member in a bridge. In this case, the
    underlying driver should only change the STP state of that particular
    VLAN and not of all the VLANs configured on the port.

    However, current switchdev infrastructure only passes the port netdev down
    to the driver. Solve that by passing the original device down to the
    driver as part of the required switchdev object / attribute.

    This doesn't entail any change in current switchdev drivers. It simply
    enables those supporting stacked devices to know the originating device
    and act accordingly.

    Signed-off-by: Ido Schimmel
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     

15 Dec, 2015

1 commit


09 Dec, 2015

1 commit

  • Only needed when meta nftrace rule(s) were added.
    The assumption is that no such rules are active, so the call to
    nft_trace_init is "never" needed.

    When nftrace rules are active, we always call the nft_trace_* functions,
    but will only send netlink messages when all of the following are true:

    - traceinfo structure was initialised
    - skb->nf_trace == 1
    - at least one subscriber to trace group.

    Adding an extra conditional
    (static_branch ... && skb->nf_trace)
    nft_trace_init( ..)

    Is possible but results in a larger nft_do_chain footprint.

    Signed-off-by: Florian Westphal
    Acked-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

04 Dec, 2015

2 commits


24 Nov, 2015

4 commits


17 Nov, 2015

1 commit

  • When NET_SWITCHDEV=n, switchdev_port_attr_set simply returns EOPNOTSUPP.
    In this case we should not emit errors and warnings to the kernel log.

    Reported-by: Sander Eikelenboom
    Tested-by: Christian Borntraeger
    Fixes: 0bc05d585d38 ("switchdev: allow caller to explicitly request
    attr_set as deferred")
    Fixes: 6ac311ae8bfb ("Adding switchdev ageing notification on port
    bridged")
    Signed-off-by: Ido Schimmel
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     

11 Nov, 2015

1 commit

  • This reverts commit 34c2d9fb0498c066afbe610b15e18995fd8be792.

    There are 2 reasons for this revert:
    1) The commit in question doesn't do what it says it does. The
    description reads: "Allow bridge forward delay to be configured
    when Spanning Tree is enabled." This was already the case before
    the commit was made. What the commit actually do was disallow
    invalid values or 'forward_delay' when STP was turned off.

    2) The above change was actually a change in the user observed
    behavior and broke things like libvirt and other network configs
    that set 'forward_delay' to 0 without enabling STP. The value
    of 0 is actually used when STP is turned off to immediately mark
    the bridge as forwarding.

    Signed-off-by: Vlad Yasevich
    Signed-off-by: David S. Miller

    Vlad Yasevich
     

03 Nov, 2015

3 commits

  • br_should_learn() is protected by RCU and not by RTNL, so use correct
    flavor of nbp_vlan_group().

    Fixes: 907b1e6e83ed ("bridge: vlan: use proper rcu for the vlgrp
    member")
    Signed-off-by: Ido Schimmel
    Acked-by: Nikolay Aleksandrov
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • The flag used to indicate if a VLAN should be used for filtering - as
    opposed to context only - on the bridge itself (e.g. br0) is called
    'brentry' and not 'brvlan'.

    Signed-off-by: Ido Schimmel
    Acked-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • When adding a port to a bridge we initialize VLAN filtering on it. We do
    not bail out in case an error occurred in nbp_vlan_init, as it can be
    used as a non VLAN filtering bridge.

    However, if VLAN filtering is required and an error occurred in
    nbp_vlan_init, we should set vlgrp to NULL, so that VLAN filtering
    functions (e.g. br_vlan_find, br_get_pvid) will know the struct is
    invalid and will not try to access it.

    Signed-off-by: Ido Schimmel
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Ido Schimmel
     

30 Oct, 2015

1 commit

  • Problem Description:
    We can add fdbs pointing to the bridge with NULL ->dst but that has a
    few race conditions because br_fdb_insert() is used which first creates
    the fdb and then, after the fdb has been published/linked, sets
    "is_local" to 1 and in that time frame if a packet arrives for that fdb
    it may see it as non-local and either do a NULL ptr dereference in
    br_forward() or attach the fdb to the port where it arrived, and later
    br_fdb_insert() will make it local thus getting a wrong fdb entry.
    Call chain br_handle_frame_finish() -> br_forward():
    But in br_handle_frame_finish() in order to call br_forward() the dst
    should not be local i.e. skb != NULL, whenever the dst is
    found to be local skb is set to NULL so we can't forward it,
    and here comes the problem since it's running only
    with RCU when forwarding packets it can see the entry before "is_local"
    is set to 1 and actually try to dereference NULL.
    The main issue is that if someone sends a packet to the switch while
    it's adding the entry which points to the bridge device, it may
    dereference NULL ptr. This is needed now after we can add fdbs
    pointing to the bridge. This poses a problem for
    br_fdb_update() as well, while someone's adding a bridge fdb, but
    before it has is_local == 1, it might get moved to a port if it comes
    as a source mac and then it may get its "is_local" set to 1

    This patch changes fdb_create to take is_local and is_static as
    arguments to set these values in the fdb entry before it is added to the
    hash. Also adds null check for port in br_forward.

    Fixes: 3741873b4f73 ("bridge: allow adding of fdb entries pointing to the bridge device")
    Reported-by: Nikolay Aleksandrov
    Signed-off-by: Roopa Prabhu
    Reviewed-by: Nikolay Aleksandrov
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Roopa Prabhu
     

22 Oct, 2015

1 commit

  • if_nlmsg_size() overestimates the minimum allocation size of netlink
    dump request (when called from rtnl_calcit()) or the size of the
    message (when called from rtnl_getlink()). This is because
    ext_filter_mask is not supported by rtnl_link_get_af_size() and
    rtnl_link_get_size().

    The over-estimation is significant when at least one netdev has many
    VLANs configured (8 bytes for each configured VLAN).

    This patch-set "rightsizes" the protocol specific attribute size
    calculation by propagating ext_filter_mask to rtnl_link_get_af_size()
    and adding this a argument to get_link_af_size op in rtnl_af_ops.

    Bridge module already used filtering aware sizing for notifications.
    br_get_link_af_size_filtered() is consistent with the modified
    get_link_af_size op so it replaces br_get_link_af_size() in br_af_ops.
    br_get_link_af_size() becomes unused and thus removed.

    Signed-off-by: Ronen Arad
    Acked-by: Sridhar Samudrala
    Signed-off-by: David S. Miller

    Arad, Ronen
     

21 Oct, 2015

1 commit


17 Oct, 2015

2 commits


15 Oct, 2015

3 commits

  • Since spinlock is held here, defer the switchdev operation. Also, ensure
    that defered switchdev ops are processed before port master device
    is unlinked.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • When object is used in deferred work, we cannot use pointers in
    switchdev object structures because the memory they point at may be already
    used by someone else. So rather do local copy of the value.

    Signed-off-by: Jiri Pirko
    Acked-by: Scott Feldman
    Reviewed-by: John Fastabend
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Caller should know if he can call attr_set directly (when holding RTNL)
    or if he has to defer the att_set processing for later.

    This also allows drivers to sleep inside attr_set and report operation
    status back to switchdev core. Switchdev core then warns if status is
    not ok, instead of silent errors happening in drivers.

    Benefit from newly introduced switchdev deferred ops infrastructure.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     

13 Oct, 2015

4 commits

  • Ido Schimmel reported a problem with switchdev devices because of the
    order change of del_nbp operations, more specifically the move of
    nbp_vlan_flush() which deletes all vlans and frees vlgrp after the
    rx_handler has been unregistered. So in order to fix this move
    vlan_flush back where it was and make it destroy the rhtable after
    NULLing vlgrp and waiting a grace period to make sure noone can see it.

    Reported-by: Ido Schimmel
    Signed-off-by: Nikolay Aleksandrov
    Reviewed-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     
  • As Ido Schimmel pointed out the vlan_vid_del() code in nbp_vlan_flush is
    unnecessary (and is actually a remnant of the old vlan code) so we can
    remove it.

    Signed-off-by: Nikolay Aleksandrov
    Reviewed-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     
  • br_fill_ifinfo is called by br_ifinfo_notify which can be called from
    many contexts with different locks held, sometimes it relies upon
    bridge's spinlock only which is a problem for the vlan code, so use
    explicitly rcu for that to avoid problems.

    Signed-off-by: Nikolay Aleksandrov
    Reviewed-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     
  • The bridge and port's vlgrp member is already used in RCU way, currently
    we rely on the fact that it cannot disappear while the port exists but
    that is error-prone and we might miss places with improper locking
    (either RCU or RTNL must be held to walk the vlan_list). So make it
    official and use RCU for vlgrp to catch offenders. Introduce proper vlgrp
    accessors and use them consistently throughout the code.

    Signed-off-by: Nikolay Aleksandrov
    Reviewed-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov