24 May, 2015

1 commit


23 May, 2015

7 commits

  • Following lockdep splat was reported :

    [ 29.382286] ===============================
    [ 29.382315] [ INFO: suspicious RCU usage. ]
    [ 29.382344] 4.1.0-0.rc0.git11.1.fc23.x86_64 #1 Not tainted
    [ 29.382380] -------------------------------
    [ 29.382409] net/bridge/br_private.h:626 suspicious
    rcu_dereference_check() usage!
    [ 29.382455]
    other info that might help us debug this:

    [ 29.382507]
    rcu_scheduler_active = 1, debug_locks = 0
    [ 29.382549] 2 locks held by swapper/0/0:
    [ 29.382576] #0: (((&p->forward_delay_timer))){+.-...}, at:
    [] call_timer_fn+0x5/0x4f0
    [ 29.382660] #1: (&(&br->lock)->rlock){+.-...}, at:
    [] br_forward_delay_timer_expired+0x31/0x140
    [bridge]
    [ 29.382754]
    stack backtrace:
    [ 29.382787] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
    4.1.0-0.rc0.git11.1.fc23.x86_64 #1
    [ 29.382838] Hardware name: LENOVO 422916G/LENOVO, BIOS A1KT53AUS 04/07/2015
    [ 29.382882] 0000000000000000 3ebfc20364115825 ffff880666603c48
    ffffffff81892d4b
    [ 29.382943] 0000000000000000 ffffffff81e124e0 ffff880666603c78
    ffffffff8110bcd7
    [ 29.383004] ffff8800785c9d00 ffff88065485ac58 ffff880c62002800
    ffff880c5fc88ac0
    [ 29.383065] Call Trace:
    [ 29.383084] [] dump_stack+0x4c/0x65
    [ 29.383130] [] lockdep_rcu_suspicious+0xe7/0x120
    [ 29.383178] [] br_fill_ifinfo+0x4a9/0x6a0 [bridge]
    [ 29.383225] [] br_ifinfo_notify+0x11b/0x4b0 [bridge]
    [ 29.383271] [] ? br_hold_timer_expired+0x70/0x70 [bridge]
    [ 29.383320] []
    br_forward_delay_timer_expired+0x58/0x140 [bridge]
    [ 29.383371] [] ? br_hold_timer_expired+0x70/0x70 [bridge]
    [ 29.383416] [] call_timer_fn+0xc3/0x4f0
    [ 29.383454] [] ? call_timer_fn+0x5/0x4f0
    [ 29.383493] [] ? lock_release_holdtime.part.29+0xf/0x200
    [ 29.383541] [] ? br_hold_timer_expired+0x70/0x70 [bridge]
    [ 29.383587] [] run_timer_softirq+0x244/0x490
    [ 29.383629] [] __do_softirq+0xec/0x670
    [ 29.383666] [] irq_exit+0x145/0x150
    [ 29.383703] [] smp_apic_timer_interrupt+0x46/0x60
    [ 29.383744] [] apic_timer_interrupt+0x73/0x80
    [ 29.383782] [] ? cpuidle_enter_state+0x5f/0x2f0
    [ 29.383832] [] ? cpuidle_enter_state+0x5b/0x2f0

    Problem here is that br_forward_delay_timer_expired() is a timer
    handler, calling br_ifinfo_notify() which assumes either rcu_read_lock()
    or RTNL are held.

    Simplest fix seems to add rcu read lock section.

    Signed-off-by: Eric Dumazet
    Reported-by: Josh Boyer
    Reported-by: Dominick Grift
    Cc: Vlad Yasevich
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • When trying to configure the settings for PHY1, using commands
    like 'ethtool -s eth0 phyad 1 speed 100', the 'ethtool' seems to
    modify other settings apart from the speed of the PHY1, in the
    above case.

    The ethtool seems to query the settings for PHY0, and use this
    as the base to apply the new settings to the PHY1. This is
    causing the other settings of the PHY 1 to be wrongly
    configured.

    The issue is caused by the '_ethtool_get_settings()' API, which
    gets called because of the 'ETHTOOL_GSET' command, is clearing
    the 'cmd' pointer (of type 'struct ethtool_cmd') by calling
    memset. This clears all the parameters (if any) passed for the
    'ETHTOOL_GSET' cmd. So the driver's callback is always invoked
    with 'cmd->phy_address' as '0'.

    The '_ethtool_get_settings()' is called from other files in the
    'net/core'. So the fix is applied to the 'ethtool_get_settings()'
    which is only called in the context of the 'ethtool'.

    Signed-off-by: Arun Parameswaran
    Reviewed-by: Ray Jui
    Reviewed-by: Scott Branden
    Signed-off-by: David S. Miller

    Arun Parameswaran
     
  • When more than a multicast address is present in a MLDv2 report, all but
    the first address is ignored, because the code breaks out of the loop if
    there has not been an error adding that address.

    This has caused failures when two guests connected through the bridge
    tried to communicate using IPv6. Neighbor discoveries would not be
    transmitted to the other guest when both used a link-local address and a
    static address.

    This only happens when there is a MLDv2 querier in the network.

    The fix will only break out of the loop when there is a failure adding a
    multicast address.

    The mdb before the patch:

    dev ovirtmgmt port vnet0 grp ff02::1:ff7d:6603 temp
    dev ovirtmgmt port vnet1 grp ff02::1:ff7d:6604 temp
    dev ovirtmgmt port bond0.86 grp ff02::2 temp

    After the patch:

    dev ovirtmgmt port vnet0 grp ff02::1:ff7d:6603 temp
    dev ovirtmgmt port vnet1 grp ff02::1:ff7d:6604 temp
    dev ovirtmgmt port bond0.86 grp ff02::fb temp
    dev ovirtmgmt port bond0.86 grp ff02::2 temp
    dev ovirtmgmt port bond0.86 grp ff02::d temp
    dev ovirtmgmt port vnet0 grp ff02::1:ff00:76 temp
    dev ovirtmgmt port bond0.86 grp ff02::16 temp
    dev ovirtmgmt port vnet1 grp ff02::1:ff00:77 temp
    dev ovirtmgmt port bond0.86 grp ff02::1:ff00:def temp
    dev ovirtmgmt port bond0.86 grp ff02::1:ffa1:40bf temp

    Fixes: 08b202b67264 ("bridge br_multicast: IPv6 MLD support.")
    Reported-by: Rik Theys
    Signed-off-by: Thadeu Lima de Souza Cascardo
    Tested-by: Rik Theys
    Signed-off-by: David S. Miller

    Thadeu Lima de Souza Cascardo
     
  • When replacing an IPv4 route, tb_id member of the new fib_alias
    structure is not set in the replace code path so that the new route is
    ignored.

    Fixes: 0ddcf43d5d4a ("ipv4: FIB Local/MAIN table collapse")
    Signed-off-by: Michal Kubecek
    Acked-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Michal Kubeček
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contain Netfilter fixes for your net tree, they are:

    1) Fix a race in nfnetlink_log and nfnetlink_queue that can lead to a crash.
    This problem is due to wrong order in the per-net registration and netlink
    socket events. Patch from Francesco Ruggeri.

    2) Make sure that counters that userspace pass us are higher than 0 in all the
    x_tables frontends. Discovered via Trinity, patch from Dave Jones.

    3) Revert a patch for br_netfilter to rely on the conntrack status bits. This
    breaks stateless IPv6 NAT transformations. Patch from Florian Westphal.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • ip_error does not check if in_dev is NULL before dereferencing it.

    IThe following sequence of calls is possible:
    CPU A CPU B
    ip_rcv_finish
    ip_route_input_noref()
    ip_route_input_slow()
    inetdev_destroy()
    dst_input()

    With the result that a network device can be destroyed while processing
    an input packet.

    A crash was triggered with only unicast packets in flight, and
    forwarding enabled on the only network device. The error condition
    was created by the removal of the network device.

    As such it is likely the that error code was -EHOSTUNREACH, and the
    action taken by ip_error (if in_dev had been accessible) would have
    been to not increment any counters and to have tried and likely failed
    to send an icmp error as the network device is going away.

    Therefore handle this weird case by just dropping the packet if
    !in_dev. It will result in dropping the packet sooner, and will not
    result in an actual change of behavior.

    Fixes: 251da4130115b ("ipv4: Cache ip_error() routes even when not forwarding.")
    Reported-by: Vittorio Gambaletta
    Tested-by: Vittorio Gambaletta
    Signed-off-by: Vittorio Gambaletta
    Signed-off-by: "Eric W. Biederman"
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Taking socket spinlock in tcp_get_info() can deadlock, as
    inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i],
    while packet processing can use the reverse locking order.

    We could avoid this locking for TCP_LISTEN states, but lockdep would
    certainly get confused as all TCP sockets share same lockdep classes.

    [ 523.722504] ======================================================
    [ 523.728706] [ INFO: possible circular locking dependency detected ]
    [ 523.734990] 4.1.0-dbg-DEV #1676 Not tainted
    [ 523.739202] -------------------------------------------------------
    [ 523.745474] ss/18032 is trying to acquire lock:
    [ 523.750002] (slock-AF_INET){+.-...}, at: [] tcp_get_info+0x2c4/0x360
    [ 523.758129]
    [ 523.758129] but task is already holding lock:
    [ 523.763968] (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [] inet_diag_dump_icsk+0x1d5/0x6c0
    [ 523.774661]
    [ 523.774661] which lock already depends on the new lock.
    [ 523.774661]
    [ 523.782850]
    [ 523.782850] the existing dependency chain (in reverse order) is:
    [ 523.790326]
    -> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}:
    [ 523.796599] [] lock_acquire+0xbb/0x270
    [ 523.802565] [] _raw_spin_lock+0x38/0x50
    [ 523.808628] [] __inet_hash_nolisten+0x78/0x110
    [ 523.815273] [] tcp_v4_syn_recv_sock+0x24b/0x350
    [ 523.822067] [] tcp_check_req+0x3c1/0x500
    [ 523.828199] [] tcp_v4_do_rcv+0x239/0x3d0
    [ 523.834331] [] tcp_v4_rcv+0xa8e/0xc10
    [ 523.840202] [] ip_local_deliver_finish+0x133/0x3e0
    [ 523.847214] [] ip_local_deliver+0xaa/0xc0
    [ 523.853440] [] ip_rcv_finish+0x168/0x5c0
    [ 523.859624] [] ip_rcv+0x307/0x420

    Lets use u64_sync infrastructure instead. As a bonus, 64bit
    arches get optimized, as these are nop for them.

    Fixes: 0df48c26d841 ("tcp: add tcpi_bytes_acked to tcp_info")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 May, 2015

1 commit

  • Vijay reported that a loop as simple as ...

    while true; do
    tc qdisc add dev foo root handle 1: prio
    tc filter add dev foo parent 1: u32 match u32 0 0 flowid 1
    tc qdisc del dev foo root
    rmmod cls_u32
    done

    ... will panic the kernel. Moreover, he bisected the change
    apparently introducing it to 78fd1d0ab072 ("netlink: Re-add
    locking to netlink_lookup() and seq walker").

    The removal of synchronize_net() from the netlink socket
    triggering the qdisc to be removed, seems to have uncovered
    an RCU resp. module reference count race from the tc API.
    Given that RCU conversion was done after e341694e3eb5 ("netlink:
    Convert netlink_lookup() to use RCU protected hash table")
    which added the synchronize_net() originally, occasion of
    hitting the bug was less likely (not impossible though):

    When qdiscs that i) support attaching classifiers and,
    ii) have at least one of them attached, get deleted, they
    invoke tcf_destroy_chain(), and thus call into ->destroy()
    handler from a classifier module.

    After RCU conversion, all classifier that have an internal
    prio list, unlink them and initiate freeing via call_rcu()
    deferral.

    Meanhile, tcf_destroy() releases already reference to the
    tp->ops->owner module before the queued RCU callback handler
    has been invoked.

    Subsequent rmmod on the classifier module is then not prevented
    since all module references are already dropped.

    By the time, the kernel invokes the RCU callback handler from
    the module, that function address is then invalid.

    One way to fix it would be to add an rcu_barrier() to
    unregister_tcf_proto_ops() to wait for all pending call_rcu()s
    to complete.

    synchronize_rcu() is not appropriate as under heavy RCU
    callback load, registered call_rcu()s could be deferred
    longer than a grace period. In case we don't have any pending
    call_rcu()s, the barrier is allowed to return immediately.

    Since we came here via unregister_tcf_proto_ops(), there
    are no users of a given classifier anymore. Further nested
    call_rcu()s pointing into the module space are not being
    done anywhere.

    Only cls_bpf_delete_prog() may schedule a work item, to
    unlock pages eventually, but that is not in the range/context
    of cls_bpf anymore.

    Fixes: 25d8c0d55f24 ("net: rcu-ify tcf_proto")
    Fixes: 9888faefe132 ("net: sched: cls_basic use RCU")
    Reported-by: Vijay Subramanian
    Signed-off-by: Daniel Borkmann
    Cc: John Fastabend
    Cc: Eric Dumazet
    Cc: Thomas Graf
    Cc: Jamal Hadi Salim
    Cc: Alexei Starovoitov
    Tested-by: Vijay Subramanian
    Acked-by: Alexei Starovoitov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

21 May, 2015

4 commits

  • This reverts commit ba9d114ec5578e6e99a4dfa37ff8ae688040fd64.

    .. which introduced a regression that prevented all lingering requests
    requeued in kick_requests() from ever being sent to the OSDs, resulting
    in a lot of missed notifies. In retrospect it's pretty obvious that
    r_req_lru_item item in the case of lingering requests can be used not
    only for notarget, but also for unsent linkage due to how tightly
    actual map and enqueue operations are coupled in __map_request().

    The assertion that was being silenced is taken care of in the previous
    ("libceph: request a new osdmap if lingering request maps to no osd")
    commit: by always kicking homeless lingering requests we ensure that
    none of them ends up on the notarget list outside of the critical
    section guarded by request_mutex.

    Cc: stable@vger.kernel.org # 3.18+, needs b0494532214b "libceph: request a new osdmap if lingering request maps to no osd"
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • This commit does two things. First, if there are any homeless
    lingering requests, we now request a new osdmap even if the osdmap that
    is being processed brought no changes, i.e. if a given lingering
    request turned homeless in one of the previous epochs and remained
    homeless in the current epoch. Not doing so leaves us with a stale
    osdmap and as a result we may miss our window for reestablishing the
    watch and lose notifies.

    MON=1 OSD=1:

    # cat linger-needmap.sh
    #!/bin/bash
    rbd create --size 1 test
    DEV=$(rbd map test)
    ceph osd out 0
    rbd map dne/dne # obtain a new osdmap as a side effect (!)
    sleep 1
    ceph osd in 0
    rbd resize --size 2 test
    # rbd info test | grep size -> 2M
    # blockdev --getsize $DEV -> 1M

    N.B.: Not obtaining a new osdmap in between "osd out" and "osd in"
    above is enough to make it miss that resize notify, but that is a
    bug^Wlimitation of ceph watch/notify v1.

    Second, homeless lingering requests are now kicked just like those
    lingering requests whose mapping has changed. This is mainly to
    recognize that a homeless lingering request makes no sense and to
    preserve the invariant that a registered lingering request is not
    sitting on any of r_req_lru_item lists. This spares us a WARN_ON,
    which commit ba9d114ec557 ("libceph: clear r_req_lru_item in
    __unregister_linger_request()") tried to fix the _wrong_ way.

    Cc: stable@vger.kernel.org # 3.10+
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Sage Weil

    Ilya Dryomov
     
  • When replacing an IPv6 multipath route with "ip route replace", i.e.
    NLM_F_CREATE | NLM_F_REPLACE, fib6_add_rt2node() replaces only first
    matching route without fixing its siblings, resulting in corrupted
    siblings linked list; removing one of the siblings can then end in an
    infinite loop.

    IPv6 ECMP implementation is a bit different from IPv4 so that route
    replacement cannot work in exactly the same way. This should be a
    reasonable approximation:

    1. If the new route is ECMP-able and there is a matching ECMP-able one
    already, replace it and all its siblings (if any).

    2. If the new route is ECMP-able and no matching ECMP-able route exists,
    replace first matching non-ECMP-able (if any) or just add the new one.

    3. If the new route is not ECMP-able, replace first matching
    non-ECMP-able route (if any) or add the new route.

    We also need to remove the NLM_F_REPLACE flag after replacing old
    route(s) by first nexthop of an ECMP route so that each subsequent
    nexthop does not replace previous one.

    Fixes: 51ebd3181572 ("ipv6: add support of equal cost multipath (ECMP)")
    Signed-off-by: Michal Kubecek
    Acked-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Michal Kubeček
     
  • If adding a nexthop of an IPv6 multipath route fails, comment in
    ip6_route_multipath() says we are going to delete all nexthops already
    added. However, current implementation deletes even the routes it
    hasn't even tried to add yet. For example, running

    ip route add 1234:5678::/64 \
    nexthop via fe80::aa dev dummy1 \
    nexthop via fe80::bb dev dummy1 \
    nexthop via fe80::cc dev dummy1

    twice results in removing all routes first command added.

    Limit the second (delete) run to nexthops that succeeded in the first
    (add) run.

    Fixes: 51ebd3181572 ("ipv6: add support of equal cost multipath (ECMP)")
    Signed-off-by: Michal Kubecek
    Acked-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Michal Kubeček
     

20 May, 2015

7 commits

  • This reverts commit c055d5b03bb4cb69d349d787c9787c0383abd8b2.

    There are two issues:
    'dnat_took_place' made me think that this is related to
    -j DNAT/MASQUERADE.

    But thats only one part of the story. This is also relevant for SNAT
    when we undo snat translation in reverse/reply direction.

    Furthermore, I originally wanted to do this mainly to avoid
    storing ipv6 addresses once we make DNAT/REDIRECT work
    for ipv6 on bridges.

    However, I forgot about SNPT/DNPT which is stateless.

    So we can't escape storing address for ipv6 anyway. Might as
    well do it for ipv4 too.

    Reported-and-tested-by: Bernhard Thaler
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • After improving setsockopt() coverage in trinity, I started triggering
    vmalloc failures pretty reliably from this code path:

    warn_alloc_failed+0xe9/0x140
    __vmalloc_node_range+0x1be/0x270
    vzalloc+0x4b/0x50
    __do_replace+0x52/0x260 [ip_tables]
    do_ipt_set_ctl+0x15d/0x1d0 [ip_tables]
    nf_setsockopt+0x65/0x90
    ip_setsockopt+0x61/0xa0
    raw_setsockopt+0x16/0x60
    sock_common_setsockopt+0x14/0x20
    SyS_setsockopt+0x71/0xd0

    It turns out we don't validate that the num_counters field in the
    struct we pass in from userspace is initialized.

    The same problem also exists in ebtables, arptables, ipv6, and the
    compat variants.

    Signed-off-by: Dave Jones
    Signed-off-by: Pablo Neira Ayuso

    Dave Jones
     
  • nfnetlink_{log,queue}_init() register the netlink callback nf*_rcv_nl_event
    before registering the pernet_subsys, but the callback relies on data
    structures allocated by pernet init functions.

    When nfnetlink_{log,queue} is loaded, if a netlink message is received after
    the netlink callback is registered but before the pernet_subsys is registered,
    the kernel will panic in the sequence

    nfulnl_rcv_nl_event
    nfnl_log_pernet
    net_generic
    BUG_ON(id == 0) where id is nfnl_log_net_id.

    The panic can be easily reproduced in 4.0.3 by:

    while true ;do modprobe nfnetlink_log ; rmmod nfnetlink_log ; done &
    while true ;do ip netns add dummy ; ip netns del dummy ; done &

    This patch moves register_pernet_subsys to earlier in nfnetlink_log_init.

    Notice that the BUG_ON hit in 4.0.3 was recently removed in 2591ffd308
    ["netns: remove BUG_ONs from net_generic()"].

    Signed-off-by: Francesco Ruggeri
    Signed-off-by: Pablo Neira Ayuso

    Francesco Ruggeri
     
  • …kernel/git/jberg/mac80211

    Johannes Berg says:

    ====================
    This has just a single fix, for a WEP tailroom check
    problem that leads to dropped frames.
    ====================

    Signed-off-by: David S. Miller <davem@davemloft.net>

    David S. Miller
     
  • After sending the new data packets to probe (step 2), F-RTO may
    incorrectly send more probes if the next ACK advances SND_UNA and
    does not sack new packet. However F-RTO RFC 5682 probes at most
    once. This bug may cause sender to always send new data instead of
    repairing holes, inducing longer HoL blocking on the receiver for
    the application.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Undo based on TCP timestamps should only happen on ACKs that advance
    SND_UNA, according to the Eifel algorithm in RFC 3522:

    Section 3.2:

    (4) If the value of the Timestamp Echo Reply field of the
    acceptable ACK's Timestamps option is smaller than the
    value of RetransmitTS, then proceed to step (5),

    Section Terminology:
    We use the term 'acceptable ACK' as defined in [RFC793]. That is an
    ACK that acknowledges previously unacknowledged data.

    This is because upon receiving an out-of-order packet, the receiver
    returns the last timestamp that advances RCV_NXT, not the current
    timestamp of the packet in the DUPACK. Without checking the flag,
    the DUPACK will cause tcp_packet_delayed() to return true and
    tcp_try_undo_loss() will revert cwnd reduction.

    Note that we check the condition in CA_Recovery already by only
    calling tcp_try_undo_partial() if FLAG_SND_UNA_ADVANCED is set or
    tcp_try_undo_recovery() if snd_una crosses high_seq.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • Commit ("udp: Simplify__udp*_lib_mcast_deliver")
    simplified the filter for incoming IPv6 multicast but removed
    the check of the local socket address and the UDP destination
    address.

    This patch restores the filter to prevent sockets bound to a IPv6
    multicast IP to receive other UDP traffic link unicast.

    Signed-off-by: Henning Rogge
    Fixes: 5cf3d46192fc ("udp: Simplify__udp*_lib_mcast_deliver")
    Cc: "David S. Miller"
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Henning Rogge
     

19 May, 2015

1 commit


18 May, 2015

2 commits

  • commit 1d13a96c74fc ("ipv6: tcp: fix flowlabel value in ACK messages
    send from TIME_WAIT") added the flow label in the last TCP packets.
    Unfortunately, it was not casted properly.

    This patch replace the buggy shift with be32_to_cpu/cpu_to_be32.

    Fixes: 1d13a96c74fc ("ipv6: tcp: fix flowlabel value in ACK messages")
    Reported-by: Eric Dumazet
    Signed-off-by: Florent Fourcot
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Florent Fourcot
     
  • Before the patch, the command 'ip link add bond2 type bond mode 802.3ad'
    causes the kernel to send a rtnl message for the bond2 interface, with an
    ifindex 0.

    'ip monitor' shows:
    0: bond2: mtu 1500 state DOWN group default
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
    9: bond2@NONE: mtu 1500 qdisc noop state DOWN group default
    link/ether ea:3e:1f:53:92:7b brd ff:ff:ff:ff:ff:ff
    [snip]

    The patch fixes the spotted bug by checking in bond driver if the interface
    is registered before calling the notifier chain.
    It also adds a check in rtmsg_ifinfo() to prevent this kind of bug in the
    future.

    Fixes: d4261e565000 ("bonding: create netlink event when bonding option is changed")
    CC: Jiri Pirko
    Reported-by: Julien Meunier
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     

17 May, 2015

2 commits

  • The commit c5adde9468b0714a051eac7f9666f23eb10b61f7 ("netlink:
    eliminate nl_sk_hash_lock") breaks the autobind retry mechanism
    because it doesn't reset portid after a failed netlink_insert.

    This means that should autobind fail the first time around, then
    the socket will be stuck in limbo as it can never be bound again
    since it already has a non-zero portid.

    Fixes: c5adde9468b0 ("netlink: eliminate nl_sk_hash_lock")
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • Pablo Neira Ayuso says:

    ====================
    The following patchset contains Netfilter fixes for your net tree, they are:

    1) Fix a leak in IPVS, the sysctl table is not released accordingly when
    destroying a netns, patch from Tommi Rantala.

    2) Fix a build error when TPROXY and socket are built-in but IPv6 defrag is
    compiled as module, from Florian Westphal.

    3) Fix TCP tracket wrt. RFC5961 challenge ACK when in LAST_ACK state, patch
    from Jesper Dangaard Brouer.

    4) Fix a bogus WARN_ON() in nf_tables when deleting a set element that stores
    a map, from Mirek Kratochvil.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

16 May, 2015

3 commits

  • The values 0x00000000-0xfffffeff are reserved for userspace datatype. When,
    deleting set elements with maps, a bogus warning is triggered.

    WARNING: CPU: 0 PID: 11133 at net/netfilter/nf_tables_api.c:4481 nft_data_uninit+0x35/0x40 [nf_tables]()

    This fixes the check accordingly to enum definition in
    include/linux/netfilter/nf_tables.h

    Fixes: https://bugzilla.netfilter.org/show_bug.cgi?id=1013
    Signed-off-by: Mirek Kratochvil
    Signed-off-by: Pablo Neira Ayuso

    Mirek Kratochvil
     
  • In compliance with RFC5961, the network stack send challenge ACK in
    response to spurious SYN packets, since commit 0c228e833c88 ("tcp:
    Restore RFC5961-compliant behavior for SYN packets").

    This pose a problem for netfilter conntrack in state LAST_ACK, because
    this challenge ACK is (falsely) seen as ACKing last FIN, causing a
    false state transition (into TIME_WAIT).

    The challenge ACK is hard to distinguish from real last ACK. Thus,
    solution introduce a flag that tracks the potential for seeing a
    challenge ACK, in case a SYN packet is let through and current state
    is LAST_ACK.

    When conntrack transition LAST_ACK to TIME_WAIT happens, this flag is
    used for determining if we are expecting a challenge ACK.

    Scapy based reproducer script avail here:
    https://github.com/netoptimizer/network-testing/blob/master/scapy/tcp_hacks_3WHS_LAST_ACK.py

    Fixes: 0c228e833c88 ("tcp: Restore RFC5961-compliant behavior for SYN packets")
    Signed-off-by: Jesper Dangaard Brouer
    Acked-by: Jozsef Kadlecsik
    Signed-off-by: Pablo Neira Ayuso

    Jesper Dangaard Brouer
     
  • With TPROXY=y but DEFRAG_IPV6=m we get build failure:

    net/built-in.o: In function `tproxy_tg_init':
    net/netfilter/xt_TPROXY.c:588: undefined reference to `nf_defrag_ipv6_enable'

    If DEFRAG_IPV6 is modular, TPROXY must be too.
    (or both must be builtin).

    This enforces =m for both.

    Reported-and-tested-by: Liu Hua
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

15 May, 2015

3 commits

  • RTNH_F_EXTERNAL today is printed as "offload" in iproute2 output.

    This patch renames the flag to be consistent with what the user sees.

    Signed-off-by: Roopa Prabhu
    Signed-off-by: David S. Miller

    Roopa Prabhu
     
  • It was reported that trancerout6 would cause
    a kernel to crash when trying to compute checksums
    on raw UDP packets. The cause was the check in
    __ip6_append_data that would attempt to use
    partial checksums on the packet. However,
    raw sockets do not initialize partial checksum
    fields so partial checksums can't be used.

    Solve this the same way IPv4 does it. raw sockets
    pass transhdrlen value of 0 to ip_append_data which
    causes the checksum to be computed in software. Use
    the same check in ip6_append_data (check transhdrlen).

    Reported-by: Wolfgang Walter
    CC: Wolfgang Walter
    CC: Eric Dumazet
    Signed-off-by: Vladislav Yasevich
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Vlad Yasevich
     
  • netlink sockets creation and deletion heavily modify nl_table_users
    and nl_table_lock.

    If nl_table is sharing one cache line with one of them, netlink
    performance is really bad on SMP.

    ffffffff81ff5f00 B nl_table
    ffffffff81ff5f0c b nl_table_users

    Putting nl_table in read_mostly section increased performance
    of my open/delete netlink sockets test by about 80 %

    This came up while diagnosing a getaddrinfo() problem.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 May, 2015

2 commits

  • This patch fixes hci_remote_name_evt dose not resolve name during
    discovery status is RESOLVING. Before simultaneous dual mode scan enabled,
    hci_check_pending_name will set discovery status to STOPPED eventually.

    Signed-off-by: Wesley Kuo
    Signed-off-by: Marcel Holtmann

    Wesley Kuo
     
  • Currently vlan notifier handler will try to update all vlans
    for a device when that device comes up. A problem occurs,
    however, when the vlan device was set to promiscuous, but not
    by the user (ex: a bridge). In that case, dev->gflags are
    not updated. What results is that the lower device ends
    up with an extra promiscuity count. Here are the
    backtraces that prove this:
    [62852.052179] [] __dev_set_promiscuity+0x38/0x1e0
    [62852.052186] [] ? _raw_spin_unlock_bh+0x1b/0x40
    [62852.052188] [] ? dev_set_rx_mode+0x2e/0x40
    [62852.052190] [] dev_set_promiscuity+0x24/0x50
    [62852.052194] [] vlan_dev_open+0xd5/0x1f0 [8021q]
    [62852.052196] [] __dev_open+0xbf/0x140
    [62852.052198] [] __dev_change_flags+0x9d/0x170
    [62852.052200] [] dev_change_flags+0x29/0x60

    The above comes from the setting the vlan device to IFF_UP state.

    [62852.053569] [] __dev_set_promiscuity+0x38/0x1e0
    [62852.053571] [] ? vlan_dev_set_rx_mode+0x2b/0x30
    [8021q]
    [62852.053573] [] __dev_change_flags+0xe5/0x170
    [62852.053645] [] dev_change_flags+0x29/0x60
    [62852.053647] [] vlan_device_event+0x18a/0x690
    [8021q]
    [62852.053649] [] notifier_call_chain+0x4c/0x70
    [62852.053651] [] raw_notifier_call_chain+0x16/0x20
    [62852.053653] [] call_netdevice_notifiers+0x2d/0x60
    [62852.053654] [] __dev_notify_flags+0x33/0xa0
    [62852.053656] [] dev_change_flags+0x52/0x60
    [62852.053657] [] do_setlink+0x397/0xa40

    And this one comes from the notification code. What we end
    up with is a vlan with promiscuity count of 1 and and a physical
    device with a promiscuity count of 2. They should both have
    a count 1.

    To resolve this issue, vlan code can use dev_get_flags() api
    which correctly masks promiscuity and allmulti flags.

    Signed-off-by: Vlad Yasevich
    Signed-off-by: David S. Miller

    Vlad Yasevich
     

13 May, 2015

2 commits

  • Pull networking fixes from David Miller:

    1) Handle max TX power properly wrt VIFs and the MAC in iwlwifi, from
    Avri Altman.

    2) Use the correct FW API for scan completions in iwlwifi, from Avraham
    Stern.

    3) FW monitor in iwlwifi accidently uses unmapped memory, fix from Liad
    Kaufman.

    4) rhashtable conversion of mac80211 station table was buggy, the
    virtual interface was not taken into account. Fix from Johannes
    Berg.

    5) Fix deadlock in rtlwifi by not using a zero timeout for
    usb_control_msg(), from Larry Finger.

    6) Update reordering state before calculating loss detection, from
    Yuchung Cheng.

    7) Fix off by one in bluetooth firmward parsing, from Dan Carpenter.

    8) Fix extended frame handling in xiling_can driver, from Jeppe
    Ledet-Pedersen.

    9) Fix CODEL packet scheduler behavior in the presence of TSO packets,
    from Eric Dumazet.

    10) Fix NAPI budget testing in fm10k driver, from Alexander Duyck.

    11) macvlan needs to propagate promisc settings down the the lower
    device, from Vlad Yasevich.

    12) igb driver can oops when changing number of rings, from Toshiaki
    Makita.

    13) Source specific default routes not handled properly in ipv6, from
    Markus Stenberg.

    14) Use after free in tc_ctl_tfilter(), from WANG Cong.

    15) Use softirq spinlocking in netxen driver, from Tony Camuso.

    16) Two ARM bpf JIT fixes from Nicolas Schichan.

    17) Handle MSG_DONTWAIT properly in ring based AF_PACKET sends, from
    Mathias Kretschmer.

    18) Fix x86 bpf JIT implementation of FROM_{BE16,LE16,LE32}, from Alexei
    Starovoitov.

    19) ll_temac driver DMA maps TX packet header with incorrect length, fix
    from Michal Simek.

    20) We removed pm_qos bits from netdevice.h, but some indirect
    references remained. Kill them. From David Ahern.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits)
    net: Remove remaining remnants of pm_qos from netdevice.h
    e1000e: Add pm_qos header
    net: phy: micrel: Fix regression in kszphy_probe
    net: ll_temac: Fix DMA map size bug
    x86: bpf_jit: fix FROM_BE16 and FROM_LE16/32 instructions
    netns: return RTM_NEWNSID instead of RTM_GETNSID on a get
    Update be2net maintainers' email addresses
    net_sched: gred: use correct backlog value in WRED mode
    pppoe: drop pppoe device in pppoe_unbind_sock_work
    net: qca_spi: Fix possible race during probe
    net: mdio-gpio: Allow for unspecified bus id
    af_packet / TX_RING not fully non-blocking (w/ MSG_DONTWAIT).
    bnx2x: limit fw delay in kdump to 5s after boot
    ARM: net: delegate filter to kernel interpreter when imm_offset() return value can't fit into 12bits.
    ARM: net fix emit_udiv() for BPF_ALU | BPF_DIV | BPF_K intruction.
    mpls: Change reserved label names to be consistent with netbsd
    usbnet: avoid integer overflow in start_xmit
    netxen_nic: use spin_[un]lock_bh around tx_clean_lock (2)
    net: xgene_enet: Set hardware dependency
    net: amd-xgbe: Add hardware dependency
    ...

    Linus Torvalds
     
  • Usually, RTM_NEWxxx is returned on a get (same as a dump).

    Fixes: 0c7aecd4bde4 ("netns: add rtnl cmd to add and get peer netns ids")
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     

12 May, 2015

1 commit

  • In WRED mode, the backlog for a single virtual queue (VQ) should not be
    used to determine queue behavior; instead the backlog is summed across
    all VQs. This sum is currently used when calculating the average queue
    lengths. It also needs to be used when determining if the queue's hard
    limit has been reached, or when reporting each VQ's backlog via netlink.
    q->backlog will only be used if the queue switches out of WRED mode.

    Signed-off-by: David Ward
    Signed-off-by: David S. Miller

    David Ward
     

11 May, 2015

2 commits

  • Remove checking tailroom when adding IV as it uses only
    headroom, and move the check to the ICV generation that
    actually needs the tailroom.

    In other case I hit such warning and datapath don't work,
    when testing:
    - IBSS + WEP
    - ath9k with hw crypt enabled
    - IPv6 data (ping6)

    WARNING: CPU: 3 PID: 13301 at net/mac80211/wep.c:102 ieee80211_wep_add_iv+0x129/0x190 [mac80211]()
    [...]
    Call Trace:
    [] dump_stack+0x45/0x57
    [] warn_slowpath_common+0x8a/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] ieee80211_wep_add_iv+0x129/0x190 [mac80211]
    [] ieee80211_crypto_wep_encrypt+0x6b/0xd0 [mac80211]
    [] invoke_tx_handlers+0xc51/0xf30 [mac80211]
    [...]

    Cc: stable@vger.kernel.org
    Signed-off-by: Janusz Dziedzic
    Signed-off-by: Johannes Berg

    Janusz Dziedzic
     
  • This patch fixes an issue where the send(MSG_DONTWAIT) call
    on a TX_RING is not fully non-blocking in cases where the device's sndBuf is
    full. We pass nonblock=true to sock_alloc_send_skb() and return any possibly
    occuring error code (most likely EGAIN) to the caller. As the fast-path stays
    as it is, we keep the unlikely() around skb == NULL.

    Signed-off-by: Mathias Kretschmer
    Signed-off-by: David S. Miller

    Kretschmer, Mathias
     

10 May, 2015

2 commits