19 Apr, 2014

3 commits

  • Currently, it is possible to create an SCTP socket, then switch
    auth_enable via sysctl setting to 1 and crash the system on connect:

    Oops[#1]:
    CPU: 0 PID: 0 Comm: swapper Not tainted 3.14.1-mipsgit-20140415 #1
    task: ffffffff8056ce80 ti: ffffffff8055c000 task.ti: ffffffff8055c000
    [...]
    Call Trace:
    [] sctp_auth_asoc_set_default_hmac+0x68/0x80
    [] sctp_process_init+0x5e0/0x8a4
    [] sctp_sf_do_5_1B_init+0x234/0x34c
    [] sctp_do_sm+0xb4/0x1e8
    [] sctp_endpoint_bh_rcv+0x1c4/0x214
    [] sctp_rcv+0x588/0x630
    [] sctp6_rcv+0x10/0x24
    [] ip6_input+0x2c0/0x440
    [] __netif_receive_skb_core+0x4a8/0x564
    [] process_backlog+0xb4/0x18c
    [] net_rx_action+0x12c/0x210
    [] __do_softirq+0x17c/0x2ac
    [] irq_exit+0x54/0xb0
    [] ret_from_irq+0x0/0x4
    [] rm7k_wait_irqoff+0x24/0x48
    [] cpu_startup_entry+0xc0/0x148
    [] start_kernel+0x37c/0x398
    Code: dd0900b8 000330f8 0126302d 50c0fff1 0047182a a48306a0
    03e00008 00000000
    ---[ end trace b530b0551467f2fd ]---
    Kernel panic - not syncing: Fatal exception in interrupt

    What happens while auth_enable=0 in that case is, that
    ep->auth_hmacs is initialized to NULL in sctp_auth_init_hmacs()
    when endpoint is being created.

    After that point, if an admin switches over to auth_enable=1,
    the machine can crash due to NULL pointer dereference during
    reception of an INIT chunk. When we enter sctp_process_init()
    via sctp_sf_do_5_1B_init() in order to respond to an INIT chunk,
    the INIT verification succeeds and while we walk and process
    all INIT params via sctp_process_param() we find that
    net->sctp.auth_enable is set, therefore do not fall through,
    but invoke sctp_auth_asoc_set_default_hmac() instead, and thus,
    dereference what we have set to NULL during endpoint
    initialization phase.

    The fix is to make auth_enable immutable by caching its value
    during endpoint initialization, so that its original value is
    being carried along until destruction. The bug seems to originate
    from the very first days.

    Fix in joint work with Daniel Borkmann.

    Reported-by: Joshua Kinard
    Signed-off-by: Vlad Yasevich
    Signed-off-by: Daniel Borkmann
    Acked-by: Neil Horman
    Tested-by: Joshua Kinard
    Signed-off-by: David S. Miller

    Vlad Yasevich
     
  • John W. Linville says:

    ====================
    pull request: wireless 2014-04-17

    Please pull this batch of fixes intended for the 3.15 stream...

    For the mac80211 bits, Johannes says:

    "We have a fix from Chun-Yeow to not look at management frame bitrates
    that are typically really low, two fixes from Felix for AP_VLAN
    interfaces, a fix from Ido to disable SMPS settings when a monitor
    interface is enabled, a radar detection fix from Michał and a fix from
    myself for a very old remain-on-channel bug."

    For the iwlwifi bits, Emmanuel says:

    "I have new device IDs and a new firmware API. These are the trivial
    ones. The less trivial ones are Johannes's fix that delays the
    enablement of an interrupt coalescing hardware until after association
    - this fixes a few connection problems seen in the field. Eyal has a
    bunch of rate control fixes. I decided to add these for 3.15 because
    they fix some disconnection and packet loss scenarios which were
    reported by the field. I also have a fix for a memory leak that
    happens only with a very new NIC."

    Along with those...

    Amitkumar Karwar fixes a couple of problems relating to driver/firmware
    interactions in mwifiex.

    Christian Engelmayer avoids a couple of potential memory leaks in
    the new rsi driver.

    Eliad Peller provides a wl18xx mailbox alignment fix for problems
    when using new firmware.

    Frederic Danis adds a couple of missing debugging strings to the
    cw1200 driver.

    Geert Uytterhoeven adds a variable initialization inside of the
    rsi driver.

    Luciano Coelho patches the wlcore code to ignore dummy packet events
    in PLT mode in order to work around a firmware bug.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • When I open the LOCKDEP config and run these steps:

    modprobe 8021q
    vconfig add eth2 20
    vconfig add eth2.20 30
    ifconfig eth2 xx.xx.xx.xx

    then the Call Trace happened:

    [32524.386288] =============================================
    [32524.386293] [ INFO: possible recursive locking detected ]
    [32524.386298] 3.14.0-rc2-0.7-default+ #35 Tainted: G O
    [32524.386302] ---------------------------------------------
    [32524.386306] ifconfig/3103 is trying to acquire lock:
    [32524.386310] (&vlan_netdev_addr_lock_key/1){+.....}, at: [] dev_mc_sync+0x64/0xb0
    [32524.386326]
    [32524.386326] but task is already holding lock:
    [32524.386330] (&vlan_netdev_addr_lock_key/1){+.....}, at: [] dev_set_rx_mode+0x23/0x40
    [32524.386341]
    [32524.386341] other info that might help us debug this:
    [32524.386345] Possible unsafe locking scenario:
    [32524.386345]
    [32524.386350] CPU0
    [32524.386352] ----
    [32524.386354] lock(&vlan_netdev_addr_lock_key/1);
    [32524.386359] lock(&vlan_netdev_addr_lock_key/1);
    [32524.386364]
    [32524.386364] *** DEADLOCK ***
    [32524.386364]
    [32524.386368] May be due to missing lock nesting notation
    [32524.386368]
    [32524.386373] 2 locks held by ifconfig/3103:
    [32524.386376] #0: (rtnl_mutex){+.+.+.}, at: [] rtnl_lock+0x12/0x20
    [32524.386387] #1: (&vlan_netdev_addr_lock_key/1){+.....}, at: [] dev_set_rx_mode+0x23/0x40
    [32524.386398]
    [32524.386398] stack backtrace:
    [32524.386403] CPU: 1 PID: 3103 Comm: ifconfig Tainted: G O 3.14.0-rc2-0.7-default+ #35
    [32524.386409] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
    [32524.386414] ffffffff81ffae40 ffff8800d9625ae8 ffffffff814f68a2 ffff8800d9625bc8
    [32524.386421] ffffffff810a35fb ffff8800d8a8d9d0 00000000d9625b28 ffff8800d8a8e5d0
    [32524.386428] 000003cc00000000 0000000000000002 ffff8800d8a8e5f8 0000000000000000
    [32524.386435] Call Trace:
    [32524.386441] [] dump_stack+0x6a/0x78
    [32524.386448] [] __lock_acquire+0x7ab/0x1940
    [32524.386454] [] ? __lock_acquire+0x3ea/0x1940
    [32524.386459] [] lock_acquire+0xe4/0x110
    [32524.386464] [] ? dev_mc_sync+0x64/0xb0
    [32524.386471] [] _raw_spin_lock_nested+0x2a/0x40
    [32524.386476] [] ? dev_mc_sync+0x64/0xb0
    [32524.386481] [] dev_mc_sync+0x64/0xb0
    [32524.386489] [] vlan_dev_set_rx_mode+0x2b/0x50 [8021q]
    [32524.386495] [] __dev_set_rx_mode+0x5f/0xb0
    [32524.386500] [] dev_set_rx_mode+0x2b/0x40
    [32524.386506] [] __dev_open+0xef/0x150
    [32524.386511] [] __dev_change_flags+0xa7/0x190
    [32524.386516] [] dev_change_flags+0x32/0x80
    [32524.386524] [] devinet_ioctl+0x7d6/0x830
    [32524.386532] [] ? dev_ioctl+0x34b/0x660
    [32524.386540] [] inet_ioctl+0x80/0xa0
    [32524.386550] [] sock_do_ioctl+0x2d/0x60
    [32524.386558] [] sock_ioctl+0x82/0x2a0
    [32524.386568] [] do_vfs_ioctl+0x93/0x590
    [32524.386578] [] ? rcu_read_lock_held+0x45/0x50
    [32524.386586] [] ? __fget_light+0x105/0x110
    [32524.386594] [] SyS_ioctl+0x91/0xb0
    [32524.386604] [] system_call_fastpath+0x16/0x1b

    ========================================================================

    The reason is that all of the addr_lock_key for vlan dev have the same class,
    so if we change the status for vlan dev, the vlan dev and its real dev will
    hold the same class of addr_lock_key together, so the warning happened.

    we should distinguish the lock depth for vlan dev and its real dev.

    v1->v2: Convert the vlan_netdev_addr_lock_key to an array of eight elements, which
    could support to add 8 vlan id on a same vlan dev, I think it is enough for current
    scene, because a netdev's name is limited to IFNAMSIZ which could not hold 8 vlan id,
    and the vlan dev would not meet the same class key with its real dev.

    The new function vlan_dev_get_lockdep_subkey() will return the subkey and make the vlan
    dev could get a suitable class key.

    v2->v3: According David's suggestion, I use the subclass to distinguish the lock key for vlan dev
    and its real dev, but it make no sense, because the difference for subclass in the
    lock_class_key doesn't mean that the difference class for lock_key, so I use lock_depth
    to distinguish the different depth for every vlan dev, the same depth of the vlan dev
    could have the same lock_class_key, I import the MAX_LOCK_DEPTH from the include/linux/sched.h,
    I think it is enough here, the lockdep should never exceed that value.

    v3->v4: Add a huge array of locking keys will waste static kernel memory and is not a appropriate method,
    we could use _nested() variants to fix the problem, calculate the depth for every vlan dev,
    and use the depth as the subclass for addr_lock_key.

    Signed-off-by: Ding Tianhong
    Signed-off-by: David S. Miller

    dingtianhong
     

17 Apr, 2014

7 commits

  • …wireless into for-davem

    John W. Linville
     
  • Because the netdevice may be in another netns than the i/o netns, we should
    use the i/o netns instead of dev_net(dev).

    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     
  • Because the netdevice may be in another netns than the i/o netns, we should
    use the i/o netns instead of dev_net(dev).

    Note that netdev_priv(dev) cannot bu NULL, hence we can remove these useless
    checks.

    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     
  • Because the netdevice may be in another netns than the i/o netns, we should
    use the i/o netns instead of dev_net(dev).

    The variable 'tunnel' was used only to get 'itn', hence to simplify code I
    remove it and use 't' instead.

    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     
  • Make sys_recv a first class citizen by using the SYSCALL_DEFINEx
    macro. Besides being cleaner this will also generate meta data
    for the system call so tracing tools like ftrace or LTTng can
    resolve this system call.

    Signed-off-by: Jan Glauber
    Signed-off-by: David S. Miller

    Jan Glauber
     
  • In my special case, when a packet is redirected from veth0 to lo,
    its skb->dev->ifindex would be LOOPBACK_IFINDEX. Meanwhile we
    pass the hard-coded LOOPBACK_IFINDEX to fib_validate_source()
    in ip_route_input_slow(). This would cause the following check
    in fib_validate_source() fail:

    (dev->ifindex != oif || !IN_DEV_TX_REDIRECTS(idev))

    when rp_filter is disabeld on loopback. As suggested by Julian,
    the caller should pass 0 here so that we will not end up by
    calling __fib_validate_source().

    Cc: Eric Biederman
    Cc: Julian Anastasov
    Cc: David S. Miller
    Signed-off-by: Cong Wang
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     
  • As suggested by Julian:

    Simply, flowi4_iif must not contain 0, it does not
    look logical to ignore all ip rules with specified iif.

    because in fib_rule_match() we do:

    if (rule->iifindex && (rule->iifindex != fl->flowi_iif))
    goto out;

    flowi4_iif should be LOOPBACK_IFINDEX by default.

    We need to move LOOPBACK_IFINDEX to include/net/flow.h:

    1) It is mostly used by flowi_iif

    2) Fix the following compile error if we use it in flow.h
    by the patches latter:

    In file included from include/linux/netfilter.h:277:0,
    from include/net/netns/netfilter.h:5,
    from include/net/net_namespace.h:21,
    from include/linux/netdevice.h:43,
    from include/linux/icmpv6.h:12,
    from include/linux/ipv6.h:61,
    from include/net/ipv6.h:16,
    from include/linux/sunrpc/clnt.h:27,
    from include/linux/nfs_fs.h:30,
    from init/do_mounts.c:32:
    include/net/flow.h: In function ‘flowi4_init_output’:
    include/net/flow.h:84:32: error: ‘LOOPBACK_IFINDEX’ undeclared (first use in this function)

    Cc: Eric Biederman
    Cc: Julian Anastasov
    Cc: David S. Miller
    Signed-off-by: Cong Wang
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

16 Apr, 2014

3 commits

  • It's possible to remove the FB tunnel with the command 'ip link del ip6gre0' but
    this is unsafe, the module always supposes that this device exists. For example,
    ip6gre_tunnel_lookup() may use it unconditionally.

    Let's add a rtnl handler for dellink, which will never remove the FB tunnel (we
    let ip6gre_destroy_tunnels() do the job).

    Introduced by commit c12b395a4664 ("gre: Support GRE over IPv6").

    CC: Dmitry Kozlov
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     
  • In the dst->output() path for ipv4, the code assumes the skb it has to
    transmit is attached to an inet socket, specifically via
    ip_mc_output() : The sk_mc_loop() test triggers a WARN_ON() when the
    provider of the packet is an AF_PACKET socket.

    The dst->output() method gets an additional 'struct sock *sk'
    parameter. This needs a cascade of changes so that this parameter can
    be propagated from vxlan to final consumer.

    Fixes: 8f646c922d55 ("vxlan: keep original skb ownership")
    Reported-by: lucien xin
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • ip_queue_xmit() assumes the skb it has to transmit is attached to an
    inet socket. Commit 31c70d5956fc ("l2tp: keep original skb ownership")
    changed l2tp to not change skb ownership and thus broke this assumption.

    One fix is to add a new 'struct sock *sk' parameter to ip_queue_xmit(),
    so that we do not assume skb->sk points to the socket used by l2tp
    tunnel.

    Fixes: 31c70d5956fc ("l2tp: keep original skb ownership")
    Reported-by: Zhan Jianyu
    Tested-by: Zhan Jianyu
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Apr, 2014

6 commits

  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contains three Netfilter fixes for your net tree,
    they are:

    * Fix missing generation sequence initialization which results in a splat
    if lockdep is enabled, it was introduced in the recent works to improve
    nf_conntrack scalability, from Andrey Vagin.

    * Don't flush the GRE keymap list in nf_conntrack when the pptp helper is
    disabled otherwise this crashes due to a double release, from Andrey
    Vagin.

    * Fix nf_tables cmp fast in big endian, from Patrick McHardy.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Sometimes, when the packet arrives at skb_mac_gso_segment()
    its skb->mac_len already accounts for some of the mac lenght
    headers in the packet. This seems to happen when forwarding
    through and OpenSSL tunnel.

    When we start looking for any vlan headers in skb_network_protocol()
    we seem to ignore any of the already known mac headers and start
    with an ETH_HLEN. This results in an incorrect offset, dropped
    TSO frames and general slowness of the connection.

    We can start counting from the known skb->mac_len
    and return at least that much if all mac level headers
    are known and accounted for.

    Fixes: 53d6471cef17262d3ad1c7ce8982a234244f68ec (net: Account for all vlan headers in skb_mac_gso_segment)
    CC: Eric Dumazet
    CC: Daniel Borkman
    Tested-by: Martin Filip
    Signed-off-by: Vlad Yasevich
    Signed-off-by: David S. Miller

    Vlad Yasevich
     
  • This reverts commit ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management
    to reflect real state of the receiver's buffer") as it introduced a
    serious performance regression on SCTP over IPv4 and IPv6, though a not
    as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.

    Current state:

    [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.241.3 -V -l 1452 -t 60
    iperf version 3.0.1 (10 January 2014)
    Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
    Time: Fri, 11 Apr 2014 17:56:21 GMT
    Connecting to host 192.168.241.3, port 5201
    Cookie: Lab200slot2.1397238981.812898.548918
    [ 4] local 192.168.241.2 port 38616 connected to 192.168.241.3 port 5201
    Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
    [ ID] Interval Transfer Bandwidth
    [ 4] 0.00-1.09 sec 20.8 MBytes 161 Mbits/sec
    [ 4] 1.09-2.13 sec 10.8 MBytes 86.8 Mbits/sec
    [ 4] 2.13-3.15 sec 3.57 MBytes 29.5 Mbits/sec
    [ 4] 3.15-4.16 sec 4.33 MBytes 35.7 Mbits/sec
    [ 4] 4.16-6.21 sec 10.4 MBytes 42.7 Mbits/sec
    [ 4] 6.21-6.21 sec 0.00 Bytes 0.00 bits/sec
    [ 4] 6.21-7.35 sec 34.6 MBytes 253 Mbits/sec
    [ 4] 7.35-11.45 sec 22.0 MBytes 45.0 Mbits/sec
    [ 4] 11.45-11.45 sec 0.00 Bytes 0.00 bits/sec
    [ 4] 11.45-11.45 sec 0.00 Bytes 0.00 bits/sec
    [ 4] 11.45-11.45 sec 0.00 Bytes 0.00 bits/sec
    [ 4] 11.45-12.51 sec 16.0 MBytes 126 Mbits/sec
    [ 4] 12.51-13.59 sec 20.3 MBytes 158 Mbits/sec
    [ 4] 13.59-14.65 sec 13.4 MBytes 107 Mbits/sec
    [ 4] 14.65-16.79 sec 33.3 MBytes 130 Mbits/sec
    [ 4] 16.79-16.79 sec 0.00 Bytes 0.00 bits/sec
    [ 4] 16.79-17.82 sec 5.94 MBytes 48.7 Mbits/sec
    (etc)

    [root@Lab200slot2 ~]# iperf3 --sctp -6 -c 2001:db8:0:f101::1 -V -l 1400 -t 60
    iperf version 3.0.1 (10 January 2014)
    Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
    Time: Fri, 11 Apr 2014 19:08:41 GMT
    Connecting to host 2001:db8:0:f101::1, port 5201
    Cookie: Lab200slot2.1397243321.714295.2b3f7c
    [ 4] local 2001:db8:0:f101::2 port 55804 connected to 2001:db8:0:f101::1 port 5201
    Starting Test: protocol: SCTP, 1 streams, 1400 byte blocks, omitting 0 seconds, 60 second test
    [ ID] Interval Transfer Bandwidth
    [ 4] 0.00-1.00 sec 169 MBytes 1.42 Gbits/sec
    [ 4] 1.00-2.00 sec 201 MBytes 1.69 Gbits/sec
    [ 4] 2.00-3.00 sec 188 MBytes 1.58 Gbits/sec
    [ 4] 3.00-4.00 sec 174 MBytes 1.46 Gbits/sec
    [ 4] 4.00-5.00 sec 165 MBytes 1.39 Gbits/sec
    [ 4] 5.00-6.00 sec 199 MBytes 1.67 Gbits/sec
    [ 4] 6.00-7.00 sec 163 MBytes 1.36 Gbits/sec
    [ 4] 7.00-8.00 sec 174 MBytes 1.46 Gbits/sec
    [ 4] 8.00-9.00 sec 193 MBytes 1.62 Gbits/sec
    [ 4] 9.00-10.00 sec 196 MBytes 1.65 Gbits/sec
    [ 4] 10.00-11.00 sec 157 MBytes 1.31 Gbits/sec
    [ 4] 11.00-12.00 sec 175 MBytes 1.47 Gbits/sec
    [ 4] 12.00-13.00 sec 192 MBytes 1.61 Gbits/sec
    [ 4] 13.00-14.00 sec 199 MBytes 1.67 Gbits/sec
    (etc)

    After patch:

    [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.240.3 -V -l 1452 -t 60
    iperf version 3.0.1 (10 January 2014)
    Linux Lab200slot2 3.14.0+ #1 SMP Mon Apr 14 12:06:40 EDT 2014 x86_64
    Time: Mon, 14 Apr 2014 16:40:48 GMT
    Connecting to host 192.168.240.3, port 5201
    Cookie: Lab200slot2.1397493648.413274.65e131
    [ 4] local 192.168.240.2 port 50548 connected to 192.168.240.3 port 5201
    Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
    [ ID] Interval Transfer Bandwidth
    [ 4] 0.00-1.00 sec 240 MBytes 2.02 Gbits/sec
    [ 4] 1.00-2.00 sec 239 MBytes 2.01 Gbits/sec
    [ 4] 2.00-3.00 sec 240 MBytes 2.01 Gbits/sec
    [ 4] 3.00-4.00 sec 239 MBytes 2.00 Gbits/sec
    [ 4] 4.00-5.00 sec 245 MBytes 2.05 Gbits/sec
    [ 4] 5.00-6.00 sec 240 MBytes 2.01 Gbits/sec
    [ 4] 6.00-7.00 sec 240 MBytes 2.02 Gbits/sec
    [ 4] 7.00-8.00 sec 239 MBytes 2.01 Gbits/sec

    With the reverted patch applied, the SCTP/IPv4 performance is back
    to normal on latest upstream for IPv4 and IPv6 and has same throughput
    as 3.4.2 test kernel, steady and interval reports are smooth again.

    Fixes: ef2820a735f7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
    Reported-by: Peter Butler
    Reported-by: Dongsheng Song
    Reported-by: Fengguang Wu
    Tested-by: Peter Butler
    Signed-off-by: Daniel Borkmann
    Cc: Matija Glavinic Pecotic
    Cc: Alexander Sverdlin
    Cc: Vlad Yasevich
    Acked-by: Vlad Yasevich
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • While reviewing seccomp code, we found that BPF_S_ANC_SECCOMP_LD_W has
    been wrongly decoded by commit a8fc927780 ("sk-filter: Add ability to
    get socket filter program (v2)") into the opcode BPF_LD|BPF_B|BPF_ABS
    although it should have been decoded as BPF_LD|BPF_W|BPF_ABS.

    In practice, this should not have much side-effect though, as such
    conversion is/was being done through prctl(2) PR_SET_SECCOMP. Reverse
    operation PR_GET_SECCOMP will only return the current seccomp mode, but
    not the filter itself. Since the transition to the new BPF infrastructure,
    it's also not used anymore, so we can simply remove this as it's
    unreachable.

    Fixes: a8fc927780 ("sk-filter: Add ability to get socket filter program (v2)")
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Cc: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • John W. Linville
     
  • Francois reported that setting big mtu on loopback device could prevent
    tcp sessions making progress.

    We do not support (yet ?) IPv6 Jumbograms and cook corrupted packets.

    We must limit the IPv6 MTU to (65535 + 40) bytes in theory.

    Tested:

    ifconfig lo mtu 70000
    netperf -H ::1

    Before patch : Throughput : 0.05 Mbits

    After patch : Throughput : 35484 Mbits

    Reported-by: Francois WELLENREITER
    Signed-off-by: Eric Dumazet
    Acked-by: YOSHIFUJI Hideaki
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Apr, 2014

5 commits

  • nft_cmp_fast is used for equality comparisions of size < 4 byte a mask is calculated that is applied to
    both the data from userspace (during initialization) and the register
    value (during runtime). Both values are stored using (in effect) memcpy
    to a memory area that is then interpreted as u32 by nft_cmp_fast.

    This works fine on little endian since smaller types have the same base
    address, however on big endian this is not true and the smaller types
    are interpreted as a big number with trailing zero bytes.

    The mask therefore must not include the lower bytes, but the higher bytes
    on big endian. Add a helper function that does a cpu_to_le32 to switch
    the bytes on big endian. Since we're dealing with a mask of just consequitive
    bits, this works out fine.

    Signed-off-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Patrick McHardy
     
  • [ 251.920788] INFO: trying to register non-static key.
    [ 251.921386] the code is fine but needs lockdep annotation.
    [ 251.921386] turning off the locking correctness validator.
    [ 251.921386] CPU: 2 PID: 15715 Comm: socket_listen Not tainted 3.14.0+ #294
    [ 251.921386] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    [ 251.921386] 0000000000000000 000000009d18c210 ffff880075f039b8 ffffffff816b7ecd
    [ 251.921386] ffffffff822c3b10 ffff880075f039c8 ffffffff816b36f4 ffff880075f03aa0
    [ 251.921386] ffffffff810c65ff ffffffff810c4a85 00000000fffffe01 ffffffffa0075172
    [ 251.921386] Call Trace:
    [ 251.921386] [] dump_stack+0x45/0x56
    [ 251.921386] [] register_lock_class.part.24+0x38/0x3c
    [ 251.921386] [] __lock_acquire+0x168f/0x1b40
    [ 251.921386] [] ? trace_hardirqs_on_caller+0x105/0x1d0
    [ 251.921386] [] ? nf_nat_setup_info+0x252/0x3a0 [nf_nat]
    [ 251.921386] [] ? _raw_spin_unlock_bh+0x35/0x40
    [ 251.921386] [] ? nf_nat_setup_info+0x252/0x3a0 [nf_nat]
    [ 251.921386] [] lock_acquire+0xa2/0x120
    [ 251.921386] [] ? ipv4_confirm+0x90/0xf0 [nf_conntrack_ipv4]
    [ 251.921386] [] __nf_conntrack_confirm+0x129/0x410 [nf_conntrack]
    [ 251.921386] [] ? ipv4_confirm+0x90/0xf0 [nf_conntrack_ipv4]
    [ 251.921386] [] ipv4_confirm+0x90/0xf0 [nf_conntrack_ipv4]
    [ 251.921386] [] ? ip_fragment+0x9f0/0x9f0
    [ 251.921386] [] nf_iterate+0xaa/0xc0
    [ 251.921386] [] ? ip_fragment+0x9f0/0x9f0
    [ 251.921386] [] nf_hook_slow+0xa4/0x190
    [ 251.921386] [] ? ip_fragment+0x9f0/0x9f0
    [ 251.921386] [] ip_output+0x92/0x100
    [ 251.921386] [] ip_local_out+0x29/0x90
    [ 251.921386] [] ip_queue_xmit+0x170/0x4c0
    [ 251.921386] [] ? ip_queue_xmit+0x5/0x4c0
    [ 251.921386] [] tcp_transmit_skb+0x498/0x960
    [ 251.921386] [] tcp_connect+0x812/0x960
    [ 251.921386] [] ? ktime_get_real+0x25/0x70
    [ 251.921386] [] ? secure_tcp_sequence_number+0x6a/0xc0
    [ 251.921386] [] tcp_v4_connect+0x317/0x470
    [ 251.921386] [] __inet_stream_connect+0xb5/0x330
    [ 251.921386] [] ? lock_sock_nested+0x33/0xa0
    [ 251.921386] [] ? trace_hardirqs_on+0xd/0x10
    [ 251.921386] [] ? __local_bh_enable_ip+0x75/0xe0
    [ 251.921386] [] inet_stream_connect+0x38/0x50
    [ 251.921386] [] SYSC_connect+0xe7/0x120
    [ 251.921386] [] ? current_kernel_time+0x69/0xd0
    [ 251.921386] [] ? trace_hardirqs_on_caller+0x105/0x1d0
    [ 251.921386] [] ? trace_hardirqs_on+0xd/0x10
    [ 251.921386] [] SyS_connect+0xe/0x10
    [ 251.921386] [] system_call_fastpath+0x16/0x1b
    [ 312.014104] INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 0, t=60003 jiffies, g=42359, c=42358, q=333)
    [ 312.015097] INFO: Stall ended before state dump start

    Fixes: 93bb0ceb75be ("netfilter: conntrack: remove central spinlock nf_conntrack_lock")
    Cc: Jesper Dangaard Brouer
    Cc: Pablo Neira Ayuso
    Cc: Patrick McHardy
    Cc: Jozsef Kadlecsik
    Cc: "David S. Miller"
    Signed-off-by: Andrey Vagin
    Signed-off-by: Pablo Neira Ayuso

    Andrey Vagin
     
  • The BPF_S_ANC_NLATTR and BPF_S_ANC_NLATTR_NEST extensions fail to check
    for a minimal message length before testing the supplied offset to be
    within the bounds of the message. This allows the subtraction of the nla
    header to underflow and therefore -- as the data type is unsigned --
    allowing far to big offset and length values for the search of the
    netlink attribute.

    The remainder calculation for the BPF_S_ANC_NLATTR_NEST extension is
    also wrong. It has the minuend and subtrahend mixed up, therefore
    calculates a huge length value, allowing to overrun the end of the
    message while looking for the netlink attribute.

    The following three BPF snippets will trigger the bugs when attached to
    a UNIX datagram socket and parsing a message with length 1, 2 or 3.

    ,-[ PoC for missing size check in BPF_S_ANC_NLATTR ]--
    | ld #0x87654321
    | ldx #42
    | ld #nla
    | ret a
    `---

    ,-[ PoC for the same bug in BPF_S_ANC_NLATTR_NEST ]--
    | ld #0x87654321
    | ldx #42
    | ld #nlan
    | ret a
    `---

    ,-[ PoC for wrong remainder calculation in BPF_S_ANC_NLATTR_NEST ]--
    | ; (needs a fake netlink header at offset 0)
    | ld #0
    | ldx #42
    | ld #nlan
    | ret a
    `---

    Fix the first issue by ensuring the message length fulfills the minimal
    size constrains of a nla header. Fix the second bug by getting the math
    for the remainder calculation right.

    Fixes: 4738c1db15 ("[SKFILTER]: Add SKF_ADF_NLATTR instruction")
    Fixes: d214c7537b ("filter: add SKF_AD_NLATTR_NEST to look for nested..")
    Cc: Patrick McHardy
    Cc: Pablo Neira Ayuso
    Signed-off-by: Mathias Krause
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Mathias Krause
     
  • Extend commit 13378cad02afc2adc6c0e07fca03903c7ada0b37
    ("ipv4: Change rt->rt_iif encoding.") from 3.6 to return valid
    RTA_IIF on 'ip route get ... iif DEVICE' instead of rt_iif 0
    which is displayed as 'iif *'.

    inet_iif is not appropriate to use because skb_iif is not set.
    Use the skb->dev->ifindex instead.

    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Julian Anastasov
     
  • Plug a group_info refcount leak in ping_init.
    group_info is only needed during initialization and
    the code failed to release the reference on exit.
    While here move grabbing the reference to a place
    where it is actually needed.

    Signed-off-by: Chuansheng Liu
    Signed-off-by: Zhang Dongxing
    Signed-off-by: xiaoming wang
    Signed-off-by: David S. Miller

    Wang, Xiaoming
     

13 Apr, 2014

3 commits

  • Pull yet more networking updates from David Miller:

    1) Various fixes to the new Redpine Signals wireless driver, from
    Fariya Fatima.

    2) L2TP PPP connect code takes PMTU from the wrong socket, fix from
    Dmitry Petukhov.

    3) UFO and TSO packets differ in whether they include the protocol
    header in gso_size, account for that in skb_gso_transport_seglen().
    From Florian Westphal.

    4) If VLAN untagging fails, we double free the SKB in the bridging
    output path. From Toshiaki Makita.

    5) Several call sites of sk->sk_data_ready() were referencing an SKB
    just added to the socket receive queue in order to calculate the
    second argument via skb->len. This is dangerous because the moment
    the skb is added to the receive queue it can be consumed in another
    context and freed up.

    It turns out also that none of the sk->sk_data_ready()
    implementations even care about this second argument.

    So just kill it off and thus fix all these use-after-free bugs as a
    side effect.

    6) Fix inverted test in tcp_v6_send_response(), from Lorenzo Colitti.

    7) pktgen needs to do locking properly for LLTX devices, from Daniel
    Borkmann.

    8) xen-netfront driver initializes TX array entries in RX loop :-) From
    Vincenzo Maffione.

    9) After refactoring, some tunnel drivers allow a tunnel to be
    configured on top itself. Fix from Nicolas Dichtel.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
    vti: don't allow to add the same tunnel twice
    gre: don't allow to add the same tunnel twice
    drivers: net: xen-netfront: fix array initialization bug
    pktgen: be friendly to LLTX devices
    r8152: check RTL8152_UNPLUG
    net: sun4i-emac: add promiscuous support
    net/apne: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
    net: ipv6: Fix oif in TCP SYN+ACK route lookup.
    drivers: net: cpsw: enable interrupts after napi enable and clearing previous interrupts
    drivers: net: cpsw: discard all packets received when interface is down
    net: Fix use after free by removing length arg from sk_data_ready callbacks.
    Drivers: net: hyperv: Address UDP checksum issues
    Drivers: net: hyperv: Negotiate suitable ndis version for offload support
    Drivers: net: hyperv: Allocate memory for all possible per-pecket information
    bridge: Fix double free and memory leak around br_allowed_ingress
    bonding: Remove debug_fs files when module init fails
    i40evf: program RSS LUT correctly
    i40evf: remove open-coded skb_cow_head
    ixgb: remove open-coded skb_cow_head
    igbvf: remove open-coded skb_cow_head
    ...

    Linus Torvalds
     
  • Before the patch, it was possible to add two times the same tunnel:
    ip l a vti1 type vti remote 10.16.0.121 local 10.16.0.249 key 41
    ip l a vti2 type vti remote 10.16.0.121 local 10.16.0.249 key 41

    It was possible, because ip_tunnel_newlink() calls ip_tunnel_find() with the
    argument dev->type, which was set only later (when calling ndo_init handler
    in register_netdevice()). Let's set this type in the setup handler, which is
    called before newlink handler.

    Introduced by commit b9959fd3b0fa ("vti: switch to new ip tunnel code").

    CC: Cong Wang
    CC: Steffen Klassert
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     
  • Before the patch, it was possible to add two times the same tunnel:
    ip l a gre1 type gre remote 10.16.0.121 local 10.16.0.249
    ip l a gre2 type gre remote 10.16.0.121 local 10.16.0.249

    It was possible, because ip_tunnel_newlink() calls ip_tunnel_find() with the
    argument dev->type, which was set only later (when calling ndo_init handler
    in register_netdevice()). Let's set this type in the setup handler, which is
    called before newlink handler.

    Introduced by commit c54419321455 ("GRE: Refactor GRE tunneling code.").

    CC: Pravin B Shelar
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     

12 Apr, 2014

5 commits

  • Similarly to commit 43279500deca ("packet: respect devices with
    LLTX flag in direct xmit"), we can basically apply the very same
    to pktgen. This will help testing against LLTX devices such as
    dummy driver (or others), which only have a single netdevice txq
    and would otherwise require locking their txq from pktgen side
    while e.g. in dummy case, we would not need any locking. Fix this
    by making use of HARD_TX_{UN,}LOCK API, so that NETIF_F_LLTX will
    be respected.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Jesper Dangaard Brouer
    Cc: Eric Dumazet
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Pull 9p changes from Eric Van Hensbergen:
    "A bunch of updates and cleanup within the transport layer,
    particularly with a focus on RDMA"

    * tag 'for-linus-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
    9pnet_rdma: check token type before int conversion
    9pnet: trans_fd : allocate struct p9_trans_fd and struct p9_conn together.
    9pnet: p9_client->conn field is unused. Remove it.
    9P: Get rid of REQ_STATUS_FLSH
    9pnet_rdma: add cancelled()
    9pnet_rdma: update request status during send
    9P: Add cancelled() to the transport functions.
    net: Mark function as static in 9p/client.c
    9P: Add memory barriers to protect request fields over cb/rpc threads handoff

    Linus Torvalds
     
  • net-next commit 9c76a11, ipv6: tcp_ipv6 policy route issue, had
    a boolean logic error that caused incorrect behaviour for TCP
    SYN+ACK when oif-based rules are in use. Specifically:

    1. If a SYN comes in from a global address, and sk_bound_dev_if
    is not set, the routing lookup has oif set to the interface
    the SYN came in on. Instead, it should have oif unset,
    because for global addresses, the incoming interface doesn't
    necessarily have any bearing on the interface the SYN+ACK is
    sent out on.
    2. If a SYN comes in from a link-local address, and
    sk_bound_dev_if is set, the routing lookup has oif set to the
    interface the SYN came in on. Instead, it should have oif set
    to sk_bound_dev_if, because that's what the application
    requested.

    Signed-off-by: Lorenzo Colitti
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Lorenzo Colitti
     
  • Several spots in the kernel perform a sequence like:

    skb_queue_tail(&sk->s_receive_queue, skb);
    sk->sk_data_ready(sk, skb->len);

    But at the moment we place the SKB onto the socket receive queue it
    can be consumed and freed up. So this skb->len access is potentially
    to freed up memory.

    Furthermore, the skb->len can be modified by the consumer so it is
    possible that the value isn't accurate.

    And finally, no actual implementation of this callback actually uses
    the length argument. And since nobody actually cared about it's
    value, lots of call sites pass arbitrary values in such as '0' and
    even '1'.

    So just remove the length argument from the callback, that way there
    is no confusion whatsoever and all of these use-after-free cases get
    fixed as a side effect.

    Based upon a patch by Eric Dumazet and his suggestion to audit this
    issue tree-wide.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • br_allowed_ingress() has two problems.

    1. If br_allowed_ingress() is called by br_handle_frame_finish() and
    vlan_untag() in br_allowed_ingress() fails, skb will be freed by both
    vlan_untag() and br_handle_frame_finish().

    2. If br_allowed_ingress() is called by br_dev_xmit() and
    br_allowed_ingress() fails, the skb will not be freed.

    Fix these two problems by freeing the skb in br_allowed_ingress()
    if it fails.

    Signed-off-by: Toshiaki Makita
    Signed-off-by: David S. Miller

    Toshiaki Makita
     

11 Apr, 2014

3 commits


10 Apr, 2014

2 commits

  • When l2tp driver tries to get PMTU for the tunnel destination, it uses
    the pointer to struct sock that represents PPPoX socket, while it
    should use the pointer that represents UDP socket of the tunnel.

    Signed-off-by: Dmitry Petukhov
    Signed-off-by: David S. Miller

    Dmitry Petukhov
     
  • In function sctp_wake_up_waiters(), we need to involve a test
    if the association is declared dead. If so, we don't have any
    reference to a possible sibling association anymore and need
    to invoke sctp_write_space() instead, and normally walk the
    socket's associations and notify them of new wmem space. The
    reason for special casing is that otherwise, we could run
    into the following issue when a sctp_primitive_SEND() call
    from sctp_sendmsg() fails, and tries to flush an association's
    outq, i.e. in the following way:

    sctp_association_free()
    `-> list_del(&asoc->asocs) base.dead = true
    sctp_outq_free(&asoc->outqueue)
    `-> __sctp_outq_teardown()
    `-> sctp_chunk_free()
    `-> consume_skb()
    `-> sctp_wfree()
    `-> sctp_wake_up_waiters() ep->sndbuf_policy=0

    Therefore, only walk the list in an 'optimized' way if we find
    that the current association is still active. We could also use
    list_del_init() in addition when we call sctp_association_free(),
    but as Vlad suggests, we want to trap such bugs and thus leave
    it poisoned as is.

    Why is it safe to resolve the issue by testing for asoc->base.dead?
    Parallel calls to sctp_sendmsg() are protected under socket lock,
    that is lock_sock()/release_sock(). Only within that path under
    lock held, we're setting skb/chunk owner via sctp_set_owner_w().
    Eventually, chunks are freed directly by an association still
    under that lock. So when traversing association list on destruction
    time from sctp_wake_up_waiters() via sctp_wfree(), a different
    CPU can't be running sctp_wfree() while another one calls
    sctp_association_free() as both happens under the same lock.
    Therefore, this can also not race with setting/testing against
    asoc->base.dead as we are guaranteed for this to happen in order,
    under lock. Further, Vlad says: the times we check asoc->base.dead
    is when we've cached an association pointer for later processing.
    In between cache and processing, the association may have been
    freed and is simply still around due to reference counts. We check
    asoc->base.dead under a lock, so it should always be safe to check
    and not race against sctp_association_free(). Stress-testing seems
    fine now, too.

    Fixes: cd253f9f357d ("net: sctp: wake up all assocs if sndbuf policy is per socket")
    Signed-off-by: Daniel Borkmann
    Cc: Vlad Yasevich
    Acked-by: Neil Horman
    Acked-by: Vlad Yasevich
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

09 Apr, 2014

3 commits

  • Rate controller in firmware may also return the Tx Rate
    used for management frame that is usually sent as lowest
    Tx Rate (1Mbps in 2.4GHz). So update the last_tx_rate only
    if it is data frame.

    This patch is tested with ath9k_htc.

    Signed-off-by: Chun-Yeow Yeoh
    Signed-off-by: Johannes Berg

    Chun-Yeow Yeoh
     
  • If chandef had non-HT width it was possible for
    radar_enabled update to not be propagated properly
    through drv_config(). This happened because
    ieee80211_hw_conf_chan() would never see different
    local->hw.conf.chandef and local->_oper_chandef.

    This wasn't a problem with HT chandefs because
    _oper_chandef width is reset to non-HT in
    ieee80211_free_chanctx() making
    ieee80211_hw_conf_chan() to kick in.

    This problem led (at least) ath10k to not start
    CAC if prior CAC was cancelled and both CACs were
    requested for identical non-HT chandefs.

    Signed-off-by: Michal Kazior
    Signed-off-by: Johannes Berg

    Michal Kazior
     
  • All antennas should be operational when monitoring to maximize
    reception.

    Signed-off-by: Ido Yariv
    Signed-off-by: Johannes Berg

    Ido Yariv