31 Aug, 2013

8 commits

  • Pull networking fixes from David Miller:

    1) There was a simplification in the ipv6 ndisc packet sending
    attempted here, which avoided using memory accounting on the
    per-netns ndisc socket for sending NDISC packets. It did fix some
    important issues, but it causes regressions so it gets reverted here
    too. Specifically, the problem with this change is that the IPV6
    output path really depends upon there being a valid skb->sk
    attached.

    The reason we want to do this change in some form when we figure out
    how to do it right, is that if a device goes down the ndisc_sk
    socket send queue will fill up and block NDISC packets that we want
    to send to other devices too. That's really bad behavior.

    Hopefully Thomas can come up with a better version of this change.

    2) Fix a severe TCP performance regression by reverting a change made
    to dev_pick_tx() quite some time ago. From Eric Dumazet.

    3) TIPC returns wrongly signed error codes, fix from Erik Hugne.

    4) Fix OOPS when doing IPSEC over ipv4 tunnels due to orphaning the
    skb->sk too early. Fix from Li Hongjun.

    5) RAW ipv4 sockets can use the wrong routing key during lookup, from
    Chris Clark.

    6) Similar to #1 revert an older change that tried to use plain
    alloc_skb() for SYN/ACK TCP packets, this broke the netfilter owner
    mark which needs to see the skb->sk for such frames. From Phil
    Oester.

    7) BNX2x driver bug fixes from Ariel Elior and Yuval Mintz,
    specifically in the handling of virtual functions.

    8) IPSEC path error propagations to sockets is not done properly when
    we have v4 in v6, and v6 in v4 type rules. Fix from Hannes Frederic
    Sowa.

    9) Fix missing channel context release in mac80211, from Johannes Berg.

    10) Fix network namespace handing wrt. SCM_RIGHTS, from Andy
    Lutomirski.

    11) Fix usage of bogus NAPI weight in jme, netxen, and ps3_gelic
    drivers. From Michal Schmidt.

    12) Hopefully a complete and correct fix for the genetlink dump locking
    and module reference counting. From Pravin B Shelar.

    13) sk_busy_loop() must do a cpu_relax(), from Eliezer Tamir.

    14) Fix handling of timestamp offset when restoring a snapshotted TCP
    socket. From Andrew Vagin.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
    net: fec: fix time stamping logic after napi conversion
    net: bridge: convert MLDv2 Query MRC into msecs_to_jiffies for max_delay
    mISDN: return -EINVAL on error in dsp_control_req()
    net: revert 8728c544a9c ("net: dev_pick_tx() fix")
    Revert "ipv6: Don't depend on per socket memory for neighbour discovery messages"
    ipv4 tunnels: fix an oops when using ipip/sit with IPsec
    tipc: set sk_err correctly when connection fails
    tcp: tcp_make_synack() should use sock_wmalloc
    bridge: separate querier and query timer into IGMP/IPv4 and MLD/IPv6 ones
    ipv6: Don't depend on per socket memory for neighbour discovery messages
    ipv4: sendto/hdrincl: don't use destination address found in header
    tcp: don't apply tsoffset if rcv_tsecr is zero
    tcp: initialize rcv_tstamp for restored sockets
    net: xilinx: fix memleak
    net: usb: Add HP hs2434 device to ZLP exception table
    net: add cpu_relax to busy poll loop
    net: stmmac: fixed the pbl setting with DT
    genl: Hold reference on correct module while netlink-dump.
    genl: Fix genl dumpit() locking.
    xfrm: Fix potential null pointer dereference in xdst_queue_output
    ...

    Linus Torvalds
     
  • While looking into MLDv1/v2 code, I noticed that bridging code does
    not convert it's max delay into jiffies for MLDv2 messages as we do
    in core IPv6' multicast code.

    RFC3810, 5.1.3. Maximum Response Code says:

    The Maximum Response Code field specifies the maximum time allowed
    before sending a responding Report. The actual time allowed, called
    the Maximum Response Delay, is represented in units of milliseconds,
    and is derived from the Maximum Response Code as follows: [...]

    As we update timers that work with jiffies, we need to convert it.

    Signed-off-by: Daniel Borkmann
    Cc: Linus Lüssing
    Cc: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • commit 8728c544a9cbdc ("net: dev_pick_tx() fix") and commit
    b6fe83e9525a ("bonding: refine IFF_XMIT_DST_RELEASE capability")
    are quite incompatible : Queue selection is disabled because skb
    dst was dropped before entering bonding device.

    This causes major performance regression, mainly because TCP packets
    for a given flow can be sent to multiple queues.

    This is particularly visible when using the new FQ packet scheduler
    with MQ + FQ setup on the slaves.

    We can safely revert the first commit now that 416186fbf8c5b
    ("net: Split core bits of netdev_pick_tx into __netdev_pick_tx")
    properly caps the queue_index.

    Reported-by: Xi Wang
    Diagnosed-by: Xi Wang
    Signed-off-by: Eric Dumazet
    Cc: Tom Herbert
    Cc: Alexander Duyck
    Cc: Denys Fedorysychenko
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This reverts commit 1f324e38870cc09659cf23bc626f1b8869e201f2.

    It seems to cause regressions, and in particular the output path
    really depends upon there being a socket attached to skb->sk for
    checks such as sk_mc_loop(skb->sk) for example. See ip6_output_finish2().

    Reported-by: Stephen Warren
    Reported-by: Fabio Estevam
    Signed-off-by: David S. Miller

    David S. Miller
     
  • Since commit 3d7b46cd20e3 (ip_tunnel: push generic protocol handling to
    ip_tunnel module.), an Oops is triggered when an xfrm policy is configured on
    an IPv4 over IPv4 tunnel.

    xfrm4_policy_check() calls __xfrm_policy_check2(), which uses skb_dst(skb). But
    this field is NULL because iptunnel_pull_header() calls skb_dst_drop(skb).

    Signed-off-by: Li Hongjun
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Li Hongjun
     
  • Should a connect fail, if the publication/server is unavailable or
    due to some other error, a positive value will be returned and errno
    is never set. If the application code checks for an explicit zero
    return from connect (success) or a negative return (failure), it
    will not catch the error and subsequent send() calls will fail as
    shown from the strace snippet below.

    socket(0x1e /* PF_??? */, SOCK_SEQPACKET, 0) = 3
    connect(3, {sa_family=0x1e /* AF_??? */, sa_data="\2\1\322\4\0\0\322\4\0\0\0\0\0\0"}, 16) = 111
    sendto(3, "test", 4, 0, NULL, 0) = -1 EPIPE (Broken pipe)

    The reason for this behaviour is that TIPC wrongly inverts error
    codes set in sk_err.

    Signed-off-by: Erik Hugne
    Signed-off-by: David S. Miller

    Erik Hugne
     
  • In commit 90ba9b19 (tcp: tcp_make_synack() can use alloc_skb()), Eric changed
    the call to sock_wmalloc in tcp_make_synack to alloc_skb. In doing so,
    the netfilter owner match lost its ability to block the SYNACK packet on
    outbound listening sockets. Revert the change, restoring the owner match
    functionality.

    This closes netfilter bugzilla #847.

    Signed-off-by: Phil Oester
    Signed-off-by: David S. Miller

    Phil Oester
     
  • Currently we would still potentially suffer multicast packet loss if there
    is just either an IGMP or an MLD querier: For the former case, we would
    possibly drop IPv6 multicast packets, for the latter IPv4 ones. This is
    because we are currently assuming that if either an IGMP or MLD querier
    is present that the other one is present, too.

    This patch makes the behaviour and fix added in
    "bridge: disable snooping if there is no querier" (b00589af3b04)
    to also work if there is either just an IGMP or an MLD querier on the
    link: It refines the deactivation of the snooping to be protocol
    specific by using separate timers for the snooped IGMP and MLD queries
    as well as separate timers for our internal IGMP and MLD queriers.

    Signed-off-by: Linus Lüssing
    Signed-off-by: David S. Miller

    Linus Lüssing
     

30 Aug, 2013

5 commits

  • Steffen Klassert says:

    ====================
    This pull request fixes some issues that arise when 6in4 or 4in6 tunnels
    are used in combination with IPsec, all from Hannes Frederic Sowa and a
    null pointer dereference when queueing packets to the policy hold queue.

    1) We might access the local error handler of the wrong address family if
    6in4 or 4in6 tunnel is protected by ipsec. Fix this by addind a pointer
    to the correct local_error to xfrm_state_afinet.

    2) Add a helper function to always refer to the correct interpretation
    of skb->sk.

    3) Call skb_reset_inner_headers to record the position of the inner headers
    when adding a new one in various ipv6 tunnels. This is needed to identify
    the addresses where to send back errors in the xfrm layer.

    4) Dereference inner ipv6 header if encapsulated to always call the
    right error handler.

    5) Choose protocol family by skb protocol to not call the wrong
    xfrm{4,6}_local_error handler in case an ipv6 sockets is used
    in ipv4 mode.

    6) Partly revert "xfrm: introduce helper for safe determination of mtu"
    because this introduced pmtu discovery problems.

    7) Set skb->protocol on tcp, raw and ip6_append_data genereated skbs.
    We need this to get the correct mtu informations in xfrm.

    8) Fix null pointer dereference in xdst_queue_output.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Allocating skbs when sending out neighbour discovery messages
    currently uses sock_alloc_send_skb() based on a per net namespace
    socket and thus share a socket wmem buffer space.

    If a netdevice is temporarily unable to transmit due to carrier
    loss or for other reasons, the queued up ndisc messages will cosnume
    all of the wmem space and will thus prevent from any more skbs to
    be allocated even for netdevices that are able to transmit packets.

    The number of neighbour discovery messages sent is very limited,
    simply use alloc_skb() and don't depend on any socket wmem space any
    longer.

    This patch has orginally been posted by Eric Dumazet in a modified
    form.

    Signed-off-by: Thomas Graf
    Cc: Eric Dumazet
    Acked-by: Hannes Frederic Sowa
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • ipv4: raw_sendmsg: don't use header's destination address

    A sendto() regression was bisected and found to start with commit
    f8126f1d5136be1 (ipv4: Adjust semantics of rt->rt_gateway.)

    The problem is that it tries to ARP-lookup the constructed packet's
    destination address rather than the explicitly provided address.

    Fix this using FLOWI_FLAG_KNOWN_NH so that given nexthop is used.

    cf. commit 2ad5b9e4bd314fc685086b99e90e5de3bc59e26b

    Reported-by: Chris Clark
    Bisected-by: Chris Clark
    Tested-by: Chris Clark
    Suggested-by: Julian Anastasov
    Signed-off-by: Chris Clark
    Signed-off-by: David S. Miller

    Chris Clark
     
  • The zero value means that tsecr is not valid, so it's a special case.

    tsoffset is used to customize tcp_time_stamp for one socket.
    tsoffset is usually zero, it's used when a socket was moved from one
    host to another host.

    Currently this issue affects logic of tcp_rcv_rtt_measure_ts. Due to
    incorrect value of rcv_tsecr, tcp_rcv_rtt_measure_ts sets rto to
    TCP_RTO_MAX.

    Cc: Pavel Emelyanov
    Cc: Eric Dumazet
    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: James Morris
    Cc: Hideaki YOSHIFUJI
    Cc: Patrick McHardy
    Reported-by: Cyrill Gorcunov
    Signed-off-by: Andrey Vagin
    Signed-off-by: David S. Miller

    Andrew Vagin
     
  • u32 rcv_tstamp; /* timestamp of last received ACK */

    Its value used in tcp_retransmit_timer, which closes socket
    if the last ack was received more then TCP_RTO_MAX ago.

    Currently rcv_tstamp is initialized to zero and if tcp_retransmit_timer
    is called before receiving a first ack, the connection is closed.

    This patch initializes rcv_tstamp to a timestamp, when a socket was
    restored.

    Cc: Pavel Emelyanov
    Cc: Eric Dumazet
    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: James Morris
    Cc: Hideaki YOSHIFUJI
    Cc: Patrick McHardy
    Reported-by: Cyrill Gorcunov
    Signed-off-by: Andrey Vagin
    Signed-off-by: David S. Miller

    Andrew Vagin
     

29 Aug, 2013

3 commits

  • netlink dump operations take module as parameter to hold
    reference for entire netlink dump duration.
    Currently it holds ref only on genl module which is not correct
    when we use ops registered to genl from another module.
    Following patch adds module pointer to genl_ops so that netlink
    can hold ref count on it.

    CC: Jesse Gross
    CC: Johannes Berg
    Signed-off-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Pravin B Shelar
     
  • In case of genl-family with parallel ops off, dumpif() callback
    is expected to run under genl_lock, But commit def3117493eafd9df
    (genl: Allow concurrent genl callbacks.) changed this behaviour
    where only first dumpit() op was called under genl-lock.
    For subsequent dump, only nlk->cb_lock was taken.
    Following patch fixes it by defining locked dumpit() and done()
    callback which takes care of genl-locking.

    CC: Jesse Gross
    CC: Johannes Berg
    Signed-off-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Pravin B Shelar
     
  • Some architectures, such as ARM-32 do not return the same base address
    when you call kmap_atomic() twice on the same page.
    This causes problems for the memmove() call in the XDR helper routine
    "_shift_data_right_pages()", since it defeats the detection of
    overlapping memory ranges, and has been seen to corrupt memory.

    The fix is to distinguish between the case where we're doing an
    inter-page copy or not. In the former case of we know that the memory
    ranges cannot possibly overlap, so we can additionally micro-optimise
    by replacing memmove() with memcpy().

    Reported-by: Mark Young
    Reported-by: Matt Craighead
    Cc: Bruce Fields
    Cc: stable@vger.kernel.org
    Signed-off-by: Trond Myklebust
    Tested-by: Matt Craighead

    Trond Myklebust
     

28 Aug, 2013

3 commits

  • The net_device might be not set on the skb when we try refcounting.
    This leads to a null pointer dereference in xdst_queue_output().
    It turned out that the refcount to the net_device is not needed
    after all. The dst_entry has a refcount to the net_device before
    we queue the skb, so it can't go away. Therefore we can remove the
    refcount on queueing to fix the null pointer dereference.

    Signed-off-by: Steffen Klassert

    Steffen Klassert
     
  • John W. Linville says:

    ====================
    This is one more set of fixes intended for the 3.11 stream...

    For the mac80211 bits, Johannes says:

    "I have three more patches for the 3.11 stream: Felix's fix for the
    fairly visible brcmsmac crash, a fix from Simon for an IBSS join bug I
    found and a fix for a channel context bug in IBSS I'd introduced."

    Along with those...

    Sujith Manoharan makes a minor change to not use a PLL hang workaroun
    for AR9550. This one-liner fixes a couple of bugs reported in the Red Hat
    bugzilla.

    Helmut Schaa addresses an ath9k_htc bug that mangles frame headers
    during Tx. This fix is small, tested by the bug reported and isolated
    to ath9k_htc.

    Stanislaw Gruszka reverts a recent iwl4965 change that broke rfkill
    notification to user space.

    Please let me know if there are problems!
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This is a security bug.

    The follow-up will fix nsproxy to discourage this type of issue from
    happening again.

    Cc: stable@vger.kernel.org
    Signed-off-by: Andy Lutomirski
    Reviewed-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Andy Lutomirski
     

26 Aug, 2013

2 commits

  • Currently we don't initialize skb->protocol when transmitting data via
    tcp, raw(with and without inclhdr) or udp+ufo or appending data directly
    to the socket transmit queue (via ip6_append_data). This needs to be
    done so that we can get the correct mtu in the xfrm layer.

    Setting of skb->protocol happens only in functions where we also have
    a transmitting socket and a new skb, so we don't overwrite old values.

    Cc: Steffen Klassert
    Cc: Eric Dumazet
    Acked-by: Eric Dumazet
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: Steffen Klassert

    Hannes Frederic Sowa
     
  • In commit 0ea9d5e3e0e03a63b11392f5613378977dae7eca ("xfrm: introduce
    helper for safe determination of mtu") I switched the determination of
    ipv4 mtus from dst_mtu to ip_skb_dst_mtu. This was an error because in
    case of IP_PMTUDISC_PROBE we fall back to the interface mtu, which is
    never correct for ipv4 ipsec.

    This patch partly reverts 0ea9d5e3e0e03a63b11392f5613378977dae7eca
    ("xfrm: introduce helper for safe determination of mtu").

    Cc: Steffen Klassert
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: Steffen Klassert

    Hannes Frederic Sowa
     

23 Aug, 2013

3 commits


22 Aug, 2013

1 commit


21 Aug, 2013

8 commits

  • my earlier patch "mac80211: change IBSS channel state to chandef"
    created a regression by ignoring the channel parameter in
    __ieee80211_sta_join_ibss, which breaks IBSS channel selection. This
    patch fixes this situation by using the right channel and adopting the
    selected bandwidth mode.

    Cc: stable@vger.kernel.org
    Signed-off-by: Simon Wunderlich
    Signed-off-by: Johannes Berg

    Simon Wunderlich
     
  • brcm80211 cannot handle sending frames with CCK rates as part of an
    A-MPDU session. Other drivers may have issues too. Set the flag in all
    drivers that have been tested with CCK rates.

    This fixes a reported brcmsmac regression introduced in
    commit ef47a5e4f1aaf1d0e2e6875e34b2c9595897bef6
    "mac80211/minstrel_ht: fix cck rate sampling"

    Cc: stable@vger.kernel.org # 3.10
    Reported-by: Tom Gundersen
    Signed-off-by: Felix Fietkau
    Signed-off-by: Johannes Berg

    Felix Fietkau
     
  • IBSS needs to release the channel context when leaving
    but I evidently missed that. Fix it.

    Cc: stable@vger.kernel.org
    Signed-off-by: Johannes Berg

    Johannes Berg
     
  • The VLAN code needs to know the length of the per-port VLAN bitmap to
    perform its most basic operations (retrieving VLAN informations, removing
    VLANs, forwarding database manipulation, etc). Unfortunately, in the
    current implementation we are using a macro that indicates the bitmap
    size in longs in places where the size in bits is expected, which in
    some cases can cause what appear to be random failures.
    Use the correct macro.

    Signed-off-by: Toshiaki Makita
    Signed-off-by: David S. Miller

    Toshiaki Makita
     
  • John W. Linville says:

    ====================
    Regarding the iwlwifi bits, Johannes says:

    "We revert an rfkill bugfix that unfortunately caused more bugs, shuffle
    some code to avoid touching the PCIe device before it's enabled and
    disconnect if firmware fails to do our bidding. I also have Stanislaw's
    fix to not crash in some channel switch scenarios."

    As for the mac80211 bits, Johannes says:

    "This time, I have one fix from Dan Carpenter for users of
    nl80211hdr_put(), and one fix from myself fixing a regression with the
    libertas driver."

    Along with the above...

    Dan Carpenter fixes some incorrectly placed "address of" operators
    in hostap that caused copying of junk data.

    Jussi Kivilinna corrects zd1201 to use an allocated buffer rather
    than the stack for a URB operation.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • getsockopt PACKET_STATISTICS returns tp_packets + tp_drops. Commit
    ee80fbf301 ("packet: account statistics only in tpacket_stats_u")
    cleaned up the getsockopt PACKET_STATISTICS code.
    This also changed semantics. Historically, tp_packets included
    tp_drops on return. The commit removed the line that adds tp_drops
    into tp_packets.

    This patch reinstates the old semantics.

    Signed-off-by: Willem de Bruijn
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Included change:
    - Check if the skb has been correctly prepared before going on

    David S. Miller
     
  • When the repair mode is turned off, the write queue seqs are
    updated so that the whole queue is considered to be 'already sent.

    The "when" field must be set for such skb. It's used in tcp_rearm_rto
    for example. If the "when" field isn't set, the retransmit timeout can
    be calculated incorrectly and a tcp connected can stop for two minutes
    (TCP_RTO_MAX).

    Acked-by: Pavel Emelyanov
    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: James Morris
    Cc: Hideaki YOSHIFUJI
    Cc: Patrick McHardy
    Signed-off-by: Andrey Vagin
    Signed-off-by: David S. Miller

    Andrey Vagin
     

20 Aug, 2013

3 commits

  • It is not allowed for an ipv6 packet to contain multiple fragmentation
    headers. So discard packets which were already reassembled by
    fragmentation logic and send back a parameter problem icmp.

    The updates for RFC 6980 will come in later, I have to do a bit more
    research here.

    Cc: YOSHIFUJI Hideaki
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     
  • Because of the max_addresses check attackers were able to disable privacy
    extensions on an interface by creating enough autoconfigured addresses:

    But the check is not actually needed: max_addresses protects the
    kernel to install too many ipv6 addresses on an interface and guards
    addrconf_prefix_rcv to install further addresses as soon as this limit
    is reached. We only generate temporary addresses in direct response of
    a new address showing up. As soon as we filled up the maximum number of
    addresses of an interface, we stop installing more addresses and thus
    also stop generating more temp addresses.

    Even if the attacker tries to generate a lot of temporary addresses
    by announcing a prefix and removing it again (lifetime == 0) we won't
    install more temp addresses, because the temporary addresses do count
    to the maximum number of addresses, thus we would stop installing new
    autoconfigured addresses when the limit is reached.

    This patch fixes CVE-2013-0343 (but other layer-2 attacks are still
    possible).

    Thanks to Ding Tianhong to bring this topic up again.

    Cc: Ding Tianhong
    Cc: George Kargiotakis
    Cc: P J P
    Cc: YOSHIFUJI Hideaki
    Signed-off-by: Hannes Frederic Sowa
    Acked-by: Ding Tianhong
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     
  • …wireless into for-davem

    John W. Linville
     

19 Aug, 2013

3 commits

  • We need to choose the protocol family by skb->protocol. Otherwise we
    call the wrong xfrm{4,6}_local_error handler in case an ipv6 sockets is
    used in ipv4 mode, in which case we should call down to xfrm4_local_error
    (ip6 sockets are a superset of ip4 ones).

    We are called before before ip_output functions, so skb->protocol is
    not reset.

    Cc: Steffen Klassert
    Acked-by: Eric Dumazet
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: Steffen Klassert

    Hannes Frederic Sowa
     
  • In xfrm6_local_error use inner_header if the packet was encapsulated.

    Cc: Steffen Klassert
    Acked-by: Eric Dumazet
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: Steffen Klassert

    Hannes Frederic Sowa
     
  • When pushing a new header before current one call skb_reset_inner_headers
    to record the position of the inner headers in the various ipv6 tunnel
    protocols.

    We later need this to correctly identify the addresses needed to send
    back an error in the xfrm layer.

    This change is safe, because skb->protocol is always checked before
    dereferencing data from the inner protocol.

    Cc: Steffen Klassert
    Cc: YOSHIFUJI Hideaki
    Cc: Nicolas Dichtel
    Acked-by: Eric Dumazet
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: Steffen Klassert

    Hannes Frederic Sowa
     

18 Aug, 2013

1 commit