09 Oct, 2015

1 commit


03 Oct, 2015

30 commits

  • Greg Kroah-Hartman
     
  • commit 8a1513b49321e503fd6c8b6793e3b1f9a8a3285b upstream.

    Do not write initialize magic on systems that do not have
    feature query 0xb. Fixes Bug #82451.

    Redefine FEATURE_QUERY to align with 0xb and FEATURE2 with 0xd
    for code clearity.

    Add a new test function, hp_wmi_bios_2008_later() & simplify
    hp_wmi_bios_2009_later(), which fixes a bug in cases where
    an improper value is returned. Probably also fixes Bug #69131.

    Add missing __init tag.

    Signed-off-by: Kyle Evans
    Signed-off-by: Darren Hart
    Signed-off-by: Greg Kroah-Hartman

    Kyle Evans
     
  • commit 3aaf14da807a4e9931a37f21e4251abb8a67021b upstream.

    zcomp_create() verifies the success of zcomp_strm_{multi,single}_create()
    through comp->stream, which can potentially be pointing to memory that
    was freed if these functions returned an error.

    While at it, replace a 'ERR_PTR(-ENOMEM)' by a more generic
    'ERR_PTR(error)' as in the future zcomp_strm_{multi,siggle}_create()
    could return other error codes. Function documentation updated
    accordingly.

    Fixes: beca3ec71fe5 ("zram: add multi stream functionality")
    Signed-off-by: Luis Henriques
    Acked-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Luis Henriques
     
  • [ Upstream commit da314c9923fed553a007785a901fd395b7eb6c19 ]

    On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
    >
    > store_release and load_acquire are different from the usual memory
    > barriers and can't be paired this way. You have to pair store_release
    > and load_acquire. Besides, it isn't a particularly good idea to

    OK I've decided to drop the acquire/release helpers as they don't
    help us at all and simply pessimises the code by using full memory
    barriers (on some architectures) where only a write or read barrier
    is needed.

    > depend on memory barriers embedded in other data structures like the
    > above. Here, especially, rhashtable_insert() would have write barrier
    > *before* the entry is hashed not necessarily *after*, which means that
    > in the above case, a socket which appears to have set bound to a
    > reader might not visible when the reader tries to look up the socket
    > on the hashtable.

    But you are right we do need an explicit write barrier here to
    ensure that the hashing is visible.

    > There's no reason to be overly smart here. This isn't a crazy hot
    > path, write barriers tend to be very cheap, store_release more so.
    > Please just do smp_store_release() and note what it's paired with.

    It's not about being overly smart. It's about actually understanding
    what's going on with the code. I've seen too many instances of
    people simply sprinkling synchronisation primitives around without
    any knowledge of what is happening underneath, which is just a recipe
    for creating hard-to-debug races.

    > > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
    > > }
    > > }
    > >
    > > - if (!nlk->portid) {
    > > + if (!nlk->bound) {
    >
    > I don't think you can skip load_acquire here just because this is the
    > second deref of the variable. That doesn't change anything. Race
    > condition could still happen between the first and second tests and
    > skipping the second would lead to the same kind of bug.

    The reason this one is OK is because we do not use nlk->portid or
    try to get nlk from the hash table before we return to user-space.

    However, there is a real bug here that none of these acquire/release
    helpers discovered. The two bound tests here used to be a single
    one. Now that they are separate it is entirely possible for another
    thread to come in the middle and bind the socket. So we need to
    repeat the portid check in order to maintain consistency.

    > > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
    > > !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
    > > return -EPERM;
    > >
    > > - if (!nlk->portid)
    > > + if (!nlk->bound)
    >
    > Don't we need load_acquire here too? Is this path holding a lock
    > which makes that unnecessary?

    Ditto.

    ---8bound once in netlink_bind fixes
    a race where two threads that bind the socket at the same time
    with different port IDs may both succeed.

    Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID")
    Reported-by: Tejun Heo
    Reported-by: Linus Torvalds
    Signed-off-by: Herbert Xu
    Nacked-by: Tejun Heo
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Herbert Xu
     
  • [ Upstream commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ]

    The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
    Reset portid after netlink_insert failure") introduced a race
    condition where if two threads try to autobind the same socket
    one of them may end up with a zero port ID. This led to kernel
    deadlocks that were observed by multiple people.

    This patch reverts that commit and instead fixes it by introducing
    a separte rhash_portid variable so that the real portid is only set
    after the socket has been successfully hashed.

    Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
    Reported-by: Tejun Heo
    Reported-by: Linus Torvalds
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Herbert Xu
     
  • [ Upstream commit f8af8e6eb95093d5ce5ebcc52bd1929b0433e172 in net-next tree,
    will be pushed to Linus very soon. ]

    The commit 898b2970e2c9 ("mvneta: implement SGMII-based in-band link state
    signaling") implemented the link parameters auto-negotiation unconditionally.
    Unfortunately it appears that some HW that implements SGMII protocol,
    doesn't generate the inband status, so it is not possible to auto-negotiate
    anything with such HW.

    This patch enables the auto-negotiation only if explicitly requested with
    the 'managed' DT property.

    This patch fixes the following regression:
    https://lkml.org/lkml/2015/7/8/865

    Signed-off-by: Stas Sergeev

    CC: Thomas Petazzoni
    CC: netdev@vger.kernel.org
    CC: linux-kernel@vger.kernel.org
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Stas Sergeev
     
  • [ Upstream commit 4cba5c2103657d43d0886e4cff8004d95a3d0def in net-next tree,
    will be pushed to Linus very soon. ]

    Currently the PHY management type is selected by the MAC driver arbitrary.
    The decision is based on the presence of the "fixed-link" node and on a
    will of the driver's authors.
    This caused a regression recently, when mvneta driver suddenly started
    to use the in-band status for auto-negotiation on fixed links.
    It appears the auto-negotiation may not work when expected by the MAC driver.
    Sebastien Rannou explains:
    << Yes, I confirm that my HW does not generate an in-band status. AFAIK, it's
    a PHY that aggregates 4xSGMIIs to 1xQSGMII ; the MAC side of the PHY (with
    inband status) is connected to the switch through QSGMII, and in this context
    we are on the media side of the PHY. >>
    https://lkml.org/lkml/2015/7/10/206

    This patch introduces the new string property 'managed' that allows
    the user to set the management type explicitly.
    The supported values are:
    "auto" - default. Uses either MDIO or nothing, depending on the presence
    of the fixed-link node
    "in-band-status" - use in-band status

    Signed-off-by: Stas Sergeev

    CC: Rob Herring
    CC: Pawel Moll
    CC: Mark Rutland
    CC: Ian Campbell
    CC: Kumar Gala
    CC: Florian Fainelli
    CC: Grant Likely
    CC: devicetree@vger.kernel.org
    CC: linux-kernel@vger.kernel.org
    CC: netdev@vger.kernel.org
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Stas Sergeev
     
  • [ Upstream 868a4215be9a6d80548ccb74763b883dc99d32a2 in net-next tree,
    will be pushed to Linus very soon. ]

    fixed_phy_register() currently hardcodes the fixed PHY link to 1, and
    expects to find a "speed" parameter to provide correct information
    towards the fixed PHY consumer.

    In a subsequent change, where we allow "managed" (e.g: (RS)GMII in-band
    status auto-negotiation) fixed PHYs, none of these parameters can be
    provided since they will be auto-negotiated, hence, we just provide a
    zero-initialized fixed_phy_status to fixed_phy_register() which makes it
    fail when we call fixed_phy_update_regs() since status.speed = 0 which
    makes us hit the "default" label and error out.

    Without this change, we would also see potentially inconsistent
    speed/duplex parameters for fixed PHYs when the link is DOWN.

    CC: netdev@vger.kernel.org
    CC: linux-kernel@vger.kernel.org
    Signed-off-by: Stas Sergeev
    [florian: add more background to why this is correct and desirable]
    Signed-off-by: Florian Fainelli
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Stas Sergeev
     
  • [ Upstream d2eac98f7d1b950b762a7eca05a9ce0ea1d878d2 in net-next tree,
    will be pushed to Linus very soon. ]

    The SF2 driver currently overrides speed settings for its port
    configured using a fixed PHY, this is both unnecessary and incorrect,
    because we keep feedback to the hardware parameters that we read from
    the PHY device, which in the case of a fixed PHY cannot possibly change
    speed.

    This is a required change to allow the fixed PHY code to allow
    registering a PHY with a link configured as DOWN by default and avoid
    some sort of circular dependency where we require the link_update
    callback to run to program the hardware, and we then utilize the fixed
    PHY parameters to program the hardware with the same settings.

    Fixes: 246d7f773c13 ("net: dsa: add Broadcom SF2 switch driver")
    Signed-off-by: Florian Fainelli
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Florian Fainelli
     
  • [ Upstream commit 41fc014332d91ee90c32840bf161f9685b7fbf2b ]

    dump_rules returns skb length and not error.
    But when family == AF_UNSPEC, the caller of dump_rules
    assumes that it returns an error. Hence, when family == AF_UNSPEC,
    we continue trying to dump on -EMSGSIZE errors resulting in
    incorrect dump idx carried between skbs belonging to the same dump.
    This results in fib rule dump always only dumping rules that fit
    into the first skb.

    This patch fixes dump_rules to return error so that we exit correctly
    and idx is correctly maintained between skbs that are part of the
    same dump.

    Signed-off-by: Wilson Kok
    Signed-off-by: Roopa Prabhu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Wilson Kok
     
  • [ Upstream commit d8aecb10115497f6cdf841df8c88ebb3ba25fa28 ]

    fw filter uses tp->root==NULL to check if it is the old method,
    so it doesn't need allocation at all in this case. This patch
    reverts the offending commit and adds some comments for old
    method to make it obvious.

    Fixes: 33f8b9ecdb15 ("net_sched: move tp->root allocation into fw_init()")
    Reported-by: Akshat Kakkar
    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    WANG Cong
     
  • [ Upstream commit 675ee231d960af2af3606b4480324e26797eb010 ]

    RST packets sent on behalf of TCP connections with TS option (RFC 7323
    TCP timestamps) have incorrect TS val (set to 0), but correct TS ecr.

    A > B: Flags [S], seq 0, win 65535, options [mss 1000,nop,nop,TS val 100
    ecr 0], length 0
    B > A: Flags [S.], seq 2444755794, ack 1, win 28960, options [mss
    1460,nop,nop,TS val 7264344 ecr 100], length 0
    A > B: Flags [.], ack 1, win 65535, options [nop,nop,TS val 110 ecr
    7264344], length 0

    B > A: Flags [R.], seq 1, ack 1, win 28960, options [nop,nop,TS val 0
    ecr 110], length 0

    We need to call skb_mstamp_get() to get proper TS val,
    derived from skb->skb_mstamp

    Note that RFC 1323 was advocating to not send TS option in RST segment,
    but RFC 7323 recommends the opposite :

    Once TSopt has been successfully negotiated, that is both and
    contain TSopt, the TSopt MUST be sent in every non-
    segment for the duration of the connection, and SHOULD be sent in an
    segment (see Section 5.2 for details)

    Note this RFC recommends to send TS val = 0, but we believe it is
    premature : We do not know if all TCP stacks are properly
    handling the receive side :

    When an segment is
    received, it MUST NOT be subjected to the PAWS check by verifying an
    acceptable value in SEG.TSval, and information from the Timestamps
    option MUST NOT be used to update connection state information.
    SEG.TSecr MAY be used to provide stricter acceptance checks.

    In 5 years, if/when all TCP stack are RFC 7323 ready, we might consider
    to decide to send TS val = 0, if it buys something.

    Fixes: 7faee5c0d514 ("tcp: remove TCP_SKB_CB(skb)->when")
    Signed-off-by: Eric Dumazet
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit ae5f2fb1d51fa128a460bcfbe3c56d7ab8bf6a43 ]

    When support for megaflows was introduced, OVS needed to start
    installing flows with a mask applied to them. Since masking is an
    expensive operation, OVS also had an optimization that would only
    take the parts of the flow keys that were covered by a non-zero
    mask. The values stored in the remaining pieces should not matter
    because they are masked out.

    While this works fine for the purposes of matching (which must always
    look at the mask), serialization to netlink can be problematic. Since
    the flow and the mask are serialized separately, the uninitialized
    portions of the flow can be encoded with whatever values happen to be
    present.

    In terms of functionality, this has little effect since these fields
    will be masked out by definition. However, it leaks kernel memory to
    userspace, which is a potential security vulnerability. It is also
    possible that other code paths could look at the masked key and get
    uninitialized data, although this does not currently appear to be an
    issue in practice.

    This removes the mask optimization for flows that are being installed.
    This was always intended to be the case as the mask optimizations were
    really targetting per-packet flow operations.

    Fixes: 03f0d916 ("openvswitch: Mega flow implementation")
    Signed-off-by: Jesse Gross
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jesse Gross
     
  • [ Upstream commit 3ea79249e81e5ed051f2e6480cbde896d99046e8 ]

    Upon TUNSETSNDBUF, macvtap reads the requested sndbuf size into
    a local variable u.
    commit 39ec7de7092b ("macvtap: fix uninitialized access on
    TUNSETIFF") changed its type to u16 (which is the right thing to
    do for all other macvtap ioctls), breaking all values > 64k.

    The value of TUNSETSNDBUF is actually a signed 32 bit integer, so
    the right thing to do is to read it into an int.

    Cc: David S. Miller
    Fixes: 39ec7de7092b ("macvtap: fix uninitialized access on TUNSETIFF")
    Reported-by: Mark A. Peloquin
    Bisected-by: Matthew Rosato
    Reported-by: Christian Borntraeger
    Signed-off-by: Michael S. Tsirkin
    Tested-by: Matthew Rosato
    Acked-by: Christian Borntraeger
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Michael S. Tsirkin
     
  • [ Upsteam commit 4671fc6d47e0a0108fe24a4d830347d6a6ef4aa7 ]

    When changing rss key, we do not want to overwrite user provided key
    by the one provided by netdev_rss_key_fill(), which is the host random
    key generated at boot time.

    Fixes: 947cbb0ac242 ("net/mlx4_en: Support for configurable RSS hash function")
    Signed-off-by: Eric Dumazet
    Cc: Eyal Perry
    CC: Amir Vadai
    Acked-by: Or Gerlitz
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit c2d4fbd2163e607915cc05798ce7fb7f31117cc1 ]

    With the newly introduced helper functions the skb pulling is hidden in
    the checksumming function - and undone before returning to the caller.

    The IGMPv3 and MLDv2 report parsing functions in the bridge still
    assumed that the skb is pointing to the beginning of the IGMP/MLD
    message while it is now kept at the beginning of the IPv4/6 header,
    breaking the message parsing and creating packet loss.

    Fixing this by taking the offset between IP and IGMP/MLD header into
    account, too.

    Fixes: 9afd85c9e455 ("net: Export IGMP/MLD message validation code")
    Reported-by: Tobias Powalowski
    Tested-by: Tobias Powalowski
    Signed-off-by: Linus Lüssing
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Linus Lüssing
     
  • [ Upstream commit 8e2d61e0aed2b7c4ecb35844fe07e0b2b762dee4 ]

    Consider sctp module is unloaded and is being requested because an user
    is creating a sctp socket.

    During initialization, sctp will add the new protocol type and then
    initialize pernet subsys:

    status = sctp_v4_protosw_init();
    if (status)
    goto err_protosw_init;

    status = sctp_v6_protosw_init();
    if (status)
    goto err_v6_protosw_init;

    status = register_pernet_subsys(&sctp_net_ops);

    The problem is that after those calls to sctp_v{4,6}_protosw_init(), it
    is possible for userspace to create SCTP sockets like if the module is
    already fully loaded. If that happens, one of the possible effects is
    that we will have readers for net->sctp.local_addr_list list earlier
    than expected and sctp_net_init() does not take precautions while
    dealing with that list, leading to a potential panic but not limited to
    that, as sctp_sock_init() will copy a bunch of blank/partially
    initialized values from net->sctp.

    The race happens like this:

    CPU 0 | CPU 1
    socket() |
    __sock_create | socket()
    inet_create | __sock_create
    list_for_each_entry_rcu( |
    answer, &inetsw[sock->type], |
    list) { | inet_create
    /* no hits */ |
    if (unlikely(err)) { |
    ... |
    request_module() |
    /* socket creation is blocked |
    * the module is fully loaded |
    */ |
    sctp_init |
    sctp_v4_protosw_init |
    inet_register_protosw |
    list_add_rcu(&p->list, |
    last_perm); |
    | list_for_each_entry_rcu(
    | answer, &inetsw[sock->type],
    sctp_v6_protosw_init | list) {
    | /* hit, so assumes protocol
    | * is already loaded
    | */
    | /* socket creation continues
    | * before netns is initialized
    | */
    register_pernet_subsys |

    Simply inverting the initialization order between
    register_pernet_subsys() and sctp_v4_protosw_init() is not possible
    because register_pernet_subsys() will create a control sctp socket, so
    the protocol must be already visible by then. Deferring the socket
    creation to a work-queue is not good specially because we loose the
    ability to handle its errors.

    So, as suggested by Vlad, the fix is to split netns initialization in
    two moments: defaults and control socket, so that the defaults are
    already loaded by when we register the protocol, while control socket
    initialization is kept at the same moment it is today.

    Fixes: 4db67e808640 ("sctp: Make the address lists per network namespace")
    Signed-off-by: Vlad Yasevich
    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Marcelo Ricardo Leitner
     
  • [ Upstream commit 1853c949646005b5959c483becde86608f548f24 ]

    Ken-ichirou reported that running netlink in mmap mode for receive in
    combination with nlmon will throw a NULL pointer dereference in
    __kfree_skb() on nlmon_xmit(), in my case I can also trigger an "unable
    to handle kernel paging request". The problem is the skb_clone() in
    __netlink_deliver_tap_skb() for skbs that are mmaped.

    I.e. the cloned skb doesn't have a destructor, whereas the mmap netlink
    skb has it pointed to netlink_skb_destructor(), set in the handler
    netlink_ring_setup_skb(). There, skb->head is being set to NULL, so
    that in such cases, __kfree_skb() doesn't perform a skb_release_data()
    via skb_release_all(), where skb->head is possibly being freed through
    kfree(head) into slab allocator, although netlink mmap skb->head points
    to the mmap buffer. Similarly, the same has to be done also for large
    netlink skbs where the data area is vmalloced. Therefore, as discussed,
    make a copy for these rather rare cases for now. This fixes the issue
    on my and Ken-ichirou's test-cases.

    Reference: http://thread.gmane.org/gmane.linux.network/371129
    Fixes: bcbde0d449ed ("net: netlink: virtual tap device management")
    Reported-by: Ken-ichirou MATSUZAWA
    Signed-off-by: Daniel Borkmann
    Tested-by: Ken-ichirou MATSUZAWA
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • [ Upstream commit 03679a14739a0d4c14b52ba65a69ff553bfba73b ]

    The macro to write 64-bits quantities to the 32-bits register swapped
    the value and offsets arguments, we want to preserve the ordering of the
    arguments with respect to how writel() is implemented for instance:
    value first, offset/base second.

    Fixes: 246d7f773c13 ("net: dsa: add Broadcom SF2 switch driver")
    Signed-off-by: Florian Fainelli
    Reviewed-by: Vivien Didelot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Florian Fainelli
     
  • [ Upstream commit 6b9ea5a64ed5eeb3f68f2e6fcce0ed1179801d1e ]

    Problem:
    The ecmp route replace support for ipv6 in the kernel, deletes the
    existing ecmp route too early, ie when it installs the first nexthop.
    If there is an error in installing the subsequent nexthops, its too late
    to recover the already deleted existing route leaving the fib
    in an inconsistent state.

    This patch reduces the possibility of this by doing the following:
    a) Changes the existing multipath route add code to a two stage process:
    build rt6_infos + insert them
    ip6_route_add rt6_info creation code is moved into
    ip6_route_info_create.
    b) This ensures that most errors are caught during building rt6_infos
    and we fail early
    c) Separates multipath add and del code. Because add needs the special
    two stage mode in a) and delete essentially does not care.
    d) In any event if the code fails during inserting a route again, a
    warning is printed (This should be unlikely)

    Before the patch:
    $ip -6 route show
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024

    /* Try replacing the route with a duplicate nexthop */
    $ip -6 route change 3000:1000:1000:1000::2/128 nexthop via
    fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev
    swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1
    RTNETLINK answers: File exists

    $ip -6 route show
    /* previously added ecmp route 3000:1000:1000:1000::2 dissappears from
    * kernel */

    After the patch:
    $ip -6 route show
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024

    /* Try replacing the route with a duplicate nexthop */
    $ip -6 route change 3000:1000:1000:1000::2/128 nexthop via
    fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev
    swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1
    RTNETLINK answers: File exists

    $ip -6 route show
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
    3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024

    Fixes: 27596472473a ("ipv6: fix ECMP route replacement")
    Signed-off-by: Roopa Prabhu
    Reviewed-by: Nikolay Aleksandrov
    Acked-by: Nicolas Dichtel
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Roopa Prabhu
     
  • [ Upstream commit 39797a279d62972cd914ef580fdfacb13e508bf8 ]

    The comparison check between cur_hw_state and hw_state is currently
    invalid because cur_hw_state is right shifted by G_MISTP_SHIFT, while
    hw_state is not, so we end-up comparing bits 2:0 with bits 7:5, which is
    going to cause an additional aging to occur. Fix this by not shifting
    cur_hw_state while reading it, but instead, mask the value with the
    appropriately shitfted bitmask.

    The other problem with the fast-ageing process is that we did not set
    the EN_AGE_DYNAMIC bit to request the ageing to occur for dynamically
    learned MAC addresses. Finally, write back 0 to the FAST_AGE_CTRL
    register to avoid leaving spurious bits sets from one operation to the
    other.

    Fixes: 12f460f23423 ("net: dsa: bcm_sf2: add HW bridging support")
    Signed-off-by: Florian Fainelli
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Florian Fainelli
     
  • [ Upstream commit 25b4a44c19c83d98e8c0807a7ede07c1f28eab8b ]

    In the IPv6 multicast routing code the mrt_lock was not being released
    correctly in the MFC iterator, as a result adding or deleting a MIF would
    cause a hang because the mrt_lock could not be acquired.

    This fix is a copy of the code for the IPv4 case and ensures that the lock
    is released correctly.

    Signed-off-by: Richard Laing
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Richard Laing
     
  • [ Upstream commit 4548a697e4969d695047cebd6d9af5e2f6cc728e ]

    tse_poll() calls __napi_complete() with irq enabled. This leads napi
    poll_list corruption and may stop all napi drivers working.
    Use napi_complete() instead of __napi_complete().

    Signed-off-by: Atsushi Nemoto
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Atsushi Nemoto
     
  • [ Upstream commit ed63f1dcd5788d36f942fbcce350742385e3e18c ]

    The patch just to re-submit the patch "db3421c114cfa6326" because the
    patch "4d494cdc92b3b9a0" remove the change.

    Clear any pending receive interrupt before we process a pending packet.
    This helps to avoid any spurious interrupts being raised after we have
    fully cleaned the receive ring, while still allowing an interrupt to be
    raised if we receive another packet.

    The position of this is critical: we must do this prior to reading the
    next packet status to avoid potentially dropping an interrupt when a
    packet is still pending.

    Acked-by: Fugang Duan
    Signed-off-by: Russell King
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Russell King
     
  • [ Upstream commit e41b0bedba0293b9e1e8d1e8ed553104b9693656 ]

    We previously register IPPROTO_ROUTING offload under inet6_add_offload(),
    but in error path, we try to unregister it with inet_del_offload(). This
    doesn't seem correct, it should actually be inet6_del_offload(), also
    ipv6_exthdrs_offload_exit() from that commit seems rather incorrect (it
    also uses rthdr_offload twice), but it got removed entirely later on.

    Fixes: 3336288a9fea ("ipv6: Switch to using new offload infrastructure.")
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • [ Upstream commit b382c08656000c12a146723a153b85b13a855b49 ]

    diag socket's sock_diag_put_filterinfo() dumps classic BPF programs
    upon request to user space (ss -0 -b). However, native eBPF programs
    attached to sockets (SO_ATTACH_BPF) cannot be dumped with this method:

    Their orig_prog is always NULL. However, sock_diag_put_filterinfo()
    unconditionally tries to access its filter length resp. wants to copy
    the filter insns from there. Internal cBPF to eBPF transformations
    attached to sockets don't have this issue, as orig_prog state is kept.

    It's currently only used by packet sockets. If we would want to add
    native eBPF support in the future, this needs to be done through
    a different attribute than PACKET_DIAG_FILTER to not confuse possible
    user space disassemblers that work on diag data.

    Fixes: 89aa075832b0 ("net: sock: allow eBPF programs to be attached to sockets")
    Signed-off-by: Daniel Borkmann
    Acked-by: Nicolas Dichtel
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • [ Upstream commit f50791ac1aca1ac1b0370d62397b43e9f831421a ]

    It is needed to check EVENT_NO_RUNTIME_PM bit of dev->flags in
    usbnet_stop(), but its value should be read before it is cleared
    when dev->flags is set to 0.

    The problem was spotted and the fix was provided by
    Oliver Neukum .

    Signed-off-by: Eugene Shatokhin
    Acked-by: Oliver Neukum
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eugene Shatokhin
     
  • [ Upstream commit a6c1aea044e490da3e59124ec55991fe316818d5 ]

    In commit 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
    I added a check in u32_destroy() to see if all real filters are gone
    for each tp, however, that is only done for root_ht, same is needed
    for others.

    This can be reproduced by the following tc commands:

    tc filter add dev eth0 parent 1:0 prio 5 handle 15: protocol ip u32 divisor 256
    tc filter add dev eth0 protocol ip parent 1: prio 5 handle 15:2:2 u32
    ht 15:2: match ip src 10.0.0.2 flowid 1:10
    tc filter add dev eth0 protocol ip parent 1: prio 5 handle 15:2:3 u32
    ht 15:2: match ip src 10.0.0.3 flowid 1:10

    Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
    Reported-by: Akshat Kakkar
    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    WANG Cong
     
  • [ Upstream commit bef0057b7ba881d5ae67eec876df7a26fe672a59 ]

    Before 56ef9c909b40[1] it used to ignore all errors from igmp_join().
    That commit enhanced that and made it error out whatever error happened
    with igmp_join(), but that's not good because when using multicast
    groups vxlan will try to join it multiple times if the socket is reused
    and then the 2nd and further attempts will fail with EADDRINUSE.

    As we don't track to which groups the socket is already subscribed, it's
    okay to just ignore that error.

    Fixes: 56ef9c909b40 ("vxlan: Move socket initialization to within rtnl scope")
    Reported-by: John Nielsen
    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Marcelo Ricardo Leitner
     
  • [ Upstream commit d4257295ba1b389c693b79de857a96e4b7cd8ac0 ]

    When a tunnel is deleted, the cached dst entry should be released.

    This problem may prevent the removal of a netns (seen with a x-netns IPv6
    gre tunnel):
    unregister_netdevice: waiting for lo to become free. Usage count = 3

    CC: Dmitry Kozlov
    Fixes: c12b395a4664 ("gre: Support GRE over IPv6")
    Signed-off-by: huaibin Wang
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    huaibin Wang
     

30 Sep, 2015

9 commits

  • Greg Kroah-Hartman
     
  • commit 4e1efb403c1c016ae831bd9988a7d2e5e0af41a0 upstream.

    If the driver doesn't participate in EEH, the AFUs will be removed
    by cxl_remove, which will be invoked by EEH.

    If the driver does particpate in EEH, the vPHB needs to stick around
    so that the it can particpate.

    In both cases, we shouldn't remove the AFU/vPHB.

    Reviewed-by: Cyril Bur
    Signed-off-by: Daniel Axtens
    Signed-off-by: Michael Ellerman
    Reported-by: Guenter Roeck
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Daniel Axtens
     
  • [ Upstream commit 25b97c016b26039982daaa2c11d83979f93b71ab ]

    When generating /proc/net/route we emit a header followed by a line for
    each route. When a short read is performed we will restart this process
    based on the open file descriptor. When calculating the start point we
    fail to take into account that the 0th entry is the header. This leads
    us to skip the first entry when doing a continuation read.

    This can be easily seen with the comparison below:

    while read l; do echo "$l"; done A
    cat /proc/net/route >B
    diff -bu A B | grep '^[+-]'

    On my example machine I have approximatly 10KB of route output. There we
    see the very first non-title element is lost in the while read case,
    and an entry around the 8K mark in the cat case:

    +wlan0 00000000 02021EAC 0003 0 0 400 00000000 0 0 0
    -tun1 00C0AC0A 00000000 0001 0 0 950 00C0FFFF 0 0 0

    Fix up the off-by-one when reaquiring position on continuation.

    Fixes: 8be33e955cb9 ("fib_trie: Fib walk rcu should take a tnode and key instead of a trie and a leaf")
    BugLink: http://bugs.launchpad.net/bugs/1483440
    Acked-by: Alexander Duyck
    Signed-off-by: Andy Whitcroft
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Andy Whitcroft
     
  • [ Upstream commit 211c504a444710b1d8ce3431ac19f2578602ca27 ]

    In case we need to divert reads/writes using the slave MII bus, we may have
    already fetched a valid PHY interface property from Device Tree, and that
    mode is used by the PHY driver to make configuration decisions.

    If we could not fetch the "phy-mode" property, we will assign p->phy_interface
    to PHY_INTERFACE_MODE_NA, such that we can actually check for that condition as
    to whether or not we should override the interface value.

    Fixes: 19334920eaf7 ("net: dsa: Set valid phy interface type")
    Signed-off-by: Florian Fainelli
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Florian Fainelli
     
  • [ Upstream commit 2235f2ac75fd2501c251b0b699a9632e80239a6d ]

    reqsk_queue_destroy() and reqsk_queue_unlink() should use
    del_timer_sync() instead of del_timer() before calling reqsk_put(),
    otherwise we could free a req still used by another cpu.

    But before doing so, reqsk_queue_destroy() must release syn_wait_lock
    spinlock or risk a dead lock, as reqsk_timer_handler() might
    need to take this same spinlock from reqsk_queue_unlink() (called from
    inet_csk_reqsk_queue_drop())

    Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 3257d8b12f954c462d29de6201664a846328a522 ]

    In commit b357a364c57c9 ("inet: fix possible panic in
    reqsk_queue_unlink()"), I missed fact that tcp_check_req()
    can return the listener socket in one case, and that we must
    release the request socket refcount or we leak it.

    Tested:

    Following packetdrill test template shows the issue

    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    +0 bind(3, ..., ...) = 0
    +0 listen(3, 1) = 0

    +0 < S 0:0(0) win 2920
    +0 > S. 0:0(0) ack 1
    +.002 < . 1:1(0) ack 21 win 2920
    +0 > R 21:21(0)

    Fixes: b357a364c57c9 ("inet: fix possible panic in reqsk_queue_unlink()")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 4e7c1330689e27556de407d3fdadc65ffff5eb12 ]

    Linus reports the following deadlock on rtnl_mutex; triggered only
    once so far (extract):

    [12236.694209] NetworkManager D 0000000000013b80 0 1047 1 0x00000000
    [12236.694218] ffff88003f902640 0000000000000000 ffffffff815d15a9 0000000000000018
    [12236.694224] ffff880119538000 ffff88003f902640 ffffffff81a8ff84 00000000ffffffff
    [12236.694230] ffffffff81a8ff88 ffff880119c47f00 ffffffff815d133a ffffffff81a8ff80
    [12236.694235] Call Trace:
    [12236.694250] [] ? schedule_preempt_disabled+0x9/0x10
    [12236.694257] [] ? schedule+0x2a/0x70
    [12236.694263] [] ? schedule_preempt_disabled+0x9/0x10
    [12236.694271] [] ? __mutex_lock_slowpath+0x7f/0xf0
    [12236.694280] [] ? mutex_lock+0x16/0x30
    [12236.694291] [] ? rtnetlink_rcv+0x10/0x30
    [12236.694299] [] ? netlink_unicast+0xfb/0x180
    [12236.694309] [] ? rtnl_getlink+0x113/0x190
    [12236.694319] [] ? rtnetlink_rcv_msg+0x7a/0x210
    [12236.694331] [] ? sock_has_perm+0x5c/0x70
    [12236.694339] [] ? rtnetlink_rcv+0x30/0x30
    [12236.694346] [] ? netlink_rcv_skb+0x9c/0xc0
    [12236.694354] [] ? rtnetlink_rcv+0x1f/0x30
    [12236.694360] [] ? netlink_unicast+0xfb/0x180
    [12236.694367] [] ? netlink_sendmsg+0x484/0x5d0
    [12236.694376] [] ? __wake_up+0x2f/0x50
    [12236.694387] [] ? sock_sendmsg+0x33/0x40
    [12236.694396] [] ? ___sys_sendmsg+0x22e/0x240
    [12236.694405] [] ? ___sys_recvmsg+0x135/0x1a0
    [12236.694415] [] ? eventfd_write+0x82/0x210
    [12236.694423] [] ? fsnotify+0x32e/0x4c0
    [12236.694429] [] ? wake_up_q+0x60/0x60
    [12236.694434] [] ? __sys_sendmsg+0x39/0x70
    [12236.694440] [] ? entry_SYSCALL_64_fastpath+0x12/0x6a

    It seems so far plausible that the recursive call into rtnetlink_rcv()
    looks suspicious. One way, where this could trigger is that the senders
    NETLINK_CB(skb).portid was wrongly 0 (which is rtnetlink socket), so
    the rtnl_getlink() request's answer would be sent to the kernel instead
    to the actual user process, thus grabbing rtnl_mutex() twice.

    One theory would be that netlink_autobind() triggered via netlink_sendmsg()
    internally overwrites the -EBUSY error to 0, but where it is wrongly
    originating from __netlink_insert() instead. That would reset the
    socket's portid to 0, which is then filled into NETLINK_CB(skb).portid
    later on. As commit d470e3b483dc ("[NETLINK]: Fix two socket hashing bugs.")
    also puts it, -EBUSY should not be propagated from netlink_insert().

    It looks like it's very unlikely to reproduce. We need to trigger the
    rhashtable_insert_rehash() handler under a situation where rehashing
    currently occurs (one /rare/ way would be to hit ht->elasticity limits
    while not filled enough to expand the hashtable, but that would rather
    require a specifically crafted bind() sequence with knowledge about
    destination slots, seems unlikely). It probably makes sense to guard
    __netlink_insert() in any case and remap that error. It was suggested
    that EOVERFLOW might be better than an already overloaded ENOMEM.

    Reference: http://thread.gmane.org/gmane.linux.network/372676
    Reported-by: Linus Torvalds
    Signed-off-by: Daniel Borkmann
    Acked-by: Herbert Xu
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • [ Upstream commit ade4dc3e616e33c80d7e62855fe1b6f9895bc7c3 ]

    The commit "e29aa33 bna: Enable Multi Buffer RX" moved packets counter
    increment from the beginning of the NAPI processing loop after the check
    for erroneous packets so they are never accounted. This counter is used
    to inform firmware about number of processed completions (packets).
    As these packets are never acked the firmware fires IRQs for them again
    and again.

    Fixes: e29aa33 ("bna: Enable Multi Buffer RX")
    Signed-off-by: Ivan Vecera
    Acked-by: Rasesh Mody
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Ivan Vecera
     
  • [ Upstream commit 786c2077ec8e9eab37a88fc14aac4309a8061e18 ]

    The attribute size wasn't accounted for in the get_slave_size() callback
    (br_port_get_slave_size) when it was introduced, so fix it now. Also add
    a policy entry for it in br_port_policy.

    Signed-off-by: Nikolay Aleksandrov
    Fixes: 842a9ae08a25 ("bridge: Extend Proxy ARP design to allow optional rules for Wi-Fi")
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Nikolay Aleksandrov