10 Dec, 2013

40 commits

  • The current implementation of IPV6_FLOWINFO only gives a
    result if pktoptions is available (thanks to the
    ip6_datagram_recv_ctl function).
    It gives inconsistent results to user space, sometimes
    there is a result for getsockopt(IPV6_FLOWINFO), sometimes
    not.

    This patch add rcv_flowinfo to store it, and return it to
    the userspace in the same way than other pkt_options.

    Signed-off-by: Florent Fourcot
    Reviewed-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Florent Fourcot
     
  • We were already registering MDIO bus, but we were not connecting bgmac
    to the PHY. Add proper call and implement adjust link function to switch
    MAC into requested state.
    At the same time it's possible to drop our internal PHY management.
    This is a "standard" PHY, so the "Generic PHY" driver works perfectly
    fine with this. Don't duplicate the code.
    Finally make use of phy_ethtool_[gs]set functions instead implementing
    them from scratch.

    This change was successfully tested on BCM5357. I was able to
    autonegotiate 1000Mb/s full duplex, as well as force any of the
    10/100/1000 half/full modes.

    Signed-off-by: Rafał Miłecki
    Acked-by: Florian Fainelli
    Acked-by: Hauke Mehrtens
    Signed-off-by: David S. Miller

    Rafał Miłecki
     
  • If CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set,
    several is__ether_addr functions can be slightly
    improved by using u32 dereferences.

    I believe all current uses of is_zero_ether_addr and
    is_broadcast_ether_addr are u16 aligned, so always use
    u16 references to improve those functions performance.

    Document the u16 alignment requirements.

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     
  • Use the newly added generic routine ether_addr_equal_unaligned
    to test if possibly unaligned to u16 Ethernet addresses are equal.

    This slightly improves comparison time for systems with
    CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS.

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     
  • Add a generic routine to test if possibly unaligned
    to u16 Ethernet addresses are equal.

    If CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set,
    this uses the slightly faster generic routine
    ether_addr_equal, otherwise this uses memcmp.

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     
  • Jiri Pirko says:

    ====================
    neigh: respect default parms values

    This is a long standing regression. But since the patchset is bigger and
    the regression happened in 2007, I'm proposing this to net-next instead.

    Basically the problem is that if user wants to use /etc/sysctl.conf to specify
    default values of neigh related params, he is not able to do that.

    The reason is that the default values are copied to dev instance right after
    netdev is registered. And that is way to early. The original behaviour
    for ipv4 was that this happened after first address was assigned to device.
    For ipv6 this was apparently from the very beginning.

    So this patchset basically reverts the behaviour back to what it was in 2007 for
    ipv4 and changes the behaviour for ipv6 so they are both the same.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Make the behaviour similar to ipv4. This will allow user to set sysctl
    default neigh param values and these values will be respected even by
    devices registered before (that ones what do not have address set yet).

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Previously inet devices were only constructed when addresses are added.
    Therefore the default neigh parms values they get are the ones at the
    time of these operations.

    Now that we're creating inet devices earlier, this changes the behaviour
    of default neigh parms values in an incompatible way (see bug #8519).

    This patch creates a compromise by setting the default values at the
    same point as before but only for those that have not been explicitly
    set by the user since the inet device's creation.

    Introduced by:
    commit 8030f54499925d073a88c09f30d5d844fb1b3190
    Author: Herbert Xu
    Date: Thu Feb 22 01:53:47 2007 +0900

    [IPV4] devinet: Register inetdev earlier.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • This will be needed later on to provide better management of default values.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • This patch converts the neigh param members to an array. This allows easier
    manipulation which will be needed later on to provide better management of
    default values.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Florian Fainelli says:

    ====================
    net: phy: consolidate PHY reset

    This patchset consolidates the PHY reset through the MII BMCR
    register by using a central place were this is done.

    This patchset resumes the work Kyle Moffett started here:
    https://lkml.org/lkml/2011/10/20/301

    Note that at this point, drivers doing funky things after issuing
    a PHY reset using phy_init_hw() will still suffer from PHY state
    machine problems, this will be taken care of later on.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The sh_eth driver issues an uncontrolled PHY reset through the MII
    register BMCR but fails to wait for the reset to complete, and will also
    implicitely wipe out all possible PHY fixups applied. Use phy_init_hw()
    which remedies both problems.

    Signed-off-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Florian Fainelli
     
  • Instead of open-coding the PHY reset through MII BMCR, use phy_init_hw()
    which does that for us and also makes sure that any PHY specific fixups
    are applied.

    Signed-off-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Florian Fainelli
     
  • Instead of open-coding a PHY reset through the MII BMCR register, use
    phy_init_hw() which does this for us and ensures that PHY device fixups
    are also applied. We also remove a call to ethernet_phy_reset() which is
    now unncessary since phy_attach() calls phy_attach_direct() which in
    turns calls phy_init_hw().

    Signed-off-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Florian Fainelli
     
  • Instead of open-coding a PHY reset through the MII BMCR register, use
    phy_init_hw() which does that for us and will also make sure that PHY
    fixups are applied if required. We also remove a call to phy_reset()
    due to the following sequence of calls in the driver:

    phy_scan()
    -> phy_connect()
    -> phy_connect_direct()
    -> phy_attach_direct()
    -> phy_init_hw()

    and we only have a call to phy_init() after phy_scan().

    Signed-off-by: Florian Fainelli
    Tested-by: Sebastian Hesselbarth
    Signed-off-by: David S. Miller

    Florian Fainelli
     
  • There are quite a lot of drivers touching a PHY device MII_BMCR
    register to reset the PHY without taking care of:

    1) ensuring that BMCR_RESET is cleared after a given timeout
    2) the PHY state machine resuming to the proper state and re-applying
    potentially changed settings such as auto-negotiation

    Introduce phy_poll_reset() which will take care of polling the MII_BMCR
    for the BMCR_RESET bit to be cleared after a given timeout or return a
    timeout error code.

    In order to make sure the PHY is in a correct state, phy_init_hw() first
    issues a software reset through MII_BMCR and then applies any fixups.

    Signed-off-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Florian Fainelli
     
  • The PHY is already reset during driver probing, and this manual reset
    after calling phy_start() will wipe out board-specific PHY fixups and
    driver specific configuration initialization. Remove that explicit PHY
    reset.

    Signed-off-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Florian Fainelli
     
  • In case the greth driver is bound to anything but the Generic PHY
    driver or the PHY has a special read_status callback implemented,
    unexpected things will happen. Make sure we that we use
    phy_read_status() which does the proper abstraction of calling the
    driver specific read_status() callback for a given PHY.

    Signed-off-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Florian Fainelli
     
  • Use phy_init_hw() instead of open-coding it in phy_mii_ioctl(), this
    improves consistenty and makes sure that we will not duplicate the same
    routine somewhere else.

    Signed-off-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Florian Fainelli
     
  • The PHY library already reads the MII_STAT1000 and MII_LPA registers in
    genphy_read_status(), so extend it to also populate the PHY device link
    partner advertised features such that we can feed this back into ethtool
    when asked for it in phy_ethtool_gset().

    Signed-off-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Florian Fainelli
     
  • By checking related codes, it is impossible that ret > len or total_len,
    so we should remove some useless codes in both above functions.

    Signed-off-by: Zhi Yong Wu
    Signed-off-by: David S. Miller

    Zhi Yong Wu
     
  • By checking related codes, it is impossible that ret > len or total_len,
    so we should remove some useless coeds in both above functions.

    Signed-off-by: Zhi Yong Wu
    Signed-off-by: David S. Miller

    Zhi Yong Wu
     
  • The way that flow control works without this patch is that, in start_xmit()
    the code uses xenvif_count_skb_slots() to predict how many slots
    xenvif_gop_skb() will consume and then adds this to a 'req_cons_peek'
    counter which it then uses to determine if the shared ring has that amount
    of space available by checking whether 'req_prod' has passed that value.
    If the ring doesn't have space the tx queue is stopped.
    xenvif_gop_skb() will then consume slots and update 'req_cons' and issue
    responses, updating 'rsp_prod' as it goes. The frontend will consume those
    responses and post new requests, by updating req_prod. So, req_prod chases
    req_cons which chases rsp_prod, and can never exceed that value. Thus if
    xenvif_count_skb_slots() ever returns a number of slots greater than
    xenvif_gop_skb() uses, req_cons_peek will get to a value that req_prod cannot
    possibly achieve (since it's limited by the 'real' req_cons) and, if this
    happens enough times, req_cons_peek gets more than a ring size ahead of
    req_cons and the tx queue then remains stopped forever waiting for an
    unachievable amount of space to become available in the ring.

    Having two routines trying to calculate the same value is always going to be
    fragile, so this patch does away with that. All we essentially need to do is
    make sure that we have 'enough stuff' on our internal queue without letting
    it build up uncontrollably. So start_xmit() makes a cheap optimistic check
    of how much space is needed for an skb and only turns the queue off if that
    is unachievable. net_rx_action() is the place where we could do with an
    accurate predicition but, since that has proven tricky to calculate, a cheap
    worse-case (but not too bad) estimate is all we really need since the only
    thing we *must* prevent is xenvif_gop_skb() consuming more slots than are
    available.

    Without this patch I can trivially stall netback permanently by just doing
    a large guest to guest file copy between two Windows Server 2008R2 VMs on a
    single host.

    Patch tested with frontends in:
    - Windows Server 2008R2
    - CentOS 6.0
    - Debian Squeeze
    - Debian Wheezy
    - SLES11

    Signed-off-by: Paul Durrant
    Cc: Wei Liu
    Cc: Ian Campbell
    Cc: David Vrabel
    Cc: Annie Li
    Cc: Konrad Rzeszutek Wilk
    Acked-by: Wei Liu
    Signed-off-by: David S. Miller

    Paul Durrant
     
  • struct 'tipc_bearer' is a generic representation of the underlying
    media type, and exists in a one-to-one relationship to each interface
    TIPC is using. The struct contains a 'blocked' flag that mirrors the
    operational and execution state of the represented interface, and is
    updated through notification calls from the latter. The users of
    tipc_bearer are checking this flag before each attempt to send a
    packet via the interface.

    This state mirroring serves no purpose in the current code base. TIPC
    links will not discover a media failure any faster through this
    mechanism, and in reality the flag only adds overhead at packet
    sending and reception.

    Furthermore, the fact that the flag needs to be protected by a spinlock
    aggregated into tipc_bearer has turned out to cause a serious and
    completely unnecessary deadlock problem.

    CPU0 CPU1
    ---- ----
    Time 0: bearer_disable() link_timeout()
    Time 1: spin_lock_bh(&b_ptr->lock) tipc_link_push_queue()
    Time 2: tipc_link_delete() tipc_bearer_blocked(b_ptr)
    Time 3: k_cancel_timer(&req->timer) spin_lock_bh(&b_ptr->lock)
    Time 4: del_timer_sync(&req->timer)

    I.e., del_timer_sync() on CPU0 never returns, because the timer handler
    on CPU1 is waiting for the bearer lock.

    We eliminate the 'blocked' flag from struct tipc_bearer, along with all
    tests on this flag. This not only resolves the deadlock, but also
    simplifies and speeds up the data path execution of TIPC. It also fits
    well into our ongoing effort to make the locking policy simpler and
    more manageable.

    An effect of this change is that we can get rid of functions such as
    tipc_bearer_blocked(), tipc_continue() and tipc_block_bearer().
    We replace the latter with a new function, tipc_reset_bearer(), which
    resets all links associated to the bearer immediately after an
    interface goes down.

    A user might notice one slight change in link behaviour after this
    change. When an interface goes down, (e.g. through a NETDEV_DOWN
    event) all attached links will be reset immediately, instead of
    leaving it to each link to detect the failure through a timer-driven
    mechanism. We consider this an improvement, and see no obvious risks
    with the new behavior.

    Signed-off-by: Erik Hugne
    Reviewed-by: Ying Xue
    Reviewed-by: Paul Gortmaker
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Erik Hugne
     
  • use pr_ instead of printk(LEVEL)

    Suggested-by: Joe Perches
    Signed-off-by: Wang Weidong
    Signed-off-by: David S. Miller

    wangweidong
     
  • This patch introduces a PACKET_QDISC_BYPASS socket option, that
    allows for using a similar xmit() function as in pktgen instead
    of taking the dev_queue_xmit() path. This can be very useful when
    PF_PACKET applications are required to be used in a similar
    scenario as pktgen, but with full, flexible packet payload that
    needs to be provided, for example.

    On default, nothing changes in behaviour for normal PF_PACKET
    TX users, so everything stays as is for applications. New users,
    however, can now set PACKET_QDISC_BYPASS if needed to prevent
    own packets from i) reentering packet_rcv() and ii) to directly
    push the frame to the driver.

    In doing so we can increase pps (here 64 byte packets) for
    PF_PACKET a bit:

    # CPUs -- QDISC_BYPASS -- qdisc path -- qdisc path[**]
    1 CPU == 1,509,628 pps -- 1,208,708 -- 1,247,436
    2 CPUs == 3,198,659 pps -- 2,536,012 -- 1,605,779
    3 CPUs == 4,787,992 pps -- 3,788,740 -- 1,735,610
    4 CPUs == 6,173,956 pps -- 4,907,799 -- 1,909,114
    5 CPUs == 7,495,676 pps -- 5,956,499 -- 2,014,422
    6 CPUs == 9,001,496 pps -- 7,145,064 -- 2,155,261
    7 CPUs == 10,229,776 pps -- 8,190,596 -- 2,220,619
    8 CPUs == 11,040,732 pps -- 9,188,544 -- 2,241,879
    9 CPUs == 12,009,076 pps -- 10,275,936 -- 2,068,447
    10 CPUs == 11,380,052 pps -- 11,265,337 -- 1,578,689
    11 CPUs == 11,672,676 pps -- 11,845,344 -- 1,297,412
    [...]
    20 CPUs == 11,363,192 pps -- 11,014,933 -- 1,245,081

    [**]: qdisc path with packet_rcv(), how probably most people
    seem to use it (hopefully not anymore if not needed)

    The test was done using a modified trafgen, sending a simple
    static 64 bytes packet, on all CPUs. The trick in the fast
    "qdisc path" case, is to avoid reentering packet_rcv() by
    setting the RAW socket protocol to zero, like:
    socket(PF_PACKET, SOCK_RAW, 0);

    Tradeoffs are documented as well in this patch, clearly, if
    queues are busy, we will drop more packets, tc disciplines are
    ignored, and these packets are not visible to taps anymore. For
    a pktgen like scenario, we argue that this is acceptable.

    The pointer to the xmit function has been placed in packet
    socket structure hole between cached_dev and prot_hook that
    is hot anyway as we're working on cached_dev in each send path.

    Done in joint work together with Jesper Dangaard Brouer.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • As we need it elsewhere, move the inline helper function of
    skb_needs_linearize() over to skbuff.h include file. While
    at it, also convert the return to 'bool' instead of 'int'
    and add a proper kernel doc.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Merge 'net' into 'net-next' to get the AF_PACKET bug fix that
    Daniel's direct transmit changes depend upon.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Commit e40526cb20b5 introduced a cached dev pointer, that gets
    hooked into register_prot_hook(), __unregister_prot_hook() to
    update the device used for the send path.

    We need to fix this up, as otherwise this will not work with
    sockets created with protocol = 0, plus with sll_protocol = 0
    passed via sockaddr_ll when doing the bind.

    So instead, assign the pointer directly. The compiler can inline
    these helper functions automagically.

    While at it, also assume the cached dev fast-path as likely(),
    and document this variant of socket creation as it seems it is
    not widely used (seems not even the author of TX_RING was aware
    of that in his reference example [1]). Tested with reproducer
    from e40526cb20b5.

    [1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap#Example

    Fixes: e40526cb20b5 ("packet: fix use after free race in send path when dev is released")
    Signed-off-by: Daniel Borkmann
    Tested-by: Salam Noureddine
    Tested-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Commit 6da7c8fcbcbd ("qdisc: allow setting default queuing discipline")
    added the ability to change default qdisc from pfifo_fast to say fq

    But as most modern ethernet devices are multiqueue, we cant really
    see all the statistics from "tc -s qdisc show", as the default root
    qdisc is mq.

    This patch adds the calls to qdisc_list_add() to mq and mqprio

    Signed-off-by: Eric Dumazet
    Cc: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Jeff Kirsher says:

    ====================
    Intel Wired LAN Driver Updates

    This series contains updates to i40e only.

    Jacob provides a i40e patch to get 1588 work correctly by separating
    TSYNVALID and TSYNINDX fields in the receive descriptor.

    Jesse provides several i40e patches, first to correct the checking
    of the multi-bit state. The hash is reported correctly in the RSS
    field if and only if the filter status is 3. Other values of the
    filter status mean different things and we should not depend on a
    bitwise result. Then provides a patch to enable a couple of
    workarounds based on revision ID that allow the driver to work
    more fully on early hardware.

    Shannon provides several i40e patches as well. First sets the media
    type in the hardware structure based on the external connection type.
    Then provides a patch to only setup the rings that will be used. Lastly
    provides a fix where the TESTING state was still set when exiting the
    ethtool diagnostics.

    Kevin Scott provides one i40e patch to add a new flag to the i40e_add_veb()
    which allows the driver to request the hardware to filter on layer 2
    parameters.

    Anjali provides four i40e patches, first refactors the reset code in
    order to re-size queues and vectors while the interface is still up.
    Then provides a patch to enable all PCTYPEs expect FCoE for RSS. Adds
    a message to notify the user of how many VFs are initialized on each
    port. Lastly adds a new variable to track the number of PF instances,
    this is a global counter on purpose so that each PF loaded has a
    unique ID.

    Catherine bumps the driver version.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The driver core clears the driver data to NULL after device_release
    or on probe failure. Thus, it is not needed to manually clear the
    device driver data to NULL.

    Signed-off-by: Jingoo Han
    Signed-off-by: David S. Miller

    Jingoo Han
     
  • The driver core clears the driver data to NULL after device_release
    or on probe failure. Thus, it is not needed to manually clear the
    device driver data to NULL.

    Signed-off-by: Jingoo Han
    Signed-off-by: David S. Miller

    Jingoo Han
     
  • The driver core clears the driver data to NULL after device_release
    or on probe failure. Thus, it is not needed to manually clear the
    device driver data to NULL.

    Signed-off-by: Jingoo Han
    Signed-off-by: David S. Miller

    Jingoo Han
     
  • The driver core clears the driver data to NULL after device_release
    or on probe failure. Thus, it is not needed to manually clear the
    device driver data to NULL.

    Signed-off-by: Jingoo Han
    Signed-off-by: David S. Miller

    Jingoo Han
     
  • The driver core clears the driver data to NULL after device_release
    or on probe failure. Thus, it is not needed to manually clear the
    device driver data to NULL.

    Signed-off-by: Jingoo Han
    Signed-off-by: David S. Miller

    Jingoo Han
     
  • The driver core clears the driver data to NULL after device_release
    or on probe failure. Thus, it is not needed to manually clear the
    device driver data to NULL.

    Signed-off-by: Jingoo Han
    Signed-off-by: David S. Miller

    Jingoo Han
     
  • The driver core clears the driver data to NULL after device_release
    or on probe failure. Thus, it is not needed to manually clear the
    device driver data to NULL.

    Signed-off-by: Jingoo Han
    Signed-off-by: David S. Miller

    Jingoo Han
     
  • The driver core clears the driver data to NULL after device_release
    or on probe failure. Thus, it is not needed to manually clear the
    device driver data to NULL.

    Signed-off-by: Jingoo Han
    Signed-off-by: David S. Miller

    Jingoo Han