12 Dec, 2013

1 commit


11 Dec, 2013

18 commits

  • In early versions of TIPC it was possible to administratively block
    individual links through the use of the member flag 'blocked'. This
    functionality was deemed redundant, and since commit 7368dd ("tipc:
    clean out all instances of #if 0'd unused code"), this flag has been
    unused.

    In the current code, a link only needs to be blocked for sending and
    reception if it is subject to an ongoing link failover. In that case,
    it is sufficient to check if the number of expected failover packets
    is non-zero, something which is done via the funtion 'link_blocked()'.

    This commit finally removes the redundant 'blocked' flag completely.

    Signed-off-by: Ying Xue
    Reviewed-by: Paul Gortmaker
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Ying Xue
     
  • Currently TIPC supports two L2 media types, Ethernet and Infiniband.
    Because both these media are accessed through the common net_device API,
    several functions in the two media adaptation files turn out to be
    fully or almost identical, leading to unnecessary code duplication.

    In this commit we extract this common code from the two media files
    and move them to the generic bearer.c. Additionally, we change
    the function names to reflect their real role: to access L2 media,
    irrespective of type.

    Signed-off-by: Ying Xue
    Cc: Patrick McHardy
    Reviewed-by: Paul Gortmaker
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Ying Xue
     
  • Currently, registering a TIPC stack handler in the network device layer
    is done twice, once for Ethernet (eth_media) and Infiniband (ib_media)
    repectively. But, as this registration is not media specific, we can
    avoid some code duplication by moving the registering function to
    the generic bearer layer, to the file bearer.c, and call it only once.
    The same is true for the network device event notifier.

    As a side effect, the two workqueues we are using for for setting up/
    cleaning up media can now be eliminated. Furthermore, the array for
    storing the specific media type structs, media_array[], can be entirely
    deleted.

    Note that the eth_started and ib_started flags were removed during the
    code relocation. There is now only one call to bearer_setup and
    bearer_cleanup, and these can logically not race against each other.

    Despite its size, this cleanup work incurs no functional changes in TIPC.
    In particular, it should be noted that the sequence ordering of received
    packets is unaffected by this change, since packet reception never was
    subject to any work queue handling in the first place.

    Signed-off-by: Ying Xue
    Cc: Patrick McHardy
    Signed-off-by: Jon Maloy
    Reviewed-by: Paul Gortmaker
    Signed-off-by: David S. Miller

    Ying Xue
     
  • TIPC is currently using the field 'af_packet_priv' in struct net_device
    as a handle to find the bearer instance associated to the given network
    device. But, by doing so it is blocking other networking cleanups, such
    as the one discussed here:

    http://patchwork.ozlabs.org/patch/178044/

    This commit removes this usage from TIPC. Instead, we introduce a new
    field, 'tipc_ptr', to the net_device structure, to serve this purpose.
    When TIPC bearer is enabled, the bearer object is associated to
    'tipc_ptr'. When a TIPC packet arrives in the recv_msg() upcall
    from a networking device, the bearer object can now be obtained from
    'tipc_ptr'. When a bearer is disabled, the bearer object is detached
    from its underlying network device by setting 'tipc_ptr' to NULL.

    Additionally, an RCU lock is used to protect the new pointer.
    Henceforth, the existing tipc_net_lock is used in write mode to
    serialize write accesses to this pointer, while the new RCU lock is
    applied on the read side to ensure that the pointer is 100% valid
    within its wrapped area for all readers.

    Signed-off-by: Ying Xue
    Cc: Patrick McHardy
    Reviewed-by: Paul Gortmaker
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Ying Xue
     
  • struct 'tipc_media' represents the specific info that the media
    layer adaptors (eth_media and ib_media) expose to the generic
    bearer layer. We clarify this by improved commenting, and by giving
    the 'media_list' array the more appropriate name 'media_info_array'.

    There are no functional changes in this commit.

    Signed-off-by: Ying Xue
    Reviewed-by: Paul Gortmaker
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • Communication media types are abstracted through the struct 'tipc_media',
    one per media type. These structs are allocated statically inside their
    respective media file.

    Furthermore, in order to be able to reach all instances from a central
    location, we keep a static array with pointers to these structs. This
    array is currently initialized at runtime, under protection of
    tipc_net_lock. However, since the contents of the array itself never
    changes after initialization, we can just as well initialize it at
    compile time and make it 'const', at the same time making it obvious
    that no lock protection is needed here.

    This commit makes the array constant and removes the redundant lock
    protection.

    Signed-off-by: Ying Xue
    Reviewed-by: Paul Gortmaker
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • sk_buff lists are currently relased by looping over the list and
    explicitly releasing each buffer.

    We replace all occurrences of this loop with a call to kfree_skb_list().

    Signed-off-by: Ying Xue
    Reviewed-by: Paul Gortmaker
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Ying Xue
     
  • Macros with multiple statements should be enclosed in a do - while loop

    Signed-off-by: Yang Yingliang
    Signed-off-by: David S. Miller

    Yang Yingliang
     
  • Spaces required around that '>' (ctx:VxV) and
    before the open parenthesis '('.

    Signed-off-by: Yang Yingliang
    Signed-off-by: David S. Miller

    Yang Yingliang
     
  • "foo* bar" or "foo * bar" should be "foo *bar".

    Signed-off-by: Yang Yingliang
    Signed-off-by: David S. Miller

    Yang Yingliang
     
  • Code indent should use tabs where possible

    Signed-off-by: Yang Yingliang
    Signed-off-by: David S. Miller

    Yang Yingliang
     
  • return is not a function, parentheses are not required.

    Signed-off-by: Yang Yingliang
    Signed-off-by: David S. Miller

    Yang Yingliang
     
  • This patch makes socketpair() use error paths which do not
    rely on heavy-weight call to sys_close(): it's better to try
    to push the file descriptor to userspace before installing
    the socket file to the file descriptor, so that errors are
    catched earlier and being easier to handle.

    Using sys_close() seems to be the exception, while writing the
    file descriptor before installing it look like it's more or less
    the norm: eg. except for code used in init/, error handling
    involve fput() and put_unused_fd(), but not sys_close().

    This make socketpair() usage of sys_close() quite unusual.
    So it deserves to be replaced by the common pattern relying on
    fput() and put_unused_fd() just like, for example, the one used
    in pipe(2) or recvmsg(2).

    Three distinct error paths are still needed since calling
    fput() on file structure returned by sock_alloc_file() will
    implicitly call sock_release() on the associated socket
    structure.

    Cc: David S. Miller
    Cc: Al Viro
    Signed-off-by: Yann Droneaud
    Link: http://marc.info/?i=1385979146-13825-1-git-send-email-ydroneaud@opteya.com
    Signed-off-by: David S. Miller

    Yann Droneaud
     
  • Various spelling fixes in networking stack

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    stephen hemminger
     
  • Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • This fixes compile error when CONFIG_NET_NS is not set.

    Introduced by:
    commit 1d4c8c29841b9991cdf3c7cc4ba7f96a94f104ca
    "neigh: restore old behaviour of default parms values"

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Turned out that applications like ifconfig do not handle the change.
    So revert ifa_flag format back to 2-letter hex value.

    Introduced by:
    commit 479840ffdbe4242e8a25349218c8e0859223aa35
    "ipv6 addrconf: extend ifa_flags to u32"

    Reported-by: Alexander Aring
    Signed-off-by: Jiri Pirko
    Tested-by: FLorent Fourcot
    Signed-off-by: David S. Miller

    Jiri Pirko
     

10 Dec, 2013

18 commits

  • Signed-off-by: Florent Fourcot
    Reviewed-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Florent Fourcot
     
  • And use it if possible.

    Signed-off-by: Florent Fourcot
    Reviewed-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Florent Fourcot
     
  • tclass information in now already stored in rcv_flowinfo
    We do not need to store the same information twice.

    Signed-off-by: Florent Fourcot
    Reviewed-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Florent Fourcot
     
  • Signed-off-by: Florent Fourcot
    Reviewed-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Florent Fourcot
     
  • The current implementation of IPV6_FLOWINFO only gives a
    result if pktoptions is available (thanks to the
    ip6_datagram_recv_ctl function).
    It gives inconsistent results to user space, sometimes
    there is a result for getsockopt(IPV6_FLOWINFO), sometimes
    not.

    This patch add rcv_flowinfo to store it, and return it to
    the userspace in the same way than other pkt_options.

    Signed-off-by: Florent Fourcot
    Reviewed-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Florent Fourcot
     
  • Use the newly added generic routine ether_addr_equal_unaligned
    to test if possibly unaligned to u16 Ethernet addresses are equal.

    This slightly improves comparison time for systems with
    CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS.

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     
  • Make the behaviour similar to ipv4. This will allow user to set sysctl
    default neigh param values and these values will be respected even by
    devices registered before (that ones what do not have address set yet).

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Previously inet devices were only constructed when addresses are added.
    Therefore the default neigh parms values they get are the ones at the
    time of these operations.

    Now that we're creating inet devices earlier, this changes the behaviour
    of default neigh parms values in an incompatible way (see bug #8519).

    This patch creates a compromise by setting the default values at the
    same point as before but only for those that have not been explicitly
    set by the user since the inet device's creation.

    Introduced by:
    commit 8030f54499925d073a88c09f30d5d844fb1b3190
    Author: Herbert Xu
    Date: Thu Feb 22 01:53:47 2007 +0900

    [IPV4] devinet: Register inetdev earlier.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • This will be needed later on to provide better management of default values.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • This patch converts the neigh param members to an array. This allows easier
    manipulation which will be needed later on to provide better management of
    default values.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • struct 'tipc_bearer' is a generic representation of the underlying
    media type, and exists in a one-to-one relationship to each interface
    TIPC is using. The struct contains a 'blocked' flag that mirrors the
    operational and execution state of the represented interface, and is
    updated through notification calls from the latter. The users of
    tipc_bearer are checking this flag before each attempt to send a
    packet via the interface.

    This state mirroring serves no purpose in the current code base. TIPC
    links will not discover a media failure any faster through this
    mechanism, and in reality the flag only adds overhead at packet
    sending and reception.

    Furthermore, the fact that the flag needs to be protected by a spinlock
    aggregated into tipc_bearer has turned out to cause a serious and
    completely unnecessary deadlock problem.

    CPU0 CPU1
    ---- ----
    Time 0: bearer_disable() link_timeout()
    Time 1: spin_lock_bh(&b_ptr->lock) tipc_link_push_queue()
    Time 2: tipc_link_delete() tipc_bearer_blocked(b_ptr)
    Time 3: k_cancel_timer(&req->timer) spin_lock_bh(&b_ptr->lock)
    Time 4: del_timer_sync(&req->timer)

    I.e., del_timer_sync() on CPU0 never returns, because the timer handler
    on CPU1 is waiting for the bearer lock.

    We eliminate the 'blocked' flag from struct tipc_bearer, along with all
    tests on this flag. This not only resolves the deadlock, but also
    simplifies and speeds up the data path execution of TIPC. It also fits
    well into our ongoing effort to make the locking policy simpler and
    more manageable.

    An effect of this change is that we can get rid of functions such as
    tipc_bearer_blocked(), tipc_continue() and tipc_block_bearer().
    We replace the latter with a new function, tipc_reset_bearer(), which
    resets all links associated to the bearer immediately after an
    interface goes down.

    A user might notice one slight change in link behaviour after this
    change. When an interface goes down, (e.g. through a NETDEV_DOWN
    event) all attached links will be reset immediately, instead of
    leaving it to each link to detect the failure through a timer-driven
    mechanism. We consider this an improvement, and see no obvious risks
    with the new behavior.

    Signed-off-by: Erik Hugne
    Reviewed-by: Ying Xue
    Reviewed-by: Paul Gortmaker
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Erik Hugne
     
  • use pr_ instead of printk(LEVEL)

    Suggested-by: Joe Perches
    Signed-off-by: Wang Weidong
    Signed-off-by: David S. Miller

    wangweidong
     
  • This patch introduces a PACKET_QDISC_BYPASS socket option, that
    allows for using a similar xmit() function as in pktgen instead
    of taking the dev_queue_xmit() path. This can be very useful when
    PF_PACKET applications are required to be used in a similar
    scenario as pktgen, but with full, flexible packet payload that
    needs to be provided, for example.

    On default, nothing changes in behaviour for normal PF_PACKET
    TX users, so everything stays as is for applications. New users,
    however, can now set PACKET_QDISC_BYPASS if needed to prevent
    own packets from i) reentering packet_rcv() and ii) to directly
    push the frame to the driver.

    In doing so we can increase pps (here 64 byte packets) for
    PF_PACKET a bit:

    # CPUs -- QDISC_BYPASS -- qdisc path -- qdisc path[**]
    1 CPU == 1,509,628 pps -- 1,208,708 -- 1,247,436
    2 CPUs == 3,198,659 pps -- 2,536,012 -- 1,605,779
    3 CPUs == 4,787,992 pps -- 3,788,740 -- 1,735,610
    4 CPUs == 6,173,956 pps -- 4,907,799 -- 1,909,114
    5 CPUs == 7,495,676 pps -- 5,956,499 -- 2,014,422
    6 CPUs == 9,001,496 pps -- 7,145,064 -- 2,155,261
    7 CPUs == 10,229,776 pps -- 8,190,596 -- 2,220,619
    8 CPUs == 11,040,732 pps -- 9,188,544 -- 2,241,879
    9 CPUs == 12,009,076 pps -- 10,275,936 -- 2,068,447
    10 CPUs == 11,380,052 pps -- 11,265,337 -- 1,578,689
    11 CPUs == 11,672,676 pps -- 11,845,344 -- 1,297,412
    [...]
    20 CPUs == 11,363,192 pps -- 11,014,933 -- 1,245,081

    [**]: qdisc path with packet_rcv(), how probably most people
    seem to use it (hopefully not anymore if not needed)

    The test was done using a modified trafgen, sending a simple
    static 64 bytes packet, on all CPUs. The trick in the fast
    "qdisc path" case, is to avoid reentering packet_rcv() by
    setting the RAW socket protocol to zero, like:
    socket(PF_PACKET, SOCK_RAW, 0);

    Tradeoffs are documented as well in this patch, clearly, if
    queues are busy, we will drop more packets, tc disciplines are
    ignored, and these packets are not visible to taps anymore. For
    a pktgen like scenario, we argue that this is acceptable.

    The pointer to the xmit function has been placed in packet
    socket structure hole between cached_dev and prot_hook that
    is hot anyway as we're working on cached_dev in each send path.

    Done in joint work together with Jesper Dangaard Brouer.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • As we need it elsewhere, move the inline helper function of
    skb_needs_linearize() over to skbuff.h include file. While
    at it, also convert the return to 'bool' instead of 'int'
    and add a proper kernel doc.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Merge 'net' into 'net-next' to get the AF_PACKET bug fix that
    Daniel's direct transmit changes depend upon.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Commit e40526cb20b5 introduced a cached dev pointer, that gets
    hooked into register_prot_hook(), __unregister_prot_hook() to
    update the device used for the send path.

    We need to fix this up, as otherwise this will not work with
    sockets created with protocol = 0, plus with sll_protocol = 0
    passed via sockaddr_ll when doing the bind.

    So instead, assign the pointer directly. The compiler can inline
    these helper functions automagically.

    While at it, also assume the cached dev fast-path as likely(),
    and document this variant of socket creation as it seems it is
    not widely used (seems not even the author of TX_RING was aware
    of that in his reference example [1]). Tested with reproducer
    from e40526cb20b5.

    [1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap#Example

    Fixes: e40526cb20b5 ("packet: fix use after free race in send path when dev is released")
    Signed-off-by: Daniel Borkmann
    Tested-by: Salam Noureddine
    Tested-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Commit 6da7c8fcbcbd ("qdisc: allow setting default queuing discipline")
    added the ability to change default qdisc from pfifo_fast to say fq

    But as most modern ethernet devices are multiqueue, we cant really
    see all the statistics from "tc -s qdisc show", as the default root
    qdisc is mq.

    This patch adds the calls to qdisc_list_add() to mq and mqprio

    Signed-off-by: Eric Dumazet
    Cc: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Dec, 2013

3 commits