27 Jul, 2015

2 commits

  • Currently, we use the code sequence

    if (msg_reverse())
    tipc_link_xmit_skb()

    at numerous locations in socket.c. The preparation of arguments
    for these calls, as well as the sequence itself, makes the code
    unecessarily complex.

    In this commit, we introduce a new function, tipc_sk_respond(),
    that performs this call combination. We also replace some, but not
    yet all, of these explicit call sequences with calls to the new
    function. Notably, we let the function tipc_sk_proto_rcv() use
    the new function to directly send out PROBE_REPLY messages,
    instead of deferring this to the calling tipc_sk_rcv() function,
    as we do now.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The shortest TIPC message header, for cluster local CONNECTED messages,
    is 24 bytes long. With this format, the fields "dest_node" and
    "orig_node" are optimized away, since they in reality are redundant
    in this particular case.

    However, the absence of these fields leads to code inconsistencies
    that are difficult to handle in some cases, especially when we need
    to reverse or reject messages at the socket layer.

    In this commit, we concentrate the handling of the absent fields
    to one place, by letting the function tipc_msg_reverse() reallocate
    the buffer and expand the header to 32 bytes when necessary. This
    means that the socket code now can assume that the two previously
    absent fields are present in the header when a message needs to be
    rejected. This opens up for some further simplifications of the
    socket code.

    Reviewed-by: Ying Xue
    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     

25 Jul, 2015

1 commit

  • This patch fixes setting of vinfo.flags in the br_fill_ifvlaninfo_range() method. The
    assignment of vinfo.flags &= ~BRIDGE_VLAN_INFO_RANGE_BEGIN has no effect and is
    unneeded, as vinfo.flags value is overriden by the immediately following
    vinfo.flags = flags | BRIDGE_VLAN_INFO_RANGE_END assignement.

    Signed-off-by: Rami Rosen
    Acked-by: Roopa Prabhu
    Signed-off-by: David S. Miller

    Rosen, Rami
     

23 Jul, 2015

5 commits

  • Convert the module_init() to a invocation from inet_init() since
    ip_tunnel_core is part of the INET built-in.

    Fixes: 3093fbe7ff4 ("route: Per route IP tunnel metadata via lightweight tunnel")
    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • Conflicts:
    net/bridge/br_mdb.c

    br_mdb.c conflict was a function call being removed to fix a bug in
    'net' but whose signature was changed in 'net-next'.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pull networking fixes from David Miller:

    1) Don't use shared bluetooth antenna in iwlwifi driver for management
    frames, from Emmanuel Grumbach.

    2) Fix device ID check in ath9k driver, from Felix Fietkau.

    3) Off by one in xen-netback BUG checks, from Dan Carpenter.

    4) Fix IFLA_VF_PORT netlink attribute validation, from Daniel Borkmann.

    5) Fix races in setting peeked bit flag in SKBs during datagram
    receive. If it's shared we have to clone it otherwise the value can
    easily be corrupted. Fix from Herbert Xu.

    6) Revert fec clock handling change, causes regressions. From Fabio
    Estevam.

    7) Fix use after free in fq_codel and sfq packet schedulers, from WANG
    Cong.

    8) ipvlan bug fixes (memory leaks, missing rcu_dereference_bh, etc.)
    from WANG Cong and Konstantin Khlebnikov.

    9) Memory leak in act_bpf packet action, from Alexei Starovoitov.

    10) ARM bpf JIT bug fixes from Nicolas Schichan.

    11) Fix backwards compat of ANY_LAYOUT in virtio_net driver, from
    Michael S Tsirkin.

    12) Destruction of bond with different ARP header types not handled
    correctly, fix from Nikolay Aleksandrov.

    13) Revert GRO receive support in ipv6 SIT tunnel driver, causes
    regressions because the GRO packets created cannot be processed
    properly on the GSO side if we forward the frame. From Herbert Xu.

    14) TCCR update race and other fixes to ravb driver from Sergei
    Shtylyov.

    15) Fix SKB leaks in caif_queue_rcv_skb(), from Eric Dumazet.

    16) Fix panics on packet scheduler filter replace, from Daniel Borkmann.

    17) Make sure AF_PACKET sees properly IP headers in defragmented frames
    (via PACKET_FANOUT_FLAG_DEFRAG option), from Edward Hyunkoo Jee.

    18) AF_NETLINK cannot hold mutex in RCU callback, fix from Florian
    Westphal.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (84 commits)
    ravb: fix ring memory allocation
    net: phy: dp83867: Fix warning check for setting the internal delay
    openvswitch: allocate nr_node_ids flow_stats instead of num_possible_nodes
    netlink: don't hold mutex in rcu callback when releasing mmapd ring
    ARM: net: fix vlan access instructions in ARM JIT.
    ARM: net: handle negative offsets in BPF JIT.
    ARM: net: fix condition for load_order > 0 when translating load instructions.
    tcp: suppress a division by zero warning
    drivers: net: cpsw: remove tx event processing in rx napi poll
    inet: frags: fix defragmented packet's IP header for af_packet
    net: mvneta: fix refilling for Rx DMA buffers
    stmmac: fix setting of driver data in stmmac_dvr_probe
    sched: cls_flow: fix panic on filter replace
    sched: cls_flower: fix panic on filter replace
    sched: cls_bpf: fix panic on filter replace
    net/mdio: fix mdio_bus_match for c45 PHY
    net: ratelimit warnings about dst entry refcount underflow or overflow
    caif: fix leaks and race in caif_queue_rcv_skb()
    qmi_wwan: add the second QMI/network interface for Sierra Wireless MC7305/MC7355
    ravb: fix race updating TCCR
    ...

    Linus Torvalds
     
  • Per RFC 6724, section 4, "Candidate Source Addresses":

    It is RECOMMENDED that the candidate source addresses be the set
    of unicast addresses assigned to the interface that will be used
    to send to the destination (the "outgoing" interface).

    Add a sysctl to enable this behaviour.

    Signed-off-by: Erik Kline
    Signed-off-by: David S. Miller

    Erik Kline
     
  • fix for:
    net/mpls/mpls_iptunnel.c:73:19: sparse: incompatible types in comparison
    expression (different address spaces)

    remove incorrect rcu_dereference possibly left over from
    earlier revisions of the code.

    Reported-by: kbuild test robot
    Signed-off-by: Roopa Prabhu
    Signed-off-by: David S. Miller

    Roopa Prabhu
     

22 Jul, 2015

27 commits

  • Track success and failure of TCP PMTU probing.

    Signed-off-by: Rick Jones
    Signed-off-by: David S. Miller

    Rick Jones
     
  • If user did not specify an oif, try and get it from the via address.
    If failed to get device, return with -ENODEV.

    Signed-off-by: Roopa Prabhu
    Signed-off-by: David S. Miller

    Roopa Prabhu
     
  • Some architectures like POWER can have a NUMA node_possible_map that
    contains sparse entries. This causes memory corruption with openvswitch
    since it allocates flow_cache with a multiple of num_possible_nodes() and
    assumes the node variable returned by for_each_node will index into
    flow->stats[node].

    Use nr_node_ids to allocate a maximal sparse array instead of
    num_possible_nodes().

    The crash was noticed after 3af229f2 was applied as it changed the
    node_possible_map to match node_online_map on boot.
    Fixes: 3af229f2071f5b5cb31664be6109561fbe19c861

    Signed-off-by: Chris J Arges
    Acked-by: Pravin B Shelar
    Acked-by: Nishanth Aravamudan
    Signed-off-by: David S. Miller

    Chris J Arges
     
  • Kirill A. Shutemov says:

    This simple test-case trigers few locking asserts in kernel:

    int main(int argc, char **argv)
    {
    unsigned int block_size = 16 * 4096;
    struct nl_mmap_req req = {
    .nm_block_size = block_size,
    .nm_block_nr = 64,
    .nm_frame_size = 16384,
    .nm_frame_nr = 64 * block_size / 16384,
    };
    unsigned int ring_size;
    int fd;

    fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
    if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0)
    exit(1);
    if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0)
    exit(1);

    ring_size = req.nm_block_nr * req.nm_block_size;
    mmap(NULL, 2 * ring_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    return 0;
    }

    +++ exited with 0 +++
    BUG: sleeping function called from invalid context at /home/kas/git/public/linux-mm/kernel/locking/mutex.c:616
    in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: init
    3 locks held by init/1:
    #0: (reboot_mutex){+.+...}, at: [] SyS_reboot+0xa9/0x220
    #1: ((reboot_notifier_list).rwsem){.+.+..}, at: [] __blocking_notifier_call_chain+0x39/0x70
    #2: (rcu_callback){......}, at: [] rcu_do_batch.isra.49+0x160/0x10c0
    Preemption disabled at:[] __delay+0xf/0x20

    CPU: 1 PID: 1 Comm: init Not tainted 4.1.0-00009-gbddf4c4818e0 #253
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Debian-1.8.2-1 04/01/2014
    ffff88017b3d8000 ffff88027bc03c38 ffffffff81929ceb 0000000000000102
    0000000000000000 ffff88027bc03c68 ffffffff81085a9d 0000000000000002
    ffffffff81ca2a20 0000000000000268 0000000000000000 ffff88027bc03c98
    Call Trace:
    [] dump_stack+0x4f/0x7b
    [] ___might_sleep+0x16d/0x270
    [] __might_sleep+0x4d/0x90
    [] mutex_lock_nested+0x2f/0x430
    [] ? _raw_spin_unlock_irqrestore+0x5d/0x80
    [] ? __this_cpu_preempt_check+0x13/0x20
    [] netlink_set_ring+0x1ed/0x350
    [] ? netlink_undo_bind+0x70/0x70
    [] netlink_sock_destruct+0x80/0x150
    [] __sk_free+0x1d/0x160
    [] sk_free+0x19/0x20
    [..]

    Cong Wang says:

    We can't hold mutex lock in a rcu callback, [..]

    Thomas Graf says:

    The socket should be dead at this point. It might be simpler to
    add a netlink_release_ring() function which doesn't require
    locking at all.

    Reported-by: "Kirill A. Shutemov"
    Diagnosed-by: Cong Wang
    Suggested-by: Thomas Graf
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Andrew Morton reported following warning on one ARM build
    with gcc-4.4 :

    net/ipv4/inet_hashtables.c: In function 'inet_ehash_locks_alloc':
    net/ipv4/inet_hashtables.c:617: warning: division by zero

    Even guarded with a test on sizeof(spinlock_t), compiler does not
    like current construct on a !CONFIG_SMP build.

    Remove the warning by using a temporary variable.

    Fixes: 095dc8e0c368 ("tcp: fix/cleanup inet_ehash_locks_alloc()")
    Reported-by: Andrew Morton
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In commit d999297c3dbbe7fdd832f7fa4ec84301e170b3e6
    ("tipc: reduce locking scope during packet reception") we introduced
    a new function tipc_link_proto_rcv(). This function contains a bug,
    so that it sometimes by error sends out a non-zero link priority value
    in created protocol messages.

    The bug may lead to an extra link reset at initial link establising
    with older nodes. This will never happen more than once, whereafter
    the link will work as intended.

    We fix this bug in this commit.

    Signed-off-by: Jon Maloy
    Signed-off-by: David S. Miller

    Jon Paul Maloy
     
  • The sk_classid member is only required when CONFIG_CGROUP_NET_CLASSID is
    enabled. #ifdefify it to reduce the size of struct sock on 32 bit
    systems, at least.

    Signed-off-by: Mathias Krause
    Signed-off-by: David S. Miller

    Mathias Krause
     
  • This gets rid of all OVS specific VXLAN code in the receive and
    transmit path by using a VXLAN net_device to represent the vport.
    Only a small shim layer remains which takes care of handling the
    VXLAN specific OVS Netlink configuration.

    Unexports vxlan_sock_add(), vxlan_sock_release(), vxlan_xmit_skb()
    since they are no longer needed.

    Signed-off-by: Thomas Graf
    Signed-off-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • This allows to get rid of the get_name() vport ops later on.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • This is the first step in representing all OVS vports as regular
    struct net_devices. Move the net_device pointer into the vport
    structure itself to get rid of struct vport_netdev.

    Signed-off-by: Thomas Graf
    Signed-off-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • Utilize the new metadata dst to attach encapsulation instructions to
    the skb. The existing egress_tun_info via the OVS_CB() is left in
    place until all tunnel vports have been converted to the new method.

    Signed-off-by: Thomas Graf
    Signed-off-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • This add the ability to select a routing table based on the tunnel
    id which allows to maintain separate routing tables for each virtual
    tunnel network.

    ip rule add from all tunnel-id 100 lookup 100
    ip rule add from all tunnel-id 200 lookup 200

    A new static key controls the collection of metadata at tunnel level
    upon demand.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • This introduces a new IP tunnel lightweight tunnel type which allows
    to specify IP tunnel instructions per route. Only IPv4 is supported
    at this point.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • Add a new flowi_tunnel structure which is a subset of ip_tunnel_key to
    allow routes to match on tunnel metadata. For now, the tunnel id is
    added to flowi_tunnel which allows for routes to be bound to specific
    virtual tunnels.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • If output device wants to see the dst, inherit the dst of the
    original skb and pass it on to generate the ARP request.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • Introduces a new dst_metadata which enables to carry per packet metadata
    between forwarding and processing elements via the skb->dst pointer.

    The structure is set up to be a union. Thus, each separate type of
    metadata requires its own dst instance. If demand arises to carry
    multiple types of metadata concurrently, metadata dst entries can be
    made stackable.

    The metadata dst entry is refcnt'ed as expected for now but a non
    reference counted use is possible if the reference is forced before
    queueing the skb.

    In order to allow allocating dsts with variable length, the existing
    dst_alloc() is split into a dst_alloc() and dst_init() function. The
    existing dst_init() function to initialize the subsystem is being
    renamed to dst_subsys_init() to make it clear what is what.

    The check before ip_route_input() is changed to ignore metadata dsts
    and drop the dst inside the routing function thus allowing to interpret
    metadata in a later commit.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • ip_route_input() unconditionally overwrites the dst. Hide the original
    dst attached to the skb by calling skb_dst_set(skb, NULL) prior to
    ip_route_input().

    Reported-by: Julian Anastasov
    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • Rename the tunnel metadata data structures currently internal to
    OVS and make them generic for use by all IP tunnels.

    Both structures are kernel internal and will stay that way. Their
    members are exposed to user space through individual Netlink
    attributes by OVS. It will therefore be possible to extend/modify
    these structures without affecting user ABI.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • This implementation uses lwtunnel infrastructure to register
    hooks for mpls tunnel encaps.

    It picks cues from iptunnel_encaps infrastructure and previous
    mpls iptunnel RFC patches from Eric W. Biederman and Robert Shearman

    Signed-off-by: Roopa Prabhu
    Signed-off-by: David S. Miller

    Roopa Prabhu
     
  • Signed-off-by: Roopa Prabhu
    Signed-off-by: David S. Miller

    Roopa Prabhu
     
  • This is similar to ipv4 redirect of dst output to lwtunnel
    output function for encapsulation and xmit.

    Signed-off-by: Roopa Prabhu
    Signed-off-by: David S. Miller

    Roopa Prabhu
     
  • For input routes with tunnel encap state this patch redirects
    dst output functions to lwtunnel_output which later resolves to
    the corresponding lwtunnel output function.

    This has been tested to work with mpls ip tunnels.

    Signed-off-by: Roopa Prabhu
    Signed-off-by: David S. Miller

    Roopa Prabhu
     
  • This patch introduces lwtunnel_output function to call corresponding
    lwtunnels output function to xmit the packet.

    It adds two variants lwtunnel_output and lwtunnel_output6 for ipv4 and
    ipv6 respectively today. But this is subject to change when lwtstate will
    reside in dst or dst_metadata (as per upstream discussions).

    Signed-off-by: Roopa Prabhu
    Signed-off-by: David S. Miller

    Roopa Prabhu
     
  • This patch adds support in ipv6 fib functions to parse Netlink
    RTA encap attributes and attach encap state data to rt6_info.

    Signed-off-by: Roopa Prabhu
    Signed-off-by: David S. Miller

    Roopa Prabhu
     
  • This patch adds support in ipv4 fib functions to parse user
    provided encap attributes and attach encap state data to fib_nh
    and rtable.

    Signed-off-by: Roopa Prabhu
    Signed-off-by: David S. Miller

    Roopa Prabhu
     
  • Provides infrastructure to parse/dump/store encap information for
    light weight tunnels like mpls. Encap information for such tunnels
    is associated with fib routes.

    This infrastructure is based on previous suggestions from
    Eric Biederman to follow the xfrm infrastructure.

    Signed-off-by: Roopa Prabhu
    Signed-off-by: David S. Miller

    Roopa Prabhu
     
  • When ip_frag_queue() computes positions, it assumes that the passed
    sk_buff does not contain L2 headers.

    However, when PACKET_FANOUT_FLAG_DEFRAG is used, IP reassembly
    functions can be called on outgoing packets that contain L2 headers.

    Also, IPv4 checksum is not corrected after reassembly.

    Fixes: 7736d33f4262 ("packet: Add pre-defragmentation support for ipv4 fanouts.")
    Signed-off-by: Edward Hyunkoo Jee
    Signed-off-by: Eric Dumazet
    Cc: Willem de Bruijn
    Cc: Jerry Chu
    Signed-off-by: David S. Miller

    Edward Hyunkoo Jee
     

21 Jul, 2015

5 commits

  • Signed-off-by: Jakub Wilk
    Signed-off-by: David S. Miller

    Jakub Wilk
     
  • The following test case causes a NULL pointer dereference in cls_flow:

    tc filter add dev foo parent 1: handle 0x1 flow hash keys dst action ok
    tc filter replace dev foo parent 1: pref 49152 handle 0x1 \
    flow hash keys mark action drop

    To be more precise, actually two different panics are fixed, the first
    occurs because tcf_exts_init() is not called on the newly allocated
    filter when we do a replace. And the second panic uncovered after that
    happens since the arguments of list_replace_rcu() are swapped, the old
    element needs to be the first argument and the new element the second.

    Fixes: 70da9f0bf999 ("net: sched: cls_flow use RCU")
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • The following test case causes a NULL pointer dereference in cls_flower:

    tc filter add dev foo parent 1: flower eth_type ipv4 action ok flowid 1:1
    tc filter replace dev foo parent 1: pref 49152 handle 0x1 \
    flower eth_type ipv6 action ok flowid 1:1

    The problem is that commit 77b9900ef53a ("tc: introduce Flower classifier")
    accidentally swapped the arguments of list_replace_rcu(), the old
    element needs to be the first argument and the new element the second.

    Fixes: 77b9900ef53a ("tc: introduce Flower classifier")
    Signed-off-by: Daniel Borkmann
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • The following test case causes a NULL pointer dereference in cls_bpf:

    FOO="1,6 0 0 4294967295,"
    tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 action ok
    tc filter replace dev foo parent 1: pref 49152 handle 0x1 \
    bpf bytecode "$FOO" flowid 1:1 action drop

    The problem is that commit 1f947bf151e9 ("net: sched: rcu'ify cls_bpf")
    accidentally swapped the arguments of list_replace_rcu(), the old
    element needs to be the first argument and the new element the second.

    Fixes: 1f947bf151e9 ("net: sched: rcu'ify cls_bpf")
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Cookie ACK is always received by the association initiator, so fix the
    comment to avoid confusion.

    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner