02 Feb, 2016

1 commit

  • Pull networking fixes from David Miller:
    "This looks like a lot but it's a mixture of regression fixes as well
    as fixes for longer standing issues.

    1) Fix on-channel cancellation in mac80211, from Johannes Berg.

    2) Handle CHECKSUM_COMPLETE properly in xt_TCPMSS netfilter xtables
    module, from Eric Dumazet.

    3) Avoid infinite loop in UDP SO_REUSEPORT logic, also from Eric
    Dumazet.

    4) Avoid a NULL deref if we try to set SO_REUSEPORT after a socket is
    bound, from Craig Gallek.

    5) GRO key comparisons don't take lightweight tunnels into account,
    from Jesse Gross.

    6) Fix struct pid leak via SCM credentials in AF_UNIX, from Eric
    Dumazet.

    7) We need to set the rtnl_link_ops of ipv6 SIT tunnels before we
    register them, otherwise the NEWLINK netlink message is missing
    the proper attributes. From Thadeu Lima de Souza Cascardo.

    8) Several Spectrum chip bug fixes for mlxsw switch driver, from Ido
    Schimmel

    9) Handle fragments properly in ipv4 easly socket demux, from Eric
    Dumazet.

    10) Don't ignore the ifindex key specifier on ipv6 output route
    lookups, from Paolo Abeni"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (128 commits)
    tcp: avoid cwnd undo after receiving ECN
    irda: fix a potential use-after-free in ircomm_param_request
    net: tg3: avoid uninitialized variable warning
    net: nb8800: avoid uninitialized variable warning
    net: vxge: avoid unused function warnings
    net: bgmac: clarify CONFIG_BCMA dependency
    net: hp100: remove unnecessary #ifdefs
    net: davinci_cpdma: use dma_addr_t for DMA address
    ipv6/udp: use sticky pktinfo egress ifindex on connect()
    ipv6: enforce flowi6_oif usage in ip6_dst_lookup_tail()
    netlink: not trim skb for mmaped socket when dump
    vxlan: fix a out of bounds access in __vxlan_find_mac
    net: dsa: mv88e6xxx: fix port VLAN maps
    fib_trie: Fix shift by 32 in fib_table_lookup
    net: moxart: use correct accessors for DMA memory
    ipv4: ipconfig: avoid unused ic_proto_used symbol
    bnxt_en: Fix crash in bnxt_free_tx_skbs() during tx timeout.
    bnxt_en: Exclude rx_drop_pkts hw counter from the stack's rx_dropped counter.
    bnxt_en: Ring free response from close path should use completion ring
    net_sched: drr: check for NULL pointer in drr_dequeue
    ...

    Linus Torvalds
     

31 Jan, 2016

1 commit


30 Jan, 2016

2 commits

  • The current implementation of ip6_dst_lookup_tail basically
    ignore the egress ifindex match: if the saddr is set,
    ip6_route_output() purposefully ignores flowi6_oif, due
    to the commit d46a9d678e4c ("net: ipv6: Dont add RT6_LOOKUP_F_IFACE
    flag if saddr set"), if the saddr is 'any' the first route lookup
    in ip6_dst_lookup_tail fails, but upon failure a second lookup will
    be performed with saddr set, thus ignoring the ifindex constraint.

    This commit adds an output route lookup function variant, which
    allows the caller to specify lookup flags, and modify
    ip6_dst_lookup_tail() to enforce the ifindex match on the second
    lookup via said helper.

    ip6_route_output() becames now a static inline function build on
    top of ip6_route_output_flags(); as a side effect, out-of-tree
    modules need now a GPL license to access the output route lookup
    functionality.

    Signed-off-by: Paolo Abeni
    Acked-by: Hannes Frederic Sowa
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Signed-off-by: Jörg Thalheim
    Signed-off-by: David S. Miller

    Jörg Thalheim
     

29 Jan, 2016

3 commits

  • Having proper defines makes the code a bit readable, it also avoids
    duplicating hard-coded values since these are also needed when
    auto-allocating PSM values (in a subsequent patch).

    Signed-off-by: Johan Hedberg
    Signed-off-by: Marcel Holtmann

    Johan Hedberg
     
  • After we use refcnt to check if transport is alive, the dead can be
    removed from sctp_transport.

    The traversal of transport_addr_list in procfs dump is using
    list_for_each_entry_rcu, no need to check if it has been freed.

    sctp_generate_t3_rtx_event and sctp_generate_heartbeat_event is
    protected by sock lock, it's not necessary to check dead, either.
    also, the timers are cancelled when sctp_transport_free() is
    called, that it doesn't wait for refcnt to reach 0 to cancel them.

    Signed-off-by: Xin Long
    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Xin Long
     
  • Now when __sctp_lookup_association is running in BH, it will try to
    check if t->dead is set, but meanwhile other CPUs may be freeing this
    transport and this assoc and if it happens that
    __sctp_lookup_association checked t->dead a bit too early, it may think
    that the association is still good while it was already freed.

    So we fix this race by using atomic_add_unless in sctp_transport_hold.
    After we get one transport from hashtable, we will hold it only when
    this transport's refcnt is not 0, so that we can make sure t->asoc
    cannot be freed before we hold the asoc again.

    Note that sctp association is not freed using RCU so we can't use
    atomic_add_unless() with it as it may just be too late for that either.

    Fixes: 4f0087812648 ("sctp: apply rhashtable api to send/recv path")
    Reported-by: Vlad Yasevich
    Signed-off-by: Xin Long
    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Xin Long
     

22 Jan, 2016

1 commit

  • The cgroup methods are no longer used after baac50bbc3cd ("net:
    tcp_memcontrol: simplify linkage between socket and page counter").
    The hunk to delete them was included in the original patch but must
    have gotten lost during conflict resolution on the way upstream.

    Fixes: baac50bbc3cd ("net: tcp_memcontrol: simplify linkage between socket and page counter")
    Signed-off-by: Johannes Weiner
    Signed-off-by: David S. Miller

    Johannes Weiner
     

21 Jan, 2016

4 commits

  • David S. Miller
     
  • GRO is currently not aware of tunnel metadata generated by lightweight
    tunnels and stored in the dst. This leads to two possible problems:
    * Incorrectly merging two frames that have different metadata.
    * Leaking of allocated metadata from merged frames.

    This avoids those problems by comparing the tunnel information before
    merging, similar to how we handle other metadata (such as vlan tags),
    and releasing any state when we are done.

    Reported-by: John
    Fixes: 2e15ea39 ("ip_gre: Add support to collect tunnel metadata.")
    Signed-off-by: Jesse Gross
    Acked-by: Eric Dumazet
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Jesse Gross
     
  • tcp_memcontrol.c only contains legacy memory.tcp.kmem.* file definitions
    and mem_cgroup->tcp_mem init/destroy stuff. This doesn't belong to
    network subsys. Let's move it to memcontrol.c. This also allows us to
    reuse generic code for handling legacy memcg files.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: "David S. Miller"
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • This series adds accounting of the historical "kmem" memory consumers to
    the cgroup2 memory controller.

    These consumers include the dentry cache, the inode cache, kernel stack
    pages, and a few others that are pointed out in patch 7/8. The
    footprint of these consumers is directly tied to userspace activity in
    common workloads, and so they have to be part of the minimally viable
    configuration in order to present a complete feature to our users.

    The cgroup2 interface of the memory controller is far from complete, but
    this series, along with the socket memory accounting series, provides
    the final semantic changes for the existing memory knobs in the cgroup2
    interface, which is scheduled for initial release in the next merge
    window.

    This patch (of 8):

    Remove unused css argument frmo memcg_init_kmem()

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

20 Jan, 2016

2 commits

  • When we need to lock all buckets in the connection hashtable we'd attempt to
    lock 1024 spinlocks, which is way more preemption levels than supported by
    the kernel. Furthermore, this behavior was hidden by checking if lockdep is
    enabled, and if it was - use only 8 buckets(!).

    Fix this by using a global lock and synchronize all buckets on it when we
    need to lock them all. This is pretty heavyweight, but is only done when we
    need to resize the hashtable, and that doesn't happen often enough (or at all).

    Signed-off-by: Sasha Levin
    Acked-by: Jesper Dangaard Brouer
    Reviewed-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Sasha Levin
     
  • Marc Dionne discovered a NULL pointer dereference when setting
    SO_REUSEPORT on a socket after it is bound.
    This patch removes the assumption that at least one socket in the
    reuseport group is bound with the SO_REUSEPORT option before other
    bind calls occur.

    Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
    Reported-by: Marc Dionne
    Signed-off-by: Craig Gallek
    Tested-by: Marc Dionne
    Signed-off-by: David S. Miller

    Craig Gallek
     

18 Jan, 2016

1 commit

  • In file included from net/ipv4/tcp_ipv4.c:77 (and many more):
    include/net/tcp_memcontrol.h:5: warning: ‘struct cgroup_subsys’ declared inside parameter list
    include/net/tcp_memcontrol.h:5: warning: its scope is only this definition or declaration, which is probably not what you want

    Add forward declarations for all used structures to fix this.

    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: David S. Miller

    Geert Uytterhoeven
     

16 Jan, 2016

2 commits

  • Pull networking fixes from David Miller:
    "A quick set of bug fixes after there initial networking merge:

    1) Netlink multicast group storage allocator only was tested with
    nr_groups equal to 1, make it work for other values too. From
    Matti Vaittinen.

    2) Check build_skb() return value in macb and hip04_eth drivers, from
    Weidong Wang.

    3) Don't leak x25_asy on x25_asy_open() failure.

    4) More DMA map/unmap fixes in 3c59x from Neil Horman.

    5) Don't clobber IP skb control block during GSO segmentation, from
    Konstantin Khlebnikov.

    6) ECN helpers for ipv6 don't fixup the checksum, from Eric Dumazet.

    7) Fix SKB segment utilization estimation in xen-netback, from David
    Vrabel.

    8) Fix lockdep splat in bridge addrlist handling, from Nikolay
    Aleksandrov"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits)
    bgmac: Fix reversed test of build_skb() return value.
    bridge: fix lockdep addr_list_lock false positive splat
    net: smsc: Add support h8300
    xen-netback: free queues after freeing the net device
    xen-netback: delete NAPI instance when queue fails to initialize
    xen-netback: use skb to determine number of required guest Rx requests
    net: sctp: Move sequence start handling into sctp_transport_get_idx()
    ipv6: update skb->csum when CE mark is propagated
    net: phy: turn carrier off on phy attach
    net: macb: clear interrupts when disabling them
    sctp: support to lookup with ep+paddr in transport rhashtable
    net: hns: fixes no syscon error when init mdio
    dts: hisi: fixes no syscon fault when init mdio
    net: preserve IP control block during GSO segmentation
    fsl/fman: Delete one function call "put_device" in dtsec_config()
    hip04_eth: fix missing error handle for build_skb failed
    3c59x: fix another page map/single unmap imbalance
    3c59x: balance page maps and unmaps
    x25_asy: Free x25_asy on x25_asy_open() failure.
    mlxsw: fix SWITCHDEV_OBJ_ID_PORT_MDB
    ...

    Linus Torvalds
     
  • When a tunnel decapsulates the outer header, it has to comply
    with RFC 6080 and eventually propagate CE mark into inner header.

    It turns out IP6_ECN_set_ce() does not correctly update skb->csum
    for CHECKSUM_COMPLETE packets, triggering infamous "hw csum failure"
    messages and stack traces.

    Signed-off-by: Eric Dumazet
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Jan, 2016

7 commits

  • The unified hierarchy memory controller is going to use this jump label
    as well to control the networking callbacks. Move it to the memory
    controller code and give it a more generic name.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Reviewed-by: Vladimir Davydov
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • There won't be any separate counters for socket memory consumed by
    protocols other than TCP in the future. Remove the indirection and link
    sockets directly to their owning memory cgroup.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • There won't be a tcp control soft limit, so integrating the memcg code
    into the global skmem limiting scheme complicates things unnecessarily.
    Replace this with simple and clear charge and uncharge calls--hidden
    behind a jump label--to account skb memory.

    Note that this is not purely aesthetic: as a result of shoehorning the
    per-memcg code into the same memory accounting functions that handle the
    global level, the old code would compare the per-memcg consumption
    against the smaller of the per-memcg limit and the global limit. This
    allowed the total consumption of multiple sockets to exceed the global
    limit, as long as the individual sockets stayed within bounds. After
    this change, the code will always compare the per-memcg consumption to
    the per-memcg limit, and the global consumption to the global limit, and
    thus close this loophole.

    Without a soft limit, the per-memcg memory pressure state in sockets is
    generally questionable. However, we did it until now, so we continue to
    enter it when the hard limit is hit, and packets are dropped, to let
    other sockets in the cgroup know that they shouldn't grow their transmit
    windows, either. However, keep it simple in the new callback model and
    leave memory pressure lazily when the next packet is accepted (as
    opposed to doing it synchroneously when packets are processed). When
    packets are dropped, network performance will already be in the toilet,
    so that should be a reasonable trade-off.

    As described above, consumption is now checked on the per-memcg level
    and the global level separately. Likewise, memory pressure states are
    maintained on both the per-memcg level and the global level, and a
    socket is considered under pressure when either level asserts as much.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • tcp_memcontrol replicates the global sysctl_mem limit array per cgroup,
    but it only ever sets these entries to the value of the memory_allocated
    page_counter limit. Use the latter directly.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The number of allocated sockets is used for calculations in the soft
    limit phase, where packets are accepted but the socket is under memory
    pressure.
    Since there is no soft limit phase in tcp_memcontrol, and memory
    pressure is only entered when packets are already dropped, this is
    actually dead code. Remove it.

    As this is the last user of parent_cg_proto(), remove that too.

    Signed-off-by: Johannes Weiner
    Acked-by: David S. Miller
    Reviewed-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When a cgroup currently breaches its socket memory limit, it enters
    memory pressure mode for itself and its *ancestors*. This throttles
    transmission in unrelated sibling and cousin subtrees that have nothing
    to do with the breached limit.

    On the contrary, breaching a limit should make that group and its
    *children* enter memory pressure mode. But this happens already, albeit
    lazily: if an ancestor limit is breached, siblings will enter memory
    pressure on their own once the next packet arrives for them.

    So no additional hierarchy code is needed. Remove the bogus stuff.

    Signed-off-by: Johannes Weiner
    Acked-by: David S. Miller
    Reviewed-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When charging socket memory, the code currently checks only the local
    page counter for excess to determine whether the memcg is under socket
    pressure. But even if the local counter is fine, one of the ancestors
    could have breached its limit, which should also force this child to
    enter socket pressure. This currently doesn't happen.

    Fix this by using page_counter_try_charge() first. If that fails, it
    means that either the local counter or one of the ancestors are in
    excess of their limit, and the child should enter socket pressure.

    Fixes: 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
    Signed-off-by: Johannes Weiner
    Acked-by: David S. Miller
    Reviewed-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

12 Jan, 2016

1 commit


11 Jan, 2016

6 commits

  • Add a skb_at_tc_ingress() as this will be needed elsewhere as well and
    can hide the ugly ifdef.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • This is the final part required to namespaceify the tcp
    keep alive mechanism.

    Signed-off-by: Nikolay Borisov
    Signed-off-by: David S. Miller

    Nikolay Borisov
     
  • This is required to have full tcp keepalive mechanism namespace
    support.

    Signed-off-by: Nikolay Borisov
    Signed-off-by: David S. Miller

    Nikolay Borisov
     
  • Different net namespaces might have different requirements as to
    the keepalive time of tcp sockets. This might be required in cases
    where different firewall rules are in place which require tcp
    timeout sockets to be increased/decreased independently of the host.

    Signed-off-by: Nikolay Borisov
    Signed-off-by: David S. Miller

    Nikolay Borisov
     
  • udp tunnel offloads tend to aggregate datagrams based on inner
    headers. gro engine gets notified by tunnel implementations about
    possible offloads. The match is solely based on the port number.

    Imagine a tunnel bound to port 53, the offloading will look into all
    DNS packets and tries to aggregate them based on the inner data found
    within. This could lead to data corruption and malformed DNS packets.

    While this patch minimizes the problem and helps an administrator to find
    the issue by querying ip tunnel/fou, a better way would be to match on
    the specific destination ip address so if a user space socket is bound
    to the same address it will conflict.

    Cc: Tom Herbert
    Cc: Eric Dumazet
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     
  • Define HW multicast entry: MAC and VID.
    Using a MAC address simplifies support for both IPV4 and IPv6.

    Signed-off-by: Elad Raz
    Signed-off-by: Ido Schimmel
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Elad Raz
     

09 Jan, 2016

3 commits

  • fib_multipath_hash() computes a hash using __be32 values, force
    cast these to u32 to pacify sparse.

    Signed-off-by: Lance Richardson
    Signed-off-by: David S. Miller

    Lance Richardson
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    The following patchset contains Netfilter updates for net-next, they are:

    1) Release nf_tables objects on netns destructions via
    nft_release_afinfo().

    2) Destroy basechain and rules on netdevice removal in the new netdev
    family.

    3) Get rid of defensive check against removal of inactive objects in
    nf_tables.

    4) Pass down netns pointer to our existing nfnetlink callbacks, as well
    as commit() and abort() nfnetlink callbacks.

    5) Allow to invert limit expression in nf_tables, so we can throttle
    overlimit traffic.

    6) Add packet duplication for the netdev family.

    7) Add forward expression for the netdev family.

    8) Define pr_fmt() in conntrack helpers.

    9) Don't leave nfqueue configuration on inconsistent state in case of
    errors, from Ken-ichirou MATSUZAWA, follow up patches are also from
    him.

    10) Skip queue option handling after unbind.

    11) Return error on unknown both in nfqueue and nflog command.

    12) Autoload ctnetlink when NFQA_CFG_F_CONNTRACK is set.

    13) Add new NFTA_SET_USERDATA attribute to store user data in sets,
    from Carlos Falgueras.

    14) Add support for 64 bit byteordering changes nf_tables, from Florian
    Westphal.

    15) Add conntrack byte/packet counter matching support to nf_tables,
    also from Florian.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • …etooth/bluetooth-next

    Johan Hedberg says:

    ====================
    pull request: bluetooth-next 2016-01-08

    Here's one more bluetooth-next pull request for the 4.5 kernel:

    - Support for CRC check and promiscuous mode for CC2520
    - Fixes to btmrvl driver
    - New ACPI IDs for hci_bcm driver
    - Limited Discovery support for the Bluetooth mgmt interface
    - Minor other cleanups here and there

    Please let me know if there are any issues pulling. Thanks.
    ====================

    Signed-off-by: David S. Miller <davem@davemloft.net>

    David S. Miller
     

08 Jan, 2016

1 commit


07 Jan, 2016

2 commits


06 Jan, 2016

3 commits

  • The only user was removed in commit
    029f7f3b8701cc7a ("netfilter: ipv6: nf_defrag: avoid/free clone operations").

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • transport hashtable will replace the association hashtable,
    so association hashtable is not used in sctp any more, so
    drop the codes about that.

    Signed-off-by: Xin Long
    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Xin Long
     
  • tranport hashtbale will replace the association hashtable to do the
    lookup for transport, and then get association by t->assoc, rhashtable
    apis will be used because of it's resizable, scalable and using rcu.

    lport + rport + paddr will be the base hashkey to locate the chain,
    with net to protect one netns from another, then plus the laddr to
    compare to get the target.

    this patch will provider the lookup functions:
    - sctp_epaddr_lookup_transport
    - sctp_addrs_lookup_transport

    hash/unhash functions:
    - sctp_hash_transport
    - sctp_unhash_transport

    init/destroy functions:
    - sctp_transport_hashtable_init
    - sctp_transport_hashtable_destroy

    Signed-off-by: Xin Long
    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Xin Long