16 Sep, 2014

1 commit

  • Currently we genarate a blackhole route route whenever we have
    matching policies but can not resolve the states. Here we assume
    that dst_output() is called to kill the balckholed packets.
    Unfortunately this assumption is not true in all cases, so
    it is possible that these packets leave the system unwanted.

    We fix this by generating blackhole routes only from the
    route lookup functions, here we can guarantee a call to
    dst_output() afterwards.

    Fixes: 2774c131b1d ("xfrm: Handle blackhole route creation via afinfo.")
    Reported-by: Konstantinos Kolelis
    Signed-off-by: Steffen Klassert

    Steffen Klassert
     

13 Sep, 2014

1 commit

  • If we try to rmmod the driver for an interface while sockets with
    setsockopt(JOIN_ANYCAST) are alive, some refcounts aren't cleaned up
    and we get stuck on:

    unregister_netdevice: waiting for ens3 to become free. Usage count = 1

    If we LEAVE_ANYCAST/close everything before rmmod'ing, there is no
    problem.

    We need to perform a cleanup similar to the one for multicast in
    addrconf_ifdown(how == 1).

    Signed-off-by: Sabrina Dubroca
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Sabrina Dubroca
     

08 Sep, 2014

1 commit

  • It is possible that the interface is already gone after joining
    the list of anycast on this interface as we don't hold a refcount
    for the device, in this case we are safe to ignore the error.

    What's more important, for API compatibility we should not
    change this behavior for applications even if it were correct.

    Fixes: commit a9ed4a2986e13011 ("ipv6: fix rtnl locking in setsockopt for anycast and multicast")
    Cc: Sabrina Dubroca
    Cc: David S. Miller
    Signed-off-by: Cong Wang
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    WANG Cong
     

06 Sep, 2014

3 commits

  • addrconf_get_prefix_route() ensures to get the right route in the right table.

    Signed-off-by: Nicolas Dichtel
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     
  • There is no reason to take a refcnt before deleting the peer address route.
    It's done some lines below for the local prefix route because
    inet6_ifa_finish_destroy() will release it at the end.
    For the peer address route, we want to free it right now.

    This bug has been introduced by commit
    caeaba79009c ("ipv6: add support of peer address").

    Signed-off-by: Nicolas Dichtel
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     
  • Calling setsockopt with IPV6_JOIN_ANYCAST or IPV6_LEAVE_ANYCAST
    triggers the assertion in addrconf_join_solict()/addrconf_leave_solict()

    ipv6_sock_ac_join(), ipv6_sock_ac_drop(), ipv6_sock_ac_close() need to
    take RTNL before calling ipv6_dev_ac_inc/dec. Same thing with
    ipv6_sock_mc_join(), ipv6_sock_mc_drop(), ipv6_sock_mc_close() before
    calling ipv6_dev_mc_inc/dec.

    This patch moves ASSERT_RTNL() up a level in the call stack.

    Signed-off-by: Cong Wang
    Signed-off-by: Sabrina Dubroca
    Reported-by: Tommi Rantala
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Sabrina Dubroca
     

03 Sep, 2014

2 commits

  • make defconfig reports:

    warning: (NETFILTER_XT_TARGET_LOG) selects NF_LOG_IPV6 which has unmet direct dependencies (NET && INET && IPV6 && NETFILTER && NETFILTER_ADVANCED)

    Fixes: d79a61d netfilter: NETFILTER_XT_TARGET_LOG selects NF_LOG_*
    Reported-by: kbuild test robot
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira
     
  • Pablo Neira Ayuso says:

    ====================
    pull request: Netfilter/IPVS fixes for net

    The following patchset contains seven Netfilter fixes for your net
    tree, they are:

    1) Make the NAT infrastructure independent of x_tables, some users are
    already starting to test nf_tables with NAT without enabling x_tables.
    Without this patch for Kconfig, there's a superfluous dependency
    between NAT and x_tables.
    2) Allow to use 0 in the cgroup match, the kernel rejects with -EINVAL
    with no good reason. From Daniel Borkmann.

    3) Select CONFIG_NF_NAT from the nf_tables NAT expression, this also
    resolves another NAT dependency with x_tables.

    4) Use HAVE_JUMP_LABEL instead of CONFIG_JUMP_LABEL in the Netfilter hook
    code as elsewhere in the kernel to resolve toolchain problems, from
    Zhouyi Zhou.

    5) Use iptunnel_handle_offloads() to set up tunnel encapsulation
    depending on the offload capabilities, reported by Alex Gartrell
    patch from Julian Anastasov.

    6) Fix wrong family when registering the ip_vs_local_reply6() hook,
    also from Julian.

    7) Select the NF_LOG_* symbols from NETFILTER_XT_TARGET_LOG. Rafał
    Miłecki reported that when jumping from 3.16 to 3.17-rc, his log
    target is not selected anymore due to changes in the previous
    development cycle to accomodate the full logging support for
    nf_tables.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

23 Aug, 2014

1 commit

  • The function fib6_commit_metrics() allocates a piece of memory in mode
    GFP_KERNEL while holding an atomic lock from higher up in the stack, in
    the function __ip6_ins_rt(). This produces the following BUG:

    > BUG: sleeping function called from invalid context at mm/slub.c:1250
    > in_atomic(): 1, irqs_disabled(): 0, pid: 2909, name: dhcpcd
    > 2 locks held by dhcpcd/2909:
    > #0: (rtnl_mutex){+.+.+.}, at: [] rtnl_lock+0x17/0x20
    > #1: (&tb->tb6_lock){++--+.}, at: [] ip6_route_add+0x65a/0x800
    > CPU: 1 PID: 2909 Comm: dhcpcd Not tainted 3.17.0-rc1 #1
    > Hardware name: ASUS All Series/Q87T, BIOS 0216 10/16/2013
    > 0000000000000008 ffff8800c8f13858 ffffffff81af135a 0000000000000000
    > ffff880212202430 ffff8800c8f13878 ffffffff810f8d3a ffff880212202c98
    > 0000000000000010 ffff8800c8f138c8 ffffffff8121ad0e 0000000000000001
    > Call Trace:
    > [] dump_stack+0x4e/0x68
    > [] __might_sleep+0x10a/0x120
    > [] kmem_cache_alloc_trace+0x4e/0x190
    > [] ? fib6_commit_metrics+0x66/0x110
    > [] fib6_commit_metrics+0x66/0x110
    > [] fib6_add+0x883/0xa80
    > [] ? ip6_route_add+0x65a/0x800
    > [] ip6_route_add+0x675/0x800
    > [] ? ip6_route_add+0x6a/0x800
    > [] inet6_rtm_newroute+0x5c/0x80
    > [] rtnetlink_rcv_msg+0x211/0x260
    > [] ? rtnl_lock+0x17/0x20
    > [] ? lock_release_holdtime+0x28/0x180
    > [] ? rtnl_lock+0x17/0x20
    > [] ? __rtnl_unlock+0x20/0x20
    > [] netlink_rcv_skb+0x6e/0xd0
    > [] rtnetlink_rcv+0x25/0x40
    > [] netlink_unicast+0xd9/0x180
    > [] netlink_sendmsg+0x700/0x770
    > [] ? local_clock+0x25/0x30
    > [] sock_sendmsg+0x6c/0x90
    > [] ? might_fault+0xa3/0xb0
    > [] ? verify_iovec+0x7d/0xf0
    > [] ___sys_sendmsg+0x37e/0x3b0
    > [] ? trace_hardirqs_on_caller+0x185/0x220
    > [] ? mutex_unlock+0xe/0x10
    > [] ? netlink_insert+0xbc/0xe0
    > [] ? netlink_autobind.isra.30+0x125/0x150
    > [] ? netlink_autobind.isra.30+0x60/0x150
    > [] ? netlink_bind+0x159/0x230
    > [] ? might_fault+0x5a/0xb0
    > [] ? SYSC_bind+0x7e/0xd0
    > [] __sys_sendmsg+0x4d/0x80
    > [] SyS_sendmsg+0x12/0x20
    > [] system_call_fastpath+0x16/0x1b

    Fixing this by replacing the mode GFP_KERNEL with GFP_ATOMIC.

    Signed-off-by: Benjamin Block
    Acked-by: David Rientjes
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Benjamin Block
     

19 Aug, 2014

1 commit

  • Currently, the NAT configs depend on iptables and ip6tables. However,
    users should be capable of enabling NAT for nft without having to
    switch on iptables.

    Fix this by adding new specific IP_NF_NAT and IP6_NF_NAT config
    switches for iptables and ip6tables NAT support. I have also moved
    the original NF_NAT_IPV4 and NF_NAT_IPV6 configs out of the scope
    of iptables to make them independent of it.

    This patch also adds NETFILTER_XT_NAT which selects the xt_nat
    combo that provides snat/dnat for iptables. We cannot use NF_NAT
    anymore since nf_tables can select this.

    Reported-by: Matteo Croce
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

15 Aug, 2014

2 commits

  • Make sure we use the correct address-family-specific function for
    handling MTU reductions from within tcp_release_cb().

    Previously AF_INET6 sockets were incorrectly always using the IPv6
    code path when sometimes they were handling IPv4 traffic and thus had
    an IPv4 dst.

    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Diagnosed-by: Willem de Bruijn
    Fixes: 563d34d057862 ("tcp: dont drop MTU reduction indications")
    Reviewed-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • As of 4fddbf5d78 ("sit: strictly restrict incoming traffic to tunnel link device"),
    when looking up a tunnel, tunnel's underlying interface (t->parms.link)
    is verified to match incoming traffic's ingress device.

    However the comparison was incorrectly based on skb->dev->iflink.

    Instead, dev->ifindex should be used, which correctly represents the
    interface from which the IP stack hands the ipip6 packets.

    This allows setting up sit tunnels bound to vlan interfaces (otherwise
    incoming ipip6 traffic on the vlan interface was dropped due to
    ipip6_tunnel_lookup match failure).

    Signed-off-by: Shmulik Ladkani
    Acked-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Shmulik Ladkani
     

07 Aug, 2014

3 commits

  • Merge incoming from Andrew Morton:
    - Various misc things.
    - arch/sh updates.
    - Part of ocfs2. Review is slow.
    - Slab updates.
    - Most of -mm.
    - printk updates.
    - lib/ updates.
    - checkpatch updates.

    * emailed patches from Andrew Morton : (226 commits)
    checkpatch: update $declaration_macros, add uninitialized_var
    checkpatch: warn on missing spaces in broken up quoted
    checkpatch: fix false positives for --strict "space after cast" test
    checkpatch: fix false positive MISSING_BREAK warnings with --file
    checkpatch: add test for native c90 types in unusual order
    checkpatch: add signed generic types
    checkpatch: add short int to c variable types
    checkpatch: add for_each tests to indentation and brace tests
    checkpatch: fix brace style misuses of else and while
    checkpatch: add --fix option for a couple OPEN_BRACE misuses
    checkpatch: use the correct indentation for which()
    checkpatch: add fix_insert_line and fix_delete_line helpers
    checkpatch: add ability to insert and delete lines to patch/file
    checkpatch: add an index variable for fixed lines
    checkpatch: warn on break after goto or return with same tab indentation
    checkpatch: emit a warning on file add/move/delete
    checkpatch: add test for commit id formatting style in commit log
    checkpatch: emit fewer kmalloc_array/kcalloc conversion warnings
    checkpatch: improve "no space after cast" test
    checkpatch: allow multiple const * types
    ...

    Linus Torvalds
     
  • All other add functions for lists have the new item as first argument
    and the position where it is added as second argument. This was changed
    for no good reason in this function and makes using it unnecessary
    confusing.

    The name was changed to hlist_add_behind() to cause unconverted code to
    generate a compile error instead of using the wrong parameter order.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Ken Helias
    Cc: "Paul E. McKenney"
    Acked-by: Jeff Kirsher [intel driver bits]
    Cc: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Helias
     
  • Since a8afca032 (tcp: md5: protects md5sig_info with RCU) tcp_md5_do_lookup
    doesn't require socket lock, rcu_read_lock is enough. Therefore socket lock is
    no longer required for tcp_v{4,6}_inbound_md5_hash too, so we can move these
    calls (wrapped with rcu_read_{,un}lock) before bh_lock_sock:
    from tcp_v{4,6}_do_rcv to tcp_v{4,6}_rcv.

    Signed-off-by: Dmitry Popov
    Signed-off-by: David S. Miller

    Dmitry Popov
     

06 Aug, 2014

2 commits

  • Conflicts:
    drivers/net/Makefile
    net/ipv6/sysctl_net_ipv6.c

    Two ipv6_table_template[] additions overlap, so the index
    of the ipv6_table[x] assignments needed to be adjusted.

    In the drivers/net/Makefile case, we've gotten rid of the
    garbage whereby we had to list every single USB networking
    driver in the top-level Makefile, there is just one
    "USB_NETWORKING" that guards everything.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Datagrams timestamped on transmission can coexist in the kernel stack
    and be reordered in packet scheduling. When reading looped datagrams
    from the socket error queue it is not always possible to unique
    correlate looped data with original send() call (for application
    level retransmits). Even if possible, it may be expensive and complex,
    requiring packet inspection.

    Introduce a data-independent ID mechanism to associate timestamps with
    send calls. Pass an ID alongside the timestamp in field ee_data of
    sock_extended_err.

    The ID is a simple 32 bit unsigned int that is associated with the
    socket and incremented on each send() call for which software tx
    timestamp generation is enabled.

    The feature is enabled only if SOF_TIMESTAMPING_OPT_ID is set, to
    avoid changing ee_data for existing applications that expect it 0.
    The counter is reset each time the flag is reenabled. Reenabling
    does not change the ID of already submitted data. It is possible
    to receive out of order IDs if the timestamp stream is not quiesced
    first.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

03 Aug, 2014

5 commits


01 Aug, 2014

2 commits

  • Signed-off-by: Duan Jiong
    Signed-off-by: David S. Miller

    Duan Jiong
     
  • When dealing with ICMPv[46] Error Message, function icmp_socket_deliver()
    and icmpv6_notify() do some valid checks on packet's length, but then some
    protocols check packet's length redaudantly. So remove those duplicated
    statements, and increase counter ICMP_MIB_INERRORS/ICMP6_MIB_INERRORS in
    function icmp_socket_deliver() and icmpv6_notify() respectively.

    In addition, add missed counter in udp6/udplite6 when socket is NULL.

    Signed-off-by: Duan Jiong
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Duan Jiong
     

31 Jul, 2014

2 commits


30 Jul, 2014

1 commit

  • We create a proc dir for each network device, this will cause
    conflicts when the devices have name "all" or "default".

    Rather than emitting an ugly kernel warning, we could just
    fail earlier by checking the device name.

    Reported-by: Stephane Chazelas
    Cc: "David S. Miller"
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     

29 Jul, 2014

1 commit

  • In "Counting Packets Sent Between Arbitrary Internet Hosts", Jeffrey and
    Jedidiah describe ways exploiting linux IP identifier generation to
    infer whether two machines are exchanging packets.

    With commit 73f156a6e8c1 ("inetpeer: get rid of ip_id_count"), we
    changed IP id generation, but this does not really prevent this
    side-channel technique.

    This patch adds a random amount of perturbation so that IP identifiers
    for a given destination [1] are no longer monotonically increasing after
    an idle period.

    Note that prandom_u32_max(1) returns 0, so if generator is used at most
    once per jiffy, this patch inserts no hole in the ID suite and do not
    increase collision probability.

    This is jiffies based, so in the worst case (HZ=1000), the id can
    rollover after ~65 seconds of idle time, which should be fine.

    We also change the hash used in __ip_select_ident() to not only hash
    on daddr, but also saddr and protocol, so that ICMP probes can not be
    used to infer information for other protocols.

    For IPv6, adds saddr into the hash as well, but not nexthdr.

    If I ping the patched target, we can see ID are now hard to predict.

    21:57:11.008086 IP (...)
    A > target: ICMP echo request, seq 1, length 64
    21:57:11.010752 IP (... id 2081 ...)
    target > A: ICMP echo reply, seq 1, length 64

    21:57:12.013133 IP (...)
    A > target: ICMP echo request, seq 2, length 64
    21:57:12.015737 IP (... id 3039 ...)
    target > A: ICMP echo reply, seq 2, length 64

    21:57:13.016580 IP (...)
    A > target: ICMP echo request, seq 3, length 64
    21:57:13.019251 IP (... id 3437 ...)
    target > A: ICMP echo reply, seq 3, length 64

    [1] TCP sessions uses a per flow ID generator not changed by this patch.

    Signed-off-by: Eric Dumazet
    Reported-by: Jeffrey Knockel
    Reported-by: Jedidiah R. Crandall
    Cc: Willy Tarreau
    Cc: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Jul, 2014

9 commits

  • This patch makes init_net's high_thresh limit to be the maximum for all
    namespaces, thus introducing a global memory limit threshold equal to the
    sum of the individual high_thresh limits which are capped.
    It also introduces some sane minimums for low_thresh as it shouldn't be
    able to drop below 0 (or > high_thresh in the unsigned case), and
    overall low_thresh should not ever be above high_thresh, so we make the
    following relations for a namespace:
    init_net:
    high_thresh - max(not capped), min(init_net low_thresh)
    low_thresh - max(init_net high_thresh), min (0)

    all other namespaces:
    high_thresh = max(init_net high_thresh), min(namespace's low_thresh)
    low_thresh = max(namespace's high_thresh), min(0)

    The major issue with having low_thresh > high_thresh is that we'll
    schedule eviction but never evict anything and thus rely only on the
    timers.

    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     
  • rehash is rare operation, don't force readers to take
    the read-side rwlock.

    Instead, we only have to detect the (rare) case where
    the secret was altered while we are trying to insert
    a new inetfrag queue into the table.

    If it was changed, drop the bucket lock and recompute
    the hash to get the 'new' chain bucket that we have to
    insert into.

    Joint work with Nikolay Aleksandrov.

    Signed-off-by: Florian Westphal
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • merge functionality into the eviction workqueue.

    Instead of rebuilding every n seconds, take advantage of the upper
    hash chain length limit.

    If we hit it, mark table for rebuild and schedule workqueue.
    To prevent frequent rebuilds when we're completely overloaded,
    don't rebuild more than once every 5 seconds.

    ipfrag_secret_interval sysctl is now obsolete and has been marked as
    deprecated, it still can be changed so scripts won't be broken but it
    won't have any effect. A comment is left above each unused secret_timer
    variable to avoid confusion.

    Joint work with Nikolay Aleksandrov.

    Signed-off-by: Florian Westphal
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • no longer used.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • The 'nqueues' counter is protected by the lru list lock,
    once thats removed this needs to be converted to atomic
    counter. Given this isn't used for anything except for
    reporting it to userspace via /proc, just remove it.

    We still report the memory currently used by fragment
    reassembly queues.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • When the high_thresh limit is reached we try to toss the 'oldest'
    incomplete fragment queues until memory limits are below the low_thresh
    value. This happens in softirq/packet processing context.

    This has two drawbacks:

    1) processors might evict a queue that was about to be completed
    by another cpu, because they will compete wrt. resource usage and
    resource reclaim.

    2) LRU list maintenance is expensive.

    But when constantly overloaded, even the 'least recently used' element is
    recent, so removing 'lru' queue first is not 'fairer' than removing any
    other fragment queue.

    This moves eviction out of the fast path:

    When the low threshold is reached, a work queue is scheduled
    which then iterates over the table and removes the queues that exceed
    the memory limits of the namespace. It sets a new flag called
    INET_FRAG_EVICTED on the evicted queues so the proper counters will get
    incremented when the queue is forcefully expired.

    When the high threshold is reached, no more fragment queues are
    created until we're below the limit again.

    The LRU list is now unused and will be removed in a followup patch.

    Joint work with Nikolay Aleksandrov.

    Suggested-by: Eric Dumazet
    Signed-off-by: Florian Westphal
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • First step to move eviction handling into a work queue.

    We lose two spots that accounted evicted fragments in MIB counters.

    Accounting will be restored since the upcoming work-queue evictor
    invokes the frag queue timer callbacks instead.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • hide actual hash size from individual users: The _find
    function will now fold the given hash value into the required range.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

25 Jul, 2014

2 commits

  • After 11878b40e[net-timestamp: SOCK_RAW and PING timestamping], this comment
    becomes obsolete since the codes check not only UDP socket, but also RAW sock;
    and the codes are clear, not need the comments

    Signed-off-by: Li RongQing
    Signed-off-by: David S. Miller

    Li RongQing
     
  • In this file, function names are otherwise used as pointers without &.

    A simplified version of the Coccinelle semantic patch that makes this
    change is as follows:

    //
    @r@
    identifier f;
    @@

    f(...) { ... }

    @@
    identifier r.f;
    @@

    - &f
    + f
    //

    Signed-off-by: Himangi Saraogi
    Acked-by: Julia Lawall
    Signed-off-by: David S. Miller

    Himangi Saraogi
     

24 Jul, 2014

1 commit