17 Apr, 2014

3 commits

  • Because the netdevice may be in another netns than the i/o netns, we should
    use the i/o netns instead of dev_net(dev).

    The variable 'tunnel' was used only to get 'itn', hence to simplify code I
    remove it and use 't' instead.

    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     
  • In my special case, when a packet is redirected from veth0 to lo,
    its skb->dev->ifindex would be LOOPBACK_IFINDEX. Meanwhile we
    pass the hard-coded LOOPBACK_IFINDEX to fib_validate_source()
    in ip_route_input_slow(). This would cause the following check
    in fib_validate_source() fail:

    (dev->ifindex != oif || !IN_DEV_TX_REDIRECTS(idev))

    when rp_filter is disabeld on loopback. As suggested by Julian,
    the caller should pass 0 here so that we will not end up by
    calling __fib_validate_source().

    Cc: Eric Biederman
    Cc: Julian Anastasov
    Cc: David S. Miller
    Signed-off-by: Cong Wang
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     
  • As suggested by Julian:

    Simply, flowi4_iif must not contain 0, it does not
    look logical to ignore all ip rules with specified iif.

    because in fib_rule_match() we do:

    if (rule->iifindex && (rule->iifindex != fl->flowi_iif))
    goto out;

    flowi4_iif should be LOOPBACK_IFINDEX by default.

    We need to move LOOPBACK_IFINDEX to include/net/flow.h:

    1) It is mostly used by flowi_iif

    2) Fix the following compile error if we use it in flow.h
    by the patches latter:

    In file included from include/linux/netfilter.h:277:0,
    from include/net/netns/netfilter.h:5,
    from include/net/net_namespace.h:21,
    from include/linux/netdevice.h:43,
    from include/linux/icmpv6.h:12,
    from include/linux/ipv6.h:61,
    from include/net/ipv6.h:16,
    from include/linux/sunrpc/clnt.h:27,
    from include/linux/nfs_fs.h:30,
    from init/do_mounts.c:32:
    include/net/flow.h: In function ‘flowi4_init_output’:
    include/net/flow.h:84:32: error: ‘LOOPBACK_IFINDEX’ undeclared (first use in this function)

    Cc: Eric Biederman
    Cc: Julian Anastasov
    Cc: David S. Miller
    Signed-off-by: Cong Wang
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

16 Apr, 2014

2 commits

  • In the dst->output() path for ipv4, the code assumes the skb it has to
    transmit is attached to an inet socket, specifically via
    ip_mc_output() : The sk_mc_loop() test triggers a WARN_ON() when the
    provider of the packet is an AF_PACKET socket.

    The dst->output() method gets an additional 'struct sock *sk'
    parameter. This needs a cascade of changes so that this parameter can
    be propagated from vxlan to final consumer.

    Fixes: 8f646c922d55 ("vxlan: keep original skb ownership")
    Reported-by: lucien xin
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • ip_queue_xmit() assumes the skb it has to transmit is attached to an
    inet socket. Commit 31c70d5956fc ("l2tp: keep original skb ownership")
    changed l2tp to not change skb ownership and thus broke this assumption.

    One fix is to add a new 'struct sock *sk' parameter to ip_queue_xmit(),
    so that we do not assume skb->sk points to the socket used by l2tp
    tunnel.

    Fixes: 31c70d5956fc ("l2tp: keep original skb ownership")
    Reported-by: Zhan Jianyu
    Tested-by: Zhan Jianyu
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Apr, 2014

2 commits

  • Extend commit 13378cad02afc2adc6c0e07fca03903c7ada0b37
    ("ipv4: Change rt->rt_iif encoding.") from 3.6 to return valid
    RTA_IIF on 'ip route get ... iif DEVICE' instead of rt_iif 0
    which is displayed as 'iif *'.

    inet_iif is not appropriate to use because skb_iif is not set.
    Use the skb->dev->ifindex instead.

    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Julian Anastasov
     
  • Plug a group_info refcount leak in ping_init.
    group_info is only needed during initialization and
    the code failed to release the reference on exit.
    While here move grabbing the reference to a place
    where it is actually needed.

    Signed-off-by: Chuansheng Liu
    Signed-off-by: Zhang Dongxing
    Signed-off-by: xiaoming wang
    Signed-off-by: David S. Miller

    Wang, Xiaoming
     

13 Apr, 2014

2 commits

  • Before the patch, it was possible to add two times the same tunnel:
    ip l a vti1 type vti remote 10.16.0.121 local 10.16.0.249 key 41
    ip l a vti2 type vti remote 10.16.0.121 local 10.16.0.249 key 41

    It was possible, because ip_tunnel_newlink() calls ip_tunnel_find() with the
    argument dev->type, which was set only later (when calling ndo_init handler
    in register_netdevice()). Let's set this type in the setup handler, which is
    called before newlink handler.

    Introduced by commit b9959fd3b0fa ("vti: switch to new ip tunnel code").

    CC: Cong Wang
    CC: Steffen Klassert
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     
  • Before the patch, it was possible to add two times the same tunnel:
    ip l a gre1 type gre remote 10.16.0.121 local 10.16.0.249
    ip l a gre2 type gre remote 10.16.0.121 local 10.16.0.249

    It was possible, because ip_tunnel_newlink() calls ip_tunnel_find() with the
    argument dev->type, which was set only later (when calling ndo_init handler
    in register_netdevice()). Let's set this type in the setup handler, which is
    called before newlink handler.

    Introduced by commit c54419321455 ("GRE: Refactor GRE tunneling code.").

    CC: Pravin B Shelar
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     

12 Apr, 2014

1 commit

  • Several spots in the kernel perform a sequence like:

    skb_queue_tail(&sk->s_receive_queue, skb);
    sk->sk_data_ready(sk, skb->len);

    But at the moment we place the SKB onto the socket receive queue it
    can be consumed and freed up. So this skb->len access is potentially
    to freed up memory.

    Furthermore, the skb->len can be modified by the consumer so it is
    possible that the value isn't accurate.

    And finally, no actual implementation of this callback actually uses
    the length argument. And since nobody actually cared about it's
    value, lots of call sites pass arbitrary values in such as '0' and
    even '1'.

    So just remove the length argument from the callback, that way there
    is no confusion whatsoever and all of these use-after-free cases get
    fixed as a side effect.

    Based upon a patch by Eric Dumazet and his suggestion to audit this
    issue tree-wide.

    Signed-off-by: David S. Miller

    David S. Miller
     

09 Apr, 2014

1 commit

  • Pull more networking updates from David Miller:

    1) If a VXLAN interface is created with no groups, we can crash on
    reception of packets. Fix from Mike Rapoport.

    2) Missing includes in CPTS driver, from Alexei Starovoitov.

    3) Fix string validations in isdnloop driver, from YOSHIFUJI Hideaki
    and Dan Carpenter.

    4) Missing irq.h include in bnxw2x, enic, and qlcnic drivers. From
    Josh Boyer.

    5) AF_PACKET transmit doesn't statistically count TX drops, from Daniel
    Borkmann.

    6) Byte-Queue-Limit enabled drivers aren't handled properly in
    AF_PACKET transmit path, also from Daniel Borkmann.

    Same problem exists in pktgen, and Daniel fixed it there too.

    7) Fix resource leaks in driver probe error paths of new sxgbe driver,
    from Francois Romieu.

    8) Truesize of SKBs can gradually get more and more corrupted in NAPI
    packet recycling path, fix from Eric Dumazet.

    9) Fix uniprocessor netfilter build, from Florian Westphal. In the
    longer term we should perhaps try to find a way for ARRAY_SIZE() to
    work even with zero sized array elements.

    10) Fix crash in netfilter conntrack extensions due to mis-estimation of
    required extension space. From Andrey Vagin.

    11) Since we commit table rule updates before trying to copy the
    counters back to userspace (it's the last action we perform), we
    really can't signal the user copy with an error as we are beyond the
    point from which we can unwind everything. This causes all kinds of
    use after free crashes and other mysterious behavior.

    From Thomas Graf.

    12) Restore previous behvaior of div/mod by zero in BPF filter
    processing. From Daniel Borkmann.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (38 commits)
    net: sctp: wake up all assocs if sndbuf policy is per socket
    isdnloop: several buffer overflows
    netdev: remove potentially harmful checks
    pktgen: fix xmit test for BQL enabled devices
    net/at91_ether: avoid NULL pointer dereference
    tipc: Let tipc_release() return 0
    at86rf230: fix MAX_CSMA_RETRIES parameter
    mac802154: fix duplicate #include headers
    sxgbe: fix duplicate #include headers
    net: filter: be more defensive on div/mod by X==0
    netfilter: Can't fail and free after table replacement
    xen-netback: Trivial format string fix
    net: bcmgenet: Remove unnecessary version.h inclusion
    net: smc911x: Remove unused local variable
    bonding: Inactive slaves should keep inactive flag's value
    netfilter: nf_tables: fix wrong format in request_module()
    netfilter: nf_tables: set names cannot be larger than 15 bytes
    netfilter: nf_conntrack: reserve two bytes for nf_ct_ext->len
    netfilter: Add {ipt,ip6t}_osf aliases for xt_osf
    netfilter: x_tables: allow to use cgroup match for LOCAL_IN nf hooks
    ...

    Linus Torvalds
     

08 Apr, 2014

1 commit

  • The RT_CACHE_STAT_INC macro triggers the new preemption checks
    for __this_cpu ops.

    I do not see any other synchronization that would allow the use of a
    __this_cpu operation here however in commit dbd2915ce87e ("[IPV4]:
    RT_CACHE_STAT_INC() warning fix") Andrew justifies the use of
    raw_smp_processor_id() here because "we do not care" about races. In
    the past we agreed that the price of disabling interrupts here to get
    consistent counters would be too high. These counters may be inaccurate
    due to race conditions.

    The use of __this_cpu op improves the situation already from what commit
    dbd2915ce87e did since the single instruction emitted on x86 does not
    allow the race to occur anymore. However, non x86 platforms could still
    experience a race here.

    Trace:

    __this_cpu_add operation in preemptible [00000000] code: avahi-daemon/1193
    caller is __this_cpu_preempt_check+0x38/0x60
    CPU: 1 PID: 1193 Comm: avahi-daemon Tainted: GF 3.12.0-rc4+ #187
    Call Trace:
    check_preemption_disabled+0xec/0x110
    __this_cpu_preempt_check+0x38/0x60
    __ip_route_output_key+0x575/0x8c0
    ip_route_output_flow+0x27/0x70
    udp_sendmsg+0x825/0xa20
    inet_sendmsg+0x85/0xc0
    sock_sendmsg+0x9c/0xd0
    ___sys_sendmsg+0x37c/0x390
    __sys_sendmsg+0x49/0x90
    SyS_sendmsg+0x12/0x20
    tracesys+0xe1/0xe6

    Signed-off-by: Christoph Lameter
    Acked-by: David S. Miller
    Acked-by: Ingo Molnar
    Cc: Eric Dumazet
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

05 Apr, 2014

1 commit

  • All xtables variants suffer from the defect that the copy_to_user()
    to copy the counters to user memory may fail after the table has
    already been exchanged and thus exposed. Return an error at this
    point will result in freeing the already exposed table. Any
    subsequent packet processing will result in a kernel panic.

    We can't copy the counters before exposing the new tables as we
    want provide the counter state after the old table has been
    unhooked. Therefore convert this into a silent error.

    Cc: Florian Westphal
    Signed-off-by: Thomas Graf
    Signed-off-by: Pablo Neira Ayuso

    Thomas Graf
     

04 Apr, 2014

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot updates for cgroup:

    - The biggest one is cgroup's conversion to kernfs. cgroup took
    after the long abandoned vfs-entangled sysfs implementation and
    made it even more convoluted over time. cgroup's internal objects
    were fused with vfs objects which also brought in vfs locking and
    object lifetime rules. Naturally, there are places where vfs rules
    don't fit and nasty hacks, such as credential switching or lock
    dance interleaving inode mutex and cgroup_mutex with object serial
    number comparison thrown in to decide whether the operation is
    actually necessary, needed to be employed.

    After conversion to kernfs, internal object lifetime and locking
    rules are mostly isolated from vfs interactions allowing shedding
    of several nasty hacks and overall simplification. This will also
    allow implmentation of operations which may affect multiple cgroups
    which weren't possible before as it would have required nesting
    i_mutexes.

    - Various simplifications including dropping of module support,
    easier cgroup name/path handling, simplified cgroup file type
    handling and task_cg_lists optimization.

    - Prepatory changes for the planned unified hierarchy, which is still
    a patchset away from being actually operational. The dummy
    hierarchy is updated to serve as the default unified hierarchy.
    Controllers which aren't claimed by other hierarchies are
    associated with it, which BTW was what the dummy hierarchy was for
    anyway.

    - Various fixes from Li and others. This pull request includes some
    patches to add missing slab.h to various subsystems. This was
    triggered xattr.h include removal from cgroup.h. cgroup.h
    indirectly got included a lot of files which brought in xattr.h
    which brought in slab.h.

    There are several merge commits - one to pull in kernfs updates
    necessary for converting cgroup (already in upstream through
    driver-core), others for interfering changes in the fixes branch"

    * 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
    cgroup: remove useless argument from cgroup_exit()
    cgroup: fix spurious lockdep warning in cgroup_exit()
    cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
    cgroup: break kernfs active_ref protection in cgroup directory operations
    cgroup: fix cgroup_taskset walking order
    cgroup: implement CFTYPE_ONLY_ON_DFL
    cgroup: make cgrp_dfl_root mountable
    cgroup: drop const from @buffer of cftype->write_string()
    cgroup: rename cgroup_dummy_root and related names
    cgroup: move ->subsys_mask from cgroupfs_root to cgroup
    cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
    cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
    cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
    cgroup: reorganize cgroup bootstrapping
    cgroup: relocate setting of CGRP_DEAD
    cpuset: use rcu_read_lock() to protect task_cs()
    cgroup_freezer: document freezer_fork() subtleties
    cgroup: update cgroup_transfer_tasks() to either succeed or fail
    cgroup: drop task_lock() protection around task->cgroups
    cgroup: update how a newly forked task gets associated with css_set
    ...

    Linus Torvalds
     

30 Mar, 2014

1 commit


29 Mar, 2014

1 commit

  • It seems I missed one change in get_timewait4_sock() to compute
    the remaining time before deletion of IPV4 timewait socket.

    This could result in wrong output in /proc/net/tcp for tm->when field.

    Fixes: 96f817fedec4 ("tcp: shrink tcp6_timewait_sock by one cache line")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Mar, 2014

1 commit

  • There is no need to allocate 15 bytes in excess for a SYNACK packet,
    as it contains no data, only headers.

    SYNACK are always generated in softirq context, and contain a single
    segment, we can use TCP_INC_STATS_BH()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Mar, 2014

2 commits

  • After commit d4589926d7a9 (tcp: refine TSO splits), tcp_nagle_check() does
    not use parameter mss_now anymore.

    Signed-off-by: Weiping Pan
    Signed-off-by: David S. Miller

    Peter Pan(潘卫平)
     
  • Commit 10ddceb22ba (ip_tunnel:multicast process cause panic due
    to skb->_skb_refdst NULL pointer) removed dst-drop call from
    ip-tunnel-recv.

    Following commit reintroduce dst-drop and fix the original bug by
    checking loopback packet before releasing dst.
    Original bug: https://bugzilla.kernel.org/show_bug.cgi?id=70681

    CC: Xin Long
    Signed-off-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Pravin B Shelar
     

26 Mar, 2014

1 commit


25 Mar, 2014

1 commit


24 Mar, 2014

1 commit


21 Mar, 2014

2 commits


19 Mar, 2014

2 commits

  • cftype->write_string() just passes on the writeable buffer from kernfs
    and there's no reason to add const restriction on the buffer. The
    only thing const achieves is unnecessarily complicating parsing of the
    buffer. Drop const from @buffer.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki

    Tejun Heo
     
  • Steffen Klassert says:

    ====================
    One patch to rename a newly introduced struct. The rest is
    the rework of the IPsec virtual tunnel interface for ipv6 to
    support inter address family tunneling and namespace crossing.

    1) Rename the newly introduced struct xfrm_filter to avoid a
    conflict with iproute2. From Nicolas Dichtel.

    2) Introduce xfrm_input_afinfo to access the address family
    dependent tunnel callback functions properly.

    3) Add and use a IPsec protocol multiplexer for ipv6.

    4) Remove dst_entry caching. vti can lookup multiple different
    dst entries, dependent of the configured xfrm states. Therefore
    it does not make to cache a dst_entry.

    5) Remove caching of flow informations. vti6 does not use the the
    tunnel endpoint addresses to do route and xfrm lookups.

    6) Update the vti6 to use its own receive hook.

    7) Remove the now unused xfrm_tunnel_notifier. This was used from vti
    and is replaced by the IPsec protocol multiplexer hooks.

    8) Support inter address family tunneling for vti6.

    9) Check if the tunnel endpoints of the xfrm state and the vti interface
    are matching and return an error otherwise.

    10) Enable namespace crossing for vti devices.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

18 Mar, 2014

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter/IPVS updates for net-next

    The following patchset contains Netfilter/IPVS updates for net-next,
    most relevantly they are:

    * cleanup to remove double semicolon from stephen hemminger.

    * calm down sparse warning in xt_ipcomp, from Fan Du.

    * nf_ct_labels support for nf_tables, from Florian Westphal.

    * new macros to simplify rcu dereferences in the scope of nfnetlink
    and nf_tables, from Patrick McHardy.

    * Accept queue and drop (including reason for drop) to verdict
    parsing in nf_tables, also from Patrick.

    * Remove unused random seed initialization in nfnetlink_log, from
    Florian Westphal.

    * Allow to attach user-specific information to nf_tables rules, useful
    to attach user comments to rule, from me.

    * Return errors in ipset according to the manpage documentation, from
    Jozsef Kadlecsik.

    * Fix coccinelle warnings related to incorrect bool type usage for ipset,
    from Fengguang Wu.

    * Add hash:ip,mark set type to ipset, from Vytas Dauksa.

    * Fix message for each spotted by ipset for each netns that is created,
    from Ilia Mirkin.

    * Add forceadd option to ipset, which evicts a random entry from the set
    if it becomes full, from Josh Hunt.

    * Minor IPVS cleanups and fixes from Andi Kleen and Tingwei Liu.

    * Improve conntrack scalability by removing a central spinlock, original
    work from Eric Dumazet. Jesper Dangaard Brouer took them over to address
    remaining issues. Several patches to prepare this change come in first
    place.

    * Rework nft_hash to resolve bugs (leaking chain, missing rcu synchronization
    on element removal, etc. from Patrick McHardy.

    * Restore context in the rule deletion path, as we now release rule objects
    synchronously, from Patrick McHardy. This gets back event notification for
    anonymous sets.

    * Fix NAT family validation in nft_nat, also from Patrick.

    * Improve scalability of xt_connlimit by using an array of spinlocks and
    by introducing a rb-tree of hashtables for faster lookup of accounted
    objects per network. This patch was preceded by several patches and
    refactorizations to accomodate this change including the use of kmem_cache,
    from Florian Westphal.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

15 Mar, 2014

2 commits


14 Mar, 2014

1 commit


12 Mar, 2014

1 commit


11 Mar, 2014

1 commit

  • All skb in socket write queue should be properly timestamped.

    In case of FastOpen, we special case the SYN+DATA 'message' as we
    queue in socket wrote queue the two fallback skbs:

    1) SYN message by itself.
    2) DATA segment by itself.

    We should make sure these skbs have proper timestamps.

    Add a WARN_ON_ONCE() to eventually catch future violations.

    Fixes: 740b0f1841f6 ("tcp: switch rtt estimations to usec resolution")
    Signed-off-by: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Acked-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Mar, 2014

1 commit

  • Usage of skb->tstamp should remain private to TCP stack
    (only set on packets on write queue, not on cloned ones)

    Otherwise, packets given to loopback interface with a non null tstamp
    can confuse netif_rx() / net_timestamp_check()

    Other possibility would be to clear tstamp in loopback_xmit(),
    as done in skb_scrub_packet()

    Fixes: 740b0f1841f6 ("tcp: switch rtt estimations to usec resolution")
    Signed-off-by: Eric Dumazet
    Reported-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Mar, 2014

2 commits

  • Quoting Alexander Aring:
    While fragmentation and unloading of 6lowpan module I got this kernel Oops
    after few seconds:

    BUG: unable to handle kernel paging request at f88bbc30
    [..]
    Modules linked in: ipv6 [last unloaded: 6lowpan]
    Call Trace:
    [] ? call_timer_fn+0x54/0xb3
    [] ? process_timeout+0xa/0xa
    [] run_timer_softirq+0x140/0x15f

    Problem is that incomplete frags are still around after unload; when
    their frag expire timer fires, we get crash.

    When a netns is removed (also done when unloading module), inet_frag
    calls the evictor with 'force' argument to purge remaining frags.

    The evictor loop terminates when accounted memory ('work') drops to 0
    or the lru-list becomes empty. However, the mem accounting is done
    via percpu counters and may not be accurate, i.e. loop may terminate
    prematurely.

    Alter evictor to only stop once the lru list is empty when force is
    requested.

    Reported-by: Phoebe Buckheister
    Reported-by: Alexander Aring
    Tested-by: Alexander Aring
    Signed-off-by: Florian Westphal
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Can be invoked from non-BH context.

    Based upon a patch by Eric Dumazet.

    Fixes: f19c29e3e391 ("tcp: snmp stats for Fast Open, SYN rtx, and data pkts")
    Reported-by: Sergey Senozhatsky
    Signed-off-by: David S. Miller

    David S. Miller
     

06 Mar, 2014

2 commits

  • Conflicts:
    drivers/net/wireless/ath/ath9k/recv.c
    drivers/net/wireless/mwifiex/pcie.c
    net/ipv6/sit.c

    The SIT driver conflict consists of a bug fix being done by hand
    in 'net' (missing u64_stats_init()) whilst in 'net-next' a helper
    was created (netdev_alloc_pcpu_stats()) which takes care of this.

    The two wireless conflicts were overlapping changes.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • I stumbled upon this very serious bug while hunting for another one,
    it's a very subtle race condition between inet_frag_evictor,
    inet_frag_intern and the IPv4/6 frag_queue and expire functions
    (basically the users of inet_frag_kill/inet_frag_put).

    What happens is that after a fragment has been added to the hash chain
    but before it's been added to the lru_list (inet_frag_lru_add) in
    inet_frag_intern, it may get deleted (either by an expired timer if
    the system load is high or the timer sufficiently low, or by the
    fraq_queue function for different reasons) before it's added to the
    lru_list, then after it gets added it's a matter of time for the
    evictor to get to a piece of memory which has been freed leading to a
    number of different bugs depending on what's left there.

    I've been able to trigger this on both IPv4 and IPv6 (which is normal
    as the frag code is the same), but it's been much more difficult to
    trigger on IPv4 due to the protocol differences about how fragments
    are treated.

    The setup I used to reproduce this is: 2 machines with 4 x 10G bonded
    in a RR bond, so the same flow can be seen on multiple cards at the
    same time. Then I used multiple instances of ping/ping6 to generate
    fragmented packets and flood the machines with them while running
    other processes to load the attacked machine.

    *It is very important to have the _same flow_ coming in on multiple CPUs
    concurrently. Usually the attacked machine would die in less than 30
    minutes, if configured properly to have many evictor calls and timeouts
    it could happen in 10 minutes or so.

    An important point to make is that any caller (frag_queue or timer) of
    inet_frag_kill will remove both the timer refcount and the
    original/guarding refcount thus removing everything that's keeping the
    frag from being freed at the next inet_frag_put. All of this could
    happen before the frag was ever added to the LRU list, then it gets
    added and the evictor uses a freed fragment.

    An example for IPv6 would be if a fragment is being added and is at
    the stage of being inserted in the hash after the hash lock is
    released, but before inet_frag_lru_add executes (or is able to obtain
    the lru lock) another overlapping fragment for the same flow arrives
    at a different CPU which finds it in the hash, but since it's
    overlapping it drops it invoking inet_frag_kill and thus removing all
    guarding refcounts, and afterwards freeing it by invoking
    inet_frag_put which removes the last refcount added previously by
    inet_frag_find, then inet_frag_lru_add gets executed by
    inet_frag_intern and we have a freed fragment in the lru_list.

    The fix is simple, just move the lru_add under the hash chain locked
    region so when a removing function is called it'll have to wait for
    the fragment to be added to the lru_list, and then it'll remove it (it
    works because the hash chain removal is done before the lru_list one
    and there's no window between the two list adds when the frag can get
    dropped). With this fix applied I couldn't kill the same machine in 24
    hours with the same setup.

    Fixes: 3ef0eb0db4bf ("net: frag, move LRU list maintenance outside of
    rwlock")

    CC: Florian Westphal
    CC: Jesper Dangaard Brouer
    CC: David S. Miller

    Signed-off-by: Nikolay Aleksandrov
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     

04 Mar, 2014

3 commits

  • Add the following snmp stats:

    TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed beacuse
    the remote does not accept it or the attempts timed out.

    TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down
    retransmissions into SYN, fast-retransmits, timeout retransmits, etc.

    TCPOrigDataSent: number of outgoing packets with original data (excluding
    retransmission but including data-in-SYN). This counter is different from
    TcpOutSegs because TcpOutSegs also tracks pure ACKs. TCPOrigDataSent is
    more useful to track the TCP retransmission rate.

    Change TCPFastOpenActive to track only successful Fast Opens to be symmetric to
    TCPFastOpenPassive.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: Lawrence Brakmo
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • when ip_tunnel process multicast packets, it may check if the packet is looped
    back packet though 'rt_is_output_route(skb_rtable(skb))' in ip_tunnel_rcv(),
    but before that , skb->_skb_refdst has been dropped in iptunnel_pull_header(),
    so which leads to a panic.

    fix the bug: https://bugzilla.kernel.org/show_bug.cgi?id=70681

    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     
  • RTT may be bogus with tall loss probe (TLP) when a packet
    is retransmitted and latter (s)acked without TCPCB_SACKED_RETRANS flag.

    For example, TLP calls __tcp_retransmit_skb() instead of
    tcp_retransmit_skb(). The skb timestamps are updated but the sacked
    flag is not marked with TCPCB_SACKED_RETRANS. As a result we'll
    get bogus RTT in tcp_clean_rtx_queue() or in tcp_sacktag_one() on
    spurious retransmission.

    The fix is to apply the sticky flag TCP_EVER_RETRANS to enforce Karn's
    check on RTT sampling. However this will disable F-RTO if timeout occurs
    after TLP, by resetting undo_marker in tcp_enter_loss(). We relax this
    check to only if any pending retransmists are still in-flight.

    Signed-off-by: Yuchung Cheng
    Acked-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Acked-by: Nandita Dukkipati
    Signed-off-by: David S. Miller

    Yuchung Cheng