07 Feb, 2019

1 commit

  • ade446403bfb ("net: ipv4: do not handle duplicate fragments as
    overlapping") was backported to many stable trees, but it had a problem
    that was "accidentally" fixed by the upstream commit 0ff89efb5246 ("ip:
    fail fast on IP defrag errors")

    This is the fixup for that problem as we do not want the larger patch in
    the older stable trees.

    Fixes: ade446403bfb ("net: ipv4: do not handle duplicate fragments as overlapping")
    Reported-by: Ivan Babrou
    Reported-by: Eric Dumazet
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

10 Jan, 2019

1 commit

  • [ Upstream commit ade446403bfb79d3528d56071a84b15351a139ad ]

    Since commit 7969e5c40dfd ("ip: discard IPv4 datagrams with overlapping
    segments.") IPv4 reassembly code drops the whole queue whenever an
    overlapping fragment is received. However, the test is written in a way
    which detects duplicate fragments as overlapping so that in environments
    with many duplicate packets, fragmented packets may be undeliverable.

    Add an extra test and for (potentially) duplicate fragment, only drop the
    new fragment rather than the whole queue. Only starting offset and length
    are checked, not the contents of the fragments as that would be too
    expensive. For similar reason, linear list ("run") of a rbtree node is not
    iterated, we only check if the new fragment is a subset of the interval
    covered by existing consecutive fragments.

    v2: instead of an exact check iterating through linear list of an rbtree
    node, only check if the new fragment is subset of the "run" (suggested
    by Eric Dumazet)

    Fixes: 7969e5c40dfd ("ip: discard IPv4 datagrams with overlapping segments.")
    Signed-off-by: Michal Kubecek
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Michal Kubecek
     

17 Dec, 2018

1 commit

  • [ Upstream commit ebaf39e6032faf77218220707fc3fa22487784e0 ]

    The *_frag_reasm() functions are susceptible to miscalculating the byte
    count of packet fragments in case the truesize of a head buffer changes.
    The truesize member may be changed by the call to skb_unclone(), leaving
    the fragment memory limit counter unbalanced even if all fragments are
    processed. This miscalculation goes unnoticed as long as the network
    namespace which holds the counter is not destroyed.

    Should an attempt be made to destroy a network namespace that holds an
    unbalanced fragment memory limit counter the cleanup of the namespace
    never finishes. The thread handling the cleanup gets stuck in
    inet_frags_exit_net() waiting for the percpu counter to reach zero. The
    thread is usually in running state with a stacktrace similar to:

    PID: 1073 TASK: ffff880626711440 CPU: 1 COMMAND: "kworker/u48:4"
    #5 [ffff880621563d48] _raw_spin_lock at ffffffff815f5480
    #6 [ffff880621563d48] inet_evict_bucket at ffffffff8158020b
    #7 [ffff880621563d80] inet_frags_exit_net at ffffffff8158051c
    #8 [ffff880621563db0] ops_exit_list at ffffffff814f5856
    #9 [ffff880621563dd8] cleanup_net at ffffffff814f67c0
    #10 [ffff880621563e38] process_one_work at ffffffff81096f14

    It is not possible to create new network namespaces, and processes
    that call unshare() end up being stuck in uninterruptible sleep state
    waiting to acquire the net_mutex.

    The bug was observed in the IPv6 netfilter code by Per Sundstrom.
    I thank him for his analysis of the problem. The parts of this patch
    that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.

    Signed-off-by: Jiri Wiesner
    Reported-by: Per Sundstrom
    Acked-by: Peter Oskolkov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jiri Wiesner
     

04 Nov, 2018

1 commit

  • [ Upstream commit 7de414a9dd91426318df7b63da024b2b07e53df5 ]

    Most callers of pskb_trim_rcsum() simply drop the skb when
    it fails, however, ip_check_defrag() still continues to pass
    the skb up to stack. This is suspicious.

    In ip_check_defrag(), after we learn the skb is an IP fragment,
    passing the skb to callers makes no sense, because callers expect
    fragments are defrag'ed on success. So, dropping the skb when we
    can't defrag it is reasonable.

    Note, prior to commit 88078d98d1bb, this is not a big problem as
    checksum will be fixed up anyway. After it, the checksum is not
    correct on failure.

    Found this during code review.

    Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends")
    Cc: Eric Dumazet
    Signed-off-by: Cong Wang
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Cong Wang
     

20 Sep, 2018

18 commits

  • commit 5d407b071dc369c26a38398326ee2be53651cfe4 upstream

    A kernel crash occurrs when defragmented packet is fragmented
    in ip_do_fragment().
    In defragment routine, skb_orphan() is called and
    skb->ip_defrag_offset is set. but skb->sk and
    skb->ip_defrag_offset are same union member. so that
    frag->sk is not NULL.
    Hence crash occurrs in skb->sk check routine in ip_do_fragment() when
    defragmented packet is fragmented.

    test commands:
    %iptables -t nat -I POSTROUTING -j MASQUERADE
    %hping3 192.168.4.2 -s 1000 -p 2000 -d 60000

    splat looks like:
    [ 261.069429] kernel BUG at net/ipv4/ip_output.c:636!
    [ 261.075753] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
    [ 261.083854] CPU: 1 PID: 1349 Comm: hping3 Not tainted 4.19.0-rc2+ #3
    [ 261.100977] RIP: 0010:ip_do_fragment+0x1613/0x2600
    [ 261.106945] Code: e8 e2 38 e3 fe 4c 8b 44 24 18 48 8b 74 24 08 e9 92 f6 ff ff 80 3c 02 00 0f 85 da 07 00 00 48 8b b5 d0 00 00 00 e9 25 f6 ff ff 0b 0f 0b 44 8b 54 24 58 4c 8b 4c 24 18 4c 8b 5c 24 60 4c 8b 6c
    [ 261.127015] RSP: 0018:ffff8801031cf2c0 EFLAGS: 00010202
    [ 261.134156] RAX: 1ffff1002297537b RBX: ffffed0020639e6e RCX: 0000000000000004
    [ 261.142156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880114ba9bd8
    [ 261.150157] RBP: ffff880114ba8a40 R08: ffffed0022975395 R09: ffffed0022975395
    [ 261.158157] R10: 0000000000000001 R11: ffffed0022975394 R12: ffff880114ba9ca4
    [ 261.166159] R13: 0000000000000010 R14: ffff880114ba9bc0 R15: dffffc0000000000
    [ 261.174169] FS: 00007fbae2199700(0000) GS:ffff88011b400000(0000) knlGS:0000000000000000
    [ 261.183012] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 261.189013] CR2: 00005579244fe000 CR3: 0000000119bf4000 CR4: 00000000001006e0
    [ 261.198158] Call Trace:
    [ 261.199018] ? dst_output+0x180/0x180
    [ 261.205011] ? save_trace+0x300/0x300
    [ 261.209018] ? ip_copy_metadata+0xb00/0xb00
    [ 261.213034] ? sched_clock_local+0xd4/0x140
    [ 261.218158] ? kill_l4proto+0x120/0x120 [nf_conntrack]
    [ 261.223014] ? rt_cpu_seq_stop+0x10/0x10
    [ 261.227014] ? find_held_lock+0x39/0x1c0
    [ 261.233008] ip_finish_output+0x51d/0xb50
    [ 261.237006] ? ip_fragment.constprop.56+0x220/0x220
    [ 261.243011] ? nf_ct_l4proto_register_one+0x5b0/0x5b0 [nf_conntrack]
    [ 261.250152] ? rcu_is_watching+0x77/0x120
    [ 261.255010] ? nf_nat_ipv4_out+0x1e/0x2b0 [nf_nat_ipv4]
    [ 261.261033] ? nf_hook_slow+0xb1/0x160
    [ 261.265007] ip_output+0x1c7/0x710
    [ 261.269005] ? ip_mc_output+0x13f0/0x13f0
    [ 261.273002] ? __local_bh_enable_ip+0xe9/0x1b0
    [ 261.278152] ? ip_fragment.constprop.56+0x220/0x220
    [ 261.282996] ? nf_hook_slow+0xb1/0x160
    [ 261.287007] raw_sendmsg+0x21f9/0x4420
    [ 261.291008] ? dst_output+0x180/0x180
    [ 261.297003] ? sched_clock_cpu+0x126/0x170
    [ 261.301003] ? find_held_lock+0x39/0x1c0
    [ 261.306155] ? stop_critical_timings+0x420/0x420
    [ 261.311004] ? check_flags.part.36+0x450/0x450
    [ 261.315005] ? _raw_spin_unlock_irq+0x29/0x40
    [ 261.320995] ? _raw_spin_unlock_irq+0x29/0x40
    [ 261.326142] ? cyc2ns_read_end+0x10/0x10
    [ 261.330139] ? raw_bind+0x280/0x280
    [ 261.334138] ? sched_clock_cpu+0x126/0x170
    [ 261.338995] ? check_flags.part.36+0x450/0x450
    [ 261.342991] ? __lock_acquire+0x4500/0x4500
    [ 261.348994] ? inet_sendmsg+0x11c/0x500
    [ 261.352989] ? dst_output+0x180/0x180
    [ 261.357012] inet_sendmsg+0x11c/0x500
    [ ... ]

    v2:
    - clear skb->sk at reassembly routine.(Eric Dumarzet)

    Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
    Suggested-by: Eric Dumazet
    Signed-off-by: Taehee Yoo
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Taehee Yoo
     
  • This patch changes the runtime behavior of IP defrag queue:
    incoming in-order fragments are added to the end of the current
    list/"run" of in-order fragments at the tail.

    On some workloads, UDP stream performance is substantially improved:

    RX: ./udp_stream -F 10 -T 2 -l 60
    TX: ./udp_stream -c -H -F 10 -T 5 -l 60

    with this patchset applied on a 10Gbps receiver:

    throughput=9524.18
    throughput_units=Mbit/s

    upstream (net-next):

    throughput=4608.93
    throughput_units=Mbit/s

    Reported-by: Willem de Bruijn
    Signed-off-by: Peter Oskolkov
    Cc: Eric Dumazet
    Cc: Florian Westphal
    Signed-off-by: David S. Miller
    (cherry picked from commit a4fd284a1f8fd4b6c59aa59db2185b1e17c5c11c)
    Signed-off-by: Greg Kroah-Hartman

    Peter Oskolkov
     
  • This patch introduces several helper functions/macros that will be
    used in the follow-up patch. No runtime changes yet.

    The new logic (fully implemented in the second patch) is as follows:

    * Nodes in the rb-tree will now contain not single fragments, but lists
    of consecutive fragments ("runs").

    * At each point in time, the current "active" run at the tail is
    maintained/tracked. Fragments that arrive in-order, adjacent
    to the previous tail fragment, are added to this tail run without
    triggering the re-balancing of the rb-tree.

    * If a fragment arrives out of order with the offset _before_ the tail run,
    it is inserted into the rb-tree as a single fragment.

    * If a fragment arrives after the current tail fragment (with a gap),
    it starts a new "tail" run, as is inserted into the rb-tree
    at the end as the head of the new run.

    skb->cb is used to store additional information
    needed here (suggested by Eric Dumazet).

    Reported-by: Willem de Bruijn
    Signed-off-by: Peter Oskolkov
    Cc: Eric Dumazet
    Cc: Florian Westphal
    Signed-off-by: David S. Miller
    (cherry picked from commit 353c9cb360874e737fb000545f783df756c06f9a)
    Signed-off-by: Greg Kroah-Hartman

    Peter Oskolkov
     
  • We accidentally removed the parentheses here, but they are required
    because '!' has higher precedence than '&'.

    Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
    Signed-off-by: Dan Carpenter
    Signed-off-by: David S. Miller
    (cherry picked from commit 70837ffe3085c9a91488b52ca13ac84424da1042)
    Signed-off-by: Greg Kroah-Hartman

    Dan Carpenter
     
  • commit bffa72cf7f9df842f0016ba03586039296b4caaf upstream

    skb->rbnode shares space with skb->next, skb->prev and skb->tstamp

    Current uses (TCP receive ofo queue and netem) need to save/restore
    tstamp, while skb->dev is either NULL (TCP) or a constant for a given
    queue (netem).

    Since we plan using an RB tree for TCP retransmit queue to speedup SACK
    processing with large BDP, this patch exchanges skb->dev and
    skb->tstamp.

    This saves some overhead in both TCP and netem.

    v2: removes the swtstamp field from struct tcp_skb_cb

    Signed-off-by: Eric Dumazet
    Cc: Soheil Hassas Yeganeh
    Cc: Wei Wang
    Cc: Willem de Bruijn
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • This behavior is required in IPv6, and there is little need
    to tolerate overlapping fragments in IPv4. This change
    simplifies the code and eliminates potential DDoS attack vectors.

    Tested: ran ip_defrag selftest (not yet available uptream).

    Suggested-by: David S. Miller
    Signed-off-by: Peter Oskolkov
    Signed-off-by: Eric Dumazet
    Cc: Florian Westphal
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller
    (cherry picked from commit 7969e5c40dfd04799d4341f1b7cd266b6e47f227)
    Signed-off-by: Greg Kroah-Hartman

    Peter Oskolkov
     
  • Giving an integer to proc_doulongvec_minmax() is dangerous on 64bit arches,
    since linker might place next to it a non zero value preventing a change
    to ip6frag_low_thresh.

    ip6frag_low_thresh is not used anymore in the kernel, but we do not
    want to prematuraly break user scripts wanting to change it.

    Since specifying a minimal value of 0 for proc_doulongvec_minmax()
    is moot, let's remove these zero values in all defrag units.

    Fixes: 6e00f7dd5e4e ("ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh")
    Signed-off-by: Eric Dumazet
    Reported-by: Maciej Żenczykowski
    Signed-off-by: David S. Miller
    (cherry picked from commit 3d23401283e80ceb03f765842787e0e79ff598b7)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • ip_defrag uses skb->cb[] to store the fragment offset, and unfortunately
    this integer is currently in a different cache line than skb->next,
    meaning that we use two cache lines per skb when finding the insertion point.

    By aliasing skb->ip_defrag_offset and skb->dev, we pack all the fields
    in a single cache line and save precious memory bandwidth.

    Note that after the fast path added by Changli Gao in commit
    d6bebca92c66 ("fragment: add fast path for in-order fragments")
    this change wont help the fast path, since we still need
    to access prev->len (2nd cache line), but will show great
    benefits when slow path is entered, since we perform
    a linear scan of a potentially long list.

    Also, note that this potential long list is an attack vector,
    we might consider also using an rb-tree there eventually.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit bf66337140c64c27fa37222b7abca7e49d63fb57)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • An skb_clone() was added in commit ec4fbd64751d ("inet: frag: release
    spinlock before calling icmp_send()")

    While fixing the bug at that time, it also added a very high cost
    for DDOS frags, as the ICMP rate limit is applied after this
    expensive operation (skb_clone() + consume_skb(), implying memory
    allocations, copy, and freeing)

    We can use skb_get(head) here, all we want is to make sure skb wont
    be freed by another cpu.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 1eec5d5670084ee644597bd26c25e22c69b9f748)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • Some users are willing to provision huge amounts of memory to be able
    to perform reassembly reasonnably well under pressure.

    Current memory tracking is using one atomic_t and integers.

    Switch to atomic_long_t so that 64bit arches can use more than 2GB,
    without any cost for 32bit arches.

    Note that this patch avoids an overflow error, if high_thresh was set
    to ~2GB, since this test in inet_frag_alloc() was never true :

    if (... || frag_mem_limit(nf) > nf->high_thresh)

    Tested:

    $ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh

    $ grep FRAG /proc/net/sockstat
    FRAG: inuse 14705885 memory 16000002880

    $ nstat -n ; sleep 1 ; nstat | grep Reas
    IpReasmReqds 3317150 0.0
    IpReasmFails 3317112 0.0

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 3e67f106f619dcfaf6f4e2039599bdb69848c714)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • This function is obsolete, after rhashtable addition to inet defrag.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 2d44ed22e607f9a285b049de2263e3840673a260)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • This refactors ip_expire() since one indentation level is removed.

    Note: in the future, we should try hard to avoid the skb_clone()
    since this is a serious performance cost.
    Under DDOS, the ICMP message wont be sent because of rate limits.

    Fact that ip6_expire_frag_queue() does not use skb_clone() is
    disturbing too. Presumably IPv6 should have the same
    issue than the one we fixed in commit ec4fbd64751d
    ("inet: frag: release spinlock before calling icmp_send()")

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 399d1404be660d355192ff4df5ccc3f4159ec1e4)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • Remove sum_frag_mem_limit(), ip_frag_mem() & ip6_frag_mem()

    Also since we use rhashtable we can bring back the number of fragments
    in "grep FRAG /proc/net/sockstat /proc/net/sockstat6" that was
    removed in commit 434d305405ab ("inet: frag: don't account number
    of fragment queues")

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 6befe4a78b1553edb6eed3a78b4bcd9748526672)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • Some applications still rely on IP fragmentation, and to be fair linux
    reassembly unit is not working under any serious load.

    It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)

    A work queue is supposed to garbage collect items when host is under memory
    pressure, and doing a hash rebuild, changing seed used in hash computations.

    This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
    occurring every 5 seconds if host is under fire.

    Then there is the problem of sharing this hash table for all netns.

    It is time to switch to rhashtables, and allocate one of them per netns
    to speedup netns dismantle, since this is a critical metric these days.

    Lookup is now using RCU. A followup patch will even remove
    the refcount hold/release left from prior implementation and save
    a couple of atomic operations.

    Before this patch, 16 cpus (16 RX queue NIC) could not handle more
    than 1 Mpps frags DDOS.

    After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
    of storage for the fragments (exact number depends on frags being evicted
    after timeout)

    $ grep FRAG /proc/net/sockstat
    FRAG: inuse 1966916 memory 2140004608

    A followup patch will change the limits for 64bit arches.

    Signed-off-by: Eric Dumazet
    Cc: Kirill Tkhai
    Cc: Herbert Xu
    Cc: Florian Westphal
    Cc: Jesper Dangaard Brouer
    Cc: Alexander Aring
    Cc: Stefan Schmidt
    Signed-off-by: David S. Miller
    (cherry picked from commit 648700f76b03b7e8149d13cc2bdb3355035258a9)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • In preparation for unconditionally passing the struct timer_list pointer to
    all timer callbacks, switch to using the new timer_setup() and from_timer()
    to pass the timer pointer explicitly.

    Cc: Alexander Aring
    Cc: Stefan Schmidt
    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: Hideaki YOSHIFUJI
    Cc: Pablo Neira Ayuso
    Cc: Jozsef Kadlecsik
    Cc: Florian Westphal
    Cc: linux-wpan@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Cc: netfilter-devel@vger.kernel.org
    Cc: coreteam@netfilter.org
    Signed-off-by: Kees Cook
    Acked-by: Stefan Schmidt # for ieee802154
    Signed-off-by: David S. Miller
    (cherry picked from commit 78802011fbe34331bdef6f2dfb1634011f0e4c32)
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     
  • We need to call inet_frags_init() before register_pernet_subsys(),
    as a prereq for following patch ("inet: frags: use rhashtables for reassembly units")

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 483a6e4fa055123142d8956866fe2aa9c98d546d)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • In order to simplify the API, add a pointer to struct inet_frags.
    This will allow us to make things less complex.

    These functions no longer have a struct inet_frags parameter :

    inet_frag_destroy(struct inet_frag_queue *q /*, struct inet_frags *f */)
    inet_frag_put(struct inet_frag_queue *q /*, struct inet_frags *f */)
    inet_frag_kill(struct inet_frag_queue *q /*, struct inet_frags *f */)
    inet_frags_exit_net(struct netns_frags *nf /*, struct inet_frags *f */)
    ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 093ba72914b696521e4885756a68a3332782c8de)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • We will soon initialize one rhashtable per struct netns_frags
    in inet_frags_init_net().

    This patch changes the return value to eventually propagate an
    error.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 787bea7748a76130566f881c2342a0be4127d182)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

06 Aug, 2018

1 commit

  • [ Upstream commit 4672694bd4f1aebdab0ad763ae4716e89cb15221 ]

    ip_frag_queue() might call pskb_pull() on one skb that
    is already in the fragment queue.

    We need to take care of possible truesize change, or we
    might have an imbalance of the netns frags memory usage.

    IPv6 is immune to this bug, because RFC5722, Section 4,
    amended by Errata ID 3089 states :

    When reassembling an IPv6 datagram, if
    one or more its constituent fragments is determined to be an
    overlapping fragment, the entire datagram (and any constituent
    fragments) MUST be silently discarded.

    Fixes: 158f323b9868 ("net: adjust skb->truesize in pskb_expand_head()")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

04 Sep, 2017

1 commit

  • This reverts commit 1d6119baf0610f813eb9d9580eb4fd16de5b4ceb.

    After reverting commit 6d7b857d541e ("net: use lib/percpu_counter API
    for fragmentation mem accounting") then here is no need for this
    fix-up patch. As percpu_counter is no longer used, it cannot
    memory leak it any-longer.

    Fixes: 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting")
    Fixes: 1d6119baf061 ("net: fix percpu memory leaks")
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     

01 Jul, 2017

1 commit

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     

23 Mar, 2017

1 commit

  • Dmitry reported a lockdep splat [1] (false positive) that we can fix
    by releasing the spinlock before calling icmp_send() from ip_expire()

    This is a false positive because sending an ICMP message can not
    possibly re-enter the IP frag engine.

    [1]
    [ INFO: possible circular locking dependency detected ]
    4.10.0+ #29 Not tainted
    -------------------------------------------------------
    modprobe/12392 is trying to acquire lock:
    (_xmit_ETHER#2){+.-...}, at: [] spin_lock
    include/linux/spinlock.h:299 [inline]
    (_xmit_ETHER#2){+.-...}, at: [] __netif_tx_lock
    include/linux/netdevice.h:3486 [inline]
    (_xmit_ETHER#2){+.-...}, at: []
    sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180

    but task is already holding lock:
    (&(&q->lock)->rlock){+.-...}, at: [] spin_lock
    include/linux/spinlock.h:299 [inline]
    (&(&q->lock)->rlock){+.-...}, at: []
    ip_expire+0x51/0x6c0 net/ipv4/ip_fragment.c:201

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&(&q->lock)->rlock){+.-...}:
    validate_chain kernel/locking/lockdep.c:2267 [inline]
    __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340
    lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
    spin_lock include/linux/spinlock.h:299 [inline]
    ip_defrag+0x3a2/0x4130 net/ipv4/ip_fragment.c:669
    ip_check_defrag+0x4e3/0x8b0 net/ipv4/ip_fragment.c:713
    packet_rcv_fanout+0x282/0x800 net/packet/af_packet.c:1459
    deliver_skb net/core/dev.c:1834 [inline]
    dev_queue_xmit_nit+0x294/0xa90 net/core/dev.c:1890
    xmit_one net/core/dev.c:2903 [inline]
    dev_hard_start_xmit+0x16b/0xab0 net/core/dev.c:2923
    sch_direct_xmit+0x31f/0x6d0 net/sched/sch_generic.c:182
    __dev_xmit_skb net/core/dev.c:3092 [inline]
    __dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358
    dev_queue_xmit+0x17/0x20 net/core/dev.c:3423
    neigh_resolve_output+0x6b9/0xb10 net/core/neighbour.c:1308
    neigh_output include/net/neighbour.h:478 [inline]
    ip_finish_output2+0x8b8/0x15a0 net/ipv4/ip_output.c:228
    ip_do_fragment+0x1d93/0x2720 net/ipv4/ip_output.c:672
    ip_fragment.constprop.54+0x145/0x200 net/ipv4/ip_output.c:545
    ip_finish_output+0x82d/0xe10 net/ipv4/ip_output.c:314
    NF_HOOK_COND include/linux/netfilter.h:246 [inline]
    ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404
    dst_output include/net/dst.h:486 [inline]
    ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124
    ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492
    ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512
    raw_sendmsg+0x26de/0x3a00 net/ipv4/raw.c:655
    inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:761
    sock_sendmsg_nosec net/socket.c:633 [inline]
    sock_sendmsg+0xca/0x110 net/socket.c:643
    ___sys_sendmsg+0x4a3/0x9f0 net/socket.c:1985
    __sys_sendmmsg+0x25c/0x750 net/socket.c:2075
    SYSC_sendmmsg net/socket.c:2106 [inline]
    SyS_sendmmsg+0x35/0x60 net/socket.c:2101
    do_syscall_64+0x2e8/0x930 arch/x86/entry/common.c:281
    return_from_SYSCALL_64+0x0/0x7a

    -> #0 (_xmit_ETHER#2){+.-...}:
    check_prev_add kernel/locking/lockdep.c:1830 [inline]
    check_prevs_add+0xa8f/0x19f0 kernel/locking/lockdep.c:1940
    validate_chain kernel/locking/lockdep.c:2267 [inline]
    __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340
    lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
    spin_lock include/linux/spinlock.h:299 [inline]
    __netif_tx_lock include/linux/netdevice.h:3486 [inline]
    sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180
    __dev_xmit_skb net/core/dev.c:3092 [inline]
    __dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358
    dev_queue_xmit+0x17/0x20 net/core/dev.c:3423
    neigh_hh_output include/net/neighbour.h:468 [inline]
    neigh_output include/net/neighbour.h:476 [inline]
    ip_finish_output2+0xf6c/0x15a0 net/ipv4/ip_output.c:228
    ip_finish_output+0xa29/0xe10 net/ipv4/ip_output.c:316
    NF_HOOK_COND include/linux/netfilter.h:246 [inline]
    ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404
    dst_output include/net/dst.h:486 [inline]
    ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124
    ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492
    ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512
    icmp_push_reply+0x372/0x4d0 net/ipv4/icmp.c:394
    icmp_send+0x156c/0x1c80 net/ipv4/icmp.c:754
    ip_expire+0x40e/0x6c0 net/ipv4/ip_fragment.c:239
    call_timer_fn+0x241/0x820 kernel/time/timer.c:1268
    expire_timers kernel/time/timer.c:1307 [inline]
    __run_timers+0x960/0xcf0 kernel/time/timer.c:1601
    run_timer_softirq+0x21/0x80 kernel/time/timer.c:1614
    __do_softirq+0x31f/0xbe7 kernel/softirq.c:284
    invoke_softirq kernel/softirq.c:364 [inline]
    irq_exit+0x1cc/0x200 kernel/softirq.c:405
    exiting_irq arch/x86/include/asm/apic.h:657 [inline]
    smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:962
    apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:707
    __read_once_size include/linux/compiler.h:254 [inline]
    atomic_read arch/x86/include/asm/atomic.h:26 [inline]
    rcu_dynticks_curr_cpu_in_eqs kernel/rcu/tree.c:350 [inline]
    __rcu_is_watching kernel/rcu/tree.c:1133 [inline]
    rcu_is_watching+0x83/0x110 kernel/rcu/tree.c:1147
    rcu_read_lock_held+0x87/0xc0 kernel/rcu/update.c:293
    radix_tree_deref_slot include/linux/radix-tree.h:238 [inline]
    filemap_map_pages+0x6d4/0x1570 mm/filemap.c:2335
    do_fault_around mm/memory.c:3231 [inline]
    do_read_fault mm/memory.c:3265 [inline]
    do_fault+0xbd5/0x2080 mm/memory.c:3370
    handle_pte_fault mm/memory.c:3600 [inline]
    __handle_mm_fault+0x1062/0x2cb0 mm/memory.c:3714
    handle_mm_fault+0x1e2/0x480 mm/memory.c:3751
    __do_page_fault+0x4f6/0xb60 arch/x86/mm/fault.c:1397
    do_page_fault+0x54/0x70 arch/x86/mm/fault.c:1460
    page_fault+0x28/0x30 arch/x86/entry/entry_64.S:1011

    other info that might help us debug this:

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&(&q->lock)->rlock);
    lock(_xmit_ETHER#2);
    lock(&(&q->lock)->rlock);
    lock(_xmit_ETHER#2);

    *** DEADLOCK ***

    10 locks held by modprobe/12392:
    #0: (&mm->mmap_sem){++++++}, at: []
    __do_page_fault+0x2b8/0xb60 arch/x86/mm/fault.c:1336
    #1: (rcu_read_lock){......}, at: []
    filemap_map_pages+0x1e6/0x1570 mm/filemap.c:2324
    #2: (&(ptlock_ptr(page))->rlock#2){+.+...}, at: []
    spin_lock include/linux/spinlock.h:299 [inline]
    #2: (&(ptlock_ptr(page))->rlock#2){+.+...}, at: []
    pte_alloc_one_map mm/memory.c:2944 [inline]
    #2: (&(ptlock_ptr(page))->rlock#2){+.+...}, at: []
    alloc_set_pte+0x13b8/0x1b90 mm/memory.c:3072
    #3: (((&q->timer))){+.-...}, at: []
    lockdep_copy_map include/linux/lockdep.h:175 [inline]
    #3: (((&q->timer))){+.-...}, at: []
    call_timer_fn+0x1c2/0x820 kernel/time/timer.c:1258
    #4: (&(&q->lock)->rlock){+.-...}, at: [] spin_lock
    include/linux/spinlock.h:299 [inline]
    #4: (&(&q->lock)->rlock){+.-...}, at: []
    ip_expire+0x51/0x6c0 net/ipv4/ip_fragment.c:201
    #5: (rcu_read_lock){......}, at: []
    ip_expire+0x1b3/0x6c0 net/ipv4/ip_fragment.c:216
    #6: (slock-AF_INET){+.-...}, at: [] spin_trylock
    include/linux/spinlock.h:309 [inline]
    #6: (slock-AF_INET){+.-...}, at: [] icmp_xmit_lock
    net/ipv4/icmp.c:219 [inline]
    #6: (slock-AF_INET){+.-...}, at: []
    icmp_send+0x803/0x1c80 net/ipv4/icmp.c:681
    #7: (rcu_read_lock_bh){......}, at: []
    ip_finish_output2+0x2c1/0x15a0 net/ipv4/ip_output.c:198
    #8: (rcu_read_lock_bh){......}, at: []
    __dev_queue_xmit+0x23e/0x1e60 net/core/dev.c:3324
    #9: (dev->qdisc_running_key ?: &qdisc_running_key){+.....}, at:
    [] dev_queue_xmit+0x17/0x20 net/core/dev.c:3423

    stack backtrace:
    CPU: 0 PID: 12392 Comm: modprobe Not tainted 4.10.0+ #29
    Hardware name: Google Google Compute Engine/Google Compute Engine,
    BIOS Google 01/01/2011
    Call Trace:

    __dump_stack lib/dump_stack.c:16 [inline]
    dump_stack+0x2ee/0x3ef lib/dump_stack.c:52
    print_circular_bug+0x307/0x3b0 kernel/locking/lockdep.c:1204
    check_prev_add kernel/locking/lockdep.c:1830 [inline]
    check_prevs_add+0xa8f/0x19f0 kernel/locking/lockdep.c:1940
    validate_chain kernel/locking/lockdep.c:2267 [inline]
    __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340
    lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
    spin_lock include/linux/spinlock.h:299 [inline]
    __netif_tx_lock include/linux/netdevice.h:3486 [inline]
    sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180
    __dev_xmit_skb net/core/dev.c:3092 [inline]
    __dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358
    dev_queue_xmit+0x17/0x20 net/core/dev.c:3423
    neigh_hh_output include/net/neighbour.h:468 [inline]
    neigh_output include/net/neighbour.h:476 [inline]
    ip_finish_output2+0xf6c/0x15a0 net/ipv4/ip_output.c:228
    ip_finish_output+0xa29/0xe10 net/ipv4/ip_output.c:316
    NF_HOOK_COND include/linux/netfilter.h:246 [inline]
    ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404
    dst_output include/net/dst.h:486 [inline]
    ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124
    ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492
    ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512
    icmp_push_reply+0x372/0x4d0 net/ipv4/icmp.c:394
    icmp_send+0x156c/0x1c80 net/ipv4/icmp.c:754
    ip_expire+0x40e/0x6c0 net/ipv4/ip_fragment.c:239
    call_timer_fn+0x241/0x820 kernel/time/timer.c:1268
    expire_timers kernel/time/timer.c:1307 [inline]
    __run_timers+0x960/0xcf0 kernel/time/timer.c:1601
    run_timer_softirq+0x21/0x80 kernel/time/timer.c:1614
    __do_softirq+0x31f/0xbe7 kernel/softirq.c:284
    invoke_softirq kernel/softirq.c:364 [inline]
    irq_exit+0x1cc/0x200 kernel/softirq.c:405
    exiting_irq arch/x86/include/asm/apic.h:657 [inline]
    smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:962
    apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:707
    RIP: 0010:__read_once_size include/linux/compiler.h:254 [inline]
    RIP: 0010:atomic_read arch/x86/include/asm/atomic.h:26 [inline]
    RIP: 0010:rcu_dynticks_curr_cpu_in_eqs kernel/rcu/tree.c:350 [inline]
    RIP: 0010:__rcu_is_watching kernel/rcu/tree.c:1133 [inline]
    RIP: 0010:rcu_is_watching+0x83/0x110 kernel/rcu/tree.c:1147
    RSP: 0000:ffff8801c391f120 EFLAGS: 00000a03 ORIG_RAX: ffffffffffffff10
    RAX: dffffc0000000000 RBX: ffff8801c391f148 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 000055edd4374000 RDI: ffff8801dbe1ae0c
    RBP: ffff8801c391f1a0 R08: 0000000000000002 R09: 0000000000000000
    R10: dffffc0000000000 R11: 0000000000000002 R12: 1ffff10038723e25
    R13: ffff8801dbe1ae00 R14: ffff8801c391f680 R15: dffffc0000000000

    rcu_read_lock_held+0x87/0xc0 kernel/rcu/update.c:293
    radix_tree_deref_slot include/linux/radix-tree.h:238 [inline]
    filemap_map_pages+0x6d4/0x1570 mm/filemap.c:2335
    do_fault_around mm/memory.c:3231 [inline]
    do_read_fault mm/memory.c:3265 [inline]
    do_fault+0xbd5/0x2080 mm/memory.c:3370
    handle_pte_fault mm/memory.c:3600 [inline]
    __handle_mm_fault+0x1062/0x2cb0 mm/memory.c:3714
    handle_mm_fault+0x1e2/0x480 mm/memory.c:3751
    __do_page_fault+0x4f6/0xb60 arch/x86/mm/fault.c:1397
    do_page_fault+0x54/0x70 arch/x86/mm/fault.c:1460
    page_fault+0x28/0x30 arch/x86/entry/entry_64.S:1011
    RIP: 0033:0x7f83172f2786
    RSP: 002b:00007fffe859ae80 EFLAGS: 00010293
    RAX: 000055edd4373040 RBX: 00007f83175111c8 RCX: 000055edd4373238
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f8317510970
    RBP: 00007fffe859afd0 R08: 0000000000000009 R09: 0000000000000000
    R10: 0000000000000064 R11: 0000000000000000 R12: 000055edd4373040
    R13: 0000000000000000 R14: 00007fffe859afe8 R15: 0000000000000000

    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Apr, 2016

1 commit


17 Feb, 2016

2 commits


29 Jan, 2016

1 commit

  • Later parts of the stack (including fragmentation) expect that there is
    never a socket attached to frag in a frag_list, however this invariant
    was not enforced on all defrag paths. This could lead to the
    BUG_ON(skb->sk) during ip_do_fragment(), as per the call stack at the
    end of this commit message.

    While the call could be added to openvswitch to fix this particular
    error, the head and tail of the frags list are already orphaned
    indirectly inside ip_defrag(), so it seems like the remaining fragments
    should all be orphaned in all circumstances.

    kernel BUG at net/ipv4/ip_output.c:586!
    [...]
    Call Trace:

    [] ? do_output.isra.29+0x1b0/0x1b0 [openvswitch]
    [] ovs_fragment+0xcc/0x214 [openvswitch]
    [] ? dst_discard_out+0x20/0x20
    [] ? dst_ifdown+0x80/0x80
    [] ? find_bucket.isra.2+0x62/0x70 [openvswitch]
    [] ? mod_timer_pending+0x65/0x210
    [] ? __lock_acquire+0x3db/0x1b90
    [] ? nf_conntrack_in+0x252/0x500 [nf_conntrack]
    [] ? __lock_is_held+0x54/0x70
    [] do_output.isra.29+0xe3/0x1b0 [openvswitch]
    [] do_execute_actions+0xe11/0x11f0 [openvswitch]
    [] ? __lock_is_held+0x54/0x70
    [] ovs_execute_actions+0x32/0xd0 [openvswitch]
    [] ovs_dp_process_packet+0x85/0x140 [openvswitch]
    [] ? __lock_is_held+0x54/0x70
    [] ovs_execute_actions+0xb2/0xd0 [openvswitch]
    [] ovs_dp_process_packet+0x85/0x140 [openvswitch]
    [] ? ovs_ct_get_labels+0x49/0x80 [openvswitch]
    [] ovs_vport_receive+0x5d/0xa0 [openvswitch]
    [] ? __lock_acquire+0x3db/0x1b90
    [] ? __lock_acquire+0x3db/0x1b90
    [] ? __lock_acquire+0x3db/0x1b90
    [] ? internal_dev_xmit+0x5/0x140 [openvswitch]
    [] internal_dev_xmit+0x6c/0x140 [openvswitch]
    [] ? internal_dev_xmit+0x5/0x140 [openvswitch]
    [] dev_hard_start_xmit+0x2b9/0x5e0
    [] ? netif_skb_features+0xd1/0x1f0
    [] __dev_queue_xmit+0x800/0x930
    [] ? __dev_queue_xmit+0x50/0x930
    [] ? mark_held_locks+0x71/0x90
    [] ? neigh_resolve_output+0x106/0x220
    [] dev_queue_xmit+0x10/0x20
    [] neigh_resolve_output+0x178/0x220
    [] ? ip_finish_output2+0x1ff/0x590
    [] ip_finish_output2+0x1ff/0x590
    [] ? ip_finish_output2+0x7e/0x590
    [] ip_do_fragment+0x831/0x8a0
    [] ? ip_copy_metadata+0x1b0/0x1b0
    [] ip_fragment.constprop.49+0x43/0x80
    [] ip_finish_output+0x17c/0x340
    [] ? nf_hook_slow+0xe4/0x190
    [] ip_output+0x70/0x110
    [] ? ip_fragment.constprop.49+0x80/0x80
    [] ip_local_out+0x39/0x70
    [] ip_send_skb+0x19/0x40
    [] ip_push_pending_frames+0x33/0x40
    [] icmp_push_reply+0xea/0x120
    [] icmp_reply.constprop.23+0x1ed/0x230
    [] icmp_echo.part.21+0x4e/0x50
    [] ? __lock_is_held+0x54/0x70
    [] ? rcu_read_lock_held+0x5e/0x70
    [] icmp_echo+0x36/0x70
    [] icmp_rcv+0x271/0x450
    [] ip_local_deliver_finish+0x127/0x3a0
    [] ? ip_local_deliver_finish+0x41/0x3a0
    [] ip_local_deliver+0x60/0xd0
    [] ? ip_rcv_finish+0x560/0x560
    [] ip_rcv_finish+0xdd/0x560
    [] ip_rcv+0x283/0x3e0
    [] ? match_held_lock+0x192/0x200
    [] ? inet_del_offload+0x40/0x40
    [] __netif_receive_skb_core+0x392/0xae0
    [] ? process_backlog+0x8e/0x230
    [] ? mark_held_locks+0x71/0x90
    [] __netif_receive_skb+0x18/0x60
    [] process_backlog+0x78/0x230
    [] ? process_backlog+0xdd/0x230
    [] net_rx_action+0x155/0x400
    [] __do_softirq+0xcc/0x420
    [] ? ip_finish_output2+0x217/0x590
    [] do_softirq_own_stack+0x1c/0x30

    [] do_softirq+0x4e/0x60
    [] __local_bh_enable_ip+0xa8/0xb0
    [] ip_finish_output2+0x240/0x590
    [] ? ip_do_fragment+0x831/0x8a0
    [] ip_do_fragment+0x831/0x8a0
    [] ? ip_copy_metadata+0x1b0/0x1b0
    [] ip_fragment.constprop.49+0x43/0x80
    [] ip_finish_output+0x17c/0x340
    [] ? nf_hook_slow+0xe4/0x190
    [] ip_output+0x70/0x110
    [] ? ip_fragment.constprop.49+0x80/0x80
    [] ip_local_out+0x39/0x70
    [] ip_send_skb+0x19/0x40
    [] ip_push_pending_frames+0x33/0x40
    [] raw_sendmsg+0x7d3/0xc30
    [] ? __lock_acquire+0x3db/0x1b90
    [] ? inet_sendmsg+0xc7/0x1d0
    [] ? __lock_is_held+0x54/0x70
    [] inet_sendmsg+0x10a/0x1d0
    [] ? inet_sendmsg+0x5/0x1d0
    [] sock_sendmsg+0x38/0x50
    [] ___sys_sendmsg+0x25f/0x270
    [] ? handle_mm_fault+0x8dd/0x1320
    [] ? _raw_spin_unlock+0x27/0x40
    [] ? __do_page_fault+0x1e2/0x460
    [] ? __fget_light+0x66/0x90
    [] __sys_sendmsg+0x42/0x80
    [] SyS_sendmsg+0x12/0x20
    [] entry_SYSCALL_64_fastpath+0x12/0x6f
    Code: 00 00 44 89 e0 e9 7c fb ff ff 4c 89 ff e8 e7 e7 ff ff 41 8b 9d 80 00 00 00 2b 5d d4 89 d8 c1 f8 03 0f b7 c0 e9 33 ff ff f
    66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48
    RIP [] ip_do_fragment+0x892/0x8a0
    RSP

    Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
    Signed-off-by: Joe Stringer
    Signed-off-by: David S. Miller

    Joe Stringer
     

06 Jan, 2016

1 commit


03 Nov, 2015

1 commit

  • This patch fixes following problems :

    1) percpu_counter_init() can return an error, therefore
    init_frag_mem_limit() must propagate this error so that
    inet_frags_init_net() can do the same up to its callers.

    2) If ip[46]_frags_ns_ctl_register() fail, we must unwind
    properly and free the percpu_counter.

    Without this fix, we leave freed object in percpu_counters
    global list (if CONFIG_HOTPLUG_CPU) leading to crashes.

    This bug was detected by KASAN and syzkaller tool
    (http://github.com/google/syzkaller)

    Fixes: 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting")
    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Cc: Hannes Frederic Sowa
    Cc: Jesper Dangaard Brouer
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Oct, 2015

1 commit

  • The function ip_defrag is called on both the input and the output
    paths of the networking stack. In particular conntrack when it is
    tracking outbound packets from the local machine calls ip_defrag.

    So add a struct net parameter and stop making ip_defrag guess which
    network namespace it needs to defragment packets in.

    Signed-off-by: "Eric W. Biederman"
    Acked-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

30 Sep, 2015

1 commit


29 Aug, 2015

1 commit

  • inetpeer caches based on address only, so duplicate IP addresses within
    a namespace return the same cached entry. Enhance the ipv4 address key
    to contain both the IPv4 address and VRF device index.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

14 Aug, 2015

1 commit

  • Fragmentation cache uses information from the IP header to reassemble
    packets. That information can be duplicated across VRFs -- same source
    and destination addresses, protocol and id. Handle fragmentation with
    VRFs by adding the VRF device index to entries in the cache and the
    lookup arg.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

01 Aug, 2015

1 commit


27 Jul, 2015

2 commits

  • We can simply remove the INET_FRAG_EVICTED flag to avoid all the flags
    race conditions with the evictor and use a participation test for the
    evictor list, when we're at that point (after inet_frag_kill) in the
    timer there're 2 possible cases:

    1. The evictor added the entry to its evictor list while the timer was
    waiting for the chainlock
    or
    2. The timer unchained the entry and the evictor won't see it

    In both cases we should be able to see list_evictor correctly due
    to the sync on the chainlock.

    Joint work with Florian Westphal.

    Tested-by: Frank Schreuder
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     
  • Followup patch will call it after inet_frag_queue was freed, so q->net
    doesn't work anymore (but netf = q->net; free(q); mem_limit(netf) would).

    Tested-by: Frank Schreuder
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal