20 Sep, 2018

11 commits

  • This patch introduces several helper functions/macros that will be
    used in the follow-up patch. No runtime changes yet.

    The new logic (fully implemented in the second patch) is as follows:

    * Nodes in the rb-tree will now contain not single fragments, but lists
    of consecutive fragments ("runs").

    * At each point in time, the current "active" run at the tail is
    maintained/tracked. Fragments that arrive in-order, adjacent
    to the previous tail fragment, are added to this tail run without
    triggering the re-balancing of the rb-tree.

    * If a fragment arrives out of order with the offset _before_ the tail run,
    it is inserted into the rb-tree as a single fragment.

    * If a fragment arrives after the current tail fragment (with a gap),
    it starts a new "tail" run, as is inserted into the rb-tree
    at the end as the head of the new run.

    skb->cb is used to store additional information
    needed here (suggested by Eric Dumazet).

    Reported-by: Willem de Bruijn
    Signed-off-by: Peter Oskolkov
    Cc: Eric Dumazet
    Cc: Florian Westphal
    Signed-off-by: David S. Miller
    (cherry picked from commit 353c9cb360874e737fb000545f783df756c06f9a)
    Signed-off-by: Greg Kroah-Hartman

    Peter Oskolkov
     
  • commit bffa72cf7f9df842f0016ba03586039296b4caaf upstream

    skb->rbnode shares space with skb->next, skb->prev and skb->tstamp

    Current uses (TCP receive ofo queue and netem) need to save/restore
    tstamp, while skb->dev is either NULL (TCP) or a constant for a given
    queue (netem).

    Since we plan using an RB tree for TCP retransmit queue to speedup SACK
    processing with large BDP, this patch exchanges skb->dev and
    skb->tstamp.

    This saves some overhead in both TCP and netem.

    v2: removes the swtstamp field from struct tcp_skb_cb

    Signed-off-by: Eric Dumazet
    Cc: Soheil Hassas Yeganeh
    Cc: Wei Wang
    Cc: Willem de Bruijn
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • Put the read-mostly fields in a separate cache line
    at the beginning of struct netns_frags, to reduce
    false sharing noticed in inet_frag_kill()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit c2615cf5a761b32bf74e85bddc223dfff3d9b9f0)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • Some users are willing to provision huge amounts of memory to be able
    to perform reassembly reasonnably well under pressure.

    Current memory tracking is using one atomic_t and integers.

    Switch to atomic_long_t so that 64bit arches can use more than 2GB,
    without any cost for 32bit arches.

    Note that this patch avoids an overflow error, if high_thresh was set
    to ~2GB, since this test in inet_frag_alloc() was never true :

    if (... || frag_mem_limit(nf) > nf->high_thresh)

    Tested:

    $ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh

    $ grep FRAG /proc/net/sockstat
    FRAG: inuse 14705885 memory 16000002880

    $ nstat -n ; sleep 1 ; nstat | grep Reas
    IpReasmReqds 3317150 0.0
    IpReasmFails 3317112 0.0

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 3e67f106f619dcfaf6f4e2039599bdb69848c714)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • This function is obsolete, after rhashtable addition to inet defrag.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 2d44ed22e607f9a285b049de2263e3840673a260)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • This refactors ip_expire() since one indentation level is removed.

    Note: in the future, we should try hard to avoid the skb_clone()
    since this is a serious performance cost.
    Under DDOS, the ICMP message wont be sent because of rate limits.

    Fact that ip6_expire_frag_queue() does not use skb_clone() is
    disturbing too. Presumably IPv6 should have the same
    issue than the one we fixed in commit ec4fbd64751d
    ("inet: frag: release spinlock before calling icmp_send()")

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 399d1404be660d355192ff4df5ccc3f4159ec1e4)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • Remove sum_frag_mem_limit(), ip_frag_mem() & ip6_frag_mem()

    Also since we use rhashtable we can bring back the number of fragments
    in "grep FRAG /proc/net/sockstat /proc/net/sockstat6" that was
    removed in commit 434d305405ab ("inet: frag: don't account number
    of fragment queues")

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 6befe4a78b1553edb6eed3a78b4bcd9748526672)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • Some applications still rely on IP fragmentation, and to be fair linux
    reassembly unit is not working under any serious load.

    It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)

    A work queue is supposed to garbage collect items when host is under memory
    pressure, and doing a hash rebuild, changing seed used in hash computations.

    This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
    occurring every 5 seconds if host is under fire.

    Then there is the problem of sharing this hash table for all netns.

    It is time to switch to rhashtables, and allocate one of them per netns
    to speedup netns dismantle, since this is a critical metric these days.

    Lookup is now using RCU. A followup patch will even remove
    the refcount hold/release left from prior implementation and save
    a couple of atomic operations.

    Before this patch, 16 cpus (16 RX queue NIC) could not handle more
    than 1 Mpps frags DDOS.

    After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
    of storage for the fragments (exact number depends on frags being evicted
    after timeout)

    $ grep FRAG /proc/net/sockstat
    FRAG: inuse 1966916 memory 2140004608

    A followup patch will change the limits for 64bit arches.

    Signed-off-by: Eric Dumazet
    Cc: Kirill Tkhai
    Cc: Herbert Xu
    Cc: Florian Westphal
    Cc: Jesper Dangaard Brouer
    Cc: Alexander Aring
    Cc: Stefan Schmidt
    Signed-off-by: David S. Miller
    (cherry picked from commit 648700f76b03b7e8149d13cc2bdb3355035258a9)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • In preparation for unconditionally passing the struct timer_list pointer to
    all timer callbacks, switch to using the new timer_setup() and from_timer()
    to pass the timer pointer explicitly.

    Cc: Alexander Aring
    Cc: Stefan Schmidt
    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: Hideaki YOSHIFUJI
    Cc: Pablo Neira Ayuso
    Cc: Jozsef Kadlecsik
    Cc: Florian Westphal
    Cc: linux-wpan@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Cc: netfilter-devel@vger.kernel.org
    Cc: coreteam@netfilter.org
    Signed-off-by: Kees Cook
    Acked-by: Stefan Schmidt # for ieee802154
    Signed-off-by: David S. Miller
    (cherry picked from commit 78802011fbe34331bdef6f2dfb1634011f0e4c32)
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     
  • In order to simplify the API, add a pointer to struct inet_frags.
    This will allow us to make things less complex.

    These functions no longer have a struct inet_frags parameter :

    inet_frag_destroy(struct inet_frag_queue *q /*, struct inet_frags *f */)
    inet_frag_put(struct inet_frag_queue *q /*, struct inet_frags *f */)
    inet_frag_kill(struct inet_frag_queue *q /*, struct inet_frags *f */)
    inet_frags_exit_net(struct netns_frags *nf /*, struct inet_frags *f */)
    ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 093ba72914b696521e4885756a68a3332782c8de)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • We will soon initialize one rhashtable per struct netns_frags
    in inet_frags_init_net().

    This patch changes the return value to eventually propagate an
    error.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 787bea7748a76130566f881c2342a0be4127d182)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

04 Sep, 2017

2 commits

  • This reverts commit 1d6119baf0610f813eb9d9580eb4fd16de5b4ceb.

    After reverting commit 6d7b857d541e ("net: use lib/percpu_counter API
    for fragmentation mem accounting") then here is no need for this
    fix-up patch. As percpu_counter is no longer used, it cannot
    memory leak it any-longer.

    Fixes: 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting")
    Fixes: 1d6119baf061 ("net: fix percpu memory leaks")
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     
  • This reverts commit 6d7b857d541ecd1d9bd997c97242d4ef94b19de2.

    There is a bug in fragmentation codes use of the percpu_counter API,
    that can cause issues on systems with many CPUs.

    The frag_mem_limit() just reads the global counter (fbc->count),
    without considering other CPUs can have upto batch size (130K) that
    haven't been subtracted yet. Due to the 3MBytes lower thresh limit,
    this become dangerous at >=24 CPUs (3*1024*1024/130000=24).

    The correct API usage would be to use __percpu_counter_compare() which
    does the right thing, and takes into account the number of (online)
    CPUs and batch size, to account for this and call __percpu_counter_sum()
    when needed.

    We choose to revert the use of the lib/percpu_counter API for frag
    memory accounting for several reasons:

    1) On systems with CPUs > 24, the heavier fully locked
    __percpu_counter_sum() is always invoked, which will be more
    expensive than the atomic_t that is reverted to.

    Given systems with more than 24 CPUs are becoming common this doesn't
    seem like a good option. To mitigate this, the batch size could be
    decreased and thresh be increased.

    2) The add_frag_mem_limit+sub_frag_mem_limit pairs happen on the RX
    CPU, before SKBs are pushed into sockets on remote CPUs. Given
    NICs can only hash on L2 part of the IP-header, the NIC-RXq's will
    likely be limited. Thus, a fair chance that atomic add+dec happen
    on the same CPU.

    Revert note that commit 1d6119baf061 ("net: fix percpu memory leaks")
    removed init_frag_mem_limit() and instead use inet_frags_init_net().
    After this revert, inet_frags_uninit_net() becomes empty.

    Fixes: 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting")
    Fixes: 1d6119baf061 ("net: fix percpu memory leaks")
    Signed-off-by: Jesper Dangaard Brouer
    Acked-by: Florian Westphal
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     

06 Jul, 2017

1 commit

  • Pull percpu updates from Tejun Heo:
    "These are the percpu changes for the v4.13-rc1 merge window. There are
    a couple visibility related changes - tracepoints and allocator stats
    through debugfs, along with __ro_after_init markings and a cosmetic
    rename in percpu_counter.

    Please note that the simple O(#elements_in_the_chunk) area allocator
    used by percpu allocator is again showing scalability issues,
    primarily with bpf allocating and freeing large number of counters.
    Dennis is working on the replacement allocator and the percpu
    allocator will be seeing increased churns in the coming cycles"

    * 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    percpu: fix static checker warnings in pcpu_destroy_chunk
    percpu: fix early calls for spinlock in pcpu_stats
    percpu: resolve err may not be initialized in pcpu_alloc
    percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch
    percpu: add tracepoint support for percpu memory
    percpu: expose statistics about percpu memory via debugfs
    percpu: migrate percpu data structures to internal header
    percpu: add missing lockdep_assert_held to func pcpu_free_area
    mark most percpu globals as __ro_after_init

    Linus Torvalds
     

01 Jul, 2017

1 commit

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     

21 Jun, 2017

1 commit

  • Currently, percpu_counter_add is a wrapper around __percpu_counter_add
    which is preempt safe due to explicit calls to preempt_disable. Given
    how __ prefix is used in percpu related interfaces, the naming
    unfortunately creates the false sense that __percpu_counter_add is
    less safe than percpu_counter_add. In terms of context-safety,
    they're equivalent. The only difference is that the __ version takes
    a batch parameter.

    Make this a bit more explicit by just renaming __percpu_counter_add to
    percpu_counter_add_batch.

    This patch doesn't cause any functional changes.

    tj: Minor updates to patch description for clarity. Cosmetic
    indentation updates.

    Signed-off-by: Nikolay Borisov
    Signed-off-by: Tejun Heo
    Cc: Chris Mason
    Cc: Josef Bacik
    Cc: David Sterba
    Cc: Darrick J. Wong
    Cc: Jan Kara
    Cc: Jens Axboe
    Cc: linux-mm@kvack.org
    Cc: "David S. Miller"

    Nikolay Borisov
     

23 May, 2017

1 commit


21 Jan, 2017

1 commit


17 Feb, 2016

1 commit


06 Jan, 2016

1 commit


03 Nov, 2015

1 commit

  • This patch fixes following problems :

    1) percpu_counter_init() can return an error, therefore
    init_frag_mem_limit() must propagate this error so that
    inet_frags_init_net() can do the same up to its callers.

    2) If ip[46]_frags_ns_ctl_register() fail, we must unwind
    properly and free the percpu_counter.

    Without this fix, we leave freed object in percpu_counters
    global list (if CONFIG_HOTPLUG_CPU) leading to crashes.

    This bug was detected by KASAN and syzkaller tool
    (http://github.com/google/syzkaller)

    Fixes: 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting")
    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Cc: Hannes Frederic Sowa
    Cc: Jesper Dangaard Brouer
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Jul, 2015

3 commits

  • We can simply remove the INET_FRAG_EVICTED flag to avoid all the flags
    race conditions with the evictor and use a participation test for the
    evictor list, when we're at that point (after inet_frag_kill) in the
    timer there're 2 possible cases:

    1. The evictor added the entry to its evictor list while the timer was
    waiting for the chainlock
    or
    2. The timer unchained the entry and the evictor won't see it

    In both cases we should be able to see list_evictor correctly due
    to the sync on the chainlock.

    Joint work with Florian Westphal.

    Tested-by: Frank Schreuder
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     
  • Followup patch will call it after inet_frag_queue was freed, so q->net
    doesn't work anymore (but netf = q->net; free(q); mem_limit(netf) would).

    Tested-by: Frank Schreuder
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • commit 65ba1f1ec0eff ("inet: frags: fix a race between inet_evict_bucket
    and inet_frag_kill") describes the bug, but the fix doesn't work reliably.

    Problem is that ->flags member can be set on other cpu without chainlock
    being held by that task, i.e. the RMW-Cycle can clear INET_FRAG_EVICTED
    bit after we put the element on the evictor private list.

    We can crash when walking the 'private' evictor list since an element can
    be deleted from list underneath the evictor.

    Join work with Nikolay Alexandrov.

    Fixes: b13d3cbfb8e8 ("inet: frag: move eviction of queues to work queue")
    Reported-by: Johan Schuijt
    Tested-by: Frank Schreuder
    Signed-off-by: Nikolay Alexandrov
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

28 May, 2015

1 commit

  • We currently always send fragments without DF bit set.

    Thus, given following setup:

    mtu1500 - mtu1500:1400 - mtu1400:1280 - mtu1280
    A R1 R2 B

    Where R1 and R2 run linux with netfilter defragmentation/conntrack
    enabled, then if Host A sent a fragmented packet _with_ DF set to B, R1
    will respond with icmp too big error if one of these fragments exceeded
    1400 bytes.

    However, if R1 receives fragment sizes 1200 and 100, it would
    forward the reassembled packet without refragmenting, i.e.
    R2 will send an icmp error in response to a packet that was never sent,
    citing mtu that the original sender never exceeded.

    The other minor issue is that a refragmentation on R1 will conceal the
    MTU of R2-B since refragmentation does not set DF bit on the fragments.

    This modifies ip_fragment so that we track largest fragment size seen
    both for DF and non-DF packets, and set frag_max_size to the largest
    value.

    If the DF fragment size is larger or equal to the non-df one, we will
    consider the packet a path mtu probe:
    We set DF bit on the reassembled skb and also tag it with a new IPCB flag
    to force refragmentation even if skb fits outdev mtu.

    We will also set DF bit on each fragment in this case.

    Joint work with Hannes Frederic Sowa.

    Reported-by: Jesse Gross
    Signed-off-by: Florian Westphal
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Florian Westphal
     

08 Sep, 2014

1 commit

  • Percpu allocator now supports allocation mask. Add @gfp to
    percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
    with percpu_counters too.

    We could have left percpu_counter_init() alone and added
    percpu_counter_init_gfp(); however, the number of users isn't that
    high and introducing _gfp variants to all percpu data structures would
    be quite ugly, so let's just do the conversion. This is the one with
    the most users. Other percpu data structures are a lot easier to
    convert.

    This patch doesn't make any functional difference.

    Signed-off-by: Tejun Heo
    Acked-by: Jan Kara
    Acked-by: "David S. Miller"
    Cc: x86@kernel.org
    Cc: Jens Axboe
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andrew Morton

    Tejun Heo
     

03 Aug, 2014

3 commits


28 Jul, 2014

7 commits

  • rehash is rare operation, don't force readers to take
    the read-side rwlock.

    Instead, we only have to detect the (rare) case where
    the secret was altered while we are trying to insert
    a new inetfrag queue into the table.

    If it was changed, drop the bucket lock and recompute
    the hash to get the 'new' chain bucket that we have to
    insert into.

    Joint work with Nikolay Aleksandrov.

    Signed-off-by: Florian Westphal
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • merge functionality into the eviction workqueue.

    Instead of rebuilding every n seconds, take advantage of the upper
    hash chain length limit.

    If we hit it, mark table for rebuild and schedule workqueue.
    To prevent frequent rebuilds when we're completely overloaded,
    don't rebuild more than once every 5 seconds.

    ipfrag_secret_interval sysctl is now obsolete and has been marked as
    deprecated, it still can be changed so scripts won't be broken but it
    won't have any effect. A comment is left above each unused secret_timer
    variable to avoid confusion.

    Joint work with Nikolay Aleksandrov.

    Signed-off-by: Florian Westphal
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • no longer used.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • The 'nqueues' counter is protected by the lru list lock,
    once thats removed this needs to be converted to atomic
    counter. Given this isn't used for anything except for
    reporting it to userspace via /proc, just remove it.

    We still report the memory currently used by fragment
    reassembly queues.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • When the high_thresh limit is reached we try to toss the 'oldest'
    incomplete fragment queues until memory limits are below the low_thresh
    value. This happens in softirq/packet processing context.

    This has two drawbacks:

    1) processors might evict a queue that was about to be completed
    by another cpu, because they will compete wrt. resource usage and
    resource reclaim.

    2) LRU list maintenance is expensive.

    But when constantly overloaded, even the 'least recently used' element is
    recent, so removing 'lru' queue first is not 'fairer' than removing any
    other fragment queue.

    This moves eviction out of the fast path:

    When the low threshold is reached, a work queue is scheduled
    which then iterates over the table and removes the queues that exceed
    the memory limits of the namespace. It sets a new flag called
    INET_FRAG_EVICTED on the evicted queues so the proper counters will get
    incremented when the queue is forcefully expired.

    When the high threshold is reached, no more fragment queues are
    created until we're below the limit again.

    The LRU list is now unused and will be removed in a followup patch.

    Joint work with Nikolay Aleksandrov.

    Suggested-by: Eric Dumazet
    Signed-off-by: Florian Westphal
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • First step to move eviction handling into a work queue.

    We lose two spots that accounted evicted fragments in MIB counters.

    Accounting will be restored since the upcoming work-queue evictor
    invokes the frag queue timer callbacks instead.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

24 Oct, 2013

1 commit

  • All fragmentation hash secrets now get initialized by their
    corresponding hash function with net_get_random_once. Thus we can
    eliminate the initial seeding.

    Also provide a comment that hash secret seeding happens at the first
    call to the corresponding hashing function.

    Cc: David S. Miller
    Cc: Eric Dumazet
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

06 May, 2013

1 commit

  • This patch fixes race between inet_frag_lru_move() and inet_frag_lru_add()
    which was introduced in commit 3ef0eb0db4bf92c6d2510fe5c4dc51852746f206
    ("net: frag, move LRU list maintenance outside of rwlock")

    One cpu already added new fragment queue into hash but not into LRU.
    Other cpu found it in hash and tries to move it to the end of LRU.
    This leads to NULL pointer dereference inside of list_move_tail().

    Another possible race condition is between inet_frag_lru_move() and
    inet_frag_lru_del(): move can happens after deletion.

    This patch initializes LRU list head before adding fragment into hash and
    inet_frag_lru_move() doesn't touches it if it's empty.

    I saw this kernel oops two times in a couple of days.

    [119482.128853] BUG: unable to handle kernel NULL pointer dereference at (null)
    [119482.132693] IP: [] __list_del_entry+0x29/0xd0
    [119482.136456] PGD 2148f6067 PUD 215ab9067 PMD 0
    [119482.140221] Oops: 0000 [#1] SMP
    [119482.144008] Modules linked in: vfat msdos fat 8021q fuse nfsd auth_rpcgss nfs_acl nfs lockd sunrpc ppp_async ppp_generic bridge slhc stp llc w83627ehf hwmon_vid snd_hda_codec_hdmi snd_hda_codec_realtek kvm_amd k10temp kvm snd_hda_intel snd_hda_codec edac_core radeon snd_hwdep ath9k snd_pcm ath9k_common snd_page_alloc ath9k_hw snd_timer snd soundcore drm_kms_helper ath ttm r8169 mii
    [119482.152692] CPU 3
    [119482.152721] Pid: 20, comm: ksoftirqd/3 Not tainted 3.9.0-zurg-00001-g9f95269 #132 To Be Filled By O.E.M. To Be Filled By O.E.M./RS880D
    [119482.161478] RIP: 0010:[] [] __list_del_entry+0x29/0xd0
    [119482.166004] RSP: 0018:ffff880216d5db58 EFLAGS: 00010207
    [119482.170568] RAX: 0000000000000000 RBX: ffff88020882b9c0 RCX: dead000000200200
    [119482.175189] RDX: 0000000000000000 RSI: 0000000000000880 RDI: ffff88020882ba00
    [119482.179860] RBP: ffff880216d5db58 R08: ffffffff8155c7f0 R09: 0000000000000014
    [119482.184570] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88020882ba00
    [119482.189337] R13: ffffffff81c8d780 R14: ffff880204357f00 R15: 00000000000005a0
    [119482.194140] FS: 00007f58124dc700(0000) GS:ffff88021fcc0000(0000) knlGS:0000000000000000
    [119482.198928] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [119482.203711] CR2: 0000000000000000 CR3: 00000002155f0000 CR4: 00000000000007e0
    [119482.208533] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [119482.213371] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    [119482.218221] Process ksoftirqd/3 (pid: 20, threadinfo ffff880216d5c000, task ffff880216d3a9a0)
    [119482.223113] Stack:
    [119482.228004] ffff880216d5dbd8 ffffffff8155dcda 0000000000000000 ffff000200000001
    [119482.233038] ffff8802153c1f00 ffff880000289440 ffff880200000014 ffff88007bc72000
    [119482.238083] 00000000000079d5 ffff88007bc72f44 ffffffff00000002 ffff880204357f00
    [119482.243090] Call Trace:
    [119482.248009] [] ip_defrag+0x8fa/0xd10
    [119482.252921] [] ipv4_conntrack_defrag+0x83/0xe0
    [119482.257803] [] nf_iterate+0x8b/0xa0
    [119482.262658] [] ? inet_del_offload+0x40/0x40
    [119482.267527] [] nf_hook_slow+0x74/0x130
    [119482.272412] [] ? inet_del_offload+0x40/0x40
    [119482.277302] [] ip_rcv+0x268/0x320
    [119482.282147] [] __netif_receive_skb_core+0x612/0x7e0
    [119482.286998] [] __netif_receive_skb+0x18/0x60
    [119482.291826] [] process_backlog+0xa0/0x160
    [119482.296648] [] net_rx_action+0x139/0x220
    [119482.301403] [] __do_softirq+0xe7/0x220
    [119482.306103] [] run_ksoftirqd+0x28/0x40
    [119482.310809] [] smpboot_thread_fn+0xff/0x1a0
    [119482.315515] [] ? lg_local_lock_cpu+0x40/0x40
    [119482.320219] [] kthread+0xc0/0xd0
    [119482.324858] [] ? insert_kthread_work+0x40/0x40
    [119482.329460] [] ret_from_fork+0x7c/0xb0
    [119482.334057] [] ? insert_kthread_work+0x40/0x40
    [119482.338661] Code: 00 00 55 48 8b 17 48 b9 00 01 10 00 00 00 ad de 48 8b 47 08 48 89 e5 48 39 ca 74 29 48 b9 00 02 20 00 00 00 ad de 48 39 c8 74 7a 8b 00 4c 39 c7 75 53 4c 8b 42 08 4c 39 c7 75 2b 48 89 42 08
    [119482.343787] RIP [] __list_del_entry+0x29/0xd0
    [119482.348675] RSP
    [119482.353493] CR2: 0000000000000000

    Oops happened on this path:
    ip_defrag() -> ip_frag_queue() -> inet_frag_lru_move() -> list_move_tail() -> __list_del_entry()

    Signed-off-by: Konstantin Khlebnikov
    Cc: Jesper Dangaard Brouer
    Cc: Florian Westphal
    Cc: Eric Dumazet
    Cc: David S. Miller
    Acked-by: Florian Westphal
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Konstantin Khlebnikov
     

30 Apr, 2013

1 commit

  • Increase fragmentation hash bucket size to 1024 from old 64 elems.

    After we increased the frag mem limits commit c2a93660 (net: increase
    fragment memory usage limits) the hash size of 64 elements is simply
    too small. Also considering the mem limit is per netns and the hash
    table is shared for all netns.

    For the embedded people, note that this increase will change the hash
    table/array from using approx 1 Kbytes to 16 Kbytes.

    Signed-off-by: Jesper Dangaard Brouer
    Acked-by: Hannes Frederic Sowa
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer