07 Mar, 2014

1 commit

  • Quoting Alexander Aring:
    While fragmentation and unloading of 6lowpan module I got this kernel Oops
    after few seconds:

    BUG: unable to handle kernel paging request at f88bbc30
    [..]
    Modules linked in: ipv6 [last unloaded: 6lowpan]
    Call Trace:
    [] ? call_timer_fn+0x54/0xb3
    [] ? process_timeout+0xa/0xa
    [] run_timer_softirq+0x140/0x15f

    Problem is that incomplete frags are still around after unload; when
    their frag expire timer fires, we get crash.

    When a netns is removed (also done when unloading module), inet_frag
    calls the evictor with 'force' argument to purge remaining frags.

    The evictor loop terminates when accounted memory ('work') drops to 0
    or the lru-list becomes empty. However, the mem accounting is done
    via percpu counters and may not be accurate, i.e. loop may terminate
    prematurely.

    Alter evictor to only stop once the lru list is empty when force is
    requested.

    Reported-by: Phoebe Buckheister
    Reported-by: Alexander Aring
    Tested-by: Alexander Aring
    Signed-off-by: Florian Westphal
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Florian Westphal
     

06 Mar, 2014

1 commit

  • I stumbled upon this very serious bug while hunting for another one,
    it's a very subtle race condition between inet_frag_evictor,
    inet_frag_intern and the IPv4/6 frag_queue and expire functions
    (basically the users of inet_frag_kill/inet_frag_put).

    What happens is that after a fragment has been added to the hash chain
    but before it's been added to the lru_list (inet_frag_lru_add) in
    inet_frag_intern, it may get deleted (either by an expired timer if
    the system load is high or the timer sufficiently low, or by the
    fraq_queue function for different reasons) before it's added to the
    lru_list, then after it gets added it's a matter of time for the
    evictor to get to a piece of memory which has been freed leading to a
    number of different bugs depending on what's left there.

    I've been able to trigger this on both IPv4 and IPv6 (which is normal
    as the frag code is the same), but it's been much more difficult to
    trigger on IPv4 due to the protocol differences about how fragments
    are treated.

    The setup I used to reproduce this is: 2 machines with 4 x 10G bonded
    in a RR bond, so the same flow can be seen on multiple cards at the
    same time. Then I used multiple instances of ping/ping6 to generate
    fragmented packets and flood the machines with them while running
    other processes to load the attacked machine.

    *It is very important to have the _same flow_ coming in on multiple CPUs
    concurrently. Usually the attacked machine would die in less than 30
    minutes, if configured properly to have many evictor calls and timeouts
    it could happen in 10 minutes or so.

    An important point to make is that any caller (frag_queue or timer) of
    inet_frag_kill will remove both the timer refcount and the
    original/guarding refcount thus removing everything that's keeping the
    frag from being freed at the next inet_frag_put. All of this could
    happen before the frag was ever added to the LRU list, then it gets
    added and the evictor uses a freed fragment.

    An example for IPv6 would be if a fragment is being added and is at
    the stage of being inserted in the hash after the hash lock is
    released, but before inet_frag_lru_add executes (or is able to obtain
    the lru lock) another overlapping fragment for the same flow arrives
    at a different CPU which finds it in the hash, but since it's
    overlapping it drops it invoking inet_frag_kill and thus removing all
    guarding refcounts, and afterwards freeing it by invoking
    inet_frag_put which removes the last refcount added previously by
    inet_frag_find, then inet_frag_lru_add gets executed by
    inet_frag_intern and we have a freed fragment in the lru_list.

    The fix is simple, just move the lru_add under the hash chain locked
    region so when a removing function is called it'll have to wait for
    the fragment to be added to the lru_list, and then it'll remove it (it
    works because the hash chain removal is done before the lru_list one
    and there's no window between the two list adds when the frag can get
    dropped). With this fix applied I couldn't kill the same machine in 24
    hours with the same setup.

    Fixes: 3ef0eb0db4bf ("net: frag, move LRU list maintenance outside of
    rwlock")

    CC: Florian Westphal
    CC: Jesper Dangaard Brouer
    CC: David S. Miller

    Signed-off-by: Nikolay Aleksandrov
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     

24 Oct, 2013

1 commit

  • All fragmentation hash secrets now get initialized by their
    corresponding hash function with net_get_random_once. Thus we can
    eliminate the initial seeding.

    Also provide a comment that hash secret seeding happens at the first
    call to the corresponding hashing function.

    Cc: David S. Miller
    Cc: Eric Dumazet
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

10 Jul, 2013

1 commit

  • Pull networking updates from David Miller:
    "This is a re-do of the net-next pull request for the current merge
    window. The only difference from the one I made the other day is that
    this has Eliezer's interface renames and the timeout handling changes
    made based upon your feedback, as well as a few bug fixes that have
    trickeled in.

    Highlights:

    1) Low latency device polling, eliminating the cost of interrupt
    handling and context switches. Allows direct polling of a network
    device from socket operations, such as recvmsg() and poll().

    Currently ixgbe, mlx4, and bnx2x support this feature.

    Full high level description, performance numbers, and design in
    commit 0a4db187a999 ("Merge branch 'll_poll'")

    From Eliezer Tamir.

    2) With the routing cache removed, ip_check_mc_rcu() gets exercised
    more than ever before in the case where we have lots of multicast
    addresses. Use a hash table instead of a simple linked list, from
    Eric Dumazet.

    3) Add driver for Atheros CQA98xx 802.11ac wireless devices, from
    Bartosz Markowski, Janusz Dziedzic, Kalle Valo, Marek Kwaczynski,
    Marek Puzyniak, Michal Kazior, and Sujith Manoharan.

    4) Support reporting the TUN device persist flag to userspace, from
    Pavel Emelyanov.

    5) Allow controlling network device VF link state using netlink, from
    Rony Efraim.

    6) Support GRE tunneling in openvswitch, from Pravin B Shelar.

    7) Adjust SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF for modern times, from
    Daniel Borkmann and Eric Dumazet.

    8) Allow controlling of TCP quickack behavior on a per-route basis,
    from Cong Wang.

    9) Several bug fixes and improvements to vxlan from Stephen
    Hemminger, Pravin B Shelar, and Mike Rapoport. In particular,
    support receiving on multiple UDP ports.

    10) Major cleanups, particular in the area of debugging and cookie
    lifetime handline, to the SCTP protocol code. From Daniel
    Borkmann.

    11) Allow packets to cross network namespaces when traversing tunnel
    devices. From Nicolas Dichtel.

    12) Allow monitoring netlink traffic via AF_PACKET sockets, in a
    manner akin to how we monitor real network traffic via ptype_all.
    From Daniel Borkmann.

    13) Several bug fixes and improvements for the new alx device driver,
    from Johannes Berg.

    14) Fix scalability issues in the netem packet scheduler's time queue,
    by using an rbtree. From Eric Dumazet.

    15) Several bug fixes in TCP loss recovery handling, from Yuchung
    Cheng.

    16) Add support for GSO segmentation of MPLS packets, from Simon
    Horman.

    17) Make network notifiers have a real data type for the opaque
    pointer that's passed into them. Use this to properly handle
    network device flag changes in arp_netdev_event(). From Jiri
    Pirko and Timo Teräs.

    18) Convert several drivers over to module_pci_driver(), from Peter
    Huewe.

    19) tcp_fixup_rcvbuf() can loop 500 times over loopback, just use a
    O(1) calculation instead. From Eric Dumazet.

    20) Support setting of explicit tunnel peer addresses in ipv6, just
    like ipv4. From Nicolas Dichtel.

    21) Protect x86 BPF JIT against spraying attacks, from Eric Dumazet.

    22) Prevent a single high rate flow from overruning an individual cpu
    during RX packet processing via selective flow shedding. From
    Willem de Bruijn.

    23) Don't use spinlocks in TCP md5 signing fast paths, from Eric
    Dumazet.

    24) Don't just drop GSO packets which are above the TBF scheduler's
    burst limit, chop them up so they are in-bounds instead. Also
    from Eric Dumazet.

    25) VLAN offloads are missed when configured on top of a bridge, fix
    from Vlad Yasevich.

    26) Support IPV6 in ping sockets. From Lorenzo Colitti.

    27) Receive flow steering targets should be updated at poll() time
    too, from David Majnemer.

    28) Fix several corner case regressions in PMTU/redirect handling due
    to the routing cache removal, from Timo Teräs.

    29) We have to be mindful of ipv4 mapped ipv6 sockets in
    upd_v6_push_pending_frames(). From Hannes Frederic Sowa.

    30) Fix L2TP sequence number handling bugs, from James Chapman."

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1214 commits)
    drivers/net: caif: fix wrong rtnl_is_locked() usage
    drivers/net: enic: release rtnl_lock on error-path
    vhost-net: fix use-after-free in vhost_net_flush
    net: mv643xx_eth: do not use port number as platform device id
    net: sctp: confirm route during forward progress
    virtio_net: fix race in RX VQ processing
    virtio: support unlocked queue poll
    net/cadence/macb: fix bug/typo in extracting gem_irq_read_clear bit
    Documentation: Fix references to defunct linux-net@vger.kernel.org
    net/fs: change busy poll time accounting
    net: rename low latency sockets functions to busy poll
    bridge: fix some kernel warning in multicast timer
    sfc: Fix memory leak when discarding scattered packets
    sit: fix tunnel update via netlink
    dt:net:stmmac: Add dt specific phy reset callback support.
    dt:net:stmmac: Add support to dwmac version 3.610 and 3.710
    dt:net:stmmac: Allocate platform data only if its NULL.
    net:stmmac: fix memleak in the open method
    ipv6: rt6_check_neigh should successfully verify neigh if no NUD information are available
    net: ipv6: fix wrong ping_v6_sendmsg return value
    ...

    Linus Torvalds
     

04 Jul, 2013

1 commit

  • The global variable num_physpages is scheduled to be removed, so use
    totalram_pages instead of num_physpages at runtime.

    Signed-off-by: Jiang Liu
    Cc: Miklos Szeredi
    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: James Morris
    Cc: Hideaki YOSHIFUJI
    Cc: Patrick McHardy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     

20 Jun, 2013

1 commit

  • This patch removes an empty ifdef from inet_frag_intern()
    in net/ipv4/inet_fragment.c.

    commit b67bfe0d42cac56c512dd5da4b1b347a23f4b70a
    (hlist: drop the node parameter from iterators) removed hlist from
    net/ipv4/inet_fragment.c, but did not remove the enclosing ifdef command,
    which is now empty.

    Signed-off-by: Rami Rosen
    Signed-off-by: David S. Miller

    Rami Rosen
     

06 May, 2013

1 commit

  • This patch fixes race between inet_frag_lru_move() and inet_frag_lru_add()
    which was introduced in commit 3ef0eb0db4bf92c6d2510fe5c4dc51852746f206
    ("net: frag, move LRU list maintenance outside of rwlock")

    One cpu already added new fragment queue into hash but not into LRU.
    Other cpu found it in hash and tries to move it to the end of LRU.
    This leads to NULL pointer dereference inside of list_move_tail().

    Another possible race condition is between inet_frag_lru_move() and
    inet_frag_lru_del(): move can happens after deletion.

    This patch initializes LRU list head before adding fragment into hash and
    inet_frag_lru_move() doesn't touches it if it's empty.

    I saw this kernel oops two times in a couple of days.

    [119482.128853] BUG: unable to handle kernel NULL pointer dereference at (null)
    [119482.132693] IP: [] __list_del_entry+0x29/0xd0
    [119482.136456] PGD 2148f6067 PUD 215ab9067 PMD 0
    [119482.140221] Oops: 0000 [#1] SMP
    [119482.144008] Modules linked in: vfat msdos fat 8021q fuse nfsd auth_rpcgss nfs_acl nfs lockd sunrpc ppp_async ppp_generic bridge slhc stp llc w83627ehf hwmon_vid snd_hda_codec_hdmi snd_hda_codec_realtek kvm_amd k10temp kvm snd_hda_intel snd_hda_codec edac_core radeon snd_hwdep ath9k snd_pcm ath9k_common snd_page_alloc ath9k_hw snd_timer snd soundcore drm_kms_helper ath ttm r8169 mii
    [119482.152692] CPU 3
    [119482.152721] Pid: 20, comm: ksoftirqd/3 Not tainted 3.9.0-zurg-00001-g9f95269 #132 To Be Filled By O.E.M. To Be Filled By O.E.M./RS880D
    [119482.161478] RIP: 0010:[] [] __list_del_entry+0x29/0xd0
    [119482.166004] RSP: 0018:ffff880216d5db58 EFLAGS: 00010207
    [119482.170568] RAX: 0000000000000000 RBX: ffff88020882b9c0 RCX: dead000000200200
    [119482.175189] RDX: 0000000000000000 RSI: 0000000000000880 RDI: ffff88020882ba00
    [119482.179860] RBP: ffff880216d5db58 R08: ffffffff8155c7f0 R09: 0000000000000014
    [119482.184570] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88020882ba00
    [119482.189337] R13: ffffffff81c8d780 R14: ffff880204357f00 R15: 00000000000005a0
    [119482.194140] FS: 00007f58124dc700(0000) GS:ffff88021fcc0000(0000) knlGS:0000000000000000
    [119482.198928] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [119482.203711] CR2: 0000000000000000 CR3: 00000002155f0000 CR4: 00000000000007e0
    [119482.208533] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [119482.213371] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    [119482.218221] Process ksoftirqd/3 (pid: 20, threadinfo ffff880216d5c000, task ffff880216d3a9a0)
    [119482.223113] Stack:
    [119482.228004] ffff880216d5dbd8 ffffffff8155dcda 0000000000000000 ffff000200000001
    [119482.233038] ffff8802153c1f00 ffff880000289440 ffff880200000014 ffff88007bc72000
    [119482.238083] 00000000000079d5 ffff88007bc72f44 ffffffff00000002 ffff880204357f00
    [119482.243090] Call Trace:
    [119482.248009] [] ip_defrag+0x8fa/0xd10
    [119482.252921] [] ipv4_conntrack_defrag+0x83/0xe0
    [119482.257803] [] nf_iterate+0x8b/0xa0
    [119482.262658] [] ? inet_del_offload+0x40/0x40
    [119482.267527] [] nf_hook_slow+0x74/0x130
    [119482.272412] [] ? inet_del_offload+0x40/0x40
    [119482.277302] [] ip_rcv+0x268/0x320
    [119482.282147] [] __netif_receive_skb_core+0x612/0x7e0
    [119482.286998] [] __netif_receive_skb+0x18/0x60
    [119482.291826] [] process_backlog+0xa0/0x160
    [119482.296648] [] net_rx_action+0x139/0x220
    [119482.301403] [] __do_softirq+0xe7/0x220
    [119482.306103] [] run_ksoftirqd+0x28/0x40
    [119482.310809] [] smpboot_thread_fn+0xff/0x1a0
    [119482.315515] [] ? lg_local_lock_cpu+0x40/0x40
    [119482.320219] [] kthread+0xc0/0xd0
    [119482.324858] [] ? insert_kthread_work+0x40/0x40
    [119482.329460] [] ret_from_fork+0x7c/0xb0
    [119482.334057] [] ? insert_kthread_work+0x40/0x40
    [119482.338661] Code: 00 00 55 48 8b 17 48 b9 00 01 10 00 00 00 ad de 48 8b 47 08 48 89 e5 48 39 ca 74 29 48 b9 00 02 20 00 00 00 ad de 48 39 c8 74 7a 8b 00 4c 39 c7 75 53 4c 8b 42 08 4c 39 c7 75 2b 48 89 42 08
    [119482.343787] RIP [] __list_del_entry+0x29/0xd0
    [119482.348675] RSP
    [119482.353493] CR2: 0000000000000000

    Oops happened on this path:
    ip_defrag() -> ip_frag_queue() -> inet_frag_lru_move() -> list_move_tail() -> __list_del_entry()

    Signed-off-by: Konstantin Khlebnikov
    Cc: Jesper Dangaard Brouer
    Cc: Florian Westphal
    Cc: Eric Dumazet
    Cc: David S. Miller
    Acked-by: Florian Westphal
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Konstantin Khlebnikov
     

05 Apr, 2013

1 commit

  • This patch implements per hash bucket locking for the frag queue
    hash. This removes two write locks, and the only remaining write
    lock is for protecting hash rebuild. This essentially reduce the
    readers-writer lock to a rebuild lock.

    This patch is part of "net: frag performance followup"
    http://thread.gmane.org/gmane.linux.network/263644
    of which two patches have already been accepted:

    Same test setup as previous:
    (http://thread.gmane.org/gmane.linux.network/257155)
    Two 10G interfaces, on seperate NUMA nodes, are under-test, and uses
    Ethernet flow-control. A third interface is used for generating the
    DoS attack (with trafgen).

    Notice, I have changed the frag DoS generator script to be more
    efficient/deadly. Before it would only hit one RX queue, now its
    sending packets causing multi-queue RX, due to "better" RX hashing.

    Test types summary (netperf UDP_STREAM):
    Test-20G64K == 2x10G with 65K fragments
    Test-20G3F == 2x10G with 3x fragments (3*1472 bytes)
    Test-20G64K+DoS == Same as 20G64K with frag DoS
    Test-20G3F+DoS == Same as 20G3F with frag DoS
    Test-20G64K+MQ == Same as 20G64K with Multi-Queue frag DoS
    Test-20G3F+MQ == Same as 20G3F with Multi-Queue frag DoS

    When I rebased this-patch(03) (on top of net-next commit a210576c) and
    removed the _bh spinlock, I saw a performance regression. BUT this
    was caused by some unrelated change in-between. See tests below.

    Test (A) is what I reported before for patch-02, accepted in commit 1b5ab0de.
    Test (B) verifying-retest of commit 1b5ab0de corrospond to patch-02.
    Test (C) is what I reported before for this-patch

    Test (D) is net-next master HEAD (commit a210576c), which reveals some
    (unknown) performance regression (compared against test (B)).
    Test (D) function as a new base-test.

    Performance table summary (in Mbit/s):

    (#) Test-type: 20G64K 20G3F 20G64K+DoS 20G3F+DoS 20G64K+MQ 20G3F+MQ
    ---------- ------- ------- ---------- --------- -------- -------
    (A) Patch-02 : 18848.7 13230.1 4103.04 5310.36 130.0 440.2
    (B) 1b5ab0de : 18841.5 13156.8 4101.08 5314.57 129.0 424.2
    (C) Patch-03v1: 18838.0 13490.5 4405.11 6814.72 196.6 461.6

    (D) a210576c : 18321.5 11250.4 3635.34 5160.13 119.1 405.2
    (E) with _bh : 17247.3 11492.6 3994.74 6405.29 166.7 413.6
    (F) without bh: 17471.3 11298.7 3818.05 6102.11 165.7 406.3

    Test (E) and (F) is this-patch(03), with(V1) and without(V2) the _bh spinlocks.

    I cannot explain the slow down for 20G64K (but its an artificial
    "lab-test" so I'm not worried). But the other results does show
    improvements. And test (E) "with _bh" version is slightly better.

    Signed-off-by: Jesper Dangaard Brouer
    Acked-by: Hannes Frederic Sowa
    Acked-by: Eric Dumazet

    ----
    V2:
    - By analysis from Hannes Frederic Sowa and Eric Dumazet, we don't
    need the spinlock _bh versions, as Netfilter currently does a
    local_bh_disable() before entering inet_fragment.
    - Fold-in desc from cover-mail
    V3:
    - Drop the chain_len counter per hash bucket.
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     

28 Mar, 2013

2 commits

  • Move the protection of netns_frags.nqueues updates under the LRU_lock,
    instead of the write lock. As they are located on the same cacheline,
    and this is also needed when transitioning to use per hash bucket locking.

    Signed-off-by: Jesper Dangaard Brouer
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     
  • The LRU list is protected by its own lock, since commit 3ef0eb0db4
    (net: frag, move LRU list maintenance outside of rwlock), and
    no-longer by a read_lock.

    This makes it possible, to remove the inet_frag_queue, which is about
    to be "evicted", from the LRU list head. This avoids the problem, of
    several CPUs grabbing the same frag queue.

    Note, cannot remove the inet_frag_lru_del() call in fq_unlink()
    called by inet_frag_kill(), because inet_frag_kill() is also used in
    other situations. Thus, we use list_del_init() to allow this
    double list_del to work.

    Signed-off-by: Jesper Dangaard Brouer
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     

25 Mar, 2013

1 commit


19 Mar, 2013

1 commit

  • This patch introduces a constant limit of the fragment queue hash
    table bucket list lengths. Currently the limit 128 is choosen somewhat
    arbitrary and just ensures that we can fill up the fragment cache with
    empty packets up to the default ip_frag_high_thresh limits. It should
    just protect from list iteration eating considerable amounts of cpu.

    If we reach the maximum length in one hash bucket a warning is printed.
    This is implemented on the caller side of inet_frag_find to distinguish
    between the different users of inet_fragment.c.

    I dropped the out of memory warning in the ipv4 fragment lookup path,
    because we already get a warning by the slab allocator.

    Cc: Eric Dumazet
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Hannes Frederic Sowa
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

30 Jan, 2013

3 commits

  • Updating the fragmentation queues LRU (Least-Recently-Used) list,
    required taking the hash writer lock. However, the LRU list isn't
    tied to the hash at all, so we can use a separate lock for it.

    Original-idea-by: Florian Westphal
    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     
  • Replace the per network namespace shared atomic "mem" accounting
    variable, in the fragmentation code, with a lib/percpu_counter.

    Getting percpu_counter to scale to the fragmentation code usage
    requires some tweaks.

    At first view, percpu_counter looks superfast, but it does not
    scale on multi-CPU/NUMA machines, because the default batch size
    is too small, for frag code usage. Thus, I have adjusted the
    batch size by using __percpu_counter_add() directly, instead of
    percpu_counter_sub() and percpu_counter_add().

    The batch size is increased to 130.000, based on the largest 64K
    fragment memory usage. This does introduce some imprecise
    memory accounting, but its does not need to be strict for this
    use-case.

    It is also essential, that the percpu_counter, does not
    share cacheline with other writers, to make this scale.

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     
  • This change is primarily a preparation to ease the extension of memory
    limit tracking.

    The change does reduce the number atomic operation, during freeing of
    a frag queue. This does introduce a some performance improvement, as
    these atomic operations are at the core of the performance problems
    seen on NUMA systems.

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     

20 Sep, 2012

1 commit


09 Jun, 2012

1 commit


13 Jul, 2010

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

27 Feb, 2009

1 commit


26 Jul, 2008

1 commit

  • Removes legacy reinvent-the-wheel type thing. The generic
    machinery integrates much better to automated debugging aids
    such as kerneloops.org (and others), and is unambiguous due to
    better naming. Non-intuively BUG_TRAP() is actually equal to
    WARN_ON() rather than BUG_ON() though some might actually be
    promoted to BUG_ON() but I left that to future.

    I could make at least one BUILD_BUG_ON conversion.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

28 Jun, 2008

1 commit

  • The problem is that while we work w/o the inet_frags.lock even
    read-locked the secret rebuild timer may occur (on another CPU, since
    BHs are still disabled in the inet_frag_find) and change the rnd seed
    for ipv4/6 fragments.

    It was caused by my patch fd9e63544cac30a34c951f0ec958038f0529e244
    ([INET]: Omit double hash calculations in xxx_frag_intern) late
    in the 2.6.24 kernel, so this should probably be queued to -stable.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

03 Apr, 2008

1 commit


29 Mar, 2008

2 commits


29 Jan, 2008

9 commits


18 Oct, 2007

5 commits

  • Since we now allocate the queues in inet_fragment.c, we
    can safely free it in the same place. The ->destructor
    callback thus becomes optional for inet_frags.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • Since this callback is used to check for conflicts in
    hashtable when inserting a newly created frag queue, we can
    do the same by checking for matching the queue with the
    argument, used to create one.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • Here we need another callback ->match to check whether the
    entry found in hash matches the key passed. The key used
    is the same as the creation argument for inet_frag_create.

    Yet again, this ->match is the same for netfilter and ipv6.
    Running a frew steps forward - this callback will later
    replace the ->equal one.

    Since the inet_frag_find() uses the already consolidated
    inet_frag_create() remove the xxx_frag_create from protocol
    codes.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • This one uses the xxx_frag_intern() and xxx_frag_alloc()
    routines, which are already consolidated, so remove them
    from protocol code (as promised).

    The ->constructor callback is used to init the rest of
    the frag queue and it is the same for netfilter and ipv6.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • Just perform the kzalloc() allocation and setup common
    fields in the inet_frag_queue(). Then return the result
    to the caller to initialize the rest.

    The inet_frag_alloc() may return NULL, so check the
    return value before doing the container_of(). This looks
    ugly, but the xxx_frag_alloc() will be removed soon.

    The xxx_expire() timer callbacks are patches,
    because the argument is now the inet_frag_queue, not
    the protocol specific queue.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov