07 Oct, 2012

1 commit

  • Over time, skb recycling infrastructure got litle interest and
    many bugs. Generic rx path skb allocation is now using page
    fragments for efficient GRO / TCP coalescing, and recyling
    a tx skb for rx path is not worth the pain.

    Last identified bug is that fat skbs can be recycled
    and it can endup using high order pages after few iterations.

    With help from Maxime Bizon, who pointed out that commit
    87151b8689d (net: allow pskb_expand_head() to get maximum tailroom)
    introduced this regression for recycled skbs.

    Instead of fixing this bug, lets remove skb recycling.

    Drivers wanting really hot skbs should use build_skb() anyway,
    to allocate/populate sk_buff right before netif_receive_skb()

    Signed-off-by: Eric Dumazet
    Cc: Maxime Bizon
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Oct, 2012

1 commit


29 Sep, 2012

1 commit

  • Conflicts:
    drivers/net/team/team.c
    drivers/net/usb/qmi_wwan.c
    net/batman-adv/bat_iv_ogm.c
    net/ipv4/fib_frontend.c
    net/ipv4/route.c
    net/l2tp/l2tp_netlink.c

    The team, fib_frontend, route, and l2tp_netlink conflicts were simply
    overlapping changes.

    qmi_wwan and bat_iv_ogm were of the "use HEAD" variety.

    With help from Antonio Quartulli.

    Signed-off-by: David S. Miller

    David S. Miller
     

28 Sep, 2012

1 commit

  • We currently use percpu order-0 pages in __netdev_alloc_frag
    to deliver fragments used by __netdev_alloc_skb()

    Depending on NIC driver and arch being 32 or 64 bit, it allows a page to
    be split in several fragments (between 1 and 8), assuming PAGE_SIZE=4096

    Switching to bigger pages (32768 bytes for PAGE_SIZE=4096 case) allows :

    - Better filling of space (the ending hole overhead is less an issue)

    - Less calls to page allocator or accesses to page->_count

    - Could allow struct skb_shared_info futures changes without major
    performance impact.

    This patch implements a transparent fallback to smaller
    pages in case of memory pressure.

    It also uses a standard "struct page_frag" instead of a custom one.

    Signed-off-by: Eric Dumazet
    Cc: Alexander Duyck
    Cc: Benjamin LaHaise
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Sep, 2012

1 commit

  • We currently use a per socket order-0 page cache for tcp_sendmsg()
    operations.

    This page is used to build fragments for skbs.

    Its done to increase probability of coalescing small write() into
    single segments in skbs still in write queue (not yet sent)

    But it wastes a lot of memory for applications handling many mostly
    idle sockets, since each socket holds one page in sk->sk_sndmsg_page

    Its also quite inefficient to build TSO 64KB packets, because we need
    about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
    page allocator more than wanted.

    This patch adds a per task frag allocator and uses bigger pages,
    if available. An automatic fallback is done in case of memory pressure.

    (up to 32768 bytes per frag, thats order-3 pages on x86)

    This increases TCP stream performance by 20% on loopback device,
    but also benefits on other network devices, since 8x less frags are
    mapped on transmit and unmapped on tx completion. Alexander Duyck
    mentioned a probable performance win on systems with IOMMU enabled.

    Its possible some SG enabled hardware cant cope with bigger fragments,
    but their ndo_start_xmit() should already handle this, splitting a
    fragment in sub fragments, since some arches have PAGE_SIZE=65536

    Successfully tested on various ethernet devices.
    (ixgbe, igb, bnx2x, tg3, mellanox mlx4)

    Signed-off-by: Eric Dumazet
    Cc: Ben Hutchings
    Cc: Vijay Subramanian
    Cc: Alexander Duyck
    Tested-by: Vijay Subramanian
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Sep, 2012

1 commit


01 Aug, 2012

1 commit

  • Change the skb allocation API to indicate RX usage and use this to fall
    back to the PFMEMALLOC reserve when needed. SKBs allocated from the
    reserve are tagged in skb->pfmemalloc. If an SKB is allocated from the
    reserve and the socket is later found to be unrelated to page reclaim, the
    packet is dropped so that the memory remains available for page reclaim.
    Network protocols are expected to recover from this packet loss.

    [a.p.zijlstra@chello.nl: Ideas taken from various patches]
    [davem@davemloft.net: Use static branches, coding style corrections]
    [sebastian@breakpoint.cc: Avoid unnecessary cast, fix !CONFIG_NET build]
    Signed-off-by: Mel Gorman
    Acked-by: David S. Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

23 Jul, 2012

2 commits


20 Jul, 2012

1 commit


19 Jul, 2012

1 commit


16 Jul, 2012

1 commit


13 Jul, 2012

1 commit

  • This patch is meant to help improve performance by reducing the number of
    locked operations required to allocate a frag on x86 and other platforms.
    This is accomplished by using atomic_set operations on the page count
    instead of calling get_page and put_page. It is based on work originally
    provided by Eric Dumazet.

    In addition it also helps to reduce memory overhead when using TCP. This
    is done by recycling the page if the only holder of the frame is the
    netdev_alloc_frag call itself. This can occur when skb heads are stolen by
    either GRO or TCP and the driver providing the packets is using paged frags
    to store all of the data for the packets.

    Cc: Eric Dumazet
    Signed-off-by: Alexander Duyck
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexander Duyck
     

11 Jul, 2012

1 commit


05 Jul, 2012

1 commit


04 Jul, 2012

1 commit

  • Pull block bits from Jens Axboe:
    "As vacation is coming up, thought I'd better get rid of my pending
    changes in my for-linus branch for this iteration. It contains:

    - Two patches for mtip32xx. Killing a non-compliant sysfs interface
    and moving it to debugfs, where it belongs.

    - A few patches from Asias. Two legit bug fixes, and one killing an
    interface that is no longer in use.

    - A patch from Jan, making the annoying partition ioctl warning a bit
    less annoying, by restricting it to !CAP_SYS_RAWIO only.

    - Three bug fixes for drbd from Lars Ellenberg.

    - A fix for an old regression for umem, it hasn't really worked since
    the plugging scheme was changed in 3.0.

    - A few fixes from Tejun.

    - A splice fix from Eric Dumazet, fixing an issue with pipe
    resizing."

    * 'for-linus' of git://git.kernel.dk/linux-block:
    scsi: Silence unnecessary warnings about ioctl to partition
    block: Drop dead function blk_abort_queue()
    block: Mitigate lock unbalance caused by lock switching
    block: Avoid missed wakeup in request waitqueue
    umem: fix up unplugging
    splice: fix racy pipe->buffers uses
    drbd: fix null pointer dereference with on-congestion policy when diskless
    drbd: fix list corruption by failing but already aborted reads
    drbd: fix access of unallocated pages and kernel panic
    xen/blkfront: Add WARN to deal with misbehaving backends.
    blkcg: drop local variable @q from blkg_destroy()
    mtip32xx: Create debugfs entries for troubleshooting
    mtip32xx: Remove 'registers' and 'flags' from sysfs
    blkcg: fix blkg_alloc() failure path
    block: blkcg_policy_cfq shouldn't be used if !CONFIG_CFQ_GROUP_IOSCHED
    block: fix return value on cfq_init() failure
    mtip32xx: Remove version.h header file inclusion
    xen/blkback: Copy id field when doing BLKIF_DISCARD.

    Linus Torvalds
     

14 Jun, 2012

1 commit

  • Dave Jones reported a kernel BUG at mm/slub.c:3474! triggered
    by splice_shrink_spd() called from vmsplice_to_pipe()

    commit 35f3d14dbbc5 (pipe: add support for shrinking and growing pipes)
    added capability to adjust pipe->buffers.

    Problem is some paths don't hold pipe mutex and assume pipe->buffers
    doesn't change for their duration.

    Fix this by adding nr_pages_max field in struct splice_pipe_desc, and
    use it in place of pipe->buffers where appropriate.

    splice_shrink_spd() loses its struct pipe_inode_info argument.

    Reported-by: Dave Jones
    Signed-off-by: Eric Dumazet
    Cc: Jens Axboe
    Cc: Alexander Viro
    Cc: Tom Herbert
    Cc: stable # 2.6.35
    Tested-by: Dave Jones
    Signed-off-by: Jens Axboe

    Eric Dumazet
     

13 Jun, 2012

1 commit

  • Conflicts:
    MAINTAINERS
    drivers/net/wireless/iwlwifi/pcie/trans.c

    The iwlwifi conflict was resolved by keeping the code added
    in 'net' that turns off the buggy chip feature.

    The MAINTAINERS conflict was merely overlapping changes, one
    change updated all the wireless web site URLs and the other
    changed some GIT trees to be Johannes's instead of John's.

    Signed-off-by: David S. Miller

    David S. Miller
     

09 Jun, 2012

1 commit

  • Fix kernel-doc warnings in net/core:

    Warning(net/core/skbuff.c:3368): No description found for parameter 'delta_truesize'
    Warning(net/core/filter.c:628): No description found for parameter 'pfp'
    Warning(net/core/filter.c:628): Excess function parameter 'sk' description in 'sk_unattached_filter_create'

    Signed-off-by: Randy Dunlap
    Signed-off-by: David S. Miller

    Randy Dunlap
     

08 Jun, 2012

1 commit


20 May, 2012

1 commit

  • Move tcp_try_coalesce() protocol independent part to
    skb_try_coalesce().

    skb_try_coalesce() can be used in IPv4 defrag and IPv6 reassembly,
    to build optimized skbs (less sk_buff, and possibly less 'headers')

    skb_try_coalesce() is zero copy, unless the copy can fit in destination
    header (its a rare case)

    kfree_skb_partial() is also moved to net/core/skbuff.c and exported,
    because IPv6 will need it in patch (ipv6: use skb coalescing in
    reassembly).

    Signed-off-by: Eric Dumazet
    Cc: Alexander Duyck
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 May, 2012

1 commit

  • Fix two issues introduced in commit a1c7fff7e18f5
    ( net: netdev_alloc_skb() use build_skb() )

    - Must be IRQ safe (non NAPI drivers can use it)
    - Must not leak the frag if build_skb() fails to allocate sk_buff

    This patch introduces netdev_alloc_frag() for drivers willing to
    use build_skb() instead of __netdev_alloc_skb() variants.

    Factorize code so that :
    __dev_alloc_skb() is a wrapper around __netdev_alloc_skb(), and
    dev_alloc_skb() a wrapper around netdev_alloc_skb()

    Use __GFP_COLD flag.

    Almost all network drivers now benefit from skb->head_frag
    infrastructure.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 May, 2012

1 commit

  • netdev_alloc_skb() is used by networks driver in their RX path to
    allocate an skb to receive an incoming frame.

    With recent skb->head_frag infrastructure, it makes sense to change
    netdev_alloc_skb() to use build_skb() and a frag allocator.

    This permits a zero copy splice(socket->pipe), and better GRO or TCP
    coalescing.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 May, 2012

1 commit

  • Use the current logging style.

    This enables use of dynamic debugging as well.

    Convert printk(KERN_ to pr_.
    Add pr_fmt. Remove embedded prefixes, use
    %s, __func__ instead.

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

16 May, 2012

1 commit


07 May, 2012

3 commits

  • With the recent changes for how we compute the skb truesize it occurs to me
    we are probably going to have a lot of calls to skb_end_pointer -
    skb->head. Instead of running all over the place doing that it would make
    more sense to just make it a separate inline skb_end_offset(skb) that way
    we can return the correct value without having gcc having to do all the
    optimization to cancel out skb->head - skb->head.

    Signed-off-by: Alexander Duyck
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • Since there is now only one spot that actually uses "fastpath" there isn't
    much point in carrying it. Instead we can just use a check for skb_cloned
    to verify if we can perform the fast-path free for the head or not.

    Signed-off-by: Alexander Duyck
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • The fast-path for pskb_expand_head contains a check where the size plus the
    unaligned size of skb_shared_info is compared against the size of the data
    buffer. This code path has two issues. First is the fact that after the
    recent changes by Eric Dumazet to __alloc_skb and build_skb the shared info
    is always placed in the optimal spot for a buffer size making this check
    unnecessary. The second issue is the fact that the check doesn't take into
    account the aligned size of shared info. As a result the code burns cycles
    doing a memcpy with nothing actually being shifted.

    Signed-off-by: Alexander Duyck
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexander Duyck
     

04 May, 2012

2 commits

  • This patch adds support for a skb_head_is_locked helper function. It is
    meant to be used any time we are considering transferring the head from
    skb->head to a paged frag. If the head is locked it means we cannot remove
    the head from the skb so it must be copied or we must take the skb as a
    whole.

    Signed-off-by: Alexander Duyck
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • GRO is very optimistic in skb truesize estimates, only taking into
    account the used part of fragments.

    Be conservative, and use more precise computation, so that bloated GRO
    skbs can be collapsed eventually.

    Signed-off-by: Eric Dumazet
    Cc: Alexander Duyck
    Cc: Jeff Kirsher
    Acked-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 May, 2012

1 commit

  • This change is meant ot prevent stealing the skb->head to use as a page in
    the event that the skb->head was cloned. This allows the other clones to
    track each other via shinfo->dataref.

    Without this we break down to two methods for tracking the reference count,
    one being dataref, the other being the page count. As a result it becomes
    difficult to track how many references there are to skb->head.

    Signed-off-by: Alexander Duyck
    Cc: Eric Dumazet
    Cc: Jeff Kirsher
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alexander Duyck
     

01 May, 2012

3 commits

  • __skb_splice_bits() can check if skb to be spliced has its skb->head
    mapped to a page fragment, instead of a kmalloc() area.

    If so we can avoid a copy of the skb head and get a reference on
    underlying page.

    Signed-off-by: Eric Dumazet
    Cc: Ilpo Järvinen
    Cc: Herbert Xu
    Cc: Maciej Żenczykowski
    Cc: Neal Cardwell
    Cc: Tom Herbert
    Cc: Jeff Kirsher
    Cc: Ben Hutchings
    Cc: Matt Carlson
    Cc: Michael Chan
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • GRO can check if skb to be merged has its skb->head mapped to a page
    fragment, instead of a kmalloc() area.

    We 'upgrade' skb->head as a fragment in itself

    This avoids the frag_list fallback, and permits to build true GRO skb
    (one sk_buff and up to 16 fragments), using less memory.

    This reduces number of cache misses when user makes its copy, since a
    single sk_buff is fetched.

    This is a followup of patch "net: allow skb->head to be a page fragment"

    Signed-off-by: Eric Dumazet
    Cc: Ilpo Järvinen
    Cc: Herbert Xu
    Cc: Maciej Żenczykowski
    Cc: Neal Cardwell
    Cc: Tom Herbert
    Cc: Jeff Kirsher
    Cc: Ben Hutchings
    Cc: Matt Carlson
    Cc: Michael Chan
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • skb->head is currently allocated from kmalloc(). This is convenient but
    has the drawback the data cannot be converted to a page fragment if
    needed.

    We have three spots were it hurts :

    1) GRO aggregation

    When a linear skb must be appended to another skb, GRO uses the
    frag_list fallback, very inefficient since we keep all struct sk_buff
    around. So drivers enabling GRO but delivering linear skbs to network
    stack aren't enabling full GRO power.

    2) splice(socket -> pipe).

    We must copy the linear part to a page fragment.
    This kind of defeats splice() purpose (zero copy claim)

    3) TCP coalescing.

    Recently introduced, this permits to group several contiguous segments
    into a single skb. This shortens queue lengths and save kernel memory,
    and greatly reduce probabilities of TCP collapses. This coalescing
    doesnt work on linear skbs (or we would need to copy data, this would be
    too slow)

    Given all these issues, the following patch introduces the possibility
    of having skb->head be a fragment in itself. We use a new skb flag,
    skb->head_frag to carry this information.

    build_skb() is changed to accept a frag_size argument. Drivers willing
    to provide a page fragment instead of kmalloc() data will set a non zero
    value, set to the fragment size.

    Then, on situations we need to convert the skb head to a frag in itself,
    we can check if skb->head_frag is set and avoid the copies or various
    fallbacks we have.

    This means drivers currently using frags could be updated to avoid the
    current skb->head allocation and reduce their memory footprint (aka skb
    truesize). (thats 512 or 1024 bytes saved per skb). This also makes
    bpf/netfilter faster since the 'first frag' will be part of skb linear
    part, no need to copy data.

    Signed-off-by: Eric Dumazet
    Cc: Ilpo Järvinen
    Cc: Herbert Xu
    Cc: Maciej Żenczykowski
    Cc: Neal Cardwell
    Cc: Tom Herbert
    Cc: Jeff Kirsher
    Cc: Ben Hutchings
    Cc: Matt Carlson
    Cc: Michael Chan
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Apr, 2012

3 commits


22 Apr, 2012

1 commit


20 Apr, 2012

1 commit


16 Apr, 2012

1 commit