19 Oct, 2011

1 commit

  • To ease skb->truesize sanitization, its better to be able to localize
    all references to skb frags size.

    Define accessors : skb_frag_size() to fetch frag size, and
    skb_frag_size_{set|add|sub}() to manipulate it.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Aug, 2011

1 commit


07 Dec, 2010

1 commit


24 Oct, 2010

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1699 commits)
    bnx2/bnx2x: Unsupported Ethtool operations should return -EINVAL.
    vlan: Calling vlan_hwaccel_do_receive() is always valid.
    tproxy: use the interface primary IP address as a default value for --on-ip
    tproxy: added IPv6 support to the socket match
    cxgb3: function namespace cleanup
    tproxy: added IPv6 support to the TPROXY target
    tproxy: added IPv6 socket lookup function to nf_tproxy_core
    be2net: Changes to use only priority codes allowed by f/w
    tproxy: allow non-local binds of IPv6 sockets if IP_TRANSPARENT is enabled
    tproxy: added tproxy sockopt interface in the IPV6 layer
    tproxy: added udp6_lib_lookup function
    tproxy: added const specifiers to udp lookup functions
    tproxy: split off ipv6 defragmentation to a separate module
    l2tp: small cleanup
    nf_nat: restrict ICMP translation for embedded header
    can: mcp251x: fix generation of error frames
    can: mcp251x: fix endless loop in interrupt handler if CANINTF_MERRF is set
    can-raw: add msg_flags to distinguish local traffic
    9p: client code cleanup
    rds: make local functions/variables static
    ...

    Fix up conflicts in net/core/dev.c, drivers/net/pcmcia/smc91c92_cs.c and
    drivers/net/wireless/ath/ath9k/debug.c as per David

    Linus Torvalds
     

07 Sep, 2010

2 commits

  • This patch adds tracepoint to consume_skb and add trace_kfree_skb
    before __kfree_skb in skb_free_datagram_locked and net_tx_action.
    Combinating with tracepoint on dev_hard_start_xmit, we can check
    how long it takes to free transmitted packets. And using it, we can
    calculate how many packets driver had at that time. It is useful when
    a drop of transmitted packet is a problem.

    sshd-6828 [000] 112689.258154: consume_skb: skbaddr=f2d99bb8

    Signed-off-by: Koki Sanagi
    Acked-by: David S. Miller
    Acked-by: Neil Horman
    Cc: Mathieu Desnoyers
    Cc: Kaneshige Kenji
    Cc: Izumo Taku
    Cc: Kosaki Motohiro
    Cc: Lai Jiangshan
    Cc: Scott Mcmillan
    Cc: Steven Rostedt
    Cc: Eric Dumazet
    LKML-Reference:
    Signed-off-by: Frederic Weisbecker

    Koki Sanagi
     
  • No need to test twice sk->sk_shutdown & RCV_SHUTDOWN

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Jul, 2010

1 commit


27 May, 2010

1 commit

  • This new sock lock primitive was introduced to speedup some user context
    socket manipulation. But it is unsafe to protect two threads, one using
    regular lock_sock/release_sock, one using lock_sock_bh/unlock_sock_bh

    This patch changes lock_sock_bh to be careful against 'owned' state.
    If owned is found to be set, we must take the slow path.
    lock_sock_bh() now returns a boolean to say if the slow path was taken,
    and this boolean is used at unlock_sock_bh time to call the appropriate
    unlock function.

    After this change, BH are either disabled or enabled during the
    lock_sock_bh/unlock_sock_bh protected section. This might be misleading,
    so we rename these functions to lock_sock_fast()/unlock_sock_fast().

    Reported-by: Anton Blanchard
    Signed-off-by: Eric Dumazet
    Tested-by: Anton Blanchard
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 May, 2010

1 commit

  • Commit 4b0b72f7dd617b ( net: speedup udp receive path )
    introduced a bug in skb_free_datagram_locked().

    We should not skb_orphan() skb if we dont have the guarantee we are the
    last skb user, this might happen with MSG_PEEK concurrent users.

    To keep socket locked for the smallest period of time, we split
    consume_skb() logic, inlined in skb_free_datagram_locked()

    Reported-by: Stephen Hemminger
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

29 Apr, 2010

1 commit

  • Since commit 95766fff ([UDP]: Add memory accounting.),
    each received packet needs one extra sock_lock()/sock_release() pair.

    This added latency because of possible backlog handling. Then later,
    ticket spinlocks added yet another latency source in case of DDOS.

    This patch introduces lock_sock_bh() and unlock_sock_bh()
    synchronization primitives, avoiding one atomic operation and backlog
    processing.

    skb_free_datagram_locked() uses them instead of full blown
    lock_sock()/release_sock(). skb is orphaned inside locked section for
    proper socket memory reclaim, and finally freed outside of it.

    UDP receive path now take the socket spinlock only once.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Apr, 2010

1 commit

  • Define a new function to return the waitqueue of a "struct sock".

    static inline wait_queue_head_t *sk_sleep(struct sock *sk)
    {
    return sk->sk_sleep;
    }

    Change all read occurrences of sk_sleep by a call to this function.

    Needed for a future RCU conversion. sk_sleep wont be a field directly
    available.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

06 Nov, 2009

1 commit


31 Oct, 2009

1 commit

  • On UDP sockets, we must call skb_free_datagram() with socket locked,
    or risk sk_forward_alloc corruption. This requirement is not respected
    in SUNRPC.

    Add a convenient helper, skb_free_datagram_locked() and use it in SUNRPC

    Reported-by: Francis Moreau
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Oct, 2009

1 commit

  • - skb_kill_datagram() can increment sk->sk_drops itself, not callers.

    - UDP on IPV4 & IPV6 dropped frames (because of bad checksum or policy checks) increment sk_drops

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Aug, 2009

1 commit

  • skb allocation / cosumption tracer - Add consumption tracepoint

    This patch adds a tracepoint to skb_copy_datagram_iovec, which is called each
    time a userspace process copies a frame from a socket receive queue to a user
    space buffer. It allows us to hook in and examine each sk_buff that the system
    receives on a per-socket bases, and can be use to compile a list of which skb's
    were received by which processes.

    Signed-off-by: Neil Horman

    include/trace/events/skb.h | 20 ++++++++++++++++++++
    net/core/datagram.c | 3 +++
    2 files changed, 23 insertions(+)
    Signed-off-by: David S. Miller

    Neil Horman
     

10 Jul, 2009

1 commit

  • Adding memory barrier after the poll_wait function, paired with
    receive callbacks. Adding fuctions sock_poll_wait and sk_has_sleeper
    to wrap the memory barrier.

    Without the memory barrier, following race can happen.
    The race fires, when following code paths meet, and the tp->rcv_nxt
    and __add_wait_queue updates stay in CPU caches.

    CPU1 CPU2

    sys_select receive packet
    ... ...
    __add_wait_queue update tp->rcv_nxt
    ... ...
    tp->rcv_nxt check sock_def_readable
    ... {
    schedule ...
    if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
    wake_up_interruptible(sk->sk_sleep)
    ...
    }

    If there was no cache the code would work ok, since the wait_queue and
    rcv_nxt are opposit to each other.

    Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
    passed the tp->rcv_nxt check and sleeps, or will get the new value for
    tp->rcv_nxt and will return with new data mask.
    In both cases the process (CPU1) is being added to the wait queue, so the
    waitqueue_active (CPU2) call cannot miss and will wake up CPU1.

    The bad case is when the __add_wait_queue changes done by CPU1 stay in its
    cache, and so does the tp->rcv_nxt update on CPU2 side. The CPU1 will then
    endup calling schedule and sleep forever if there are no more data on the
    socket.

    Calls to poll_wait in following modules were ommited:
    net/bluetooth/af_bluetooth.c
    net/irda/af_irda.c
    net/irda/irnet/irnet_ppp.c
    net/mac80211/rc80211_pid_debugfs.c
    net/phonet/socket.c
    net/rds/af_rds.c
    net/rfkill/core.c
    net/sunrpc/cache.c
    net/sunrpc/rpc_pipe.c
    net/tipc/socket.c

    Signed-off-by: Jiri Olsa
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Jiri Olsa
     

09 Jun, 2009

1 commit


08 Jun, 2009

1 commit

  • I am working on enabling UFO between KVM guests using virtio-net and i have
    some patches that i got working with 2.6.30-rc8. When i wanted to try them
    with net-next-2.6, i noticed that virtio-net is not working with that tree.

    After some debugging, it turned out to be several bugs in the recent patches
    to fix aio with tun driver, specifically the following 2 commits.

    http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git;a=commitdiff;h=0a1ec07a67bd8b0033dace237249654d015efa21
    http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git;a=commitdiff;h=6f26c9a7555e5bcca3560919db9b852015077dae

    Fix the call to memcpy_from_iovecend() in skb_copy_datagram_from_iovec
    to pass the right iovec offset.

    Signed-off-by: Sridhar Samudrala
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Sridhar Samudrala
     

09 May, 2009

1 commit

  • Commit ead2ceb0ec9f85cff19c43b5cdb2f8a054484431 ("Network Drop Monitor:
    Adding kfree_skb_clean for non-drops and modifying end-of-line points
    for skbs") established new conventions for identifying dropped packets.

    Align skb_kill_datagram() with these conventions so that packets that
    get dropped just before the copy to userspace are properly tracked.

    Signed-off-by: John Dykstra
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    John Dykstra
     

30 Apr, 2009

1 commit


28 Apr, 2009

1 commit

  • In 2.6.25 we added UDP mem accounting.

    This unfortunatly added a penalty when a frame is transmitted, since
    we have at TX completion time to call sock_wfree() to perform necessary
    memory accounting. This calls sock_def_write_space() and utimately
    scheduler if any thread is waiting on the socket.
    Thread(s) waiting for an incoming frame was scheduled, then had to sleep
    again as event was meaningless.

    (All threads waiting on a socket are using same sk_sleep anchor)

    This adds lot of extra wakeups and increases latencies, as noted
    by Christoph Lameter, and slows down softirq handler.

    Reference : http://marc.info/?l=linux-netdev&m=124060437012283&w=2

    Fortunatly, Davide Libenzi recently added concept of keyed wakeups
    into kernel, and particularly for sockets (see commit
    37e5540b3c9d838eb20f2ca8ea2eb8072271e403
    epoll keyed wakeups: make sockets use keyed wakeups)

    Davide goal was to optimize epoll, but this new wakeup infrastructure
    can help non epoll users as well, if they care to setup an appropriate
    handler.

    This patch introduces new DEFINE_WAIT_FUNC() helper and uses it
    in wait_for_packet(), so that only relevant event can wakeup a thread
    blocked in this function.

    Trace of function calls from bnx2 TX completion bnx2_poll_work() is :
    __kfree_skb()
    skb_release_head_state()
    sock_wfree()
    sock_def_write_space()
    __wake_up_sync_key()
    __wake_up_common()
    receiver_wake_function() : Stops here since thread is waiting for an INPUT

    Reported-by: Christoph Lameter
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Apr, 2009

2 commits

  • aio_write gets const struct iovec * but tun_chr_aio_write casts this to struct
    iovec * and modifies the iovec. As a result, attempts to use io_submit
    to send packets to a tun device fail with weird errors such as EINVAL.

    Since tun is the only user of skb_copy_datagram_from_iovec, we can
    fix this simply by changing the later so that it does not
    touch the iovec passed to it.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Michael S. Tsirkin
     
  • There's an skb_copy_datagram_iovec() to copy out of a paged skb,
    but it modifies the iovec, and does not support starting
    at an offset in the destination. We want both in tun.c, so let's
    add the function.

    It's a carbon copy of skb_copy_datagram_iovec() with enough changes to
    be annoying.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Michael S. Tsirkin
     

14 Mar, 2009

1 commit


05 Nov, 2008

1 commit

  • I noticed a contention on udp_memory_allocated on regular UDP applications.

    While tcp_memory_allocated is seldom used, it appears each incoming UDP frame
    is currently touching udp_memory_allocated when queued, and when received by
    application.

    One possible solution is to use sk_mem_reclaim_partial() instead of
    sk_mem_reclaim(), so that we keep a small reserve (less than one page)
    of memory for each UDP socket.

    We did something very similar on TCP side in commit
    9993e7d313e80bdc005d09c7def91903e0068f07
    ([TCP]: Do not purge sk_forward_alloc entirely in tcp_delack_timer())

    A more complex solution would need to convert prot->memory_allocated to
    use a percpu_counter with batches of 64 or 128 pages.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Oct, 2008

1 commit

  • Clean up the various different email addresses of mine listed in the code
    to a single current and valid address. As Dave says his network merges
    for 2.6.28 are now done this seems a good point to send them in where
    they won't risk disrupting real changes.

    Signed-off-by: Alan Cox
    Signed-off-by: David S. Miller

    Alan Cox
     

16 Aug, 2008

1 commit

  • There's an skb_copy_datagram_iovec() to copy out of a paged skb, but
    nothing the other way around (because we don't do that).

    We want to allocate big skbs in tun.c, so let's add the function.
    It's a carbon copy of skb_copy_datagram_iovec() with enough changes to
    be annoying.

    Signed-off-by: Rusty Russell
    Signed-off-by: David S. Miller

    Rusty Russell
     

26 Jul, 2008

1 commit

  • Removes legacy reinvent-the-wheel type thing. The generic
    machinery integrates much better to automated debugging aids
    such as kerneloops.org (and others), and is unambiguous due to
    better naming. Non-intuively BUG_TRAP() is actually equal to
    WARN_ON() rather than BUG_ON() though some might actually be
    promoted to BUG_ON() but I left that to future.

    I could make at least one BUILD_BUG_ON conversion.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

29 Jan, 2008

3 commits

  • This patch introduces new memory accounting functions for each network
    protocol. Most of them are renamed from memory accounting functions
    for stream protocols. At the same time, some stream memory accounting
    functions are removed since other functions do same thing.

    Renaming:
    sk_stream_free_skb() -> sk_wmem_free_skb()
    __sk_stream_mem_reclaim() -> __sk_mem_reclaim()
    sk_stream_mem_reclaim() -> sk_mem_reclaim()
    sk_stream_mem_schedule -> __sk_mem_schedule()
    sk_stream_pages() -> sk_mem_pages()
    sk_stream_rmem_schedule() -> sk_rmem_schedule()
    sk_stream_wmem_schedule() -> sk_wmem_schedule()
    sk_charge_skb() -> sk_mem_charge()

    Removeing
    sk_stream_rfree(): consolidates into sock_rfree()
    sk_stream_set_owner_r(): consolidates into skb_set_owner_r()
    sk_stream_mem_schedule()

    The following functions are added.
    sk_has_account(): check if the protocol supports accounting
    sk_mem_uncharge(): do the opposite of sk_mem_charge()

    In addition, to achieve consolidation, updating sk_wmem_queued is
    removed from sk_mem_charge().

    Next, to consolidate memory accounting functions, this patch adds
    memory accounting calls to network core functions. Moreover, present
    memory accounting call is renamed to new accounting call.

    Finally we replace present memory accounting calls with new interface
    in TCP and SCTP.

    Signed-off-by: Takahiro Yasui
    Signed-off-by: Hideo Aoki
    Signed-off-by: David S. Miller

    Hideo Aoki
     
  • The previous move of the the UDP inDatagrams counter caused each
    peek of the same packet to be counted separately. This may be
    undesirable.

    This patch fixes this by adding a bit to sk_buff to record whether
    this packet has already been seen through skb_recv_datagram. We
    then only increment the counter when the packet is seen for the
    first time.

    The only dodgy part is the fact that skb_recv_datagram doesn't have
    a good way of returning this new bit of information. So I've added
    a new function __skb_recv_datagram that does return this and made
    skb_recv_datagram a wrapper around it.

    The plan is to eventually replace all uses of skb_recv_datagram with
    this new function at which time it can be renamed its proper name.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • Currently it is possible for two processes to peek on the same socket
    and end up incrementing the error counter twice for the same packet.

    This patch fixes it by making skb_kill_datagram return whether it
    succeeded in unlinking the packet and only incrementing the counter
    if it did.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

11 Sep, 2007

1 commit

  • When msg_iovlen is zero we shouldn't try to dereference
    msg_iov. Right now the only thing that tries to do so
    is skb_copy_and_csum_datagram_iovec. Since the total
    length should also be zero if msg_iovlen is zero, it's
    sufficient to check the total length there and simply
    return if it's zero.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

28 Apr, 2007

1 commit

  • This reverts eefa3906283a2b60a6d02a2cda593a7d7d7946c5

    The simplification made in that change works with the assumption that
    the 'offset' parameter to these functions is always positive or zero,
    which is not true. It can be and often is negative in order to access
    SKB header values in front of skb->data.

    Signed-off-by: David S. Miller

    David S. Miller
     

26 Apr, 2007

2 commits

  • I noticed recently that, in skb_checksum(), "offset" and "start" are
    essentially the same thing and have the same value throughout the
    function, despite being computed differently. Using a single variable
    allows some cleanups and makes the skb_checksum() function smaller,
    more readable, and presumably marginally faster.

    We appear to have many other "sk_buff walker" functions built on the
    exact same model, so the cleanup applies to them, too. Here is a list
    of the functions I found to be affected:

    net/appletalk/ddp.c:atalk_sum_skb()
    net/core/datagram.c:skb_copy_datagram_iovec()
    net/core/datagram.c:skb_copy_and_csum_datagram()
    net/core/skbuff.c:skb_copy_bits()
    net/core/skbuff.c:skb_store_bits()
    net/core/skbuff.c:skb_checksum()
    net/core/skbuff.c:skb_copy_and_csum_bit()
    net/core/user_dma.c:dma_skb_copy_datagram_iovec()
    net/xfrm/xfrm_algo.c:skb_icv_walk()
    net/xfrm/xfrm_algo.c:skb_to_sgvec()

    OTOH, I admit I'm a bit surprised, the cleanup is rather obvious so I'm
    really wondering if I am missing something. Can anyone please comment
    on this?

    Signed-off-by: Jean Delvare
    Signed-off-by: David S. Miller

    Jean Delvare
     
  • This patch eliminates some duplicate code for the verification of
    receive checksums between UDP-Lite and UDP. It does this by
    introducing __skb_checksum_complete_head which is identical to
    __skb_checksum_complete_head apart from the fact that it takes
    a length parameter rather than computing the first skb->len bytes.

    As a result UDP-Lite will be able to use hardware checksum offload
    for packets which do not use partial coverage checksums. It also
    means that UDP-Lite loopback no longer does unnecessary checksum
    verification.

    If any NICs start support UDP-Lite this would also start working
    automatically.

    This patch removes the assumption that msg_flags has MSG_TRUNC clear
    upon entry in recvmsg.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

11 Feb, 2007

1 commit


03 Dec, 2006

3 commits