23 Dec, 2011

1 commit

  • skb->truesize might be big even for a small packet.

    Its even bigger after commit 87fb4b7b533 (net: more accurate skb
    truesize) and big MTU.

    We should allow queueing at least one packet per receiver, even with a
    low RCVBUF setting.

    Reported-by: Michal Simek
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Oct, 2011

1 commit


14 Oct, 2011

1 commit

  • skb truesize currently accounts for sk_buff struct and part of skb head.
    kmalloc() roundings are also ignored.

    Considering that skb_shared_info is larger than sk_buff, its time to
    take it into account for better memory accounting.

    This patch introduces SKB_TRUESIZE(X) macro to centralize various
    assumptions into a single place.

    At skb alloc phase, we put skb_shared_info struct at the exact end of
    skb head, to allow a better use of memory (lowering number of
    reallocations), since kmalloc() gives us power-of-two memory blocks.

    Unless SLUB/SLUB debug is active, both skb->head and skb_shared_info are
    aligned to cache lines, as before.

    Note: This patch might trigger performance regressions because of
    misconfigured protocol stacks, hitting per socket or global memory
    limits that were previously not reached. But its a necessary step for a
    more accurate memory accounting.

    Signed-off-by: Eric Dumazet
    CC: Andi Kleen
    CC: Ben Hutchings
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Oct, 2011

1 commit


25 Aug, 2011

1 commit


02 Aug, 2011

1 commit

  • When assigning a NULL value to an RCU protected pointer, no barrier
    is needed. The rcu_assign_pointer, used to handle that but will soon
    change to not handle the special case.

    Convert all rcu_assign_pointer of NULL value.

    //smpl
    @@ expression P; @@

    - rcu_assign_pointer(P, NULL)
    + RCU_INIT_POINTER(P, NULL)

    //

    Signed-off-by: Stephen Hemminger
    Acked-by: Paul E. McKenney
    Signed-off-by: David S. Miller

    Stephen Hemminger
     

08 Jul, 2011

1 commit


06 Jul, 2011

1 commit


22 Jun, 2011

1 commit

  • This patch adds 2 tracepoints to get a status of a socket receive queue
    and related parameter.

    One tracepoint is added to sock_queue_rcv_skb. It records rcvbuf size
    and its usage. The other tracepoint is added to __sk_mem_schedule and
    it records limitations of memory for sockets and current usage.

    By using these tracepoints we're able to know detailed reason why kernel
    drop the packet.

    Signed-off-by: Satoru Moriya
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Satoru Moriya
     

31 Mar, 2011

1 commit


14 Jan, 2011

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (46 commits)
    hwrng: via_rng - Fix memory scribbling on some CPUs
    crypto: padlock - Move padlock.h into include/crypto
    hwrng: via_rng - Fix asm constraints
    crypto: n2 - use __devexit not __exit in n2_unregister_algs
    crypto: mark crypto workqueues CPU_INTENSIVE
    crypto: mv_cesa - dont return PTR_ERR() of wrong pointer
    crypto: ripemd - Set module author and update email address
    crypto: omap-sham - backlog handling fix
    crypto: gf128mul - Remove experimental tag
    crypto: af_alg - fix af_alg memory_allocated data type
    crypto: aesni-intel - Fixed build with binutils 2.16
    crypto: af_alg - Make sure sk_security is initialized on accept()ed sockets
    net: Add missing lockdep class names for af_alg
    include: Install linux/if_alg.h for user-space crypto API
    crypto: omap-aes - checkpatch --file warning fixes
    crypto: omap-aes - initialize aes module once per request
    crypto: omap-aes - unnecessary code removed
    crypto: omap-aes - error handling implementation improved
    crypto: omap-aes - redundant locking is removed
    crypto: omap-aes - DMA initialization fixes for OMAP off mode
    ...

    Linus Torvalds
     

07 Jan, 2011

1 commit

  • Leonardo Chiquitto found poll() could block forever on tcp sockets and
    Urgent data was received, if the event flag only contains POLLPRI.

    He did a bisection and found commit 4938d7e0233 (poll: avoid extra
    wakeups in select/poll) was the source of the problem.

    Problem is TCP sockets use standard sock_def_readable() function for
    their sk_data_ready() handler, and sock_def_readable() doesnt signal
    POLLPRI.

    Only TCP is affected by the problem. Adding POLLPRI to the list of flags
    might trigger unnecessary schedules, but URGENT handling is such a
    seldom used feature this seems a good compromise.

    Thanks a lot to Leonardo for providing the bisection result and a test
    program as well.

    Reference : http://www.spinics.net/lists/netdev/msg151793.html

    Reported-and-bisected-by: Leonardo Chiquitto
    Signed-off-by: Eric Dumazet
    Tested-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Dec, 2010

1 commit


17 Dec, 2010

1 commit

  • Special care is taken inside sk_port_alloc to avoid overwriting
    skc_node/skc_nulls_node. We should also avoid overwriting
    skc_bind_node/skc_portaddr_node.

    The patch fixes the following crash:

    BUG: unable to handle kernel paging request at fffffffffffffff0
    IP: [] udp4_lib_lookup2+0xad/0x370
    [] __udp4_lib_lookup+0x282/0x360
    [] __udp4_lib_rcv+0x31e/0x700
    [] ? ip_local_deliver_finish+0x65/0x190
    [] ? ip_local_deliver+0x88/0xa0
    [] udp_rcv+0x15/0x20
    [] ip_local_deliver_finish+0x65/0x190
    [] ip_local_deliver+0x88/0xa0
    [] ip_rcv_finish+0x32d/0x6f0
    [] ? netif_receive_skb+0x99c/0x11c0
    [] ip_rcv+0x2bb/0x350
    [] netif_receive_skb+0x99c/0x11c0

    Signed-off-by: Leonard Crestez
    Signed-off-by: Octavian Purdila
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Octavian Purdila
     

10 Dec, 2010

1 commit

  • Followup of commit b178bb3dfc30 (net: reorder struct sock fields)

    Optimize INET input path a bit further, by :

    1) moving sk_refcnt close to sk_lock.

    This reduces number of dirtied cache lines by one on 64bit arches (and
    64 bytes cache line size).

    2) moving inet_daddr & inet_rcv_saddr at the beginning of sk

    (same cache line than hash / family / bound_dev_if / nulls_node)

    This reduces number of accessed cache lines in lookups by one, and dont
    increase size of inet and timewait socks.
    inet and tw sockets now share same place-holder for these fields.

    Before patch :

    offsetof(struct sock, sk_refcnt) = 0x10
    offsetof(struct sock, sk_lock) = 0x40
    offsetof(struct sock, sk_receive_queue) = 0x60
    offsetof(struct inet_sock, inet_daddr) = 0x270
    offsetof(struct inet_sock, inet_rcv_saddr) = 0x274

    After patch :

    offsetof(struct sock, sk_refcnt) = 0x44
    offsetof(struct sock, sk_lock) = 0x48
    offsetof(struct sock, sk_receive_queue) = 0x68
    offsetof(struct inet_sock, inet_daddr) = 0x0
    offsetof(struct inet_sock, inet_rcv_saddr) = 0x4

    compute_score() (udp or tcp) now use a single cache line per ignored
    item, instead of two.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Dec, 2010

1 commit


11 Nov, 2010

1 commit

  • Robin Holt tried to boot a 16TB machine and found some limits were
    reached : sysctl_tcp_mem[2], sysctl_udp_mem[2]

    We can switch infrastructure to use long "instead" of "int", now
    atomic_long_t primitives are available for free.

    Signed-off-by: Eric Dumazet
    Reported-by: Robin Holt
    Reviewed-by: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Oct, 2010

1 commit


24 Oct, 2010

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1699 commits)
    bnx2/bnx2x: Unsupported Ethtool operations should return -EINVAL.
    vlan: Calling vlan_hwaccel_do_receive() is always valid.
    tproxy: use the interface primary IP address as a default value for --on-ip
    tproxy: added IPv6 support to the socket match
    cxgb3: function namespace cleanup
    tproxy: added IPv6 support to the TPROXY target
    tproxy: added IPv6 socket lookup function to nf_tproxy_core
    be2net: Changes to use only priority codes allowed by f/w
    tproxy: allow non-local binds of IPv6 sockets if IP_TRANSPARENT is enabled
    tproxy: added tproxy sockopt interface in the IPV6 layer
    tproxy: added udp6_lib_lookup function
    tproxy: added const specifiers to udp lookup functions
    tproxy: split off ipv6 defragmentation to a separate module
    l2tp: small cleanup
    nf_nat: restrict ICMP translation for embedded header
    can: mcp251x: fix generation of error frames
    can: mcp251x: fix endless loop in interrupt handler if CANINTF_MERRF is set
    can-raw: add msg_flags to distinguish local traffic
    9p: client code cleanup
    rds: make local functions/variables static
    ...

    Fix up conflicts in net/core/dev.c, drivers/net/pcmcia/smc91c92_cs.c and
    drivers/net/wireless/ath/ath9k/debug.c as per David

    Linus Torvalds
     

08 Oct, 2010

1 commit

  • > ===================================================
    > [ INFO: suspicious rcu_dereference_check() usage. ]
    > ---------------------------------------------------
    > include/linux/cgroup.h:542 invoked rcu_dereference_check() without protection!
    >
    > other info that might help us debug this:
    >
    >
    > rcu_scheduler_active = 1, debug_locks = 0
    > 1 lock held by swapper/1:
    > #0: (net_mutex){+.+.+.}, at: []
    > register_pernet_subsys+0x1f/0x47
    >
    > stack backtrace:
    > Pid: 1, comm: swapper Not tainted 2.6.35.4-28.fc14.x86_64 #1
    > Call Trace:
    > [] lockdep_rcu_dereference+0xaa/0xb3
    > [] sock_update_classid+0x7c/0xa2
    > [] sk_alloc+0x6b/0x77
    > [] __netlink_create+0x37/0xab
    > [] ? rtnetlink_rcv+0x0/0x2d
    > [] netlink_kernel_create+0x74/0x19d
    > [] ? __mutex_lock_common+0x339/0x35b
    > [] rtnetlink_net_init+0x2e/0x48
    > [] ops_init+0xe9/0xff
    > [] register_pernet_operations+0xab/0x130
    > [] register_pernet_subsys+0x2e/0x47
    > [] rtnetlink_init+0x53/0x102
    > [] netlink_proto_init+0x126/0x143
    > [] ? netlink_proto_init+0x0/0x143
    > [] do_one_initcall+0x72/0x186
    > [] kernel_init+0x23b/0x2c9
    > [] kernel_thread_helper+0x4/0x10
    > [] ? restore_args+0x0/0x30
    > [] ? kernel_init+0x0/0x2c9
    > [] ? kernel_thread_helper+0x0/0x10

    The sock_update_classid() function calls task_cls_classid(current),
    but the calling task cannot go away, so there is no danger of
    the associated structures disappearing. Insert an RCU read-side
    critical section to suppress the false positive.

    Reported-by: Subrata Modak
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

27 Sep, 2010

1 commit


25 Sep, 2010

1 commit

  • We have for each socket :

    One spinlock (sk_slock.slock)
    One rwlock (sk_callback_lock)

    Possible scenarios are :

    (A) (this is used in net/sunrpc/xprtsock.c)
    read_lock(&sk->sk_callback_lock) (without blocking BH)

    spin_lock(&sk->sk_slock.slock);
    ...
    read_lock(&sk->sk_callback_lock);
    ...

    (B)
    write_lock_bh(&sk->sk_callback_lock)
    stuff
    write_unlock_bh(&sk->sk_callback_lock)

    (C)
    spin_lock_bh(&sk->sk_slock)
    ...
    write_lock_bh(&sk->sk_callback_lock)
    stuff
    write_unlock_bh(&sk->sk_callback_lock)
    spin_unlock_bh(&sk->sk_slock)

    This (C) case conflicts with (A) :

    CPU1 [A] CPU2 [C]
    read_lock(callback_lock)
    spin_lock_bh(slock)

    We have one problematic (C) use case in inet_csk_listen_stop() :

    local_bh_disable();
    bh_lock_sock(child); // spin_lock_bh(&sk->sk_slock)
    WARN_ON(sock_owned_by_user(child));
    ...
    sock_orphan(child); // write_lock_bh(&sk->sk_callback_lock)

    lockdep is not happy with this, as reported by Tetsuo Handa

    It seems only way to deal with this is to use read_lock_bh(callbacklock)
    everywhere.

    Thanks to Jarek for pointing a bug in my first attempt and suggesting
    this solution.

    Reported-by: Tetsuo Handa
    Tested-by: Tetsuo Handa
    Signed-off-by: Eric Dumazet
    CC: Jarek Poplawski
    Tested-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Sep, 2010

1 commit

  • __lock_sock() and __release_sock() releases and regrabs lock but
    were missing proper annotations. Add it. This removes following
    warning from sparse. (Currently __lock_sock() does not emit any
    warning about it but I think it is better to add also.)

    net/core/sock.c:1580:17: warning: context imbalance in '__release_sock' - unexpected unlock

    Signed-off-by: Namhyung Kim
    Signed-off-by: David S. Miller

    Namhyung Kim
     

20 Jul, 2010

1 commit


13 Jul, 2010

1 commit


17 Jun, 2010

3 commits

  • AF_UNIX references this, and can be built as a module,
    so...

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Use struct pid and struct cred to store the peer credentials on struct
    sock. This gives enough information to convert the peer credential
    information to a value relative to whatever namespace the socket is in
    at the time.

    This removes nasty surprises when using SO_PEERCRED on socket
    connetions where the processes on either side are in different pid and
    user namespaces.

    Signed-off-by: Eric W. Biederman
    Acked-by: Daniel Lezcano
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • To keep the coming code clear and to allow both the sock
    code and the scm code to share the logic introduce a
    fuction to translate from struct cred to struct ucred.

    Signed-off-by: Eric W. Biederman
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

07 Jun, 2010

1 commit


27 May, 2010

1 commit

  • This new sock lock primitive was introduced to speedup some user context
    socket manipulation. But it is unsafe to protect two threads, one using
    regular lock_sock/release_sock, one using lock_sock_bh/unlock_sock_bh

    This patch changes lock_sock_bh to be careful against 'owned' state.
    If owned is found to be set, we must take the slow path.
    lock_sock_bh() now returns a boolean to say if the slow path was taken,
    and this boolean is used at unlock_sock_bh time to call the appropriate
    unlock function.

    After this change, BH are either disabled or enabled during the
    lock_sock_bh/unlock_sock_bh protected section. This might be misleading,
    so we rename these functions to lock_sock_fast()/unlock_sock_fast().

    Reported-by: Anton Blanchard
    Signed-off-by: Eric Dumazet
    Tested-by: Anton Blanchard
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 May, 2010

2 commits

  • This patch makes tun update its socket classid every time we
    inject a packet into the network stack. This is so that any
    updates made by the admin to the process writing packets to
    tun is effected.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • Up until now cls_cgroup has relied on fetching the classid out of
    the current executing thread. This runs into trouble when a packet
    processing is delayed in which case it may execute out of another
    thread's context.

    Furthermore, even when a packet is not delayed we may fail to
    classify it if soft IRQs have been disabled, because this scenario
    is indistinguishable from one where a packet unrelated to the
    current thread is processed by a real soft IRQ.

    In fact, the current semantics is inherently broken, as a single
    skb may be constructed out of the writes of two different tasks.
    A different manifestation of this problem is when the TCP stack
    transmits in response of an incoming ACK. This is currently
    unclassified.

    As we already have a concept of packet ownership for accounting
    purposes in the skb->sk pointer, this is a natural place to store
    the classid in a persistent manner.

    This patch adds the cls_cgroup classid in struct sock, filling up
    an existing hole on 64-bit :)

    The value is set at socket creation time. So all sockets created
    via socket(2) automatically gains the ID of the thread creating it.
    Whenever another process touches the socket by either reading or
    writing to it, we will change the socket classid to that of the
    process if it has a valid (non-zero) classid.

    For sockets created on inbound connections through accept(2), we
    inherit the classid of the original listening socket through
    sk_clone, possibly preceding the actual accept(2) call.

    In order to minimise risks, I have not made this the authoritative
    classid. For now it is only used as a backup when we execute
    with soft IRQs disabled. Once we're completely happy with its
    semantics we can use it as the sole classid.

    Footnote: I have rearranged the error path on cls_group module
    creation. If we didn't do this, then there is a window where
    someone could create a tc rule using cls_group before the cgroup
    subsystem has been registered.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

18 May, 2010

1 commit

  • Use low order bit of skb->_skb_dst to tell dst is not refcounted.

    Change _skb_dst to _skb_refdst to make sure all uses are catched.

    skb_dst() returns the dst, regardless of noref bit set or not, but
    with a lockdep check to make sure a noref dst is not given if current
    user is not rcu protected.

    New skb_dst_set_noref() helper to set an notrefcounted dst on a skb.
    (with lockdep check)

    skb_dst_drop() drops a reference only if skb dst was refcounted.

    skb_dst_force() helper is used to force a refcount on dst, when skb
    is queued and not anymore RCU protected.

    Use skb_dst_force() in __sk_add_backlog(), __dev_xmit_skb() if
    !IFF_XMIT_DST_RELEASE or skb enqueued on qdisc queue, in
    sock_queue_rcv_skb(), in __nf_queue().

    Use skb_dst_force() in dev_requeue_skb().

    Note: dst_use_noref() still dirties dst, we might transform it
    later to do one dirtying per jiffies.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 May, 2010

1 commit

  • TCP-MD5 sessions have intermittent failures, when route cache is
    invalidated. ip_queue_xmit() has to find a new route, calls
    sk_setup_caps(sk, &rt->u.dst), destroying the

    sk->sk_route_caps &= ~NETIF_F_GSO_MASK

    that MD5 desperately try to make all over its way (from
    tcp_transmit_skb() for example)

    So we send few bad packets, and everything is fine when
    tcp_transmit_skb() is called again for this socket.

    Since ip_queue_xmit() is at a lower level than TCP-MD5, I chose to use a
    socket field, sk_route_nocaps, containing bits to mask on sk_route_caps.

    Reported-by: Bhaskar Dutta
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 May, 2010

1 commit

  • sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we
    need two atomic operations (and associated dirtying) per incoming
    packet.

    RCU conversion is pretty much needed :

    1) Add a new structure, called "struct socket_wq" to hold all fields
    that will need rcu_read_lock() protection (currently: a
    wait_queue_head_t and a struct fasync_struct pointer).

    [Future patch will add a list anchor for wakeup coalescing]

    2) Attach one of such structure to each "struct socket" created in
    sock_alloc_inode().

    3) Respect RCU grace period when freeing a "struct socket_wq"

    4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct
    socket_wq"

    5) Change sk_sleep() function to use new sk->sk_wq instead of
    sk->sk_sleep

    6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside
    a rcu_read_lock() section.

    7) Change all sk_has_sleeper() callers to :
    - Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock)
    - Use wq_has_sleeper() to eventually wakeup tasks.
    - Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock)

    8) sock_wake_async() is modified to use rcu protection as well.

    9) Exceptions :
    macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq"
    instead of dynamically allocated ones. They dont need rcu freeing.

    Some cleanups or followups are probably needed, (possible
    sk_callback_lock conversion to a spinlock for example...).

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Apr, 2010

1 commit

  • Current socket backlog limit is not enough to really stop DDOS attacks,
    because user thread spend many time to process a full backlog each
    round, and user might crazy spin on socket lock.

    We should add backlog size and receive_queue size (aka rmem_alloc) to
    pace writers, and let user run without being slow down too much.

    Introduce a sk_rcvqueues_full() helper, to avoid taking socket lock in
    stress situations.

    Under huge stress from a multiqueue/RPS enabled NIC, a single flow udp
    receiver can now process ~200.000 pps (instead of ~100 pps before the
    patch) on a 8 core machine.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Apr, 2010

1 commit

  • Define a new function to return the waitqueue of a "struct sock".

    static inline wait_queue_head_t *sk_sleep(struct sock *sk)
    {
    return sk->sk_sleep;
    }

    Change all read occurrences of sk_sleep by a call to this function.

    Needed for a future RCU conversion. sk_sleep wont be a field directly
    available.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Apr, 2010

1 commit

  • With latest CONFIG_PROVE_RCU stuff, I felt more comfortable to make this
    work.

    sk->sk_dst_cache is currently protected by a rwlock (sk_dst_lock)

    This rwlock is readlocked for a very small amount of time, and dst
    entries are already freed after RCU grace period. This calls for RCU
    again :)

    This patch converts sk_dst_lock to a spinlock, and use RCU for readers.

    __sk_dst_get() is supposed to be called with rcu_read_lock() or if
    socket locked by user, so use appropriate rcu_dereference_check()
    condition (rcu_read_lock_held() || sock_owned_by_user(sk))

    This patch avoids two atomic ops per tx packet on UDP connected sockets,
    for example, and permits sk_dst_lock to be much less dirtied.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Mar, 2010

1 commit


06 Mar, 2010

1 commit