10 Jan, 2012

1 commit


09 Jan, 2012

1 commit

  • so move it there. Fixes build errors when CONFIG_INET is not defined:

    In file included from include/linux/tcp.h:211:0,
    from include/linux/ipv6.h:221,
    from include/net/ipv6.h:16,
    from include/linux/sunrpc/clnt.h:26,
    from include/linux/nfs_fs.h:50,
    from init/do_mounts.c:20:
    include/net/sock.h: In function 'sk_update_clone':
    include/net/sock.h:1109:3: error: implicit declaration of function 'sock_update_memcg' [-Werror=implicit-function-declaration]

    Signed-off-by: Stephen Rothwell
    Signed-off-by: David S. Miller

    Stephen Rothwell
     

08 Jan, 2012

1 commit

  • Sockets can also be created through sock_clone. Because it copies
    all data in the sock structure, it also copies the memcg-related pointer,
    and all should be fine. However, since we now use reference counts in
    socket creation, we are left with some sockets that have no reference
    counts. It matters when we destroy them, since it leads to a mismatch.

    Signed-off-by: Glauber Costa
    CC: David S. Miller
    CC: Greg Thelen
    CC: Hiroyouki Kamezawa
    CC: Laurent Chavey
    Signed-off-by: David S. Miller

    Glauber Costa
     

24 Dec, 2011

1 commit


23 Dec, 2011

1 commit

  • skb->truesize might be big even for a small packet.

    Its even bigger after commit 87fb4b7b533 (net: more accurate skb
    truesize) and big MTU.

    We should allow queueing at least one packet per receiver, even with a
    low RCVBUF setting.

    Reported-by: Michal Simek
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Dec, 2011

1 commit

  • We can't scan the proto_list to initialize sock cgroups, as it
    holds a rwlock, and we also want to keep the code generic enough to
    avoid calling the initialization functions of protocols directly,

    Convert proto_list_lock into a mutex, so we can sleep and do the
    necessary allocations. This lock is seldom taken, so there shouldn't
    be any performance penalties associated with that

    Signed-off-by: Glauber Costa
    CC: Hiroyouki Kamezawa
    CC: David S. Miller
    CC: Eric Dumazet
    CC: Stephen Rothwell
    CC: Randy Dunlap
    Signed-off-by: David S. Miller

    Glauber Costa
     

13 Dec, 2011

3 commits

  • This patch introduces memory pressure controls for the tcp
    protocol. It uses the generic socket memory pressure code
    introduced in earlier patches, and fills in the
    necessary data in cg_proto struct.

    Signed-off-by: Glauber Costa
    Reviewed-by: KAMEZAWA Hiroyuki
    CC: Eric W. Biederman
    Signed-off-by: David S. Miller

    Glauber Costa
     
  • The goal of this work is to move the memory pressure tcp
    controls to a cgroup, instead of just relying on global
    conditions.

    To avoid excessive overhead in the network fast paths,
    the code that accounts allocated memory to a cgroup is
    hidden inside a static_branch(). This branch is patched out
    until the first non-root cgroup is created. So when nobody
    is using cgroups, even if it is mounted, no significant performance
    penalty should be seen.

    This patch handles the generic part of the code, and has nothing
    tcp-specific.

    Signed-off-by: Glauber Costa
    Reviewed-by: KAMEZAWA Hiroyuki
    CC: Kirill A. Shutemov
    CC: David S. Miller
    CC: Eric W. Biederman
    CC: Eric Dumazet
    Signed-off-by: David S. Miller

    Glauber Costa
     
  • This patch replaces all uses of struct sock fields' memory_pressure,
    memory_allocated, sockets_allocated, and sysctl_mem to acessor
    macros. Those macros can either receive a socket argument, or a mem_cgroup
    argument, depending on the context they live in.

    Since we're only doing a macro wrapping here, no performance impact at all is
    expected in the case where we don't have cgroups disabled.

    Signed-off-by: Glauber Costa
    Reviewed-by: Hiroyouki Kamezawa
    CC: David S. Miller
    CC: Eric W. Biederman
    CC: Eric Dumazet
    Signed-off-by: David S. Miller

    Glauber Costa
     

29 Nov, 2011

1 commit


23 Nov, 2011

1 commit

  • This patch adds in the infrastructure code to create the network priority
    cgroup. The cgroup, in addition to the standard processes file creates two
    control files:

    1) prioidx - This is a read-only file that exports the index of this cgroup.
    This is a value that is both arbitrary and unique to a cgroup in this subsystem,
    and is used to index the per-device priority map

    2) priomap - This is a writeable file. On read it reports a table of 2-tuples
    where name is the name of a network interface and priority is
    indicates the priority assigned to frames egresessing on the named interface and
    originating from a pid in this cgroup

    This cgroup allows for skb priority to be set prior to a root qdisc getting
    selected. This is benenficial for DCB enabled systems, in that it allows for any
    application to use dcb configured priorities so without application modification

    Signed-off-by: Neil Horman
    Signed-off-by: John Fastabend
    CC: Robert Love
    CC: "David S. Miller"
    Signed-off-by: David S. Miller

    Neil Horman
     

18 Nov, 2011

1 commit


10 Nov, 2011

1 commit

  • The 802.1X EAPOL handshake hostapd does requires
    knowing whether the frame was ack'ed by the peer.
    Currently, we fudge this pretty badly by not even
    transmitting the frame as a normal data frame but
    injecting it with radiotap and getting the status
    out of radiotap monitor as well. This is rather
    complex, confuses users (mon.wlan0 presence) and
    doesn't work with all hardware.

    To get rid of that hack, introduce a real wifi TX
    status option for data frame transmissions.

    This works similar to the existing TX timestamping
    in that it reflects the SKB back to the socket's
    error queue with a SCM_WIFI_STATUS cmsg that has
    an int indicating ACK status (0/1).

    Since it is possible that at some point we will
    want to have TX timestamping and wifi status in a
    single errqueue SKB (there's little point in not
    doing that), redefine SO_EE_ORIGIN_TIMESTAMPING
    to SO_EE_ORIGIN_TXSTATUS which can collect more
    than just the timestamp; keep the old constant
    as an alias of course. Currently the internal APIs
    don't make that possible, but it wouldn't be hard
    to split them up in a way that makes it possible.

    Thanks to Neil Horman for helping me figure out
    the functions that add the control messages.

    Signed-off-by: Johannes Berg
    Signed-off-by: John W. Linville

    Johannes Berg
     

09 Nov, 2011

1 commit


26 Oct, 2011

1 commit


14 Oct, 2011

1 commit

  • skb truesize currently accounts for sk_buff struct and part of skb head.
    kmalloc() roundings are also ignored.

    Considering that skb_shared_info is larger than sk_buff, its time to
    take it into account for better memory accounting.

    This patch introduces SKB_TRUESIZE(X) macro to centralize various
    assumptions into a single place.

    At skb alloc phase, we put skb_shared_info struct at the exact end of
    skb head, to allow a better use of memory (lowering number of
    reallocations), since kmalloc() gives us power-of-two memory blocks.

    Unless SLUB/SLUB debug is active, both skb->head and skb_shared_info are
    aligned to cache lines, as before.

    Note: This patch might trigger performance regressions because of
    misconfigured protocol stacks, hitting per socket or global memory
    limits that were previously not reached. But its a necessary step for a
    more accurate memory accounting.

    Signed-off-by: Eric Dumazet
    CC: Andi Kleen
    CC: Ben Hutchings
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Oct, 2011

1 commit


25 Aug, 2011

1 commit


02 Aug, 2011

1 commit

  • When assigning a NULL value to an RCU protected pointer, no barrier
    is needed. The rcu_assign_pointer, used to handle that but will soon
    change to not handle the special case.

    Convert all rcu_assign_pointer of NULL value.

    //smpl
    @@ expression P; @@

    - rcu_assign_pointer(P, NULL)
    + RCU_INIT_POINTER(P, NULL)

    //

    Signed-off-by: Stephen Hemminger
    Acked-by: Paul E. McKenney
    Signed-off-by: David S. Miller

    Stephen Hemminger
     

08 Jul, 2011

1 commit


06 Jul, 2011

1 commit


22 Jun, 2011

1 commit

  • This patch adds 2 tracepoints to get a status of a socket receive queue
    and related parameter.

    One tracepoint is added to sock_queue_rcv_skb. It records rcvbuf size
    and its usage. The other tracepoint is added to __sk_mem_schedule and
    it records limitations of memory for sockets and current usage.

    By using these tracepoints we're able to know detailed reason why kernel
    drop the packet.

    Signed-off-by: Satoru Moriya
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Satoru Moriya
     

31 Mar, 2011

1 commit


14 Jan, 2011

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (46 commits)
    hwrng: via_rng - Fix memory scribbling on some CPUs
    crypto: padlock - Move padlock.h into include/crypto
    hwrng: via_rng - Fix asm constraints
    crypto: n2 - use __devexit not __exit in n2_unregister_algs
    crypto: mark crypto workqueues CPU_INTENSIVE
    crypto: mv_cesa - dont return PTR_ERR() of wrong pointer
    crypto: ripemd - Set module author and update email address
    crypto: omap-sham - backlog handling fix
    crypto: gf128mul - Remove experimental tag
    crypto: af_alg - fix af_alg memory_allocated data type
    crypto: aesni-intel - Fixed build with binutils 2.16
    crypto: af_alg - Make sure sk_security is initialized on accept()ed sockets
    net: Add missing lockdep class names for af_alg
    include: Install linux/if_alg.h for user-space crypto API
    crypto: omap-aes - checkpatch --file warning fixes
    crypto: omap-aes - initialize aes module once per request
    crypto: omap-aes - unnecessary code removed
    crypto: omap-aes - error handling implementation improved
    crypto: omap-aes - redundant locking is removed
    crypto: omap-aes - DMA initialization fixes for OMAP off mode
    ...

    Linus Torvalds
     

07 Jan, 2011

1 commit

  • Leonardo Chiquitto found poll() could block forever on tcp sockets and
    Urgent data was received, if the event flag only contains POLLPRI.

    He did a bisection and found commit 4938d7e0233 (poll: avoid extra
    wakeups in select/poll) was the source of the problem.

    Problem is TCP sockets use standard sock_def_readable() function for
    their sk_data_ready() handler, and sock_def_readable() doesnt signal
    POLLPRI.

    Only TCP is affected by the problem. Adding POLLPRI to the list of flags
    might trigger unnecessary schedules, but URGENT handling is such a
    seldom used feature this seems a good compromise.

    Thanks a lot to Leonardo for providing the bisection result and a test
    program as well.

    Reference : http://www.spinics.net/lists/netdev/msg151793.html

    Reported-and-bisected-by: Leonardo Chiquitto
    Signed-off-by: Eric Dumazet
    Tested-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Dec, 2010

1 commit


17 Dec, 2010

1 commit

  • Special care is taken inside sk_port_alloc to avoid overwriting
    skc_node/skc_nulls_node. We should also avoid overwriting
    skc_bind_node/skc_portaddr_node.

    The patch fixes the following crash:

    BUG: unable to handle kernel paging request at fffffffffffffff0
    IP: [] udp4_lib_lookup2+0xad/0x370
    [] __udp4_lib_lookup+0x282/0x360
    [] __udp4_lib_rcv+0x31e/0x700
    [] ? ip_local_deliver_finish+0x65/0x190
    [] ? ip_local_deliver+0x88/0xa0
    [] udp_rcv+0x15/0x20
    [] ip_local_deliver_finish+0x65/0x190
    [] ip_local_deliver+0x88/0xa0
    [] ip_rcv_finish+0x32d/0x6f0
    [] ? netif_receive_skb+0x99c/0x11c0
    [] ip_rcv+0x2bb/0x350
    [] netif_receive_skb+0x99c/0x11c0

    Signed-off-by: Leonard Crestez
    Signed-off-by: Octavian Purdila
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Octavian Purdila
     

10 Dec, 2010

1 commit

  • Followup of commit b178bb3dfc30 (net: reorder struct sock fields)

    Optimize INET input path a bit further, by :

    1) moving sk_refcnt close to sk_lock.

    This reduces number of dirtied cache lines by one on 64bit arches (and
    64 bytes cache line size).

    2) moving inet_daddr & inet_rcv_saddr at the beginning of sk

    (same cache line than hash / family / bound_dev_if / nulls_node)

    This reduces number of accessed cache lines in lookups by one, and dont
    increase size of inet and timewait socks.
    inet and tw sockets now share same place-holder for these fields.

    Before patch :

    offsetof(struct sock, sk_refcnt) = 0x10
    offsetof(struct sock, sk_lock) = 0x40
    offsetof(struct sock, sk_receive_queue) = 0x60
    offsetof(struct inet_sock, inet_daddr) = 0x270
    offsetof(struct inet_sock, inet_rcv_saddr) = 0x274

    After patch :

    offsetof(struct sock, sk_refcnt) = 0x44
    offsetof(struct sock, sk_lock) = 0x48
    offsetof(struct sock, sk_receive_queue) = 0x68
    offsetof(struct inet_sock, inet_daddr) = 0x0
    offsetof(struct inet_sock, inet_rcv_saddr) = 0x4

    compute_score() (udp or tcp) now use a single cache line per ignored
    item, instead of two.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Dec, 2010

1 commit


11 Nov, 2010

1 commit

  • Robin Holt tried to boot a 16TB machine and found some limits were
    reached : sysctl_tcp_mem[2], sysctl_udp_mem[2]

    We can switch infrastructure to use long "instead" of "int", now
    atomic_long_t primitives are available for free.

    Signed-off-by: Eric Dumazet
    Reported-by: Robin Holt
    Reviewed-by: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Oct, 2010

1 commit


24 Oct, 2010

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1699 commits)
    bnx2/bnx2x: Unsupported Ethtool operations should return -EINVAL.
    vlan: Calling vlan_hwaccel_do_receive() is always valid.
    tproxy: use the interface primary IP address as a default value for --on-ip
    tproxy: added IPv6 support to the socket match
    cxgb3: function namespace cleanup
    tproxy: added IPv6 support to the TPROXY target
    tproxy: added IPv6 socket lookup function to nf_tproxy_core
    be2net: Changes to use only priority codes allowed by f/w
    tproxy: allow non-local binds of IPv6 sockets if IP_TRANSPARENT is enabled
    tproxy: added tproxy sockopt interface in the IPV6 layer
    tproxy: added udp6_lib_lookup function
    tproxy: added const specifiers to udp lookup functions
    tproxy: split off ipv6 defragmentation to a separate module
    l2tp: small cleanup
    nf_nat: restrict ICMP translation for embedded header
    can: mcp251x: fix generation of error frames
    can: mcp251x: fix endless loop in interrupt handler if CANINTF_MERRF is set
    can-raw: add msg_flags to distinguish local traffic
    9p: client code cleanup
    rds: make local functions/variables static
    ...

    Fix up conflicts in net/core/dev.c, drivers/net/pcmcia/smc91c92_cs.c and
    drivers/net/wireless/ath/ath9k/debug.c as per David

    Linus Torvalds
     

08 Oct, 2010

1 commit

  • > ===================================================
    > [ INFO: suspicious rcu_dereference_check() usage. ]
    > ---------------------------------------------------
    > include/linux/cgroup.h:542 invoked rcu_dereference_check() without protection!
    >
    > other info that might help us debug this:
    >
    >
    > rcu_scheduler_active = 1, debug_locks = 0
    > 1 lock held by swapper/1:
    > #0: (net_mutex){+.+.+.}, at: []
    > register_pernet_subsys+0x1f/0x47
    >
    > stack backtrace:
    > Pid: 1, comm: swapper Not tainted 2.6.35.4-28.fc14.x86_64 #1
    > Call Trace:
    > [] lockdep_rcu_dereference+0xaa/0xb3
    > [] sock_update_classid+0x7c/0xa2
    > [] sk_alloc+0x6b/0x77
    > [] __netlink_create+0x37/0xab
    > [] ? rtnetlink_rcv+0x0/0x2d
    > [] netlink_kernel_create+0x74/0x19d
    > [] ? __mutex_lock_common+0x339/0x35b
    > [] rtnetlink_net_init+0x2e/0x48
    > [] ops_init+0xe9/0xff
    > [] register_pernet_operations+0xab/0x130
    > [] register_pernet_subsys+0x2e/0x47
    > [] rtnetlink_init+0x53/0x102
    > [] netlink_proto_init+0x126/0x143
    > [] ? netlink_proto_init+0x0/0x143
    > [] do_one_initcall+0x72/0x186
    > [] kernel_init+0x23b/0x2c9
    > [] kernel_thread_helper+0x4/0x10
    > [] ? restore_args+0x0/0x30
    > [] ? kernel_init+0x0/0x2c9
    > [] ? kernel_thread_helper+0x0/0x10

    The sock_update_classid() function calls task_cls_classid(current),
    but the calling task cannot go away, so there is no danger of
    the associated structures disappearing. Insert an RCU read-side
    critical section to suppress the false positive.

    Reported-by: Subrata Modak
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

27 Sep, 2010

1 commit


25 Sep, 2010

1 commit

  • We have for each socket :

    One spinlock (sk_slock.slock)
    One rwlock (sk_callback_lock)

    Possible scenarios are :

    (A) (this is used in net/sunrpc/xprtsock.c)
    read_lock(&sk->sk_callback_lock) (without blocking BH)

    spin_lock(&sk->sk_slock.slock);
    ...
    read_lock(&sk->sk_callback_lock);
    ...

    (B)
    write_lock_bh(&sk->sk_callback_lock)
    stuff
    write_unlock_bh(&sk->sk_callback_lock)

    (C)
    spin_lock_bh(&sk->sk_slock)
    ...
    write_lock_bh(&sk->sk_callback_lock)
    stuff
    write_unlock_bh(&sk->sk_callback_lock)
    spin_unlock_bh(&sk->sk_slock)

    This (C) case conflicts with (A) :

    CPU1 [A] CPU2 [C]
    read_lock(callback_lock)
    spin_lock_bh(slock)

    We have one problematic (C) use case in inet_csk_listen_stop() :

    local_bh_disable();
    bh_lock_sock(child); // spin_lock_bh(&sk->sk_slock)
    WARN_ON(sock_owned_by_user(child));
    ...
    sock_orphan(child); // write_lock_bh(&sk->sk_callback_lock)

    lockdep is not happy with this, as reported by Tetsuo Handa

    It seems only way to deal with this is to use read_lock_bh(callbacklock)
    everywhere.

    Thanks to Jarek for pointing a bug in my first attempt and suggesting
    this solution.

    Reported-by: Tetsuo Handa
    Tested-by: Tetsuo Handa
    Signed-off-by: Eric Dumazet
    CC: Jarek Poplawski
    Tested-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Sep, 2010

1 commit

  • __lock_sock() and __release_sock() releases and regrabs lock but
    were missing proper annotations. Add it. This removes following
    warning from sparse. (Currently __lock_sock() does not emit any
    warning about it but I think it is better to add also.)

    net/core/sock.c:1580:17: warning: context imbalance in '__release_sock' - unexpected unlock

    Signed-off-by: Namhyung Kim
    Signed-off-by: David S. Miller

    Namhyung Kim
     

20 Jul, 2010

1 commit


13 Jul, 2010

1 commit


17 Jun, 2010

2 commits

  • AF_UNIX references this, and can be built as a module,
    so...

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Use struct pid and struct cred to store the peer credentials on struct
    sock. This gives enough information to convert the peer credential
    information to a value relative to whatever namespace the socket is in
    at the time.

    This removes nasty surprises when using SO_PEERCRED on socket
    connetions where the processes on either side are in different pid and
    user namespaces.

    Signed-off-by: Eric W. Biederman
    Acked-by: Daniel Lezcano
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Eric W. Biederman