19 Nov, 2011

1 commit

  • ipv4: Remove all uses of LL_ALLOCATED_SPACE

    The macro LL_ALLOCATED_SPACE was ill-conceived. It applies the
    alignment to the sum of needed_headroom and needed_tailroom. As
    the amount that is then reserved for head room is needed_headroom
    with alignment, this means that the tail room left may be too small.

    This patch replaces all uses of LL_ALLOCATED_SPACE in net/ipv4
    with the macro LL_RESERVED_SPACE and direct reference to
    needed_tailroom.

    This also fixes the problem with needed_headroom changing between
    allocating the skb and reserving the head room.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

10 Nov, 2011

1 commit

  • Le lundi 07 novembre 2011 à 15:33 +0100, Eric Dumazet a écrit :

    > At least, in recent kernels we dont change dst->refcnt in forwarding
    > patch (usinf NOREF skb->dst)
    >
    > One particular point is the atomic_inc(dst->refcnt) we have to perform
    > when queuing an UDP packet if socket asked PKTINFO stuff (for example a
    > typical DNS server has to setup this option)
    >
    > I have one patch somewhere that stores the information in skb->cb[] and
    > avoid the atomic_{inc|dec}(dst->refcnt).
    >

    OK I found it, I did some extra tests and believe its ready.

    [PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference

    When a socket uses IP_PKTINFO notifications, we currently force a dst
    reference for each received skb. Reader has to access dst to get needed
    information (rt_iif & rt_spec_dst) and must release dst reference.

    We also forced a dst reference if skb was put in socket backlog, even
    without IP_PKTINFO handling. This happens under stress/load.

    We can instead store the needed information in skb->cb[], so that only
    softirq handler really access dst, improving cache hit ratios.

    This removes two atomic operations per packet, and false sharing as
    well.

    On a benchmark using a mono threaded receiver (doing only recvmsg()
    calls), I can reach 720.000 pps instead of 570.000 pps.

    IP_PKTINFO is typically used by DNS servers, and any multihomed aware
    UDP application.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Nov, 2011

1 commit


08 Aug, 2011

1 commit

  • The raw sockets can provide source address for
    routing but their privileges are not considered. We
    can provide non-local source address, make sure the
    FLOWI_FLAG_ANYSRC flag is set if socket has privileges
    for this, i.e. based on hdrincl (IP_HDRINCL) and
    transparent flags.

    Signed-off-by: Julian Anastasov
    Signed-off-by: David S. Miller

    Julian Anastasov
     

27 Jul, 2011

1 commit

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     

02 Jul, 2011

1 commit


24 May, 2011

1 commit

  • The %pK format specifier is designed to hide exposed kernel pointers,
    specifically via /proc interfaces. Exposing these pointers provides an
    easy target for kernel write vulnerabilities, since they reveal the
    locations of writable structures containing easily triggerable function
    pointers. The behavior of %pK depends on the kptr_restrict sysctl.

    If kptr_restrict is set to 0, no deviation from the standard %p behavior
    occurs. If kptr_restrict is set to 1, the default, if the current user
    (intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
    (currently in the LSM tree), kernel pointers using %pK are printed as 0's.
    If kptr_restrict is set to 2, kernel pointers using %pK are printed as
    0's regardless of privileges. Replacing with 0's was chosen over the
    default "(null)", which cannot be parsed by userland %p, which expects
    "(nil)".

    The supporting code for kptr_restrict and %pK are currently in the -mm
    tree. This patch converts users of %p in net/ to %pK. Cases of printing
    pointers to the syslog are not covered, since this would eliminate useful
    information for postmortem debugging and the reading of the syslog is
    already optionally protected by the dmesg_restrict sysctl.

    Signed-off-by: Dan Rosenberg
    Cc: James Morris
    Cc: Eric Dumazet
    Cc: Thomas Graf
    Cc: Eugene Teo
    Cc: Kees Cook
    Cc: Ingo Molnar
    Cc: David S. Miller
    Cc: Peter Zijlstra
    Cc: Eric Paris
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Dan Rosenberg
     

09 May, 2011

2 commits


29 Apr, 2011

1 commit

  • We lack proper synchronization to manipulate inet->opt ip_options

    Problem is ip_make_skb() calls ip_setup_cork() and
    ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
    without any protection against another thread manipulating inet->opt.

    Another thread can change inet->opt pointer and free old one under us.

    Use RCU to protect inet->opt (changed to inet->inet_opt).

    Instead of handling atomic refcounts, just copy ip_options when
    necessary, to avoid cache line dirtying.

    We cant insert an rcu_head in struct ip_options since its included in
    skb->cb[], so this patch is large because I had to introduce a new
    ip_options_rcu structure.

    Signed-off-by: Eric Dumazet
    Cc: Herbert Xu
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Apr, 2011

1 commit


12 Apr, 2011

1 commit


31 Mar, 2011

2 commits


29 Mar, 2011

1 commit


13 Mar, 2011

4 commits


03 Mar, 2011

1 commit


02 Mar, 2011

3 commits


30 Jan, 2011

1 commit

  • SIOCGETSGCNT is not a unique ioctl value as it it maps tio SIOCPROTOPRIVATE +1,
    which unfortunately means the existing infrastructure for compat networking
    ioctls is insufficient. A trivial compact ioctl implementation would conflict
    with:

    SIOCAX25ADDUID
    SIOCAIPXPRISLT
    SIOCGETSGCNT_IN6
    SIOCGETSGCNT
    SIOCRSSCAUSE
    SIOCX25SSUBSCRIP
    SIOCX25SDTEFACILITIES

    To make this work I have updated the compat_ioctl decode path to mirror the
    the normal ioctl decode path. I have added an ipv4 inet_compat_ioctl function
    so that I can have ipv4 specific compat ioctls. I have added a compat_ioctl
    function into struct proto so I can break out ioctls by which kind of ip socket
    I am using. I have added a compat_raw_ioctl function because SIOCGETSGCNT only
    works on raw sockets. I have added a ipmr_compat_ioctl that mirrors the normal
    ipmr_ioctl.

    This was necessary because unfortunately the struct layout for the SIOCGETSGCNT
    has unsigned longs in it so changes between 32bit and 64bit kernels.

    This change was sufficient to run a 32bit ip multicast routing daemon on a
    64bit kernel.

    Reported-by: Bill Fenner
    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

18 Nov, 2010

1 commit


19 Aug, 2010

1 commit

  • This patch removes the abstraction introduced by the union skb_shared_tx in
    the shared skb data.

    The access of the different union elements at several places led to some
    confusion about accessing the shared tx_flags e.g. in skb_orphan_try().

    http://marc.info/?l=linux-netdev&m=128084897415886&w=2

    Signed-off-by: Oliver Hartkopp
    Signed-off-by: David S. Miller

    Oliver Hartkopp
     

11 Jun, 2010

1 commit


07 Jun, 2010

1 commit


11 May, 2010

1 commit


29 Apr, 2010

1 commit

  • When queueing a skb to socket, we can immediately release its dst if
    target socket do not use IP_CMSG_PKTINFO.

    tcp_data_queue() can drop dst too.

    This to benefit from a hot cache line and avoid the receiver, possibly
    on another cpu, to dirty this cache line himself.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Apr, 2010

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

25 Mar, 2010

1 commit


30 Oct, 2009

1 commit


29 Oct, 2009

1 commit

  • Augment raw_send_hdrinc to correct for incorrect ip header length values

    A series of oopses was reported to me recently. Apparently when using AF_RAW
    sockets to send data to peers that were reachable via ipsec encapsulation,
    people could panic or BUG halt their systems.

    I've tracked the problem down to user space sending an invalid ip header over an
    AF_RAW socket with IP_HDRINCL set to 1.

    Basically what happens is that userspace sends down an ip frame that includes
    only the header (no data), but sets the ip header ihl value to a large number,
    one that is larger than the total amount of data passed to the sendmsg call. In
    raw_send_hdrincl, we allocate an skb based on the size of the data in the msghdr
    that was passed in, but assume the data is all valid. Later during ipsec
    encapsulation, xfrm4_tranport_output moves the entire frame back in the skbuff
    to provide headroom for the ipsec headers. During this operation, the
    skb->transport_header is repointed to a spot computed by
    skb->network_header + the ip header length (ihl). Since so little data was
    passed in relative to the value of ihl provided by the raw socket, we point
    transport header to an unknown location, resulting in various crashes.

    This fix for this is pretty straightforward, simply validate the value of of
    iph->ihl when sending over a raw socket. If (iph->ihl*4U) > user data buffer
    size, drop the frame and return -EINVAL. I just confirmed this fixes the
    reported crashes.

    Signed-off-by: Neil Horman
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neil Horman
     

19 Oct, 2009

1 commit

  • In order to have better cache layouts of struct sock (separate zones
    for rx/tx paths), we need this preliminary patch.

    Goal is to transfert fields used at lookup time in the first
    read-mostly cache line (inside struct sock_common) and move sk_refcnt
    to a separate cache line (only written by rx path)

    This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
    sport and id fields. This allows a future patch to define these
    fields as macros, like sk_refcnt, without name clashes.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Oct, 2009

1 commit

  • sock_queue_rcv_skb() can update sk_drops itself, removing need for
    callers to take care of it. This is more consistent since
    sock_queue_rcv_skb() also reads sk_drops when queueing a skb.

    This adds sk_drops managment to many protocols that not cared yet.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Oct, 2009

1 commit

  • Create a new socket level option to report number of queue overflows

    Recently I augmented the AF_PACKET protocol to report the number of frames lost
    on the socket receive queue between any two enqueued frames. This value was
    exported via a SOL_PACKET level cmsg. AFter I completed that work it was
    requested that this feature be generalized so that any datagram oriented socket
    could make use of this option. As such I've created this patch, It creates a
    new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
    SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
    overflowed between any two given frames. It also augments the AF_PACKET
    protocol to take advantage of this new feature (as it previously did not touch
    sk->sk_drops, which this patch uses to record the overflow count). Tested
    successfully by me.

    Notes:

    1) Unlike my previous patch, this patch simply records the sk_drops value, which
    is not a number of drops between packets, but rather a total number of drops.
    Deltas must be computed in user space.

    2) While this patch currently works with datagram oriented protocols, it will
    also be accepted by non-datagram oriented protocols. I'm not sure if thats
    agreeable to everyone, but my argument in favor of doing so is that, for those
    protocols which aren't applicable to this option, sk_drops will always be zero,
    and reporting no drops on a receive queue that isn't used for those
    non-participating protocols seems reasonable to me. This also saves us having
    to code in a per-protocol opt in mechanism.

    3) This applies cleanly to net-next assuming that commit
    977750076d98c7ff6cbda51858bb5a5894a9d9ab (my af packet cmsg patch) is reverted

    Signed-off-by: Neil Horman
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neil Horman
     

01 Oct, 2009

1 commit

  • This provides safety against negative optlen at the type
    level instead of depending upon (sometimes non-trivial)
    checks against this sprinkled all over the the place, in
    each and every implementation.

    Based upon work done by Arjan van de Ven and feedback
    from Linus Torvalds.

    Signed-off-by: David S. Miller

    David S. Miller
     

03 Sep, 2009

1 commit

  • Christoph Lameter pointed out that packet drops at qdisc level where not
    accounted in SNMP counters. Only if application sets IP_RECVERR, drops
    are reported to user (-ENOBUFS errors) and SNMP counters updated.

    IP_RECVERR is used to enable extended reliable error message passing,
    but these are not needed to update system wide SNMP stats.

    This patch changes things a bit to allow SNMP counters to be updated,
    regardless of IP_RECVERR being set or not on the socket.

    Example after an UDP tx flood
    # netstat -s
    ...
    IP:
    1487048 outgoing packets dropped
    ...
    Udp:
    ...
    SndbufErrors: 1487048

    send() syscalls, do however still return an OK status, to not
    break applications.

    Note : send() manual page explicitly says for -ENOBUFS error :

    "The output queue for a network interface was full.
    This generally indicates that the interface has stopped sending,
    but may be caused by transient congestion.
    (Normally, this does not occur in Linux. Packets are just silently
    dropped when a device queue overflows.) "

    This is not true for IP_RECVERR enabled sockets : a send() syscall
    that hit a qdisc drop returns an ENOBUFS error.

    Many thanks to Christoph, David, and last but not least, Alexey !

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet