12 Jul, 2005

1 commit

  • Move the protocol specific config options out to the specific protocols.
    With this change net/Kconfig now starts to become readable and serve as a
    good basis for further re-structuring.

    The menu structure is left almost intact, except that indention is
    fixed in most cases. Most visible are the INET changes where several
    "depends on INET" are replaced with a single ifdef INET / endif pair.

    Several new files were created to accomplish this change - they are
    small but serve the purpose that config options are now distributed
    out where they belongs.

    Signed-off-by: Sam Ravnborg
    Signed-off-by: David S. Miller

    Sam Ravnborg
     

09 Jul, 2005

3 commits


06 Jul, 2005

3 commits

  • Make TSO segment transmit size decisions at send time not earlier.

    The basic scheme is that we try to build as large a TSO frame as
    possible when pulling in the user data, but the size of the TSO frame
    output to the card is determined at transmit time.

    This is guided by tp->xmit_size_goal. It is always set to a multiple
    of MSS and tells sendmsg/sendpage how large an SKB to try and build.

    Later, tcp_write_xmit() and tcp_push_one() chop up the packet if
    necessary and conditions warrant. These routines can also decide to
    "defer" in order to wait for more ACKs to arrive and thus allow larger
    TSO frames to be emitted.

    A general observation is that TSO elongates the pipe, thus requiring a
    larger congestion window and larger buffering especially at the sender
    side. Therefore, it is important that applications 1) get a large
    enough socket send buffer (this is accomplished by our dynamic send
    buffer expansion code) 2) do large enough writes.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     

29 Jun, 2005

4 commits


24 Jun, 2005

3 commits


22 Jun, 2005

7 commits

  • Sit tunnel logging is currently broken:

    MAC=01:23:45:67:89:ab->01:23:45:47:89:ac TUNNEL=123.123. 0.123-> 12.123. 6.123

    Apart from the broken IP address, MAC addresses are printed differently
    for sit tunnels than for everything else.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • I missed this one when fixing up iptable_raw.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Netfilter assumes that skb->data == skb->nh.ipv6h

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Here is a simplified version of the patch to fix a bug in IPv6
    multicasting. It:

    1) adds existence check & EADDRINUSE error for regular joins
    2) adds an exception for EADDRINUSE in the source-specific multicast
    join (where a prior join is ok)
    3) adds a missing/needed read_lock on sock_mc_list; would've raced
    with destroying the socket on interface down without
    4) adds a "leave group" in the (INCLUDE, empty) source filter case.
    This frees unneeded socket buffer memory, but also prevents
    an inappropriate interaction among the 8 socket options that
    mess with this. Some would fail as if in the group when you
    aren't really.

    Item #4 had a locking bug in the last version of this patch; rather than
    removing the idev->lock read lock only, I've simplified it to remove
    all lock state in the path and treat it as a direct "leave group" call for
    the (INCLUDE,empty) case it covers. Tested on an MP machine. :-)

    Much thanks to HoerdtMickael who
    reported the original bug.

    Signed-off-by: David L Stevens
    Signed-off-by: David S. Miller

    David L Stevens
     
  • Essentially netlink at the moment always reports a pid and sequence of 0
    always for v6 route activities.
    To understand the repurcassions of this look at:
    http://lists.quagga.net/pipermail/quagga-dev/2005-June/003507.html

    While fixing this, i took the liberty to resolve the outstanding issue
    of IPV6 routes inserted via ioctls to have the correct pids as well.

    This patch tries to behave as close as possible to the v4 routes i.e
    maintains whatever PID the socket issuing the command owns as opposed to
    the process. That made the patch a little bulky.

    I have tested against both netlink derived utility to add/del routes as
    well as ioctl derived one. The Quagga folks have tested against quagga.
    This fixes the problem and so far hasnt been detected to introduce any
    new issues.

    Signed-off-by: Jamal Hadi Salim
    Acked-by: YOSHIFUJI Hideaki
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     

21 Jun, 2005

1 commit

  • This patch adds xfrm_init_state which is simply a wrapper that calls
    xfrm_get_type and subsequently x->type->init_state. It also gets rid
    of the unused args argument.

    Abstracting it out allows us to add common initialisation code, e.g.,
    to set family-specific flags.

    The add_time setting in xfrm_user.c was deleted because it's already
    set by xfrm_state_alloc.

    Signed-off-by: Herbert Xu
    Acked-by: James Morris
    Signed-off-by: David S. Miller

    Herbert Xu
     

19 Jun, 2005

8 commits

  • In light of my recent patch to net/ipv4/udp.c that replaced the
    spin_lock_irq calls on the receive queue lock with spin_lock_bh,
    here is a similar patch for all other occurences of spin_lock_irq
    on receive/error queue locks in IPv4 and IPv6.

    In these stacks, we know that they can only be entered from user
    or softirq context. Therefore it's safe to disable BH only.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • This patch ensures that netlink events created as a result of programns
    using ioctls (such as ifconfig, route etc) contains the correct PID of
    those events.

    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     
  • This patch converts "unsigned flags" to use more explict types like u16
    instead and incrementally introduces NLMSG_NEW().

    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     
  • This patch rectifies some rtnetlink message builders that derive the
    flags from the pid. It is now explicit like the other cases
    which get it right. Also fixes half a dozen dumpers which did not
    set NLM_F_MULTI at all.

    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     
  • Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • This chunks out the accept_queue and tcp_listen_opt code and moves
    them to net/core/request_sock.c and include/net/request_sock.h, to
    make it useful for other transport protocols, DCCP being the first one
    to use it.

    Next patches will rename tcp_listen_opt to accept_sock and remove the
    inline tcp functions that just call a reqsk_queue_ function.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • Ok, this one just renames some stuff to have a better namespace and to
    dissassociate it from TCP:

    struct open_request -> struct request_sock
    tcp_openreq_alloc -> reqsk_alloc
    tcp_openreq_free -> reqsk_free
    tcp_openreq_fastfree -> __reqsk_free

    With this most of the infrastructure closely resembles a struct
    sock methods subset.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • Kept this first changeset minimal, without changing existing names to
    ease peer review.

    Basicaly tcp_openreq_alloc now receives the or_calltable, that in turn
    has two new members:

    ->slab, that replaces tcp_openreq_cachep
    ->obj_size, to inform the size of the openreq descendant for
    a specific protocol

    The protocol specific fields in struct open_request were moved to a
    class hierarchy, with the things that are common to all connection
    oriented PF_INET protocols in struct inet_request_sock, the TCP ones
    in tcp_request_sock, that is an inet_request_sock, that is an
    open_request.

    I.e. this uses the same approach used for the struct sock class
    hierarchy, with sk_prot indicating if the protocol wants to use the
    open_request infrastructure by filling in sk_prot->rsk_prot with an
    or_calltable.

    Results? Performance is improved and TCP v4 now uses only 64 bytes per
    open request minisock, down from 96 without this patch :-)

    Next changeset will rename some of the structs, fields and functions
    mentioned above, struct or_calltable is way unclear, better name it
    struct request_sock_ops, s/struct open_request/struct request_sock/g,
    etc.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     

14 Jun, 2005

2 commits

  • Userland layer-2 tunneling devices allocated through the TUNTAP driver
    (drivers/net/tun.c) have a type of ARPHRD_NONE, and have no link-layer
    address. The kernel complains at regular interval when IPv6 Privacy
    extension are enabled because it can't find an hardware address :

    Dec 29 11:02:04 auguste kernel: __ipv6_regen_rndid(idev=cb3e0c00):
    cannot get EUI64 identifier; use random bytes.

    IPv6 Privacy extensions should probably be disabled on that sort of
    device. They won't work anyway. If userland wants a more usual
    Ethernet-ish interface with usual IPv6 autoconfiguration, it will use a
    TAP device with an emulated link-layer and a random hardware address
    rather than a TUN device.

    As far as I could fine, TUN virtual device from TUNTAP is the very only
    sort of device using ARPHRD_NONE as kernel device type.

    Signed-off-by: Rémi Denis-Courmont
    Acked-by: YOSHIFUJI Hideaki
    Signed-off-by: David S. Miller

    Rémi Denis-Courmont
     
  • We saw following trace several times:

    |BUG: using smp_processor_id() in preemptible [00000001] code: httpd/30137
    |caller is icmpv6_send+0x23/0x540
    | [] smp_processor_id+0x9b/0xb8
    | [] icmpv6_send+0x23/0x540

    This is because of icmpv6_socket, which is the only one user of
    smp_processor_id() in icmpv6_send(), AFAIK.

    Since it should be used in non-preemptive context,
    let's defer the dereference after disabling preemption
    (by icmpv6_xmit_lock()).

    Signed-off-by: YOSHIFUJI Hideaki
    Signed-off-by: David S. Miller

    YOSHIFUJI Hideaki
     

09 Jun, 2005

1 commit


03 Jun, 2005

1 commit


30 May, 2005

1 commit


27 May, 2005

1 commit


24 May, 2005

1 commit


19 May, 2005

1 commit

  • Having frag_list members which holds wmem of an sk leads to nightmares
    with partially cloned frag skb's. The reason is that once you unleash
    a skb with a frag_list that has individual sk ownerships into the stack
    you can never undo those ownerships safely as they may have been cloned
    by things like netfilter. Since we have to undo them in order to make
    skb_linearize happy this approach leads to a dead-end.

    So let's go the other way and make this an invariant:

    For any skb on a frag_list, skb->sk must be NULL.

    That is, the socket ownership always belongs to the head skb.
    It turns out that the implementation is actually pretty simple.

    The above invariant is actually violated in the following patch
    for a short duration inside ip_fragment. This is OK because the
    offending frag_list member is either destroyed at the end of the
    slow path without being sent anywhere, or it is detached from
    the frag_list before being sent.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

04 May, 2005

2 commits

  • I found a bug that stopped IPsec/IPv6 from working. About
    a month ago IPv6 started using rt6i_idev->dev on the cached socket dst
    entries. If the cached socket dst entry is IPsec, then rt6i_idev will
    be NULL.

    Since we want to look at the rt6i_idev of the original route in this
    case, the easiest fix is to store rt6i_idev in the IPsec dst entry just
    as we do for a number of other IPv6 route attributes. Unfortunately
    this means that we need some new code to handle the references to
    rt6i_idev. That's why this patch is bigger than it would otherwise be.

    I've also done the same thing for IPv4 since it is conceivable that
    once these idev attributes start getting used for accounting, we
    probably need to dereference them for IPv4 IPsec entries too.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • Let's recap the problem. The current asynchronous netlink kernel
    message processing is vulnerable to these attacks:

    1) Hit and run: Attacker sends one or more messages and then exits
    before they're processed. This may confuse/disable the next netlink
    user that gets the netlink address of the attacker since it may
    receive the responses to the attacker's messages.

    Proposed solutions:

    a) Synchronous processing.
    b) Stream mode socket.
    c) Restrict/prohibit binding.

    2) Starvation: Because various netlink rcv functions were written
    to not return until all messages have been processed on a socket,
    it is possible for these functions to execute for an arbitrarily
    long period of time. If this is successfully exploited it could
    also be used to hold rtnl forever.

    Proposed solutions:

    a) Synchronous processing.
    b) Stream mode socket.

    Firstly let's cross off solution c). It only solves the first
    problem and it has user-visible impacts. In particular, it'll
    break user space applications that expect to bind or communicate
    with specific netlink addresses (pid's).

    So we're left with a choice of synchronous processing versus
    SOCK_STREAM for netlink.

    For the moment I'm sticking with the synchronous approach as
    suggested by Alexey since it's simpler and I'd rather spend
    my time working on other things.

    However, it does have a number of deficiencies compared to the
    stream mode solution:

    1) User-space to user-space netlink communication is still vulnerable.

    2) Inefficient use of resources. This is especially true for rtnetlink
    since the lock is shared with other users such as networking drivers.
    The latter could hold the rtnl while communicating with hardware which
    causes the rtnetlink user to wait when it could be doing other things.

    3) It is still possible to DoS all netlink users by flooding the kernel
    netlink receive queue. The attacker simply fills the receive socket
    with a single netlink message that fills up the entire queue. The
    attacker then continues to call sendmsg with the same message in a loop.

    Point 3) can be countered by retransmissions in user-space code, however
    it is pretty messy.

    In light of these problems (in particular, point 3), we should implement
    stream mode netlink at some point. In the mean time, here is a patch
    that implements synchronous processing.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu