11 Oct, 2007

11 commits

  • This change allows the generic attribute interface to be used within
    the netfilter subsystem where this flag was initially introduced.

    The byte-order flag is yet unused, it's intended use is to
    allow automatic byte order convertions for all atomic types.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • This patch makes most of the generic device layer network
    namespace safe. This patch makes dev_base_head a
    network namespace variable, and then it picks up
    a few associated variables. The functions:
    dev_getbyhwaddr
    dev_getfirsthwbytype
    dev_get_by_flags
    dev_get_by_name
    __dev_get_by_name
    dev_get_by_index
    __dev_get_by_index
    dev_ioctl
    dev_ethtool
    dev_load
    wireless_process_ioctl

    were modified to take a network namespace argument, and
    deal with it.

    vlan_ioctl_set and brioctl_set were modified so their
    hooks will receive a network namespace argument.

    So basically anthing in the core of the network stack that was
    affected to by the change of dev_base was modified to handle
    multiple network namespaces. The rest of the network stack was
    simply modified to explicitly use &init_net the initial network
    namespace. This can be fixed when those components of the network
    stack are modified to handle multiple network namespaces.

    For now the ifindex generator is left global.

    Fundametally ifindex numbers are per namespace, or else
    we will have corner case problems with migration when
    we get that far.

    At the same time there are assumptions in the network stack
    that the ifindex of a network device won't change. Making
    the ifindex number global seems a good compromise until
    the network stack can cope with ifindex changes when
    you change namespaces, and the like.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Each netlink socket will live in exactly one network namespace,
    this includes the controlling kernel sockets.

    This patch updates all of the existing netlink protocols
    to only support the initial network namespace. Request
    by clients in other namespaces will get -ECONREFUSED.
    As they would if the kernel did not have the support for
    that netlink protocol compiled in.

    As each netlink protocol is updated to be multiple network
    namespace safe it can register multiple kernel sockets
    to acquire a presence in the rest of the network namespaces.

    The implementation in af_netlink is a simple filter implementation
    at hash table insertion and hash table look up time.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Every user of the network device notifiers is either a protocol
    stack or a pseudo device. If a protocol stack that does not have
    support for multiple network namespaces receives an event for a
    device that is not in the initial network namespace it quite possibly
    can get confused and do the wrong thing.

    To avoid problems until all of the protocol stacks are converted
    this patch modifies all netdev event handlers to ignore events on
    devices that are not in the initial network namespace.

    As the rest of the code is made network namespace aware these
    checks can be removed.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This patch modifies every packet receive function
    registered with dev_add_pack() to drop packets if they
    are not from the initial network namespace.

    This should ensure that the various network stacks do
    not receive packets in a anything but the initial network
    namespace until the code has been converted and is ready
    for them.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This patch passes in the namespace a new socket should be created in
    and has the socket code do the appropriate reference counting. By
    virtue of this all socket create methods are touched. In addition
    the socket create methods are modified so that they will fail if
    you attempt to create a socket in a non-default network namespace.

    Failing if we attempt to create a socket outside of the default
    network namespace ensures that as we incrementally make the network stack
    network namespace aware we will not export functionality that someone
    has not audited and made certain is network namespace safe.
    Allowing us to partially enable network namespaces before all of the
    exotic protocols are supported.

    Any protocol layers I have missed will fail to compile because I now
    pass an extra parameter into the socket creation code.

    [ Integrated AF_IUCV build fixes from Andrew Morton... -DaveM ]

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This patch makes /proc/net per network namespace. It modifies the global
    variables proc_net and proc_net_stat to be per network namespace.
    The proc_net file helpers are modified to take a network namespace argument,
    and all of their callers are fixed to pass &init_net for that argument.
    This ensures that all of the /proc/net files are only visible and
    usable in the initial network namespace until the code behind them
    has been updated to be handle multiple network namespaces.

    Making /proc/net per namespace is necessary as at least some files
    in /proc/net depend upon the set of network devices which is per
    network namespace, and even more files in /proc/net have contents
    that are relevant to a single network namespace.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • This trivial patch removes the unneeded pointer iph, which is never used.

    Signed-off-by: Micah Gruber
    Signed-off-by: David S. Miller

    Micah Gruber
     
  • IPv6 IPsec tunnel gateway incorrectly sends redirect to
    router or sender when network device the IPsec tunnelled packet
    is arrived is the same as the one the decapsulated packet
    is sent.

    With this patch, it omits to send the redirect when the forwarding
    skbuff carries secpath, since such skbuff should be assumed as
    a decapsulated packet from IPsec tunnel by own.

    It may be a rare case for an IPsec security gateway, however
    it is not rare when the gateway is MIPv6 Home Agent since
    the another tunnel end-point is Mobile Node and it changes
    the attached network.

    Signed-off-by: Masahide NAKAMURA
    Signed-off-by: David S. Miller

    Masahide NAKAMURA
     
  • When XFRM policy and state are ready after TCP connection is started,
    the traffic should be transformed immediately, however it does not
    on IPv6 TCP.

    It depends on a dst cache replacement policy with connected socket.
    It seems that the replacement is always done for IPv4, however, on
    IPv6 case it is done only when routing cookie is changed.

    This patch fix that non-transformation dst can be changed to
    transformation one.
    This behavior is required by MIPv6 and improves IPv6 IPsec.

    Fixes by Masahide NAKAMURA.

    Signed-off-by: Noriaki TAKAMIYA
    Signed-off-by: Masahide NAKAMURA
    Signed-off-by: David S. Miller

    Noriaki TAKAMIYA
     
  • Add v4mapped address inline to avoid calls to ipv6_addr_type().

    Signed-off-by: Brian Haley
    Signed-off-by: David S. Miller

    Brian Haley
     

08 Oct, 2007

1 commit

  • When the ICMPv6 Target address is multicast, Linux processes the
    redirect instead of dropping it. The problem is in this code in
    ndisc_redirect_rcv():

    if (ipv6_addr_equal(dest, target)) {
    on_link = 1;
    } else if (!(ipv6_addr_type(target) & IPV6_ADDR_LINKLOCAL)) {
    ND_PRINTK2(KERN_WARNING
    "ICMPv6 Redirect: target address is not
    link-local.\n");
    return;
    }

    This second check will succeed if the Target address is, for example,
    FF02::1 because it has link-local scope. Instead, it should be checking
    if it's a unicast link-local address, as stated in RFC 2461/4861 Section
    8.1:

    - The ICMP Target Address is either a link-local address (when
    redirected to a router) or the same as the ICMP Destination
    Address (when redirected to the on-link destination).

    I know this doesn't explicitly say unicast link-local address, but it's
    implied.

    This bug is preventing Linux kernels from achieving IPv6 Logo Phase II
    certification because of a recent error that was found in the TAHI test
    suite - Neighbor Disovery suite test 206 (v6LC.2.3.6_G) had the
    multicast address in the Destination field instead of Target field, so
    we were passing the test. This won't be the case anymore.

    The patch below fixes this problem, and also fixes ndisc_send_redirect()
    to not send an invalid redirect with a multicast address in the Target
    field. I re-ran the TAHI Neighbor Discovery section to make sure Linux
    passes all 245 tests now.

    Signed-off-by: Brian Haley
    Acked-by: David L Stevens
    Signed-off-by: David S. Miller

    Brian Haley
     

29 Sep, 2007

1 commit

  • Based upon a report and initial patch by Peter Lieven.

    tcp4_md5sig_key and tcp6_md5sig_key need to start with
    the exact same members as tcp_md5sig_key. Because they
    are both cast to that type by tcp_v{4,6}_md5_do_lookup().

    Unfortunately tcp{4,6}_md5sig_key use a u16 for the key
    length instead of a u8, which is what tcp_md5sig_key
    uses. This just so happens to work by accident on
    little-endian, but on big-endian it doesn't.

    Instead of casting, just place tcp_md5sig_key as the first member of
    the address-family specific structures, adjust the access sites, and
    kill off the ugly casts.

    Signed-off-by: David S. Miller

    David S. Miller
     

17 Sep, 2007

1 commit

  • The commit 95c385 broke proper source address selection for cases in which
    there is a address which is makred 'deprecated'. The commit mistakenly
    changed ifa->flags to ifa_result->flags (probably copy/paste error from a
    few lines above) in the 'Rule 3' address selection code.

    The patch restores the previous RFC-compliant behavior.

    Signed-off-by: Jiri Kosina
    Signed-off-by: David S. Miller

    Jiri Kosina
     

15 Sep, 2007

2 commits


11 Sep, 2007

3 commits

  • Some of skbs in sk->write_queue do not have skb->dst because
    we do not fill skb->dst when we allocate new skb in append_data().

    BTW, I think we may not need to (or we should not) increment some stats
    when using corking; if 100 sendmsg() (with MSG_MORE) result in 2 packets,
    how many should we increment?

    If 100, we should set skb->dst for every queued skbs.

    If 1 (or 2 (*)), we increment the stats for the first queued skb and
    we should just skip incrementing OutDiscards for the rest of queued skbs,
    adn we should also impelement this semantics in other places;
    e.g., we should increment other stats just once, not 100 times.

    *: depends on the place we are discarding the datagram.

    I guess should just increment by 1 (or 2).

    Signed-off-by: YOSHIFUJI Hideaki
    Signed-off-by: David S. Miller

    YOSHIFUJI Hideaki
     
  • So I've had a deadlock reported to me. I've found that the sequence of
    events goes like this:

    1) process A (modprobe) runs to remove ip_tables.ko

    2) process B (iptables-restore) runs and calls setsockopt on a netfilter socket,
    increasing the ip_tables socket_ops use count

    3) process A acquires a file lock on the file ip_tables.ko, calls remove_module
    in the kernel, which in turn executes the ip_tables module cleanup routine,
    which calls nf_unregister_sockopt

    4) nf_unregister_sockopt, seeing that the use count is non-zero, puts the
    calling process into uninterruptible sleep, expecting the process using the
    socket option code to wake it up when it exits the kernel

    4) the user of the socket option code (process B) in do_ipt_get_ctl, calls
    ipt_find_table_lock, which in this case calls request_module to load
    ip_tables_nat.ko

    5) request_module forks a copy of modprobe (process C) to load the module and
    blocks until modprobe exits.

    6) Process C. forked by request_module process the dependencies of
    ip_tables_nat.ko, of which ip_tables.ko is one.

    7) Process C attempts to lock the request module and all its dependencies, it
    blocks when it attempts to lock ip_tables.ko (which was previously locked in
    step 3)

    Theres not really any great permanent solution to this that I can see, but I've
    developed a two part solution that corrects the problem

    Part 1) Modifies the nf_sockopt registration code so that, instead of using a
    use counter internal to the nf_sockopt_ops structure, we instead use a pointer
    to the registering modules owner to do module reference counting when nf_sockopt
    calls a modules set/get routine. This prevents the deadlock by preventing set 4
    from happening.

    Part 2) Enhances the modprobe utilty so that by default it preforms non-blocking
    remove operations (the same way rmmod does), and add an option to explicity
    request blocking operation. So if you select blocking operation in modprobe you
    can still cause the above deadlock, but only if you explicity try (and since
    root can do any old stupid thing it would like.... :) ).

    Signed-off-by: Neil Horman
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Neil Horman
     
  • From: Denis V. Lunev

    addrconf_dad_failure calls addrconf_dad_stop which takes referenced address
    and drops the count. So, in6_ifa_put perrformed at out: is extra. This
    results in message: "Freeing alive inet6 address" and not released dst entries.

    Signed-off-by: Denis V. Lunev
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Denis V. Lunev
     

27 Aug, 2007

1 commit


22 Aug, 2007

1 commit

  • If ICMP6 message with "Packet Too Big" is received after send SCTP DATA,
    kernel panic will occur when SCTP DATA is send again.

    This is because of a bad dest address when call to skb_copy_bits().

    The messages sequence is like this:

    Endpoint A Endpoint B

    (Packet Too Big pmtu=1280)
    ] Not tainted VLI
    EFLAGS: 00010282 (2.6.23-rc2 #1)
    EIP is at skb_copy_bits+0x4f/0x1ef
    eax: 000004d0 ebx: ce12a980 ecx: 00000134 edx: cfd5a880
    esi: c8246858 edi: 00000000 ebp: c0759b14 esp: c0759adc
    ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068
    Process swapper (pid: 0, ti=c0759000 task=c06d0340 task.ti=c0713000)
    Stack: c0759b88 c0405867 ce12a980 c8bff838 c789c084 00000000 00000028 cfd5a880
    d09f1890 000005dc 0000007b ce12a980 cfd5a880 c8bff838 c0759b88 d09bc521
    000004d0 fffff96c 00000200 00000100 c0759b50 cfd5a880 00000246 c0759bd4
    Call Trace:
    [] show_trace_log_lvl+0x1a/0x2f
    [] show_stack_log_lvl+0x9b/0xa3
    [] show_registers+0x1b8/0x289
    [] die+0x113/0x246
    [] do_page_fault+0x4ad/0x57e
    [] error_code+0x72/0x78
    [] ip6_output+0x8e5/0xab2 [ipv6]
    [] ip6_xmit+0x2ea/0x3a3 [ipv6]
    [] sctp_v6_xmit+0x248/0x253 [sctp]
    [] sctp_packet_transmit+0x53f/0x5ae [sctp]
    [] sctp_outq_flush+0x555/0x587 [sctp]
    [] sctp_retransmit+0xf8/0x10f [sctp]
    [] sctp_icmp_frag_needed+0x57/0x5b [sctp]
    [] sctp_v6_err+0xcd/0x148 [sctp]
    [] icmpv6_notify+0xe6/0x167 [ipv6]
    [] icmpv6_rcv+0x7d7/0x849 [ipv6]
    [] ip6_input+0x1dc/0x310 [ipv6]
    [] ipv6_rcv+0x294/0x2df [ipv6]
    [] netif_receive_skb+0x2d2/0x335
    [] process_backlog+0x7f/0xd0
    [] net_rx_action+0x96/0x17e
    [] __do_softirq+0x64/0xcd
    [] do_softirq+0x5c/0xac
    =======================
    Code: 00 00 29 ca 89 d0 2b 45 e0 89 55 ec 85 c0 7e 35 39 45 08 8b 55 e4 0f 4e 45 08 8b 75 e0 8b 7d dc 89 c1 c1 e9 02 03 b2 a0 00 00 00 a5 89 c1 83 e1 03 74 02 f3 a4 29 45 08 0f 84 7b 01 00 00 01
    EIP: [] skb_copy_bits+0x4f/0x1ef SS:ESP 0068:c0759adc
    Kernel panic - not syncing: Fatal exception in interrupt

    Arnaldo says:
    ====================
    Thanks! I'm to blame for this one, problem was introduced in:

    b0e380b1d8a8e0aca215df97702f99815f05c094

    @@ -761,7 +762,7 @@ slow_path:
    /*
    * Copy a block of the IP datagram.
    */
    - if (skb_copy_bits(skb, ptr, frag->h.raw, len))
    + if (skb_copy_bits(skb, ptr, skb_transport_header(skb),
    len))
    BUG();
    left -= len;
    ====================

    Signed-off-by: Wei Yongjun
    Acked-by: YOSHIFUJI Hideaki
    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Wei Yongjun
     

16 Aug, 2007

1 commit

  • A similar fix to netfilter from Eric Dumazet inspired me to
    look around a bit by using some grep/sed stuff as looking for
    this kind of bugs seemed easy to automate. This is one of them
    I found where it looks like this semicolon is not valid.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

14 Aug, 2007

1 commit


03 Aug, 2007

1 commit

  • As discovered by Evegniy Polyakov, if we try to sendmsg after
    a connection reset, we can do incredibly stupid things.

    The core issue is that inet_sendmsg() tries to autobind the
    socket, but we should never do that for TCP. Instead we should
    just go straight into TCP's sendmsg() code which will do all
    of the necessary state and pending socket error checks.

    TCP's sendpage already directly vectors to tcp_sendpage(), so this
    merely brings sendmsg() in line with that.

    Signed-off-by: David S. Miller

    David S. Miller
     

31 Jul, 2007

4 commits


27 Jul, 2007

1 commit

  • Convert rel_info to host-endian before calling ip6_tnl_err().
    The things become much more straightforward that way.
    The key observation (and the reason why that code actually
    worked) is that after ip6_tnl_err() we either immediately
    bailed out or had rel_info set to 0 or had it set to host-endian
    and guaranteed to hit
    (rel_type == ICMP_DEST_UNREACH && rel_code == ICMP_FRAG_NEEDED)
    case. So inconsistent endianness didn't really lead to bugs,
    but it had been subtle and prone to breakage. New variant is
    saner and obviously safe.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

25 Jul, 2007

2 commits


22 Jul, 2007

1 commit


20 Jul, 2007

1 commit

  • Slab destructors were no longer supported after Christoph's
    c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
    BUGs for both slab and slub, and slob never supported them
    either.

    This rips out support for the dtor pointer from kmem_cache_create()
    completely and fixes up every single callsite in the kernel (there were
    about 224, not including the slab allocator definitions themselves,
    or the documentation references).

    Signed-off-by: Paul Mundt

    Paul Mundt
     

15 Jul, 2007

7 commits

  • Currently if the link is brought down via ip link or ifconfig down,
    the inet6addr_chain notifiers are not called even though all
    the addresses are removed from the interface. This caused SCTP
    to add duplicate addresses to it's list.

    Signed-off-by: Vlad Yasevich
    Signed-off-by: David S. Miller

    Vlad Yasevich
     
  • From: Dmitry Butskoy

    Taken from http://bugzilla.kernel.org/show_bug.cgi?id=8747

    Problem Description:

    It is related to the possibility to obtain MSG_ERRQUEUE messages from the udp
    and raw sockets, both connected and unconnected.

    There is a little typo in net/ipv6/icmp.c code, which prevents such messages
    to be delivered to the errqueue of the correspond raw socket, when the socket
    is CONNECTED. The typo is due to swap of local/remote addresses.

    Consider __raw_v6_lookup() function from net/ipv6/raw.c. When a raw socket is
    looked up usual way, it is something like:

    sk = __raw_v6_lookup(sk, nexthdr, daddr, saddr, IP6CB(skb)->iif);

    where "daddr" is a destination address of the incoming packet (IOW our local
    address), "saddr" is a source address of the incoming packet (the remote end).

    But when the raw socket is looked up for some icmp error report, in
    net/ipv6/icmp.c:icmpv6_notify() , daddr/saddr are obtained from the echoed
    fragment of the "bad" packet, i.e. "daddr" is the original destination
    address of that packet, "saddr" is our local address. Hence, for
    icmpv6_notify() must use "saddr, daddr" in its arguments, not "daddr, saddr"
    ...

    Steps to reproduce:

    Create some raw socket, connect it to an address, and cause some error
    situation: f.e. set ttl=1 where the remote address is more than 1 hop to reach.
    Set IPV6_RECVERR .
    Then send something and wait for the error (f.e. poll() with POLLERR|POLLIN).
    You should receive "time exceeded" icmp message (because of "ttl=1"), but the
    socket do not receive it.

    If you do not connect your raw socket, you will receive MSG_ERRQUEUE
    successfully. (The reason is that for unconnected socket there are no actual
    checks for local/remote addresses).

    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Dmitry Butskoy
     
  • Also remove two unnecessary EXPORT_SYMBOLs and move the
    nf_conntrack_l3proto_ipv4 declaration to the correct file.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Lower ip6tables, arptables and ebtables printk severity similar to
    Dan Aloni's patch for iptables.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • nf_ct_get_tuple() requires the offset to transport header and that bothers
    callers such as icmp[v6] l4proto modules. This introduces new function
    to simplify them.

    Signed-off-by: Yasuyuki Kozakai
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Yasuyuki Kozakai
     
  • The icmp[v6] l4proto modules parse headers in ICMP[v6] error to get tuple.
    But they have to find the offset to transport protocol header before that.
    Their processings are almost same as prepare() of l3proto modules.
    This makes prepare() more generic to simplify icmp[v6] l4proto module
    later.

    Signed-off-by: Yasuyuki Kozakai
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Yasuyuki Kozakai
     
  • Signed-off-by: Yasuyuki Kozakai
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Yasuyuki Kozakai