14 Aug, 2008

7 commits

  • This patch makes the multicast socket to be per namespace.

    When a network namespace is created, other than the init_net and a
    multicast packet is received, the kernel goes to a hang or a kernel panic.

    How to reproduce ?

    * create a child network namespace
    * create a pair virtual device veth
    * ip link add type veth
    * move one side to the pair network device to the child namespace
    * ip link set netns dev veth1
    * ping -I veth0 224.0.0.1

    The bug appears because the function ip_mc_init_dev does not initialize
    the different multicast fields as it exits because it is not the init_net.

    BUG: soft lockup - CPU#0 stuck for 61s! [avahi-daemon:2695]
    Modules linked in:
    irq event stamp: 50350
    hardirqs last enabled at (50349): [] _spin_unlock_irqrestore+0x34/0x39
    hardirqs last disabled at (50350): [] schedule+0x9f/0x5ff
    softirqs last enabled at (45712): [] ip_setsockopt+0x8e7/0x909
    softirqs last disabled at (45710): [] _spin_lock_bh+0x8/0x27

    Pid: 2695, comm: avahi-daemon Not tainted (2.6.27-rc2-00029-g0872073 #3)
    EIP: 0060:[] EFLAGS: 00000297 CPU: 0
    EIP is at __read_lock_failed+0x8/0x10
    EAX: c4f38810 EBX: c4f38810 ECX: 00000000 EDX: c04cc22e
    ESI: fb0000e0 EDI: 00000011 EBP: 0f02000a ESP: c4e3faa0
    DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    CR0: 8005003b CR2: 44618a40 CR3: 04e37000 CR4: 000006d0
    DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
    DR6: ffff0ff0 DR7: 00000400
    [] ? _raw_read_lock+0x23/0x25
    [] ? ip_check_mc+0x1c/0x83
    [] ? ip_route_input+0x229/0xe92
    [] ? trace_hardirqs_on_thunk+0xc/0x10
    [] ? do_IRQ+0x69/0x7d
    [] ? restore_nocheck_notrace+0x0/0xe
    [] ? ip_rcv+0x227/0x505
    [] ? netif_receive_skb+0xfe/0x2b3
    [] ? netif_receive_skb+0x26c/0x2b3
    [] ? process_backlog+0x73/0xbd
    [] ? net_rx_action+0xc1/0x1ae
    [] ? __do_softirq+0x7b/0xef
    [] ? do_softirq+0x37/0x4d
    [] ? dev_queue_xmit+0x3d4/0x40b
    [] ? local_bh_enable+0x96/0xab
    [] ? dev_queue_xmit+0x3d4/0x40b
    [] ? _local_bh_enable+0x79/0x88
    [] ? neigh_resolve_output+0x20f/0x239
    [] ? ip_finish_output+0x1df/0x209
    [] ? ip_dev_loopback_xmit+0x62/0x66
    [] ? ip_local_out+0x15/0x17
    [] ? ip_push_pending_frames+0x25c/0x2bb
    [] ? udp_push_pending_frames+0x2bb/0x30e
    [] ? udp_sendmsg+0x413/0x51d
    [] ? udp_sendmsg+0x433/0x51d
    [] ? inet_sendmsg+0x35/0x3f
    [] ? sock_sendmsg+0xb8/0xd1
    [] ? autoremove_wake_function+0x0/0x2b
    [] ? copy_from_user+0x32/0x5e
    [] ? copy_from_user+0x32/0x5e
    [] ? sys_sendmsg+0x18d/0x1f0
    [] ? pipe_write+0x3cb/0x3d7
    [] ? do_sync_write+0xbe/0x105
    [] ? autoremove_wake_function+0x0/0x2b
    [] ? sys_socketcall+0x176/0x1b0
    [] ? syscall_trace_enter+0x6c/0x7b
    [] ? syscall_call+0x7/0xb

    Signed-off-by: Daniel Lezcano
    Signed-off-by: David S. Miller

    Daniel Lezcano
     
  • gen_kill_estimator() required rtnl_lock() protection, but since it is
    moved to an RCU callback __qdisc_destroy() let's use est_lock instead.

    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Jarek Poplawski
     
  • Based upon discussions with Jarek P. and Herbert Xu.

    First, we're testing the wrong qdisc. We just reset the device
    queue qdiscs to &noop_qdisc and checking it's state is completely
    pointless here.

    We want to wait until the previous qdisc that was sitting at
    the ->qdisc pointer is not busy any more. And that would be
    ->qdisc_sleeping.

    Because of how we propagate the samples qdisc pointer down into
    qdisc_run and friends via per-cpu ->output_queue and netif_schedule,
    we have to wait also for the __QDISC_STATE_SCHED bit to clear as
    well.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Recent changes introduced a bug in htb_delete(): cl->parent->children
    counter update misses checking cl->parent for NULL, which is used for
    root classes, so deleting them causes an oops.

    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Jarek Poplawski
     
  • With the new multi-queue transmit code, it is possible to accidentally
    make pktgen pick a non-existing tx queue simply by using a stale
    script to drive pktgen. Access to this non-existing tx queue will
    then trigger a bad memory access and kill the machine.

    For example, setting "queue_map_max 2" will cause my machine to die
    when accessing a garbage spinlock in the non-existing tx queue:

    BUG: spinlock bad magic on CPU#0, kpktgend_0/564
    lock: ffff88001ddf6718, .magic: ffffffff, .owner: /-1, .owner_cpu: 0
    Pid: 564, comm: kpktgend_0 Not tainted 2.6.27-rc3 #35

    Call Trace:
    [] spin_bug+0xa4/0xac
    [] _raw_spin_lock+0x23/0x123
    [] _spin_lock_bh+0x17/0x1b
    [] pktgen_thread_worker+0xa97/0x1002
    [] ? finish_task_switch+0x38/0x97
    [] ? autoremove_wake_function+0x0/0x36
    [] ? autoremove_wake_function+0x0/0x36
    [] ? pktgen_thread_worker+0x0/0x1002
    [] kthread+0x44/0x6d
    [] child_rip+0xa/0x11
    [] ? kthread+0x0/0x6d
    [] ? child_rip+0x0/0x11

    The attached patch adds some sanity checking to prevent
    these sorts of configuration errors.

    Signed-off-by: Andrew Gallatin
    Signed-off-by: David S. Miller

    Andrew Gallatin
     
  • Thanks to Eugene Teo for reporting this problem.

    Signed-off-by: Eugene Teo
    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: Gerrit Renker
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     
  • Small fix removing an unnecessary intermediate variable.

    Signed-off-by: Jean-Christophe DUBOIS
    Signed-off-by: David S. Miller

    Jean-Christophe DUBOIS
     

13 Aug, 2008

6 commits

  • Flushing must consistently return ENOMEM on failure of any allocation

    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     
  • Flushing of actions has been broken since we changed
    the semantics of netlink parsed tb[X] to mean X is an attribute type.
    This makes the flushing work.

    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     
  • In case of error, the function rxrpc_get_transport returns an ERR
    pointer, but never returns a NULL pointer. So after a call to this
    function, a NULL test should be replaced by an IS_ERR test.

    A simplified version of the semantic patch that makes this change is
    as follows:
    (http://www.emn.fr/x-info/coccinelle/)

    //
    @correct_null_test@
    expression x,E;
    statement S1, S2;
    @@
    x = rxrpc_get_transport(...)

    ? x = E;
    //

    Signed-off-by: Julien Brunel
    Signed-off-by: Julia Lawall
    Signed-off-by: David S. Miller

    Julien Brunel
     
  • In the minimal the wireless extensions oughta send at least
    the name in addition to the ifindex.

    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     
  • It's an internal implementation detail which we _should_ be free to change.
    So we did, and it promptly broke.

    The compiler shold be able to work out when to use the __constant version
    anyway.

    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Andrew Morton
     
  • Alexey Dobriyan wrote:
    > On Thu, Aug 07, 2008 at 07:00:56PM +0200, John Gumb wrote:
    >> Scenario: no ipv6 default route set.
    >
    >> # ip -f inet6 route get fec0::1
    >>
    >> BUG: unable to handle kernel NULL pointer dereference at 00000000
    >> IP: [] rt6_fill_node+0x175/0x3b0
    >> EIP is at rt6_fill_node+0x175/0x3b0
    >
    > 0xffffffff80424dd3 is in rt6_fill_node (net/ipv6/route.c:2191).
    > 2186 } else
    > 2187 #endif
    > 2188 NLA_PUT_U32(skb, RTA_IIF, iif);
    > 2189 } else if (dst) {
    > 2190 struct in6_addr saddr_buf;
    > 2191 ====> if (ipv6_dev_get_saddr(ip6_dst_idev(&rt->u.dst)->dev,
    > ^^^^^^^^^^^^^^^^^^^^^^^^
    > NULL
    >
    > 2192 dst, 0, &saddr_buf) == 0)
    > 2193 NLA_PUT(skb, RTA_PREFSRC, 16, &saddr_buf);
    > 2194 }

    The commit that changed this can't be reverted easily, but the patch
    below works for me.

    Fix NULL de-reference in rt6_fill_node() when there's no IPv6 input
    device present in the dst entry.

    Signed-off-by: Brian Haley
    Signed-off-by: David S. Miller

    Brian Haley
     

12 Aug, 2008

2 commits


11 Aug, 2008

9 commits

  • In order to align the coding styles of ip_vs_zero_stats() and
    its child-function ip_vs_zero_estimator(), clear ip_vs_stats
    members explicitlty rather than doing a limited memset().

    This was chosen over modifying ip_vs_zero_estimator() to use
    memset() as it is more robust against changes in members
    in the relevant structures. memset() would be prefered if
    all members of the structure were to be cleared.

    Cc: Sven Wegener
    Signed-off-by: Simon Horman
    Signed-off-by: Sven Wegener

    Simon Horman
     
  • It's a global variable and automatically initialized to zero. And now we can
    also initialize the lock at compile time.

    Signed-off-by: Sven Wegener
    Acked-by: Simon Horman

    Sven Wegener
     
  • There's no reason for dynamically allocating an estimator object for every
    stats object. Directly embed an estimator object into every stats object and
    switch to using the kernel-provided list implementation. This makes the code
    much simpler and faster, as we do not need to traverse the list of all
    estimators to find the one belonging to a stats object. There's no need to use
    an rwlock, as we only have one reader. Also reorder the members of the
    estimator structure slightly to avoid padding overhead. This can't be done
    with the stats object as the members are currently copied to our user space
    object via memcpy() and changing it would break ABI.

    Signed-off-by: Sven Wegener
    Acked-by: Simon Horman

    Sven Wegener
     
  • Signed-off-by: Sven Wegener
    Acked-by: Simon Horman

    Sven Wegener
     
  • Being able to discard these functions saves a couple of bytes at runtime. The
    cleanup functions can't be annotated with __exit as they are also called from
    init functions.

    Signed-off-by: Sven Wegener
    Acked-by: Simon Horman

    Sven Wegener
     
  • No need to do it at runtime and this saves a couple of bytes in the text
    section.

    Signed-off-by: Sven Wegener
    Acked-by: Simon Horman

    Sven Wegener
     
  • Signed-off-by: Sven Wegener
    Acked-by: Simon Horman

    Sven Wegener
     
  • There is a slight chance for a deadlock in the estimator code. We can't call
    del_timer_sync() while holding our lock, as the timer might be active and
    spinning for the lock on another cpu. Work around this issue by using
    try_to_del_timer_sync() and releasing the lock. We could actually delete the
    timer outside of our lock, as the add and kill functions are only every called
    from userspace via [gs]etsockopt() and are serialized by a mutex, but better
    make this explicit.

    Signed-off-by: Sven Wegener
    Cc: stable
    Acked-by: Simon Horman

    Sven Wegener
     
  • Commit 998e7a76804b7a273a0460c2cdd5a51fa9856717 ("ipvs: Use kthread_run()
    instead of doing a double-fork via kernel_thread()") introduced a possible
    deadlock in the sync code. We need to use the _bh versions for the lock, as the
    lock is also accessed from a bottom half.

    Signed-off-by: Sven Wegener
    Acked-by: Simon Horman

    Sven Wegener
     

09 Aug, 2008

2 commits

  • The socket lock is there to protect the normal UDP receive path.
    Encapsulation UDP sockets don't need that protection. In fact
    the locking is deadly for them as they may contain another UDP
    packet within, possibly with the same addresses.

    Also the nested bit was copied from TCP. TCP needs it because
    of accept(2) spawning sockets. This simply doesn't apply to UDP
    so I've removed it.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • Based upon bug reports by Stephen Hemminger.

    We still had some cases using ->qdisc instead of ->qdisc_sleeping.

    Also, qdisc_lookup() should return ingress qdiscs.

    Signed-off-by: David S. Miller

    David S. Miller
     

08 Aug, 2008

5 commits


07 Aug, 2008

9 commits

  • Currently a mesh node will not forward a multicast frame if it is not subscribed
    to the specific multicast address. This patch addresses the issue and fixes mesh
    multicast forwarding.

    Signed-off-by: Luis Carlos Cobo
    Acked-by: Johannes Berg
    Signed-off-by: John W. Linville

    Luis Carlos Cobo
     
  • Now we deal with mesh forwarding before the 802.11->802.3 conversion, thus
    eliminating a few unnecessary steps. The next hop lookup is called from
    ieee80211_master_start_xmit() instead of subif_start_xmit(). Until the next hop
    is found, RA in the frame will be all zeroes for frames originating from the
    device. For forwarded frames, RA will contain the TA of the received frame,
    which will be necessary to send a path error if a next hop is not found.

    Signed-off-by: Luis Carlos Cobo
    Acked-by: Johannes Berg
    Signed-off-by: John W. Linville

    Luis Carlos Cobo
     
  • Sofar far pktgen have had a restriction to only use one device per kernel
    thread. With the new multiqueue architecture this is no longer adequate.

    The patch below is an effort to remove this by in pktgen configuration
    adding a tag to the device name a la eth0@0 etc. The tag is used for
    usual device config just as before. Also a new flag is introduced to mirror
    queue_map with sending threads smp_processor_id() QUEUE_MAP_CPU.

    An example: We use 4 CPU's to send to one 10g interface (eth0)
    and we use the new tagging to send a mix of packet sizes, 64, 576 and
    1500 bytes. Also we use TX queues according to smp_processor_id()

    PGDEV=/proc/net/pktgen/kpktgend_0
    pgset "add_device eth0@0"

    PGDEV=/proc/net/pktgen/kpktgend_1
    pgset "add_device eth0@1"

    PGDEV=/proc/net/pktgen/kpktgend_2
    pgset "add_device eth0@2"

    PGDEV=/proc/net/pktgen/kpktgend_3
    pgset "add_device eth0@3"
    ....
    PGDEV=/proc/net/pktgen/eth0@0
    pgset "pkt_size 64"
    pgset "flag QUEUE_MAP_CPU"

    PGDEV=/proc/net/pktgen/eth0@1
    pgset "pkt_size 572"
    pgset "flag QUEUE_MAP_CPU"

    PGDEV=/proc/net/pktgen/eth0@2
    pgset "pkt_size 1496"

    PGDEV=/proc/net/pktgen/eth0@3
    pgset "pkt_size 1496"
    pgset "flag QUEUE_MAP_CPU"

    Signed-off-by: Robert Olsson
    Signed-off-by: David S. Miller

    Robert Olsson
     
  • David S. Miller
     
  • Jeff Garzik
     
  • If a packet_type specifies an active slave to bonding and not just any
    interface, allow it to receive frames that came in on that interface.

    Signed-off-by: Joe Eykholt
    Signed-off-by: Jay Vosburgh
    Signed-off-by: Jeff Garzik

    Joe Eykholt
     
  • Allow a packet_type that specifies the exact device to receive
    even on an inactive bonding slave devices. This is important for some
    L2 protocols such as LLDP and FCoE. This can eventually be used
    for the bonding special cases as well.

    Signed-off-by: Joe Eykholt
    Signed-off-by: Jay Vosburgh
    Signed-off-by: Jeff Garzik

    Joe Eykholt
     
  • Otherwise subsequent changes need multiple return values.

    Signed-off-by: Joe Eykholt
    Signed-off-by: Jay Vosburgh
    Signed-off-by: Jeff Garzik

    Joe Eykholt
     
  • If the following packet flow happen, kernel will panic.
    MathineA MathineB
    SYN
    ---------------------->
    SYN+ACK

    When a bad seq ACK is received, tcp_v4_md5_do_lookup(skb->sk, ip_hdr(skb)->daddr))
    is finally called by tcp_v4_reqsk_send_ack(), but the first parameter(skb->sk) is
    NULL at that moment, so kernel panic happens.
    This patch fixes this bug.

    OOPS output is as following:
    [ 302.812793] IP: [] tcp_v4_md5_do_lookup+0x12/0x42
    [ 302.817075] Oops: 0000 [#1] SMP
    [ 302.819815] Modules linked in: ipv6 loop dm_multipath rtc_cmos rtc_core rtc_lib pcspkr pcnet32 mii i2c_piix4 parport_pc i2c_core parport ac button ata_piix libata dm_mod mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod crc_t10dif ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: scsi_wait_scan]
    [ 302.849946]
    [ 302.851198] Pid: 0, comm: swapper Not tainted (2.6.27-rc1-guijf #5)
    [ 302.855184] EIP: 0060:[] EFLAGS: 00010296 CPU: 0
    [ 302.858296] EIP is at tcp_v4_md5_do_lookup+0x12/0x42
    [ 302.861027] EAX: 0000001e EBX: 00000000 ECX: 00000046 EDX: 00000046
    [ 302.864867] ESI: ceb69e00 EDI: 1467a8c0 EBP: cf75f180 ESP: c0792e54
    [ 302.868333] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    [ 302.871287] Process swapper (pid: 0, ti=c0792000 task=c0712340 task.ti=c0746000)
    [ 302.875592] Stack: c06f413a 00000000 cf75f180 ceb69e00 00000000 c05d0d86 000016d0 ceac5400
    [ 302.883275] c05d28f8 000016d0 ceb69e00 ceb69e20 681bf6e3 00001000 00000000 0a67a8c0
    [ 302.890971] ceac5400 c04250a3 c06f413a c0792eb0 c0792edc cf59a620 cf59a620 cf59a634
    [ 302.900140] Call Trace:
    [ 302.902392] [] tcp_v4_reqsk_send_ack+0x17/0x35
    [ 302.907060] [] tcp_check_req+0x156/0x372
    [ 302.910082] [] printk+0x14/0x18
    [ 302.912868] [] tcp_v4_do_rcv+0x1d3/0x2bf
    [ 302.917423] [] tcp_v4_rcv+0x563/0x5b9
    [ 302.920453] [] ip_local_deliver_finish+0xe8/0x183
    [ 302.923865] [] ip_rcv_finish+0x286/0x2a3
    [ 302.928569] [] dev_alloc_skb+0x11/0x25
    [ 302.931563] [] netif_receive_skb+0x2d6/0x33a
    [ 302.934914] [] pcnet32_poll+0x333/0x680 [pcnet32]
    [ 302.938735] [] net_rx_action+0x5c/0xfe
    [ 302.941792] [] __do_softirq+0x5d/0xc1
    [ 302.944788] [] __do_softirq+0x0/0xc1
    [ 302.948999] [] do_softirq+0x55/0x88
    [ 302.951870] [] handle_fasteoi_irq+0x0/0xa4
    [ 302.954986] [] irq_exit+0x35/0x69
    [ 302.959081] [] do_IRQ+0x99/0xae
    [ 302.961896] [] common_interrupt+0x23/0x28
    [ 302.966279] [] default_idle+0x2a/0x3d
    [ 302.969212] [] cpu_idle+0xb2/0xd2
    [ 302.972169] =======================
    [ 302.974274] Code: fc ff 84 d2 0f 84 df fd ff ff e9 34 fe ff ff 83 c4 0c 5b 5e 5f 5d c3 90 90 57 89 d7 56 53 89 c3 50 68 3a 41 6f c0 e8 e9 55 e5 ff 93 9c 04 00 00 58 85 d2 59 74 1e 8b 72 10 31 db 31 c9 85 f6
    [ 303.011610] EIP: [] tcp_v4_md5_do_lookup+0x12/0x42 SS:ESP 0068:c0792e54
    [ 303.018360] Kernel panic - not syncing: Fatal exception in interrupt

    Signed-off-by: Gui Jianfeng
    Signed-off-by: David S. Miller

    Gui Jianfeng