21 Jun, 2008

1 commit

  • Alexey Dobriyan writes:
    > Subject: ICMP sockets destruction vs ICMP packets oops

    > After icmp_sk_exit() nuked ICMP sockets, we get an interrupt.
    > icmp_reply() wants ICMP socket.
    >
    > Steps to reproduce:
    >
    > launch shell in new netns
    > move real NIC to netns
    > setup routing
    > ping -i 0
    > exit from shell
    >
    > BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
    > IP: [] icmp_sk+0x17/0x30
    > PGD 17f3cd067 PUD 17f3ce067 PMD 0
    > Oops: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC
    > CPU 0
    > Modules linked in: usblp usbcore
    > Pid: 0, comm: swapper Not tainted 2.6.26-rc6-netns-ct #4
    > RIP: 0010:[] [] icmp_sk+0x17/0x30
    > RSP: 0018:ffffffff8057fc30 EFLAGS: 00010286
    > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff81017c7db900
    > RDX: 0000000000000034 RSI: ffff81017c7db900 RDI: ffff81017dc41800
    > RBP: ffffffff8057fc40 R08: 0000000000000001 R09: 000000000000a815
    > R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff8057fd28
    > R13: ffffffff8057fd00 R14: ffff81017c7db938 R15: ffff81017dc41800
    > FS: 0000000000000000(0000) GS:ffffffff80525000(0000) knlGS:0000000000000000
    > CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
    > CR2: 0000000000000000 CR3: 000000017fcda000 CR4: 00000000000006e0
    > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    > Process swapper (pid: 0, threadinfo ffffffff8053a000, task ffffffff804fa4a0)
    > Stack: 0000000000000000 ffff81017c7db900 ffffffff8057fcf0 ffffffff803fcfe4
    > ffffffff804faa38 0000000000000246 0000000000005a40 0000000000000246
    > 000000000001ffff ffff81017dd68dc0 0000000000005a40 0000000055342436
    > Call Trace:
    > [] icmp_reply+0x44/0x1e0
    > [] ? ip_route_input+0x23a/0x1360
    > [] icmp_echo+0x65/0x70
    > [] icmp_rcv+0x180/0x1b0
    > [] ip_local_deliver+0xf4/0x1f0
    > [] ip_rcv+0x33b/0x650
    > [] netif_receive_skb+0x27a/0x340
    > [] process_backlog+0x9d/0x100
    > [] net_rx_action+0x18d/0x250
    > [] __do_softirq+0x75/0x100
    > [] call_softirq+0x1c/0x30
    > [] do_softirq+0x65/0xa0
    > [] irq_exit+0x97/0xa0
    > [] do_IRQ+0xa8/0x130
    > [] ? mwait_idle+0x0/0x60
    > [] ret_from_intr+0x0/0xf
    > [] ? mwait_idle+0x4c/0x60
    > [] ? mwait_idle+0x43/0x60
    > [] ? cpu_idle+0x57/0xa0
    > [] ? rest_init+0x70/0x80
    > Code: 10 5b 41 5c 41 5d 41 5e c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 53
    > 48 83 ec 08 48 8b 9f 78 01 00 00 e8 2b c7 f1 ff 89 c0 8b 04 c3 48 83 c4 08
    > 5b c9 c3 66 66 66 66 66 2e 0f 1f 84 00
    > RIP [] icmp_sk+0x17/0x30
    > RSP
    > CR2: 0000000000000000
    > ---[ end trace ea161157b76b33e8 ]---
    > Kernel panic - not syncing: Aiee, killing interrupt handler!

    Receiving packets while we are cleaning up a network namespace is a
    racy proposition. It is possible when the packet arrives that we have
    removed some but not all of the state we need to fully process it. We
    have the choice of either playing wack-a-mole with the cleanup routines
    or simply dropping packets when we don't have a network namespace to
    handle them.

    Since the check looks inexpensive in netif_receive_skb let's just
    drop the incoming packets.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

17 Jun, 2008

1 commit

  • Selected device feature bits can be propagated to VLAN devices, so we
    can make use of TX checksum offload and TSO on VLAN-tagged packets.
    However, if the physical device does not do VLAN tag insertion or
    generic checksum offload then the test for TX checksum offload in
    dev_queue_xmit() will see a protocol of htons(ETH_P_8021Q) and yield
    false.

    This splits the checksum offload test into two functions:

    - can_checksum_protocol() tests a given protocol against a feature bitmask

    - dev_can_checksum() first tests the skb protocol against the device
    features; if that fails and the protocol is htons(ETH_P_8021Q) then
    it tests the encapsulated protocol against the effective device
    features for VLANs

    Signed-off-by: Ben Hutchings
    Acked-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Ben Hutchings
     

21 May, 2008

1 commit


15 May, 2008

1 commit


08 May, 2008

2 commits

  • dev_open() and dev_close() must be called holding the RTNL, since they
    call device functions and netdevice notifiers that are promised the RTNL.

    Signed-off-by: Ben Hutchings
    Signed-off-by: David S. Miller

    Ben Hutchings
     
  • When a net namespace is destroyed, some devices (those, not killed
    on ns stop explicitly) are moved back to init_net.

    The problem, is that this net_ns change has one point of failure -
    the __dev_alloc_name() may be called if a name collision occurs (and
    this is easy to trigger). This allocator performs a likely-to-fail
    GFP_ATOMIC allocation to find a suitable number. Other possible
    conditions that may cause error (for device being ns local or not
    registered) are always false in this case.

    So, when this call fails, the device is unregistered. But this is
    *not* the right thing to do, since after this the device may be
    released (and kfree-ed) improperly. E. g. bridges require more
    actions (sysfs update, timer disarming, etc.), some other devices
    want to remove their private areas from lists, etc.

    I. e. arbitrary use-after-free cases may occur.

    The proposed fix is the following: since the only reason for the
    dev_change_net_namespace to fail is the name generation, we may
    give it a unique fall-back name w/o %d-s in it - the dev
    one, since ifindexes are still unique.

    So make this change, raise the failure-case printk loglevel to
    EMERG and replace the unregister_netdevice call with BUG().

    [ Use snprintf() -DaveM ]

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

03 May, 2008

2 commits

  • When a netdev is moved across namespaces with the
    'dev_change_net_namespace' function, the 'device_rename' function is
    used to fixup kobject and refresh the sysfs tree. The device_rename
    function will call kobject_rename and this one will check if there is
    an object with the same name and this is the case because we are
    renaming the object with the same name.

    The use of 'device_rename' seems for me wrong because we usually don't
    rename it but just move it across namespaces. As we just want to do a
    mini "netdev_[un]register", IMO the functions
    'netdev_[un]register_kobject' should be used instead, like an usual
    network device [un]registering.

    This patch replace device_rename by netdev_unregister_kobject,
    followed by netdev_register_kobject.

    The netdev_register_kobject will call device_initialize and will raise
    a warning indicating the device was already initialized. In order to
    fix that, I split the device initialization into a separate function
    and use it together with 'netdev_register_kobject' into
    register_netdevice. So we can safely call 'netdev_register_kobject' in
    'dev_change_net_namespace'.

    This fix will allow to properly use the sysfs per namespace which is
    coming from -mm tree.

    Signed-off-by: Daniel Lezcano
    Acked-by: Benjamin Thery
    Signed-off-by: David S. Miller

    Daniel Lezcano
     
  • Remove the fixed size channels[NR_CPUS] array in net/core/dev.c and
    dynamically allocate array based on nr_cpu_ids.

    Signed-off-by: Mike Travis
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Mike Travis
     

29 Apr, 2008

1 commit

  • Some drivers have duplicated unlikely() macros. IS_ERR() already has
    unlikely() in itself.

    This patch cleans up such pointless code.

    Signed-off-by: Hirofumi Nakagawa
    Acked-by: David S. Miller
    Acked-by: Jeff Garzik
    Cc: Paul Clements
    Cc: Richard Purdie
    Cc: Alessandro Zummo
    Cc: David Brownell
    Cc: James Bottomley
    Cc: Michael Halcrow
    Cc: Anton Altaparmakov
    Cc: Al Viro
    Cc: Carsten Otte
    Cc: Patrick McHardy
    Cc: Paul Mundt
    Cc: Jaroslav Kysela
    Cc: Takashi Iwai
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hirofumi Nakagawa
     

19 Apr, 2008

1 commit

  • This patch effectively reverts commit d0498d9ae1a5cebac363e38907266d5cd2eedf89
    aka "[NET]: Do not allocate unneeded memory for dev->priv alignment."
    It was found to be buggy because of final unconditional += NETDEV_ALIGN_CONST
    removal.

    For example, for sizeof(struct net_device) being 2048 bytes, "alloc_size"
    was also 2048 bytes, but allocator with debugging options turned on started
    giving out !32-byte aligned memory resulting in redzones overwrites.

    Patch does small optimization in ->priv'less case: bumping size to next
    32-byte boundary was always done to ensure ->priv will also be aligned.
    But, no ->priv, no need to do that.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     

16 Apr, 2008

2 commits

  • The alloc_netdev_mq() tries to produce 32-bytes alignment for both
    the net_device itself and its private data. The second alignment is
    achieved by adding the NETDEV_ALIGN_CONST to the whole size of
    the memory to be allocated.

    However, for those devices that do not need the private area, this
    addition just makes the net_device weight 1024 + 32 = 1068 bytes,
    i.e. consume twice as much memory.

    Since loopback device is such (sizeof_priv == 0 for it), and each
    net namespace creates one, this can save a noticeable amount of
    memory for kernel with net namespaces turned on.

    After this set the lo device is actually allocated from a size-1024
    kmem cache on i386 box even with NETPOLL and WIRELESS_EXT turned on.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • dev_set_net is called for
    - just allocated devices
    - devices moving from one namespace to another
    release_net has proper check inside to distinguish these cases.

    Signed-off-by: Denis V. Lunev
    Signed-off-by: David S. Miller

    Denis V. Lunev
     

28 Mar, 2008

1 commit


26 Mar, 2008

3 commits


25 Mar, 2008

1 commit


21 Mar, 2008

1 commit

  • Update: My mailer ate one of Jarek's feedback mails... Fixed the
    parameter in netif_set_gso_max_size() to be u32, not u16. Fixed the
    whitespace issue due to a patch import botch. Changed the types from
    u32 to unsigned int to be more consistent with other variables in the
    area. Also brought the patch up to the latest net-2.6.26 tree.

    Update: Made gso_max_size container 32 bits, not 16. Moved the
    location of gso_max_size within netdev to be less hotpath. Made more
    consistent names between the sock and netdev layers, and added a
    define for the max GSO size.

    Update: Respun for net-2.6.26 tree.

    Update: changed max_gso_frame_size and sk_gso_max_size from signed to
    unsigned - thanks Stephen!

    This patch adds the ability for device drivers to control the size of
    the TSO frames being sent to them, per TCP connection. By setting the
    netdevice's gso_max_size value, the socket layer will set the GSO
    frame size based on that value. This will propogate into the TCP
    layer, and send TSO's of that size to the hardware.

    This can be desirable to help tune the bursty nature of TSO on a
    per-adapter basis, where one may have 1 GbE and 10 GbE devices
    coexisting in a system, one running multiqueue and the other not, etc.

    This can also be desirable for devices that cannot support full 64 KB
    TSO's, but still want to benefit from some level of segmentation
    offloading.

    Signed-off-by: Peter P Waskiewicz Jr
    Signed-off-by: David S. Miller

    Peter P Waskiewicz Jr
     

24 Feb, 2008

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (37 commits)
    [NETFILTER]: fix ebtable targets return
    [IP_TUNNEL]: Don't limit the number of tunnels with generic name explicitly.
    [NET]: Restore sanity wrt. print_mac().
    [NEIGH]: Fix race between neighbor lookup and table's hash_rnd update.
    [RTNL]: Validate hardware and broadcast address attribute for RTM_NEWLINK
    tg3: ethtool phys_id default
    [BNX2]: Update version to 1.7.4.
    [BNX2]: Disable parallel detect on an HP blade.
    [BNX2]: More 5706S link down workaround.
    ssb: Fix support for PCI devices behind a SSB->PCI bridge
    zd1211rw: fix sparse warnings
    rtl818x: fix sparse warnings
    ssb: Fix pcicore cardbus mode
    ssb: Make the GPIO API reentrancy safe
    ssb: Fix the GPIO API
    ssb: Fix watchdog access for devices without a chipcommon
    ssb: Fix serial console on new bcm47xx devices
    ath5k: Fix build warnings on some 64-bit platforms.
    WDEV, ath5k, don't return int from bool function
    WDEV: ath5k, fix lock imbalance
    ...

    Linus Torvalds
     

20 Feb, 2008

1 commit

  • Commit a0a400d79e3dd7843e7e81baa3ef2957bdc292d0 ("[NET]: dev_mcast:
    add multicast list synchronization helpers") from you introduced a new
    field "da_synced" to struct dev_addr_list that is not properly
    initialized to 0. So when any of the current users (8021q, macvlan,
    mac80211) calls dev_mc_sync/unsync they mess the address list for both
    devices.

    The attached patch fixed it for me and avoid future problems.

    Signed-off-by: Jorge Boncompte [DTI2]
    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Jorge Boncompte [DTI2]
     

15 Feb, 2008

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (82 commits)
    [NET]: Make sure sockets implement splice_read
    netconsole: avoid null pointer dereference at show_local_mac()
    [IPV6]: Fix reversed local_df test in ip6_fragment
    [XFRM]: Avoid bogus BUG() when throwing new policy away.
    [AF_KEY]: Fix bug in spdadd
    [NETFILTER] nf_conntrack_proto_tcp.c: Mistyped state corrected.
    net: xfrm statistics depend on INET
    [NETFILTER]: make secmark_tg_destroy() static
    [INET]: Unexport inet_listen_wlock
    [INET]: Unexport __inet_hash_connect
    [NET]: Improve cache line coherency of ingress qdisc
    [NET]: Fix race in dev_close(). (Bug 9750)
    [IPSEC]: Fix bogus usage of u64 on input sequence number
    [RTNETLINK]: Send a single notification on device state changes.
    [NETLABLE]: Hide netlbl_unlabel_audit_addr6 under ifdef CONFIG_IPV6.
    [NETLABEL]: Don't produce unused variables when IPv6 is off.
    [NETLABEL]: Compilation for CONFIG_AUDIT=n case.
    [GENETLINK]: Relax dances with genl_lock.
    [NETLABEL]: Fix lookup logic of netlbl_domhsh_search_def.
    [IPV6]: remove unused method declaration (net/ndisc.h).
    ...

    Linus Torvalds
     

14 Feb, 2008

2 commits


13 Feb, 2008

1 commit

  • There is a race in Linux kernel file net/core/dev.c, function dev_close.
    The function calls function dev_deactivate, which calls function
    dev_watchdog_down that deletes the watchdog timer. However, after that, a
    driver can call netif_carrier_ok, which calls function
    __netdev_watchdog_up that can add the watchdog timer again. Function
    unregister_netdevice calls function dev_shutdown that traps the bug
    !timer_pending(&dev->watchdog_timer). Moving dev_deactivate after
    netif_running() has been cleared prevents function netif_carrier_on
    from calling __netdev_watchdog_up and adding the watchdog timer again.

    Signed-off-by: Matti Linnanvuori
    Signed-off-by: David S. Miller

    Matti Linnanvuori
     

02 Feb, 2008

3 commits


01 Feb, 2008

1 commit

  • Reuse the existing logic for multicast list synchronization for the
    unicast address list. The core of dev_mc_sync/unsync are split out as
    __dev_addr_sync/unsync and moved from dev_mcast.c to dev.c. These are
    then used to implement dev_unicast_sync/unsync as well.

    I'm working on cleaning up Intel's FCoE stack, which generates new MAC
    addresses from the fibre channel device id assigned by the fabric as
    per the current draft specification in T11. When using such a
    protocol in a VLAN environment it would be nice to not always be
    forced into promiscuous mode, assuming the underlying Ethernet driver
    supports multiple unicast addresses as well.

    Signed-off-by: Chris Leech
    Signed-off-by: Patrick McHardy

    Chris Leech
     

29 Jan, 2008

7 commits


09 Jan, 2008

1 commit


21 Dec, 2007

1 commit


11 Dec, 2007

1 commit


15 Nov, 2007

1 commit

  • Commit fcc5a03ac42564e9e255c1134dda47442289e466:

    [NET]: Allow netdev REGISTER/CHANGENAME events to fail

    makes the register_netdevice_notifier() handle the error from the
    NETDEV_REGISTER event, sent to the registering block.

    The bad news is that in this case the notifier block is
    not removed from the list, but the error is returned to the
    caller. In case the caller is in module init function and
    handles this error this can abort the module loading. The
    notifier block will be then removed from the kernel, but
    will be left in the list. Oops :(

    I think that the notifier block should be removed from the
    chain in case of error, regardless whether this error is
    handled by the caller or not. In the worst case (the error
    is _not_ handled) module will not receive the events any
    longer.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

13 Nov, 2007

1 commit