03 Oct, 2012

1 commit

  • Pull networking changes from David Miller:

    1) GRE now works over ipv6, from Dmitry Kozlov.

    2) Make SCTP more network namespace aware, from Eric Biederman.

    3) TEAM driver now works with non-ethernet devices, from Jiri Pirko.

    4) Make openvswitch network namespace aware, from Pravin B Shelar.

    5) IPV6 NAT implementation, from Patrick McHardy.

    6) Server side support for TCP Fast Open, from Jerry Chu and others.

    7) Packet BPF filter supports MOD and XOR, from Eric Dumazet and Daniel
    Borkmann.

    8) Increate the loopback default MTU to 64K, from Eric Dumazet.

    9) Use a per-task rather than per-socket page fragment allocator for
    outgoing networking traffic. This benefits processes that have very
    many mostly idle sockets, which is quite common.

    From Eric Dumazet.

    10) Use up to 32K for page fragment allocations, with fallbacks to
    smaller sizes when higher order page allocations fail. Benefits are
    a) less segments for driver to process b) less calls to page
    allocator c) less waste of space.

    From Eric Dumazet.

    11) Allow GRO to be used on GRE tunnels, from Eric Dumazet.

    12) VXLAN device driver, one way to handle VLAN issues such as the
    limitation of 4096 VLAN IDs yet still have some level of isolation.
    From Stephen Hemminger.

    13) As usual there is a large boatload of driver changes, with the scale
    perhaps tilted towards the wireless side this time around.

    Fix up various fairly trivial conflicts, mostly caused by the user
    namespace changes.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1012 commits)
    hyperv: Add buffer for extended info after the RNDIS response message.
    hyperv: Report actual status in receive completion packet
    hyperv: Remove extra allocated space for recv_pkt_list elements
    hyperv: Fix page buffer handling in rndis_filter_send_request()
    hyperv: Fix the missing return value in rndis_filter_set_packet_filter()
    hyperv: Fix the max_xfer_size in RNDIS initialization
    vxlan: put UDP socket in correct namespace
    vxlan: Depend on CONFIG_INET
    sfc: Fix the reported priorities of different filter types
    sfc: Remove EFX_FILTER_FLAG_RX_OVERRIDE_IP
    sfc: Fix loopback self-test with separate_tx_channels=1
    sfc: Fix MCDI structure field lookup
    sfc: Add parentheses around use of bitfield macro arguments
    sfc: Fix null function pointer in efx_sriov_channel_type
    vxlan: virtual extensible lan
    igmp: export symbol ip_mc_leave_group
    netlink: add attributes to fdb interface
    tg3: unconditionally select HWMON support when tg3 is enabled.
    Revert "net: ti cpsw ethernet: allow reading phy interface mode from DT"
    gre: fix sparse warning
    ...

    Linus Torvalds
     

25 Aug, 2012

1 commit

  • The operstate of a device is initially IF_OPER_UNKNOWN and is updated
    asynchronously by linkwatch after each change of carrier state
    reported by the driver. The default carrier state of a net device is
    on, and this will never be changed on drivers that do not support
    carrier detection, thus the operstate remains IF_OPER_UNKNOWN.

    For devices that do support carrier detection, the driver must set the
    carrier state to off initially, then poll the hardware state when the
    device is opened. However, we must not activate linkwatch for a
    unregistered device, and commit b473001 ('net: Do not fire linkwatch
    events until the device is registered.') ensured that we don't. But
    this means that the operstate for many devices that support carrier
    detection remains IF_OPER_UNKNOWN when it should be IF_OPER_DOWN.

    The same issue exists with the dormant state.

    The proper initialisation sequence, avoiding a race with opening of
    the device, is:

    rtnl_lock();
    rc = register_netdevice(dev);
    if (rc)
    goto out_unlock;
    netif_carrier_off(dev); /* or netif_dormant_on(dev) */
    rtnl_unlock();

    but it seems silly that this should have to be repeated in so many
    drivers. Further, the operstate seen immediately after opening the
    device may still be IF_OPER_UNKNOWN due to the asynchronous nature of
    linkwatch.

    Commit 22604c8 ('net: Fix for initial link state in 2.6.28') attempted
    to fix this by setting the operstate synchronously, but it was
    reverted as it could lead to deadlock.

    This initialises the operstate synchronously at registration time
    only.

    Signed-off-by: Ben Hutchings
    Signed-off-by: David S. Miller

    Ben Hutchings
     

22 Aug, 2012

1 commit

  • Now that mod_delayed_work() is safe to call from IRQ handlers,
    __cancel_delayed_work() followed by queue_delayed_work() can be
    replaced with mod_delayed_work().

    Most conversions are straight-forward except for the following.

    * net/core/link_watch.c: linkwatch_schedule_work() was doing a quite
    elaborate dancing around its delayed_work. Collapse it such that
    linkwatch_work is queued for immediate execution if LW_URGENT and
    existing timer is kept otherwise.

    Signed-off-by: Tejun Heo
    Cc: "David S. Miller"
    Cc: Tomi Valkeinen

    Tejun Heo
     

16 Sep, 2011

1 commit

  • There is a time-lag of IFF_RUNNING flag consistency between vlan and
    real devices when the real devices are in problem such as link or cable
    broken.

    This leads to a degradation of Availability such as a delay of failover
    in HA systems using vlan since the detection of the problem at real
    device is delayed.

    We can avoid the linkwatch delay (~1 sec) for devices linked to another
    ones, since delay is already done for the realdev.

    Based on a previous patch from Mitsuo Hayasaka

    Reported-by: Mitsuo Hayasaka
    Signed-off-by: Eric Dumazet
    Cc: Herbert Xu
    Cc: Patrick McHardy
    Cc: "Michał Mirosław"
    Cc: Tom Herbert
    Cc: Stephen Hemminger
    Cc: Jesse Gross
    Tested-by: Mitsuo Hayasaka
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Jul, 2011

1 commit

  • As reported by Ben Greer and Froncois Romieu. The code path in
    the netif_carrier code leads it to try and disable
    a late workqueue to reenable it immediately
    netif_carrier_on
    -> linkwatch_fire_event
    -> linkwatch_schedule_work
    -> cancel_delayed_work
    -> del_timer_sync

    If __cancel_delayed_work is used instead then there is no
    problem of waiting for running linkwatch_event.

    There is a race between linkwatch_event running re-scheduling
    but it is harmless to schedule an extra scan of the linkwatch queue.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    stephen hemminger
     

31 Mar, 2011

1 commit


13 Jul, 2010

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

18 Nov, 2009

1 commit

  • Herbert Xu a écrit :
    > On Tue, Nov 17, 2009 at 04:26:04AM -0800, David Miller wrote:
    >> Really, the link watch stuff is just due for a redesign. I don't
    >> think a simple hack is going to cut it this time, sorry Eric :-)
    >
    > I have no objections against any redesigns, but since the only
    > caller of linkwatch_forget_dev runs in process context with the
    > RTNL, it could also legally emit those events.

    Thanks guys, here an updated version then, before linkwatch surgery ?

    In this version, I force the event to be sent synchronously.

    [PATCH net-next-2.6] linkwatch: linkwatch_forget_dev() to speedup device dismantle

    time ip link del eth3.103 ; time ip link del eth3.104 ; time ip link del eth3.105

    real 0m0.266s
    user 0m0.000s
    sys 0m0.001s

    real 0m0.770s
    user 0m0.000s
    sys 0m0.000s

    real 0m1.022s
    user 0m0.000s
    sys 0m0.000s

    One problem of current schem in vlan dismantle phase is the
    holding of device done by following chain :

    vlan_dev_stop() ->
    netif_carrier_off(dev) ->
    linkwatch_fire_event(dev) ->
    dev_hold() ...

    And __linkwatch_run_queue() runs up to one second later...

    A generic fix to this problem is to add a linkwatch_forget_dev() method
    to unlink the device from the list of watched devices.

    dev->link_watch_next becomes dev->link_watch_list (and use a bit more memory),
    to be able to unlink device in O(1).

    After patch :
    time ip link del eth3.103 ; time ip link del eth3.104 ; time ip link del eth3.105

    real 0m0.024s
    user 0m0.000s
    sys 0m0.000s

    real 0m0.032s
    user 0m0.000s
    sys 0m0.001s

    real 0m0.033s
    user 0m0.000s
    sys 0m0.000s

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Jul, 2008

2 commits


11 May, 2007

4 commits

  • Urgent events may be delayed if we already have a non-urgent event
    queued for that device. This patch changes this by making sure that
    an urgent event is always looked at immediately.

    I've replaced the LW_RUNNING flag by LW_URGENT since whether work
    is scheduled is already kept track by the work queue system.

    The only complication is that we have to provide some exclusion for
    the setting linkwatch_nextevent which is available in the actual
    work function.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • When the jiffies wrap around or when the system boots up for the first
    time, down events can be delayed indefinitely since we no longer
    update linkwatch_nextevent when only urgent events are processed.

    This patch fixes this by setting linkwatch_nextevent when a
    wrap-around occurs.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • Currently all link carrier events are delayed by up to a second
    before they're processed to prevent link storms. This causes
    unnecessary packet loss during that interval.

    In fact, we can achieve the same effect in preventing storms by
    only delaying down events and unnecssary up events. The latter
    is defined as up events when we're already up.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • These days the link watch mechanism is an integral part of the
    network subsystem as it manages the carrier status. So it now
    makes sense to allocate some memory for it in net_device rather
    than allocating it on demand.

    In fact, this is necessary because we can't tolerate a memory
    allocation failure since that means we'd have to potentially
    throw a link up event away.

    It also simplifies the code greatly.

    In doing so I discovered a subtle race condition in the use
    of singleevent. This race condition still exists (and is
    somewhat magnified) without singleevent but it's now plugged
    thanks to an smp_mb__before_clear_bit.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

26 Apr, 2007

1 commit

  • Spring cleaning time...

    There seems to be a lot of places in the network code that have
    extra bogus semicolons after conditionals. Most commonly is a
    bogus semicolon after: switch() { }

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     

11 Feb, 2007

1 commit


22 Nov, 2006

2 commits

  • Pass the work_struct pointer to the work function rather than context data.
    The work function can use container_of() to work out the data.

    For the cases where the container of the work_struct may go away the moment the
    pending bit is cleared, it is made possible to defer the release of the
    structure by deferring the clearing of the pending bit.

    To make this work, an extra flag is introduced into the management side of the
    work_struct. This governs auto-release of the structure upon execution.

    Ordinarily, the work queue executor would release the work_struct for further
    scheduling or deallocation by clearing the pending bit prior to jumping to the
    work function. This means that, unless the driver makes some guarantee itself
    that the work_struct won't go away, the work function may not access anything
    else in the work_struct or its container lest they be deallocated.. This is a
    problem if the auxiliary data is taken away (as done by the last patch).

    However, if the pending bit is *not* cleared before jumping to the work
    function, then the work function *may* access the work_struct and its container
    with no problems. But then the work function must itself release the
    work_struct by calling work_release().

    In most cases, automatic release is fine, so this is the default. Special
    initiators exist for the non-auto-release case (ending in _NAR).

    Signed-Off-By: David Howells

    David Howells
     
  • Separate delayable work items from non-delayable work items be splitting them
    into a separate structure (delayed_work), which incorporates a work_struct and
    the timer_list removed from work_struct.

    The work_struct struct is huge, and this limits it's usefulness. On a 64-bit
    architecture it's nearly 100 bytes in size. This reduces that by half for the
    non-delayable type of event.

    Signed-Off-By: David Howells

    David Howells
     

01 Jul, 2006

1 commit


23 Jun, 2006

1 commit


10 May, 2006

1 commit

  • The test used in the linkwatch does not handle wrap-arounds correctly.
    Since the intention of the code is to eliminate bursts of messages we
    can afford to delay things up to a second. Using that fact we can
    easily handle wrap-arounds by making sure that we don't delay things
    by more than one second.

    This is based on diagnosis and a patch by Stefan Rompf.

    Signed-off-by: Herbert Xu
    Acked-by: Stefan Rompf
    Signed-off-by: David S. Miller

    Herbert Xu
     

21 Mar, 2006

2 commits

  • This patch turns the RTNL from a semaphore to a new 2.6.16 mutex and
    gets rid of some of the leftover legacy.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     
  • this patch adds a dormant flag to network devices, RFC2863 operstate derived
    from these flags and possibility for userspace interaction. It allows drivers
    to signal that a device is unusable for user traffic without disabling
    queueing (and therefore the possibility for protocol establishment traffic to
    flow) and a userspace supplicant (WPA, 802.1X) to mark a device unusable
    without changes to the driver.

    It is the result of our long discussion. However I must admit that it
    represents what Jamal and I agreed on with compromises towards Krzysztof, but
    Thomas and Krzysztof still disagree with some parts. Anyway I think it should
    be applied.

    Signed-off-by: Stefan Rompf
    Signed-off-by: David S. Miller

    Stefan Rompf
     

04 May, 2005

1 commit

  • Some network drivers call netif_stop_queue() when detecting loss of
    carrier. This leads to packets being queued up at the qdisc level for
    an unbound period of time. In order to prevent this effect, the core
    networking stack will now cease to queue packets for any device, that
    is operationally down (i.e. the queue is flushed and disabled).

    Signed-off-by: Tommy S. Christensen
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller

    Tommy S. Christensen
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds