27 Jul, 2009

1 commit

  • The use of a static buffer in rose2asc() to return its result is not
    threadproof and can result in corruption if multiple threads are trying
    to use one of the procfs files based on rose2asc().

    Signed-off-by: Ralf Baechle
    Signed-off-by: David S. Miller

    Ralf Baechle
     

20 Jul, 2009

1 commit


17 Jul, 2009

1 commit

  • Commit e912b1142be8f1e2c71c71001dc992c6e5eb2ec1
    (net: sk_prot_alloc() should not blindly overwrite memory)
    took care of not zeroing whole new socket at allocation time.

    sock_copy() is another spot where we should be very careful.
    We should not set refcnt to a non null value, until
    we are sure other fields are correctly setup, or
    a lockless reader could catch this socket by mistake,
    while not fully (re)initialized.

    This patch puts sk_node & sk_refcnt to the very beginning
    of struct sock to ease sock_copy() & sk_prot_alloc() job.

    We add appropriate smp_wmb() before sk_refcnt initializations
    to match our RCU requirements (changes to sock keys should
    be committed to memory before sk_refcnt setting)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Jul, 2009

2 commits

  • Adding smp_mb__after_lock define to be used as a smp_mb call after
    a lock.

    Making it nop for x86, since {read|write|spin}_lock() on x86 are
    full memory barriers.

    Signed-off-by: Jiri Olsa
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Jiri Olsa
     
  • Adding memory barrier after the poll_wait function, paired with
    receive callbacks. Adding fuctions sock_poll_wait and sk_has_sleeper
    to wrap the memory barrier.

    Without the memory barrier, following race can happen.
    The race fires, when following code paths meet, and the tp->rcv_nxt
    and __add_wait_queue updates stay in CPU caches.

    CPU1 CPU2

    sys_select receive packet
    ... ...
    __add_wait_queue update tp->rcv_nxt
    ... ...
    tp->rcv_nxt check sock_def_readable
    ... {
    schedule ...
    if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
    wake_up_interruptible(sk->sk_sleep)
    ...
    }

    If there was no cache the code would work ok, since the wait_queue and
    rcv_nxt are opposit to each other.

    Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
    passed the tp->rcv_nxt check and sleeps, or will get the new value for
    tp->rcv_nxt and will return with new data mask.
    In both cases the process (CPU1) is being added to the wait queue, so the
    waitqueue_active (CPU2) call cannot miss and will wake up CPU1.

    The bad case is when the __add_wait_queue changes done by CPU1 stay in its
    cache, and so does the tp->rcv_nxt update on CPU2 side. The CPU1 will then
    endup calling schedule and sleep forever if there are no more data on the
    socket.

    Calls to poll_wait in following modules were ommited:
    net/bluetooth/af_bluetooth.c
    net/irda/af_irda.c
    net/irda/irnet/irnet_ppp.c
    net/mac80211/rc80211_pid_debugfs.c
    net/phonet/socket.c
    net/rds/af_rds.c
    net/rfkill/core.c
    net/sunrpc/cache.c
    net/sunrpc/rpc_pipe.c
    net/tipc/socket.c

    Signed-off-by: Jiri Olsa
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Jiri Olsa
     

30 Jun, 2009

1 commit


29 Jun, 2009

1 commit

  • When NAT helpers change the TCP packet size, the highest seen sequence
    number needs to be corrected. This is currently only done upwards, when
    the packet size is reduced the sequence number is unchanged. This causes
    TCP conntrack to falsely detect unacknowledged data and decrease the
    timeout.

    Fix by updating the highest seen sequence number in both directions after
    packet mangling.

    Tested-by: Krzysztof Piotr Oledzki
    Signed-off-by: Patrick McHardy

    Patrick McHardy
     

25 Jun, 2009

2 commits

  • Signed-off-by: Rémi Denis-Courmont
    Signed-off-by: David S. Miller

    Rémi Denis-Courmont
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6:
    bnx2: Fix the behavior of ethtool when ONBOOT=no
    qla3xxx: Don't sleep while holding lock.
    qla3xxx: Give the PHY time to come out of reset.
    ipv4 routing: Ensure that route cache entries are usable and reclaimable with caching is off
    net: Move rx skb_orphan call to where needed
    ipv6: Use correct data types for ICMPv6 type and code
    net: let KS8842 driver depend on HAS_IOMEM
    can: let SJA1000 driver depend on HAS_IOMEM
    netxen: fix firmware init handshake
    netxen: fix build with without CONFIG_PM
    netfilter: xt_rateest: fix comparison with self
    netfilter: xt_quota: fix incomplete initialization
    netfilter: nf_log: fix direct userspace memory access in proc handler
    netfilter: fix some sparse endianess warnings
    netfilter: nf_conntrack: fix conntrack lookup race
    netfilter: nf_conntrack: fix confirmation race condition
    netfilter: nf_conntrack: death_by_timeout() fix

    Linus Torvalds
     

24 Jun, 2009

1 commit

  • In order to get the tun driver to account packets, we need to be
    able to receive packets with destructors set. To be on the safe
    side, I added an skb_orphan call for all protocols by default since
    some of them (IP in particular) cannot handle receiving packets
    destructors properly.

    Now it seems that at least one protocol (CAN) expects to be able
    to pass skb->sk through the rx path without getting clobbered.

    So this patch attempts to fix this properly by moving the skb_orphan
    call to where it's actually needed. In particular, I've added it
    to skb_set_owner_[rw] which is what most users of skb->destructor
    call.

    This is actually an improvement for tun too since it means that
    we only give back the amount charged to the socket when the skb
    is passed to another socket that will also be charged accordingly.

    Signed-off-by: Herbert Xu
    Tested-by: Oliver Hartkopp
    Signed-off-by: David S. Miller

    Herbert Xu
     

23 Jun, 2009

2 commits

  • Change all the code that deals directly with ICMPv6 type and code
    values to use u8 instead of a signed int as that's the actual data
    type.

    Signed-off-by: Brian Haley
    Signed-off-by: David S. Miller

    Brian Haley
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (43 commits)
    via-velocity: Fix velocity driver unmapping incorrect size.
    mlx4_en: Remove redundant refill code on RX
    mlx4_en: Removed redundant check on lso header size
    mlx4_en: Cancel port_up check in transmit function
    mlx4_en: using stop/start_all_queues
    mlx4_en: Removed redundant skb->len check
    mlx4_en: Counting all the dropped packets on the TX side
    usbnet cdc_subset: fix issues talking to PXA gadgets
    Net: qla3xxx, remove sleeping in atomic
    ipv4: fix NULL pointer + success return in route lookup path
    isdn: clean up documentation index
    cfg80211: validate station settings
    cfg80211: allow setting station parameters in mesh
    cfg80211: allow adding/deleting stations on mesh
    ath5k: fix beacon_int handling
    MAINTAINERS: Fix Atheros pattern paths
    ath9k: restore PS mode, before we put the chip into FULL SLEEP state.
    ath9k: wait for beacon frame along with CAB
    acer-wmi: fix rfkill conversion
    ath5k: avoid PCI FATAL interrupts by restoring RETRY_TIMEOUT disabling
    ...

    Linus Torvalds
     

19 Jun, 2009

2 commits

  • If the iucv message limit for a communication path is exceeded,
    sendmsg() returns -EAGAIN instead of -EPIPE.
    The calling application can then handle this error situtation,
    e.g. to try again after waiting some time.

    For blocking sockets, sendmsg() waits up to the socket timeout
    before returning -EAGAIN. For the new wait condition, a macro
    has been introduced and the iucv_sock_wait_state() has been
    refactored to this macro.

    Signed-off-by: Hendrik Brueckner
    Signed-off-by: Ursula Braun
    Signed-off-by: David S. Miller

    Hendrik Brueckner
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (55 commits)
    netxen: fix tx ring accounting
    netxen: fix detection of cut-thru firmware mode
    forcedeth: fix dma api mismatches
    atm: sk_wmem_alloc initial value is one
    net: correct off-by-one write allocations reports
    via-velocity : fix no link detection on boot
    Net / e100: Fix suspend of devices that cannot be power managed
    TI DaVinci EMAC : Fix rmmod error
    net: group address list and its count
    ipv4: Fix fib_trie rebalancing, part 2
    pkt_sched: Update drops stats in act_police
    sky2: version 1.23
    sky2: add GRO support
    sky2: skb recycling
    sky2: reduce default transmit ring
    sky2: receive counter update
    sky2: fix shutdown synchronization
    sky2: PCI irq issues
    sky2: more receive shutdown
    sky2: turn off pause during shutdown
    ...

    Manually fix trivial conflict in net/core/skbuff.c due to kmemcheck

    Linus Torvalds
     

17 Jun, 2009

2 commits

  • commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
    (net: No more expensive sock_hold()/sock_put() on each tx)
    changed initial sk_wmem_alloc value.

    Some protocols check sk_wmem_alloc value to determine if a timer
    must delay socket deallocation. We must take care of the sk_wmem_alloc
    value being one instead of zero when no write allocations are pending.

    Reported by Ingo Molnar, and full diagnostic from David Miller.

    This patch introduces three helpers to get read/write allocations
    and a followup patch will use these helpers to report correct
    write allocations to user.

    Reported-by: Ingo Molnar
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • * 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/vegard/kmemcheck: (39 commits)
    signal: fix __send_signal() false positive kmemcheck warning
    fs: fix do_mount_root() false positive kmemcheck warning
    fs: introduce __getname_gfp()
    trace: annotate bitfields in struct ring_buffer_event
    net: annotate struct sock bitfield
    c2port: annotate bitfield for kmemcheck
    net: annotate inet_timewait_sock bitfields
    ieee1394/csr1212: fix false positive kmemcheck report
    ieee1394: annotate bitfield
    net: annotate bitfields in struct inet_sock
    net: use kmemcheck bitfields API for skbuff
    kmemcheck: introduce bitfield API
    kmemcheck: add opcode self-testing at boot
    x86: unify pte_hidden
    x86: make _PAGE_HIDDEN conditional
    kmemcheck: make kconfig accessible for other architectures
    kmemcheck: enable in the x86 Kconfig
    kmemcheck: add hooks for the page allocator
    kmemcheck: add hooks for page- and sg-dma-mappings
    kmemcheck: don't track page tables
    ...

    Linus Torvalds
     

16 Jun, 2009

1 commit


15 Jun, 2009

4 commits

  • 2009/2/24 Ingo Molnar :
    > ok, this is the last warning i have from today's overnight -tip
    > testruns - a 32-bit system warning in sock_init_data():
    >
    > [ 2.610389] NET: Registered protocol family 16
    > [ 2.616138] initcall netlink_proto_init+0x0/0x170 returned 0 after 7812 usecs
    > [ 2.620010] WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (f642c184)
    > [ 2.624002] 010000000200000000000000604990c000000000000000000000000000000000
    > [ 2.634076] i i i i i i u u i i i i i i i i i i i i i i i i i i i i i i i i
    > [ 2.641038] ^
    > [ 2.643376]
    > [ 2.644004] Pid: 1, comm: swapper Not tainted (2.6.29-rc6-tip-01751-g4d1c22c-dirty #885)
    > [ 2.648003] EIP: 0060:[] EFLAGS: 00010282 CPU: 0
    > [ 2.652008] EIP is at sock_init_data+0xa1/0x190
    > [ 2.656003] EAX: 0001a800 EBX: f6836c00 ECX: 00463000 EDX: c0e46fe0
    > [ 2.660003] ESI: f642c180 EDI: c0b83088 EBP: f6863ed8 ESP: c0c412ec
    > [ 2.664003] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
    > [ 2.668003] CR0: 8005003b CR2: f682c400 CR3: 00b91000 CR4: 000006f0
    > [ 2.672003] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
    > [ 2.676003] DR6: ffff4ff0 DR7: 00000400
    > [ 2.680002] [] __netlink_create+0x35/0xa0
    > [ 2.684002] [] netlink_kernel_create+0x4c/0x140
    > [ 2.688002] [] rtnetlink_net_init+0x1e/0x40
    > [ 2.696002] [] register_pernet_operations+0x11/0x30
    > [ 2.700002] [] register_pernet_subsys+0x1c/0x30
    > [ 2.704002] [] rtnetlink_init+0x4c/0x100
    > [ 2.708002] [] netlink_proto_init+0x159/0x170
    > [ 2.712002] [] do_one_initcall+0x24/0x150
    > [ 2.716002] [] do_initcalls+0x27/0x40
    > [ 2.723201] [] do_basic_setup+0x1c/0x20
    > [ 2.728002] [] kernel_init+0x5a/0xa0
    > [ 2.732002] [] kernel_thread_helper+0x7/0x10
    > [ 2.736002] [] 0xffffffff

    We fix this false positive by annotating the bitfield in struct
    sock.

    Reported-by: Ingo Molnar
    Signed-off-by: Vegard Nossum

    Vegard Nossum
     
  • The use of bitfields here would lead to false positive warnings with
    kmemcheck. Silence them.

    (Additionally, one erroneous comment related to the bitfield was also
    fixed.)

    Signed-off-by: Vegard Nossum

    Vegard Nossum
     
  • Signed-off-by: Vegard Nossum

    Vegard Nossum
     
  • Let's use TICKS instead of US, so PSCHED_TICKS2NS and PSCHED_NS2TICKS
    (like in PSCHED_TICKS_PER_SEC already) to avoid misleading.

    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Jarek Poplawski
     

13 Jun, 2009

3 commits

  • This patch improves ctnetlink event reliability if one broadcast
    listener has set the NETLINK_BROADCAST_ERROR socket option.

    The logic is the following: if an event delivery fails, we keep
    the undelivered events in the missed event cache. Once the next
    packet arrives, we add the new events (if any) to the missed
    events in the cache and we try a new delivery, and so on. Thus,
    if ctnetlink fails to deliver an event, we try to deliver them
    once we see a new packet. Therefore, we may lose state
    transitions but the userspace process gets in sync at some point.

    At worst case, if no events were delivered to userspace, we make
    sure that destroy events are successfully delivered. Basically,
    if ctnetlink fails to deliver the destroy event, we remove the
    conntrack entry from the hashes and we insert them in the dying
    list, which contains inactive entries. Then, the conntrack timer
    is added with an extra grace timeout of random32() % 15 seconds
    to trigger the event again (this grace timeout is tunable via
    /proc). The use of a limited random timeout value allows
    distributing the "destroy" resends, thus, avoiding accumulating
    lots "destroy" events at the same time. Event delivery may
    re-order but we can identify them by means of the tuple plus
    the conntrack ID.

    The maximum number of conntrack entries (active or inactive) is
    still handled by nf_conntrack_max. Thus, we may start dropping
    packets at some point if we accumulate a lot of inactive conntrack
    entries that did not successfully report the destroy event to
    userspace.

    During my stress tests consisting of setting a very small buffer
    of 2048 bytes for conntrackd and the NETLINK_BROADCAST_ERROR socket
    flag, and generating lots of very small connections, I noticed
    very few destroy entries on the fly waiting to be resend.

    A simple way to test this patch consist of creating a lot of
    entries, set a very small Netlink buffer in conntrackd (+ a patch
    which is not in the git tree to set the BROADCAST_ERROR flag)
    and invoke `conntrack -F'.

    For expectations, no changes are introduced in this patch.
    Currently, event delivery is only done for new expectations (no
    events from expectation expiration, removal and confirmation).
    In that case, they need a per-expectation event cache to implement
    the same idea that is exposed in this patch.

    This patch can be useful to provide reliable flow-accouting. We
    still have to add a new conntrack extension to store the creation
    and destroy time.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Patrick McHardy

    Pablo Neira Ayuso
     
  • This patch moves the helper destruction to a function that lives
    in nf_conntrack_helper.c. This new function is used in the patch
    to add ctnetlink reliable event delivery.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Patrick McHardy

    Pablo Neira Ayuso
     
  • This patch reworks the per-cpu event caching to use the conntrack
    extension infrastructure.

    The main drawback is that we consume more memory per conntrack
    if event delivery is enabled. This patch is required by the
    reliable event delivery that follows to this patch.

    BTW, this patch allows you to enable/disable event delivery via
    /proc/sys/net/netfilter/nf_conntrack_events in runtime, although
    you can still disable event caching as compilation option.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Patrick McHardy

    Pablo Neira Ayuso
     

11 Jun, 2009

4 commits

  • Patrick McHardy
     
  • David S. Miller
     
  • One of the problem with sock memory accounting is it uses
    a pair of sock_hold()/sock_put() for each transmitted packet.

    This slows down bidirectional flows because the receive path
    also needs to take a refcount on socket and might use a different
    cpu than transmit path or transmit completion path. So these
    two atomic operations also trigger cache line bounces.

    We can see this in tx or tx/rx workloads (media gateways for example),
    where sock_wfree() can be in top five functions in profiles.

    We use this sock_hold()/sock_put() so that sock freeing
    is delayed until all tx packets are completed.

    As we also update sk_wmem_alloc, we could offset sk_wmem_alloc
    by one unit at init time, until sk_free() is called.
    Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc)
    to decrement initial offset and atomicaly check if any packets
    are in flight.

    skb_set_owner_w() doesnt call sock_hold() anymore

    sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc
    reached 0 to perform the final freeing.

    Drawback is that a skb->truesize error could lead to unfreeable sockets, or
    even worse, prematurely calling __sk_free() on a live socket.

    Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 MB/s
    on my 8 cpu dev machine, even if tbench was not really hitting sk_refcnt
    contention point. 5 % speedup on a UDP transmit workload (depends
    on number of flows), lowering TX completion cpu usage.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In order to handle powersave frames properly we had needed
    to pass these out to the device queues again, and introduce
    the skb->requeue bit. This, however, also has unnecessary
    overhead by needing to 'clean up' already tried frames, and
    this clean-up code is also buggy when software encryption
    is used.

    Instead of sending the frames via the master netdev queue
    again, simply put them into the pending queue. This also
    fixes a problem where frames for that particular station
    could be reordered when some were still on the software
    queues and older ones are re-injected into the software
    queue after them.

    Signed-off-by: Johannes Berg
    Signed-off-by: John W. Linville

    Johannes Berg
     

10 Jun, 2009

1 commit


09 Jun, 2009

5 commits

  • Add a netlink interface for configuration of IEEE 802.15.4 device. Also this
    interface specifies events notification sent by devices towards higher layers.

    Signed-off-by: Dmitry Eremin-Solenikov
    Signed-off-by: Sergey Lapin
    Signed-off-by: David S. Miller

    Sergey Lapin
     
  • Add support for communication over IEEE 802.15.4 networks. This implementation
    is neither certified nor complete, but aims to that goal. This commit contains
    only the socket interface for communication over IEEE 802.15.4 networks.
    One can either send RAW datagrams or use SOCK_DGRAM to encapsulate data
    inside normal IEEE 802.15.4 packets.

    Configuration interface, drivers and software MAC 802.15.4 implementation will
    follow.

    Initial implementation was done by Maxim Gorbachyov, Maxim Osipov and Pavel
    Smolensky as a research project at Siemens AG. Later the stack was heavily
    reworked to better suit the linux networking model, and is now maitained
    as an open project partially sponsored by Siemens.

    Signed-off-by: Dmitry Eremin-Solenikov
    Signed-off-by: Sergey Lapin
    Signed-off-by: David S. Miller

    Sergey Lapin
     
  • Change PSCHED_SHIFT from 10 to 6 to increase schedulers time
    resolution. This will increase 16x a number of (internal) ticks per
    nanosecond, and is needed to improve accuracy of schedulers based on
    rate tables, like HTB, TBF or CBQ, with rates above 100Mbit. It is
    assumed this change is safe for 32bit accounting of time diffs up
    to 2 minutes, which should be enough for common use (extremely low
    rate values may overflow, so get inaccurate instead). To make full
    use of this change an updated iproute2 will be needed. (But using
    older iproute2 should be safe too.)

    This change breaks ticks - microseconds similarity, so some minor code
    fixes might be needed. It is also planned to change naming adequately
    eg. to PSCHED_TICKS2NS() etc. in the near future.

    Reported-by: Antonio Almeida
    Tested-by: Antonio Almeida
    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Jarek Poplawski
     
  • Use PSCHED_SHIFT constant instead of '10' in PSCHED_US2NS() and
    PSCHED_NS2US() macros to enable changing this value later.

    Additionally use PSCHED_SHIFT in sch_hfsc SM_SHIFT and ISM_SHIFT
    definitions. This part of the patch is based on feedback from
    Patrick McHardy .

    Reported-by: Antonio Almeida
    Tested-by: Antonio Almeida
    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Jarek Poplawski
     
  • Furthermore, it twiddles with the details of SKB list handling
    directly, which we're trying to eliminate.

    Signed-off-by: David S. Miller

    David S. Miller
     

08 Jun, 2009

6 commits