17 Jun, 2009

1 commit

  • * 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/vegard/kmemcheck: (39 commits)
    signal: fix __send_signal() false positive kmemcheck warning
    fs: fix do_mount_root() false positive kmemcheck warning
    fs: introduce __getname_gfp()
    trace: annotate bitfields in struct ring_buffer_event
    net: annotate struct sock bitfield
    c2port: annotate bitfield for kmemcheck
    net: annotate inet_timewait_sock bitfields
    ieee1394/csr1212: fix false positive kmemcheck report
    ieee1394: annotate bitfield
    net: annotate bitfields in struct inet_sock
    net: use kmemcheck bitfields API for skbuff
    kmemcheck: introduce bitfield API
    kmemcheck: add opcode self-testing at boot
    x86: unify pte_hidden
    x86: make _PAGE_HIDDEN conditional
    kmemcheck: make kconfig accessible for other architectures
    kmemcheck: enable in the x86 Kconfig
    kmemcheck: add hooks for the page allocator
    kmemcheck: add hooks for page- and sg-dma-mappings
    kmemcheck: don't track page tables
    ...

    Linus Torvalds
     

15 Jun, 2009

2 commits

  • The use of bitfields here would lead to false positive warnings with
    kmemcheck. Silence them.

    (Additionally, one erroneous comment related to the bitfield was also
    fixed.)

    Signed-off-by: Vegard Nossum

    Vegard Nossum
     
  • While doing trie_rebalance(): resize(), inflate(), halve() RCU free
    tnodes before updating their parents. It depends on RCU delaying the
    real destruction, but if RCU readers start after call_rcu() and before
    parent update they could access freed memory.

    It is currently prevented with preempt_disable() on the update side,
    but it's not safe, except maybe classic RCU, plus it conflicts with
    memory allocations with GFP_KERNEL flag used from these functions.

    This patch explicitly delays freeing of tnodes by adding them to the
    list, which is flushed after the update is finished.

    Reported-by: Yan Zheng
    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Jarek Poplawski
     

14 Jun, 2009

3 commits

  • IPv4:
    - make PIM register vifs netns local
    - set the netns when a PIM register vif is created
    - make PIM available in all network namespaces (if CONFIG_IP_PIMSM_V2)
    by adding the protocol handler when multicast routing is initialized

    IPv6:
    - make PIM register vifs netns local
    - make PIM available in all network namespaces (if CONFIG_IPV6_PIMSM_V2)
    by adding the protocol handler when multicast routing is initialized

    Signed-off-by: Tom Goff
    Signed-off-by: David S. Miller

    Tom Goff
     
  • Removed the statements about ARP cache size as this config option does
    not affect it. The cache size is controlled by neigh_table gc thresholds.

    Remove also expiremental and obsolete markings as the API originally
    intended for arp caching is useful for implementing ARP-like protocols
    (e.g. NHRP) in user space and has been there for a long enough time.

    Signed-off-by: Timo Teras
    Signed-off-by: David S. Miller

    Timo Teräs
     
  • For the sake of power saver lovers, use a deferrable timer to fire
    rt_check_expire()

    As some big routers cache equilibrium depends on garbage collection
    done in time, we take into account elapsed time between two
    rt_check_expire() invocations to adjust the amount of slots we have to
    check.

    Based on an initial idea and patch from Tero Kristo

    Signed-off-by: Eric Dumazet
    Signed-off-by: Tero Kristo
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Jun, 2009

1 commit

  • Fix build error introduced by commit bb70dfa5 (netfilter: xtables:
    consolidate comefrom debug cast access):

    net/ipv4/netfilter/ip_tables.c: In function 'ipt_do_table':
    net/ipv4/netfilter/ip_tables.c:421: error: 'comefrom' undeclared (first use in this function)
    net/ipv4/netfilter/ip_tables.c:421: error: (Each undeclared identifier is reported only once
    net/ipv4/netfilter/ip_tables.c:421: error: for each function it appears in.)

    Signed-off-by: Patrick McHardy

    Patrick McHardy
     

11 Jun, 2009

2 commits

  • Patrick McHardy
     
  • One of the problem with sock memory accounting is it uses
    a pair of sock_hold()/sock_put() for each transmitted packet.

    This slows down bidirectional flows because the receive path
    also needs to take a refcount on socket and might use a different
    cpu than transmit path or transmit completion path. So these
    two atomic operations also trigger cache line bounces.

    We can see this in tx or tx/rx workloads (media gateways for example),
    where sock_wfree() can be in top five functions in profiles.

    We use this sock_hold()/sock_put() so that sock freeing
    is delayed until all tx packets are completed.

    As we also update sk_wmem_alloc, we could offset sk_wmem_alloc
    by one unit at init time, until sk_free() is called.
    Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc)
    to decrement initial offset and atomicaly check if any packets
    are in flight.

    skb_set_owner_w() doesnt call sock_hold() anymore

    sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc
    reached 0 to perform the final freeing.

    Drawback is that a skb->truesize error could lead to unfreeable sockets, or
    even worse, prematurely calling __sk_free() on a live socket.

    Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 MB/s
    on my 8 cpu dev machine, even if tbench was not really hitting sk_refcnt
    contention point. 5 % speedup on a UDP transmit workload (depends
    on number of flows), lowering TX completion cpu usage.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Jun, 2009

1 commit


09 Jun, 2009

2 commits


08 Jun, 2009

1 commit

  • Current conntrack code kills the ICMP conntrack entry as soon as
    the first reply is received. This is incorrect, as we then see only
    the first ICMP echo reply out of several possible duplicates as
    ESTABLISHED, while the rest will be INVALID. Also this unnecessarily
    increases the conntrackd traffic on H-A firewalls.

    Make all the ICMP conntrack entries (including the replied ones)
    last for the default of nf_conntrack_icmp{,v6}_timeout seconds.

    Signed-off-by: Jan "Yenya" Kasprzak
    Signed-off-by: Patrick McHardy

    Jan Kasprzak
     

05 Jun, 2009

1 commit

  • The lock "protects" an assignment and a comparision of an integer.
    When the caller of device_cmp() evaluates the result, nat->masq_index
    may already have been changed (regardless if the lock is there or not).

    So, the lock either has to be held during nf_ct_iterate_cleanup(),
    or can be removed.

    This does the latter.

    Signed-off-by: Florian Westphal
    Signed-off-by: Patrick McHardy

    Florian Westphal
     

04 Jun, 2009

2 commits


03 Jun, 2009

4 commits

  • Define three accessors to get/set dst attached to a skb

    struct dst_entry *skb_dst(const struct sk_buff *skb)

    void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)

    void skb_dst_drop(struct sk_buff *skb)
    This one should replace occurrences of :
    dst_release(skb->dst)
    skb->dst = NULL;

    Delete skb->dst field

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Define skb_rtable(const struct sk_buff *skb) accessor to get rtable from skb

    Delete skb->rtable field

    Setting rtable is not allowed, just set dst instead as rtable is an alias.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Conflicts:
    drivers/net/forcedeth.c

    David S. Miller
     
  • This patch simplifies the conntrack event caching system by removing
    several events:

    * IPCT_[*]_VOLATILE, IPCT_HELPINFO and IPCT_NATINFO has been deleted
    since the have no clients.
    * IPCT_COUNTER_FILLING which is a leftover of the 32-bits counter
    days.
    * IPCT_REFRESH which is not of any use since we always include the
    timeout in the messages.

    After this patch, the existing events are:

    * IPCT_NEW, IPCT_RELATED and IPCT_DESTROY, that are used to identify
    addition and deletion of entries.
    * IPCT_STATUS, that notes that the status bits have changes,
    eg. IPS_SEEN_REPLY and IPS_ASSURED.
    * IPCT_PROTOINFO, that reports that internal protocol information has
    changed, eg. the TCP, DCCP and SCTP protocol state.
    * IPCT_HELPER, that a helper has been assigned or unassigned to this
    entry.
    * IPCT_MARK and IPCT_SECMARK, that reports that the mark has changed, this
    covers the case when a mark is set to zero.
    * IPCT_NATSEQADJ, to report that there's updates in the NAT sequence
    adjustment.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     

02 Jun, 2009

3 commits


30 May, 2009

1 commit

  • Somewhat luckily, I was looking into these parts with very fine
    comb because I've made somewhat similar changes on the same
    area (conflicts that arose weren't that lucky though). The loop
    was very much overengineered recently in commit 915219441d566
    (tcp: Use SKB queue and list helpers instead of doing it
    by-hand), while it basically just wants to know if there are
    skbs after 'skb'.

    Also it got broken because skb1 = skb->next got translated into
    skb1 = skb1->next (though abstracted) improperly. Note that
    'skb1' is pointing to previous sk_buff than skb or NULL if at
    head. Two things went wrong:
    - We'll kfree 'skb' on the first iteration instead of the
    skbuff following 'skb' (it would require required SACK reneging
    to recover I think).
    - The list head case where 'skb1' is NULL is checked too early
    and the loop won't execute whereas it previously did.

    Conclusion, mostly revert the recent changes which makes the
    cset very messy looking but using proper accessor in the
    previous-like version.

    The effective changes against the original can be viewed with:
    git-diff 915219441d566f1da0caa0e262be49b666159e17^ \
    net/ipv4/tcp_input.c | sed -n -e '57,70 p'

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

29 May, 2009

3 commits


27 May, 2009

6 commits


26 May, 2009

1 commit


25 May, 2009

1 commit


22 May, 2009

1 commit

  • It seems we can fix this by disabling preemption while we re-balance the
    trie. This is with the CONFIG_CLASSIC_RCU. It's been stress-tested at high
    loads continuesly taking a full BGP table up/down via iproute -batch.

    Note. fib_trie is not updated for CONFIG_PREEMPT_RCU

    Reported-by: Andrei Popa
    Signed-off-by: Robert Olsson
    Signed-off-by: David S. Miller

    Robert Olsson
     

21 May, 2009

3 commits

  • The netlink message header (struct nlmsghdr) is an unused parameter in
    fill method of fib_rules_ops struct. This patch removes this
    parameter from this method and fixes the places where this method is
    called.

    (include/net/fib_rules.h)

    Signed-off-by: Rami Rosen
    Signed-off-by: David S. Miller

    Rami Rosen
     
  • Alexander V. Lukyanov found a regression in 2.6.29 and made a complete
    analysis found in http://bugzilla.kernel.org/show_bug.cgi?id=13339
    Quoted here because its a perfect one :

    begin_of_quotation
    2.6.29 patch has introduced flexible route cache rebuilding. Unfortunately the
    patch has at least one critical flaw, and another problem.

    rt_intern_hash calculates rthi pointer, which is later used for new entry
    insertion. The same loop calculates cand pointer which is used to clean the
    list. If the pointers are the same, rtable leak occurs, as first the cand is
    removed then the new entry is appended to it.

    This leak leads to unregister_netdevice problem (usage count > 0).

    Another problem of the patch is that it tries to insert the entries in certain
    order, to facilitate counting of entries distinct by all but QoS parameters.
    Unfortunately, referencing an existing rtable entry moves it to list beginning,
    to speed up further lookups, so the carefully built order is destroyed.

    For the first problem the simplest patch it to set rthi=0 when rthi==cand, but
    it will also destroy the ordering.
    end_of_quotation

    Problematic commit is 1080d709fb9d8cd4392f93476ee46a9d6ea05a5b
    (net: implement emergency route cache rebulds when gc_elasticity is exceeded)

    Trying to keep dst_entries ordered is too complex and breaks the fact that
    order should depend on the frequency of use for garbage collection.

    A possible fix is to make rt_intern_hash() simpler, and only makes
    rt_check_expire() a litle bit smarter, being able to cope with an arbitrary
    entries order. The added loop is running on cache hot data, while cpu
    is prefetching next object, so should be unnoticied.

    Reported-and-analyzed-by: Alexander V. Lukyanov
    Signed-off-by: Eric Dumazet
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • rt_check_expire() computes average and standard deviation of chain lengths,
    but not correclty reset length to 0 at beginning of each chain.
    This probably gives overflows for sum2 (and sum) on loaded machines instead
    of meaningful results.

    Signed-off-by: Eric Dumazet
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 May, 2009

1 commit

  • The DHCP spec allows the server to specify the MTU. This can be useful
    for netbooting with UDP-based NFS-root on a network using jumbo frames.
    This patch allows the kernel IP autoconfiguration to handle this option
    correctly.

    It would be possible to use initramfs and add a script to set the MTU,
    but that seems like a complicated solution if no initramfs is otherwise
    necessary, and would bloat the kernel image more than this code would.

    This patch was originally submitted to LKML in 2003 by Hans-Peter Jansen.

    Signed-off-by: Chris Friesen
    Signed-off-by: David S. Miller

    Chris Friesen