10 Jan, 2011

9 commits

  • In order to compute the features for other offloads (primarily
    scatter/gather), we need to first check the ability of the NIC to
    offload the checksum for the packet. Since we have already computed
    this, we can directly use the result instead of figuring it out
    again.

    Signed-off-by: Jesse Gross
    Signed-off-by: David S. Miller

    Jesse Gross
     
  • This switches skb_need_linearize() to use the features that have
    been centrally computed. In doing so, this fixes a problem where
    scatter/gather should not be used because the card does not support
    checksum offloading on that type of packet. On device registration
    we only check that some form of checksum offloading is available if
    scatter/gatther is enabled but we must also check at transmission
    time. Examples of this include IPv6 or vlan packets on a NIC that
    only supports IPv4 offloading.

    Signed-off-by: Jesse Gross
    Signed-off-by: David S. Miller

    Jesse Gross
     
  • This switches dev_gso_segment() to use the device features computed
    by the centralized routine. In doing so, it fixes a problem where
    it would always use dev->features, instead of those appropriate
    to the number of vlan tags if any are present.

    Signed-off-by: Jesse Gross
    Signed-off-by: David S. Miller

    Jesse Gross
     
  • Now that there is a single function that can compute the device
    features relevant to a packet, we don't want to run it for each
    offload. This converts netif_needs_gso() to take the features
    of the device, rather than computing them itself.

    Signed-off-by: Jesse Gross
    Signed-off-by: David S. Miller

    Jesse Gross
     
  • netif_get_vlan_features() is currently only used by netif_needs_gso(),
    so it only concerns itself with GSO features. However, several other
    places also should take into account the contents of the packet when
    deciding whether to offload to hardware. This generalizes the function
    to return features about all of the various forms of offloading. Since
    offloads tend to be linked together, this avoids duplicating the logic
    in each location (i.e. the scatter/gather code also needs the checksum
    logic).

    Suggested-by: Michał Mirosław
    Signed-off-by: Jesse Gross
    Signed-off-by: David S. Miller

    Jesse Gross
     
  • We currently only have software fallback for one type of checksum: the
    TCP/UDP one's complement. This means that a protocol that uses hardware
    offloading for a different type of checksum (FCoE, SCTP) must directly
    check the device's features and do the right thing ahead of time. By
    the time we get to dev_can_checksum(), we're only deciding whether to
    apply the one algorithm in software or hardware. NETIF_F_HW_CSUM has the
    same capabilities as the software version, so we should always use it if
    present. The primary advantage of this is multiply tagged vlans can use
    hardware checksumming.

    Signed-off-by: Jesse Gross
    Signed-off-by: David S. Miller

    Jesse Gross
     
  • Fix new kernel-doc notation warning in net/core/filter.c:

    Warning(net/core/filter.c:172): No description found for parameter 'fentry'
    Warning(net/core/filter.c:172): Excess function parameter 'filter' description in 'sk_run_filter'

    Signed-off-by: Randy Dunlap
    Signed-off-by: David S. Miller

    Randy Dunlap
     
  • Due to NLM_F_DUMP is composed of two bits, NLM_F_ROOT | NLM_F_MATCH,
    when doing "if (x & NLM_F_DUMP)", it tests for _either_ of the bits
    being set. Because NLM_F_MATCH's value overlaps with NLM_F_EXCL,
    non-dump requests with NLM_F_EXCL set are mistaken as dump requests.

    Substitute the condition to test for _all_ bits being set.

    Signed-off-by: Jan Engelhardt
    Acked-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Jan Engelhardt
     
  • David S. Miller
     

08 Jan, 2011

3 commits

  • * 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (33 commits)
    usb: don't use flush_scheduled_work()
    speedtch: don't abuse struct delayed_work
    media/video: don't use flush_scheduled_work()
    media/video: explicitly flush request_module work
    ioc4: use static work_struct for ioc4_load_modules()
    init: don't call flush_scheduled_work() from do_initcalls()
    s390: don't use flush_scheduled_work()
    rtc: don't use flush_scheduled_work()
    mmc: update workqueue usages
    mfd: update workqueue usages
    dvb: don't use flush_scheduled_work()
    leds-wm8350: don't use flush_scheduled_work()
    mISDN: don't use flush_scheduled_work()
    macintosh/ams: don't use flush_scheduled_work()
    vmwgfx: don't use flush_scheduled_work()
    tpm: don't use flush_scheduled_work()
    sonypi: don't use flush_scheduled_work()
    hvsi: don't use flush_scheduled_work()
    xen: don't use flush_scheduled_work()
    gdrom: don't use flush_scheduled_work()
    ...

    Fixed up trivial conflict in drivers/media/video/bt8xx/bttv-input.c
    as per Tejun.

    Linus Torvalds
     
  • * 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6: (65 commits)
    [S390] prevent unneccesary loops_per_jiffy recalculation
    [S390] cpuinfo: use get_online_cpus() instead of preempt_disable()
    [S390] smp: remove cpu hotplug messages
    [S390] mutex: enable spinning mutex on s390
    [S390] mutex: Introduce arch_mutex_cpu_relax()
    [S390] cio: fix ccwgroup unregistration race condition
    [S390] perf: add DWARF register lookup for s390
    [S390] cleanup ftrace backend functions
    [S390] ptrace cleanup
    [S390] smp/idle: call init_idle() before starting a new cpu
    [S390] smp: delay idle task creation
    [S390] dasd: Correct retry counter for terminated I/O.
    [S390] dasd: Add support for raw ECKD access.
    [S390] dasd: Prevent deadlock during suspend/resume.
    [S390] dasd: Improve handling of stolen DASD reservation
    [S390] dasd: do path verification for paths added at runtime
    [S390] dasd: add High Performance FICON multitrack support
    [S390] cio: reduce memory consumption of itcw structures
    [S390] nmi: enable machine checks early
    [S390] qeth: buffer count imbalance
    ...

    Linus Torvalds
     
  • …t/npiggin/linux-npiggin

    * 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin: (57 commits)
    fs: scale mntget/mntput
    fs: rename vfsmount counter helpers
    fs: implement faster dentry memcmp
    fs: prefetch inode data in dcache lookup
    fs: improve scalability of pseudo filesystems
    fs: dcache per-inode inode alias locking
    fs: dcache per-bucket dcache hash locking
    bit_spinlock: add required includes
    kernel: add bl_list
    xfs: provide simple rcu-walk ACL implementation
    btrfs: provide simple rcu-walk ACL implementation
    ext2,3,4: provide simple rcu-walk ACL implementation
    fs: provide simple rcu-walk generic_check_acl implementation
    fs: provide rcu-walk aware permission i_ops
    fs: rcu-walk aware d_revalidate method
    fs: cache optimise dentry and inode for rcu-walk
    fs: dcache reduce branches in lookup path
    fs: dcache remove d_mounted
    fs: fs_struct use seqlock
    fs: rcu-walk for path lookup
    ...

    Linus Torvalds
     

07 Jan, 2011

17 commits

  • The 'seq_window' sysctl sets the initial value for the DCCP Sequence Window,
    which may range from 32..2^46-1 (RFC 4340, 7.5.2). The patch sets the upper
    bound consistently to 2^32-1 on both 32 and 64 bit systems, which should be
    sufficient - with a RTT of 1sec and 1-byte packets, a seq_window of 2^32-1
    corresponds to a link speed of 34 Gbps.

    Signed-off-by: Gerrit Renker

    Gerrit Renker
     
  • Currently dccp_check_seqno allows any valid packet to update the Greatest
    Sequence Number Received, even if that packet's sequence number is less than
    the current GSR. This patch adds a check to make sure that the new packet's
    sequence number is greater than GSR.

    Signed-off-by: Samuel Jero
    Signed-off-by: Gerrit Renker

    Samuel Jero
     
  • Currently dccp_check_seqno returns 0 (indicating a valid packet) if the
    acknowledgment number is out of bounds and the sync that RFC 4340 mandates at
    this point is currently being rate-limited. This function should return -1,
    indicating an invalid packet.

    Signed-off-by: Samuel Jero
    Acked-by: Gerrit Renker

    Samuel Jero
     
  • The problem that this patch aims to fix is vfsmount refcounting scalability.
    We need to take a reference on the vfsmount for every successful path lookup,
    which often go to the same mount point.

    The fundamental difficulty is that a "simple" reference count can never be made
    scalable, because any time a reference is dropped, we must check whether that
    was the last reference. To do that requires communication with all other CPUs
    that may have taken a reference count.

    We can make refcounts more scalable in a couple of ways, involving keeping
    distributed counters, and checking for the global-zero condition less
    frequently.

    - check the global sum once every interval (this will delay zero detection
    for some interval, so it's probably a showstopper for vfsmounts).

    - keep a local count and only taking the global sum when local reaches 0 (this
    is difficult for vfsmounts, because we can't hold preempt off for the life of
    a reference, so a counter would need to be per-thread or tied strongly to a
    particular CPU which requires more locking).

    - keep a local difference of increments and decrements, which allows us to sum
    the total difference and hence find the refcount when summing all CPUs. Then,
    keep a single integer "long" refcount for slow and long lasting references,
    and only take the global sum of local counters when the long refcount is 0.

    This last scheme is what I implemented here. Attached mounts and process root
    and working directory references are "long" references, and everything else is
    a short reference.

    This allows scalable vfsmount references during path walking over mounted
    subtrees and unattached (lazy umounted) mounts with processes still running
    in them.

    This results in one fewer atomic op in the fastpath: mntget is now just a
    per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
    and non-atomic decrement in the common case. However code is otherwise bigger
    and heavier, so single threaded performance is basically a wash.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Regardless of how much we possibly try to scale dcache, there is likely
    always going to be some fundamental contention when adding or removing children
    under the same parent. Pseudo filesystems do not seem need to have connected
    dentries because by definition they are disconnected.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Reduce some branches and memory accesses in dcache lookup by adding dentry
    flags to indicate common d_ops are set, rather than having to check them.
    This saves a pointer memory access (dentry->d_op) in common path lookup
    situations, and saves another pointer load and branch in cases where we
    have d_op but not the particular operation.

    Patched with:

    git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Pseudo filesystems that don't put inode on RCU list or reachable by
    rcu-walk dentries do not need to RCU free their inodes.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • RCU free the struct inode. This will allow:

    - Subsequent store-free path walking patch. The inode must be consulted for
    permissions when walking, so an RCU inode reference is a must.
    - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
    to take i_lock no longer need to take sb_inode_list_lock to walk the list in
    the first place. This will simplify and optimize locking.
    - Could remove some nested trylock loops in dcache code
    - Could potentially simplify things a bit in VM land. Do not need to take the
    page lock to follow page->mapping.

    The downsides of this is the performance cost of using RCU. In a simple
    creat/unlink microbenchmark, performance drops by about 10% due to inability to
    reuse cache-hot slab objects. As iterations increase and RCU freeing starts
    kicking over, this increases to about 20%.

    In cases where inode lifetimes are longer (ie. many inodes may be allocated
    during the average life span of a single inode), a lot of this cache reuse is
    not applicable, so the regression caused by this patch is smaller.

    The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
    however this adds some complexity to list walking and store-free path walking,
    so I prefer to implement this at a later date, if it is shown to be a win in
    real situations. I haven't found a regression in any non-micro benchmark so I
    doubt it will be a problem.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Change d_delete from a dentry deletion notification to a dentry caching
    advise, more like ->drop_inode. Require it to be constant and idempotent,
    and not take d_lock. This is how all existing filesystems use the callback
    anyway.

    This makes fine grained dentry locking of dput and dentry lru scanning
    much simpler.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Since nf_bridge_maybe_copy_header() may change the length of skb,
    we should check the length of skb after it to handle the ppoe skbs.

    Signed-off-by: Changli Gao
    Signed-off-by: David S. Miller

    Changli Gao
     
  • In 1ae4de0cdf855305765592647025bde55e85e451, the secctx was exported
    via the /proc/net/netfilter/nf_conntrack and ctnetlink interfaces
    instead of the secmark.

    That patch introduced the use of security_secid_to_secctx() which may
    return a non-zero value on error.

    In one of my setups, I have NF_CONNTRACK_SECMARK enabled but no
    security modules. Thus, security_secid_to_secctx() returns a negative
    value that results in the breakage of the /proc and `conntrack -L'
    outputs. To fix this, we skip the inclusion of secctx if the
    aforementioned function fails.

    This patch also fixes the dynamic netlink message size calculation
    if security_secid_to_secctx() returns an error, since its logic is
    also wrong.

    This problem exists in Linux kernel >= 2.6.37.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira Ayuso
     
  • Since nf_ct_expect_dst_hash() may be called without nf_conntrack_lock
    locked, nf_ct_expect_hash_rnd should be initialized in the atomic way.

    In this patch, we use nf_conntrack_hash_rnd instead of
    nf_ct_expect_hash_rnd.

    Signed-off-by: Changli Gao
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Changli Gao
     
  • RFC3168 (The Addition of Explicit Congestion Notification to IP)
    states :

    5.3. Fragmentation

    ECN-capable packets MAY have the DF (Don't Fragment) bit set.
    Reassembly of a fragmented packet MUST NOT lose indications of
    congestion. In other words, if any fragment of an IP packet to be
    reassembled has the CE codepoint set, then one of two actions MUST be
    taken:

    * Set the CE codepoint on the reassembled packet. However, this
    MUST NOT occur if any of the other fragments contributing to
    this reassembly carries the Not-ECT codepoint.

    * The packet is dropped, instead of being reassembled, for any
    other reason.

    This patch implements this requirement for IPv4, choosing the first
    action :

    If one fragment had NO-ECT codepoint
    reassembled frame has NO-ECT
    ElIf one fragment had CE codepoint
    reassembled frame has CE

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The original code has a use after free bug because it's not using the
    _safe() version of the list_for_each_entry() macro.

    Signed-off-by: Dan Carpenter
    Signed-off-by: David S. Miller

    Dan Carpenter
     
  • There is a "goto nla_put_failure" hidden inside the NLA_PUT() macro, but
    we're holding the dcb_lock so we need to unlock first.

    Signed-off-by: Dan Carpenter
    Signed-off-by: David S. Miller

    Dan Carpenter
     
  • David S. Miller
     
  • Leonardo Chiquitto found poll() could block forever on tcp sockets and
    Urgent data was received, if the event flag only contains POLLPRI.

    He did a bisection and found commit 4938d7e0233 (poll: avoid extra
    wakeups in select/poll) was the source of the problem.

    Problem is TCP sockets use standard sock_def_readable() function for
    their sk_data_ready() handler, and sock_def_readable() doesnt signal
    POLLPRI.

    Only TCP is affected by the problem. Adding POLLPRI to the list of flags
    might trigger unnecessary schedules, but URGENT handling is such a
    seldom used feature this seems a good compromise.

    Thanks a lot to Leonardo for providing the bisection result and a test
    program as well.

    Reference : http://www.spinics.net/lists/netdev/msg151793.html

    Reported-and-bisected-by: Leonardo Chiquitto
    Signed-off-by: Eric Dumazet
    Tested-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 Jan, 2011

7 commits


05 Jan, 2011

4 commits