13 Jan, 2012

1 commit

  • commit a9b3cd7f32 (rcu: convert uses of rcu_assign_pointer(x, NULL) to
    RCU_INIT_POINTER) did a lot of incorrect changes, since it did a
    complete conversion of rcu_assign_pointer(x, y) to RCU_INIT_POINTER(x,
    y).

    We miss needed barriers, even on x86, when y is not NULL.

    Signed-off-by: Eric Dumazet
    CC: Stephen Hemminger
    CC: Paul E. McKenney
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Jan, 2012

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1958 commits)
    net: pack skb_shared_info more efficiently
    net_sched: red: split red_parms into parms and vars
    net_sched: sfq: extend limits
    cnic: Improve error recovery on bnx2x devices
    cnic: Re-init dev->stats_addr after chip reset
    net_sched: Bug in netem reordering
    bna: fix sparse warnings/errors
    bna: make ethtool_ops and strings const
    xgmac: cleanups
    net: make ethtool_ops const
    vmxnet3" make ethtool ops const
    xen-netback: make ops structs const
    virtio_net: Pass gfp flags when allocating rx buffers.
    ixgbe: FCoE: Add support for ndo_get_fcoe_hbainfo() call
    netdev: FCoE: Add new ndo_get_fcoe_hbainfo() call
    igb: reset PHY after recovering from PHY power down
    igb: add basic runtime PM support
    igb: Add support for byte queue limits.
    e1000: cleanup CE4100 MDIO registers access
    e1000: unmap ce4100_gbe_mdio_base_virt in e1000_remove
    ...

    Linus Torvalds
     

06 Jan, 2012

1 commit

  • We're doing some odd things there, which already messes up various users
    (see the net/socket.c code that this removes), and it was going to add
    yet more crud to the block layer because of the incorrect error code
    translation.

    ENOIOCTLCMD is not an error return that should be returned to user mode
    from the "ioctl()" system call, but it should *not* be translated as
    EINVAL ("Invalid argument"). It should be translated as ENOTTY
    ("Inappropriate ioctl for device").

    That EINVAL confusion has apparently so permeated some code that the
    block layer actually checks for it, which is sad. We continue to do so
    for now, but add a big comment about how wrong that is, and we should
    remove it entirely eventually. In the meantime, this tries to keep the
    changes localized to just the EINVAL -> ENOTTY fix, and removing code
    that makes it harder to do the right thing.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

05 Jan, 2012

1 commit

  • Define special location values for RX NFC that request the driver to
    select the actual rule location. This allows for implementation on
    devices that use hash-based filter lookup, whereas currently the API is
    more suited to devices with TCAM lookup or linear search.

    In ethtool_set_rxnfc() and the compat wrapper ethtool_ioctl(), copy
    the structure back to user-space after insertion so that the actual
    location is returned.

    Signed-off-by: Ben Hutchings
    Signed-off-by: David S. Miller

    Ben Hutchings
     

23 Nov, 2011

1 commit

  • This patch adds in the infrastructure code to create the network priority
    cgroup. The cgroup, in addition to the standard processes file creates two
    control files:

    1) prioidx - This is a read-only file that exports the index of this cgroup.
    This is a value that is both arbitrary and unique to a cgroup in this subsystem,
    and is used to index the per-device priority map

    2) priomap - This is a writeable file. On read it reports a table of 2-tuples
    where name is the name of a network interface and priority is
    indicates the priority assigned to frames egresessing on the named interface and
    originating from a pid in this cgroup

    This cgroup allows for skb priority to be set prior to a root qdisc getting
    selected. This is benenficial for DCB enabled systems, in that it allows for any
    application to use dcb configured priorities so without application modification

    Signed-off-by: Neil Horman
    Signed-off-by: John Fastabend
    CC: Robert Love
    CC: "David S. Miller"
    Signed-off-by: David S. Miller

    Neil Horman
     

10 Nov, 2011

1 commit

  • The 802.1X EAPOL handshake hostapd does requires
    knowing whether the frame was ack'ed by the peer.
    Currently, we fudge this pretty badly by not even
    transmitting the frame as a normal data frame but
    injecting it with radiotap and getting the status
    out of radiotap monitor as well. This is rather
    complex, confuses users (mon.wlan0 presence) and
    doesn't work with all hardware.

    To get rid of that hack, introduce a real wifi TX
    status option for data frame transmissions.

    This works similar to the existing TX timestamping
    in that it reflects the SKB back to the socket's
    error queue with a SCM_WIFI_STATUS cmsg that has
    an int indicating ACK status (0/1).

    Since it is possible that at some point we will
    want to have TX timestamping and wifi status in a
    single errqueue SKB (there's little point in not
    doing that), redefine SO_EE_ORIGIN_TIMESTAMPING
    to SO_EE_ORIGIN_TXSTATUS which can collect more
    than just the timestamp; keep the old constant
    as an alias of course. Currently the internal APIs
    don't make that possible, but it wouldn't be hard
    to split them up in a way that makes it possible.

    Thanks to Neil Horman for helping me figure out
    the functions that add the control messages.

    Signed-off-by: Johannes Berg
    Signed-off-by: John W. Linville

    Johannes Berg
     

22 Sep, 2011

1 commit

  • Conflicts:
    MAINTAINERS
    drivers/net/Kconfig
    drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
    drivers/net/ethernet/broadcom/tg3.c
    drivers/net/wireless/iwlwifi/iwl-pci.c
    drivers/net/wireless/iwlwifi/iwl-trans-tx-pcie.c
    drivers/net/wireless/rt2x00/rt2800usb.c
    drivers/net/wireless/wl12xx/main.c

    David S. Miller
     

25 Aug, 2011

1 commit

  • Dereferencing a user pointer directly from kernel-space without going
    through the copy_from_user family of functions is a bad idea. Two of
    such usages can be found in the sendmsg code path called from sendmmsg,
    added by

    commit c71d8ebe7a4496fb7231151cb70a6baa0cb56f9a upstream.
    commit 5b47b8038f183b44d2d8ff1c7d11a5c1be706b34 in the 3.0-stable tree.

    Usages are performed through memcmp() and memcpy() directly. Fix those
    by using the already copied msg_sys structure instead of the __user *msg
    structure. Note that msg_sys can be set to NULL by verify_compat_iovec()
    or verify_iovec(), which requires additional NULL pointer checks.

    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: David Goulet
    CC: Tetsuo Handa
    CC: Anton Blanchard
    CC: David S. Miller
    CC: stable
    Signed-off-by: David S. Miller

    Mathieu Desnoyers
     

08 Aug, 2011

1 commit


05 Aug, 2011

3 commits

  • The sendmmsg() introduced by commit 228e548e "net: Add sendmmsg socket system
    call" is capable of sending to multiple different destination addresses.

    SMACK is using destination's address for checking sendmsg() permission.
    However, security_socket_sendmsg() is called for only once even if multiple
    different destination addresses are passed to sendmmsg().

    Therefore, we need to call security_socket_sendmsg() for each destination
    address rather than only the first destination address.

    Since calling security_socket_sendmsg() every time when only single destination
    address was passed to sendmmsg() is a waste of time, omit calling
    security_socket_sendmsg() unless destination address of previous datagram and
    that of current datagram differs.

    Signed-off-by: Tetsuo Handa
    Acked-by: Anton Blanchard
    Cc: stable [3.0+]
    Signed-off-by: David S. Miller

    Tetsuo Handa
     
  • To limit the amount of time we can spend in sendmmsg, cap the
    number of elements to UIO_MAXIOV (currently 1024).

    For error handling an application using sendmmsg needs to retry at
    the first unsent message, so capping is simpler and requires less
    application logic than returning EINVAL.

    Signed-off-by: Anton Blanchard
    Cc: stable [3.0+]
    Signed-off-by: David S. Miller

    Anton Blanchard
     
  • sendmmsg uses a similar error return strategy as recvmmsg but it
    turns out to be a confusing way to communicate errors.

    The current code stores the error code away and returns it on the next
    sendmmsg call. This means a call with completely valid arguments could
    get an error from a previous call.

    Change things so we only return an error if no datagrams could be sent.
    If less than the requested number of messages were sent, the application
    must retry starting at the first failed one and if the problem is
    persistent the error will be returned.

    This matches the behaviour of other syscalls like read/write - it
    is not an error if less than the requested number of elements are sent.

    Signed-off-by: Anton Blanchard
    Cc: stable [3.0+]
    Signed-off-by: David S. Miller

    Anton Blanchard
     

02 Aug, 2011

1 commit

  • When assigning a NULL value to an RCU protected pointer, no barrier
    is needed. The rcu_assign_pointer, used to handle that but will soon
    change to not handle the special case.

    Convert all rcu_assign_pointer of NULL value.

    //smpl
    @@ expression P; @@

    - rcu_assign_pointer(P, NULL)
    + RCU_INIT_POINTER(P, NULL)

    //

    Signed-off-by: Stephen Hemminger
    Acked-by: Paul E. McKenney
    Signed-off-by: David S. Miller

    Stephen Hemminger
     

28 Jul, 2011

2 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (32 commits)
    tg3: Remove 5719 jumbo frames and TSO blocks
    tg3: Break larger frags into 4k chunks for 5719
    tg3: Add tx BD budgeting code
    tg3: Consolidate code that calls tg3_tx_set_bd()
    tg3: Add partial fragment unmapping code
    tg3: Generalize tg3_skb_error_unmap()
    tg3: Remove short DMA check for 1st fragment
    tg3: Simplify tx bd assignments
    tg3: Reintroduce tg3_tx_ring_info
    ASIX: Use only 11 bits of header for data size
    ASIX: Simplify condition in rx_fixup()
    Fix cdc-phonet build
    bonding: reduce noise during init
    bonding: fix string comparison errors
    net: Audit drivers to identify those needing IFF_TX_SKB_SHARING cleared
    net: add IFF_SKB_TX_SHARED flag to priv_flags
    net: sock_sendmsg_nosec() is static
    forcedeth: fix vlans
    gianfar: fix bug caused by 87c288c6e9aa31720b72e2bc2d665e24e1653c3e
    gro: Only reset frag0 when skb can be pulled
    ...

    Linus Torvalds
     
  • Signed-off-by: Eric Dumazet
    CC: Anton Blanchard
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Jul, 2011

1 commit


21 May, 2011

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1446 commits)
    macvlan: fix panic if lowerdev in a bond
    tg3: Add braces around 5906 workaround.
    tg3: Fix NETIF_F_LOOPBACK error
    macvlan: remove one synchronize_rcu() call
    networking: NET_CLS_ROUTE4 depends on INET
    irda: Fix error propagation in ircomm_lmp_connect_response()
    irda: Kill set but unused variable 'bytes' in irlan_check_command_param()
    irda: Kill set but unused variable 'clen' in ircomm_connect_indication()
    rxrpc: Fix set but unused variable 'usage' in rxrpc_get_transport()
    be2net: Kill set but unused variable 'req' in lancer_fw_download()
    irda: Kill set but unused vars 'saddr' and 'daddr' in irlan_provider_connect_indication()
    atl1c: atl1c_resume() is only used when CONFIG_PM_SLEEP is defined.
    rxrpc: Fix set but unused variable 'usage' in rxrpc_get_peer().
    rxrpc: Kill set but unused variable 'local' in rxrpc_UDP_error_handler()
    rxrpc: Kill set but unused variable 'sp' in rxrpc_process_connection()
    rxrpc: Kill set but unused variable 'sp' in rxrpc_rotate_tx_window()
    pkt_sched: Kill set but unused variable 'protocol' in tc_classify()
    isdn: capi: Use pr_debug() instead of ifdefs.
    tg3: Update version to 3.119
    tg3: Apply rx_discards fix to 5719/5720
    ...

    Fix up trivial conflicts in arch/x86/Kconfig and net/mac80211/agg-tx.c
    as per Davem.

    Linus Torvalds
     

18 May, 2011

2 commits


08 May, 2011

1 commit


06 May, 2011

1 commit

  • This patch adds a multiple message send syscall and is the send
    version of the existing recvmmsg syscall. This is heavily
    based on the patch by Arnaldo that added recvmmsg.

    I wrote a microbenchmark to test the performance gains of using
    this new syscall:

    http://ozlabs.org/~anton/junkcode/sendmmsg_test.c

    The test was run on a ppc64 box with a 10 Gbit network card. The
    benchmark can send both UDP and RAW ethernet packets.

    64B UDP

    batch pkts/sec
    1 804570
    2 872800 (+ 8 %)
    4 916556 (+14 %)
    8 939712 (+17 %)
    16 952688 (+18 %)
    32 956448 (+19 %)
    64 964800 (+20 %)

    64B raw socket

    batch pkts/sec
    1 1201449
    2 1350028 (+12 %)
    4 1461416 (+22 %)
    8 1513080 (+26 %)
    16 1541216 (+28 %)
    32 1553440 (+29 %)
    64 1557888 (+30 %)

    We see a 20% improvement in throughput on UDP send and 30%
    on raw socket send.

    [ Add sparc syscall entries. -DaveM ]

    Signed-off-by: Anton Blanchard
    Signed-off-by: David S. Miller

    Anton Blanchard
     

12 Apr, 2011

2 commits


31 Mar, 2011

1 commit


19 Mar, 2011

1 commit

  • This structure was accidentally defined such that its layout can
    differ between 32-bit and 64-bit processes. Add compat structure
    definitions and an ioctl wrapper function.

    Signed-off-by: Ben Hutchings
    Acked-by: Alexander Duyck
    Cc: stable@kernel.org [2.6.30+]
    Signed-off-by: David S. Miller

    Ben Hutchings
     

24 Feb, 2011

1 commit


23 Feb, 2011

1 commit


13 Jan, 2011

1 commit


08 Jan, 2011

1 commit

  • …t/npiggin/linux-npiggin

    * 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin: (57 commits)
    fs: scale mntget/mntput
    fs: rename vfsmount counter helpers
    fs: implement faster dentry memcmp
    fs: prefetch inode data in dcache lookup
    fs: improve scalability of pseudo filesystems
    fs: dcache per-inode inode alias locking
    fs: dcache per-bucket dcache hash locking
    bit_spinlock: add required includes
    kernel: add bl_list
    xfs: provide simple rcu-walk ACL implementation
    btrfs: provide simple rcu-walk ACL implementation
    ext2,3,4: provide simple rcu-walk ACL implementation
    fs: provide simple rcu-walk generic_check_acl implementation
    fs: provide rcu-walk aware permission i_ops
    fs: rcu-walk aware d_revalidate method
    fs: cache optimise dentry and inode for rcu-walk
    fs: dcache reduce branches in lookup path
    fs: dcache remove d_mounted
    fs: fs_struct use seqlock
    fs: rcu-walk for path lookup
    ...

    Linus Torvalds
     

07 Jan, 2011

5 commits

  • The problem that this patch aims to fix is vfsmount refcounting scalability.
    We need to take a reference on the vfsmount for every successful path lookup,
    which often go to the same mount point.

    The fundamental difficulty is that a "simple" reference count can never be made
    scalable, because any time a reference is dropped, we must check whether that
    was the last reference. To do that requires communication with all other CPUs
    that may have taken a reference count.

    We can make refcounts more scalable in a couple of ways, involving keeping
    distributed counters, and checking for the global-zero condition less
    frequently.

    - check the global sum once every interval (this will delay zero detection
    for some interval, so it's probably a showstopper for vfsmounts).

    - keep a local count and only taking the global sum when local reaches 0 (this
    is difficult for vfsmounts, because we can't hold preempt off for the life of
    a reference, so a counter would need to be per-thread or tied strongly to a
    particular CPU which requires more locking).

    - keep a local difference of increments and decrements, which allows us to sum
    the total difference and hence find the refcount when summing all CPUs. Then,
    keep a single integer "long" refcount for slow and long lasting references,
    and only take the global sum of local counters when the long refcount is 0.

    This last scheme is what I implemented here. Attached mounts and process root
    and working directory references are "long" references, and everything else is
    a short reference.

    This allows scalable vfsmount references during path walking over mounted
    subtrees and unattached (lazy umounted) mounts with processes still running
    in them.

    This results in one fewer atomic op in the fastpath: mntget is now just a
    per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
    and non-atomic decrement in the common case. However code is otherwise bigger
    and heavier, so single threaded performance is basically a wash.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Regardless of how much we possibly try to scale dcache, there is likely
    always going to be some fundamental contention when adding or removing children
    under the same parent. Pseudo filesystems do not seem need to have connected
    dentries because by definition they are disconnected.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Reduce some branches and memory accesses in dcache lookup by adding dentry
    flags to indicate common d_ops are set, rather than having to check them.
    This saves a pointer memory access (dentry->d_op) in common path lookup
    situations, and saves another pointer load and branch in cases where we
    have d_op but not the particular operation.

    Patched with:

    git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • Pseudo filesystems that don't put inode on RCU list or reachable by
    rcu-walk dentries do not need to RCU free their inodes.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • RCU free the struct inode. This will allow:

    - Subsequent store-free path walking patch. The inode must be consulted for
    permissions when walking, so an RCU inode reference is a must.
    - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
    to take i_lock no longer need to take sb_inode_list_lock to walk the list in
    the first place. This will simplify and optimize locking.
    - Could remove some nested trylock loops in dcache code
    - Could potentially simplify things a bit in VM land. Do not need to take the
    page lock to follow page->mapping.

    The downsides of this is the performance cost of using RCU. In a simple
    creat/unlink microbenchmark, performance drops by about 10% due to inability to
    reuse cache-hot slab objects. As iterations increase and RCU freeing starts
    kicking over, this increases to about 20%.

    In cases where inode lifetimes are longer (ie. many inodes may be allocated
    during the average life span of a single inode), a lot of this cache reuse is
    not applicable, so the regression caused by this patch is smaller.

    The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
    however this adds some complexity to list walking and store-free path walking,
    so I prefer to implement this at a later date, if it is shown to be a win in
    real situations. I haven't found a regression in any non-micro benchmark so I
    doubt it will be a problem.

    Signed-off-by: Nick Piggin

    Nick Piggin
     

18 Dec, 2010

1 commit


11 Dec, 2010

1 commit


13 Nov, 2010

1 commit


31 Oct, 2010

2 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
    isdn: mISDN: socket: fix information leak to userland
    netdev: can: Change mail address of Hans J. Koch
    pcnet_cs: add new_id
    net: Truncate recvfrom and sendto length to INT_MAX.
    RDS: Let rds_message_alloc_sgs() return NULL
    RDS: Copy rds_iovecs into kernel memory instead of rereading from userspace
    RDS: Clean up error handling in rds_cmsg_rdma_args
    RDS: Return -EINVAL if rds_rdma_pages returns an error
    net: fix rds_iovec page count overflow
    can: pch_can: fix section mismatch warning by using a whitelisted name
    can: pch_can: fix sparse warning
    netxen_nic: Fix the tx queue manipulation bug in netxen_nic_probe
    ip_gre: fix fallback tunnel setup
    vmxnet: trivial annotation of protocol constant
    vmxnet3: remove unnecessary byteswapping in BAR writing macros
    ipv6/udp: report SndbufErrors and RcvbufErrors
    phy/marvell: rename 88ec048 to 88e1318s and fix mscr1 addr

    Linus Torvalds
     
  • Signed-off-by: Linus Torvalds
    Signed-off-by: David S. Miller

    Linus Torvalds
     

29 Oct, 2010

1 commit