13 Jan, 2012
1 commit
-
commit a9b3cd7f32 (rcu: convert uses of rcu_assign_pointer(x, NULL) to
RCU_INIT_POINTER) did a lot of incorrect changes, since it did a
complete conversion of rcu_assign_pointer(x, y) to RCU_INIT_POINTER(x,
y).We miss needed barriers, even on x86, when y is not NULL.
Signed-off-by: Eric Dumazet
CC: Stephen Hemminger
CC: Paul E. McKenney
Signed-off-by: David S. Miller
07 Jan, 2012
1 commit
-
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1958 commits)
net: pack skb_shared_info more efficiently
net_sched: red: split red_parms into parms and vars
net_sched: sfq: extend limits
cnic: Improve error recovery on bnx2x devices
cnic: Re-init dev->stats_addr after chip reset
net_sched: Bug in netem reordering
bna: fix sparse warnings/errors
bna: make ethtool_ops and strings const
xgmac: cleanups
net: make ethtool_ops const
vmxnet3" make ethtool ops const
xen-netback: make ops structs const
virtio_net: Pass gfp flags when allocating rx buffers.
ixgbe: FCoE: Add support for ndo_get_fcoe_hbainfo() call
netdev: FCoE: Add new ndo_get_fcoe_hbainfo() call
igb: reset PHY after recovering from PHY power down
igb: add basic runtime PM support
igb: Add support for byte queue limits.
e1000: cleanup CE4100 MDIO registers access
e1000: unmap ce4100_gbe_mdio_base_virt in e1000_remove
...
06 Jan, 2012
1 commit
-
We're doing some odd things there, which already messes up various users
(see the net/socket.c code that this removes), and it was going to add
yet more crud to the block layer because of the incorrect error code
translation.ENOIOCTLCMD is not an error return that should be returned to user mode
from the "ioctl()" system call, but it should *not* be translated as
EINVAL ("Invalid argument"). It should be translated as ENOTTY
("Inappropriate ioctl for device").That EINVAL confusion has apparently so permeated some code that the
block layer actually checks for it, which is sad. We continue to do so
for now, but add a big comment about how wrong that is, and we should
remove it entirely eventually. In the meantime, this tries to keep the
changes localized to just the EINVAL -> ENOTTY fix, and removing code
that makes it harder to do the right thing.Signed-off-by: Linus Torvalds
05 Jan, 2012
1 commit
-
Define special location values for RX NFC that request the driver to
select the actual rule location. This allows for implementation on
devices that use hash-based filter lookup, whereas currently the API is
more suited to devices with TCAM lookup or linear search.In ethtool_set_rxnfc() and the compat wrapper ethtool_ioctl(), copy
the structure back to user-space after insertion so that the actual
location is returned.Signed-off-by: Ben Hutchings
Signed-off-by: David S. Miller
23 Nov, 2011
1 commit
-
This patch adds in the infrastructure code to create the network priority
cgroup. The cgroup, in addition to the standard processes file creates two
control files:1) prioidx - This is a read-only file that exports the index of this cgroup.
This is a value that is both arbitrary and unique to a cgroup in this subsystem,
and is used to index the per-device priority map2) priomap - This is a writeable file. On read it reports a table of 2-tuples
where name is the name of a network interface and priority is
indicates the priority assigned to frames egresessing on the named interface and
originating from a pid in this cgroupThis cgroup allows for skb priority to be set prior to a root qdisc getting
selected. This is benenficial for DCB enabled systems, in that it allows for any
application to use dcb configured priorities so without application modificationSigned-off-by: Neil Horman
Signed-off-by: John Fastabend
CC: Robert Love
CC: "David S. Miller"
Signed-off-by: David S. Miller
10 Nov, 2011
1 commit
-
The 802.1X EAPOL handshake hostapd does requires
knowing whether the frame was ack'ed by the peer.
Currently, we fudge this pretty badly by not even
transmitting the frame as a normal data frame but
injecting it with radiotap and getting the status
out of radiotap monitor as well. This is rather
complex, confuses users (mon.wlan0 presence) and
doesn't work with all hardware.To get rid of that hack, introduce a real wifi TX
status option for data frame transmissions.This works similar to the existing TX timestamping
in that it reflects the SKB back to the socket's
error queue with a SCM_WIFI_STATUS cmsg that has
an int indicating ACK status (0/1).Since it is possible that at some point we will
want to have TX timestamping and wifi status in a
single errqueue SKB (there's little point in not
doing that), redefine SO_EE_ORIGIN_TIMESTAMPING
to SO_EE_ORIGIN_TXSTATUS which can collect more
than just the timestamp; keep the old constant
as an alias of course. Currently the internal APIs
don't make that possible, but it wouldn't be hard
to split them up in a way that makes it possible.Thanks to Neil Horman for helping me figure out
the functions that add the control messages.Signed-off-by: Johannes Berg
Signed-off-by: John W. Linville
22 Sep, 2011
1 commit
-
Conflicts:
MAINTAINERS
drivers/net/Kconfig
drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
drivers/net/ethernet/broadcom/tg3.c
drivers/net/wireless/iwlwifi/iwl-pci.c
drivers/net/wireless/iwlwifi/iwl-trans-tx-pcie.c
drivers/net/wireless/rt2x00/rt2800usb.c
drivers/net/wireless/wl12xx/main.c
25 Aug, 2011
1 commit
-
Dereferencing a user pointer directly from kernel-space without going
through the copy_from_user family of functions is a bad idea. Two of
such usages can be found in the sendmsg code path called from sendmmsg,
added bycommit c71d8ebe7a4496fb7231151cb70a6baa0cb56f9a upstream.
commit 5b47b8038f183b44d2d8ff1c7d11a5c1be706b34 in the 3.0-stable tree.Usages are performed through memcmp() and memcpy() directly. Fix those
by using the already copied msg_sys structure instead of the __user *msg
structure. Note that msg_sys can be set to NULL by verify_compat_iovec()
or verify_iovec(), which requires additional NULL pointer checks.Signed-off-by: Mathieu Desnoyers
Signed-off-by: David Goulet
CC: Tetsuo Handa
CC: Anton Blanchard
CC: David S. Miller
CC: stable
Signed-off-by: David S. Miller
08 Aug, 2011
1 commit
05 Aug, 2011
3 commits
-
The sendmmsg() introduced by commit 228e548e "net: Add sendmmsg socket system
call" is capable of sending to multiple different destination addresses.SMACK is using destination's address for checking sendmsg() permission.
However, security_socket_sendmsg() is called for only once even if multiple
different destination addresses are passed to sendmmsg().Therefore, we need to call security_socket_sendmsg() for each destination
address rather than only the first destination address.Since calling security_socket_sendmsg() every time when only single destination
address was passed to sendmmsg() is a waste of time, omit calling
security_socket_sendmsg() unless destination address of previous datagram and
that of current datagram differs.Signed-off-by: Tetsuo Handa
Acked-by: Anton Blanchard
Cc: stable [3.0+]
Signed-off-by: David S. Miller -
To limit the amount of time we can spend in sendmmsg, cap the
number of elements to UIO_MAXIOV (currently 1024).For error handling an application using sendmmsg needs to retry at
the first unsent message, so capping is simpler and requires less
application logic than returning EINVAL.Signed-off-by: Anton Blanchard
Cc: stable [3.0+]
Signed-off-by: David S. Miller -
sendmmsg uses a similar error return strategy as recvmmsg but it
turns out to be a confusing way to communicate errors.The current code stores the error code away and returns it on the next
sendmmsg call. This means a call with completely valid arguments could
get an error from a previous call.Change things so we only return an error if no datagrams could be sent.
If less than the requested number of messages were sent, the application
must retry starting at the first failed one and if the problem is
persistent the error will be returned.This matches the behaviour of other syscalls like read/write - it
is not an error if less than the requested number of elements are sent.Signed-off-by: Anton Blanchard
Cc: stable [3.0+]
Signed-off-by: David S. Miller
02 Aug, 2011
1 commit
-
When assigning a NULL value to an RCU protected pointer, no barrier
is needed. The rcu_assign_pointer, used to handle that but will soon
change to not handle the special case.Convert all rcu_assign_pointer of NULL value.
//smpl
@@ expression P; @@- rcu_assign_pointer(P, NULL)
+ RCU_INIT_POINTER(P, NULL)//
Signed-off-by: Stephen Hemminger
Acked-by: Paul E. McKenney
Signed-off-by: David S. Miller
28 Jul, 2011
2 commits
-
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (32 commits)
tg3: Remove 5719 jumbo frames and TSO blocks
tg3: Break larger frags into 4k chunks for 5719
tg3: Add tx BD budgeting code
tg3: Consolidate code that calls tg3_tx_set_bd()
tg3: Add partial fragment unmapping code
tg3: Generalize tg3_skb_error_unmap()
tg3: Remove short DMA check for 1st fragment
tg3: Simplify tx bd assignments
tg3: Reintroduce tg3_tx_ring_info
ASIX: Use only 11 bits of header for data size
ASIX: Simplify condition in rx_fixup()
Fix cdc-phonet build
bonding: reduce noise during init
bonding: fix string comparison errors
net: Audit drivers to identify those needing IFF_TX_SKB_SHARING cleared
net: add IFF_SKB_TX_SHARED flag to priv_flags
net: sock_sendmsg_nosec() is static
forcedeth: fix vlans
gianfar: fix bug caused by 87c288c6e9aa31720b72e2bc2d665e24e1653c3e
gro: Only reset frag0 when skb can be pulled
... -
Signed-off-by: Eric Dumazet
CC: Anton Blanchard
Signed-off-by: David S. Miller
27 Jul, 2011
1 commit
-
Workloads using pipes and sockets hit inode_sb_list_lock contention.
superblock s_inodes list is needed for quota, dirty, pagecache and
fsnotify management. pipe/anon/socket fs are clearly not candidates for
these.Signed-off-by: Eric Dumazet
Reviewed-by: Christoph Hellwig
Signed-off-by: Al Viro
21 May, 2011
1 commit
-
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1446 commits)
macvlan: fix panic if lowerdev in a bond
tg3: Add braces around 5906 workaround.
tg3: Fix NETIF_F_LOOPBACK error
macvlan: remove one synchronize_rcu() call
networking: NET_CLS_ROUTE4 depends on INET
irda: Fix error propagation in ircomm_lmp_connect_response()
irda: Kill set but unused variable 'bytes' in irlan_check_command_param()
irda: Kill set but unused variable 'clen' in ircomm_connect_indication()
rxrpc: Fix set but unused variable 'usage' in rxrpc_get_transport()
be2net: Kill set but unused variable 'req' in lancer_fw_download()
irda: Kill set but unused vars 'saddr' and 'daddr' in irlan_provider_connect_indication()
atl1c: atl1c_resume() is only used when CONFIG_PM_SLEEP is defined.
rxrpc: Fix set but unused variable 'usage' in rxrpc_get_peer().
rxrpc: Kill set but unused variable 'local' in rxrpc_UDP_error_handler()
rxrpc: Kill set but unused variable 'sp' in rxrpc_process_connection()
rxrpc: Kill set but unused variable 'sp' in rxrpc_rotate_tx_window()
pkt_sched: Kill set but unused variable 'protocol' in tc_classify()
isdn: capi: Use pr_debug() instead of ifdefs.
tg3: Update version to 3.119
tg3: Apply rx_discards fix to 5719/5720
...Fix up trivial conflicts in arch/x86/Kconfig and net/mac80211/agg-tx.c
as per Davem.
18 May, 2011
2 commits
-
Conflicts:
drivers/net/vmxnet3/vmxnet3_ethtool.c
net/core/dev.c -
recvmmsg fails on a raw socket with EINVAL. The reason for this is
packet_recvmsg checks the incoming flags:err = -EINVAL;
if (flags & ~(MSG_PEEK|MSG_DONTWAIT|MSG_TRUNC|MSG_CMSG_COMPAT|MSG_ERRQUEUE))
goto out;This patch strips out MSG_WAITFORONE when calling recvmmsg which
fixes the issue.Signed-off-by: Anton Blanchard
Cc: stable@kernel.org [2.6.34+]
Signed-off-by: David S. Miller
08 May, 2011
1 commit
-
The rcu callback wq_free_rcu() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(wq_free_rcu).Signed-off-by: Lai Jiangshan
Acked-by: David S. Miller
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett
06 May, 2011
1 commit
-
This patch adds a multiple message send syscall and is the send
version of the existing recvmmsg syscall. This is heavily
based on the patch by Arnaldo that added recvmmsg.I wrote a microbenchmark to test the performance gains of using
this new syscall:http://ozlabs.org/~anton/junkcode/sendmmsg_test.c
The test was run on a ppc64 box with a 10 Gbit network card. The
benchmark can send both UDP and RAW ethernet packets.64B UDP
batch pkts/sec
1 804570
2 872800 (+ 8 %)
4 916556 (+14 %)
8 939712 (+17 %)
16 952688 (+18 %)
32 956448 (+19 %)
64 964800 (+20 %)64B raw socket
batch pkts/sec
1 1201449
2 1350028 (+12 %)
4 1461416 (+22 %)
8 1513080 (+26 %)
16 1541216 (+28 %)
32 1553440 (+29 %)
64 1557888 (+30 %)We see a 20% improvement in throughput on UDP send and 30%
on raw socket send.[ Add sparc syscall entries. -DaveM ]
Signed-off-by: Anton Blanchard
Signed-off-by: David S. Miller
12 Apr, 2011
2 commits
-
Conflicts:
drivers/net/smsc911x.c -
This change is meant to add an ntuple data extensions to the rx network flow
classification specifiers. The idea is to allow ntuple to be displayed via
the network flow classification interface.The first patch had some left over stuff from the original flow extension
flags I had added. That bit is removed in this patch.The second had some left over comments that stated we ignored bits in the
masks when we actually match them.This work is based on input from Ben Hutchings.
Signed-off-by: Alexander Duyck
Reviewed-by: Ben Hutchings
Signed-off-by: David S. Miller
31 Mar, 2011
1 commit
-
Fixes generated by 'codespell' and manually reviewed.
Signed-off-by: Lucas De Marchi
19 Mar, 2011
1 commit
-
This structure was accidentally defined such that its layout can
differ between 32-bit and 64-bit processes. Add compat structure
definitions and an ioctl wrapper function.Signed-off-by: Ben Hutchings
Acked-by: Alexander Duyck
Cc: stable@kernel.org [2.6.30+]
Signed-off-by: David S. Miller
24 Feb, 2011
1 commit
-
Use __force to quiet sparse warnings for cases where the code
is simulating user space pointers.Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller
23 Feb, 2011
1 commit
-
Add proper RCU annotations/verbs to sk_wq and wq members
Fix __sctp_write_space() sk_sleep() abuse (and sock->wq access)
Fix sunrpc sk_sleep() abuse too
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
13 Jan, 2011
1 commit
-
Signed-off-by: Al Viro
08 Jan, 2011
1 commit
-
…t/npiggin/linux-npiggin
* 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin: (57 commits)
fs: scale mntget/mntput
fs: rename vfsmount counter helpers
fs: implement faster dentry memcmp
fs: prefetch inode data in dcache lookup
fs: improve scalability of pseudo filesystems
fs: dcache per-inode inode alias locking
fs: dcache per-bucket dcache hash locking
bit_spinlock: add required includes
kernel: add bl_list
xfs: provide simple rcu-walk ACL implementation
btrfs: provide simple rcu-walk ACL implementation
ext2,3,4: provide simple rcu-walk ACL implementation
fs: provide simple rcu-walk generic_check_acl implementation
fs: provide rcu-walk aware permission i_ops
fs: rcu-walk aware d_revalidate method
fs: cache optimise dentry and inode for rcu-walk
fs: dcache reduce branches in lookup path
fs: dcache remove d_mounted
fs: fs_struct use seqlock
fs: rcu-walk for path lookup
...
07 Jan, 2011
5 commits
-
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.Signed-off-by: Nick Piggin
-
Regardless of how much we possibly try to scale dcache, there is likely
always going to be some fundamental contention when adding or removing children
under the same parent. Pseudo filesystems do not seem need to have connected
dentries because by definition they are disconnected.Signed-off-by: Nick Piggin
-
Reduce some branches and memory accesses in dcache lookup by adding dentry
flags to indicate common d_ops are set, rather than having to check them.
This saves a pointer memory access (dentry->d_op) in common path lookup
situations, and saves another pointer load and branch in cases where we
have d_op but not the particular operation.Patched with:
git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i
Signed-off-by: Nick Piggin
-
Pseudo filesystems that don't put inode on RCU list or reachable by
rcu-walk dentries do not need to RCU free their inodes.Signed-off-by: Nick Piggin
-
RCU free the struct inode. This will allow:
- Subsequent store-free path walking patch. The inode must be consulted for
permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
to take i_lock no longer need to take sb_inode_list_lock to walk the list in
the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
page lock to follow page->mapping.The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.Signed-off-by: Nick Piggin
18 Dec, 2010
1 commit
-
Conflicts:
drivers/net/bnx2x/bnx2x.h
drivers/net/wireless/iwlwifi/iwl-1000.c
drivers/net/wireless/iwlwifi/iwl-6000.c
drivers/net/wireless/iwlwifi/iwl-core.h
drivers/vhost/vhost.c
11 Dec, 2010
1 commit
-
Signed-off-by: Martin Lucina
Signed-off-by: David S. Miller
13 Nov, 2010
1 commit
-
Use modern RCU API / annotations for net_families array.
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
31 Oct, 2010
2 commits
-
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
isdn: mISDN: socket: fix information leak to userland
netdev: can: Change mail address of Hans J. Koch
pcnet_cs: add new_id
net: Truncate recvfrom and sendto length to INT_MAX.
RDS: Let rds_message_alloc_sgs() return NULL
RDS: Copy rds_iovecs into kernel memory instead of rereading from userspace
RDS: Clean up error handling in rds_cmsg_rdma_args
RDS: Return -EINVAL if rds_rdma_pages returns an error
net: fix rds_iovec page count overflow
can: pch_can: fix section mismatch warning by using a whitelisted name
can: pch_can: fix sparse warning
netxen_nic: Fix the tx queue manipulation bug in netxen_nic_probe
ip_gre: fix fallback tunnel setup
vmxnet: trivial annotation of protocol constant
vmxnet3: remove unnecessary byteswapping in BAR writing macros
ipv6/udp: report SndbufErrors and RcvbufErrors
phy/marvell: rename 88ec048 to 88e1318s and fix mscr1 addr -
Signed-off-by: Linus Torvalds
Signed-off-by: David S. Miller
29 Oct, 2010
1 commit
-
Signed-off-by: Al Viro