30 Aug, 2013

2 commits


27 Aug, 2013

1 commit


21 Aug, 2013

1 commit

  • getsockopt PACKET_STATISTICS returns tp_packets + tp_drops. Commit
    ee80fbf301 ("packet: account statistics only in tpacket_stats_u")
    cleaned up the getsockopt PACKET_STATISTICS code.
    This also changed semantics. Historically, tp_packets included
    tp_drops on return. The commit removed the line that adds tp_drops
    into tp_packets.

    This patch reinstates the old semantics.

    Signed-off-by: Willem de Bruijn
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

10 Aug, 2013

1 commit

  • Adding paged frags skbs to af_unix sockets introduced a performance
    regression on large sends because of additional page allocations, even
    if each skb could carry at least 100% more payload than before.

    We can instruct sock_alloc_send_pskb() to attempt high order
    allocations.

    Most of the time, it does a single page allocation instead of 8.

    I added an additional parameter to sock_alloc_send_pskb() to
    let other users to opt-in for this new feature on followup patches.

    Tested:

    Before patch :

    $ netperf -t STREAM_STREAM
    STREAM STREAM TEST
    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/sec

    2304 212992 212992 10.00 46861.15

    After patch :

    $ netperf -t STREAM_STREAM
    STREAM STREAM TEST
    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/sec

    2304 212992 212992 10.00 57981.11

    Signed-off-by: Eric Dumazet
    Cc: David Rientjes
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Aug, 2013

1 commit

  • This reverts commits:

    0f75b09c798ed00c30d7d5551b896be883bc2aeb
    cbd89acb9eb257ed3b2be867142583fdcf7fdc5b
    c483e02614551e44ced3fe6eedda8e36d3277ccc

    Amongst other things, it's modifies the SKB header
    to pull the ethernet headers off via eth_type_trans()
    on the output path which is bogus.

    It's causing serious regressions for people.

    Signed-off-by: David S. Miller

    David S. Miller
     

03 Aug, 2013

3 commits

  • For ethernet frames, eth_type_trans() already parses the header, so one
    can skip this when checking the frame size.

    Signed-off-by: Phil Sutter
    Signed-off-by: David S. Miller

    Phil Sutter
     
  • Since tpacket_fill_skb() parses the protocol field in ethernet frames'
    headers, it's easy to see if any passed frame is a VLAN one and account
    for the extended size.

    But as the real protocol does not turn up before tpacket_fill_skb()
    runs which in turn also checks the frame length, move the max frame
    length calculation into the function.

    Signed-off-by: Phil Sutter
    Signed-off-by: David S. Miller

    Phil Sutter
     
  • This may be necessary when the SKB is passed to other layers on the go,
    which check the protocol field on their own. An example is a VLAN packet
    sent out using AF_PACKET on a bridge interface. The bridging code checks
    the SKB size, accounting for any VLAN header only if the protocol field
    is set accordingly.

    Note that eth_type_trans() sets skb->dev to the passed argument, so this
    can be skipped in packet_snd() for ethernet frames, as well.

    Signed-off-by: Phil Sutter
    Signed-off-by: David S. Miller

    Phil Sutter
     

23 Jul, 2013

1 commit


20 Jun, 2013

1 commit

  • Conflicts:
    drivers/net/wireless/ath/ath9k/Kconfig
    drivers/net/xen-netback/netback.c
    net/batman-adv/bat_iv_ogm.c
    net/wireless/nl80211.c

    The ath9k Kconfig conflict was a change of a Kconfig option name right
    next to the deletion of another option.

    The xen-netback conflict was overlapping changes involving the
    handling of the notify list in xen_netbk_rx_action().

    Batman conflict resolution provided by Antonio Quartulli, basically
    keep everything in both conflict hunks.

    The nl80211 conflict is a little more involved. In 'net' we added a
    dynamic memory allocation to nl80211_dump_wiphy() to fix a race that
    Linus reported. Meanwhile in 'net-next' the handlers were converted
    to use pre and post doit handlers which use a flag to determine
    whether to hold the RTNL mutex around the operation.

    However, the dump handlers to not use this logic. Instead they have
    to explicitly do the locking. There were apparent bugs in the
    conversion of nl80211_dump_wiphy() in that we were not dropping the
    RTNL mutex in all the return paths, and it seems we very much should
    be doing so. So I fixed that whilst handling the overlapping changes.

    To simplify the initial returns, I take the RTNL mutex after we try
    to allocate 'tb'.

    Signed-off-by: David S. Miller

    David S. Miller
     

13 Jun, 2013

1 commit

  • uaddr->sa_data is exactly of size 14, which is hard-coded here and
    passed as a size argument to strncpy(). A device name can be of size
    IFNAMSIZ (== 16), meaning we might leave the destination string
    unterminated. Thus, use strlcpy() and also sizeof() while we're
    at it. We need to memset the data area beforehand, since strlcpy
    does not padd the remaining buffer with zeroes for user space, so
    that we do not possibly leak anything.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

29 May, 2013

1 commit

  • So far, only net_device * could be passed along with netdevice notifier
    event. This patch provides a possibility to pass custom structure
    able to provide info that event listener needs to know.

    Signed-off-by: Jiri Pirko

    v2->v3: fix typo on simeth
    shortened dev_getter
    shortened notifier_info struct name
    v1->v2: fix notifier_call parameter in call_netdevice_notifier()
    Signed-off-by: David S. Miller

    Jiri Pirko
     

04 May, 2013

1 commit

  • Jakub reported that it is fairly easy to trigger the BUG() macro
    from user space with TPACKET_V3's RX_RING by just giving a wrong
    header status flag. We already had a similar situation in commit
    7f5c3e3a80e6654 (``af_packet: remove BUG statement in
    tpacket_destruct_skb'') where this was the case in the TX_RING
    side that could be triggered from user space. So really, don't use
    BUG() or BUG_ON() unless there's really no way out, and i.e.
    don't use it for consistency checking when there's user space
    involved, no excuses, especially not if you're slapping the user
    with WARN + dump_stack + BUG all at once. The two functions are
    of concern:

    prb_retire_current_block() [when block status != TP_STATUS_KERNEL]
    prb_open_block() [when block_status != TP_STATUS_KERNEL]

    Calls to prb_open_block() are guarded by ealier checks if block_status
    is really TP_STATUS_KERNEL (racy!), but the first one BUG() is easily
    triggable from user space. System behaves still stable after they are
    removed. Also remove that yoda condition entirely, since it's already
    guarded.

    Reported-by: Jakub Zawadzki
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

30 Apr, 2013

3 commits


25 Apr, 2013

5 commits

  • Currently, packet_sock has a struct tpacket_stats stats member for
    TPACKET_V1 and TPACKET_V2 statistic accounting, and with TPACKET_V3
    ``union tpacket_stats_u stats_u'' was introduced, where however only
    statistics for TPACKET_V3 are held, and when copied to user space,
    TPACKET_V3 does some hackery and access also tpacket_stats' stats,
    although everything could have been done within the union itself.

    Unify accounting within the tpacket_stats_u union so that we can
    remove 8 bytes from packet_sock that are there unnecessary. Note that
    even if we switch to TPACKET_V3 and would use non mmap(2)ed option,
    this still works due to the union with same types + offsets, that are
    exposed to the user space.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • There's a 4 byte hole in packet_ring_buffer structure before
    prb_bdqc, that can be filled with 'pending' member, thus we can
    reduce the overall structure size from 224 bytes to 216 bytes.
    This also has the side-effect, that in struct packet_sock 2*4 byte
    holes after the embedded packet_ring_buffer members are removed,
    and overall, packet_sock can be reduced by 1 cacheline:

    Before: size: 1344, cachelines: 21, members: 24
    After: size: 1280, cachelines: 20, members: 24

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Currently, there is no way to find out which timestamp is reported in
    tpacket{,2,3}_hdr's tp_sec, tp_{n,u}sec members. It can be one of
    SOF_TIMESTAMPING_SYS_HARDWARE, SOF_TIMESTAMPING_RAW_HARDWARE,
    SOF_TIMESTAMPING_SOFTWARE, or a fallback variant late call from the
    PF_PACKET code in software.

    Therefore, report in the tp_status member of the ring buffer which
    timestamp has been reported for RX and TX path. This should not break
    anything for the following reasons: i) in RX ring path, the user needs
    to test for tp_status & TP_STATUS_USER, and later for other flags as
    well such as TP_STATUS_VLAN_VALID et al, so adding other flags will
    do no harm; ii) in TX ring path, time stamps with PACKET_TIMESTAMP
    socketoption are not available resp. had no effect except that the
    application setting this is buggy. Next to TP_STATUS_AVAILABLE, the
    user also should check for other flags such as TP_STATUS_WRONG_FORMAT
    to reclaim frames to the application. Thus, in case TX ts are turned
    off (default case), nothing happens to the application logic, and in
    case we want to use this new feature, we now can also check which of
    the ts source is reported in the status field as provided in the docs.

    Reported-by: Richard Cochran
    Signed-off-by: Daniel Borkmann
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Currently, we only have software timestamping for the TX ring buffer
    path, but this limitation stems rather from the implementation. By
    just reusing tpacket_get_timestamp(), we can also allow hardware
    timestamping just as in the RX path.

    Signed-off-by: Daniel Borkmann
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • When transmit timestamping is enabled at the socket level, record a
    timestamp on packets written to a PACKET_TX_RING. Tx timestamps are
    always looped to the application over the socket error queue. Software
    timestamps are also written back into the packet frame header in the
    packet ring.

    Reported-by: Paul Chavent
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

20 Apr, 2013

1 commit

  • This patch introduces a small, internal helper function, that is used by
    PF_PACKET. Based on the flags that are passed, it extracts the packet
    timestamp in the receive path. This is merely a refactoring to remove
    some duplicate code in tpacket_rcv(), to make it more readable, and to
    enable others to use this function in PF_PACKET as well, e.g. for TX.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

17 Apr, 2013

1 commit


15 Apr, 2013

1 commit

  • Currently, sock_tx_timestamp() always returns 0. The comment that
    describes the sock_tx_timestamp() function wrongly says that it
    returns an error when an invalid argument is passed (from commit
    20d4947353be, ``net: socket infrastructure for SO_TIMESTAMPING'').
    Make the function void, so that we can also remove all the unneeded
    if conditions that check for such a _non-existant_ error case in the
    output path.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

28 Mar, 2013

1 commit

  • Switch to use the new help skb_probe_transport_header() to do the l4 header
    probing for untrusted sources. For packets with partial csum, the header should
    already been set by skb_partial_csum_set().

    Cc: Eric Dumazet
    Signed-off-by: Jason Wang
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Jason Wang
     

27 Mar, 2013

1 commit

  • Set the transport header for 1) some drivers (e.g ixgbe needs l4 header to do
    atr) 2) precise packet length estimation (introduced in 1def9238) needs l4
    header to compute header length.

    So this patch first tries to get l4 header for packet socket through
    skb_flow_dissect(), and pretend no l4 header if skb_flow_dissect() fails.

    Cc: Eric Dumazet
    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     

20 Mar, 2013

1 commit

  • Changes:
    v3->v2: rebase (no other changes)
    passes selftest
    v2->v1: read f->num_members only once
    fix bug: test rollover mode + flag

    Minimize packet drop in a fanout group. If one socket is full,
    roll over packets to another from the group. Maintain flow
    affinity during normal load using an rxhash fanout policy, while
    dispersing unexpected traffic storms that hit a single cpu, such
    as spoofed-source DoS flows. Rollover breaks affinity for flows
    arriving at saturated sockets during those conditions.

    The patch adds a fanout policy ROLLOVER that rotates between sockets,
    filling each socket before moving to the next. It also adds a fanout
    flag ROLLOVER. If passed along with any other fanout policy, the
    primary policy is applied until the chosen socket is full. Then,
    rollover selects another socket, to delay packet drop until the
    entire system is saturated.

    Probing sockets is not free. Selecting the last used socket, as
    rollover does, is a greedy approach that maximizes chance of
    success, at the cost of extreme load imbalance. In practice, with
    sufficiently long queues to absorb bursts, sockets are drained in
    parallel and load balance looks uniform in `top`.

    To avoid contention, scales counters with number of sockets and
    accesses them lockfree. Values are bounds checked to ensure
    correctness.

    Tested using an application with 9 threads pinned to CPUs, one socket
    per thread and sufficient busywork per packet operation to limits each
    thread to handling 32 Kpps. When sent 500 Kpps single UDP stream
    packets, a FANOUT_CPU setup processes 32 Kpps in total without this
    patch, 270 Kpps with the patch. Tested with read() and with a packet
    ring (V1).

    Also, passes psock_fanout.c unit test added to selftests.

    Signed-off-by: Willem de Bruijn
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

19 Feb, 2013

2 commits

  • proc_net_remove is only used to remove proc entries
    that under /proc/net,it's not a general function for
    removing proc entries of netns. if we want to remove
    some proc entries which under /proc/net/stat/, we still
    need to call remove_proc_entry.

    this patch use remove_proc_entry to replace proc_net_remove.
    we can remove proc_net_remove after this patch.

    Signed-off-by: Gao feng
    Signed-off-by: David S. Miller

    Gao feng
     
  • Right now, some modules such as bonding use proc_create
    to create proc entries under /proc/net/, and other modules
    such as ipv4 use proc_net_fops_create.

    It looks a little chaos.this patch changes all of
    proc_net_fops_create to proc_create. we can remove
    proc_net_fops_create after this patch.

    Signed-off-by: Gao feng
    Signed-off-by: David S. Miller

    Gao feng
     

04 Feb, 2013

1 commit

  • When releasing a packet socket, the routine packet_set_ring() is reused
    to free rings instead of allocating them. But when calling it for the
    first time, it fills req->tp_block_nr with the value of rb->pg_vec_len
    which in the second invocation makes it bail out since req->tp_block_nr
    is greater zero but req->tp_block_size is zero.

    This patch solves the problem by passing a zeroed auto-variable to
    packet_set_ring() upon each invocation from packet_release().

    As far as I can tell, this issue exists even since 69e3c75 (net: TX_RING
    and packet mmap), i.e. the original inclusion of TX ring support into
    af_packet, but applies only to sockets with both RX and TX ring
    allocated, which is probably why this was unnoticed all the time.

    Signed-off-by: Phil Sutter
    Cc: Johann Baudy
    Cc: Daniel Borkmann
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Phil Sutter
     

19 Nov, 2012

1 commit

  • Allow an unpriviled user who has created a user namespace, and then
    created a network namespace to effectively use the new network
    namespace, by reducing capable(CAP_NET_ADMIN) and
    capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
    CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.

    Allow creation of af_key sockets.
    Allow creation of llc sockets.
    Allow creation of af_packet sockets.

    Allow sending xfrm netlink control messages.

    Allow binding to netlink multicast groups.
    Allow sending to netlink multicast groups.
    Allow adding and dropping netlink multicast groups.
    Allow sending to all netlink multicast groups and port ids.

    Allow reading the netfilter SO_IP_SET socket option.
    Allow sending netfilter netlink messages.
    Allow setting and getting ip_vs netfilter socket options.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

08 Nov, 2012

1 commit

  • The tx data offset of packet mmap tx ring used to be :
    (TPACKET2_HDRLEN - sizeof(struct sockaddr_ll))

    The problem is that, with SOCK_RAW socket, the payload (14 bytes after
    the beginning of the user data) is misaligned.

    This patch allows to let the user gives an offset for it's tx data if
    he desires.

    Set sock option PACKET_TX_HAS_OFF to 1, then specify in each frame of
    your tx ring tp_net for SOCK_DGRAM, or tp_mac for SOCK_RAW.

    Signed-off-by: Paul Chavent
    Signed-off-by: David S. Miller

    Paul Chavent
     

26 Oct, 2012

1 commit

  • This tiny patch removes two unused err assignments. In those two cases the
    err variable is either overwritten with another value at a later point in
    time without having read the previous assigment, or it is assigned and the
    function returns without using/reading err after the assignment.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

11 Sep, 2012

1 commit

  • It is a frequent mistake to confuse the netlink port identifier with a
    process identifier. Try to reduce this confusion by renaming fields
    that hold port identifiers portid instead of pid.

    I have carefully avoided changing the structures exported to
    userspace to avoid changing the userspace API.

    I have successfully built an allyesconfig kernel with this change.

    Signed-off-by: "Eric W. Biederman"
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

01 Sep, 2012

1 commit


25 Aug, 2012

1 commit


24 Aug, 2012

1 commit


23 Aug, 2012

1 commit

  • Change since v1:

    * Fixed inuse counters access spotted by Eric

    In patch eea68e2f (packet: Report socket mclist info via diag module) I've
    introduced a "scheduling in atomic" problem in packet diag module -- the
    socket list is traversed under rcu_read_lock() while performed under it sk
    mclist access requires rtnl lock (i.e. -- mutex) to be taken.

    [152363.820563] BUG: scheduling while atomic: crtools/12517/0x10000002
    [152363.820573] 4 locks held by crtools/12517:
    [152363.820581] #0: (sock_diag_mutex){+.+.+.}, at: [] sock_diag_rcv+0x1f/0x3e
    [152363.820613] #1: (sock_diag_table_mutex){+.+.+.}, at: [] sock_diag_rcv_msg+0xdb/0x11a
    [152363.820644] #2: (nlk->cb_mutex){+.+.+.}, at: [] netlink_dump+0x23/0x1ab
    [152363.820693] #3: (rcu_read_lock){.+.+..}, at: [] packet_diag_dump+0x0/0x1af

    Similar thing was then re-introduced by further packet diag patches (fanount
    mutex and pgvec mutex for rings) :(

    Apart from being terribly sorry for the above, I propose to change the packet
    sk list protection from spinlock to mutex. This lock currently protects two
    modifications:

    * sklist
    * prot inuse counters

    The sklist modifications can be just reprotected with mutex since they already
    occur in a sleeping context. The inuse counters modifications are trickier -- the
    __this_cpu_-s are used inside, thus requiring the caller to handle the potential
    issues with contexts himself. Since packet sockets' counters are modified in two
    places only (packet_create and packet_release) we only need to protect the context
    from being preempted. BH disabling is not required in this case.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Pavel Emelyanov