14 Sep, 2013

1 commit

  • [ Upstream commit 8bcdeaff5ed544704a9a691d4aef0adb3f9c5b8f ]

    getsockopt PACKET_STATISTICS returns tp_packets + tp_drops. Commit
    ee80fbf301 ("packet: account statistics only in tpacket_stats_u")
    cleaned up the getsockopt PACKET_STATISTICS code.
    This also changed semantics. Historically, tp_packets included
    tp_drops on return. The commit removed the line that adds tp_drops
    into tp_packets.

    This patch reinstates the old semantics.

    Signed-off-by: Willem de Bruijn
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Willem de Bruijn
     

13 Jun, 2013

1 commit

  • uaddr->sa_data is exactly of size 14, which is hard-coded here and
    passed as a size argument to strncpy(). A device name can be of size
    IFNAMSIZ (== 16), meaning we might leave the destination string
    unterminated. Thus, use strlcpy() and also sizeof() while we're
    at it. We need to memset the data area beforehand, since strlcpy
    does not padd the remaining buffer with zeroes for user space, so
    that we do not possibly leak anything.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

04 May, 2013

1 commit

  • Jakub reported that it is fairly easy to trigger the BUG() macro
    from user space with TPACKET_V3's RX_RING by just giving a wrong
    header status flag. We already had a similar situation in commit
    7f5c3e3a80e6654 (``af_packet: remove BUG statement in
    tpacket_destruct_skb'') where this was the case in the TX_RING
    side that could be triggered from user space. So really, don't use
    BUG() or BUG_ON() unless there's really no way out, and i.e.
    don't use it for consistency checking when there's user space
    involved, no excuses, especially not if you're slapping the user
    with WARN + dump_stack + BUG all at once. The two functions are
    of concern:

    prb_retire_current_block() [when block status != TP_STATUS_KERNEL]
    prb_open_block() [when block_status != TP_STATUS_KERNEL]

    Calls to prb_open_block() are guarded by ealier checks if block_status
    is really TP_STATUS_KERNEL (racy!), but the first one BUG() is easily
    triggable from user space. System behaves still stable after they are
    removed. Also remove that yoda condition entirely, since it's already
    guarded.

    Reported-by: Jakub Zawadzki
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

30 Apr, 2013

3 commits


25 Apr, 2013

5 commits

  • Currently, packet_sock has a struct tpacket_stats stats member for
    TPACKET_V1 and TPACKET_V2 statistic accounting, and with TPACKET_V3
    ``union tpacket_stats_u stats_u'' was introduced, where however only
    statistics for TPACKET_V3 are held, and when copied to user space,
    TPACKET_V3 does some hackery and access also tpacket_stats' stats,
    although everything could have been done within the union itself.

    Unify accounting within the tpacket_stats_u union so that we can
    remove 8 bytes from packet_sock that are there unnecessary. Note that
    even if we switch to TPACKET_V3 and would use non mmap(2)ed option,
    this still works due to the union with same types + offsets, that are
    exposed to the user space.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • There's a 4 byte hole in packet_ring_buffer structure before
    prb_bdqc, that can be filled with 'pending' member, thus we can
    reduce the overall structure size from 224 bytes to 216 bytes.
    This also has the side-effect, that in struct packet_sock 2*4 byte
    holes after the embedded packet_ring_buffer members are removed,
    and overall, packet_sock can be reduced by 1 cacheline:

    Before: size: 1344, cachelines: 21, members: 24
    After: size: 1280, cachelines: 20, members: 24

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Currently, there is no way to find out which timestamp is reported in
    tpacket{,2,3}_hdr's tp_sec, tp_{n,u}sec members. It can be one of
    SOF_TIMESTAMPING_SYS_HARDWARE, SOF_TIMESTAMPING_RAW_HARDWARE,
    SOF_TIMESTAMPING_SOFTWARE, or a fallback variant late call from the
    PF_PACKET code in software.

    Therefore, report in the tp_status member of the ring buffer which
    timestamp has been reported for RX and TX path. This should not break
    anything for the following reasons: i) in RX ring path, the user needs
    to test for tp_status & TP_STATUS_USER, and later for other flags as
    well such as TP_STATUS_VLAN_VALID et al, so adding other flags will
    do no harm; ii) in TX ring path, time stamps with PACKET_TIMESTAMP
    socketoption are not available resp. had no effect except that the
    application setting this is buggy. Next to TP_STATUS_AVAILABLE, the
    user also should check for other flags such as TP_STATUS_WRONG_FORMAT
    to reclaim frames to the application. Thus, in case TX ts are turned
    off (default case), nothing happens to the application logic, and in
    case we want to use this new feature, we now can also check which of
    the ts source is reported in the status field as provided in the docs.

    Reported-by: Richard Cochran
    Signed-off-by: Daniel Borkmann
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Currently, we only have software timestamping for the TX ring buffer
    path, but this limitation stems rather from the implementation. By
    just reusing tpacket_get_timestamp(), we can also allow hardware
    timestamping just as in the RX path.

    Signed-off-by: Daniel Borkmann
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • When transmit timestamping is enabled at the socket level, record a
    timestamp on packets written to a PACKET_TX_RING. Tx timestamps are
    always looped to the application over the socket error queue. Software
    timestamps are also written back into the packet frame header in the
    packet ring.

    Reported-by: Paul Chavent
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

20 Apr, 2013

1 commit

  • This patch introduces a small, internal helper function, that is used by
    PF_PACKET. Based on the flags that are passed, it extracts the packet
    timestamp in the receive path. This is merely a refactoring to remove
    some duplicate code in tpacket_rcv(), to make it more readable, and to
    enable others to use this function in PF_PACKET as well, e.g. for TX.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

17 Apr, 2013

1 commit


15 Apr, 2013

1 commit

  • Currently, sock_tx_timestamp() always returns 0. The comment that
    describes the sock_tx_timestamp() function wrongly says that it
    returns an error when an invalid argument is passed (from commit
    20d4947353be, ``net: socket infrastructure for SO_TIMESTAMPING'').
    Make the function void, so that we can also remove all the unneeded
    if conditions that check for such a _non-existant_ error case in the
    output path.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

28 Mar, 2013

1 commit

  • Switch to use the new help skb_probe_transport_header() to do the l4 header
    probing for untrusted sources. For packets with partial csum, the header should
    already been set by skb_partial_csum_set().

    Cc: Eric Dumazet
    Signed-off-by: Jason Wang
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Jason Wang
     

27 Mar, 2013

1 commit

  • Set the transport header for 1) some drivers (e.g ixgbe needs l4 header to do
    atr) 2) precise packet length estimation (introduced in 1def9238) needs l4
    header to compute header length.

    So this patch first tries to get l4 header for packet socket through
    skb_flow_dissect(), and pretend no l4 header if skb_flow_dissect() fails.

    Cc: Eric Dumazet
    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     

20 Mar, 2013

1 commit

  • Changes:
    v3->v2: rebase (no other changes)
    passes selftest
    v2->v1: read f->num_members only once
    fix bug: test rollover mode + flag

    Minimize packet drop in a fanout group. If one socket is full,
    roll over packets to another from the group. Maintain flow
    affinity during normal load using an rxhash fanout policy, while
    dispersing unexpected traffic storms that hit a single cpu, such
    as spoofed-source DoS flows. Rollover breaks affinity for flows
    arriving at saturated sockets during those conditions.

    The patch adds a fanout policy ROLLOVER that rotates between sockets,
    filling each socket before moving to the next. It also adds a fanout
    flag ROLLOVER. If passed along with any other fanout policy, the
    primary policy is applied until the chosen socket is full. Then,
    rollover selects another socket, to delay packet drop until the
    entire system is saturated.

    Probing sockets is not free. Selecting the last used socket, as
    rollover does, is a greedy approach that maximizes chance of
    success, at the cost of extreme load imbalance. In practice, with
    sufficiently long queues to absorb bursts, sockets are drained in
    parallel and load balance looks uniform in `top`.

    To avoid contention, scales counters with number of sockets and
    accesses them lockfree. Values are bounds checked to ensure
    correctness.

    Tested using an application with 9 threads pinned to CPUs, one socket
    per thread and sufficient busywork per packet operation to limits each
    thread to handling 32 Kpps. When sent 500 Kpps single UDP stream
    packets, a FANOUT_CPU setup processes 32 Kpps in total without this
    patch, 270 Kpps with the patch. Tested with read() and with a packet
    ring (V1).

    Also, passes psock_fanout.c unit test added to selftests.

    Signed-off-by: Willem de Bruijn
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

19 Feb, 2013

2 commits

  • proc_net_remove is only used to remove proc entries
    that under /proc/net,it's not a general function for
    removing proc entries of netns. if we want to remove
    some proc entries which under /proc/net/stat/, we still
    need to call remove_proc_entry.

    this patch use remove_proc_entry to replace proc_net_remove.
    we can remove proc_net_remove after this patch.

    Signed-off-by: Gao feng
    Signed-off-by: David S. Miller

    Gao feng
     
  • Right now, some modules such as bonding use proc_create
    to create proc entries under /proc/net/, and other modules
    such as ipv4 use proc_net_fops_create.

    It looks a little chaos.this patch changes all of
    proc_net_fops_create to proc_create. we can remove
    proc_net_fops_create after this patch.

    Signed-off-by: Gao feng
    Signed-off-by: David S. Miller

    Gao feng
     

04 Feb, 2013

1 commit

  • When releasing a packet socket, the routine packet_set_ring() is reused
    to free rings instead of allocating them. But when calling it for the
    first time, it fills req->tp_block_nr with the value of rb->pg_vec_len
    which in the second invocation makes it bail out since req->tp_block_nr
    is greater zero but req->tp_block_size is zero.

    This patch solves the problem by passing a zeroed auto-variable to
    packet_set_ring() upon each invocation from packet_release().

    As far as I can tell, this issue exists even since 69e3c75 (net: TX_RING
    and packet mmap), i.e. the original inclusion of TX ring support into
    af_packet, but applies only to sockets with both RX and TX ring
    allocated, which is probably why this was unnoticed all the time.

    Signed-off-by: Phil Sutter
    Cc: Johann Baudy
    Cc: Daniel Borkmann
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Phil Sutter
     

19 Nov, 2012

1 commit

  • Allow an unpriviled user who has created a user namespace, and then
    created a network namespace to effectively use the new network
    namespace, by reducing capable(CAP_NET_ADMIN) and
    capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
    CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.

    Allow creation of af_key sockets.
    Allow creation of llc sockets.
    Allow creation of af_packet sockets.

    Allow sending xfrm netlink control messages.

    Allow binding to netlink multicast groups.
    Allow sending to netlink multicast groups.
    Allow adding and dropping netlink multicast groups.
    Allow sending to all netlink multicast groups and port ids.

    Allow reading the netfilter SO_IP_SET socket option.
    Allow sending netfilter netlink messages.
    Allow setting and getting ip_vs netfilter socket options.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

08 Nov, 2012

1 commit

  • The tx data offset of packet mmap tx ring used to be :
    (TPACKET2_HDRLEN - sizeof(struct sockaddr_ll))

    The problem is that, with SOCK_RAW socket, the payload (14 bytes after
    the beginning of the user data) is misaligned.

    This patch allows to let the user gives an offset for it's tx data if
    he desires.

    Set sock option PACKET_TX_HAS_OFF to 1, then specify in each frame of
    your tx ring tp_net for SOCK_DGRAM, or tp_mac for SOCK_RAW.

    Signed-off-by: Paul Chavent
    Signed-off-by: David S. Miller

    Paul Chavent
     

26 Oct, 2012

1 commit

  • This tiny patch removes two unused err assignments. In those two cases the
    err variable is either overwritten with another value at a later point in
    time without having read the previous assigment, or it is assigned and the
    function returns without using/reading err after the assignment.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

11 Sep, 2012

1 commit

  • It is a frequent mistake to confuse the netlink port identifier with a
    process identifier. Try to reduce this confusion by renaming fields
    that hold port identifiers portid instead of pid.

    I have carefully avoided changing the structures exported to
    userspace to avoid changing the userspace API.

    I have successfully built an allyesconfig kernel with this change.

    Signed-off-by: "Eric W. Biederman"
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

01 Sep, 2012

1 commit


25 Aug, 2012

1 commit


24 Aug, 2012

1 commit


23 Aug, 2012

3 commits

  • Change since v1:

    * Fixed inuse counters access spotted by Eric

    In patch eea68e2f (packet: Report socket mclist info via diag module) I've
    introduced a "scheduling in atomic" problem in packet diag module -- the
    socket list is traversed under rcu_read_lock() while performed under it sk
    mclist access requires rtnl lock (i.e. -- mutex) to be taken.

    [152363.820563] BUG: scheduling while atomic: crtools/12517/0x10000002
    [152363.820573] 4 locks held by crtools/12517:
    [152363.820581] #0: (sock_diag_mutex){+.+.+.}, at: [] sock_diag_rcv+0x1f/0x3e
    [152363.820613] #1: (sock_diag_table_mutex){+.+.+.}, at: [] sock_diag_rcv_msg+0xdb/0x11a
    [152363.820644] #2: (nlk->cb_mutex){+.+.+.}, at: [] netlink_dump+0x23/0x1ab
    [152363.820693] #3: (rcu_read_lock){.+.+..}, at: [] packet_diag_dump+0x0/0x1af

    Similar thing was then re-introduced by further packet diag patches (fanount
    mutex and pgvec mutex for rings) :(

    Apart from being terribly sorry for the above, I propose to change the packet
    sk list protection from spinlock to mutex. This lock currently protects two
    modifications:

    * sklist
    * prot inuse counters

    The sklist modifications can be just reprotected with mutex since they already
    occur in a sleeping context. The inuse counters modifications are trickier -- the
    __this_cpu_-s are used inside, thus requiring the caller to handle the potential
    issues with contexts himself. Since packet sockets' counters are modified in two
    places only (packet_create and packet_release) we only need to protect the context
    from being preempted. BH disabling is not required in this case.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • Instead of using a hard-coded value for the status variable, it would make
    the code more readable to use its destined define from linux/if_packet.h.

    Signed-off-by: daniel.borkmann@tik.ee.ethz.ch
    Signed-off-by: David S. Miller

    danborkmann@iogearbox.net
     
  • David S. Miller
     

20 Aug, 2012

3 commits

  • If a packet is emitted on one socket in one group of fanout sockets,
    it is transmitted again. It is thus read again on one of the sockets
    of the fanout group. This result in a loop for software which
    generate packets when receiving one.
    This retransmission is not the intended behavior: a fanout group
    must behave like a single socket. The packet should not be
    transmitted on a socket if it originates from a socket belonging
    to the same fanout group.

    This patch fixes the issue by changing the transmission check to
    take fanout group info account.

    Reported-by: Aleksandr Kotov
    Signed-off-by: Eric Leblond
    Signed-off-by: David S. Miller

    Eric Leblond
     
  • Reported value is the same reported by the FANOUT getsockoption, but
    unlike it, the absent fanout setup results in absent nlattr, rather
    than in nlattr with zero value. This is done so, since zero fanout
    report may mean both -- no fanout, and fanout with both id and type zero.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • One extension bit may result in two nlattrs -- one per ring type.
    If some ring type is not configured, then the respective nlatts
    will be empty.

    The structure reported contains the data, that is given to the
    corresponding ring setup socket option.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

15 Aug, 2012

5 commits


13 Aug, 2012

1 commit

  • Here's a quote of the comment about the BUG macro from asm-generic/bug.h:

    Don't use BUG() or BUG_ON() unless there's really no way out; one
    example might be detecting data structure corruption in the middle
    of an operation that can't be backed out of. If the (sub)system
    can somehow continue operating, perhaps with reduced functionality,
    it's probably not BUG-worthy.

    If you're tempted to BUG(), think again: is completely giving up
    really the *only* solution? There are usually better options, where
    users don't need to reboot ASAP and can mostly shut down cleanly.

    In our case, the status flag of a ring buffer slot is managed from both sides,
    the kernel space and the user space. This means that even though the kernel
    side might work as expected, the user space screws up and changes this flag
    right between the send(2) is triggered when the flag is changed to
    TP_STATUS_SENDING and a given skb is destructed after some time. Then, this
    will hit the BUG macro. As David suggested, the best solution is to simply
    remove this statement since it cannot be used for kernel side internal
    consistency checks. I've tested it and the system still behaves /stable/ in
    this case, so in accordance with the above comment, we should rather remove it.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    danborkmann@iogearbox.net