11 Jan, 2017

1 commit


18 Jan, 2015

1 commit

  • Contrary to common expectations for an "int" return, these functions
    return only a positive value -- if used correctly they cannot even
    return 0 because the message header will necessarily be in the skb.

    This makes the very common pattern of

    if (genlmsg_end(...) < 0) { ... }

    be a whole bunch of dead code. Many places also simply do

    return nlmsg_end(...);

    and the caller is expected to deal with it.

    This also commonly (at least for me) causes errors, because it is very
    common to write

    if (my_function(...))
    /* error condition */

    and if my_function() does "return nlmsg_end()" this is of course wrong.

    Additionally, there's not a single place in the kernel that actually
    needs the message length returned, and if anyone needs it later then
    it'll be very easy to just use skb->len there.

    Remove this, and make the functions void. This removes a bunch of dead
    code as described above. The patch adds lines because I did

    - return nlmsg_end(...);
    + nlmsg_end(...);
    + return 0;

    I could have preserved all the function's return values by returning
    skb->len, but instead I've audited all the places calling the affected
    functions and found that none cared. A few places actually compared
    the return value with < 0 with no change in behaviour, so I opted for the more
    efficient version.

    One instance of the error I've made numerous times now is also present
    in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
    check for
    Signed-off-by: David S. Miller

    Johannes Berg
     

25 Apr, 2014

2 commits

  • It is possible by passing a netlink socket to a more privileged
    executable and then to fool that executable into writing to the socket
    data that happens to be valid netlink message to do something that
    privileged executable did not intend to do.

    To keep this from happening replace bare capable and ns_capable calls
    with netlink_capable, netlink_net_calls and netlink_ns_capable calls.
    Which act the same as the previous calls except they verify that the
    opener of the socket had the desired permissions as well.

    Reported-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • The permission check in sock_diag_put_filterinfo is wrong, and it is so removed
    from it's sources it is not clear why it is wrong. Move the computation
    into packet_diag_dump and pass a bool of the result into sock_diag_filterinfo.

    This does not yet correct the capability check but instead simply moves it to make
    it clear what is going on.

    Reported-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

23 Apr, 2014

1 commit


17 Jan, 2014

1 commit

  • In PF_PACKET's packet mmap(), we can avoid using one atomic_inc()
    and one atomic_dec() call in skb destructor and use a percpu
    reference count instead in order to determine if packets are
    still pending to be sent out. Micro-benchmark with [1] that has
    been slightly modified (that is, protcol = 0 in socket(2) and
    bind(2)), example on a rather crappy testing machine; I expect
    it to scale and have even better results on bigger machines:

    ./packet_mm_tx -s7000 -m7200 -z700000 em1, avg over 2500 runs:

    With patch: 4,022,015 cyc
    Without patch: 4,812,994 cyc

    time ./packet_mm_tx -s64 -c10000000 em1 > /dev/null, stable:

    With patch:
    real 1m32.241s
    user 0m0.287s
    sys 1m29.316s

    Without patch:
    real 1m38.386s
    user 0m0.265s
    sys 1m35.572s

    In function tpacket_snd(), it is okay to use packet_read_pending()
    since in fast-path we short-circuit the condition already with
    ph != NULL, since we have next frames to process. In case we have
    MSG_DONTWAIT, we also do not execute this path as need_wait is
    false here anyway, and in case of _no_ MSG_DONTWAIT flag, it is
    okay to call a packet_read_pending(), because when we ever reach
    that path, we're done processing outgoing frames anyway and only
    look if there are skbs still outstanding to be orphaned. We can
    stay lockless in this percpu counter since it's acceptable when we
    reach this path for the sum to be imprecise first, but we'll level
    out at 0 after all pending frames have reached the skb destructor
    eventually through tx reclaim. When people pin a tx process to
    particular CPUs, we expect overflows to happen in the reference
    counter as on one CPU we expect heavy increase; and distributed
    through ksoftirqd on all CPUs a decrease, for example. As
    David Laight points out, since the C language doesn't define the
    result of signed int overflow (i.e. rather than wrap, it is
    allowed to saturate as a possible outcome), we have to use
    unsigned int as reference count. The sum over all CPUs when tx
    is complete will result in 0 again.

    The BUG_ON() in tpacket_destruct_skb() we can remove as well. It
    can _only_ be set from inside tpacket_snd() path and we made sure
    to increase tx_ring.pending in any case before we called po->xmit(skb).
    So testing for tx_ring.pending == 0 is not too useful. Instead, it
    would rather have been useful to test if lower layers didn't orphan
    the skb so that we're missing ring slots being put back to
    TP_STATUS_AVAILABLE. But such a bug will be caught in user space
    already as we end up realizing that we do not have any
    TP_STATUS_AVAILABLE slots left anymore. Therefore, we're all set.

    Btw, in case of RX_RING path, we do not make use of the pending
    member, therefore we also don't need to use up any percpu memory
    here. Also note that __alloc_percpu() already returns a zero-filled
    percpu area, so initialization is done already.

    [1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

30 Apr, 2013

3 commits


28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

11 Sep, 2012

1 commit

  • It is a frequent mistake to confuse the netlink port identifier with a
    process identifier. Try to reduce this confusion by renaming fields
    that hold port identifiers portid instead of pid.

    I have carefully avoided changing the structures exported to
    userspace to avoid changing the userspace API.

    I have successfully built an allyesconfig kernel with this change.

    Signed-off-by: "Eric W. Biederman"
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

23 Aug, 2012

1 commit

  • Change since v1:

    * Fixed inuse counters access spotted by Eric

    In patch eea68e2f (packet: Report socket mclist info via diag module) I've
    introduced a "scheduling in atomic" problem in packet diag module -- the
    socket list is traversed under rcu_read_lock() while performed under it sk
    mclist access requires rtnl lock (i.e. -- mutex) to be taken.

    [152363.820563] BUG: scheduling while atomic: crtools/12517/0x10000002
    [152363.820573] 4 locks held by crtools/12517:
    [152363.820581] #0: (sock_diag_mutex){+.+.+.}, at: [] sock_diag_rcv+0x1f/0x3e
    [152363.820613] #1: (sock_diag_table_mutex){+.+.+.}, at: [] sock_diag_rcv_msg+0xdb/0x11a
    [152363.820644] #2: (nlk->cb_mutex){+.+.+.}, at: [] netlink_dump+0x23/0x1ab
    [152363.820693] #3: (rcu_read_lock){.+.+..}, at: [] packet_diag_dump+0x0/0x1af

    Similar thing was then re-introduced by further packet diag patches (fanount
    mutex and pgvec mutex for rings) :(

    Apart from being terribly sorry for the above, I propose to change the packet
    sk list protection from spinlock to mutex. This lock currently protects two
    modifications:

    * sklist
    * prot inuse counters

    The sklist modifications can be just reprotected with mutex since they already
    occur in a sleeping context. The inuse counters modifications are trickier -- the
    __this_cpu_-s are used inside, thus requiring the caller to handle the potential
    issues with contexts himself. Since packet sockets' counters are modified in two
    places only (packet_create and packet_release) we only need to protect the context
    from being preempted. BH disabling is not required in this case.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

20 Aug, 2012

2 commits

  • Reported value is the same reported by the FANOUT getsockoption, but
    unlike it, the absent fanout setup results in absent nlattr, rather
    than in nlattr with zero value. This is done so, since zero fanout
    report may mean both -- no fanout, and fanout with both id and type zero.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • One extension bit may result in two nlattrs -- one per ring type.
    If some ring type is not configured, then the respective nlatts
    will be empty.

    The structure reported contains the data, that is given to the
    corresponding ring setup socket option.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

15 Aug, 2012

3 commits

  • The info is reported as an array of packet_diag_mclist structures. Each
    includes not only the directly configured values (index, type, etc), but
    also the "count".

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • This reports in one rtattr message all the other scalar values, that can be
    set on a packet socket with setsockopt.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • The diag module can be built independently from the af_packet.ko one,
    just like it's done in unix sockets.

    The core dumping message carries the info available at socket creation
    time, i.e. family, type and protocol (in the same byte order as shown in
    the proc file).

    The socket inode number and cookie is reserved for future per-socket info
    retrieving. The per-protocol filtering is also reserved for future by
    requiring the sdiag_protocol to be zero.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov