03 Mar, 2015

19 commits

  • Now that there are no more users kill dev_rebuild_header and all of it's
    implementations.

    This is long overdue.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Have ax25_neigh_output perform ordinary arp resolution before calling
    ax25_neigh_xmit.

    Call dev_hard_header in ax25_neigh_output with a destination address so
    it will not fail, and the destination mac address will not need to be
    set in ax25_neigh_xmit.

    Remove arp_find from ax25_neigh_xmit (the ordinary arp resolution added
    to ax25_neigh_output removes the need for calling arp_find).

    Document how close ax25_neigh_output is to neigh_resolve_output.

    Cc: Ralf Baechle
    Cc: linux-hams@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • - Rename ax25_rebuild_header to ax25_neigh_xmit and call it from
    ax25_neigh_output directly. The rename is to make it clear
    that this is not a rebuild_header operation.

    - Remove ax25_rebuild_header from ax25_header_ops.

    Cc: Ralf Baechle
    Cc: linux-hams@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • The only caller is now is ax25_neigh_construct so move
    neigh_compat_output into ax25_ip.c make it static and rename it
    ax25_neigh_output.

    Cc: Ralf Baechle
    Cc: linux-hams@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • The special case has been pushed out into ax25_neigh_construct so there
    is no need to keep this code in arp.c

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • AX25 already has it's own private arp cache operations to isolate
    it's abuse of dev_rebuild_header to transmit packets. Add a function
    ax25_neigh_construct that will allow all of the ax25 devices to
    force using these operations, so that the generic arp code does
    not need to.

    Cc: Ralf Baechle
    Cc: linux-hams@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • The only user is in ax25_ip.c so stop exporting these functions.

    Cc: Ralf Baechle
    Cc: linux-hams@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • The two sets of header operations are functionally identical remove
    the duplicate definition.

    Cc: Ralf Baechle
    Cc: linux-hams@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • The two sets of header operations are functionally identical remove the
    duplicate definition.

    Cc: Ralf Baechle
    Cc: linux-hams@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Patterned after the similar code in net/rom this turns out
    to be a trivial obviously correct transmformation.

    Cc: Ralf Baechle
    Cc: linux-hams@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Not setting the destination address is a bug that I suspect causes no
    problems today, as only the arp code seems to call dev_hard_header and
    the description I have of rose is that it is expected to be used with a
    static neigbour table.

    I have derived the offset and the length of the rose destination address
    from rose_rebuild_header where arp_find calls neigh_ha_snapshot to set
    the destination address.

    Cc: Ralf Baechle
    Cc: linux-hams@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • In the unlikely (impossible?) event that we attempt to transmit
    an ax25 packet over a non-ax25 device free the skb so we don't
    leak it.

    Cc: Ralf Baechle
    Cc: linux-hams@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Masami noted that it would be better to hide the remaining CONFIG_BPF_SYSCALL-only
    function declarations within the BPF header ifdef, w/o else path dummy alternatives
    since these functions are not supposed to have a user outside of CONFIG_BPF_SYSCALL.

    Suggested-by: Masami Hiramatsu
    Reference: http://article.gmane.org/gmane.linux.kernel.api/8658
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    A small batch with accumulated updates in nf-next, mostly IPVS updates,
    they are:

    1) Add 64-bits stats counters to IPVS, from Julian Anastasov.

    2) Move NETFILTER_XT_MATCH_ADDRTYPE out of NETFILTER_ADVANCED as docker
    seem to require this, from Anton Blanchard.

    3) Use boolean instead of numeric value in set_match_v*(), from
    coccinelle via Fengguang Wu.

    4) Allows rescheduling of new connections in IPVS when port reuse is
    detected, from Marcelo Ricardo Leitner.

    5) Add missing bits to support arptables extensions from nft_compat,
    from Arturo Borrero.

    Patrick is preparing a large batch to enhance the set infrastructure,
    named expressions among other things, that should follow up soon after
    this batch.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Both sk_attach_filter() and sk_attach_bpf() are setting up sk_filter,
    charging skmem and attaching it to the socket after we got the eBPF
    prog up and ready. Lets refactor that into a common helper.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • …etooth/bluetooth-next

    Johan Hedberg says:

    ====================
    pull request: bluetooth-next 2015-03-02

    Here's the first bluetooth-next pull request targeting the 4.1 kernel:

    - ieee802154/6lowpan cleanups
    - SCO routing to host interface support for the btmrvl driver
    - AMP code cleanups
    - Fixes to AMP HCI init sequence
    - Refactoring of the HCI callback mechanism
    - Added shutdown routine for Intel controllers in the btusb driver
    - New config option to enable/disable Bluetooth debugfs information
    - Fix for early data reception on L2CAP fixed channels

    Please let me know if there are any issues pulling. Thanks.
    ====================

    Signed-off-by: David S. Miller <davem@davemloft.net>

    David S. Miller
     
  • Ying Xue says:

    ====================
    net: Remove iocb argument from sendmsg and recvmsg

    Currently there is only one user - TIPC whose sendmsg() instances
    using iocb argument. Meanwhile, there is no user using iocb argument
    in its recvmsg() instance. Therefore, if we eliminate the werid usage
    of iobc argument from TIPC, the iocb argument can be removed from
    all sendmsg() and recvmsg() instances of the whole networking stack.

    Reference:
    https://patchwork.ozlabs.org/patch/433960/

    Changes:

    v2:
    * Fix compile errors of DCCP module pointed by David
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • After TIPC doesn't depend on iocb argument in its internal
    implementations of sendmsg() and recvmsg() hooks defined in proto
    structure, no any user is using iocb argument in them at all now.
    Then we can drop the redundant iocb argument completely from kinds of
    implementations of both sendmsg() and recvmsg() in the entire
    networking stack.

    Cc: Christoph Hellwig
    Suggested-by: Al Viro
    Signed-off-by: Ying Xue
    Signed-off-by: David S. Miller

    Ying Xue
     
  • Currently the iocb argument is used to idenfiy whether or not socket
    lock is hold before tipc_sendmsg()/tipc_send_stream() is called. But
    this usage prevents iocb argument from being dropped through sendmsg()
    at socket common layer. Therefore, in the commit we introduce two new
    functions called __tipc_sendmsg() and __tipc_send_stream(). When they
    are invoked, it assumes that their callers have taken socket lock,
    thereby avoiding the weird usage of iocb argument.

    Cc: Al Viro
    Cc: Christoph Hellwig
    Reviewed-by: Erik Hugne
    Reviewed-by: Jon Maloy
    Signed-off-by: Ying Xue
    Signed-off-by: David S. Miller

    Ying Xue
     

02 Mar, 2015

21 commits

  • This patch adds support to arptables extensions from nft_compat.

    Signed-off-by: Arturo Borrero Gonzalez
    Signed-off-by: Pablo Neira Ayuso

    Arturo Borrero
     
  • Eyal Birger says:

    ====================
    net: move skb->dropcount to skb->cb[]

    Commit 977750076d98 ("af_packet: add interframe drop cmsg (v6)")
    unionized skb->mark and skb->dropcount in order to allow recording
    of the socket drop count while maintaining struct sk_buff size.

    skb->dropcount was introduced since there was no available room
    in skb->cb[] in packet sockets. However, its introduction led to
    the inability to export skb->mark to userspace.

    It was considered to alias skb->priority instead of skb->mark.
    However, that would lead to the inabilty to export skb->priority
    to userspace if desired. Such change may also lead to hard-to-find
    issues as skb->priority is assumed to be alias free, and, as noted
    by Shmulik Ladkani, is not 'naturally orthogonal' with other skb
    fields.

    This patch series follows the suggestions made by Eric Dumazet moving
    the dropcount metric to skb->cb[], eliminating this problem
    at the expense of 4 bytes less in skb->cb[] for protocol families
    using it.

    The patch series include compactization of bluetooth and packet
    use of skb->cb[] as well as the infrastructure for placing dropcount
    in skb->cb[].
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Commit 977750076d98 ("af_packet: add interframe drop cmsg (v6)")
    unionized skb->mark and skb->dropcount in order to allow recording
    of the socket drop count while maintaining struct sk_buff size.

    skb->dropcount was introduced since there was no available room
    in skb->cb[] in packet sockets. However, its introduction led to
    the inability to export skb->mark, or any other aliased field to
    userspace if so desired.

    Moving the dropcount metric to skb->cb[] eliminates this problem
    at the expense of 4 bytes less in skb->cb[] for protocol families
    using it.

    Signed-off-by: Eyal Birger
    Signed-off-by: David S. Miller

    Eyal Birger
     
  • As part of an effort to move skb->dropcount to skb->cb[], use
    a common function in order to set dropcount in struct sk_buff.

    Signed-off-by: Eyal Birger
    Signed-off-by: David S. Miller

    Eyal Birger
     
  • As part of an effort to move skb->dropcount to skb->cb[] use a common
    macro in protocol families using skb->cb[] for ancillary data to
    validate available room in skb->cb[].

    Signed-off-by: Eyal Birger
    Signed-off-by: David S. Miller

    Eyal Birger
     
  • As part of an effort to move skb->dropcount to skb->cb[], 4 bytes
    of additional room are needed in skb->cb[] in packet sockets.

    Store the skb original length in the first two fields of sockaddr_ll
    (sll_family and sll_protocol) as they can be derived from the skb when
    needed.

    Signed-off-by: Eyal Birger
    Signed-off-by: David S. Miller

    Eyal Birger
     
  • Commit 3b885787ea4112 ("net: Generalize socket rx gap / receive queue overflow cmsg")
    allowed receiving packet dropcount information as a socket level option.
    RXRPC sockets recvmsg function was changed to support this by calling
    sock_recv_ts_and_drops() instead of sock_recv_timestamp().

    However, protocol families wishing to receive dropcount should call
    sock_queue_rcv_skb() or set the dropcount specifically (as done
    in packet_rcv()). This was not done for rxrpc and thus this feature
    never worked on these sockets.

    Formalizing this by not calling sock_recv_ts_and_drops() in rxrpc as
    part of an effort to move skb->dropcount into skb->cb[]

    Signed-off-by: Eyal Birger
    Signed-off-by: David S. Miller

    Eyal Birger
     
  • Convert boolean fields incoming and req_start to bit fields and move
    force_active in order save space in bt_skb_cb in an effort to use
    a portion of skb->cb[] for storing skb->dropcount.

    Signed-off-by: Eyal Birger
    Signed-off-by: David S. Miller

    Eyal Birger
     
  • struct hci_req_ctrl is never used outside of struct bt_skb_cb;
    Inlining it frees 8 bytes on a 64 bit system in skb->cb[] allowing
    the addition of more ancillary data.

    Signed-off-by: Eyal Birger
    Reviewed-by: Shmulik Ladkani
    Signed-off-by: David S. Miller

    Eyal Birger
     
  • When a PADT frame is received, the socket may not be in a good state to
    close down the PPP interface. The current implementation handles this by
    simply blocking all further PPP traffic, and hoping that the lack of traffic
    will trigger the user to investigate.

    Use schedule_work to get to a process context from which we clear down the
    PPP interface, in a fashion analogous to hangup on a TTY-based PPP
    interface. This causes pppd to disconnect immediately, and allows tools to
    take immediate corrective action.

    Note that pppd's rp_pppoe.so plugin has code in it to disable the session
    when it disconnects; however, as a consequence of this patch, the session is
    already disabled before rp_pppoe.so is asked to disable the session. The
    result is a harmless error message:

    Failed to disconnect PPPoE socket: 114 Operation already in progress

    This message is safe to ignore, as long as the error is 114 Operation
    already in progress; in that specific case, it means that the PPPoE session
    has already been disabled before pppd tried to disable it.

    Signed-off-by: Simon Farnsworth
    Tested-by: Dan Williams
    Tested-by: Christoph Schulz
    Signed-off-by: David S. Miller

    Simon Farnsworth
     
  • The bnx2 driver uses .ndo_fix_features to force enable of Rx VLAN tag
    stripping when the card cannot disable it. The driver should remove
    NETIF_F_HW_VLAN_CTAG_RX flag from hw_features instead so it is fixed
    for the ethtool.

    Cc: Sony Chacko
    Cc: Dept-HSGLinuxNICDev@qlogic.com
    Signed-off-by: Ivan Vecera
    Signed-off-by: David S. Miller

    Ivan Vecera
     
  • Add *_SIZE macros for the bits ENDIA_DESC and
    ENDIA_PKT

    Signed-off-by: Arun Chandran
    Acked-by: Nicolas Ferre
    Signed-off-by: David S. Miller

    Arun Chandran
     
  • Program management descriptor's access mode according to the
    dynamically detected CPU endianness.

    Signed-off-by: Arun Chandran
    Acked-by: Nicolas Ferre
    Tested-by: Michal Simek
    Signed-off-by: David S. Miller

    Arun Chandran
     
  • Allows for packet parsing to be done by the fast path. This performance
    optimization already exists for IPv4. Add similar logic for IPv6.

    Signed-off-by: Amitabha Banerjee
    Signed-off-by: Shrikrishna Khare
    Signed-off-by: David S. Miller

    Shrikrishna Khare
     
  • Daniel Borkmann says:

    ====================
    eBPF support for cls_bpf

    This is the non-RFC version of my patchset posted before netdev01 [1]
    conference. It contains a couple of eBPF cleanups and preparation
    patches to get eBPF support into cls_bpf. The last patch adds the
    actual support. I'll post the iproute2 parts after the kernel bits
    are merged, an initial preview link to the code is mentioned in the
    last patch.

    Patch 4 and 5 were originally one patch, but I've split them into
    two parts upon request as patch 4 only is also needed for Alexei's
    tracing patches that go via tip tree.

    Tested with tc and all in-kernel available BPF test suites.

    I have configured and built LLVM with --enable-experimental-targets=BPF
    but as Alexei put it, the plan is to get rid of the experimental
    status in future [2].

    Thanks a lot!

    v1 -> v2:
    - Removed arch patches from this series
    - x86 is already queued in tip tree, under x86/mm
    - arm64 just reposted directly to arm folks
    - Rest is unchanged

    [1] http://thread.gmane.org/gmane.linux.network/350191
    [2] http://article.gmane.org/gmane.linux.kernel/1874969
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This work extends the "classic" BPF programmable tc classifier by
    extending its scope also to native eBPF code!

    This allows for user space to implement own custom, 'safe' C like
    classifiers (or whatever other frontend language LLVM et al may
    provide in future), that can then be compiled with the LLVM eBPF
    backend to an eBPF elf file. The result of this can be loaded into
    the kernel via iproute2's tc. In the kernel, they can be JITed on
    major archs and thus run in native performance.

    Simple, minimal toy example to demonstrate the workflow:

    #include
    #include
    #include

    #include "tc_bpf_api.h"

    __section("classify")
    int cls_main(struct sk_buff *skb)
    {
    return (0x800 << 16) | load_byte(skb, ETH_HLEN + __builtin_offsetof(struct iphdr, tos));
    }

    char __license[] __section("license") = "GPL";

    The classifier can then be compiled into eBPF opcodes and loaded
    via tc, for example:

    clang -O2 -emit-llvm -c cls.c -o - | llc -march=bpf -filetype=obj -o cls.o
    tc filter add dev em1 parent 1: bpf cls.o [...]

    As it has been demonstrated, the scope can even reach up to a fully
    fledged flow dissector (similarly as in samples/bpf/sockex2_kern.c).

    For tc, maps are allowed to be used, but from kernel context only,
    in other words, eBPF code can keep state across filter invocations.
    In future, we perhaps may reattach from a different application to
    those maps e.g., to read out collected statistics/state.

    Similarly as in socket filters, we may extend functionality for eBPF
    classifiers over time depending on the use cases. For that purpose,
    cls_bpf programs are using BPF_PROG_TYPE_SCHED_CLS program type, so
    we can allow additional functions/accessors (e.g. an ABI compatible
    offset translation to skb fields/metadata). For an initial cls_bpf
    support, we allow the same set of helper functions as eBPF socket
    filters, but we could diverge at some point in time w/o problem.

    I was wondering whether cls_bpf and act_bpf could share C programs,
    I can imagine that at some point, we introduce i) further common
    handlers for both (or even beyond their scope), and/or if truly needed
    ii) some restricted function space for each of them. Both can be
    abstracted easily through struct bpf_verifier_ops in future.

    The context of cls_bpf versus act_bpf is slightly different though:
    a cls_bpf program will return a specific classid whereas act_bpf a
    drop/non-drop return code, latter may also in future mangle skbs.
    That said, we can surely have a "classify" and "action" section in
    a single object file, or considered mentioned constraint add a
    possibility of a shared section.

    The workflow for getting native eBPF running from tc [1] is as
    follows: for f_bpf, I've added a slightly modified ELF parser code
    from Alexei's kernel sample, which reads out the LLVM compiled
    object, sets up maps (and dynamically fixes up map fds) if any, and
    loads the eBPF instructions all centrally through the bpf syscall.

    The resulting fd from the loaded program itself is being passed down
    to cls_bpf, which looks up struct bpf_prog from the fd store, and
    holds reference, so that it stays available also after tc program
    lifetime. On tc filter destruction, it will then drop its reference.

    Moreover, I've also added the optional possibility to annotate an
    eBPF filter with a name (e.g. path to object file, or something
    else if preferred) so that when tc dumps currently installed filters,
    some more context can be given to an admin for a given instance (as
    opposed to just the file descriptor number).

    Last but not least, bpf_prog_get() and bpf_prog_put() needed to be
    exported, so that eBPF can be used from cls_bpf built as a module.
    Thanks to 60a3b2253c41 ("net: bpf: make eBPF interpreter images
    read-only") I think this is of no concern since anything wanting to
    alter eBPF opcode after verification stage would crash the kernel.

    [1] http://git.breakpoint.cc/cgit/dborkman/iproute2.git/log/?h=ebpf

    Signed-off-by: Daniel Borkmann
    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • is_gpl_compatible and prog_type should be moved directly into bpf_prog
    as they stay immutable during bpf_prog's lifetime, are core attributes
    and they can be locked as read-only later on via bpf_prog_select_runtime().

    With a bit of rearranging, this also allows us to shrink bpf_prog_aux
    to exactly 1 cacheline.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • As discussed recently and at netconf/netdev01, we want to prevent making
    bpf_verifier_ops registration available for modules, but have them at a
    controlled place inside the kernel instead.

    The reason for this is, that out-of-tree modules can go crazy and define
    and register any verfifier ops they want, doing all sorts of crap, even
    bypassing available GPLed eBPF helper functions. We don't want to offer
    such a shiny playground, of course, but keep strict control to ourselves
    inside the core kernel.

    This also encourages us to design eBPF user helpers carefully and
    generically, so they can be shared among various subsystems using eBPF.

    For the eBPF traffic classifier (cls_bpf), it's a good start to share
    the same helper facilities as we currently do in eBPF for socket filters.

    That way, we have BPF_PROG_TYPE_SCHED_CLS look like it's own type, thus
    one day if there's a good reason to diverge the set of helper functions
    from the set available to socket filters, we keep ABI compatibility.

    In future, we could place all bpf_prog_type_list at a central place,
    perhaps.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • This gets rid of CONFIG_BPF_SYSCALL ifdefs in the socket filter code,
    now that the BPF internal header can deal with it.

    While going over it, I also changed eBPF related functions to a sk_filter
    prefix to be more consistent with the rest of the file.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Socket filter code and other subsystems with upcoming eBPF support should
    not need to deal with the fact that we have CONFIG_BPF_SYSCALL defined or
    not.

    Having the bpf syscall as a config option is a nice thing and I'd expect
    it to stay that way for expert users (I presume one day the default setting
    of it might change, though), but code making use of it should not care if
    it's actually enabled or not.

    Instead, hide this via header files and let the rest deal with it.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • We need to export BPF_PSEUDO_MAP_FD to user space, as it's used in the
    ELF BPF loader where instructions are being loaded that need map fixups.

    An initial stage loads all maps into the kernel, and later on replaces
    related instructions in the eBPF blob with BPF_PSEUDO_MAP_FD as source
    register and the actual fd as immediate value.

    The kernel verifier recognizes this keyword and replaces the map fd with
    a real pointer internally.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann