25 Jan, 2017

25 commits

  • We may be able to see invalid Broadcom tags when the hardware and drivers are
    misconfigured, or just while exercising the error path. Instead of flooding
    the console with messages, flat out drop the packet.

    Signed-off-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Florian Fainelli
     
  • After commit 51b7b1c34e19 (KSZ8851-SNL: Add ethtool support for
    EEPROM via eeprom_93cx6, 2011-11-21) this structure member is
    unused. Delete it.

    Signed-off-by: Stephen Boyd
    Signed-off-by: David S. Miller

    Stephen Boyd
     
  • Daniel Borkmann says:

    ====================
    Misc BPF improvements

    This series adds various misc improvements to BPF, f.e. allowing
    skb_load_bytes() helper to be used with filter/reuseport programs
    to facilitate programming, test cases for program tag, etc. For
    details, please see individual patches.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • William reported couple of issues in relation to direct packet
    access. Typical scheme is to check for data + [off] 0 it
    means that the value has [imm] upper zero bits. F.e. when shifting
    an UNKNOWN_VALUE register by 3 to the right, no matter what value
    it had, we know that the 3 upper most bits must be zero now.
    This is to make sure that ALU operations with unknown registers
    don't overflow. Meaning, once we know that we have more than 48
    upper zero bits, or, in other words cannot go beyond 0xffff offset
    with ALU ops, such an addition will track the target register
    as a new pkt() register with a new id, but 0 offset and 0 range,
    so for that a new data/data_end test will be required. Is the source
    register a CONST_IMM one that is to be added to the pkt() register,
    or the source instruction is an add instruction with immediate
    value, then it will get added if it stays within max 0xffff bounds.
    >From there, pkt() type, can be accessed should reg->off + imm be
    within the access range of pkt().

    [...]
    from 28 to 30: R0=imm1,min_value=1,max_value=1
    R1=pkt(id=0,off=0,r=22) R2=pkt_end
    R3=imm144,min_value=144,max_value=144
    R4=imm0,min_value=0,max_value=0
    R5=inv48,min_value=2054,max_value=2054 R10=fp
    30: (bf) r5 = r3
    31: (07) r5 += 23
    32: (77) r5 >>= 3
    33: (bf) r6 = r1
    34: (0f) r6 += r5
    cannot add integer value with 0 upper zero bits to ptr_to_packet

    [...]
    from 52 to 80: R0=imm1,min_value=1,max_value=1
    R1=pkt(id=0,off=0,r=34) R2=pkt_end R3=inv
    R4=imm272 R5=inv56,min_value=17,max_value=17
    R6=pkt(id=0,off=26,r=34) R10=fp
    80: (07) r4 += 71
    81: (18) r5 = 0xfffffff8
    83: (5f) r4 &= r5
    84: (77) r4 >>= 3
    85: (0f) r1 += r4
    cannot add integer value with 3 upper zero bits to ptr_to_packet

    Thus to get above use-cases working, evaluate_reg_imm_alu() has
    been extended for further ALU ops. This is fine, because we only
    operate strictly within realm of CONST_IMM types, so here we don't
    care about overflows as they will happen in the simulated but also
    real execution and interaction with pkt() in check_packet_ptr_add()
    will check actual imm value once added to pkt(), but it's irrelevant
    before.

    With regards to 06c1c049721a ("bpf: allow helpers access to variable
    memory") that works on UNKNOWN_VALUE registers, the verifier becomes
    now a bit smarter as it can better resolve ALU ops, so we need to
    adapt two test cases there, as min/max bound tracking only becomes
    necessary when registers were spilled to stack. So while mask was
    set before to track upper bound for UNKNOWN_VALUE case, it's now
    resolved directly as CONST_IMM, and such contructs are only necessary
    when f.e. registers are spilled.

    For commit 6b17387307ba ("bpf: recognize 64bit immediate loads as
    consts") that initially enabled dw load tracking only for nfp jit/
    analyzer, I did couple of tests on large, complex programs and we
    don't increase complexity badly (my tests were in ~3% range on avg).
    I've added a couple of tests similar to affected code above, and
    it works fine with verifier now.

    Reported-by: William Tu
    Signed-off-by: Daniel Borkmann
    Cc: Gianluca Borello
    Cc: William Tu
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Add the test case used to compare the results from fdinfo with
    af_alg's output on the tag. Tests are from min to max sized
    programs, with and without maps included.

    # ./test_tag
    test_tag: OK (40945 tests)

    Tested on x86_64 and s390x.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • When programs need to calculate the csum from scratch for small UDP
    packets and use bpf_l4_csum_replace() to feed the result from helpers
    like bpf_csum_diff(), then we need a flag besides BPF_F_MARK_MANGLED_0
    that would ignore the case of current csum being 0, and which would
    still allow for the helper to set the csum and transform when needed
    to CSUM_MANGLED_0.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • BPF_PROG_TYPE_SOCKET_FILTER are used in various facilities such as
    for SO_REUSEPORT and packet fanout demuxing, packet filtering, kcm,
    etc, and yet the only facility they can use is BPF_LD with {BPF_ABS,
    BPF_IND} for single byte/half/word access.

    Direct packet access is only restricted to tc programs right now,
    but we can still facilitate usage by allowing skb_load_bytes() helper
    added back then in 05c74e5e53f6 ("bpf: add bpf_skb_load_bytes helper")
    that calls skb_header_pointer() similarly to bpf_load_pointer(), but
    for stack buffers with larger access size.

    Name the previous sk_filter_func_proto() as bpf_base_func_proto()
    since this is used everywhere else as well, similarly for the ctx
    converter, that is, bpf_convert_ctx_access().

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • The __is_valid_access() test for cb[] from 62c7989b24db ("bpf: allow
    b/h/w/dw access for bpf's cb in ctx") was done unnecessarily complex,
    we can just simplify it the same way as recent fix from 2d071c643f1c
    ("bpf, trace: make ctx access checks more robust") did. Overflow can
    never happen as size is 1/2/4/8 depending on access.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • One line was apparently pasted incorrectly during a new feature patch:

    drivers/net/phy/marvell.c:2090:15: error: initialized field overwritten [-Werror=override-init]
    .features = PHY_GBIT_FEATURES,

    I'm removing the extraneous line here to avoid the W=1 warning and restore
    the previous flags value, and I'm slightly reordering the lines for consistency
    to make it less likely to happen again in the future. The ordering in the
    array is still not the same as in the structure definition, instead I picked
    the order that is most common in this file and that seems to make more sense
    here.

    Fixes: 0b04680fdae4 ("phy: marvell: Add support for temperature sensor")
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Arnd Bergmann
     
  • The idea for this was born when testing VF support in iproute2 which was
    impeded by hardware requirements. In fact, not every VF-capable hardware
    driver implements all netdev ops, so testing the interface is still hard
    to do even with a well-sorted hardware shelf.

    To overcome this and allow for testing the user-kernel interface, this
    patch allows to turn dummy into a PF with a configurable amount of VFs.

    Since my patch series 'bus-agnostic-num-vf' has been accepted,
    implementing the required interfaces is pretty straightforward: Iff
    'num_vfs' module parameter was given a value >0, a dummy bus type is
    being registered which implements the 'num_vf()' callback. Additionally,
    a dummy parent device common to all dummy devices is registered which
    sits on the above dummy bus.

    Joint work with Sabrina Dubroca.

    Signed-off-by: Sabrina Dubroca
    Signed-off-by: Phil Sutter
    Signed-off-by: David S. Miller

    Phil Sutter
     
  • The ethtool api {get|set}_settings is deprecated.
    We move this driver to new api {get|set}_link_ksettings.

    As I don't have the hardware, I'd be very pleased if
    someone may test this patch.

    Signed-off-by: Philippe Reynes
    Acked-by: Yuval Mintz
    Signed-off-by: David S. Miller

    Philippe Reynes
     
  • Jiri Pirko says:

    ====================
    Add support for offloading packet-sampling

    Yotam says:

    The first patch introduces the psample module, a netlink channel dedicated
    to packet sampling implemented using generic netlink. This module provides
    a generic way for kernel modules to sample packets, while not being tied
    to any specific subsystem like NFLOG.

    The second patch adds the sample tc action, which uses psample to randomly
    sample packets that match a classifier. The user can configure the psample
    group number, the sampling rate and the packet's truncation (to save
    kernel-user traffic).

    The last two patches add the support for offloading the matchall-sample
    tc command in the mlxsw driver, for ingress qdiscs.

    An example for psample usage can be found in the libpsample project at:
    https://github.com/Mellanox/libpsample

    v1->v2:
    - Reword first patch's commit message
    - Fix typo in comment in second patch
    - Change order of tc_sample uapi enum to match convention
    - Rename act_sample action callback tcf_sample -> tcf_sample_act
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Using the MPSC register, add the functions that configure port-based
    packet sampling in hardware and the necessary datatypes in the
    mlxsw_sp_port struct. In addition, add the necessary trap for sampled
    packets and integrate with matchall offloading to allow offloading of the
    sample tc action.

    The current offload support is for the tc command:

    tc filter add dev parent ffff: \
    matchall skip_sw \
    action sample rate group [trunc ]

    Where only ingress qdiscs are supported, and only a combination of
    matchall classifier and sample action will lead to activating hardware
    packet sampling.

    Signed-off-by: Yotam Gigi
    Signed-off-by: Jiri Pirko
    Reviewed-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Yotam Gigi
     
  • The MPSC register allows to configure ingress packet sampling on specific
    port of the mlxsw device. The sampled packets are then trapped via
    PKT_SAMPLE trap.

    Signed-off-by: Yotam Gigi
    Signed-off-by: Jiri Pirko
    Reviewed-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Yotam Gigi
     
  • This action allows the user to sample traffic matched by tc classifier.
    The sampling consists of choosing packets randomly and sampling them using
    the psample module. The user can configure the psample group number, the
    sampling rate and the packet's truncation (to save kernel-user traffic).

    Example:
    To sample ingress traffic from interface eth1, one may use the commands:

    tc qdisc add dev eth1 handle ffff: ingress

    tc filter add dev eth1 parent ffff: \
    matchall action sample rate 12 group 4

    Where the first command adds an ingress qdisc and the second starts
    sampling randomly with an average of one sampled packet per 12 packets on
    dev eth1 to psample group 4.

    Signed-off-by: Yotam Gigi
    Signed-off-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Yotam Gigi
     
  • Add a general way for kernel modules to sample packets, without being tied
    to any specific subsystem. This netlink channel can be used by tc,
    iptables, etc. and allow to standardize packet sampling in the kernel.

    For every sampled packet, the psample module adds the following metadata
    fields:

    PSAMPLE_ATTR_IIFINDEX - the packets input ifindex, if applicable

    PSAMPLE_ATTR_OIFINDEX - the packet output ifindex, if applicable

    PSAMPLE_ATTR_ORIGSIZE - the packet's original size, in case it has been
    truncated during sampling

    PSAMPLE_ATTR_SAMPLE_GROUP - the packet's sample group, which is set by the
    user who initiated the sampling. This field allows the user to
    differentiate between several samplers working simultaneously and
    filter packets relevant to him

    PSAMPLE_ATTR_GROUP_SEQ - sequence counter of last sent packet. The
    sequence is kept for each group

    PSAMPLE_ATTR_SAMPLE_RATE - the sampling rate used for sampling the packets

    PSAMPLE_ATTR_DATA - the actual packet bits

    The sampled packets are sent to the PSAMPLE_NL_MCGRP_SAMPLE multicast
    group. In addition, add the GET_GROUPS netlink command which allows the
    user to see the current sample groups, their refcount and sequence number.
    This command currently supports only netlink dump mode.

    Signed-off-by: Yotam Gigi
    Signed-off-by: Jiri Pirko
    Reviewed-by: Jamal Hadi Salim
    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Yotam Gigi
     
  • Florian Fainelli says:

    ====================
    net: couple mdio_module_driver changes

    Small patch series fixing a comment for mdio_module_driver and
    finally utilizing it in b53_mdio.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Eliminate a bit of boilerplate code.

    Signed-off-by: Florian Fainelli
    Reviewed-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Florian Fainelli
     
  • The module boilerplate macro is named mdio_module_driver and not
    module_mdio_driver, fix that.

    Fixes: a9049e0c513c ("mdio: Add support for mdio drivers.")
    Signed-off-by: Florian Fainelli
    Reviewed-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Florian Fainelli
     
  • Martin Blumenstingl says:

    ====================
    stmmac: dwmac-meson8b: configurable RGMII TX delay

    Currently the dwmac-meson8b stmmac glue driver uses a hardcoded 1/4
    cycle (= 2ns) TX clock delay. This seems to work fine for many boards
    (for example Odroid-C2 or Amlogic's reference boards) but there are
    some others where TX traffic is simply broken.
    There are probably multiple reasons why it's working on some boards
    while it's broken on others:
    - some of Amlogic's reference boards are using a Micrel PHY
    - hardware circuit design
    - maybe more...

    iperf3 results on my Mecool BB2 board (Meson GXM, RTL8211F PHY) with
    TX clock delay disabled on the MAC (as it's enabled in the PHY driver).
    TX throughput was virtually zero before:
    $ iperf3 -c 192.168.1.100 -R
    Connecting to host 192.168.1.100, port 5201
    Reverse mode, remote host 192.168.1.100 is sending
    [ 4] local 192.168.1.206 port 52828 connected to 192.168.1.100 port 5201
    [ ID] Interval Transfer Bandwidth
    [ 4] 0.00-1.00 sec 108 MBytes 901 Mbits/sec
    [ 4] 1.00-2.00 sec 94.2 MBytes 791 Mbits/sec
    [ 4] 2.00-3.00 sec 96.5 MBytes 810 Mbits/sec
    [ 4] 3.00-4.00 sec 96.2 MBytes 808 Mbits/sec
    [ 4] 4.00-5.00 sec 96.6 MBytes 810 Mbits/sec
    [ 4] 5.00-6.00 sec 96.5 MBytes 810 Mbits/sec
    [ 4] 6.00-7.00 sec 96.6 MBytes 810 Mbits/sec
    [ 4] 7.00-8.00 sec 96.5 MBytes 809 Mbits/sec
    [ 4] 8.00-9.00 sec 105 MBytes 884 Mbits/sec
    [ 4] 9.00-10.00 sec 111 MBytes 934 Mbits/sec
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval Transfer Bandwidth Retr
    [ 4] 0.00-10.00 sec 1000 MBytes 839 Mbits/sec 0 sender
    [ 4] 0.00-10.00 sec 998 MBytes 837 Mbits/sec receiver

    iperf Done.
    $ iperf3 -c 192.168.1.100
    Connecting to host 192.168.1.100, port 5201
    [ 4] local 192.168.1.206 port 52832 connected to 192.168.1.100 port 5201
    [ ID] Interval Transfer Bandwidth Retr Cwnd
    [ 4] 0.00-1.01 sec 99.5 MBytes 829 Mbits/sec 117 139 KBytes
    [ 4] 1.01-2.00 sec 105 MBytes 884 Mbits/sec 129 70.7 KBytes
    [ 4] 2.00-3.01 sec 107 MBytes 889 Mbits/sec 106 187 KBytes
    [ 4] 3.01-4.01 sec 105 MBytes 878 Mbits/sec 92 143 KBytes
    [ 4] 4.01-5.00 sec 105 MBytes 882 Mbits/sec 140 129 KBytes
    [ 4] 5.00-6.01 sec 106 MBytes 883 Mbits/sec 115 195 KBytes
    [ 4] 6.01-7.00 sec 102 MBytes 863 Mbits/sec 133 70.7 KBytes
    [ 4] 7.00-8.01 sec 106 MBytes 884 Mbits/sec 143 97.6 KBytes
    [ 4] 8.01-9.01 sec 104 MBytes 875 Mbits/sec 124 107 KBytes
    [ 4] 9.01-10.01 sec 105 MBytes 876 Mbits/sec 90 139 KBytes
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval Transfer Bandwidth Retr
    [ 4] 0.00-10.01 sec 1.02 GBytes 874 Mbits/sec 1189 sender
    [ 4] 0.00-10.01 sec 1.02 GBytes 873 Mbits/sec receiver

    iperf Done.

    I get similar TX throughput on my Meson GXBB "MXQ Pro+" board when I
    disable the PHY's TX-delay and configure a 4ms TX-delay on the MAC.
    So changes to at least the RTL8211F PHY driver are needed to get it
    working properly in all situations.

    Changes since v4:
    - add a fallback of 2ns (the value which was previously hardcoded) for
    the TX delay so we are backwards-compatible with older .dts'
    - update the documentation with the new fallback value and add a small
    note that the "amlogic,tx-delay" property is ignored when the phy-mode
    is "rmii".

    Changes since v3:
    - rebased to apply against current net-next branch (fixes a conflict
    with d2ed0a7755fe14c7 "net: ethernet: stmmac: fix of-node and
    fixed-link-phydev leaks")

    Changes since v2:
    - moved all .dts patches (3-7) to a separate series
    - removed the default 2ns TX delay when phy-mode RGMII is specified
    - (rebased against current net-next)

    Changes since v1:
    - renamed the devicetree property "amlogic,tx-delay" to
    "amlogic,tx-delay-ns", which makes the .dts easier to read as we can
    simply specify human-readable values instead of having "preprocessor
    defines and calculation in human brain". Thanks to Andrew Lunn for
    the suggestion!
    - improved documentation to indicate when the MAC TX-delay should be
    configured and how to use the PHY's TX-delay
    - changed the default TX-delay in the dwmac-meson8b driver from 2ns
    to 0ms when any of the rgmii-*id modes are used (the 2ns default
    value still applies for phy-mode "rgmii")
    - added patches to properly reset the PHY on Meson GXBB devices and to
    use a similar configuration than the one we use on Meson GXL devices
    (by passing a phy-handle to stmmac and defining the PHY in the mdio0
    bus - patch 3-6)
    - add the "amlogic,tx-delay-ns" property to all boards which are using
    the RGMII PHY (patch 7)
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Prior to this patch we were using a hardcoded RGMII TX clock delay of
    2ns (= 1/4 cycle of the 125MHz RGMII TX clock). This value works for
    many boards, but unfortunately not for all (due to the way the actual
    circuit is designed, sometimes because the TX delay is enabled in the
    PHY, etc.). Making the TX delay on the MAC side configurable allows us
    to support all possible hardware combinations.

    This allows fixing a compatibility issue on some boards, where the
    RTL8211F PHY is configured to generate the TX delay. We can now turn
    off the TX delay in the MAC, because otherwise we would be applying the
    delay twice (which results in non-working TX traffic).

    Signed-off-by: Martin Blumenstingl
    Tested-by: Neil Armstrong
    Signed-off-by: David S. Miller

    Martin Blumenstingl
     
  • This allows configuring the RGMII TX clock delay. The RGMII clock is
    generated by underlying hardware of the the Meson 8b / GXBB DWMAC glue.
    The configuration depends on the actual hardware (no delay may be
    needed due to the design of the actual circuit, the PHY might add this
    delay, etc.).

    Signed-off-by: Martin Blumenstingl
    Tested-by: Neil Armstrong
    Acked-by: Rob Herring
    Signed-off-by: David S. Miller

    Martin Blumenstingl
     
  • Remove the wrong !, otherwise we get false positives about having
    multiple CPU interfaces.

    Fixes: b22de490869d ("net: dsa: store CPU switch structure in the tree")
    Signed-off-by: Andrew Lunn
    Reviewed-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Andrew Lunn
     
  • Implements an optional, per bridge port flag and feature to deliver
    multicast packets to any host on the according port via unicast
    individually. This is done by copying the packet per host and
    changing the multicast destination MAC to a unicast one accordingly.

    multicast-to-unicast works on top of the multicast snooping feature of
    the bridge. Which means unicast copies are only delivered to hosts which
    are interested in it and signalized this via IGMP/MLD reports
    previously.

    This feature is intended for interface types which have a more reliable
    and/or efficient way to deliver unicast packets than broadcast ones
    (e.g. wifi).

    However, it should only be enabled on interfaces where no IGMPv2/MLDv1
    report suppression takes place. This feature is disabled by default.

    The initial patch and idea is from Felix Fietkau.

    Signed-off-by: Felix Fietkau
    [linus.luessing@c0d3.blue: various bug + style fixes, commit message]
    Signed-off-by: Linus Lüssing
    Reviewed-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Felix Fietkau
     
  • Add net.ipv4.ip_unprivileged_port_start, which is a per namespace sysctl
    that denotes the first unprivileged inet port in the namespace. To
    disable all privileged ports set this to zero. It also checks for
    overlap with the local port range. The privileged and local range may
    not overlap.

    The use case for this change is to allow containerized processes to bind
    to priviliged ports, but prevent them from ever being allowed to modify
    their container's network configuration. The latter is accomplished by
    ensuring that the network namespace is not a child of the user
    namespace. This modification was needed to allow the container manager
    to disable a namespace's priviliged port restrictions without exposing
    control of the network namespace to processes in the user namespace.

    Signed-off-by: Krister Johansen
    Signed-off-by: David S. Miller

    Krister Johansen
     

24 Jan, 2017

10 commits

  • We need to initialize im_node to NULL, otherwise in case of error path
    it gets passed to kfree() as uninitialized pointer.

    Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Daniel Mack says:

    ====================
    bpf: add longest prefix match map

    This patch set adds a longest prefix match algorithm that can be used
    to match IP addresses to a stored set of ranges. It is exposed as a
    bpf map type.

    Internally, data is stored in an unbalanced tree of nodes that has a
    maximum height of n, where n is the prefixlen the trie was created
    with.

    Note that this has nothing to do with fib or fib6 and is in no way meant
    to replace or share code with it. It's rather a much simpler
    implementation that is specifically written with bpf maps in mind.

    Patch 1/2 adds the implementation, 2/2 an extensive test suite and 3/3
    has benchmarking code for the new trie type.

    Feedback is much appreciated.

    Changelog:

    v3 -> v4:
    * David added a 3rd patch that augments map_perf_test for
    LPM trie benchmarks
    * Limit allocation of maps of this new type to CAP_SYS_ADMIN
    for now, as requested by Alexei
    * Add a stub .map_delete_elem so the core does not stumble
    over a NULL pointer when the syscall is invoked
    * Tests for non-power-of-2 prefix lengths were added
    * More comment style fixes

    v2 -> v3:
    * Store both the key match data and the caller provided
    value in the same byte array attached to a node. This
    avoids double allocations
    * Bring back node->flags to distinguish between 'real'
    and intermediate nodes
    * Fix comment style and some typos

    v1 -> v2:
    * Turn spin lock into raw spinlock
    * Lock with irqsave options during trie_update_elem()
    * Return -ENOMEM properly from trie_alloc()
    * Force attr->flags == BPF_F_NO_PREALLOC during creation
    * Set trie->map.pages after creation to account for map memory
    * Allow arbitrary value sizes
    * Removed node->flags and denode intermediate nodes through
    node->value == NULL instead

    rfc -> v1:
    * Add __rcu pointer annotations to make sparse happy
    * Fold _lpm_trie_find_target_node() into its only caller
    * Fix some minor documentation issues
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Extend the map_perf_test_{user,kern}.c infrastructure to stress test
    lpm-trie lookups. We hook into the kprobe on sys_gettid() and measure
    the latency depending on trie size and lookup count.

    On my Intel Haswell i7-6400U, a single gettid() syscall with an empty
    bpf program takes roughly 6.5us on my system. Lookups in empty tries
    take ~1.8us on first try, ~0.9us on retries. Lookups in tries with 8192
    entries take ~7.1us (on the first _and_ any subsequent try).

    Signed-off-by: David Herrmann
    Reviewed-by: Daniel Mack
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    David Herrmann
     
  • The first part of this program runs randomized tests against the
    lpm-bpf-map. It implements a "Trivial Longest Prefix Match" (tlpm)
    based on simple, linear, single linked lists. The implementation
    should be pretty straightforward.

    Based on tlpm, this inserts randomized data into bpf-lpm-maps and
    verifies the trie-based bpf-map implementation behaves the same way
    as tlpm.

    The second part uses 'real world' IPv4 and IPv6 addresses and tests
    the trie with those.

    Signed-off-by: David Herrmann
    Signed-off-by: Daniel Mack
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    David Herrmann
     
  • This trie implements a longest prefix match algorithm that can be used
    to match IP addresses to a stored set of ranges.

    Internally, data is stored in an unbalanced trie of nodes that has a
    maximum height of n, where n is the prefixlen the trie was created
    with.

    Tries may be created with prefix lengths that are multiples of 8, in
    the range from 8 to 2048. The key used for lookup and update operations
    is a struct bpf_lpm_trie_key, and the value is a uint64_t.

    The code carries more information about the internal implementation.

    Signed-off-by: Daniel Mack
    Reviewed-by: David Herrmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Mack
     
  • Declare net_device_ops structure as const as it is only stored in
    the netdev_ops field of a net_device structure. This field is of type
    const, so net_device_ops structures having same properties can be made
    const too.
    Done using Coccinelle:

    @r1 disable optional_qualifier@
    identifier i;
    position p;
    @@
    static struct net_device_ops i@p={...};

    @ok1@
    identifier r1.i;
    position p;
    struct net_device ndev;
    @@
    ndev.netdev_ops=&i@p

    @bad@
    position p!={r1.p,ok1.p};
    identifier r1.i;
    @@
    i@p

    @depends on !bad disable optional_qualifier@
    identifier r1.i;
    @@
    +const
    struct net_device_ops i;

    File size before:
    text data bss dec hex filename
    6201 744 0 6945 1b21 ethernet/xilinx/xilinx_emaclite.o

    File size after:
    text data bss dec hex filename
    6745 192 0 6937 1b19 ethernet/xilinx/xilinx_emaclite.o

    Signed-off-by: Bhumika Goyal
    Signed-off-by: David S. Miller

    Bhumika Goyal
     
  • Declare net_device_ops structure as const as it is only stored in
    the netdev_ops field of a net_device structure. This field is of type
    const, so net_device_ops structures having same properties can be made
    const too.
    Done using Coccinelle:

    @r1 disable optional_qualifier@
    identifier i;
    position p;
    @@
    static struct net_device_ops i@p={...};

    @ok1@
    identifier r1.i;
    position p;
    struct net_device ndev;
    @@
    ndev.netdev_ops=&i@p

    @bad@
    position p!={r1.p,ok1.p};
    identifier r1.i;
    @@
    i@p

    @depends on !bad disable optional_qualifier@
    identifier r1.i;
    @@
    +const
    struct net_device_ops i;

    File size before:
    text data bss dec hex filename
    4821 744 0 5565 15bd ethernet/moxa/moxart_ether.o

    File size after:
    text data bss dec hex filename
    5373 192 0 5565 15bd ethernet/moxa/moxart_ether.o

    Signed-off-by: Bhumika Goyal
    Signed-off-by: David S. Miller

    Bhumika Goyal
     
  • During reset, functions emac_mac_down() and emac_mac_up() are called,
    so we don't want to free and claim the IRQ unnecessarily. Move those
    operations to open/close.

    Signed-off-by: Timur Tabi
    Reviewed-by: Lino Sanfilippo
    Signed-off-by: David S. Miller

    Timur Tabi
     
  • The EMAC has an internal PHY that is often called the "SGMII". This
    SGMII is also connected to an external PHY, which is managed by phylib.
    These dual PHYs often cause confusion. In this case, the data structure
    for managing the SGMII was mis-named and located in the wrong header file.

    Structure emac_phy is renamed to emac_sgmii to clearly indicate it applies
    to the internal PHY only. It also also moved from emac_phy.h (which
    supports the external PHY) to emac_sgmii.h (where it belongs).

    To keep the changes minimal, only the structure name is changed, not
    the names of any variables of that type.

    Signed-off-by: Timur Tabi
    Signed-off-by: David S. Miller

    Timur Tabi
     
  • Commit 4cace675d687 ("bnx2x: Alloc 4k fragment for each rx ring buffer
    element") added extra put_page() and get_page() calls on arches where
    PAGE_SIZE=4K like x86

    Reorder things to avoid this overhead.

    Signed-off-by: Eric Dumazet
    Cc: Gabriel Krisman Bertazi
    Cc: Yuval Mintz
    Cc: Ariel Elior
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Jan, 2017

5 commits