08 Jun, 2017

3 commits

  • Provide a control message that can be specified on the first sendmsg() of a
    client call or the first sendmsg() of a service response to indicate the
    total length of the data to be transmitted for that call.

    Currently, because the length of the payload of an encrypted DATA packet is
    encrypted in front of the data, the packet cannot be encrypted until we
    know how much data it will hold.

    By specifying the length at the beginning of the transmit phase, each DATA
    packet length can be set before we start loading data from userspace (where
    several sendmsg() calls may contribute to a particular packet).

    An error will be returned if too little or too much data is presented in
    the Tx phase.

    Signed-off-by: David Howells

    David Howells
     
  • Consolidate the sendmsg control message parameters into a struct rather
    than passing them individually through the argument list of
    rxrpc_sendmsg_cmsg(). This makes it easier to add more parameters.

    Signed-off-by: David Howells

    David Howells
     
  • Provide a getsockopt() call that can query what cmsg types are supported by
    AF_RXRPC.

    David Howells
     

07 Jun, 2017

37 commits

  • Just some simple overlapping changes in marvell PHY driver
    and the DSA core code.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Russell King says:

    ====================
    net: Add phylib support for MV88X3310 10G phy

    This patch series adds support for the Marvell 88x3310 PHY found on
    the SolidRun Macchiatobin board.

    The first patch introduces a set of generic Clause 45 PHY helpers that
    C45 PHY drivers can make use of if they wish.

    Patch 2 ensures that the Clause 22 aneg_done function will not be
    called for incompatible Clause 45 PHYs.

    Patch 3 fixes the aneg restart to be compatible with C45 PHYs - it can
    currently only cope with C22 PHYs.

    Patch 4 moves the "gen10g" driver into the Clause 45 code, grouping all
    core clause 45 code together.

    Patch 5 adds the phy_interface_t types for XAUI and 10GBase-KR links.
    As 10GBase-KR appears to be compatible with XFI and SFI, XFI and SFI,
    I currently see no reason to add XFI and SFI interface modes. There
    seems to be vendor code out there using these, but they all alias back
    to the same hardware settings.

    Patch 6 adds support for the MV88X3310 PHY, which supports both the
    copper and fiber interfaces. It should be noted that the MV88X3310
    automatically switches its MAC facing interface between 10GBase-KR
    and SGMII depending on the negotiated speed. This was discussed with
    Florian, and we agreed to update the phy interface mode depending on
    the properties of the actual link mode to the PHY.

    v2:
    - update sysfs-class-net-phydev documentation
    - avoid genphy_aneg_done for non-C22 PHYs
    - expand comment about 0x30 constant
    - add comment about lack of reset
    - configure driver using MARVELL_10G_PHY
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Add phylib support for the Marvell Alaska X 10 Gigabit PHY (MV88X3310).
    This phy is able to operate at 10G, 1G, 100M and 10M speeds, and only
    supports Clause 45 accesses.

    The PHY appears (based on the vendor IDs) to be two different vendors
    IP, with each devad containing several instances.

    This PHY driver has only been tested with the RJ45 copper port, fiber
    port and a Marvell Armada 8040-based ethernet interface.

    It should be noted that to use the full range of speeds, MAC drivers
    need to also reconfigure the link mode as per phydev->interface, since
    the PHY automatically changes its interface mode depending on the
    negotiated speed.

    Signed-off-by: Russell King
    Reviewed-by: Andrew Lunn
    Reviewed-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Russell King
     
  • XAUI allows XGMII to reach an extended distance by using a XGXS layer at
    each end of the MAC to PHY link, operating over four Serdes lanes.

    10GBASE-KR is a single lane Serdes backplane ethernet connection method
    with autonegotiation on the link. Some PHYs use this to connect to the
    ethernet interface at 10G speeds, switching to other connection types
    when utilising slower speeds.

    10GBASE-KR is also used for XFI and SFI to connect to XFP and SFP fiber
    modules.

    Signed-off-by: Russell King
    Reviewed-by: Andrew Lunn
    Reviewed-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Russell King
     
  • Move the old 10G genphy support to sit beside the new clause 45 library
    functions, so all the 10G phy code is together.

    Reviewed-by: Andrew Lunn
    Reviewed-by: Florian Fainelli
    Signed-off-by: Russell King
    Signed-off-by: David S. Miller

    Russell King
     
  • genphy_restart_aneg() can only restart autonegotiation on clause 22
    PHYs. Add a phy_restart_aneg() function which selects between the
    clause 22 and clause 45 restart functionality depending on the PHY
    type and whether the Clause 45 PHY supports the Clause 22 register set.

    Signed-off-by: Russell King
    Reviewed-by: Andrew Lunn
    Reviewed-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Russell King
     
  • Avoid calling genphy_aneg_done() for PHYs that do not implement the
    Clause 22 register set.

    Clause 45 PHYs may implement the Clause 22 register set along with the
    Clause 22 extension MMD. Hence, we can't simply block access to the
    Clause 22 functions based on the PHY being a Clause 45 PHY.

    Signed-off-by: Russell King
    Reviewed-by: Andrew Lunn
    Reviewed-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Russell King
     
  • Add generic helpers for 802.3 clause 45 PHYs for >= 10Gbps support.

    Reviewed-by: Andrew Lunn
    Signed-off-by: Russell King
    Reviewed-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Russell King
     
  • Pull networking fixes from David Miller:

    1) Made TCP congestion control documentation match current reality,
    from Anmol Sarma.

    2) Various build warning and failure fixes from Arnd Bergmann.

    3) Fix SKB list leak in ipv6_gso_segment().

    4) Use after free in ravb driver, from Eugeniu Rosca.

    5) Don't use udp_poll() in ping protocol driver, from Eric Dumazet.

    6) Don't crash in PCI error recovery of cxgb4 driver, from Guilherme
    Piccoli.

    7) _SRC_NAT_DONE_BIT needs to be cleared using atomics, from Liping
    Zhang.

    8) Use after free in vxlan deletion, from Mark Bloch.

    9) Fix ordering of NAPI poll enabled in ethoc driver, from Max
    Filippov.

    10) Fix stmmac hangs with TSO, from Niklas Cassel.

    11) Fix crash in CALIPSO ipv6, from Richard Haines.

    12) Clear nh_flags properly on mpls link up. From Roopa Prabhu.

    13) Fix regression in sk_err socket error queue handling, noticed by
    ping applications. From Soheil Hassas Yeganeh.

    14) Update mlx4/mlx5 MAINTAINERS information.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (78 commits)
    net: stmmac: fix a broken u32 less than zero check
    net: stmmac: fix completely hung TX when using TSO
    net: ethoc: enable NAPI before poll may be scheduled
    net: bridge: fix a null pointer dereference in br_afspec
    ravb: Fix use-after-free on `ifconfig eth0 down`
    net/ipv6: Fix CALIPSO causing GPF with datagram support
    net: stmmac: ensure jumbo_frm error return is correctly checked for -ve value
    Revert "sit: reload iphdr in ipip6_rcv"
    i40e/i40evf: proper update of the page_offset field
    i40e: Fix state flags for bit set and clean operations of PF
    iwlwifi: fix host command memory leaks
    iwlwifi: fix min API version for 7265D, 3168, 8000 and 8265
    iwlwifi: mvm: clear new beacon command template struct
    iwlwifi: mvm: don't fail when removing a key from an inexisting sta
    iwlwifi: pcie: only use d0i3 in suspend/resume if system_pm is set to d0i3
    iwlwifi: mvm: fix firmware debug restart recording
    iwlwifi: tt: move ucode_loaded check under mutex
    iwlwifi: mvm: support ibss in dqa mode
    iwlwifi: mvm: Fix command queue number on d0i3 flow
    iwlwifi: mvm: rs: start using LQ command color
    ...

    Linus Torvalds
     
  • Pull sparc fixes from David Miller:

    1) Fix TLB context wrap races, from Pavel Tatashin.

    2) Cure some gcc-7 build issues.

    3) Handle invalid setup_hugepagesz command line values properly, from
    Liam R Howlett.

    4) Copy TSB using the correct address shift for the huge TSB, from Mike
    Kravetz.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
    sparc64: delete old wrap code
    sparc64: new context wrap
    sparc64: add per-cpu mm of secondary contexts
    sparc64: redefine first version
    sparc64: combine activate_mm and switch_mm
    sparc64: reset mm cpumask after wrap
    sparc/mm/hugepages: Fix setup_hugepagesz for invalid values.
    sparc: Machine description indices can vary
    sparc64: mm: fix copy_tsb to correctly copy huge page TSBs
    arch/sparc: support NR_CPUS = 4096
    sparc64: Add __multi3 for gcc 7.x and later.
    sparc64: Fix build warnings with gcc 7.
    arch/sparc: increase CONFIG_NODES_SHIFT on SPARC64 to 5

    Linus Torvalds
     
  • GCC explicitly does not warn for unused static inline functions for
    -Wunused-function. The manual states:

    Warn whenever a static function is declared but not defined or
    a non-inline static function is unused.

    Clang does warn for static inline functions that are unused.

    It turns out that suppressing the warnings avoids potentially complex
    #ifdef directives, which also reduces LOC.

    Suppress the warning for clang.

    Signed-off-by: David Rientjes
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Pavel Tatashin says:

    ====================
    sparc64: context wrap fixes

    This patch series contains fixes for context wrap: when we are out of
    context ids, and need to get a new version.

    It fixes memory corruption issues which happen when more than number of
    context ids (currently set to 8K) number of processes are started
    simultaneously, and processes can get a wrong context.

    sparc64: new context wrap:
    - contains explanation of new wrap method, and also explanation of races
    that it solves
    sparc64: reset mm cpumask after wrap
    - explains issue of not reseting cpu mask on a wrap
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The old method that is using xcall and softint to get new context id is
    deleted, as it is replaced by a method of using per_cpu_secondary_mm
    without xcall to perform the context wrap.

    Signed-off-by: Pavel Tatashin
    Reviewed-by: Bob Picco
    Reviewed-by: Steven Sistare
    Signed-off-by: David S. Miller

    Pavel Tatashin
     
  • The current wrap implementation has a race issue: it is called outside of
    the ctx_alloc_lock, and also does not wait for all CPUs to complete the
    wrap. This means that a thread can get a new context with a new version
    and another thread might still be running with the same context. The
    problem is especially severe on CPUs with shared TLBs, like sun4v. I used
    the following test to very quickly reproduce the problem:
    - start over 8K processes (must be more than context IDs)
    - write and read values at a memory location in every process.

    Very quickly memory corruptions start happening, and what we read back
    does not equal what we wrote.

    Several approaches were explored before settling on this one:

    Approach 1:
    Move smp_new_mmu_context_version() inside ctx_alloc_lock, and wait for
    every process to complete the wrap. (Note: every CPU must WAIT before
    leaving smp_new_mmu_context_version_client() until every one arrives).

    This approach ends up with deadlocks, as some threads own locks which other
    threads are waiting for, and they never receive softint until these threads
    exit smp_new_mmu_context_version_client(). Since we do not allow the exit,
    deadlock happens.

    Approach 2:
    Handle wrap right during mondo interrupt. Use etrap/rtrap to enter into
    into C code, and issue new versions to every CPU.
    This approach adds some overhead to runtime: in switch_mm() we must add
    some checks to make sure that versions have not changed due to wrap while
    we were loading the new secondary context. (could be protected by PSTATE_IE
    but that degrades performance as on M7 and older CPUs as it takes 50 cycles
    for each access). Also, we still need a global per-cpu array of MMs to know
    where we need to load new contexts, otherwise we can change context to a
    thread that is going way (if we received mondo between switch_mm() and
    switch_to() time). Finally, there are some issues with window registers in
    rtrap() when context IDs are changed during CPU mondo time.

    The approach in this patch is the simplest and has almost no impact on
    runtime. We use the array with mm's where last secondary contexts were
    loaded onto CPUs and bump their versions to the new generation without
    changing context IDs. If a new process comes in to get a context ID, it
    will go through get_new_mmu_context() because of version mismatch. But the
    running processes do not need to be interrupted. And wrap is quicker as we
    do not need to xcall and wait for everyone to receive and complete wrap.

    Signed-off-by: Pavel Tatashin
    Reviewed-by: Bob Picco
    Reviewed-by: Steven Sistare
    Signed-off-by: David S. Miller

    Pavel Tatashin
     
  • The new wrap is going to use information from this array to figure out
    mm's that currently have valid secondary contexts setup.

    Signed-off-by: Pavel Tatashin
    Reviewed-by: Bob Picco
    Reviewed-by: Steven Sistare
    Signed-off-by: David S. Miller

    Pavel Tatashin
     
  • CTX_FIRST_VERSION defines the first context version, but also it defines
    first context. This patch redefines it to only include the first context
    version.

    Signed-off-by: Pavel Tatashin
    Reviewed-by: Bob Picco
    Reviewed-by: Steven Sistare
    Signed-off-by: David S. Miller

    Pavel Tatashin
     
  • The only difference between these two functions is that in activate_mm we
    unconditionally flush context. However, there is no need to keep this
    difference after fixing a bug where cpumask was not reset on a wrap. So, in
    this patch we combine these.

    Signed-off-by: Pavel Tatashin
    Reviewed-by: Bob Picco
    Reviewed-by: Steven Sistare
    Signed-off-by: David S. Miller

    Pavel Tatashin
     
  • After a wrap (getting a new context version) a process must get a new
    context id, which means that we would need to flush the context id from
    the TLB before running for the first time with this ID on every CPU. But,
    we use mm_cpumask to determine if this process has been running on this CPU
    before, and this mask is not reset after a wrap. So, there are two possible
    fixes for this issue:

    1. Clear mm cpumask whenever mm gets a new context id
    2. Unconditionally flush context every time process is running on a CPU

    This patch implements the first solution

    Signed-off-by: Pavel Tatashin
    Reviewed-by: Bob Picco
    Reviewed-by: Steven Sistare
    Signed-off-by: David S. Miller

    Pavel Tatashin
     
  • hugetlb_bad_size needs to be called on invalid values. Also change the
    pr_warn to a pr_err to better align with other platforms.

    Signed-off-by: Liam R. Howlett
    Signed-off-by: David S. Miller

    Liam R. Howlett
     
  • VIO devices were being looked up by their index in the machine
    description node block, but this often varies over time as devices are
    added and removed. Instead, store the ID and look up using the type,
    config handle and ID.

    Signed-off-by: James Clarke
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112541
    Signed-off-by: David S. Miller

    James Clarke
     
  • When a TSB grows beyond its current capacity, a new TSB is allocated
    and copy_tsb is called to copy entries from the old TSB to the new.
    A hash shift based on page size is used to calculate the index of an
    entry in the TSB. copy_tsb has hard coded PAGE_SHIFT in these
    calculations. However, for huge page TSBs the value REAL_HPAGE_SHIFT
    should be used. As a result, when copy_tsb is called for a huge page
    TSB the entries are placed at the incorrect index in the newly
    allocated TSB. When doing hardware table walk, the MMU does not
    match these entries and we end up in the TSB miss handling code.
    This code will then create and write an entry to the correct index
    in the TSB. We take a performance hit for the table walk miss and
    recreation of these entries.

    Pass a new parameter to copy_tsb that is the page size shift to be
    used when copying the TSB.

    Suggested-by: Anthony Yznaga
    Signed-off-by: Mike Kravetz
    Signed-off-by: David S. Miller

    Mike Kravetz
     
  • Linux SPARC64 limits NR_CPUS to 4064 because init_cpu_send_mondo_info()
    only allocates a single page for NR_CPUS mondo entries. Thus we cannot
    use all 4096 CPUs on some SPARC platforms.

    To fix, allocate (2^order) pages where order is set according to the size
    of cpu_list for possible cpus. Since cpu_list_pa and cpu_mondo_block_pa
    are not used in asm code, there are no imm13 offsets from the base PA
    that will break because they can only reach one page.

    Orabug: 25505750

    Signed-off-by: Jane Chu

    Reviewed-by: Bob Picco
    Reviewed-by: Atish Patra
    Signed-off-by: David S. Miller

    Jane Chu
     
  • Commit fb9a307d11d6 ("bpf: Allow CGROUP_SKB eBPF program to
    access sk_buff") enabled programs of BPF_PROG_TYPE_CGROUP_SKB
    type to use ld_abs/ind instructions. However, at this point,
    we cannot use them, since offsets relative to SKF_LL_OFF will
    end up pointing skb_mac_header(skb) out of bounds since in the
    egress path it is not yet set at that point in time, but only
    after __dev_queue_xmit() did a general reset on the mac header.
    bpf_internal_load_pointer_neg_helper() will then end up reading
    data from a wrong offset.

    BPF_PROG_TYPE_CGROUP_SKB programs can use bpf_skb_load_bytes()
    already to access packet data, which is also more flexible than
    the insns carried over from cBPF.

    Fixes: fb9a307d11d6 ("bpf: Allow CGROUP_SKB eBPF program to access sk_buff")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Cc: Chenbo Feng
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • The check that queue is less or equal to zero is always true
    because queue is a u32; queue is decremented and will wrap around
    and never go -ve. Fix this by making queue an int.

    Detected by CoverityScan, CID#1428988 ("Unsigned compared against 0")

    Signed-off-by: Colin Ian King
    Signed-off-by: David S. Miller

    Colin Ian King
     
  • stmmac_tso_allocator can fail to set the Last Descriptor bit
    on a descriptor that actually was the last descriptor.

    This happens when the buffer of the last descriptor ends
    up having a size of exactly TSO_MAX_BUFF_SIZE.

    When the IP eventually reaches the next last descriptor,
    which actually has the bit set, the DMA will hang.

    When the DMA hangs, we get a tx timeout, however,
    since stmmac does not do a complete reset of the IP
    in stmmac_tx_timeout, we end up in a state with
    completely hung TX.

    Signed-off-by: Niklas Cassel
    Acked-by: Giuseppe Cavallaro
    Acked-by: Alexandre TORGUE
    Signed-off-by: David S. Miller

    Niklas Cassel
     
  • Tun actually expects a symmetric hash for queue selecting to work
    correctly, otherwise packets belongs to a single flow may be
    redirected to the wrong queue. So this patch switch to use
    __skb_get_hash_symmetric().

    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     
  • ethoc_reset enables device interrupts, ethoc_interrupt may schedule a
    NAPI poll before NAPI is enabled in the ethoc_open, which results in
    device being unable to send or receive anything until it's closed and
    reopened. In case the device is flooded with ingress packets it may be
    unable to recover at all.
    Move napi_enable above ethoc_reset in the ethoc_open to fix that.

    Fixes: a1702857724f ("net: Add support for the OpenCores 10/100 Mbps Ethernet MAC.")
    Signed-off-by: Max Filippov
    Reviewed-by: Tobias Klauser
    Reviewed-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Max Filippov
     
  • We might call br_afspec() with p == NULL which is a valid use case if
    the action is on the bridge device itself, but the bridge tunnel code
    dereferences the p pointer without checking, so check if p is null
    first.

    Reported-by: Gustavo A. R. Silva
    Fixes: efa5356b0d97 ("bridge: per vlan dst_metadata netlink support")
    Signed-off-by: Nikolay Aleksandrov
    Acked-by: Roopa Prabhu
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     
  • The register bits used for the frame mode were masked with DSA (0x1)
    instead of the mask value (0x3) in the 6085 implementation of
    port_set_frame_mode. Fix this.

    Fixes: 56995cbc3540 ("net: dsa: mv88e6xxx: Refactor CPU and DSA port setup")
    Signed-off-by: Vivien Didelot
    Signed-off-by: David S. Miller

    Vivien Didelot
     
  • Commit a47b70ea86bd ("ravb: unmap descriptors when freeing rings") has
    introduced the issue seen in [1] reproduced on H3ULCB board.

    Fix this by relocating the RX skb ringbuffer free operation, so that
    swiotlb page unmapping can be done first. Freeing of aligned TX buffers
    is not relevant to the issue seen in [1]. Still, reposition TX free
    calls as well, to have all kfree() operations performed consistently
    _after_ dma_unmap_*()/dma_free_*().

    [1] Console screenshot with the problem reproduced:

    salvator-x login: root
    root@salvator-x:~# ifconfig eth0 up
    Micrel KSZ9031 Gigabit PHY e6800000.ethernet-ffffffff:00: \
    attached PHY driver [Micrel KSZ9031 Gigabit PHY] \
    (mii_bus:phy_addr=e6800000.ethernet-ffffffff:00, irq=235)
    IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
    root@salvator-x:~#
    root@salvator-x:~# ifconfig eth0 down

    ==================================================================
    BUG: KASAN: use-after-free in swiotlb_tbl_unmap_single+0xc4/0x35c
    Write of size 1538 at addr ffff8006d884f780 by task ifconfig/1649

    CPU: 0 PID: 1649 Comm: ifconfig Not tainted 4.12.0-rc4-00004-g112eb07287d1 #32
    Hardware name: Renesas H3ULCB board based on r8a7795 (DT)
    Call trace:
    [] dump_backtrace+0x0/0x3a4
    [] show_stack+0x14/0x1c
    [] dump_stack+0xf8/0x150
    [] print_address_description+0x7c/0x330
    [] kasan_report+0x2e0/0x2f4
    [] check_memory_region+0x20/0x14c
    [] memcpy+0x48/0x68
    [] swiotlb_tbl_unmap_single+0xc4/0x35c
    [] unmap_single+0x90/0xa4
    [] swiotlb_unmap_page+0xc/0x14
    [] __swiotlb_unmap_page+0xcc/0xe4
    [] ravb_ring_free+0x514/0x870
    [] ravb_close+0x288/0x36c
    [] __dev_close_many+0x14c/0x174
    [] __dev_close+0xc8/0x144
    [] __dev_change_flags+0xd8/0x194
    [] dev_change_flags+0x60/0xb0
    [] devinet_ioctl+0x484/0x9d4
    [] inet_ioctl+0x190/0x194
    [] sock_do_ioctl+0x78/0xa8
    [] sock_ioctl+0x110/0x3c4
    [] vfs_ioctl+0x90/0xa0
    [] do_vfs_ioctl+0x148/0xc38
    [] SyS_ioctl+0x44/0x74
    [] el0_svc_naked+0x24/0x28

    The buggy address belongs to the page:
    page:ffff7e001b6213c0 count:0 mapcount:0 mapping: (null) index:0x0
    flags: 0x4000000000000000()
    raw: 4000000000000000 0000000000000000 0000000000000000 00000000ffffffff
    raw: 0000000000000000 ffff7e001b6213e0 0000000000000000 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff8006d884f680: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
    ffff8006d884f700: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
    >ffff8006d884f780: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
    ^
    ffff8006d884f800: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
    ffff8006d884f880: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
    ==================================================================
    Disabling lock debugging due to kernel taint
    root@salvator-x:~#

    Fixes: a47b70ea86bd ("ravb: unmap descriptors when freeing rings")
    Signed-off-by: Eugeniu Rosca
    Acked-by: Sergei Shtylyov
    Signed-off-by: David S. Miller

    Eugeniu Rosca
     
  • Martin KaFai Lau says:

    ====================
    Introduce bpf ID

    This patch series:
    1) Introduce ID for both bpf_prog and bpf_map.
    2) Add bpf commands to iterate the prog IDs and map
    IDs of the system.
    3) Add bpf commands to get a prog/map fd from an ID
    4) Add bpf command to get prog/map info from a fd.
    The prog/map info is a jump start in this patchset
    and it is not meant to be a complete list. They can
    be extended in the future patches.

    v3:
    - I suspect v2 may not have applied cleanly.
    In particular, patch 1 has conflict with a recent
    change in struct bpf_prog_aux introduced at a similar time frame:
    8726679a0fa3 ("bpf: teach verifier to track stack depth")
    v3 should have fixed it.

    v2:
    Compiler warning fixes:
    - Remove lockdep_is_held() usage. Add comment
    to explain the lock situation instead.
    - Add static for idr related variables
    - Add __user to the uattr param in bpf_prog_get_info_by_fd()
    and bpf_map_get_info_by_fd().
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Add test to exercise the bpf_prog/map id generation,
    bpf_(prog|map)_get_next_id(), bpf_(prog|map)_get_fd_by_id() and
    bpf_get_obj_info_by_fd().

    Signed-off-by: Martin KaFai Lau
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     
  • A single BPF_OBJ_GET_INFO_BY_FD cmd is used to obtain the info
    for both bpf_prog and bpf_map. The kernel can figure out the
    fd is associated with a bpf_prog or bpf_map.

    The suggested struct bpf_prog_info and struct bpf_map_info are
    not meant to be a complete list and it is not the goal of this patch.
    New fields can be added in the future patch.

    The focus of this patch is to create the interface,
    BPF_OBJ_GET_INFO_BY_FD cmd for exposing the bpf_prog's and
    bpf_map's info.

    The obj's info, which will be extended (and get bigger) over time, is
    separated from the bpf_attr to avoid bloating the bpf_attr.

    Signed-off-by: Martin KaFai Lau
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     
  • Add jited_len to struct bpf_prog. It will be
    useful for the struct bpf_prog_info which will
    be added in the later patch.

    Signed-off-by: Martin KaFai Lau
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     
  • Add BPF_MAP_GET_FD_BY_ID command to allow user to get a fd
    from a bpf_map's ID.

    bpf_map_inc_not_zero() is added and is called with map_idr_lock
    held.

    __bpf_map_put() is also added which has the 'bool do_idr_lock'
    param to decide if the map_idr_lock should be acquired when
    freeing the map->id.

    In the error path of bpf_map_inc_not_zero(), it may have to
    call __bpf_map_put(map, false) which does not need
    to take the map_idr_lock when freeing the map->id.

    It is currently limited to CAP_SYS_ADMIN which we can
    consider to lift it in followup patches.

    Signed-off-by: Martin KaFai Lau
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     
  • Add BPF_PROG_GET_FD_BY_ID command to allow user to get a fd
    from a bpf_prog's ID.

    bpf_prog_inc_not_zero() is added and is called with prog_idr_lock
    held.

    __bpf_prog_put() is also added which has the 'bool do_idr_lock'
    param to decide if the prog_idr_lock should be acquired when
    freeing the prog->id.

    In the error path of bpf_prog_inc_not_zero(), it may have to
    call __bpf_prog_put(map, false) which does not need
    to take the prog_idr_lock when freeing the prog->id.

    It is currently limited to CAP_SYS_ADMIN which we can
    consider to lift it in followup patches.

    Signed-off-by: Martin KaFai Lau
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     
  • This patch adds BPF_PROG_GET_NEXT_ID and BPF_MAP_GET_NEXT_ID
    to allow userspace to iterate all bpf_prog IDs and bpf_map IDs.

    The API is trying to be consistent with the existing
    BPF_MAP_GET_NEXT_KEY.

    It is currently limited to CAP_SYS_ADMIN which we can
    consider to lift it in followup patches.

    Signed-off-by: Martin KaFai Lau
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Martin KaFai Lau