08 Jul, 2017

1 commit

  • As Hongjun/Nicolas summarized in their original patch:

    "
    When a device changes from one netns to another, it's first unregistered,
    then the netns reference is updated and the dev is registered in the new
    netns. Thus, when a slave moves to another netns, it is first
    unregistered. This triggers a NETDEV_UNREGISTER event which is caught by
    the bonding driver. The driver calls bond_release(), which calls
    dev_set_mtu() and thus triggers NETDEV_CHANGEMTU (the device is still in
    the old netns).
    "

    This is a very special case, because the device is being unregistered
    no one should still care about the NETDEV_CHANGEMTU event triggered
    at this point, we can avoid broadcasting this event on this path,
    and avoid touching inetdev_event()/addrconf_notify() path.

    It requires to export __dev_set_mtu() to bonding driver.

    Reported-by: Hongjun Li
    Reported-by: Nicolas Dichtel
    Cc: Jay Vosburgh
    Cc: Veaceslav Falico
    Cc: Andy Gospodarek
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     

06 Jul, 2017

2 commits

  • Pull memdup_user() conversions from Al Viro:
    "A fairly self-contained series - hunting down open-coded memdup_user()
    and memdup_user_nul() instances"

    * 'work.memdup_user' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    bpf: don't open-code memdup_user()
    kimage_file_prepare_segments(): don't open-code memdup_user()
    ethtool: don't open-code memdup_user()
    do_ip_setsockopt(): don't open-code memdup_user()
    do_ipv6_setsockopt(): don't open-code memdup_user()
    irda: don't open-code memdup_user()
    xfrm_user_policy(): don't open-code memdup_user()
    ima_write_policy(): don't open-code memdup_user_nul()
    sel_write_validatetrans(): don't open-code memdup_user_nul()

    Linus Torvalds
     
  • Pull networking updates from David Miller:
    "Reasonably busy this cycle, but perhaps not as busy as in the 4.12
    merge window:

    1) Several optimizations for UDP processing under high load from
    Paolo Abeni.

    2) Support pacing internally in TCP when using the sch_fq packet
    scheduler for this is not practical. From Eric Dumazet.

    3) Support mutliple filter chains per qdisc, from Jiri Pirko.

    4) Move to 1ms TCP timestamp clock, from Eric Dumazet.

    5) Add batch dequeueing to vhost_net, from Jason Wang.

    6) Flesh out more completely SCTP checksum offload support, from
    Davide Caratti.

    7) More plumbing of extended netlink ACKs, from David Ahern, Pablo
    Neira Ayuso, and Matthias Schiffer.

    8) Add devlink support to nfp driver, from Simon Horman.

    9) Add RTM_F_FIB_MATCH flag to RTM_GETROUTE queries, from Roopa
    Prabhu.

    10) Add stack depth tracking to BPF verifier and use this information
    in the various eBPF JITs. From Alexei Starovoitov.

    11) Support XDP on qed device VFs, from Yuval Mintz.

    12) Introduce BPF PROG ID for better introspection of installed BPF
    programs. From Martin KaFai Lau.

    13) Add bpf_set_hash helper for TC bpf programs, from Daniel Borkmann.

    14) For loads, allow narrower accesses in bpf verifier checking, from
    Yonghong Song.

    15) Support MIPS in the BPF selftests and samples infrastructure, the
    MIPS eBPF JIT will be merged in via the MIPS GIT tree. From David
    Daney.

    16) Support kernel based TLS, from Dave Watson and others.

    17) Remove completely DST garbage collection, from Wei Wang.

    18) Allow installing TCP MD5 rules using prefixes, from Ivan
    Delalande.

    19) Add XDP support to Intel i40e driver, from Björn Töpel

    20) Add support for TC flower offload in nfp driver, from Simon
    Horman, Pieter Jansen van Vuuren, Benjamin LaHaise, Jakub
    Kicinski, and Bert van Leeuwen.

    21) IPSEC offloading support in mlx5, from Ilan Tayari.

    22) Add HW PTP support to macb driver, from Rafal Ozieblo.

    23) Networking refcount_t conversions, From Elena Reshetova.

    24) Add sock_ops support to BPF, from Lawrence Brako. This is useful
    for tuning the TCP sockopt settings of a group of applications,
    currently via CGROUPs"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1899 commits)
    net: phy: dp83867: add workaround for incorrect RX_CTRL pin strap
    dt-bindings: phy: dp83867: provide a workaround for incorrect RX_CTRL pin strap
    cxgb4: Support for get_ts_info ethtool method
    cxgb4: Add PTP Hardware Clock (PHC) support
    cxgb4: time stamping interface for PTP
    nfp: default to chained metadata prepend format
    nfp: remove legacy MAC address lookup
    nfp: improve order of interfaces in breakout mode
    net: macb: remove extraneous return when MACB_EXT_DESC is defined
    bpf: add missing break in for the TCP_BPF_SNDCWND_CLAMP case
    bpf: fix return in load_bpf_file
    mpls: fix rtm policy in mpls_getroute
    net, ax25: convert ax25_cb.refcount from atomic_t to refcount_t
    net, ax25: convert ax25_route.refcount from atomic_t to refcount_t
    net, ax25: convert ax25_uid_assoc.refcount from atomic_t to refcount_t
    net, sctp: convert sctp_ep_common.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_transport.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_chunk.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_datamsg.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_auth_bytes.refcnt from atomic_t to refcount_t
    ...

    Linus Torvalds
     

05 Jul, 2017

1 commit

  • There appears to be a missing break in the TCP_BPF_SNDCWND_CLAMP case.
    Currently the non-error path where val is greater than zero falls through
    to the default case that sets the error return to -EINVAL. Add in
    the missing break.

    Detected by CoverityScan, CID#1449376 ("Missing break in switch")

    Fixes: 13bf96411ad2 ("bpf: Adds support for setting sndcwnd clamp")
    Signed-off-by: Colin Ian King
    Acked-by: Daniel Borkmann
    Acked-by: Lawrence Brakmo
    Signed-off-by: David S. Miller

    Colin Ian King
     

04 Jul, 2017

2 commits

  • Pull documentation updates from Jonathan Corbet:
    "There has been a fair amount of activity in the docs tree this time
    around. Highlights include:

    - Conversion of a bunch of security documentation into RST

    - The conversion of the remaining DocBook templates by The Amazing
    Mauro Machine. We can now drop the entire DocBook build chain.

    - The usual collection of fixes and minor updates"

    * tag 'docs-4.13' of git://git.lwn.net/linux: (90 commits)
    scripts/kernel-doc: handle DECLARE_HASHTABLE
    Documentation: atomic_ops.txt is core-api/atomic_ops.rst
    Docs: clean up some DocBook loose ends
    Make the main documentation title less Geocities
    Docs: Use kernel-figure in vidioc-g-selection.rst
    Docs: fix table problems in ras.rst
    Docs: Fix breakage with Sphinx 1.5 and upper
    Docs: Include the Latex "ifthen" package
    doc/kokr/howto: Only send regression fixes after -rc1
    docs-rst: fix broken links to dynamic-debug-howto in kernel-parameters
    doc: Document suitability of IBM Verse for kernel development
    Doc: fix a markup error in coding-style.rst
    docs: driver-api: i2c: remove some outdated information
    Documentation: DMA API: fix a typo in a function name
    Docs: Insert missing space to separate link from text
    doc/ko_KR/memory-barriers: Update control-dependencies example
    Documentation, kbuild: fix typo "minimun" -> "minimum"
    docs: Fix some formatting issues in request-key.rst
    doc: ReSTify keys-trusted-encrypted.txt
    doc: ReSTify keys-request-key.txt
    ...

    Linus Torvalds
     
  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Add the SYSTEM_SCHEDULING bootup state to move various scheduler
    debug checks earlier into the bootup. This turns silent and
    sporadically deadly bugs into nice, deterministic splats. Fix some
    of the splats that triggered. (Thomas Gleixner)

    - A round of restructuring and refactoring of the load-balancing and
    topology code (Peter Zijlstra)

    - Another round of consolidating ~20 of incremental scheduler code
    history: this time in terms of wait-queue nomenclature. (I didn't
    get much feedback on these renaming patches, and we can still
    easily change any names I might have misplaced, so if anyone hates
    a new name, please holler and I'll fix it.) (Ingo Molnar)

    - sched/numa improvements, fixes and updates (Rik van Riel)

    - Another round of x86/tsc scheduler clock code improvements, in hope
    of making it more robust (Peter Zijlstra)

    - Improve NOHZ behavior (Frederic Weisbecker)

    - Deadline scheduler improvements and fixes (Luca Abeni, Daniel
    Bristot de Oliveira)

    - Simplify and optimize the topology setup code (Lauro Ramos
    Venancio)

    - Debloat and decouple scheduler code some more (Nicolas Pitre)

    - Simplify code by making better use of llist primitives (Byungchul
    Park)

    - ... plus other fixes and improvements"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
    sched/cputime: Refactor the cputime_adjust() code
    sched/debug: Expose the number of RT/DL tasks that can migrate
    sched/numa: Hide numa_wake_affine() from UP build
    sched/fair: Remove effective_load()
    sched/numa: Implement NUMA node level wake_affine()
    sched/fair: Simplify wake_affine() for the single socket case
    sched/numa: Override part of migrate_degrades_locality() when idle balancing
    sched/rt: Move RT related code from sched/core.c to sched/rt.c
    sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
    sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
    sched/fair: Spare idle load balancing on nohz_full CPUs
    nohz: Move idle balancer registration to the idle path
    sched/loadavg: Generalize "_idle" naming to "_nohz"
    sched/core: Drop the unused try_get_task_struct() helper function
    sched/fair: WARN() and refuse to set buddy when !se->on_rq
    sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
    sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
    sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
    sched/wait: Split out the wait_bit*() APIs from into
    sched/wait: Re-adjust macro line continuation backslashes in
    ...

    Linus Torvalds
     

03 Jul, 2017

6 commits

  • We need to use refcount_set() on a newly created rule to avoid
    following error :

    [ 64.601749] ------------[ cut here ]------------
    [ 64.601757] WARNING: CPU: 0 PID: 6476 at lib/refcount.c:184 refcount_sub_and_test+0x75/0xa0
    [ 64.601758] Modules linked in: w1_therm wire cdc_acm ehci_pci ehci_hcd mlx4_en ib_uverbs mlx4_ib ib_core mlx4_core
    [ 64.601769] CPU: 0 PID: 6476 Comm: ip Tainted: G W 4.12.0-smp-DEV #274
    [ 64.601771] task: ffff8837bf482040 task.stack: ffff8837bdc08000
    [ 64.601773] RIP: 0010:refcount_sub_and_test+0x75/0xa0
    [ 64.601774] RSP: 0018:ffff8837bdc0f5c0 EFLAGS: 00010286
    [ 64.601776] RAX: 0000000000000026 RBX: 0000000000000001 RCX: 0000000000000000
    [ 64.601777] RDX: 0000000000000026 RSI: 0000000000000096 RDI: ffffed06f7b81eae
    [ 64.601778] RBP: ffff8837bdc0f5d0 R08: 0000000000000004 R09: fffffbfff4a54c25
    [ 64.601779] R10: 00000000cbc500e5 R11: ffffffffa52a6128 R12: ffff881febcf6f24
    [ 64.601779] R13: ffff881fbf4eaf00 R14: ffff881febcf6f80 R15: ffff8837d7a4ed00
    [ 64.601781] FS: 00007ff5a2f6b700(0000) GS:ffff881fff800000(0000) knlGS:0000000000000000
    [ 64.601782] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 64.601783] CR2: 00007ffcdc70d000 CR3: 0000001f9c91e000 CR4: 00000000001406f0
    [ 64.601783] Call Trace:
    [ 64.601786] refcount_dec_and_test+0x11/0x20
    [ 64.601790] fib_nl_delrule+0xc39/0x1630
    [ 64.601793] ? is_bpf_text_address+0xe/0x20
    [ 64.601795] ? fib_nl_newrule+0x25e0/0x25e0
    [ 64.601798] ? depot_save_stack+0x133/0x470
    [ 64.601801] ? ns_capable+0x13/0x20
    [ 64.601803] ? __netlink_ns_capable+0xcc/0x100
    [ 64.601806] rtnetlink_rcv_msg+0x23a/0x6a0
    [ 64.601808] ? rtnl_newlink+0x1630/0x1630
    [ 64.601811] ? memset+0x31/0x40
    [ 64.601813] netlink_rcv_skb+0x2d7/0x440
    [ 64.601815] ? rtnl_newlink+0x1630/0x1630
    [ 64.601816] ? netlink_ack+0xaf0/0xaf0
    [ 64.601818] ? kasan_unpoison_shadow+0x35/0x50
    [ 64.601820] ? __kmalloc_node_track_caller+0x4c/0x70
    [ 64.601821] rtnetlink_rcv+0x28/0x30
    [ 64.601823] netlink_unicast+0x422/0x610
    [ 64.601824] ? netlink_attachskb+0x650/0x650
    [ 64.601826] netlink_sendmsg+0x7b7/0xb60
    [ 64.601828] ? netlink_unicast+0x610/0x610
    [ 64.601830] ? netlink_unicast+0x610/0x610
    [ 64.601832] sock_sendmsg+0xba/0xf0
    [ 64.601834] ___sys_sendmsg+0x6a9/0x8c0
    [ 64.601835] ? copy_msghdr_from_user+0x520/0x520
    [ 64.601837] ? __alloc_pages_nodemask+0x160/0x520
    [ 64.601839] ? memcg_write_event_control+0xd60/0xd60
    [ 64.601841] ? __alloc_pages_slowpath+0x1d50/0x1d50
    [ 64.601843] ? kasan_slab_free+0x71/0xc0
    [ 64.601845] ? mem_cgroup_commit_charge+0xb2/0x11d0
    [ 64.601847] ? lru_cache_add_active_or_unevictable+0x7d/0x1a0
    [ 64.601849] ? __handle_mm_fault+0x1af8/0x2810
    [ 64.601851] ? may_open_dev+0xc0/0xc0
    [ 64.601852] ? __pmd_alloc+0x2c0/0x2c0
    [ 64.601853] ? __fdget+0x13/0x20
    [ 64.601855] __sys_sendmsg+0xc6/0x150
    [ 64.601856] ? __sys_sendmsg+0xc6/0x150
    [ 64.601857] ? SyS_shutdown+0x170/0x170
    [ 64.601859] ? handle_mm_fault+0x28a/0x650
    [ 64.601861] SyS_sendmsg+0x12/0x20
    [ 64.601863] entry_SYSCALL_64_fastpath+0x13/0x94

    Fixes: 717d1e993ad8 ("net: convert fib_rule.refcnt from atomic_t to refcount_t")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • commit 9256645af098 ("net/core: relax BUILD_BUG_ON in
    netdev_stats_to_stats64") made an attempt to read beyond
    the size of the source a possibility.

    Fix to only copy src size to dest. As dest might be bigger than src.

    ==================================================================
    BUG: KASAN: slab-out-of-bounds in netdev_stats_to_stats64+0xe/0x30 at addr ffff8801be248b20
    Read of size 192 by task VBoxNetAdpCtl/6734
    CPU: 1 PID: 6734 Comm: VBoxNetAdpCtl Tainted: G O 4.11.4prahal+intel+ #118
    Hardware name: LENOVO 20CDCTO1WW/20CDCTO1WW, BIOS GQET52WW (1.32 ) 05/04/2017
    Call Trace:
    dump_stack+0x63/0x86
    kasan_object_err+0x1c/0x70
    kasan_report+0x270/0x520
    ? netdev_stats_to_stats64+0xe/0x30
    ? sched_clock_cpu+0x1b/0x190
    ? __module_address+0x3e/0x3b0
    ? unwind_next_frame+0x1ea/0xb00
    check_memory_region+0x13c/0x1a0
    memcpy+0x23/0x50
    netdev_stats_to_stats64+0xe/0x30
    dev_get_stats+0x1b9/0x230
    rtnl_fill_stats+0x44/0xc00
    ? nla_put+0xc6/0x130
    rtnl_fill_ifinfo+0xe9e/0x3700
    ? rtnl_fill_vfinfo+0xde0/0xde0
    ? sched_clock+0x9/0x10
    ? sched_clock+0x9/0x10
    ? sched_clock_local+0x120/0x130
    ? __module_address+0x3e/0x3b0
    ? unwind_next_frame+0x1ea/0xb00
    ? sched_clock+0x9/0x10
    ? sched_clock+0x9/0x10
    ? sched_clock_cpu+0x1b/0x190
    ? VBoxNetAdpLinuxIOCtlUnlocked+0x14b/0x280 [vboxnetadp]
    ? depot_save_stack+0x1d8/0x4a0
    ? depot_save_stack+0x34f/0x4a0
    ? depot_save_stack+0x34f/0x4a0
    ? save_stack+0xb1/0xd0
    ? save_stack_trace+0x16/0x20
    ? save_stack+0x46/0xd0
    ? kasan_slab_alloc+0x12/0x20
    ? __kmalloc_node_track_caller+0x10d/0x350
    ? __kmalloc_reserve.isra.36+0x2c/0xc0
    ? __alloc_skb+0xd0/0x560
    ? rtmsg_ifinfo_build_skb+0x61/0x120
    ? rtmsg_ifinfo.part.25+0x16/0xb0
    ? rtmsg_ifinfo+0x47/0x70
    ? register_netdev+0x15/0x30
    ? vboxNetAdpOsCreate+0xc0/0x1c0 [vboxnetadp]
    ? vboxNetAdpCreate+0x210/0x400 [vboxnetadp]
    ? VBoxNetAdpLinuxIOCtlUnlocked+0x14b/0x280 [vboxnetadp]
    ? do_vfs_ioctl+0x17f/0xff0
    ? SyS_ioctl+0x74/0x80
    ? do_syscall_64+0x182/0x390
    ? __alloc_skb+0xd0/0x560
    ? __alloc_skb+0xd0/0x560
    ? save_stack_trace+0x16/0x20
    ? init_object+0x64/0xa0
    ? ___slab_alloc+0x1ae/0x5c0
    ? ___slab_alloc+0x1ae/0x5c0
    ? __alloc_skb+0xd0/0x560
    ? sched_clock+0x9/0x10
    ? kasan_unpoison_shadow+0x35/0x50
    ? kasan_kmalloc+0xad/0xe0
    ? __kmalloc_node_track_caller+0x246/0x350
    ? __alloc_skb+0xd0/0x560
    ? kasan_unpoison_shadow+0x35/0x50
    ? memset+0x31/0x40
    ? __alloc_skb+0x31f/0x560
    ? napi_consume_skb+0x320/0x320
    ? br_get_link_af_size_filtered+0xb7/0x120 [bridge]
    ? if_nlmsg_size+0x440/0x630
    rtmsg_ifinfo_build_skb+0x83/0x120
    rtmsg_ifinfo.part.25+0x16/0xb0
    rtmsg_ifinfo+0x47/0x70
    register_netdevice+0xa2b/0xe50
    ? __kmalloc+0x171/0x2d0
    ? netdev_change_features+0x80/0x80
    register_netdev+0x15/0x30
    vboxNetAdpOsCreate+0xc0/0x1c0 [vboxnetadp]
    vboxNetAdpCreate+0x210/0x400 [vboxnetadp]
    ? vboxNetAdpComposeMACAddress+0x1d0/0x1d0 [vboxnetadp]
    ? kasan_check_write+0x14/0x20
    VBoxNetAdpLinuxIOCtlUnlocked+0x14b/0x280 [vboxnetadp]
    ? VBoxNetAdpLinuxOpen+0x20/0x20 [vboxnetadp]
    ? lock_acquire+0x11c/0x270
    ? __audit_syscall_entry+0x2fb/0x660
    do_vfs_ioctl+0x17f/0xff0
    ? __audit_syscall_entry+0x2fb/0x660
    ? ioctl_preallocate+0x1d0/0x1d0
    ? __audit_syscall_entry+0x2fb/0x660
    ? kmem_cache_free+0xb2/0x250
    ? syscall_trace_enter+0x537/0xd00
    ? exit_to_usermode_loop+0x100/0x100
    SyS_ioctl+0x74/0x80
    ? do_sys_open+0x350/0x350
    ? do_vfs_ioctl+0xff0/0xff0
    do_syscall_64+0x182/0x390
    entry_SYSCALL64_slow_path+0x25/0x25
    RIP: 0033:0x7f7e39a1ae07
    RSP: 002b:00007ffc6f04c6d8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
    RAX: ffffffffffffffda RBX: 00007ffc6f04c730 RCX: 00007f7e39a1ae07
    RDX: 00007ffc6f04c730 RSI: 00000000c0207601 RDI: 0000000000000007
    RBP: 00007ffc6f04c700 R08: 00007ffc6f04c780 R09: 0000000000000008
    R10: 0000000000000541 R11: 0000000000000206 R12: 0000000000000007
    R13: 00000000c0207601 R14: 00007ffc6f04c730 R15: 0000000000000012
    Object at ffff8801be248008, in cache kmalloc-4096 size: 4096
    Allocated:
    PID = 6734
    save_stack_trace+0x16/0x20
    save_stack+0x46/0xd0
    kasan_kmalloc+0xad/0xe0
    __kmalloc+0x171/0x2d0
    alloc_netdev_mqs+0x8a7/0xbe0
    vboxNetAdpOsCreate+0x65/0x1c0 [vboxnetadp]
    vboxNetAdpCreate+0x210/0x400 [vboxnetadp]
    VBoxNetAdpLinuxIOCtlUnlocked+0x14b/0x280 [vboxnetadp]
    do_vfs_ioctl+0x17f/0xff0
    SyS_ioctl+0x74/0x80
    do_syscall_64+0x182/0x390
    return_from_SYSCALL_64+0x0/0x6a
    Freed:
    PID = 5600
    save_stack_trace+0x16/0x20
    save_stack+0x46/0xd0
    kasan_slab_free+0x73/0xc0
    kfree+0xe4/0x220
    kvfree+0x25/0x30
    single_release+0x74/0xb0
    __fput+0x265/0x6b0
    ____fput+0x9/0x10
    task_work_run+0xd5/0x150
    exit_to_usermode_loop+0xe2/0x100
    do_syscall_64+0x26c/0x390
    return_from_SYSCALL_64+0x0/0x6a
    Memory state around the buggy address:
    ffff8801be248a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ffff8801be248b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    >ffff8801be248b80: 00 00 00 00 00 00 00 00 00 00 00 07 fc fc fc fc
    ^
    ffff8801be248c00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff8801be248c80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ==================================================================

    Signed-off-by: Alban Browaeys
    Signed-off-by: David S. Miller

    Alban Browaeys
     
  • This work tries to make the semantics and code around the
    narrower ctx access a bit easier to follow. Right now
    everything is done inside the .is_valid_access(). Offset
    matching is done differently for read/write types, meaning
    writes don't support narrower access and thus matching only
    on offsetof(struct foo, bar) is enough whereas for read
    case that supports narrower access we must check for
    offsetof(struct foo, bar) + offsetof(struct foo, bar) +
    sizeof() - 1 for each of the cases. For read cases of
    individual members that don't support narrower access (like
    packet pointers or skb->cb[] case which has its own narrow
    access logic), we check as usual only offsetof(struct foo,
    bar) like in write case. Then, for the case where narrower
    access is allowed, we also need to set the aux info for the
    access. Meaning, ctx_field_size and converted_op_size have
    to be set. First is the original field size e.g. sizeof()
    as in above example from the user facing ctx, and latter
    one is the target size after actual rewrite happened, thus
    for the kernel facing ctx. Also here we need the range match
    and we need to keep track changing convert_ctx_access() and
    converted_op_size from is_valid_access() as both are not at
    the same location.

    We can simplify the code a bit: check_ctx_access() becomes
    simpler in that we only store ctx_field_size as a meta data
    and later in convert_ctx_accesses() we fetch the target_size
    right from the location where we do convert. Should the verifier
    be misconfigured we do reject for BPF_WRITE cases or target_size
    that are not provided. For the subsystems, we always work on
    ranges in is_valid_access() and add small helpers for ranges
    and narrow access, convert_ctx_accesses() sets target_size
    for the relevant instruction.

    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Cc: Yonghong Song
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • This work adds a helper that can be used to adjust net room of an
    skb. The helper is generic and can be further extended in future.
    Main use case is for having a programmatic way to add/remove room to
    v4/v6 header options along with cls_bpf on egress and ingress hook
    of the data path. It reuses most of the infrastructure that we added
    for the bpf_skb_change_type() helper which can be used in nat64
    translations. Similarly, the helper only takes care of adjusting the
    room so that related data is populated and csum adapted out of the
    BPF program using it.

    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Add a small skb_mac_header_len() helper similarly as the
    skb_network_header_len() we have and replace open coded
    places in BPF's bpf_skb_change_proto() helper. Will also
    be used in upcoming work.

    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Fixed build error due to misplaced "#ifdef CONFIG_INET" (moved 1
    statement up).

    Signed-off-by: Lawrence Brakmo
    Signed-off-by: David S. Miller

    Lawrence Brakmo
     

02 Jul, 2017

5 commits

  • Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_SNDCWND_CLAMP, which
    sets the initial congestion window. It is useful to limit the sndcwnd
    when the host are close to each other (small RTT).

    Signed-off-by: Lawrence Brakmo
    Signed-off-by: David S. Miller

    Lawrence Brakmo
     
  • Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_IW, which sets the
    initial congestion window. This can be used when the hosts are far
    apart (large RTTs) and it is safe to start with a large inital cwnd.

    Signed-off-by: Lawrence Brakmo
    Signed-off-by: David S. Miller

    Lawrence Brakmo
     
  • Added support for changing congestion control for SOCK_OPS bpf
    programs through the setsockopt bpf helper function. It also adds
    a new SOCK_OPS op, BPF_SOCK_OPS_NEEDS_ECN, that is needed for
    congestion controls, like dctcp, that need to enable ECN in the
    SYN packets.

    Signed-off-by: Lawrence Brakmo
    Signed-off-by: David S. Miller

    Lawrence Brakmo
     
  • Added support for calling a subset of socket setsockopts from
    BPF_PROG_TYPE_SOCK_OPS programs. The code was duplicated rather
    than making the changes to call the socket setsockopt function because
    the changes required would have been larger.

    The ops supported are:
    SO_RCVBUF
    SO_SNDBUF
    SO_MAX_PACING_RATE
    SO_PRIORITY
    SO_RCVLOWAT
    SO_MARK

    Signed-off-by: Lawrence Brakmo
    Signed-off-by: David S. Miller

    Lawrence Brakmo
     
  • Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
    struct that allows BPF programs of this type to access some of the
    socket's fields (such as IP addresses, ports, etc.). It uses the
    existing bpf cgroups infrastructure so the programs can be attached per
    cgroup with full inheritance support. The program will be called at
    appropriate times to set relevant connections parameters such as buffer
    sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
    as IP addresses, port numbers, etc.

    Alghough there are already 3 mechanisms to set parameters (sysctls,
    route metrics and setsockopts), this new mechanism provides some
    distinct advantages. Unlike sysctls, it can set parameters per
    connection. In contrast to route metrics, it can also use port numbers
    and information provided by a user level program. In addition, it could
    set parameters probabilistically for evaluation purposes (i.e. do
    something different on 10% of the flows and compare results with the
    other 90% of the flows). Also, in cases where IPv6 addresses contain
    geographic information, the rules to make changes based on the distance
    (or RTT) between the hosts are much easier than route metric rules and
    can be global. Finally, unlike setsockopt, it oes not require
    application changes and it can be updated easily at any time.

    Although the bpf cgroup framework already contains a sock related
    program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
    (BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
    only once during the connections's lifetime. In contrast, the new
    program type will be called multiple times from different places in the
    network stack code. For example, before sending SYN and SYN-ACKs to set
    an appropriate timeout, when the connection is established to set
    congestion control, etc. As a result it has "op" field to specify the
    type of operation requested.

    The purpose of this new program type is to simplify setting connection
    parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
    easy to use facebook's internal IPv6 addresses to determine if both hosts
    of a connection are in the same datacenter. Therefore, it is easy to
    write a BPF program to choose a small SYN RTO value when both hosts are
    in the same datacenter.

    This patch only contains the framework to support the new BPF program
    type, following patches add the functionality to set various connection
    parameters.

    This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
    and a new bpf syscall command to load a new program of this type:
    BPF_PROG_LOAD_SOCKET_OPS.

    Two new corresponding structs (one for the kernel one for the user/BPF
    program):

    /* kernel version */
    struct bpf_sock_ops_kern {
    struct sock *sk;
    __u32 op;
    union {
    __u32 reply;
    __u32 replylong[4];
    };
    };

    /* user version
    * Some fields are in network byte order reflecting the sock struct
    * Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to
    * convert them to host byte order.
    */
    struct bpf_sock_ops {
    __u32 op;
    union {
    __u32 reply;
    __u32 replylong[4];
    };
    __u32 family;
    __u32 remote_ip4; /* In network byte order */
    __u32 local_ip4; /* In network byte order */
    __u32 remote_ip6[4]; /* In network byte order */
    __u32 local_ip6[4]; /* In network byte order */
    __u32 remote_port; /* In network byte order */
    __u32 local_port; /* In host byte horder */
    };

    Currently there are two types of ops. The first type expects the BPF
    program to return a value which is then used by the caller (or a
    negative value to indicate the operation is not supported). The second
    type expects state changes to be done by the BPF program, for example
    through a setsockopt BPF helper function, and they ignore the return
    value.

    The reply fields of the bpf_sockt_ops struct are there in case a bpf
    program needs to return a value larger than an integer.

    Signed-off-by: Lawrence Brakmo
    Acked-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Lawrence Brakmo
     

01 Jul, 2017

10 commits

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    This patch uses refcount_inc_not_zero() instead of
    atomic_inc_not_zero_hint() due to absense of a _hint()
    version of refcount API. If the hint() version must
    be used, we might need to revisit API.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     
  • A set of overlapping changes in macvlan and the rocker
    driver, nothing serious.

    Signed-off-by: David S. Miller

    David S. Miller
     

30 Jun, 2017

4 commits

  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    The following patchset contains Netfilter updates for your net-next
    tree. This batch contains connection tracking updates for the cleanup
    iteration path, patches from Florian Westphal:

    X) Skip unconfirmed conntracks in nf_ct_iterate_cleanup_net(), just set
    dying bit to let the CPU release them.

    X) Add nf_ct_iterate_destroy() to be used on module removal, to kill
    conntrack from all namespace.

    X) Restart iteration on hashtable resizing, since both may occur at
    the same time.

    X) Use the new nf_ct_iterate_destroy() to remove conntrack with NAT
    mapping on module removal.

    X) Use nf_ct_iterate_destroy() to remove conntrack entries helper
    module removal, from Liping Zhang.

    X) Use nf_ct_iterate_cleanup_net() to remove the timeout extension
    if user requests this, also from Liping.

    X) Add net_ns_barrier() and use it from FTP helper, so make sure
    no concurrent namespace removal happens at the same time while
    the helper module is being removed.

    X) Use NFPROTO_MAX in layer 3 conntrack protocol array, to reduce
    module size. Same thing in nf_tables.

    Updates for the nf_tables infrastructure:

    X) Prepare usage of the extended ACK reporting infrastructure for
    nf_tables.

    X) Remove unnecessary forward declaration in nf_tables hash set.

    X) Skip set size estimation if number of element is not specified.

    X) Changes to accomodate a (faster) unresizable hash set implementation,
    for anonymous sets and dynamic size fixed sets with no timeouts.

    X) Faster lookup function for unresizable hash table for 2 and 4
    bytes key.

    And, finally, a bunch of asorted small updates and cleanups:

    X) Do not hold reference to netdev from ipt_CLUSTER, instead subscribe
    to device events and look up for index from the packet path, this
    is fixing an issue that is present since the very beginning, patch
    from Xin Long.

    X) Use nf_register_net_hook() in ipt_CLUSTER, from Florian Westphal.

    X) Use ebt_invalid_target() whenever possible in the ebtables tree,
    from Gao Feng.

    X) Calm down compilation warning in nf_dup infrastructure, patch from
    stephen hemminger.

    X) Statify functions in nftables rt expression, also from stephen.

    X) Update Makefile to use canonical method to specify nf_tables-objs.
    From Jike Song.

    X) Use nf_conntrack_helpers_register() in amanda and H323.

    X) Space cleanup for ctnetlink, from linzhang.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Recently I started seeing warnings about pages with refcount -1. The
    problem was traced to packets being reused after their head was merged into
    a GRO packet by skb_gro_receive(). While bisecting the issue pointed to
    commit c21b48cc1bbf ("net: adjust skb->truesize in ___pskb_trim()") and
    I have never seen it on a kernel with it reverted, I believe the real
    problem appeared earlier when the option to merge head frag in GRO was
    implemented.

    Handling NAPI_GRO_FREE_STOLEN_HEAD state was only added to GRO_MERGED_FREE
    branch of napi_skb_finish() so that if the driver uses napi_gro_frags()
    and head is merged (which in my case happens after the skb_condense()
    call added by the commit mentioned above), the skb is reused including the
    head that has been merged. As a result, we release the page reference
    twice and eventually end up with negative page refcount.

    To fix the problem, handle NAPI_GRO_FREE_STOLEN_HEAD in napi_frags_finish()
    the same way it's done in napi_skb_finish().

    Fixes: d7e8883cfcf4 ("net: make GRO aware of skb->head_frag")
    Signed-off-by: Michal Kubecek
    Signed-off-by: David S. Miller

    Michal Kubeček
     
  • attribute_groups are not supposed to change at runtime. All functions
    working with attribute_groups provided by work with const
    attribute_group. So mark the non-const structs as const.

    File size before:
    text data bss dec hex filename
    9968 3168 16 13152 3360 net/core/net-sysfs.o

    File size After adding 'const':
    text data bss dec hex filename
    10160 2976 16 13152 3360 net/core/net-sysfs.o

    Signed-off-by: Arvind Yadav
    Signed-off-by: David S. Miller

    Arvind Yadav
     

28 Jun, 2017

1 commit

  • Similar to the fix provided by Dominik Heidler in commit
    9b3dc0a17d73 ("l2tp: cast l2tp traffic counter to unsigned")
    we need to take care of 32bit kernels in dev_get_stats().

    When using atomic_long_read(), we add a 'long' to u64 and
    might misinterpret high order bit, unless we cast to unsigned.

    Fixes: caf586e5f23ce ("net: add a core netdev->rx_dropped counter")
    Fixes: 015f0688f57ca ("net: net: add a core netdev->tx_dropped counter")
    Fixes: 6e7333d315a76 ("net: add rx_nohandler stat counter")
    Signed-off-by: Eric Dumazet
    Cc: Jarod Wilson
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Jun, 2017

5 commits


25 Jun, 2017

1 commit

  • Switches and modern SR-IOV enabled NICs may multiplex traffic from Port
    representators and control messages over single set of hardware queues.
    Control messages and muxed traffic may need ordered delivery.

    Those requirements make it hard to comfortably use TC infrastructure today
    unless we have a way of attaching metadata to skbs at the upper device.
    Because single set of queues is used for many netdevs stopping TC/sched
    queues of all of them reliably is impossible and lower device has to
    retreat to returning NETDEV_TX_BUSY and usually has to take extra locks on
    the fastpath.

    This patch attempts to enable port/representative devs to attach metadata
    to skbs which carry port id. This way representatives can be queueless and
    all queuing can be performed at the lower netdev in the usual way.

    Traffic arriving on the port/representative interfaces will be have
    metadata attached and will subsequently be queued to the lower device for
    transmission. The lower device should recognize the metadata and translate
    it to HW specific format which is most likely either a special header
    inserted before the network headers or descriptor/metadata fields.

    Metadata is associated with the lower device by storing the netdev pointer
    along with port id so that if TC decides to redirect or mirror the new
    netdev will not try to interpret it.

    This is mostly for SR-IOV devices since switches don't have lower netdevs
    today.

    Signed-off-by: Jakub Kicinski
    Signed-off-by: Sridhar Samudrala
    Signed-off-by: Simon Horman
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

24 Jun, 2017

2 commits

  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Commit 31fd85816dbe ("bpf: permits narrower load from bpf program
    context fields") permits narrower load for certain ctx fields.
    The commit however will already generate a masking even if
    the prog-specific ctx conversion produces the result with
    narrower size.

    For example, for __sk_buff->protocol, the ctx conversion
    loads the data into register with 2-byte load.
    A narrower 2-byte load should not generate masking.
    For __sk_buff->vlan_present, the conversion function
    set the result as either 0 or 1, essentially a byte.
    The narrower 2-byte or 1-byte load should not generate masking.

    To avoid unnecessary masking, prog-specific *_is_valid_access
    now passes converted_op_size back to verifier, which indicates
    the valid data width after perceived future conversion.
    Based on this information, verifier is able to avoid
    unnecessary marking.

    Since we want more information back from prog-specific
    *_is_valid_access checking, all of them are packed into
    one data structure for more clarity.

    Acked-by: Daniel Borkmann
    Signed-off-by: Yonghong Song
    Signed-off-by: David S. Miller

    Yonghong Song