Eric Lee / smarc-fsl-linux-kernel

08 Jul, 2017

1 commit

f51048c3e bonding: avoid NETDEV_CHANGEMTU event when unregistering slave ... Browse Code »

As Hongjun/Nicolas summarized in their original patch:

"
When a device changes from one netns to another, it's first unregistered,
then the netns reference is updated and the dev is registered in the new
netns. Thus, when a slave moves to another netns, it is first
unregistered. This triggers a NETDEV_UNREGISTER event which is caught by
the bonding driver. The driver calls bond_release(), which calls
dev_set_mtu() and thus triggers NETDEV_CHANGEMTU (the device is still in
the old netns).
"

This is a very special case, because the device is being unregistered
no one should still care about the NETDEV_CHANGEMTU event triggered
at this point, we can avoid broadcasting this event on this path,
and avoid touching inetdev_event()/addrconf_notify() path.

It requires to export __dev_set_mtu() to bonding driver.

Reported-by: Hongjun Li
Reported-by: Nicolas Dichtel
Cc: Jay Vosburgh
Cc: Veaceslav Falico
Cc: Andy Gospodarek
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

WANG Cong
2017-07-08 18:23:29 +0800

06 Jul, 2017

2 commits

7114f51fc Merge branch 'work.memdup_user' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull memdup_user() conversions from Al Viro:
"A fairly self-contained series - hunting down open-coded memdup_user()
and memdup_user_nul() instances"

* 'work.memdup_user' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
bpf: don't open-code memdup_user()
kimage_file_prepare_segments(): don't open-code memdup_user()
ethtool: don't open-code memdup_user()
do_ip_setsockopt(): don't open-code memdup_user()
do_ipv6_setsockopt(): don't open-code memdup_user()
irda: don't open-code memdup_user()
xfrm_user_policy(): don't open-code memdup_user()
ima_write_policy(): don't open-code memdup_user_nul()
sel_write_validatetrans(): don't open-code memdup_user_nul()

Linus Torvalds
2017-07-06 07:05:24 +0800
5518b69b7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:
"Reasonably busy this cycle, but perhaps not as busy as in the 4.12
merge window:

1) Several optimizations for UDP processing under high load from
Paolo Abeni.

2) Support pacing internally in TCP when using the sch_fq packet
scheduler for this is not practical. From Eric Dumazet.

3) Support mutliple filter chains per qdisc, from Jiri Pirko.

4) Move to 1ms TCP timestamp clock, from Eric Dumazet.

5) Add batch dequeueing to vhost_net, from Jason Wang.

6) Flesh out more completely SCTP checksum offload support, from
Davide Caratti.

7) More plumbing of extended netlink ACKs, from David Ahern, Pablo
Neira Ayuso, and Matthias Schiffer.

8) Add devlink support to nfp driver, from Simon Horman.

9) Add RTM_F_FIB_MATCH flag to RTM_GETROUTE queries, from Roopa
Prabhu.

10) Add stack depth tracking to BPF verifier and use this information
in the various eBPF JITs. From Alexei Starovoitov.

11) Support XDP on qed device VFs, from Yuval Mintz.

12) Introduce BPF PROG ID for better introspection of installed BPF
programs. From Martin KaFai Lau.

13) Add bpf_set_hash helper for TC bpf programs, from Daniel Borkmann.

14) For loads, allow narrower accesses in bpf verifier checking, from
Yonghong Song.

15) Support MIPS in the BPF selftests and samples infrastructure, the
MIPS eBPF JIT will be merged in via the MIPS GIT tree. From David
Daney.

16) Support kernel based TLS, from Dave Watson and others.

17) Remove completely DST garbage collection, from Wei Wang.

18) Allow installing TCP MD5 rules using prefixes, from Ivan
Delalande.

19) Add XDP support to Intel i40e driver, from Björn Töpel

20) Add support for TC flower offload in nfp driver, from Simon
Horman, Pieter Jansen van Vuuren, Benjamin LaHaise, Jakub
Kicinski, and Bert van Leeuwen.

21) IPSEC offloading support in mlx5, from Ilan Tayari.

22) Add HW PTP support to macb driver, from Rafal Ozieblo.

23) Networking refcount_t conversions, From Elena Reshetova.

24) Add sock_ops support to BPF, from Lawrence Brako. This is useful
for tuning the TCP sockopt settings of a group of applications,
currently via CGROUPs"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1899 commits)
net: phy: dp83867: add workaround for incorrect RX_CTRL pin strap
dt-bindings: phy: dp83867: provide a workaround for incorrect RX_CTRL pin strap
cxgb4: Support for get_ts_info ethtool method
cxgb4: Add PTP Hardware Clock (PHC) support
cxgb4: time stamping interface for PTP
nfp: default to chained metadata prepend format
nfp: remove legacy MAC address lookup
nfp: improve order of interfaces in breakout mode
net: macb: remove extraneous return when MACB_EXT_DESC is defined
bpf: add missing break in for the TCP_BPF_SNDCWND_CLAMP case
bpf: fix return in load_bpf_file
mpls: fix rtm policy in mpls_getroute
net, ax25: convert ax25_cb.refcount from atomic_t to refcount_t
net, ax25: convert ax25_route.refcount from atomic_t to refcount_t
net, ax25: convert ax25_uid_assoc.refcount from atomic_t to refcount_t
net, sctp: convert sctp_ep_common.refcnt from atomic_t to refcount_t
net, sctp: convert sctp_transport.refcnt from atomic_t to refcount_t
net, sctp: convert sctp_chunk.refcnt from atomic_t to refcount_t
net, sctp: convert sctp_datamsg.refcnt from atomic_t to refcount_t
net, sctp: convert sctp_auth_bytes.refcnt from atomic_t to refcount_t
...

Linus Torvalds
2017-07-06 03:31:59 +0800

05 Jul, 2017

1 commit

6d3f06a00 bpf: add missing break in for the TCP_BPF_SNDCWND_CLAMP case ... Browse Code »

There appears to be a missing break in the TCP_BPF_SNDCWND_CLAMP case.
Currently the non-error path where val is greater than zero falls through
to the default case that sets the error return to -EINVAL. Add in
the missing break.

Detected by CoverityScan, CID#1449376 ("Missing break in switch")

Fixes: 13bf96411ad2 ("bpf: Adds support for setting sndcwnd clamp")
Signed-off-by: Colin Ian King
Acked-by: Daniel Borkmann
Acked-by: Lawrence Brakmo
Signed-off-by: David S. Miller

Colin Ian King
2017-07-05 16:08:54 +0800

04 Jul, 2017

2 commits

650fc870a Merge tag 'docs-4.13' of git://git.lwn.net/linux ... Browse Code »

Pull documentation updates from Jonathan Corbet:
"There has been a fair amount of activity in the docs tree this time
around. Highlights include:

- Conversion of a bunch of security documentation into RST

- The conversion of the remaining DocBook templates by The Amazing
Mauro Machine. We can now drop the entire DocBook build chain.

- The usual collection of fixes and minor updates"

* tag 'docs-4.13' of git://git.lwn.net/linux: (90 commits)
scripts/kernel-doc: handle DECLARE_HASHTABLE
Documentation: atomic_ops.txt is core-api/atomic_ops.rst
Docs: clean up some DocBook loose ends
Make the main documentation title less Geocities
Docs: Use kernel-figure in vidioc-g-selection.rst
Docs: fix table problems in ras.rst
Docs: Fix breakage with Sphinx 1.5 and upper
Docs: Include the Latex "ifthen" package
doc/kokr/howto: Only send regression fixes after -rc1
docs-rst: fix broken links to dynamic-debug-howto in kernel-parameters
doc: Document suitability of IBM Verse for kernel development
Doc: fix a markup error in coding-style.rst
docs: driver-api: i2c: remove some outdated information
Documentation: DMA API: fix a typo in a function name
Docs: Insert missing space to separate link from text
doc/ko_KR/memory-barriers: Update control-dependencies example
Documentation, kbuild: fix typo "minimun" -> "minimum"
docs: Fix some formatting issues in request-key.rst
doc: ReSTify keys-trusted-encrypted.txt
doc: ReSTify keys-request-key.txt
...

Linus Torvalds
2017-07-04 12:13:25 +0800
9bd42183b Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:

- Add the SYSTEM_SCHEDULING bootup state to move various scheduler
debug checks earlier into the bootup. This turns silent and
sporadically deadly bugs into nice, deterministic splats. Fix some
of the splats that triggered. (Thomas Gleixner)

- A round of restructuring and refactoring of the load-balancing and
topology code (Peter Zijlstra)

- Another round of consolidating ~20 of incremental scheduler code
history: this time in terms of wait-queue nomenclature. (I didn't
get much feedback on these renaming patches, and we can still
easily change any names I might have misplaced, so if anyone hates
a new name, please holler and I'll fix it.) (Ingo Molnar)

- sched/numa improvements, fixes and updates (Rik van Riel)

- Another round of x86/tsc scheduler clock code improvements, in hope
of making it more robust (Peter Zijlstra)

- Improve NOHZ behavior (Frederic Weisbecker)

- Deadline scheduler improvements and fixes (Luca Abeni, Daniel
Bristot de Oliveira)

- Simplify and optimize the topology setup code (Lauro Ramos
Venancio)

- Debloat and decouple scheduler code some more (Nicolas Pitre)

- Simplify code by making better use of llist primitives (Byungchul
Park)

- ... plus other fixes and improvements"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
sched/cputime: Refactor the cputime_adjust() code
sched/debug: Expose the number of RT/DL tasks that can migrate
sched/numa: Hide numa_wake_affine() from UP build
sched/fair: Remove effective_load()
sched/numa: Implement NUMA node level wake_affine()
sched/fair: Simplify wake_affine() for the single socket case
sched/numa: Override part of migrate_degrades_locality() when idle balancing
sched/rt: Move RT related code from sched/core.c to sched/rt.c
sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
sched/fair: Spare idle load balancing on nohz_full CPUs
nohz: Move idle balancer registration to the idle path
sched/loadavg: Generalize "_idle" naming to "_nohz"
sched/core: Drop the unused try_get_task_struct() helper function
sched/fair: WARN() and refuse to set buddy when !se->on_rq
sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
sched/wait: Split out the wait_bit*() APIs from into
sched/wait: Re-adjust macro line continuation backslashes in
...

Linus Torvalds
2017-07-04 04:08:04 +0800

03 Jul, 2017

6 commits

5361e209d net: avoid one splat in fib_nl_delrule() ... Browse Code »

We need to use refcount_set() on a newly created rule to avoid
following error :

[ 64.601749] ------------[ cut here ]------------
[ 64.601757] WARNING: CPU: 0 PID: 6476 at lib/refcount.c:184 refcount_sub_and_test+0x75/0xa0
[ 64.601758] Modules linked in: w1_therm wire cdc_acm ehci_pci ehci_hcd mlx4_en ib_uverbs mlx4_ib ib_core mlx4_core
[ 64.601769] CPU: 0 PID: 6476 Comm: ip Tainted: G W 4.12.0-smp-DEV #274
[ 64.601771] task: ffff8837bf482040 task.stack: ffff8837bdc08000
[ 64.601773] RIP: 0010:refcount_sub_and_test+0x75/0xa0
[ 64.601774] RSP: 0018:ffff8837bdc0f5c0 EFLAGS: 00010286
[ 64.601776] RAX: 0000000000000026 RBX: 0000000000000001 RCX: 0000000000000000
[ 64.601777] RDX: 0000000000000026 RSI: 0000000000000096 RDI: ffffed06f7b81eae
[ 64.601778] RBP: ffff8837bdc0f5d0 R08: 0000000000000004 R09: fffffbfff4a54c25
[ 64.601779] R10: 00000000cbc500e5 R11: ffffffffa52a6128 R12: ffff881febcf6f24
[ 64.601779] R13: ffff881fbf4eaf00 R14: ffff881febcf6f80 R15: ffff8837d7a4ed00
[ 64.601781] FS: 00007ff5a2f6b700(0000) GS:ffff881fff800000(0000) knlGS:0000000000000000
[ 64.601782] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 64.601783] CR2: 00007ffcdc70d000 CR3: 0000001f9c91e000 CR4: 00000000001406f0
[ 64.601783] Call Trace:
[ 64.601786] refcount_dec_and_test+0x11/0x20
[ 64.601790] fib_nl_delrule+0xc39/0x1630
[ 64.601793] ? is_bpf_text_address+0xe/0x20
[ 64.601795] ? fib_nl_newrule+0x25e0/0x25e0
[ 64.601798] ? depot_save_stack+0x133/0x470
[ 64.601801] ? ns_capable+0x13/0x20
[ 64.601803] ? __netlink_ns_capable+0xcc/0x100
[ 64.601806] rtnetlink_rcv_msg+0x23a/0x6a0
[ 64.601808] ? rtnl_newlink+0x1630/0x1630
[ 64.601811] ? memset+0x31/0x40
[ 64.601813] netlink_rcv_skb+0x2d7/0x440
[ 64.601815] ? rtnl_newlink+0x1630/0x1630
[ 64.601816] ? netlink_ack+0xaf0/0xaf0
[ 64.601818] ? kasan_unpoison_shadow+0x35/0x50
[ 64.601820] ? __kmalloc_node_track_caller+0x4c/0x70
[ 64.601821] rtnetlink_rcv+0x28/0x30
[ 64.601823] netlink_unicast+0x422/0x610
[ 64.601824] ? netlink_attachskb+0x650/0x650
[ 64.601826] netlink_sendmsg+0x7b7/0xb60
[ 64.601828] ? netlink_unicast+0x610/0x610
[ 64.601830] ? netlink_unicast+0x610/0x610
[ 64.601832] sock_sendmsg+0xba/0xf0
[ 64.601834] ___sys_sendmsg+0x6a9/0x8c0
[ 64.601835] ? copy_msghdr_from_user+0x520/0x520
[ 64.601837] ? __alloc_pages_nodemask+0x160/0x520
[ 64.601839] ? memcg_write_event_control+0xd60/0xd60
[ 64.601841] ? __alloc_pages_slowpath+0x1d50/0x1d50
[ 64.601843] ? kasan_slab_free+0x71/0xc0
[ 64.601845] ? mem_cgroup_commit_charge+0xb2/0x11d0
[ 64.601847] ? lru_cache_add_active_or_unevictable+0x7d/0x1a0
[ 64.601849] ? __handle_mm_fault+0x1af8/0x2810
[ 64.601851] ? may_open_dev+0xc0/0xc0
[ 64.601852] ? __pmd_alloc+0x2c0/0x2c0
[ 64.601853] ? __fdget+0x13/0x20
[ 64.601855] __sys_sendmsg+0xc6/0x150
[ 64.601856] ? __sys_sendmsg+0xc6/0x150
[ 64.601857] ? SyS_shutdown+0x170/0x170
[ 64.601859] ? handle_mm_fault+0x28a/0x650
[ 64.601861] SyS_sendmsg+0x12/0x20
[ 64.601863] entry_SYSCALL_64_fastpath+0x13/0x94

Fixes: 717d1e993ad8 ("net: convert fib_rule.refcnt from atomic_t to refcount_t")
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2017-07-03 18:29:14 +0800
9af9959e1 net: core: Fix slab-out-of-bounds in netdev_stats_to_stats64 ... Browse Code »

commit 9256645af098 ("net/core: relax BUILD_BUG_ON in
netdev_stats_to_stats64") made an attempt to read beyond
the size of the source a possibility.

Fix to only copy src size to dest. As dest might be bigger than src.

==================================================================
BUG: KASAN: slab-out-of-bounds in netdev_stats_to_stats64+0xe/0x30 at addr ffff8801be248b20
Read of size 192 by task VBoxNetAdpCtl/6734
CPU: 1 PID: 6734 Comm: VBoxNetAdpCtl Tainted: G O 4.11.4prahal+intel+ #118
Hardware name: LENOVO 20CDCTO1WW/20CDCTO1WW, BIOS GQET52WW (1.32 ) 05/04/2017
Call Trace:
dump_stack+0x63/0x86
kasan_object_err+0x1c/0x70
kasan_report+0x270/0x520
? netdev_stats_to_stats64+0xe/0x30
? sched_clock_cpu+0x1b/0x190
? __module_address+0x3e/0x3b0
? unwind_next_frame+0x1ea/0xb00
check_memory_region+0x13c/0x1a0
memcpy+0x23/0x50
netdev_stats_to_stats64+0xe/0x30
dev_get_stats+0x1b9/0x230
rtnl_fill_stats+0x44/0xc00
? nla_put+0xc6/0x130
rtnl_fill_ifinfo+0xe9e/0x3700
? rtnl_fill_vfinfo+0xde0/0xde0
? sched_clock+0x9/0x10
? sched_clock+0x9/0x10
? sched_clock_local+0x120/0x130
? __module_address+0x3e/0x3b0
? unwind_next_frame+0x1ea/0xb00
? sched_clock+0x9/0x10
? sched_clock+0x9/0x10
? sched_clock_cpu+0x1b/0x190
? VBoxNetAdpLinuxIOCtlUnlocked+0x14b/0x280 [vboxnetadp]
? depot_save_stack+0x1d8/0x4a0
? depot_save_stack+0x34f/0x4a0
? depot_save_stack+0x34f/0x4a0
? save_stack+0xb1/0xd0
? save_stack_trace+0x16/0x20
? save_stack+0x46/0xd0
? kasan_slab_alloc+0x12/0x20
? __kmalloc_node_track_caller+0x10d/0x350
? __kmalloc_reserve.isra.36+0x2c/0xc0
? __alloc_skb+0xd0/0x560
? rtmsg_ifinfo_build_skb+0x61/0x120
? rtmsg_ifinfo.part.25+0x16/0xb0
? rtmsg_ifinfo+0x47/0x70
? register_netdev+0x15/0x30
? vboxNetAdpOsCreate+0xc0/0x1c0 [vboxnetadp]
? vboxNetAdpCreate+0x210/0x400 [vboxnetadp]
? VBoxNetAdpLinuxIOCtlUnlocked+0x14b/0x280 [vboxnetadp]
? do_vfs_ioctl+0x17f/0xff0
? SyS_ioctl+0x74/0x80
? do_syscall_64+0x182/0x390
? __alloc_skb+0xd0/0x560
? __alloc_skb+0xd0/0x560
? save_stack_trace+0x16/0x20
? init_object+0x64/0xa0
? ___slab_alloc+0x1ae/0x5c0
? ___slab_alloc+0x1ae/0x5c0
? __alloc_skb+0xd0/0x560
? sched_clock+0x9/0x10
? kasan_unpoison_shadow+0x35/0x50
? kasan_kmalloc+0xad/0xe0
? __kmalloc_node_track_caller+0x246/0x350
? __alloc_skb+0xd0/0x560
? kasan_unpoison_shadow+0x35/0x50
? memset+0x31/0x40
? __alloc_skb+0x31f/0x560
? napi_consume_skb+0x320/0x320
? br_get_link_af_size_filtered+0xb7/0x120 [bridge]
? if_nlmsg_size+0x440/0x630
rtmsg_ifinfo_build_skb+0x83/0x120
rtmsg_ifinfo.part.25+0x16/0xb0
rtmsg_ifinfo+0x47/0x70
register_netdevice+0xa2b/0xe50
? __kmalloc+0x171/0x2d0
? netdev_change_features+0x80/0x80
register_netdev+0x15/0x30
vboxNetAdpOsCreate+0xc0/0x1c0 [vboxnetadp]
vboxNetAdpCreate+0x210/0x400 [vboxnetadp]
? vboxNetAdpComposeMACAddress+0x1d0/0x1d0 [vboxnetadp]
? kasan_check_write+0x14/0x20
VBoxNetAdpLinuxIOCtlUnlocked+0x14b/0x280 [vboxnetadp]
? VBoxNetAdpLinuxOpen+0x20/0x20 [vboxnetadp]
? lock_acquire+0x11c/0x270
? __audit_syscall_entry+0x2fb/0x660
do_vfs_ioctl+0x17f/0xff0
? __audit_syscall_entry+0x2fb/0x660
? ioctl_preallocate+0x1d0/0x1d0
? __audit_syscall_entry+0x2fb/0x660
? kmem_cache_free+0xb2/0x250
? syscall_trace_enter+0x537/0xd00
? exit_to_usermode_loop+0x100/0x100
SyS_ioctl+0x74/0x80
? do_sys_open+0x350/0x350
? do_vfs_ioctl+0xff0/0xff0
do_syscall_64+0x182/0x390
entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f7e39a1ae07
RSP: 002b:00007ffc6f04c6d8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffc6f04c730 RCX: 00007f7e39a1ae07
RDX: 00007ffc6f04c730 RSI: 00000000c0207601 RDI: 0000000000000007
RBP: 00007ffc6f04c700 R08: 00007ffc6f04c780 R09: 0000000000000008
R10: 0000000000000541 R11: 0000000000000206 R12: 0000000000000007
R13: 00000000c0207601 R14: 00007ffc6f04c730 R15: 0000000000000012
Object at ffff8801be248008, in cache kmalloc-4096 size: 4096
Allocated:
PID = 6734
save_stack_trace+0x16/0x20
save_stack+0x46/0xd0
kasan_kmalloc+0xad/0xe0
__kmalloc+0x171/0x2d0
alloc_netdev_mqs+0x8a7/0xbe0
vboxNetAdpOsCreate+0x65/0x1c0 [vboxnetadp]
vboxNetAdpCreate+0x210/0x400 [vboxnetadp]
VBoxNetAdpLinuxIOCtlUnlocked+0x14b/0x280 [vboxnetadp]
do_vfs_ioctl+0x17f/0xff0
SyS_ioctl+0x74/0x80
do_syscall_64+0x182/0x390
return_from_SYSCALL_64+0x0/0x6a
Freed:
PID = 5600
save_stack_trace+0x16/0x20
save_stack+0x46/0xd0
kasan_slab_free+0x73/0xc0
kfree+0xe4/0x220
kvfree+0x25/0x30
single_release+0x74/0xb0
__fput+0x265/0x6b0
____fput+0x9/0x10
task_work_run+0xd5/0x150
exit_to_usermode_loop+0xe2/0x100
do_syscall_64+0x26c/0x390
return_from_SYSCALL_64+0x0/0x6a
Memory state around the buggy address:
ffff8801be248a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff8801be248b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff8801be248b80: 00 00 00 00 00 00 00 00 00 00 00 07 fc fc fc fc
^
ffff8801be248c00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff8801be248c80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================

Signed-off-by: Alban Browaeys
Signed-off-by: David S. Miller

Alban Browaeys
2017-07-03 17:40:23 +0800
f96da0947 bpf: simplify narrower ctx access ... Browse Code »

This work tries to make the semantics and code around the
narrower ctx access a bit easier to follow. Right now
everything is done inside the .is_valid_access(). Offset
matching is done differently for read/write types, meaning
writes don't support narrower access and thus matching only
on offsetof(struct foo, bar) is enough whereas for read
case that supports narrower access we must check for
offsetof(struct foo, bar) + offsetof(struct foo, bar) +
sizeof() - 1 for each of the cases. For read cases of
individual members that don't support narrower access (like
packet pointers or skb->cb[] case which has its own narrow
access logic), we check as usual only offsetof(struct foo,
bar) like in write case. Then, for the case where narrower
access is allowed, we also need to set the aux info for the
access. Meaning, ctx_field_size and converted_op_size have
to be set. First is the original field size e.g. sizeof()
as in above example from the user facing ctx, and latter
one is the target size after actual rewrite happened, thus
for the kernel facing ctx. Also here we need the range match
and we need to keep track changing convert_ctx_access() and
converted_op_size from is_valid_access() as both are not at
the same location.

We can simplify the code a bit: check_ctx_access() becomes
simpler in that we only store ctx_field_size as a meta data
and later in convert_ctx_accesses() we fetch the target_size
right from the location where we do convert. Should the verifier
be misconfigured we do reject for BPF_WRITE cases or target_size
that are not provided. For the subsystems, we always work on
ranges in is_valid_access() and add small helpers for ranges
and narrow access, convert_ctx_accesses() sets target_size
for the relevant instruction.

Signed-off-by: Daniel Borkmann
Acked-by: John Fastabend
Cc: Yonghong Song
Signed-off-by: David S. Miller

Daniel Borkmann
2017-07-03 17:22:52 +0800
2be7e212d bpf: add bpf_skb_adjust_room helper ... Browse Code »

This work adds a helper that can be used to adjust net room of an
skb. The helper is generic and can be further extended in future.
Main use case is for having a programmatic way to add/remove room to
v4/v6 header options along with cls_bpf on egress and ingress hook
of the data path. It reuses most of the infrastructure that we added
for the bpf_skb_change_type() helper which can be used in nat64
translations. Similarly, the helper only takes care of adjusting the
room so that related data is populated and csum adapted out of the
BPF program using it.

Signed-off-by: Daniel Borkmann
Acked-by: John Fastabend
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2017-07-03 17:22:52 +0800
0daf43494 bpf, net: add skb_mac_header_len helper ... Browse Code »

Add a small skb_mac_header_len() helper similarly as the
skb_network_header_len() we have and replace open coded
places in BPF's bpf_skb_change_proto() helper. Will also
be used in upcoming work.

Signed-off-by: Daniel Borkmann
Acked-by: John Fastabend
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2017-07-03 17:22:51 +0800
a5192c523 bpf: fix to bpf_setsockops ... Browse Code »

Fixed build error due to misplaced "#ifdef CONFIG_INET" (moved 1
statement up).

Signed-off-by: Lawrence Brakmo
Signed-off-by: David S. Miller

Lawrence Brakmo
2017-07-03 06:16:52 +0800

02 Jul, 2017

5 commits

13bf96411 bpf: Adds support for setting sndcwnd clamp ... Browse Code »

Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_SNDCWND_CLAMP, which
sets the initial congestion window. It is useful to limit the sndcwnd
when the host are close to each other (small RTT).

Signed-off-by: Lawrence Brakmo
Signed-off-by: David S. Miller

Lawrence Brakmo
2017-07-02 07:15:14 +0800
fc7478103 bpf: Adds support for setting initial cwnd ... Browse Code »

Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_IW, which sets the
initial congestion window. This can be used when the hosts are far
apart (large RTTs) and it is safe to start with a large inital cwnd.

Signed-off-by: Lawrence Brakmo
Signed-off-by: David S. Miller

Lawrence Brakmo
2017-07-02 07:15:14 +0800
91b5b21c7 bpf: Add support for changing congestion control ... Browse Code »

Added support for changing congestion control for SOCK_OPS bpf
programs through the setsockopt bpf helper function. It also adds
a new SOCK_OPS op, BPF_SOCK_OPS_NEEDS_ECN, that is needed for
congestion controls, like dctcp, that need to enable ECN in the
SYN packets.

Signed-off-by: Lawrence Brakmo
Signed-off-by: David S. Miller

Lawrence Brakmo
2017-07-02 07:15:14 +0800
8c4b4c7e9 bpf: Add setsockopt helper function to bpf ... Browse Code »

Added support for calling a subset of socket setsockopts from
BPF_PROG_TYPE_SOCK_OPS programs. The code was duplicated rather
than making the changes to call the socket setsockopt function because
the changes required would have been larger.

The ops supported are:
SO_RCVBUF
SO_SNDBUF
SO_MAX_PACING_RATE
SO_PRIORITY
SO_RCVLOWAT
SO_MARK

Signed-off-by: Lawrence Brakmo
Signed-off-by: David S. Miller

Lawrence Brakmo
2017-07-02 07:15:13 +0800
40304b2a1 bpf: BPF support for sock_ops ... Browse Code »

Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
struct that allows BPF programs of this type to access some of the
socket's fields (such as IP addresses, ports, etc.). It uses the
existing bpf cgroups infrastructure so the programs can be attached per
cgroup with full inheritance support. The program will be called at
appropriate times to set relevant connections parameters such as buffer
sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
as IP addresses, port numbers, etc.

Alghough there are already 3 mechanisms to set parameters (sysctls,
route metrics and setsockopts), this new mechanism provides some
distinct advantages. Unlike sysctls, it can set parameters per
connection. In contrast to route metrics, it can also use port numbers
and information provided by a user level program. In addition, it could
set parameters probabilistically for evaluation purposes (i.e. do
something different on 10% of the flows and compare results with the
other 90% of the flows). Also, in cases where IPv6 addresses contain
geographic information, the rules to make changes based on the distance
(or RTT) between the hosts are much easier than route metric rules and
can be global. Finally, unlike setsockopt, it oes not require
application changes and it can be updated easily at any time.

Although the bpf cgroup framework already contains a sock related
program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
(BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
only once during the connections's lifetime. In contrast, the new
program type will be called multiple times from different places in the
network stack code. For example, before sending SYN and SYN-ACKs to set
an appropriate timeout, when the connection is established to set
congestion control, etc. As a result it has "op" field to specify the
type of operation requested.

The purpose of this new program type is to simplify setting connection
parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
easy to use facebook's internal IPv6 addresses to determine if both hosts
of a connection are in the same datacenter. Therefore, it is easy to
write a BPF program to choose a small SYN RTO value when both hosts are
in the same datacenter.

This patch only contains the framework to support the new BPF program
type, following patches add the functionality to set various connection
parameters.

This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
and a new bpf syscall command to load a new program of this type:
BPF_PROG_LOAD_SOCKET_OPS.

Two new corresponding structs (one for the kernel one for the user/BPF
program):

/* kernel version */
struct bpf_sock_ops_kern {
struct sock *sk;
__u32 op;
union {
__u32 reply;
__u32 replylong[4];
};
};

/* user version
* Some fields are in network byte order reflecting the sock struct
* Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to
* convert them to host byte order.
*/
struct bpf_sock_ops {
__u32 op;
union {
__u32 reply;
__u32 replylong[4];
};
__u32 family;
__u32 remote_ip4; /* In network byte order */
__u32 local_ip4; /* In network byte order */
__u32 remote_ip6[4]; /* In network byte order */
__u32 local_ip6[4]; /* In network byte order */
__u32 remote_port; /* In network byte order */
__u32 local_port; /* In host byte horder */
};

Currently there are two types of ops. The first type expects the BPF
program to return a value which is then used by the caller (or a
negative value to indicate the operation is not supported). The second
type expects state changes to be done by the BPF program, for example
through a setsockopt BPF helper function, and they ignore the return
value.

The reply fields of the bpf_sockt_ops struct are there in case a bpf
program needs to return a value larger than an integer.

Signed-off-by: Lawrence Brakmo
Acked-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Lawrence Brakmo
2017-07-02 07:15:13 +0800

01 Jul, 2017

10 commits

c122e14df net: convert net.passive from atomic_t to refcount_t ... Browse Code »

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David S. Miller

Reshetova, Elena
2017-07-01 22:39:09 +0800
717d1e993 net: convert fib_rule.refcnt from atomic_t to refcount_t ... Browse Code »

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David S. Miller

Reshetova, Elena
2017-07-01 22:39:09 +0800
433cea4d9 net: convert netpoll_info.refcnt from atomic_t to refcount_t ... Browse Code »

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David S. Miller

Reshetova, Elena
2017-07-01 22:39:08 +0800
41c6d650f net: convert sock.sk_refcnt from atomic_t to refcount_t ... Browse Code »

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

This patch uses refcount_inc_not_zero() instead of
atomic_inc_not_zero_hint() due to absense of a _hint()
version of refcount API. If the hint() version must
be used, we might need to revisit API.

Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David S. Miller

Reshetova, Elena
2017-07-01 22:39:08 +0800
14afee4b6 net: convert sock.sk_wmem_alloc from atomic_t to refcount_t ... Browse Code »

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David S. Miller

Reshetova, Elena
2017-07-01 22:39:08 +0800
2638595af net: convert sk_buff_fclones.fclone_ref from atomic_t to refcount_t ... Browse Code »

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David S. Miller

Reshetova, Elena
2017-07-01 22:39:08 +0800
633547973 net: convert sk_buff.users from atomic_t to refcount_t ... Browse Code »

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David S. Miller

Reshetova, Elena
2017-07-01 22:39:07 +0800
6343944bc net: convert neigh_params.refcnt from atomic_t to refcount_t ... Browse Code »

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David S. Miller

Reshetova, Elena
2017-07-01 22:39:07 +0800
9f2374301 net: convert neighbour.refcnt from atomic_t to refcount_t ... Browse Code »

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David S. Miller

Reshetova, Elena
2017-07-01 22:39:07 +0800
b07911593 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

A set of overlapping changes in macvlan and the rocker
driver, nothing serious.

Signed-off-by: David S. Miller

David S. Miller
2017-07-01 00:43:08 +0800

30 Jun, 2017

4 commits

52a623bd6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next ... Browse Code »

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next
tree. This batch contains connection tracking updates for the cleanup
iteration path, patches from Florian Westphal:

X) Skip unconfirmed conntracks in nf_ct_iterate_cleanup_net(), just set
dying bit to let the CPU release them.

X) Add nf_ct_iterate_destroy() to be used on module removal, to kill
conntrack from all namespace.

X) Restart iteration on hashtable resizing, since both may occur at
the same time.

X) Use the new nf_ct_iterate_destroy() to remove conntrack with NAT
mapping on module removal.

X) Use nf_ct_iterate_destroy() to remove conntrack entries helper
module removal, from Liping Zhang.

X) Use nf_ct_iterate_cleanup_net() to remove the timeout extension
if user requests this, also from Liping.

X) Add net_ns_barrier() and use it from FTP helper, so make sure
no concurrent namespace removal happens at the same time while
the helper module is being removed.

X) Use NFPROTO_MAX in layer 3 conntrack protocol array, to reduce
module size. Same thing in nf_tables.

Updates for the nf_tables infrastructure:

X) Prepare usage of the extended ACK reporting infrastructure for
nf_tables.

X) Remove unnecessary forward declaration in nf_tables hash set.

X) Skip set size estimation if number of element is not specified.

X) Changes to accomodate a (faster) unresizable hash set implementation,
for anonymous sets and dynamic size fixed sets with no timeouts.

X) Faster lookup function for unresizable hash table for 2 and 4
bytes key.

And, finally, a bunch of asorted small updates and cleanups:

X) Do not hold reference to netdev from ipt_CLUSTER, instead subscribe
to device events and look up for index from the packet path, this
is fixing an issue that is present since the very beginning, patch
from Xin Long.

X) Use nf_register_net_hook() in ipt_CLUSTER, from Florian Westphal.

X) Use ebt_invalid_target() whenever possible in the ebtables tree,
from Gao Feng.

X) Calm down compilation warning in nf_dup infrastructure, patch from
stephen hemminger.

X) Statify functions in nftables rt expression, also from stephen.

X) Update Makefile to use canonical method to specify nf_tables-objs.
From Jike Song.

X) Use nf_conntrack_helpers_register() in amanda and H323.

X) Space cleanup for ctnetlink, from linzhang.
====================

Signed-off-by: David S. Miller

David S. Miller
2017-06-30 21:27:09 +0800
30e7e3ecf ethtool: don't open-code memdup_user() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2017-06-30 14:04:10 +0800
e44699d2c net: handle NAPI_GRO_FREE_STOLEN_HEAD case also in napi_frags_finish() ... Browse Code »

Recently I started seeing warnings about pages with refcount -1. The
problem was traced to packets being reused after their head was merged into
a GRO packet by skb_gro_receive(). While bisecting the issue pointed to
commit c21b48cc1bbf ("net: adjust skb->truesize in ___pskb_trim()") and
I have never seen it on a kernel with it reverted, I believe the real
problem appeared earlier when the option to merge head frag in GRO was
implemented.

Handling NAPI_GRO_FREE_STOLEN_HEAD state was only added to GRO_MERGED_FREE
branch of napi_skb_finish() so that if the driver uses napi_gro_frags()
and head is merged (which in my case happens after the skb_condense()
call added by the commit mentioned above), the skb is reused including the
head that has been merged. As a result, we release the page reference
twice and eventually end up with negative page refcount.

To fix the problem, handle NAPI_GRO_FREE_STOLEN_HEAD in napi_frags_finish()
the same way it's done in napi_skb_finish().

Fixes: d7e8883cfcf4 ("net: make GRO aware of skb->head_frag")
Signed-off-by: Michal Kubecek
Signed-off-by: David S. Miller

Michal Kubeček
2017-06-30 03:54:13 +0800
38ef00cc3 net: constify attribute_group structures. ... Browse Code »

attribute_groups are not supposed to change at runtime. All functions
working with attribute_groups provided by work with const
attribute_group. So mark the non-const structs as const.

File size before:
text data bss dec hex filename
9968 3168 16 13152 3360 net/core/net-sysfs.o

File size After adding 'const':
text data bss dec hex filename
10160 2976 16 13152 3360 net/core/net-sysfs.o

Signed-off-by: Arvind Yadav
Signed-off-by: David S. Miller

Arvind Yadav
2017-06-30 03:48:51 +0800

28 Jun, 2017

1 commit

6f64ec745 net: prevent sign extension in dev_get_stats() ... Browse Code »

Similar to the fix provided by Dominik Heidler in commit
9b3dc0a17d73 ("l2tp: cast l2tp traffic counter to unsigned")
we need to take care of 32bit kernels in dev_get_stats().

When using atomic_long_read(), we add a 'long' to u64 and
might misinterpret high order bit, unless we cast to unsigned.

Fixes: caf586e5f23ce ("net: add a core netdev->rx_dropped counter")
Fixes: 015f0688f57ca ("net: net: add a core netdev->tx_dropped counter")
Fixes: 6e7333d315a76 ("net: add rx_nohandler stat counter")
Signed-off-by: Eric Dumazet
Cc: Jarod Wilson
Signed-off-by: David S. Miller

Eric Dumazet
2017-06-28 02:45:12 +0800

27 Jun, 2017

5 commits

d116ffc77 net: add netlink_ext_ack argument to rtnl_link_ops.slave_validate ... Browse Code »

Add support for extended error reporting.

Signed-off-by: Matthias Schiffer
Acked-by: David Ahern
Signed-off-by: David S. Miller

Matthias Schiffer
2017-06-27 11:13:22 +0800
17dd0ec47 net: add netlink_ext_ack argument to rtnl_link_ops.slave_changelink ... Browse Code »

Add support for extended error reporting.

Signed-off-by: Matthias Schiffer
Acked-by: David Ahern
Signed-off-by: David S. Miller

Matthias Schiffer
2017-06-27 11:13:22 +0800
a8b8a889e net: add netlink_ext_ack argument to rtnl_link_ops.validate ... Browse Code »

Add support for extended error reporting.

Signed-off-by: Matthias Schiffer
Acked-by: David Ahern
Signed-off-by: David S. Miller

Matthias Schiffer
2017-06-27 11:13:22 +0800
ad744b223 net: add netlink_ext_ack argument to rtnl_link_ops.changelink ... Browse Code »

Add support for extended error reporting.

Signed-off-by: Matthias Schiffer
Acked-by: David Ahern
Signed-off-by: David S. Miller

Matthias Schiffer
2017-06-27 11:13:22 +0800
7a3f4a185 net: add netlink_ext_ack argument to rtnl_link_ops.newlink ... Browse Code »

Add support for extended error reporting.

Signed-off-by: Matthias Schiffer
Acked-by: David Ahern
Signed-off-by: David S. Miller

Matthias Schiffer
2017-06-27 11:13:21 +0800

25 Jun, 2017

1 commit

3fcece12b net: store port/representator id in metadata_dst ... Browse Code »

Switches and modern SR-IOV enabled NICs may multiplex traffic from Port
representators and control messages over single set of hardware queues.
Control messages and muxed traffic may need ordered delivery.

Those requirements make it hard to comfortably use TC infrastructure today
unless we have a way of attaching metadata to skbs at the upper device.
Because single set of queues is used for many netdevs stopping TC/sched
queues of all of them reliably is impossible and lower device has to
retreat to returning NETDEV_TX_BUSY and usually has to take extra locks on
the fastpath.

This patch attempts to enable port/representative devs to attach metadata
to skbs which carry port id. This way representatives can be queueless and
all queuing can be performed at the lower netdev in the usual way.

Traffic arriving on the port/representative interfaces will be have
metadata attached and will subsequently be queued to the lower device for
transmission. The lower device should recognize the metadata and translate
it to HW specific format which is most likely either a special header
inserted before the network headers or descriptor/metadata fields.

Metadata is associated with the lower device by storing the netdev pointer
along with port id so that if TC decides to redirect or mirror the new
netdev will not try to interpret it.

This is mostly for SR-IOV devices since switches don't have lower netdevs
today.

Signed-off-by: Jakub Kicinski
Signed-off-by: Sridhar Samudrala
Signed-off-by: Simon Horman
Signed-off-by: David S. Miller

Jakub Kicinski
2017-06-25 23:42:01 +0800

24 Jun, 2017

2 commits

1bc3cd4df Merge branch 'linus' into sched/core, to pick up fixes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2017-06-24 14:57:20 +0800
239946314 bpf: possibly avoid extra masking for narrower load in verifier ... Browse Code »

Commit 31fd85816dbe ("bpf: permits narrower load from bpf program
context fields") permits narrower load for certain ctx fields.
The commit however will already generate a masking even if
the prog-specific ctx conversion produces the result with
narrower size.

For example, for __sk_buff->protocol, the ctx conversion
loads the data into register with 2-byte load.
A narrower 2-byte load should not generate masking.
For __sk_buff->vlan_present, the conversion function
set the result as either 0 or 1, essentially a byte.
The narrower 2-byte or 1-byte load should not generate masking.

To avoid unnecessary masking, prog-specific *_is_valid_access
now passes converted_op_size back to verifier, which indicates
the valid data width after perceived future conversion.
Based on this information, verifier is able to avoid
unnecessary marking.

Since we want more information back from prog-specific
*_is_valid_access checking, all of them are packed into
one data structure for more clarity.

Acked-by: Daniel Borkmann
Signed-off-by: Yonghong Song
Signed-off-by: David S. Miller

Yonghong Song
2017-06-24 02:04:11 +0800