Eric Lee / smarc-fsl-linux-kernel

29 Jan, 2018

1 commit

624441927 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fix from Thomas Gleixner:
"A single bug fix to prevent a subtle deadlock in the scheduler core
code vs cpu hotplug"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Fix cpu.max vs. cpuhotplug deadlock

Linus Torvalds
2018-01-29 03:51:45 +0800

26 Jan, 2018

1 commit

f15ca723c net: don't call update_pmtu unconditionally ... Browse Code »

Some dst_ops (e.g. md_dst_ops)) doesn't set this handler. It may result to:
"BUG: unable to handle kernel NULL pointer dereference at (null)"

Let's add a helper to check if update_pmtu is available before calling it.

Fixes: 52a589d51f10 ("geneve: update skb dst pmtu on tx path")
Fixes: a93bf0ff4490 ("vxlan: update skb dst pmtu on tx path")
CC: Roman Kapl
CC: Xin Long
Signed-off-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Nicolas Dichtel
2018-01-26 05:27:34 +0800

25 Jan, 2018

5 commits

4ee806d51 net: tcp: close sock if net namespace is exiting ... Browse Code »

When a tcp socket is closed, if it detects that its net namespace is
exiting, close immediately and do not wait for FIN sequence.

For normal sockets, a reference is taken to their net namespace, so it will
never exit while the socket is open. However, kernel sockets do not take a
reference to their net namespace, so it may begin exiting while the kernel
socket is still open. In this case if the kernel socket is a tcp socket,
it will stay open trying to complete its close sequence. The sock's dst(s)
hold a reference to their interface, which are all transferred to the
namespace's loopback interface when the real interfaces are taken down.
When the namespace tries to take down its loopback interface, it hangs
waiting for all references to the loopback interface to release, which
results in messages like:

unregister_netdevice: waiting for lo to become free. Usage count = 1

These messages continue until the socket finally times out and closes.
Since the net namespace cleanup holds the net_mutex while calling its
registered pernet callbacks, any new net namespace initialization is
blocked until the current net namespace finishes exiting.

After this change, the tcp socket notices the exiting net namespace, and
closes immediately, releasing its dst(s) and their reference to the
loopback interface, which lets the net namespace continue exiting.

Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
Signed-off-by: Dan Streetman
Signed-off-by: David S. Miller

Dan Streetman
2018-01-25 23:56:45 +0800
5b7d27967 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) Avoid negative netdev refcount in error flow of xfrm state add, from
Aviad Yehezkel.

2) Fix tcpdump decoding of IPSEC decap'd frames by filling in the
ethernet header protocol field in xfrm{4,6}_mode_tunnel_input().
From Yossi Kuperman.

3) Fix a syzbot triggered skb_under_panic in pppoe having to do with
failing to allocate an appropriate amount of headroom. From
Guillaume Nault.

4) Fix memory leak in vmxnet3 driver, from Neil Horman.

5) Cure out-of-bounds packet memory access in em_nbyte EMATCH module,
from Wolfgang Bumiller.

6) Restrict what kinds of sockets can be bound to the KCM multiplexer
and also disallow when another layer has attached to the socket and
made use of sk_user_data. From Tom Herbert.

7) Fix use before init of IOTLB in vhost code, from Jason Wang.

8) Correct STACR register write bit definition in IBM emac driver, from
Ivan Mikhaylov.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
net/ibm/emac: wrong bit is used for STA control register write
net/ibm/emac: add 8192 rx/tx fifo size
vhost: do not try to access device IOTLB when not initialized
vhost: use mutex_lock_nested() in vhost_dev_lock_vqs()
i40e: flower: check if TC offload is enabled on a netdev
qed: Free reserved MR tid
qed: Remove reserveration of dpi for kernel
kcm: Check if sk_user_data already set in kcm_attach
kcm: Only allow TCP sockets to be attached to a KCM mux
net: sched: fix TCF_LAYER_LINK case in tcf_get_base_ptr
net: sched: em_nbyte: don't add the data offset twice
mlxsw: spectrum_router: Don't log an error on missing neighbor
vmxnet3: repair memory leak
ipv6: Fix getsockopt() for sockets with default IPV6_AUTOFLOWLABEL
pppoe: take ->needed_headroom of lower device into account on xmit
xfrm: fix boolean assignment in xfrm_get_type_offload
xfrm: Fix eth_hdr(skb)->h_proto to reflect inner IP version
xfrm: fix error flow in case of add state fails
xfrm: Add SA to hardware at the end of xfrm_state_construct()

Linus Torvalds
2018-01-25 09:24:30 +0800
d3303a65a net: sched: fix TCF_LAYER_LINK case in tcf_get_base_ptr ... Browse Code »

TCF_LAYER_LINK and TCF_LAYER_NETWORK returned the same pointer as
skb->data points to the network header.
Use skb_mac_header instead.

Signed-off-by: Wolfgang Bumiller
Signed-off-by: David S. Miller

Wolfgang Bumiller
2018-01-25 03:52:48 +0800
03fae44b4 Merge tag 'trace-v4.15-rc9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace ... Browse Code »

Pull tracing fixes from Steven Rostedt:
"With the new ORC unwinder, ftrace stack tracing became disfunctional.

One was that ORC didn't know how to handle the ftrace callbacks in
general (which Josh fixed).

The other was that ORC would just bail if it hit a dynamically
allocated trampoline. Which means all ftrace stack tracing that
happens from the function tracer would produce no results (that
includes killing the max stack size tracer). I added a check to the
ORC unwinder to see if the trampoline belonged to ftrace, and if it
did, use the orc entry of the static trampoline that was used to
create the dynamic one (it would be identical).

Finally, I noticed that the skip values of the stack tracing were out
of whack. I went through and fixed them up"

* tag 'trace-v4.15-rc9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Update stack trace skipping for ORC unwinder
ftrace, orc, x86: Handle ftrace dynamically allocated trampolines
x86/ftrace: Fix ORC unwinding from ftrace handlers

Linus Torvalds
2018-01-25 02:08:16 +0800
5132ede0f Revert "module: Add retpoline tag to VERMAGIC" ... Browse Code »

This reverts commit 6cfb521ac0d5b97470883ff9b7facae264b7ab12.

Turns out distros do not want to make retpoline as part of their "ABI",
so this patch should not have been merged. Sorry Andi, this was my
fault, I suggested it when your original patch was the "correct" way of
doing this instead.

Reported-by: Jiri Kosina
Fixes: 6cfb521ac0d5 ("module: Add retpoline tag to VERMAGIC")
Acked-by: Andi Kleen
Cc: Thomas Gleixner
Cc: David Woodhouse
Cc: rusty@rustcorp.com.au
Cc: arjan.van.de.ven@intel.com
Cc: jeyu@kernel.org
Cc: stable
Signed-off-by: Greg Kroah-Hartman
Signed-off-by: Linus Torvalds

Greg Kroah-Hartman
2018-01-25 01:00:05 +0800

24 Jan, 2018

3 commits

ce48c1464 sched/core: Fix cpu.max vs. cpuhotplug deadlock ... Browse Code »

Tejun reported the following cpu-hotplug lock (percpu-rwsem) read recursion:

tg_set_cfs_bandwidth()
get_online_cpus()
cpus_read_lock()

cfs_bandwidth_usage_inc()
static_key_slow_inc()
cpus_read_lock()

Reported-by: Tejun Heo
Tested-by: Tejun Heo
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20180122215328.GP3397@worktop
Signed-off-by: Ingo Molnar

Peter Zijlstra
2018-01-24 17:03:44 +0800
e9191ffb6 ipv6: Fix getsockopt() for sockets with default IPV6_AUTOFLOWLABEL ... Browse Code »

Commit 513674b5a2c9 ("net: reevalulate autoflowlabel setting after
sysctl setting") removed the initialisation of
ipv6_pinfo::autoflowlabel and added a second flag to indicate
whether this field or the net namespace default should be used.

The getsockopt() handling for this case was not updated, so it
currently returns 0 for all sockets for which IPV6_AUTOFLOWLABEL is
not explicitly enabled. Fix it to return the effective value, whether
that has been set at the socket or net namespace level.

Fixes: 513674b5a2c9 ("net: reevalulate autoflowlabel setting after sysctl ...")
Signed-off-by: Ben Hutchings
Signed-off-by: David S. Miller

Ben Hutchings
2018-01-24 08:53:24 +0800
6be7fa3c7 ftrace, orc, x86: Handle ftrace dynamically allocated trampolines ... Browse Code »

The function tracer can create a dynamically allocated trampoline that is
called by the function mcount or fentry hook that is used to call the
function callback that is registered. The problem is that the orc undwinder
will bail if it encounters one of these trampolines. This breaks the stack
trace of function callbacks, which include the stack tracer and setting the
stack trace for individual functions.

Since these dynamic trampolines are basically copies of the static ftrace
trampolines defined in ftrace_*.S, we do not need to create new orc entries
for the dynamic trampolines. Finding the return address on the stack will be
identical as the functions that were copied to create the dynamic
trampolines. When encountering a ftrace dynamic trampoline, we can just use
the orc entry of the ftrace static function that was copied for that
trampoline.

Signed-off-by: Steven Rostedt (VMware)

Steven Rostedt (VMware)
2018-01-24 04:56:55 +0800

22 Jan, 2018

1 commit

0d665e7b1 mm, page_vma_mapped: Drop faulty pointer arithmetics in check_pte() ... Browse Code »

Tetsuo reported random crashes under memory pressure on 32-bit x86
system and tracked down to change that introduced
page_vma_mapped_walk().

The root cause of the issue is the faulty pointer math in check_pte().
As ->pte may point to an arbitrary page we have to check that they are
belong to the section before doing math. Otherwise it may lead to weird
results.

It wasn't noticed until now as mem_map[] is virtually contiguous on
flatmem or vmemmap sparsemem. Pointer arithmetic just works against all
'struct page' pointers. But with classic sparsemem, it doesn't because
each section memap is allocated separately and so consecutive pfns
crossing two sections might have struct pages at completely unrelated
addresses.

Let's restructure code a bit and replace pointer arithmetic with
operations on pfns.

Signed-off-by: Kirill A. Shutemov
Reported-and-tested-by: Tetsuo Handa
Acked-by: Michal Hocko
Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2018-01-22 09:44:47 +0800

21 Jan, 2018

2 commits

24b612404 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm ... Browse Code »

Pull KVM fixes from Radim Krčmář:
"ARM:
- fix incorrect huge page mappings on systems using the contiguous
hint for hugetlbfs
- support alternative GICv4 init sequence
- correctly implement the ARM SMCC for HVC and SMC handling

PPC:
- add KVM IOCTL for reporting vulnerability and workaround status

s390:
- provide userspace interface for branch prediction changes in
firmware

x86:
- use correct macros for bits"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: s390: wire up bpb feature
KVM: PPC: Book3S: Provide information about hardware/firmware CVE workarounds
KVM/x86: Fix wrong macro references of X86_CR0_PG_BIT and X86_CR4_PAE_BIT in kvm_valid_sregs()
arm64: KVM: Fix SMCCC handling of unimplemented SMC/HVC calls
KVM: arm64: Fix GICv4 init when called from vgic_its_create
KVM: arm/arm64: Check pagesize when allocating a hugepage at Stage 2

Linus Torvalds
2018-01-21 03:41:09 +0800
35b3fde62 KVM: s390: wire up bpb feature ... Browse Code »

The new firmware interfaces for branch prediction behaviour changes
are transparently available for the guest. Nevertheless, there is
new state attached that should be migrated and properly resetted.
Provide a mechanism for handling reset, migration and VSIE.

Signed-off-by: Christian Borntraeger
Reviewed-by: David Hildenbrand
Reviewed-by: Cornelia Huck
[Changed capability number to 152. - Radim]
Signed-off-by: Radim Krčmář

Christian Borntraeger
2018-01-21 00:30:47 +0800

20 Jan, 2018

1 commit

a3d6c976f sparse doesn't support struct randomization ... Browse Code »

Without this patch, I drown in a sea of unknown attribute warnings

Link: http://lkml.kernel.org/r/20180117024539.27354-1-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Kees Cook
Cc: Ingo Molnar
Cc: Josh Poimboeuf
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2018-01-20 02:09:41 +0800

19 Jan, 2018

1 commit

3214d01f1 KVM: PPC: Book3S: Provide information about hardware/firmware CVE workarounds ... Browse Code »

This adds a new ioctl, KVM_PPC_GET_CPU_CHAR, that gives userspace
information about the underlying machine's level of vulnerability
to the recently announced vulnerabilities CVE-2017-5715,
CVE-2017-5753 and CVE-2017-5754, and whether the machine provides
instructions to assist software to work around the vulnerabilities.

The ioctl returns two u64 words describing characteristics of the
CPU and required software behaviour respectively, plus two mask
words which indicate which bits have been filled in by the kernel,
for extensibility. The bit definitions are the same as for the
new H_GET_CPU_CHARACTERISTICS hypercall.

There is also a new capability, KVM_CAP_PPC_GET_CPU_CHAR, which
indicates whether the new ioctl is available.

Signed-off-by: Paul Mackerras

Paul Mackerras
2018-01-19 12:17:01 +0800

18 Jan, 2018

2 commits

9a4ba2ab0 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fix from Ingo Molnar:
"A delayacct statistics correctness fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
delayacct: Account blkio completion on the correct task

Linus Torvalds
2018-01-18 04:28:22 +0800
88dc7fca1 Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 pti bits and fixes from Thomas Gleixner:
"This last update contains:

- An objtool fix to prevent a segfault with the gold linker by
changing the invocation order. That's not just for gold, it's a
general robustness improvement.

- An improved error message for objtool which spares tearing hairs.

- Make KASAN fail loudly if there is not enough memory instead of
oopsing at some random place later

- RSB fill on context switch to prevent RSB underflow and speculation
through other units.

- Make the retpoline/RSB functionality work reliably for both Intel
and AMD

- Add retpoline to the module version magic so mismatch can be
detected

- A small (non-fix) update for cpufeatures which prevents cpu feature
clashing for the upcoming extra mitigation bits to ease
backporting"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
module: Add retpoline tag to VERMAGIC
x86/cpufeature: Move processor tracing out of scattered features
objtool: Improve error message for bad file argument
objtool: Fix seg fault with gold linker
x86/retpoline: Add LFENCE to the retpoline/RSB filling RSB macros
x86/retpoline: Fill RSB on context switch for affected CPUs
x86/kasan: Panic if there is not enough memory to boot

Linus Torvalds
2018-01-18 03:54:56 +0800

17 Jan, 2018

4 commits

6cfb521ac module: Add retpoline tag to VERMAGIC ... Browse Code »

Add a marker for retpoline to the module VERMAGIC. This catches the case
when a non RETPOLINE compiled module gets loaded into a retpoline kernel,
making it insecure.

It doesn't handle the case when retpoline has been runtime disabled. Even
in this case the match of the retcompile status will be enforced. This
implies that even with retpoline run time disabled all modules loaded need
to be recompiled.

Signed-off-by: Andi Kleen
Signed-off-by: Thomas Gleixner
Reviewed-by: Greg Kroah-Hartman
Acked-by: David Woodhouse
Cc: rusty@rustcorp.com.au
Cc: arjan.van.de.ven@intel.com
Cc: jeyu@kernel.org
Cc: torvalds@linux-foundation.org
Link: https://lkml.kernel.org/r/20180116205228.4890-1-andi@firstfloor.org

Andi Kleen
2018-01-17 18:35:14 +0800
b45a53be5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) Two read past end of buffer fixes in AF_KEY, from Eric Biggers.

2) Memory leak in key_notify_policy(), from Steffen Klassert.

3) Fix overflow with bpf arrays, from Daniel Borkmann.

4) Fix RDMA regression with mlx5 due to mlx5 no longer using
pci_irq_get_affinity(), from Saeed Mahameed.

5) Missing RCU read locking in nl80211_send_iface() when it calls
ieee80211_bss_get_ie(), from Dominik Brodowski.

6) cfg80211 should check dev_set_name()'s return value, from Johannes
Berg.

7) Missing module license tag in 9p protocol, from Stephen Hemminger.

8) Fix crash due to too small MTU in udp ipv6 sendmsg, from Mike
Maloney.

9) Fix endless loop in netlink extack code, from David Ahern.

10) TLS socket layer sets inverted error codes, resulting in an endless
loop. From Robert Hering.

11) Revert openvswitch erspan tunnel support, it's mis-designed and we
need to kill it before it goes into a real release. From William Tu.

12) Fix lan78xx failures in full speed USB mode, from Yuiko Oshino.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (54 commits)
net, sched: fix panic when updating miniq {b,q}stats
qed: Fix potential use-after-free in qed_spq_post()
nfp: use the correct index for link speed table
lan78xx: Fix failure in USB Full Speed
sctp: do not allow the v4 socket to bind a v4mapped v6 address
sctp: return error if the asoc has been peeled off in sctp_wait_for_sndbuf
sctp: reinit stream if stream outcnt has been change by sinit in sendmsg
ibmvnic: Fix pending MAC address changes
netlink: extack: avoid parenthesized string constant warning
ipv4: Make neigh lookup keys for loopback/point-to-point devices be INADDR_ANY
net: Allow neigh contructor functions ability to modify the primary_key
sh_eth: fix dumping ARSTR
Revert "openvswitch: Add erspan tunnel support."
net/tls: Fix inverted error codes to avoid endless loop
ipv6: ip6_make_skb() needs to clear cork.base.dst
sctp: avoid compiler warning on implicit fallthru
net: ipv4: Make "ip route get" match iif lo rules again.
netlink: extack needs to be reset each time through loop
tipc: fix a memory leak in tipc_nl_node_get_link()
ipv6: fix udpv6 sendmsg crash caused by too small MTU
...

Linus Torvalds
2018-01-17 04:45:30 +0800
81d947e2b net, sched: fix panic when updating miniq {b,q}stats ... Browse Code »

While working on fixing another bug, I ran into the following panic
on arm64 by simply attaching clsact qdisc, adding a filter and running
traffic on ingress to it:

[...]
[ 178.188591] Unable to handle kernel read from unreadable memory at virtual address 810fb501f000
[ 178.197314] Mem abort info:
[ 178.200121] ESR = 0x96000004
[ 178.203168] Exception class = DABT (current EL), IL = 32 bits
[ 178.209095] SET = 0, FnV = 0
[ 178.212157] EA = 0, S1PTW = 0
[ 178.215288] Data abort info:
[ 178.218175] ISV = 0, ISS = 0x00000004
[ 178.222019] CM = 0, WnR = 0
[ 178.224997] user pgtable: 4k pages, 48-bit VAs, pgd = 0000000023cb3f33
[ 178.231531] [0000810fb501f000] *pgd=0000000000000000
[ 178.236508] Internal error: Oops: 96000004 [#1] SMP
[...]
[ 178.311855] CPU: 73 PID: 2497 Comm: ping Tainted: G W 4.15.0-rc7+ #5
[ 178.319413] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB18A 03/31/2017
[ 178.326887] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 178.331685] pc : __netif_receive_skb_core+0x49c/0xac8
[ 178.336728] lr : __netif_receive_skb+0x28/0x78
[ 178.341161] sp : ffff00002344b750
[ 178.344465] x29: ffff00002344b750 x28: ffff810fbdfd0580
[ 178.349769] x27: 0000000000000000 x26: ffff000009378000
[...]
[ 178.418715] x1 : 0000000000000054 x0 : 0000000000000000
[ 178.424020] Process ping (pid: 2497, stack limit = 0x000000009f0a3ff4)
[ 178.430537] Call trace:
[ 178.432976] __netif_receive_skb_core+0x49c/0xac8
[ 178.437670] __netif_receive_skb+0x28/0x78
[ 178.441757] process_backlog+0x9c/0x160
[ 178.445584] net_rx_action+0x2f8/0x3f0
[...]

Reason is that sch_ingress and sch_clsact are doing mini_qdisc_pair_init()
which sets up miniq pointers to cpu_{b,q}stats from the underlying qdisc.
Problem is that this cannot work since they are actually set up right after
the qdisc ->init() callback in qdisc_create(), so first packet going into
sch_handle_ingress() tries to call mini_qdisc_bstats_cpu_update() and we
therefore panic.

In order to fix this, allocation of {b,q}stats needs to happen before we
call into ->init(). In net-next, there's already such option through commit
d59f5ffa59d8 ("net: sched: a dflt qdisc may be used with per cpu stats").
However, the bug needs to be fixed in net still for 4.15. Thus, include
these bits to reduce any merge churn and reuse the static_flags field to
set TCQ_F_CPUSTATS, and remove the allocation from qdisc_create() since
there is no other user left. Prashant Bhole ran into the same issue but
for net-next, thus adding him below as well as co-author. Same issue was
also reported by Sandipan Das when using bcc.

Fixes: 46209401f8f6 ("net: core: introduce mini_Qdisc and eliminate usage of tp->q for clsact fastpath")
Reference: https://lists.iovisor.org/pipermail/iovisor-dev/2018-January/001190.html
Reported-by: Sandipan Das
Co-authored-by: Prashant Bhole
Co-authored-by: John Fastabend
Signed-off-by: Daniel Borkmann
Cc: Jiri Pirko
Signed-off-by: David S. Miller

Daniel Borkmann
2018-01-17 04:02:36 +0800
161f72ed6 Merge tag 'mac80211-for-davem-2018-01-15' of git://git.kernel.org/pub/scm/linux/… ... Browse Code »

…kernel/git/jberg/mac80211

Johannes Berg says:

====================
More fixes:
* hwsim:
- properly flush deletion works at module unload
- validate # of channels passed from userspace
* cfg80211:
- fix RCU locking regression
- initialize on-stack channel data for nl80211 event
- check dev_set_name() return value
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

David S. Miller
2018-01-17 03:28:14 +0800

16 Jan, 2018

6 commits

c96f5471c delayacct: Account blkio completion on the correct task ... Browse Code »

Before commit:

e33a9bba85a8 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")

delayacct_blkio_end() was called after context-switching into the task which
completed I/O.

This resulted in double counting: the task would account a delay both waiting
for I/O and for time spent in the runqueue.

With e33a9bba85a8, delayacct_blkio_end() is called by try_to_wake_up().
In ttwu, we have not yet context-switched. This is more correct, in that
the delay accounting ends when the I/O is complete.

But delayacct_blkio_end() relies on 'get_current()', and we have not yet
context-switched into the task whose I/O completed. This results in the
wrong task having its delay accounting statistics updated.

Instead of doing that, pass the task_struct being woken to delayacct_blkio_end(),
so that it can update the statistics of the correct task.

Signed-off-by: Josh Snyder
Acked-by: Tejun Heo
Acked-by: Balbir Singh
Cc:
Cc: Brendan Gregg
Cc: Jens Axboe
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-block@vger.kernel.org
Fixes: e33a9bba85a8 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
Link: http://lkml.kernel.org/r/1513613712-571-1-git-send-email-joshs@netflix.com
Signed-off-by: Ingo Molnar

Josh Snyder
2018-01-16 10:29:36 +0800
6311b7ce4 netlink: extack: avoid parenthesized string constant warning ... Browse Code »

NL_SET_ERR_MSG() and NL_SET_ERR_MSG_ATTR() lead to the following warning
in newer versions of gcc:
warning: array initialized from parenthesized string constant

Just remove the parentheses, they're not needed in this context since
anyway since there can be no operator precendence issues or similar.

Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller

Johannes Berg
2018-01-16 04:15:23 +0800
cd9ff4de0 ipv4: Make neigh lookup keys for loopback/point-to-point devices be INADDR_ANY ... Browse Code »

Map all lookup neigh keys to INADDR_ANY for loopback/point-to-point devices
to avoid making an entry for every remote ip the device needs to talk to.

This used the be the old behavior but became broken in a263b3093641f
(ipv4: Make neigh lookups directly in output packet path) and later removed
in 0bb4087cbec0 (ipv4: Fix neigh lookup keying over loopback/point-to-point
devices) because it was broken.

Signed-off-by: Jim Westfall
Signed-off-by: David S. Miller

Jim Westfall
2018-01-16 03:53:43 +0800
95a332088 Revert "openvswitch: Add erspan tunnel support." ... Browse Code »

This reverts commit ceaa001a170e43608854d5290a48064f57b565ed.

The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr should be designed
as a nested attribute to support all ERSPAN v1 and v2's fields.
The current attr is a be32 supporting only one field. Thus, this
patch reverts it and later patch will redo it using nested attr.

Signed-off-by: William Tu
Cc: Jiri Benc
Cc: Pravin Shelar
Acked-by: Jiri Benc
Acked-by: Pravin B Shelar
Signed-off-by: David S. Miller

William Tu
2018-01-16 03:33:16 +0800
30be8f8db net/tls: Fix inverted error codes to avoid endless loop ... Browse Code »

sendfile() calls can hang endless with using Kernel TLS if a socket error occurs.
Socket error codes must be inverted by Kernel TLS before returning because
they are stored with positive sign. If returned non-inverted they are
interpreted as number of bytes sent, causing endless looping of the
splice mechanic behind sendfile().

Signed-off-by: Robert Hering
Signed-off-by: David S. Miller

r.hering@avm.de
2018-01-16 03:21:57 +0800
66940f35d ptr_ring: document usage around __ptr_ring_peek ... Browse Code »

This explains why is the net usage of __ptr_ring_peek
actually ok without locks.

Signed-off-by: Michael S. Tsirkin
Acked-by: John Fastabend
Signed-off-by: David S. Miller

Michael S. Tsirkin
2018-01-16 02:19:12 +0800

15 Jan, 2018

2 commits

51a1aaa63 mac80211_hwsim: validate number of different channels ... Browse Code »

When creating a new radio on the fly, hwsim allows this
to be done with an arbitrary number of channels, but
cfg80211 only supports a limited number of simultaneous
channels, leading to a warning.

Fix this by validating the number - this requires moving
the define for the maximum out to a visible header file.

Reported-by: syzbot+8dd9051ff19940290931@syzkaller.appspotmail.com
Fixes: b59ec8dd4394 ("mac80211_hwsim: fix number of channels in interface combinations")
Signed-off-by: Johannes Berg

Johannes Berg
2018-01-15 16:34:45 +0800
40548c6b6 Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 pti updates from Thomas Gleixner:
"This contains:

- a PTI bugfix to avoid setting reserved CR3 bits when PCID is
disabled. This seems to cause issues on a virtual machine at least
and is incorrect according to the AMD manual.

- a PTI bugfix which disables the perf BTS facility if PTI is
enabled. The BTS AUX buffer is not globally visible and causes the
CPU to fault when the mapping disappears on switching CR3 to user
space. A full fix which restores BTS on PTI is non trivial and will
be worked on.

- PTI bugfixes for EFI and trusted boot which make sure that the user
space visible page table entries have the NX bit cleared

- removal of dead code in the PTI pagetable setup functions

- add PTI documentation

- add a selftest for vsyscall to verify that the kernel actually
implements what it advertises.

- a sysfs interface to expose vulnerability and mitigation
information so there is a coherent way for users to retrieve the
status.

- the initial spectre_v2 mitigations, aka retpoline:

+ The necessary ASM thunk and compiler support

+ The ASM variants of retpoline and the conversion of affected ASM
code

+ Make LFENCE serializing on AMD so it can be used as speculation
trap

+ The RSB fill after vmexit

- initial objtool support for retpoline

As I said in the status mail this is the most of the set of patches
which should go into 4.15 except two straight forward patches still on
hold:

- the retpoline add on of LFENCE which waits for ACKs

- the RSB fill after context switch

Both should be ready to go early next week and with that we'll have
covered the major holes of spectre_v2 and go back to normality"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
x86,perf: Disable intel_bts when PTI
security/Kconfig: Correct the Documentation reference for PTI
x86/pti: Fix !PCID and sanitize defines
selftests/x86: Add test_vsyscall
x86/retpoline: Fill return stack buffer on vmexit
x86/retpoline/irq32: Convert assembler indirect jumps
x86/retpoline/checksum32: Convert assembler indirect jumps
x86/retpoline/xen: Convert Xen hypercall indirect jumps
x86/retpoline/hyperv: Convert assembler indirect jumps
x86/retpoline/ftrace: Convert ftrace assembler indirect jumps
x86/retpoline/entry: Convert entry assembler indirect jumps
x86/retpoline/crypto: Convert crypto assembler indirect jumps
x86/spectre: Add boot time option to select Spectre v2 mitigation
x86/retpoline: Add initial retpoline support
objtool: Allow alternatives to be ignored
objtool: Detect jumps to retpoline thunks
x86/pti: Make unpoison of pgd for trusted boot work for real
x86/alternatives: Fix optimize_nops() checking
sysfs/cpu: Fix typos in vulnerability documentation
x86/cpu/AMD: Use LFENCE_RDTSC in preference to MFENCE_RDTSC
...

Linus Torvalds
2018-01-15 01:51:25 +0800

14 Jan, 2018

2 commits

ed93de842 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge misc fixlets from Andrew Morton:
"4 fixes"

* emailed patches from Andrew Morton :
tools/objtool/Makefile: don't assume sync-check.sh is executable
kdump: write correct address of mem_section into vmcoreinfo
kmemleak: allow to coexist with fault injection
MAINTAINERS, nilfs2: change project home URLs

Linus Torvalds
2018-01-14 03:07:55 +0800
a0b128036 kdump: write correct address of mem_section into vmcoreinfo ... Browse Code »

Depending on configuration mem_section can now be an array or a pointer
to an array allocated dynamically. In most cases, we can continue to
refer to it as 'mem_section' regardless of what it is.

But there's one exception: '&mem_section' means "address of the array"
if mem_section is an array, but if mem_section is a pointer, it would
mean "address of the pointer".

We've stepped onto this in kdump code. VMCOREINFO_SYMBOL(mem_section)
writes down address of pointer into vmcoreinfo, not array as we wanted.

Let's introduce VMCOREINFO_SYMBOL_ARRAY() that would handle the
situation correctly for both cases.

Link: http://lkml.kernel.org/r/20180112162532.35896-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Fixes: 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y")
Acked-by: Baoquan He
Acked-by: Dave Young
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Greg Kroah-Hartman
Cc: Dave Young
Cc: Baoquan He
Cc: Vivek Goyal
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2018-01-14 02:42:48 +0800

13 Jan, 2018

1 commit

02776b9b5 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull locking fixes from Ingo Molnar:
"No functional effects intended: removes leftovers from recent lockdep
and refcounts work"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/refcounts: Remove stale comment from the ARCH_HAS_REFCOUNT Kconfig entry
locking/lockdep: Remove cross-release leftovers
locking/Documentation: Remove stale crossrelease_fullstack parameter

Linus Torvalds
2018-01-13 02:14:09 +0800

12 Jan, 2018

2 commits

05e0cc84e net/mlx5: Fix get vector affinity helper function ... Browse Code »

mlx5_get_vector_affinity used to call pci_irq_get_affinity and after
reverting the patch that sets the device affinity via PCI_IRQ_AFFINITY
API, calling pci_irq_get_affinity becomes useless and it breaks RDMA
mlx5 users. To fix this, this patch provides an alternative way to
retrieve IRQ vector affinity using legacy IRQ API, following
smp_affinity read procfs implementation.

Fixes: 231243c82793 ("Revert mlx5: move affinity hints assignments to generic code")
Fixes: a435393acafb ("mlx5: move affinity hints assignments to generic code")
Cc: Sagi Grimberg
Signed-off-by: Saeed Mahameed

Saeed Mahameed
2018-01-12 08:01:40 +0800
8978cc921 {net,ib}/mlx5: Don't disable local loopback multicast traffic when needed ... Browse Code »

There are systems platform information management interfaces (such as
HOST2BMC) for which we cannot disable local loopback multicast traffic.

Separate disable_local_lb_mc and disable_local_lb_uc capability bits so
driver will not disable multicast loopback traffic if not supported.
(It is expected that Firmware will not set disable_local_lb_mc if
HOST2BMC is running for example.)

Function mlx5_nic_vport_update_local_lb will do best effort to
disable/enable UC/MC loopback traffic and return success only in case it
succeeded to changed all allowed by Firmware.

Adapt mlx5_ib and mlx5e to support the new cap bits.

Fixes: 2c43c5a036be ("net/mlx5e: Enable local loopback in loopback selftest")
Fixes: c85023e153e3 ("IB/mlx5: Add raw ethernet local loopback support")
Fixes: bded747bb432 ("net/mlx5: Add raw ethernet local loopback firmware command")
Signed-off-by: Eran Ben Elisha
Cc: kernel-team@fb.com
Signed-off-by: Saeed Mahameed

Eran Ben Elisha
2018-01-12 06:52:42 +0800

11 Jan, 2018

1 commit

661e4e33a Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf ... Browse Code »

Daniel Borkmann says:

====================
pull-request: bpf 2018-01-09

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Prevent out-of-bounds speculation in BPF maps by masking the
index after bounds checks in order to fix spectre v1, and
add an option BPF_JIT_ALWAYS_ON into Kconfig that allows for
removing the BPF interpreter from the kernel in favor of
JIT-only mode to make spectre v2 harder, from Alexei.

2) Remove false sharing of map refcount with max_entries which
was used in spectre v1, from Daniel.

3) Add a missing NULL psock check in sockmap in order to fix
a race, from John.

4) Fix test_align BPF selftest case since a recent change in
verifier rejects the bit-wise arithmetic on pointers
earlier but test_align update was missing, from Alexei.
====================

Signed-off-by: David S. Miller

David S. Miller
2018-01-11 00:17:21 +0800

10 Jan, 2018

1 commit

be95a845c bpf: avoid false sharing of map refcount with max_entries ... Browse Code »

In addition to commit b2157399cc98 ("bpf: prevent out-of-bounds
speculation") also change the layout of struct bpf_map such that
false sharing of fast-path members like max_entries is avoided
when the maps reference counter is altered. Therefore enforce
them to be placed into separate cachelines.

pahole dump after change:

struct bpf_map {
const struct bpf_map_ops * ops; /* 0 8 */
struct bpf_map * inner_map_meta; /* 8 8 */
void * security; /* 16 8 */
enum bpf_map_type map_type; /* 24 4 */
u32 key_size; /* 28 4 */
u32 value_size; /* 32 4 */
u32 max_entries; /* 36 4 */
u32 map_flags; /* 40 4 */
u32 pages; /* 44 4 */
u32 id; /* 48 4 */
int numa_node; /* 52 4 */
bool unpriv_array; /* 56 1 */

/* XXX 7 bytes hole, try to pack */

/* --- cacheline 1 boundary (64 bytes) --- */
struct user_struct * user; /* 64 8 */
atomic_t refcnt; /* 72 4 */
atomic_t usercnt; /* 76 4 */
struct work_struct work; /* 80 32 */
char name[16]; /* 112 16 */
/* --- cacheline 2 boundary (128 bytes) --- */

/* size: 128, cachelines: 2, members: 17 */
/* sum members: 121, holes: 1, sum holes: 7 */
};

Now all entries in the first cacheline are read only throughout
the life time of the map, set up once during map creation. Overall
struct size and number of cachelines doesn't change from the
reordering. struct bpf_map is usually first member and embedded
in map structs in specific map implementations, so also avoid those
members to sit at the end where it could potentially share the
cacheline with first map values e.g. in the array since remote
CPUs could trigger map updates just as well for those (easily
dirtying members like max_entries intentionally as well) while
having subsequent values in cache.

Quoting from Google's Project Zero blog [1]:

Additionally, at least on the Intel machine on which this was
tested, bouncing modified cache lines between cores is slow,
apparently because the MESI protocol is used for cache coherence
[8]. Changing the reference counter of an eBPF array on one
physical CPU core causes the cache line containing the reference
counter to be bounced over to that CPU core, making reads of the
reference counter on all other CPU cores slow until the changed
reference counter has been written back to memory. Because the
length and the reference counter of an eBPF array are stored in
the same cache line, this also means that changing the reference
counter on one physical CPU core causes reads of the eBPF array's
length to be slow on other physical CPU cores (intentional false
sharing).

While this doesn't 'control' the out-of-bounds speculation through
masking the index as in commit b2157399cc98, triggering a manipulation
of the map's reference counter is really trivial, so lets not allow
to easily affect max_entries from it.

Splitting to separate cachelines also generally makes sense from
a performance perspective anyway in that fast-path won't have a
cache miss if the map gets pinned, reused in other progs, etc out
of control path, thus also avoids unintentional false sharing.

[1] https://googleprojectzero.blogspot.ch/2018/01/reading-privileged-memory-with-side.html

Signed-off-by: Daniel Borkmann
Signed-off-by: Alexei Starovoitov

Daniel Borkmann
2018-01-10 02:07:30 +0800

09 Jan, 2018

4 commits

ef7f8cec8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) Frag and UDP handling fixes in i40e driver, from Amritha Nambiar and
Alexander Duyck.

2) Undo unintentional UAPI change in netfilter conntrack, from Florian
Westphal.

3) Revert a change to how error codes are returned from
dev_get_valid_name(), it broke some apps.

4) Cannot cache routes for ipv6 tunnels in the tunnel is ipv4/ipv6
dual-stack. From Eli Cooper.

5) Fix missed PMTU updates in geneve, from Xin Long.

6) Cure double free in macvlan, from Gao Feng.

7) Fix heap out-of-bounds write in rds_message_alloc_sgs(), from
Mohamed Ghannam.

8) FEC bug fixes from FUgang Duan (mis-accounting of dev_id, missed
deferral of probe when the regulator is not ready yet).

9) Missing DMA mapping error checks in 3c59x, from Neil Horman.

10) Turn off Broadcom tags for some b53 switches, from Florian Fainelli.

11) Fix OOPS when get_target_net() is passed an SKB whose NETLINK_CB()
isn't initialized. From Andrei Vagin.

12) Fix crashes in fib6_add(), from Wei Wang.

13) PMTU bug fixes in SCTP from Marcelo Ricardo Leitner.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (56 commits)
sh_eth: fix TXALCR1 offsets
mdio-sun4i: Fix a memory leak
phylink: mark expected switch fall-throughs in phylink_mii_ioctl
sctp: fix the handling of ICMP Frag Needed for too small MTUs
sctp: do not retransmit upon FragNeeded if PMTU discovery is disabled
xen-netfront: enable device after manual module load
bnxt_en: Fix the 'Invalid VF' id check in bnxt_vf_ndo_prep routine.
bnxt_en: Fix population of flow_type in bnxt_hwrm_cfa_flow_alloc()
sh_eth: fix SH7757 GEther initialization
net: fec: free/restore resource in related probe error pathes
uapi/if_ether.h: prevent redefinition of struct ethhdr
ipv6: fix general protection fault in fib6_add()
RDS: null pointer dereference in rds_atomic_free_op
sh_eth: fix TSU resource handling
net: stmmac: enable EEE in MII, GMII or RGMII only
rtnetlink: give a user socket to get_target_net()
MAINTAINERS: Update my email address.
can: ems_usb: improve error reporting for error warning and error passive
can: flex_can: Correct the checking for frame length in flexcan_start_xmit()
can: gs_usb: fix return value of the "set_bittiming" callback
...

Linus Torvalds
2018-01-09 12:21:39 +0800
b2157399c bpf: prevent out-of-bounds speculation ... Browse Code »

Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
memory accesses under a bounds check may be speculated even if the
bounds check fails, providing a primitive for building a side channel.

To avoid leaking kernel data round up array-based maps and mask the index
after bounds check, so speculated load with out of bounds index will load
either valid value from the array or zero from the padded area.

Unconditionally mask index for all array types even when max_entries
are not rounded to power of 2 for root user.
When map is created by unpriv user generate a sequence of bpf insns
that includes AND operation to make sure that JITed code includes
the same 'index & index_mask' operation.

If prog_array map is created by unpriv user replace
bpf_tail_call(ctx, map, index);
with
if (index >= max_entries) {
index &= map->index_mask;
bpf_tail_call(ctx, map, index);
}
(along with roundup to power 2) to prevent out-of-bounds speculation.
There is secondary redundant 'if (index >= max_entries)' in the interpreter
and in all JITs, but they can be optimized later if necessary.

Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array)
cannot be used by unpriv, so no changes there.

That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on
all architectures with and without JIT.

v2->v3:
Daniel noticed that attack potentially can be crafted via syscall commands
without loading the program, so add masking to those paths as well.

Signed-off-by: Alexei Starovoitov
Acked-by: John Fastabend
Signed-off-by: Daniel Borkmann

Alexei Starovoitov
2018-01-09 07:53:49 +0800
b6c5734db sctp: fix the handling of ICMP Frag Needed for too small MTUs ... Browse Code »

syzbot reported a hang involving SCTP, on which it kept flooding dmesg
with the message:
[ 246.742374] sctp: sctp_transport_update_pmtu: Reported pmtu 508 too
low, using default minimum of 512

That happened because whenever SCTP hits an ICMP Frag Needed, it tries
to adjust to the new MTU and triggers an immediate retransmission. But
it didn't consider the fact that MTUs smaller than the SCTP minimum MTU
allowed (512) would not cause the PMTU to change, and issued the
retransmission anyway (thus leading to another ICMP Frag Needed, and so
on).

As IPv4 (ip_rt_min_pmtu=556) and IPv6 (IPV6_MIN_MTU=1280) minimum MTU
are higher than that, sctp_transport_update_pmtu() is changed to
re-fetch the PMTU that got set after our request, and with that, detect
if there was an actual change or not.

The fix, thus, skips the immediate retransmission if the received ICMP
resulted in no change, in the hope that SCTP will select another path.

Note: The value being used for the minimum MTU (512,
SCTP_DEFAULT_MINSEGMENT) is not right and instead it should be (576,
SCTP_MIN_PMTU), but such change belongs to another patch.

Changes from v1:
- do not disable PMTU discovery, in the light of commit
06ad391919b2 ("[SCTP] Don't disable PMTU discovery when mtu is small")
and as suggested by Xin Long.
- changed the way to break the rtx loop by detecting if the icmp
resulted in a change or not
Changes from v2:
none

See-also: https://lkml.org/lkml/2017/12/22/811
Reported-by: syzbot
Signed-off-by: Marcelo Ricardo Leitner
Signed-off-by: David S. Miller

Marcelo Ricardo Leitner
2018-01-09 03:19:13 +0800
527187d28 locking/lockdep: Remove cross-release leftovers ... Browse Code »

There's two cross-release leftover facilities:

- the crossrelease_hist_*() irq-tracing callbacks (NOPs currently)
- the complete_release_commit() callback (NOP as well)

Remove them.

Cc: David Sterba
Cc: Byungchul Park
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2018-01-09 00:30:45 +0800