Eric Lee / smarc-fsl-linux-kernel

19 Aug, 2020

7 commits

e02c77edd include/asm-generic/vmlinux.lds.h: align ro_after_init ... Browse Code »

commit 7f897acbe5d57995438c831670b7c400e9c0dc00 upstream.

Since the patch [1], building the kernel using a toolchain built with
binutils 2.33.1 prevents booting a sh4 system under Qemu. Apply the patch
provided by Alan Modra [2] that fix alignment of rodata.

[1] https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=ebd2263ba9a9124d93bbc0ece63d7e0fae89b40e
[2] https://www.sourceware.org/ml/binutils/2019-12/msg00112.html

Signed-off-by: Romain Naour
Signed-off-by: Andrew Morton
Cc: Alan Modra
Cc: Bin Meng
Cc: Chen Zhou
Cc: Geert Uytterhoeven
Cc: John Paul Adrian Glaubitz
Cc: Krzysztof Kozlowski
Cc: Kuninori Morimoto
Cc: Rich Felker
Cc: Sam Ravnborg
Cc: Yoshinori Sato
Cc: Arnd Bergmann
Cc:
Link: https://marc.info/?l=linux-sh&m=158429470221261
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Romain Naour
2020-08-19 14:16:25 +0800
dcedddbc7 net: refactor bind_bucket fastreuse into helper ... Browse Code »

[ Upstream commit 62ffc589abb176821662efc4525ee4ac0b9c3894 ]

Refactor the fastreuse update code in inet_csk_get_port into a small
helper function that can be called from other places.

Acked-by: Matthieu Baerts
Signed-off-by: Tim Froidcoeur
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Tim Froidcoeur
2020-08-19 14:16:23 +0800
e07d0ccd7 tcp: correct read of TFO keys on big endian systems ... Browse Code »

[ Upstream commit f19008e676366c44e9241af57f331b6c6edf9552 ]

When TFO keys are read back on big endian systems either via the global
sysctl interface or via getsockopt() using TCP_FASTOPEN_KEY, the values
don't match what was written.

For example, on s390x:

# echo "1-2-3-4" > /proc/sys/net/ipv4/tcp_fastopen_key
# cat /proc/sys/net/ipv4/tcp_fastopen_key
02000000-01000000-04000000-03000000

Instead of:

# cat /proc/sys/net/ipv4/tcp_fastopen_key
00000001-00000002-00000003-00000004

Fix this by converting to the correct endianness on read. This was
reported by Colin Ian King when running the 'tcp_fastopen_backup_key' net
selftest on s390x, which depends on the read value matching what was
written. I've confirmed that the test now passes on big and little endian
systems.

Signed-off-by: Jason Baron
Fixes: 438ac88009bc ("net: fastopen: robustness and endianness fixes for SipHash")
Cc: Ard Biesheuvel
Cc: Eric Dumazet
Reported-and-tested-by: Colin Ian King
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Jason Baron
2020-08-19 14:16:23 +0800
0c122fc90 ipvs: allow connection reuse for unconfirmed conntrack ... Browse Code »

[ Upstream commit f0a5e4d7a594e0fe237d3dfafb069bb82f80f42f ]

YangYuxi is reporting that connection reuse
is causing one-second delay when SYN hits
existing connection in TIME_WAIT state.
Such delay was added to give time to expire
both the IPVS connection and the corresponding
conntrack. This was considered a rare case
at that time but it is causing problem for
some environments such as Kubernetes.

As nf_conntrack_tcp_packet() can decide to
release the conntrack in TIME_WAIT state and
to replace it with a fresh NEW conntrack, we
can use this to allow rescheduling just by
tuning our check: if the conntrack is
confirmed we can not schedule it to different
real server and the one-second delay still
applies but if new conntrack was created,
we are free to select new real server without
any delays.

YangYuxi lists some of the problem reports:

- One second connection delay in masquerading mode:
https://marc.info/?t=151683118100004&r=1&w=2

- IPVS low throughput #70747
https://github.com/kubernetes/kubernetes/issues/70747

- Apache Bench can fill up ipvs service proxy in seconds #544
https://github.com/cloudnativelabs/kube-router/issues/544

- Additional 1s latency in `host -> service IP -> pod`
https://github.com/kubernetes/kubernetes/issues/90854

Fixes: f719e3754ee2 ("ipvs: drop first packet to redirect conntrack")
Co-developed-by: YangYuxi
Signed-off-by: YangYuxi
Signed-off-by: Julian Anastasov
Reviewed-by: Simon Horman
Signed-off-by: Pablo Neira Ayuso
Signed-off-by: Sasha Levin

Julian Anastasov
2020-08-19 14:16:10 +0800
0f09c88f2 seccomp: Fix ioctl number for SECCOMP_IOCTL_NOTIF_ID_VALID ... Browse Code »

[ Upstream commit 47e33c05f9f07cac3de833e531bcac9ae052c7ca ]

When SECCOMP_IOCTL_NOTIF_ID_VALID was first introduced it had the wrong
direction flag set. While this isn't a big deal as nothing currently
enforces these bits in the kernel, it should be defined correctly. Fix
the define and provide support for the old command until it is no longer
needed for backward compatibility.

Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
Signed-off-by: Kees Cook
Signed-off-by: Sasha Levin

Kees Cook
2020-08-19 14:15:58 +0800
3a17c7bfe tpm: Require that all digests are present in TCG_PCR_EVENT2 structures ... Browse Code »

[ Upstream commit 7f3d176f5f7e3f0477bf82df0f600fcddcdcc4e4 ]

Require that the TCG_PCR_EVENT2.digests.count value strictly matches the
value of TCG_EfiSpecIdEvent.numberOfAlgorithms in the event field of the
TCG_PCClientPCREvent event log header. Also require that
TCG_EfiSpecIdEvent.numberOfAlgorithms is non-zero.

The TCG PC Client Platform Firmware Profile Specification section 9.1
(Family "2.0", Level 00 Revision 1.04) states:

For each Hash algorithm enumerated in the TCG_PCClientPCREvent entry,
there SHALL be a corresponding digest in all TCG_PCR_EVENT2 structures.
Note: This includes EV_NO_ACTION events which do not extend the PCR.

Section 9.4.5.1 provides this description of
TCG_EfiSpecIdEvent.numberOfAlgorithms:

The number of Hash algorithms in the digestSizes field. This field MUST
be set to a value of 0x01 or greater.

Enforce these restrictions, as required by the above specification, in
order to better identify and ignore invalid sequences of bytes at the
end of an otherwise valid TPM2 event log. Firmware doesn't always have
the means necessary to inform the kernel of the actual event log size so
the kernel's event log parsing code should be stringent when parsing the
event log for resiliency against firmware bugs. This is true, for
example, when firmware passes the event log to the kernel via a reserved
memory region described in device tree.

POWER and some ARM systems use the "linux,sml-base" and "linux,sml-size"
device tree properties to describe the memory region used to pass the
event log from firmware to the kernel. Unfortunately, the
"linux,sml-size" property describes the size of the entire reserved
memory region rather than the size of the event long within the memory
region and the event log format does not include information describing
the size of the event log.

tpm_read_log_of(), in drivers/char/tpm/eventlog/of.c, is where the
"linux,sml-size" property is used. At the end of that function,
log->bios_event_log_end is pointing at the end of the reserved memory
region. That's typically 0x10000 bytes offset from "linux,sml-base",
depending on what's defined in the device tree source.

The firmware event log only fills a portion of those 0x10000 bytes and
the rest of the memory region should be zeroed out by firmware. Even in
the case of a properly zeroed bytes in the remainder of the memory
region, the only thing allowing the kernel's event log parser to detect
the end of the event log is the following conditional in
__calc_tpm2_event_size():

if (event_type == 0 && event_field->event_size == 0)
size = 0;

If that wasn't there, __calc_tpm2_event_size() would think that a 16
byte sequence of zeroes, following an otherwise valid event log, was
a valid event.

However, problems can occur if a single bit is set in the offset
corresponding to either the TCG_PCR_EVENT2.eventType or
TCG_PCR_EVENT2.eventSize fields, after the last valid event log entry.
This could confuse the parser into thinking that an additional entry is
present in the event log and exposing this invalid entry to userspace in
the /sys/kernel/security/tpm0/binary_bios_measurements file. Such
problems have been seen if firmware does not fully zero the memory
region upon a warm reboot.

This patch significantly raises the bar on how difficult it is for
stale/invalid memory to confuse the kernel's event log parser but
there's still, ultimately, a reliance on firmware to properly initialize
the remainder of the memory region reserved for the event log as the
parser cannot be expected to detect a stale but otherwise properly
formatted firmware event log entry.

Fixes: fd5c78694f3f ("tpm: fix handling of the TPM 2.0 event logs")
Signed-off-by: Tyler Hicks
Reviewed-by: Jarkko Sakkinen
Signed-off-by: Jarkko Sakkinen
Signed-off-by: Sasha Levin

Tyler Hicks
2020-08-19 14:15:57 +0800
16d2fb138 tracepoint: Mark __tracepoint_string's __used ... Browse Code »

commit f3751ad0116fb6881f2c3c957d66a9327f69cefb upstream.

__tracepoint_string's have their string data stored in .rodata, and an
address to that data stored in the "__tracepoint_str" section. Functions
that refer to those strings refer to the symbol of the address. Compiler
optimization can replace those address references with references
directly to the string data. If the address doesn't appear to have other
uses, then it appears dead to the compiler and is removed. This can
break the /tracing/printk_formats sysfs node which iterates the
addresses stored in the "__tracepoint_str" section.

Like other strings stored in custom sections in this header, mark these
__used to inform the compiler that there are other non-obvious users of
the address, so they should still be emitted.

Link: https://lkml.kernel.org/r/20200730224555.2142154-2-ndesaulniers@google.com

Cc: Ingo Molnar
Cc: Miguel Ojeda
Cc: stable@vger.kernel.org
Fixes: 102c9323c35a8 ("tracing: Add __tracepoint_string() to export string pointers")
Reported-by: Tim Murray
Reported-by: Simon MacMullen
Suggested-by: Greg Hackmann
Signed-off-by: Nick Desaulniers
Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Greg Kroah-Hartman

Nick Desaulniers
2020-08-19 14:15:53 +0800

11 Aug, 2020

5 commits

512570b17 nfsd: Fix NFSv4 READ on RDMA when using readv ... Browse Code »

commit 412055398b9e67e07347a936fc4a6adddabe9cf4 upstream.

svcrdma expects that the payload falls precisely into the xdr_buf
page vector. This does not seem to be the case for
nfsd4_encode_readv().

This code is called only when fops->splice_read is missing or when
RQ_SPLICE_OK is clear, so it's not a noticeable problem in many
common cases.

Add new transport method: ->xpo_read_payload so that when a READ
payload does not fit exactly in rq_res's page vector, the XDR
encoder can inform the RPC transport exactly where that payload is,
without the payload's XDR pad.

That way, when a Write chunk is present, the transport knows what
byte range in the Reply message is supposed to be matched with the
chunk.

Note that the Linux NFS server implementation of NFS/RDMA can
currently handle only one Write chunk per RPC-over-RDMA message.
This simplifies the implementation of this fix.

Fixes: b04209806384 ("nfsd4: allow exotic read compounds")
Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=198053
Signed-off-by: Chuck Lever
Cc: Timo Rothenpieler
Signed-off-by: Greg Kroah-Hartman

Chuck Lever
2020-08-11 21:33:42 +0800
89c12bc36 ipv6: fix memory leaks on IPV6_ADDRFORM path ... Browse Code »

[ Upstream commit 8c0de6e96c9794cb523a516c465991a70245da1c ]

IPV6_ADDRFORM causes resource leaks when converting an IPv6 socket
to IPv4, particularly struct ipv6_ac_socklist. Similar to
struct ipv6_mc_socklist, we should just close it on this path.

This bug can be easily reproduced with the following C program:

#include
#include
#include
#include
#include

int main()
{
int s, value;
struct sockaddr_in6 addr;
struct ipv6_mreq m6;

s = socket(AF_INET6, SOCK_DGRAM, 0);
addr.sin6_family = AF_INET6;
addr.sin6_port = htons(5000);
inet_pton(AF_INET6, "::ffff:192.168.122.194", &addr.sin6_addr);
connect(s, (struct sockaddr *)&addr, sizeof(addr));

inet_pton(AF_INET6, "fe80::AAAA", &m6.ipv6mr_multiaddr);
m6.ipv6mr_interface = 5;
setsockopt(s, SOL_IPV6, IPV6_JOIN_ANYCAST, &m6, sizeof(m6));

value = AF_INET;
setsockopt(s, SOL_IPV6, IPV6_ADDRFORM, &value, sizeof(value));

close(s);
return 0;
}

Reported-by: ch3332xr@gmail.com
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Cong Wang
2020-08-11 21:33:39 +0800
11e64146d xattr: break delegations in {set,remove}xattr ... Browse Code »

commit 08b5d5014a27e717826999ad20e394a8811aae92 upstream.

set/removexattr on an exported filesystem should break NFS delegations.
This is true in general, but also for the upcoming support for
RFC 8726 (NFSv4 extended attribute support). Make sure that they do.

Additionally, they need to grow a _locked variant, since callers might
call this with i_rwsem held (like the NFS server code).

Cc: stable@vger.kernel.org # v4.9+
Cc: linux-fsdevel@vger.kernel.org
Cc: Al Viro
Signed-off-by: Frank van der Linden
Signed-off-by: Chuck Lever
Signed-off-by: Greg Kroah-Hartman

Frank van der Linden
2020-08-11 21:33:39 +0800
6059000e1 Drivers: hv: vmbus: Ignore CHANNELMSG_TL_CONNECT_RESULT(23) ... Browse Code »

[ Upstream commit ddc9d357b991838c2d975e8d7e4e9db26f37a7ff ]

When a Linux hv_sock app tries to connect to a Service GUID on which no
host app is listening, a recent host (RS3+) sends a
CHANNELMSG_TL_CONNECT_RESULT (23) message to Linux and this triggers such
a warning:

unknown msgtype=23
WARNING: CPU: 2 PID: 0 at drivers/hv/vmbus_drv.c:1031 vmbus_on_msg_dpc

Actually Linux can safely ignore the message because the Linux app's
connect() will time out in 2 seconds: see VSOCK_DEFAULT_CONNECT_TIMEOUT
and vsock_stream_connect(). We don't bother to make use of the message
because: 1) it's only supported on recent hosts; 2) a non-trivial effort
is required to use the message in Linux, but the benefit is small.

So, let's not see the warning by silently ignoring the message.

Signed-off-by: Dexuan Cui
Reviewed-by: Michael Kelley
Signed-off-by: Sasha Levin

Dexuan Cui
2020-08-11 21:33:38 +0800
4bba72b72 drm/drm_fb_helper: fix fbdev with sparc64 ... Browse Code »

[ Upstream commit 2a1658bf922ffd9b7907e270a7d9cdc9643fc45d ]

Recent kernels have been reported to panic using the bochs_drm
framebuffer under qemu-system-sparc64 which was bisected to
commit 7a0483ac4ffc ("drm/bochs: switch to generic drm fbdev emulation").

The backtrace indicates that the shadow framebuffer copy in
drm_fb_helper_dirty_blit_real() is trying to access the real
framebuffer using a virtual address rather than use an IO access
typically implemented using a physical (ASI_PHYS) access on SPARC.

The fix is to replace the memcpy with memcpy_toio() from io.h.

memcpy_toio() uses writeb() where the original fbdev code
used sbus_memcpy_toio(). The latter uses sbus_writeb().

The difference between writeb() and sbus_memcpy_toio() is
that writeb() writes bytes in little-endian, where sbus_writeb() writes
bytes in big-endian. As endian does not matter for byte writes they are
the same. So we can safely use memcpy_toio() here.

Note that this only fixes bochs, in general fbdev helpers still have
issues with mixing up system memory and __iomem space. Fixing that will
require a lot more work.

v3:
- Improved changelog (Daniel)
- Added FIXME to fbdev_use_iomem (Daniel)

v2:
- Added missing __iomem cast (kernel test robot)
- Made changelog readable and fix typos (Mark)
- Add flag to select iomem - and set it in the bochs driver

Signed-off-by: Sam Ravnborg
Reported-by: Mark Cave-Ayland
Reported-by: kernel test robot
Tested-by: Mark Cave-Ayland
Reviewed-by: Daniel Vetter
Cc: Mark Cave-Ayland
Cc: Thomas Zimmermann
Cc: Gerd Hoffmann
Cc: "David S. Miller"
Cc: sparclinux@vger.kernel.org
Link: https://patchwork.freedesktop.org/patch/msgid/20200709193016.291267-1-sam@ravnborg.org
Link: https://patchwork.freedesktop.org/patch/msgid/20200725191012.GA434957@ravnborg.org
Signed-off-by: Sasha Levin

Sam Ravnborg
2020-08-11 21:33:37 +0800

07 Aug, 2020

5 commits

ca7ace8fd bpf: sockmap: Require attach_bpf_fd when detaching a program ... Browse Code »

commit bb0de3131f4c60a9bf976681e0fe4d1e55c7a821 upstream.

The sockmap code currently ignores the value of attach_bpf_fd when
detaching a program. This is contrary to the usual behaviour of
checking that attach_bpf_fd represents the currently attached
program.

Ensure that attach_bpf_fd is indeed the currently attached
program. It turns out that all sockmap selftests already do this,
which indicates that this is unlikely to cause breakage.

Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Lorenz Bauer
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200629095630.7933-5-lmb@cloudflare.com
Signed-off-by: Greg Kroah-Hartman

Lorenz Bauer
2020-08-07 15:34:02 +0800
f06d60ff7 random32: move the pseudo-random 32-bit definitions to prandom.h ... Browse Code »

commit c0842fbc1b18c7a044e6ff3e8fa78bfa822c7d1a upstream.

The addition of percpu.h to the list of includes in random.h revealed
some circular dependencies on arm64 and possibly other platforms. This
include was added solely for the pseudo-random definitions, which have
nothing to do with the rest of the definitions in this file but are
still there for legacy reasons.

This patch moves the pseudo-random parts to linux/prandom.h and the
percpu.h include with it, which is now guarded by _LINUX_PRANDOM_H and
protected against recursive inclusion.

A further cleanup step would be to remove this from
entirely, and make people who use the prandom infrastructure include
just the new header file. That's a bit of a churn patch, but grepping
for "prandom_" and "next_pseudo_random32" "struct rnd_state" should
catch most users.

But it turns out that that nice cleanup step is fairly painful, because
a _lot_ of code currently seems to depend on the implicit include of
, which can currently come in a lot of ways, including
such fairly core headfers as .

So the "nice cleanup" part may or may never happen.

Fixes: 1c9df907da83 ("random: fix circular include dependency on arm64 after addition of percpu.h")
Tested-by: Guenter Roeck
Acked-by: Willy Tarreau
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Linus Torvalds
2020-08-07 15:34:01 +0800
c13100998 random32: remove net_rand_state from the latent entropy gcc plugin ... Browse Code »

commit 83bdc7275e6206f560d247be856bceba3e1ed8f2 upstream.

It turns out that the plugin right now ends up being really unhappy
about the change from 'static' to 'extern' storage that happened in
commit f227e3ec3b5c ("random32: update the net random state on interrupt
and activity").

This is probably a trivial fix for the latent_entropy plugin, but for
now, just remove net_rand_state from the list of things the plugin
worries about.

Reported-by: Stephen Rothwell
Cc: Emese Revfy
Cc: Kees Cook
Cc: Willy Tarreau
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Linus Torvalds
2020-08-07 15:34:01 +0800
7471f3228 random: fix circular include dependency on arm64 after addition of percpu.h ... Browse Code »

commit 1c9df907da83812e4f33b59d3d142c864d9da57f upstream.

Daniel Díaz and Kees Cook independently reported that commit
f227e3ec3b5c ("random32: update the net random state on interrupt and
activity") broke arm64 due to a circular dependency on include files
since the addition of percpu.h in random.h.

The correct fix would definitely be to move all the prandom32 stuff out
of random.h but for backporting, a smaller solution is preferred.

This one replaces linux/percpu.h with asm/percpu.h, and this fixes the
problem on x86_64, arm64, arm, and mips. Note that moving percpu.h
around didn't change anything and that removing it entirely broke
differently. When backporting, such options might still be considered
if this patch fails to help.

[ It turns out that an alternate fix seems to be to just remove the
troublesome remove from the arm64
that causes the circular dependency.

But we might as well do the whole belt-and-suspenders thing, and
minimize inclusion in too. Either will fix the
problem, and both are good changes. - Linus ]

Reported-by: Daniel Díaz
Reported-by: Kees Cook
Tested-by: Marc Zyngier
Fixes: f227e3ec3b5c
Cc: Stephen Rothwell
Signed-off-by: Willy Tarreau
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Willy Tarreau
2020-08-07 15:34:01 +0800
c15a77bdd random32: update the net random state on interrupt and activity ... Browse Code »

commit f227e3ec3b5cad859ad15666874405e8c1bbc1d4 upstream.

This modifies the first 32 bits out of the 128 bits of a random CPU's
net_rand_state on interrupt or CPU activity to complicate remote
observations that could lead to guessing the network RNG's internal
state.

Note that depending on some network devices' interrupt rate moderation
or binding, this re-seeding might happen on every packet or even almost
never.

In addition, with NOHZ some CPUs might not even get timer interrupts,
leaving their local state rarely updated, while they are running
networked processes making use of the random state. For this reason, we
also perform this update in update_process_times() in order to at least
update the state when there is user or system activity, since it's the
only case we care about.

Reported-by: Amit Klein
Suggested-by: Linus Torvalds
Cc: Eric Dumazet
Cc: "Jason A. Donenfeld"
Cc: Andy Lutomirski
Cc: Kees Cook
Cc: Thomas Gleixner
Cc: Peter Zijlstra
Cc:
Signed-off-by: Willy Tarreau
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Willy Tarreau
2020-08-07 15:34:01 +0800

05 Aug, 2020

6 commits

c3883876d rhashtable: Fix unprotected RCU dereference in __rht_ptr ... Browse Code »

[ Upstream commit 1748f6a2cbc4694523f16da1c892b59861045b9d ]

The rcu_dereference call in rht_ptr_rcu is completely bogus because
we've already dereferenced the value in __rht_ptr and operated on it.
This causes potential double readings which could be fatal. The RCU
dereference must occur prior to the comparison in __rht_ptr.

This patch changes the order of RCU dereference so that it is done
first and the result is then fed to __rht_ptr. The RCU marking
changes have been minimised using casts which will be removed in
a follow-up patch.

Fixes: ba6306e3f648 ("rhashtable: Remove RCU marking from...")
Reported-by: "Gong, Sishuai"
Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller
Signed-off-by: Sasha Levin

Herbert Xu
2020-08-05 15:59:47 +0800
475cbcef4 net/mlx5e: Modify uplink state on interface up/down ... Browse Code »

[ Upstream commit 7d0314b11cdd92bca8b89684c06953bf114605fc ]

When setting the PF interface up/down, notify the firmware to update
uplink state via MODIFY_VPORT_STATE, when E-Switch is enabled.

This behavior will prevent sending traffic out on uplink port when PF is
down, such as sending traffic from a VF interface which is still up.
Currently when calling mlx5e_open/close(), the driver only sends PAOS
command to notify the firmware to set the physical port state to
up/down, however, it is not sufficient. When VF is in "auto" state, it
follows the uplink state, which was not updated on mlx5e_open/close()
before this patch.

When switchdev mode is enabled and uplink representor is first enabled,
set the uplink port state value back to its FW default "AUTO".

Fixes: 63bfd399de55 ("net/mlx5e: Send PAOS command on interface up/down")
Signed-off-by: Ron Diskin
Reviewed-by: Roi Dayan
Reviewed-by: Moshe Shemesh
Signed-off-by: Saeed Mahameed
Signed-off-by: Sasha Levin

Ron Diskin
2020-08-05 15:59:47 +0800
731e013e3 xfrm: Fix crash when the hold queue is used. ... Browse Code »

[ Upstream commit 101dde4207f1daa1fda57d714814a03835dccc3f ]

The commits "xfrm: Move dst->path into struct xfrm_dst"
and "net: Create and use new helper xfrm_dst_child()."
changed xfrm bundle handling under the assumption
that xdst->path and dst->child are not a NULL pointer
only if dst->xfrm is not a NULL pointer. That is true
with one exception. If the xfrm hold queue is used
to wait until a SA is installed by the key manager,
we create a dummy bundle without a valid dst->xfrm
pointer. The current xfrm bundle handling crashes
in that case. Fix this by extending the NULL check
of dst->xfrm with a test of the DST_XFRM_QUEUE flag.

Fixes: 0f6c480f23f4 ("xfrm: Move dst->path into struct xfrm_dst")
Fixes: b92cf4aab8e6 ("net: Create and use new helper xfrm_dst_child().")
Signed-off-by: Steffen Klassert
Signed-off-by: Sasha Levin

Steffen Klassert
2020-08-05 15:59:45 +0800
0307da686 xfrm: policy: match with both mark and mask on user interfaces ... Browse Code »

[ Upstream commit 4f47e8ab6ab796b5380f74866fa5287aca4dcc58 ]

In commit ed17b8d377ea ("xfrm: fix a warning in xfrm_policy_insert_list"),
it would take 'priority' to make a policy unique, and allow duplicated
policies with different 'priority' to be added, which is not expected
by userland, as Tobias reported in strongswan.

To fix this duplicated policies issue, and also fix the issue in
commit ed17b8d377ea ("xfrm: fix a warning in xfrm_policy_insert_list"),
when doing add/del/get/update on user interfaces, this patch is to change
to look up a policy with both mark and mask by doing:

mark.v == pol->mark.v && mark.m == pol->mark.m

and leave the check:

(mark & pol->mark.m) == pol->mark.v

for tx/rx path only.

As the userland expects an exact mark and mask match to manage policies.

v1->v2:
- make xfrm_policy_mark_match inline and fix the changelog as
Tobias suggested.

Fixes: 295fae568885 ("xfrm: Allow user space manipulation of SPD mark")
Fixes: ed17b8d377ea ("xfrm: fix a warning in xfrm_policy_insert_list")
Reported-by: Tobias Brunner
Tested-by: Tobias Brunner
Signed-off-by: Xin Long
Signed-off-by: Steffen Klassert
Signed-off-by: Sasha Levin

Xin Long
2020-08-05 15:59:44 +0800
b8fa0b037 wireless: Use offsetof instead of custom macro. ... Browse Code »

commit 6989310f5d4327e8595664954edd40a7f99ddd0d upstream.

Use offsetof to calculate offset of a field to take advantage of
compiler built-in version when possible, and avoid UBSAN warning when
compiling with Clang:

==================================================================
UBSAN: Undefined behaviour in net/wireless/wext-core.c:525:14
member access within null pointer of type 'struct iw_point'
CPU: 3 PID: 165 Comm: kworker/u16:3 Tainted: G S W 4.19.23 #43
Workqueue: cfg80211 __cfg80211_scan_done [cfg80211]
Call trace:
dump_backtrace+0x0/0x194
show_stack+0x20/0x2c
__dump_stack+0x20/0x28
dump_stack+0x70/0x94
ubsan_epilogue+0x14/0x44
ubsan_type_mismatch_common+0xf4/0xfc
__ubsan_handle_type_mismatch_v1+0x34/0x54
wireless_send_event+0x3cc/0x470
___cfg80211_scan_done+0x13c/0x220 [cfg80211]
__cfg80211_scan_done+0x28/0x34 [cfg80211]
process_one_work+0x170/0x35c
worker_thread+0x254/0x380
kthread+0x13c/0x158
ret_from_fork+0x10/0x18
===================================================================

Signed-off-by: Pi-Hsun Shih
Reviewed-by: Nick Desaulniers
Link: https://lore.kernel.org/r/20191204081307.138765-1-pihsun@chromium.org
Signed-off-by: Johannes Berg
Signed-off-by: Nick Desaulniers
Signed-off-by: Greg Kroah-Hartman

Pi-Hsun Shih
2020-08-05 15:59:42 +0800
951117a20 IB/rdmavt: Fix RQ counting issues causing use of an invalid RWQE ... Browse Code »

commit 54a485e9ec084da1a4b32dcf7749c7d760ed8aa5 upstream.

The lookaside count is improperly initialized to the size of the
Receive Queue with the additional +1. In the traces below, the
RQ size is 384, so the count was set to 385.

The lookaside count is then rarely refreshed. Note the high and
incorrect count in the trace below:

rvt_get_rwqe: [hfi1_0] wqe ffffc900078e9008 wr_id 55c7206d75a0 qpn c
qpt 2 pid 3018 num_sge 1 head 1 tail 0, count 385
rvt_get_rwqe: (hfi1_rc_rcv+0x4eb/0x1480 [hfi1]
Cc: # 5.4.x
Reviewed-by: Kaike Wan
Signed-off-by: Mike Marciniszyn
Tested-by: Honggang Li
Signed-off-by: Jason Gunthorpe
Signed-off-by: Greg Kroah-Hartman

Mike Marciniszyn
2020-08-05 15:59:42 +0800

01 Aug, 2020

1 commit

182ffc664 tcp: allow at most one TLP probe per flight ... Browse Code »

[ Upstream commit 76be93fc0702322179bb0ea87295d820ee46ad14 ]

Previously TLP may send multiple probes of new data in one
flight. This happens when the sender is cwnd limited. After the
initial TLP containing new data is sent, the sender receives another
ACK that acks partial inflight. It may re-arm another TLP timer
to send more, if no further ACK returns before the next TLP timeout
(PTO) expires. The sender may send in theory a large amount of TLP
until send queue is depleted. This only happens if the sender sees
such irregular uncommon ACK pattern. But it is generally undesirable
behavior during congestion especially.

The original TLP design restrict only one TLP probe per inflight as
published in "Reducing Web Latency: the Virtue of Gentle Aggression",
SIGCOMM 2013. This patch changes TLP to send at most one probe
per inflight.

Note that if the sender is app-limited, TLP retransmits old data
and did not have this issue.

Signed-off-by: Yuchung Cheng
Signed-off-by: Neal Cardwell
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Yuchung Cheng
2020-08-01 00:39:31 +0800

29 Jul, 2020

7 commits

6d4448ca5 dm integrity: fix integrity recalculation that is improperly skipped ... Browse Code »

commit 5df96f2b9f58a5d2dc1f30fe7de75e197f2c25f2 upstream.

Commit adc0daad366b62ca1bce3e2958a40b0b71a8b8b3 ("dm: report suspended
device during destroy") broke integrity recalculation.

The problem is dm_suspended() returns true not only during suspend,
but also during resume. So this race condition could occur:
1. dm_integrity_resume calls queue_work(ic->recalc_wq, &ic->recalc_work)
2. integrity_recalc (&ic->recalc_work) preempts the current thread
3. integrity_recalc calls if (unlikely(dm_suspended(ic->ti))) goto unlock_ret;
4. integrity_recalc exits and no recalculating is done.

To fix this race condition, add a function dm_post_suspending that is
only true during the postsuspend phase and use it instead of
dm_suspended().

Signed-off-by: Mikulas Patocka
Fixes: adc0daad366b ("dm: report suspended device during destroy")
Cc: stable vger kernel org # v4.18+
Signed-off-by: Mike Snitzer
Signed-off-by: Greg Kroah-Hartman

Mikulas Patocka
2020-07-29 16:18:45 +0800
8f64dc9e1 ASoC: rt5670: Add new gpio1_is_ext_spk_en quirk and enable it on the Lenovo Miix 2 10 ... Browse Code »

commit 85ca6b17e2bb96b19caac3b02c003d670b66de96 upstream.

The Lenovo Miix 2 10 has a keyboard dock with extra speakers in the dock.
Rather then the ACL5672's GPIO1 pin being used as IRQ to the CPU, it is
actually used to enable the amplifier for these speakers
(the IRQ to the CPU comes directly from the jack-detect switch).

Add a quirk for having an ext speaker-amplifier enable pin on GPIO1
and replace the Lenovo Miix 2 10's dmi_system_id table entry's wrong
GPIO_DEV quirk (which needs to be renamed to GPIO1_IS_IRQ) with the
new RT5670_GPIO1_IS_EXT_SPK_EN quirk, so that we enable the external
speaker-amplifier as necessary.

Also update the ident field for the dmi_system_id table entry, the
Miix models are not Thinkpads.

Fixes: 67e03ff3f32f ("ASoC: codecs: rt5670: add Thinkpad Tablet 10 quirk")
Signed-off-by: Hans de Goede
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1786723
Link: https://lore.kernel.org/r/20200628155231.71089-4-hdegoede@redhat.com
Signed-off-by: Mark Brown
Signed-off-by: Greg Kroah-Hartman

Hans de Goede
2020-07-29 16:18:45 +0800
697bd3e4a x86, vmlinux.lds: Page-align end of ..page_aligned sections ... Browse Code »

commit de2b41be8fcccb2f5b6c480d35df590476344201 upstream.

On x86-32 the idt_table with 256 entries needs only 2048 bytes. It is
page-aligned, but the end of the .bss..page_aligned section is not
guaranteed to be page-aligned.

As a result, objects from other .bss sections may end up on the same 4k
page as the idt_table, and will accidentially get mapped read-only during
boot, causing unexpected page-faults when the kernel writes to them.

This could be worked around by making the objects in the page aligned
sections page sized, but that's wrong.

Explicit sections which store only page aligned objects have an implicit
guarantee that the object is alone in the page in which it is placed. That
works for all objects except the last one. That's inconsistent.

Enforcing page sized objects for these sections would wreckage memory
sanitizers, because the object becomes artificially larger than it should
be and out of bound access becomes legit.

Align the end of the .bss..page_aligned and .data..page_aligned section on
page-size so all objects places in these sections are guaranteed to have
their own page.

[ tglx: Amended changelog ]

Signed-off-by: Joerg Roedel
Signed-off-by: Thomas Gleixner
Reviewed-by: Kees Cook
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200721093448.10417-1-joro@8bytes.org
Signed-off-by: Greg Kroah-Hartman

Joerg Roedel
2020-07-29 16:18:45 +0800
615f44e04 io-mapping: indicate mapping failure ... Browse Code »

commit e0b3e0b1a04367fc15c07f44e78361545b55357c upstream.

The !ATOMIC_IOMAP version of io_maping_init_wc will always return
success, even when the ioremap fails.

Since the ATOMIC_IOMAP version returns NULL when the init fails, and
callers check for a NULL return on error this is unexpected.

During a device probe, where the ioremap failed, a crash can look like
this:

BUG: unable to handle page fault for address: 0000000000210000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
Oops: 0002 [#1] PREEMPT SMP
CPU: 0 PID: 177 Comm:
RIP: 0010:fill_page_dma [i915]
gen8_ppgtt_create [i915]
i915_ppgtt_create [i915]
intel_gt_init [i915]
i915_gem_init [i915]
i915_driver_probe [i915]
pci_device_probe
really_probe
driver_probe_device

The remap failure occurred much earlier in the probe. If it had been
propagated, the driver would have exited with an error.

Return NULL on ioremap failure.

[akpm@linux-foundation.org: detect ioremap_wc() errors earlier]

Fixes: cafaf14a5d8f ("io-mapping: Always create a struct to hold metadata about the io-mapping")
Signed-off-by: Michael J. Ruhl
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Cc: Mike Rapoport
Cc: Andy Shevchenko
Cc: Chris Wilson
Cc: Daniel Vetter
Cc:
Link: http://lkml.kernel.org/r/20200721171936.81563-1-michael.j.ruhl@intel.com
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Michael J. Ruhl
2020-07-29 16:18:44 +0800
0821295b2 asm-generic/mmiowb: Allow mmiowb_set_pending() when preemptible() ... Browse Code »

[ Upstream commit bd024e82e4cd95c7f1a475a55f99871936c2b2db ]

Although mmiowb() is concerned only with serialising MMIO writes occuring
in contexts where a spinlock is held, the call to mmiowb_set_pending()
from the MMIO write accessors can occur in preemptible contexts, such
as during driver probe() functions where ordering between CPUs is not
usually a concern, assuming that the task migration path provides the
necessary ordering guarantees.

Unfortunately, the default implementation of mmiowb_set_pending() is not
preempt-safe, as it makes use of a a per-cpu variable to track its
internal state. This has been reported to generate the following splat
on riscv:

| BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1
| caller is regmap_mmio_write32le+0x1c/0x46
| CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.8.0-rc3-hfu+ #1
| Call Trace:
| walk_stackframe+0x0/0x7a
| dump_stack+0x6e/0x88
| regmap_mmio_write32le+0x18/0x46
| check_preemption_disabled+0xa4/0xaa
| regmap_mmio_write32le+0x18/0x46
| regmap_mmio_write+0x26/0x44
| regmap_write+0x28/0x48
| sifive_gpio_probe+0xc0/0x1da

Although it's possible to fix the driver in this case, other splats have
been seen from other drivers, including the infamous 8250 UART, and so
it's better to address this problem in the mmiowb core itself.

Fix mmiowb_set_pending() by using the raw_cpu_ptr() to get at the mmiowb
state and then only updating the 'mmiowb_pending' field if we are not
preemptible (i.e. we have a non-zero nesting count).

Cc: Arnd Bergmann
Cc: Paul Walmsley
Cc: Guo Ren
Cc: Michael Ellerman
Reported-by: Palmer Dabbelt
Reported-by: Emil Renner Berthing
Tested-by: Emil Renner Berthing
Reviewed-by: Palmer Dabbelt
Acked-by: Palmer Dabbelt
Link: https://lore.kernel.org/r/20200716112816.7356-1-will@kernel.org
Signed-off-by: Will Deacon
Signed-off-by: Sasha Levin

Will Deacon
2020-07-29 16:18:40 +0800
80fed4024 Input: add `SW_MACHINE_COVER` ... Browse Code »

[ Upstream commit c463bb2a8f8d7d97aa414bf7714fc77e9d3b10df ]

This event code represents the state of a removable cover of a device.
Value 0 means that the cover is open or removed, value 1 means that the
cover is closed.

Reviewed-by: Sebastian Reichel
Acked-by: Tony Lindgren
Signed-off-by: Merlijn Wajer
Link: https://lore.kernel.org/r/20200612125402.18393-2-merlijn@wizzup.org
Signed-off-by: Dmitry Torokhov
Signed-off-by: Sasha Levin

Merlijn Wajer
2020-07-29 16:18:36 +0800
722c6e954 dmabuf: use spinlock to access dmabuf->name ... Browse Code »

[ Upstream commit 6348dd291e3653534a9e28e6917569bc9967b35b ]

There exists a sleep-while-atomic bug while accessing the dmabuf->name
under mutex in the dmabuffs_dname(). This is caused from the SELinux
permissions checks on a process where it tries to validate the inherited
files from fork() by traversing them through iterate_fd() (which
traverse files under spin_lock) and call
match_file(security/selinux/hooks.c) where the permission checks happen.
This audit information is logged using dump_common_audit_data() where it
calls d_path() to get the file path name. If the file check happen on
the dmabuf's fd, then it ends up in ->dmabuffs_dname() and use mutex to
access dmabuf->name. The flow will be like below:
flush_unauthorized_files()
iterate_fd()
spin_lock() --> Start of the atomic section.
match_file()
file_has_perm()
avc_has_perm()
avc_audit()
slow_avc_audit()
common_lsm_audit()
dump_common_audit_data()
audit_log_d_path()
d_path()
dmabuffs_dname()
mutex_lock()--> Sleep while atomic.

Call trace captured (on 4.19 kernels) is below:
___might_sleep+0x204/0x208
__might_sleep+0x50/0x88
__mutex_lock_common+0x5c/0x1068
__mutex_lock_common+0x5c/0x1068
mutex_lock_nested+0x40/0x50
dmabuffs_dname+0xa0/0x170
d_path+0x84/0x290
audit_log_d_path+0x74/0x130
common_lsm_audit+0x334/0x6e8
slow_avc_audit+0xb8/0xf8
avc_has_perm+0x154/0x218
file_has_perm+0x70/0x180
match_file+0x60/0x78
iterate_fd+0x128/0x168
selinux_bprm_committing_creds+0x178/0x248
security_bprm_committing_creds+0x30/0x48
install_exec_creds+0x1c/0x68
load_elf_binary+0x3a4/0x14e0
search_binary_handler+0xb0/0x1e0

So, use spinlock to access dmabuf->name to avoid sleep-while-atomic.

Cc: [5.3+]
Signed-off-by: Charan Teja Kalla
Reviewed-by: Michael J. Ruhl
Acked-by: Christian König
[sumits: added comment to spinlock_t definition to avoid warning]
Signed-off-by: Sumit Semwal
Link: https://patchwork.freedesktop.org/patch/msgid/a83e7f0d-4e54-9848-4b58-e1acdbe06735@codeaurora.org
Signed-off-by: Sasha Levin

Charan Teja Kalla
2020-07-29 16:18:29 +0800

22 Jul, 2020

9 commits

1245a1e0e rxrpc: Fix trace string ... Browse Code »

commit aadf9dcef9d4cd68c73a4ab934f93319c4becc47 upstream.

The trace symbol printer (__print_symbolic()) ignores symbols that map to
an empty string and prints the hex value instead.

Fix the symbol for rxrpc_cong_no_change to " -" instead of "" to avoid
this.

Fixes: b54a134a7de4 ("rxrpc: Fix handling of enums-to-string translation in tracing")
Signed-off-by: David Howells
Signed-off-by: Greg Kroah-Hartman

David Howells
2020-07-22 15:33:17 +0800
97f1aecb8 Input: elan_i2c - add more hardware ID for Lenovo laptops ... Browse Code »

commit a50ca29523b18baea548bdf5df9b4b923c2bb4f6 upstream.

This adds more hardware IDs for Elan touchpads found in various Lenovo
laptops.

Signed-off-by: Dave Wang
Link: https://lore.kernel.org/r/000201d5a8bd$9fead3f0$dfc07bd0$@emc.com.tw
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov
Signed-off-by: Greg Kroah-Hartman

Dave Wang
2020-07-22 15:33:13 +0800
78d85ca83 virt: vbox: Fix VBGL_IOCTL_VMMDEV_REQUEST_BIG and _LOG req numbers to match upstream ... Browse Code »

commit f794db6841e5480208f0c3a3ac1df445a96b079e upstream.

Until this commit the mainline kernel version (this version) of the
vboxguest module contained a bug where it defined
VBGL_IOCTL_VMMDEV_REQUEST_BIG and VBGL_IOCTL_LOG using
_IOC(_IOC_READ | _IOC_WRITE, 'V', ...) instead of
_IO(V, ...) as the out of tree VirtualBox upstream version does.

Since the VirtualBox userspace bits are always built against VirtualBox
upstream's headers, this means that so far the mainline kernel version
of the vboxguest module has been failing these 2 ioctls with -ENOTTY.
I guess that VBGL_IOCTL_VMMDEV_REQUEST_BIG is never used causing us to
not hit that one and sofar the vboxguest driver has failed to actually
log any log messages passed it through VBGL_IOCTL_LOG.

This commit changes the VBGL_IOCTL_VMMDEV_REQUEST_BIG and VBGL_IOCTL_LOG
defines to match the out of tree VirtualBox upstream vboxguest version,
while keeping compatibility with the old wrong request defines so as
to not break the kernel ABI in case someone has been using the old
request defines.

Fixes: f6ddd094f579 ("virt: Add vboxguest driver for Virtual Box Guest integration UAPI")
Cc: stable@vger.kernel.org
Acked-by: Arnd Bergmann
Reviewed-by: Arnd Bergmann
Signed-off-by: Hans de Goede
Link: https://lore.kernel.org/r/20200709120858.63928-2-hdegoede@redhat.com
Signed-off-by: Greg Kroah-Hartman

Hans de Goede
2020-07-22 15:33:11 +0800
e7e98dd42 bus: ti-sysc: Handle module unlock quirk needed for some RTC ... Browse Code »

[ Upstream commit e8639e1c986a8a9d0f94549170f6db579376c3ae ]

The RTC modules on am3 and am4 need quirk handling to unlock and lock
them for reset so let's add the quirk handling based on what we already
have for legacy platform data. In later patches we will simply drop the
RTC related platform data and the old quirk handling.

Signed-off-by: Tony Lindgren
Signed-off-by: Sasha Levin

Tony Lindgren
2020-07-22 15:32:58 +0800
c3adbd37c blk-mq-debugfs: update blk_queue_flag_name[] accordingly for new flags ... Browse Code »

[ Upstream commit bfe373f608cf81b7626dfeb904001b0e867c5110 ]

Else there may be magic numbers in /sys/kernel/debug/block/*/state.

Signed-off-by: Hou Tao
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin

Hou Tao
2020-07-22 15:32:52 +0800
38b122c0a cgroup: Fix sock_cgroup_data on big-endian. ... Browse Code »

[ Upstream commit 14b032b8f8fce03a546dcf365454bec8c4a58d7d ]

In order for no_refcnt and is_data to be the lowest order two
bits in the 'val' we have to pad out the bitfield of the u8.

Fixes: ad0f75e5f57c ("cgroup: fix cgroup_sk_alloc() for sk_clone_lock()")
Reported-by: Guenter Roeck
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Cong Wang
2020-07-22 15:32:50 +0800
94886c86e cgroup: fix cgroup_sk_alloc() for sk_clone_lock() ... Browse Code »

[ Upstream commit ad0f75e5f57ccbceec13274e1e242f2b5a6397ed ]

When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
copied, so the cgroup refcnt must be taken too. And, unlike the
sk_alloc() path, sock_update_netprioidx() is not called here.
Therefore, it is safe and necessary to grab the cgroup refcnt
even when cgroup_sk_alloc is disabled.

sk_clone_lock() is in BH context anyway, the in_interrupt()
would terminate this function if called there. And for sk_alloc()
skcd->val is always zero. So it's safe to factor out the code
to make it more readable.

The global variable 'cgroup_sk_alloc_disabled' is used to determine
whether to take these reference counts. It is impossible to make
the reference counting correct unless we save this bit of information
in skcd->val. So, add a new bit there to record whether the socket
has already taken the reference counts. This obviously relies on
kmalloc() to align cgroup pointers to at least 4 bytes,
ARCH_KMALLOC_MINALIGN is certainly larger than that.

This bug seems to be introduced since the beginning, commit
d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
tried to fix it but not compeletely. It seems not easy to trigger until
the recent commit 090e28b229af
("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged.

Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
Reported-by: Cameron Berkenpas
Reported-by: Peter Geis
Reported-by: Lu Fengqi
Reported-by: Daniël Sonck
Reported-by: Zhang Qiang
Tested-by: Cameron Berkenpas
Tested-by: Peter Geis
Tested-by: Thomas Lamprecht
Cc: Daniel Borkmann
Cc: Zefan Li
Cc: Tejun Heo
Cc: Roman Gushchin
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Cong Wang
2020-07-22 15:32:49 +0800
30d015f5e vlan: consolidate VLAN parsing code and limit max parsing depth ... Browse Code »

[ Upstream commit 469aceddfa3ed16e17ee30533fae45e90f62efd8 ]

Toshiaki pointed out that we now have two very similar functions to extract
the L3 protocol number in the presence of VLAN tags. And Daniel pointed out
that the unbounded parsing loop makes it possible for maliciously crafted
packets to loop through potentially hundreds of tags.

Fix both of these issues by consolidating the two parsing functions and
limiting the VLAN tag parsing to a max depth of 8 tags. As part of this,
switch over __vlan_get_protocol() to use skb_header_pointer() instead of
pskb_may_pull(), to avoid the possible side effects of the latter and keep
the skb pointer 'const' through all the parsing functions.

v2:
- Use limit of 8 tags instead of 32 (matching XMIT_RECURSION_LIMIT)

Reported-by: Toshiaki Makita
Reported-by: Daniel Borkmann
Fixes: d7bf2ebebc2b ("sched: consistently handle layer3 header accesses in the presence of VLANs")
Signed-off-by: Toke Høiland-Jørgensen
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Toke Høiland-Jørgensen
2020-07-22 15:32:49 +0800
9b7fd81cf sched: consistently handle layer3 header accesses in the presence of VLANs ... Browse Code »

[ Upstream commit d7bf2ebebc2bd61ab95e2a8e33541ef282f303d4 ]

There are a couple of places in net/sched/ that check skb->protocol and act
on the value there. However, in the presence of VLAN tags, the value stored
in skb->protocol can be inconsistent based on whether VLAN acceleration is
enabled. The commit quoted in the Fixes tag below fixed the users of
skb->protocol to use a helper that will always see the VLAN ethertype.

However, most of the callers don't actually handle the VLAN ethertype, but
expect to find the IP header type in the protocol field. This means that
things like changing the ECN field, or parsing diffserv values, stops
working if there's a VLAN tag, or if there are multiple nested VLAN
tags (QinQ).

To fix this, change the helper to take an argument that indicates whether
the caller wants to skip the VLAN tags or not. When skipping VLAN tags, we
make sure to skip all of them, so behaviour is consistent even in QinQ
mode.

To make the helper usable from the ECN code, move it to if_vlan.h instead
of pkt_sched.h.

v3:
- Remove empty lines
- Move vlan variable definitions inside loop in skb_protocol()
- Also use skb_protocol() helper in IP{,6}_ECN_decapsulate() and
bpf_skb_ecn_set_ce()

v2:
- Use eth_type_vlan() helper in skb_protocol()
- Also fix code that reads skb->protocol directly
- Change a couple of 'if/else if' statements to switch constructs to avoid
calling the helper twice

Reported-by: Ilya Ponetayev
Fixes: d8b9605d2697 ("net: sched: fix skb->protocol use in case of accelerated vlan path")
Signed-off-by: Toke Høiland-Jørgensen
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Toke Høiland-Jørgensen
2020-07-22 15:32:48 +0800