Eric Lee / smarc-fsl-linux-kernel

30 Nov, 2017

40 commits

5c21c3dde fs: guard_bio_eod() needs to consider partitions ... Browse Code »

commit 67f2519fe2903c4041c0e94394d14d372fe51399 upstream.

guard_bio_eod() needs to look at the partition capacity, not just the
capacity of the whole device, when determining if truncation is
necessary.

[ 60.268688] attempt to access beyond end of device
[ 60.268690] unknown-block(9,1): rw=0, want=67103509, limit=67103506
[ 60.268693] buffer_io_error: 2 callbacks suppressed
[ 60.268696] Buffer I/O error on dev md1p7, logical block 4524305, async page read

Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index")
Reviewed-by: Christoph Hellwig
Signed-off-by: Greg Edwards
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Greg Edwards
2017-11-30 16:40:45 +0800
e9c80881b bcache: check ca->alloc_thread initialized before wake up it ... Browse Code »

commit 91af8300d9c1d7c6b6a2fd754109e08d4798b8d8 upstream.

In bcache code, sysfs entries are created before all resources get
allocated, e.g. allocation thread of a cache set.

There is posibility for NULL pointer deference if a resource is accessed
but which is not initialized yet. Indeed Jorg Bornschein catches one on
cache set allocation thread and gets a kernel oops.

The reason for this bug is, when bch_bucket_alloc() is called during
cache set registration and attaching, ca->alloc_thread is not properly
allocated and initialized yet, call wake_up_process() on ca->alloc_thread
triggers NULL pointer deference failure. A simple and fast fix is, before
waking up ca->alloc_thread, checking whether it is allocated, and only
wake up ca->alloc_thread when it is not NULL.

Signed-off-by: Coly Li
Reported-by: Jorg Bornschein
Cc: Kent Overstreet
Reviewed-by: Michael Lyle
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Coly Li
2017-11-30 16:40:45 +0800
bcae2363e libceph: don't WARN() if user tries to add invalid key ... Browse Code »

commit b11270853fa3654f08d4a6a03b23ddb220512d8d upstream.

The WARN_ON(!key->len) in set_secret() in net/ceph/crypto.c is hit if a
user tries to add a key of type "ceph" with an invalid payload as
follows (assuming CONFIG_CEPH_LIB=y):

echo -e -n '\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' \
| keyctl padd ceph desc @s

This can be hit by fuzzers. As this is merely bad input and not a
kernel bug, replace the WARN_ON() with return -EINVAL.

Fixes: 7af3ea189a9a ("libceph: stop allocating a new cipher on every crypto request")
Signed-off-by: Eric Biggers
Reviewed-by: Ilya Dryomov
Signed-off-by: Ilya Dryomov
Signed-off-by: Greg Kroah-Hartman

Eric Biggers
2017-11-30 16:40:45 +0800
bc6e89683 eCryptfs: use after free in ecryptfs_release_messaging() ... Browse Code »

commit db86be3a12d0b6e5c5b51c2ab2a48f06329cb590 upstream.

We're freeing the list iterator so we should be using the _safe()
version of hlist_for_each_entry().

Fixes: 88b4a07e6610 ("[PATCH] eCryptfs: Public key transport mechanism")
Signed-off-by: Dan Carpenter
Signed-off-by: Tyler Hicks
Signed-off-by: Greg Kroah-Hartman

Dan Carpenter
2017-11-30 16:40:45 +0800
ddf1264ec fscrypt: lock mutex before checking for bounce page pool ... Browse Code »

commit a0b3bc855374c50b5ea85273553485af48caf2f7 upstream.

fscrypt_initialize(), which allocates the global bounce page pool when
an encrypted file is first accessed, uses "double-checked locking" to
try to avoid locking fscrypt_init_mutex. However, it doesn't use any
memory barriers, so it's theoretically possible for a thread to observe
a bounce page pool which has not been fully initialized. This is a
classic bug with "double-checked locking".

While "only a theoretical issue" in the latest kernel, in pre-4.8
kernels the pointer that was checked was not even the last to be
initialized, so it was easily possible for a crash (NULL pointer
dereference) to happen. This was changed only incidentally by the large
refactor to use fs/crypto/.

Solve both problems in a trivial way that can easily be backported: just
always take the mutex. It's theoretically less efficient, but it
shouldn't be noticeable in practice as the mutex is only acquired very
briefly once per encrypted file.

Later I'd like to make this use a helper macro like DO_ONCE(). However,
DO_ONCE() runs in atomic context, so we'd need to add a new macro that
allows blocking.

Signed-off-by: Eric Biggers
Signed-off-by: Theodore Ts'o
Signed-off-by: Greg Kroah-Hartman

Eric Biggers
2017-11-30 16:40:44 +0800
f94782668 nilfs2: fix race condition that causes file system corruption ... Browse Code »

commit 31ccb1f7ba3cfe29631587d451cf5bb8ab593550 upstream.

There is a race condition between nilfs_dirty_inode() and
nilfs_set_file_dirty().

When a file is opened, nilfs_dirty_inode() is called to update the
access timestamp in the inode. It calls __nilfs_mark_inode_dirty() in a
separate transaction. __nilfs_mark_inode_dirty() caches the ifile
buffer_head in the i_bh field of the inode info structure and marks it
as dirty.

After some data was written to the file in another transaction, the
function nilfs_set_file_dirty() is called, which adds the inode to the
ns_dirty_files list.

Then the segment construction calls nilfs_segctor_collect_dirty_files(),
which goes through the ns_dirty_files list and checks the i_bh field.
If there is a cached buffer_head in i_bh it is not marked as dirty
again.

Since nilfs_dirty_inode() and nilfs_set_file_dirty() use separate
transactions, it is possible that a segment construction that writes out
the ifile occurs in-between the two. If this happens the inode is not
on the ns_dirty_files list, but its ifile block is still marked as dirty
and written out.

In the next segment construction, the data for the file is written out
and nilfs_bmap_propagate() updates the b-tree. Eventually the bmap root
is written into the i_bh block, which is not dirty, because it was
written out in another segment construction.

As a result the bmap update can be lost, which leads to file system
corruption. Either the virtual block address points to an unallocated
DAT block, or the DAT entry will be reused for something different.

The error can remain undetected for a long time. A typical error
message would be one of the "bad btree" errors or a warning that a DAT
entry could not be found.

This bug can be reproduced reliably by a simple benchmark that creates
and overwrites millions of 4k files.

Link: http://lkml.kernel.org/r/1509367935-3086-2-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Andreas Rohner
Signed-off-by: Ryusuke Konishi
Tested-by: Andreas Rohner
Tested-by: Ryusuke Konishi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Andreas Rohner
2017-11-30 16:40:44 +0800
7b7f54379 autofs: don't fail mount for transient error ... Browse Code »

commit ecc0c469f27765ed1e2b967be0aa17cee1a60b76 upstream.

Currently if the autofs kernel module gets an error when writing to the
pipe which links to the daemon, then it marks the whole moutpoint as
catatonic, and it will stop working.

It is possible that the error is transient. This can happen if the
daemon is slow and more than 16 requests queue up. If a subsequent
process tries to queue a request, and is then signalled, the write to
the pipe will return -ERESTARTSYS and autofs will take that as total
failure.

So change the code to assess -ERESTARTSYS and -ENOMEM as transient
failures which only abort the current request, not the whole mountpoint.

It isn't a crash or a data corruption, but having autofs mountpoints
suddenly stop working is rather inconvenient.

Ian said:

: And given the problems with a half dozen (or so) user space applications
: consuming large amounts of CPU under heavy mount and umount activity this
: could happen more easily than we expect.

Link: http://lkml.kernel.org/r/87y3norvgp.fsf@notabene.neil.brown.name
Signed-off-by: NeilBrown
Acked-by: Ian Kent
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

NeilBrown
2017-11-30 16:40:44 +0800
c1a14af38 mm/z3fold.c: use kref to prevent page free/compact race ... Browse Code »

commit 5d03a6613957785e94af7a4a6212ad4af66aa5c2 upstream.

There is a race in the current z3fold implementation between
do_compact() called in a work queue context and the page release
procedure when page's kref goes to 0.

do_compact() may be waiting for page lock, which is released by
release_z3fold_page_locked right before putting the page onto the
"stale" list, and then the page may be freed as do_compact() modifies
its contents.

The mechanism currently implemented to handle that (checking the
PAGE_STALE flag) is not reliable enough. Instead, we'll use page's kref
counter to guarantee that the page is not released if its compaction is
scheduled. It then becomes compaction function's responsibility to
decrease the counter and quit immediately if the page was actually
freed.

Link: http://lkml.kernel.org/r/20171117092032.00ea56f42affbed19f4fcc6c@gmail.com
Signed-off-by: Vitaly Wool
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Vitaly Wool
2017-11-30 16:40:44 +0800
769bfea59 rt2x00usb: mark device removed when get ENOENT usb error ... Browse Code »

commit bfa62a52cad93686bb8d8171ea5288813248a7c6 upstream.

ENOENT usb error mean "specified interface or endpoint does not exist or
is not enabled". Mark device not present when we encounter this error
similar like we do with ENODEV error.

Otherwise we can have infinite loop in rt2x00usb_work_rxdone(), because
we remove and put again RX entries to the queue infinitely.

We can have similar situation when submit urb will fail all the time
with other error, so we need consider to limit number of entries
processed by rxdone work. But for now, since the patch fixes
reproducible soft lockup issue on single processor systems
and taken ENOENT error meaning, let apply this fix.

Patch adds additional ENOENT check not only in rx kick routine, but
also on other places where we check for ENODEV error.

Reported-by: Richard Genoud
Debugged-by: Richard Genoud
Signed-off-by: Stanislaw Gruszka
Tested-by: Richard Genoud
Signed-off-by: Kalle Valo
Signed-off-by: Greg Kroah-Hartman

Stanislaw Gruszka
2017-11-30 16:40:44 +0800
085d66519 MIPS: math-emu: Fix final emulation phase for certain instructions ... Browse Code »

commit 409fcace9963c1e8d2cb0f7ac62e8b34d47ef979 upstream.

Fix final phase of .
emulation. Provide proper generation of SIGFPE signal and updating
debugfs FP exception stats in cases of any exception flags set in
preceding phases of emulation.

CLASS. instruction may generate "Unimplemented Operation" FP
exception. . instructions may generate "Inexact",
"Unimplemented Operation", "Invalid Operation", "Overflow", and
"Underflow" FP exceptions. . instructions
can generate "Unimplemented Operation" and "Invalid Operation" FP
exceptions.

The proper final processing of the cases when any FP exception
flag is set is achieved by replacing "break" statement with "goto
copcsr" statement. With such solution, this patch brings the final
phase of emulation of the above instructions consistent with the
one corresponding to the previously implemented emulation of other
related FPU instructions (ADD, SUB, etc.).

Fixes: 38db37ba069f ("MIPS: math-emu: Add support for the MIPS R6 CLASS FPU instruction")
Fixes: e24c3bec3e8e ("MIPS: math-emu: Add support for the MIPS R6 MADDF FPU instruction")
Fixes: 83d43305a1df ("MIPS: math-emu: Add support for the MIPS R6 MSUBF FPU instruction")
Fixes: a79f5f9ba508 ("MIPS: math-emu: Add support for the MIPS R6 MAX{, A} FPU instruction")
Fixes: 4e9561b20e2f ("MIPS: math-emu: Add support for the MIPS R6 MIN{, A} FPU instruction")
Signed-off-by: Aleksandar Markovic
Cc: Ralf Baechle
Cc: Douglas Leung
Cc: Goran Ferenc
Cc: "Maciej W. Rozycki"
Cc: Miodrag Dinic
Cc: Paul Burton
Cc: Petar Jovanovic
Cc: Raghu Gandham
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/17581/
Signed-off-by: James Hogan
Signed-off-by: Greg Kroah-Hartman

Aleksandar Markovic
2017-11-30 16:40:44 +0800
8d187fa8e MIPS: BCM47XX: Fix LED inversion for WRT54GSv1 ... Browse Code »

commit 56a46acf62af5ba44fca2f3f1c7c25a2d5385b19 upstream.

The WLAN LED on the Linksys WRT54GSv1 is active low, but the software
treats it as active high. Fix the inverted logic.

Fixes: 7bb26b169116 ("MIPS: BCM47xx: Fix LEDs on WRT54GS V1.0")
Signed-off-by: Mirko Parthey
Looks-ok-by: Rafał Miłecki
Cc: Hauke Mehrtens
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/16071/
Signed-off-by: James Hogan
Signed-off-by: Greg Kroah-Hartman

Mirko Parthey
2017-11-30 16:40:44 +0800
dc3aceed4 MIPS: Fix an n32 core file generation regset support regression ... Browse Code »

commit 547da673173de51f73887377eb275304775064ad upstream.

Fix a commit 7aeb753b5353 ("MIPS: Implement task_user_regset_view.")
regression, then activated by commit 6a9c001b7ec3 ("MIPS: Switch ELF
core dumper to use regsets.)", that caused n32 processes to dump o32
core files by failing to set the EF_MIPS_ABI2 flag in the ELF core file
header's `e_flags' member:

$ file tls-core
tls-core: ELF 32-bit MSB executable, MIPS, N32 MIPS64 rel2 version 1 (SYSV), [...]
$ ./tls-core
Aborted (core dumped)
$ file core
core: ELF 32-bit MSB core file MIPS, MIPS-I version 1 (SYSV), SVR4-style
$

Previously the flag was set as the result of a:

statement placed in arch/mips/kernel/binfmt_elfn32.c, however in the
regset case, i.e. when CORE_DUMP_USE_REGSET is set, ELF_CORE_EFLAGS is
no longer used by `fill_note_info' in fs/binfmt_elf.c, and instead the
`->e_flags' member of the regset view chosen is. We have the views
defined in arch/mips/kernel/ptrace.c, however only an o32 and an n64
one, and the latter is used for n32 as well. Consequently an o32 core
file is incorrectly dumped from n32 processes (the ELF32 vs ELF64 class
is chosen elsewhere, and the 32-bit one is correctly selected for n32).

Correct the issue then by defining an n32 regset view and using it as
appropriate. Issue discovered in GDB testing.

Fixes: 7aeb753b5353 ("MIPS: Implement task_user_regset_view.")
Signed-off-by: Maciej W. Rozycki
Cc: Ralf Baechle
Cc: Djordje Todorovic
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/17617/
Signed-off-by: James Hogan
Signed-off-by: Greg Kroah-Hartman

Maciej W. Rozycki
2017-11-30 16:40:44 +0800
43bce9f2e MIPS: dts: remove bogus bcm96358nb4ser.dtb from dtb-y entry ... Browse Code »

commit 3cad14d56adbf7d621fc5a35db42f3acc0a2d6e8 upstream.

arch/mips/boot/dts/brcm/bcm96358nb4ser.dts does not exist, so
we cannot build bcm96358nb4ser.dtb .

Signed-off-by: Masahiro Yamada
Fixes: 695835511f96 ("MIPS: BMIPS: rename bcm96358nb4ser to bcm6358-neufbox4-sercom")
Acked-by: James Hogan
Signed-off-by: Rob Herring
Signed-off-by: Greg Kroah-Hartman

Masahiro Yamada
2017-11-30 16:40:43 +0800
d63534042 MIPS: Fix MIPS64 FP save/restore on 32-bit kernels ... Browse Code »

commit 22b8ba765a726d90e9830ff6134c32b04f12c10f upstream.

32-bit kernels can be configured to support MIPS64, in which case
neither CONFIG_64BIT or CONFIG_CPU_MIPS32_R* will be set. This causes
the CP0_Status.FR checks at the point of floating point register save
and restore to be compiled out, which results in odd FP registers not
being saved or restored to the task or signal context even when
CP0_Status.FR is set.

Fix the ifdefs to use CONFIG_CPU_MIPSR2 and CONFIG_CPU_MIPSR6, which are
enabled for the relevant revisions of either MIPS32 or MIPS64, along
with some other CPUs such as Octeon (r2), Loongson1 (r2), XLP (r2),
Loongson 3A R2.

The suspect code originates from commit 597ce1723e0f ("MIPS: Support for
64-bit FP with O32 binaries") in v3.14, however the code in
__enable_fpu() was consistent and refused to set FR=1, falling back to
software FPU emulation. This was suboptimal but should be functionally
correct.

Commit fcc53b5f6c38 ("MIPS: fpu.h: Allow 64-bit FPU on a 64-bit MIPS R6
CPU") in v4.2 (and stable tagged back to 4.0) later introduced the bug
by updating __enable_fpu() to set FR=1 but failing to update the other
similar ifdefs to enable FR=1 state handling.

Fixes: fcc53b5f6c38 ("MIPS: fpu.h: Allow 64-bit FPU on a 64-bit MIPS R6 CPU")
Signed-off-by: James Hogan
Cc: Ralf Baechle
Cc: Paul Burton
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/16739/
Signed-off-by: Greg Kroah-Hartman

James Hogan
2017-11-30 16:40:43 +0800
43292e652 MIPS: Fix odd fp register warnings with MIPS64r2 ... Browse Code »

commit c7fd89a6407ea3a44a2a2fa12d290162c42499c4 upstream.

Building 32-bit MIPS64r2 kernels produces warnings like the following
on certain toolchains (such as GNU assembler 2.24.90, but not GNU
assembler 2.28.51) since commit 22b8ba765a72 ("MIPS: Fix MIPS64 FP
save/restore on 32-bit kernels"), due to the exposure of fpu_save_16odd
from fpu_save_double and fpu_restore_16odd from fpu_restore_double:

arch/mips/kernel/r4k_fpu.S:47: Warning: float register should be even, was 1
...
arch/mips/kernel/r4k_fpu.S:59: Warning: float register should be even, was 1
...

This appears to be because .set mips64r2 does not change the FPU ABI to
64-bit when -march=mips64r2 (or e.g. -march=xlp) is provided on the
command line on that toolchain, from the default FPU ABI of 32-bit due
to the -mabi=32. This makes access to the odd FPU registers invalid.

Fix by explicitly changing the FPU ABI with .set fp=64 directives in
fpu_save_16odd and fpu_restore_16odd, and moving the undefine of fp up
in asmmacro.h so fp doesn't turn into $30.

Fixes: 22b8ba765a72 ("MIPS: Fix MIPS64 FP save/restore on 32-bit kernels")
Signed-off-by: James Hogan
Cc: Ralf Baechle
Cc: Paul Burton
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/17656/
Signed-off-by: Greg Kroah-Hartman

James Hogan
2017-11-30 16:40:43 +0800
e39516d24 dm: discard support requires all targets in a table support discards ... Browse Code »

commit 8a74d29d541cd86569139c6f3f44b2d210458071 upstream.

A DM device with a mix of discard capabilities (due to some underlying
devices not having discard support) _should_ just return -EOPNOTSUPP for
the region of the device that doesn't support discards (even if only by
way of the underlying driver formally not supporting discards). BUT,
that does ask the underlying driver to handle something that it never
advertised support for. In doing so we're exposing users to the
potential for a underlying disk driver hanging if/when a discard is
issued a the device that is incapable and never claimed to support
discards.

Fix this by requiring that each DM target in a DM table provide discard
support as a prereq for a DM device to advertise support for discards.

This may cause some configurations that were happily supporting discards
(even in the face of a mix of discard support) to stop supporting
discards -- but the risk of users hitting driver hangs, and forced
reboots, outweighs supporting those fringe mixed discard
configurations.

Signed-off-by: Mike Snitzer
Signed-off-by: Greg Kroah-Hartman

Mike Snitzer
2017-11-30 16:40:43 +0800
3bfb87ecb dm: fix race between dm_get_from_kobject() and __dm_destroy() ... Browse Code »

commit b9a41d21dceadf8104812626ef85dc56ee8a60ed upstream.

The following BUG_ON was hit when testing repeat creation and removal of
DM devices:

kernel BUG at drivers/md/dm.c:2919!
CPU: 7 PID: 750 Comm: systemd-udevd Not tainted 4.1.44
Call Trace:
[] dm_get_from_kobject+0x34/0x3a
[] dm_attr_show+0x2b/0x5e
[] ? mutex_lock+0x26/0x44
[] sysfs_kf_seq_show+0x83/0xcf
[] kernfs_seq_show+0x23/0x25
[] seq_read+0x16f/0x325
[] kernfs_fop_read+0x3a/0x13f
[] __vfs_read+0x26/0x9d
[] ? security_file_permission+0x3c/0x44
[] ? rw_verify_area+0x83/0xd9
[] vfs_read+0x8f/0xcf
[] ? __fdget_pos+0x12/0x41
[] SyS_read+0x4b/0x76
[] system_call_fastpath+0x12/0x71

The bug can be easily triggered, if an extra delay (e.g. 10ms) is added
between the test of DMF_FREEING & DMF_DELETING and dm_get() in
dm_get_from_kobject().

To fix it, we need to ensure the test of DMF_FREEING & DMF_DELETING and
dm_get() are done in an atomic way, so _minor_lock is used.

The other callers of dm_get() have also been checked to be OK: some
callers invoke dm_get() under _minor_lock, some callers invoke it under
_hash_lock, and dm_start_request() invoke it after increasing
md->open_count.

Signed-off-by: Hou Tao
Signed-off-by: Mike Snitzer
Signed-off-by: Greg Kroah-Hartman

Hou Tao
2017-11-30 16:40:43 +0800
9be341ede MIPS: pci: Remove KERN_WARN instance inside the mt7620 driver ... Browse Code »

commit 8593b18ad348733b5d5ddfa0c79dcabf51dff308 upstream.

Switch the printk() call to the prefered pr_warn() api.

Fixes: 7e5873d3755c ("MIPS: pci: Add MT7620a PCIE driver")
Signed-off-by: John Crispin
Cc: Ralf Baechle
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/15321/
Signed-off-by: James Hogan
Signed-off-by: Greg Kroah-Hartman

John Crispin
2017-11-30 16:40:43 +0800
f17c786b2 sched/rt: Simplify the IPI based RT balancing logic ... Browse Code »

commit 4bdced5c9a2922521e325896a7bbbf0132c94e56 upstream.

When a CPU lowers its priority (schedules out a high priority task for a
lower priority one), a check is made to see if any other CPU has overloaded
RT tasks (more than one). It checks the rto_mask to determine this and if so
it will request to pull one of those tasks to itself if the non running RT
task is of higher priority than the new priority of the next task to run on
the current CPU.

When we deal with large number of CPUs, the original pull logic suffered
from large lock contention on a single CPU run queue, which caused a huge
latency across all CPUs. This was caused by only having one CPU having
overloaded RT tasks and a bunch of other CPUs lowering their priority. To
solve this issue, commit:

b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")

changed the way to request a pull. Instead of grabbing the lock of the
overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.

Although the IPI logic worked very well in removing the large latency build
up, it still could suffer from a large number of IPIs being sent to a single
CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
when I tested this on a 120 CPU box, with a stress test that had lots of
RT tasks scheduling on all CPUs, it actually triggered the hard lockup
detector! One CPU had so many IPIs sent to it, and due to the restart
mechanism that is triggered when the source run queue has a priority status
change, the CPU spent minutes! processing the IPIs.

Thinking about this further, I realized there's no reason for each run queue
to send its own IPI. As all CPUs with overloaded tasks must be scanned
regardless if there's one or many CPUs lowering their priority, because
there's no current way to find the CPU with the highest priority task that
can schedule to one of these CPUs, there really only needs to be one IPI
being sent around at a time.

This greatly simplifies the code!

The new approach is to have each root domain have its own irq work, as the
rto_mask is per root domain. The root domain has the following fields
attached to it:

rto_push_work - the irq work to process each CPU set in rto_mask
rto_lock - the lock to protect some of the other rto fields
rto_loop_start - an atomic that keeps contention down on rto_lock
the first CPU scheduling in a lower priority task
is the one to kick off the process.
rto_loop_next - an atomic that gets incremented for each CPU that
schedules in a lower priority task.
rto_loop - a variable protected by rto_lock that is used to
compare against rto_loop_next
rto_cpu - The cpu to send the next IPI to, also protected by
the rto_lock.

When a CPU schedules in a lower priority task and wants to make sure
overloaded CPUs know about it. It increments the rto_loop_next. Then it
atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
then it is done, as another CPU is kicking off the IPI loop. If the old
value is "0", then it will take the rto_lock to synchronize with a possible
IPI being sent around to the overloaded CPUs.

If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
IPI being sent around, or one is about to finish. Then rto_cpu is set to the
first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
in rto_mask, then there's nothing to be done.

When the CPU receives the IPI, it will first try to push any RT tasks that is
queued on the CPU but can't run because a higher priority RT task is
currently running on that CPU.

Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
finds one, it simply sends an IPI to that CPU and the process continues.

If there's no more CPUs in the rto_mask, then rto_loop is compared with
rto_loop_next. If they match, everything is done and the process is over. If
they do not match, then a CPU scheduled in a lower priority task as the IPI
was being passed around, and the process needs to start again. The first CPU
in rto_mask is sent the IPI.

This change removes this duplication of work in the IPI logic, and greatly
lowers the latency caused by the IPIs. This removed the lockup happening on
the 120 CPU machine. It also simplifies the code tremendously. What else
could anyone ask for?

Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
supplying me with the rto_start_trylock() and rto_start_unlock() helper
functions.

Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Peter Zijlstra (Intel)
Cc: Clark Williams
Cc: Daniel Bristot de Oliveira
Cc: John Kacur
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Scott Wood
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Steven Rostedt (Red Hat)
2017-11-30 16:40:43 +0800
2bf483c9a dm: allocate struct mapped_device with kvzalloc ... Browse Code »

commit 856eb0916d181da6d043cc33e03f54d5c5bbe54a upstream.

The structure srcu_struct can be very big, its size is proportional to the
value CONFIG_NR_CPUS. The Fedora kernel has CONFIG_NR_CPUS 8192, the field
io_barrier in the struct mapped_device has 84kB in the debugging kernel
and 50kB in the non-debugging kernel. The large size may result in failure
of the function kzalloc_node.

In order to avoid the allocation failure, we use the function
kvzalloc_node, this function falls back to vmalloc if a large contiguous
chunk of memory is not available. This patch also moves the field
io_barrier to the last position of struct mapped_device - the reason is
that on many processor architectures, short memory offsets result in
smaller code than long memory offsets - on x86-64 it reduces code size by
320 bytes.

Note to stable kernel maintainers - the kernels 4.11 and older don't have
the function kvzalloc_node, you can use the function vzalloc_node instead.

Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer
Signed-off-by: Greg Kroah-Hartman

Mikulas Patocka
2017-11-30 16:40:43 +0800
13e656007 ovl: Put upperdentry if ovl_check_origin() fails ... Browse Code »

commit 5455f92b54e516995a9ca45bbf790d3629c27a93 upstream.

If ovl_check_origin() fails, we should put upperdentry. We have a reference
on it by now. So goto out_put_upper instead of out.

Fixes: a9d019573e88 ("ovl: lookup non-dir copy-up-origin by file handle")
Signed-off-by: Vivek Goyal
Signed-off-by: Miklos Szeredi
Signed-off-by: Greg Kroah-Hartman

Vivek Goyal
2017-11-30 16:40:43 +0800
08720bf98 dm bufio: fix integer overflow when limiting maximum cache size ... Browse Code »

commit 74d4108d9e681dbbe4a2940ed8fdff1f6868184c upstream.

The default max_cache_size_bytes for dm-bufio is meant to be the lesser
of 25% of the size of the vmalloc area and 2% of the size of lowmem.
However, on 32-bit systems the intermediate result in the expression

(VMALLOC_END - VMALLOC_START) * DM_BUFIO_VMALLOC_PERCENT / 100

overflows, causing the wrong result to be computed. For example, on a
32-bit system where the vmalloc area is 520093696 bytes, the result is
1174405 rather than the expected 130023424, which makes the maximum
cache size much too small (far less than 2% of lowmem). This causes
severe performance problems for dm-verity users on affected systems.

Fix this by using mult_frac() to correctly multiply by a percentage. Do
this for all places in dm-bufio that multiply by a percentage. Also
replace (VMALLOC_END - VMALLOC_START) with VMALLOC_TOTAL, which contrary
to the comment is now defined in include/linux/vmalloc.h.

Depends-on: 9993bc635 ("sched/x86: Fix overflow in cyc2ns_offset")
Fixes: 95d402f057f2 ("dm: add bufio")
Signed-off-by: Eric Biggers
Signed-off-by: Mike Snitzer
Signed-off-by: Greg Kroah-Hartman

Eric Biggers
2017-11-30 16:40:42 +0800
4d7a55f5b dm mpath: remove annoying message of 'blk_get_request() returned -11' ... Browse Code »

commit 9dc112e2daf87b40607fd8d357e2d7de32290d45 upstream.

It is very normal to see allocation failure, especially with blk-mq
request_queues, so it's unnecessary to report this error and annoy
people.

In practice this 'blk_get_request() returned -11' error gets logged
quite frequently when a blk-mq DM multipath device sees heavy IO.

This change is marked for stable@ because the annoying message in
question was included in stable@ commit 7083abbbf.

Fixes: 7083abbbf ("dm mpath: avoid that path removal can trigger an infinite loop")
Signed-off-by: Ming Lei
Signed-off-by: Mike Snitzer
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2017-11-30 16:40:42 +0800
ac29afdb4 dm zoned: ignore last smaller runt zone ... Browse Code »

commit 114e025968b5990ad0b57bf60697ea64ee206aac upstream.

The SCSI layer allows ZBC drives to have a smaller last runt zone. For
such a device, specifying the entire capacity for a dm-zoned target
table entry fails because the specified capacity is not aligned on a
device zone size indicated in the request queue structure of the
device.

Fix this problem by ignoring the last runt zone in the entry length
when seting up the dm-zoned target (ctr method) and when iterating table
entries of the target (iterate_devices method). This allows dm-zoned
users to still easily setup a target using the entire device capacity
(as mandated by dm-zoned) or the aligned capacity excluding the last
runt zone.

While at it, replace direct references to the device queue chunk_sectors
limit with calls to the accessor blk_queue_zone_sectors().

Reported-by: Peter Desnoyers
Signed-off-by: Damien Le Moal
Signed-off-by: Mike Snitzer
Signed-off-by: Greg Kroah-Hartman

Damien Le Moal
2017-11-30 16:40:42 +0800
8f71f493f dm crypt: allow unaligned bv_offset ... Browse Code »

commit 0440d5c0ca9744b92a07aeb6df0a9a75db6f4280 upstream.

When slub_debug is enabled kmalloc returns unaligned memory. XFS uses
this unaligned memory for its buffers (if an unaligned buffer crosses a
page, XFS frees it and allocates a full page instead - see the function
xfs_buf_allocate_memory).

dm-crypt checks if bv_offset is aligned on page size and these checks
fail with slub_debug and XFS.

Fix this bug by removing the bv_offset checks. Switch to checking if
bv_len is aligned instead of bv_offset (this check should be sufficient
to prevent overruns if a bio with too small bv_len is received).

Fixes: 8f0009a22517 ("dm crypt: optionally support larger encryption sector size")
Reported-by: Bruno Prémont
Tested-by: Bruno Prémont
Signed-off-by: Mikulas Patocka
Reviewed-by: Milan Broz
Signed-off-by: Mike Snitzer
Signed-off-by: Greg Kroah-Hartman

Mikulas Patocka
2017-11-30 16:40:42 +0800
ec544ec95 dm cache: fix race condition in the writeback mode overwrite_bio optimisation ... Browse Code »

commit d1260e2a3f85f4c1010510a15f89597001318b1b upstream.

When a DM cache in writeback mode moves data between the slow and fast
device it can often avoid a copy if the triggering bio either:

i) covers the whole block (no point copying if we're about to overwrite it)
ii) the migration is a promotion and the origin block is currently discarded

Prior to this fix there was a race with case (ii). The discard status
was checked with a shared lock held (rather than exclusive). This meant
another bio could run in parallel and write data to the origin, removing
the discard state. After the promotion the parallel write would have
been lost.

With this fix the discard status is re-checked once the exclusive lock
has been aquired. If the block is no longer discarded it falls back to
the slower full copy path.

Fixes: b29d4986d ("dm cache: significant rework to leverage dm-bio-prison-v2")
Signed-off-by: Joe Thornber
Signed-off-by: Mike Snitzer
Signed-off-by: Greg Kroah-Hartman

Joe Thornber
2017-11-30 16:40:42 +0800
a502cd2dd dm integrity: allow unaligned bv_offset ... Browse Code »

commit 95b1369a9638cfa322ad1c0cde8efbe524059884 upstream.

When slub_debug is enabled kmalloc returns unaligned memory. XFS uses
this unaligned memory for its buffers (if an unaligned buffer crosses a
page, XFS frees it and allocates a full page instead - see the function
xfs_buf_allocate_memory).

dm-integrity checks if bv_offset is aligned on page size and this check
fail with slub_debug and XFS.

Fix this bug by removing the bv_offset check, leaving only the check for
bv_len.

Fixes: 7eada909bfd7 ("dm: add integrity target")
Reported-by: Bruno Prémont
Reviewed-by: Milan Broz
Signed-off-by: Mikulas Patocka
Signed-off-by: Mike Snitzer
Signed-off-by: Greg Kroah-Hartman

Mikulas Patocka
2017-11-30 16:40:42 +0800
ca90f34e2 ALSA: hda: Add Raven PCI ID ... Browse Code »

commit 9ceace3c9c18c67676e75141032a65a8e01f9a7a upstream.

This commit adds PCI ID for Raven platform

Signed-off-by: Vijendar Mukunda
Signed-off-by: Takashi Iwai
Signed-off-by: Greg Kroah-Hartman

Vijendar Mukunda
2017-11-30 16:40:42 +0800
a529422a5 PCI: Apply Cavium ThunderX ACS quirk to more Root Ports ... Browse Code »

commit f2ddaf8dfd4a5071ad09074d2f95ab85d35c8a1e upstream.

Extend the Cavium ThunderX ACS quirk to cover more device IDs and restrict
it to only Root Ports.

Signed-off-by: Vadim Lomovtsev
[bhelgaas: changelog, stable tag]
Signed-off-by: Bjorn Helgaas
Signed-off-by: Greg Kroah-Hartman

Vadim Lomovtsev
2017-11-30 16:40:42 +0800
f97ca6071 PCI: Set Cavium ACS capability quirk flags to assert RR/CR/SV/UF ... Browse Code »

commit 7f342678634f16795892677204366e835e450dda upstream.

The Cavium ThunderX (CN8XXX) family of PCIe Root Ports does not advertise
an ACS capability. However, the RTL internally implements similar
protection as if ACS had Request Redirection, Completion Redirection,
Source Validation, and Upstream Forwarding features enabled.

Change Cavium ACS capabilities quirk flags accordingly.

Fixes: b404bcfbf035 ("PCI: Add ACS quirk for all Cavium devices")
Signed-off-by: Vadim Lomovtsev
[bhelgaas: tidy changelog, comment, stable tag]
Signed-off-by: Bjorn Helgaas
Signed-off-by: Greg Kroah-Hartman

Vadim Lomovtsev
2017-11-30 16:40:42 +0800
c85364c66 PCI: hv: Use effective affinity mask ... Browse Code »

commit 79aa801e899417a56863d6713f76c4e108856000 upstream.

The effective_affinity_mask is always set when an interrupt is assigned in
__assign_irq_vector() -> apic->cpu_mask_to_apicid(), e.g. for struct apic
apic_physflat: -> default_cpu_mask_to_apicid() ->
irq_data_update_effective_affinity(), but it looks d->common->affinity
remains all-1's before the user space or the kernel changes it later.

In the early allocation/initialization phase of an IRQ, we should use the
effective_affinity_mask, otherwise Hyper-V may not deliver the interrupt to
the expected CPU. Without the patch, if we assign 7 Mellanox ConnectX-3
VFs to a 32-vCPU VM, one of the VFs may fail to receive interrupts.

Tested-by: Adrian Suhov
Signed-off-by: Dexuan Cui
Signed-off-by: Bjorn Helgaas
Reviewed-by: Jake Oshins
Cc: Jork Loeser
Cc: Stephen Hemminger
Cc: K. Y. Srinivasan
Signed-off-by: Greg Kroah-Hartman

Dexuan Cui
2017-11-30 16:40:41 +0800
d42e6a246 PCI/ASPM: Use correct capability pointer to program LTR_L1.2_THRESHOLD ... Browse Code »

commit c00054f540bf81e592e1fee709b0bdbf20f478b5 upstream.

Previously we programmed the LTR_L1.2_THRESHOLD in the parent (upstream)
device using the capability pointer of the *child* (downstream) device,
which corrupted some random word of the parent's config space.

Use the parent's L1 SS capability pointer to program its
LTR_L1.2_THRESHOLD.

Fixes: aeda9adebab8 ("PCI/ASPM: Configure L1 substate settings")
Signed-off-by: Bjorn Helgaas
Reviewed-by: Vidya Sagar
CC: Rajat Jain
Signed-off-by: Greg Kroah-Hartman

Bjorn Helgaas
2017-11-30 16:40:41 +0800
576bdf964 PCI/ASPM: Account for downstream device's Port Common_Mode_Restore_Time ... Browse Code »

commit 94ac327e043ee40d7fc57b54541da50507ef4e99 upstream.

Every Port that supports the L1.2 substate advertises its Port
Common_Mode_Restore_Time, i.e., the time the Port requires to re-establish
common mode when exiting L1.2 (see PCIe r3.1, sec 7.33.2).

Per sec 5.5.3.3.1, when exiting L1.2, the Downstream Port (the device at
the upstream end of the link) must send TS1 training sequences for at least
T(COMMONMODE) after it detects electrical idle exit on the Link. We want
this to be long enough for both ends of the Link, so we should set it to
the maximum of the Port Common_Mode_Restore_Time for the upstream and
downstream components on the Link.

Previously we only looked at the Port Common_Mode_Restore_Time of the
upstream device, so if the downstream device required more time, we didn't
program the upstream device's T(COMMONMODE) correctly.

Fixes: f1f0366dd6be ("PCI/ASPM: Calculate and save the L1.2 timing parameters")
Signed-off-by: Bjorn Helgaas
Reviewed-by: Vidya Sagar
Acked-by: Rajat Jain
Signed-off-by: Greg Kroah-Hartman

Bjorn Helgaas
2017-11-30 16:40:41 +0800
7b4c6a3b3 PM / OPP: Add missing of_node_put(np) ... Browse Code »

commit 7978db344719dab1e56d05e6fc04aaaddcde0a5e upstream.

The for_each_available_child_of_node() loop in _of_add_opp_table_v2()
doesn't drop the reference to "np" on errors. Fix that.

Fixes: 274659029c9d (PM / OPP: Add support to parse "operating-points-v2" bindings)
Signed-off-by: Tobias Jordan
[ VK: Improved commit log. ]
Signed-off-by: Viresh Kumar
Reviewed-by: Stephen Boyd
Signed-off-by: Rafael J. Wysocki
Signed-off-by: Greg Kroah-Hartman

Tobias Jordan
2017-11-30 16:40:41 +0800
2031e243e nbd: don't start req until after the dead connection logic ... Browse Code »

commit 6a468d5990ecd1c2d07dd85f8633bbdd0ba61c40 upstream.

We can end up sleeping for a while waiting for the dead timeout, which
means we could get the per request timer to fire. We did handle this
case, but if the dead timeout happened right after we submitted we'd
either tear down the connection or possibly requeue as we're handling an
error and race with the endio which can lead to panics and other
hilarity.

Fixes: 560bc4b39952 ("nbd: handle dead connections")
Signed-off-by: Josef Bacik
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Josef Bacik
2017-11-30 16:40:41 +0800
f6b7c54c2 nbd: wait uninterruptible for the dead timeout ... Browse Code »

commit ff57dc94faec023abc267cdc45766fccff497557 upstream.

If we have a pending signal or the user kills their application then
it'll bring down the whole device, which is less than awesome. Instead
wait uninterruptible for the dead timeout so we're sure we gave it our
best shot.

Fixes: 560bc4b39952 ("nbd: handle dead connections")
Signed-off-by: Josef Bacik
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Josef Bacik
2017-11-30 16:40:41 +0800
a59e386c4 net: mvneta: fix handling of the Tx descriptor counter ... Browse Code »

commit 0d63785c6b94b5d2f095f90755825f90eea791f5 upstream.

The mvneta controller provides a 8-bit register to update the pending
Tx descriptor counter. Then, a maximum of 255 Tx descriptors can be
added at once. In the current code the mvneta_txq_pend_desc_add function
assumes the caller takes care of this limit. But it is not the case. In
some situations (xmit_more flag), more than 255 descriptors are added.
When this happens, the Tx descriptor counter register is updated with a
wrong value, which breaks the whole Tx queue management.

This patch fixes the issue by allowing the mvneta_txq_pend_desc_add
function to process more than 255 Tx descriptors.

Fixes: 2a90f7e1d5d0 ("net: mvneta: add xmit_more support")
Signed-off-by: Simon Guinot
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Simon Guinot
2017-11-30 16:40:41 +0800
0d0a61fbc MIPS: ralink: Fix typo in mt7628 pinmux function ... Browse Code »

commit 05a67cc258e75ac9758e6f13d26337b8be51162a upstream.

There is a typo inside the pinmux setup code. The function is called
refclk and not reclk.

Fixes: 53263a1c6852 ("MIPS: ralink: add mt7628an support")
Signed-off-by: Mathias Kresin
Acked-by: John Crispin
Cc: Ralf Baechle
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/16047/
Signed-off-by: James Hogan
Signed-off-by: Greg Kroah-Hartman

Mathias Kresin
2017-11-30 16:40:40 +0800
8086ac5cd MIPS: ralink: Fix MT7628 pinmux ... Browse Code »

commit 8ef4b43cd3794d63052d85898e42424fd3b14d24 upstream.

According to the datasheet the REFCLK pin is shared with GPIO#37 and
the PERST pin is shared with GPIO#36.

Fixes: 53263a1c6852 ("MIPS: ralink: add mt7628an support")
Signed-off-by: Mathias Kresin
Acked-by: John Crispin
Cc: Ralf Baechle
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/16046/
Signed-off-by: James Hogan
Signed-off-by: Greg Kroah-Hartman

Mathias Kresin
2017-11-30 16:40:40 +0800
c79223802 MIPS: cmpxchg64() and HAVE_VIRT_CPU_ACCOUNTING_GEN don't work for 32-bit SMP ... Browse Code »

commit a3f143106596d739e7fbc4b84c96b1475247d876 upstream.

__cmpxchg64_local_generic() is atomic only w.r.t tasks and interrupts
on the same CPU (that's what the 'local' means). We can't use it to
implement cmpxchg64() in SMP configurations.

So, for 32-bit SMP configurations:

- Don't define cmpxchg64()
- Don't enable HAVE_VIRT_CPU_ACCOUNTING_GEN, which requires it

Fixes: e2093c7b03c1 ("MIPS: Fall back to generic implementation of ...")
Fixes: bb877e96bea1 ("MIPS: Add support for full dynticks CPU time accounting")
Signed-off-by: Ben Hutchings
Cc: Ralf Baechle
Cc: Deng-Cheng Zhu
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/17413/
Signed-off-by: James Hogan
Signed-off-by: Greg Kroah-Hartman

Ben Hutchings
2017-11-30 16:40:40 +0800