Eric Lee / smarc-fsl-linux-kernel

22 Mar, 2020

1 commit

763802b53 x86/mm: split vmalloc_sync_all() ... Browse Code »

Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in
__purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in
the vunmap() code-path. While this change was necessary to maintain
correctness on x86-32-pae kernels, it also adds additional cycles for
architectures that don't need it.

Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
severe performance regressions in micro-benchmarks because it now also
calls the x86-64 implementation of vmalloc_sync_all() on vunmap(). But
the vmalloc_sync_all() implementation on x86-64 is only needed for newly
created mappings.

To avoid the unnecessary work on x86-64 and to gain the performance
back, split up vmalloc_sync_all() into two functions:

* vmalloc_sync_mappings(), and
* vmalloc_sync_unmappings()

Most call-sites to vmalloc_sync_all() only care about new mappings being
synchronized. The only exception is the new call-site added in the
above mentioned commit.

Shile Zhang directed us to a report of an 80% regression in reaim
throughput.

Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
Reported-by: kernel test robot
Reported-by: Shile Zhang
Signed-off-by: Joerg Roedel
Signed-off-by: Andrew Morton
Tested-by: Borislav Petkov
Acked-by: Rafael J. Wysocki [GHES]
Cc: Dave Hansen
Cc: Andy Lutomirski
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc:
Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com
Signed-off-by: Linus Torvalds

Joerg Roedel
2020-03-22 09:56:06 +0800

16 Mar, 2020

2 commits

34d5a4b33 Merge tag 'locking-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull futex fix from Thomas Gleixner:
"Fix for yet another subtle futex issue.

The futex code used ihold() to prevent inodes from vanishing, but
ihold() does not guarantee inode persistence. Replace the inode
pointer with a per boot, machine wide, unique inode identifier.

The second commit fixes the breakage of the hash mechanism which
causes a 100% performance regression"

* tag 'locking-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Unbreak futex hashing
futex: Fix inode life-time issue

Linus Torvalds
2020-03-16 03:55:52 +0800
ffe6da91b Merge tag 'timers-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull timer fix from Thomas Gleixner:
"A single fix adding the missing time namespace adjustment in
sys/sysinfo which caused sys/sysinfo to be inconsistent with
/proc/uptime when read from a task inside a time namespace"

* tag 'timers-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sys/sysinfo: Respect boottime inside time namespace

Linus Torvalds
2020-03-16 03:48:21 +0800

13 Mar, 2020

1 commit

1b51f6946 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

Pull networking fixes from David Miller:
"It looks like a decent sized set of fixes, but a lot of these are one
liner off-by-one and similar type changes:

1) Fix netlink header pointer to calcular bad attribute offset
reported to user. From Pablo Neira Ayuso.

2) Don't double clear PHY interrupts when ->did_interrupt is set,
from Heiner Kallweit.

3) Add missing validation of various (devlink, nl802154, fib, etc.)
attributes, from Jakub Kicinski.

4) Missing *pos increments in various netfilter seq_next ops, from
Vasily Averin.

5) Missing break in of_mdiobus_register() loop, from Dajun Jin.

6) Don't double bump tx_dropped in veth driver, from Jiang Lidong.

7) Work around FMAN erratum A050385, from Madalin Bucur.

8) Make sure ARP header is pulled early enough in bonding driver,
from Eric Dumazet.

9) Do a cond_resched() during multicast processing of ipvlan and
macvlan, from Mahesh Bandewar.

10) Don't attach cgroups to unrelated sockets when in interrupt
context, from Shakeel Butt.

11) Fix tpacket ring state management when encountering unknown GSO
types. From Willem de Bruijn.

12) Fix MDIO bus PHY resume by checking mdio_bus_phy_may_suspend()
only in the suspend context. From Heiner Kallweit"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (112 commits)
net: systemport: fix index check to avoid an array out of bounds access
tc-testing: add ETS scheduler to tdc build configuration
net: phy: fix MDIO bus PM PHY resuming
net: hns3: clear port base VLAN when unload PF
net: hns3: fix RMW issue for VLAN filter switch
net: hns3: fix VF VLAN table entries inconsistent issue
net: hns3: fix "tc qdisc del" failed issue
taprio: Fix sending packets without dequeueing them
net: mvmdio: avoid error message for optional IRQ
net: dsa: mv88e6xxx: Add missing mask of ATU occupancy register
net: memcg: fix lockdep splat in inet_csk_accept()
s390/qeth: implement smarter resizing of the RX buffer pool
s390/qeth: refactor buffer pool code
s390/qeth: use page pointers to manage RX buffer pool
seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol number
net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed
net/packet: tpacket_rcv: do not increment ring index on drop
sxgbe: Fix off by one in samsung driver strncpy size arg
net: caif: Add lockdep expression to RCU traversal primitive
MAINTAINERS: remove Sathya Perla as Emulex NIC maintainer
...

Linus Torvalds
2020-03-13 07:19:19 +0800

12 Mar, 2020

2 commits

addcb1d0e Merge tag 'for-linus-2020-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux ... Browse Code »

Pull thread fix from Christian Brauner:
"This contains a single fix for a regression which was introduced when
we introduced the ability to select a specific pid at process creation
time.

When this feature is requested, the error value will be set to -EPERM
after exiting the pid allocation loop. This caused EPERM to be
returned when e.g. the init process/child subreaper of the pid
namespace has already died where we used to return ENOMEM before.

The first patch here simply fixes the regression by unconditionally
setting the return value back to ENOMEM again once we've successfully
allocated the requested pid number. This should be easy to backport to
v5.5.

The second patch adds a comment explaining that we must keep returning
ENOMEM since we've been doing it for a long time and have explicitly
documented this behavior for userspace. This seemed worthwhile because
we now have at least two separate example where people tried to change
the return value to something other than ENOMEM (The first version of
the regression fix did that too and the commit message links to an
earlier patch that tried to do the same.).

I have a simple regression test to make sure we catch this regression
in the future but since that introduces a whole new selftest subdir
and test files I'll keep this for v5.7"

* tag 'for-linus-2020-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
pid: make ENOMEM return value more obvious
pid: Fix error return value in some cases

Linus Torvalds
2020-03-12 01:00:41 +0800
36feb9963 Merge tag 'trace-v5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace ... Browse Code »

Pull ftrace fix from Steven Rostedt:
"Have ftrace lookup_rec() return a consistent record otherwise it can
break live patching"

* tag 'trace-v5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Return the first found result in lookup_rec()

Linus Torvalds
2020-03-12 00:54:59 +0800

11 Mar, 2020

4 commits

d9815bff6 ftrace: Return the first found result in lookup_rec() ... Browse Code »

It appears that ip ranges can overlap so. In that case lookup_rec()
returns whatever results it got last even if it found nothing in last
searched page.

This breaks an obscure livepatch late module patching usecase:
- load livepatch
- load the patched module
- unload livepatch
- try to load livepatch again

To fix this return from lookup_rec() as soon as it found the record
containing searched-for ip. This used to be this way prior lookup_rec()
introduction.

Link: http://lkml.kernel.org/r/20200306174317.21699-1-asavkov@redhat.com

Cc: stable@vger.kernel.org
Fixes: 7e16f581a817 ("ftrace: Separate out functionality from ftrace_location_range()")
Signed-off-by: Artem Savkov
Signed-off-by: Steven Rostedt (VMware)

Artem Savkov
2020-03-11 22:37:12 +0800
e876ecc67 cgroup: memcg: net: do not associate sock with unrelated cgroup ... Browse Code »

We are testing network memory accounting in our setup and noticed
inconsistent network memory usage and often unrelated cgroups network
usage correlates with testing workload. On further inspection, it
seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in
irq context specially for cgroup v1.

mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context
and kind of assumes that this can only happen from sk_clone_lock()
and the source sock object has already associated cgroup. However in
cgroup v1, where network memory accounting is opt-in, the source sock
can be unassociated with any cgroup and the new cloned sock can get
associated with unrelated interrupted cgroup.

Cgroup v2 can also suffer if the source sock object was created by
process in the root cgroup or if sk_alloc() is called in irq context.
The fix is to just do nothing in interrupt.

WARNING: Please note that about half of the TCP sockets are allocated
from the IRQ context, so, memory used by such sockets will not be
accouted by the memcg.

The stack trace of mem_cgroup_sk_alloc() from IRQ-context:

CPU: 70 PID: 12720 Comm: ssh Tainted: 5.6.0-smp-DEV #1
Hardware name: ...
Call Trace:

dump_stack+0x57/0x75
mem_cgroup_sk_alloc+0xe9/0xf0
sk_clone_lock+0x2a7/0x420
inet_csk_clone_lock+0x1b/0x110
tcp_create_openreq_child+0x23/0x3b0
tcp_v6_syn_recv_sock+0x88/0x730
tcp_check_req+0x429/0x560
tcp_v6_rcv+0x72d/0xa40
ip6_protocol_deliver_rcu+0xc9/0x400
ip6_input+0x44/0xd0
? ip6_protocol_deliver_rcu+0x400/0x400
ip6_rcv_finish+0x71/0x80
ipv6_rcv+0x5b/0xe0
? ip6_sublist_rcv+0x2e0/0x2e0
process_backlog+0x108/0x1e0
net_rx_action+0x26b/0x460
__do_softirq+0x104/0x2a6
do_softirq_own_stack+0x2a/0x40

do_softirq.part.19+0x40/0x50
__local_bh_enable_ip+0x51/0x60
ip6_finish_output2+0x23d/0x520
? ip6table_mangle_hook+0x55/0x160
__ip6_finish_output+0xa1/0x100
ip6_finish_output+0x30/0xd0
ip6_output+0x73/0x120
? __ip6_finish_output+0x100/0x100
ip6_xmit+0x2e3/0x600
? ipv6_anycast_cleanup+0x50/0x50
? inet6_csk_route_socket+0x136/0x1e0
? skb_free_head+0x1e/0x30
inet6_csk_xmit+0x95/0xf0
__tcp_transmit_skb+0x5b4/0xb20
__tcp_send_ack.part.60+0xa3/0x110
tcp_send_ack+0x1d/0x20
tcp_rcv_state_process+0xe64/0xe80
? tcp_v6_connect+0x5d1/0x5f0
tcp_v6_do_rcv+0x1b1/0x3f0
? tcp_v6_do_rcv+0x1b1/0x3f0
__release_sock+0x7f/0xd0
release_sock+0x30/0xa0
__inet_stream_connect+0x1c3/0x3b0
? prepare_to_wait+0xb0/0xb0
inet_stream_connect+0x3b/0x60
__sys_connect+0x101/0x120
? __sys_getsockopt+0x11b/0x140
__x64_sys_connect+0x1a/0x20
do_syscall_64+0x51/0x200
entry_SYSCALL_64_after_hwframe+0x44/0xa9

The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
Fixes: 2d7580738345 ("mm: memcontrol: consolidate cgroup socket tracking")
Fixes: d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
Signed-off-by: Shakeel Butt
Reviewed-by: Roman Gushchin
Signed-off-by: David S. Miller

Shakeel Butt
2020-03-11 06:33:05 +0800
e94148454 Merge branch 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup fixes from Tejun Heo:

- cgroup.procs listing related fixes.

It didn't interlock properly with exiting tasks leaving a short
window where a cgroup has empty cgroup.procs but still can't be
removed and misbehaved on short reads.

- psi_show() crash fix on 32bit ino archs

- Empty release_agent handling fix

* 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup1: don't call release_agent when it is ""
cgroup: fix psi_show() crash on 32bit ino archs
cgroup: Iterate tasks that did not finish do_exit()
cgroup: cgroup_procs_next should increase position index
cgroup-v1: cgroup_pidlist_next should update position index

Linus Torvalds
2020-03-11 06:05:45 +0800
2c1aca4bd Merge branch 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq ... Browse Code »

Pull workqueue fixes from Tejun Heo:
"Workqueue has been incorrectly round-robining per-cpu work items.
Hillf's patch fixes that.

The other patch documents memory-ordering properties of workqueue
operations"

* 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: don't use wq_select_unbound_cpu() for bound works
workqueue: Document (some) memory-ordering properties of {queue,schedule}_work()

Linus Torvalds
2020-03-11 05:48:22 +0800

10 Mar, 2020

3 commits

aa202f1f5 workqueue: don't use wq_select_unbound_cpu() for bound works ... Browse Code »

wq_select_unbound_cpu() is designed for unbound workqueues only, but
it's wrongly called when using a bound workqueue too.

Fixing this ensures work queued to a bound workqueue with
cpu=WORK_CPU_UNBOUND always runs on the local CPU.

Before, that would happen only if wq_unbound_cpumask happened to include
it (likely almost always the case), or was empty, or we got lucky with
forced round-robin placement. So restricting
/sys/devices/virtual/workqueue/cpumask to a small subset of a machine's
CPUs would cause some bound work items to run unexpectedly there.

Fixes: ef557180447f ("workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs")
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Hillf Danton
[dj: massage changelog]
Signed-off-by: Daniel Jordan
Cc: Tejun Heo
Cc: Lai Jiangshan
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo

Hillf Danton
2020-03-10 22:30:51 +0800
10dab84ca pid: make ENOMEM return value more obvious ... Browse Code »

The alloc_pid() codepath used to be simpler. With the introducation of the
ability to choose specific pids in 49cb2fc42ce4 ("fork: extend clone3() to
support setting a PID") it got more complex. It hasn't been super obvious
that ENOMEM is returned when the pid namespace init process/child subreaper
of the pid namespace has died. As can be seen from multiple attempts to
improve this see e.g. [1] and most recently [2].
We regressed returning ENOMEM in [3] and [2] restored it. Let's add a
comment on top explaining that this is historic and documented behavior and
cannot easily be changed.

[1]: 35f71bc0a09a ("fork: report pid reservation failure properly")
[2]: b26ebfe12f34 ("pid: Fix error return value in some cases")
[3]: 49cb2fc42ce4 ("fork: extend clone3() to support setting a PID")
Signed-off-by: Christian Brauner

Christian Brauner
2020-03-10 06:40:05 +0800
8d6774365 futex: Unbreak futex hashing ... Browse Code »

The recent futex inode life time fix changed the ordering of the futex key
union struct members, but forgot to adjust the hash function accordingly,

As a result the hashing omits the leading 64bit and even hashes beyond the
futex key causing a bad hash distribution which led to a ~100% performance
regression.

Hand in the futex key pointer instead of a random struct member and make
the size calculation based of the struct offset.

Fixes: 8019ad13ef7f ("futex: Fix inode life-time issue")
Reported-by: Rong Chen
Decoded-by: Linus Torvalds
Signed-off-by: Thomas Gleixner
Tested-by: Rong Chen
Link: https://lkml.kernel.org/r/87h7yy90ve.fsf@nanos.tec.linutronix.de

Thomas Gleixner
2020-03-10 05:33:09 +0800

08 Mar, 2020

2 commits

b26ebfe12 pid: Fix error return value in some cases ... Browse Code »

Recent changes to alloc_pid() allow the pid number to be specified on
the command line. If set_tid_size is set, then the code scanning the
levels will hard-set retval to -EPERM, overriding it's previous -ENOMEM
value.

After the code scanning the levels, there are error returns that do not
set retval, assuming it is still set to -ENOMEM.

So set retval back to -ENOMEM after scanning the levels.

Fixes: 49cb2fc42ce4 ("fork: extend clone3() to support setting a PID")
Signed-off-by: Corey Minyard
Acked-by: Christian Brauner
Cc: Andrei Vagin
Cc: Dmitry Safonov
Cc: Oleg Nesterov
Cc: Adrian Reber
Cc: # 5.5
Link: https://lore.kernel.org/r/20200306172314.12232-1-minyard@acm.org
[christian.brauner@ubuntu.com: fixup commit message]
Signed-off-by: Christian Brauner

Corey Minyard
2020-03-08 21:22:58 +0800
5dfcc1390 Merge tag 'block-5.6-2020-03-07' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block fixes from Jens Axboe:
"Here are a few fixes that should go into this release. This contains:

- Revert of a bad bcache patch from this merge window

- Removed unused function (Daniel)

- Fixup for the blktrace fix from Jan from this release (Cengiz)

- Fix of deeper level bfqq overwrite in BFQ (Carlo)"

* tag 'block-5.6-2020-03-07' of git://git.kernel.dk/linux-block:
block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group()
blktrace: fix dereference after null check
Revert "bcache: ignore pending signals when creating gc and allocator thread"
block: Remove used kblockd_schedule_work_on()

Linus Torvalds
2020-03-08 04:14:38 +0800

07 Mar, 2020

1 commit

fa883d6af Merge tag 'for-linus-2020-03-07' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux ... Browse Code »

Pull thread fixes from Christian Brauner:
"Here are a few hopefully uncontroversial fixes:

- Use RCU_INIT_POINTER() when initializing rcu protected members in
task_struct to fix sparse warnings.

- Add pidfd_fdinfo_test binary to .gitignore file"

* tag 'for-linus-2020-03-07' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux:
selftests: pidfd: Add pidfd_fdinfo_test in .gitignore
exit: Fix Sparse errors and warnings
fork: Use RCU_INIT_POINTER() instead of rcu_access_pointer()

Linus Torvalds
2020-03-07 22:01:43 +0800

06 Mar, 2020

2 commits

8019ad13e futex: Fix inode life-time issue ... Browse Code »

As reported by Jann, ihold() does not in fact guarantee inode
persistence. And instead of making it so, replace the usage of inode
pointers with a per boot, machine wide, unique inode identifier.

This sequence number is global, but shared (file backed) futexes are
rare enough that this should not become a performance issue.

Reported-by: Jann Horn
Suggested-by: Linus Torvalds
Signed-off-by: Peter Zijlstra (Intel)

Peter Zijlstra
2020-03-06 18:06:15 +0800
153031a30 blktrace: fix dereference after null check ... Browse Code »

There was a recent change in blktrace.c that added a RCU protection to
`q->blk_trace` in order to fix a use-after-free issue during access.

However the change missed an edge case that can lead to dereferencing of
`bt` pointer even when it's NULL:

Coverity static analyzer marked this as a FORWARD_NULL issue with CID
1460458.

```
/kernel/trace/blktrace.c: 1904 in sysfs_blk_trace_attr_store()
1898 ret = 0;
1899 if (bt == NULL)
1900 ret = blk_trace_setup_queue(q, bdev);
1901
1902 if (ret == 0) {
1903 if (attr == &dev_attr_act_mask)
>>> CID 1460458: Null pointer dereferences (FORWARD_NULL)
>>> Dereferencing null pointer "bt".
1904 bt->act_mask = value;
1905 else if (attr == &dev_attr_pid)
1906 bt->pid = value;
1907 else if (attr == &dev_attr_start_lba)
1908 bt->start_lba = value;
1909 else if (attr == &dev_attr_end_lba)
```

Added a reassignment with RCU annotation to fix the issue.

Fixes: c780e86dd48 ("blktrace: Protect q->blk_trace with RCU")
Cc: stable@vger.kernel.org
Reviewed-by: Ming Lei
Reviewed-by: Bob Liu
Reviewed-by: Steven Rostedt (VMware)
Signed-off-by: Cengiz Can
Signed-off-by: Jens Axboe

Cengiz Can
2020-03-06 04:42:40 +0800

05 Mar, 2020

2 commits

2e5383d79 cgroup1: don't call release_agent when it is "" ... Browse Code »

Older (and maybe current) versions of systemd set release_agent to "" when
shutting down, but do not set notify_on_release to 0.

Since 64e90a8acb85 ("Introduce STATIC_USERMODEHELPER to mediate
call_usermodehelper()"), we filter out such calls when the user mode helper
path is "". However, when used in conjunction with an actual (i.e. non "")
STATIC_USERMODEHELPER, the path is never "", so the real usermode helper
will be called with argv[0] == "".

Let's avoid this by not invoking the release_agent when it is "".

Signed-off-by: Tycho Andersen
Signed-off-by: Tejun Heo

Tycho Andersen
2020-03-05 00:53:33 +0800
190ecb190 cgroup: fix psi_show() crash on 32bit ino archs ... Browse Code »

Similar to the commit d7495343228f ("cgroup: fix incorrect
WARN_ON_ONCE() in cgroup_setup_root()"), cgroup_id(root_cgrp) does not
equal to 1 on 32bit ino archs which triggers all sorts of issues with
psi_show() on s390x. For example,

BUG: KASAN: slab-out-of-bounds in collect_percpu_times+0x2d0/
Read of size 4 at addr 000000001e0ce000 by task read_all/3667
collect_percpu_times+0x2d0/0x798
psi_show+0x7c/0x2a8
seq_read+0x2ac/0x830
vfs_read+0x92/0x150
ksys_read+0xe2/0x188
system_call+0xd8/0x2b4

Fix it by using cgroup_ino().

Fixes: 743210386c03 ("cgroup: use cgrp->kn->id as the cgroup ID")
Signed-off-by: Qian Cai
Acked-by: Johannes Weiner
Signed-off-by: Tejun Heo
Cc: stable@vger.kernel.org # v5.5

Qian Cai
2020-03-05 00:15:58 +0800

04 Mar, 2020

1 commit

ecc421e05 sys/sysinfo: Respect boottime inside time namespace ... Browse Code »

The sysinfo() syscall includes uptime in seconds but has no correction for
time namespaces which makes it inconsistent with the /proc/uptime inside of
a time namespace.

Add the missing time namespace adjustment call.

Signed-off-by: Cyril Hrubis
Signed-off-by: Thomas Gleixner
Reviewed-by: Dmitry Safonov
Link: https://lkml.kernel.org/r/20200303150638.7329-1-chrubis@suse.cz

Cyril Hrubis
2020-03-04 02:34:32 +0800

02 Mar, 2020

1 commit

c105df5d8 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fix from Ingo Molnar:
"Fix a scheduler statistics bug"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix statistics for find_idlest_group()

Linus Torvalds
2020-03-02 20:51:43 +0800

29 Feb, 2020

2 commits

2edc78b9a Merge tag 'block-5.6-2020-02-28' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block fixes from Jens Axboe:

- Passthrough insertion fix (Ming)

- Kill off some unused arguments (John)

- blktrace RCU fix (Jan)

- Dead fields removal for null_blk (Dongli)

- NVMe polled IO fix (Bijan)

* tag 'block-5.6-2020-02-28' of git://git.kernel.dk/linux-block:
nvme-pci: Hold cq_poll_lock while completing CQEs
blk-mq: Remove some unused function arguments
null_blk: remove unused fields in 'nullb_cmd'
blktrace: Protect q->blk_trace with RCU
blk-mq: insert passthrough request into hctx->dispatch directly

Linus Torvalds
2020-02-29 03:43:30 +0800
364285981 Merge tag 'pm-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

Pull power management fixes from Rafael Wysocki:
"Fix a recent cpufreq initialization regression (Rafael Wysocki),
revert a devfreq commit that made incompatible changes and broke user
land on some systems (Orson Zhai), drop a stale reference to a
document that has gone away recently (Jonathan Neuschäfer), and fix a
typo in a hibernation code comment (Alexandre Belloni)"

* tag 'pm-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: Fix policy initialization for internal governor drivers
Revert "PM / devfreq: Modify the device name as devfreq(X) for sysfs"
PM / hibernate: fix typo "reserverd_size" -> "reserved_size"
Documentation: power: Drop reference to interface.rst

Linus Torvalds
2020-02-29 00:49:52 +0800

28 Feb, 2020

4 commits

22a34c6fe exit: Fix Sparse errors and warnings ... Browse Code »

This patch fixes the following sparse error:
kernel/exit.c:627:25: error: incompatible types in comparison expression

And the following warning:
kernel/exit.c:626:40: warning: incorrect type in assignment

Signed-off-by: Madhuparna Bhowmik
Acked-by: Oleg Nesterov
Acked-by: Christian Brauner
[christian.brauner@ubuntu.com: edit commit message]
Link: https://lore.kernel.org/r/20200130062028.4870-1-madhuparnabhowmik10@gmail.com
Signed-off-by: Christian Brauner

Madhuparna Bhowmik
2020-02-28 20:34:39 +0800
0c282b068 fork: Use RCU_INIT_POINTER() instead of rcu_access_pointer() ... Browse Code »

Use RCU_INIT_POINTER() instead of rcu_access_pointer() in
copy_sighand().

Suggested-by: Oleg Nesterov
Signed-off-by: Madhuparna Bhowmik
Acked-by: Oleg Nesterov
Acked-by: Christian Brauner
[christian.brauner@ubuntu.com: edit commit message]
Link: https://lore.kernel.org/r/20200127175821.10833-1-madhuparnabhowmik10@gmail.com
Signed-off-by: Christian Brauner

Madhuparna Bhowmik
2020-02-28 20:34:36 +0800
189c6967f Merge branches 'pm-sleep' and 'pm-devfreq' ... Browse Code »

* pm-sleep:
PM / hibernate: fix typo "reserverd_size" -> "reserved_size"
Documentation: power: Drop reference to interface.rst

* pm-devfreq:
Revert "PM / devfreq: Modify the device name as devfreq(X) for sysfs"

Rafael J. Wysocki
2020-02-28 18:00:50 +0800
ed5fa5591 Merge tag 'audit-pr-20200226' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit ... Browse Code »

Pull audit fixes from Paul Moore:
"Two fixes for problems found by syzbot:

- Moving audit filter structure fields into a union caused some
problems in the code which populates that filter structure.

We keep the union (that idea is a good one), but we are fixing the
code so that it doesn't needlessly set fields in the union and mess
up the error handling.

- The audit_receive_msg() function wasn't validating user input as
well as it should in all cases, we add the necessary checks"

* tag 'audit-pr-20200226' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: always check the netlink payload length in audit_receive_msg()
audit: fix error handling in audit_data_to_entry()

Linus Torvalds
2020-02-28 03:01:22 +0800

27 Feb, 2020

3 commits

289de3598 sched/fair: Fix statistics for find_idlest_group() ... Browse Code »

sgs->group_weight is not set while gathering statistics in
update_sg_wakeup_stats(). This means that a group can be classified as
fully busy with 0 running tasks if utilization is high enough.

This path is mainly used for fork and exec.

Fixes: 57abff067a08 ("sched/fair: Rework find_idlest_group()")
Signed-off-by: Vincent Guittot
Signed-off-by: Ingo Molnar
Acked-by: Peter Zijlstra
Acked-by: Mel Gorman
Link: https://lore.kernel.org/r/20200218144534.4564-1-vincent.guittot@linaro.org

Vincent Guittot
2020-02-27 17:08:27 +0800
91ad64a84 Merge tag 'trace-v5.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace ... Browse Code »

Pull tracing and bootconfig updates:
"Fixes and changes to bootconfig before it goes live in a release.

Change in API of bootconfig (before it comes live in a release):
- Have a magic value "BOOTCONFIG" in initrd to know a bootconfig
exists
- Set CONFIG_BOOT_CONFIG to 'n' by default
- Show error if "bootconfig" on cmdline but not compiled in
- Prevent redefining the same value
- Have a way to append values
- Added a SELECT BLK_DEV_INITRD to fix a build failure

Synthetic event fixes:
- Switch to raw_smp_processor_id() for recording CPU value in preempt
section. (No care for what the value actually is)
- Fix samples always recording u64 values
- Fix endianess
- Check number of values matches number of fields
- Fix a printing bug

Fix of trace_printk() breaking postponed start up tests

Make a function static that is only used in a single file"

* tag 'trace-v5.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
bootconfig: Fix CONFIG_BOOTTIME_TRACING dependency issue
bootconfig: Add append value operator support
bootconfig: Prohibit re-defining value on same key
bootconfig: Print array as multiple commands for legacy command line
bootconfig: Reject subkey and value on same parent key
tools/bootconfig: Remove unneeded error message silencer
bootconfig: Add bootconfig magic word for indicating bootconfig explicitly
bootconfig: Set CONFIG_BOOT_CONFIG=n by default
tracing: Clear trace_state when starting trace
bootconfig: Mark boot_config_checksum() static
tracing: Disable trace_printk() on post poned tests
tracing: Have synthetic event test use raw_smp_processor_id()
tracing: Fix number printing bug in print_synth_event()
tracing: Check that number of vals matches number of synth event fields
tracing: Make synth_event trace functions endian-correct
tracing: Make sure synth_event_trace() example always uses u64

Linus Torvalds
2020-02-27 02:34:42 +0800
fda31c502 signal: avoid double atomic counter increments for user accounting ... Browse Code »

When queueing a signal, we increment both the users count of pending
signals (for RLIMIT_SIGPENDING tracking) and we increment the refcount
of the user struct itself (because we keep a reference to the user in
the signal structure in order to correctly account for it when freeing).

That turns out to be fairly expensive, because both of them are atomic
updates, and particularly under extreme signal handling pressure on big
machines, you can get a lot of cache contention on the user struct.
That can then cause horrid cacheline ping-pong when you do these
multiple accesses.

So change the reference counting to only pin the user for the _first_
pending signal, and to unpin it when the last pending signal is
dequeued. That means that when a user sees a lot of concurrent signal
queuing - which is the only situation when this matters - the only
atomic access needed is generally the 'sigpending' count update.

This was noticed because of a particularly odd timing artifact on a
dual-socket 96C/192T Cascade Lake platform: when you get into bad
contention, on that machine for some reason seems to be much worse when
the contention happens in the upper 32-byte half of the cacheline.

As a result, the kernel test robot will-it-scale 'signal1' benchmark had
an odd performance regression simply due to random alignment of the
'struct user_struct' (and pointed to a completely unrelated and
apparently nonsensical commit for the regression).

Avoiding the double increments (and decrements on the dequeueing side,
of course) makes for much less contention and hugely improved
performance on that will-it-scale microbenchmark.

Quoting Feng Tang:

"It makes a big difference, that the performance score is tripled! bump
from original 17000 to 54000. Also the gap between 5.0-rc6 and
5.0-rc6+Jiri's patch is reduced to around 2%"

[ The "2% gap" is the odd cacheline placement difference on that
platform: under the extreme contention case, the effect of which half
of the cacheline was hot was 5%, so with the reduced contention the
odd timing artifact is reduced too ]

It does help in the non-contended case too, but is not nearly as
noticeable.

Reported-and-tested-by: Feng Tang
Cc: Eric W. Biederman
Cc: Huang, Ying
Cc: Philip Li
Cc: Andi Kleen
Cc: Jiri Olsa
Cc: Peter Zijlstra
Signed-off-by: Linus Torvalds

Linus Torvalds
2020-02-27 01:54:03 +0800

26 Feb, 2020

1 commit

2910b5aa6 bootconfig: Fix CONFIG_BOOTTIME_TRACING dependency issue ... Browse Code »

Since commit d8a953ddde5e ("bootconfig: Set CONFIG_BOOT_CONFIG=n by
default") also changed the CONFIG_BOOTTIME_TRACING to select
CONFIG_BOOT_CONFIG to show the boot-time tracing on the menu,
it introduced wrong dependencies with BLK_DEV_INITRD as below.

WARNING: unmet direct dependencies detected for BOOT_CONFIG
Depends on [n]: BLK_DEV_INITRD [=n]
Selected by [y]:
- BOOTTIME_TRACING [=y] && TRACING_SUPPORT [=y] && FTRACE [=y] && TRACING [=y]

This makes the CONFIG_BOOT_CONFIG selects CONFIG_BLK_DEV_INITRD to
fix this error and make CONFIG_BOOTTIME_TRACING=n by default, so
that both boot-time tracing and boot configuration off but those
appear on the menu list.

Link: http://lkml.kernel.org/r/158264140162.23842.11237423518607465535.stgit@devnote2

Fixes: d8a953ddde5e ("bootconfig: Set CONFIG_BOOT_CONFIG=n by default")
Reported-by: Randy Dunlap
Compiled-tested-by: Randy Dunlap
Signed-off-by: Masami Hiramatsu
Signed-off-by: Steven Rostedt (VMware)

Masami Hiramatsu
2020-02-26 08:07:58 +0800

25 Feb, 2020

2 commits

c780e86dd blktrace: Protect q->blk_trace with RCU ... Browse Code »

KASAN is reporting that __blk_add_trace() has a use-after-free issue
when accessing q->blk_trace. Indeed the switching of block tracing (and
thus eventual freeing of q->blk_trace) is completely unsynchronized with
the currently running tracing and thus it can happen that the blk_trace
structure is being freed just while __blk_add_trace() works on it.
Protect accesses to q->blk_trace by RCU during tracing and make sure we
wait for the end of RCU grace period when shutting down tracing. Luckily
that is rare enough event that we can afford that. Note that postponing
the freeing of blk_trace to an RCU callback should better be avoided as
it could have unexpected user visible side-effects as debugfs files
would be still existing for a short while block tracing has been shut
down.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=205711
CC: stable@vger.kernel.org
Reviewed-by: Chaitanya Kulkarni
Reviewed-by: Ming Lei
Tested-by: Ming Lei
Reviewed-by: Bart Van Assche
Reported-by: Tristan Madani
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2020-02-25 23:40:07 +0800
756125289 audit: always check the netlink payload length in audit_receive_msg() ... Browse Code »

This patch ensures that we always check the netlink payload length
in audit_receive_msg() before we take any action on the payload
itself.

Cc: stable@vger.kernel.org
Reported-by: syzbot+399c44bf1f43b8747403@syzkaller.appspotmail.com
Reported-by: syzbot+e4b12d8d202701f08b6d@syzkaller.appspotmail.com
Signed-off-by: Paul Moore

Paul Moore
2020-02-25 05:38:57 +0800

23 Feb, 2020

3 commits

2ad3e17eb audit: fix error handling in audit_data_to_entry() ... Browse Code »

Commit 219ca39427bf ("audit: use union for audit_field values since
they are mutually exclusive") combined a number of separate fields in
the audit_field struct into a single union. Generally this worked
just fine because they are generally mutually exclusive.
Unfortunately in audit_data_to_entry() the overlap can be a problem
when a specific error case is triggered that causes the error path
code to attempt to cleanup an audit_field struct and the cleanup
involves attempting to free a stored LSM string (the lsm_str field).
Currently the code always has a non-NULL value in the
audit_field.lsm_str field as the top of the for-loop transfers a
value into audit_field.val (both .lsm_str and .val are part of the
same union); if audit_data_to_entry() fails and the audit_field
struct is specified to contain a LSM string, but the
audit_field.lsm_str has not yet been properly set, the error handling
code will attempt to free the bogus audit_field.lsm_str value that
was set with audit_field.val at the top of the for-loop.

This patch corrects this by ensuring that the audit_field.val is only
set when needed (it is cleared when the audit_field struct is
allocated with kcalloc()). It also corrects a few other issues to
ensure that in case of error the proper error code is returned.

Cc: stable@vger.kernel.org
Fixes: 219ca39427bf ("audit: use union for audit_field values since they are mutually exclusive")
Reported-by: syzbot+1f4d90ead370d72e450b@syzkaller.appspotmail.com
Signed-off-by: Paul Moore

Paul Moore
2020-02-23 09:36:47 +0800
f3cc24942 Merge tag 'irq-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull irq fixes from Thomas Gleixner:
"Two fixes for the irq core code which are follow ups to the recent MSI
fixes:

- The WARN_ON which was put into the MSI setaffinity callback for
paranoia reasons actually triggered via a callchain which escaped
when all the possible ways to reach that code were analyzed.

The proc/irq/$N/*affinity interfaces have a quirk which came in
when ALPHA moved to the generic interface: In case that the written
affinity mask does not contain any online CPU it calls into ALPHAs
magic auto affinity setting code.

A few years later this mechanism was also made available to x86 for
no good reasons and in a way which circumvents all sanity checks
for interrupts which cannot have their affinity set from process
context on X86 due to the way the X86 interrupt delivery works.

It would be possible to make this work properly, but there is no
point in doing so. If the interrupt is not yet started then the
affinity setting has no effect and if it is started already then it
is already assigned to an online CPU so there is no point to
randomly move it to some other CPU. Just return EINVAL as the code
has done before that change forever.

- The new MSI quirk bit in the irq domain flags turned out to be
already occupied, which escaped the author and the reviewers
because the already in use bits were 0,6,2,3,4,5 listed in that
order.

That bit 6 was simply overlooked because the ordering was straight
forward linear otherwise. So the new bit ended up being a
duplicate.

Fix it up by switching the oddball 6 to the obvious 1"

* tag 'irq-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/irqdomain: Make sure all irq domain flags are distinct
genirq/proc: Reject invalid affinity masks (again)

Linus Torvalds
2020-02-23 09:25:46 +0800
591dd4c10 Merge tag 's390-5.6-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux ... Browse Code »

Pull s390 fixes from Vasily Gorbik:

- Remove ieee_emulation_warnings sysctl which is a dead code.

- Avoid triggering rebuild of the kernel during make install.

- Enable protected virtualization guest support in default configs.

- Fix cio_ignore seq_file .next function to increase position index.
And use kobj_to_dev instead of container_of in cio code.

- Fix storage block address lists to contain absolute addresses in qdio
code.

- Few clang warnings and spelling fixes.

* tag 's390-5.6-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/qdio: fill SBALEs with absolute addresses
s390/qdio: fill SL with absolute addresses
s390: remove obsolete ieee_emulation_warnings
s390: make 'install' not depend on vmlinux
s390/kaslr: Fix casts in get_random
s390/mm: Explicitly compare PAGE_DEFAULT_KEY against zero in storage_key_init_range
s390/pkey/zcrypt: spelling s/crytp/crypt/
s390/cio: use kobj_to_dev() API
s390/defconfig: enable CONFIG_PROTECTED_VIRTUALIZATION_GUEST
s390/cio: cio_ignore_proc_seq_next should increase position index

Linus Torvalds
2020-02-23 02:43:41 +0800

22 Feb, 2020

2 commits

3dc55dba6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

Pull networking fixes from David Miller:

1) Limit xt_hashlimit hash table size to avoid OOM or hung tasks, from
Cong Wang.

2) Fix deadlock in xsk by publishing global consumer pointers when NAPI
is finished, from Magnus Karlsson.

3) Set table field properly to RT_TABLE_COMPAT when necessary, from
Jethro Beekman.

4) NLA_STRING attributes are not necessary NULL terminated, deal wiht
that in IFLA_ALT_IFNAME. From Eric Dumazet.

5) Fix checksum handling in atlantic driver, from Dmitry Bezrukov.

6) Handle mtu==0 devices properly in wireguard, from Jason A.
Donenfeld.

7) Fix several lockdep warnings in bonding, from Taehee Yoo.

8) Fix cls_flower port blocking, from Jason Baron.

9) Sanitize internal map names in libbpf, from Toke Høiland-Jørgensen.

10) Fix RDMA race in qede driver, from Michal Kalderon.

11) Fix several false lockdep warnings by adding conditions to
list_for_each_entry_rcu(), from Madhuparna Bhowmik.

12) Fix sleep in atomic in mlx5 driver, from Huy Nguyen.

13) Fix potential deadlock in bpf_map_do_batch(), from Yonghong Song.

14) Hey, variables declared in switch statement before any case
statements are not initialized. I learn something every day. Get
rids of this stuff in several parts of the networking, from Kees
Cook.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
bnxt_en: Issue PCIe FLR in kdump kernel to cleanup pending DMAs.
bnxt_en: Improve device shutdown method.
net: netlink: cap max groups which will be considered in netlink_bind()
net: thunderx: workaround BGX TX Underflow issue
ionic: fix fw_status read
net: disable BRIDGE_NETFILTER by default
net: macb: Properly handle phylink on at91rm9200
s390/qeth: fix off-by-one in RX copybreak check
s390/qeth: don't warn for napi with 0 budget
s390/qeth: vnicc Fix EOPNOTSUPP precedence
openvswitch: Distribute switch variables for initialization
net: ip6_gre: Distribute switch variables for initialization
net: core: Distribute switch variables for initialization
udp: rehash on disconnect
net/tls: Fix to avoid gettig invalid tls record
bpf: Fix a potential deadlock with bpf_map_do_batch
bpf: Do not grab the bucket spinlock by default on htab batch ops
ice: Wait for VF to be reset/ready before configuration
ice: Don't tell the OS that link is going down
ice: Don't reject odd values of usecs set by user
...

Linus Torvalds
2020-02-22 03:59:51 +0800
412c53a68 y2038: remove unused time32 interfaces ... Browse Code »

No users remain, so kill these off before we grow new ones.

Link: http://lkml.kernel.org/r/20200110154232.4104492-3-arnd@arndb.de
Signed-off-by: Arnd Bergmann
Acked-by: Thomas Gleixner
Cc: Deepa Dinamani
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arnd Bergmann
2020-02-22 03:22:15 +0800

21 Feb, 2020

1 commit

d8a953ddd bootconfig: Set CONFIG_BOOT_CONFIG=n by default ... Browse Code »

Set CONFIG_BOOT_CONFIG=n by default. This also warns
user if CONFIG_BOOT_CONFIG=n but "bootconfig" is given
in the kernel command line.

Link: http://lkml.kernel.org/r/158220111291.26565.9036889083940367969.stgit@devnote2

Suggested-by: Steven Rostedt
Signed-off-by: Masami Hiramatsu
Signed-off-by: Steven Rostedt (VMware)

Masami Hiramatsu
2020-02-21 06:52:12 +0800