Eric Lee / smarc-fsl-linux-kernel

06 Mar, 2019

9 commits

dc50537bd kernel: cgroup: add poll file operation ... Browse Code »

Cgroup has a standardized poll/notification mechanism for waking all
pollers on all fds when a filesystem node changes. To allow polling for
custom events, add a .poll callback that can override the default.

This is in preparation for pollable cgroup pressure files which have
per-fd trigger configurations.

Link: http://lkml.kernel.org/r/20190124211518.244221-3-surenb@google.com
Signed-off-by: Johannes Weiner
Signed-off-by: Suren Baghdasaryan
Cc: Dennis Zhou
Cc: Ingo Molnar
Cc: Jens Axboe
Cc: Li Zefan
Cc: Peter Zijlstra
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2019-03-06 13:07:17 +0800
5e1f0f098 mm, compaction: capture a page under direct compaction ... Browse Code »

Compaction is inherently race-prone as a suitable page freed during
compaction can be allocated by any parallel task. This patch uses a
capture_control structure to isolate a page immediately when it is freed
by a direct compactor in the slow path of the page allocator. The
intent is to avoid redundant scanning.

5.0.0-rc1 5.0.0-rc1
selective-v3r17 capture-v3r19
Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%*
Amean fault-both-3 2582.11 ( 0.00%) 2563.68 ( 0.71%)
Amean fault-both-5 4500.26 ( 0.00%) 4233.52 ( 5.93%)
Amean fault-both-7 5819.53 ( 0.00%) 6333.65 ( -8.83%)
Amean fault-both-12 9321.18 ( 0.00%) 9759.38 ( -4.70%)
Amean fault-both-18 9782.76 ( 0.00%) 10338.76 ( -5.68%)
Amean fault-both-24 15272.81 ( 0.00%) 13379.55 * 12.40%*
Amean fault-both-30 15121.34 ( 0.00%) 16158.25 ( -6.86%)
Amean fault-both-32 18466.67 ( 0.00%) 18971.21 ( -2.73%)

Latency is only moderately affected but the devil is in the details. A
closer examination indicates that base page fault latency is reduced but
latency of huge pages is increased as it takes creater care to succeed.
Part of the "problem" is that allocation success rates are close to 100%
even when under pressure and compaction gets harder

5.0.0-rc1 5.0.0-rc1
selective-v3r17 capture-v3r19
Percentage huge-3 96.70 ( 0.00%) 98.23 ( 1.58%)
Percentage huge-5 96.99 ( 0.00%) 95.30 ( -1.75%)
Percentage huge-7 94.19 ( 0.00%) 97.24 ( 3.24%)
Percentage huge-12 94.95 ( 0.00%) 97.35 ( 2.53%)
Percentage huge-18 96.74 ( 0.00%) 97.30 ( 0.58%)
Percentage huge-24 97.07 ( 0.00%) 97.55 ( 0.50%)
Percentage huge-30 95.69 ( 0.00%) 98.50 ( 2.95%)
Percentage huge-32 96.70 ( 0.00%) 99.27 ( 2.65%)

And scan rates are reduced as expected by 6% for the migration scanner
and 29% for the free scanner indicating that there is less redundant
work.

Compaction migrate scanned 20815362 19573286
Compaction free scanned 16352612 11510663

[mgorman@techsingularity.net: remove redundant check]
Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net
Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Dan Carpenter
Cc: David Rientjes
Cc: YueHaibing
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2019-03-06 13:07:17 +0800
6b7e5cad6 mm: remove sysctl_extfrag_handler() ... Browse Code »

sysctl_extfrag_handler() neglects to propagate the return value from
proc_dointvec_minmax() to its caller. It's a wrapper that doesn't need
to exist, so just use proc_dointvec_minmax() directly.

Link: http://lkml.kernel.org/r/20190104032557.3056-1-willy@infradead.org
Signed-off-by: Matthew Wilcox
Reported-by: Aditya Pakki
Acked-by: Mel Gorman
Acked-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2019-03-06 13:07:15 +0800
98fa15f34 mm: replace all open encodings for NUMA_NO_NODE ... Browse Code »

Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

All these places for replacement were found by running the following
grep patterns on the entire kernel code. Please let me know if this
might have missed some instances. This might also have replaced some
false positives. I will appreciate suggestions, inputs and review.

1. git grep "nid == -1"
2. git grep "node == -1"
3. git grep "nid = -1"
4. git grep "node = -1"

This patch (of 2):

At present there are multiple places where invalid node number is
encoded as -1. Even though implicitly understood it is always better to
have macros in there. Replace these open encodings for an invalid node
number with the global macro NUMA_NO_NODE. This helps remove NUMA
related assumptions like 'invalid node' from various places redirecting
them to a common definition.

Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual
Reviewed-by: David Hildenbrand
Acked-by: Jeff Kirsher [ixgbe]
Acked-by: Jens Axboe [mtip32xx]
Acked-by: Vinod Koul [dmaengine.c]
Acked-by: Michael Ellerman [powerpc]
Acked-by: Doug Ledford [drivers/infiniband]
Cc: Joseph Qi
Cc: Hans Verkuil
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anshuman Khandual
2019-03-06 13:07:14 +0800
abd02ac61 PM/Hibernate: exclude all PageOffline() pages ... Browse Code »

The content of pages that are marked PG_offline is not of interest (e.g.
inflated by a balloon driver), let's skip these pages.

In saveable_highmem_page(), move the PageReserved() check to a new check
along with the PageOffline() check to separate it from the swsusp
checks.

[david@redhat.com: v2]
Link: http://lkml.kernel.org/r/20181122100627.5189-9-david@redhat.com
Link: http://lkml.kernel.org/r/20181119101616.8901-9-david@redhat.com
Signed-off-by: David Hildenbrand
Acked-by: Pavel Machek
Acked-by: Rafael J. Wysocki
Cc: Pavel Machek
Cc: Len Brown
Cc: Matthew Wilcox
Cc: Michal Hocko
Cc: "Michael S. Tsirkin"
Cc: Alexander Duyck
Cc: Alexey Dobriyan
Cc: Arnd Bergmann
Cc: Baoquan He
Cc: Borislav Petkov
Cc: Boris Ostrovsky
Cc: Christian Hansen
Cc: Dave Young
Cc: David Rientjes
Cc: Greg Kroah-Hartman
Cc: Haiyang Zhang
Cc: Jonathan Corbet
Cc: Juergen Gross
Cc: Julien Freche
Cc: Kairui Song
Cc: Kazuhito Hagio
Cc: "Kirill A. Shutemov"
Cc: Konstantin Khlebnikov
Cc: "K. Y. Srinivasan"
Cc: Lianbo Jiang
Cc: Michal Hocko
Cc: Mike Rapoport
Cc: Miles Chen
Cc: Nadav Amit
Cc: Naoya Horiguchi
Cc: Omar Sandoval
Cc: Pankaj gupta
Cc: Pavel Tatashin
Cc: "Rafael J. Wysocki"
Cc: Stefano Stabellini
Cc: Stephen Hemminger
Cc: Stephen Rothwell
Cc: Vitaly Kuznetsov
Cc: Vlastimil Babka
Cc: Xavier Deguillard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-03-06 13:07:14 +0800
5b56db372 PM/Hibernate: use pfn_to_online_page() ... Browse Code »

Let's use pfn_to_online_page() instead of pfn_to_page() when checking
for saveable pages to not save/restore offline memory sections.

Link: http://lkml.kernel.org/r/20181119101616.8901-8-david@redhat.com
Signed-off-by: David Hildenbrand
Suggested-by: Michal Hocko
Acked-by: Michal Hocko
Acked-by: Pavel Machek
Acked-by: Rafael J. Wysocki
Cc: Len Brown
Cc: Matthew Wilcox
Cc: "Michael S. Tsirkin"
Cc: Alexander Duyck
Cc: Alexey Dobriyan
Cc: Arnd Bergmann
Cc: Baoquan He
Cc: Borislav Petkov
Cc: Boris Ostrovsky
Cc: Christian Hansen
Cc: Dave Young
Cc: David Rientjes
Cc: Greg Kroah-Hartman
Cc: Haiyang Zhang
Cc: Jonathan Corbet
Cc: Juergen Gross
Cc: Julien Freche
Cc: Kairui Song
Cc: Kazuhito Hagio
Cc: "Kirill A. Shutemov"
Cc: Konstantin Khlebnikov
Cc: "K. Y. Srinivasan"
Cc: Lianbo Jiang
Cc: Mike Rapoport
Cc: Miles Chen
Cc: Nadav Amit
Cc: Naoya Horiguchi
Cc: Omar Sandoval
Cc: Pankaj gupta
Cc: Pavel Tatashin
Cc: "Rafael J. Wysocki"
Cc: Stefano Stabellini
Cc: Stephen Hemminger
Cc: Stephen Rothwell
Cc: Vitaly Kuznetsov
Cc: Vlastimil Babka
Cc: Xavier Deguillard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-03-06 13:07:14 +0800
e04b742f7 kexec: export PG_offline to VMCOREINFO ... Browse Code »

Right now, pages inflated as part of a balloon driver will be dumped by
dump tools like makedumpfile. While XEN is able to check in the crash
kernel whether a certain pfn is actuall backed by memory in the
hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
other balloon inflated memory will essentially result in zero pages
getting allocated by the hypervisor and the dump getting filled with
this data.

The allocation and reading of zero pages can directly be avoided if a
dumping tool could know which pages only contain stale information not
to be dumped.

We now have PG_offline which can be (and already is by virtio-balloon)
used for marking pages as logically offline. Follow up patches will
make use of this flag also in other balloon implementations.

Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so makedumpfile
can directly skip pages that are logically offline and the content
therefore stale.

Please note that this is also helpful for a problem we were seeing under
Hyper-V: Dumping logically offline memory (pages kept fake offline while
onlining a section via online_page_callback) would under some condicions
result in a kernel panic when dumping them.

Link: http://lkml.kernel.org/r/20181119101616.8901-4-david@redhat.com
Signed-off-by: David Hildenbrand
Acked-by: Michael S. Tsirkin
Acked-by: Dave Young
Cc: "Kirill A. Shutemov"
Cc: Baoquan He
Cc: Omar Sandoval
Cc: Arnd Bergmann
Cc: Matthew Wilcox
Cc: Michal Hocko
Cc: Lianbo Jiang
Cc: Borislav Petkov
Cc: Kazuhito Hagio
Cc: Alexander Duyck
Cc: Alexey Dobriyan
Cc: Boris Ostrovsky
Cc: Christian Hansen
Cc: David Rientjes
Cc: Greg Kroah-Hartman
Cc: Haiyang Zhang
Cc: Jonathan Corbet
Cc: Juergen Gross
Cc: Julien Freche
Cc: Kairui Song
Cc: Konstantin Khlebnikov
Cc: "K. Y. Srinivasan"
Cc: Len Brown
Cc: Michal Hocko
Cc: Mike Rapoport
Cc: Miles Chen
Cc: Nadav Amit
Cc: Naoya Horiguchi
Cc: Pankaj gupta
Cc: Pavel Machek
Cc: Pavel Tatashin
Cc: Rafael J. Wysocki
Cc: "Rafael J. Wysocki"
Cc: Stefano Stabellini
Cc: Stephen Hemminger
Cc: Stephen Rothwell
Cc: Vitaly Kuznetsov
Cc: Vlastimil Babka
Cc: Xavier Deguillard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2019-03-06 13:07:14 +0800
3591b1951 Merge tag 's390-5.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux ... Browse Code »

Pull s390 updates from Martin Schwidefsky:

- A copy of Arnds compat wrapper generation series

- Pass information about the KVM guest to the host in form the control
program code and the control program version code

- Map IOV resources to support PCI physical functions on s390

- Add vector load and store alignment hints to improve performance

- Use the "jdd" constraint with gcc 9 to make jump labels working again

- Remove amode workaround for old z/VM releases from the DCSS code

- Add support for in-kernel performance measurements using the CPU
measurement counter facility

- Introduce a new PMU device cpum_cf_diag to capture counters and store
thenn as event raw data.

- Bug fixes and cleanups

* tag 's390-5.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (54 commits)
Revert "s390/cpum_cf: Add kernel message exaplanations"
s390/dasd: fix read device characteristic with CONFIG_VMAP_STACK=y
s390/suspend: fix prefix register reset in swsusp_arch_resume
s390: warn about clearing als implied facilities
s390: allow overriding facilities via command line
s390: clean up redundant facilities list setup
s390/als: remove duplicated in-place implementation of stfle
s390/cio: Use cpa range elsewhere within vfio-ccw
s390/cio: Fix vfio-ccw handling of recursive TICs
s390: vfio_ap: link the vfio_ap devices to the vfio_ap bus subsystem
s390/cpum_cf: Handle EBUSY return code from CPU counter facility reservation
s390/cpum_cf: Add kernel message exaplanations
s390/cpum_cf_diag: Add support for s390 counter facility diagnostic trace
s390/cpum_cf: add ctr_stcctm() function
s390/cpum_cf: move common functions into a separate file
s390/cpum_cf: introduce kernel_cpumcf_avail() function
s390/cpu_mf: replace stcctm5() with the stcctm() function
s390/cpu_mf: add store cpu counter multiple instruction support
s390/cpum_cf: Add minimal in-kernel interface for counter measurements
s390/cpum_cf: introduce kernel_cpumcf_alert() to obtain measurement alerts
...

Linus Torvalds
2019-03-06 03:13:10 +0800
645630035 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:
"Here we go, another merge window full of networking and #ebpf changes:

1) Snoop DHCPACKS in batman-adv to learn MAC/IP pairs in the DHCP
range without dealing with floods of ARP traffic, from Linus
Lüssing.

2) Throttle buffered multicast packet transmission in mt76, from
Felix Fietkau.

3) Support adaptive interrupt moderation in ice, from Brett Creeley.

4) A lot of struct_size conversions, from Gustavo A. R. Silva.

5) Add peek/push/pop commands to bpftool, as well as bash completion,
from Stanislav Fomichev.

6) Optimize sk_msg_clone(), from Vakul Garg.

7) Add SO_BINDTOIFINDEX, from David Herrmann.

8) Be more conservative with local resends due to local congestion,
from Yuchung Cheng.

9) Allow vetoing of unsupported VXLAN FDBs, from Petr Machata.

10) Add health buffer support to devlink, from Eran Ben Elisha.

11) Add TXQ scheduling API to mac80211, from Toke Høiland-Jørgensen.

12) Add statistics to basic packet scheduler filter, from Cong Wang.

13) Add GRE tunnel support for mlxsw Spectrum-2, from Nir Dotan.

14) Lots of new IP tunneling forwarding tests, also from Nir Dotan.

15) Add 3ad stats to bonding, from Nikolay Aleksandrov.

16) Lots of probing improvements for bpftool, from Quentin Monnet.

17) Various nfp drive #ebpf JIT improvements from Jakub Kicinski.

18) Allow #ebpf programs to access gso_segs from skb shared info, from
Eric Dumazet.

19) Add sock_diag support for AF_XDP sockets, from Björn Töpel.

20) Support 22260 iwlwifi devices, from Luca Coelho.

21) Use rbtree for ipv6 defragmentation, from Peter Oskolkov.

22) Add JMP32 instruction class support to #ebpf, from Jiong Wang.

23) Add spinlock support to #ebpf, from Alexei Starovoitov.

24) Support 256-bit keys and TLS 1.3 in ktls, from Dave Watson.

25) Add device infomation API to devlink, from Jakub Kicinski.

26) Add new timestamping socket options which are y2038 safe, from
Deepa Dinamani.

27) Add RX checksum offloading for various sh_eth chips, from Sergei
Shtylyov.

28) Flow offload infrastructure, from Pablo Neira Ayuso.

29) Numerous cleanups, improvements, and bug fixes to the PHY layer
and many drivers from Heiner Kallweit.

30) Lots of changes to try and make packet scheduler classifiers run
lockless as much as possible, from Vlad Buslov.

31) Support BCM957504 chip in bnxt_en driver, from Erik Burrows.

32) Add concurrency tests to tc-tests infrastructure, from Vlad
Buslov.

33) Add hwmon support to aquantia, from Heiner Kallweit.

34) Allow 64-bit values for SO_MAX_PACING_RATE, from Eric Dumazet.

And I would be remiss if I didn't thank the various major networking
subsystem maintainers for integrating much of this work before I even
saw it. Alexei Starovoitov, Daniel Borkmann, Pablo Neira Ayuso,
Johannes Berg, Kalle Valo, and many others. Thank you!"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2207 commits)
net/sched: avoid unused-label warning
net: ignore sysctl_devconf_inherit_init_net without SYSCTL
phy: mdio-mux: fix Kconfig dependencies
net: phy: use phy_modify_mmd_changed in genphy_c45_an_config_aneg
net: dsa: mv88e6xxx: add call to mv88e6xxx_ports_cmode_init to probe for new DSA framework
selftest/net: Remove duplicate header
sky2: Disable MSI on Dell Inspiron 1545 and Gateway P-79
net/mlx5e: Update tx reporter status in case channels were successfully opened
devlink: Add support for direct reporter health state update
devlink: Update reporter state to error even if recover aborted
sctp: call iov_iter_revert() after sending ABORT
team: Free BPF filter when unregistering netdev
ip6mr: Do not call __IP6_INC_STATS() from preemptible context
isdn: mISDN: Fix potential NULL pointer dereference of kzalloc
net: dsa: mv88e6xxx: support in-band signalling on SGMII ports with external PHYs
cxgb4/chtls: Prefix adapter flags with CXGB4
net-sysfs: Switch to bitmap_zalloc()
mellanox: Switch to bitmap_zalloc()
bpf: add test cases for non-pointer sanitiation logic
mlxsw: i2c: Extend initialization by querying resources data
...

Linus Torvalds
2019-03-06 00:26:13 +0800

05 Mar, 2019

2 commits

4f9020ffd Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs fixes from Al Viro:
"Assorted fixes that sat in -next for a while, all over the place"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
aio: Fix locking in aio_poll()
exec: Fix mem leak in kernel_read_file
copy_mount_string: Limit string length to PATH_MAX
cgroup: saner refcounting for cgroup_root
fix cgroup_do_mount() handling of failure exits

Linus Torvalds
2019-03-05 05:24:27 +0800
f7fb7c1a1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next ... Browse Code »

Daniel Borkmann says:

====================
pull-request: bpf-next 2019-03-04

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add AF_XDP support to libbpf. Rationale is to facilitate writing
AF_XDP applications by offering higher-level APIs that hide many
of the details of the AF_XDP uapi. Sample programs are converted
over to this new interface as well, from Magnus.

2) Introduce a new cant_sleep() macro for annotation of functions
that cannot sleep and use it in BPF_PROG_RUN() to assert that
BPF programs run under preemption disabled context, from Peter.

3) Introduce per BPF prog stats in order to monitor the usage
of BPF; this is controlled by kernel.bpf_stats_enabled sysctl
knob where monitoring tools can make use of this to efficiently
determine the average cost of programs, from Alexei.

4) Split up BPF selftest's test_progs similarly as we already
did with test_verifier. This allows to further reduce merge
conflicts in future and to get more structure into our
quickly growing BPF selftest suite, from Stanislav.

5) Fix a bug in BTF's dedup algorithm which can cause an infinite
loop in some circumstances; also various BPF doc fixes and
improvements, from Andrii.

6) Various BPF sample cleanups and migration to libbpf in order
to further isolate the old sample loader code (so we can get
rid of it at some point), from Jakub.

7) Add a new BPF helper for BPF cgroup skb progs that allows
to set ECN CE code point and a Host Bandwidth Manager (HBM)
sample program for limiting the bandwidth used by v2 cgroups,
from Lawrence.

8) Enable write access to skb->queue_mapping from tc BPF egress
programs in order to let BPF pick TX queue, from Jesper.

9) Fix a bug in BPF spinlock handling for map-in-map which did
not propagate spin_lock_off to the meta map, from Yonghong.

10) Fix a bug in the new per-CPU BPF prog counters to properly
initialize stats for each CPU, from Eric.

11) Add various BPF helper prototypes to selftest's bpf_helpers.h,
from Willem.

12) Fix various BPF samples bugs in XDP and tracing progs,
from Toke, Daniel and Yonghong.

13) Silence preemption splat in test_bpf after BPF_PROG_RUN()
enforces it now everywhere, from Anders.

14) Fix a signedness bug in libbpf's btf_dedup_ref_type() to
get error handling working, from Dan.

15) Fix bpftool documentation and auto-completion with regards
to stream_{verdict,parser} attach types, from Alban.
====================

Signed-off-by: David S. Miller

David S. Miller
2019-03-05 02:14:31 +0800

03 Mar, 2019

1 commit

9eb359140 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Browse Code »

David S. Miller
2019-03-03 04:54:35 +0800

02 Mar, 2019

2 commits

3612af783 bpf: fix sanitation rewrite in case of non-pointers ... Browse Code »

Marek reported that he saw an issue with the below snippet in that
timing measurements where off when loaded as unpriv while results
were reasonable when loaded as privileged:

[...]
uint64_t a = bpf_ktime_get_ns();
uint64_t b = bpf_ktime_get_ns();
uint64_t delta = b - a;
if ((int64_t)delta > 0) {
[...]

Turns out there is a bug where a corner case is missing in the fix
d3bd7413e0ca ("bpf: fix sanitation of alu op with pointer / scalar
type from different paths"), namely fixup_bpf_calls() only checks
whether aux has a non-zero alu_state, but it also needs to test for
the case of BPF_ALU_NON_POINTER since in both occasions we need to
skip the masking rewrite (as there is nothing to mask).

Fixes: d3bd7413e0ca ("bpf: fix sanitation of alu op with pointer / scalar type from different paths")
Reported-by: Marek Majkowski
Reported-by: Arthur Fabre
Signed-off-by: Daniel Borkmann
Link: https://lore.kernel.org/netdev/CAJPywTJqP34cK20iLM5YmUMz9KXQOdu1-+BZrGMAGgLuBWz7fg@mail.gmail.com/T/
Acked-by: Song Liu
Signed-off-by: Alexei Starovoitov

Daniel Borkmann
2019-03-02 13:24:08 +0800
4b9113045 bpf: fix u64_stats_init() usage in bpf_prog_alloc() ... Browse Code »

We need to iterate through all possible cpus.

Fixes: 492ecee892c2 ("bpf: enable program stats")
Signed-off-by: Eric Dumazet
Reported-by: Guenter Roeck
Tested-by: Guenter Roeck
Acked-by: Song Liu
Signed-off-by: Daniel Borkmann

Eric Dumazet
2019-03-02 07:31:36 +0800

01 Mar, 2019

2 commits

352d20d61 bpf: drop refcount if bpf_map_new_fd() fails in map_create() ... Browse Code »

In bpf/syscall.c, map_create() first set map->usercnt to 1, a file
descriptor is supposed to return to userspace. When bpf_map_new_fd()
fails, drop the refcount.

Fixes: bd5f5f4ecb78 ("bpf: Add BPF_MAP_GET_FD_BY_ID")
Signed-off-by: Peng Sun
Acked-by: Martin KaFai Lau
Signed-off-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann

Peng Sun
2019-03-01 23:04:29 +0800
3fcc5530b bpf: fix build without bpf_syscall ... Browse Code »

wrap bpf_stats_enabled sysctl with #ifdef

Reported-by: Stephen Rothwell
Fixes: 492ecee892c2 ("bpf: enable program stats")
Signed-off-by: Alexei Starovoitov
Acked-by: Song Liu
Signed-off-by: Daniel Borkmann

Alexei Starovoitov
2019-03-01 07:44:58 +0800

28 Feb, 2019

3 commits

a115d0ed7 bpf: set inner_map_meta->spin_lock_off correctly ... Browse Code »

Commit d83525ca62cf ("bpf: introduce bpf_spin_lock")
introduced bpf_spin_lock and the field spin_lock_off
in kernel internal structure bpf_map has the following
meaning:
>=0 valid offset, spin_lock_off is not copied
from the inner map to the map_in_map inner_map_meta
during a map_in_map type map creation, so
inner_map_meta->spin_lock_off = 0.
This will give verifier wrong information that
inner_map has bpf_spin_lock and the bpf_spin_lock
is defined at offset 0. An access to offset 0
of a value pointer will trigger the following error:
bpf_spin_lock cannot be accessed directly by load/store

This patch fixed the issue by copy inner map's spin_lock_off
value to inner_map_meta->spin_lock_off.

Fixes: d83525ca62cf ("bpf: introduce bpf_spin_lock")
Signed-off-by: Yonghong Song
Acked-by: Andrii Nakryiko
Signed-off-by: Alexei Starovoitov

Yonghong Song
2019-02-28 09:03:13 +0800
5f8f8b93a bpf: expose program stats via bpf_prog_info ... Browse Code »

Return bpf program run_time_ns and run_cnt via bpf_prog_info

Signed-off-by: Alexei Starovoitov
Acked-by: Andrii Nakryiko
Signed-off-by: Daniel Borkmann

Alexei Starovoitov
2019-02-28 00:22:50 +0800
492ecee89 bpf: enable program stats ... Browse Code »

JITed BPF programs are indistinguishable from kernel functions, but unlike
kernel code BPF code can be changed often.
Typical approach of "perf record" + "perf report" profiling and tuning of
kernel code works just as well for BPF programs, but kernel code doesn't
need to be monitored whereas BPF programs do.
Users load and run large amount of BPF programs.
These BPF stats allow tools monitor the usage of BPF on the server.
The monitoring tools will turn sysctl kernel.bpf_stats_enabled
on and off for few seconds to sample average cost of the programs.
Aggregated data over hours and days will provide an insight into cost of BPF
and alarms can trigger in case given program suddenly gets more expensive.

The cost of two sched_clock() per program invocation adds ~20 nsec.
Fast BPF progs (like selftests/bpf/progs/test_pkt_access.c) will slow down
from ~10 nsec to ~30 nsec.
static_key minimizes the cost of the stats collection.
There is no measurable difference before/after this patch
with kernel.bpf_stats_enabled=0

Signed-off-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann

Alexei Starovoitov
2019-02-28 00:22:50 +0800

27 Feb, 2019

1 commit

781e62823 bpf: decrease usercnt if bpf_map_new_fd() fails in bpf_map_get_fd_by_id() ... Browse Code »

In bpf/syscall.c, bpf_map_get_fd_by_id() use bpf_map_inc_not_zero()
to increase the refcount, both map->refcnt and map->usercnt. Then, if
bpf_map_new_fd() fails, should handle map->usercnt too.

Fixes: bd5f5f4ecb78 ("bpf: Add BPF_MAP_GET_FD_BY_ID")
Signed-off-by: Peng Sun
Acked-by: Martin KaFai Lau
Signed-off-by: Daniel Borkmann

Peng Sun
2019-02-27 02:08:30 +0800

25 Feb, 2019

2 commits

70f352261 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Three conflicts, one of which, for marvell10g.c is non-trivial and
requires some follow-up from Heiner or someone else.

The issue is that Heiner converted the marvell10g driver over to
use the generic c45 code as much as possible.

However, in 'net' a bug fix appeared which makes sure that a new
local mask (MDIO_AN_10GBT_CTRL_ADV_NBT_MASK) with value 0x01e0
is cleared.

Signed-off-by: David S. Miller

David S. Miller
2019-02-25 04:06:19 +0800
c4eb1e185 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:
"Hopefully the last pull request for this release. Fingers crossed:

1) Only refcount ESP stats on full sockets, from Martin Willi.

2) Missing barriers in AF_UNIX, from Al Viro.

3) RCU protection fixes in ipv6 route code, from Paolo Abeni.

4) Avoid false positives in untrusted GSO validation, from Willem de
Bruijn.

5) Forwarded mesh packets in mac80211 need more tailroom allocated,
from Felix Fietkau.

6) Use operstate consistently for linkup in team driver, from George
Wilkie.

7) ThunderX bug fixes from Vadim Lomovtsev. Mostly races between VF
and PF code paths.

8) Purge ipv6 exceptions during netdevice removal, from Paolo Abeni.

9) nfp eBPF code gen fixes from Jiong Wang.

10) bnxt_en firmware timeout fix from Michael Chan.

11) Use after free in udp/udpv6 error handlers, from Paolo Abeni.

12) Fix a race in x25_bind triggerable by syzbot, from Eric Dumazet"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits)
net: phy: realtek: Dummy IRQ calls for RTL8366RB
tcp: repaired skbs must init their tso_segs
net/x25: fix a race in x25_bind()
net: dsa: Remove documentation for port_fdb_prepare
Revert "bridge: do not add port to router list when receives query with source 0.0.0.0"
selftests: fib_tests: sleep after changing carrier. again.
net: set static variable an initial value in atl2_probe()
net: phy: marvell10g: Fix Multi-G advertisement to only advertise 10G
bpf, doc: add bpf list as secondary entry to maintainers file
udp: fix possible user after free in error handler
udpv6: fix possible user after free in error handler
fou6: fix proto error handler argument type
udpv6: add the required annotation to mib type
mdio_bus: Fix use-after-free on device_register fails
net: Set rtm_table to RT_TABLE_COMPAT for ipv6 for tables > 255
bnxt_en: Wait longer for the firmware message response to complete.
bnxt_en: Fix typo in firmware message timeout logic.
nfp: bpf: fix ALU32 high bits clearance bug
nfp: bpf: fix code-gen bug on BPF_ALU | BPF_XOR | BPF_K
Documentation: networking: switchdev: Update port parent ID section
...

Linus Torvalds
2019-02-25 01:28:26 +0800

23 Feb, 2019

1 commit

ea34a0036 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf ... Browse Code »

Daniel Borkmann says:

====================
pull-request: bpf 2019-02-23

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix a bug in BPF's LPM deletion logic to match correct prefix
length, from Alban.

2) Fix AF_XDP teardown by not destroying umem prematurely as it
is still needed till all outstanding skbs are freed, from Björn.

3) Fix unkillable BPF_PROG_TEST_RUN under preempt kernel by checking
signal_pending() outside need_resched() condition which is never
triggered there, from Stanislav.

4) Fix two nfp JIT bugs, one in code emission for K-based xor, and
another one to explicitly clear upper bits in alu32, from Jiong.

5) Add bpf list address to maintainers file, from Daniel.
====================

Signed-off-by: David S. Miller

David S. Miller
2019-02-23 12:45:38 +0800

22 Feb, 2019

3 commits

7c0cdf0b3 bpf, lpm: fix lookup bug in map_delete_elem ... Browse Code »

trie_delete_elem() was deleting an entry even though it was not matching
if the prefixlen was correct. This patch adds a check on matchlen.

Reproducer:

$ sudo bpftool map create /sys/fs/bpf/mylpm type lpm_trie key 8 value 1 entries 128 name mylpm flags 1
$ sudo bpftool map update pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 aa bb cc dd value hex 01
$ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
key: 10 00 00 00 aa bb cc dd value: 01
Found 1 element
$ sudo bpftool map delete pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 ff ff ff ff
$ echo $?
0
$ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
Found 0 elements

A similar reproducer is added in the selftests.

Without the patch:

$ sudo ./tools/testing/selftests/bpf/test_lpm_map
test_lpm_map: test_lpm_map.c:485: test_lpm_delete: Assertion `bpf_map_delete_elem(map_fd, key) == -1 && errno == ENOENT' failed.
Aborted

With the patch: test_lpm_map runs without errors.

Fixes: e454cf595853 ("bpf: Implement map_delete_elem for BPF_MAP_TYPE_LPM_TRIE")
Cc: Craig Gallek
Signed-off-by: Alban Crequy
Acked-by: Craig Gallek
Signed-off-by: Daniel Borkmann

Alban Crequy
2019-02-22 23:17:53 +0800
e80d02dd7 seccomp, bpf: disable preemption before calling into bpf prog ... Browse Code »

All BPF programs must be called with preemption disabled.

Fixes: 568f196756ad ("bpf: check that BPF programs run with preemption disabled")
Reported-by: syzbot+8bf19ee2aa580de7a2a7@syzkaller.appspotmail.com
Signed-off-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann

Alexei Starovoitov
2019-02-22 07:14:19 +0800
4e37504d1 psi: avoid divide-by-zero crash inside virtual machines ... Browse Code »

We've been seeing hard-to-trigger psi crashes when running inside VM
instances:

divide error: 0000 [#1] SMP PTI
Modules linked in: [...]
CPU: 0 PID: 212 Comm: kworker/0:2 Not tainted 4.16.18-119_fbk9_3817_gfe944c98d695 #119
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Workqueue: events psi_clock
RIP: 0010:psi_update_stats+0x270/0x490
RSP: 0018:ffffc90001117e10 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8800a35a13f8
RDX: 0000000000000000 RSI: ffff8800a35a1340 RDI: 0000000000000000
RBP: 0000000000000658 R08: ffff8800a35a1470 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000f8502
FS: 0000000000000000(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fbe370fa000 CR3: 00000000b1e3a000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
psi_clock+0x12/0x50
process_one_work+0x1e0/0x390
worker_thread+0x2b/0x3c0
? rescuer_thread+0x330/0x330
kthread+0x113/0x130
? kthread_create_worker_on_cpu+0x40/0x40
? SyS_exit_group+0x10/0x10
ret_from_fork+0x35/0x40
Code: 48 0f 47 c7 48 01 c2 45 85 e4 48 89 16 0f 85 e6 00 00 00 4c 8b 49 10 4c 8b 51 08 49 69 d9 f2 07 00 00 48 6b c0 64 4c 8b 29 31 d2 f7 f7 49 69 d5 8d 06 00 00 48 89 c5 4c 69 f0 00 98 0b 00 48

The Code-line points to `period` being 0 inside update_stats(), and we
divide by that when calculating that period's pressure percentage.

The elapsed period should never be 0. The reason this can happen is due
to an off-by-one in the idle time / missing period calculation combined
with a coarse sched_clock() in the virtual machine.

The target time for aggregation is advanced into the future on a fixed
grid to prevent clock drift. So when an aggregation runs after some idle
period, we can not just set it to "now + psi_period", but have to
calculate the downtime and advance the target time relative to itself.

However, if the aggregator was disabled exactly one psi_period (ns), we
drop one idle period in the calculation due to a > when we should do >=.
In that case, next_update will be advanced from 'now - psi_period' to
'now' when it should be moved to 'now + psi_period'. The run finishes
with last_update == next_update == sched_clock().

With hardware clocks, this exact nanosecond match isn't likely in the
first place; but if it does happen, the clock will still have moved on and
the period non-zero by the time the worker runs. A pointlessly short
period, but besides the extra work, no harm no foul. However, a slow
sched_clock() like we have on VMs might not have advanced either by the
time the worker runs again. And when we calculate the elapsed period, the
result, our pressure divisor, will be 0. Ouch.

Fix this by correctly handling the situation when the elapsed time between
aggregation runs is precisely two periods, and advance the expiration
timestamp correctly to period into the future.

Link: http://lkml.kernel.org/r/20190214193157.15788-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Reported-by: Łukasz Siudut
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2019-02-22 01:01:00 +0800

20 Feb, 2019

3 commits

375ca548f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Two easily resolvable overlapping change conflicts, one in
TCP and one in the eBPF verifier.

Signed-off-by: David S. Miller

David S. Miller
2019-02-20 16:34:07 +0800
40e196a90 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) Fix suspend and resume in mt76x0u USB driver, from Stanislaw
Gruszka.

2) Missing memory barriers in xsk, from Magnus Karlsson.

3) rhashtable fixes in mac80211 from Herbert Xu.

4) 32-bit MIPS eBPF JIT fixes from Paul Burton.

5) Fix for_each_netdev_feature() on big endian, from Hauke Mehrtens.

6) GSO validation fixes from Willem de Bruijn.

7) Endianness fix for dwmac4 timestamp handling, from Alexandre Torgue.

8) More strict checks in tcp_v4_err(), from Eric Dumazet.

9) af_alg_release should NULL out the sk after the sock_put(), from Mao
Wenan.

10) Missing unlock in mac80211 mesh error path, from Wei Yongjun.

11) Missing device put in hns driver, from Salil Mehta.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
sky2: Increase D3 delay again
vhost: correctly check the return value of translate_desc() in log_used()
net: netcp: Fix ethss driver probe issue
net: hns: Fixes the missing put_device in positive leg for roce reset
net: stmmac: Fix a race in EEE enable callback
qed: Fix iWARP syn packet mac address validation.
qed: Fix iWARP buffer size provided for syn packet processing.
r8152: Add support for MAC address pass through on RTL8153-BD
mac80211: mesh: fix missing unlock on error in table_path_del()
net/mlx4_en: fix spelling mistake: "quiting" -> "quitting"
net: crypto set sk to NULL when af_alg_release.
net: Do not allocate page fragments that are not skb aligned
mm: Use fixed constant in page_frag_alloc instead of size + 1
tcp: tcp_v4_err() should be more careful
tcp: clear icsk_backoff in tcp_write_queue_purge()
net: mv643xx_eth: disable clk on error path in mv643xx_eth_shared_probe()
qmi_wwan: apply SET_DTR quirk to Sierra WP7607
net: stmmac: handle endianness in dwmac4_get_timestamp
doc: Mention MSG_ZEROCOPY implementation for UDP
mlxsw: __mlxsw_sp_port_headroom_set(): Fix a use of local variable
...

Linus Torvalds
2019-02-20 08:13:19 +0800
568f19675 bpf: check that BPF programs run with preemption disabled ... Browse Code »

Introduce cant_sleep() macro for annotation of functions that
cannot sleep.

Use it in BPF_PROG_RUN to catch execution of BPF programs in
preemptable context.

Suggested-by: Jann Horn
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Alexei Starovoitov
Acked-by: Song Liu
Signed-off-by: Daniel Borkmann

Peter Zijlstra
2019-02-20 04:53:07 +0800

19 Feb, 2019

1 commit

10f490217 Merge tag 'trace-v5.0-rc4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace ... Browse Code »

Pull tracing fixes from Steven Rostedt:
"Two more tracing fixes

- Have kprobes not use copy_from_user() to access kernel addresses,
because kprobes can legitimately poke at bad kernel memory, which
will fault. Copy from user code should never fault in kernel space.
Using probe_mem_read() can handle kernel address space faulting.

- Put back the entries counter in the tracing output that was
accidentally removed"

* tag 'trace-v5.0-rc4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix number of entries in trace header
kprobe: Do not use uaccess functions to access kernel memory that can fault

Linus Torvalds
2019-02-19 01:40:16 +0800

18 Feb, 2019

1 commit

dd6f29da6 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf fixes from Ingo Molnar:
"Two fixes on the kernel side: fix an over-eager condition that failed
larger perf ring-buffer sizes, plus fix crashes in the Intel BTS code
for a corner case, found by fuzzing"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix impossible ring-buffer sizes warning
perf/x86: Add check_period PMU callback

Linus Torvalds
2019-02-18 00:38:13 +0800

17 Feb, 2019

2 commits

885e63195 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next ... Browse Code »

Alexei Starovoitov says:

====================
pull-request: bpf-next 2019-02-16

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) numerous libbpf API improvements, from Andrii, Andrey, Yonghong.

2) test all bpf progs in alu32 mode, from Jiong.

3) skb->sk access and bpf_sk_fullsock(), bpf_tcp_sock() helpers, from Martin.

4) support for IP encap in lwt bpf progs, from Peter.

5) remove XDP_QUERY_XSK_UMEM dead code, from Jan.
====================

Signed-off-by: David S. Miller

David S. Miller
2019-02-17 14:56:34 +0800
6e1077f51 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf ... Browse Code »

Alexei Starovoitov says:

====================
pull-request: bpf 2019-02-16

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) fix lockdep false positive in bpf_get_stackid(), from Alexei.

2) several AF_XDP fixes, from Bjorn, Magnus, Davidlohr.

3) fix narrow load from struct bpf_sock, from Martin.

4) mips JIT fixes, from Paul.

5) gso handling fix in bpf helpers, from Willem.
====================

Signed-off-by: David S. Miller

David S. Miller
2019-02-17 14:34:07 +0800

16 Feb, 2019

3 commits

3313da818 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

The netfilter conflicts were rather simple overlapping
changes.

However, the cls_tcindex.c stuff was a bit more complex.

On the 'net' side, Cong is fixing several races and memory
leaks. Whilst on the 'net-next' side we have Vlad adding
the rtnl-ness support.

What I've decided to do, in order to resolve this, is revert the
conversion over to using a workqueue that Cong did, bringing us back
to pure RCU. I did it this way because I believe that either Cong's
races don't apply with have Vlad did things, or Cong will have to
implement the race fix slightly differently.

Signed-off-by: David S. Miller

David S. Miller
2019-02-16 04:38:38 +0800
9e7382153 tracing: Fix number of entries in trace header ... Browse Code »

The following commit

441dae8f2f29 ("tracing: Add support for display of tgid in trace output")

removed the call to print_event_info() from print_func_help_header_irq()
which results in the ftrace header not reporting the number of entries
written in the buffer. As this wasn't the original intent of the patch,
re-introduce the call to print_event_info() to restore the orginal
behaviour.

Link: http://lkml.kernel.org/r/20190214152950.4179-1-quentin.perret@arm.com

Acked-by: Joel Fernandes
Cc: stable@vger.kernel.org
Fixes: 441dae8f2f29 ("tracing: Add support for display of tgid in trace output")
Signed-off-by: Quentin Perret
Signed-off-by: Steven Rostedt (VMware)

Quentin Perret
2019-02-16 01:42:26 +0800
2c4f1fcbe kprobe: Do not use uaccess functions to access kernel memory that can fault ... Browse Code »

The userspace can ask kprobe to intercept strings at any memory address,
including invalid kernel address. In this case, fetch_store_strlen()
would crash since it uses general usercopy function, and user access
functions are no longer allowed to access kernel memory.

For example, we can crash the kernel by doing something as below:

$ sudo kprobe 'p:do_sys_open +0(+0(%si)):string'

[ 103.620391] BUG: GPF in non-whitelisted uaccess (non-canonical address?)
[ 103.622104] general protection fault: 0000 [#1] SMP PTI
[ 103.623424] CPU: 10 PID: 1046 Comm: cat Not tainted 5.0.0-rc3-00130-gd73aba1-dirty #96
[ 103.625321] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-2-g628b2e6-dirty-20190104_103505-linux 04/01/2014
[ 103.628284] RIP: 0010:process_fetch_insn+0x1ab/0x4b0
[ 103.629518] Code: 10 83 80 28 2e 00 00 01 31 d2 31 ff 48 8b 74 24 28 eb 0c 81 fa ff 0f 00 00 7f 1c 85 c0 75 18 66 66 90 0f ae e8 48 63
ca 89 f8 0c 31 66 66 90 83 c2 01 84 c9 75 dc 89 54 24 34 89 44 24 28 48
[ 103.634032] RSP: 0018:ffff88845eb37ce0 EFLAGS: 00010246
[ 103.635312] RAX: 0000000000000000 RBX: ffff888456c4e5a8 RCX: 0000000000000000
[ 103.637057] RDX: 0000000000000000 RSI: 2e646c2f6374652f RDI: 0000000000000000
[ 103.638795] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 103.640556] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[ 103.642297] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 103.644040] FS: 0000000000000000(0000) GS:ffff88846f000000(0000) knlGS:0000000000000000
[ 103.646019] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 103.647436] CR2: 00007ffc79758038 CR3: 0000000463360006 CR4: 0000000000020ee0
[ 103.649147] Call Trace:
[ 103.649781] ? sched_clock_cpu+0xc/0xa0
[ 103.650747] ? do_sys_open+0x5/0x220
[ 103.651635] kprobe_trace_func+0x303/0x380
[ 103.652645] ? do_sys_open+0x5/0x220
[ 103.653528] kprobe_dispatcher+0x45/0x50
[ 103.654682] ? do_sys_open+0x1/0x220
[ 103.655875] kprobe_ftrace_handler+0x90/0xf0
[ 103.657282] ftrace_ops_assist_func+0x54/0xf0
[ 103.658564] ? __call_rcu+0x1dc/0x280
[ 103.659482] 0xffffffffc00000bf
[ 103.660384] ? __ia32_sys_open+0x20/0x20
[ 103.661682] ? do_sys_open+0x1/0x220
[ 103.662863] do_sys_open+0x5/0x220
[ 103.663988] do_syscall_64+0x60/0x210
[ 103.665201] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 103.666862] RIP: 0033:0x7fc22fadccdd
[ 103.668034] Code: 48 89 54 24 e0 41 83 e2 40 75 32 89 f0 25 00 00 41 00 3d 00 00 41 00 74 24 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff
ff 0f 05 3d 00 f0 ff ff 77 33 f3 c3 66 0f 1f 84 00 00 00 00 00 48 8d 44
[ 103.674029] RSP: 002b:00007ffc7972c3a8 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
[ 103.676512] RAX: ffffffffffffffda RBX: 0000562f86147a21 RCX: 00007fc22fadccdd
[ 103.678853] RDX: 0000000000080000 RSI: 00007fc22fae1428 RDI: 00000000ffffff9c
[ 103.681151] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
[ 103.683489] R10: 0000000000000000 R11: 0000000000000287 R12: 00007fc22fce90a8
[ 103.685774] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[ 103.688056] Modules linked in:
[ 103.689131] ---[ end trace 43792035c28984a1 ]---

This can be fixed by using probe_mem_read() instead, as it can handle faulting
kernel memory addresses, which kprobes can legitimately do.

Link: http://lkml.kernel.org/r/20190125151051.7381-1-changbin.du@gmail.com

Cc: stable@vger.kernel.org
Fixes: 9da3f2b7405 ("x86/fault: BUG() when uaccess helpers fault on kernel addresses")
Signed-off-by: Changbin Du
Signed-off-by: Steven Rostedt (VMware)

Changbin Du
2019-02-16 01:41:23 +0800

15 Feb, 2019

1 commit

02d750408 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull signal fix from Eric Biederman:
"Just a single patch that restores PTRACE_EVENT_EXIT functionality that
was accidentally broken by last weeks fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
signal: Restore the stop PTRACE_EVENT_EXIT

Linus Torvalds
2019-02-15 23:56:24 +0800

14 Feb, 2019

1 commit

b6ea7bcf7 Merge tag 'trace-v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace ... Browse Code »

Pull tracing fix from Steven Rostedt:
"This fixes kprobes/uprobes dynamic processing of strings, where it
processes the args but does not update the remaining length of the
buffer that the string arguments will be placed in. It constantly
passes in the total size of buffer used instead of passing in the
remaining size of the buffer used.

This could cause issues if the strings are larger than the max size of
an event which could cause the strings to be written beyond what was
reserved on the buffer"

* tag 'trace-v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: probeevent: Correctly update remaining space in dynamic area

Linus Torvalds
2019-02-14 02:28:17 +0800

13 Feb, 2019

2 commits

cf43a757f signal: Restore the stop PTRACE_EVENT_EXIT ... Browse Code »

In the middle of do_exit() there is there is a call
"ptrace_event(PTRACE_EVENT_EXIT, code);" That call places the process
in TACKED_TRACED aka "(TASK_WAKEKILL | __TASK_TRACED)" and waits for
for the debugger to release the task or SIGKILL to be delivered.

Skipping past dequeue_signal when we know a fatal signal has already
been delivered resulted in SIGKILL remaining pending and
TIF_SIGPENDING remaining set. This in turn caused the
scheduler to not sleep in PTACE_EVENT_EXIT as it figured
a fatal signal was pending. This also caused ptrace_freeze_traced
in ptrace_check_attach to fail because it left a per thread
SIGKILL pending which is what fatal_signal_pending tests for.

This difference in signal state caused strace to report
strace: Exit of unknown pid NNNNN ignored

Therefore update the signal handling state like dequeue_signal
would when removing a per thread SIGKILL, by removing SIGKILL
from the per thread signal mask and clearing TIF_SIGPENDING.

Acked-by: Oleg Nesterov
Reported-by: Oleg Nesterov
Reported-by: Ivan Delalande
Cc: stable@vger.kernel.org
Fixes: 35634ffa1751 ("signal: Always notice exiting tasks")
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2019-02-13 22:31:41 +0800
528871b45 perf/core: Fix impossible ring-buffer sizes warning ... Browse Code »

The following commit:

9dff0aa95a32 ("perf/core: Don't WARN() for impossible ring-buffer sizes")

results in perf recording failures with larger mmap areas:

root@skl:/tmp# perf record -g -a
failed to mmap with 12 (Cannot allocate memory)

The root cause is that the following condition is buggy:

if (order_base_2(size) >= MAX_ORDER)
goto fail;

The problem is that @size is in bytes and MAX_ORDER is in pages,
so the right test is:

if (order_base_2(size) >= PAGE_SHIFT+MAX_ORDER)
goto fail;

Fix it.

Reported-by: "Jin, Yao"
Bisected-by: Borislav Petkov
Analyzed-by: Peter Zijlstra
Cc: Julien Thierry
Cc: Mark Rutland
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Greg Kroah-Hartman
Cc:
Fixes: 9dff0aa95a32 ("perf/core: Don't WARN() for impossible ring-buffer sizes")
Signed-off-by: Ingo Molnar

Ingo Molnar
2019-02-13 15:05:02 +0800