06 Mar, 2019

9 commits

  • Cgroup has a standardized poll/notification mechanism for waking all
    pollers on all fds when a filesystem node changes. To allow polling for
    custom events, add a .poll callback that can override the default.

    This is in preparation for pollable cgroup pressure files which have
    per-fd trigger configurations.

    Link: http://lkml.kernel.org/r/20190124211518.244221-3-surenb@google.com
    Signed-off-by: Johannes Weiner
    Signed-off-by: Suren Baghdasaryan
    Cc: Dennis Zhou
    Cc: Ingo Molnar
    Cc: Jens Axboe
    Cc: Li Zefan
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Compaction is inherently race-prone as a suitable page freed during
    compaction can be allocated by any parallel task. This patch uses a
    capture_control structure to isolate a page immediately when it is freed
    by a direct compactor in the slow path of the page allocator. The
    intent is to avoid redundant scanning.

    5.0.0-rc1 5.0.0-rc1
    selective-v3r17 capture-v3r19
    Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%*
    Amean fault-both-3 2582.11 ( 0.00%) 2563.68 ( 0.71%)
    Amean fault-both-5 4500.26 ( 0.00%) 4233.52 ( 5.93%)
    Amean fault-both-7 5819.53 ( 0.00%) 6333.65 ( -8.83%)
    Amean fault-both-12 9321.18 ( 0.00%) 9759.38 ( -4.70%)
    Amean fault-both-18 9782.76 ( 0.00%) 10338.76 ( -5.68%)
    Amean fault-both-24 15272.81 ( 0.00%) 13379.55 * 12.40%*
    Amean fault-both-30 15121.34 ( 0.00%) 16158.25 ( -6.86%)
    Amean fault-both-32 18466.67 ( 0.00%) 18971.21 ( -2.73%)

    Latency is only moderately affected but the devil is in the details. A
    closer examination indicates that base page fault latency is reduced but
    latency of huge pages is increased as it takes creater care to succeed.
    Part of the "problem" is that allocation success rates are close to 100%
    even when under pressure and compaction gets harder

    5.0.0-rc1 5.0.0-rc1
    selective-v3r17 capture-v3r19
    Percentage huge-3 96.70 ( 0.00%) 98.23 ( 1.58%)
    Percentage huge-5 96.99 ( 0.00%) 95.30 ( -1.75%)
    Percentage huge-7 94.19 ( 0.00%) 97.24 ( 3.24%)
    Percentage huge-12 94.95 ( 0.00%) 97.35 ( 2.53%)
    Percentage huge-18 96.74 ( 0.00%) 97.30 ( 0.58%)
    Percentage huge-24 97.07 ( 0.00%) 97.55 ( 0.50%)
    Percentage huge-30 95.69 ( 0.00%) 98.50 ( 2.95%)
    Percentage huge-32 96.70 ( 0.00%) 99.27 ( 2.65%)

    And scan rates are reduced as expected by 6% for the migration scanner
    and 29% for the free scanner indicating that there is less redundant
    work.

    Compaction migrate scanned 20815362 19573286
    Compaction free scanned 16352612 11510663

    [mgorman@techsingularity.net: remove redundant check]
    Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net
    Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Dan Carpenter
    Cc: David Rientjes
    Cc: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • sysctl_extfrag_handler() neglects to propagate the return value from
    proc_dointvec_minmax() to its caller. It's a wrapper that doesn't need
    to exist, so just use proc_dointvec_minmax() directly.

    Link: http://lkml.kernel.org/r/20190104032557.3056-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Reported-by: Aditya Pakki
    Acked-by: Mel Gorman
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

    All these places for replacement were found by running the following
    grep patterns on the entire kernel code. Please let me know if this
    might have missed some instances. This might also have replaced some
    false positives. I will appreciate suggestions, inputs and review.

    1. git grep "nid == -1"
    2. git grep "node == -1"
    3. git grep "nid = -1"
    4. git grep "node = -1"

    This patch (of 2):

    At present there are multiple places where invalid node number is
    encoded as -1. Even though implicitly understood it is always better to
    have macros in there. Replace these open encodings for an invalid node
    number with the global macro NUMA_NO_NODE. This helps remove NUMA
    related assumptions like 'invalid node' from various places redirecting
    them to a common definition.

    Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: David Hildenbrand
    Acked-by: Jeff Kirsher [ixgbe]
    Acked-by: Jens Axboe [mtip32xx]
    Acked-by: Vinod Koul [dmaengine.c]
    Acked-by: Michael Ellerman [powerpc]
    Acked-by: Doug Ledford [drivers/infiniband]
    Cc: Joseph Qi
    Cc: Hans Verkuil
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • The content of pages that are marked PG_offline is not of interest (e.g.
    inflated by a balloon driver), let's skip these pages.

    In saveable_highmem_page(), move the PageReserved() check to a new check
    along with the PageOffline() check to separate it from the swsusp
    checks.

    [david@redhat.com: v2]
    Link: http://lkml.kernel.org/r/20181122100627.5189-9-david@redhat.com
    Link: http://lkml.kernel.org/r/20181119101616.8901-9-david@redhat.com
    Signed-off-by: David Hildenbrand
    Acked-by: Pavel Machek
    Acked-by: Rafael J. Wysocki
    Cc: Pavel Machek
    Cc: Len Brown
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: "Michael S. Tsirkin"
    Cc: Alexander Duyck
    Cc: Alexey Dobriyan
    Cc: Arnd Bergmann
    Cc: Baoquan He
    Cc: Borislav Petkov
    Cc: Boris Ostrovsky
    Cc: Christian Hansen
    Cc: Dave Young
    Cc: David Rientjes
    Cc: Greg Kroah-Hartman
    Cc: Haiyang Zhang
    Cc: Jonathan Corbet
    Cc: Juergen Gross
    Cc: Julien Freche
    Cc: Kairui Song
    Cc: Kazuhito Hagio
    Cc: "Kirill A. Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: "K. Y. Srinivasan"
    Cc: Lianbo Jiang
    Cc: Michal Hocko
    Cc: Mike Rapoport
    Cc: Miles Chen
    Cc: Nadav Amit
    Cc: Naoya Horiguchi
    Cc: Omar Sandoval
    Cc: Pankaj gupta
    Cc: Pavel Tatashin
    Cc: "Rafael J. Wysocki"
    Cc: Stefano Stabellini
    Cc: Stephen Hemminger
    Cc: Stephen Rothwell
    Cc: Vitaly Kuznetsov
    Cc: Vlastimil Babka
    Cc: Xavier Deguillard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Let's use pfn_to_online_page() instead of pfn_to_page() when checking
    for saveable pages to not save/restore offline memory sections.

    Link: http://lkml.kernel.org/r/20181119101616.8901-8-david@redhat.com
    Signed-off-by: David Hildenbrand
    Suggested-by: Michal Hocko
    Acked-by: Michal Hocko
    Acked-by: Pavel Machek
    Acked-by: Rafael J. Wysocki
    Cc: Len Brown
    Cc: Matthew Wilcox
    Cc: "Michael S. Tsirkin"
    Cc: Alexander Duyck
    Cc: Alexey Dobriyan
    Cc: Arnd Bergmann
    Cc: Baoquan He
    Cc: Borislav Petkov
    Cc: Boris Ostrovsky
    Cc: Christian Hansen
    Cc: Dave Young
    Cc: David Rientjes
    Cc: Greg Kroah-Hartman
    Cc: Haiyang Zhang
    Cc: Jonathan Corbet
    Cc: Juergen Gross
    Cc: Julien Freche
    Cc: Kairui Song
    Cc: Kazuhito Hagio
    Cc: "Kirill A. Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: "K. Y. Srinivasan"
    Cc: Lianbo Jiang
    Cc: Mike Rapoport
    Cc: Miles Chen
    Cc: Nadav Amit
    Cc: Naoya Horiguchi
    Cc: Omar Sandoval
    Cc: Pankaj gupta
    Cc: Pavel Tatashin
    Cc: "Rafael J. Wysocki"
    Cc: Stefano Stabellini
    Cc: Stephen Hemminger
    Cc: Stephen Rothwell
    Cc: Vitaly Kuznetsov
    Cc: Vlastimil Babka
    Cc: Xavier Deguillard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Right now, pages inflated as part of a balloon driver will be dumped by
    dump tools like makedumpfile. While XEN is able to check in the crash
    kernel whether a certain pfn is actuall backed by memory in the
    hypervisor (see xen_oldmem_pfn_is_ram) and optimize this case, dumps of
    other balloon inflated memory will essentially result in zero pages
    getting allocated by the hypervisor and the dump getting filled with
    this data.

    The allocation and reading of zero pages can directly be avoided if a
    dumping tool could know which pages only contain stale information not
    to be dumped.

    We now have PG_offline which can be (and already is by virtio-balloon)
    used for marking pages as logically offline. Follow up patches will
    make use of this flag also in other balloon implementations.

    Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so makedumpfile
    can directly skip pages that are logically offline and the content
    therefore stale.

    Please note that this is also helpful for a problem we were seeing under
    Hyper-V: Dumping logically offline memory (pages kept fake offline while
    onlining a section via online_page_callback) would under some condicions
    result in a kernel panic when dumping them.

    Link: http://lkml.kernel.org/r/20181119101616.8901-4-david@redhat.com
    Signed-off-by: David Hildenbrand
    Acked-by: Michael S. Tsirkin
    Acked-by: Dave Young
    Cc: "Kirill A. Shutemov"
    Cc: Baoquan He
    Cc: Omar Sandoval
    Cc: Arnd Bergmann
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Lianbo Jiang
    Cc: Borislav Petkov
    Cc: Kazuhito Hagio
    Cc: Alexander Duyck
    Cc: Alexey Dobriyan
    Cc: Boris Ostrovsky
    Cc: Christian Hansen
    Cc: David Rientjes
    Cc: Greg Kroah-Hartman
    Cc: Haiyang Zhang
    Cc: Jonathan Corbet
    Cc: Juergen Gross
    Cc: Julien Freche
    Cc: Kairui Song
    Cc: Konstantin Khlebnikov
    Cc: "K. Y. Srinivasan"
    Cc: Len Brown
    Cc: Michal Hocko
    Cc: Mike Rapoport
    Cc: Miles Chen
    Cc: Nadav Amit
    Cc: Naoya Horiguchi
    Cc: Pankaj gupta
    Cc: Pavel Machek
    Cc: Pavel Tatashin
    Cc: Rafael J. Wysocki
    Cc: "Rafael J. Wysocki"
    Cc: Stefano Stabellini
    Cc: Stephen Hemminger
    Cc: Stephen Rothwell
    Cc: Vitaly Kuznetsov
    Cc: Vlastimil Babka
    Cc: Xavier Deguillard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Pull s390 updates from Martin Schwidefsky:

    - A copy of Arnds compat wrapper generation series

    - Pass information about the KVM guest to the host in form the control
    program code and the control program version code

    - Map IOV resources to support PCI physical functions on s390

    - Add vector load and store alignment hints to improve performance

    - Use the "jdd" constraint with gcc 9 to make jump labels working again

    - Remove amode workaround for old z/VM releases from the DCSS code

    - Add support for in-kernel performance measurements using the CPU
    measurement counter facility

    - Introduce a new PMU device cpum_cf_diag to capture counters and store
    thenn as event raw data.

    - Bug fixes and cleanups

    * tag 's390-5.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (54 commits)
    Revert "s390/cpum_cf: Add kernel message exaplanations"
    s390/dasd: fix read device characteristic with CONFIG_VMAP_STACK=y
    s390/suspend: fix prefix register reset in swsusp_arch_resume
    s390: warn about clearing als implied facilities
    s390: allow overriding facilities via command line
    s390: clean up redundant facilities list setup
    s390/als: remove duplicated in-place implementation of stfle
    s390/cio: Use cpa range elsewhere within vfio-ccw
    s390/cio: Fix vfio-ccw handling of recursive TICs
    s390: vfio_ap: link the vfio_ap devices to the vfio_ap bus subsystem
    s390/cpum_cf: Handle EBUSY return code from CPU counter facility reservation
    s390/cpum_cf: Add kernel message exaplanations
    s390/cpum_cf_diag: Add support for s390 counter facility diagnostic trace
    s390/cpum_cf: add ctr_stcctm() function
    s390/cpum_cf: move common functions into a separate file
    s390/cpum_cf: introduce kernel_cpumcf_avail() function
    s390/cpu_mf: replace stcctm5() with the stcctm() function
    s390/cpu_mf: add store cpu counter multiple instruction support
    s390/cpum_cf: Add minimal in-kernel interface for counter measurements
    s390/cpum_cf: introduce kernel_cpumcf_alert() to obtain measurement alerts
    ...

    Linus Torvalds
     
  • Pull networking updates from David Miller:
    "Here we go, another merge window full of networking and #ebpf changes:

    1) Snoop DHCPACKS in batman-adv to learn MAC/IP pairs in the DHCP
    range without dealing with floods of ARP traffic, from Linus
    Lüssing.

    2) Throttle buffered multicast packet transmission in mt76, from
    Felix Fietkau.

    3) Support adaptive interrupt moderation in ice, from Brett Creeley.

    4) A lot of struct_size conversions, from Gustavo A. R. Silva.

    5) Add peek/push/pop commands to bpftool, as well as bash completion,
    from Stanislav Fomichev.

    6) Optimize sk_msg_clone(), from Vakul Garg.

    7) Add SO_BINDTOIFINDEX, from David Herrmann.

    8) Be more conservative with local resends due to local congestion,
    from Yuchung Cheng.

    9) Allow vetoing of unsupported VXLAN FDBs, from Petr Machata.

    10) Add health buffer support to devlink, from Eran Ben Elisha.

    11) Add TXQ scheduling API to mac80211, from Toke Høiland-Jørgensen.

    12) Add statistics to basic packet scheduler filter, from Cong Wang.

    13) Add GRE tunnel support for mlxsw Spectrum-2, from Nir Dotan.

    14) Lots of new IP tunneling forwarding tests, also from Nir Dotan.

    15) Add 3ad stats to bonding, from Nikolay Aleksandrov.

    16) Lots of probing improvements for bpftool, from Quentin Monnet.

    17) Various nfp drive #ebpf JIT improvements from Jakub Kicinski.

    18) Allow #ebpf programs to access gso_segs from skb shared info, from
    Eric Dumazet.

    19) Add sock_diag support for AF_XDP sockets, from Björn Töpel.

    20) Support 22260 iwlwifi devices, from Luca Coelho.

    21) Use rbtree for ipv6 defragmentation, from Peter Oskolkov.

    22) Add JMP32 instruction class support to #ebpf, from Jiong Wang.

    23) Add spinlock support to #ebpf, from Alexei Starovoitov.

    24) Support 256-bit keys and TLS 1.3 in ktls, from Dave Watson.

    25) Add device infomation API to devlink, from Jakub Kicinski.

    26) Add new timestamping socket options which are y2038 safe, from
    Deepa Dinamani.

    27) Add RX checksum offloading for various sh_eth chips, from Sergei
    Shtylyov.

    28) Flow offload infrastructure, from Pablo Neira Ayuso.

    29) Numerous cleanups, improvements, and bug fixes to the PHY layer
    and many drivers from Heiner Kallweit.

    30) Lots of changes to try and make packet scheduler classifiers run
    lockless as much as possible, from Vlad Buslov.

    31) Support BCM957504 chip in bnxt_en driver, from Erik Burrows.

    32) Add concurrency tests to tc-tests infrastructure, from Vlad
    Buslov.

    33) Add hwmon support to aquantia, from Heiner Kallweit.

    34) Allow 64-bit values for SO_MAX_PACING_RATE, from Eric Dumazet.

    And I would be remiss if I didn't thank the various major networking
    subsystem maintainers for integrating much of this work before I even
    saw it. Alexei Starovoitov, Daniel Borkmann, Pablo Neira Ayuso,
    Johannes Berg, Kalle Valo, and many others. Thank you!"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2207 commits)
    net/sched: avoid unused-label warning
    net: ignore sysctl_devconf_inherit_init_net without SYSCTL
    phy: mdio-mux: fix Kconfig dependencies
    net: phy: use phy_modify_mmd_changed in genphy_c45_an_config_aneg
    net: dsa: mv88e6xxx: add call to mv88e6xxx_ports_cmode_init to probe for new DSA framework
    selftest/net: Remove duplicate header
    sky2: Disable MSI on Dell Inspiron 1545 and Gateway P-79
    net/mlx5e: Update tx reporter status in case channels were successfully opened
    devlink: Add support for direct reporter health state update
    devlink: Update reporter state to error even if recover aborted
    sctp: call iov_iter_revert() after sending ABORT
    team: Free BPF filter when unregistering netdev
    ip6mr: Do not call __IP6_INC_STATS() from preemptible context
    isdn: mISDN: Fix potential NULL pointer dereference of kzalloc
    net: dsa: mv88e6xxx: support in-band signalling on SGMII ports with external PHYs
    cxgb4/chtls: Prefix adapter flags with CXGB4
    net-sysfs: Switch to bitmap_zalloc()
    mellanox: Switch to bitmap_zalloc()
    bpf: add test cases for non-pointer sanitiation logic
    mlxsw: i2c: Extend initialization by querying resources data
    ...

    Linus Torvalds
     

05 Mar, 2019

2 commits

  • Pull vfs fixes from Al Viro:
    "Assorted fixes that sat in -next for a while, all over the place"

    * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    aio: Fix locking in aio_poll()
    exec: Fix mem leak in kernel_read_file
    copy_mount_string: Limit string length to PATH_MAX
    cgroup: saner refcounting for cgroup_root
    fix cgroup_do_mount() handling of failure exits

    Linus Torvalds
     
  • Daniel Borkmann says:

    ====================
    pull-request: bpf-next 2019-03-04

    The following pull-request contains BPF updates for your *net-next* tree.

    The main changes are:

    1) Add AF_XDP support to libbpf. Rationale is to facilitate writing
    AF_XDP applications by offering higher-level APIs that hide many
    of the details of the AF_XDP uapi. Sample programs are converted
    over to this new interface as well, from Magnus.

    2) Introduce a new cant_sleep() macro for annotation of functions
    that cannot sleep and use it in BPF_PROG_RUN() to assert that
    BPF programs run under preemption disabled context, from Peter.

    3) Introduce per BPF prog stats in order to monitor the usage
    of BPF; this is controlled by kernel.bpf_stats_enabled sysctl
    knob where monitoring tools can make use of this to efficiently
    determine the average cost of programs, from Alexei.

    4) Split up BPF selftest's test_progs similarly as we already
    did with test_verifier. This allows to further reduce merge
    conflicts in future and to get more structure into our
    quickly growing BPF selftest suite, from Stanislav.

    5) Fix a bug in BTF's dedup algorithm which can cause an infinite
    loop in some circumstances; also various BPF doc fixes and
    improvements, from Andrii.

    6) Various BPF sample cleanups and migration to libbpf in order
    to further isolate the old sample loader code (so we can get
    rid of it at some point), from Jakub.

    7) Add a new BPF helper for BPF cgroup skb progs that allows
    to set ECN CE code point and a Host Bandwidth Manager (HBM)
    sample program for limiting the bandwidth used by v2 cgroups,
    from Lawrence.

    8) Enable write access to skb->queue_mapping from tc BPF egress
    programs in order to let BPF pick TX queue, from Jesper.

    9) Fix a bug in BPF spinlock handling for map-in-map which did
    not propagate spin_lock_off to the meta map, from Yonghong.

    10) Fix a bug in the new per-CPU BPF prog counters to properly
    initialize stats for each CPU, from Eric.

    11) Add various BPF helper prototypes to selftest's bpf_helpers.h,
    from Willem.

    12) Fix various BPF samples bugs in XDP and tracing progs,
    from Toke, Daniel and Yonghong.

    13) Silence preemption splat in test_bpf after BPF_PROG_RUN()
    enforces it now everywhere, from Anders.

    14) Fix a signedness bug in libbpf's btf_dedup_ref_type() to
    get error handling working, from Dan.

    15) Fix bpftool documentation and auto-completion with regards
    to stream_{verdict,parser} attach types, from Alban.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

03 Mar, 2019

1 commit


02 Mar, 2019

2 commits

  • Marek reported that he saw an issue with the below snippet in that
    timing measurements where off when loaded as unpriv while results
    were reasonable when loaded as privileged:

    [...]
    uint64_t a = bpf_ktime_get_ns();
    uint64_t b = bpf_ktime_get_ns();
    uint64_t delta = b - a;
    if ((int64_t)delta > 0) {
    [...]

    Turns out there is a bug where a corner case is missing in the fix
    d3bd7413e0ca ("bpf: fix sanitation of alu op with pointer / scalar
    type from different paths"), namely fixup_bpf_calls() only checks
    whether aux has a non-zero alu_state, but it also needs to test for
    the case of BPF_ALU_NON_POINTER since in both occasions we need to
    skip the masking rewrite (as there is nothing to mask).

    Fixes: d3bd7413e0ca ("bpf: fix sanitation of alu op with pointer / scalar type from different paths")
    Reported-by: Marek Majkowski
    Reported-by: Arthur Fabre
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/netdev/CAJPywTJqP34cK20iLM5YmUMz9KXQOdu1-+BZrGMAGgLuBWz7fg@mail.gmail.com/T/
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     
  • We need to iterate through all possible cpus.

    Fixes: 492ecee892c2 ("bpf: enable program stats")
    Signed-off-by: Eric Dumazet
    Reported-by: Guenter Roeck
    Tested-by: Guenter Roeck
    Acked-by: Song Liu
    Signed-off-by: Daniel Borkmann

    Eric Dumazet
     

01 Mar, 2019

2 commits


28 Feb, 2019

3 commits

  • Commit d83525ca62cf ("bpf: introduce bpf_spin_lock")
    introduced bpf_spin_lock and the field spin_lock_off
    in kernel internal structure bpf_map has the following
    meaning:
    >=0 valid offset, spin_lock_off is not copied
    from the inner map to the map_in_map inner_map_meta
    during a map_in_map type map creation, so
    inner_map_meta->spin_lock_off = 0.
    This will give verifier wrong information that
    inner_map has bpf_spin_lock and the bpf_spin_lock
    is defined at offset 0. An access to offset 0
    of a value pointer will trigger the following error:
    bpf_spin_lock cannot be accessed directly by load/store

    This patch fixed the issue by copy inner map's spin_lock_off
    value to inner_map_meta->spin_lock_off.

    Fixes: d83525ca62cf ("bpf: introduce bpf_spin_lock")
    Signed-off-by: Yonghong Song
    Acked-by: Andrii Nakryiko
    Signed-off-by: Alexei Starovoitov

    Yonghong Song
     
  • Return bpf program run_time_ns and run_cnt via bpf_prog_info

    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Signed-off-by: Daniel Borkmann

    Alexei Starovoitov
     
  • JITed BPF programs are indistinguishable from kernel functions, but unlike
    kernel code BPF code can be changed often.
    Typical approach of "perf record" + "perf report" profiling and tuning of
    kernel code works just as well for BPF programs, but kernel code doesn't
    need to be monitored whereas BPF programs do.
    Users load and run large amount of BPF programs.
    These BPF stats allow tools monitor the usage of BPF on the server.
    The monitoring tools will turn sysctl kernel.bpf_stats_enabled
    on and off for few seconds to sample average cost of the programs.
    Aggregated data over hours and days will provide an insight into cost of BPF
    and alarms can trigger in case given program suddenly gets more expensive.

    The cost of two sched_clock() per program invocation adds ~20 nsec.
    Fast BPF progs (like selftests/bpf/progs/test_pkt_access.c) will slow down
    from ~10 nsec to ~30 nsec.
    static_key minimizes the cost of the stats collection.
    There is no measurable difference before/after this patch
    with kernel.bpf_stats_enabled=0

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Alexei Starovoitov
     

27 Feb, 2019

1 commit


25 Feb, 2019

2 commits

  • Three conflicts, one of which, for marvell10g.c is non-trivial and
    requires some follow-up from Heiner or someone else.

    The issue is that Heiner converted the marvell10g driver over to
    use the generic c45 code as much as possible.

    However, in 'net' a bug fix appeared which makes sure that a new
    local mask (MDIO_AN_10GBT_CTRL_ADV_NBT_MASK) with value 0x01e0
    is cleared.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pull networking fixes from David Miller:
    "Hopefully the last pull request for this release. Fingers crossed:

    1) Only refcount ESP stats on full sockets, from Martin Willi.

    2) Missing barriers in AF_UNIX, from Al Viro.

    3) RCU protection fixes in ipv6 route code, from Paolo Abeni.

    4) Avoid false positives in untrusted GSO validation, from Willem de
    Bruijn.

    5) Forwarded mesh packets in mac80211 need more tailroom allocated,
    from Felix Fietkau.

    6) Use operstate consistently for linkup in team driver, from George
    Wilkie.

    7) ThunderX bug fixes from Vadim Lomovtsev. Mostly races between VF
    and PF code paths.

    8) Purge ipv6 exceptions during netdevice removal, from Paolo Abeni.

    9) nfp eBPF code gen fixes from Jiong Wang.

    10) bnxt_en firmware timeout fix from Michael Chan.

    11) Use after free in udp/udpv6 error handlers, from Paolo Abeni.

    12) Fix a race in x25_bind triggerable by syzbot, from Eric Dumazet"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits)
    net: phy: realtek: Dummy IRQ calls for RTL8366RB
    tcp: repaired skbs must init their tso_segs
    net/x25: fix a race in x25_bind()
    net: dsa: Remove documentation for port_fdb_prepare
    Revert "bridge: do not add port to router list when receives query with source 0.0.0.0"
    selftests: fib_tests: sleep after changing carrier. again.
    net: set static variable an initial value in atl2_probe()
    net: phy: marvell10g: Fix Multi-G advertisement to only advertise 10G
    bpf, doc: add bpf list as secondary entry to maintainers file
    udp: fix possible user after free in error handler
    udpv6: fix possible user after free in error handler
    fou6: fix proto error handler argument type
    udpv6: add the required annotation to mib type
    mdio_bus: Fix use-after-free on device_register fails
    net: Set rtm_table to RT_TABLE_COMPAT for ipv6 for tables > 255
    bnxt_en: Wait longer for the firmware message response to complete.
    bnxt_en: Fix typo in firmware message timeout logic.
    nfp: bpf: fix ALU32 high bits clearance bug
    nfp: bpf: fix code-gen bug on BPF_ALU | BPF_XOR | BPF_K
    Documentation: networking: switchdev: Update port parent ID section
    ...

    Linus Torvalds
     

23 Feb, 2019

1 commit

  • Daniel Borkmann says:

    ====================
    pull-request: bpf 2019-02-23

    The following pull-request contains BPF updates for your *net* tree.

    The main changes are:

    1) Fix a bug in BPF's LPM deletion logic to match correct prefix
    length, from Alban.

    2) Fix AF_XDP teardown by not destroying umem prematurely as it
    is still needed till all outstanding skbs are freed, from Björn.

    3) Fix unkillable BPF_PROG_TEST_RUN under preempt kernel by checking
    signal_pending() outside need_resched() condition which is never
    triggered there, from Stanislav.

    4) Fix two nfp JIT bugs, one in code emission for K-based xor, and
    another one to explicitly clear upper bits in alu32, from Jiong.

    5) Add bpf list address to maintainers file, from Daniel.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

22 Feb, 2019

3 commits

  • trie_delete_elem() was deleting an entry even though it was not matching
    if the prefixlen was correct. This patch adds a check on matchlen.

    Reproducer:

    $ sudo bpftool map create /sys/fs/bpf/mylpm type lpm_trie key 8 value 1 entries 128 name mylpm flags 1
    $ sudo bpftool map update pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 aa bb cc dd value hex 01
    $ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
    key: 10 00 00 00 aa bb cc dd value: 01
    Found 1 element
    $ sudo bpftool map delete pinned /sys/fs/bpf/mylpm key hex 10 00 00 00 ff ff ff ff
    $ echo $?
    0
    $ sudo bpftool map dump pinned /sys/fs/bpf/mylpm
    Found 0 elements

    A similar reproducer is added in the selftests.

    Without the patch:

    $ sudo ./tools/testing/selftests/bpf/test_lpm_map
    test_lpm_map: test_lpm_map.c:485: test_lpm_delete: Assertion `bpf_map_delete_elem(map_fd, key) == -1 && errno == ENOENT' failed.
    Aborted

    With the patch: test_lpm_map runs without errors.

    Fixes: e454cf595853 ("bpf: Implement map_delete_elem for BPF_MAP_TYPE_LPM_TRIE")
    Cc: Craig Gallek
    Signed-off-by: Alban Crequy
    Acked-by: Craig Gallek
    Signed-off-by: Daniel Borkmann

    Alban Crequy
     
  • All BPF programs must be called with preemption disabled.

    Fixes: 568f196756ad ("bpf: check that BPF programs run with preemption disabled")
    Reported-by: syzbot+8bf19ee2aa580de7a2a7@syzkaller.appspotmail.com
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Alexei Starovoitov
     
  • We've been seeing hard-to-trigger psi crashes when running inside VM
    instances:

    divide error: 0000 [#1] SMP PTI
    Modules linked in: [...]
    CPU: 0 PID: 212 Comm: kworker/0:2 Not tainted 4.16.18-119_fbk9_3817_gfe944c98d695 #119
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
    Workqueue: events psi_clock
    RIP: 0010:psi_update_stats+0x270/0x490
    RSP: 0018:ffffc90001117e10 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8800a35a13f8
    RDX: 0000000000000000 RSI: ffff8800a35a1340 RDI: 0000000000000000
    RBP: 0000000000000658 R08: ffff8800a35a1470 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
    R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000f8502
    FS: 0000000000000000(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fbe370fa000 CR3: 00000000b1e3a000 CR4: 00000000000006f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    psi_clock+0x12/0x50
    process_one_work+0x1e0/0x390
    worker_thread+0x2b/0x3c0
    ? rescuer_thread+0x330/0x330
    kthread+0x113/0x130
    ? kthread_create_worker_on_cpu+0x40/0x40
    ? SyS_exit_group+0x10/0x10
    ret_from_fork+0x35/0x40
    Code: 48 0f 47 c7 48 01 c2 45 85 e4 48 89 16 0f 85 e6 00 00 00 4c 8b 49 10 4c 8b 51 08 49 69 d9 f2 07 00 00 48 6b c0 64 4c 8b 29 31 d2 f7 f7 49 69 d5 8d 06 00 00 48 89 c5 4c 69 f0 00 98 0b 00 48

    The Code-line points to `period` being 0 inside update_stats(), and we
    divide by that when calculating that period's pressure percentage.

    The elapsed period should never be 0. The reason this can happen is due
    to an off-by-one in the idle time / missing period calculation combined
    with a coarse sched_clock() in the virtual machine.

    The target time for aggregation is advanced into the future on a fixed
    grid to prevent clock drift. So when an aggregation runs after some idle
    period, we can not just set it to "now + psi_period", but have to
    calculate the downtime and advance the target time relative to itself.

    However, if the aggregator was disabled exactly one psi_period (ns), we
    drop one idle period in the calculation due to a > when we should do >=.
    In that case, next_update will be advanced from 'now - psi_period' to
    'now' when it should be moved to 'now + psi_period'. The run finishes
    with last_update == next_update == sched_clock().

    With hardware clocks, this exact nanosecond match isn't likely in the
    first place; but if it does happen, the clock will still have moved on and
    the period non-zero by the time the worker runs. A pointlessly short
    period, but besides the extra work, no harm no foul. However, a slow
    sched_clock() like we have on VMs might not have advanced either by the
    time the worker runs again. And when we calculate the elapsed period, the
    result, our pressure divisor, will be 0. Ouch.

    Fix this by correctly handling the situation when the elapsed time between
    aggregation runs is precisely two periods, and advance the expiration
    timestamp correctly to period into the future.

    Link: http://lkml.kernel.org/r/20190214193157.15788-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: Łukasz Siudut
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

20 Feb, 2019

3 commits

  • Two easily resolvable overlapping change conflicts, one in
    TCP and one in the eBPF verifier.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pull networking fixes from David Miller:

    1) Fix suspend and resume in mt76x0u USB driver, from Stanislaw
    Gruszka.

    2) Missing memory barriers in xsk, from Magnus Karlsson.

    3) rhashtable fixes in mac80211 from Herbert Xu.

    4) 32-bit MIPS eBPF JIT fixes from Paul Burton.

    5) Fix for_each_netdev_feature() on big endian, from Hauke Mehrtens.

    6) GSO validation fixes from Willem de Bruijn.

    7) Endianness fix for dwmac4 timestamp handling, from Alexandre Torgue.

    8) More strict checks in tcp_v4_err(), from Eric Dumazet.

    9) af_alg_release should NULL out the sk after the sock_put(), from Mao
    Wenan.

    10) Missing unlock in mac80211 mesh error path, from Wei Yongjun.

    11) Missing device put in hns driver, from Salil Mehta.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
    sky2: Increase D3 delay again
    vhost: correctly check the return value of translate_desc() in log_used()
    net: netcp: Fix ethss driver probe issue
    net: hns: Fixes the missing put_device in positive leg for roce reset
    net: stmmac: Fix a race in EEE enable callback
    qed: Fix iWARP syn packet mac address validation.
    qed: Fix iWARP buffer size provided for syn packet processing.
    r8152: Add support for MAC address pass through on RTL8153-BD
    mac80211: mesh: fix missing unlock on error in table_path_del()
    net/mlx4_en: fix spelling mistake: "quiting" -> "quitting"
    net: crypto set sk to NULL when af_alg_release.
    net: Do not allocate page fragments that are not skb aligned
    mm: Use fixed constant in page_frag_alloc instead of size + 1
    tcp: tcp_v4_err() should be more careful
    tcp: clear icsk_backoff in tcp_write_queue_purge()
    net: mv643xx_eth: disable clk on error path in mv643xx_eth_shared_probe()
    qmi_wwan: apply SET_DTR quirk to Sierra WP7607
    net: stmmac: handle endianness in dwmac4_get_timestamp
    doc: Mention MSG_ZEROCOPY implementation for UDP
    mlxsw: __mlxsw_sp_port_headroom_set(): Fix a use of local variable
    ...

    Linus Torvalds
     
  • Introduce cant_sleep() macro for annotation of functions that
    cannot sleep.

    Use it in BPF_PROG_RUN to catch execution of BPF programs in
    preemptable context.

    Suggested-by: Jann Horn
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Alexei Starovoitov
    Acked-by: Song Liu
    Signed-off-by: Daniel Borkmann

    Peter Zijlstra
     

19 Feb, 2019

1 commit

  • Pull tracing fixes from Steven Rostedt:
    "Two more tracing fixes

    - Have kprobes not use copy_from_user() to access kernel addresses,
    because kprobes can legitimately poke at bad kernel memory, which
    will fault. Copy from user code should never fault in kernel space.
    Using probe_mem_read() can handle kernel address space faulting.

    - Put back the entries counter in the tracing output that was
    accidentally removed"

    * tag 'trace-v5.0-rc4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Fix number of entries in trace header
    kprobe: Do not use uaccess functions to access kernel memory that can fault

    Linus Torvalds
     

18 Feb, 2019

1 commit


17 Feb, 2019

2 commits

  • Alexei Starovoitov says:

    ====================
    pull-request: bpf-next 2019-02-16

    The following pull-request contains BPF updates for your *net-next* tree.

    The main changes are:

    1) numerous libbpf API improvements, from Andrii, Andrey, Yonghong.

    2) test all bpf progs in alu32 mode, from Jiong.

    3) skb->sk access and bpf_sk_fullsock(), bpf_tcp_sock() helpers, from Martin.

    4) support for IP encap in lwt bpf progs, from Peter.

    5) remove XDP_QUERY_XSK_UMEM dead code, from Jan.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Alexei Starovoitov says:

    ====================
    pull-request: bpf 2019-02-16

    The following pull-request contains BPF updates for your *net* tree.

    The main changes are:

    1) fix lockdep false positive in bpf_get_stackid(), from Alexei.

    2) several AF_XDP fixes, from Bjorn, Magnus, Davidlohr.

    3) fix narrow load from struct bpf_sock, from Martin.

    4) mips JIT fixes, from Paul.

    5) gso handling fix in bpf helpers, from Willem.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

16 Feb, 2019

3 commits

  • The netfilter conflicts were rather simple overlapping
    changes.

    However, the cls_tcindex.c stuff was a bit more complex.

    On the 'net' side, Cong is fixing several races and memory
    leaks. Whilst on the 'net-next' side we have Vlad adding
    the rtnl-ness support.

    What I've decided to do, in order to resolve this, is revert the
    conversion over to using a workqueue that Cong did, bringing us back
    to pure RCU. I did it this way because I believe that either Cong's
    races don't apply with have Vlad did things, or Cong will have to
    implement the race fix slightly differently.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The following commit

    441dae8f2f29 ("tracing: Add support for display of tgid in trace output")

    removed the call to print_event_info() from print_func_help_header_irq()
    which results in the ftrace header not reporting the number of entries
    written in the buffer. As this wasn't the original intent of the patch,
    re-introduce the call to print_event_info() to restore the orginal
    behaviour.

    Link: http://lkml.kernel.org/r/20190214152950.4179-1-quentin.perret@arm.com

    Acked-by: Joel Fernandes
    Cc: stable@vger.kernel.org
    Fixes: 441dae8f2f29 ("tracing: Add support for display of tgid in trace output")
    Signed-off-by: Quentin Perret
    Signed-off-by: Steven Rostedt (VMware)

    Quentin Perret
     
  • The userspace can ask kprobe to intercept strings at any memory address,
    including invalid kernel address. In this case, fetch_store_strlen()
    would crash since it uses general usercopy function, and user access
    functions are no longer allowed to access kernel memory.

    For example, we can crash the kernel by doing something as below:

    $ sudo kprobe 'p:do_sys_open +0(+0(%si)):string'

    [ 103.620391] BUG: GPF in non-whitelisted uaccess (non-canonical address?)
    [ 103.622104] general protection fault: 0000 [#1] SMP PTI
    [ 103.623424] CPU: 10 PID: 1046 Comm: cat Not tainted 5.0.0-rc3-00130-gd73aba1-dirty #96
    [ 103.625321] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-2-g628b2e6-dirty-20190104_103505-linux 04/01/2014
    [ 103.628284] RIP: 0010:process_fetch_insn+0x1ab/0x4b0
    [ 103.629518] Code: 10 83 80 28 2e 00 00 01 31 d2 31 ff 48 8b 74 24 28 eb 0c 81 fa ff 0f 00 00 7f 1c 85 c0 75 18 66 66 90 0f ae e8 48 63
    ca 89 f8 0c 31 66 66 90 83 c2 01 84 c9 75 dc 89 54 24 34 89 44 24 28 48
    [ 103.634032] RSP: 0018:ffff88845eb37ce0 EFLAGS: 00010246
    [ 103.635312] RAX: 0000000000000000 RBX: ffff888456c4e5a8 RCX: 0000000000000000
    [ 103.637057] RDX: 0000000000000000 RSI: 2e646c2f6374652f RDI: 0000000000000000
    [ 103.638795] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
    [ 103.640556] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
    [ 103.642297] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
    [ 103.644040] FS: 0000000000000000(0000) GS:ffff88846f000000(0000) knlGS:0000000000000000
    [ 103.646019] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 103.647436] CR2: 00007ffc79758038 CR3: 0000000463360006 CR4: 0000000000020ee0
    [ 103.649147] Call Trace:
    [ 103.649781] ? sched_clock_cpu+0xc/0xa0
    [ 103.650747] ? do_sys_open+0x5/0x220
    [ 103.651635] kprobe_trace_func+0x303/0x380
    [ 103.652645] ? do_sys_open+0x5/0x220
    [ 103.653528] kprobe_dispatcher+0x45/0x50
    [ 103.654682] ? do_sys_open+0x1/0x220
    [ 103.655875] kprobe_ftrace_handler+0x90/0xf0
    [ 103.657282] ftrace_ops_assist_func+0x54/0xf0
    [ 103.658564] ? __call_rcu+0x1dc/0x280
    [ 103.659482] 0xffffffffc00000bf
    [ 103.660384] ? __ia32_sys_open+0x20/0x20
    [ 103.661682] ? do_sys_open+0x1/0x220
    [ 103.662863] do_sys_open+0x5/0x220
    [ 103.663988] do_syscall_64+0x60/0x210
    [ 103.665201] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 103.666862] RIP: 0033:0x7fc22fadccdd
    [ 103.668034] Code: 48 89 54 24 e0 41 83 e2 40 75 32 89 f0 25 00 00 41 00 3d 00 00 41 00 74 24 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff
    ff 0f 05 3d 00 f0 ff ff 77 33 f3 c3 66 0f 1f 84 00 00 00 00 00 48 8d 44
    [ 103.674029] RSP: 002b:00007ffc7972c3a8 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
    [ 103.676512] RAX: ffffffffffffffda RBX: 0000562f86147a21 RCX: 00007fc22fadccdd
    [ 103.678853] RDX: 0000000000080000 RSI: 00007fc22fae1428 RDI: 00000000ffffff9c
    [ 103.681151] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
    [ 103.683489] R10: 0000000000000000 R11: 0000000000000287 R12: 00007fc22fce90a8
    [ 103.685774] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
    [ 103.688056] Modules linked in:
    [ 103.689131] ---[ end trace 43792035c28984a1 ]---

    This can be fixed by using probe_mem_read() instead, as it can handle faulting
    kernel memory addresses, which kprobes can legitimately do.

    Link: http://lkml.kernel.org/r/20190125151051.7381-1-changbin.du@gmail.com

    Cc: stable@vger.kernel.org
    Fixes: 9da3f2b7405 ("x86/fault: BUG() when uaccess helpers fault on kernel addresses")
    Signed-off-by: Changbin Du
    Signed-off-by: Steven Rostedt (VMware)

    Changbin Du
     

15 Feb, 2019

1 commit


14 Feb, 2019

1 commit

  • Pull tracing fix from Steven Rostedt:
    "This fixes kprobes/uprobes dynamic processing of strings, where it
    processes the args but does not update the remaining length of the
    buffer that the string arguments will be placed in. It constantly
    passes in the total size of buffer used instead of passing in the
    remaining size of the buffer used.

    This could cause issues if the strings are larger than the max size of
    an event which could cause the strings to be written beyond what was
    reserved on the buffer"

    * tag 'trace-v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: probeevent: Correctly update remaining space in dynamic area

    Linus Torvalds
     

13 Feb, 2019

2 commits

  • In the middle of do_exit() there is there is a call
    "ptrace_event(PTRACE_EVENT_EXIT, code);" That call places the process
    in TACKED_TRACED aka "(TASK_WAKEKILL | __TASK_TRACED)" and waits for
    for the debugger to release the task or SIGKILL to be delivered.

    Skipping past dequeue_signal when we know a fatal signal has already
    been delivered resulted in SIGKILL remaining pending and
    TIF_SIGPENDING remaining set. This in turn caused the
    scheduler to not sleep in PTACE_EVENT_EXIT as it figured
    a fatal signal was pending. This also caused ptrace_freeze_traced
    in ptrace_check_attach to fail because it left a per thread
    SIGKILL pending which is what fatal_signal_pending tests for.

    This difference in signal state caused strace to report
    strace: Exit of unknown pid NNNNN ignored

    Therefore update the signal handling state like dequeue_signal
    would when removing a per thread SIGKILL, by removing SIGKILL
    from the per thread signal mask and clearing TIF_SIGPENDING.

    Acked-by: Oleg Nesterov
    Reported-by: Oleg Nesterov
    Reported-by: Ivan Delalande
    Cc: stable@vger.kernel.org
    Fixes: 35634ffa1751 ("signal: Always notice exiting tasks")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • The following commit:

    9dff0aa95a32 ("perf/core: Don't WARN() for impossible ring-buffer sizes")

    results in perf recording failures with larger mmap areas:

    root@skl:/tmp# perf record -g -a
    failed to mmap with 12 (Cannot allocate memory)

    The root cause is that the following condition is buggy:

    if (order_base_2(size) >= MAX_ORDER)
    goto fail;

    The problem is that @size is in bytes and MAX_ORDER is in pages,
    so the right test is:

    if (order_base_2(size) >= PAGE_SHIFT+MAX_ORDER)
    goto fail;

    Fix it.

    Reported-by: "Jin, Yao"
    Bisected-by: Borislav Petkov
    Analyzed-by: Peter Zijlstra
    Cc: Julien Thierry
    Cc: Mark Rutland
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Greg Kroah-Hartman
    Cc:
    Fixes: 9dff0aa95a32 ("perf/core: Don't WARN() for impossible ring-buffer sizes")
    Signed-off-by: Ingo Molnar

    Ingo Molnar