18 Sep, 2020

1 commit

  • Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
    the page locking entirely fair, in that if a waiter came in while the
    lock was held, the lock would be transferred to the lockers strictly in
    order.

    That was intended to finally get rid of the long-reported watchdog
    failures that involved the page lock under extreme load, where a process
    could end up waiting essentially forever, as other page lockers stole
    the lock from under it.

    It also improved some benchmarks, but it ended up causing huge
    performance regressions on others, simply because fair lock behavior
    doesn't end up giving out the lock as aggressively, causing better
    worst-case latency, but potentially much worse average latencies and
    throughput.

    Instead of reverting that change entirely, this introduces a controlled
    amount of unfairness, with a sysctl knob to tune it if somebody needs
    to. But the default value should hopefully be good for any normal load,
    allowing a few rounds of lock stealing, but enforcing the strict
    ordering before the lock has been stolen too many times.

    There is also a hint from Matthieu Baerts that the fair page coloring
    may end up exposing an ABBA deadlock that is hidden by the usual
    optimistic lock stealing, and while the unfairness doesn't fix the
    fundamental issue (and I'm still looking at that), it avoids it in
    practice.

    The amount of unfairness can be modified by writing a new value to the
    'sysctl_page_lock_unfairness' variable (default value of 5, exposed
    through /proc/sys/vm/page_lock_unfairness), but that is hopefully
    something we'd use mainly for debugging rather than being necessary for
    any deep system tuning.

    This whole issue has exposed just how critical the page lock can be, and
    how contended it gets under certain locks. And the main contention
    doesn't really seem to be anything related to IO (which was the origin
    of this lock), but for things like just verifying that the page file
    mapping is stable while faulting in the page into a page table.

    Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
    Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
    Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
    Reported-and-tested-by: Michael Larabel
    Tested-by: Matthieu Baerts
    Cc: Dave Chinner
    Cc: Matthew Wilcox
    Cc: Chris Mason
    Cc: Jan Kara
    Cc: Amir Goldstein
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

25 Aug, 2020

1 commit

  • Commit 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
    changed ctl_table.proc_handler to take a kernel pointer. Adjust the
    signature of bpf_stats_handler to match ctl_table.proc_handler which
    fixes the following sparse warning:

    kernel/sysctl.c:226:49: warning: incorrect type in argument 3 (different address spaces)
    kernel/sysctl.c:226:49: expected void *
    kernel/sysctl.c:226:49: got void [noderef] __user *buffer
    kernel/sysctl.c:2640:35: warning: incorrect type in initializer (incompatible argument 3 (different address spaces))
    kernel/sysctl.c:2640:35: expected int ( [usertype] *proc_handler )( ... )
    kernel/sysctl.c:2640:35: got int ( * )( ... )

    Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
    Signed-off-by: Tobias Klauser
    Signed-off-by: Alexei Starovoitov
    Cc: Christoph Hellwig
    Link: https://lore.kernel.org/bpf/20200824142047.22043-1-tklauser@distanz.ch

    Tobias Klauser
     

13 Aug, 2020

2 commits

  • Proactive compaction uses per-node/zone "fragmentation score" which is
    always in range [0, 100], so use unsigned type of these scores as well as
    for related constants.

    Signed-off-by: Nitin Gupta
    Signed-off-by: Andrew Morton
    Reviewed-by: Baoquan He
    Cc: Luis Chamberlain
    Cc: Kees Cook
    Cc: Iurii Zaikin
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200618010319.13159-1-nigupta@nvidia.com
    Signed-off-by: Linus Torvalds

    Nitin Gupta
     
  • For some applications, we need to allocate almost all memory as hugepages.
    However, on a running system, higher-order allocations can fail if the
    memory is fragmented. Linux kernel currently does on-demand compaction as
    we request more hugepages, but this style of compaction incurs very high
    latency. Experiments with one-time full memory compaction (followed by
    hugepage allocations) show that kernel is able to restore a highly
    fragmented memory state to a fairly compacted memory state within 98% of free memory could be allocated as hugepages)

    - With 5.6.0-rc3 + this patch, with proactiveness=20

    sysctl -w vm.compaction_proactiveness=20

    percentile latency
    –––––––––– –––––––
    5 2
    10 2
    25 3
    30 3
    40 3
    50 4
    60 4
    75 4
    80 4
    90 5
    95 429

    Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
    total free => 98% of free memory could be allocated as hugepages)

    2. JAVA heap allocation

    In this test, we first fragment memory using the same method as for (1).

    Then, we start a Java process with a heap size set to 700G and request the
    heap to be allocated with THP hugepages. We also set THP to madvise to
    allow hugepage backing of this heap.

    /usr/bin/time
    java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch

    The above command allocates 700G of Java heap using hugepages.

    - With vanilla 5.6.0-rc3

    17.39user 1666.48system 27:37.89elapsed

    - With 5.6.0-rc3 + this patch, with proactiveness=20

    8.35user 194.58system 3:19.62elapsed

    Elapsed time remains around 3:15, as proactiveness is further increased.

    Note that proactive compaction happens throughout the runtime of these
    workloads. The situation of one-time compaction, sufficient to supply
    hugepages for following allocation stream, can probably happen for more
    extreme proactiveness values, like 80 or 90.

    In the above Java workload, proactiveness is set to 20. The test starts
    with a node's score of 80 or higher, depending on the delay between the
    fragmentation step and starting the benchmark, which gives more-or-less
    time for the initial round of compaction. As t he benchmark consumes
    hugepages, node's score quickly rises above the high threshold (90) and
    proactive compaction starts again, which brings down the score to the low
    threshold level (80). Repeat.

    bpftrace also confirms proactive compaction running 20+ times during the
    runtime of this Java benchmark. kcompactd threads consume 100% of one of
    the CPUs while it tries to bring a node's score within thresholds.

    Backoff behavior
    ================

    Above workloads produce a memory state which is easy to compact. However,
    if memory is filled with unmovable pages, proactive compaction should
    essentially back off. To test this aspect:

    - Created a kernel driver that allocates almost all memory as hugepages
    followed by freeing first 3/4 of each hugepage.
    - Set proactiveness=40
    - Note that proactive_compact_node() is deferred maximum number of times
    with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
    (=> ~30 seconds between retries).

    [1] https://patchwork.kernel.org/patch/11098289/
    [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
    [3] https://lwn.net/Articles/817905/

    Signed-off-by: Nitin Gupta
    Signed-off-by: Andrew Morton
    Tested-by: Oleksandr Natalenko
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Khalid Aziz
    Reviewed-by: Oleksandr Natalenko
    Cc: Vlastimil Babka
    Cc: Khalid Aziz
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Matthew Wilcox
    Cc: Mike Kravetz
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Nitin Gupta
    Cc: Oleksandr Natalenko
    Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
    Signed-off-by: Linus Torvalds

    Nitin Gupta
     

08 Aug, 2020

1 commit

  • When checking a performance change for will-it-scale scalability mmap test
    [1], we found very high lock contention for spinlock of percpu counter
    'vm_committed_as':

    94.14% 0.35% [kernel.kallsyms] [k] _raw_spin_lock_irqsave
    48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
    45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;

    Actually this heavy lock contention is not always necessary. The
    'vm_committed_as' needs to be very precise when the strict
    OVERCOMMIT_NEVER policy is set, which requires a rather small batch number
    for the percpu counter.

    So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy, and
    lift it to 64X for OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS policies. Also
    add a sysctl handler to adjust it when the policy is reconfigured.

    Benchmark with the same testcase in [1] shows 53% improvement on a 8C/16T
    desktop, and 2097%(20X) on a 4S/72C/144T server. We tested with test
    platforms in 0day (server, desktop and laptop), and 80%+ platforms shows
    improvements with that test. And whether it shows improvements depends on
    if the test mmap size is bigger than the batch number computed.

    And if the lift is 16X, 1/3 of the platforms will show improvements,
    though it should help the mmap/unmap usage generally, as Michal Hocko
    mentioned:

    : I believe that there are non-synthetic worklaods which would benefit from
    : a larger batch. E.g. large in memory databases which do large mmaps
    : during startups from multiple threads.

    [1] https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/

    Signed-off-by: Feng Tang
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Matthew Wilcox (Oracle)
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Qian Cai
    Cc: Kees Cook
    Cc: Andi Kleen
    Cc: Tim Chen
    Cc: Dave Hansen
    Cc: Huang Ying
    Cc: Christoph Lameter
    Cc: Dennis Zhou
    Cc: Haiyang Zhang
    Cc: kernel test robot
    Cc: "K. Y. Srinivasan"
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/1589611660-89854-4-git-send-email-feng.tang@intel.com
    Link: http://lkml.kernel.org/r/1592725000-73486-4-git-send-email-feng.tang@intel.com
    Link: http://lkml.kernel.org/r/1594389708-60781-5-git-send-email-feng.tang@intel.com
    Signed-off-by: Linus Torvalds

    Feng Tang
     

29 Jul, 2020

1 commit

  • RT tasks by default run at the highest capacity/performance level. When
    uclamp is selected this default behavior is retained by enforcing the
    requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
    uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
    value.

    This is also referred to as 'the default boost value of RT tasks'.

    See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").

    On battery powered devices, it is desired to control this default
    (currently hardcoded) behavior at runtime to reduce energy consumed by
    RT tasks.

    For example, a mobile device manufacturer where big.LITTLE architecture
    is dominant, the performance of the little cores varies across SoCs, and
    on high end ones the big cores could be too power hungry.

    Given the diversity of SoCs, the new knob allows manufactures to tune
    the best performance/power for RT tasks for the particular hardware they
    run on.

    They could opt to further tune the value when the user selects
    a different power saving mode or when the device is actively charging.

    The runtime aspect of it further helps in creating a single kernel image
    that can be run on multiple devices that require different tuning.

    Keep in mind that a lot of RT tasks in the system are created by the
    kernel. On Android for instance I can see over 50 RT tasks, only
    a handful of which created by the Android framework.

    To control the default behavior globally by system admins and device
    integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
    to change the default boost value of the RT tasks.

    I anticipate this to be mostly in the form of modifying the init script
    of a particular device.

    To avoid polluting the fast path with unnecessary code, the approach
    taken is to synchronously do the update by traversing all the existing
    tasks in the system. This could race with a concurrent fork(), which is
    dealt with by introducing sched_post_fork() function which will ensure
    the racy fork will get the right update applied.

    Tested on Juno-r2 in combination with the RT capacity awareness [1].
    By default an RT task will go to the highest capacity CPU and run at the
    maximum frequency, which is particularly energy inefficient on high end
    mobile devices because the biggest core[s] are 'huge' and power hungry.

    With this patch the RT task can be controlled to run anywhere by
    default, and doesn't cause the frequency to be maximum all the time.
    Yet any task that really needs to be boosted can easily escape this
    default behavior by modifying its requested uclamp.min value
    (p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.

    [1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")

    Signed-off-by: Qais Yousef
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com

    Qais Yousef
     

15 Jun, 2020

1 commit


09 Jun, 2020

4 commits

  • Users with SYS_ADMIN capability can add arbitrary taint flags to the
    running kernel by writing to /proc/sys/kernel/tainted or issuing the
    command 'sysctl -w kernel.tainted=...'. This interface, however, is
    open for any integer value and this might cause an invalid set of flags
    being committed to the tainted_mask bitset.

    This patch introduces a simple way for proc_taint() to ignore any
    eventual invalid bit coming from the user input before committing those
    bits to the kernel tainted_mask.

    Signed-off-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Reviewed-by: Luis Chamberlain
    Cc: Kees Cook
    Cc: Iurii Zaikin
    Cc: "Theodore Ts'o"
    Link: http://lkml.kernel.org/r/20200512223946.888020-1-aquini@redhat.com
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • Usually when the kernel reaches an oops condition, it's a point of no
    return; in case not enough debug information is available in the kernel
    splat, one of the last resorts would be to collect a kernel crash dump
    and analyze it. The problem with this approach is that in order to
    collect the dump, a panic is required (to kexec-load the crash kernel).
    When in an environment of multiple virtual machines, users may prefer to
    try living with the oops, at least until being able to properly shutdown
    their VMs / finish their important tasks.

    This patch implements a way to collect a bit more debug details when an
    oops event is reached, by printing all the CPUs backtraces through the
    usage of NMIs (on architectures that support that). The sysctl added
    (and documented) here was called "oops_all_cpu_backtrace", and when set
    will (as the name suggests) dump all CPUs backtraces.

    Far from ideal, this may be the last option though for users that for
    some reason cannot panic on oops. Most of times oopses are clear enough
    to indicate the kernel portion that must be investigated, but in virtual
    environments it's possible to observe hypervisor/KVM issues that could
    lead to oopses shown in other guests CPUs (like virtual APIC crashes).
    This patch hence aims to help debug such complex issues without
    resorting to kdump.

    Signed-off-by: Guilherme G. Piccoli
    Signed-off-by: Andrew Morton
    Reviewed-by: Kees Cook
    Cc: Luis Chamberlain
    Cc: Iurii Zaikin
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Randy Dunlap
    Cc: Matthew Wilcox
    Link: http://lkml.kernel.org/r/20200327224116.21030-1-gpiccoli@canonical.com
    Signed-off-by: Linus Torvalds

    Guilherme G. Piccoli
     
  • Commit 401c636a0eeb ("kernel/hung_task.c: show all hung tasks before
    panic") introduced a change in that we started to show all CPUs
    backtraces when a hung task is detected _and_ the sysctl/kernel
    parameter "hung_task_panic" is set. The idea is good, because usually
    when observing deadlocks (that may lead to hung tasks), the culprit is
    another task holding a lock and not necessarily the task detected as
    hung.

    The problem with this approach is that dumping backtraces is a slightly
    expensive task, specially printing that on console (and specially in
    many CPU machines, as servers commonly found nowadays). So, users that
    plan to collect a kdump to investigate the hung tasks and narrow down
    the deadlock definitely don't need the CPUs backtrace on dmesg/console,
    which will delay the panic and pollute the log (crash tool would easily
    grab all CPUs traces with 'bt -a' command).

    Also, there's the reciprocal scenario: some users may be interested in
    seeing the CPUs backtraces but not have the system panic when a hung
    task is detected. The current approach hence is almost as embedding a
    policy in the kernel, by forcing the CPUs backtraces' dump (only) on
    hung_task_panic.

    This patch decouples the panic event on hung task from the CPUs
    backtraces dump, by creating (and documenting) a new sysctl called
    "hung_task_all_cpu_backtrace", analog to the approach taken on soft/hard
    lockups, that have both a panic and an "all_cpu_backtrace" sysctl to
    allow individual control. The new mechanism for dumping the CPUs
    backtraces on hung task detection respects "hung_task_warnings" by not
    dumping the traces in case there's no warnings left.

    Signed-off-by: Guilherme G. Piccoli
    Signed-off-by: Andrew Morton
    Reviewed-by: Kees Cook
    Cc: Tetsuo Handa
    Link: http://lkml.kernel.org/r/20200327223646.20779-1-gpiccoli@canonical.com
    Signed-off-by: Linus Torvalds

    Guilherme G. Piccoli
     
  • Analogously to the introduction of panic_on_warn, this patch introduces
    a kernel option named panic_on_taint in order to provide a simple and
    generic way to stop execution and catch a coredump when the kernel gets
    tainted by any given flag.

    This is useful for debugging sessions as it avoids having to rebuild the
    kernel to explicitly add calls to panic() into the code sites that
    introduce the taint flags of interest.

    For instance, if one is interested in proceeding with a post-mortem
    analysis at the point a given code path is hitting a bad page (i.e.
    unaccount_page_cache_page(), or slab_bug()), a coredump can be collected
    by rebooting the kernel with 'panic_on_taint=0x20' amended to the
    command line.

    Another, perhaps less frequent, use for this option would be as a means
    for assuring a security policy case where only a subset of taints, or no
    single taint (in paranoid mode), is allowed for the running system. The
    optional switch 'nousertaint' is handy in this particular scenario, as
    it will avoid userspace induced crashes by writes to sysctl interface
    /proc/sys/kernel/tainted causing false positive hits for such policies.

    [akpm@linux-foundation.org: tweak kernel-parameters.txt wording]

    Suggested-by: Qian Cai
    Signed-off-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Reviewed-by: Luis Chamberlain
    Cc: Dave Young
    Cc: Baoquan He
    Cc: Jonathan Corbet
    Cc: Kees Cook
    Cc: Randy Dunlap
    Cc: "Theodore Ts'o"
    Cc: Adrian Bunk
    Cc: Greg Kroah-Hartman
    Cc: Laura Abbott
    Cc: Jeff Mahoney
    Cc: Jiri Kosina
    Cc: Takashi Iwai
    Link: http://lkml.kernel.org/r/20200515175502.146720-1-aquini@redhat.com
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     

04 Jun, 2020

3 commits

  • Merge more updates from Andrew Morton:
    "More mm/ work, plenty more to come

    Subsystems affected by this patch series: slub, memcg, gup, kasan,
    pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs,
    thp, mmap, kconfig"

    * akpm: (131 commits)
    arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
    x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
    riscv: support DEBUG_WX
    mm: add DEBUG_WX support
    drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
    mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid()
    powerpc/mm: drop platform defined pmd_mknotpresent()
    mm: thp: don't need to drain lru cache when splitting and mlocking THP
    hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
    sparc32: register memory occupied by kernel as memblock.memory
    include/linux/memblock.h: fix minor typo and unclear comment
    mm, mempolicy: fix up gup usage in lookup_node
    tools/vm/page_owner_sort.c: filter out unneeded line
    mm: swap: memcg: fix memcg stats for huge pages
    mm: swap: fix vmstats for huge pages
    mm: vmscan: limit the range of LRU type balancing
    mm: vmscan: reclaim writepage is IO cost
    mm: vmscan: determine anon/file pressure balance at the reclaim root
    mm: balance LRU lists based on relative thrashing
    mm: only count actual rotations as LRU reclaim cost
    ...

    Linus Torvalds
     
  • With the advent of fast random IO devices (SSDs, PMEM) and in-memory swap
    devices such as zswap, it's possible for swap to be much faster than
    filesystems, and for swapping to be preferable over thrashing filesystem
    caches.

    Allow setting swappiness - which defines the rough relative IO cost of
    cache misses between page cache and swap-backed pages - to reflect such
    situations by making the swap-preferred range configurable.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200520232525.798933-4-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Pull networking updates from David Miller:

    1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

    2) Add GSO partial support to igc, from Sasha Neftin.

    3) Several cleanups and improvements to r8169 from Heiner Kallweit.

    4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

    5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

    6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

    7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

    8) Add sriov and vf support to hinic, from Luo bin.

    9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

    10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

    11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

    12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

    13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

    14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

    15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

    16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

    17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

    18) Several RISCV bpf jit optimizations, from Luke Nelson.

    19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

    20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

    21) Add BPF iterators, from Yonghang Song.

    22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

    23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

    24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

    25) Add CAP_BPF, from Alexei Starovoitov.

    26) Support terse dumps in the packet scheduler, from Vlad Buslov.

    27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

    28) Add devm_register_netdev(), from Bartosz Golaszewski.

    29) Minimize qdisc resets, from Cong Wang.

    30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
    selftests: net: ip_defrag: ignore EPERM
    net_failover: fixed rollback in net_failover_open()
    Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
    Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
    vmxnet3: allow rx flow hash ops only when rss is enabled
    hinic: add set_channels ethtool_ops support
    selftests/bpf: Add a default $(CXX) value
    tools/bpf: Don't use $(COMPILE.c)
    bpf, selftests: Use bpf_probe_read_kernel
    s390/bpf: Use bcr 0,%0 as tail call nop filler
    s390/bpf: Maintain 8-byte stack alignment
    selftests/bpf: Fix verifier test
    selftests/bpf: Fix sample_cnt shared between two threads
    bpf, selftests: Adapt cls_redirect to call csum_level helper
    bpf: Add csum_level helper for fixing up csum levels
    bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
    sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
    crypto/chtls: IPv6 support for inline TLS
    Crypto/chcr: Fixes a coccinile check error
    Crypto/chcr: Fixes compilations warnings
    ...

    Linus Torvalds
     

11 May, 2020

1 commit

  • The variable sysctl_panic_on_stackoverflow is used in
    arch/parisc/kernel/irq.c and arch/x86/kernel/irq_32.c, but the sysctl file
    interface panic_on_stackoverflow only exists on x86.

    Add sysctl file interface panic_on_stackoverflow for parisc

    Signed-off-by: Xiaoming Ni
    Reviewed-by: Luis Chamberlain
    Signed-off-by: Helge Deller

    Xiaoming Ni
     

06 May, 2020

1 commit

  • The newly added bpf_stats_handler function has the wrong #ifdef
    check around it, leading to an unused-function warning when
    CONFIG_SYSCTL is disabled:

    kernel/sysctl.c:205:12: error: unused function 'bpf_stats_handler' [-Werror,-Wunused-function]
    static int bpf_stats_handler(struct ctl_table *table, int write,

    Fix the check to match the reference.

    Fixes: d46edd671a14 ("bpf: Sharing bpf runtime stats with BPF_ENABLE_STATS")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Luis Chamberlain
    Acked-by: Martin KaFai Lau
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20200505140734.503701-1-arnd@arndb.de

    Arnd Bergmann
     

02 May, 2020

1 commit

  • Currently, sysctl kernel.bpf_stats_enabled controls BPF runtime stats.
    Typical userspace tools use kernel.bpf_stats_enabled as follows:

    1. Enable kernel.bpf_stats_enabled;
    2. Check program run_time_ns;
    3. Sleep for the monitoring period;
    4. Check program run_time_ns again, calculate the difference;
    5. Disable kernel.bpf_stats_enabled.

    The problem with this approach is that only one userspace tool can toggle
    this sysctl. If multiple tools toggle the sysctl at the same time, the
    measurement may be inaccurate.

    To fix this problem while keep backward compatibility, introduce a new
    bpf command BPF_ENABLE_STATS. On success, this command enables stats and
    returns a valid fd. BPF_ENABLE_STATS takes argument "type". Currently,
    only one type, BPF_STATS_RUN_TIME, is supported. We can extend the
    command to support other types of stats in the future.

    With BPF_ENABLE_STATS, user space tool would have the following flow:

    1. Get a fd with BPF_ENABLE_STATS, and make sure it is valid;
    2. Check program run_time_ns;
    3. Sleep for the monitoring period;
    4. Check program run_time_ns again, calculate the difference;
    5. Close the fd.

    Signed-off-by: Song Liu
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200430071506.1408910-2-songliubraving@fb.com

    Song Liu
     

27 Apr, 2020

4 commits


03 Apr, 2020

2 commits

  • Since commit 5bbe3547aa3ba ("mm: allow compaction of unevictable pages")
    it is allowed to examine mlocked pages and compact them by default. On
    -RT even minor pagefaults are problematic because it may take a few 100us
    to resolve them and until then the task is blocked.

    Make compact_unevictable_allowed = 0 default and issue a warning on RT if
    it is changed.

    [bigeasy@linutronix.de: v5]
    Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/
    Link: http://lkml.kernel.org/r/20200319165536.ovi75tsr2seared4@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Thomas Gleixner
    Cc: Luis Chamberlain
    Cc: Kees Cook
    Cc: Iurii Zaikin
    Cc: Vlastimil Babka
    Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/
    Link: http://lkml.kernel.org/r/20200303202225.nhqc3v5gwlb7x6et@linutronix.de
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     
  • The proc file `compact_unevictable_allowed' should allow 0 and 1 only, the
    `extra*' attribues have been set properly but without
    proc_dointvec_minmax() as the `proc_handler' the limit will not be
    enforced.

    Use proc_dointvec_minmax() as the `proc_handler' to enfoce the valid
    specified range.

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Thomas Gleixner
    Cc: Luis Chamberlain
    Cc: Kees Cook
    Cc: Iurii Zaikin
    Cc: Mel Gorman
    Link: http://lkml.kernel.org/r/20200303202054.gsosv7fsx2ma3cic@linutronix.de
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     

07 Mar, 2020

1 commit

  • Many embedded boards have a disconnected TTL level serial which can
    generate some garbage that can lead to spurious false sysrq detects.

    Currently, sysrq can be either completely disabled for serial console
    or always disabled (with CONFIG_MAGIC_SYSRQ_SERIAL), since
    commit 732dbf3a6104 ("serial: do not accept sysrq characters via serial port")

    At Arista, we have such boards that can generate BREAK and random
    garbage. While disabling sysrq for serial console would solve
    the problem with spurious false sysrq triggers, it's also desirable
    to have a way to enable sysrq back.

    Having the way to enable sysrq was beneficial to debug lockups with
    a manual investigation in field and on the other side preventing false
    sysrq detections.

    As a preparation to add sysrq_toggle_support() call into uart,
    remove a private copy of sysrq_enabled from sysctl - it should reflect
    the actual status of sysrq.

    Furthermore, the private copy isn't correct already in case
    sysrq_always_enabled is true. So, remove __sysrq_enabled and use a
    getter-helper sysrq_mask() to check sysrq_key_op enabled status.

    Cc: Iurii Zaikin
    Cc: Jiri Slaby
    Cc: Luis Chamberlain
    Cc: Kees Cook
    Cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: Dmitry Safonov
    Link: https://lore.kernel.org/r/20200302175135.269397-2-dima@arista.com
    Signed-off-by: Greg Kroah-Hartman

    Dmitry Safonov
     

19 Feb, 2020

1 commit

  • s390 math emulation was removed with commit 5a79859ae0f3 ("s390:
    remove 31 bit support"), rendering ieee_emulation_warnings useless.
    The code still built because it was protected by CONFIG_MATHEMU, which
    was no longer selectable.

    This patch removes the sysctl_ieee_emulation_warnings declaration and
    the sysctl entry declaration.

    Link: https://lkml.kernel.org/r/20200214172628.3598516-1-steve@sk2.org
    Reviewed-by: Vasily Gorbik
    Signed-off-by: Stephen Kitt
    Signed-off-by: Vasily Gorbik

    Stephen Kitt
     

10 Dec, 2019

1 commit

  • Currently PREEMPT_RCU and TREE_RCU are mutually exclusive Kconfig
    options. But PREEMPT_RCU actually specifies a kind of TREE_RCU,
    namely a preemptible TREE_RCU. This commit therefore makes PREEMPT_RCU
    be a modifer to the TREE_RCU Kconfig option. This has the benefit of
    simplifying several of the #if expressions that formerly needed to
    check both, but now need only check one or the other.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Lai Jiangshan
    Reviewed-by: Joel Fernandes (Google)
    Signed-off-by: Paul E. McKenney

    Lai Jiangshan
     

02 Dec, 2019

1 commit

  • Currently, the drop_caches proc file and sysctl read back the last value
    written, suggesting this is somehow a stateful setting instead of a
    one-time command. Make it write-only, like e.g. compact_memory.

    While mitigating a VM problem at scale in our fleet, there was confusion
    about whether writing to this file will permanently switch the kernel into
    a non-caching mode. This influences the decision making in a tense
    situation, where tens of people are trying to fix tens of thousands of
    affected machines: Do we need a rollback strategy? What are the
    performance implications of operating in a non-caching state for several
    days? It also caused confusion when the kernel team said we may need to
    write the file several times to make sure it's effective ("But it already
    reads back 3?").

    Link: http://lkml.kernel.org/r/20191031221602.9375-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Chris Down
    Acked-by: Vlastimil Babka
    Acked-by: David Hildenbrand
    Acked-by: Michal Hocko
    Acked-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

15 Oct, 2019

1 commit


25 Sep, 2019

1 commit

  • arm64 handles top-down mmap layout in a way that can be easily reused by
    other architectures, so make it available in mm. It then introduces a new
    config ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT that can be set by other
    architectures to benefit from those functions. Note that this new config
    depends on MMU being enabled, if selected without MMU support, a warning
    will be thrown.

    Link: http://lkml.kernel.org/r/20190730055113.23635-5-alex@ghiti.fr
    Signed-off-by: Alexandre Ghiti
    Suggested-by: Christoph Hellwig
    Acked-by: Catalin Marinas
    Acked-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Luis Chamberlain
    Cc: Albert Ou
    Cc: Alexander Viro
    Cc: James Hogan
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     

19 Jul, 2019

1 commit

  • In the sysctl code the proc_dointvec_minmax() function is often used to
    validate the user supplied value between an allowed range. This
    function uses the extra1 and extra2 members from struct ctl_table as
    minimum and maximum allowed value.

    On sysctl handler declaration, in every source file there are some
    readonly variables containing just an integer which address is assigned
    to the extra1 and extra2 members, so the sysctl range is enforced.

    The special values 0, 1 and INT_MAX are very often used as range
    boundary, leading duplication of variables like zero=0, one=1,
    int_max=INT_MAX in different source files:

    $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
    248

    Add a const int array containing the most commonly used values, some
    macros to refer more easily to the correct array member, and use them
    instead of creating a local one for every object file.

    This is the bloat-o-meter output comparing the old and new binary
    compiled with the default Fedora config:

    # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
    add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
    Data old new delta
    sysctl_vals - 12 +12
    __kstrtab_sysctl_vals - 12 +12
    max 14 10 -4
    int_max 16 - -16
    one 68 - -68
    zero 128 28 -100
    Total: Before=20583249, After=20583085, chg -0.00%

    [mcroce@redhat.com: tipc: remove two unused variables]
    Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
    [akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
    [arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
    Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
    [akpm@linux-foundation.org: fix fs/eventpoll.c]
    Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
    Signed-off-by: Matteo Croce
    Signed-off-by: Arnd Bergmann
    Acked-by: Kees Cook
    Reviewed-by: Aaron Tomlin
    Cc: Matthew Wilcox
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matteo Croce
     

17 Jul, 2019

1 commit


25 Jun, 2019

1 commit

  • Tasks without a user-defined clamp value are considered not clamped
    and by default their utilization can have any value in the
    [0..SCHED_CAPACITY_SCALE] range.

    Tasks with a user-defined clamp value are allowed to request any value
    in that range, and the required clamp is unconditionally enforced.
    However, a "System Management Software" could be interested in limiting
    the range of clamp values allowed for all tasks.

    Add a privileged interface to define a system default configuration via:

    /proc/sys/kernel/sched_uclamp_util_{min,max}

    which works as an unconditional clamp range restriction for all tasks.

    With the default configuration, the full SCHED_CAPACITY_SCALE range of
    values is allowed for each clamp index. Otherwise, the task-specific
    clamp is capped by the corresponding system default value.

    Do that by tracking, for each task, the "effective" clamp value and
    bucket the task has been refcounted in at enqueue time. This
    allows to lazy aggregate "requested" and "system default" values at
    enqueue time and simplifies refcounting updates at dequeue time.

    The cached bucket ids are used to avoid (relatively) more expensive
    integer divisions every time a task is enqueued.

    An active flag is used to report when the "effective" value is valid and
    thus the task is actually refcounted in the corresponding rq's bucket.

    Signed-off-by: Patrick Bellasi
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alessio Balsini
    Cc: Dietmar Eggemann
    Cc: Joel Fernandes
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Quentin Perret
    Cc: Rafael J . Wysocki
    Cc: Steve Muckle
    Cc: Suren Baghdasaryan
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Todd Kjos
    Cc: Vincent Guittot
    Cc: Viresh Kumar
    Link: https://lkml.kernel.org/r/20190621084217.8167-5-patrick.bellasi@arm.com
    Signed-off-by: Ingo Molnar

    Patrick Bellasi
     

15 Jun, 2019

1 commit

  • Convert proc_dointvec_minmax_bpf_stats() into a more generic
    helper, since we are going to use jump labels more often.

    Note that sysctl_bpf_stats_enabled is removed, since
    it is no longer needed/used.

    Signed-off-by: Eric Dumazet
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

4 commits

  • Today, proc_do_large_bitmap() truncates a large write input buffer to
    PAGE_SIZE - 1, which may result in misparsed numbers at the (truncated)
    end of the buffer. Further, it fails to notify the caller that the
    buffer was truncated, so it doesn't get called iteratively to finish the
    entire input buffer.

    Tell the caller if there's more work to do by adding the skipped amount
    back to left/*lenp before returning.

    To fix the misparsing, reset the position if we have completely consumed
    a truncated buffer (or if just one char is left, which may be a "-" in a
    range), and ask the caller to come back for more.

    Link: http://lkml.kernel.org/r/20190320222831.8243-7-mcgrof@kernel.org
    Signed-off-by: Eric Sandeen
    Signed-off-by: Luis Chamberlain
    Acked-by: Kees Cook
    Cc: Eric Sandeen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     
  • Currently when userspace gives us a values that overflow e.g. file-max
    and other callers of __do_proc_doulongvec_minmax() we simply ignore the
    new value and leave the current value untouched.

    This can be problematic as it gives the illusion that the limit has
    indeed be bumped when in fact it failed. This commit makes sure to
    return EINVAL when an overflow is detected. Please note that this is a
    userspace facing change.

    Link: http://lkml.kernel.org/r/20190210203943.8227-4-christian@brauner.io
    Signed-off-by: Christian Brauner
    Acked-by: Luis Chamberlain
    Cc: Kees Cook
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Dominik Brodowski
    Cc: "Eric W. Biederman"
    Cc: Joe Lawrence
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christian Brauner
     
  • Switch to bitmap_zalloc() to show clearly what we are allocating.
    Besides that it returns pointer of bitmap type instead of opaque void *.

    Link: http://lkml.kernel.org/r/20190304094037.57756-1-andriy.shevchenko@linux.intel.com
    Signed-off-by: Andy Shevchenko
    Acked-by: Kees Cook
    Reviewed-by: Andrew Morton
    Cc: Luis Chamberlain
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • Userfaultfd can be misued to make it easier to exploit existing
    use-after-free (and similar) bugs that might otherwise only make a
    short window or race condition available. By using userfaultfd to
    stall a kernel thread, a malicious program can keep some state that it
    wrote, stable for an extended period, which it can then access using an
    existing exploit. While it doesn't cause the exploit itself, and while
    it's not the only thing that can stall a kernel thread when accessing a
    memory location, it's one of the few that never needs privilege.

    We can add a flag, allowing userfaultfd to be restricted, so that in
    general it won't be useable by arbitrary user programs, but in
    environments that require userfaultfd it can be turned back on.

    Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
    whether userfaultfd is allowed by unprivileged users. When this is
    set to zero, only privileged users (root user, or users with the
    CAP_SYS_PTRACE capability) will be able to use the userfaultfd
    syscalls.

    Andrea said:

    : The only difference between the bpf sysctl and the userfaultfd sysctl
    : this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
    : requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
    : because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
    : already if it's doing other kind of tracking on processes runtime, in
    : addition of userfaultfd. In other words both syscalls works only for
    : root, when the two sysctl are opt-in set to 1.

    [dgilbert@redhat.com: changelog additions]
    [akpm@linux-foundation.org: documentation tweak, per Mike]
    Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.com
    Signed-off-by: Peter Xu
    Suggested-by: Andrea Arcangeli
    Suggested-by: Mike Rapoport
    Reviewed-by: Mike Rapoport
    Reviewed-by: Andrea Arcangeli
    Cc: Paolo Bonzini
    Cc: Hugh Dickins
    Cc: Luis Chamberlain
    Cc: Maxime Coquelin
    Cc: Maya Gokhale
    Cc: Jerome Glisse
    Cc: Pavel Emelyanov
    Cc: Johannes Weiner
    Cc: Martin Cracauer
    Cc: Denis Plotnikov
    Cc: Marty McFadden
    Cc: Mike Kravetz
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: "Kirill A . Shutemov"
    Cc: "Dr . David Alan Gilbert"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Xu
     

19 Apr, 2019

1 commit

  • To make ICMPv6 closer to ICMPv4, add ratemask parameter. Since the ICMP
    message types use larger numeric values, a simple bitmask doesn't fit.
    I use large bitmap. The input and output are the in form of list of
    ranges. Set the default to rate limit all error messages but Packet Too
    Big. For Packet Too Big, use ratemask instead of hard-coded.

    There are functions where icmpv6_xrlim_allow() and icmpv6_global_allow()
    aren't called. This patch only adds them to icmpv6_echo_reply().

    Rate limiting error messages is mandated by RFC 4443 but RFC 4890 says
    that it is also acceptable to rate limit informational messages. Thus,
    I removed the current hard-coded behavior of icmpv6_mask_allow() that
    doesn't rate limit informational messages.

    v2: Add dummy function proc_do_large_bitmap() if CONFIG_PROC_SYSCTL
    isn't defined, expand the description in ip-sysctl.txt and remove
    unnecessary conditional before kfree().
    v3: Inline the bitmap instead of dynamically allocated. Still is a
    pointer to it is needed because of the way proc_do_large_bitmap work.

    Signed-off-by: Stephen Suryaputra
    Signed-off-by: David S. Miller

    Stephen Suryaputra
     

06 Apr, 2019

1 commit

  • Commit 32a5ad9c2285 ("sysctl: handle overflow for file-max") hooked up
    min/max values for the file-max sysctl parameter via the .extra1 and
    .extra2 fields in the corresponding struct ctl_table entry.

    Unfortunately, the minimum value points at the global 'zero' variable,
    which is an int. This results in a KASAN splat when accessed as a long
    by proc_doulongvec_minmax on 64-bit architectures:

    | BUG: KASAN: global-out-of-bounds in __do_proc_doulongvec_minmax+0x5d8/0x6a0
    | Read of size 8 at addr ffff2000133d1c20 by task systemd/1
    |
    | CPU: 0 PID: 1 Comm: systemd Not tainted 5.1.0-rc3-00012-g40b114779944 #2
    | Hardware name: linux,dummy-virt (DT)
    | Call trace:
    | dump_backtrace+0x0/0x228
    | show_stack+0x14/0x20
    | dump_stack+0xe8/0x124
    | print_address_description+0x60/0x258
    | kasan_report+0x140/0x1a0
    | __asan_report_load8_noabort+0x18/0x20
    | __do_proc_doulongvec_minmax+0x5d8/0x6a0
    | proc_doulongvec_minmax+0x4c/0x78
    | proc_sys_call_handler.isra.19+0x144/0x1d8
    | proc_sys_write+0x34/0x58
    | __vfs_write+0x54/0xe8
    | vfs_write+0x124/0x3c0
    | ksys_write+0xbc/0x168
    | __arm64_sys_write+0x68/0x98
    | el0_svc_common+0x100/0x258
    | el0_svc_handler+0x48/0xc0
    | el0_svc+0x8/0xc
    |
    | The buggy address belongs to the variable:
    | zero+0x0/0x40
    |
    | Memory state around the buggy address:
    | ffff2000133d1b00: 00 00 00 00 00 00 00 00 fa fa fa fa 04 fa fa fa
    | ffff2000133d1b80: fa fa fa fa 04 fa fa fa fa fa fa fa 04 fa fa fa
    | >ffff2000133d1c00: fa fa fa fa 04 fa fa fa fa fa fa fa 00 00 00 00
    | ^
    | ffff2000133d1c80: fa fa fa fa 00 fa fa fa fa fa fa fa 00 00 00 00
    | ffff2000133d1d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

    Fix the splat by introducing a unsigned long 'zero_ul' and using that
    instead.

    Link: http://lkml.kernel.org/r/20190403153409.17307-1-will.deacon@arm.com
    Fixes: 32a5ad9c2285 ("sysctl: handle overflow for file-max")
    Signed-off-by: Will Deacon
    Acked-by: Christian Brauner
    Cc: Kees Cook
    Cc: Alexey Dobriyan
    Cc: Matteo Croce
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Will Deacon