05 Jan, 2020

1 commit

  • [ Upstream commit 204cb79ad42f015312a5bbd7012d09c93d9b46fb ]

    Currently, the drop_caches proc file and sysctl read back the last value
    written, suggesting this is somehow a stateful setting instead of a
    one-time command. Make it write-only, like e.g. compact_memory.

    While mitigating a VM problem at scale in our fleet, there was confusion
    about whether writing to this file will permanently switch the kernel into
    a non-caching mode. This influences the decision making in a tense
    situation, where tens of people are trying to fix tens of thousands of
    affected machines: Do we need a rollback strategy? What are the
    performance implications of operating in a non-caching state for several
    days? It also caused confusion when the kernel team said we may need to
    write the file several times to make sure it's effective ("But it already
    reads back 3?").

    Link: http://lkml.kernel.org/r/20191031221602.9375-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Chris Down
    Acked-by: Vlastimil Babka
    Acked-by: David Hildenbrand
    Acked-by: Michal Hocko
    Acked-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Johannes Weiner
     

15 Oct, 2019

1 commit


25 Sep, 2019

1 commit

  • arm64 handles top-down mmap layout in a way that can be easily reused by
    other architectures, so make it available in mm. It then introduces a new
    config ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT that can be set by other
    architectures to benefit from those functions. Note that this new config
    depends on MMU being enabled, if selected without MMU support, a warning
    will be thrown.

    Link: http://lkml.kernel.org/r/20190730055113.23635-5-alex@ghiti.fr
    Signed-off-by: Alexandre Ghiti
    Suggested-by: Christoph Hellwig
    Acked-by: Catalin Marinas
    Acked-by: Kees Cook
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Luis Chamberlain
    Cc: Albert Ou
    Cc: Alexander Viro
    Cc: James Hogan
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     

19 Jul, 2019

1 commit

  • In the sysctl code the proc_dointvec_minmax() function is often used to
    validate the user supplied value between an allowed range. This
    function uses the extra1 and extra2 members from struct ctl_table as
    minimum and maximum allowed value.

    On sysctl handler declaration, in every source file there are some
    readonly variables containing just an integer which address is assigned
    to the extra1 and extra2 members, so the sysctl range is enforced.

    The special values 0, 1 and INT_MAX are very often used as range
    boundary, leading duplication of variables like zero=0, one=1,
    int_max=INT_MAX in different source files:

    $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
    248

    Add a const int array containing the most commonly used values, some
    macros to refer more easily to the correct array member, and use them
    instead of creating a local one for every object file.

    This is the bloat-o-meter output comparing the old and new binary
    compiled with the default Fedora config:

    # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
    add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
    Data old new delta
    sysctl_vals - 12 +12
    __kstrtab_sysctl_vals - 12 +12
    max 14 10 -4
    int_max 16 - -16
    one 68 - -68
    zero 128 28 -100
    Total: Before=20583249, After=20583085, chg -0.00%

    [mcroce@redhat.com: tipc: remove two unused variables]
    Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
    [akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
    [arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
    Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
    [akpm@linux-foundation.org: fix fs/eventpoll.c]
    Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
    Signed-off-by: Matteo Croce
    Signed-off-by: Arnd Bergmann
    Acked-by: Kees Cook
    Reviewed-by: Aaron Tomlin
    Cc: Matthew Wilcox
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matteo Croce
     

17 Jul, 2019

1 commit


25 Jun, 2019

1 commit

  • Tasks without a user-defined clamp value are considered not clamped
    and by default their utilization can have any value in the
    [0..SCHED_CAPACITY_SCALE] range.

    Tasks with a user-defined clamp value are allowed to request any value
    in that range, and the required clamp is unconditionally enforced.
    However, a "System Management Software" could be interested in limiting
    the range of clamp values allowed for all tasks.

    Add a privileged interface to define a system default configuration via:

    /proc/sys/kernel/sched_uclamp_util_{min,max}

    which works as an unconditional clamp range restriction for all tasks.

    With the default configuration, the full SCHED_CAPACITY_SCALE range of
    values is allowed for each clamp index. Otherwise, the task-specific
    clamp is capped by the corresponding system default value.

    Do that by tracking, for each task, the "effective" clamp value and
    bucket the task has been refcounted in at enqueue time. This
    allows to lazy aggregate "requested" and "system default" values at
    enqueue time and simplifies refcounting updates at dequeue time.

    The cached bucket ids are used to avoid (relatively) more expensive
    integer divisions every time a task is enqueued.

    An active flag is used to report when the "effective" value is valid and
    thus the task is actually refcounted in the corresponding rq's bucket.

    Signed-off-by: Patrick Bellasi
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alessio Balsini
    Cc: Dietmar Eggemann
    Cc: Joel Fernandes
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Quentin Perret
    Cc: Rafael J . Wysocki
    Cc: Steve Muckle
    Cc: Suren Baghdasaryan
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Todd Kjos
    Cc: Vincent Guittot
    Cc: Viresh Kumar
    Link: https://lkml.kernel.org/r/20190621084217.8167-5-patrick.bellasi@arm.com
    Signed-off-by: Ingo Molnar

    Patrick Bellasi
     

15 Jun, 2019

1 commit

  • Convert proc_dointvec_minmax_bpf_stats() into a more generic
    helper, since we are going to use jump labels more often.

    Note that sysctl_bpf_stats_enabled is removed, since
    it is no longer needed/used.

    Signed-off-by: Eric Dumazet
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

4 commits

  • Today, proc_do_large_bitmap() truncates a large write input buffer to
    PAGE_SIZE - 1, which may result in misparsed numbers at the (truncated)
    end of the buffer. Further, it fails to notify the caller that the
    buffer was truncated, so it doesn't get called iteratively to finish the
    entire input buffer.

    Tell the caller if there's more work to do by adding the skipped amount
    back to left/*lenp before returning.

    To fix the misparsing, reset the position if we have completely consumed
    a truncated buffer (or if just one char is left, which may be a "-" in a
    range), and ask the caller to come back for more.

    Link: http://lkml.kernel.org/r/20190320222831.8243-7-mcgrof@kernel.org
    Signed-off-by: Eric Sandeen
    Signed-off-by: Luis Chamberlain
    Acked-by: Kees Cook
    Cc: Eric Sandeen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     
  • Currently when userspace gives us a values that overflow e.g. file-max
    and other callers of __do_proc_doulongvec_minmax() we simply ignore the
    new value and leave the current value untouched.

    This can be problematic as it gives the illusion that the limit has
    indeed be bumped when in fact it failed. This commit makes sure to
    return EINVAL when an overflow is detected. Please note that this is a
    userspace facing change.

    Link: http://lkml.kernel.org/r/20190210203943.8227-4-christian@brauner.io
    Signed-off-by: Christian Brauner
    Acked-by: Luis Chamberlain
    Cc: Kees Cook
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Dominik Brodowski
    Cc: "Eric W. Biederman"
    Cc: Joe Lawrence
    Cc: Waiman Long
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christian Brauner
     
  • Switch to bitmap_zalloc() to show clearly what we are allocating.
    Besides that it returns pointer of bitmap type instead of opaque void *.

    Link: http://lkml.kernel.org/r/20190304094037.57756-1-andriy.shevchenko@linux.intel.com
    Signed-off-by: Andy Shevchenko
    Acked-by: Kees Cook
    Reviewed-by: Andrew Morton
    Cc: Luis Chamberlain
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • Userfaultfd can be misued to make it easier to exploit existing
    use-after-free (and similar) bugs that might otherwise only make a
    short window or race condition available. By using userfaultfd to
    stall a kernel thread, a malicious program can keep some state that it
    wrote, stable for an extended period, which it can then access using an
    existing exploit. While it doesn't cause the exploit itself, and while
    it's not the only thing that can stall a kernel thread when accessing a
    memory location, it's one of the few that never needs privilege.

    We can add a flag, allowing userfaultfd to be restricted, so that in
    general it won't be useable by arbitrary user programs, but in
    environments that require userfaultfd it can be turned back on.

    Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
    whether userfaultfd is allowed by unprivileged users. When this is
    set to zero, only privileged users (root user, or users with the
    CAP_SYS_PTRACE capability) will be able to use the userfaultfd
    syscalls.

    Andrea said:

    : The only difference between the bpf sysctl and the userfaultfd sysctl
    : this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
    : requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
    : because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
    : already if it's doing other kind of tracking on processes runtime, in
    : addition of userfaultfd. In other words both syscalls works only for
    : root, when the two sysctl are opt-in set to 1.

    [dgilbert@redhat.com: changelog additions]
    [akpm@linux-foundation.org: documentation tweak, per Mike]
    Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.com
    Signed-off-by: Peter Xu
    Suggested-by: Andrea Arcangeli
    Suggested-by: Mike Rapoport
    Reviewed-by: Mike Rapoport
    Reviewed-by: Andrea Arcangeli
    Cc: Paolo Bonzini
    Cc: Hugh Dickins
    Cc: Luis Chamberlain
    Cc: Maxime Coquelin
    Cc: Maya Gokhale
    Cc: Jerome Glisse
    Cc: Pavel Emelyanov
    Cc: Johannes Weiner
    Cc: Martin Cracauer
    Cc: Denis Plotnikov
    Cc: Marty McFadden
    Cc: Mike Kravetz
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: "Kirill A . Shutemov"
    Cc: "Dr . David Alan Gilbert"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Xu
     

19 Apr, 2019

1 commit

  • To make ICMPv6 closer to ICMPv4, add ratemask parameter. Since the ICMP
    message types use larger numeric values, a simple bitmask doesn't fit.
    I use large bitmap. The input and output are the in form of list of
    ranges. Set the default to rate limit all error messages but Packet Too
    Big. For Packet Too Big, use ratemask instead of hard-coded.

    There are functions where icmpv6_xrlim_allow() and icmpv6_global_allow()
    aren't called. This patch only adds them to icmpv6_echo_reply().

    Rate limiting error messages is mandated by RFC 4443 but RFC 4890 says
    that it is also acceptable to rate limit informational messages. Thus,
    I removed the current hard-coded behavior of icmpv6_mask_allow() that
    doesn't rate limit informational messages.

    v2: Add dummy function proc_do_large_bitmap() if CONFIG_PROC_SYSCTL
    isn't defined, expand the description in ip-sysctl.txt and remove
    unnecessary conditional before kfree().
    v3: Inline the bitmap instead of dynamically allocated. Still is a
    pointer to it is needed because of the way proc_do_large_bitmap work.

    Signed-off-by: Stephen Suryaputra
    Signed-off-by: David S. Miller

    Stephen Suryaputra
     

06 Apr, 2019

1 commit

  • Commit 32a5ad9c2285 ("sysctl: handle overflow for file-max") hooked up
    min/max values for the file-max sysctl parameter via the .extra1 and
    .extra2 fields in the corresponding struct ctl_table entry.

    Unfortunately, the minimum value points at the global 'zero' variable,
    which is an int. This results in a KASAN splat when accessed as a long
    by proc_doulongvec_minmax on 64-bit architectures:

    | BUG: KASAN: global-out-of-bounds in __do_proc_doulongvec_minmax+0x5d8/0x6a0
    | Read of size 8 at addr ffff2000133d1c20 by task systemd/1
    |
    | CPU: 0 PID: 1 Comm: systemd Not tainted 5.1.0-rc3-00012-g40b114779944 #2
    | Hardware name: linux,dummy-virt (DT)
    | Call trace:
    | dump_backtrace+0x0/0x228
    | show_stack+0x14/0x20
    | dump_stack+0xe8/0x124
    | print_address_description+0x60/0x258
    | kasan_report+0x140/0x1a0
    | __asan_report_load8_noabort+0x18/0x20
    | __do_proc_doulongvec_minmax+0x5d8/0x6a0
    | proc_doulongvec_minmax+0x4c/0x78
    | proc_sys_call_handler.isra.19+0x144/0x1d8
    | proc_sys_write+0x34/0x58
    | __vfs_write+0x54/0xe8
    | vfs_write+0x124/0x3c0
    | ksys_write+0xbc/0x168
    | __arm64_sys_write+0x68/0x98
    | el0_svc_common+0x100/0x258
    | el0_svc_handler+0x48/0xc0
    | el0_svc+0x8/0xc
    |
    | The buggy address belongs to the variable:
    | zero+0x0/0x40
    |
    | Memory state around the buggy address:
    | ffff2000133d1b00: 00 00 00 00 00 00 00 00 fa fa fa fa 04 fa fa fa
    | ffff2000133d1b80: fa fa fa fa 04 fa fa fa fa fa fa fa 04 fa fa fa
    | >ffff2000133d1c00: fa fa fa fa 04 fa fa fa fa fa fa fa 00 00 00 00
    | ^
    | ffff2000133d1c80: fa fa fa fa 00 fa fa fa fa fa fa fa 00 00 00 00
    | ffff2000133d1d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

    Fix the splat by introducing a unsigned long 'zero_ul' and using that
    instead.

    Link: http://lkml.kernel.org/r/20190403153409.17307-1-will.deacon@arm.com
    Fixes: 32a5ad9c2285 ("sysctl: handle overflow for file-max")
    Signed-off-by: Will Deacon
    Acked-by: Christian Brauner
    Cc: Kees Cook
    Cc: Alexey Dobriyan
    Cc: Matteo Croce
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Will Deacon
     

13 Mar, 2019

2 commits

  • do_proc_do[u]intvec_minmax_conv() had included open-coded versions of
    do_proc_do[u]intvec_conv(); the duplication led to buggy inconsistencies
    (missing range checks). To reduce the likelihood of such problems in the
    future, we can instead refactor both to be defined in terms of their
    non-bounded counterparts (plus the added check).

    Link: http://lkml.kernel.org/r/20190207165138.5oud57vq4ozwb4kh@hatter.bewilderbeest.net
    Signed-off-by: Zev Weiss
    Cc: Brendan Higgins
    Cc: Iurii Zaikin
    Cc: Kees Cook
    Cc: Luis Chamberlain
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zev Weiss
     
  • This bug has apparently existed since the introduction of this function
    in the pre-git era (4500e91754d3 in Thomas Gleixner's history.git,
    "[NET]: Add proc_dointvec_userhz_jiffies, use it for proper handling of
    neighbour sysctls.").

    As a minimal fix we can simply duplicate the corresponding check in
    do_proc_dointvec_conv().

    Link: http://lkml.kernel.org/r/20190207123426.9202-3-zev@bewilderbeest.net
    Signed-off-by: Zev Weiss
    Cc: Brendan Higgins
    Cc: Iurii Zaikin
    Cc: Kees Cook
    Cc: Luis Chamberlain
    Cc: [2.6.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zev Weiss
     

11 Mar, 2019

1 commit

  • Pull networking fixes from David Miller:
    "First batch of fixes in the new merge window:

    1) Double dst_cache free in act_tunnel_key, from Wenxu.

    2) Avoid NULL deref in IN_DEV_MFORWARD() by failing early in the
    ip_route_input_rcu() path, from Paolo Abeni.

    3) Fix appletalk compile regression, from Arnd Bergmann.

    4) If SLAB objects reach the TCP sendpage method we are in serious
    trouble, so put a debugging check there. From Vasily Averin.

    5) Memory leak in hsr layer, from Mao Wenan.

    6) Only test GSO type on GSO packets, from Willem de Bruijn.

    7) Fix crash in xsk_diag_put_umem(), from Eric Dumazet.

    8) Fix VNIC mailbox length in nfp, from Dirk van der Merwe.

    9) Fix race in ipv4 route exception handling, from Xin Long.

    10) Missing DMA memory barrier in hns3 driver, from Jian Shen.

    11) Use after free in __tcf_chain_put(), from Vlad Buslov.

    12) Handle inet_csk_reqsk_queue_add() failures, from Guillaume Nault.

    13) Return value correction when ip_mc_may_pull() fails, from Eric
    Dumazet.

    14) Use after free in x25_device_event(), also from Eric"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (72 commits)
    gro_cells: make sure device is up in gro_cells_receive()
    vxlan: test dev->flags & IFF_UP before calling gro_cells_receive()
    net/x25: fix use-after-free in x25_device_event()
    isdn: mISDNinfineon: fix potential NULL pointer dereference
    net: hns3: fix to stop multiple HNS reset due to the AER changes
    ip: fix ip_mc_may_pull() return value
    net: keep refcount warning in reqsk_free()
    net: stmmac: Avoid one more sometimes uninitialized Clang warning
    net: dsa: mv88e6xxx: Set correct interface mode for CPU/DSA ports
    rxrpc: Fix client call queueing, waiting for channel
    tcp: handle inet_csk_reqsk_queue_add() failures
    net: ethernet: sun: Zero initialize class in default case in niu_add_ethtool_tcam_entry
    8139too : Add support for U.S. Robotics USR997901A 10/100 Cardbus NIC
    fou, fou6: avoid uninit-value in gue_err() and gue6_err()
    net: sched: fix potential use-after-free in __tcf_chain_put()
    vhost: silence an unused-variable warning
    vsock/virtio: fix kernel panic from virtio_transport_reset_no_sock
    connector: fix unsafe usage of ->real_parent
    vxlan: do not need BH again in vxlan_cleanup()
    net: hns3: add dma_rmb() for rx description
    ...

    Linus Torvalds
     

08 Mar, 2019

2 commits

  • Currently, when writing

    echo 18446744073709551616 > /proc/sys/fs/file-max

    /proc/sys/fs/file-max will overflow and be set to 0. That quickly
    crashes the system.

    This commit sets the max and min value for file-max. The max value is
    set to long int. Any higher value cannot currently be used as the
    percpu counters are long ints and not unsigned integers.

    Note that the file-max value is ultimately parsed via
    __do_proc_doulongvec_minmax(). This function does not report error when
    min or max are exceeded. Which means if a value largen that long int is
    written userspace will not receive an error instead the old value will be
    kept. There is an argument to be made that this should be changed and
    __do_proc_doulongvec_minmax() should return an error when a dedicated min
    or max value are exceeded. However this has the potential to break
    userspace so let's defer this to an RFC patch.

    Link: http://lkml.kernel.org/r/20190107222700.15954-3-christian@brauner.io
    Signed-off-by: Christian Brauner
    Acked-by: Kees Cook
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Dominik Brodowski
    Cc: "Eric W. Biederman"
    Cc: Joe Lawrence
    Cc: Luis Chamberlain
    Cc: Waiman Long
    [christian@brauner.io: v4]
    Link: http://lkml.kernel.org/r/20190210203943.8227-3-christian@brauner.io
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christian Brauner
     
  • proc_get_long() is a funny function. It uses simple_strtoul() and for a
    good reason. proc_get_long() wants to always succeed the parse and
    return the maybe incorrect value and the trailing characters to check
    against a pre-defined list of acceptable trailing values. However,
    simple_strtoul() explicitly ignores overflows which can cause funny
    things like the following to happen:

    echo 18446744073709551616 > /proc/sys/fs/file-max
    cat /proc/sys/fs/file-max
    0

    (Which will cause your system to silently die behind your back.)

    On the other hand kstrtoul() does do overflow detection but does not
    return the trailing characters, and also fails the parse when anything
    other than '\n' is a trailing character whereas proc_get_long() wants to
    be more lenient.

    Now, before adding another kstrtoul() function let's simply add a static
    parse strtoul_lenient() which:
    - fails on overflow with -ERANGE
    - returns the trailing characters to the caller

    The reason why we should fail on ERANGE is that we already do a partial
    fail on overflow right now. Namely, when the TMPBUFLEN is exceeded. So
    we already reject values such as 184467440737095516160 (21 chars) but
    accept values such as 18446744073709551616 (20 chars) but both are
    overflows. So we should just always reject 64bit overflows and not
    special-case this based on the number of chars.

    Link: http://lkml.kernel.org/r/20190107222700.15954-2-christian@brauner.io
    Signed-off-by: Christian Brauner
    Acked-by: Kees Cook
    Cc: "Eric W. Biederman"
    Cc: Luis Chamberlain
    Cc: Joe Lawrence
    Cc: Waiman Long
    Cc: Dominik Brodowski
    Cc: Al Viro
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christian Brauner
     

07 Mar, 2019

3 commits

  • When CONFIG_BPF_SYSCALL or CONFIG_SYSCTL is disabled, we get
    a warning about an unused function:

    kernel/sysctl.c:3331:12: error: 'proc_dointvec_minmax_bpf_stats' defined but not used [-Werror=unused-function]
    static int proc_dointvec_minmax_bpf_stats(struct ctl_table *table, int write,

    The CONFIG_BPF_SYSCALL check was already handled, but the SYSCTL check
    is needed on top.

    Fixes: 492ecee892c2 ("bpf: enable program stats")
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Kees Cook
    Reviewed-by: Christian Brauner
    Acked-by: Song Liu
    Signed-off-by: Daniel Borkmann

    Arnd Bergmann
     
  • Merge misc updates from Andrew Morton:

    - a few misc things

    - ocfs2 updates

    - most of MM

    * emailed patches from Andrew Morton : (159 commits)
    tools/testing/selftests/proc/proc-self-syscall.c: remove duplicate include
    proc: more robust bulk read test
    proc: test /proc/*/maps, smaps, smaps_rollup, statm
    proc: use seq_puts() everywhere
    proc: read kernel cpu stat pointer once
    proc: remove unused argument in proc_pid_lookup()
    fs/proc/thread_self.c: code cleanup for proc_setup_thread_self()
    fs/proc/self.c: code cleanup for proc_setup_self()
    proc: return exit code 4 for skipped tests
    mm,mremap: bail out earlier in mremap_to under map pressure
    mm/sparse: fix a bad comparison
    mm/memory.c: do_fault: avoid usage of stale vm_area_struct
    writeback: fix inode cgroup switching comment
    mm/huge_memory.c: fix "orig_pud" set but not used
    mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC
    mm/memcontrol.c: fix bad line in comment
    mm/cma.c: cma_declare_contiguous: correct err handling
    mm/page_ext.c: fix an imbalance with kmemleak
    mm/compaction: pass pgdat to too_many_isolated() instead of zone
    mm: remove zone_lru_lock() function, access ->lru_lock directly
    ...

    Linus Torvalds
     
  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - refcount conversions

    - Solve the rq->leaf_cfs_rq_list can of worms for real.

    - improve power-aware scheduling

    - add sysctl knob for Energy Aware Scheduling

    - documentation updates

    - misc other changes"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
    kthread: Do not use TIMER_IRQSAFE
    kthread: Convert worker lock to raw spinlock
    sched/fair: Use non-atomic cpumask_{set,clear}_cpu()
    sched/fair: Remove unused 'sd' parameter from select_idle_smt()
    sched/wait: Use freezable_schedule() when possible
    sched/fair: Prune, fix and simplify the nohz_balancer_kick() comment block
    sched/fair: Explain LLC nohz kick condition
    sched/fair: Simplify nohz_balancer_kick()
    sched/topology: Fix percpu data types in struct sd_data & struct s_data
    sched/fair: Simplify post_init_entity_util_avg() by calling it with a task_struct pointer argument
    sched/fair: Fix O(nr_cgroups) in the load balancing path
    sched/fair: Optimize update_blocked_averages()
    sched/fair: Fix insertion in rq->leaf_cfs_rq_list
    sched/fair: Add tmp_alone_branch assertion
    sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
    sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK
    sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity
    sched/fair: Update scale invariance of PELT
    sched/fair: Move the rq_of() helper function
    sched/core: Convert task_struct.stack_refcount to refcount_t
    ...

    Linus Torvalds
     

06 Mar, 2019

1 commit

  • sysctl_extfrag_handler() neglects to propagate the return value from
    proc_dointvec_minmax() to its caller. It's a wrapper that doesn't need
    to exist, so just use proc_dointvec_minmax() directly.

    Link: http://lkml.kernel.org/r/20190104032557.3056-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Reported-by: Aditya Pakki
    Acked-by: Mel Gorman
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

01 Mar, 2019

1 commit


28 Feb, 2019

1 commit

  • JITed BPF programs are indistinguishable from kernel functions, but unlike
    kernel code BPF code can be changed often.
    Typical approach of "perf record" + "perf report" profiling and tuning of
    kernel code works just as well for BPF programs, but kernel code doesn't
    need to be monitored whereas BPF programs do.
    Users load and run large amount of BPF programs.
    These BPF stats allow tools monitor the usage of BPF on the server.
    The monitoring tools will turn sysctl kernel.bpf_stats_enabled
    on and off for few seconds to sample average cost of the programs.
    Aggregated data over hours and days will provide an insight into cost of BPF
    and alarms can trigger in case given program suddenly gets more expensive.

    The cost of two sched_clock() per program invocation adds ~20 nsec.
    Fast BPF progs (like selftests/bpf/progs/test_pkt_access.c) will slow down
    from ~10 nsec to ~30 nsec.
    static_key minimizes the cost of the stats collection.
    There is no measurable difference before/after this patch
    with kernel.bpf_stats_enabled=0

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Alexei Starovoitov
     

27 Jan, 2019

1 commit

  • In its current state, Energy Aware Scheduling (EAS) starts automatically
    on asymmetric platforms having an Energy Model (EM). However, there are
    users who want to have an EM (for thermal management for example), but
    don't want EAS with it.

    In order to let users disable EAS explicitly, introduce a new sysctl
    called 'sched_energy_aware'. It is enabled by default so that EAS can
    start automatically on platforms where it makes sense. Flipping it to 0
    rebuilds the scheduling domains and disables EAS.

    Signed-off-by: Quentin Perret
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: adharmap@codeaurora.org
    Cc: chris.redpath@arm.com
    Cc: currojerez@riseup.net
    Cc: dietmar.eggemann@arm.com
    Cc: edubezval@gmail.com
    Cc: gregkh@linuxfoundation.org
    Cc: javi.merino@kernel.org
    Cc: joel@joelfernandes.org
    Cc: juri.lelli@redhat.com
    Cc: morten.rasmussen@arm.com
    Cc: patrick.bellasi@arm.com
    Cc: pkondeti@codeaurora.org
    Cc: rjw@rjwysocki.net
    Cc: skannan@codeaurora.org
    Cc: smuckle@google.com
    Cc: srinivas.pandruvada@linux.intel.com
    Cc: thara.gopinath@linaro.org
    Cc: tkjos@google.com
    Cc: valentin.schneider@arm.com
    Cc: vincent.guittot@linaro.org
    Cc: viresh.kumar@linaro.org
    Link: https://lkml.kernel.org/r/20181203095628.11858-11-quentin.perret@arm.com
    Signed-off-by: Ingo Molnar

    Quentin Perret
     

05 Jan, 2019

2 commits

  • So that we can also runtime chose to print out the needed system info
    for panic, other than setting the kernel cmdline.

    Link: http://lkml.kernel.org/r/1543398842-19295-3-git-send-email-feng.tang@intel.com
    Signed-off-by: Feng Tang
    Suggested-by: Steven Rostedt
    Acked-by: Steven Rostedt (VMware)
    Cc: Thomas Gleixner
    Cc: John Stultz
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Feng Tang
     
  • If the number of input parameters is less than the total parameters, an
    EINVAL error will be returned.

    For example, we use proc_doulongvec_minmax to pass up to two parameters
    with kern_table:

    {
    .procname = "monitor_signals",
    .data = &monitor_sigs,
    .maxlen = 2*sizeof(unsigned long),
    .mode = 0644,
    .proc_handler = proc_doulongvec_minmax,
    },

    Reproduce:

    When passing two parameters, it's work normal. But passing only one
    parameter, an error "Invalid argument"(EINVAL) is returned.

    [root@cl150 ~]# echo 1 2 > /proc/sys/kernel/monitor_signals
    [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
    1 2
    [root@cl150 ~]# echo 3 > /proc/sys/kernel/monitor_signals
    -bash: echo: write error: Invalid argument
    [root@cl150 ~]# echo $?
    1
    [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
    3 2
    [root@cl150 ~]#

    The following is the result after apply this patch. No error is
    returned when the number of input parameters is less than the total
    parameters.

    [root@cl150 ~]# echo 1 2 > /proc/sys/kernel/monitor_signals
    [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
    1 2
    [root@cl150 ~]# echo 3 > /proc/sys/kernel/monitor_signals
    [root@cl150 ~]# echo $?
    0
    [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
    3 2
    [root@cl150 ~]#

    There are three processing functions dealing with digital parameters,
    __do_proc_dointvec/__do_proc_douintvec/__do_proc_doulongvec_minmax.

    This patch deals with __do_proc_doulongvec_minmax, just as
    __do_proc_dointvec does, adding a check for parameters 'left'. In
    __do_proc_douintvec, its code implementation explicitly does not support
    multiple inputs.

    static int __do_proc_douintvec(...){
    ...
    /*
    * Arrays are not supported, keep this simple. *Do not* add
    * support for them.
    */
    if (vleft != 1) {
    *lenp = 0;
    return -EINVAL;
    }
    ...
    }

    So, just __do_proc_doulongvec_minmax has the problem. And most use of
    proc_doulongvec_minmax/proc_doulongvec_ms_jiffies_minmax just have one
    parameter.

    Link: http://lkml.kernel.org/r/1544081775-15720-1-git-send-email-cheng.lin130@zte.com.cn
    Signed-off-by: Cheng Lin
    Acked-by: Luis Chamberlain
    Reviewed-by: Kees Cook
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cheng Lin
     

29 Dec, 2018

1 commit

  • An external fragmentation event was previously described as

    When the page allocator fragments memory, it records the event using
    the mm_page_alloc_extfrag event. If the fallback_order is smaller
    than a pageblock order (order-9 on 64-bit x86) then it's considered
    an event that will cause external fragmentation issues in the future.

    The kernel reduces the probability of such events by increasing the
    watermark sizes by calling set_recommended_min_free_kbytes early in the
    lifetime of the system. This works reasonably well in general but if
    there are enough sparsely populated pageblocks then the problem can still
    occur as enough memory is free overall and kswapd stays asleep.

    This patch introduces a watermark_boost_factor sysctl that allows a zone
    watermark to be temporarily boosted when an external fragmentation causing
    events occurs. The boosting will stall allocations that would decrease
    free memory below the boosted low watermark and kswapd is woken if the
    calling context allows to reclaim an amount of memory relative to the size
    of the high watermark and the watermark_boost_factor until the boost is
    cleared. When kswapd finishes, it wakes kcompactd at the pageblock order
    to clean some of the pageblocks that may have been affected by the
    fragmentation event. kswapd avoids any writeback, slab shrinkage and swap
    from reclaim context during this operation to avoid excessive system
    disruption in the name of fragmentation avoidance. Care is taken so that
    kswapd will do normal reclaim work if the system is really low on memory.

    This was evaluated using the same workloads as "mm, page_alloc: Spread
    allocations across zones before introducing fragmentation".

    1-socket Skylake machine
    config-global-dhp__workload_thpfioscale XFS (no special madvise)
    4 fio threads, 1 THP allocating thread
    --------------------------------------

    4.20-rc3 extfrag events < order 9: 804694
    4.20-rc3+patch: 408912 (49% reduction)
    4.20-rc3+patch1-4: 18421 (98% reduction)

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%)
    Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%*

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%)

    Note that external fragmentation causing events are massively reduced by
    this path whether in comparison to the previous kernel or the vanilla
    kernel. The fault latency for huge pages appears to be increased but that
    is only because THP allocations were successful with the patch applied.

    1-socket Skylake machine
    global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
    -----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 291392
    4.20-rc3+patch: 191187 (34% reduction)
    4.20-rc3+patch1-4: 13464 (95% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%)
    Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%)
    Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%)
    Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%*

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%)

    As before, massive reduction in external fragmentation events, some jitter
    on latencies and an increase in THP allocation success rates.

    2-socket Haswell machine
    config-global-dhp__workload_thpfioscale XFS (no special madvise)
    4 fio threads, 5 THP allocating threads
    ----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 215698
    4.20-rc3+patch: 200210 (7% reduction)
    4.20-rc3+patch1-4: 14263 (93% reduction)

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%)
    Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%)

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%)

    There is a 93% reduction in fragmentation causing events, there is a big
    reduction in the huge page fault latency and allocation success rate is
    higher.

    2-socket Haswell machine
    global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
    -----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 166352
    4.20-rc3+patch: 147463 (11% reduction)
    4.20-rc3+patch1-4: 11095 (93% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%*
    Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%)

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%)

    There is a large reduction in fragmentation events with some jitter around
    the latencies and success rates. As before, the high THP allocation
    success rate does mean the system is under a lot of pressure. However, as
    the fragmentation events are reduced, it would be expected that the
    long-term allocation success rate would be higher.

    Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

04 Nov, 2018

1 commit

  • Remove one include of .
    No functional changes.

    Link: http://lkml.kernel.org/r/20181004134223.17735-1-michael@schupikov.de
    Signed-off-by: Michael Schupikov
    Reviewed-by: Richard Weinberger
    Acked-by: Luis Chamberlain
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Schupikov
     

05 Sep, 2018

1 commit


24 Aug, 2018

1 commit

  • Disallows open of FIFOs or regular files not owned by the user in world
    writable sticky directories, unless the owner is the same as that of the
    directory or the file is opened without the O_CREAT flag. The purpose
    is to make data spoofing attacks harder. This protection can be turned
    on and off separately for FIFOs and regular files via sysctl, just like
    the symlinks/hardlinks protection. This patch is based on Openwall's
    "HARDEN_FIFO" feature by Solar Designer.

    This is a brief list of old vulnerabilities that could have been prevented
    by this feature, some of them even allow for privilege escalation:

    CVE-2000-1134
    CVE-2007-3852
    CVE-2008-0525
    CVE-2009-0416
    CVE-2011-4834
    CVE-2015-1838
    CVE-2015-7442
    CVE-2016-7489

    This list is not meant to be complete. It's difficult to track down all
    vulnerabilities of this kind because they were often reported without any
    mention of this particular attack vector. In fact, before
    hardlinks/symlinks restrictions, fifos/regular files weren't the favorite
    vehicle to exploit them.

    [s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter]
    Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda
    Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com
    [keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future]
    [keescook@chromium.org: adjust commit subjet]
    Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast
    Signed-off-by: Salvatore Mesoraca
    Signed-off-by: Kees Cook
    Suggested-by: Solar Designer
    Suggested-by: Kees Cook
    Cc: Al Viro
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Salvatore Mesoraca
     

23 Aug, 2018

2 commits

  • Fix a few typos/spellos in kernel/sysctl.c.

    Link: http://lkml.kernel.org/r/bb09a8b9-f984-6dd4-b07b-3ecaf200862e@infradead.org
    Signed-off-by: Randy Dunlap
    Acked-by: Kees Cook
    Cc: "Luis R. Rodriguez"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Currently task hung checking interval is equal to timeout, as the result
    hung is detected anywhere between timeout and 2*timeout. This is fine for
    most interactive environments, but this hurts automated testing setups
    (syzbot). In an automated setup we need to strictly order CPU lockup <
    RCU stall < workqueue lockup < task hung < silent loss, so that RCU stall
    is not detected as task hung and task hung is not detected as silent
    machine loss. The large variance in task hung detection timeout requires
    setting silent machine loss timeout to a very large value (e.g. if task
    hung is 3 mins, then silent loss need to be set to ~7 mins). The
    additional 3 minutes significantly reduce testing efficiency because
    usually we crash kernel within a minute, and this can add hours to bug
    localization process as it needs to do dozens of tests.

    Allow setting checking interval separately from timeout. This allows to
    set timeout to, say, 3 minutes, but checking interval to 10 secs.

    The interval is controlled via a new hung_task_check_interval_secs sysctl,
    similar to the existing hung_task_timeout_secs sysctl. The default value
    of 0 results in the current behavior: checking interval is equal to
    timeout.

    [akpm@linux-foundation.org: update hung_task_timeout_max's comment]
    Link: http://lkml.kernel.org/r/20180611111004.203513-1-dvyukov@google.com
    Signed-off-by: Dmitry Vyukov
    Cc: Paul E. McKenney
    Cc: Tetsuo Handa
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     

16 Jul, 2018

1 commit

  • /proc/sys/kernel/sched_time_avg_ms entry is not used anywhere,
    remove it.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Luis R. Rodriguez
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Morten.Rasmussen@arm.com
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: claudio@evidence.eu.com
    Cc: daniel.lezcano@linaro.org
    Cc: dietmar.eggemann@arm.com
    Cc: joel@joelfernandes.org
    Cc: juri.lelli@redhat.com
    Cc: luca.abeni@santannapisa.it
    Cc: patrick.bellasi@arm.com
    Cc: quentin.perret@arm.com
    Cc: rjw@rjwysocki.net
    Cc: valentin.schneider@arm.com
    Cc: viresh.kumar@linaro.org
    Link: http://lkml.kernel.org/r/1530200714-4504-12-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     

13 Jun, 2018

1 commit

  • The kzalloc() function has a 2-factor argument form, kcalloc(). This
    patch replaces cases of:

    kzalloc(a * b, gfp)

    with:
    kcalloc(a * b, gfp)

    as well as handling cases of:

    kzalloc(a * b * c, gfp)

    with:

    kzalloc(array3_size(a, b, c), gfp)

    as it's slightly less ugly than:

    kzalloc_array(array_size(a, b), c, gfp)

    This does, however, attempt to ignore constant size factors like:

    kzalloc(4 * 1024, gfp)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kzalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kzalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kzalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kzalloc
    + kcalloc
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kzalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kzalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kzalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kzalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kzalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kzalloc(C1 * C2 * C3, ...)
    |
    kzalloc(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kzalloc(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kzalloc(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kzalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kzalloc(sizeof(THING) * C2, ...)
    |
    kzalloc(sizeof(TYPE) * C2, ...)
    |
    kzalloc(C1 * C2 * C3, ...)
    |
    kzalloc(C1 * C2, ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

12 Apr, 2018

2 commits

  • Kdoc comments are added to the do_proc_dointvec_minmax_conv_param and
    do_proc_douintvec_minmax_conv_param structures thare are used internally
    for range checking.

    The error codes returned by proc_dointvec_minmax() and
    proc_douintvec_minmax() are also documented.

    Link: http://lkml.kernel.org/r/1519926220-7453-3-git-send-email-longman@redhat.com
    Signed-off-by: Waiman Long
    Reviewed-by: Andrew Morton
    Acked-by: Luis R. Rodriguez
    Cc: Al Viro
    Cc: Davidlohr Bueso
    Cc: Kees Cook
    Cc: Manfred Spraul
    Cc: Matthew Wilcox
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • Fix sizeof argument to be the same as the data variable name. Probably
    a copy/paste error.

    Mostly harmless since both variables are unsigned int.

    Fixes kernel bugzilla #197371:
    Possible access to unintended variable in "kernel/sysctl.c" line 1339
    https://bugzilla.kernel.org/show_bug.cgi?id=197371

    Link: http://lkml.kernel.org/r/e0d0531f-361e-ef5f-8499-32743ba907e1@infradead.org
    Signed-off-by: Randy Dunlap
    Reported-by: Petru Mihancea
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

20 Mar, 2018

1 commit

  • Currently one requires to test four kernel configurations to test the
    firmware API completely:

    0)
    CONFIG_FW_LOADER=y

    1)
    o CONFIG_FW_LOADER=y
    o CONFIG_FW_LOADER_USER_HELPER=y

    2)
    o CONFIG_FW_LOADER=y
    o CONFIG_FW_LOADER_USER_HELPER=y
    o CONFIG_FW_LOADER_USER_HELPER_FALLBACK=y

    3) When CONFIG_FW_LOADER=m the built-in stuff is disabled, we have
    no current tests for this.

    We can reduce the requirements to three kernel configurations by making
    fw_config.force_sysfs_fallback a proc knob we flip on off. For kernels that
    disable CONFIG_IKCONFIG_PROC this can also enable one to inspect if
    CONFIG_FW_LOADER_USER_HELPER_FALLBACK was enabled at build time by checking
    the proc value at boot time.

    Acked-by: Kees Cook
    Signed-off-by: Luis R. Rodriguez
    Signed-off-by: Greg Kroah-Hartman

    Luis R. Rodriguez
     

07 Feb, 2018

1 commit

  • A pipe's size is represented as an 'unsigned int'. As expected, writing a
    value greater than UINT_MAX to /proc/sys/fs/pipe-max-size fails with
    EINVAL. However, the F_SETPIPE_SZ fcntl silently truncates such values to
    32 bits, rather than failing with EINVAL as expected. (It *does* fail
    with EINVAL for values above (1 << 31) but
    Acked-by: Kees Cook
    Acked-by: Joe Lawrence
    Cc: Alexander Viro
    Cc: "Luis R . Rodriguez"
    Cc: Michael Kerrisk
    Cc: Mikulas Patocka
    Cc: Willy Tarreau
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers