10 Mar, 2019

1 commit

  • commit f612acfae86af7ecad754ae6a46019be9da05b8e upstream.

    syzkaller report this:
    BUG: memory leak
    unreferenced object 0xffffc9000488d000 (size 9195520):
    comm "syz-executor.0", pid 2752, jiffies 4294787496 (age 18.757s)
    hex dump (first 32 bytes):
    ff ff ff ff ff ff ff ff a8 00 00 00 01 00 00 00 ................
    02 00 00 00 00 00 00 00 80 a1 7a c1 ff ff ff ff ..........z.....
    backtrace:
    [] __vmalloc_node mm/vmalloc.c:1795 [inline]
    [] __vmalloc_node_flags mm/vmalloc.c:1809 [inline]
    [] vmalloc+0x8c/0xb0 mm/vmalloc.c:1831
    [] kernel_read_file+0x58f/0x7d0 fs/exec.c:924
    [] kernel_read_file_from_fd+0x49/0x80 fs/exec.c:993
    [] __do_sys_finit_module+0x13b/0x2a0 kernel/module.c:3895
    [] do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
    [] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [] 0xffffffffffffffff

    It should goto 'out_free' lable to free allocated buf while kernel_read
    fails.

    Fixes: 39d637af5aa7 ("vfs: forbid write access when reading a file into memory")
    Signed-off-by: YueHaibing
    Signed-off-by: Al Viro
    Cc: Thibaut Sautereau
    Signed-off-by: Greg Kroah-Hartman

    YueHaibing
     

22 Aug, 2018

1 commit

  • …iederm/user-namespace

    Pull core signal handling updates from Eric Biederman:
    "It was observed that a periodic timer in combination with a
    sufficiently expensive fork could prevent fork from every completing.
    This contains the changes to remove the need for that restart.

    This set of changes is split into several parts:

    - The first part makes PIDTYPE_TGID a proper pid type instead
    something only for very special cases. The part starts using
    PIDTYPE_TGID enough so that in __send_signal where signals are
    actually delivered we know if the signal is being sent to a a group
    of processes or just a single process.

    - With that prep work out of the way the logic in fork is modified so
    that fork logically makes signals received while it is running
    appear to be received after the fork completes"

    * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
    signal: Don't send signals to tasks that don't exist
    signal: Don't restart fork when signals come in.
    fork: Have new threads join on-going signal group stops
    fork: Skip setting TIF_SIGPENDING in ptrace_init_task
    signal: Add calculate_sigpending()
    fork: Unconditionally exit if a fatal signal is pending
    fork: Move and describe why the code examines PIDNS_ADDING
    signal: Push pid type down into complete_signal.
    signal: Push pid type down into __send_signal
    signal: Push pid type down into send_signal
    signal: Pass pid type into do_send_sig_info
    signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
    signal: Pass pid type into group_send_sig_info
    signal: Pass pid and pid type into send_sigqueue
    posix-timers: Noralize good_sigevent
    signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
    pid: Implement PIDTYPE_TGID
    pids: Move the pgrp and session pid pointers from task_struct to signal_struct
    kvm: Don't open code task_pid in kvm_vcpu_ioctl
    pids: Compute task_tgid using signal->leader_pid
    ...

    Linus Torvalds
     

27 Jul, 2018

1 commit

  • vma_is_anonymous() relies on ->vm_ops being NULL to detect anonymous
    VMA. This is unreliable as ->mmap may not set ->vm_ops.

    False-positive vma_is_anonymous() may lead to crashes:

    next ffff8801ce5e7040 prev ffff8801d20eca50 mm ffff88019c1e13c0
    prot 27 anon_vma ffff88019680cdd8 vm_ops 0000000000000000
    pgoff 0 file ffff8801b2ec2d00 private_data 0000000000000000
    flags: 0xff(read|write|exec|shared|mayread|maywrite|mayexec|mayshare)
    ------------[ cut here ]------------
    kernel BUG at mm/memory.c:1422!
    invalid opcode: 0000 [#1] SMP KASAN
    CPU: 0 PID: 18486 Comm: syz-executor3 Not tainted 4.18.0-rc3+ #136
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
    01/01/2011
    RIP: 0010:zap_pmd_range mm/memory.c:1421 [inline]
    RIP: 0010:zap_pud_range mm/memory.c:1466 [inline]
    RIP: 0010:zap_p4d_range mm/memory.c:1487 [inline]
    RIP: 0010:unmap_page_range+0x1c18/0x2220 mm/memory.c:1508
    Call Trace:
    unmap_single_vma+0x1a0/0x310 mm/memory.c:1553
    zap_page_range_single+0x3cc/0x580 mm/memory.c:1644
    unmap_mapping_range_vma mm/memory.c:2792 [inline]
    unmap_mapping_range_tree mm/memory.c:2813 [inline]
    unmap_mapping_pages+0x3a7/0x5b0 mm/memory.c:2845
    unmap_mapping_range+0x48/0x60 mm/memory.c:2880
    truncate_pagecache+0x54/0x90 mm/truncate.c:800
    truncate_setsize+0x70/0xb0 mm/truncate.c:826
    simple_setattr+0xe9/0x110 fs/libfs.c:409
    notify_change+0xf13/0x10f0 fs/attr.c:335
    do_truncate+0x1ac/0x2b0 fs/open.c:63
    do_sys_ftruncate+0x492/0x560 fs/open.c:205
    __do_sys_ftruncate fs/open.c:215 [inline]
    __se_sys_ftruncate fs/open.c:213 [inline]
    __x64_sys_ftruncate+0x59/0x80 fs/open.c:213
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Reproducer:

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define KCOV_INIT_TRACE _IOR('c', 1, unsigned long)
    #define KCOV_ENABLE _IO('c', 100)
    #define KCOV_DISABLE _IO('c', 101)
    #define COVER_SIZE (1024<<< 20);
    return 0;
    }

    This can be fixed by assigning anonymous VMAs own vm_ops and not relying
    on it being NULL.

    If ->mmap() failed to set ->vm_ops, mmap_region() will set it to
    dummy_vm_ops. This way we will have non-NULL ->vm_ops for all VMAs.

    Link: http://lkml.kernel.org/r/20180724121139.62570-4-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: syzbot+3f84280d52be9b7083cc@syzkaller.appspotmail.com
    Acked-by: Linus Torvalds
    Reviewed-by: Andrew Morton
    Cc: Dmitry Vyukov
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

22 Jul, 2018

2 commits

  • Like vm_area_dup(), it initializes the anon_vma_chain head, and the
    basic mm pointer.

    The rest of the fields end up being different for different users,
    although the plan is to also initialize the 'vm_ops' field to a dummy
    entry.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The vm_area_struct is one of the most fundamental memory management
    objects, but the management of it is entirely open-coded evertwhere,
    ranging from allocation and freeing (using kmem_cache_[z]alloc and
    kmem_cache_free) to initializing all the fields.

    We want to unify this in order to end up having some unified
    initialization of the vmas, and the first step to this is to at least
    have basic allocation functions.

    Right now those functions are literally just wrappers around the
    kmem_cache_*() calls. This is a purely mechanical conversion:

    # new vma:
    kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()

    # copy old vma
    kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)

    # free vma
    kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)

    to the point where the old vma passed in to the vm_area_dup() function
    isn't even used yet (because I've left all the old manual initialization
    alone).

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 Jul, 2018

1 commit

  • Everywhere except in the pid array we distinguish between a tasks pid and
    a tasks tgid (thread group id). Even in the enumeration we want that
    distinction sometimes so we have added __PIDTYPE_TGID. With leader_pid
    we almost have an implementation of PIDTYPE_TGID in struct signal_struct.

    Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
    into the pids array. Then remove the __PIDTYPE_TGID special case and the
    leader_pid in signal_struct.

    The net size increase is just an extra pointer added to struct pid and
    an extra pair of pointers of an hlist_node added to task_struct.

    The effect on code maintenance is the removal of a number of special
    cases today and the potential to remove many more special cases as
    PIDTYPE_TGID gets used to it's fullest. The long term potential
    is allowing zombie thread group leaders to exit, which will remove
    a lot more special cases in the code.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

11 Jun, 2018

1 commit

  • Pull restartable sequence support from Thomas Gleixner:
    "The restartable sequences syscall (finally):

    After a lot of back and forth discussion and massive delays caused by
    the speculative distraction of maintainers, the core set of
    restartable sequences has finally reached a consensus.

    It comes with the basic non disputed core implementation along with
    support for arm, powerpc and x86 and a full set of selftests

    It was exposed to linux-next earlier this week, so it does not fully
    comply with the merge window requirements, but there is really no
    point to drag it out for yet another cycle"

    * 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    rseq/selftests: Provide Makefile, scripts, gitignore
    rseq/selftests: Provide parametrized tests
    rseq/selftests: Provide basic percpu ops test
    rseq/selftests: Provide basic test
    rseq/selftests: Provide rseq library
    selftests/lib.mk: Introduce OVERRIDE_TARGETS
    powerpc: Wire up restartable sequences system call
    powerpc: Add syscall detection for restartable sequences
    powerpc: Add support for restartable sequences
    x86: Wire up restartable sequence system call
    x86: Add support for restartable sequences
    arm: Wire up restartable sequences system call
    arm: Add syscall detection for restartable sequences
    arm: Add restartable sequences support
    rseq: Introduce restartable sequences system call
    uapi/headers: Provide types_32_64.h

    Linus Torvalds
     

06 Jun, 2018

1 commit

  • Expose a new system call allowing each thread to register one userspace
    memory area to be used as an ABI between kernel and user-space for two
    purposes: user-space restartable sequences and quick access to read the
    current CPU number value from user-space.

    * Restartable sequences (per-cpu atomics)

    Restartables sequences allow user-space to perform update operations on
    per-cpu data without requiring heavy-weight atomic operations.

    The restartable critical sections (percpu atomics) work has been started
    by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
    critical sections. [1] [2] The re-implementation proposed here brings a
    few simplifications to the ABI which facilitates porting to other
    architectures and speeds up the user-space fast path.

    Here are benchmarks of various rseq use-cases.

    Test hardware:

    arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
    x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading

    The following benchmarks were all performed on a single thread.

    * Per-CPU statistic counter increment

    getcpu+atomic (ns/op) rseq (ns/op) speedup
    arm32: 344.0 31.4 11.0
    x86-64: 15.3 2.0 7.7

    * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
    per-cpu buffer

    getcpu+atomic (ns/op) rseq (ns/op) speedup
    arm32: 2502.0 2250.0 1.1
    x86-64: 117.4 98.0 1.2

    * liburcu percpu: lock-unlock pair, dereference, read/compare word

    getcpu+atomic (ns/op) rseq (ns/op) speedup
    arm32: 751.0 128.5 5.8
    x86-64: 53.4 28.6 1.9

    * jemalloc memory allocator adapted to use rseq

    Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
    rseq 2016 implementation):

    The production workload response-time has 1-2% gain avg. latency, and
    the P99 overall latency drops by 2-3%.

    * Reading the current CPU number

    Speeding up reading the current CPU number on which the caller thread is
    running is done by keeping the current CPU number up do date within the
    cpu_id field of the memory area registered by the thread. This is done
    by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
    current thread. Upon return to user-space, a notify-resume handler
    updates the current CPU value within the registered user-space memory
    area. User-space can then read the current CPU number directly from
    memory.

    Keeping the current cpu id in a memory area shared between kernel and
    user-space is an improvement over current mechanisms available to read
    the current CPU number, which has the following benefits over
    alternative approaches:

    - 35x speedup on ARM vs system call through glibc
    - 20x speedup on x86 compared to calling glibc, which calls vdso
    executing a "lsl" instruction,
    - 14x speedup on x86 compared to inlined "lsl" instruction,
    - Unlike vdso approaches, this cpu_id value can be read from an inline
    assembly, which makes it a useful building block for restartable
    sequences.
    - The approach of reading the cpu id through memory mapping shared
    between kernel and user-space is portable (e.g. ARM), which is not the
    case for the lsl-based x86 vdso.

    On x86, yet another possible approach would be to use the gs segment
    selector to point to user-space per-cpu data. This approach performs
    similarly to the cpu id cache, but it has two disadvantages: it is
    not portable, and it is incompatible with existing applications already
    using the gs segment selector for other purposes.

    Benchmarking various approaches for reading the current CPU number:

    ARMv7 Processor rev 4 (v7l)
    Machine model: Cubietruck
    - Baseline (empty loop): 8.4 ns
    - Read CPU from rseq cpu_id: 16.7 ns
    - Read CPU from rseq cpu_id (lazy register): 19.8 ns
    - glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
    - getcpu system call: 234.9 ns

    x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
    - Baseline (empty loop): 0.8 ns
    - Read CPU from rseq cpu_id: 0.8 ns
    - Read CPU from rseq cpu_id (lazy register): 0.8 ns
    - Read using gs segment selector: 0.8 ns
    - "lsl" inline assembly: 13.0 ns
    - glibc 2.19-0ubuntu6 getcpu: 16.6 ns
    - getcpu system call: 53.9 ns

    - Speed (benchmark taken on v8 of patchset)

    Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
    expectations, that enabling CONFIG_RSEQ slightly accelerates the
    scheduler:

    Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
    2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
    saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
    kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
    restartable sequences series applied.

    * CONFIG_RSEQ=n

    avg.: 41.37 s
    std.dev.: 0.36 s

    * CONFIG_RSEQ=y

    avg.: 40.46 s
    std.dev.: 0.33 s

    - Size

    On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
    567 bytes, and the data size increase of vmlinux is 5696 bytes.

    [1] https://lwn.net/Articles/650333/
    [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Cc: Joel Fernandes
    Cc: Catalin Marinas
    Cc: Dave Watson
    Cc: Will Deacon
    Cc: Andi Kleen
    Cc: "H . Peter Anvin"
    Cc: Chris Lameter
    Cc: Russell King
    Cc: Andrew Hunter
    Cc: Michael Kerrisk
    Cc: "Paul E . McKenney"
    Cc: Paul Turner
    Cc: Boqun Feng
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Ben Maurer
    Cc: Alexander Viro
    Cc: linux-api@vger.kernel.org
    Cc: Andy Lutomirski
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
    Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
    Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com

    Mathieu Desnoyers
     

24 May, 2018

1 commit

  • Introduce helper:
    int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
    struct umh_info {
    struct file *pipe_to_umh;
    struct file *pipe_from_umh;
    pid_t pid;
    };

    that GPLed kernel modules (signed or unsigned) can use it to execute part
    of its own data as swappable user mode process.

    The kernel will do:
    - allocate a unique file in tmpfs
    - populate that file with [data, data + len] bytes
    - user-mode-helper code will do_execve that file and, before the process
    starts, the kernel will create two unix pipes for bidirectional
    communication between kernel module and umh
    - close tmpfs file, effectively deleting it
    - the fork_usermode_blob will return zero on success and populate
    'struct umh_info' with two unix pipes and the pid of the user process

    As the first step in the development of the bpfilter project
    the fork_usermode_blob() helper is introduced to allow user mode code
    to be invoked from a kernel module. The idea is that user mode code plus
    normal kernel module code are built as part of the kernel build
    and installed as traditional kernel module into distro specified location,
    such that from a distribution point of view, there is
    no difference between regular kernel modules and kernel modules + umh code.
    Such modules can be signed, modprobed, rmmod, etc. The use of this new helper
    by a kernel module doesn't make it any special from kernel and user space
    tooling point of view.

    Such approach enables kernel to delegate functionality traditionally done
    by the kernel modules into the user space processes (either root or !root) and
    reduces security attack surface of the new code. The buggy umh code would crash
    the user process, but not the kernel. Another advantage is that umh code
    of the kernel module can be debugged and tested out of user space
    (e.g. opening the possibility to run clang sanitizers, fuzzers or
    user space test suites on the umh code).
    In case of the bpfilter project such architecture allows complex control plane
    to be done in the user space while bpf based data plane stays in the kernel.

    Since umh can crash, can be oom-ed by the kernel, killed by the admin,
    the kernel module that uses them (like bpfilter) needs to manage life
    time of umh on its own via two unix pipes and the pid of umh.

    The exit code of such kernel module should kill the umh it started,
    so that rmmod of the kernel module will cleanup the corresponding umh.
    Just like if the kernel module does kmalloc() it should kfree() it
    in the exit code.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

12 Apr, 2018

3 commits

  • Since the stack rlimit is used in multiple places during exec and it can
    be changed via other threads (via setrlimit()) or processes (via
    prlimit()), the assumption that the value doesn't change cannot be made.
    This leads to races with mm layout selection and argument size
    calculations. This changes the exec path to use the rlimit stored in
    bprm instead of in current. Before starting the thread, the bprm stack
    rlimit is stored back to current.

    Link: http://lkml.kernel.org/r/1518638796-20819-4-git-send-email-keescook@chromium.org
    Fixes: 64701dee4178e ("exec: Use sane stack rlimit under secureexec")
    Signed-off-by: Kees Cook
    Reported-by: Ben Hutchings
    Reported-by: Andy Lutomirski
    Reported-by: Brad Spengler
    Acked-by: Michal Hocko
    Cc: Ben Hutchings
    Cc: Greg KH
    Cc: Hugh Dickins
    Cc: "Jason A. Donenfeld"
    Cc: Laura Abbott
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Willy Tarreau
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Provide a final callback into fs/exec.c before start_thread() takes
    over, to handle any last-minute changes, like the coming restoration of
    the stack limit.

    Link: http://lkml.kernel.org/r/1518638796-20819-3-git-send-email-keescook@chromium.org
    Signed-off-by: Kees Cook
    Cc: Andy Lutomirski
    Cc: Ben Hutchings
    Cc: Ben Hutchings
    Cc: Brad Spengler
    Cc: Greg KH
    Cc: Hugh Dickins
    Cc: "Jason A. Donenfeld"
    Cc: Laura Abbott
    Cc: Michal Hocko
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Willy Tarreau
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Patch series "exec: Pin stack limit during exec".

    Attempts to solve problems with the stack limit changing during exec
    continue to be frustrated[1][2]. In addition to the specific issues
    around the Stack Clash family of flaws, Andy Lutomirski pointed out[3]
    other places during exec where the stack limit is used and is assumed to
    be unchanging. Given the many places it gets used and the fact that it
    can be manipulated/raced via setrlimit() and prlimit(), I think the only
    way to handle this is to move away from the "current" view of the stack
    limit and instead attach it to the bprm, and plumb this down into the
    functions that need to know the stack limits. This series implements
    the approach.

    [1] 04e35f4495dd ("exec: avoid RLIMIT_STACK races with prlimit()")
    [2] 779f4e1c6c7c ("Revert "exec: avoid RLIMIT_STACK races with prlimit()"")
    [3] to security@kernel.org, "Subject: existing rlimit races?"

    This patch (of 3):

    Since it is possible that the stack rlimit can change externally during
    exec (either via another thread calling setrlimit() or another process
    calling prlimit()), provide a way to pass the rlimit down into the
    per-architecture mm layout functions so that the rlimit can stay in the
    bprm structure instead of sitting in the signal structure until exec is
    finalized.

    Link: http://lkml.kernel.org/r/1518638796-20819-2-git-send-email-keescook@chromium.org
    Signed-off-by: Kees Cook
    Cc: Michal Hocko
    Cc: Ben Hutchings
    Cc: Willy Tarreau
    Cc: Hugh Dickins
    Cc: Oleg Nesterov
    Cc: "Jason A. Donenfeld"
    Cc: Rik van Riel
    Cc: Laura Abbott
    Cc: Greg KH
    Cc: Andy Lutomirski
    Cc: Ben Hutchings
    Cc: Brad Spengler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

19 Mar, 2018

1 commit

  • The LSM check should happen after the file has been confirmed to be
    unchanging. Without this, we could have a race between the Time of Check
    (the call to security_kernel_read_file() which could read the file and
    make access policy decisions) and the Time of Use (starting with
    kernel_read_file()'s reading of the file contents). In theory, file
    contents could change between the two.

    Signed-off-by: Kees Cook
    Reviewed-by: Mimi Zohar
    Signed-off-by: James Morris

    Kees Cook
     

04 Jan, 2018

1 commit

  • This is a logical revert of commit e37fdb785a5f ("exec: Use secureexec
    for setting dumpability")

    This weakens dumpability back to checking only for uid/gid changes in
    current (which is useless), but userspace depends on dumpability not
    being tied to secureexec.

    https://bugzilla.redhat.com/show_bug.cgi?id=1528633

    Reported-by: Tom Horsley
    Fixes: e37fdb785a5f ("exec: Use secureexec for setting dumpability")
    Cc: stable@vger.kernel.org
    Signed-off-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Kees Cook
     

18 Dec, 2017

1 commit

  • This reverts commit 04e35f4495dd560db30c25efca4eecae8ec8c375.

    SELinux runs with secureexec for all non-"noatsecure" domain transitions,
    which means lots of processes end up hitting the stack hard-limit change
    that was introduced in order to fix a race with prlimit(). That race fix
    will need to be redesigned.

    Reported-by: Laura Abbott
    Reported-by: Tomáš Trnka
    Cc: stable@vger.kernel.org
    Signed-off-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Kees Cook
     

15 Dec, 2017

1 commit

  • gcc-8 warns about using strncpy() with the source size as the limit:

    fs/exec.c:1223:32: error: argument to 'sizeof' in 'strncpy' call is the same expression as the source; did you mean to use the size of the destination? [-Werror=sizeof-pointer-memaccess]

    This is indeed slightly suspicious, as it protects us from source
    arguments without NUL-termination, but does not guarantee that the
    destination is terminated.

    This keeps the strncpy() to ensure we have properly padded target
    buffer, but ensures that we use the correct length, by passing the
    actual length of the destination buffer as well as adding a build-time
    check to ensure it is exactly TASK_COMM_LEN.

    There are only 23 callsites which I all reviewed to ensure this is
    currently the case. We could get away with doing only the check or
    passing the right length, but it doesn't hurt to do both.

    Link: http://lkml.kernel.org/r/20171205151724.1764896-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Suggested-by: Kees Cook
    Acked-by: Kees Cook
    Acked-by: Ingo Molnar
    Cc: Alexander Viro
    Cc: Peter Zijlstra
    Cc: Serge Hallyn
    Cc: James Morris
    Cc: Aleksa Sarai
    Cc: "Eric W. Biederman"
    Cc: Frederic Weisbecker
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

30 Nov, 2017

1 commit

  • While the defense-in-depth RLIMIT_STACK limit on setuid processes was
    protected against races from other threads calling setrlimit(), I missed
    protecting it against races from external processes calling prlimit().
    This adds locking around the change and makes sure that rlim_max is set
    too.

    Link: http://lkml.kernel.org/r/20171127193457.GA11348@beast
    Fixes: 64701dee4178e ("exec: Use sane stack rlimit under secureexec")
    Signed-off-by: Kees Cook
    Reported-by: Ben Hutchings
    Reported-by: Brad Spengler
    Acked-by: Serge Hallyn
    Cc: James Morris
    Cc: Andy Lutomirski
    Cc: Oleg Nesterov
    Cc: Jiri Slaby
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

25 Oct, 2017

1 commit

  • …READ_ONCE()/WRITE_ONCE()

    Please do not apply this to mainline directly, instead please re-run the
    coccinelle script shown below and apply its output.

    For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
    preference to ACCESS_ONCE(), and new code is expected to use one of the
    former. So far, there's been no reason to change most existing uses of
    ACCESS_ONCE(), as these aren't harmful, and changing them results in
    churn.

    However, for some features, the read/write distinction is critical to
    correct operation. To distinguish these cases, separate read/write
    accessors must be used. This patch migrates (most) remaining
    ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
    coccinelle script:

    ----
    // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
    // WRITE_ONCE()

    // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch

    virtual patch

    @ depends on patch @
    expression E1, E2;
    @@

    - ACCESS_ONCE(E1) = E2
    + WRITE_ONCE(E1, E2)

    @ depends on patch @
    expression E;
    @@

    - ACCESS_ONCE(E)
    + READ_ONCE(E)
    ----

    Signed-off-by: Mark Rutland <mark.rutland@arm.com>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: davem@davemloft.net
    Cc: linux-arch@vger.kernel.org
    Cc: mpe@ellerman.id.au
    Cc: shuah@kernel.org
    Cc: snitzer@redhat.com
    Cc: thor.thayer@linux.intel.com
    Cc: tj@kernel.org
    Cc: viro@zeniv.linux.org.uk
    Cc: will.deacon@arm.com
    Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Mark Rutland
     

20 Oct, 2017

1 commit

  • This introduces a "register private expedited" membarrier command which
    allows eventual removal of important memory barrier constraints on the
    scheduler fast-paths. It changes how the "private expedited" membarrier
    command (new to 4.14) is used from user-space.

    This new command allows processes to register their intent to use the
    private expedited command. This affects how the expedited private
    command introduced in 4.14-rc is meant to be used, and should be merged
    before 4.14 final.

    Processes are now required to register before using
    MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.

    This fixes a problem that arose when designing requested extensions to
    sys_membarrier() to allow JITs to efficiently flush old code from
    instruction caches. Several potential algorithms are much less painful
    if the user register intent to use this functionality early on, for
    example, before the process spawns the second thread. Registering at
    this time removes the need to interrupt each and every thread in that
    process at the first expedited sys_membarrier() system call.

    Signed-off-by: Mathieu Desnoyers
    Acked-by: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Alexander Viro
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     

04 Oct, 2017

1 commit

  • Patch series "exec: binfmt_misc: fix use-after-free, kill
    iname[BINPRM_BUF_SIZE]".

    It looks like this code was always wrong, then commit 948b701a607f
    ("binfmt_misc: add persistent opened binary handler for containers")
    added more problems.

    This patch (of 6):

    load_script() can simply use i_name instead, it points into bprm->buf[]
    and nobody can change this memory until we call prepare_binprm().

    The only complication is that we need to also change the signature of
    bprm_change_interp() but this change looks good too.

    While at it, do whitespace/style cleanups.

    NOTE: the real motivation for this change is that people want to
    increase BINPRM_BUF_SIZE, we need to change load_misc_binary() too but
    this looks more complicated because afaics it is very buggy.

    Link: http://lkml.kernel.org/r/20170918163446.GA26793@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Travis Gummels
    Cc: Ben Woodard
    Cc: Jim Foraker
    Cc:
    Cc: Al Viro
    Cc: James Bottomley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

15 Sep, 2017

2 commits

  • This patch constifies the path argument to kernel_read_file_from_path().

    Signed-off-by: Mimi Zohar
    Cc: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Mimi Zohar
     
  • Pull more set_fs removal from Al Viro:
    "Christoph's 'use kernel_read and friends rather than open-coding
    set_fs()' series"

    * 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: unexport vfs_readv and vfs_writev
    fs: unexport vfs_read and vfs_write
    fs: unexport __vfs_read/__vfs_write
    lustre: switch to kernel_write
    gadget/f_mass_storage: stop messing with the address limit
    mconsole: switch to kernel_read
    btrfs: switch write_buf to kernel_write
    net/9p: switch p9_fd_read to kernel_write
    mm/nommu: switch do_mmap_private to kernel_read
    serial2002: switch serial2002_tty_write to kernel_{read/write}
    fs: make the buf argument to __kernel_write a void pointer
    fs: fix kernel_write prototype
    fs: fix kernel_read prototype
    fs: move kernel_read to fs/read_write.c
    fs: move kernel_write to fs/read_write.c
    autofs4: switch autofs4_write to __kernel_write
    ashmem: switch to ->read_iter

    Linus Torvalds
     

14 Sep, 2017

1 commit

  • GFP_TEMPORARY was introduced by commit e12ba74d8ff3 ("Group short-lived
    and reclaimable kernel allocations") along with __GFP_RECLAIMABLE. It's
    primary motivation was to allow users to tell that an allocation is
    short lived and so the allocator can try to place such allocations close
    together and prevent long term fragmentation. As much as this sounds
    like a reasonable semantic it becomes much less clear when to use the
    highlevel GFP_TEMPORARY allocation flag. How long is temporary? Can the
    context holding that memory sleep? Can it take locks? It seems there is
    no good answer for those questions.

    The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
    __GFP_RECLAIMABLE which in itself is tricky because basically none of
    the existing caller provide a way to reclaim the allocated memory. So
    this is rather misleading and hard to evaluate for any benefits.

    I have checked some random users and none of them has added the flag
    with a specific justification. I suspect most of them just copied from
    other existing users and others just thought it might be a good idea to
    use without any measuring. This suggests that GFP_TEMPORARY just
    motivates for cargo cult usage without any reasoning.

    I believe that our gfp flags are quite complex already and especially
    those with highlevel semantic should be clearly defined to prevent from
    confusion and abuse. Therefore I propose dropping GFP_TEMPORARY and
    replace all existing users to simply use GFP_KERNEL. Please note that
    SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
    so they will be placed properly for memory fragmentation prevention.

    I can see reasons we might want some gfp flag to reflect shorterm
    allocations but I propose starting from a clear semantic definition and
    only then add users with proper justification.

    This was been brought up before LSF this year by Matthew [1] and it
    turned out that GFP_TEMPORARY really doesn't have a clear semantic. It
    seems to be a heuristic without any measured advantage for most (if not
    all) its current users. The follow up discussion has revealed that
    opinions on what might be temporary allocation differ a lot between
    developers. So rather than trying to tweak existing users into a
    semantic which they haven't expected I propose to simply remove the flag
    and start from scratch if we really need a semantic for short term
    allocations.

    [1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org

    [akpm@linux-foundation.org: fix typo]
    [akpm@linux-foundation.org: coding-style fixes]
    [sfr@canb.auug.org.au: drm/i915: fix up]
    Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Stephen Rothwell
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Neil Brown
    Cc: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

05 Sep, 2017

2 commits


02 Aug, 2017

10 commits

  • Instead of an additional secureexec check for pdeath_signal, just move it
    up into the initial secureexec test. Neither perf nor arch code touches
    pdeath_signal, so the relocation shouldn't change anything.

    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn

    Kees Cook
     
  • For a secureexec, before memory layout selection has happened, reset the
    stack rlimit to something sane to avoid the caller having control over
    the resulting layouts.

    $ ulimit -s
    8192
    $ ulimit -s unlimited
    $ /bin/sh -c 'ulimit -s'
    unlimited
    $ sudo /bin/sh -c 'ulimit -s'
    8192

    Cc: Linus Torvalds
    Signed-off-by: Kees Cook
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn

    Kees Cook
     
  • Since it's already valid to set dumpability in the early part of
    setup_new_exec(), we can consolidate the logic into a single place.
    The BINPRM_FLAGS_ENFORCE_NONDUMP is set during would_dump() calls
    before setup_new_exec(), so its test is safe to move as well.

    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Reviewed-by: James Morris

    Kees Cook
     
  • Like dumpability, clearing pdeath_signal happens both in setup_new_exec()
    and later in commit_creds(). The test in setup_new_exec() is different
    from all other privilege comparisons, though: it is checking the new cred
    (bprm) uid vs the old cred (current) euid. This appears to be a bug,
    introduced by commit a6f76f23d297 ("CRED: Make execve() take advantage of
    copy-on-write credentials"):

    - if (bprm->e_uid != current_euid() ||
    - bprm->e_gid != current_egid()) {
    - set_dumpable(current->mm, suid_dumpable);
    + if (bprm->cred->uid != current_euid() ||
    + bprm->cred->gid != current_egid()) {

    It was bprm euid vs current euid (and egids), but the effective got
    dropped. Nothing in the exec flow changes bprm->cred->uid (nor gid).
    The call traces are:

    prepare_bprm_creds()
    prepare_exec_creds()
    prepare_creds()
    memcpy(new_creds, old_creds, ...)
    security_prepare_creds() (unimplemented by commoncap)
    ...
    prepare_binprm()
    bprm_fill_uid()
    resets euid/egid to current euid/egid
    sets euid/egid on bprm based on set*id file bits
    security_bprm_set_creds()
    cap_bprm_set_creds()
    handle all caps-based manipulations

    so this test is effectively a test of current_uid() vs current_euid(),
    which is wrong, just like the prior dumpability tests were wrong.

    The commit log says "Clear pdeath_signal and set dumpable on
    certain circumstances that may not be covered by commit_creds()." This
    may be meaning the earlier old euid vs new euid (and egid) test that
    got changed.

    Luckily, as with dumpability, this is all masked by commit_creds()
    which performs old/new euid and egid tests and clears pdeath_signal.

    And again, like dumpability, we should include LSM secureexec logic for
    pdeath_signal clearing. For example, Smack goes out of its way to clear
    pdeath_signal when it finds a secureexec condition.

    Cc: David Howells
    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Reviewed-by: James Morris

    Kees Cook
     
  • The examination of "current" to decide dumpability is wrong. This was a
    check of and euid/uid (or egid/gid) mismatch in the existing process,
    not the newly created one. This appears to stretch back into even the
    "history.git" tree. Luckily, dumpability is later set in commit_creds().
    In earlier kernel versions before creds existed, similar checks also
    existed late in the exec flow, covering up the mistake as far back as I
    could find.

    Note that because the commit_creds() check examines differences of euid,
    uid, egid, gid, and capabilities between the old and new creds, it would
    look like the setup_new_exec() dumpability test could be entirely removed.
    However, the secureexec test may cover a different set of tests (specific
    to the LSMs) than what commit_creds() checks for. So, fix this test to
    use secureexec (the removed euid tests are redundant to the commoncap
    secureexec checks now).

    Cc: David Howells
    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Reviewed-by: James Morris

    Kees Cook
     
  • This removes the bprm_secureexec hook since the logic has been folded into
    the bprm_set_creds hook for all LSMs now.

    Cc: Eric W. Biederman
    Signed-off-by: Kees Cook
    Reviewed-by: John Johansen
    Acked-by: James Morris
    Acked-by: Serge Hallyn

    Kees Cook
     
  • The commoncap implementation of the bprm_secureexec hook is the only LSM
    that depends on the final call to its bprm_set_creds hook (since it may
    be called for multiple files, it ignores bprm->called_set_creds). As a
    result, it cannot safely _clear_ bprm->secureexec since other LSMs may
    have set it. Instead, remove the bprm_secureexec hook by introducing a
    new flag to bprm specific to commoncap: cap_elevated. This is similar to
    cap_effective, but that is used for a specific subset of elevated
    privileges, and exists solely to track state from bprm_set_creds to
    bprm_secureexec. As such, it will be removed in the next patch.

    Here, set the new bprm->cap_elevated flag when setuid/setgid has happened
    from bprm_fill_uid() or fscapabilities have been prepared. This temporarily
    moves the bprm_secureexec hook to a static inline. The helper will be
    removed in the next patch; this makes the step easier to review and bisect,
    since this does not introduce any changes to inputs nor outputs to the
    "elevated privileges" calculation.

    The new flag is merged with the bprm->secureexec flag in setup_new_exec()
    since this marks the end of any further prepare_binprm() calls.

    Cc: Andy Lutomirski
    Signed-off-by: Kees Cook
    Reviewed-by: Andy Lutomirski
    Acked-by: James Morris
    Acked-by: Serge Hallyn

    Kees Cook
     
  • The bprm_secureexec hook can be moved earlier. Right now, it is called
    during create_elf_tables(), via load_binary(), via search_binary_handler(),
    via exec_binprm(). Nearly all (see exception below) state used by
    bprm_secureexec is created during the bprm_set_creds hook, called from
    prepare_binprm().

    For all LSMs (except commoncaps described next), only the first execution
    of bprm_set_creds takes any effect (they all check bprm->called_set_creds
    which prepare_binprm() sets after the first call to the bprm_set_creds
    hook). However, all these LSMs also only do anything with bprm_secureexec
    when they detected a secure state during their first run of bprm_set_creds.
    Therefore, it is functionally identical to move the detection into
    bprm_set_creds, since the results from secureexec here only need to be
    based on the first call to the LSM's bprm_set_creds hook.

    The single exception is that the commoncaps secureexec hook also examines
    euid/uid and egid/gid differences which are controlled by bprm_fill_uid(),
    via prepare_binprm(), which can be called multiple times (e.g.
    binfmt_script, binfmt_misc), and may clear the euid/egid for the final
    load (i.e. the script interpreter). However, while commoncaps specifically
    ignores bprm->cred_prepared, and runs its bprm_set_creds hook each time
    prepare_binprm() may get called, it needs to base the secureexec decision
    on the final call to bprm_set_creds. As a result, it will need special
    handling.

    To begin this refactoring, this adds the secureexec flag to the bprm
    struct, and calls the secureexec hook during setup_new_exec(). This is
    safe since all the cred work is finished (and past the point of no return).
    This explicit call will be removed in later patches once the hook has been
    removed.

    Cc: David Howells
    Signed-off-by: Kees Cook
    Reviewed-by: John Johansen
    Acked-by: Serge Hallyn
    Reviewed-by: James Morris

    Kees Cook
     
  • In commit 221af7f87b97 ("Split 'flush_old_exec' into two functions"),
    the comment about the point of no return should have stayed in
    flush_old_exec() since it refers to "bprm->mm = NULL;" line, but prior
    changes in commits c89681ed7d0e ("remove steal_locks()"), and
    fd8328be874f ("sanitize handling of shared descriptor tables in failing
    execve()") made it look like it meant the current->sas_ss_sp line instead.

    The comment was referring to the fact that once bprm->mm is NULL, all
    failures from a binfmt load_binary hook (e.g. load_elf_binary), will
    get SEGV raised against current. Move this comment and expand the
    explanation a bit, putting it above the assignment this time, and add
    details about the true nature of "point of no return" being the call
    to flush_old_exec() itself.

    This also removes an erroneous commet about when credentials are being
    installed. That has its own dedicated function, install_exec_creds(),
    which carries a similar (and correct) comment, so remove the bogus comment
    where installation is not actually happening.

    Cc: David Howells
    Cc: Eric W. Biederman
    Signed-off-by: Kees Cook
    Acked-by: "Eric W. Biederman"
    Acked-by: Serge Hallyn

    Kees Cook
     
  • The cred_prepared bprm flag has a misleading name. It has nothing to do
    with the bprm_prepare_cred hook, and actually tracks if bprm_set_creds has
    been called. Rename this flag and improve its comment.

    Cc: David Howells
    Cc: Stephen Smalley
    Cc: Casey Schaufler
    Signed-off-by: Kees Cook
    Acked-by: John Johansen
    Acked-by: James Morris
    Acked-by: Paul Moore
    Acked-by: Serge Hallyn

    Kees Cook
     

08 Jul, 2017

1 commit


24 Jun, 2017

1 commit

  • When limiting the argv/envp strings during exec to 1/4 of the stack limit,
    the storage of the pointers to the strings was not included. This means
    that an exec with huge numbers of tiny strings could eat 1/4 of the stack
    limit in strings and then additional space would be later used by the
    pointers to the strings.

    For example, on 32-bit with a 8MB stack rlimit, an exec with 1677721
    single-byte strings would consume less than 2MB of stack, the max (8MB /
    4) amount allowed, but the pointers to the strings would consume the
    remaining additional stack space (1677721 * 4 == 6710884).

    The result (1677721 + 6710884 == 8388605) would exhaust stack space
    entirely. Controlling this stack exhaustion could result in
    pathological behavior in setuid binaries (CVE-2017-1000365).

    [akpm@linux-foundation.org: additional commenting from Kees]
    Fixes: b6a2fea39318 ("mm: variable length argument support")
    Link: http://lkml.kernel.org/r/20170622001720.GA32173@beast
    Signed-off-by: Kees Cook
    Acked-by: Rik van Riel
    Acked-by: Michal Hocko
    Cc: Alexander Viro
    Cc: Qualys Security Advisory
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

20 Mar, 2017

1 commit

  • Intel supports faulting on the CPUID instruction beginning with Ivy Bridge.
    When enabled, the processor will fault on attempts to execute the CPUID
    instruction with CPL>0. Exposing this feature to userspace will allow a
    ptracer to trap and emulate the CPUID instruction.

    When supported, this feature is controlled by toggling bit 0 of
    MSR_MISC_FEATURES_ENABLES. It is documented in detail in Section 2.3.2 of
    https://bugzilla.kernel.org/attachment.cgi?id=243991

    Implement a new pair of arch_prctls, available on both x86-32 and x86-64.

    ARCH_GET_CPUID: Returns the current CPUID state, either 0 if CPUID faulting
    is enabled (and thus the CPUID instruction is not available) or 1 if
    CPUID faulting is not enabled.

    ARCH_SET_CPUID: Set the CPUID state to the second argument. If
    cpuid_enabled is 0 CPUID faulting will be activated, otherwise it will
    be deactivated. Returns ENODEV if CPUID faulting is not supported on
    this system.

    The state of the CPUID faulting flag is propagated across forks, but reset
    upon exec.

    Signed-off-by: Kyle Huey
    Cc: Grzegorz Andrejczuk
    Cc: kvm@vger.kernel.org
    Cc: Radim Krčmář
    Cc: Peter Zijlstra
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: linux-kselftest@vger.kernel.org
    Cc: Nadav Amit
    Cc: Robert O'Callahan
    Cc: Richard Weinberger
    Cc: "Rafael J. Wysocki"
    Cc: Borislav Petkov
    Cc: Andy Lutomirski
    Cc: Len Brown
    Cc: Shuah Khan
    Cc: user-mode-linux-devel@lists.sourceforge.net
    Cc: Jeff Dike
    Cc: Alexander Viro
    Cc: user-mode-linux-user@lists.sourceforge.net
    Cc: David Matlack
    Cc: Boris Ostrovsky
    Cc: Dmitry Safonov
    Cc: linux-fsdevel@vger.kernel.org
    Cc: Paolo Bonzini
    Link: http://lkml.kernel.org/r/20170320081628.18952-9-khuey@kylehuey.com
    Signed-off-by: Thomas Gleixner

    Kyle Huey
     

02 Mar, 2017

2 commits

  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • …sched/numa_balancing.h>

    We are going to split <linux/sched/numa_balancing.h> out of <linux/sched.h>, which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder <linux/sched/numa_balancing.h> file that just
    maps to <linux/sched.h> to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar