18 Sep, 2019

1 commit

  • Pull core timer updates from Thomas Gleixner:
    "Timers and timekeeping updates:

    - A large overhaul of the posix CPU timer code which is a preparation
    for moving the CPU timer expiry out into task work so it can be
    properly accounted on the task/process.

    An update to the bogus permission checks will come later during the
    merge window as feedback was not complete before heading of for
    travel.

    - Switch the timerqueue code to use cached rbtrees and get rid of the
    homebrewn caching of the leftmost node.

    - Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
    single function

    - Implement the separation of hrtimers to be forced to expire in hard
    interrupt context even when PREEMPT_RT is enabled and mark the
    affected timers accordingly.

    - Implement a mechanism for hrtimers and the timer wheel to protect
    RT against priority inversion and live lock issues when a (hr)timer
    which should be canceled is currently executing the callback.
    Instead of infinitely spinning, the task which tries to cancel the
    timer blocks on a per cpu base expiry lock which is held and
    released by the (hr)timer expiry code.

    - Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
    resulting in faster access to timekeeping functions.

    - Updates to various clocksource/clockevent drivers and their device
    tree bindings.

    - The usual small improvements all over the place"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
    posix-cpu-timers: Fix permission check regression
    posix-cpu-timers: Always clear head pointer on dequeue
    hrtimer: Add a missing bracket and hide `migration_base' on !SMP
    posix-cpu-timers: Make expiry_active check actually work correctly
    posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
    tick: Mark sched_timer to expire in hard interrupt context
    hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
    x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
    posix-cpu-timers: Utilize timerqueue for storage
    posix-cpu-timers: Move state tracking to struct posix_cputimers
    posix-cpu-timers: Deduplicate rlimit handling
    posix-cpu-timers: Remove pointless comparisons
    posix-cpu-timers: Get rid of 64bit divisions
    posix-cpu-timers: Consolidate timer expiry further
    posix-cpu-timers: Get rid of zero checks
    rlimit: Rewrite non-sensical RLIMIT_CPU comment
    posix-cpu-timers: Respect INFINITY for hard RTTIME limit
    posix-cpu-timers: Switch thread group sampling to array
    posix-cpu-timers: Restructure expiry array
    posix-cpu-timers: Remove cputime_expires
    ...

    Linus Torvalds
     

17 Sep, 2019

1 commit

  • Pull x86 cpu-feature updates from Ingo Molnar:

    - Rework the Intel model names symbols/macros, which were decades of
    ad-hoc extensions and added random noise. It's now a coherent, easy
    to follow nomenclature.

    - Add new Intel CPU model IDs:
    - "Tiger Lake" desktop and mobile models
    - "Elkhart Lake" model ID
    - and the "Lightning Mountain" variant of Airmont, plus support code

    - Add the new AVX512_VP2INTERSECT instruction to cpufeatures

    - Remove Intel MPX user-visible APIs and the self-tests, because the
    toolchain (gcc) is not supporting it going forward. This is the
    first, lowest-risk phase of MPX removal.

    - Remove X86_FEATURE_MFENCE_RDTSC

    - Various smaller cleanups and fixes

    * 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
    x86/cpu: Update init data for new Airmont CPU model
    x86/cpu: Add new Airmont variant to Intel family
    x86/cpu: Add Elkhart Lake to Intel family
    x86/cpu: Add Tiger Lake to Intel family
    x86: Correct misc typos
    x86/intel: Add common OPTDIFFs
    x86/intel: Aggregate microserver naming
    x86/intel: Aggregate big core graphics naming
    x86/intel: Aggregate big core mobile naming
    x86/intel: Aggregate big core client naming
    x86/cpufeature: Explain the macro duplication
    x86/ftrace: Remove mcount() declaration
    x86/PCI: Remove superfluous returns from void functions
    x86/msr-index: Move AMD MSRs where they belong
    x86/cpu: Use constant definitions for CPU models
    lib: Remove redundant ftrace flag removal
    x86/crash: Remove unnecessary comparison
    x86/bitops: Use __builtin_constant_p() directly instead of IS_IMMEDIATE()
    x86: Remove X86_FEATURE_MFENCE_RDTSC
    x86/mpx: Remove MPX APIs
    ...

    Linus Torvalds
     

28 Aug, 2019

2 commits

  • Deactivation of the expiry cache is done by setting all clock caches to
    0. That requires to have a check for zero in all places which update the
    expiry cache:

    if (cache == 0 || new < cache)
    cache = new;

    Use U64_MAX as the deactivated value, which allows to remove the zero
    checks when updating the cache and reduces it to the obvious check:

    if (new < cache)
    cache = new;

    This also removes the weird workaround in do_prlimit() which was required
    to convert a RLIMIT_CPU value of 0 (immediate expiry) to 1 because handing
    in 0 to the posix CPU timer code would have effectively disarmed it.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Frederic Weisbecker
    Link: https://lkml.kernel.org/r/20190821192922.275086128@linutronix.de

    Thomas Gleixner
     
  • The comment above the function which arms RLIMIT_CPU in the posix CPU timer
    code makes no sense at all. It claims that the kernel does not return an
    error code when it rejected the attempt to set RLIMIT_CPU. That's clearly
    bogus as the code does an error check and the rlimit is only set and
    activated when the permission checks are ok. In case of a rejection an
    appropriate error code is returned.

    This is a historical and outdated comment which got dragged along even when
    the rlimit handling code was rewritten.

    Replace it with an explanation why the setup function is not called when
    the rlimit value is RLIM_INFINITY and how the 'disarming' is handled.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Frederic Weisbecker
    Link: https://lkml.kernel.org/r/20190821192922.185511287@linutronix.de

    Thomas Gleixner
     

21 Aug, 2019

1 commit


07 Aug, 2019

1 commit

  • It is not desirable to relax the ABI to allow tagged user addresses into
    the kernel indiscriminately. This patch introduces a prctl() interface
    for enabling or disabling the tagged ABI with a global sysctl control
    for preventing applications from enabling the relaxed ABI (meant for
    testing user-space prctl() return error checking without reconfiguring
    the kernel). The ABI properties are inherited by threads of the same
    application and fork()'ed children but cleared on execve(). A Kconfig
    option allows the overall disabling of the relaxed ABI.

    The PR_SET_TAGGED_ADDR_CTRL will be expanded in the future to handle
    MTE-specific settings like imprecise vs precise exceptions.

    Reviewed-by: Kees Cook
    Signed-off-by: Catalin Marinas
    Signed-off-by: Andrey Konovalov
    Signed-off-by: Will Deacon

    Catalin Marinas
     

22 Jul, 2019

1 commit

  • MPX is being removed from the kernel due to a lack of support in the
    toolchain going forward (gcc).

    The first step is to remove the userspace-visible ABIs so that applications
    will stop using it. The most visible one are the enable/disable prctl()s.
    Remove them first.

    This is the most minimal and least invasive change needed to ensure that
    apps stop using MPX with new kernels.

    Signed-off-by: Dave Hansen
    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190705175321.DB42F0AD@viggo.jf.intel.com

    Dave Hansen
     

02 Jun, 2019

2 commits

  • The commit a3b609ef9f8b ("proc read mm's {arg,env}_{start,end} with mmap
    semaphore taken.") added synchronization of reading argument/environment
    boundaries under mmap_sem. Later commit 88aa7cc688d4 ("mm: introduce
    arg_lock to protect arg_start|end and env_start|end in mm_struct") avoided
    the coarse use of mmap_sem in similar situations. But there still
    remained two places that (mis)use mmap_sem.

    get_cmdline should also use arg_lock instead of mmap_sem when it reads the
    boundaries.

    The second place that should use arg_lock is in prctl_set_mm. By
    protecting the boundaries fields with the arg_lock, we can downgrade
    mmap_sem to reader lock (analogous to what we already do in
    prctl_set_mm_map).

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20190502125203.24014-3-mkoutny@suse.com
    Fixes: 88aa7cc688d4 ("mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct")
    Signed-off-by: Michal Koutný
    Signed-off-by: Laurent Dufour
    Co-developed-by: Laurent Dufour
    Reviewed-by: Cyrill Gorcunov
    Acked-by: Michal Hocko
    Cc: Yang Shi
    Cc: Mateusz Guzik
    Cc: Kirill Tkhai
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Koutný
     
  • Despite comment of validate_prctl_map claims there are no capability
    checks, it is not completely true since commit 4d28df6152aa ("prctl: Allow
    local CAP_SYS_ADMIN changing exe_file"). Extract the check out of the
    function and make the function perform purely arithmetic checks.

    This patch should not change any behavior, it is mere refactoring for
    following patch.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20190502125203.24014-2-mkoutny@suse.com
    Signed-off-by: Michal Koutný
    Reviewed-by: Kirill Tkhai
    Reviewed-by: Cyrill Gorcunov
    Cc: Kirill Tkhai
    Cc: Laurent Dufour
    Cc: Mateusz Guzik
    Cc: Michal Hocko
    Cc: Yang Shi
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Koutný
     

15 May, 2019

1 commit

  • While validating new map we require the @start_data to be strictly less
    than @end_data, which is fine for regular applications (this is why this
    nit didn't trigger for that long). These members are set from executable
    loaders such as elf handers, still it is pretty valid to have a loadable
    data section with zero size in file, in such case the start_data is equal
    to end_data once kernel loader finishes.

    As a result when we're trying to restore such programs the procedure fails
    and the kernel returns -EINVAL. From the image dump of a program:

    | "mm_start_code": "0x400000",
    | "mm_end_code": "0x8f5fb4",
    | "mm_start_data": "0xf1bfb0",
    | "mm_end_data": "0xf1bfb0",

    Thus we need to change validate_prctl_map from strictly less to less or
    equal operator use.

    Link: http://lkml.kernel.org/r/20190408143554.GY1421@uranus.lan
    Fixes: f606b77f1a9e3 ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation")
    Signed-off-by: Cyrill Gorcunov
    Cc: Andrey Vagin
    Cc: Dmitry Safonov
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

08 Mar, 2019

2 commits

  • Merge more updates from Andrew Morton:

    - some of the rest of MM

    - various misc things

    - dynamic-debug updates

    - checkpatch

    - some epoll speedups

    - autofs

    - rapidio

    - lib/, lib/lzo/ updates

    * emailed patches from Andrew Morton : (83 commits)
    samples/mic/mpssd/mpssd.h: remove duplicate header
    kernel/fork.c: remove duplicated include
    include/linux/relay.h: fix percpu annotation in struct rchan
    arch/nios2/mm/fault.c: remove duplicate include
    unicore32: stop printing the virtual memory layout
    MAINTAINERS: fix GTA02 entry and mark as orphan
    mm: create the new vm_fault_t type
    arm, s390, unicore32: remove oneliner wrappers for memblock_alloc()
    arch: simplify several early memory allocations
    openrisc: simplify pte_alloc_one_kernel()
    sh: prefer memblock APIs returning virtual address
    microblaze: prefer memblock API returning virtual address
    powerpc: prefer memblock APIs returning virtual address
    lib/lzo: separate lzo-rle from lzo
    lib/lzo: implement run-length encoding
    lib/lzo: fast 8-byte copy on arm64
    lib/lzo: 64-bit CTZ on arm64
    lib/lzo: tidy-up ifdefs
    ipc/sem.c: replace kvmalloc/memset with kvzalloc and use struct_size
    ipc: annotate implicit fall through
    ...

    Linus Torvalds
     
  • There is a plan to build the kernel with -Wimplicit-fallthrough and this
    place in the code produced a warning (W=1).

    This commit remove the following warning:

    kernel/sys.c:1748:6: warning: this statement may fall through [-Wimplicit-fallthrough=]

    Link: http://lkml.kernel.org/r/20190114203347.17530-1-malat@debian.org
    Signed-off-by: Mathieu Malaterre
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Malaterre
     

26 Jan, 2019

1 commit

  • This change ensures that the set*uid family of syscalls in kernel/sys.c
    (setreuid, setuid, setresuid, setfsuid) all call ns_capable_common with
    the CAP_OPT_INSETID flag, so capability checks in the security_capable
    hook can know whether they are being called from within a set*uid
    syscall. This change is a no-op by itself, but is needed for the
    proposed SafeSetID LSM.

    Signed-off-by: Micah Morton
    Acked-by: Kees Cook
    Signed-off-by: James Morris

    Micah Morton
     

14 Jan, 2019

1 commit

  • UNAME26 is a mechanism to report Linux's version as 2.6.x, for
    compatibility with old/broken software. Due to the way it is
    implemented, it would have to be updated after 5.0, to keep the
    resulting versions unique. Linus Torvalds argued:

    "Do we actually need this?

    I'd rather let it bitrot, and just let it return random versions. It
    will just start again at 2.4.60, won't it?

    Anybody who uses UNAME26 for a 5.x kernel might as well think it's
    still 4.x. The user space is so old that it can't possibly care about
    differences between 4.x and 5.x, can it?

    The only thing that matters is that it shows "2.4.",
    which it will do regardless"

    Signed-off-by: Jonathan Neuschäfer
    Signed-off-by: Linus Torvalds

    Jonathan Neuschäfer
     

04 Jan, 2019

1 commit

  • Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
    of the user address range verification function since we got rid of the
    old racy i386-only code to walk page tables by hand.

    It existed because the original 80386 would not honor the write protect
    bit when in kernel mode, so you had to do COW by hand before doing any
    user access. But we haven't supported that in a long time, and these
    days the 'type' argument is a purely historical artifact.

    A discussion about extending 'user_access_begin()' to do the range
    checking resulted this patch, because there is no way we're going to
    move the old VERIFY_xyz interface to that model. And it's best done at
    the end of the merge window when I've done most of my merges, so let's
    just get this done once and for all.

    This patch was mostly done with a sed-script, with manual fix-ups for
    the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

    There were a couple of notable cases:

    - csky still had the old "verify_area()" name as an alias.

    - the iter_iov code had magical hardcoded knowledge of the actual
    values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
    really used it)

    - microblaze used the type argument for a debug printout

    but other than those oddities this should be a total no-op patch.

    I tried to fix up all architectures, did fairly extensive grepping for
    access_ok() uses, and the changes are trivial, but I may have missed
    something. Any missed conversion should be trivially fixable, though.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

14 Dec, 2018

1 commit


21 Sep, 2018

1 commit


25 Aug, 2018

1 commit

  • Pull namespace fixes from Eric Biederman:
    "This is a set of four fairly obvious bug fixes:

    - a switch from d_find_alias to d_find_any_alias because the xattr
    code perversely takes a dentry

    - two mutex vs copy_to_user fixes from Jann Horn

    - a fix to use a sanitized size not the size userspace passed in from
    Christian Brauner"

    * 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    getxattr: use correct xattr length
    sys: don't hold uts_sem while accessing userspace memory
    userns: move user access out of the mutex
    cap_inode_getsecurity: use d_find_any_alias() instead of d_find_alias()

    Linus Torvalds
     

11 Aug, 2018

1 commit

  • Holding uts_sem as a writer while accessing userspace memory allows a
    namespace admin to stall all processes that attempt to take uts_sem.
    Instead, move data through stack buffers and don't access userspace memory
    while uts_sem is held.

    Cc: stable@vger.kernel.org
    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Jann Horn
    Signed-off-by: Eric W. Biederman

    Jann Horn
     

19 Jun, 2018

1 commit

  • get_monotonic_boottime() is deprecated because it uses the old 'timespec'
    structure. This replaces one of the last callers with a call to
    ktime_get_boottime.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Cyrill Gorcunov
    Cc: Andrew Morton
    Cc: y2038@lists.linaro.org
    Cc: Dominik Brodowski
    Cc: Cyrill Gorcunov
    Link: https://lkml.kernel.org/r/20180618150114.849216-1-arnd@arndb.de

    Arnd Bergmann
     

08 Jun, 2018

1 commit

  • mmap_sem is on the hot path of kernel, and it very contended, but it is
    abused too. It is used to protect arg_start|end and evn_start|end when
    reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
    sense since those proc files just expect to read 4 values atomically and
    not related to VM, they could be set to arbitrary values by C/R.

    And, the mmap_sem contention may cause unexpected issue like below:

    INFO: task ps:14018 blocked for more than 120 seconds.
    Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
    message.
    ps D 0 14018 1 0x00000004
    Call Trace:
    schedule+0x36/0x80
    rwsem_down_read_failed+0xf0/0x150
    call_rwsem_down_read_failed+0x18/0x30
    down_read+0x20/0x40
    proc_pid_cmdline_read+0xd9/0x4e0
    __vfs_read+0x37/0x150
    vfs_read+0x96/0x130
    SyS_read+0x55/0xc0
    entry_SYSCALL_64_fastpath+0x1a/0xc5

    Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
    for them to mitigate the abuse of mmap_sem.

    So, introduce a new spinlock in mm_struct to protect the concurrent
    access to arg_start|end, env_start|end and others, as well as replace
    write map_sem to read to protect the race condition between prctl and
    sys_brk which might break check_data_rlimit(), and makes prctl more
    friendly to other VM operations.

    This patch just eliminates the abuse of mmap_sem, but it can't resolve
    the above hung task warning completely since the later
    access_remote_vm() call needs acquire mmap_sem. The mmap_sem
    scalability issue will be solved in the future.

    [yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock]
    Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com
    Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Reviewed-by: Cyrill Gorcunov
    Acked-by: Michal Hocko
    Cc: Alexey Dobriyan
    Cc: Matthew Wilcox
    Cc: Mateusz Guzik
    Cc: Kirill Tkhai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

26 May, 2018

1 commit

  • `resource' can be controlled by user-space, hence leading to a potential
    exploitation of the Spectre variant 1 vulnerability.

    This issue was detected with the help of Smatch:

    kernel/sys.c:1474 __do_compat_sys_old_getrlimit() warn: potential spectre issue 'get_current()->signal->rlim' (local cap)
    kernel/sys.c:1455 __do_sys_old_getrlimit() warn: potential spectre issue 'get_current()->signal->rlim' (local cap)

    Fix this by sanitizing *resource* before using it to index
    current->signal->rlim

    Notice that given that speculation windows are large, the policy is to
    kill the speculation on the first load and not worry if it can be
    completed with a dependent load/store [1].

    [1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2

    Link: http://lkml.kernel.org/r/20180515030038.GA11822@embeddedor.com
    Signed-off-by: Gustavo A. R. Silva
    Reviewed-by: Andrew Morton
    Cc: Alexei Starovoitov
    Cc: Dan Williams
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gustavo A. R. Silva
     

03 May, 2018

2 commits

  • Adjust arch_prctl_get/set_spec_ctrl() to operate on tasks other than
    current.

    This is needed both for /proc/$pid/status queries and for seccomp (since
    thread-syncing can trigger seccomp in non-current threads).

    Signed-off-by: Kees Cook
    Signed-off-by: Thomas Gleixner

    Kees Cook
     
  • Add two new prctls to control aspects of speculation related vulnerabilites
    and their mitigations to provide finer grained control over performance
    impacting mitigations.

    PR_GET_SPECULATION_CTRL returns the state of the speculation misfeature
    which is selected with arg2 of prctl(2). The return value uses bit 0-2 with
    the following meaning:

    Bit Define Description
    0 PR_SPEC_PRCTL Mitigation can be controlled per task by
    PR_SET_SPECULATION_CTRL
    1 PR_SPEC_ENABLE The speculation feature is enabled, mitigation is
    disabled
    2 PR_SPEC_DISABLE The speculation feature is disabled, mitigation is
    enabled

    If all bits are 0 the CPU is not affected by the speculation misfeature.

    If PR_SPEC_PRCTL is set, then the per task control of the mitigation is
    available. If not set, prctl(PR_SET_SPECULATION_CTRL) for the speculation
    misfeature will fail.

    PR_SET_SPECULATION_CTRL allows to control the speculation misfeature, which
    is selected by arg2 of prctl(2) per task. arg3 is used to hand in the
    control value, i.e. either PR_SPEC_ENABLE or PR_SPEC_DISABLE.

    The common return values are:

    EINVAL prctl is not implemented by the architecture or the unused prctl()
    arguments are not 0
    ENODEV arg2 is selecting a not supported speculation misfeature

    PR_SET_SPECULATION_CTRL has these additional return values:

    ERANGE arg3 is incorrect, i.e. it's not either PR_SPEC_ENABLE or PR_SPEC_DISABLE
    ENXIO prctl control of the selected speculation misfeature is disabled

    The first supported controlable speculation misfeature is
    PR_SPEC_STORE_BYPASS. Add the define so this can be shared between
    architectures.

    Based on an initial patch from Tim Chen and mostly rewritten.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Reviewed-by: Konrad Rzeszutek Wilk

    Thomas Gleixner
     

03 Apr, 2018

3 commits

  • Using this helper allows us to avoid the in-kernel call to the
    sys_setsid() syscall. The ksys_ prefix denotes that this function
    is meant as a drop-in replacement for the syscall. In particular, it
    uses the same calling convention as sys_setsid().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using these helpers allows us to avoid the in-kernel calls to these
    syscalls: sys_setregid(), sys_setgid(), sys_setreuid(), sys_setuid(),
    sys_setresuid(), sys_setresgid(), sys_setfsuid(), and sys_setfsgid().

    The ksys_ prefix denotes that these function are meant as a drop-in
    replacement for the syscall. In particular, they use the same calling
    convention.

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: Eric W. Biederman
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using the do_getpgid() helper removes an in-kernel call to the
    sys_getpgid() syscall.

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     

15 Dec, 2017

1 commit

  • The patch remains without practical effect since both macros carry
    identical values. Still, it might become a problem in the future if
    (for whatever reason) the default overflow uid and gid differ. The
    DEFAULT_FS_OVERFLOWGID macro was previously unused.

    Signed-off-by: Wolffhardt Schwabe
    Signed-off-by: Anatoliy Cherepantsev
    Signed-off-by: Eric W. Biederman

    Wolffhardt Schwabe
     

16 Nov, 2017

1 commit

  • Pull arm64 updates from Will Deacon:
    "The big highlight is support for the Scalable Vector Extension (SVE)
    which required extensive ABI work to ensure we don't break existing
    applications by blowing away their signal stack with the rather large
    new vector context ( of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (97 commits)
    arm64: Make ARMV8_DEPRECATED depend on SYSCTL
    arm64: Implement __lshrti3 library function
    arm64: support __int128 on gcc 5+
    arm64/sve: Add documentation
    arm64/sve: Detect SVE and activate runtime support
    arm64/sve: KVM: Hide SVE from CPU features exposed to guests
    arm64/sve: KVM: Treat guest SVE use as undefined instruction execution
    arm64/sve: KVM: Prevent guests from using SVE
    arm64/sve: Add sysctl to set the default vector length for new processes
    arm64/sve: Add prctl controls for userspace vector length management
    arm64/sve: ptrace and ELF coredump support
    arm64/sve: Preserve SVE registers around EFI runtime service calls
    arm64/sve: Preserve SVE registers around kernel-mode NEON use
    arm64/sve: Probe SVE capabilities and usable vector lengths
    arm64: cpufeature: Move sys_caps_initialised declarations
    arm64/sve: Backend logic for setting the vector length
    arm64/sve: Signal handling support
    arm64/sve: Support vector length resetting for new processes
    arm64/sve: Core task context handling
    arm64/sve: Low-level CPU setup
    ...

    Linus Torvalds
     

03 Nov, 2017

1 commit

  • This patch adds two arm64-specific prctls, to permit userspace to
    control its vector length:

    * PR_SVE_SET_VL: set the thread's SVE vector length and vector
    length inheritance mode.

    * PR_SVE_GET_VL: get the same information.

    Although these prctls resemble instruction set features in the SVE
    architecture, they provide additional control: the vector length
    inheritance mode is Linux-specific and nothing to do with the
    architecture, and the architecture does not permit EL0 to set its
    own vector length directly. Both can be used in portable tools
    without requiring the use of SVE instructions.

    Signed-off-by: Dave Martin
    Reviewed-by: Catalin Marinas
    Cc: Alex Bennée
    [will: Fixed up prctl constants to avoid clash with PDEATHSIG]
    Signed-off-by: Will Deacon

    Dave Martin
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

20 Jul, 2017

1 commit

  • During checkpointing and restore of userspace tasks
    we bumped into the situation, that it's not possible
    to restore the tasks, which user namespace does not
    have uid 0 or gid 0 mapped.

    People create user namespace mappings like they want,
    and there is no a limitation on obligatory uid and gid
    "must be mapped". So, if there is no uid 0 or gid 0
    in the mapping, it's impossible to restore mm->exe_file
    of the processes belonging to this user namespace.

    Also, there is no a workaround. It's impossible
    to create a temporary uid/gid mapping, because
    only one write to /proc/[pid]/uid_map and gid_map
    is allowed during a namespace lifetime.
    If there is an entry, then no more mapings can't be
    written. If there isn't an entry, we can't write
    there too, otherwise user task won't be able
    to do that in the future.

    The patch changes the check, and looks for CAP_SYS_ADMIN
    instead of zero uid and gid. This allows to restore
    a task independently of its user namespace mappings.

    Signed-off-by: Kirill Tkhai
    CC: Andrew Morton
    CC: Serge Hallyn
    CC: "Eric W. Biederman"
    CC: Oleg Nesterov
    CC: Michal Hocko
    CC: Andrei Vagin
    CC: Cyrill Gorcunov
    CC: Stanislav Kinsburskiy
    CC: Pavel Tikhomirov
    Reviewed-by: Cyrill Gorcunov
    Signed-off-by: Eric W. Biederman

    Kirill Tkhai
     

13 Jul, 2017

1 commit


11 Jul, 2017

1 commit

  • PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any
    existing mapping because it only updated mm->def_flags which is a
    template for new mappings.

    The mappings created after prctl(PR_SET_THP_DISABLE) have VM_NOHUGEPAGE
    flag set. This can be quite surprising for all those applications which
    do not do prctl(); fork() & exec() and want to control their own THP
    behavior.

    Another usecase when the immediate semantic of the prctl might be useful
    is a combination of pre- and post-copy migration of containers with
    CRIU. In this case CRIU populates a part of a memory region with data
    that was saved during the pre-copy stage. Afterwards, the region is
    registered with userfaultfd and CRIU expects to get page faults for the
    parts of the region that were not yet populated. However, khugepaged
    collapses the pages and the expected page faults do not occur.

    In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
    temporary mechanism for enabling/disabling THP process wide.

    Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is
    tested when decision whether to use huge pages is taken either during
    page fault of at the time of THP collapse.

    It should be noted, that the new implementation makes PR_SET_THP_DISABLE
    master override to any per-VMA setting, which was not the case
    previously.

    Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
    Link: http://lkml.kernel.org/r/1496415802-30944-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Michal Hocko
    Signed-off-by: Mike Rapoport
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Arnd Bergmann
    Cc: "Kirill A. Shutemov"
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

07 Jul, 2017

1 commit

  • Pull misc compat stuff updates from Al Viro:
    "This part is basically untangling various compat stuff. Compat
    syscalls moved to their native counterparts, getting rid of quite a
    bit of double-copying and/or set_fs() uses. A lot of field-by-field
    copyin/copyout killed off.

    - kernel/compat.c is much closer to containing just the
    copyin/copyout of compat structs. Not all compat syscalls are gone
    from it yet, but it's getting there.

    - ipc/compat_mq.c killed off completely.

    - block/compat_ioctl.c cleaned up; floppy compat ioctls moved to
    drivers/block/floppy.c where they belong. Yes, there are several
    drivers that implement some of the same ioctls. Some are m68k and
    one is 32bit-only pmac. drivers/block/floppy.c is the only one in
    that bunch that can be built on biarch"

    * 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    mqueue: move compat syscalls to native ones
    usbdevfs: get rid of field-by-field copyin
    compat_hdio_ioctl: get rid of set_fs()
    take floppy compat ioctls to sodding floppy.c
    ipmi: get rid of field-by-field __get_user()
    ipmi: get COMPAT_IPMICTL_RECEIVE_MSG in sync with the native one
    rt_sigtimedwait(): move compat to native
    select: switch compat_{get,put}_fd_set() to compat_{get,put}_bitmap()
    put_compat_rusage(): switch to copy_to_user()
    sigpending(): move compat to native
    getrlimit()/setrlimit(): move compat to native
    times(2): move compat to native
    compat_{get,put}_bitmap(): use unsafe_{get,put}_user()
    fb_get_fscreeninfo(): don't bother with do_fb_ioctl()
    do_sigaltstack(): lift copying to/from userland into callers
    take compat_sys_old_getrlimit() to native syscall
    trim __ARCH_WANT_SYS_OLD_GETRLIMIT

    Linus Torvalds
     

10 Jun, 2017

2 commits


28 May, 2017

1 commit


22 May, 2017

1 commit

  • New helpers: kernel_waitid() and kernel_wait4(). sys_waitid(),
    sys_wait4() and their compat variants switched to those. Copying
    struct rusage to userland is left to syscall itself. For
    compat_sys_wait4() that eliminates the use of set_fs() completely.
    For compat_sys_waitid() it's still needed (for siginfo handling);
    that will change shortly.

    Signed-off-by: Al Viro

    Al Viro
     

06 May, 2017

1 commit

  • Pull namespace updates from Eric Biederman:
    "This is a set of small fixes that were mostly stumbled over during
    more significant development. This proc fix and the fix to
    posix-timers are the most significant of the lot.

    There is a lot of good development going on but unfortunately it
    didn't quite make the merge window"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    proc: Fix unbalanced hard link numbers
    signal: Make kill_proc_info static
    rlimit: Properly call security_task_setrlimit
    signal: Remove unused definition of sig_user_definied
    ia64: Remove unused IA64_TASK_SIGHAND_OFFSET and IA64_SIGHAND_SIGLOCK_OFFSET
    ipc: Remove unused declaration of recompute_msgmni
    posix-timers: Correct sanity check in posix_cpu_nsleep
    sysctl: Remove dead register_sysctl_root

    Linus Torvalds