10 Sep, 2018

1 commit

  • commit 42a0cc3478584d4d63f68f2f5af021ddbea771fa upstream.

    Holding uts_sem as a writer while accessing userspace memory allows a
    namespace admin to stall all processes that attempt to take uts_sem.
    Instead, move data through stack buffers and don't access userspace memory
    while uts_sem is held.

    Cc: stable@vger.kernel.org
    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Jann Horn
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     

30 May, 2018

1 commit

  • commit 23d6aef74da86a33fa6bb75f79565e0a16ee97c2 upstream.

    `resource' can be controlled by user-space, hence leading to a potential
    exploitation of the Spectre variant 1 vulnerability.

    This issue was detected with the help of Smatch:

    kernel/sys.c:1474 __do_compat_sys_old_getrlimit() warn: potential spectre issue 'get_current()->signal->rlim' (local cap)
    kernel/sys.c:1455 __do_sys_old_getrlimit() warn: potential spectre issue 'get_current()->signal->rlim' (local cap)

    Fix this by sanitizing *resource* before using it to index
    current->signal->rlim

    Notice that given that speculation windows are large, the policy is to
    kill the speculation on the first load and not worry if it can be
    completed with a dependent load/store [1].

    [1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2

    Link: http://lkml.kernel.org/r/20180515030038.GA11822@embeddedor.com
    Signed-off-by: Gustavo A. R. Silva
    Reviewed-by: Andrew Morton
    Cc: Alexei Starovoitov
    Cc: Dan Williams
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Gustavo A. R. Silva
     

23 May, 2018

2 commits

  • commit 7bbf1373e228840bb0295a2ca26d548ef37f448e upstream

    Adjust arch_prctl_get/set_spec_ctrl() to operate on tasks other than
    current.

    This is needed both for /proc/$pid/status queries and for seccomp (since
    thread-syncing can trigger seccomp in non-current threads).

    Signed-off-by: Kees Cook
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     
  • commit b617cfc858161140d69cc0b5cc211996b557a1c7 upstream

    Add two new prctls to control aspects of speculation related vulnerabilites
    and their mitigations to provide finer grained control over performance
    impacting mitigations.

    PR_GET_SPECULATION_CTRL returns the state of the speculation misfeature
    which is selected with arg2 of prctl(2). The return value uses bit 0-2 with
    the following meaning:

    Bit Define Description
    0 PR_SPEC_PRCTL Mitigation can be controlled per task by
    PR_SET_SPECULATION_CTRL
    1 PR_SPEC_ENABLE The speculation feature is enabled, mitigation is
    disabled
    2 PR_SPEC_DISABLE The speculation feature is disabled, mitigation is
    enabled

    If all bits are 0 the CPU is not affected by the speculation misfeature.

    If PR_SPEC_PRCTL is set, then the per task control of the mitigation is
    available. If not set, prctl(PR_SET_SPECULATION_CTRL) for the speculation
    misfeature will fail.

    PR_SET_SPECULATION_CTRL allows to control the speculation misfeature, which
    is selected by arg2 of prctl(2) per task. arg3 is used to hand in the
    control value, i.e. either PR_SPEC_ENABLE or PR_SPEC_DISABLE.

    The common return values are:

    EINVAL prctl is not implemented by the architecture or the unused prctl()
    arguments are not 0
    ENODEV arg2 is selecting a not supported speculation misfeature

    PR_SET_SPECULATION_CTRL has these additional return values:

    ERANGE arg3 is incorrect, i.e. it's not either PR_SPEC_ENABLE or PR_SPEC_DISABLE
    ENXIO prctl control of the selected speculation misfeature is disabled

    The first supported controlable speculation misfeature is
    PR_SPEC_STORE_BYPASS. Add the define so this can be shared between
    architectures.

    Based on an initial patch from Tim Chen and mostly rewritten.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Reviewed-by: Konrad Rzeszutek Wilk
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

20 Jul, 2017

1 commit

  • During checkpointing and restore of userspace tasks
    we bumped into the situation, that it's not possible
    to restore the tasks, which user namespace does not
    have uid 0 or gid 0 mapped.

    People create user namespace mappings like they want,
    and there is no a limitation on obligatory uid and gid
    "must be mapped". So, if there is no uid 0 or gid 0
    in the mapping, it's impossible to restore mm->exe_file
    of the processes belonging to this user namespace.

    Also, there is no a workaround. It's impossible
    to create a temporary uid/gid mapping, because
    only one write to /proc/[pid]/uid_map and gid_map
    is allowed during a namespace lifetime.
    If there is an entry, then no more mapings can't be
    written. If there isn't an entry, we can't write
    there too, otherwise user task won't be able
    to do that in the future.

    The patch changes the check, and looks for CAP_SYS_ADMIN
    instead of zero uid and gid. This allows to restore
    a task independently of its user namespace mappings.

    Signed-off-by: Kirill Tkhai
    CC: Andrew Morton
    CC: Serge Hallyn
    CC: "Eric W. Biederman"
    CC: Oleg Nesterov
    CC: Michal Hocko
    CC: Andrei Vagin
    CC: Cyrill Gorcunov
    CC: Stanislav Kinsburskiy
    CC: Pavel Tikhomirov
    Reviewed-by: Cyrill Gorcunov
    Signed-off-by: Eric W. Biederman

    Kirill Tkhai
     

13 Jul, 2017

1 commit


11 Jul, 2017

1 commit

  • PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any
    existing mapping because it only updated mm->def_flags which is a
    template for new mappings.

    The mappings created after prctl(PR_SET_THP_DISABLE) have VM_NOHUGEPAGE
    flag set. This can be quite surprising for all those applications which
    do not do prctl(); fork() & exec() and want to control their own THP
    behavior.

    Another usecase when the immediate semantic of the prctl might be useful
    is a combination of pre- and post-copy migration of containers with
    CRIU. In this case CRIU populates a part of a memory region with data
    that was saved during the pre-copy stage. Afterwards, the region is
    registered with userfaultfd and CRIU expects to get page faults for the
    parts of the region that were not yet populated. However, khugepaged
    collapses the pages and the expected page faults do not occur.

    In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
    temporary mechanism for enabling/disabling THP process wide.

    Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is
    tested when decision whether to use huge pages is taken either during
    page fault of at the time of THP collapse.

    It should be noted, that the new implementation makes PR_SET_THP_DISABLE
    master override to any per-VMA setting, which was not the case
    previously.

    Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
    Link: http://lkml.kernel.org/r/1496415802-30944-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Michal Hocko
    Signed-off-by: Mike Rapoport
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Arnd Bergmann
    Cc: "Kirill A. Shutemov"
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

07 Jul, 2017

1 commit

  • Pull misc compat stuff updates from Al Viro:
    "This part is basically untangling various compat stuff. Compat
    syscalls moved to their native counterparts, getting rid of quite a
    bit of double-copying and/or set_fs() uses. A lot of field-by-field
    copyin/copyout killed off.

    - kernel/compat.c is much closer to containing just the
    copyin/copyout of compat structs. Not all compat syscalls are gone
    from it yet, but it's getting there.

    - ipc/compat_mq.c killed off completely.

    - block/compat_ioctl.c cleaned up; floppy compat ioctls moved to
    drivers/block/floppy.c where they belong. Yes, there are several
    drivers that implement some of the same ioctls. Some are m68k and
    one is 32bit-only pmac. drivers/block/floppy.c is the only one in
    that bunch that can be built on biarch"

    * 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    mqueue: move compat syscalls to native ones
    usbdevfs: get rid of field-by-field copyin
    compat_hdio_ioctl: get rid of set_fs()
    take floppy compat ioctls to sodding floppy.c
    ipmi: get rid of field-by-field __get_user()
    ipmi: get COMPAT_IPMICTL_RECEIVE_MSG in sync with the native one
    rt_sigtimedwait(): move compat to native
    select: switch compat_{get,put}_fd_set() to compat_{get,put}_bitmap()
    put_compat_rusage(): switch to copy_to_user()
    sigpending(): move compat to native
    getrlimit()/setrlimit(): move compat to native
    times(2): move compat to native
    compat_{get,put}_bitmap(): use unsafe_{get,put}_user()
    fb_get_fscreeninfo(): don't bother with do_fb_ioctl()
    do_sigaltstack(): lift copying to/from userland into callers
    take compat_sys_old_getrlimit() to native syscall
    trim __ARCH_WANT_SYS_OLD_GETRLIMIT

    Linus Torvalds
     

10 Jun, 2017

2 commits


28 May, 2017

1 commit


22 May, 2017

1 commit

  • New helpers: kernel_waitid() and kernel_wait4(). sys_waitid(),
    sys_wait4() and their compat variants switched to those. Copying
    struct rusage to userland is left to syscall itself. For
    compat_sys_wait4() that eliminates the use of set_fs() completely.
    For compat_sys_waitid() it's still needed (for siginfo handling);
    that will change shortly.

    Signed-off-by: Al Viro

    Al Viro
     

06 May, 2017

1 commit

  • Pull namespace updates from Eric Biederman:
    "This is a set of small fixes that were mostly stumbled over during
    more significant development. This proc fix and the fix to
    posix-timers are the most significant of the lot.

    There is a lot of good development going on but unfortunately it
    didn't quite make the merge window"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    proc: Fix unbalanced hard link numbers
    signal: Make kill_proc_info static
    rlimit: Properly call security_task_setrlimit
    signal: Remove unused definition of sig_user_definied
    ia64: Remove unused IA64_TASK_SIGHAND_OFFSET and IA64_SIGHAND_SIGLOCK_OFFSET
    ipc: Remove unused declaration of recompute_msgmni
    posix-timers: Correct sanity check in posix_cpu_nsleep
    sysctl: Remove dead register_sysctl_root

    Linus Torvalds
     

22 Apr, 2017

1 commit

  • Modify do_prlimit to call security_task_setrlimit passing the task
    whose rlimit we are changing not the tsk->group_leader.

    In general this should not matter as the lsms implementing
    security_task_setrlimit apparmor and selinux both examine the
    task->cred to see what should be allowed on the destination task.

    That task->cred is shared between tasks created with CLONE_THREAD
    unless thread keyrings are in play, in which case both apparmor and
    selinux create duplicate security contexts.

    So the only time when it will matter which thread is passed to
    security_task_setrlimit is if one of the threads of a process performs
    an operation that changes only it's credentials. At which point if a
    thread has done that we don't want to hide that information from the
    lsms.

    So fix the call of security_task_setrlimit. With the removal
    of tsk->group_leader this makes the code slightly faster,
    more comprehensible and maintainable.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

06 Mar, 2017

1 commit

  • When SELinux was first added to the kernel, a process could only get
    and set its own resource limits via getrlimit(2) and setrlimit(2), so no
    MAC checks were required for those operations, and thus no security hooks
    were defined for them. Later, SELinux introduced a hook for setlimit(2)
    with a check if the hard limit was being changed in order to be able to
    rely on the hard limit value as a safe reset point upon context
    transitions.

    Later on, when prlimit(2) was added to the kernel with the ability to get
    or set resource limits (hard or soft) of another process, LSM/SELinux was
    not updated other than to pass the target process to the setrlimit hook.
    This resulted in incomplete control over both getting and setting the
    resource limits of another process.

    Add a new security_task_prlimit() hook to the check_prlimit_permission()
    function to provide complete mediation. The hook is only called when
    acting on another task, and only if the existing DAC/capability checks
    would allow access. Pass flags down to the hook to indicate whether the
    prlimit(2) call will read, write, or both read and write the resource
    limits of the target process.

    The existing security_task_setrlimit() hook is left alone; it continues
    to serve a purpose in supporting the ability to make decisions based on
    the old and/or new resource limit values when setting limits. This
    is consistent with the DAC/capability logic, where
    check_prlimit_permission() performs generic DAC/capability checks for
    acting on another task, while do_prlimit() performs a capability check
    based on a comparison of the old and new resource limits. Fix the
    inline documentation for the hook to match the code.

    Implement the new hook for SELinux. For setting resource limits, we
    reuse the existing setrlimit permission. Note that this does overload
    the setrlimit permission to mean the ability to set the resource limit
    (soft or hard) of another process or the ability to change one's own
    hard limit. For getting resource limits, a new getrlimit permission
    is defined. This was not originally defined since getrlimit(2) could
    only be used to obtain a process' own limits.

    Signed-off-by: Stephen Smalley
    Signed-off-by: James Morris

    Stephen Smalley
     

02 Mar, 2017

7 commits


24 Feb, 2017

1 commit

  • Pull namespace updates from Eric Biederman:
    "There is a lot here. A lot of these changes result in subtle user
    visible differences in kernel behavior. I don't expect anything will
    care but I will revert/fix things immediately if any regressions show
    up.

    From Seth Forshee there is a continuation of the work to make the vfs
    ready for unpriviled mounts. We had thought the previous changes
    prevented the creation of files outside of s_user_ns of a filesystem,
    but it turns we missed the O_CREAT path. Ooops.

    Pavel Tikhomirov and Oleg Nesterov worked together to fix a long
    standing bug in the implemenation of PR_SET_CHILD_SUBREAPER where only
    children that are forked after the prctl are considered and not
    children forked before the prctl. The only known user of this prctl
    systemd forks all children after the prctl. So no userspace
    regressions will occur. Holding earlier forked children to the same
    rules as later forked children creates a semantic that is sane enough
    to allow checkpoing of processes that use this feature.

    There is a long delayed change by Nikolay Borisov to limit inotify
    instances inside a user namespace.

    Michael Kerrisk extends the API for files used to maniuplate
    namespaces with two new trivial ioctls to allow discovery of the
    hierachy and properties of namespaces.

    Konstantin Khlebnikov with the help of Al Viro adds code that when a
    network namespace exits purges it's sysctl entries from the dcache. As
    in some circumstances this could use a lot of memory.

    Vivek Goyal fixed a bug with stacked filesystems where the permissions
    on the wrong inode were being checked.

    I continue previous work on ptracing across exec. Allowing a file to
    be setuid across exec while being ptraced if the tracer has enough
    credentials in the user namespace, and if the process has CAP_SETUID
    in it's own namespace. Proc files for setuid or otherwise undumpable
    executables are now owned by the root in the user namespace of their
    mm. Allowing debugging of setuid applications in containers to work
    better.

    A bug I introduced with permission checking and automount is now
    fixed. The big change is to mark the mounts that the kernel initiates
    as a result of an automount. This allows the permission checks in sget
    to be safely suppressed for this kind of mount. As the permission
    check happened when the original filesystem was mounted.

    Finally a special case in the mount namespace is removed preventing
    unbounded chains in the mount hash table, and making the semantics
    simpler which benefits CRIU.

    The vfs fix along with related work in ima and evm I believe makes us
    ready to finish developing and merge fully unprivileged mounts of the
    fuse filesystem. The cleanups of the mount namespace makes discussing
    how to fix the worst case complexity of umount. The stacked filesystem
    fixes pave the way for adding multiple mappings for the filesystem
    uids so that efficient and safer containers can be implemented"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    proc/sysctl: Don't grab i_lock under sysctl_lock.
    vfs: Use upper filesystem inode in bprm_fill_uid()
    proc/sysctl: prune stale dentries during unregistering
    mnt: Tuck mounts under others instead of creating shadow/side mounts.
    prctl: propagate has_child_subreaper flag to every descendant
    introduce the walk_process_tree() helper
    nsfs: Add an ioctl() to return owner UID of a userns
    fs: Better permission checking for submounts
    exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction
    vfs: open() with O_CREAT should not create inodes with unknown ids
    nsfs: Add an ioctl() to return the namespace type
    proc: Better ownership of files for non-dumpable tasks in user namespaces
    exec: Remove LSM_UNSAFE_PTRACE_CAP
    exec: Test the ptracer's saved cred to see if the tracee can gain caps
    exec: Don't reset euid and egid when the tracee has CAP_SETUID
    inotify: Convert to using per-namespace limits

    Linus Torvalds
     

03 Feb, 2017

1 commit

  • If process forks some children when it has is_child_subreaper
    flag enabled they will inherit has_child_subreaper flag - first
    group, when is_child_subreaper is disabled forked children will
    not inherit it - second group. So child-subreaper does not reparent
    all his descendants when their parents die. Having these two
    differently behaving groups can lead to confusion. Also it is
    a problem for CRIU, as when we restore process tree we need to
    somehow determine which descendants belong to which group and
    much harder - to put them exactly to these group.

    To simplify these we can add a propagation of has_child_subreaper
    flag on PR_SET_CHILD_SUBREAPER, walking all descendants of child-
    subreaper to setup has_child_subreaper flag.

    In common cases when process like systemd first sets itself to
    be a child-subreaper and only after that forks its services, we will
    have zero-length list of descendants to walk. Testing with binary
    subtree of 2^15 processes prctl took < 0.007 sec and has shown close
    to linear dependency(~0.2 * n * usec) on lower numbers of processes.

    Moreover, I doubt someone intentionaly pre-forks the children whitch
    should reparent to init before becoming subreaper, because some our
    ancestor migh have had is_child_subreaper flag while forking our
    sub-tree and our childs will all inherit has_child_subreaper flag,
    and we have no way to influence it. And only way to check if we have
    no has_child_subreaper flag is to create some childs, kill them and
    see where they will reparent to.

    Using walk_process_tree helper to walk subtree, thanks to Oleg! Timing
    seems to be the same.

    Optimize:

    a) When descendant already has has_child_subreaper flag all his subtree
    has it too already.

    * for a) to be true need to move has_child_subreaper inheritance under
    the same tasklist_lock with adding task to its ->real_parent->children
    as without it process can inherit zero has_child_subreaper, then we
    set 1 to it's parent flag, check that parent has no more children, and
    only after child with wrong flag is added to the tree.

    * Also make these inheritance more clear by using real_parent instead of
    current, as on clone(CLONE_PARENT) if current has is_child_subreaper
    and real_parent has no is_child_subreaper or has_child_subreaper, child
    will have has_child_subreaper flag set without actually having a
    subreaper in it's ancestors.

    b) When some descendant is child_reaper, it's subtree is in different
    pidns from us(original child-subreaper) and processes from other pidns
    will never reparent to us.

    So we can skip their(a,b) subtree from walk.

    v2: switch to walk_process_tree() general helper, move
    has_child_subreaper inheritance
    v3: remove csr_descendant leftover, change current to real_parent
    in has_child_subreaper inheritance
    v4: small commit message fix

    Fixes: ebec18a6d3aa ("prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision")
    Signed-off-by: Pavel Tikhomirov
    Reviewed-by: Oleg Nesterov
    Signed-off-by: Eric W. Biederman

    Pavel Tikhomirov
     

01 Feb, 2017

1 commit

  • Now that most cputime readers use the transition API which return the
    task cputime in old style cputime_t, we can safely store the cputime in
    nsecs. This will eventually make cputime statistics less opaque and more
    granular. Back and forth convertions between cputime_t and nsecs in order
    to deal with cputime_t random granularity won't be needed anymore.

    Signed-off-by: Frederic Weisbecker
    Cc: Benjamin Herrenschmidt
    Cc: Fenghua Yu
    Cc: Heiko Carstens
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Stanislaw Gruszka
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1485832191-26889-8-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

25 Dec, 2016

1 commit


13 Dec, 2016

2 commits

  • Merge updates from Andrew Morton:

    - various misc bits

    - most of MM (quite a lot of MM material is awaiting the merge of
    linux-next dependencies)

    - kasan

    - printk updates

    - procfs updates

    - MAINTAINERS

    - /lib updates

    - checkpatch updates

    * emailed patches from Andrew Morton : (123 commits)
    init: reduce rootwait polling interval time to 5ms
    binfmt_elf: use vmalloc() for allocation of vma_filesz
    checkpatch: don't emit unified-diff error for rename-only patches
    checkpatch: don't check c99 types like uint8_t under tools
    checkpatch: avoid multiple line dereferences
    checkpatch: don't check .pl files, improve absolute path commit log test
    scripts/checkpatch.pl: fix spelling
    checkpatch: don't try to get maintained status when --no-tree is given
    lib/ida: document locking requirements a bit better
    lib/rbtree.c: fix typo in comment of ____rb_erase_color
    lib/Kconfig.debug: make CONFIG_STRICT_DEVMEM depend on CONFIG_DEVMEM
    MAINTAINERS: add drm and drm/i915 irc channels
    MAINTAINERS: add "C:" for URI for chat where developers hang out
    MAINTAINERS: add drm and drm/i915 bug filing info
    MAINTAINERS: add "B:" for URI where to file bugs
    get_maintainer: look for arbitrary letter prefixes in sections
    printk: add Kconfig option to set default console loglevel
    printk/sound: handle more message headers
    printk/btrfs: handle more message headers
    printk/kdb: handle more message headers
    ...

    Linus Torvalds
     
  • This limitation came with the reason to remove "another way for
    malicious code to obscure a compromised program and masquerade as a
    benign process" by allowing "security-concious program can use this
    prctl once during its early initialization to ensure the prctl cannot
    later be abused for this purpose":

    http://marc.info/?l=linux-kernel&m=133160684517468&w=2

    This explanation doesn't look sufficient. The only thing "exe" link is
    indicating is the file, used to execve, which is basically nothing and
    not reliable immediately after process has returned from execve system
    call.

    Moreover, to use this feture, all the mappings to previous exe file have
    to be unmapped and all the new exe file permissions must be satisfied.

    Which means, that changing exe link is very similar to calling execve on
    the binary.

    The need to remove this limitations comes from migration of NFS mount
    point, which is not accessible during restore and replaced by other file
    system. Because of this exe link has to be changed twice.

    [akpm@linux-foundation.org: fix up comment]
    Link: http://lkml.kernel.org/r/20160927153755.9337.69650.stgit@localhost.localdomain
    Signed-off-by: Stanislav Kinsburskiy
    Acked-by: Oleg Nesterov
    Acked-by: Cyrill Gorcunov
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Michal Hocko
    Cc: Kees Cook
    Cc: Andy Lutomirski
    Cc: John Stultz
    Cc: Matt Helsley
    Cc: Pavel Emelyanov
    Cc: Vlastimil Babka
    Cc: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislav Kinsburskiy
     

16 Nov, 2016

1 commit

  • Some embedded systems have no use for them. This removes about
    25KB from the kernel binary size when configured out.

    Corresponding syscalls are routed to a stub logging the attempt to
    use those syscalls which should be enough of a clue if they were
    disabled without proper consideration. They are: timer_create,
    timer_gettime: timer_getoverrun, timer_settime, timer_delete,
    clock_adjtime, setitimer, getitimer, alarm.

    The clock_settime, clock_gettime, clock_getres and clock_nanosleep
    syscalls are replaced by simple wrappers compatible with CLOCK_REALTIME,
    CLOCK_MONOTONIC and CLOCK_BOOTTIME only which should cover the vast
    majority of use cases with very little code.

    Signed-off-by: Nicolas Pitre
    Acked-by: Richard Cochran
    Acked-by: Thomas Gleixner
    Acked-by: John Stultz
    Reviewed-by: Josh Triplett
    Cc: Paul Bolle
    Cc: linux-kbuild@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Cc: Michal Marek
    Cc: Edward Cree
    Link: http://lkml.kernel.org/r/1478841010-28605-7-git-send-email-nicolas.pitre@linaro.org
    Signed-off-by: Thomas Gleixner

    Nicolas Pitre
     

24 May, 2016

1 commit

  • PR_SET_THP_DISABLE requires mmap_sem for write. If the waiting task
    gets killed by the oom killer it would block oom_reaper from
    asynchronous address space reclaim and reduce the chances of timely OOM
    resolving. Wait for the lock in the killable mode and return with EINTR
    if the task got killed while waiting.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Alex Thorlton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

18 Mar, 2016

1 commit

  • This patchset introduces a /proc//timerslack_ns interface which
    would allow controlling processes to be able to set the timerslack value
    on other processes in order to save power by avoiding wakeups (Something
    Android currently does via out-of-tree patches).

    The first patch tries to fix the internal timer_slack_ns usage which was
    defined as a long, which limits the slack range to ~4 seconds on 32bit
    systems. It converts it to a u64, which provides the same basically
    unlimited slack (500 years) on both 32bit and 64bit machines.

    The second patch introduces the /proc//timerslack_ns interface
    which allows the full 64bit slack range for a task to be read or set on
    both 32bit and 64bit machines.

    With these two patches, on a 32bit machine, after setting the slack on
    bash to 10 seconds:

    $ time sleep 1

    real 0m10.747s
    user 0m0.001s
    sys 0m0.005s

    The first patch is a little ugly, since I had to chase the slack delta
    arguments through a number of functions converting them to u64s. Let me
    know if it makes sense to break that up more or not.

    Other than that things are fairly straightforward.

    This patch (of 2):

    The timer_slack_ns value in the task struct is currently a unsigned
    long. This means that on 32bit applications, the maximum slack is just
    over 4 seconds. However, on 64bit machines, its much much larger (~500
    years).

    This disparity could make application development a little (as well as
    the default_slack) to a u64. This means both 32bit and 64bit systems
    have the same effective internal slack range.

    Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
    the interface as a unsigned long, so we preserve that limitation on
    32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
    long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
    actually larger then what can be stored by an unsigned long.

    This patch also modifies hrtimer functions which specified the slack
    delta as a unsigned long.

    Signed-off-by: John Stultz
    Cc: Arjan van de Ven
    Cc: Thomas Gleixner
    Cc: Oren Laadan
    Cc: Ruchi Kandoi
    Cc: Rom Lemarchand
    Cc: Kees Cook
    Cc: Android Kernel Team
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Stultz
     

21 Jan, 2016

1 commit

  • An unprivileged user can trigger an oops on a kernel with
    CONFIG_CHECKPOINT_RESTORE.

    proc_pid_cmdline_read takes mmap_sem for reading and obtains args + env
    start/end values. These get sanity checked as follows:
    BUG_ON(arg_start > arg_end);
    BUG_ON(env_start > env_end);

    These can be changed by prctl_set_mm. Turns out also takes the semaphore for
    reading, effectively rendering it useless. This results in:

    kernel BUG at fs/proc/base.c:240!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: virtio_net
    CPU: 0 PID: 925 Comm: a.out Not tainted 4.4.0-rc8-next-20160105dupa+ #71
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff880077a68000 ti: ffff8800784d0000 task.ti: ffff8800784d0000
    RIP: proc_pid_cmdline_read+0x520/0x530
    RSP: 0018:ffff8800784d3db8 EFLAGS: 00010206
    RAX: ffff880077c5b6b0 RBX: ffff8800784d3f18 RCX: 0000000000000000
    RDX: 0000000000000002 RSI: 00007f78e8857000 RDI: 0000000000000246
    RBP: ffff8800784d3e40 R08: 0000000000000008 R09: 0000000000000001
    R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000050
    R13: 00007f78e8857800 R14: ffff88006fcef000 R15: ffff880077c5b600
    FS: 00007f78e884a740(0000) GS:ffff88007b200000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007f78e8361770 CR3: 00000000790a5000 CR4: 00000000000006f0
    Call Trace:
    __vfs_read+0x37/0x100
    vfs_read+0x82/0x130
    SyS_read+0x58/0xd0
    entry_SYSCALL_64_fastpath+0x12/0x76
    Code: 4c 8b 7d a8 eb e9 48 8b 9d 78 ff ff ff 4c 8b 7d 90 48 8b 03 48 39 45 a8 0f 87 f0 fe ff ff e9 d1 fe ff ff 4c 8b 7d 90 eb c6 0f 0b 0b 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
    RIP proc_pid_cmdline_read+0x520/0x530
    ---[ end trace 97882617ae9c6818 ]---

    Turns out there are instances where the code just reads aformentioned
    values without locking whatsoever - namely environ_read and get_cmdline.

    Interestingly these functions look quite resilient against bogus values,
    but I don't believe this should be relied upon.

    The first patch gets rid of the oops bug by grabbing mmap_sem for
    writing.

    The second patch is optional and puts locking around aformentioned
    consumers for safety. Consumers of other fields don't seem to benefit
    from similar treatment and are left untouched.

    This patch (of 2):

    The code was taking the semaphore for reading, which does not protect
    against readers nor concurrent modifications.

    The problem could cause a sanity checks to fail in procfs's cmdline
    reader, resulting in an OOPS.

    Note that some functions perform an unlocked read of various mm fields,
    but they seem to be fine despite possible modificaton.

    Signed-off-by: Mateusz Guzik
    Acked-by: Cyrill Gorcunov
    Cc: Alexey Dobriyan
    Cc: Jarod Wilson
    Cc: Jan Stancek
    Cc: Al Viro
    Cc: Anshuman Khandual
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mateusz Guzik
     

07 Nov, 2015

1 commit

  • setpriority(PRIO_USER, 0, x) will change the priority of tasks outside of
    the current pid namespace. This is in contrast to both the other modes of
    setpriority and the example of kill(-1). Fix this. getpriority and
    ioprio have the same failure mode, fix them too.

    Eric said:

    : After some more thinking about it this patch sounds justifiable.
    :
    : My goal with namespaces is not to build perfect isolation mechanisms
    : as that can get into ill defined territory, but to build well defined
    : mechanisms. And to handle the corner cases so you can use only
    : a single namespace with well defined results.
    :
    : In this case you have found the two interfaces I am aware of that
    : identify processes by uid instead of by pid. Which quite frankly is
    : weird. Unfortunately the weird unexpected cases are hard to handle
    : in the usual way.
    :
    : I was hoping for a little more information. Changes like this one we
    : have to be careful of because someone might be depending on the current
    : behavior. I don't think they are and I do think this make sense as part
    : of the pid namespace.

    Signed-off-by: Ben Segall
    Cc: Oleg Nesterov
    Cc: Al Viro
    Cc: Ambrose Feinstein
    Acked-by: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Segall
     

10 Jul, 2015

1 commit

  • Today proc and sysfs do not contain any executable files. Several
    applications today mount proc or sysfs without noexec and nosuid and
    then depend on there being no exectuables files on proc or sysfs.
    Having any executable files show on proc or sysfs would cause
    a user space visible regression, and most likely security problems.

    Therefore commit to never allowing executables on proc and sysfs by
    adding a new flag to mark them as filesystems without executables and
    enforce that flag.

    Test the flag where MNT_NOEXEC is tested today, so that the only user
    visible effect will be that exectuables will be treated as if the
    execute bit is cleared.

    The filesystems proc and sysfs do not currently incoporate any
    executable files so this does not result in any user visible effects.

    This makes it unnecessary to vet changes to proc and sysfs tightly for
    adding exectuable files or changes to chattr that would modify
    existing files, as no matter what the individual file say they will
    not be treated as exectuable files by the vfs.

    Not having to vet changes to closely is important as without this we
    are only one proc_create call (or another goof up in the
    implementation of notify_change) from having problematic executables
    on proc. Those mistakes are all too easy to make and would create
    a situation where there are security issues or the assumptions of
    some program having to be broken (and cause userspace regressions).

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

26 Jun, 2015

1 commit

  • Individual prctl(PR_SET_MM_*) calls do some checking to maintain a
    consistent view of mm->arg_start et al fields, but not enough. In
    particular PR_SET_MM_ARG_START/PR_SET_MM_ARG_END/ R_SET_MM_ENV_START/
    PR_SET_MM_ENV_END only check that the address lies in an existing VMA,
    but don't check that the start address is lower than the end address _at
    all_.

    Consolidate all consistency checks, so there will be no difference in
    the future between PR_SET_MM_MAP and individual PR_SET_MM_* calls.

    The program below makes both ARGV and ENVP areas be reversed. It makes
    /proc/$PID/cmdline show garbage (it doesn't oops by luck).

    #include
    #include
    #include

    enum {PAGE_SIZE=4096};

    int main(void)
    {
    void *p;

    p = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);

    #define PR_SET_MM 35
    #define PR_SET_MM_ARG_START 8
    #define PR_SET_MM_ARG_END 9
    #define PR_SET_MM_ENV_START 10
    #define PR_SET_MM_ENV_END 11
    prctl(PR_SET_MM, PR_SET_MM_ARG_START, (unsigned long)p + PAGE_SIZE - 1, 0, 0);
    prctl(PR_SET_MM, PR_SET_MM_ARG_END, (unsigned long)p, 0, 0);
    prctl(PR_SET_MM, PR_SET_MM_ENV_START, (unsigned long)p + PAGE_SIZE - 1, 0, 0);
    prctl(PR_SET_MM, PR_SET_MM_ENV_END, (unsigned long)p, 0, 0);

    pause();
    return 0;
    }

    [akpm@linux-foundation.org: tidy code, tweak comment]
    Signed-off-by: Alexey Dobriyan
    Acked-by: Cyrill Gorcunov
    Cc: Jarod Wilson
    Cc: Jan Stancek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

09 Jun, 2015

1 commit

  • The MPX code can only work on the current task. You can not,
    for instance, enable MPX management in another process or
    thread. You can also not handle a fault for another process or
    thread.

    Despite this, we pass a task_struct around prolifically. This
    patch removes all of the task struct passing for code paths
    where the code can not deal with another task (which turns out
    to be all of them).

    This has no functional changes. It's just a cleanup.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: bp@alien8.de
    Link: http://lkml.kernel.org/r/20150607183702.6A81DA2C@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

17 Apr, 2015

1 commit

  • Oleg cleverly suggested using xchg() to set the new mm->exe_file instead
    of calling set_mm_exe_file() which requires some form of serialization --
    mmap_sem in this case. For archs that do not have atomic rmw instructions
    we still fallback to a spinlock alternative, so this should always be
    safe. As such, we only need the mmap_sem for looking up the backing
    vm_file, which can be done sharing the lock. Naturally, this means we
    need to manually deal with both the new and old file reference counting,
    and we need not worry about the MMF_EXE_FILE_CHANGED bits, which can
    probably be deleted in the future anyway.

    Signed-off-by: Davidlohr Bueso
    Suggested-by: Oleg Nesterov
    Acked-by: Oleg Nesterov
    Reviewed-by: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

16 Apr, 2015

1 commit

  • There are a lot of embedded systems that run most or all of their
    functionality in init, running as root:root. For these systems,
    supporting multiple users is not necessary.

    This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
    non-root users, non-root groups, and capabilities optional. It is enabled
    under CONFIG_EXPERT menu.

    When this symbol is not defined, UID and GID are zero in any possible case
    and processes always have all capabilities.

    The following syscalls are compiled out: setuid, setregid, setgid,
    setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
    getgroups, setfsuid, setfsgid, capget, capset.

    Also, groups.c is compiled out completely.

    In kernel/capability.c, capable function was moved in order to avoid
    adding two ifdef blocks.

    This change saves about 25 KB on a defconfig build. The most minimal
    kernels have total text sizes in the high hundreds of kB rather than
    low MB. (The 25k goes down a bit with allnoconfig, but not that much.

    The kernel was booted in Qemu. All the common functionalities work.
    Adding users/groups is not possible, failing with -ENOSYS.

    Bloat-o-meter output:
    add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Iulia Manda
    Reviewed-by: Josh Triplett
    Acked-by: Geert Uytterhoeven
    Tested-by: Paul E. McKenney
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Iulia Manda
     

01 Mar, 2015

1 commit

  • There's a uname workaround for broken userspace which can't handle kernel
    versions of 3.x. Update it for 4.x.

    Signed-off-by: Jon DeVree
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jon DeVree