20 Mar, 2017

1 commit

  • Intel supports faulting on the CPUID instruction beginning with Ivy Bridge.
    When enabled, the processor will fault on attempts to execute the CPUID
    instruction with CPL>0. Exposing this feature to userspace will allow a
    ptracer to trap and emulate the CPUID instruction.

    When supported, this feature is controlled by toggling bit 0 of
    MSR_MISC_FEATURES_ENABLES. It is documented in detail in Section 2.3.2 of
    https://bugzilla.kernel.org/attachment.cgi?id=243991

    Implement a new pair of arch_prctls, available on both x86-32 and x86-64.

    ARCH_GET_CPUID: Returns the current CPUID state, either 0 if CPUID faulting
    is enabled (and thus the CPUID instruction is not available) or 1 if
    CPUID faulting is not enabled.

    ARCH_SET_CPUID: Set the CPUID state to the second argument. If
    cpuid_enabled is 0 CPUID faulting will be activated, otherwise it will
    be deactivated. Returns ENODEV if CPUID faulting is not supported on
    this system.

    The state of the CPUID faulting flag is propagated across forks, but reset
    upon exec.

    Signed-off-by: Kyle Huey
    Cc: Grzegorz Andrejczuk
    Cc: kvm@vger.kernel.org
    Cc: Radim Krčmář
    Cc: Peter Zijlstra
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: linux-kselftest@vger.kernel.org
    Cc: Nadav Amit
    Cc: Robert O'Callahan
    Cc: Richard Weinberger
    Cc: "Rafael J. Wysocki"
    Cc: Borislav Petkov
    Cc: Andy Lutomirski
    Cc: Len Brown
    Cc: Shuah Khan
    Cc: user-mode-linux-devel@lists.sourceforge.net
    Cc: Jeff Dike
    Cc: Alexander Viro
    Cc: user-mode-linux-user@lists.sourceforge.net
    Cc: David Matlack
    Cc: Boris Ostrovsky
    Cc: Dmitry Safonov
    Cc: linux-fsdevel@vger.kernel.org
    Cc: Paolo Bonzini
    Link: http://lkml.kernel.org/r/20170320081628.18952-9-khuey@kylehuey.com
    Signed-off-by: Thomas Gleixner

    Kyle Huey
     

02 Mar, 2017

6 commits

  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • …sched/numa_balancing.h>

    We are going to split <linux/sched/numa_balancing.h> out of <linux/sched.h>, which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder <linux/sched/numa_balancing.h> file that just
    maps to <linux/sched.h> to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • threadgroup_change_begin()/end() is a pointless wrapper around
    cgroup_threadgroup_change_begin()/end(), minus a might_sleep()
    in the !CONFIG_CGROUPS=y case.

    Remove the wrappery, move the might_sleep() (the down_read()
    already has a might_sleep() check).

    This debloats a bit and simplifies this API.

    Update all call sites.

    No change in functionality.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

14 Feb, 2017

1 commit

  • Right now bprm_fill_uid() uses inode fetched from file_inode(bprm->file).
    This in turn returns inode of lower filesystem (in a stacked filesystem
    setup).

    I was playing with modified patches of shiftfs posted by james bottomley
    and realized that through shiftfs setuid bit does not take effect. And
    reason being that we fetch uid/gid from inode of lower fs (and not from
    shiftfs inode). And that results in following checks failing.

    /* We ignore suid/sgid if there are no mappings for them in the ns */
    if (!kuid_has_mapping(bprm->cred->user_ns, uid) ||
    !kgid_has_mapping(bprm->cred->user_ns, gid))
    return;

    uid/gid fetched from lower fs inode might not be mapped inside the user
    namespace of container. So we need to look at uid/gid fetched from
    upper filesystem (shiftfs in this particular case) and these should be
    mapped and setuid bit can take affect.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Eric W. Biederman

    Vivek Goyal
     

24 Jan, 2017

1 commit


25 Dec, 2016

1 commit


23 Dec, 2016

1 commit

  • If you have a process that has set itself to be non-dumpable, and it
    then undergoes exec(2), any CLOEXEC file descriptors it has open are
    "exposed" during a race window between the dumpable flags of the process
    being reset for exec(2) and CLOEXEC being applied to the file
    descriptors. This can be exploited by a process by attempting to access
    /proc//fd/... during this window, without requiring CAP_SYS_PTRACE.

    The race in question is after set_dumpable has been (for get_link,
    though the trace is basically the same for readlink):

    [vfs]
    -> proc_pid_link_inode_operations.get_link
    -> proc_pid_get_link
    -> proc_fd_access_allowed
    -> ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);

    Which will return 0, during the race window and CLOEXEC file descriptors
    will still be open during this window because do_close_on_exec has not
    been called yet. As a result, the ordering of these calls should be
    reversed to avoid this race window.

    This is of particular concern to container runtimes, where joining a
    PID namespace with file descriptors referring to the host filesystem
    can result in security issues (since PRCTL_SET_DUMPABLE doesn't protect
    against access of CLOEXEC file descriptors -- file descriptors which may
    reference filesystem objects the container shouldn't have access to).

    Cc: dev@opencontainers.org
    Cc: # v3.2+
    Reported-by: Michael Crosby
    Signed-off-by: Aleksa Sarai
    Signed-off-by: Al Viro

    Aleksa Sarai
     

15 Dec, 2016

3 commits

  • Merge more updates from Andrew Morton:

    - a few misc things

    - kexec updates

    - DMA-mapping updates to better support networking DMA operations

    - IPC updates

    - various MM changes to improve DAX fault handling

    - lots of radix-tree changes, mainly to the test suite. All leading up
    to reimplementing the IDA/IDR code to be a wrapper layer over the
    radix-tree. However the final trigger-pulling patch is held off for
    4.11.

    * emailed patches from Andrew Morton : (114 commits)
    radix tree test suite: delete unused rcupdate.c
    radix tree test suite: add new tag check
    radix-tree: ensure counts are initialised
    radix tree test suite: cache recently freed objects
    radix tree test suite: add some more functionality
    idr: reduce the number of bits per level from 8 to 6
    rxrpc: abstract away knowledge of IDR internals
    tpm: use idr_find(), not idr_find_slowpath()
    idr: add ida_is_empty
    radix tree test suite: check multiorder iteration
    radix-tree: fix replacement for multiorder entries
    radix-tree: add radix_tree_split_preload()
    radix-tree: add radix_tree_split
    radix-tree: add radix_tree_join
    radix-tree: delete radix_tree_range_tag_if_tagged()
    radix-tree: delete radix_tree_locate_item()
    radix-tree: improve multiorder iterators
    btrfs: fix race in btrfs_free_dummy_fs_info()
    radix-tree: improve dump output
    radix-tree: make radix_tree_find_next_bit more useful
    ...

    Linus Torvalds
     
  • Patch series "mm: unexport __get_user_pages_unlocked()".

    This patch series continues the cleanup of get_user_pages*() functions
    taking advantage of the fact we can now pass gup_flags as we please.

    It firstly adds an additional 'locked' parameter to
    get_user_pages_remote() to allow for its callers to utilise
    VM_FAULT_RETRY functionality. This is necessary as the invocation of
    __get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of
    this and no other existing higher level function would allow it to do
    so.

    Secondly existing callers of __get_user_pages_unlocked() are replaced
    with the appropriate higher-level replacement -
    get_user_pages_unlocked() if the current task and memory descriptor are
    referenced, or get_user_pages_remote() if other task/memory descriptors
    are referenced (having acquiring mmap_sem.)

    This patch (of 2):

    Add a int *locked parameter to get_user_pages_remote() to allow
    VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked().

    Taking into account the previous adjustments to get_user_pages*()
    functions allowing for the passing of gup_flags, we are now in a
    position where __get_user_pages_unlocked() need only be exported for his
    ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to
    subsequently unexport __get_user_pages_unlocked() as well as allowing
    for future flexibility in the use of get_user_pages_remote().

    [sfr@canb.auug.org.au: merge fix for get_user_pages_remote API change]
    Link: http://lkml.kernel.org/r/20161122210511.024ec341@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20161027095141.2569-2-lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Jan Kara
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • Pull namespace updates from Eric Biederman:
    "After a lot of discussion and work we have finally reachanged a basic
    understanding of what is necessary to make unprivileged mounts safe in
    the presence of EVM and IMA xattrs which the last commit in this
    series reflects. While technically it is a revert the comments it adds
    are important for people not getting confused in the future. Clearing
    up that confusion allows us to seriously work on unprivileged mounts
    of fuse in the next development cycle.

    The rest of the fixes in this set are in the intersection of user
    namespaces, ptrace, and exec. I started with the first fix which
    started a feedback cycle of finding additional issues during review
    and fixing them. Culiminating in a fix for a bug that has been present
    since at least Linux v1.0.

    Potentially these fixes were candidates for being merged during the rc
    cycle, and are certainly backport candidates but enough little things
    turned up during review and testing that I decided they should be
    handled as part of the normal development process just to be certain
    there were not any great surprises when it came time to backport some
    of these fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    Revert "evm: Translate user/group ids relative to s_user_ns when computing HMAC"
    exec: Ensure mm->user_ns contains the execed files
    ptrace: Don't allow accessing an undumpable mm
    ptrace: Capture the ptracer's creds not PT_PTRACE_CAP
    mm: Add a user_ns owner to mm_struct and fix ptrace permission checks

    Linus Torvalds
     

23 Nov, 2016

2 commits

  • When the user namespace support was merged the need to prevent
    ptrace from revealing the contents of an unreadable executable
    was overlooked.

    Correct this oversight by ensuring that the executed file
    or files are in mm->user_ns, by adjusting mm->user_ns.

    Use the new function privileged_wrt_inode_uidgid to see if
    the executable is a member of the user namespace, and as such
    if having CAP_SYS_PTRACE in the user namespace should allow
    tracing the executable. If not update mm->user_ns to
    the parent user namespace until an appropriate parent is found.

    Cc: stable@vger.kernel.org
    Reported-by: Jann Horn
    Fixes: 9e4a36ece652 ("userns: Fail exec for suid and sgid binaries with ids outside our user namespace.")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • When the flag PT_PTRACE_CAP was added the PTRACE_TRACEME path was
    overlooked. This can result in incorrect behavior when an application
    like strace traces an exec of a setuid executable.

    Further PT_PTRACE_CAP does not have enough information for making good
    security decisions as it does not report which user namespace the
    capability is in. This has already allowed one mistake through
    insufficient granulariy.

    I found this issue when I was testing another corner case of exec and
    discovered that I could not get strace to set PT_PTRACE_CAP even when
    running strace as root with a full set of caps.

    This change fixes the above issue with strace allowing stracing as
    root a setuid executable without disabling setuid. More fundamentaly
    this change allows what is allowable at all times, by using the correct
    information in it's decision.

    Cc: stable@vger.kernel.org
    Fixes: 4214e42f96d4 ("v2.4.9.11 -> v2.4.9.12")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

16 Nov, 2016

1 commit

  • Some embedded systems have no use for them. This removes about
    25KB from the kernel binary size when configured out.

    Corresponding syscalls are routed to a stub logging the attempt to
    use those syscalls which should be enough of a clue if they were
    disabled without proper consideration. They are: timer_create,
    timer_gettime: timer_getoverrun, timer_settime, timer_delete,
    clock_adjtime, setitimer, getitimer, alarm.

    The clock_settime, clock_gettime, clock_getres and clock_nanosleep
    syscalls are replaced by simple wrappers compatible with CLOCK_REALTIME,
    CLOCK_MONOTONIC and CLOCK_BOOTTIME only which should cover the vast
    majority of use cases with very little code.

    Signed-off-by: Nicolas Pitre
    Acked-by: Richard Cochran
    Acked-by: Thomas Gleixner
    Acked-by: John Stultz
    Reviewed-by: Josh Triplett
    Cc: Paul Bolle
    Cc: linux-kbuild@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Cc: Michal Marek
    Cc: Edward Cree
    Link: http://lkml.kernel.org/r/1478841010-28605-7-git-send-email-nicolas.pitre@linaro.org
    Signed-off-by: Thomas Gleixner

    Nicolas Pitre
     

19 Oct, 2016

1 commit


05 Aug, 2016

1 commit

  • Pull m68knommu updates from Greg Ungerer:
    "This series is all about Nicolas flat format support for MMU systems.

    Traditional m68k no-MMU flat format binaries can now be run on m68k
    MMU enabled systems too. The series includes some nice cleanups of
    the binfmt_flat code and converts it to using proper user space
    accessor functions.

    With all this in place you can boot and run a complete no-MMU flat
    format based user space on an MMU enabled system"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
    m68k: enable binfmt_flat on systems with an MMU
    binfmt_flat: allow compressed flat binary format to work on MMU systems
    binfmt_flat: add MMU-specific support
    binfmt_flat: update libraries' data segment pointer with userspace accessors
    binfmt_flat: use clear_user() rather than memset() to clear .bss
    binfmt_flat: use proper user space accessors with old relocs code
    binfmt_flat: use proper user space accessors with relocs processing code
    binfmt_flat: clean up create_flat_tables() and stack accesses
    binfmt_flat: use generic transfer_args_to_stack()
    elf_fdpic_transfer_args_to_stack(): make it generic
    binfmt_flat: prevent kernel dammage from corrupted executable headers
    binfmt_flat: convert printk invocations to their modern form
    binfmt_flat: assorted cleanups
    m68k: use same start_thread() on MMU and no-MMU
    m68k: fix file path comment
    m68k: fix bFLT executable running on MMU enabled systems

    Linus Torvalds
     

03 Aug, 2016

1 commit

  • Some systems are memory constrained but they need to load very large
    firmwares. The firmware subsystem allows drivers to request this
    firmware be loaded from the filesystem, but this requires that the
    entire firmware be loaded into kernel memory first before it's provided
    to the driver. This can lead to a situation where we map the firmware
    twice, once to load the firmware into kernel memory and once to copy the
    firmware into the final resting place.

    This creates needless memory pressure and delays loading because we have
    to copy from kernel memory to somewhere else. Let's add a
    request_firmware_into_buf() API that allows drivers to request firmware
    be loaded directly into a pre-allocated buffer. This skips the
    intermediate step of allocating a buffer in kernel memory to hold the
    firmware image while it's read from the filesystem. It also requires
    that drivers know how much memory they'll require before requesting the
    firmware and negates any benefits of firmware caching because the
    firmware layer doesn't manage the buffer lifetime.

    For a 16MB buffer, about half the time is spent performing a memcpy from
    the buffer to the final resting place. I see loading times go from
    0.081171 seconds to 0.047696 seconds after applying this patch. Plus
    the vmalloc pressure is reduced.

    This is based on a patch from Vikram Mulukutla on codeaurora.org:
    https://www.codeaurora.org/cgit/quic/la/kernel/msm-3.18/commit/drivers/base/firmware_class.c?h=rel/msm-3.18&id=0a328c5f6cd999f5c591f172216835636f39bcb5

    Link: http://lkml.kernel.org/r/20160607164741.31849-4-stephen.boyd@linaro.org
    Signed-off-by: Stephen Boyd
    Cc: Mimi Zohar
    Cc: Vikram Mulukutla
    Cc: Mark Brown
    Cc: Ming Lei
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Boyd
     

25 Jul, 2016

1 commit

  • This copying of arguments and environment is common to both NOMMU
    binary formats we support. Let's make the elf_fdpic version available
    to the flat format as well.

    While at it, improve the code a bit not to copy below the actual
    data area.

    Signed-off-by: Nicolas Pitre
    Reviewed-by: Greg Ungerer
    Signed-off-by: Greg Ungerer

    Nicolas Pitre
     

24 Jun, 2016

1 commit

  • If a process gets access to a mount from a different user
    namespace, that process should not be able to take advantage of
    setuid files or selinux entrypoints from that filesystem. Prevent
    this by treating mounts from other mount namespaces and those not
    owned by current_user_ns() or an ancestor as nosuid.

    This will make it safer to allow more complex filesystems to be
    mounted in non-root user namespaces.

    This does not remove the need for MNT_LOCK_NOSUID. The setuid,
    setgid, and file capability bits can no longer be abused if code in
    a user namespace were to clear nosuid on an untrusted filesystem,
    but this patch, by itself, is insufficient to protect the system
    from abuse of files that, when execed, would increase MAC privilege.

    As a more concrete explanation, any task that can manipulate a
    vfsmount associated with a given user namespace already has
    capabilities in that namespace and all of its descendents. If they
    can cause a malicious setuid, setgid, or file-caps executable to
    appear in that mount, then that executable will only allow them to
    elevate privileges in exactly the set of namespaces in which they
    are already privileges.

    On the other hand, if they can cause a malicious executable to
    appear with a dangerous MAC label, running it could change the
    caller's security context in a way that should not have been
    possible, even inside the namespace in which the task is confined.

    As a hardening measure, this would have made CVE-2014-5207 much
    more difficult to exploit.

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Seth Forshee
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Andy Lutomirski
     

24 May, 2016

2 commits

  • setup_arg_pages requires mmap_sem for write. If the waiting task gets
    killed by the oom killer it would block oom_reaper from asynchronous
    address space reclaim and reduce the chances of timely OOM resolving.
    Wait for the lock in the killable mode and return with EINTR if the task
    got killed while waiting. All the callers are already handling error
    path and the fatal signal doesn't need any additional treatment.

    The same applies to __bprm_mm_init.

    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Acked-by: Vlastimil Babka
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • remove_arg_zero() does free_arg_page() for no reason. This was needed
    before and only if CONFIG_MMU=y: see commit 4fc75ff4816c ("exec: fix
    remove_arg_zero"), install_arg_page() was called for every page != NULL
    in bprm->page[] array. Today install_arg_page() has already gone and
    free_arg_page() is nop after another commit b6a2fea39318 ("mm: variable
    length argument support").

    CONFIG_MMU=n does free_arg_pages() in free_bprm() and thus it doesn't
    need remove_arg_zero()->free_arg_page() too; apart from get_arg_page()
    it never checks if the page in bprm->page[] was allocated or not, so the
    "extra" non-freed page is fine. OTOH, this free_arg_page() can add the
    minor pessimization, the caller is going to do copy_strings_kernel()
    right after remove_arg_zero() which will likely need to re-allocate the
    same page again.

    And as Hujunjie pointed out, the "offset == PAGE_SIZE" check is wrong
    because we are going to increment bprm->p once again before return, so
    CONFIG_MMU=n "leaks" the page anyway if '0' is the final byte in this
    page.

    NOTE: remove_arg_zero() assumes that argv[0] is null-terminated but this
    is not necessarily true. copy_strings() does "len = strnlen_user(...)",
    then copy_from_user(len) but another thread or debuger can overwrite the
    trailing '0' in between. Afaics nothing really bad can happen because
    we must always have the null-terminated bprm->filename copied by the 1st
    copy_strings_kernel(), but perhaps we should change this code to check
    "bprm->p < bprm->exec" anyway, and/or change copy_strings() to ensure
    that the last byte in string is always zero.

    Link: http://lkml.kernel.org/r/20160517155335.GA31435@redhat.com
    Signed-off-by: Oleg Nesterov
    Reported by: hujunjie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

20 May, 2016

1 commit

  • Pull security subsystem updates from James Morris:
    "Highlights:

    - A new LSM, "LoadPin", from Kees Cook is added, which allows forcing
    of modules and firmware to be loaded from a specific device (this
    is from ChromeOS, where the device as a whole is verified
    cryptographically via dm-verity).

    This is disabled by default but can be configured to be enabled by
    default (don't do this if you don't know what you're doing).

    - Keys: allow authentication data to be stored in an asymmetric key.
    Lots of general fixes and updates.

    - SELinux: add restrictions for loading of kernel modules via
    finit_module(). Distinguish non-init user namespace capability
    checks. Apply execstack check on thread stacks"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (48 commits)
    LSM: LoadPin: provide enablement CONFIG
    Yama: use atomic allocations when reporting
    seccomp: Fix comment typo
    ima: add support for creating files using the mknodat syscall
    ima: fix ima_inode_post_setattr
    vfs: forbid write access when reading a file into memory
    fs: fix over-zealous use of "const"
    selinux: apply execstack check on thread stacks
    selinux: distinguish non-init user namespace capability checks
    LSM: LoadPin for kernel file loading restrictions
    fs: define a string representation of the kernel_read_file_id enumeration
    Yama: consolidate error reporting
    string_helpers: add kstrdup_quotable_file
    string_helpers: add kstrdup_quotable_cmdline
    string_helpers: add kstrdup_quotable
    selinux: check ss_initialized before revalidating an inode label
    selinux: delay inode label lookup as long as possible
    selinux: don't revalidate an inode's label when explicitly setting it
    selinux: Change bool variable name to index.
    KEYS: Add KEYCTL_DH_COMPUTE command
    ...

    Linus Torvalds
     

18 May, 2016

1 commit


01 May, 2016

1 commit

  • This patch is based on top of the "vfs: support for a common kernel file
    loader" patch set. In general when the kernel is reading a file into
    memory it does not want anything else writing to it.

    The kernel currently only forbids write access to a file being executed.
    This patch extends this locking to files being read by the kernel.

    Changelog:
    - moved function to kernel_read_file() - Mimi
    - updated patch description - Mimi

    Signed-off-by: Dmitry Kasatkin
    Cc: Al Viro
    Signed-off-by: Mimi Zohar
    Reviewed-by: Luis R. Rodriguez
    Acked-by: Kees Cook

    Dmitry Kasatkin
     

21 Mar, 2016

1 commit

  • Pull x86 protection key support from Ingo Molnar:
    "This tree adds support for a new memory protection hardware feature
    that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

    There's a background article at LWN.net:

    https://lwn.net/Articles/643797/

    The gist is that protection keys allow the encoding of
    user-controllable permission masks in the pte. So instead of having a
    fixed protection mask in the pte (which needs a system call to change
    and works on a per page basis), the user can map a (handful of)
    protection mask variants and can change the masks runtime relatively
    cheaply, without having to change every single page in the affected
    virtual memory range.

    This allows the dynamic switching of the protection bits of large
    amounts of virtual memory, via user-space instructions. It also
    allows more precise control of MMU permission bits: for example the
    executable bit is separate from the read bit (see more about that
    below).

    This tree adds the MM infrastructure and low level x86 glue needed for
    that, plus it adds a high level API to make use of protection keys -
    if a user-space application calls:

    mmap(..., PROT_EXEC);

    or

    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
    this special case, and will set a special protection key on this
    memory range. It also sets the appropriate bits in the Protection
    Keys User Rights (PKRU) register so that the memory becomes unreadable
    and unwritable.

    So using protection keys the kernel is able to implement 'true'
    PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
    PROT_READ as well. Unreadable executable mappings have security
    advantages: they cannot be read via information leaks to figure out
    ASLR details, nor can they be scanned for ROP gadgets - and they
    cannot be used by exploits for data purposes either.

    We know about no user-space code that relies on pure PROT_EXEC
    mappings today, but binary loaders could start making use of this new
    feature to map binaries and libraries in a more secure fashion.

    There is other pending pkeys work that offers more high level system
    call APIs to manage protection keys - but those are not part of this
    pull request.

    Right now there's a Kconfig that controls this feature
    (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
    (like most x86 CPU feature enablement code that has no runtime
    overhead), but it's not user-configurable at the moment. If there's
    any serious problem with this then we can make it configurable and/or
    flip the default"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
    x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
    mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
    x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
    mm/core, x86/mm/pkeys: Add execute-only protection keys support
    x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
    x86/mm/pkeys: Allow kernel to modify user pkey rights register
    x86/fpu: Allow setting of XSAVE state
    x86/mm: Factor out LDT init from context init
    mm/core, x86/mm/pkeys: Add arch_validate_pkey()
    mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
    x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
    x86/mm/pkeys: Add Kconfig prompt to existing config option
    x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
    x86/mm/pkeys: Dump PKRU with other kernel registers
    mm/core, x86/mm/pkeys: Differentiate instruction fetches
    x86/mm/pkeys: Optimize fault handling in access_error()
    mm/core: Do not enforce PKEY permissions on remote mm access
    um, pkeys: Add UML arch_*_access_permitted() methods
    mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
    x86/mm/gup: Simplify get_user_pages() PTE bit handling
    ...

    Linus Torvalds
     

21 Feb, 2016

3 commits

  • This patch defines kernel_read_file_from_fd(), a wrapper for the VFS
    common kernel_read_file().

    Changelog:
    - Separated from the kernel modules patch
    Acked-by: Kees Cook
    Acked-by: Luis R. Rodriguez
    Cc: Al Viro

    Signed-off-by: Mimi Zohar

    Mimi Zohar
     
  • The kernel_read_file security hook is called prior to reading the file
    into memory.

    Changelog v4+:
    - export security_kernel_read_file()

    Signed-off-by: Mimi Zohar
    Acked-by: Kees Cook
    Acked-by: Luis R. Rodriguez
    Acked-by: Casey Schaufler

    Mimi Zohar
     
  • This patch defines kernel_read_file_from_path(), a wrapper for the VFS
    common kernel_read_file().

    Changelog:
    - revert error msg regression - reported by Sergey Senozhatsky
    - Separated from the IMA patch

    Signed-off-by: Mimi Zohar
    Acked-by: Kees Cook
    Acked-by: Luis R. Rodriguez
    Cc: Al Viro

    Mimi Zohar
     

19 Feb, 2016

2 commits

  • To differentiate between the kernel_read_file() callers, this patch
    defines a new enumeration named kernel_read_file_id and includes the
    caller identifier as an argument.

    Subsequent patches define READING_KEXEC_IMAGE, READING_KEXEC_INITRAMFS,
    READING_FIRMWARE, READING_MODULE, and READING_POLICY.

    Changelog v3:
    - Replace the IMA specific enumeration with a generic one.

    Signed-off-by: Mimi Zohar
    Acked-by: Kees Cook
    Acked-by: Luis R. Rodriguez
    Cc: Al Viro

    Mimi Zohar
     
  • For a while it was looked down upon to directly read files from Linux.
    These days there exists a few mechanisms in the kernel that do just
    this though to load a file into a local buffer. There are minor but
    important checks differences on each. This patch set is the first
    attempt at resolving some of these differences.

    This patch introduces a common function for reading files from the kernel
    with the corresponding security post-read hook and function.

    Changelog v4+:
    - export security_kernel_post_read_file() - Fengguang Wu
    v3:
    - additional bounds checking - Luis
    v2:
    - To simplify patch review, re-ordered patches

    Signed-off-by: Mimi Zohar
    Reviewed-by: Luis R. Rodriguez
    Acked-by: Kees Cook
    Cc: Al Viro

    Mimi Zohar
     

16 Feb, 2016

1 commit

  • For protection keys, we need to understand whether protections
    should be enforced in software or not. In general, we enforce
    protections when working on our own task, but not when on others.
    We call these "current" and "remote" operations.

    This patch introduces a new get_user_pages() variant:

    get_user_pages_remote()

    Which is a replacement for when get_user_pages() is called on
    non-current tsk/mm.

    We also introduce a new gup flag: FOLL_REMOTE which can be used
    for the "__" gup variants to get this new behavior.

    The uprobes is_trap_at_addr() location holds mmap_sem and
    calls get_user_pages(current->mm) on an instruction address. This
    makes it a pretty unique gup caller. Being an instruction access
    and also really originating from the kernel (vs. the app), I opted
    to consider this a 'remote' access where protection keys will not
    be enforced.

    Without protection keys, this patch should not change any behavior.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Kirill A. Shutemov
    Cc: Linus Torvalds
    Cc: Naoya Horiguchi
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Cc: Vlastimil Babka
    Cc: jack@suse.cz
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210154.3F0E51EA@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

04 Jan, 2016

1 commit


10 Jul, 2015

1 commit

  • Today proc and sysfs do not contain any executable files. Several
    applications today mount proc or sysfs without noexec and nosuid and
    then depend on there being no exectuables files on proc or sysfs.
    Having any executable files show on proc or sysfs would cause
    a user space visible regression, and most likely security problems.

    Therefore commit to never allowing executables on proc and sysfs by
    adding a new flag to mark them as filesystems without executables and
    enforce that flag.

    Test the flag where MNT_NOEXEC is tested today, so that the only user
    visible effect will be that exectuables will be treated as if the
    execute bit is cleared.

    The filesystems proc and sysfs do not currently incoporate any
    executable files so this does not result in any user visible effects.

    This makes it unnecessary to vet changes to proc and sysfs tightly for
    adding exectuable files or changes to chattr that would modify
    existing files, as no matter what the individual file say they will
    not be treated as exectuable files by the vfs.

    Not having to vet changes to closely is important as without this we
    are only one proc_create call (or another goof up in the
    implementation of notify_change) from having problematic executables
    on proc. Those mistakes are all too easy to make and would create
    a situation where there are security issues or the assumptions of
    some program having to be broken (and cause userspace regressions).

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

13 May, 2015

1 commit

  • On architectures where the stack grows upwards (CONFIG_STACK_GROWSUP=y,
    currently parisc and metag only) stack randomization sometimes leads to crashes
    when the stack ulimit is set to lower values than STACK_RND_MASK (which is 8 MB
    by default if not defined in arch-specific headers).

    The problem is, that when the stack vm_area_struct is set up in fs/exec.c, the
    additional space needed for the stack randomization (as defined by the value of
    STACK_RND_MASK) was not taken into account yet and as such, when the stack
    randomization code added a random offset to the stack start, the stack
    effectively got smaller than what the user defined via rlimit_max(RLIMIT_STACK)
    which then sometimes leads to out-of-stack situations and crashes.

    This patch fixes it by adding the maximum possible amount of memory (based on
    STACK_RND_MASK) which theoretically could be added by the stack randomization
    code to the initial stack size. That way, the user-defined stack size is always
    guaranteed to be at minimum what is defined via rlimit_max(RLIMIT_STACK).

    This bug is currently not visible on the metag architecture, because on metag
    STACK_RND_MASK is defined to 0 which effectively disables stack randomization.

    The changes to fs/exec.c are inside an "#ifdef CONFIG_STACK_GROWSUP"
    section, so it does not affect other platformws beside those where the
    stack grows upwards (parisc and metag).

    Signed-off-by: Helge Deller
    Cc: linux-parisc@vger.kernel.org
    Cc: James Hogan
    Cc: linux-metag@vger.kernel.org
    Cc: stable@vger.kernel.org # v3.16+

    Helge Deller
     

20 Apr, 2015

1 commit


17 Apr, 2015

1 commit

  • We set sig->notify_count = -1 between RELEASE and ACQUIRE operations:

    spin_unlock_irq(lock);
    ...
    if (!thread_group_leader(tsk)) {
    ...
    for (;;) {
    sig->notify_count = -1;
    write_lock_irq(&tasklist_lock);

    There are no restriction on it so other processors may see this STORE
    mixed with other STOREs in both areas limited by the spinlocks.

    Probably, it may be reordered with the above

    sig->group_exit_task = tsk;
    sig->notify_count = zap_other_threads(tsk);

    in some way.

    Set it under tasklist_lock locked to be sure nothing will be reordered.

    Signed-off-by: Kirill Tkhai
    Acked-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai