15 Oct, 2016

1 commit

  • This easy-to-trigger warning shows up instantly when running
    Trinity on a kernel with CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS disabled.

    At most this should have been a printk, but the -EINVAL alone should be more
    than adequate indicator that something isn't available.

    Signed-off-by: Dave Jones
    Signed-off-by: Linus Torvalds

    Dave Jones
     

09 Sep, 2016

4 commits

  • PKRU is the register that lets you disallow writes or all access to a given
    protection key.

    The XSAVE hardware defines an "init state" of 0 for PKRU: its most
    permissive state, allowing access/writes to everything. Since we start off
    all new processes with the init state, we start all processes off with the
    most permissive possible PKRU.

    This is unfortunate. If a thread is clone()'d [1] before a program has
    time to set PKRU to a restrictive value, that thread will be able to write
    to all data, no matter what pkey is set on it. This weakens any integrity
    guarantees that we want pkeys to provide.

    To fix this, we define a very restrictive PKRU to override the
    XSAVE-provided value when we create a new FPU context. We choose a value
    that only allows access to pkey 0, which is as restrictive as we can
    practically make it.

    This does not cause any practical problems with applications using
    protection keys because we require them to specify initial permissions for
    each key when it is allocated, which override the restrictive default.

    In the end, this ensures that threads which do not know how to manage their
    own pkey rights can not do damage to data which is pkey-protected.

    I would have thought this was a pretty contrived scenario, except that I
    heard a bug report from an MPX user who was creating threads in some very
    early code before main(). It may be crazy, but folks evidently _do_ it.

    Signed-off-by: Dave Hansen
    Cc: linux-arch@vger.kernel.org
    Cc: Dave Hansen
    Cc: mgorman@techsingularity.net
    Cc: arnd@arndb.de
    Cc: linux-api@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: luto@kernel.org
    Cc: akpm@linux-foundation.org
    Cc: torvalds@linux-foundation.org
    Link: http://lkml.kernel.org/r/20160729163021.F3C25D4A@viggo.jf.intel.com
    Signed-off-by: Thomas Gleixner

    Dave Hansen
     
  • This patch adds two new system calls:

    int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
    int pkey_free(int pkey);

    These implement an "allocator" for the protection keys
    themselves, which can be thought of as analogous to the allocator
    that the kernel has for file descriptors. The kernel tracks
    which numbers are in use, and only allows operations on keys that
    are valid. A key which was not obtained by pkey_alloc() may not,
    for instance, be passed to pkey_mprotect().

    These system calls are also very important given the kernel's use
    of pkeys to implement execute-only support. These help ensure
    that userspace can never assume that it has control of a key
    unless it first asks the kernel. The kernel does not promise to
    preserve PKRU (right register) contents except for allocated
    pkeys.

    The 'init_access_rights' argument to pkey_alloc() specifies the
    rights that will be established for the returned pkey. For
    instance:

    pkey = pkey_alloc(flags, PKEY_DENY_WRITE);

    will allocate 'pkey', but also sets the bits in PKRU[1] such that
    writing to 'pkey' is already denied.

    The kernel does not prevent pkey_free() from successfully freeing
    in-use pkeys (those still assigned to a memory range by
    pkey_mprotect()). It would be expensive to implement the checks
    for this, so we instead say, "Just don't do it" since sane
    software will never do it anyway.

    Any piece of userspace calling pkey_alloc() needs to be prepared
    for it to fail. Why? pkey_alloc() returns the same error code
    (ENOSPC) when there are no pkeys and when pkeys are unsupported.
    They can be unsupported for a whole host of reasons, so apps must
    be prepared for this. Also, libraries or LD_PRELOADs might steal
    keys before an application gets access to them.

    This allocation mechanism could be implemented in userspace.
    Even if we did it in userspace, we would still need additional
    user/kernel interfaces to tell userspace which keys are being
    used by the kernel internally (such as for execute-only
    mappings). Having the kernel provide this facility completely
    removes the need for these additional interfaces, or having an
    implementation of this in userspace at all.

    Note that we have to make changes to all of the architectures
    that do not use mman-common.h because we use the new
    PKEY_DENY_ACCESS/WRITE macros in arch-independent code.

    1. PKRU is the Protection Key Rights User register. It is a
    usermode-accessible register that controls whether writes
    and/or access to each individual pkey is allowed or denied.

    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Cc: linux-arch@vger.kernel.org
    Cc: Dave Hansen
    Cc: arnd@arndb.de
    Cc: linux-api@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: luto@kernel.org
    Cc: akpm@linux-foundation.org
    Cc: torvalds@linux-foundation.org
    Link: http://lkml.kernel.org/r/20160729163015.444FE75F@viggo.jf.intel.com
    Signed-off-by: Thomas Gleixner

    Dave Hansen
     
  • Today, mprotect() takes 4 bits of data: PROT_READ/WRITE/EXEC/NONE.
    Three of those bits: READ/WRITE/EXEC get translated directly in to
    vma->vm_flags by calc_vm_prot_bits(). If a bit is unset in
    mprotect()'s 'prot' argument then it must be cleared in vma->vm_flags
    during the mprotect() call.

    We do this clearing today by first calculating the VMA flags we
    want set, then clearing the ones we do not want to inherit from
    the original VMA:

    vm_flags = calc_vm_prot_bits(prot, key);
    ...
    newflags = vm_flags;
    newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));

    However, we *also* want to mask off the original VMA's vm_flags in
    which we store the protection key.

    To do that, this patch adds a new macro:

    ARCH_VM_PKEY_FLAGS

    which allows the architecture to specify additional bits that it would
    like cleared. We use that to ensure that the VM_PKEY_BIT* bits get
    cleared.

    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Reviewed-by: Thomas Gleixner
    Cc: linux-arch@vger.kernel.org
    Cc: Dave Hansen
    Cc: arnd@arndb.de
    Cc: linux-api@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: luto@kernel.org
    Cc: akpm@linux-foundation.org
    Cc: torvalds@linux-foundation.org
    Link: http://lkml.kernel.org/r/20160729163013.E48D6981@viggo.jf.intel.com
    Signed-off-by: Thomas Gleixner

    Dave Hansen
     
  • pkey_mprotect() is just like mprotect, except it also takes a
    protection key as an argument. On systems that do not support
    protection keys, it still works, but requires that key=0.
    Otherwise it does exactly what mprotect does.

    I expect it to get used like this, if you want to guarantee that
    any mapping you create can *never* be accessed without the right
    protection keys set up.

    int real_prot = PROT_READ|PROT_WRITE;
    pkey = pkey_alloc(0, PKEY_DENY_ACCESS);
    ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
    ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);

    This way, there is *no* window where the mapping is accessible
    since it was always either PROT_NONE or had a protection key set
    that denied all access.

    We settled on 'unsigned long' for the type of the key here. We
    only need 4 bits on x86 today, but I figured that other
    architectures might need some more space.

    Semantically, we have a bit of a problem if we combine this
    syscall with our previously-introduced execute-only support:
    What do we do when we mix execute-only pkey use with
    pkey_mprotect() use? For instance:

    pkey_mprotect(ptr, PAGE_SIZE, PROT_WRITE, 6); // set pkey=6
    mprotect(ptr, PAGE_SIZE, PROT_EXEC); // set pkey=X_ONLY_PKEY?
    mprotect(ptr, PAGE_SIZE, PROT_WRITE); // is pkey=6 again?

    To solve that, we make the plain-mprotect()-initiated execute-only
    support only apply to VMAs that have the default protection key (0)
    set on them.

    Proposed semantics:
    1. protection key 0 is special and represents the default,
    "unassigned" protection key. It is always allocated.
    2. mprotect() never affects a mapping's pkey_mprotect()-assigned
    protection key. A protection key of 0 (even if set explicitly)
    represents an unassigned protection key.
    2a. mprotect(PROT_EXEC) on a mapping with an assigned protection
    key may or may not result in a mapping with execute-only
    properties. pkey_mprotect() plus pkey_set() on all threads
    should be used to _guarantee_ execute-only semantics if this
    is not a strong enough semantic.
    3. mprotect(PROT_EXEC) may result in an "execute-only" mapping. The
    kernel will internally attempt to allocate and dedicate a
    protection key for the purpose of execute-only mappings. This
    may not be possible in cases where there are no free protection
    keys available. It can also happen, of course, in situations
    where there is no hardware support for protection keys.

    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Cc: linux-arch@vger.kernel.org
    Cc: Dave Hansen
    Cc: arnd@arndb.de
    Cc: linux-api@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: luto@kernel.org
    Cc: akpm@linux-foundation.org
    Cc: torvalds@linux-foundation.org
    Link: http://lkml.kernel.org/r/20160729163012.3DDD36C4@viggo.jf.intel.com
    Signed-off-by: Thomas Gleixner

    Dave Hansen
     

19 Feb, 2016

3 commits

  • Protection keys provide new page-based protection in hardware.
    But, they have an interesting attribute: they only affect data
    accesses and never affect instruction fetches. That means that
    if we set up some memory which is set as "access-disabled" via
    protection keys, we can still execute from it.

    This patch uses protection keys to set up mappings to do just that.
    If a user calls:

    mmap(..., PROT_EXEC);
    or
    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only without PROT_READ/WRITE), the kernel will
    notice this, and set a special protection key on the memory. It
    also sets the appropriate bits in the Protection Keys User Rights
    (PKRU) register so that the memory becomes unreadable and
    unwritable.

    I haven't found any userspace that does this today. With this
    facility in place, we expect userspace to move to use it
    eventually. Userspace _could_ start doing this today. Any
    PROT_EXEC calls get converted to PROT_READ inside the kernel, and
    would transparently be upgraded to "true" PROT_EXEC with this
    code. IOW, userspace never has to do any PROT_EXEC runtime
    detection.

    This feature provides enhanced protection against leaking
    executable memory contents. This helps thwart attacks which are
    attempting to find ROP gadgets on the fly.

    But, the security provided by this approach is not comprehensive.
    The PKRU register which controls access permissions is a normal
    user register writable from unprivileged userspace. An attacker
    who can execute the 'wrpkru' instruction can easily disable the
    protection provided by this feature.

    The protection key that is used for execute-only support is
    permanently dedicated at compile time. This is fine for now
    because there is currently no API to set a protection key other
    than this one.

    Despite there being a constant PKRU value across the entire
    system, we do not set it unless this feature is in use in a
    process. That is to preserve the PKRU XSAVE 'init state',
    which can lead to faster context switches.

    PKRU *is* a user register and the kernel is modifying it. That
    means that code doing:

    pkru = rdpkru()
    pkru |= 0x100;
    mmap(..., PROT_EXEC);
    wrpkru(pkru);

    could lose the bits in PKRU that enforce execute-only
    permissions. To avoid this, we suggest avoiding ever calling
    mmap() or mprotect() when the PKRU value is expected to be
    unstable.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Andy Lutomirski
    Cc: Aneesh Kumar K.V
    Cc: Borislav Petkov
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Chen Gang
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Kees Cook
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Piotr Kwapulinski
    Cc: Rik van Riel
    Cc: Stephen Smalley
    Cc: Vladimir Murzin
    Cc: Will Deacon
    Cc: keescook@google.com
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210240.CB4BB5CA@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     
  • The Protection Key Rights for User memory (PKRU) is a 32-bit
    user-accessible register. It contains two bits for each
    protection key: one to write-disable (WD) access to memory
    covered by the key and another to access-disable (AD).

    Userspace can read/write the register with the RDPKRU and WRPKRU
    instructions. But, the register is saved and restored with the
    XSAVE family of instructions, which means we have to treat it
    like a floating point register.

    The kernel needs to write to the register if it wants to
    implement execute-only memory or if it implements a system call
    to change PKRU.

    To do this, we need to create a 'pkru_state' buffer, read the old
    contents in to it, modify it, and then tell the FPU code that
    there is modified data in there so it can (possibly) move the
    buffer back in to the registers.

    This uses the fpu__xfeature_set_state() function that we defined
    in the previous patch.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210236.0BE13217@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     
  • The syscall-level code is passed a protection key and need to
    return an appropriate error code if the protection key is bogus.
    We will be using this in subsequent patches.

    Note that this also begins a series of arch-specific calls that
    we need to expose in otherwise arch-independent code. We create
    a linux/pkeys.h header where we will put *all* the stubs for
    these functions.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210232.774EEAAB@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen