03 Mar, 2018

1 commit

  • [ Upstream commit 3756f6401c302617c5e091081ca4d26ab604bec5 ]

    gcc-8 warns about using strncpy() with the source size as the limit:

    fs/exec.c:1223:32: error: argument to 'sizeof' in 'strncpy' call is the same expression as the source; did you mean to use the size of the destination? [-Werror=sizeof-pointer-memaccess]

    This is indeed slightly suspicious, as it protects us from source
    arguments without NUL-termination, but does not guarantee that the
    destination is terminated.

    This keeps the strncpy() to ensure we have properly padded target
    buffer, but ensures that we use the correct length, by passing the
    actual length of the destination buffer as well as adding a build-time
    check to ensure it is exactly TASK_COMM_LEN.

    There are only 23 callsites which I all reviewed to ensure this is
    currently the case. We could get away with doing only the check or
    passing the right length, but it doesn't hurt to do both.

    Link: http://lkml.kernel.org/r/20171205151724.1764896-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Suggested-by: Kees Cook
    Acked-by: Kees Cook
    Acked-by: Ingo Molnar
    Cc: Alexander Viro
    Cc: Peter Zijlstra
    Cc: Serge Hallyn
    Cc: James Morris
    Cc: Aleksa Sarai
    Cc: "Eric W. Biederman"
    Cc: Frederic Weisbecker
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Arnd Bergmann
     

05 Jan, 2018

1 commit

  • commit e816c201aed5232171f8eb80b5d46ae6516683b9 upstream.

    This is a logical revert of commit e37fdb785a5f ("exec: Use secureexec
    for setting dumpability")

    This weakens dumpability back to checking only for uid/gid changes in
    current (which is useless), but userspace depends on dumpability not
    being tied to secureexec.

    https://bugzilla.redhat.com/show_bug.cgi?id=1528633

    Reported-by: Tom Horsley
    Fixes: e37fdb785a5f ("exec: Use secureexec for setting dumpability")
    Signed-off-by: Kees Cook
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     

20 Dec, 2017

1 commit

  • commit 779f4e1c6c7c661db40dfebd6dd6bda7b5f88aa3 upstream.

    This reverts commit 04e35f4495dd560db30c25efca4eecae8ec8c375.

    SELinux runs with secureexec for all non-"noatsecure" domain transitions,
    which means lots of processes end up hitting the stack hard-limit change
    that was introduced in order to fix a race with prlimit(). That race fix
    will need to be redesigned.

    Reported-by: Laura Abbott
    Reported-by: Tomáš Trnka
    Signed-off-by: Kees Cook
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     

05 Dec, 2017

1 commit

  • commit 04e35f4495dd560db30c25efca4eecae8ec8c375 upstream.

    While the defense-in-depth RLIMIT_STACK limit on setuid processes was
    protected against races from other threads calling setrlimit(), I missed
    protecting it against races from external processes calling prlimit().
    This adds locking around the change and makes sure that rlim_max is set
    too.

    Link: http://lkml.kernel.org/r/20171127193457.GA11348@beast
    Fixes: 64701dee4178e ("exec: Use sane stack rlimit under secureexec")
    Signed-off-by: Kees Cook
    Reported-by: Ben Hutchings
    Reported-by: Brad Spengler
    Acked-by: Serge Hallyn
    Cc: James Morris
    Cc: Andy Lutomirski
    Cc: Oleg Nesterov
    Cc: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     

20 Oct, 2017

1 commit

  • This introduces a "register private expedited" membarrier command which
    allows eventual removal of important memory barrier constraints on the
    scheduler fast-paths. It changes how the "private expedited" membarrier
    command (new to 4.14) is used from user-space.

    This new command allows processes to register their intent to use the
    private expedited command. This affects how the expedited private
    command introduced in 4.14-rc is meant to be used, and should be merged
    before 4.14 final.

    Processes are now required to register before using
    MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.

    This fixes a problem that arose when designing requested extensions to
    sys_membarrier() to allow JITs to efficiently flush old code from
    instruction caches. Several potential algorithms are much less painful
    if the user register intent to use this functionality early on, for
    example, before the process spawns the second thread. Registering at
    this time removes the need to interrupt each and every thread in that
    process at the first expedited sys_membarrier() system call.

    Signed-off-by: Mathieu Desnoyers
    Acked-by: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Alexander Viro
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     

04 Oct, 2017

1 commit

  • Patch series "exec: binfmt_misc: fix use-after-free, kill
    iname[BINPRM_BUF_SIZE]".

    It looks like this code was always wrong, then commit 948b701a607f
    ("binfmt_misc: add persistent opened binary handler for containers")
    added more problems.

    This patch (of 6):

    load_script() can simply use i_name instead, it points into bprm->buf[]
    and nobody can change this memory until we call prepare_binprm().

    The only complication is that we need to also change the signature of
    bprm_change_interp() but this change looks good too.

    While at it, do whitespace/style cleanups.

    NOTE: the real motivation for this change is that people want to
    increase BINPRM_BUF_SIZE, we need to change load_misc_binary() too but
    this looks more complicated because afaics it is very buggy.

    Link: http://lkml.kernel.org/r/20170918163446.GA26793@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Travis Gummels
    Cc: Ben Woodard
    Cc: Jim Foraker
    Cc:
    Cc: Al Viro
    Cc: James Bottomley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

15 Sep, 2017

2 commits

  • This patch constifies the path argument to kernel_read_file_from_path().

    Signed-off-by: Mimi Zohar
    Cc: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Mimi Zohar
     
  • Pull more set_fs removal from Al Viro:
    "Christoph's 'use kernel_read and friends rather than open-coding
    set_fs()' series"

    * 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: unexport vfs_readv and vfs_writev
    fs: unexport vfs_read and vfs_write
    fs: unexport __vfs_read/__vfs_write
    lustre: switch to kernel_write
    gadget/f_mass_storage: stop messing with the address limit
    mconsole: switch to kernel_read
    btrfs: switch write_buf to kernel_write
    net/9p: switch p9_fd_read to kernel_write
    mm/nommu: switch do_mmap_private to kernel_read
    serial2002: switch serial2002_tty_write to kernel_{read/write}
    fs: make the buf argument to __kernel_write a void pointer
    fs: fix kernel_write prototype
    fs: fix kernel_read prototype
    fs: move kernel_read to fs/read_write.c
    fs: move kernel_write to fs/read_write.c
    autofs4: switch autofs4_write to __kernel_write
    ashmem: switch to ->read_iter

    Linus Torvalds
     

14 Sep, 2017

1 commit

  • GFP_TEMPORARY was introduced by commit e12ba74d8ff3 ("Group short-lived
    and reclaimable kernel allocations") along with __GFP_RECLAIMABLE. It's
    primary motivation was to allow users to tell that an allocation is
    short lived and so the allocator can try to place such allocations close
    together and prevent long term fragmentation. As much as this sounds
    like a reasonable semantic it becomes much less clear when to use the
    highlevel GFP_TEMPORARY allocation flag. How long is temporary? Can the
    context holding that memory sleep? Can it take locks? It seems there is
    no good answer for those questions.

    The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
    __GFP_RECLAIMABLE which in itself is tricky because basically none of
    the existing caller provide a way to reclaim the allocated memory. So
    this is rather misleading and hard to evaluate for any benefits.

    I have checked some random users and none of them has added the flag
    with a specific justification. I suspect most of them just copied from
    other existing users and others just thought it might be a good idea to
    use without any measuring. This suggests that GFP_TEMPORARY just
    motivates for cargo cult usage without any reasoning.

    I believe that our gfp flags are quite complex already and especially
    those with highlevel semantic should be clearly defined to prevent from
    confusion and abuse. Therefore I propose dropping GFP_TEMPORARY and
    replace all existing users to simply use GFP_KERNEL. Please note that
    SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
    so they will be placed properly for memory fragmentation prevention.

    I can see reasons we might want some gfp flag to reflect shorterm
    allocations but I propose starting from a clear semantic definition and
    only then add users with proper justification.

    This was been brought up before LSF this year by Matthew [1] and it
    turned out that GFP_TEMPORARY really doesn't have a clear semantic. It
    seems to be a heuristic without any measured advantage for most (if not
    all) its current users. The follow up discussion has revealed that
    opinions on what might be temporary allocation differ a lot between
    developers. So rather than trying to tweak existing users into a
    semantic which they haven't expected I propose to simply remove the flag
    and start from scratch if we really need a semantic for short term
    allocations.

    [1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org

    [akpm@linux-foundation.org: fix typo]
    [akpm@linux-foundation.org: coding-style fixes]
    [sfr@canb.auug.org.au: drm/i915: fix up]
    Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Stephen Rothwell
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Neil Brown
    Cc: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

05 Sep, 2017

2 commits


02 Aug, 2017

10 commits

  • Instead of an additional secureexec check for pdeath_signal, just move it
    up into the initial secureexec test. Neither perf nor arch code touches
    pdeath_signal, so the relocation shouldn't change anything.

    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn

    Kees Cook
     
  • For a secureexec, before memory layout selection has happened, reset the
    stack rlimit to something sane to avoid the caller having control over
    the resulting layouts.

    $ ulimit -s
    8192
    $ ulimit -s unlimited
    $ /bin/sh -c 'ulimit -s'
    unlimited
    $ sudo /bin/sh -c 'ulimit -s'
    8192

    Cc: Linus Torvalds
    Signed-off-by: Kees Cook
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn

    Kees Cook
     
  • Since it's already valid to set dumpability in the early part of
    setup_new_exec(), we can consolidate the logic into a single place.
    The BINPRM_FLAGS_ENFORCE_NONDUMP is set during would_dump() calls
    before setup_new_exec(), so its test is safe to move as well.

    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Reviewed-by: James Morris

    Kees Cook
     
  • Like dumpability, clearing pdeath_signal happens both in setup_new_exec()
    and later in commit_creds(). The test in setup_new_exec() is different
    from all other privilege comparisons, though: it is checking the new cred
    (bprm) uid vs the old cred (current) euid. This appears to be a bug,
    introduced by commit a6f76f23d297 ("CRED: Make execve() take advantage of
    copy-on-write credentials"):

    - if (bprm->e_uid != current_euid() ||
    - bprm->e_gid != current_egid()) {
    - set_dumpable(current->mm, suid_dumpable);
    + if (bprm->cred->uid != current_euid() ||
    + bprm->cred->gid != current_egid()) {

    It was bprm euid vs current euid (and egids), but the effective got
    dropped. Nothing in the exec flow changes bprm->cred->uid (nor gid).
    The call traces are:

    prepare_bprm_creds()
    prepare_exec_creds()
    prepare_creds()
    memcpy(new_creds, old_creds, ...)
    security_prepare_creds() (unimplemented by commoncap)
    ...
    prepare_binprm()
    bprm_fill_uid()
    resets euid/egid to current euid/egid
    sets euid/egid on bprm based on set*id file bits
    security_bprm_set_creds()
    cap_bprm_set_creds()
    handle all caps-based manipulations

    so this test is effectively a test of current_uid() vs current_euid(),
    which is wrong, just like the prior dumpability tests were wrong.

    The commit log says "Clear pdeath_signal and set dumpable on
    certain circumstances that may not be covered by commit_creds()." This
    may be meaning the earlier old euid vs new euid (and egid) test that
    got changed.

    Luckily, as with dumpability, this is all masked by commit_creds()
    which performs old/new euid and egid tests and clears pdeath_signal.

    And again, like dumpability, we should include LSM secureexec logic for
    pdeath_signal clearing. For example, Smack goes out of its way to clear
    pdeath_signal when it finds a secureexec condition.

    Cc: David Howells
    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Reviewed-by: James Morris

    Kees Cook
     
  • The examination of "current" to decide dumpability is wrong. This was a
    check of and euid/uid (or egid/gid) mismatch in the existing process,
    not the newly created one. This appears to stretch back into even the
    "history.git" tree. Luckily, dumpability is later set in commit_creds().
    In earlier kernel versions before creds existed, similar checks also
    existed late in the exec flow, covering up the mistake as far back as I
    could find.

    Note that because the commit_creds() check examines differences of euid,
    uid, egid, gid, and capabilities between the old and new creds, it would
    look like the setup_new_exec() dumpability test could be entirely removed.
    However, the secureexec test may cover a different set of tests (specific
    to the LSMs) than what commit_creds() checks for. So, fix this test to
    use secureexec (the removed euid tests are redundant to the commoncap
    secureexec checks now).

    Cc: David Howells
    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Reviewed-by: James Morris

    Kees Cook
     
  • This removes the bprm_secureexec hook since the logic has been folded into
    the bprm_set_creds hook for all LSMs now.

    Cc: Eric W. Biederman
    Signed-off-by: Kees Cook
    Reviewed-by: John Johansen
    Acked-by: James Morris
    Acked-by: Serge Hallyn

    Kees Cook
     
  • The commoncap implementation of the bprm_secureexec hook is the only LSM
    that depends on the final call to its bprm_set_creds hook (since it may
    be called for multiple files, it ignores bprm->called_set_creds). As a
    result, it cannot safely _clear_ bprm->secureexec since other LSMs may
    have set it. Instead, remove the bprm_secureexec hook by introducing a
    new flag to bprm specific to commoncap: cap_elevated. This is similar to
    cap_effective, but that is used for a specific subset of elevated
    privileges, and exists solely to track state from bprm_set_creds to
    bprm_secureexec. As such, it will be removed in the next patch.

    Here, set the new bprm->cap_elevated flag when setuid/setgid has happened
    from bprm_fill_uid() or fscapabilities have been prepared. This temporarily
    moves the bprm_secureexec hook to a static inline. The helper will be
    removed in the next patch; this makes the step easier to review and bisect,
    since this does not introduce any changes to inputs nor outputs to the
    "elevated privileges" calculation.

    The new flag is merged with the bprm->secureexec flag in setup_new_exec()
    since this marks the end of any further prepare_binprm() calls.

    Cc: Andy Lutomirski
    Signed-off-by: Kees Cook
    Reviewed-by: Andy Lutomirski
    Acked-by: James Morris
    Acked-by: Serge Hallyn

    Kees Cook
     
  • The bprm_secureexec hook can be moved earlier. Right now, it is called
    during create_elf_tables(), via load_binary(), via search_binary_handler(),
    via exec_binprm(). Nearly all (see exception below) state used by
    bprm_secureexec is created during the bprm_set_creds hook, called from
    prepare_binprm().

    For all LSMs (except commoncaps described next), only the first execution
    of bprm_set_creds takes any effect (they all check bprm->called_set_creds
    which prepare_binprm() sets after the first call to the bprm_set_creds
    hook). However, all these LSMs also only do anything with bprm_secureexec
    when they detected a secure state during their first run of bprm_set_creds.
    Therefore, it is functionally identical to move the detection into
    bprm_set_creds, since the results from secureexec here only need to be
    based on the first call to the LSM's bprm_set_creds hook.

    The single exception is that the commoncaps secureexec hook also examines
    euid/uid and egid/gid differences which are controlled by bprm_fill_uid(),
    via prepare_binprm(), which can be called multiple times (e.g.
    binfmt_script, binfmt_misc), and may clear the euid/egid for the final
    load (i.e. the script interpreter). However, while commoncaps specifically
    ignores bprm->cred_prepared, and runs its bprm_set_creds hook each time
    prepare_binprm() may get called, it needs to base the secureexec decision
    on the final call to bprm_set_creds. As a result, it will need special
    handling.

    To begin this refactoring, this adds the secureexec flag to the bprm
    struct, and calls the secureexec hook during setup_new_exec(). This is
    safe since all the cred work is finished (and past the point of no return).
    This explicit call will be removed in later patches once the hook has been
    removed.

    Cc: David Howells
    Signed-off-by: Kees Cook
    Reviewed-by: John Johansen
    Acked-by: Serge Hallyn
    Reviewed-by: James Morris

    Kees Cook
     
  • In commit 221af7f87b97 ("Split 'flush_old_exec' into two functions"),
    the comment about the point of no return should have stayed in
    flush_old_exec() since it refers to "bprm->mm = NULL;" line, but prior
    changes in commits c89681ed7d0e ("remove steal_locks()"), and
    fd8328be874f ("sanitize handling of shared descriptor tables in failing
    execve()") made it look like it meant the current->sas_ss_sp line instead.

    The comment was referring to the fact that once bprm->mm is NULL, all
    failures from a binfmt load_binary hook (e.g. load_elf_binary), will
    get SEGV raised against current. Move this comment and expand the
    explanation a bit, putting it above the assignment this time, and add
    details about the true nature of "point of no return" being the call
    to flush_old_exec() itself.

    This also removes an erroneous commet about when credentials are being
    installed. That has its own dedicated function, install_exec_creds(),
    which carries a similar (and correct) comment, so remove the bogus comment
    where installation is not actually happening.

    Cc: David Howells
    Cc: Eric W. Biederman
    Signed-off-by: Kees Cook
    Acked-by: "Eric W. Biederman"
    Acked-by: Serge Hallyn

    Kees Cook
     
  • The cred_prepared bprm flag has a misleading name. It has nothing to do
    with the bprm_prepare_cred hook, and actually tracks if bprm_set_creds has
    been called. Rename this flag and improve its comment.

    Cc: David Howells
    Cc: Stephen Smalley
    Cc: Casey Schaufler
    Signed-off-by: Kees Cook
    Acked-by: John Johansen
    Acked-by: James Morris
    Acked-by: Paul Moore
    Acked-by: Serge Hallyn

    Kees Cook
     

08 Jul, 2017

1 commit


24 Jun, 2017

1 commit

  • When limiting the argv/envp strings during exec to 1/4 of the stack limit,
    the storage of the pointers to the strings was not included. This means
    that an exec with huge numbers of tiny strings could eat 1/4 of the stack
    limit in strings and then additional space would be later used by the
    pointers to the strings.

    For example, on 32-bit with a 8MB stack rlimit, an exec with 1677721
    single-byte strings would consume less than 2MB of stack, the max (8MB /
    4) amount allowed, but the pointers to the strings would consume the
    remaining additional stack space (1677721 * 4 == 6710884).

    The result (1677721 + 6710884 == 8388605) would exhaust stack space
    entirely. Controlling this stack exhaustion could result in
    pathological behavior in setuid binaries (CVE-2017-1000365).

    [akpm@linux-foundation.org: additional commenting from Kees]
    Fixes: b6a2fea39318 ("mm: variable length argument support")
    Link: http://lkml.kernel.org/r/20170622001720.GA32173@beast
    Signed-off-by: Kees Cook
    Acked-by: Rik van Riel
    Acked-by: Michal Hocko
    Cc: Alexander Viro
    Cc: Qualys Security Advisory
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

20 Mar, 2017

1 commit

  • Intel supports faulting on the CPUID instruction beginning with Ivy Bridge.
    When enabled, the processor will fault on attempts to execute the CPUID
    instruction with CPL>0. Exposing this feature to userspace will allow a
    ptracer to trap and emulate the CPUID instruction.

    When supported, this feature is controlled by toggling bit 0 of
    MSR_MISC_FEATURES_ENABLES. It is documented in detail in Section 2.3.2 of
    https://bugzilla.kernel.org/attachment.cgi?id=243991

    Implement a new pair of arch_prctls, available on both x86-32 and x86-64.

    ARCH_GET_CPUID: Returns the current CPUID state, either 0 if CPUID faulting
    is enabled (and thus the CPUID instruction is not available) or 1 if
    CPUID faulting is not enabled.

    ARCH_SET_CPUID: Set the CPUID state to the second argument. If
    cpuid_enabled is 0 CPUID faulting will be activated, otherwise it will
    be deactivated. Returns ENODEV if CPUID faulting is not supported on
    this system.

    The state of the CPUID faulting flag is propagated across forks, but reset
    upon exec.

    Signed-off-by: Kyle Huey
    Cc: Grzegorz Andrejczuk
    Cc: kvm@vger.kernel.org
    Cc: Radim Krčmář
    Cc: Peter Zijlstra
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: linux-kselftest@vger.kernel.org
    Cc: Nadav Amit
    Cc: Robert O'Callahan
    Cc: Richard Weinberger
    Cc: "Rafael J. Wysocki"
    Cc: Borislav Petkov
    Cc: Andy Lutomirski
    Cc: Len Brown
    Cc: Shuah Khan
    Cc: user-mode-linux-devel@lists.sourceforge.net
    Cc: Jeff Dike
    Cc: Alexander Viro
    Cc: user-mode-linux-user@lists.sourceforge.net
    Cc: David Matlack
    Cc: Boris Ostrovsky
    Cc: Dmitry Safonov
    Cc: linux-fsdevel@vger.kernel.org
    Cc: Paolo Bonzini
    Link: http://lkml.kernel.org/r/20170320081628.18952-9-khuey@kylehuey.com
    Signed-off-by: Thomas Gleixner

    Kyle Huey
     

02 Mar, 2017

6 commits

  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • …sched/numa_balancing.h>

    We are going to split <linux/sched/numa_balancing.h> out of <linux/sched.h>, which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder <linux/sched/numa_balancing.h> file that just
    maps to <linux/sched.h> to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • threadgroup_change_begin()/end() is a pointless wrapper around
    cgroup_threadgroup_change_begin()/end(), minus a might_sleep()
    in the !CONFIG_CGROUPS=y case.

    Remove the wrappery, move the might_sleep() (the down_read()
    already has a might_sleep() check).

    This debloats a bit and simplifies this API.

    Update all call sites.

    No change in functionality.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

14 Feb, 2017

1 commit

  • Right now bprm_fill_uid() uses inode fetched from file_inode(bprm->file).
    This in turn returns inode of lower filesystem (in a stacked filesystem
    setup).

    I was playing with modified patches of shiftfs posted by james bottomley
    and realized that through shiftfs setuid bit does not take effect. And
    reason being that we fetch uid/gid from inode of lower fs (and not from
    shiftfs inode). And that results in following checks failing.

    /* We ignore suid/sgid if there are no mappings for them in the ns */
    if (!kuid_has_mapping(bprm->cred->user_ns, uid) ||
    !kgid_has_mapping(bprm->cred->user_ns, gid))
    return;

    uid/gid fetched from lower fs inode might not be mapped inside the user
    namespace of container. So we need to look at uid/gid fetched from
    upper filesystem (shiftfs in this particular case) and these should be
    mapped and setuid bit can take affect.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Eric W. Biederman

    Vivek Goyal
     

24 Jan, 2017

1 commit


25 Dec, 2016

1 commit


23 Dec, 2016

1 commit

  • If you have a process that has set itself to be non-dumpable, and it
    then undergoes exec(2), any CLOEXEC file descriptors it has open are
    "exposed" during a race window between the dumpable flags of the process
    being reset for exec(2) and CLOEXEC being applied to the file
    descriptors. This can be exploited by a process by attempting to access
    /proc//fd/... during this window, without requiring CAP_SYS_PTRACE.

    The race in question is after set_dumpable has been (for get_link,
    though the trace is basically the same for readlink):

    [vfs]
    -> proc_pid_link_inode_operations.get_link
    -> proc_pid_get_link
    -> proc_fd_access_allowed
    -> ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);

    Which will return 0, during the race window and CLOEXEC file descriptors
    will still be open during this window because do_close_on_exec has not
    been called yet. As a result, the ordering of these calls should be
    reversed to avoid this race window.

    This is of particular concern to container runtimes, where joining a
    PID namespace with file descriptors referring to the host filesystem
    can result in security issues (since PRCTL_SET_DUMPABLE doesn't protect
    against access of CLOEXEC file descriptors -- file descriptors which may
    reference filesystem objects the container shouldn't have access to).

    Cc: dev@opencontainers.org
    Cc: # v3.2+
    Reported-by: Michael Crosby
    Signed-off-by: Aleksa Sarai
    Signed-off-by: Al Viro

    Aleksa Sarai
     

15 Dec, 2016

3 commits

  • Merge more updates from Andrew Morton:

    - a few misc things

    - kexec updates

    - DMA-mapping updates to better support networking DMA operations

    - IPC updates

    - various MM changes to improve DAX fault handling

    - lots of radix-tree changes, mainly to the test suite. All leading up
    to reimplementing the IDA/IDR code to be a wrapper layer over the
    radix-tree. However the final trigger-pulling patch is held off for
    4.11.

    * emailed patches from Andrew Morton : (114 commits)
    radix tree test suite: delete unused rcupdate.c
    radix tree test suite: add new tag check
    radix-tree: ensure counts are initialised
    radix tree test suite: cache recently freed objects
    radix tree test suite: add some more functionality
    idr: reduce the number of bits per level from 8 to 6
    rxrpc: abstract away knowledge of IDR internals
    tpm: use idr_find(), not idr_find_slowpath()
    idr: add ida_is_empty
    radix tree test suite: check multiorder iteration
    radix-tree: fix replacement for multiorder entries
    radix-tree: add radix_tree_split_preload()
    radix-tree: add radix_tree_split
    radix-tree: add radix_tree_join
    radix-tree: delete radix_tree_range_tag_if_tagged()
    radix-tree: delete radix_tree_locate_item()
    radix-tree: improve multiorder iterators
    btrfs: fix race in btrfs_free_dummy_fs_info()
    radix-tree: improve dump output
    radix-tree: make radix_tree_find_next_bit more useful
    ...

    Linus Torvalds
     
  • Patch series "mm: unexport __get_user_pages_unlocked()".

    This patch series continues the cleanup of get_user_pages*() functions
    taking advantage of the fact we can now pass gup_flags as we please.

    It firstly adds an additional 'locked' parameter to
    get_user_pages_remote() to allow for its callers to utilise
    VM_FAULT_RETRY functionality. This is necessary as the invocation of
    __get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of
    this and no other existing higher level function would allow it to do
    so.

    Secondly existing callers of __get_user_pages_unlocked() are replaced
    with the appropriate higher-level replacement -
    get_user_pages_unlocked() if the current task and memory descriptor are
    referenced, or get_user_pages_remote() if other task/memory descriptors
    are referenced (having acquiring mmap_sem.)

    This patch (of 2):

    Add a int *locked parameter to get_user_pages_remote() to allow
    VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked().

    Taking into account the previous adjustments to get_user_pages*()
    functions allowing for the passing of gup_flags, we are now in a
    position where __get_user_pages_unlocked() need only be exported for his
    ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to
    subsequently unexport __get_user_pages_unlocked() as well as allowing
    for future flexibility in the use of get_user_pages_remote().

    [sfr@canb.auug.org.au: merge fix for get_user_pages_remote API change]
    Link: http://lkml.kernel.org/r/20161122210511.024ec341@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20161027095141.2569-2-lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Jan Kara
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • Pull namespace updates from Eric Biederman:
    "After a lot of discussion and work we have finally reachanged a basic
    understanding of what is necessary to make unprivileged mounts safe in
    the presence of EVM and IMA xattrs which the last commit in this
    series reflects. While technically it is a revert the comments it adds
    are important for people not getting confused in the future. Clearing
    up that confusion allows us to seriously work on unprivileged mounts
    of fuse in the next development cycle.

    The rest of the fixes in this set are in the intersection of user
    namespaces, ptrace, and exec. I started with the first fix which
    started a feedback cycle of finding additional issues during review
    and fixing them. Culiminating in a fix for a bug that has been present
    since at least Linux v1.0.

    Potentially these fixes were candidates for being merged during the rc
    cycle, and are certainly backport candidates but enough little things
    turned up during review and testing that I decided they should be
    handled as part of the normal development process just to be certain
    there were not any great surprises when it came time to backport some
    of these fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    Revert "evm: Translate user/group ids relative to s_user_ns when computing HMAC"
    exec: Ensure mm->user_ns contains the execed files
    ptrace: Don't allow accessing an undumpable mm
    ptrace: Capture the ptracer's creds not PT_PTRACE_CAP
    mm: Add a user_ns owner to mm_struct and fix ptrace permission checks

    Linus Torvalds
     

23 Nov, 2016

2 commits

  • When the user namespace support was merged the need to prevent
    ptrace from revealing the contents of an unreadable executable
    was overlooked.

    Correct this oversight by ensuring that the executed file
    or files are in mm->user_ns, by adjusting mm->user_ns.

    Use the new function privileged_wrt_inode_uidgid to see if
    the executable is a member of the user namespace, and as such
    if having CAP_SYS_PTRACE in the user namespace should allow
    tracing the executable. If not update mm->user_ns to
    the parent user namespace until an appropriate parent is found.

    Cc: stable@vger.kernel.org
    Reported-by: Jann Horn
    Fixes: 9e4a36ece652 ("userns: Fail exec for suid and sgid binaries with ids outside our user namespace.")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • When the flag PT_PTRACE_CAP was added the PTRACE_TRACEME path was
    overlooked. This can result in incorrect behavior when an application
    like strace traces an exec of a setuid executable.

    Further PT_PTRACE_CAP does not have enough information for making good
    security decisions as it does not report which user namespace the
    capability is in. This has already allowed one mistake through
    insufficient granulariy.

    I found this issue when I was testing another corner case of exec and
    discovered that I could not get strace to set PT_PTRACE_CAP even when
    running strace as root with a full set of caps.

    This change fixes the above issue with strace allowing stracing as
    root a setuid executable without disabling setuid. More fundamentaly
    this change allows what is allowable at all times, by using the correct
    information in it's decision.

    Cc: stable@vger.kernel.org
    Fixes: 4214e42f96d4 ("v2.4.9.11 -> v2.4.9.12")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

16 Nov, 2016

1 commit

  • Some embedded systems have no use for them. This removes about
    25KB from the kernel binary size when configured out.

    Corresponding syscalls are routed to a stub logging the attempt to
    use those syscalls which should be enough of a clue if they were
    disabled without proper consideration. They are: timer_create,
    timer_gettime: timer_getoverrun, timer_settime, timer_delete,
    clock_adjtime, setitimer, getitimer, alarm.

    The clock_settime, clock_gettime, clock_getres and clock_nanosleep
    syscalls are replaced by simple wrappers compatible with CLOCK_REALTIME,
    CLOCK_MONOTONIC and CLOCK_BOOTTIME only which should cover the vast
    majority of use cases with very little code.

    Signed-off-by: Nicolas Pitre
    Acked-by: Richard Cochran
    Acked-by: Thomas Gleixner
    Acked-by: John Stultz
    Reviewed-by: Josh Triplett
    Cc: Paul Bolle
    Cc: linux-kbuild@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Cc: Michal Marek
    Cc: Edward Cree
    Link: http://lkml.kernel.org/r/1478841010-28605-7-git-send-email-nicolas.pitre@linaro.org
    Signed-off-by: Thomas Gleixner

    Nicolas Pitre