06 Dec, 2018

2 commits

  • commit 46f7ecb1e7359f183f5bbd1e08b90e10e52164f9 upstream

    The IBPB control code in x86 removed the usage. Remove the functionality
    which was introduced for this.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Jiri Kosina
    Cc: Tom Lendacky
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Casey Schaufler
    Cc: Asit Mallick
    Cc: Arjan van de Ven
    Cc: Jon Masters
    Cc: Waiman Long
    Cc: Greg KH
    Cc: Dave Stewart
    Cc: Kees Cook
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181125185005.559149393@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit dbfe2953f63c640463c630746cd5d9de8b2f63ae upstream

    Currently, IBPB is only issued in cases when switching into a non-dumpable
    process, the rationale being to protect such 'important and security
    sensitive' processess (such as GPG) from data leaking into a different
    userspace process via spectre v2.

    This is however completely insufficient to provide proper userspace-to-userpace
    spectrev2 protection, as any process can poison branch buffers before being
    scheduled out, and the newly scheduled process immediately becomes spectrev2
    victim.

    In order to minimize the performance impact (for usecases that do require
    spectrev2 protection), issue the barrier only in cases when switching between
    processess where the victim can't be ptraced by the potential attacker (as in
    such cases, the attacker doesn't have to bother with branch buffers at all).

    [ tglx: Split up PTRACE_MODE_NOACCESS_CHK into PTRACE_MODE_SCHED and
    PTRACE_MODE_IBPB to be able to do ptrace() context tracking reasonably
    fine-grained ]

    Fixes: 18bf3c3ea8 ("x86/speculation: Use Indirect Branch Prediction Barrier in context switch")
    Originally-by: Tim Chen
    Signed-off-by: Jiri Kosina
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: "WoodhouseDavid"
    Cc: Andi Kleen
    Cc: "SchauflerCasey"
    Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1809251437340.15880@cbobk.fhfr.pm
    Signed-off-by: Greg Kroah-Hartman

    Jiri Kosina
     

07 Feb, 2018

1 commit

  • There are several functions that do find_task_by_vpid() followed by
    get_task_struct(). We can use a helper function instead.

    Link: http://lkml.kernel.org/r/1509602027-11337-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

01 Feb, 2018

1 commit


17 Jan, 2018

1 commit


16 Jan, 2018

1 commit


29 Nov, 2017

1 commit

  • With the new SECCOMP_FILTER_FLAG_LOG, we need to be able to extract these
    flags for checkpoint restore, since they describe the state of a filter.

    So, let's add PTRACE_SECCOMP_GET_METADATA, similar to ..._GET_FILTER, which
    returns the metadata of the nth filter (right now, just the flags).
    Hopefully this will be future proof, and new per-filter metadata can be
    added to this struct.

    Signed-off-by: Tycho Andersen
    CC: Kees Cook
    CC: Andy Lutomirski
    CC: Oleg Nesterov
    Signed-off-by: Kees Cook

    Tycho Andersen
     

25 Jul, 2017

1 commit

  • struct siginfo is a union and the kernel since 2.4 has been hiding a union
    tag in the high 16bits of si_code using the values:
    __SI_KILL
    __SI_TIMER
    __SI_POLL
    __SI_FAULT
    __SI_CHLD
    __SI_RT
    __SI_MESGQ
    __SI_SYS

    While this looks plausible on the surface, in practice this situation has
    not worked well.

    - Injected positive signals are not copied to user space properly
    unless they have these magic high bits set.

    - Injected positive signals are not reported properly by signalfd
    unless they have these magic high bits set.

    - These kernel internal values leaked to userspace via ptrace_peek_siginfo

    - It was possible to inject these kernel internal values and cause the
    the kernel to misbehave.

    - Kernel developers got confused and expected these kernel internal values
    in userspace in kernel self tests.

    - Kernel developers got confused and set si_code to __SI_FAULT which
    is SI_USER in userspace which causes userspace to think an ordinary user
    sent the signal and that it was not kernel generated.

    - The values make it impossible to reorganize the code to transform
    siginfo_copy_to_user into a plain copy_to_user. As si_code must
    be massaged before being passed to userspace.

    So remove these kernel internal si codes and make the kernel code simpler
    and more maintainable.

    To replace these kernel internal magic si_codes introduce the helper
    function siginfo_layout, that takes a signal number and an si_code and
    computes which union member of siginfo is being used. Have
    siginfo_layout return an enumeration so that gcc will have enough
    information to warn if a switch statement does not handle all of union
    members.

    A couple of architectures have a messed up ABI that defines signal
    specific duplications of SI_USER which causes more special cases in
    siginfo_layout than I would like. The good news is only problem
    architectures pay the cost.

    Update all of the code that used the previous magic __SI_ values to
    use the new SIL_ values and to call siginfo_layout to get those
    values. Escept where not all of the cases are handled remove the
    defaults in the switch statements so that if a new case is missed in
    the future the lack will show up at compile time.

    Modify the code that copies siginfo si_code to userspace to just copy
    the value and not cast si_code to a short first. The high bits are no
    longer used to hold a magic union member.

    Fixup the siginfo header files to stop including the __SI_ values in
    their constants and for the headers that were missing it to properly
    update the number of si_codes for each signal type.

    The fixes to copy_siginfo_from_user32 implementations has the
    interesting property that several of them perviously should never have
    worked as the __SI_ values they depended up where kernel internal.
    With that dependency gone those implementations should work much
    better.

    The idea of not passing the __SI_ values out to userspace and then
    not reinserting them has been tested with criu and criu worked without
    changes.

    Ref: 2.4.0-test1
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

23 May, 2017

1 commit

  • When I introduced ptracer_cred I failed to consider the weirdness of
    fork where the task_struct copies the old value by default. This
    winds up leaving ptracer_cred set even when a process forks and
    the child process does not wind up being ptraced.

    Because ptracer_cred is not set on non-ptraced processes whose
    parents were ptraced this has broken the ability of the enlightenment
    window manager to start setuid children.

    Fix this by properly initializing ptracer_cred in ptrace_init_task

    This must be done with a little bit of care to preserve the current value
    of ptracer_cred when ptrace carries through fork. Re-reading the
    ptracer_cred from the ptracing process at this point is inconsistent
    with how PT_PTRACE_CAP has been maintained all of these years.

    Tested-by: Takashi Iwai
    Fixes: 64b875f7ac8a ("ptrace: Capture the ptracer's creds not PT_PTRACE_CAP")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

08 Apr, 2017

1 commit

  • In PT_SEIZED + LISTEN mode STOP/CONT signals cause a wakeup against
    __TASK_TRACED. If this races with the ptrace_unfreeze_traced at the end
    of a PTRACE_LISTEN, this can wake the task /after/ the check against
    __TASK_TRACED, but before the reset of state to TASK_TRACED. This
    causes it to instead clobber TASK_WAKING, allowing a subsequent wakeup
    against TRACED while the task is still on the rq wake_list, corrupting
    it.

    Oleg said:
    "The kernel can crash or this can lead to other hard-to-debug problems.
    In short, "task->state = TASK_TRACED" in ptrace_unfreeze_traced()
    assumes that nobody else can wake it up, but PTRACE_LISTEN breaks the
    contract. Obviusly it is very wrong to manipulate task->state if this
    task is already running, or WAKING, or it sleeps again"

    [akpm@linux-foundation.org: coding-style fixes]
    Fixes: 9899d11f ("ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL")
    Link: http://lkml.kernel.org/r/xm26y3vfhmkp.fsf_-_@bsegall-linux.mtv.corp.google.com
    Signed-off-by: Ben Segall
    Acked-by: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    bsegall@google.com
     

02 Mar, 2017

3 commits

  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

23 Nov, 2016

3 commits

  • It is the reasonable expectation that if an executable file is not
    readable there will be no way for a user without special privileges to
    read the file. This is enforced in ptrace_attach but if ptrace
    is already attached before exec there is no enforcement for read-only
    executables.

    As the only way to read such an mm is through access_process_vm
    spin a variant called ptrace_access_vm that will fail if the
    target process is not being ptraced by the current process, or
    the current process did not have sufficient privileges when ptracing
    began to read the target processes mm.

    In the ptrace implementations replace access_process_vm by
    ptrace_access_vm. There remain several ptrace sites that still use
    access_process_vm as they are reading the target executables
    instructions (for kernel consumption) or register stacks. As such it
    does not appear necessary to add a permission check to those calls.

    This bug has always existed in Linux.

    Fixes: v1.0
    Cc: stable@vger.kernel.org
    Reported-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • When the flag PT_PTRACE_CAP was added the PTRACE_TRACEME path was
    overlooked. This can result in incorrect behavior when an application
    like strace traces an exec of a setuid executable.

    Further PT_PTRACE_CAP does not have enough information for making good
    security decisions as it does not report which user namespace the
    capability is in. This has already allowed one mistake through
    insufficient granulariy.

    I found this issue when I was testing another corner case of exec and
    discovered that I could not get strace to set PT_PTRACE_CAP even when
    running strace as root with a full set of caps.

    This change fixes the above issue with strace allowing stracing as
    root a setuid executable without disabling setuid. More fundamentaly
    this change allows what is allowable at all times, by using the correct
    information in it's decision.

    Cc: stable@vger.kernel.org
    Fixes: 4214e42f96d4 ("v2.4.9.11 -> v2.4.9.12")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • During exec dumpable is cleared if the file that is being executed is
    not readable by the user executing the file. A bug in
    ptrace_may_access allows reading the file if the executable happens to
    enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
    unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).

    This problem is fixed with only necessary userspace breakage by adding
    a user namespace owner to mm_struct, captured at the time of exec, so
    it is clear in which user namespace CAP_SYS_PTRACE must be present in
    to be able to safely give read permission to the executable.

    The function ptrace_may_access is modified to verify that the ptracer
    has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
    This ensures that if the task changes it's cred into a subordinate
    user namespace it does not become ptraceable.

    The function ptrace_attach is modified to only set PT_PTRACE_CAP when
    CAP_SYS_PTRACE is held over task->mm->user_ns. The intent of
    PT_PTRACE_CAP is to be a flag to note that whatever permission changes
    the task might go through the tracer has sufficient permissions for
    it not to be an issue. task->cred->user_ns is always the same
    as or descendent of mm->user_ns. Which guarantees that having
    CAP_SYS_PTRACE over mm->user_ns is the worst case for the tasks
    credentials.

    To prevent regressions mm->dumpable and mm->user_ns are not considered
    when a task has no mm. As simply failing ptrace_may_attach causes
    regressions in privileged applications attempting to read things
    such as /proc//stat

    Cc: stable@vger.kernel.org
    Acked-by: Kees Cook
    Tested-by: Cyrill Gorcunov
    Fixes: 8409cca70561 ("userns: allow ptrace from non-init user namespaces")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

19 Oct, 2016

1 commit

  • This removes the 'write' argument from access_process_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Jesper Nilsson
    Acked-by: Michal Hocko
    Acked-by: Michael Ellerman
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     

12 Oct, 2016

1 commit

  • On __ptrace_detach(), called from do_exit()->exit_notify()->
    forget_original_parent()->exit_ptrace(), the TIF_SYSCALL_TRACE in
    thread->flags of the tracee is not cleared up. This results in the
    tracehook_report_syscall_* being called (though there's no longer a tracer
    listening to that) upon its further syscalls.

    Example scenario - attach "strace" to a running process and kill it (the
    strace) with SIGKILL. You'll see that the syscall trace hooks are still
    being called.

    The clearing of this flag should be moved from ptrace_detach() to
    __ptrace_detach().

    Link: http://lkml.kernel.org/r/1472759493-20554-1-git-send-email-alnovak@suse.cz
    Signed-off-by: Ales Novak
    Acked-by: Oleg Nesterov
    Cc: Jiri Kosina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ales Novak
     

04 Aug, 2016

1 commit

  • The use of config_enabled() against config options is ambiguous. In
    practical terms, config_enabled() is equivalent to IS_BUILTIN(), but the
    author might have used it for the meaning of IS_ENABLED(). Using
    IS_ENABLED(), IS_BUILTIN(), IS_MODULE() etc. makes the intention
    clearer.

    This commit replaces config_enabled() with IS_ENABLED() where possible.
    This commit is only touching bool config options.

    I noticed two cases where config_enabled() is used against a tristate
    option:

    - config_enabled(CONFIG_HWMON)
    [ drivers/net/wireless/ath/ath10k/thermal.c ]

    - config_enabled(CONFIG_BACKLIGHT_CLASS_DEVICE)
    [ drivers/gpu/drm/gma500/opregion.c ]

    I did not touch them because they should be converted to IS_BUILTIN()
    in order to keep the logic, but I was not sure it was the authors'
    intention.

    Link: http://lkml.kernel.org/r/1465215656-20569-1-git-send-email-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Acked-by: Kees Cook
    Cc: Stas Sergeev
    Cc: Matt Redfearn
    Cc: Joshua Kinard
    Cc: Jiri Slaby
    Cc: Bjorn Helgaas
    Cc: Borislav Petkov
    Cc: Markos Chandras
    Cc: "Dmitry V. Levin"
    Cc: yu-cheng yu
    Cc: James Hogan
    Cc: Brian Gerst
    Cc: Johannes Berg
    Cc: Peter Zijlstra
    Cc: Al Viro
    Cc: Will Drewry
    Cc: Nikolay Martynov
    Cc: Huacai Chen
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Daniel Borkmann
    Cc: Leonid Yegoshin
    Cc: Rafal Milecki
    Cc: James Cowgill
    Cc: Greg Kroah-Hartman
    Cc: Ralf Baechle
    Cc: Alex Smith
    Cc: Adam Buchbinder
    Cc: Qais Yousef
    Cc: Jiang Liu
    Cc: Mikko Rapeli
    Cc: Paul Gortmaker
    Cc: Denys Vlasenko
    Cc: Brian Norris
    Cc: Hidehiro Kawai
    Cc: "Luis R. Rodriguez"
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Cc: Dave Hansen
    Cc: "Kirill A. Shutemov"
    Cc: Roland McGrath
    Cc: Paul Burton
    Cc: Kalle Valo
    Cc: Viresh Kumar
    Cc: Tony Wu
    Cc: Huaitong Han
    Cc: Sumit Semwal
    Cc: Alexei Starovoitov
    Cc: Juergen Gross
    Cc: Jason Cooper
    Cc: "David S. Miller"
    Cc: Oleg Nesterov
    Cc: Andrea Gelmini
    Cc: David Woodhouse
    Cc: Marc Zyngier
    Cc: Rabin Vincent
    Cc: "Maciej W. Rozycki"
    Cc: David Daney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     

23 Mar, 2016

2 commits

  • This test-case (simplified version of generated by syzkaller)

    #include
    #include
    #include

    void test(void)
    {
    for (;;) {
    if (fork()) {
    wait(NULL);
    continue;
    }

    ptrace(PTRACE_SEIZE, getppid(), 0, 0);
    ptrace(PTRACE_INTERRUPT, getppid(), 0, 0);
    _exit(0);
    }
    }

    int main(void)
    {
    int np;

    for (np = 0; np < 8; ++np)
    if (!fork())
    test();

    while (wait(NULL) > 0)
    ;
    return 0;
    }

    triggers the 2nd WARN_ON_ONCE(!signr) warning in do_jobctl_trap(). The
    problem is that __ptrace_unlink() clears task->jobctl under siglock but
    task->ptrace is cleared without this lock held; this fools the "else"
    branch which assumes that !PT_SEIZED means PT_PTRACED.

    Note also that most of other PTRACE_SEIZE checks can race with detach
    from the exiting tracer too. Say, the callers of ptrace_trap_notify()
    assume that SEIZED can't go away after it was checked.

    Signed-off-by: Oleg Nesterov
    Reported-by: Dmitry Vyukov
    Cc: Tejun Heo
    Cc: syzkaller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Users of the 32-bit ptrace() ABI expect the full 32-bit ABI. siginfo
    translation should check ptrace() ABI, not caller task ABI.

    This is an ABI change on SPARC. Let's hope that no one relied on the
    old buggy ABI.

    Signed-off-by: Andy Lutomirski
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     

21 Jan, 2016

2 commits

  • By checking the effective credentials instead of the real UID / permitted
    capabilities, ensure that the calling process actually intended to use its
    credentials.

    To ensure that all ptrace checks use the correct caller credentials (e.g.
    in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
    flag), use two new flags and require one of them to be set.

    The problem was that when a privileged task had temporarily dropped its
    privileges, e.g. by calling setreuid(0, user_uid), with the intent to
    perform following syscalls with the credentials of a user, it still passed
    ptrace access checks that the user would not be able to pass.

    While an attacker should not be able to convince the privileged task to
    perform a ptrace() syscall, this is a problem because the ptrace access
    check is reused for things in procfs.

    In particular, the following somewhat interesting procfs entries only rely
    on ptrace access checks:

    /proc/$pid/stat - uses the check for determining whether pointers
    should be visible, useful for bypassing ASLR
    /proc/$pid/maps - also useful for bypassing ASLR
    /proc/$pid/cwd - useful for gaining access to restricted
    directories that contain files with lax permissions, e.g. in
    this scenario:
    lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
    drwx------ root root /root
    drwxr-xr-x root root /root/foobar
    -rw-r--r-- root root /root/foobar/secret

    Therefore, on a system where a root-owned mode 6755 binary changes its
    effective credentials as described and then dumps a user-specified file,
    this could be used by an attacker to reveal the memory layout of root's
    processes or reveal the contents of files he is not allowed to access
    (through /proc/$pid/cwd).

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Casey Schaufler
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: "Serge E. Hallyn"
    Cc: Andy Shevchenko
    Cc: Andy Lutomirski
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Willy Tarreau
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • ptrace_attach() can hang waiting for STOPPED -> TRACED transition if the
    tracee gets frozen in between, change wait_on_bit() to use TASK_KILLABLE.

    This doesn't really solve the problem(s) and we probably need to fix the
    freezer. In particular, note that this means that pm freezer will fail if
    it races attach-to-stopped-task.

    And otoh perhaps we can just remove JOBCTL_TRAPPING_BIT altogether, it is
    not clear if we really need to hide this transition from debugger, WNOHANG
    after PTRACE_ATTACH can fail anyway if it races with SIGCONT.

    Signed-off-by: Oleg Nesterov
    Reported-by: Andrey Ryabinin
    Cc: Roland McGrath
    Acked-by: Tejun Heo
    Cc: Pedro Alves
    Cc: Jan Kratochvil
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

28 Oct, 2015

1 commit

  • This patch adds support for dumping a process' (classic BPF) seccomp
    filters via ptrace.

    PTRACE_SECCOMP_GET_FILTER allows the tracer to dump the user's classic BPF
    seccomp filters. addr should be an integer which represents the ith seccomp
    filter (0 is the most recently installed filter). data should be a struct
    sock_filter * with enough room for the ith filter, or NULL, in which case
    the filter is not saved. The return value for this command is the number of
    BPF instructions the program represents, or negative in the case of errors.
    Command specific errors are ENOENT: which indicates that there is no ith
    filter in this seccomp tree, and EMEDIUMTYPE, which indicates that the ith
    filter was not installed as a classic BPF filter.

    A caveat with this approach is that there is no way to get explicitly at
    the heirarchy of seccomp filters, and users need to memcmp() filters to
    decide which are inherited. This means that a task which installs two of
    the same filter can potentially confuse users of this interface.

    v2: * make save_orig const
    * check that the orig_prog exists (not necessary right now, but when
    grows eBPF support it will be)
    * s/n/filter_off and make it an unsigned long to match ptrace
    * count "down" the tree instead of "up" when passing a filter offset

    v3: * don't take the current task's lock for inspecting its seccomp mode
    * use a 0x42** constant for the ptrace command value

    v4: * don't copy to userspace while holding spinlocks

    v5: * add another condition to WARN_ON

    v6: * rebase on net-next

    Signed-off-by: Tycho Andersen
    Acked-by: Kees Cook
    CC: Will Drewry
    Reviewed-by: Oleg Nesterov
    CC: Andy Lutomirski
    CC: Pavel Emelyanov
    CC: Serge E. Hallyn
    CC: Alexei Starovoitov
    CC: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Tycho Andersen
     

16 Jul, 2015

1 commit

  • This patch is the first step in enabling checkpoint/restore of processes
    with seccomp enabled.

    One of the things CRIU does while dumping tasks is inject code into them
    via ptrace to collect information that is only available to the process
    itself. However, if we are in a seccomp mode where these processes are
    prohibited from making these syscalls, then what CRIU does kills the task.

    This patch adds a new ptrace option, PTRACE_O_SUSPEND_SECCOMP, that enables
    a task from the init user namespace which has CAP_SYS_ADMIN and no seccomp
    filters to disable (and re-enable) seccomp filters for another task so that
    they can be successfully dumped (and restored). We restrict the set of
    processes that can disable seccomp through ptrace because although today
    ptrace can be used to bypass seccomp, there is some discussion of closing
    this loophole in the future and we would like this patch to not depend on
    that behavior and be future proofed for when it is removed.

    Note that seccomp can be suspended before any filters are actually
    installed; this behavior is useful on criu restore, so that we can suspend
    seccomp, restore the filters, unmap our restore code from the restored
    process' address space, and then resume the task by detaching and have the
    filters resumed as well.

    v2 changes:

    * require that the tracer have no seccomp filters installed
    * drop TIF_NOTSC manipulation from the patch
    * change from ptrace command to a ptrace option and use this ptrace option
    as the flag to check. This means that as soon as the tracer
    detaches/dies, seccomp is re-enabled and as a corrollary that one can not
    disable seccomp across PTRACE_ATTACHs.

    v3 changes:

    * get rid of various #ifdefs everywhere
    * report more sensible errors when PTRACE_O_SUSPEND_SECCOMP is incorrectly
    used

    v4 changes:

    * get rid of may_suspend_seccomp() in favor of a capable() check in ptrace
    directly

    v5 changes:

    * check that seccomp is not enabled (or suspended) on the tracer

    Signed-off-by: Tycho Andersen
    CC: Will Drewry
    CC: Roland McGrath
    CC: Pavel Emelyanov
    CC: Serge E. Hallyn
    Acked-by: Oleg Nesterov
    Acked-by: Andy Lutomirski
    [kees: access seccomp.mode through seccomp_mode() instead]
    Signed-off-by: Kees Cook

    Tycho Andersen
     

17 Apr, 2015

2 commits

  • ptrace_detach() re-checks ->ptrace under tasklist lock and calls
    release_task() if __ptrace_detach() returns true. This was needed because
    the __TASK_TRACED tracee could be killed/untraced, and it could even pass
    exit_notify() before we take tasklist_lock.

    But this is no longer possible after 9899d11f6544 "ptrace: ensure
    arch_ptrace/ptrace_request can never race with SIGKILL". We can turn
    these checks into WARN_ON() and remove release_task().

    While at it, document the setting of child->exit_code.

    Signed-off-by: Oleg Nesterov
    Cc: Pavel Labath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • ptrace_resume() is called when the tracee is still __TASK_TRACED. We set
    tracee->exit_code and then wake_up_state() changes tracee->state. If the
    tracer's sub-thread does wait() in between, task_stopped_code(ptrace => T)
    wrongly looks like another report from tracee.

    This confuses debugger, and since wait_task_stopped() clears ->exit_code
    the tracee can miss a signal.

    Test-case:

    #include
    #include
    #include
    #include
    #include
    #include

    int pid;

    void *waiter(void *arg)
    {
    int stat;

    for (;;) {
    assert(pid == wait(&stat));
    assert(WIFSTOPPED(stat));
    if (WSTOPSIG(stat) == SIGHUP)
    continue;

    assert(WSTOPSIG(stat) == SIGCONT);
    printf("ERR! extra/wrong report:%x\n", stat);
    }
    }

    int main(void)
    {
    pthread_t thread;

    pid = fork();
    if (!pid) {
    assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
    for (;;)
    kill(getpid(), SIGHUP);
    }

    assert(pthread_create(&thread, NULL, waiter, NULL) == 0);

    for (;;)
    ptrace(PTRACE_CONT, pid, 0, SIGCONT);

    return 0;
    }

    Note for stable: the bug is very old, but without 9899d11f6544 "ptrace:
    ensure arch_ptrace/ptrace_request can never race with SIGKILL" the fix
    should use lock_task_sighand(child).

    Signed-off-by: Oleg Nesterov
    Reported-by: Pavel Labath
    Tested-by: Pavel Labath
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

18 Feb, 2015

1 commit


11 Dec, 2014

1 commit

  • Now that forget_original_parent() uses ->ptrace_entry for EXIT_DEAD tasks,
    we can simply pass "dead_children" list to exit_ptrace() and remove
    another release_task() loop. Plus this way we do not need to drop and
    reacquire tasklist_lock.

    Also shift the list_empty(ptraced) check, if we want this optimization it
    makes sense to eliminate the function call altogether.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman" ,
    Cc: Sterling Alexander
    Cc: Peter Zijlstra
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

16 Jul, 2014

1 commit

  • The current "wait_on_bit" interface requires an 'action'
    function to be provided which does the actual waiting.
    There are over 20 such functions, many of them identical.
    Most cases can be satisfied by one of just two functions, one
    which uses io_schedule() and one which just uses schedule().

    So:
    Rename wait_on_bit and wait_on_bit_lock to
    wait_on_bit_action and wait_on_bit_lock_action
    to make it explicit that they need an action function.

    Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
    which are *not* given an action function but implicitly use
    a standard one.
    The decision to error-out if a signal is pending is now made
    based on the 'mode' argument rather than being encoded in the action
    function.

    All instances of the old wait_on_bit and wait_on_bit_lock which
    can use the new version have been changed accordingly and their
    action functions have been discarded.
    wait_on_bit{_lock} does not return any specific error code in the
    event of a signal so the caller must check for non-zero and
    interpolate their own error code as appropriate.

    The wait_on_bit() call in __fscache_wait_on_invalidate() was
    ambiguous as it specified TASK_UNINTERRUPTIBLE but used
    fscache_wait_bit_interruptible as an action function.
    David Howells confirms this should be uniformly
    "uninterruptible"

    The main remaining user of wait_on_bit{,_lock}_action is NFS
    which needs to use a freezer-aware schedule() call.

    A comment in fs/gfs2/glock.c notes that having multiple 'action'
    functions is useful as they display differently in the 'wchan'
    field of 'ps'. (and /proc/$PID/wchan).
    As the new bit_wait{,_io} functions are tagged "__sched", they
    will not show up at all, but something higher in the stack. So
    the distinction will still be visible, only with different
    function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
    gfs2/glock.c case).

    Since first version of this patch (against 3.15) two new action
    functions appeared, on in NFS and one in CIFS. CIFS also now
    uses an action function that makes the same freezer aware
    schedule call as NFS.

    Signed-off-by: NeilBrown
    Acked-by: David Howells (fscache, keys)
    Acked-by: Steven Whitehouse (gfs2)
    Acked-by: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Steve French
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
    Signed-off-by: Ingo Molnar

    NeilBrown
     

06 Mar, 2014

1 commit

  • Convert all compat system call functions where all parameter types
    have a size of four or less than four bytes, or are pointer types
    to COMPAT_SYSCALL_DEFINE.
    The implicit casts within COMPAT_SYSCALL_DEFINE will perform proper
    zero and sign extension to 64 bit of all parameters if needed.

    Signed-off-by: Heiko Carstens

    Heiko Carstens
     

13 Nov, 2013

1 commit

  • The get_dumpable() return value is not boolean. Most users of the
    function actually want to be testing for non-SUID_DUMP_USER(1) rather than
    SUID_DUMP_DISABLE(0). The SUID_DUMP_ROOT(2) is also considered a
    protected state. Almost all places did this correctly, excepting the two
    places fixed in this patch.

    Wrong logic:
    if (dumpable == SUID_DUMP_DISABLE) { /* be protective */ }
    or
    if (dumpable == 0) { /* be protective */ }
    or
    if (!dumpable) { /* be protective */ }

    Correct logic:
    if (dumpable != SUID_DUMP_USER) { /* be protective */ }
    or
    if (dumpable != 1) { /* be protective */ }

    Without this patch, if the system had set the sysctl fs/suid_dumpable=2, a
    user was able to ptrace attach to processes that had dropped privileges to
    that user. (This may have been partially mitigated if Yama was enabled.)

    The macros have been moved into the file that declares get/set_dumpable(),
    which means things like the ia64 code can see them too.

    CVE-2013-2929

    Reported-by: Vasily Kulikov
    Signed-off-by: Kees Cook
    Cc: "Luck, Tony"
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

12 Sep, 2013

1 commit

  • __ptrace_may_access() checks get_dumpable/ptrace_has_cap/etc if task !=
    current, this can can lead to surprising results.

    For example, a sub-thread can't readlink("/proc/self/exe") if the
    executable is not readable. setup_new_exec()->would_dump() notices that
    inode_permission(MAY_READ) fails and then it does
    set_dumpable(suid_dumpable). After that get_dumpable() fails.

    (It is not clear why proc_pid_readlink() checks get_dumpable(), perhaps we
    could add PTRACE_MODE_NODUMPABLE)

    Change __ptrace_may_access() to use same_thread_group() instead of "task
    == current". Any security check is pointless when the tasks share the
    same ->mm.

    Signed-off-by: Mark Grondona
    Signed-off-by: Ben Woodard
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Grondona
     

07 Aug, 2013

1 commit

  • This reverts commit fab840fc2d542fabcab903db8e03589a6702ba5f.

    This commit even has the test-case to prove that the tracee
    can be killed by SIGTRAP if the debugger does not remove the
    breakpoints before PTRACE_DETACH.

    However, this is exactly what wineserver deliberately does,
    set_thread_context() calls PTRACE_ATTACH + PTRACE_DETACH just
    for PTRACE_POKEUSER(DR*) in between.

    So we should revert this fix and document that PTRACE_DETACH
    should keep the breakpoints.

    Reported-by: Felipe Contreras
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

10 Jul, 2013

2 commits

  • Change ptrace_detach() to call flush_ptrace_hw_breakpoint(child). This
    frees the slots for non-ptrace PERF_TYPE_BREAKPOINT users, and this
    ensures that the tracee won't be killed by SIGTRAP triggered by the
    active breakpoints.

    Test-case:

    unsigned long encode_dr7(int drnum, int enable, unsigned int type, unsigned int len)
    {
    unsigned long dr7;

    dr7 = ((len | type) & 0xf)
    << (DR_CONTROL_SHIFT + drnum * DR_CONTROL_SIZE);
    if (enable)
    dr7 |= (DR_GLOBAL_ENABLE << (drnum * DR_ENABLE_SIZE));

    return dr7;
    }

    int write_dr(int pid, int dr, unsigned long val)
    {
    return ptrace(PTRACE_POKEUSER, pid,
    offsetof (struct user, u_debugreg[dr]),
    val);
    }

    void func(void)
    {
    }

    int main(void)
    {
    int pid, stat;
    unsigned long dr7;

    pid = fork();
    if (!pid) {
    assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
    kill(getpid(), SIGHUP);

    func();
    return 0x13;
    }

    assert(pid == waitpid(-1, &stat, 0));
    assert(WSTOPSIG(stat) == SIGHUP);

    assert(write_dr(pid, 0, (long)func) == 0);
    dr7 = encode_dr7(0, 1, DR_RW_EXECUTE, DR_LEN_1);
    assert(write_dr(pid, 7, dr7) == 0);

    assert(ptrace(PTRACE_DETACH, pid, 0,0) == 0);
    assert(pid == waitpid(-1, &stat, 0));
    assert(stat == 0x1300);

    return 0;
    }

    Before this patch the child is killed after PTRACE_DETACH.

    Signed-off-by: Oleg Nesterov
    Acked-by: Frederic Weisbecker
    Cc: Benjamin Herrenschmidt
    Cc: Ingo Molnar
    Cc: Jan Kratochvil
    Cc: Michael Neuling
    Cc: Paul Mackerras
    Cc: Paul Mundt
    Cc: Will Deacon
    Cc: Prasad
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This reverts commit bf26c018490c ("Prepare to fix racy accesses on task
    breakpoints").

    The patch was fine but we can no longer race with SIGKILL after commit
    9899d11f6544 ("ptrace: ensure arch_ptrace/ptrace_request can never race
    with SIGKILL"), the __TASK_TRACED tracee can't be woken up and
    ->ptrace_bps[] can't go away.

    Now that ptrace_get_breakpoints/ptrace_put_breakpoints have no callers,
    we can kill them and remove task->ptrace_bp_refcnt.

    Signed-off-by: Oleg Nesterov
    Acked-by: Frederic Weisbecker
    Acked-by: Michael Neuling
    Cc: Benjamin Herrenschmidt
    Cc: Ingo Molnar
    Cc: Jan Kratochvil
    Cc: Paul Mackerras
    Cc: Paul Mundt
    Cc: Will Deacon
    Cc: Prasad
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

04 Jul, 2013

1 commit

  • crtools uses a parasite code for dumping processes. The parasite code is
    injected into a process with help PTRACE_SEIZE.

    Currently crtools blocks signals from a parasite code. If a process has
    pending signals, crtools wait while a process handles these signals.

    This method is not suitable for stopped tasks. A stopped task can have a
    few pending signals, when we will try to execute a parasite code, we will
    need to drop SIGSTOP, but all other signals must remain pending, because a
    state of processes must not be changed during checkpointing.

    This patch adds two ptrace commands to set/get signal-blocked mask.

    I think gdb can use this commands too.

    [akpm@linux-foundation.org: be consistent with brace layout]
    Signed-off-by: Andrey Vagin
    Reviewed-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Michael Kerrisk
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Vagin
     

30 Jun, 2013

1 commit

  • This __put_user() could be used by unprivileged processes to write into
    kernel memory. The issue here is that even if copy_siginfo_to_user()
    fails, the error code is not checked before __put_user() is executed.

    Luckily, ptrace_peek_siginfo() has been added within the 3.10-rc cycle,
    so it has not hit a stable release yet.

    Signed-off-by: Mathieu Desnoyers
    Acked-by: Oleg Nesterov
    Cc: Andrey Vagin
    Cc: Roland McGrath
    Cc: Paul McKenney
    Cc: David Howells
    Cc: Dave Jones
    Cc: Pavel Emelyanov
    Cc: Pedro Alves
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     

08 May, 2013

1 commit

  • Faster kernel compiles by way of fewer unnecessary includes.

    [akpm@linux-foundation.org: fix fallout]
    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     

01 May, 2013

1 commit

  • This patch adds a new ptrace request PTRACE_PEEKSIGINFO.

    This request is used to retrieve information about pending signals
    starting with the specified sequence number. Siginfo_t structures are
    copied from the child into the buffer starting at "data".

    The argument "addr" is a pointer to struct ptrace_peeksiginfo_args.
    struct ptrace_peeksiginfo_args {
    u64 off; /* from which siginfo to start */
    u32 flags;
    s32 nr; /* how may siginfos to take */
    };

    "nr" has type "s32", because ptrace() returns "long", which has 32 bits on
    i386 and a negative values is used for errors.

    Currently here is only one flag PTRACE_PEEKSIGINFO_SHARED for dumping
    signals from process-wide queue. If this flag is not set, signals are
    read from a per-thread queue.

    The request PTRACE_PEEKSIGINFO returns a number of dumped signals. If a
    signal with the specified sequence number doesn't exist, ptrace returns
    zero. The request returns an error, if no signal has been dumped.

    Errors:
    EINVAL - one or more specified flags are not supported or nr is negative
    EFAULT - buf or addr is outside your accessible address space.

    A result siginfo contains a kernel part of si_code which usually striped,
    but it's required for queuing the same siginfo back during restore of
    pending signals.

    This functionality is required for checkpointing pending signals. Pedro
    Alves suggested using it in "gdb" to peek at pending signals. gdb already
    uses PTRACE_GETSIGINFO to get the siginfo for the signal which was already
    dequeued. This functionality allows gdb to look at the pending signals
    which were not reported yet.

    The prototype of this code was developed by Oleg Nesterov.

    Signed-off-by: Andrew Vagin
    Cc: Roland McGrath
    Cc: Oleg Nesterov
    Cc: "Paul E. McKenney"
    Cc: David Howells
    Cc: Dave Jones
    Cc: "Michael Kerrisk (man-pages)"
    Cc: Pavel Emelyanov
    Cc: Linus Torvalds
    Cc: Pedro Alves
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Vagin