23 Mar, 2016

1 commit

  • This commit fixes the following security hole affecting systems where
    all of the following conditions are fulfilled:

    - The fs.suid_dumpable sysctl is set to 2.
    - The kernel.core_pattern sysctl's value starts with "/". (Systems
    where kernel.core_pattern starts with "|/" are not affected.)
    - Unprivileged user namespace creation is permitted. (This is
    true on Linux >=3.8, but some distributions disallow it by
    default using a distro patch.)

    Under these conditions, if a program executes under secure exec rules,
    causing it to run with the SUID_DUMP_ROOT flag, then unshares its user
    namespace, changes its root directory and crashes, the coredump will be
    written using fsuid=0 and a path derived from kernel.core_pattern - but
    this path is interpreted relative to the root directory of the process,
    allowing the attacker to control where a coredump will be written with
    root privileges.

    To fix the security issue, always interpret core_pattern for dumps that
    are written under SUID_DUMP_ROOT relative to the root directory of init.

    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Andy Lutomirski
    Cc: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     

21 Jan, 2016

1 commit

  • Let %h and %e print empty values as "!", "." as "!" and
    ".." as "!.".

    This prevents hostnames and comm values that are empty or consist of one
    or two dots from changing the directory level at which the corefile will
    be stored.

    Consider the case where someone decides to sort coredumps by hostname
    with a core pattern like "/cores/%h/core.%e.%p.%t" or so. In this
    case, hostnames "" and "." would cause the coredump to land directly in
    /cores, which is not what the intent behind the core pattern is, and
    ".." would cause the coredump to land in /.

    Yeah, there probably aren't many people who do that, but I still don't
    want this edgecase to be kind of broken.

    It seems very unlikely that this caused security issues anywhere, so I'm
    not requesting a stable backport.

    [akpm@linux-foundation.org: tweak code comment]
    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     

07 Dec, 2015

1 commit

  • struct timeval on 32-bit systems will have its tv_sec
    value overflow in year 2038 and beyond.
    Use a 64 bit value to print time of the coredump in seconds.
    ktime_get_real_seconds is chosen here for efficiency reasons.

    Suggested by: Arnd Bergmann
    Signed-off-by: Tina Ruchandani
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Al Viro

    Arnd Bergmann
     

07 Nov, 2015

2 commits

  • Change zap_threads() paths to use for_each_thread() rather than
    while_each_thread().

    While at it, change zap_threads() to avoid the nested if's to make the
    code more readable and lessen the indentation.

    Signed-off-by: Oleg Nesterov
    Cc: David Rientjes
    Cc: Kyle Walker
    Cc: Michal Hocko
    Cc: Stanislav Kozina
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • task_will_free_mem() is wrong in many ways, and in particular the
    SIGNAL_GROUP_COREDUMP check is not reliable: a task can participate in the
    coredumping without SIGNAL_GROUP_COREDUMP bit set.

    change zap_threads() paths to always set SIGNAL_GROUP_COREDUMP even if
    other CLONE_VM processes can't react to SIGKILL. Fortunately, at least
    oom-kill case if fine; it kills all tasks sharing the same mm, so it
    should also kill the process which actually dumps the core.

    The change in prepare_signal() is not strictly necessary, it just ensures
    that the patch does not bring another subtle behavioural change. But it
    reminds us that this SIGNAL_GROUP_EXIT/COREDUMP case needs more changes.

    Signed-off-by: Oleg Nesterov
    Cc: David Rientjes
    Cc: Kyle Walker
    Acked-by: Michal Hocko
    Cc: Stanislav Kozina
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

11 Sep, 2015

2 commits

  • On a filesystem like vfat, all files are created with the same owner
    and mode independent of who created the file. When a vfat filesystem
    is mounted with root as owner of all files and read access for everyone,
    root's processes left world-readable coredumps on it (but other
    users' processes only left empty corefiles when given write access
    because of the uid mismatch).

    Given that the old behavior was inconsistent and insecure, I don't see
    a problem with changing it. Now, all processes refuse to dump core unless
    the resulting corefile will only be readable by their owner.

    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • It was possible for an attacking user to trick root (or another user) into
    writing his coredumps into an attacker-readable, pre-existing file using
    rename() or link(), causing the disclosure of secret data from the victim
    process' virtual memory. Depending on the configuration, it was also
    possible to trick root into overwriting system files with coredumps. Fix
    that issue by never writing coredumps into existing files.

    Requirements for the attack:
    - The attack only applies if the victim's process has a nonzero
    RLIMIT_CORE and is dumpable.
    - The attacker can trick the victim into coredumping into an
    attacker-writable directory D, either because the core_pattern is
    relative and the victim's cwd is attacker-writable or because an
    absolute core_pattern pointing to a world-writable directory is used.
    - The attacker has one of these:
    A: on a system with protected_hardlinks=0:
    execute access to a folder containing a victim-owned,
    attacker-readable file on the same partition as D, and the
    victim-owned file will be deleted before the main part of the attack
    takes place. (In practice, there are lots of files that fulfill
    this condition, e.g. entries in Debian's /var/lib/dpkg/info/.)
    This does not apply to most Linux systems because most distros set
    protected_hardlinks=1.
    B: on a system with protected_hardlinks=1:
    execute access to a folder containing a victim-owned,
    attacker-readable and attacker-writable file on the same partition
    as D, and the victim-owned file will be deleted before the main part
    of the attack takes place.
    (This seems to be uncommon.)
    C: on any system, independent of protected_hardlinks:
    write access to a non-sticky folder containing a victim-owned,
    attacker-readable file on the same partition as D
    (This seems to be uncommon.)

    The basic idea is that the attacker moves the victim-owned file to where
    he expects the victim process to dump its core. The victim process dumps
    its core into the existing file, and the attacker reads the coredump from
    it.

    If the attacker can't move the file because he does not have write access
    to the containing directory, he can instead link the file to a directory
    he controls, then wait for the original link to the file to be deleted
    (because the kernel checks that the link count of the corefile is 1).

    A less reliable variant that requires D to be non-sticky works with link()
    and does not require deletion of the original link: link() the file into
    D, but then unlink() it directly before the kernel performs the link count
    check.

    On systems with protected_hardlinks=0, this variant allows an attacker to
    not only gain information from coredumps, but also clobber existing,
    victim-writable files with coredumps. (This could theoretically lead to a
    privilege escalation.)

    Signed-off-by: Jann Horn
    Cc: Kees Cook
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     

05 Jul, 2015

1 commit

  • Pull more vfs updates from Al Viro:
    "Assorted VFS fixes and related cleanups (IMO the most interesting in
    that part are f_path-related things and Eric's descriptor-related
    stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
    fs-cache series, DAX patches, Jan's file_remove_suid() work"

    [ I'd say this is much more than "fixes and related cleanups". The
    file_table locking rule change by Eric Dumazet is a rather big and
    fundamental update even if the patch isn't huge. - Linus ]

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
    9p: cope with bogus responses from server in p9_client_{read,write}
    p9_client_write(): avoid double p9_free_req()
    9p: forgetting to cancel request on interrupted zero-copy RPC
    dax: bdev_direct_access() may sleep
    block: Add support for DAX reads/writes to block devices
    dax: Use copy_from_iter_nocache
    dax: Add block size note to documentation
    fs/file.c: __fget() and dup2() atomicity rules
    fs/file.c: don't acquire files->file_lock in fd_install()
    fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
    vfs: avoid creation of inode number 0 in get_next_ino
    namei: make set_root_rcu() return void
    make simple_positive() public
    ufs: use dir_pages instead of ufs_dir_pages()
    pagemap.h: move dir_pages() over there
    remove the pointless include of lglock.h
    fs: cleanup slight list_entry abuse
    xfs: Correctly lock inode when removing suid and file capabilities
    fs: Call security_ops->inode_killpriv on truncate
    fs: Provide function telling whether file_remove_privs() will do anything
    ...

    Linus Torvalds
     

26 Jun, 2015

2 commits

  • This allows detecting improper format string at build time, like:

    fs/coredump.c:225:5: warning: format '%ld' expects argument of type 'long int', but argument 3 has type 'int' [-Wformat=]
    err = cn_printf(cn, "%ld", cprm->siginfo->si_signo);
    ^

    As si_signo is always an int, the format should be %d here.

    Signed-off-by: Nicolas Iooss
    Acked-by: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Iooss
     
  • When adding __printf attribute to cn_printf, gcc reports some issues:

    fs/coredump.c:213:5: warning: format '%d' expects argument of type
    'int', but argument 3 has type 'kuid_t' [-Wformat=]
    err = cn_printf(cn, "%d", cred->uid);
    ^
    fs/coredump.c:217:5: warning: format '%d' expects argument of type
    'int', but argument 3 has type 'kgid_t' [-Wformat=]
    err = cn_printf(cn, "%d", cred->gid);
    ^

    These warnings come from the fact that the value of uid/gid needs to be
    extracted from the kuid_t/kgid_t structure before being used as an
    integer. More precisely, cred->uid and cred->gid need to be converted to
    either user-namespace uid/gid or to init_user_ns uid/gid.

    Use init_user_ns in order not to break existing ABI, and document this in
    Documentation/sysctl/kernel.txt.

    While at it, format uid and gid values with %u instead of %d because
    uid_t/__kernel_uid32_t and gid_t/__kernel_gid32_t are unsigned int.

    Signed-off-by: Nicolas Iooss
    Acked-by: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Iooss
     

24 Jun, 2015

1 commit


12 Apr, 2015

1 commit


20 Feb, 2015

1 commit


14 Oct, 2014

1 commit

  • format_corename() can only pass the leader's pid to the core handler,
    but there is no simple way to figure out which thread originated the
    coredump.

    As Jan explains, this also means that there is no simple way to create
    the backtrace of the crashed process:

    As programs are mostly compiled with implicit gcc -fomit-frame-pointer
    one needs program's .eh_frame section (equivalently PT_GNU_EH_FRAME
    segment) or .debug_frame section. .debug_frame usually is present only
    in separate debug info files usually not even installed on the system.
    While .eh_frame is a part of the executable/library (and it is even
    always mapped for C++ exceptions unwinding) it no longer has to be
    present anywhere on the disk as the program could be upgraded in the
    meantime and the running instance has its executable file already
    unlinked from disk.

    One possibility is to echo 0x3f >/proc/*/coredump_filter and dump all
    the file-backed memory including the executable's .eh_frame section.
    But that can create huge core files, for example even due to mmapped
    data files.

    Other possibility would be to read .eh_frame from /proc/PID/mem at the
    core_pattern handler time of the core dump. For the backtrace one needs
    to read the register state first which can be done from core_pattern
    handler:

    ptrace(PTRACE_SEIZE, tid, 0, PTRACE_O_TRACEEXIT)
    close(0); // close pipe fd to resume the sleeping dumper
    waitpid(); // should report EXIT
    PTRACE_GETREGS or other requests

    The remaining problem is how to get the 'tid' value of the crashed
    thread. It could be read from the first NT_PRSTATUS note of the core
    file but that makes the core_pattern handler complicated.

    Unfortunately %t is already used so this patch uses %i/%I.

    Automatic Bug Reporting Tool (https://github.com/abrt/abrt/wiki/overview)
    is experimenting with this. It is using the elfutils
    (https://fedorahosted.org/elfutils/) unwinder for generating the
    backtraces. Apart from not needing matching executables as mentioned
    above, another advantage is that we can get the backtrace without saving
    the core (which might be quite large) to disk.

    [mmilata@redhat.com: final paragraph of changelog]
    Signed-off-by: Jan Kratochvil
    Signed-off-by: Oleg Nesterov
    Cc: Alexander Viro
    Cc: Denys Vlasenko
    Cc: Jan Kratochvil
    Cc: Mark Wielaard
    Cc: Martin Milata
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

24 Jul, 2014

1 commit

  • Commit 079148b919d0 ("coredump: factor out the setting of PF_DUMPCORE")
    cleaned up the setting of PF_DUMPCORE by removing it from all the
    linux_binfmt->core_dump() and moving it to zap_threads().But this ended
    up clearing all the previously set flags. This causes issues during
    core generation when tsk->flags is checked again (eg. for PF_USED_MATH
    to dump floating point registers). Fix this.

    Signed-off-by: Silesh C V
    Acked-by: Oleg Nesterov
    Cc: Mandeep Singh Baines
    Cc: [3.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Silesh C V
     

20 Apr, 2014

1 commit

  • A va_list needs to be copied in case it needs to be used twice.

    Thanks to Hugh for debugging this issue, leading to various panics.

    Tested:

    lpq84:~# echo "|/foobar12345 %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h" >/proc/sys/kernel/core_pattern

    'produce_core' is simply : main() { *(int *)0 = 1;}

    lpq84:~# ./produce_core
    Segmentation fault (core dumped)
    lpq84:~# dmesg | tail -1
    [ 614.352947] Core dump to |/foobar12345 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 lpq84 (null) pipe failed

    Notice the last argument was replaced by a NULL (we were lucky enough to
    not crash, but do not try this on your production machine !)

    After fix :

    lpq83:~# echo "|/foobar12345 %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h %h" >/proc/sys/kernel/core_pattern
    lpq83:~# ./produce_core
    Segmentation fault
    lpq83:~# dmesg | tail -1
    [ 740.800441] Core dump to |/foobar12345 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 lpq83 pipe failed

    Fixes: 5fe9d8ca21cc ("coredump: cn_vprintf() has no reason to call vsnprintf() twice")
    Signed-off-by: Eric Dumazet
    Diagnosed-by: Hugh Dickins
    Acked-by: Oleg Nesterov
    Cc: Neil Horman
    Cc: Andrew Morton
    Cc: stable@vger.kernel.org # 3.11+
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

24 Jan, 2014

1 commit

  • 1. Remove fs/coredump.h. It is not clear why do we need it,
    it only declares __get_dumpable(), signal.c includes it
    for no reason.

    2. Now that get_dumpable() and __get_dumpable() are really
    trivial make them inline in linux/sched.h.

    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Alex Kelly
    Cc: "Eric W. Biederman"
    Cc: Josh Triplett
    Cc: Petr Matousek
    Cc: Vasily Kulikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

16 Nov, 2013

2 commits


09 Nov, 2013

5 commits


25 Oct, 2013

1 commit


12 Sep, 2013

1 commit

  • Add a new %P variable to be used in core_pattern. This variable contains
    the global PID (PID in the init namespace) as %p contains the PID in the
    current namespace which isn't always what we want.

    The main use for this is to make it easier to handle crashes that happened
    within a container. With that new variables it's possible to have the
    crashes dumped into the container or forwarded to the host with the right
    PID (from the host's point of view).

    Signed-off-by: Stéphane Graber
    Reported-by: Hans Feldt
    Cc: Alexander Viro
    Cc: Eric W. Biederman
    Cc: Andy Whitcroft
    Acked-by: Serge E. Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stéphane Graber
     

04 Jul, 2013

6 commits

  • "goto end" should not bypass the "Backward compatibility with
    core_uses_pid" code, move this label up.

    While at it,

    - It is ugly to copy '|' into cn->corename and then inc
    the pointer for argv_split().

    Change format_corename() to increment pat_ptr instead.

    - Remove the dead "if (*pat_ptr == 0)" in format_corename(),
    we already checked it is not zero.

    Signed-off-by: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Colin Walters
    Cc: Denys Vlasenko
    Cc: Jiri Slaby
    Cc: Lennart Poettering
    Cc: Lucas De Marchi
    Acked-by: Neil Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Imho, "atomic_t call_count" is ugly and should die. It buys nothing and
    in fact it can grow more than necessary, expand doesn't check if it was
    already incremented by another task.

    Kill it, and introduce "static int core_name_size" updated by
    expand_corename(). This is obviously racy too but harmless, and
    core_name_size never grows for no reason.

    We do not bother to to calculate the "right" new size, we simply do
    kmalloc(size_we_need) and use ksize() to rely on kmalloc_index's decision.

    Finally change format_corename() to use expand_corename(), krealloc(NULL)
    is fine.

    Signed-off-by: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Colin Walters
    Cc: Denys Vlasenko
    Cc: Jiri Slaby
    Cc: Lennart Poettering
    Cc: Lucas De Marchi
    Acked-by: Neil Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The usage of cn_escape() looks really annoying, imho this sequence needs a
    wrapper. And it is buggy. If cn_printf() does expand_corename()
    cn_escape() writes to the freed memory.

    Introduce cn_esc_printf() which hopefully does this all right. It records
    the index before cn_vprintf(), not "char *" which is no longer valid (in
    general) after krealloc().

    Signed-off-by: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Colin Walters
    Cc: Denys Vlasenko
    Cc: Jiri Slaby
    Cc: Lennart Poettering
    Cc: Lucas De Marchi
    Acked-by: Neil Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • cn_vprintf() looks really overcomplicated and sub-optimal. We do not need
    vsnprintf(NULL) to calculate the size we need, we can simply try to print
    into the current buffer and expand/retry only if necessary.

    Signed-off-by: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Colin Walters
    Cc: Denys Vlasenko
    Cc: Jiri Slaby
    Cc: Lennart Poettering
    Cc: Lucas De Marchi
    Acked-by: Neil Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Turn cn_printf(...) into cn_vprintf(va_list args), reintroduce
    cn_printf() as a trivial wrapper.

    This simplifies the next change and cn_vprintf() will have more
    callers.

    Signed-off-by: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Colin Walters
    Cc: Denys Vlasenko
    Cc: Jiri Slaby
    Cc: Lennart Poettering
    Cc: Lucas De Marchi
    Acked-by: Neil Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • do_coredump() assumes that format_corename() can only fail if
    expand_corename() fails and frees cn->corename. This is not true, for
    example cn_print_exe_file() can fail and in this case nobody frees
    cn->corename.

    Change do_coredump() to always do kfree(cn->corename) after it calls
    format_corename() (NULL is fine), change expand_corename() to do nothing
    if kmalloc() fails.

    Signed-off-by: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Colin Walters
    Cc: Denys Vlasenko
    Cc: Jiri Slaby
    Cc: Lennart Poettering
    Cc: Lucas De Marchi
    Acked-by: Neil Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

05 May, 2013

1 commit


02 May, 2013

1 commit

  • Pull VFS updates from Al Viro,

    Misc cleanups all over the place, mainly wrt /proc interfaces (switch
    create_proc_entry to proc_create(), get rid of the deprecated
    create_proc_read_entry() in favor of using proc_create_data() and
    seq_file etc).

    7kloc removed.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
    don't bother with deferred freeing of fdtables
    proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
    proc: Make the PROC_I() and PDE() macros internal to procfs
    proc: Supply a function to remove a proc entry by PDE
    take cgroup_open() and cpuset_open() to fs/proc/base.c
    ppc: Clean up scanlog
    ppc: Clean up rtas_flash driver somewhat
    hostap: proc: Use remove_proc_subtree()
    drm: proc: Use remove_proc_subtree()
    drm: proc: Use minor->index to label things, not PDE->name
    drm: Constify drm_proc_list[]
    zoran: Don't print proc_dir_entry data in debug
    reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
    proc: Supply an accessor for getting the data from a PDE's parent
    airo: Use remove_proc_subtree()
    rtl8192u: Don't need to save device proc dir PDE
    rtl8187se: Use a dir under /proc/net/r8180/
    proc: Add proc_mkdir_data()
    proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
    proc: Move PDE_NET() to fs/proc/proc_net.c
    ...

    Linus Torvalds
     

01 May, 2013

6 commits

  • wait_for_dump_helpers() calls wake_up/kill_fasync from inside the
    wait_event-like loop. This is not needed and in fact this is not
    strictly correct, we can/should do this only once after we change
    pipe->writers. We could even check if it becomes zero.

    Change this code to use use wait_event_interruptible(), this can also
    help to make this wait freezable.

    With this patch we check pipe->readers without pipe_lock(), this is
    fine. Once we see pipe->readers == 1 we know that the handler
    decremented the counter, this is all we need.

    Signed-off-by: Oleg Nesterov
    Acked-by: Mandeep Singh Baines
    Cc: Neil Horman
    Cc: "Rafael J. Wysocki"
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Cleanup. Every linux_binfmt->core_dump() sets PF_DUMPCORE, move this into
    zap_threads() called by do_coredump().

    Signed-off-by: Oleg Nesterov
    Acked-by: Mandeep Singh Baines
    Cc: Neil Horman
    Cc: "Rafael J. Wysocki"
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • By discussion with Mandeep.

    Change dump_write(), dump_seek() and do_coredump() to check
    signal_pending() and abort if it is true. dump_seek() does this only
    before f_op->llseek(), otherwise it relies on dump_write().

    We need this change to ensure that the coredump won't delay suspend, and
    to ensure it reacts to SIGKILL "quickly enough", a core dump can take a
    lot of time. In particular this can help oom-killer.

    We add the new trivial helper, dump_interrupted() to add the comments and
    to simplify the potential freezer changes. Perhaps it will have more
    callers.

    Ideally it should do try_to_freeze() but then we need the unpleasant
    changes in dump_write() and wait_for_dump_helpers(). It is not trivial to
    change dump_write() to restart if f_op->write() fails because of
    freezing(). We need to handle the short writes, we need to clear
    TIF_SIGPENDING (and we can't rely on recalc_sigpending() unless we change
    it to check PF_DUMPCORE). And if the buggy f_op->write() sets
    TIF_SIGPENDING we can not distinguish this case from the race with
    freeze_task() + __thaw_task().

    So we simply accept the fact that the freezer can truncate a core-dump but
    at least you can reliably suspend. Hopefully we can tolerate this
    unlikely case and the necessary complications doesn't worth a trouble.
    But if we decide to make the coredumping freezable later we can do this on
    top of this change.

    Signed-off-by: Oleg Nesterov
    Acked-by: Mandeep Singh Baines
    Cc: Neil Horman
    Cc: "Rafael J. Wysocki"
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that the coredumping process can be SIGKILL'ed, the setting of
    ->group_exit_code in do_coredump() can race with complete_signal() and
    SIGKILL or 0x80 can be "lost", or wait(status) can report status ==
    SIGKILL | 0x80.

    But the main problem is that it is not clear to me what should we do if
    binfmt->core_dump() succeeds but SIGKILL was sent, that is why this patch
    comes as a separate change.

    This patch adds 0x80 if ->core_dump() succeeds and the process was not
    killed. But perhaps we can (should?) re-set ->group_exit_code changed by
    SIGKILL back to "siginfo->si_signo |= 0x80" in case when core_dumped == T.

    Signed-off-by: Oleg Nesterov
    Tested-by: Mandeep Singh Baines
    Cc: Ingo Molnar
    Cc: Neil Horman
    Cc: "Rafael J. Wysocki"
    Cc: Roland McGrath
    Cc: Tejun Heo
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • prepare_signal() blesses SIGKILL sent to the dumping process but this
    signal can be "lost" anyway. The problems is, complete_signal() sees
    SIGNAL_GROUP_EXIT and skips the "kill them all" logic. And even if the
    dumping process is single-threaded (so the target is always "correct"),
    the group-wide SIGKILL is not recorded in task->pending and thus
    __fatal_signal_pending() won't be true. A multi-threaded case has even
    more problems.

    And even ignoring all technical details, SIGNAL_GROUP_EXIT doesn't look
    right to me. This coredumping process is not exiting yet, it can do a lot
    of work dumping the core.

    With this patch the dumping process doesn't have SIGNAL_GROUP_EXIT, we set
    signal->group_exit_task instead. This makes signal_group_exit() true and
    thus this should equally close the races with exit/exec/stop but allows to
    kill the dumping thread reliably.

    Notes:
    - It is not clear what should we do with ->group_exit_code
    if the dumper was killed, see the next change.

    - we need more (hopefully straightforward) changes to ensure
    that SIGKILL actually interrupts the coredump. Basically we
    need to check __fatal_signal_pending() in dump_write() and
    dump_seek().

    Signed-off-by: Oleg Nesterov
    Tested-by: Mandeep Singh Baines
    Cc: Ingo Molnar
    Cc: Neil Horman
    Cc: "Rafael J. Wysocki"
    Cc: Roland McGrath
    Cc: Tejun Heo
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • There are 2 well known and ancient problems with coredump/signals, and a
    lot of related bug reports:

    - do_coredump() clears TIF_SIGPENDING but of course this can't help
    if, say, SIGCHLD comes after that.

    In this case the coredump can fail unexpectedly. See for example
    wait_for_dump_helper()->signal_pending() check but there are other
    reasons.

    - At the same time, dumping a huge core on the slow media can take a
    lot of time/resources and there is no way to kill the coredumping
    task reliably. In particular this is not oom_kill-friendly.

    This patch tries to fix the 1st problem, and makes the preparation for the
    next changes.

    We add the new SIGNAL_GROUP_COREDUMP flag set by zap_threads() to indicate
    that this process dumps the core. prepare_signal() checks this flag and
    nacks any signal except SIGKILL.

    Note that this check tries to be conservative, in the long term we should
    probably treat the SIGNAL_GROUP_EXIT case equally but this needs more
    discussion. See marc.info/?l=linux-kernel&m=120508897917439

    Notes:
    - recalc_sigpending() doesn't check SIGNAL_GROUP_COREDUMP.
    The patch assumes that dump_write/etc paths should never
    call it, but we can change it as well.

    - There is another source of TIF_SIGPENDING, freezer. This
    will be addressed separately.

    Signed-off-by: Oleg Nesterov
    Tested-by: Mandeep Singh Baines
    Cc: Ingo Molnar
    Cc: Neil Horman
    Cc: "Rafael J. Wysocki"
    Cc: Roland McGrath
    Cc: Tejun Heo
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov