08 Oct, 2016

3 commits

  • Current supplementary groups code can massively overallocate memory and
    is implemented in a way so that access to individual gid is done via 2D
    array.

    If number of gids is
    Cc: Vasily Kulikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Allow some seq_puts removals by taking a string instead of a single
    char.

    [akpm@linux-foundation.org: update vmstat_show(), per Joe]
    Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.com
    Signed-off-by: Joe Perches
    Cc: Joe Perches
    Cc: Andi Kleen
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • top(1) opens the following files for every PID:

    /proc/*/stat
    /proc/*/statm
    /proc/*/status

    This patch switches /proc/*/status away from seq_printf().
    The result is 13.5% speedup.

    Benchmark is open("/proc/self/status")+read+close 1.000.000 million times.

    BEFORE
    $ perf stat -r 10 taskset -c 3 ./proc-self-status

    Performance counter stats for 'taskset -c 3 ./proc-self-status' (10 runs):

    10748.474301 task-clock (msec) # 0.954 CPUs utilized ( +- 0.91% )
    12 context-switches # 0.001 K/sec ( +- 1.09% )
    1 cpu-migrations # 0.000 K/sec
    104 page-faults # 0.010 K/sec ( +- 0.45% )
    37,424,127,876 cycles # 3.482 GHz ( +- 0.04% )
    8,453,010,029 stalled-cycles-frontend # 22.59% frontend cycles idle ( +- 0.12% )
    3,747,609,427 stalled-cycles-backend # 10.01% backend cycles idle ( +- 0.68% )
    65,632,764,147 instructions # 1.75 insn per cycle
    # 0.13 stalled cycles per insn ( +- 0.00% )
    13,981,324,775 branches # 1300.773 M/sec ( +- 0.00% )
    138,967,110 branch-misses # 0.99% of all branches ( +- 0.18% )

    11.263885428 seconds time elapsed ( +- 0.04% )
    ^^^^^^^^^^^^

    AFTER
    $ perf stat -r 10 taskset -c 3 ./proc-self-status

    Performance counter stats for 'taskset -c 3 ./proc-self-status' (10 runs):

    9010.521776 task-clock (msec) # 0.925 CPUs utilized ( +- 1.54% )
    11 context-switches # 0.001 K/sec ( +- 1.54% )
    1 cpu-migrations # 0.000 K/sec ( +- 11.11% )
    103 page-faults # 0.011 K/sec ( +- 0.60% )
    32,352,310,603 cycles # 3.591 GHz ( +- 0.07% )
    7,849,199,578 stalled-cycles-frontend # 24.26% frontend cycles idle ( +- 0.27% )
    3,269,738,842 stalled-cycles-backend # 10.11% backend cycles idle ( +- 0.73% )
    56,012,163,567 instructions # 1.73 insn per cycle
    # 0.14 stalled cycles per insn ( +- 0.00% )
    11,735,778,795 branches # 1302.453 M/sec ( +- 0.00% )
    98,084,459 branch-misses # 0.84% of all branches ( +- 0.28% )

    9.741247736 seconds time elapsed ( +- 0.07% )
    ^^^^^^^^^^^

    Link: http://lkml.kernel.org/r/20160806125608.GB1187@p183.telecom.by
    Signed-off-by: Alexey Dobriyan
    Cc: Joe Perches
    Cc: Andi Kleen
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

21 May, 2016

1 commit

  • It's not possible to read the process umask without also modifying it,
    which is what umask(2) does. A library cannot read umask safely,
    especially if the main program might be multithreaded.

    Add a new status line ("Umask") in /proc//status. It contains the
    file mode creation mask (umask) in octal. It is only shown for tasks
    which have task->fs.

    This patch is adapted from one originally written by Pierre Carrier.

    The use case is that we have endless trouble with people setting weird
    umask() values (usually on the grounds of "security"), and then
    everything breaking. I'm on the hook to fix these. We'd like to add
    debugging to our program so we can dump out the umask in debug reports.

    Previous versions of the patch used a syscall so you could only read
    your own umask. That's all I need. However there was quite a lot of
    push-back from those, so this new version exports it in /proc.

    See:
    https://lkml.org/lkml/2016/4/13/704 [umask2]
    https://lkml.org/lkml/2016/4/13/487 [getumask]

    Signed-off-by: Richard W.M. Jones
    Acked-by: Konstantin Khlebnikov
    Acked-by: Jerome Marchand
    Acked-by: Kees Cook
    Cc: "Theodore Ts'o"
    Cc: Michal Hocko
    Cc: Pierre Carrier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard W.M. Jones
     

21 Jan, 2016

1 commit

  • By checking the effective credentials instead of the real UID / permitted
    capabilities, ensure that the calling process actually intended to use its
    credentials.

    To ensure that all ptrace checks use the correct caller credentials (e.g.
    in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
    flag), use two new flags and require one of them to be set.

    The problem was that when a privileged task had temporarily dropped its
    privileges, e.g. by calling setreuid(0, user_uid), with the intent to
    perform following syscalls with the credentials of a user, it still passed
    ptrace access checks that the user would not be able to pass.

    While an attacker should not be able to convince the privileged task to
    perform a ptrace() syscall, this is a problem because the ptrace access
    check is reused for things in procfs.

    In particular, the following somewhat interesting procfs entries only rely
    on ptrace access checks:

    /proc/$pid/stat - uses the check for determining whether pointers
    should be visible, useful for bypassing ASLR
    /proc/$pid/maps - also useful for bypassing ASLR
    /proc/$pid/cwd - useful for gaining access to restricted
    directories that contain files with lax permissions, e.g. in
    this scenario:
    lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
    drwx------ root root /root
    drwxr-xr-x root root /root/foobar
    -rw-r--r-- root root /root/foobar/secret

    Therefore, on a system where a root-owned mode 6755 binary changes its
    effective credentials as described and then dumps a user-specified file,
    this could be used by an attacker to reveal the memory layout of root's
    processes or reveal the contents of files he is not allowed to access
    (through /proc/$pid/cwd).

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Casey Schaufler
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: "Serge E. Hallyn"
    Cc: Andy Shevchenko
    Cc: Andy Lutomirski
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Willy Tarreau
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     

07 Nov, 2015

1 commit


01 Oct, 2015

1 commit

  • So the /proc/PID/stat 'wchan' field (the 30th field, which contains
    the absolute kernel address of the kernel function a task is blocked in)
    leaks absolute kernel addresses to unprivileged user-space:

    seq_put_decimal_ull(m, ' ', wchan);

    The absolute address might also leak via /proc/PID/wchan as well, if
    KALLSYMS is turned off or if the symbol lookup fails for some reason:

    static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
    struct pid *pid, struct task_struct *task)
    {
    unsigned long wchan;
    char symname[KSYM_NAME_LEN];

    wchan = get_wchan(task);

    if (lookup_symbol_name(wchan, symname) < 0) {
    if (!ptrace_may_access(task, PTRACE_MODE_READ))
    return 0;
    seq_printf(m, "%lu", wchan);
    } else {
    seq_printf(m, "%s", symname);
    }

    return 0;
    }

    This isn't ideal, because for example it trivially leaks the KASLR offset
    to any local attacker:

    fomalhaut:~> printf "%016lx\n" $(cat /proc/$$/stat | cut -d' ' -f35)
    ffffffff8123b380

    Most real-life uses of wchan are symbolic:

    ps -eo pid:10,tid:10,wchan:30,comm

    and procps uses /proc/PID/wchan, not the absolute address in /proc/PID/stat:

    triton:~/tip> strace -f ps -eo pid:10,tid:10,wchan:30,comm 2>&1 | grep wchan | tail -1
    open("/proc/30833/wchan", O_RDONLY) = 6

    There's one compatibility quirk here: procps relies on whether the
    absolute value is non-zero - and we can provide that functionality
    by outputing "0" or "1" depending on whether the task is blocked
    (whether there's a wchan address).

    These days there appears to be very little legitimate reason
    user-space would be interested in the absolute address. The
    absolute address is mostly historic: from the days when we
    didn't have kallsyms and user-space procps had to do the
    decoding itself via the System.map.

    So this patch sets all numeric output to "0" or "1" and keeps only
    symbolic output, in /proc/PID/wchan.

    ( The absolute sleep address can generally still be profiled via
    perf, by tasks with sufficient privileges. )

    Reviewed-by: Thomas Gleixner
    Acked-by: Kees Cook
    Acked-by: Linus Torvalds
    Cc:
    Cc: Al Viro
    Cc: Alexander Potapenko
    Cc: Andrey Konovalov
    Cc: Andrey Ryabinin
    Cc: Andy Lutomirski
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Denys Vlasenko
    Cc: Dmitry Vyukov
    Cc: Kostya Serebryany
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Peter Zijlstra
    Cc: Sasha Levin
    Cc: kasan-dev
    Cc: linux-kernel@vger.kernel.org
    Link: http://lkml.kernel.org/r/20150930135917.GA3285@gmail.com
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

05 Sep, 2015

1 commit

  • Credit where credit is due: this idea comes from Christoph Lameter with
    a lot of valuable input from Serge Hallyn. This patch is heavily based
    on Christoph's patch.

    ===== The status quo =====

    On Linux, there are a number of capabilities defined by the kernel. To
    perform various privileged tasks, processes can wield capabilities that
    they hold.

    Each task has four capability masks: effective (pE), permitted (pP),
    inheritable (pI), and a bounding set (X). When the kernel checks for a
    capability, it checks pE. The other capability masks serve to modify
    what capabilities can be in pE.

    Any task can remove capabilities from pE, pP, or pI at any time. If a
    task has a capability in pP, it can add that capability to pE and/or pI.
    If a task has CAP_SETPCAP, then it can add any capability to pI, and it
    can remove capabilities from X.

    Tasks are not the only things that can have capabilities; files can also
    have capabilities. A file can have no capabilty information at all [1].
    If a file has capability information, then it has a permitted mask (fP)
    and an inheritable mask (fI) as well as a single effective bit (fE) [2].
    File capabilities modify the capabilities of tasks that execve(2) them.

    A task that successfully calls execve has its capabilities modified for
    the file ultimately being excecuted (i.e. the binary itself if that
    binary is ELF or for the interpreter if the binary is a script.) [3] In
    the capability evolution rules, for each mask Z, pZ represents the old
    value and pZ' represents the new value. The rules are:

    pP' = (X & fP) | (pI & fI)
    pI' = pI
    pE' = (fE ? pP' : 0)
    X is unchanged

    For setuid binaries, fP, fI, and fE are modified by a moderately
    complicated set of rules that emulate POSIX behavior. Similarly, if
    euid == 0 or ruid == 0, then fP, fI, and fE are modified differently
    (primary, fP and fI usually end up being the full set). For nonroot
    users executing binaries with neither setuid nor file caps, fI and fP
    are empty and fE is false.

    As an extra complication, if you execute a process as nonroot and fE is
    set, then the "secure exec" rules are in effect: AT_SECURE gets set,
    LD_PRELOAD doesn't work, etc.

    This is rather messy. We've learned that making any changes is
    dangerous, though: if a new kernel version allows an unprivileged
    program to change its security state in a way that persists cross
    execution of a setuid program or a program with file caps, this
    persistent state is surprisingly likely to allow setuid or file-capped
    programs to be exploited for privilege escalation.

    ===== The problem =====

    Capability inheritance is basically useless.

    If you aren't root and you execute an ordinary binary, fI is zero, so
    your capabilities have no effect whatsoever on pP'. This means that you
    can't usefully execute a helper process or a shell command with elevated
    capabilities if you aren't root.

    On current kernels, you can sort of work around this by setting fI to
    the full set for most or all non-setuid executable files. This causes
    pP' = pI for nonroot, and inheritance works. No one does this because
    it's a PITA and it isn't even supported on most filesystems.

    If you try this, you'll discover that every nonroot program ends up with
    secure exec rules, breaking many things.

    This is a problem that has bitten many people who have tried to use
    capabilities for anything useful.

    ===== The proposed change =====

    This patch adds a fifth capability mask called the ambient mask (pA).
    pA does what most people expect pI to do.

    pA obeys the invariant that no bit can ever be set in pA if it is not
    set in both pP and pI. Dropping a bit from pP or pI drops that bit from
    pA. This ensures that existing programs that try to drop capabilities
    still do so, with a complication. Because capability inheritance is so
    broken, setting KEEPCAPS, using setresuid to switch to nonroot uids, and
    then calling execve effectively drops capabilities. Therefore,
    setresuid from root to nonroot conditionally clears pA unless
    SECBIT_NO_SETUID_FIXUP is set. Processes that don't like this can
    re-add bits to pA afterwards.

    The capability evolution rules are changed:

    pA' = (file caps or setuid or setgid ? 0 : pA)
    pP' = (X & fP) | (pI & fI) | pA'
    pI' = pI
    pE' = (fE ? pP' : pA')
    X is unchanged

    If you are nonroot but you have a capability, you can add it to pA. If
    you do so, your children get that capability in pA, pP, and pE. For
    example, you can set pA = CAP_NET_BIND_SERVICE, and your children can
    automatically bind low-numbered ports. Hallelujah!

    Unprivileged users can create user namespaces, map themselves to a
    nonzero uid, and create both privileged (relative to their namespace)
    and unprivileged process trees. This is currently more or less
    impossible. Hallelujah!

    You cannot use pA to try to subvert a setuid, setgid, or file-capped
    program: if you execute any such program, pA gets cleared and the
    resulting evolution rules are unchanged by this patch.

    Users with nonzero pA are unlikely to unintentionally leak that
    capability. If they run programs that try to drop privileges, dropping
    privileges will still work.

    It's worth noting that the degree of paranoia in this patch could
    possibly be reduced without causing serious problems. Specifically, if
    we allowed pA to persist across executing non-pA-aware setuid binaries
    and across setresuid, then, naively, the only capabilities that could
    leak as a result would be the capabilities in pA, and any attacker
    *already* has those capabilities. This would make me nervous, though --
    setuid binaries that tried to privilege-separate might fail to do so,
    and putting CAP_DAC_READ_SEARCH or CAP_DAC_OVERRIDE into pA could have
    unexpected side effects. (Whether these unexpected side effects would
    be exploitable is an open question.) I've therefore taken the more
    paranoid route. We can revisit this later.

    An alternative would be to require PR_SET_NO_NEW_PRIVS before setting
    ambient capabilities. I think that this would be annoying and would
    make granting otherwise unprivileged users minor ambient capabilities
    (CAP_NET_BIND_SERVICE or CAP_NET_RAW for example) much less useful than
    it is with this patch.

    ===== Footnotes =====

    [1] Files that are missing the "security.capability" xattr or that have
    unrecognized values for that xattr end up with has_cap set to false.
    The code that does that appears to be complicated for no good reason.

    [2] The libcap capability mask parsers and formatters are dangerously
    misleading and the documentation is flat-out wrong. fE is *not* a mask;
    it's a single bit. This has probably confused every single person who
    has tried to use file capabilities.

    [3] Linux very confusingly processes both the script and the interpreter
    if applicable, for reasons that elude me. The results from thinking
    about a script's file capabilities and/or setuid bits are mostly
    discarded.

    Preliminary userspace code is here, but it needs updating:
    https://git.kernel.org/cgit/linux/kernel/git/luto/util-linux-playground.git/commit/?h=cap_ambient&id=7f5afbd175d2

    Here is a test program that can be used to verify the functionality
    (from Christoph):

    /*
    * Test program for the ambient capabilities. This program spawns a shell
    * that allows running processes with a defined set of capabilities.
    *
    * (C) 2015 Christoph Lameter
    * Released under: GPL v3 or later.
    *
    *
    * Compile using:
    *
    * gcc -o ambient_test ambient_test.o -lcap-ng
    *
    * This program must have the following capabilities to run properly:
    * Permissions for CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE
    *
    * A command to equip the binary with the right caps is:
    *
    * setcap cap_net_raw,cap_net_admin,cap_sys_nice+p ambient_test
    *
    *
    * To get a shell with additional caps that can be inherited by other processes:
    *
    * ./ambient_test /bin/bash
    *
    *
    * Verifying that it works:
    *
    * From the bash spawed by ambient_test run
    *
    * cat /proc/$$/status
    *
    * and have a look at the capabilities.
    */

    #include
    #include
    #include
    #include
    #include
    #include

    /*
    * Definitions from the kernel header files. These are going to be removed
    * when the /usr/include files have these defined.
    */
    #define PR_CAP_AMBIENT 47
    #define PR_CAP_AMBIENT_IS_SET 1
    #define PR_CAP_AMBIENT_RAISE 2
    #define PR_CAP_AMBIENT_LOWER 3
    #define PR_CAP_AMBIENT_CLEAR_ALL 4

    static void set_ambient_cap(int cap)
    {
    int rc;

    capng_get_caps_process();
    rc = capng_update(CAPNG_ADD, CAPNG_INHERITABLE, cap);
    if (rc) {
    printf("Cannot add inheritable cap\n");
    exit(2);
    }
    capng_apply(CAPNG_SELECT_CAPS);

    /* Note the two 0s at the end. Kernel checks for these */
    if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, cap, 0, 0)) {
    perror("Cannot set cap");
    exit(1);
    }
    }

    int main(int argc, char **argv)
    {
    int rc;

    set_ambient_cap(CAP_NET_RAW);
    set_ambient_cap(CAP_NET_ADMIN);
    set_ambient_cap(CAP_SYS_NICE);

    printf("Ambient_test forking shell\n");
    if (execv(argv[1], argv + 1))
    perror("Cannot exec");

    return 0;
    }

    Signed-off-by: Christoph Lameter # Original author
    Signed-off-by: Andy Lutomirski
    Acked-by: Serge E. Hallyn
    Acked-by: Kees Cook
    Cc: Jonathan Corbet
    Cc: Aaron Jones
    Cc: Ted Ts'o
    Cc: Andrew G. Morgan
    Cc: Mimi Zohar
    Cc: Austin S Hemmelgarn
    Cc: Markku Savela
    Cc: Jarkko Sakkinen
    Cc: Michael Kerrisk
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     

26 Jun, 2015

1 commit

  • Commit 818411616baf ("fs, proc: introduce /proc//task//children
    entry") introduced the children entry for checkpoint restore and the
    file is only available on kernels configured with CONFIG_EXPERT and
    CONFIG_CHECKPOINT_RESTORE.

    This is available in most distributions (Fedora, Debian, Ubuntu, CoreOS)
    because they usually enable CONFIG_EXPERT and CONFIG_CHECKPOINT_RESTORE.
    But Arch does not enable CONFIG_EXPERT or CONFIG_CHECKPOINT_RESTORE.

    However, the children proc file is useful outside of checkpoint restore.
    I would like to use it in rkt. The rkt process exec() another program
    it does not control, and that other program will fork()+exec() a child
    process. I would like to find the pid of the child process from an
    external tool without iterating in /proc over all processes to find
    which one has a parent pid equal to rkt.

    This commit introduces CONFIG_PROC_CHILDREN and makes
    CONFIG_CHECKPOINT_RESTORE select it. This allows enabling
    /proc//task//children without needing to enable
    CONFIG_CHECKPOINT_RESTORE and CONFIG_EXPERT.

    Alban tested that /proc//task//children is present when the
    kernel is configured with CONFIG_PROC_CHILDREN=y but without
    CONFIG_CHECKPOINT_RESTORE

    Signed-off-by: Iago López Galeiras
    Tested-by: Alban Crequy
    Reviewed-by: Cyrill Gorcunov
    Cc: Oleg Nesterov
    Cc: Kees Cook
    Cc: Pavel Emelyanov
    Cc: Serge Hallyn
    Cc: KAMEZAWA Hiroyuki
    Cc: Alexander Viro
    Cc: Andy Lutomirski
    Cc: Djalal Harouni
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Iago López Galeiras
     

25 Jun, 2015

1 commit

  • Allowing watchdog threads to be parked means that we now have the
    opportunity of actually seeing persistent parked threads in the output
    of /proc//stat and /proc//status. The existing code reported
    such threads as "Running", which is kind-of true if you think of the
    case where we park them as part of taking cpus offline. But if we allow
    parking them indefinitely, "Running" is pretty misleading, so we report
    them as "Sleeping" instead.

    We could simply report them with a new string, "Parked", but it feels
    like it's a bit risky for userspace to see unexpected new values; the
    output is already documented in Documentation/filesystems/proc.txt, and
    it seems like a mistake to change that lightly.

    The scheduler does report parked tasks with a "P" in debugging output
    from sched_show_task() or dump_cpu_task(), but that's a different API.
    Similarly, the trace_ctxwake_* routines report a "P" for parked tasks,
    but again, different API.

    This change seemed slightly cleaner than updating the task_state_array
    to have additional rows. TASK_DEAD should be subsumed by the exit_state
    bits; TASK_WAKEKILL is just a modifier; and TASK_WAKING can very
    reasonably be reported as "Running" (as it is now). Only TASK_PARKED
    shows up with unreasonable output here.

    Signed-off-by: Chris Metcalf
    Cc: Don Zickus
    Cc: Ingo Molnar
    Cc: Ulrich Obergfell
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Frederic Weisbecker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Metcalf
     

16 Apr, 2015

3 commits

  • The seq_printf return value, because it's frequently misused,
    will eventually be converted to void.

    See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
    seq_has_overflowed() and make public")

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • The current semantics of string_escape_mem are inadequate for one of its
    current users, vsnprintf(). If that is to honour its contract, it must
    know how much space would be needed for the entire escaped buffer, and
    string_escape_mem provides no way of obtaining that (short of allocating a
    large enough buffer (~4 times input string) to let it play with, and
    that's definitely a big no-no inside vsnprintf).

    So change the semantics for string_escape_mem to be more snprintf-like:
    Return the size of the output that would be generated if the destination
    buffer was big enough, but of course still only write to the part of dst
    it is allowed to, and (contrary to snprintf) don't do '\0'-termination.
    It is then up to the caller to detect whether output was truncated and to
    append a '\0' if desired. Also, we must output partial escape sequences,
    otherwise a call such as snprintf(buf, 3, "%1pE", "\123") would cause
    printf to write a \0 to buf[2] but leaving buf[0] and buf[1] with whatever
    they previously contained.

    This also fixes a bug in the escaped_string() helper function, which used
    to unconditionally pass a length of "end-buf" to string_escape_mem();
    since the latter doesn't check osz for being insanely large, it would
    happily write to dst. For example, kasprintf(GFP_KERNEL, "something and
    then %pE", ...); is an easy way to trigger an oops.

    In test-string_helpers.c, the -ENOMEM test is replaced with testing for
    getting the expected return value even if the buffer is too small. We
    also ensure that nothing is written (by relying on a NULL pointer deref)
    if the output size is 0 by passing NULL - this has to work for
    kasprintf("%pE") to work.

    In net/sunrpc/cache.c, I think qword_add still has the same semantics.
    Someone should definitely double-check this.

    In fs/proc/array.c, I made the minimum possible change, but longer-term it
    should stop poking around in seq_file internals.

    [andriy.shevchenko@linux.intel.com: simplify qword_add]
    [andriy.shevchenko@linux.intel.com: add missed curly braces]
    Signed-off-by: Rasmus Villemoes
    Acked-by: Andy Shevchenko
    Signed-off-by: Andy Shevchenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • If some issues occurred inside a container guest, host user could not know
    which process is in trouble just by guest pid: the users of container
    guest only knew the pid inside containers. This will bring obstacle for
    trouble shooting.

    This patch adds four fields: NStgid, NSpid, NSpgid and NSsid:

    a) In init_pid_ns, nothing changed;

    b) In one pidns, will tell the pid inside containers:
    NStgid: 21776 5 1
    NSpid: 21776 5 1
    NSpgid: 21776 5 1
    NSsid: 21729 1 0
    ** Process id is 21776 in level 0, 5 in level 1, 1 in level 2.

    c) If pidns is nested, it depends on which pidns are you in.
    NStgid: 5 1
    NSpid: 5 1
    NSpgid: 5 1
    NSsid: 1 0
    ** Views from level 1

    [akpm@linux-foundation.org: add CONFIG_PID_NS ifdef]
    Signed-off-by: Chen Hanxiao
    Acked-by: Serge Hallyn
    Acked-by: "Eric W. Biederman"
    Tested-by: Serge Hallyn
    Tested-by: Nathan Scott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Hanxiao
     

14 Feb, 2015

1 commit


13 Feb, 2015

1 commit


11 Dec, 2014

4 commits

  • p->ptrace != 0 means that release_task(p) was not called, so pid_alive()
    buys nothing and we can remove this check. Other callers already use it
    directly without additional checks.

    Note: with or without this patch ptrace_parent() can return the pointer to
    the freed task, this will be explained/fixed later.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman" ,
    Cc: Sterling Alexander
    Cc: Peter Zijlstra
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • task_state() does seq_printf() under rcu_read_lock(), but this is only
    needed for task_tgid_nr_ns() and task_numa_group_id(). We can calculate
    tgid/ngid and drop rcu lock.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman" ,
    Cc: Sterling Alexander
    Cc: Peter Zijlstra
    Cc: Roland McGrath
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • 1. The usage of fdt looks very ugly, it can't be NULL if ->files is
    not NULL. We can use "unsigned int max_fds" instead.

    2. This also allows to move seq_printf(max_fds) outside of task_lock()
    and join it with the previous seq_printf(). See also the next patch.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman" ,
    Cc: Sterling Alexander
    Cc: Peter Zijlstra
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • task_state() reads cred->group_info under task_lock() because a long ago
    it was task_struct->group_info and it was actually protected by
    task->alloc_lock. Today this task_unlock() after rcu_read_unlock() just
    adds the confusion, move task_unlock() up.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman" ,
    Cc: Sterling Alexander
    Cc: Peter Zijlstra
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

06 Aug, 2014

1 commit

  • Pull security subsystem updates from James Morris:
    "In this release:

    - PKCS#7 parser for the key management subsystem from David Howells
    - appoint Kees Cook as seccomp maintainer
    - bugfixes and general maintenance across the subsystem"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (94 commits)
    X.509: Need to export x509_request_asymmetric_key()
    netlabel: shorter names for the NetLabel catmap funcs/structs
    netlabel: fix the catmap walking functions
    netlabel: fix the horribly broken catmap functions
    netlabel: fix a problem when setting bits below the previously lowest bit
    PKCS#7: X.509 certificate issuer and subject are mandatory fields in the ASN.1
    tpm: simplify code by using %*phN specifier
    tpm: Provide a generic means to override the chip returned timeouts
    tpm: missing tpm_chip_put in tpm_get_random()
    tpm: Properly clean sysfs entries in error path
    tpm: Add missing tpm_do_selftest to ST33 I2C driver
    PKCS#7: Use x509_request_asymmetric_key()
    Revert "selinux: fix the default socket labeling in sock_graft()"
    X.509: x509_request_asymmetric_keys() doesn't need string length arguments
    PKCS#7: fix sparse non static symbol warning
    KEYS: revert encrypted key change
    ima: add support for measuring and appraising firmware
    firmware_class: perform new LSM checks
    security: introduce kernel_fw_from_file hook
    PKCS#7: Missing inclusion of linux/err.h
    ...

    Linus Torvalds
     

24 Jul, 2014

2 commits

  • This is effectively a revert of 7b9a7ec565505699f503b4fcf61500dceb36e744
    plus fixing it a different way...

    We found, when trying to run an application from an application which
    had dropped privs that the kernel does security checks on undefined
    capability bits. This was ESPECIALLY difficult to debug as those
    undefined bits are hidden from /proc/$PID/status.

    Consider a root application which drops all capabilities from ALL 4
    capability sets. We assume, since the application is going to set
    eff/perm/inh from an array that it will clear not only the defined caps
    less than CAP_LAST_CAP, but also the higher 28ish bits which are
    undefined future capabilities.

    The BSET gets cleared differently. Instead it is cleared one bit at a
    time. The problem here is that in security/commoncap.c::cap_task_prctl()
    we actually check the validity of a capability being read. So any task
    which attempts to 'read all things set in bset' followed by 'unset all
    things set in bset' will not even attempt to unset the undefined bits
    higher than CAP_LAST_CAP.

    So the 'parent' will look something like:
    CapInh: 0000000000000000
    CapPrm: 0000000000000000
    CapEff: 0000000000000000
    CapBnd: ffffffc000000000

    All of this 'should' be fine. Given that these are undefined bits that
    aren't supposed to have anything to do with permissions. But they do...

    So lets now consider a task which cleared the eff/perm/inh completely
    and cleared all of the valid caps in the bset (but not the invalid caps
    it couldn't read out of the kernel). We know that this is exactly what
    the libcap-ng library does and what the go capabilities library does.
    They both leave you in that above situation if you try to clear all of
    you capapabilities from all 4 sets. If that root task calls execve()
    the child task will pick up all caps not blocked by the bset. The bset
    however does not block bits higher than CAP_LAST_CAP. So now the child
    task has bits in eff which are not in the parent. These are
    'meaningless' undefined bits, but still bits which the parent doesn't
    have.

    The problem is now in cred_cap_issubset() (or any operation which does a
    subset test) as the child, while a subset for valid cap bits, is not a
    subset for invalid cap bits! So now we set durring commit creds that
    the child is not dumpable. Given it is 'more priv' than its parent. It
    also means the parent cannot ptrace the child and other stupidity.

    The solution here:
    1) stop hiding capability bits in status
    This makes debugging easier!

    2) stop giving any task undefined capability bits. it's simple, it you
    don't put those invalid bits in CAP_FULL_SET you won't get them in init
    and you won't get them in any other task either.
    This fixes the cap_issubset() tests and resulting fallout (which
    made the init task in a docker container untraceable among other
    things)

    3) mask out undefined bits when sys_capset() is called as it might use
    ~0, ~0 to denote 'all capabilities' for backward/forward compatibility.
    This lets 'capsh --caps="all=eip" -- -c /bin/bash' run.

    4) mask out undefined bit when we read a file capability off of disk as
    again likely all bits are set in the xattr for forward/backward
    compatibility.
    This lets 'setcap all+pe /bin/bash; /bin/bash' run

    Signed-off-by: Eric Paris
    Reviewed-by: Kees Cook
    Cc: Andrew Vagin
    Cc: Andrew G. Morgan
    Cc: Serge E. Hallyn
    Cc: Kees Cook
    Cc: Steve Grubb
    Cc: Dan Walsh
    Cc: stable@vger.kernel.org
    Signed-off-by: James Morris

    Eric Paris
     
  • Simplify the only user of this data by removing the timespec
    conversion.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: John Stultz

    Thomas Gleixner
     

08 Apr, 2014

1 commit

  • get_task_state() uses the most significant bit to report the state to
    user-space, this means that EXIT_ZOMBIE->EXIT_TRACE->EXIT_DEAD transition
    can be noticed via /proc as Z -> X -> Z change. Note that this was
    possible even before EXIT_TRACE was introduced.

    This is not really bad but imho it make sense to hide EXIT_TRACE from
    user-space completely. So the patch simply swaps EXIT_ZOMBIE and
    EXIT_DEAD, this way EXIT_TRACE will be seen as EXIT_ZOMBIE by user-space.

    Signed-off-by: Oleg Nesterov
    Cc: Jan Kratochvil
    Cc: Michal Schmidt
    Cc: Al Viro
    Cc: Lennart Poettering
    Cc: Roland McGrath
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

24 Jan, 2014

2 commits

  • Change the remaining next_thread (ab)users to use while_each_thread().

    The last user which should be changed is next_tid(), but we can't do this
    now.

    __exit_signal() and complete_signal() are fine, they actually need
    next_thread() logic.

    This patch (of 3):

    do_task_stat() can use while_each_thread(), no changes in
    the compiled code.

    Signed-off-by: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Cc: Al Viro
    Cc: Kees Cook
    Reviewed-by: Sameer Nanda
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • get_task_state() and task_state_array[] look confusing and suboptimal, it
    is not clear what it can actually report to user-space and
    task_state_array[] blows .data for no reason.

    1. state = (tsk->state & TASK_REPORT) | tsk->exit_state is not
    clear. TASK_REPORT is self-documenting but it is not clear
    what ->exit_state can add.

    Move the potential exit_state's (EXIT_ZOMBIE and EXIT_DEAD)
    into TASK_REPORT and use it to calculate the final result.

    2. With the change above it is obvious that task_state_array[]
    has the unused entries just to make BUILD_BUG_ON() happy.

    Change this BUILD_BUG_ON() to use TASK_REPORT rather than
    TASK_STATE_MAX and shrink task_state_array[].

    3. Turn the "while (state)" loop into fls(state).

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: David Laight
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

09 Oct, 2013

1 commit

  • It is desirable to model from userspace how the scheduler groups tasks
    over time. This patch adds an ID to the numa_group and reports it via
    /proc/PID/status.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-45-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

12 Apr, 2013

1 commit

  • The smpboot threads rely on the park/unpark mechanism which binds per
    cpu threads on a particular core. Though the functionality is racy:

    CPU0 CPU1 CPU2
    unpark(T) wake_up_process(T)
    clear(SHOULD_PARK) T runs
    leave parkme() due to !SHOULD_PARK
    bind_to(CPU2) BUG_ON(wrong CPU)

    We cannot let the tasks move themself to the target CPU as one of
    those tasks is actually the migration thread itself, which requires
    that it starts running on the target cpu right away.

    The solution to this problem is to prevent wakeups in park mode which
    are not from unpark(). That way we can guarantee that the association
    of the task to the target cpu is working correctly.

    Add a new task state (TASK_PARKED) which prevents other wakeups and
    use this state explicitly for the unpark wakeup.

    Peter noticed: Also, since the task state is visible to userspace and
    all the parked tasks are still in the PID space, its a good hint in ps
    and friends that these tasks aren't really there for the moment.

    The migration thread has another related issue.

    CPU0 CPU1
    Bring up CPU2
    create_thread(T)
    park(T)
    wait_for_completion()
    parkme()
    complete()
    sched_set_stop_task()
    schedule(TASK_PARKED)

    The sched_set_stop_task() call is issued while the task is on the
    runqueue of CPU1 and that confuses the hell out of the stop_task class
    on that cpu. So we need the same synchronizaion before
    sched_set_stop_task().

    Reported-by: Dave Jones
    Reported-and-tested-by: Dave Hansen
    Reported-and-tested-by: Borislav Petkov
    Acked-by: Peter Ziljstra
    Cc: Srivatsa S. Bhat
    Cc: dhillf@gmail.com
    Cc: Ingo Molnar
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

28 Jan, 2013

1 commit

  • This is in preparation for the full dynticks feature. While
    remotely reading the cputime of a task running in a full
    dynticks CPU, we'll need to do some extra-computation. This
    way we can account the time it spent tickless in userspace
    since its last cputime snapshot.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

18 Dec, 2012

6 commits

  • Merge misc patches from Andrew Morton:
    "Incoming:

    - lots of misc stuff

    - backlight tree updates

    - lib/ updates

    - Oleg's percpu-rwsem changes

    - checkpatch

    - rtc

    - aoe

    - more checkpoint/restart support

    I still have a pile of MM stuff pending - Pekka should be merging
    later today after which that is good to go. A number of other things
    are twiddling thumbs awaiting maintainer merges."

    * emailed patches from Andrew Morton : (180 commits)
    scatterlist: don't BUG when we can trivially return a proper error.
    docs: update documentation about /proc//fdinfo/ fanotify output
    fs, fanotify: add @mflags field to fanotify output
    docs: add documentation about /proc//fdinfo/ output
    fs, notify: add procfs fdinfo helper
    fs, exportfs: add exportfs_encode_inode_fh() helper
    fs, exportfs: escape nil dereference if no s_export_op present
    fs, epoll: add procfs fdinfo helper
    fs, eventfd: add procfs fdinfo helper
    procfs: add ability to plug in auxiliary fdinfo providers
    tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test
    breakpoint selftests: print failure status instead of cause make error
    kcmp selftests: print fail status instead of cause make error
    kcmp selftests: make run_tests fix
    mem-hotplug selftests: print failure status instead of cause make error
    cpu-hotplug selftests: print failure status instead of cause make error
    mqueue selftests: print failure status instead of cause make error
    vm selftests: print failure status instead of cause make error
    ubifs: use prandom_bytes
    mtd: nandsim: use prandom_bytes
    ...

    Linus Torvalds
     
  • This allows us to print out eventpoll target file descriptor, events and
    data, the /proc/pid/fdinfo/fd consists of

    | pos: 0
    | flags: 02
    | tfd: 5 events: 1d data: ffffffffffffffff enabled: 1

    [avagin@: fix for unitialized ret variable]

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Andrey Vagin
    Cc: Al Viro
    Cc: Alexey Dobriyan
    Cc: James Bottomley
    Cc: "Aneesh Kumar K.V"
    Cc: Alexey Dobriyan
    Cc: Matthew Helsley
    Cc: "J. Bruce Fields"
    Cc: "Aneesh Kumar K.V"
    Cc: Tvrtko Ursulin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • We display a list of supplementary group for each process in
    /proc//status. However, we show only the first 32 groups, not all of
    them.

    Although this is rare, but sometimes processes do have more than 32
    supplementary groups, and this kernel limitation breaks user-space apps
    that rely on the group list in /proc//status.

    Number 32 comes from the internal NGROUPS_SMALL macro which defines the
    length for the internal kernel "small" groups buffer. There is no
    apparent reason to limit to this value.

    This patch removes the 32 groups printing limit.

    The Linux kernel limits the amount of supplementary groups by NGROUPS_MAX,
    which is currently set to 65536. And this is the maximum count of groups
    we may possibly print.

    Signed-off-by: Artem Bityutskiy
    Acked-by: Serge E. Hallyn
    Acked-by: Kees Cook
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Artem Bityutskiy
     
  • It is currently impossible to examine the state of seccomp for a given
    process. While attaching with gdb and attempting "call
    prctl(PR_GET_SECCOMP,...)" will work with some situations, it is not
    reliable. If the process is in seccomp mode 1, this query will kill the
    process (prctl not allowed), if the process is in mode 2 with prctl not
    allowed, it will similarly be killed, and in weird cases, if prctl is
    filtered to return errno 0, it can look like seccomp is disabled.

    When reviewing the state of running processes, there should be a way to
    externally examine the seccomp mode. ("Did this build of Chrome end up
    using seccomp?" "Did my distro ship ssh with seccomp enabled?")

    This adds the "Seccomp" line to /proc/$pid/status.

    Signed-off-by: Kees Cook
    Reviewed-by: Cyrill Gorcunov
    Cc: Andrea Arcangeli
    Cc: James Morris
    Acked-by: Serge E. Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Without this patch it is really hard to interpret a bounding set, if
    CAP_LAST_CAP is unknown for a current kernel.

    Non-existant capabilities can not be deleted from a bounding set with help
    of prctl.

    E.g.: Here are two examples without/with this patch.

    CapBnd: ffffffe0fdecffff
    CapBnd: 00000000fdecffff

    I suggest to hide non-existent capabilities. Here is two reasons.
    * It's logically and easier for using.
    * It helps to checkpoint-restore capabilities of tasks, because tasks
    can be restored on another kernel, where CAP_LAST_CAP is bigger.

    Signed-off-by: Andrew Vagin
    Cc: Andrew G. Morgan
    Reviewed-by: Serge E. Hallyn
    Cc: Pavel Emelyanov
    Reviewed-by: Kees Cook
    Cc: KAMEZAWA Hiroyuki
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Vagin
     
  • Pull user namespace changes from Eric Biederman:
    "While small this set of changes is very significant with respect to
    containers in general and user namespaces in particular. The user
    space interface is now complete.

    This set of changes adds support for unprivileged users to create user
    namespaces and as a user namespace root to create other namespaces.
    The tyranny of supporting suid root preventing unprivileged users from
    using cool new kernel features is broken.

    This set of changes completes the work on setns, adding support for
    the pid, user, mount namespaces.

    This set of changes includes a bunch of basic pid namespace
    cleanups/simplifications. Of particular significance is the rework of
    the pid namespace cleanup so it no longer requires sending out
    tendrils into all kinds of unexpected cleanup paths for operation. At
    least one case of broken error handling is fixed by this cleanup.

    The files under /proc//ns/ have been converted from regular files
    to magic symlinks which prevents incorrect caching by the VFS,
    ensuring the files always refer to the namespace the process is
    currently using and ensuring that the ptrace_mayaccess permission
    checks are always applied.

    The files under /proc//ns/ have been given stable inode numbers
    so it is now possible to see if different processes share the same
    namespaces.

    Through the David Miller's net tree are changes to relax many of the
    permission checks in the networking stack to allowing the user
    namespace root to usefully use the networking stack. Similar changes
    for the mount namespace and the pid namespace are coming through my
    tree.

    Two small changes to add user namespace support were commited here adn
    in David Miller's -net tree so that I could complete the work on the
    /proc//ns/ files in this tree.

    Work remains to make it safe to build user namespaces and 9p, afs,
    ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
    Kconfig guard remains in place preventing that user namespaces from
    being built when any of those filesystems are enabled.

    Future design work remains to allow root users outside of the initial
    user namespace to mount more than just /proc and /sys."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
    proc: Usable inode numbers for the namespace file descriptors.
    proc: Fix the namespace inode permission checks.
    proc: Generalize proc inode allocation
    userns: Allow unprivilged mounts of proc and sysfs
    userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
    procfs: Print task uids and gids in the userns that opened the proc file
    userns: Implement unshare of the user namespace
    userns: Implent proc namespace operations
    userns: Kill task_user_ns
    userns: Make create_new_namespaces take a user_ns parameter
    userns: Allow unprivileged use of setns.
    userns: Allow unprivileged users to create new namespaces
    userns: Allow setting a userns mapping to your current uid.
    userns: Allow chown and setgid preservation
    userns: Allow unprivileged users to create user namespaces.
    userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
    userns: fix return value on mntns_install() failure
    vfs: Allow unprivileged manipulation of the mount namespace.
    vfs: Only support slave subtrees across different user namespaces
    vfs: Add a user namespace reference from struct mnt_namespace
    ...

    Linus Torvalds
     

29 Nov, 2012

1 commit

  • We have thread_group_cputime() and thread_group_times(). The naming
    doesn't provide enough information about the difference between
    these two APIs.

    To lower the confusion, rename thread_group_times() to
    thread_group_cputime_adjusted(). This name better suggests that
    it's a version of thread_group_cputime() that does some stabilization
    on the raw cputime values. ie here: scale on top of CFS runtime
    stats and bound lower value for monotonicity.

    Signed-off-by: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Steven Rostedt
    Cc: Paul Gortmaker

    Frederic Weisbecker
     

20 Nov, 2012

1 commit


01 Jun, 2012

3 commits

  • We would like to have an ability to restore command line arguments and
    program environment pointers but first we need to obtain them somehow.
    Thus we put these values into /proc/$pid/stat. The exit_code is needed to
    restore zombie tasks.

    Signed-off-by: Cyrill Gorcunov
    Acked-by: Kees Cook
    Cc: Pavel Emelyanov
    Cc: Serge Hallyn
    Cc: KAMEZAWA Hiroyuki
    Cc: Alexey Dobriyan
    Cc: Tejun Heo
    Cc: Andrew Vagin
    Cc: Vasiliy Kulikov
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • When we do checkpoint of a task we need to know the list of children the
    task, has but there is no easy and fast way to generate reverse
    parent->children chain from arbitrary (while a parent pid is
    provided in "PPid" field of /proc//status).

    So instead of walking over all pids in the system (creating one big
    process tree in memory, just to figure out which children a task has) --
    we add explicit /proc//task//children entry, because the kernel
    already has this kind of information but it is not yet exported.

    This is a first level children, not the whole process tree.

    Signed-off-by: Cyrill Gorcunov
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Kees Cook
    Cc: Pavel Emelyanov
    Cc: Serge Hallyn
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • - use int fpr priority and nice, since task_nice()/task_prio() return that

    - field 24: get_mm_rss() returns unsigned long

    Signed-off-by: Jan Engelhardt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Engelhardt
     

16 May, 2012

1 commit